diff --git a/.circleci/README.md b/.circleci/README.md
deleted file mode 100644
index 5b0d56d1df2e19..00000000000000
--- a/.circleci/README.md
+++ /dev/null
@@ -1,498 +0,0 @@
-Structure of CI
-===============
-
-setup job:
-1. Does a git checkout
-2. Persists CircleCI scripts (everything in `.circleci`) into a workspace.  Why?
-   We don't always do a Git checkout on all subjobs, but we usually
-   still want to be able to call scripts one way or another in a subjob.
-   Persisting files this way lets us have access to them without doing a
-   checkout.  This workspace is conventionally mounted on `~/workspace`
-   (this is distinguished from `~/project`, which is the conventional
-   working directory that CircleCI will default to starting your jobs
-   in.)
-3. Write out the commit message to `.circleci/COMMIT_MSG`.  This is so
-   we can determine in subjobs if we should actually run the jobs or
-   not, even if there isn't a Git checkout.
-
-
-
-
-CircleCI configuration generator
-================================
-
-One may no longer make changes to the `.circleci/config.yml` file directly.
-Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
-
-
-Usage
-----------
-
-1. Make changes to these scripts.
-2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
-
-You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
-
-
-Motivation
-----------
-
-These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
-The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
-
-Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
-multiple parts of the file.
-
-* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
-
-Also see https://github.com/pytorch/pytorch/issues/17038
-
-
-Future direction
-----------------
-
-### Declaring sparse config subsets
-See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
-
-In contrast with a full recursive tree traversal of configuration dimensions,
-> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
-
-----------------
-----------------
-
-# How do the binaries / nightlies / releases work?
-
-### What is a binary?
-
-A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
-
-A **binary configuration** is a collection of
-
-* release or nightly
-    * releases are stable, nightlies are beta and built every night
-* python version
-    * linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
-    * macos: 3.7, 3.8
-    * windows: 3.7, 3.8
-* cpu version
-    * cpu, cuda 9.0, cuda 10.0
-    * The supported cuda versions occasionally change
-* operating system
-    * Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
-    * MacOS
-    * Windows - these are built on Azure pipelines
-* devtoolset version (gcc compiler version)
-    * This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
-
-### Where are the binaries?
-
-The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
-
-We have 3 types of binary packages
-
-* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
-* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
-* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
-    * shared with dependencies (the only supported option for Windows)
-    * static with dependencies
-    * shared without dependencies
-    * static without dependencies
-
-All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
-
-# CircleCI structure of the binaries
-
-Some quick vocab:
-
-* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
-* **jobs** are a sequence of '**steps**'
-* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
-* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
-
-## How are the workflows structured?
-
-The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build, test, and upload) per binary configuration
-
-1. binary_builds
-    1. every day midnight EST
-    2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
-    3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
-    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
-        1. binary_linux_conda_3.7_cpu_build
-            1. Builds the build. On linux jobs this uses the 'docker executor'.
-            2. Persists the package to the workspace
-        2. binary_linux_conda_3.7_cpu_test
-            1. Loads the package to the workspace
-            2. Spins up a docker image (on Linux), mapping the package and code repos into the docker
-            3. Runs some smoke tests in the docker
-            4. (Actually, for macos this is a step rather than a separate job)
-        3. binary_linux_conda_3.7_cpu_upload
-            1. Logs in to aws/conda
-            2. Uploads the package
-2. update_s3_htmls
-    1. every day 5am EST
-    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
-    3. See below for what these are for and why they're needed
-    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
-3. binarysmoketests
-    1. every day
-    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
-    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
-        1. smoke_linux_conda_3.7_cpu
-            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
-            2. Runs the smoke tests
-
-## How are the jobs structured?
-
-The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
-
-* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
-    * binary_linux_build.sh
-    * binary_linux_test.sh
-    * binary_linux_upload.sh
-* MacOS jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
-    * binary_macos_build.sh
-    * binary_macos_test.sh
-    * binary_macos_upload.sh
-* Update html jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
-    * These delegate from the pytorch/builder repo
-    * https://github.com/pytorch/builder/blob/master/cron/update_s3_htmls.sh
-    * https://github.com/pytorch/builder/blob/master/cron/upload_binary_sizes.sh
-* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
-    * These delegate from the pytorch/builder repo
-    * https://github.com/pytorch/builder/blob/master/run_tests.sh
-    * https://github.com/pytorch/builder/blob/master/smoke_test.sh
-    * https://github.com/pytorch/builder/blob/master/check_binary.sh
-* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
-    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
-    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
-    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
-    * binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
-
-### **Why do the steps all refer to scripts?**
-
-CircleCI creates a  final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
-
-### **What is binary_run_in_docker for?**
-
-So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
-
-* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
-* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
-* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
-* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
-
-binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
-
-### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
-
-We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
-
-# Azure Pipelines structure of the binaries
-
-TODO: fill in stuff
-
-## How are the workflows structured?
-
-TODO: fill in stuff
-
-## How are the jobs structured?
-
-TODO: fill in stuff
-
-# Code structure of the binaries (circleci agnostic)
-
-## Overview
-
-The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
-
-
-```
-# All code needed to set-up environments for build code to run in,
-# but only code that is specific to the current CI system
-pytorch/pytorch
-- .circleci/                # Folder that holds all circleci related stuff
-  - config.yml              # GENERATED file that actually controls all circleci behavior
-  - verbatim-sources        # Used to generate job/workflow sections in ^
-  - scripts/                # Code needed to prepare circleci environments for binary build scripts
-
-- setup.py                  # Builds pytorch. This is wrapped in pytorch/builder
-- cmake files               # used in normal building of pytorch
-
-# All code needed to prepare a binary build, given an environment
-# with all the right variables/packages/paths.
-pytorch/builder
-
-# Given an installed binary and a proper python env, runs some checks
-# to make sure the binary was built the proper way. Checks things like
-# the library dependencies, symbols present, etc.
-- check_binary.sh
-
-# Given an installed binary, runs python tests to make sure everything
-# is in order. These should be de-duped. Right now they both run smoke
-# tests, but are called from different places. Usually just call some
-# import statements, but also has overlap with check_binary.sh above
-- run_tests.sh
-- smoke_test.sh
-
-# Folders that govern how packages are built. See paragraphs below
-
-- conda/
-  - build_pytorch.sh          # Entrypoint. Delegates to proper conda build folder
-  - switch_cuda_version.sh    # Switches activate CUDA installation in Docker
-  - pytorch-nightly/          # Build-folder
-- manywheel/
-  - build_cpu.sh              # Entrypoint for cpu builds
-  - build.sh                  # Entrypoint for CUDA builds
-  - build_common.sh           # Actual build script that ^^ call into
-- wheel/
-  - build_wheel.sh            # Entrypoint for wheel builds
-- windows/
-  - build_pytorch.bat         # Entrypoint for wheel builds on Windows
-```
-
-Every type of package has an entrypoint build script that handles the all the important logic.
-
-## Conda
-
-Linux, MacOS and Windows use the same code flow for the conda builds.
-
-Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
-
-Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
-tl;dr on conda-build is
-
-1. Creates a brand new conda environment, based off of deps in the meta.yaml
-    1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
-    2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
-2. Calls build.sh in the environment
-3. Copies the finished package to a new conda env, also specified by the meta.yaml
-4. Runs some simple import tests (if specified in the meta.yaml)
-5. Saves the finished package as a tarball
-
-The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
-
-The entrypoint file `builder/conda/build_conda.sh` is complicated because
-
-* It works for Linux, MacOS and Windows
-    * The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed
-* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
-
-## Manywheels (linux pip and libtorch packages)
-
-Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
-
-`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
-
-The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
-
-* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
-    * The script is never used this way anymore. This extra machinery could be removed.
-* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
-    * The script is never used this way anymore. This extra machinery could be removed.
-* This also builds libtorch packages
-    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
-* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
-
-## Wheels (MacOS pip and libtorch packages)
-
-The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
-
-* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
-    * The script is never used this way anymore. This extra machinery could be removed.
-* This also builds libtorch packages
-    * Ditto the comment above. This should definitely be separated out.
-
-Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
-
-## Windows Wheels (Windows pip and libtorch packages)
-
-The entrypoint file `builder/windows/build_pytorch.bat` is complicated because
-
-* This used to handle building for several different python versions at the same time. This is why there are loops everywhere
-    * The script is never used this way anymore. This extra machinery could be removed.
-* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
-    * The script is never used this way anymore. This extra machinery could be removed.
-* This also builds libtorch packages
-    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
-
-Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
-
-## General notes
-
-### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
-
-* These should all be consolidated
-* These must run on all OS types: MacOS, Linux, and Windows
-* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.
-* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
-
-### Note on libtorch
-
-Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
-
-* It’s confusing. Most of those scripts deal with python specifics.
-* The extra conditionals everywhere severely complicate the wheel build scripts
-* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
-
-### Note on docker images / Dockerfiles
-
-All linux builds occur in docker images. The docker images are
-
-* pytorch/conda-cuda
-    * Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
-    * Also used for cpu builds
-* pytorch/manylinux-cuda90
-* pytorch/manylinux-cuda100
-    * Also used for cpu builds
-
-The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
-
-### General Python
-
-* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
-
-# How to manually rebuild the binaries
-
-tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
-
-Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
-
-## How to test changes to the binaries via .circleci
-
-Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
-
-```sh
-# Make your changes
-touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
-
-# Regenerate the yaml, has to be in python 3.7
-.circleci/regenerate.sh
-
-# Make a commit
-git add .circleci *
-git commit -m "My real changes"
-git push origin my_branch
-
-# Now hardcode the jobs that you want in the .circleci/config.yml workflows section
-# Also eliminate ensure-consistency and should_run_job checks
-# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d
-
-# Make a commit you won't keep
-git add .circleci
-git commit -m "[DO NOT LAND] testing binaries for above changes"
-git push origin my_branch
-
-# Now you need to make some changes to the first commit.
-git rebase -i HEAD~2 # mark the first commit as 'edit'
-
-# Make the changes
-touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
-.circleci/regenerate.sh
-
-# Ammend the commit and recontinue
-git add .circleci
-git commit --amend
-git rebase --continue
-
-# Update the PR, need to force since the commits are different now
-git push origin my_branch --force
-```
-
-The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
-
-## How to build a binary locally
-
-### Linux
-
-You can build Linux binaries locally easily using docker.
-
-```sh
-# Run the docker
-# Use the correct docker image, pytorch/conda-cuda used here as an example
-#
-# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
-#    machine that you're running the command on) accessible to the docker
-#    container at path/to/bar. So if you then run `touch path/to/bar/baz`
-#    in the docker container then you will see path/to/foo/baz on your local
-#    machine. You could also clone the pytorch and builder repos in the docker.
-#
-# If you know how, add ccache as a volume too and speed up everything
-docker run \
-    -v your/pytorch/repo:/pytorch \
-    -v your/builder/repo:/builder \
-    -v where/you/want/packages/to/appear:/final_pkgs \
-    -it pytorch/conda-cuda /bin/bash
-
-# Export whatever variables are important to you. All variables that you'd
-# possibly need are in .circleci/scripts/binary_populate_env.sh
-# You should probably always export at least these 3 variables
-export PACKAGE_TYPE=conda
-export DESIRED_PYTHON=3.7
-export DESIRED_CUDA=cpu
-
-# Call the entrypoint
-# `|& tee foo.log` just copies all stdout and stderr output to foo.log
-# The builds generate lots of output so you probably need this when
-# building locally.
-/builder/conda/build_pytorch.sh |& tee build_output.log
-```
-
-**Building CUDA binaries on docker**
-
-You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).
-
-For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
-
-### MacOS
-
-There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci    :/
-
-But if you want to try, then I’d recommend
-
-```sh
-# Create a new terminal
-# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
-# know how to do
-
-# Install a new miniconda
-# First remove any other python or conda installation from your PATH
-# Always install miniconda 3, even if building for Python <3
-new_conda="~/my_new_conda"
-conda_sh="$new_conda/install_miniconda.sh"
-curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-chmod +x "$conda_sh"
-"$conda_sh" -b -p "$MINICONDA_ROOT"
-rm -f "$conda_sh"
-export PATH="~/my_new_conda/bin:$PATH"
-
-# Create a clean python env
-# All MacOS builds use conda to manage the python env and dependencies
-# that are built with, even the pip packages
-conda create -yn binary python=2.7
-conda activate binary
-
-# Export whatever variables are important to you. All variables that you'd
-# possibly need are in .circleci/scripts/binary_populate_env.sh
-# You should probably always export at least these 3 variables
-export PACKAGE_TYPE=conda
-export DESIRED_PYTHON=3.7
-export DESIRED_CUDA=cpu
-
-# Call the entrypoint you want
-path/to/builder/wheel/build_wheel.sh
-```
-
-N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
-
-1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
-2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
-3. Now say you (or some code that you ran) call python executable `foo`
-    1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
-    2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
-
-Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
-
-### Windows
-
-TODO: fill in
diff --git a/.circleci/cimodel/data/binary_build_data.py b/.circleci/cimodel/data/binary_build_data.py
index 1c714186568f93..5df203b6ce395f 100644
--- a/.circleci/cimodel/data/binary_build_data.py
+++ b/.circleci/cimodel/data/binary_build_data.py
@@ -31,13 +31,6 @@ def get_processor_arch_name(gpu_version):
     )
 
 CONFIG_TREE_DATA = OrderedDict(
-    windows=(
-        # Stop building Win+CU102, see https://github.com/pytorch/pytorch/issues/65648
-        [v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS and v != "cuda102"],
-        OrderedDict(
-            conda=dimensions.STANDARD_PYTHON_VERSIONS,
-        )
-    ),
 )
 
 # GCC config variants:
diff --git a/.circleci/cimodel/data/dimensions.py b/.circleci/cimodel/data/dimensions.py
index 57af81de7157eb..efdc363579003b 100644
--- a/.circleci/cimodel/data/dimensions.py
+++ b/.circleci/cimodel/data/dimensions.py
@@ -4,6 +4,7 @@
     "102",
     "113",
     "115",
+    "116",
 ]
 
 ROCM_VERSIONS = [
diff --git a/.circleci/cimodel/data/pytorch_build_definitions.py b/.circleci/cimodel/data/pytorch_build_definitions.py
index 036e8a5991919f..e3b9365b6f2607 100644
--- a/.circleci/cimodel/data/pytorch_build_definitions.py
+++ b/.circleci/cimodel/data/pytorch_build_definitions.py
@@ -185,7 +185,7 @@ def gen_docs_configs(xenial_parent_config):
         HiddenConf(
             "pytorch_python_doc_build",
             parent_build=xenial_parent_config,
-            filters=gen_filter_dict(branches_list=["master", "nightly"],
+            filters=gen_filter_dict(branches_list=["master", "main", "nightly"],
                                     tags_list=RC_PATTERN),
         )
     )
@@ -201,7 +201,7 @@ def gen_docs_configs(xenial_parent_config):
         HiddenConf(
             "pytorch_cpp_doc_build",
             parent_build=xenial_parent_config,
-            filters=gen_filter_dict(branches_list=["master", "nightly"],
+            filters=gen_filter_dict(branches_list=["master", "main", "nightly"],
                                     tags_list=RC_PATTERN),
         )
     )
diff --git a/.circleci/cimodel/data/simple/android_definitions.py b/.circleci/cimodel/data/simple/android_definitions.py
deleted file mode 100644
index fb6d6f5661b8a0..00000000000000
--- a/.circleci/cimodel/data/simple/android_definitions.py
+++ /dev/null
@@ -1,103 +0,0 @@
-import cimodel.data.simple.util.branch_filters as branch_filters
-from cimodel.data.simple.util.docker_constants import (
-    DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK
-)
-
-
-class AndroidJob:
-    def __init__(self,
-                 variant,
-                 template_name,
-                 is_master_only=True):
-
-        self.variant = variant
-        self.template_name = template_name
-        self.is_master_only = is_master_only
-
-    def gen_tree(self):
-
-        base_name_parts = [
-            "pytorch",
-            "linux",
-            "xenial",
-            "py3",
-            "clang5",
-            "android",
-            "ndk",
-            "r19c",
-        ] + self.variant + [
-            "build",
-        ]
-
-        full_job_name = "_".join(base_name_parts)
-        build_env_name = "-".join(base_name_parts)
-
-        props_dict = {
-            "name": full_job_name,
-            "build_environment": "\"{}\"".format(build_env_name),
-            "docker_image": "\"{}\"".format(DOCKER_IMAGE_NDK),
-            "requires": [DOCKER_REQUIREMENT_NDK]
-        }
-
-        if self.is_master_only:
-            props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
-
-        return [{self.template_name: props_dict}]
-
-
-class AndroidGradleJob:
-    def __init__(self,
-                 job_name,
-                 template_name,
-                 dependencies,
-                 is_master_only=True,
-                 is_pr_only=False,
-                 extra_props=tuple()):
-
-        self.job_name = job_name
-        self.template_name = template_name
-        self.dependencies = dependencies
-        self.is_master_only = is_master_only
-        self.is_pr_only = is_pr_only
-        self.extra_props = dict(extra_props)
-
-    def gen_tree(self):
-
-        props_dict = {
-            "name": self.job_name,
-            "requires": self.dependencies,
-        }
-
-        if self.is_master_only:
-            props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
-        elif self.is_pr_only:
-            props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST)
-        if self.extra_props:
-            props_dict.update(self.extra_props)
-
-        return [{self.template_name: props_dict}]
-
-
-WORKFLOW_DATA = [
-    AndroidJob(["x86_32"], "pytorch_linux_build", is_master_only=False),
-    AndroidJob(["x86_64"], "pytorch_linux_build"),
-    AndroidJob(["arm", "v7a"], "pytorch_linux_build"),
-    AndroidJob(["arm", "v8a"], "pytorch_linux_build"),
-    AndroidGradleJob(
-        "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
-        "pytorch_android_gradle_build-x86_32",
-        ["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
-        is_master_only=False,
-        is_pr_only=True),
-    AndroidGradleJob(
-        "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
-        "pytorch_android_gradle_build",
-        ["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
-         "pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
-         "pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
-         "pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
-]
-
-
-def get_workflow_jobs():
-    return [item.gen_tree() for item in WORKFLOW_DATA]
diff --git a/.circleci/cimodel/data/simple/binary_smoketest.py b/.circleci/cimodel/data/simple/binary_smoketest.py
deleted file mode 100644
index 6d1d421d029cca..00000000000000
--- a/.circleci/cimodel/data/simple/binary_smoketest.py
+++ /dev/null
@@ -1,193 +0,0 @@
-"""
-TODO: Refactor circleci/cimodel/data/binary_build_data.py to generate this file
-       instead of doing one offs here
- Binary builds (subset, to smoke test that they'll work)
-
- NB: If you modify this file, you need to also modify
- the binary_and_smoke_tests_on_pr variable in
- pytorch-ci-hud to adjust the allowed build list
- at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
-
- Note:
- This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
- - binary_linux_conda_3_6_cu90_devtoolset7_build
- - binary_linux_conda_3_6_cu90_devtoolset7_test
-
- TODO
- we should test a libtorch cuda build, but they take too long
- - binary_linux_libtorch_3_6m_cu90_devtoolset7_static-without-deps_build
-"""
-
-import cimodel.lib.miniutils as miniutils
-import cimodel.data.simple.util.branch_filters
-
-
-class SmoketestJob:
-    def __init__(self,
-                 template_name,
-                 build_env_parts,
-                 docker_image,
-                 job_name,
-                 is_master_only=False,
-                 requires=None,
-                 has_libtorch_variant=False,
-                 extra_props=None):
-
-        self.template_name = template_name
-        self.build_env_parts = build_env_parts
-        self.docker_image = docker_image
-        self.job_name = job_name
-        self.is_master_only = is_master_only
-        self.requires = requires or []
-        self.has_libtorch_variant = has_libtorch_variant
-        self.extra_props = extra_props or {}
-
-    def gen_tree(self):
-
-        props_dict = {
-            "build_environment": " ".join(self.build_env_parts),
-            "name": self.job_name,
-            "requires": self.requires,
-        }
-
-        if self.docker_image:
-            props_dict["docker_image"] = self.docker_image
-
-        if self.is_master_only:
-            props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
-
-        if self.has_libtorch_variant:
-            props_dict["libtorch_variant"] = "shared-with-deps"
-
-        props_dict.update(self.extra_props)
-
-        return [{self.template_name: props_dict}]
-
-
-WORKFLOW_DATA = [
-    SmoketestJob(
-        "binary_linux_build",
-        ["manywheel", "3.7m", "cu102", "devtoolset7"],
-        "pytorch/manylinux-cuda102",
-        "binary_linux_manywheel_3_7m_cu102_devtoolset7_build",
-        is_master_only=True,
-    ),
-    SmoketestJob(
-        "binary_linux_build",
-        ["libtorch", "3.7m", "cpu", "devtoolset7"],
-        "pytorch/manylinux-cuda102",
-        "binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build",
-        is_master_only=True,
-        has_libtorch_variant=True,
-    ),
-    SmoketestJob(
-        "binary_linux_build",
-        ["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"],
-        "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
-        "binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
-        is_master_only=False,
-        has_libtorch_variant=True,
-    ),
-    SmoketestJob(
-        "binary_mac_build",
-        ["wheel", "3.7", "cpu"],
-        None,
-        "binary_macos_wheel_3_7_cpu_build",
-        is_master_only=True,
-    ),
-    # This job has an average run time of 3 hours o.O
-    # Now only running this on master to reduce overhead
-    SmoketestJob(
-        "binary_mac_build",
-        ["libtorch", "3.7", "cpu"],
-        None,
-        "binary_macos_libtorch_3_7_cpu_build",
-        is_master_only=True,
-    ),
-    SmoketestJob(
-        "binary_windows_build",
-        ["libtorch", "3.7", "cpu", "debug"],
-        None,
-        "binary_windows_libtorch_3_7_cpu_debug_build",
-        is_master_only=True,
-    ),
-    SmoketestJob(
-        "binary_windows_build",
-        ["libtorch", "3.7", "cpu", "release"],
-        None,
-        "binary_windows_libtorch_3_7_cpu_release_build",
-        is_master_only=True,
-    ),
-    SmoketestJob(
-        "binary_windows_build",
-        ["wheel", "3.7", "cu113"],
-        None,
-        "binary_windows_wheel_3_7_cu113_build",
-        is_master_only=True,
-    ),
-
-    SmoketestJob(
-        "binary_windows_test",
-        ["libtorch", "3.7", "cpu", "debug"],
-        None,
-        "binary_windows_libtorch_3_7_cpu_debug_test",
-        is_master_only=True,
-        requires=["binary_windows_libtorch_3_7_cpu_debug_build"],
-    ),
-    SmoketestJob(
-        "binary_windows_test",
-        ["libtorch", "3.7", "cpu", "release"],
-        None,
-        "binary_windows_libtorch_3_7_cpu_release_test",
-        is_master_only=False,
-        requires=["binary_windows_libtorch_3_7_cpu_release_build"],
-    ),
-    SmoketestJob(
-        "binary_windows_test",
-        ["wheel", "3.7", "cu113"],
-        None,
-        "binary_windows_wheel_3_7_cu113_test",
-        is_master_only=True,
-        requires=["binary_windows_wheel_3_7_cu113_build"],
-        extra_props={
-            "executor": "windows-with-nvidia-gpu",
-        },
-    ),
-
-
-
-    SmoketestJob(
-        "binary_linux_test",
-        ["manywheel", "3.7m", "cu102", "devtoolset7"],
-        "pytorch/manylinux-cuda102",
-        "binary_linux_manywheel_3_7m_cu102_devtoolset7_test",
-        is_master_only=True,
-        requires=["binary_linux_manywheel_3_7m_cu102_devtoolset7_build"],
-        extra_props={
-            "resource_class": "gpu.nvidia.small",
-            "use_cuda_docker_runtime": miniutils.quote((str(1))),
-        },
-    ),
-    SmoketestJob(
-        "binary_linux_test",
-        ["libtorch", "3.7m", "cpu", "devtoolset7"],
-        "pytorch/manylinux-cuda102",
-        "binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test",
-        is_master_only=True,
-        requires=["binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build"],
-        has_libtorch_variant=True,
-    ),
-    SmoketestJob(
-        "binary_linux_test",
-        ["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"],
-        "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
-        "binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test",
-        is_master_only=True,
-        requires=["binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build"],
-        has_libtorch_variant=True,
-    ),
-]
-
-
-def get_workflow_jobs():
-    return [item.gen_tree() for item in WORKFLOW_DATA]
diff --git a/.circleci/cimodel/data/simple/nightly_android.py b/.circleci/cimodel/data/simple/nightly_android.py
deleted file mode 100644
index c6da5bbc4c76b1..00000000000000
--- a/.circleci/cimodel/data/simple/nightly_android.py
+++ /dev/null
@@ -1,77 +0,0 @@
-from cimodel.data.simple.util.docker_constants import (
-    DOCKER_IMAGE_NDK,
-    DOCKER_REQUIREMENT_NDK
-)
-
-
-class AndroidNightlyJob:
-    def __init__(self,
-                 variant,
-                 template_name,
-                 extra_props=None,
-                 with_docker=True,
-                 requires=None,
-                 no_build_suffix=False):
-
-        self.variant = variant
-        self.template_name = template_name
-        self.extra_props = extra_props or {}
-        self.with_docker = with_docker
-        self.requires = requires
-        self.no_build_suffix = no_build_suffix
-
-    def gen_tree(self):
-
-        base_name_parts = [
-            "pytorch",
-            "linux",
-            "xenial",
-            "py3",
-            "clang5",
-            "android",
-            "ndk",
-            "r19c",
-        ] + self.variant
-
-        build_suffix = [] if self.no_build_suffix else ["build"]
-        full_job_name = "_".join(["nightly"] + base_name_parts + build_suffix)
-        build_env_name = "-".join(base_name_parts)
-
-        props_dict = {
-            "name": full_job_name,
-            "requires": self.requires,
-            "filters": {"branches": {"only": "nightly"}},
-        }
-
-        props_dict.update(self.extra_props)
-
-        if self.with_docker:
-            props_dict["docker_image"] = DOCKER_IMAGE_NDK
-            props_dict["build_environment"] = build_env_name
-
-        return [{self.template_name: props_dict}]
-
-BASE_REQUIRES = [DOCKER_REQUIREMENT_NDK]
-
-WORKFLOW_DATA = [
-    AndroidNightlyJob(["x86_32"], "pytorch_linux_build", requires=BASE_REQUIRES),
-    AndroidNightlyJob(["x86_64"], "pytorch_linux_build", requires=BASE_REQUIRES),
-    AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build", requires=BASE_REQUIRES),
-    AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build", requires=BASE_REQUIRES),
-    AndroidNightlyJob(["android_gradle"], "pytorch_android_gradle_build",
-                      with_docker=False,
-                      requires=[
-                          "nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
-                          "nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
-                          "nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
-                          "nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
-    AndroidNightlyJob(["x86_32_android_publish_snapshot"], "pytorch_android_publish_snapshot",
-                      extra_props={"context": "org-member"},
-                      with_docker=False,
-                      requires=["nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build"],
-                      no_build_suffix=True),
-]
-
-
-def get_workflow_jobs():
-    return [item.gen_tree() for item in WORKFLOW_DATA]
diff --git a/.circleci/cimodel/data/simple/util/branch_filters.py b/.circleci/cimodel/data/simple/util/branch_filters.py
index dfbc6e4d63bc90..ba4e00a059ef1c 100644
--- a/.circleci/cimodel/data/simple/util/branch_filters.py
+++ b/.circleci/cimodel/data/simple/util/branch_filters.py
@@ -1,4 +1,5 @@
 NON_PR_BRANCH_LIST = [
+    "main",
     "master",
     r"/ci-all\/.*/",
     r"/release\/.*/",
diff --git a/.circleci/config.yml b/.circleci/config.yml
index 57f2fba481373a..8b5d8b87793b57 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -455,234 +455,6 @@ promote_common: &promote_common
 # Job specs
 ##############################################################################
 jobs:
-  pytorch_linux_build:
-    <<: *pytorch_params
-    machine:
-      image: ubuntu-2004:202104-01
-    steps:
-    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
-    - checkout
-    - calculate_docker_image_tag
-    - setup_linux_system_environment
-    - optional_merge_target_branch
-    - setup_ci_environment
-    - run:
-        name: Build
-        no_output_timeout: "1h"
-        command: |
-          set -e
-          if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
-            echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
-          fi
-          if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
-            echo 'USE_TBB=1' >> "${BASH_ENV}"
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
-          fi
-          echo "Parallel backend flags: "${PARALLEL_FLAGS}
-          # Pull Docker image and run build
-          echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
-          time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
-          export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
-
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-
-          docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
-
-          export COMMAND='((echo "sudo chown -R jenkins workspace && export JOB_BASE_NAME="$CIRCLE_JOB" && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
-
-          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
-
-          # Copy dist folder back
-          docker cp $id:/var/lib/jenkins/workspace/dist /home/circleci/project/. || echo "Dist folder not found"
-
-          # Push intermediate Docker image for next phase to use
-          if [ -z "${BUILD_ONLY}" ]; then
-            # Note [Special build images]
-            # The xla build uses the same docker image as
-            # pytorch_linux_bionic_py3_6_clang9_build. In the push step, we have to
-            # distinguish between them so the test can pick up the correct image.
-            output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
-            if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-xla
-            elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-libtorch
-            elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
-            elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_64"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-x86_64
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v7a"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v7a
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v8a"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v8a
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_32"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
-            elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-vulkan
-            else
-              export COMMIT_DOCKER_IMAGE=$output_image
-            fi
-            docker commit "$id" ${COMMIT_DOCKER_IMAGE}
-            time docker push ${COMMIT_DOCKER_IMAGE}
-          fi
-    - run:
-        name: upload build & binary data
-        no_output_timeout: "5m"
-        command: |
-            cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-            python3 -mpip install requests && \
-            SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
-            python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-    - store_artifacts:
-        path: /home/circleci/project/dist
-
-  pytorch_linux_test:
-    <<: *pytorch_params
-    machine:
-      image: ubuntu-2004:202104-01
-    steps:
-    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
-    - checkout
-    - calculate_docker_image_tag
-    - setup_linux_system_environment
-    - setup_ci_environment
-    - run:
-        name: Download Docker image
-        no_output_timeout: "90m"
-        command: |
-          set -e
-          export PYTHONUNBUFFERED=1
-          if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
-            export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
-          fi
-          # See Note [Special build images]
-          output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
-          if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-xla
-          elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-libtorch
-          elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
-          elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-vulkan
-          else
-            export COMMIT_DOCKER_IMAGE=$output_image
-          fi
-          echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
-
-          if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
-            echo 'USE_TBB=1' >> "${BASH_ENV}"
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
-          fi
-          echo "Parallel backend flags: "${PARALLEL_FLAGS}
-
-          time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
-
-          # TODO: Make this less painful
-          if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all --shm-size=2g -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          elif [[ ${BUILD_ENVIRONMENT} == *"rocm"* ]]; then
-            hostname
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=8g --ipc=host --device /dev/kfd --device /dev/dri --group-add video -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          else
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=1g --ipc=host -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          fi
-          echo "id=${id}" >> "${BASH_ENV}"
-
-    - run:
-        name: Check for no AVX instruction by default
-        no_output_timeout: "20m"
-        command: |
-          set -e
-          is_vanilla_build() {
-            if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-bionic-py3.7-clang9-test" ]; then
-              return 0
-            fi
-            if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-xenial-py3.7-gcc5.4-test" ]; then
-              return 0
-            fi
-            return 1
-          }
-
-          if is_vanilla_build; then
-            echo "apt-get update || apt-get install libgnutls30" | docker exec -u root -i "$id" bash
-            echo "apt-get install -y qemu-user gdb" | docker exec -u root -i "$id" bash
-            echo "cd workspace/build; qemu-x86_64 -g 2345 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU & gdb ./bin/basic -ex 'set pagination off' -ex 'target remote :2345' -ex 'continue' -ex 'bt' -ex='set confirm off' -ex 'quit \$_isvoid(\$_exitcode)'" | docker exec -u jenkins -i "$id" bash
-          else
-            echo "Skipping for ${BUILD_ENVIRONMENT}"
-          fi
-    - run:
-        name: Test
-        no_output_timeout: "90m"
-        command: |
-          set -e
-
-          cat >docker_commands.sh \<<EOL
-          # =================== The following code will be executed inside Docker container ===================
-          set -ex
-          export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
-          export JOB_BASE_NAME="$CIRCLE_JOB"
-          # temporary fix for https://github.com/pytorch/pytorch/issues/60746
-          if [ -z "$CIRCLE_PR_NUMBER" ]; then
-            if [[ $CIRCLE_BRANCH =~ .*pull.* ]]; then
-              export PR_NUMBER="$(echo $CIRCLE_BRANCH | sed 's/[^0-9]//g')"
-              export CIRCLE_PR_NUMBER="$PR_NUMBER"
-            fi
-          else
-            export PR_NUMBER="$CIRCLE_PR_NUMBER"
-          fi
-          ${PARALLEL_FLAGS}
-          cd workspace
-          EOL
-          if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
-            echo ".jenkins/pytorch/multigpu-test.sh" >> docker_commands.sh
-          elif [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
-            echo ".jenkins/caffe2/test.sh" >> docker_commands.sh
-          else
-            echo ".jenkins/pytorch/test.sh" >> docker_commands.sh
-          fi
-          echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
-          unbuffer bash command.sh | ts
-
-    - run:
-        name: Report results
-        no_output_timeout: "5m"
-        command: |
-          set -e
-          # Retrieving test results should be done as very first step as command never fails
-          # But is always executed if previous step fails for some reason
-          echo "Retrieving test reports"
-          docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
-          docker stats --all --no-stream
-
-          cat >docker_commands.sh \<<EOL
-          # =================== The following code will be executed inside Docker container ===================
-          set -ex
-          export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}
-          export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
-          export CIRCLE_TAG="${CIRCLE_TAG:-}"
-          export CIRCLE_SHA1="$CIRCLE_SHA1"
-          export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
-          export CIRCLE_BRANCH="$CIRCLE_BRANCH"
-          export JOB_BASE_NAME="$CIRCLE_JOB"
-          export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
-          cd workspace
-          python -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-          EOL
-          echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
-          unbuffer bash command.sh | ts
-        when: always
-    - store_test_results:
-        path: test-reports
   binary_linux_build:
     <<: *binary_linux_build_params
     steps:
@@ -1085,7 +857,7 @@ jobs:
     parameters:
       branch:
         type: string
-        default: "master"
+        default: "main"
     steps:
     - attach_workspace:
         at: /tmp/workspace
@@ -1125,7 +897,7 @@ jobs:
           echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
           # turn v1.12.0rc3 into 1.12
           tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9]*\.[0-9]*\).*/\1/')
-          target=${tag:-master}
+          target=${tag:-main}
           echo "building for ${target}"
           time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
@@ -1135,7 +907,7 @@ jobs:
           echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
 
           mkdir -p ~/workspace/build_artifacts
-          docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/master ~/workspace/build_artifacts
+          docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/main ~/workspace/build_artifacts
           docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io /tmp/workspace
 
           # Save the docs build so we can debug any problems
@@ -1147,7 +919,7 @@ jobs:
         paths:
           - .
     - store_artifacts:
-        path: ~/workspace/build_artifacts/master
+        path: ~/workspace/build_artifacts/main
         destination: docs
 
   pytorch_cpp_doc_build:
@@ -1171,12 +943,12 @@ jobs:
           echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
           # turn v1.12.0rc3 into 1.12
           tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9]*\.[0-9]*\).*/\1/')
-          target=${tag:-master}
+          target=${tag:-main}
           echo "building for ${target}"
           time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
 
-          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" master") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" main") | docker exec -u jenkins -i "$id" bash) 2>&1'
 
           echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
 
@@ -1660,7 +1432,7 @@ jobs:
           time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
 
-          echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
+          echo "Do NOT merge main branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
 
           git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
 
@@ -1904,654 +1676,8 @@ jobs:
 # Workflows
 ##############################################################################
 workflows:
-  binary_builds:
-    jobs:
-      - binary_windows_build:
-          name: binary_windows_conda_3_7_cpu_nightly_build
-          build_environment: "conda 3.7 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_8_cpu_nightly_build
-          build_environment: "conda 3.8 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_9_cpu_nightly_build
-          build_environment: "conda 3.9 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_10_cpu_nightly_build
-          build_environment: "conda 3.10 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_7_cu113_nightly_build
-          build_environment: "conda 3.7 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_8_cu113_nightly_build
-          build_environment: "conda 3.8 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_9_cu113_nightly_build
-          build_environment: "conda 3.9 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_10_cu113_nightly_build
-          build_environment: "conda 3.10 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_7_cu115_nightly_build
-          build_environment: "conda 3.7 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_8_cu115_nightly_build
-          build_environment: "conda 3.8 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_9_cu115_nightly_build
-          build_environment: "conda 3.9 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_build:
-          name: binary_windows_conda_3_10_cu115_nightly_build
-          build_environment: "conda 3.10 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-      - binary_windows_test:
-          name: binary_windows_conda_3_7_cpu_nightly_test
-          build_environment: "conda 3.7 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_7_cpu_nightly_build
-      - binary_windows_test:
-          name: binary_windows_conda_3_8_cpu_nightly_test
-          build_environment: "conda 3.8 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_8_cpu_nightly_build
-      - binary_windows_test:
-          name: binary_windows_conda_3_9_cpu_nightly_test
-          build_environment: "conda 3.9 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_9_cpu_nightly_build
-      - binary_windows_test:
-          name: binary_windows_conda_3_10_cpu_nightly_test
-          build_environment: "conda 3.10 cpu"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_10_cpu_nightly_build
-      - binary_windows_test:
-          name: binary_windows_conda_3_7_cu113_nightly_test
-          build_environment: "conda 3.7 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_7_cu113_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_8_cu113_nightly_test
-          build_environment: "conda 3.8 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_8_cu113_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_9_cu113_nightly_test
-          build_environment: "conda 3.9 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_9_cu113_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_10_cu113_nightly_test
-          build_environment: "conda 3.10 cu113"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_10_cu113_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_7_cu115_nightly_test
-          build_environment: "conda 3.7 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_7_cu115_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_8_cu115_nightly_test
-          build_environment: "conda 3.8 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_8_cu115_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_9_cu115_nightly_test
-          build_environment: "conda 3.9 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_9_cu115_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_windows_test:
-          name: binary_windows_conda_3_10_cu115_nightly_test
-          build_environment: "conda 3.10 cu115"
-          filters:
-            branches:
-              only:
-                - /.*/
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          requires:
-            - binary_windows_conda_3_10_cu115_nightly_build
-          executor: windows-with-nvidia-gpu
-      - binary_upload:
-          name: binary_windows_conda_3_7_cpu_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_7_cpu_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cpu
-      - binary_upload:
-          name: binary_windows_conda_3_8_cpu_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_8_cpu_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cpu
-      - binary_upload:
-          name: binary_windows_conda_3_9_cpu_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_9_cpu_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cpu
-      - binary_upload:
-          name: binary_windows_conda_3_10_cpu_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_10_cpu_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cpu
-      - binary_upload:
-          name: binary_windows_conda_3_7_cu113_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_7_cu113_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu113
-      - binary_upload:
-          name: binary_windows_conda_3_8_cu113_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_8_cu113_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu113
-      - binary_upload:
-          name: binary_windows_conda_3_9_cu113_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_9_cu113_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu113
-      - binary_upload:
-          name: binary_windows_conda_3_10_cu113_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_10_cu113_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu113
-      - binary_upload:
-          name: binary_windows_conda_3_7_cu115_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_7_cu115_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu115
-      - binary_upload:
-          name: binary_windows_conda_3_8_cu115_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_8_cu115_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu115
-      - binary_upload:
-          name: binary_windows_conda_3_9_cu115_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_9_cu115_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu115
-      - binary_upload:
-          name: binary_windows_conda_3_10_cu115_nightly_upload
-          context: org-member
-          requires:
-            - binary_windows_conda_3_10_cu115_nightly_test
-          filters:
-            branches:
-              only:
-                - nightly
-            tags:
-              only:
-                - /v[0-9]+(\.[0-9]+)*-rc[0-9]+/
-          package_type: conda
-          upload_subfolder: cu115
-    when: << pipeline.parameters.run_binary_tests >>
   build:
     jobs:
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_android_gradle_build-x86_32:
-          filters:
-            branches:
-              only:
-                - /gh\/.*\/head/
-                - /pull\/.*/
-          name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32
-          requires:
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-      - pytorch_android_gradle_build:
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
-          requires:
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-      - binary_linux_build:
-          build_environment: manywheel 3.7m cu102 devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_linux_manywheel_3_7m_cu102_devtoolset7_build
-      - binary_linux_build:
-          build_environment: libtorch 3.7m cpu devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
-      - binary_linux_build:
-          build_environment: libtorch 3.7m cpu gcc5.4_cxx11-abi
-          docker_image: pytorch/pytorch-binary-docker-image-ubuntu16.04:latest
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
-      - binary_mac_build:
-          build_environment: wheel 3.7 cpu
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_macos_wheel_3_7_cpu_build
-      - binary_mac_build:
-          build_environment: libtorch 3.7 cpu
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_macos_libtorch_3_7_cpu_build
-      - binary_windows_build:
-          build_environment: libtorch 3.7 cpu debug
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_windows_libtorch_3_7_cpu_debug_build
-      - binary_windows_build:
-          build_environment: libtorch 3.7 cpu release
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_windows_libtorch_3_7_cpu_release_build
-      - binary_windows_build:
-          build_environment: wheel 3.7 cu113
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_windows_wheel_3_7_cu113_build
-      - binary_windows_test:
-          build_environment: libtorch 3.7 cpu debug
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_windows_libtorch_3_7_cpu_debug_test
-          requires:
-            - binary_windows_libtorch_3_7_cpu_debug_build
-      - binary_windows_test:
-          build_environment: libtorch 3.7 cpu release
-          name: binary_windows_libtorch_3_7_cpu_release_test
-          requires:
-            - binary_windows_libtorch_3_7_cpu_release_build
-      - binary_windows_test:
-          build_environment: wheel 3.7 cu113
-          executor: windows-with-nvidia-gpu
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_windows_wheel_3_7_cu113_test
-          requires:
-            - binary_windows_wheel_3_7_cu113_build
-      - binary_linux_test:
-          build_environment: manywheel 3.7m cu102 devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          name: binary_linux_manywheel_3_7m_cu102_devtoolset7_test
-          requires:
-            - binary_linux_manywheel_3_7m_cu102_devtoolset7_build
-          resource_class: gpu.nvidia.small
-          use_cuda_docker_runtime: "1"
-      - binary_linux_test:
-          build_environment: libtorch 3.7m cpu devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test
-          requires:
-            - binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
-      - binary_linux_test:
-          build_environment: libtorch 3.7m cpu gcc5.4_cxx11-abi
-          docker_image: pytorch/pytorch-binary-docker-image-ubuntu16.04:latest
-          filters:
-            branches:
-              only:
-                - master
-                - /ci-all\/.*/
-                - /release\/.*/
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
-          requires:
-            - binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
       - binary_ios_build:
           build_environment: libtorch-ios-12.5.1-nightly-x86_64-build
           context: org-member
@@ -2617,60 +1743,6 @@ workflows:
           requires:
             - pytorch_ios_full_jit_12_5_1_nightly_x86_64_build
             - pytorch_ios_full_jit_12_5_1_nightly_arm64_build
-      - pytorch_linux_build:
-          build_environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32
-          docker_image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64
-          docker_image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a
-          docker_image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a
-          docker_image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_android_gradle_build:
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build
-          requires:
-            - nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-            - nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-            - nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-            - nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-      - pytorch_android_publish_snapshot:
-          context: org-member
-          filters:
-            branches:
-              only: nightly
-          name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_android_publish_snapshot
-          requires:
-            - nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build
       - anaconda_prune:
           name: anaconda-prune-pytorch-nightly
           context: "org-member"
@@ -2689,232 +1761,7 @@ workflows:
             branches:
               only:
                 - postnightly
-      - update_s3_htmls:
-          context: org-member
-          filters:
-            branches:
-              only:
-                - postnightly
-          name: update_s3_htmls
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_7_cpu_nightly
-          build_environment: "conda 3.7 cpu"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_8_cpu_nightly
-          build_environment: "conda 3.8 cpu"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_9_cpu_nightly
-          build_environment: "conda 3.9 cpu"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_10_cpu_nightly
-          build_environment: "conda 3.10 cpu"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_7_cu113_nightly
-          build_environment: "conda 3.7 cu113"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_8_cu113_nightly
-          build_environment: "conda 3.8 cu113"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_9_cu113_nightly
-          build_environment: "conda 3.9 cu113"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_10_cu113_nightly
-          build_environment: "conda 3.10 cu113"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_7_cu115_nightly
-          build_environment: "conda 3.7 cu115"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_8_cu115_nightly
-          build_environment: "conda 3.8 cu115"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_9_cu115_nightly
-          build_environment: "conda 3.9 cu115"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - smoke_windows_test:
-          name: smoke_windows_conda_3_10_cu115_nightly
-          build_environment: "conda 3.10 cu115"
-          requires:
-            - update_s3_htmls
-          filters:
-            branches:
-              only:
-                - postnightly
-          executor: windows-with-nvidia-gpu
-      - docker_build_job:
-          name: "docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          image_name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
     when: << pipeline.parameters.run_build >>
-  master_build:
-    jobs:
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_linux_build:
-          build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a-build"
-          docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          name: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-          requires:
-            - docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-      - pytorch_android_gradle_build:
-          name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
-          requires:
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
-            - pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
-      - binary_linux_build:
-          build_environment: manywheel 3.7m cu102 devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          name: binary_linux_manywheel_3_7m_cu102_devtoolset7_build
-      - binary_linux_build:
-          build_environment: libtorch 3.7m cpu devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
-      - binary_linux_build:
-          build_environment: libtorch 3.7m cpu gcc5.4_cxx11-abi
-          docker_image: pytorch/pytorch-binary-docker-image-ubuntu16.04:latest
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
-      - binary_mac_build:
-          build_environment: wheel 3.7 cpu
-          name: binary_macos_wheel_3_7_cpu_build
-      - binary_mac_build:
-          build_environment: libtorch 3.7 cpu
-          name: binary_macos_libtorch_3_7_cpu_build
-      - binary_windows_build:
-          build_environment: libtorch 3.7 cpu debug
-          name: binary_windows_libtorch_3_7_cpu_debug_build
-      - binary_windows_build:
-          build_environment: libtorch 3.7 cpu release
-          name: binary_windows_libtorch_3_7_cpu_release_build
-      - binary_windows_build:
-          build_environment: wheel 3.7 cu113
-          name: binary_windows_wheel_3_7_cu113_build
-      - binary_windows_test:
-          build_environment: libtorch 3.7 cpu debug
-          name: binary_windows_libtorch_3_7_cpu_debug_test
-          requires:
-            - binary_windows_libtorch_3_7_cpu_debug_build
-      - binary_windows_test:
-          build_environment: wheel 3.7 cu113
-          executor: windows-with-nvidia-gpu
-          name: binary_windows_wheel_3_7_cu113_test
-          requires:
-            - binary_windows_wheel_3_7_cu113_build
-      - binary_linux_test:
-          build_environment: manywheel 3.7m cu102 devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          name: binary_linux_manywheel_3_7m_cu102_devtoolset7_test
-          requires:
-            - binary_linux_manywheel_3_7m_cu102_devtoolset7_build
-          resource_class: gpu.nvidia.small
-          use_cuda_docker_runtime: "1"
-      - binary_linux_test:
-          build_environment: libtorch 3.7m cpu devtoolset7
-          docker_image: pytorch/manylinux-cuda102
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test
-          requires:
-            - binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
-      - binary_linux_test:
-          build_environment: libtorch 3.7m cpu gcc5.4_cxx11-abi
-          docker_image: pytorch/pytorch-binary-docker-image-ubuntu16.04:latest
-          libtorch_variant: shared-with-deps
-          name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
-          requires:
-            - binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
-      - docker_build_job:
-          name: "docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-          image_name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
-    when: << pipeline.parameters.run_master_build >>
   # Promotion workflow
   promote:
     jobs:
diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh
index dcd83f7ee0bbc0..0f372a3bb6991b 100755
--- a/.circleci/docker/build.sh
+++ b/.circleci/docker/build.sh
@@ -145,6 +145,17 @@ case "$image" in
     VISION=yes
     KATEX=yes
     ;;
+  pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)
+    CUDA_VERSION=11.6.0
+    CUDNN_VERSION=8
+    ANACONDA_PYTHON_VERSION=3.7
+    CMAKE_VERSION=3.10.3
+    GCC_VERSION=7
+    PROTOBUF=yes
+    DB=yes
+    VISION=yes
+    KATEX=yes
+    ;;
   pytorch-linux-xenial-py3-clang5-asan)
     ANACONDA_PYTHON_VERSION=3.7
     CLANG_VERSION=5.0
@@ -222,21 +233,21 @@ case "$image" in
     DB=yes
     VISION=yes
     ;;
-  pytorch-linux-bionic-rocm4.3.1-py3.7)
+  pytorch-linux-bionic-rocm4.5-py3.7)
     ANACONDA_PYTHON_VERSION=3.7
     GCC_VERSION=9
     PROTOBUF=yes
     DB=yes
     VISION=yes
-    ROCM_VERSION=4.3.1
+    ROCM_VERSION=4.5.2
     ;;
-  pytorch-linux-bionic-rocm4.5-py3.7)
+  pytorch-linux-bionic-rocm5.0-py3.7)
     ANACONDA_PYTHON_VERSION=3.7
     GCC_VERSION=9
     PROTOBUF=yes
     DB=yes
     VISION=yes
-    ROCM_VERSION=4.5.2
+    ROCM_VERSION=5.0
     ;;
   *)
     # Catch-all for builds that are not hardcoded.
@@ -283,6 +294,13 @@ fi
 
 tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
 
+#when using cudnn version 8 install it separately from cuda
+if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then
+  IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
+  if [[ ${CUDNN_VERSION} == 8 ]]; then
+    IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
+  fi
+fi
 
 # Build image
 # TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm
@@ -321,6 +339,7 @@ docker build \
        --build-arg "KATEX=${KATEX:-}" \
        --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
        --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
+       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
        -f $(dirname ${DOCKERFILE})/Dockerfile \
        -t "$tmp_tag" \
        "$@" \
diff --git a/.circleci/docker/centos-rocm/Dockerfile b/.circleci/docker/centos-rocm/Dockerfile
index 264ccaf0ea7c01..f2747d58bfd652 100644
--- a/.circleci/docker/centos-rocm/Dockerfile
+++ b/.circleci/docker/centos-rocm/Dockerfile
@@ -42,8 +42,10 @@ RUN bash ./install_user.sh && rm install_user.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ADD requirements-ci.txt /opt/conda/requirements-ci.txt
 ADD ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
+RUN rm /opt/conda/requirements-ci.txt
 
 # (optional) Install protobuf for ONNX
 ARG PROTOBUF
diff --git a/.circleci/docker/common/install_conda.sh b/.circleci/docker/common/install_conda.sh
index 72f06fb2285c3e..b333051a89e6f2 100755
--- a/.circleci/docker/common/install_conda.sh
+++ b/.circleci/docker/common/install_conda.sh
@@ -21,7 +21,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
       ;;
   esac
 
-  mkdir /opt/conda
+  mkdir -p /opt/conda
   chown jenkins:jenkins /opt/conda
 
   # Work around bug where devtoolset replaces sudo and breaks it.
@@ -94,20 +94,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
   conda_install nnpack -c killeent
 
   # Install some other packages, including those needed for Python test reporting
-  # Pin SciPy because of failing distribution tests (see #60347)
-  # Pin MyPy version because new errors are likely to appear with each release
-  # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
-  as_jenkins pip install --progress-bar off pytest \
-    scipy==1.6.3 \
-    scikit-image \
-    psutil \
-    unittest-xml-reporting \
-    boto3==1.16.34 \
-    hypothesis==4.53.2 \
-    expecttest==0.1.3 \
-    mypy==0.812 \
-    tb-nightly \
-    librosa>=0.6.2
+  as_jenkins pip install --progress-bar off -r /opt/conda/requirements-ci.txt
 
   # Install numba only on python-3.8 or below
   # For numba issue see https://github.com/pytorch/pytorch/issues/51511
diff --git a/.circleci/docker/common/install_cudnn.sh b/.circleci/docker/common/install_cudnn.sh
new file mode 100644
index 00000000000000..1f1c34ea200d4f
--- /dev/null
+++ b/.circleci/docker/common/install_cudnn.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+if [[ ${CUDNN_VERSION} == 8 ]]; then
+    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
+    mkdir tmp_cudnn && cd tmp_cudnn
+    CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
+    curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    tar xf ${CUDNN_NAME}.tar.xz
+    cp -a ${CUDNN_NAME}/include/* /usr/include/
+    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
+    cp -a ${CUDNN_NAME}/include/* /usr/include/x86_64-linux-gnu/
+
+    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
+    cp -a ${CUDNN_NAME}/lib/* /usr/lib/x86_64-linux-gnu/
+    cd ..
+    rm -rf tmp_cudnn
+    ldconfig
+fi
diff --git a/.circleci/docker/common/install_rocm.sh b/.circleci/docker/common/install_rocm.sh
index f5bbbe85a0d239..1a20b79ec2191b 100644
--- a/.circleci/docker/common/install_rocm.sh
+++ b/.circleci/docker/common/install_rocm.sh
@@ -6,7 +6,7 @@ install_magma() {
     # "install" hipMAGMA into /opt/rocm/magma by copying after build
     git clone https://bitbucket.org/icl/magma.git
     pushd magma
-    # Mar 7 - Fixes memory leaks for many linalg UTs
+    # Fixes memory leaks of magma found while executing linalg UTs
     git checkout 5959b8783e45f1809812ed96ae762f38ee701972
     cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
     echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
@@ -35,7 +35,7 @@ ver() {
 }
 
 # Map ROCm version to AMDGPU version
-declare -A AMDGPU_VERSIONS=( ["4.5.2"]="21.40.2" )
+declare -A AMDGPU_VERSIONS=( ["4.5.2"]="21.40.2" ["5.0"]="21.50" ["5.2"]="22.20" ["5.2.1"]="22.20.1" ["5.2.3"]="22.20.3" )
 
 install_ubuntu() {
     apt-get update
@@ -117,7 +117,7 @@ install_centos() {
       echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo
   fi
 
-  local rocm_baseurl="http://repo.radeon.com/rocm/yum/${ROCM_VERSION}"
+  local rocm_baseurl="http://repo.radeon.com/rocm/yum/${ROCM_VERSION}/main"
   echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
   echo "name=ROCm" >> /etc/yum.repos.d/rocm.repo
   echo "baseurl=${rocm_baseurl}" >> /etc/yum.repos.d/rocm.repo
diff --git a/.circleci/docker/common/install_user.sh b/.circleci/docker/common/install_user.sh
index 69c762350bbfb4..f0a8d86805dc0a 100755
--- a/.circleci/docker/common/install_user.sh
+++ b/.circleci/docker/common/install_user.sh
@@ -3,8 +3,11 @@
 set -ex
 
 # Mirror jenkins user in container
-echo "jenkins:x:1014:1014::/var/lib/jenkins:" >> /etc/passwd
-echo "jenkins:x:1014:" >> /etc/group
+# jenkins user as ec2-user should have the same user-id
+echo "jenkins:x:1000:1000::/var/lib/jenkins:" >> /etc/passwd
+echo "jenkins:x:1000:" >> /etc/group
+# Needed on focal or newer
+echo "jenkins:*:19110:0:99999:7:::" >>/etc/shadow
 
 # Create $HOME
 mkdir -p /var/lib/jenkins
diff --git a/.circleci/docker/requirements-ci.txt b/.circleci/docker/requirements-ci.txt
new file mode 100644
index 00000000000000..838062474d7a13
--- /dev/null
+++ b/.circleci/docker/requirements-ci.txt
@@ -0,0 +1,210 @@
+# Python dependencies required for unit tests
+
+#awscli==1.6 #this breaks some platforms
+#Description: AWS command line interface
+#Pinned versions: 1.6
+#test that import:
+
+boto3==1.19.12
+#Description: AWS SDK for python
+#Pinned versions: 1.19.12, 1.16.34
+#test that import:
+
+click
+#Description: Command Line Interface Creation Kit
+#Pinned versions:
+#test that import:
+
+coremltools==5.0b5
+#Description: Apple framework for ML integration
+#Pinned versions: 5.0b5
+#test that import:
+
+#dataclasses #this breaks some platforms
+#Description: Provides decorators for auto adding special methods to user classes
+#Pinned versions:
+#test that import:
+
+expecttest==0.1.3
+#Description: method for writing tests where test framework auto populates
+# the expected output based on previous runs
+#Pinned versions: 0.1.3
+#test that import:
+
+flatbuffers==2.0
+#Description: cross platform serialization library
+#Pinned versions: 2.0
+#test that import:
+
+#future #this breaks linux-bionic-rocm4.5-py3.7
+#Description: compatibility layer between python 2 and python 3
+#Pinned versions:
+#test that import:
+
+hypothesis==4.53.2
+# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
+#Description: advanced library for generating parametrized tests
+#Pinned versions: 3.44.6, 4.53.2
+#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
+
+junitparser==2.1.1
+#Description: unitparser handles JUnit/xUnit Result XML files
+#Pinned versions: 2.1.1
+#test that import:
+
+librosa>=0.6.2
+#Description: A python package for music and audio analysis
+#Pinned versions: >=0.6.2
+#test that import: test_spectral_ops.py
+
+#mkl #this breaks linux-bionic-rocm4.5-py3.7
+#Description: Intel oneAPI Math Kernel Library
+#Pinned versions:
+#test that import: test_profiler.py, test_public_bindings.py, test_testing.py,
+#test_nn.py, test_mkldnn.py, test_jit.py, test_fx_experimental.py,
+#test_autograd.py
+
+#mkl-devel
+# see mkl
+
+#mock # breaks ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+#Description: A testing library that allows you to replace parts of your
+#system under test with mock objects
+#Pinned versions:
+#test that import: test_module_init.py, test_modules.py, test_nn.py,
+#test_testing.py
+
+#MonkeyType # breaks pytorch-xla-linux-bionic-py3.7-clang8
+#Description: collects runtime types of function arguments and return
+#values, and can automatically generate stub files
+#Pinned versions:
+#test that import:
+
+mypy==0.960
+# Pin MyPy version because new errors are likely to appear with each release
+#Description: linter
+#Pinned versions: 0.960
+#test that import: test_typing.py, test_type_hints.py
+
+#networkx
+#Description: creation, manipulation, and study of
+#the structure, dynamics, and functions of complex networks
+#Pinned versions: 2.0
+#test that import:
+
+#ninja
+#Description: build system.  Note that it install from
+#here breaks things so it is commented out
+#Pinned versions: 1.10.0.post1
+#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
+
+#numba
+#Description: Just-In-Time Compiler for Numerical Functions
+#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
+#test that import: test_numba_integration.py
+
+#numpy
+#Description: Provides N-dimensional arrays and linear algebra
+#Pinned versions: 1.20
+#test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
+#test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
+#test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
+#test_spectral_ops.py, test_sort_and_select.py, test_shape_ops.py,
+#test_segment_reductions.py, test_reductions.py, test_pruning_op.py,
+#test_overrides.py, test_numpy_interop.py, test_numba_integration.py
+#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
+#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
+#test_binary_ufuncs.py
+
+#onnxruntime
+#Description: scoring engine for Open Neural Network Exchange (ONNX) models
+#Pinned versions: 1.9.0
+#test that import:
+
+#pillow
+#Description:  Python Imaging Library fork
+#Pinned versions:
+#test that import:
+
+protobuf==3.20.1
+#Description:  Google’s data interchange format
+#Pinned versions: 3.20.1
+#test that import: test_tensorboard.py
+
+psutil
+#Description: information on running processes and system utilization
+#Pinned versions:
+#test that import: test_profiler.py, test_openmp.py, test_dataloader.py
+
+pytest
+#Description: testing framework
+#Pinned versions:
+#test that import: test_typing.py, test_cpp_extensions_aot.py, run_test.py
+
+#pytest-benchmark
+#Description: fixture for benchmarking code
+#Pinned versions: 3.2.3
+#test that import:
+
+#pytest-sugar
+#Description: shows failures and errors instantly
+#Pinned versions:
+#test that import:
+
+#PyYAML
+#Description: data serialization format
+#Pinned versions:
+#test that import:
+
+#requests
+#Description: HTTP library
+#Pinned versions:
+#test that import: test_type_promotion.py
+
+#rich
+#Description: rich text and beautiful formatting in the terminal
+#Pinned versions: 10.9.0
+#test that import:
+
+scikit-image
+#Description: image processing routines
+#Pinned versions:
+#test that import: test_nn.py
+
+#scikit-learn
+#Description: machine learning package
+#Pinned versions: 0.20.3
+#test that import:
+
+scipy==1.6.3
+# Pin SciPy because of failing distribution tests (see #60347)
+#Description: scientific python
+#Pinned versions: 1.6.3
+#test that import: test_unary_ufuncs.py, test_torch.py,test_tensor_creation_ops.py
+#test_spectral_ops.py, test_sparse_csr.py, test_reductions.py,test_nn.py
+#test_linalg.py, test_binary_ufuncs.py
+
+#tabulate
+#Description: Pretty-print tabular data
+#Pinned versions:
+#test that import:
+
+tb-nightly
+#Description: TensorBoard
+#Pinned versions:
+#test that import:
+
+#typing-extensions
+#Description: type hints for python
+#Pinned versions:
+#test that import:
+
+#virtualenv
+#Description: virtual environment for python
+#Pinned versions:
+#test that import:
+
+unittest-xml-reporting<=3.2.0,>=2.0.0
+#Description: saves unit test results to xml
+#Pinned versions:
+#test that import:
diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile
index 9c9e40387066e5..241b91cff394d1 100644
--- a/.circleci/docker/ubuntu-cuda/Dockerfile
+++ b/.circleci/docker/ubuntu-cuda/Dockerfile
@@ -1,12 +1,11 @@
 ARG UBUNTU_VERSION
 ARG CUDA_VERSION
-ARG CUDNN_VERSION
+ARG IMAGE_NAME
 
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}
+FROM ${IMAGE_NAME}
 
 ARG UBUNTU_VERSION
 ARG CUDA_VERSION
-ARG CUDNN_VERSION
 
 ENV DEBIAN_FRONTEND noninteractive
 
@@ -27,8 +26,10 @@ RUN bash ./install_katex.sh && rm install_katex.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ADD requirements-ci.txt /opt/conda/requirements-ci.txt
 ADD ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
+RUN rm /opt/conda/requirements-ci.txt
 
 # Install gcc
 ARG GCC_VERSION
@@ -99,5 +100,11 @@ ENV CUDA_PATH /usr/local/cuda
 # Install LLVM dev version (Defined in the pytorch/builder github repository)
 COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
 
+# Install CUDNN
+ARG CUDNN_VERSION
+ADD ./common/install_cudnn.sh install_cudnn.sh
+RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
+RUN rm install_cudnn.sh
+
 USER jenkins
 CMD ["bash"]
diff --git a/.circleci/docker/ubuntu-rocm/Dockerfile b/.circleci/docker/ubuntu-rocm/Dockerfile
index 73f0e1822e895a..26059287636332 100644
--- a/.circleci/docker/ubuntu-rocm/Dockerfile
+++ b/.circleci/docker/ubuntu-rocm/Dockerfile
@@ -28,8 +28,10 @@ RUN bash ./install_user.sh && rm install_user.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ADD requirements-ci.txt /opt/conda/requirements-ci.txt
 ADD ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
+RUN rm /opt/conda/requirements-ci.txt
 
 # Install gcc
 ARG GCC_VERSION
diff --git a/.circleci/docker/ubuntu/Dockerfile b/.circleci/docker/ubuntu/Dockerfile
index e0ae5c096ec9a8..d5940c7a1d55e3 100644
--- a/.circleci/docker/ubuntu/Dockerfile
+++ b/.circleci/docker/ubuntu/Dockerfile
@@ -36,8 +36,10 @@ RUN bash ./install_katex.sh && rm install_katex.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ADD requirements-ci.txt /opt/conda/requirements-ci.txt
 ADD ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
+RUN rm /opt/conda/requirements-ci.txt
 
 # Install gcc
 ARG GCC_VERSION
diff --git a/.circleci/generate_config_yml.py b/.circleci/generate_config_yml.py
index 581a32cd485c80..f089518f4f46cd 100755
--- a/.circleci/generate_config_yml.py
+++ b/.circleci/generate_config_yml.py
@@ -10,12 +10,8 @@
 import sys
 from collections import namedtuple
 
-import cimodel.data.binary_build_definitions as binary_build_definitions
-import cimodel.data.simple.android_definitions
-import cimodel.data.simple.binary_smoketest
 import cimodel.data.simple.docker_definitions
 import cimodel.data.simple.mobile_definitions
-import cimodel.data.simple.nightly_android
 import cimodel.data.simple.nightly_ios
 import cimodel.data.simple.anaconda_prune_defintions
 import cimodel.lib.miniutils as miniutils
@@ -83,11 +79,11 @@ def _for_all_items(items, functor) -> None:
         functor(item_type, item)
 
 def filter_master_only_jobs(items):
-    def _is_master_item(item):
+    def _is_main_or_master_item(item):
         filters = item.get('filters', None)
         branches = filters.get('branches', None) if filters is not None else None
         branches_only = branches.get('only', None) if branches is not None else None
-        return 'master' in branches_only if branches_only is not None else False
+        return ('main' in branches_only or 'master' in branches_only) if branches_only is not None else False
 
     master_deps = set()
 
@@ -96,7 +92,7 @@ def _save_requires_if_master(item_type, item):
         item_name = item.get("name", None)
         if not isinstance(requires, list):
             return
-        if _is_master_item(item) or item_name in master_deps:
+        if _is_main_or_master_item(item) or item_name in master_deps:
             master_deps.update([n.strip('"') for n in requires])
 
     def _do_filtering(items):
@@ -107,7 +103,7 @@ def _do_filtering(items):
         item_type, item = next(iter(items.items()))
         item_name = item.get("name", None)
         item_name = item_name.strip('"') if item_name is not None else None
-        if not _is_master_item(item) and item_name not in master_deps:
+        if not _is_main_or_master_item(item) and item_name not in master_deps:
             return None
         if 'filters' in item:
             item = item.copy()
@@ -115,7 +111,7 @@ def _do_filtering(items):
         return {item_type: item}
 
     # Scan of dependencies twice to pick up nested required jobs
-    # I.e. jobs depending on jobs that master-only job depend on
+    # I.e. jobs depending on jobs that main-only job depend on
     _for_all_items(items, _save_requires_if_master)
     _for_all_items(items, _save_requires_if_master)
     return _do_filtering(items)
@@ -137,14 +133,9 @@ def _requires_docker_image(item_type, item):
 
 def gen_build_workflows_tree():
     build_workflows_functions = [
-        cimodel.data.simple.android_definitions.get_workflow_jobs,
         cimodel.data.simple.mobile_definitions.get_workflow_jobs,
-        cimodel.data.simple.binary_smoketest.get_workflow_jobs,
         cimodel.data.simple.nightly_ios.get_workflow_jobs,
-        cimodel.data.simple.nightly_android.get_workflow_jobs,
         cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
-        binary_build_definitions.get_post_upload_jobs,
-        binary_build_definitions.get_binary_smoke_test_jobs,
     ]
     build_jobs = [f() for f in build_workflows_functions]
     build_jobs.extend(
@@ -155,28 +146,20 @@ def gen_build_workflows_tree():
     )
     master_build_jobs = filter_master_only_jobs(build_jobs)
 
-    binary_build_functions = [
-        binary_build_definitions.get_binary_build_jobs,
-        binary_build_definitions.get_nightly_tests,
-        binary_build_definitions.get_nightly_uploads,
-    ]
-
-    return {
+    rc = {
         "workflows": {
-            "binary_builds": {
-                "when": r"<< pipeline.parameters.run_binary_tests >>",
-                "jobs": [f() for f in binary_build_functions],
-            },
             "build": {
                 "when": r"<< pipeline.parameters.run_build >>",
                 "jobs": build_jobs,
             },
-            "master_build": {
-                "when": r"<< pipeline.parameters.run_master_build >>",
-                "jobs": master_build_jobs,
-            },
         }
     }
+    if len(master_build_jobs) > 0:
+        rc["workflows"]["master_build"] = {
+            "when": r"<< pipeline.parameters.run_master_build >>",
+            "jobs": master_build_jobs,
+        }
+    return rc
 
 
 # Order of this list matters to the generated config.yml.
@@ -189,7 +172,6 @@ def gen_build_workflows_tree():
     File("build-parameters/binary-build-params.yml"),
     File("build-parameters/promote-build-params.yml"),
     Header("Job specs"),
-    File("job-specs/pytorch-job-specs.yml"),
     File("job-specs/binary-job-specs.yml"),
     File("job-specs/job-specs-custom.yml"),
     File("job-specs/job-specs-promote.yml"),
diff --git a/.circleci/scripts/binary_checkout.sh b/.circleci/scripts/binary_checkout.sh
index db2b0660d9f506..86bfeb77e6ac4a 100755
--- a/.circleci/scripts/binary_checkout.sh
+++ b/.circleci/scripts/binary_checkout.sh
@@ -49,8 +49,9 @@ if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
   git reset --hard "$CIRCLE_SHA1"
 elif [[ -n "${CIRCLE_SHA1:-}" ]]; then
   # Scheduled workflows & "smoke" binary build on master on PR merges
+  DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"
   git reset --hard "$CIRCLE_SHA1"
-  git checkout -q -B master
+  git checkout -q -B $DEFAULT_BRANCH
 else
   echo "Can't tell what to checkout"
   exit 1
diff --git a/.circleci/scripts/binary_linux_build.sh b/.circleci/scripts/binary_linux_build.sh
index 42aa728d55a6fb..88561fcd80ec02 100755
--- a/.circleci/scripts/binary_linux_build.sh
+++ b/.circleci/scripts/binary_linux_build.sh
@@ -26,7 +26,7 @@ else
   build_script='manywheel/build.sh'
 fi
 
-if [[ "$CIRCLE_BRANCH" == "master" ]] || [[ "$CIRCLE_BRANCH" == release/* ]]; then
+if [[ "$CIRCLE_BRANCH" == "main" ]] || [[ "$CIRCLE_BRANCH" == "master" ]] || [[ "$CIRCLE_BRANCH" == release/* ]]; then
   export BUILD_DEBUG_INFO=1
 fi
 
diff --git a/.circleci/scripts/binary_linux_test.sh b/.circleci/scripts/binary_linux_test.sh
index 5be7f7cae21375..e915903ad8746f 100755
--- a/.circleci/scripts/binary_linux_test.sh
+++ b/.circleci/scripts/binary_linux_test.sh
@@ -53,7 +53,7 @@ if [[ "\$python_nodot" = *39*  ]]; then
   NUMPY_PIN=">=1.20"
 fi
 
-if [[ "$DESIRED_CUDA" == "cu112" || "$DESIRED_CUDA" == "cu115" ]]; then
+if [[ "$DESIRED_CUDA" == "cu115" || "$DESIRED_CUDA" == "cu116" ]]; then
   EXTRA_CONDA_FLAGS="-c=conda-forge"
 fi
 
@@ -67,7 +67,8 @@ mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to m
 # TODO there is duplicated and inconsistent test-python-env setup across this
 #   file, builder/smoke_test.sh, and builder/run_tests.sh, and also in the
 #   conda build scripts themselves. These should really be consolidated
-pkg="/final_pkgs/\$(ls /final_pkgs)"
+# Pick only one package of multiple available (which happens as result of workflow re-runs)
+pkg="/final_pkgs/\$(ls -1 /final_pkgs|sort|tail -1)"
 if [[ "$PACKAGE_TYPE" == conda ]]; then
   (
     # For some reason conda likes to re-activate the conda environment when attempting this install
diff --git a/.circleci/scripts/binary_windows_build.sh b/.circleci/scripts/binary_windows_build.sh
index 2104e5728f8500..e6500b8d9c93d1 100644
--- a/.circleci/scripts/binary_windows_build.sh
+++ b/.circleci/scripts/binary_windows_build.sh
@@ -8,15 +8,16 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"
 export USE_SCCACHE=1
 export SCCACHE_BUCKET=ossci-compiler-cache-windows
 export SCCACHE_IGNORE_SERVER_IO_ERROR=1
-export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
 export VC_YEAR=2019
 
 if [[ "${DESIRED_CUDA}" == *"cu11"* ]]; then
     export BUILD_SPLIT_CUDA=ON
 fi
 
+
 echo "Free Space for CUDA DEBUG BUILD"
 if [[ "${CIRCLECI:-}" == 'true' ]]; then
+    export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
     if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community" ]]; then
         rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community"
     fi
@@ -71,6 +72,7 @@ pushd "$BUILDER_ROOT"
 if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
     ./windows/internal/build_conda.bat
 elif [[ "$PACKAGE_TYPE" == 'wheel' || "$PACKAGE_TYPE" == 'libtorch' ]]; then
+    export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
     ./windows/internal/build_wheels.bat
 fi
 
diff --git a/.circleci/scripts/cpp_doc_push_script.sh b/.circleci/scripts/cpp_doc_push_script.sh
index fa68d07e537eaa..1b4ea71ffd9dbc 100755
--- a/.circleci/scripts/cpp_doc_push_script.sh
+++ b/.circleci/scripts/cpp_doc_push_script.sh
@@ -20,7 +20,7 @@ echo "cpp_doc_push_script.sh: Invoked with $*"
 #       but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
 #       try and gather it first, just so we don't potentially break people who rely on this script
 # Argument 2: What version of the Python API docs we are building.
-version="${2:-${DOCS_VERSION:-master}}"
+version="${2:-${DOCS_VERSION:-main}}"
 if [ -z "$version" ]; then
 echo "error: cpp_doc_push_script.sh: version (arg2) not specified"
   exit 1
@@ -34,9 +34,9 @@ echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"
   exit 1
 fi
 
-is_master_doc=false
-if [ "$version" == "master" ]; then
-  is_master_doc=true
+is_main_doc=false
+if [ "$version" == "main" ]; then
+  is_main_doc=true
 fi
 
 echo "install_path: $install_path  version: $version"
@@ -65,8 +65,7 @@ cp torch/_utils_internal.py tools/shared
 
 # Generate PyTorch files
 time python tools/setup_helpers/generate_code.py \
-  --native-functions-path aten/src/ATen/native/native_functions.yaml \
-  --nn-path aten/src/
+  --native-functions-path aten/src/ATen/native/native_functions.yaml
 
 # Build the docs
 pushd docs/cpp
diff --git a/.circleci/scripts/python_doc_push_script.sh b/.circleci/scripts/python_doc_push_script.sh
index ccfc44917400a7..cb6d520c260de4 100755
--- a/.circleci/scripts/python_doc_push_script.sh
+++ b/.circleci/scripts/python_doc_push_script.sh
@@ -23,7 +23,7 @@ set -ex
 #       but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
 #       try and gather it first, just so we don't potentially break people who rely on this script
 # Argument 2: What version of the docs we are building.
-version="${2:-${DOCS_VERSION:-master}}"
+version="${2:-${DOCS_VERSION:-main}}"
 if [ -z "$version" ]; then
 echo "error: python_doc_push_script.sh: version (arg2) not specified"
   exit 1
@@ -37,9 +37,9 @@ echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
   exit 1
 fi
 
-is_master_doc=false
-if [ "$version" == "master" ]; then
-  is_master_doc=true
+is_main_doc=false
+if [ "$version" == "main" ]; then
+  is_main_doc=true
 fi
 
 # Argument 3: The branch to push to. Usually is "site"
@@ -86,7 +86,7 @@ pushd docs
 
 # Build the docs
 pip -q install -r requirements.txt
-if [ "$is_master_doc" = true ]; then
+if [ "$is_main_doc" = true ]; then
   build_docs html
   [ $? -eq 0 ] || exit $?
   make coverage
diff --git a/.circleci/scripts/setup_ci_environment.sh b/.circleci/scripts/setup_ci_environment.sh
index 1f2e6bfaef61bc..dab183d907a6c6 100755
--- a/.circleci/scripts/setup_ci_environment.sh
+++ b/.circleci/scripts/setup_ci_environment.sh
@@ -32,7 +32,7 @@ if ! command -v aws >/dev/null; then
 fi
 
 if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
-  DRIVER_FN="NVIDIA-Linux-x86_64-495.44.run"
+  DRIVER_FN="NVIDIA-Linux-x86_64-510.60.02.run"
   wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
   sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
   nvidia-smi
diff --git a/.circleci/scripts/trigger_azure_pipeline.py b/.circleci/scripts/trigger_azure_pipeline.py
index b35ee5ce9def07..9dc9dff2d54de1 100644
--- a/.circleci/scripts/trigger_azure_pipeline.py
+++ b/.circleci/scripts/trigger_azure_pipeline.py
@@ -11,7 +11,7 @@
 AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
 PIPELINE_ID = "911"
 PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"
-TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "master")
+TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "main")
 TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")
 
 build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"
diff --git a/.circleci/scripts/windows_cuda_install.sh b/.circleci/scripts/windows_cuda_install.sh
index abcdcf134b3769..b12ec7516ab7b0 100644
--- a/.circleci/scripts/windows_cuda_install.sh
+++ b/.circleci/scripts/windows_cuda_install.sh
@@ -22,6 +22,10 @@ case ${CUDA_VERSION} in
         cuda_installer_name="cuda_11.5.0_496.13_win10"
         cuda_install_packages="thrust_11.5 nvcc_11.5 cuobjdump_11.5 nvprune_11.5 nvprof_11.5 cupti_11.5 cublas_11.5 cublas_dev_11.5 cudart_11.5 cufft_11.5 cufft_dev_11.5 curand_11.5 curand_dev_11.5 cusolver_11.5 cusolver_dev_11.5 cusparse_11.5 cusparse_dev_11.5 npp_11.5 npp_dev_11.5 nvrtc_11.5 nvrtc_dev_11.5 nvml_dev_11.5"
         ;;
+    11.6)
+        cuda_installer_name="cuda_11.6.0_511.23_windows"
+        cuda_install_packages="thrust_11.6 nvcc_11.6 cuobjdump_11.6 nvprune_11.6 nvprof_11.6 cupti_11.6 cublas_11.6 cublas_dev_11.6 cudart_11.6 cufft_11.6 cufft_dev_11.6 curand_11.6 curand_dev_11.6 cusolver_11.6 cusolver_dev_11.6 cusparse_11.6 cusparse_dev_11.6 npp_11.6 npp_dev_11.6 nvrtc_11.6 nvrtc_dev_11.6 nvml_dev_11.6"
+        ;;
     *)
         echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
         exit 1
diff --git a/.circleci/scripts/windows_cudnn_install.sh b/.circleci/scripts/windows_cudnn_install.sh
index 87e8a8dd09bf20..fbcbdc4020e961 100644
--- a/.circleci/scripts/windows_cudnn_install.sh
+++ b/.circleci/scripts/windows_cudnn_install.sh
@@ -22,6 +22,10 @@ case ${CUDA_VERSION} in
         # Since cudnn 8.3  the filename have changed
         cudnn_file_name="cudnn-windows-x86_64-8.3.2.44_cuda${CUDA_VERSION}-archive"
         ;;
+    11.6)
+        # Use cudnn8.3 with hard-coded cuda11.5 version
+        cudnn_file_name="cudnn-windows-x86_64-8.3.2.44_cuda11.5-archive"
+        ;;
     *)
         echo "CUDA_VERSION: ${CUDA_VERSION} not supported yet"
         exit 1
diff --git a/.circleci/verbatim-sources/job-specs/binary-job-specs.yml b/.circleci/verbatim-sources/job-specs/binary-job-specs.yml
index ab60b0d372d679..f6f16ef7dd651c 100644
--- a/.circleci/verbatim-sources/job-specs/binary-job-specs.yml
+++ b/.circleci/verbatim-sources/job-specs/binary-job-specs.yml
@@ -1,3 +1,4 @@
+jobs:
   binary_linux_build:
     <<: *binary_linux_build_params
     steps:
diff --git a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
index a3c1d932d93eb5..f0f12e09b2d902 100644
--- a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
+++ b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
@@ -5,7 +5,7 @@
     parameters:
       branch:
         type: string
-        default: "master"
+        default: "main"
     steps:
     - attach_workspace:
         at: /tmp/workspace
@@ -45,7 +45,7 @@
           echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
           # turn v1.12.0rc3 into 1.12
           tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9]*\.[0-9]*\).*/\1/')
-          target=${tag:-master}
+          target=${tag:-main}
           echo "building for ${target}"
           time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
@@ -55,7 +55,7 @@
           echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
 
           mkdir -p ~/workspace/build_artifacts
-          docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/master ~/workspace/build_artifacts
+          docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/main ~/workspace/build_artifacts
           docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io /tmp/workspace
 
           # Save the docs build so we can debug any problems
@@ -67,7 +67,7 @@
         paths:
           - .
     - store_artifacts:
-        path: ~/workspace/build_artifacts/master
+        path: ~/workspace/build_artifacts/main
         destination: docs
 
   pytorch_cpp_doc_build:
@@ -91,12 +91,12 @@
           echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
           # turn v1.12.0rc3 into 1.12
           tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9]*\.[0-9]*\).*/\1/')
-          target=${tag:-master}
+          target=${tag:-main}
           echo "building for ${target}"
           time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
 
-          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" master") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" main") | docker exec -u jenkins -i "$id" bash) 2>&1'
 
           echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
 
@@ -580,7 +580,7 @@
           time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
           export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
 
-          echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
+          echo "Do NOT merge main branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
 
           git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
 
diff --git a/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml b/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
deleted file mode 100644
index 79f879a13f0197..00000000000000
--- a/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
+++ /dev/null
@@ -1,229 +0,0 @@
-jobs:
-  pytorch_linux_build:
-    <<: *pytorch_params
-    machine:
-      image: ubuntu-2004:202104-01
-    steps:
-    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
-    - checkout
-    - calculate_docker_image_tag
-    - setup_linux_system_environment
-    - optional_merge_target_branch
-    - setup_ci_environment
-    - run:
-        name: Build
-        no_output_timeout: "1h"
-        command: |
-          set -e
-          if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
-            echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
-          fi
-          if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
-            echo 'USE_TBB=1' >> "${BASH_ENV}"
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
-          fi
-          echo "Parallel backend flags: "${PARALLEL_FLAGS}
-          # Pull Docker image and run build
-          echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
-          time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
-          export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
-
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-
-          docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
-
-          export COMMAND='((echo "sudo chown -R jenkins workspace && export JOB_BASE_NAME="$CIRCLE_JOB" && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
-
-          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
-
-          # Copy dist folder back
-          docker cp $id:/var/lib/jenkins/workspace/dist /home/circleci/project/. || echo "Dist folder not found"
-
-          # Push intermediate Docker image for next phase to use
-          if [ -z "${BUILD_ONLY}" ]; then
-            # Note [Special build images]
-            # The xla build uses the same docker image as
-            # pytorch_linux_bionic_py3_6_clang9_build. In the push step, we have to
-            # distinguish between them so the test can pick up the correct image.
-            output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
-            if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-xla
-            elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-libtorch
-            elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
-            elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_64"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-x86_64
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v7a"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v7a
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v8a"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v8a
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_32"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
-            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
-            elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
-              export COMMIT_DOCKER_IMAGE=$output_image-vulkan
-            else
-              export COMMIT_DOCKER_IMAGE=$output_image
-            fi
-            docker commit "$id" ${COMMIT_DOCKER_IMAGE}
-            time docker push ${COMMIT_DOCKER_IMAGE}
-          fi
-    - run:
-        name: upload build & binary data
-        no_output_timeout: "5m"
-        command: |
-            cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-            python3 -mpip install requests && \
-            SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
-            python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-    - store_artifacts:
-        path: /home/circleci/project/dist
-
-  pytorch_linux_test:
-    <<: *pytorch_params
-    machine:
-      image: ubuntu-2004:202104-01
-    steps:
-    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
-    - checkout
-    - calculate_docker_image_tag
-    - setup_linux_system_environment
-    - setup_ci_environment
-    - run:
-        name: Download Docker image
-        no_output_timeout: "90m"
-        command: |
-          set -e
-          export PYTHONUNBUFFERED=1
-          if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
-            export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
-          fi
-          # See Note [Special build images]
-          output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
-          if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-xla
-          elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-libtorch
-          elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
-          elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
-            export COMMIT_DOCKER_IMAGE=$output_image-vulkan
-          else
-            export COMMIT_DOCKER_IMAGE=$output_image
-          fi
-          echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
-
-          if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
-            echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
-            echo 'USE_TBB=1' >> "${BASH_ENV}"
-          elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
-            echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
-          fi
-          echo "Parallel backend flags: "${PARALLEL_FLAGS}
-
-          time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
-
-          # TODO: Make this less painful
-          if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all --shm-size=2g -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          elif [[ ${BUILD_ENVIRONMENT} == *"rocm"* ]]; then
-            hostname
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=8g --ipc=host --device /dev/kfd --device /dev/dri --group-add video -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          else
-            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=1g --ipc=host -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
-          fi
-          echo "id=${id}" >> "${BASH_ENV}"
-
-    - run:
-        name: Check for no AVX instruction by default
-        no_output_timeout: "20m"
-        command: |
-          set -e
-          is_vanilla_build() {
-            if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-bionic-py3.7-clang9-test" ]; then
-              return 0
-            fi
-            if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-xenial-py3.7-gcc5.4-test" ]; then
-              return 0
-            fi
-            return 1
-          }
-
-          if is_vanilla_build; then
-            echo "apt-get update || apt-get install libgnutls30" | docker exec -u root -i "$id" bash
-            echo "apt-get install -y qemu-user gdb" | docker exec -u root -i "$id" bash
-            echo "cd workspace/build; qemu-x86_64 -g 2345 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU & gdb ./bin/basic -ex 'set pagination off' -ex 'target remote :2345' -ex 'continue' -ex 'bt' -ex='set confirm off' -ex 'quit \$_isvoid(\$_exitcode)'" | docker exec -u jenkins -i "$id" bash
-          else
-            echo "Skipping for ${BUILD_ENVIRONMENT}"
-          fi
-    - run:
-        name: Test
-        no_output_timeout: "90m"
-        command: |
-          set -e
-
-          cat >docker_commands.sh \<<EOL
-          # =================== The following code will be executed inside Docker container ===================
-          set -ex
-          export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
-          export JOB_BASE_NAME="$CIRCLE_JOB"
-          # temporary fix for https://github.com/pytorch/pytorch/issues/60746
-          if [ -z "$CIRCLE_PR_NUMBER" ]; then
-            if [[ $CIRCLE_BRANCH =~ .*pull.* ]]; then
-              export PR_NUMBER="$(echo $CIRCLE_BRANCH | sed 's/[^0-9]//g')"
-              export CIRCLE_PR_NUMBER="$PR_NUMBER"
-            fi
-          else
-            export PR_NUMBER="$CIRCLE_PR_NUMBER"
-          fi
-          ${PARALLEL_FLAGS}
-          cd workspace
-          EOL
-          if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
-            echo ".jenkins/pytorch/multigpu-test.sh" >> docker_commands.sh
-          elif [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
-            echo ".jenkins/caffe2/test.sh" >> docker_commands.sh
-          else
-            echo ".jenkins/pytorch/test.sh" >> docker_commands.sh
-          fi
-          echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
-          unbuffer bash command.sh | ts
-
-    - run:
-        name: Report results
-        no_output_timeout: "5m"
-        command: |
-          set -e
-          # Retrieving test results should be done as very first step as command never fails
-          # But is always executed if previous step fails for some reason
-          echo "Retrieving test reports"
-          docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
-          docker stats --all --no-stream
-
-          cat >docker_commands.sh \<<EOL
-          # =================== The following code will be executed inside Docker container ===================
-          set -ex
-          export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}
-          export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
-          export CIRCLE_TAG="${CIRCLE_TAG:-}"
-          export CIRCLE_SHA1="$CIRCLE_SHA1"
-          export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
-          export CIRCLE_BRANCH="$CIRCLE_BRANCH"
-          export JOB_BASE_NAME="$CIRCLE_JOB"
-          export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
-          cd workspace
-          python -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-          EOL
-          echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
-          unbuffer bash command.sh | ts
-        when: always
-    - store_test_results:
-        path: test-reports
diff --git a/.gitattributes b/.gitattributes
index 70246abe9bbbaf..d87495166e5c2b 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -2,3 +2,4 @@
 .circleci/config.yml linguist-generated=true
 .github/workflows/generated-*.yml linguist-generated=true
 .github/generated-* linguist-generated=true
+.github/scripts/gql_mocks.json linguist-generated=true
diff --git a/.github/actions/build-android/action.yml b/.github/actions/build-android/action.yml
new file mode 100644
index 00000000000000..2493bb3a76066a
--- /dev/null
+++ b/.github/actions/build-android/action.yml
@@ -0,0 +1,82 @@
+name: build android
+
+description: build android for a specific arch
+
+inputs:
+  arch:
+    description: arch to build
+    required: true
+  arch-for-build-env:
+    description: |
+      arch to pass to build environment.
+      This is currently different than the arch name we use elswhere, which
+      should be fixed.
+    required: true
+  github-secret:
+    description: github token
+    required: true
+  build-environment:
+    required: true
+    description: Top-level label for what's being built/tested.
+  docker-image:
+    required: true
+    description: Name of the base docker image to build with.
+  branch:
+    required: true
+    description: What branch we are building on.
+outputs:
+  container_id:
+    description: Docker container identifier used to build the artifacts
+    value: ${{ steps.build.outputs.container_id }}
+
+runs:
+  using: composite
+  steps:
+    - name: Build-${{ inputs.arch }}
+      id: build
+      shell: bash
+      env:
+        BRANCH: ${{ inputs.branch }}
+        JOB_BASE_NAME: ${{ inputs.build-environment }}-build-and-test
+        BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-${{ inputs.arch-for-build-env }}-build"
+        AWS_DEFAULT_REGION: us-east-1
+        PR_NUMBER: ${{ github.event.pull_request.number }}
+        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+        CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+        DOCKER_IMAGE: ${{ inputs.docker-image  }}
+        MATRIX_ARCH: ${{ inputs.arch }}
+      run: |
+        # detached container should get cleaned up by teardown_ec2_linux
+        set -exo pipefail
+        export container_name
+        container_name=$(docker run \
+          -e BUILD_ENVIRONMENT \
+          -e JOB_BASE_NAME \
+          -e MAX_JOBS="$(nproc --ignore=2)" \
+          -e AWS_DEFAULT_REGION \
+          -e IS_GHA \
+          -e PR_NUMBER \
+          -e SHA1 \
+          -e BRANCH \
+          -e GITHUB_RUN_ID \
+          -e SCCACHE_BUCKET \
+          -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+          -e SKIP_SCCACHE_INITIALIZATION=1 \
+          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+          --security-opt seccomp=unconfined \
+          --cap-add=SYS_PTRACE \
+          --tty \
+          --detach \
+          --user jenkins \
+          -w /var/lib/jenkins/workspace \
+          "${DOCKER_IMAGE}"
+        )
+        git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
+        docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
+        (echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete" | docker exec -u jenkins -i "${container_name}" bash) 2>&1
+
+        # Copy install binaries back
+        mkdir -p "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
+        docker cp "${container_name}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
+        echo "::set-output name=container_id::${container_name}"
diff --git a/.github/actions/calculate-docker-image/action.yml b/.github/actions/calculate-docker-image/action.yml
new file mode 100644
index 00000000000000..d32179ac78a7d8
--- /dev/null
+++ b/.github/actions/calculate-docker-image/action.yml
@@ -0,0 +1,93 @@
+name: Calculate docker image
+
+description: Determine docker image to pull, building a new one if necessary.
+
+inputs:
+  docker-image-name:
+    description: The name of a docker image, like `pytorch-linux-xenial-py3.7-gcc7`
+    required: true
+  xla:
+    description: |
+      Whether or not to use a pre-build XLA docker image.
+      Note that this is a string, either "true" or "false" due to GHA limitations.
+    required: false
+  always-rebuild:
+    description: If set to any value, always build a fresh docker image.
+    required: false
+  pull:
+    description: If set to any value, run `docker pull`` on the calculated image.
+    required: false
+
+outputs:
+  docker-image:
+    description: The docker image to use for the rest of the workflow
+    value: ${{ steps.calculate-tag.outputs.docker-image }}
+
+runs:
+  using: composite
+  steps:
+    - name: Calculate docker image tag
+      shell: bash
+      id: calculate-tag
+      env:
+        IS_XLA: ${{ inputs.xla == 'true' && 'true' || '' }}
+        XLA_IMAGE_TAG: v0.2
+        DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ inputs.docker-image-name }}
+      run: |
+        if [ -n "${IS_XLA}" ]; then
+          echo "XLA workflow uses pre-built test image at ${XLA_IMAGE_TAG}"
+          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
+          echo "::set-output name=docker-tag::${DOCKER_TAG}"
+          echo "::set-output name=docker-image::${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}"
+        else
+          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
+          echo "::set-output name=docker-tag::${DOCKER_TAG}"
+          echo "::set-output name=docker-image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
+        fi
+
+    - name: Check if image should be built
+      shell: bash
+      id: check
+      if: ${{ !inputs.always-rebuild }}
+      env:
+        BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
+        DOCKER_IMAGE: ${{ steps.calculate-tag.outputs.docker-image }}
+        DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker-tag }}
+      run: |
+        set -x
+        # Check if image already exists, if it does then skip building it
+        if docker manifest inspect "${DOCKER_IMAGE}"; then
+          exit 0
+        fi
+        if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
+          # if we're on the base branch then use the parent commit
+          MERGE_BASE=$(git rev-parse HEAD~)
+        else
+          # otherwise we're on a PR, so use the most recent base commit
+          MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
+        fi
+        # Covers the case where a previous tag doesn't exist for the tree
+        # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
+        if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
+          echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
+          exit 1
+        fi
+        PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
+        # If no image exists but the hash is the same as the previous hash then we should error out here
+        if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
+          echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
+          echo "       contact the PyTorch team to restore the original images"
+          exit 1
+        fi
+        echo ::set-output name=rebuild::yes
+
+    - name: Build and push docker image
+      if: inputs.always-rebuild || steps.check.outputs.rebuild
+      env:
+        IMAGE_NAME: ${{inputs.docker-image-name}}
+        DOCKER_SKIP_S3_UPLOAD: "1"
+        DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker-tag }}
+      working-directory: .circleci/docker
+      shell: bash
+      run: |
+        ./build_docker.sh
diff --git a/.github/actions/checkout-pytorch/action.yml b/.github/actions/checkout-pytorch/action.yml
new file mode 100644
index 00000000000000..eb1b728467f8f5
--- /dev/null
+++ b/.github/actions/checkout-pytorch/action.yml
@@ -0,0 +1,32 @@
+name: Checkout PyTorch
+
+description: Clean workspace and check out PyTorch
+
+inputs:
+  no-sudo:
+    description: If set to any value, don't use sudo to clean the workspace
+    required: false
+
+runs:
+  using: composite
+  steps:
+    - name: Clean workspace
+      shell: bash
+      env:
+        NO_SUDO: ${{ inputs.no-sudo }}
+      run: |
+        echo "${GITHUB_WORKSPACE}"
+        if [ -z "${NO_SUDO}" ]; then
+          sudo rm -rf "${GITHUB_WORKSPACE}"
+        else
+          rm -rf "${GITHUB_WORKSPACE}"
+        fi
+        mkdir "${GITHUB_WORKSPACE}"
+
+    - name: Checkout PyTorch
+      uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+      with:
+        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+        # deep clone, to allow use of git merge-base
+        fetch-depth: 0
+        submodules: recursive
diff --git a/.github/actions/chown-workspace/action.yml b/.github/actions/chown-workspace/action.yml
new file mode 100644
index 00000000000000..6adc6cdc217db4
--- /dev/null
+++ b/.github/actions/chown-workspace/action.yml
@@ -0,0 +1,11 @@
+name: Chown workspace
+
+description: Ensure that the working directory gets chowned back to the current user
+
+runs:
+  using: composite
+  steps:
+    - run: docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      shell: bash
+      env:
+        ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
diff --git a/.github/actions/download-build-artifacts/action.yml b/.github/actions/download-build-artifacts/action.yml
new file mode 100644
index 00000000000000..a3c9444c1b98fb
--- /dev/null
+++ b/.github/actions/download-build-artifacts/action.yml
@@ -0,0 +1,34 @@
+name: Download PyTorch Build Artifacts
+
+description: Download and unzip artifacts from a previous PyTorch build.
+
+inputs:
+  name:
+    description: Name of what artifact to download
+    required: true
+  use-gha:
+    description: If set to any value, use GHA to download the artifact. Otherwise use s3.
+    required: false
+
+runs:
+  using: composite
+  steps:
+    - name: Download PyTorch Build Artifacts from S3
+      if: ${{ !inputs.use-gha }}
+      uses: seemethere/download-artifact-s3@v3
+      with:
+        name: ${{ inputs.name }}
+
+    - name: Download PyTorch Build Artifacts from GHA
+      if: inputs.use-gha
+      uses: actions/download-artifact@v2
+      with:
+        name: ${{ inputs.name }}
+
+    - name: Unzip artifacts
+      shell: bash
+      run: unzip -o artifacts.zip
+
+    - name: Output disk space left
+      shell: bash
+      run: df -H
diff --git a/.github/actions/get-workflow-job-id/action.yml b/.github/actions/get-workflow-job-id/action.yml
new file mode 100644
index 00000000000000..c7ca1e07d6bec8
--- /dev/null
+++ b/.github/actions/get-workflow-job-id/action.yml
@@ -0,0 +1,31 @@
+name: Get workflow job id
+
+description: Get the ID of the workflow job that is currently running.
+
+inputs:
+  github-token:
+    description: GITHUB_TOKEN
+    required: true
+
+outputs:
+  job-id:
+    description: The retrieved workflow job id
+    value: ${{ steps.get-job-id.outputs.job-id }}
+
+runs:
+  using: composite
+  steps:
+    - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+      id: get-job-id
+      env:
+        GITHUB_TOKEN: ${{ inputs.github-token }}
+      with:
+        shell: bash
+        timeout_minutes: 10
+        max_attempts: 5
+        retry_wait_seconds: 30
+        command: |
+          set -x
+          python3 -m pip install requests==2.26.0
+          GHA_WORKFLOW_JOB_ID=$(python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}")
+          echo "::set-output name=job-id::${GHA_WORKFLOW_JOB_ID}"
diff --git a/.github/actions/pull-docker-image/action.yml b/.github/actions/pull-docker-image/action.yml
new file mode 100644
index 00000000000000..ad1cc1baf9d3dc
--- /dev/null
+++ b/.github/actions/pull-docker-image/action.yml
@@ -0,0 +1,19 @@
+name: Pull docker image
+
+description: pull a specific docker image
+
+inputs:
+  docker-image:
+    description: the image to pull
+    required: true
+
+runs:
+  using: composite
+  steps:
+    - name: Pull Docker image
+      shell: bash
+      env:
+        DOCKER_IMAGE: ${{ inputs.docker-image }}
+      run: |
+        retry () { "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@") }
+        retry docker pull "${DOCKER_IMAGE}"
diff --git a/.github/actions/setup-linux/action.yml b/.github/actions/setup-linux/action.yml
new file mode 100644
index 00000000000000..d7500f11de7d63
--- /dev/null
+++ b/.github/actions/setup-linux/action.yml
@@ -0,0 +1,47 @@
+name: Setup Linux
+
+description: Set up Docker workspace on EC2
+
+runs:
+  using: composite
+  steps:
+    - name: Display EC2 information
+      shell: bash
+      run: |
+        set -euo pipefail
+        function get_ec2_metadata() {
+          # Pulled from instance metadata endpoint for EC2
+          # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+          category=$1
+          curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+        }
+        echo "ami-id: $(get_ec2_metadata ami-id)"
+        echo "instance-id: $(get_ec2_metadata instance-id)"
+        echo "instance-type: $(get_ec2_metadata instance-type)"
+        echo "system info $(uname -a)"
+
+    - name: Start docker if docker deamon is not running
+      shell: bash
+      run: |
+        if systemctl is-active --quiet docker; then
+            echo "Docker daemon is running...";
+        else
+            echo "Starting docker deamon..." && sudo systemctl start docker;
+        fi
+
+    - name: Log in to ECR
+      shell: bash
+      env:
+        AWS_RETRY_MODE: standard
+        AWS_MAX_ATTEMPTS: "5"
+        AWS_DEFAULT_REGION: us-east-1
+      run: |
+        AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
+        retry () { "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@") }
+        retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
+            --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+
+    - name: Preserve github env variables for use in docker
+      shell: bash
+      run: |
+        env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
diff --git a/.github/actions/setup-rocm/action.yml b/.github/actions/setup-rocm/action.yml
new file mode 100644
index 00000000000000..d261a557802919
--- /dev/null
+++ b/.github/actions/setup-rocm/action.yml
@@ -0,0 +1,57 @@
+name: Setup ROCm host
+
+description: Set up ROCm host for CI
+
+runs:
+  using: composite
+  steps:
+    - name: Set DOCKER_HOST
+      shell: bash
+      run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
+
+    - name: Runner health check system info
+      if: always()
+      shell: bash
+      run: |
+        cat /etc/os-release || true
+        cat /etc/apt/sources.list.d/rocm.list || true
+        cat /opt/rocm/.info/version || true
+        whoami
+
+    - name: Runner health check rocm-smi
+      if: always()
+      shell: bash
+      run: |
+        rocm-smi
+
+    - name: Runner health check rocminfo
+      if: always()
+      shell: bash
+      run: |
+        rocminfo
+
+    - name: Runner health check GPU count
+      if: always()
+      shell: bash
+      run: |
+        ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
+        if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
+            echo "Failed to detect GPUs on the runner"
+            exit 1
+        fi
+
+    - name: Runner health check disconnect on failure
+      if: ${{ failure() }}
+      shell: bash
+      run: |
+        killall runsvc.sh
+
+    - name: Preserve github env variables for use in docker
+      shell: bash
+      run: |
+        env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
+
+    - name: ROCm set GPU_FLAG
+      shell: bash
+      run: |
+        echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
diff --git a/.github/actions/setup-ssh/action.yml b/.github/actions/setup-ssh/action.yml
new file mode 100644
index 00000000000000..9daed4a5f9734f
--- /dev/null
+++ b/.github/actions/setup-ssh/action.yml
@@ -0,0 +1,16 @@
+name: Setup SSH
+
+description: Adds ssh keys for current user to machine
+
+inputs:
+  github-secret:
+    description: GitHub token
+    required: true
+
+runs:
+  using: composite
+  steps:
+    - name: "Enable SSH (Click me for login details)"
+      uses: seemethere/add-github-ssh-key@v1
+      with:
+        GITHUB_TOKEN: ${{ inputs.github-secret }}
diff --git a/.github/actions/setup-win/action.yml b/.github/actions/setup-win/action.yml
new file mode 100644
index 00000000000000..12f287b230898a
--- /dev/null
+++ b/.github/actions/setup-win/action.yml
@@ -0,0 +1,60 @@
+name: Setup Windows
+
+description: Set up for windows jobs
+
+inputs:
+  cuda-version:
+    description: which cuda version to install, 'cpu' for none
+    required: true
+
+runs:
+  using: composite
+  steps:
+    - name: Display EC2 information
+      shell: bash
+      run: |
+        set -euo pipefail
+        function get_ec2_metadata() {
+          # Pulled from instance metadata endpoint for EC2
+          # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+          category=$1
+          curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+        }
+        echo "ami-id: $(get_ec2_metadata ami-id)"
+        echo "instance-id: $(get_ec2_metadata instance-id)"
+        echo "instance-type: $(get_ec2_metadata instance-type)"
+        echo "system info $(uname -a)"
+
+    # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+    - name: Enable long paths on Windows
+      shell: powershell
+      run: |
+        Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+
+    # Since it's just a defensive command, the workflow should continue even the command fails
+    - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+      shell: powershell
+      run: |
+        Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+
+    - name: Install Visual Studio 2019 toolchain
+      shell: powershell
+      env:
+        VS_VERSION: "16.8.6"
+        INSTALL_WINDOWS_SDK: "1"
+      run: |
+        .\.circleci\scripts\vs_install.ps1
+
+    - name: Install CUDA and CUDNN
+      shell: bash
+      if: inputs.cuda-version != 'cpu'
+      env:
+        CUDA_VERSION: ${{ inputs.cuda-version }}
+      run: |
+        .circleci/scripts/windows_cuda_install.sh
+        .circleci/scripts/windows_cudnn_install.sh
+
+    - name: Setup Python3
+      uses: actions/setup-python@v2
+      with:
+        python-version: "3.x"
diff --git a/.github/actions/teardown-linux/action.yml b/.github/actions/teardown-linux/action.yml
new file mode 100644
index 00000000000000..9238a073a6b621
--- /dev/null
+++ b/.github/actions/teardown-linux/action.yml
@@ -0,0 +1,28 @@
+name: Teardown Linux
+
+description: Stuff that should always run at the end of a linux job
+
+inputs:
+  skip-wait-ssh:
+    description: If set, don't wait for ssh to drain before tearing down
+    required: false
+    default: ""
+
+runs:
+  using: composite
+  steps:
+    - name: Hold runner for 2 hours or until ssh sessions have drained
+      # TODO working-directory: !{{ pytorch_directory }}
+      # Always hold for active ssh sessions
+      shell: bash
+      if: inputs.skip-wait-ssh == ''
+      run: .github/scripts/wait_for_ssh_to_drain.sh
+
+    - name: Kill containers, clean up images
+      shell: bash
+      run: |
+        # ignore expansion of "docker ps -q" since it could be empty
+        # shellcheck disable=SC2046
+        docker stop $(docker ps -q) || true
+        # Prune all of the docker images
+        docker system prune -af
diff --git a/.github/actions/teardown-win/action.yml b/.github/actions/teardown-win/action.yml
new file mode 100644
index 00000000000000..49c509444e095a
--- /dev/null
+++ b/.github/actions/teardown-win/action.yml
@@ -0,0 +1,33 @@
+name: Teardown Windows
+
+description: Set up Docker workspace on linux
+
+inputs:
+  extra-delete-dir:
+    description: If set, cleaning up the workspace will delete this too
+    required: false
+    default: ""
+
+runs:
+  using: composite
+  steps:
+    - name: Wait until all sessions have drained
+      shell: powershell
+      if: always()
+      run: |
+        .github\scripts\wait_for_ssh_to_drain.ps1
+
+    - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+      shell: powershell
+      if: always()
+      run: |
+        .github\scripts\kill_active_ssh_sessions.ps1
+
+    - name: Cleanup workspace
+      if: always()
+      shell: bash
+      env:
+        EXTRA_DELETE_DIR: ${{ inputs.extra-delete-dir }}
+      run: |
+        [ ! -z "${EXTRA_DELETE_DIR}" ]  || rm -rf "${EXTRA_DELETE_DIR}"
+        rm -rf ./*
diff --git a/.github/actions/upload-test-artifacts/action.yml b/.github/actions/upload-test-artifacts/action.yml
new file mode 100644
index 00000000000000..7a00a377fca41f
--- /dev/null
+++ b/.github/actions/upload-test-artifacts/action.yml
@@ -0,0 +1,94 @@
+name: Upload test artifacts
+
+description: Upload various artifacts produced by our testing process
+
+inputs:
+  use-gha:
+    description: If set to any value, upload GHA. Otherwise upload to S3.
+    required: false
+  file-suffix:
+    description: |
+      Suffix to add to the filename of the artifacts. This should include the
+      workflow job id, see [Job id in artifacts].
+    required: true
+
+runs:
+  using: composite
+  steps:
+    # Mac/Linux zip
+    - name: Zip JSONs for upload
+      if: runner.os != 'Windows' && !inputs.use-gha
+      shell: bash
+      env:
+        FILE_SUFFIX: ${{ inputs.file-suffix }}
+      run: |
+        # Remove any previous test jsons if they exist
+        rm -f test-jsons-*.zip
+        zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
+
+    - name: Zip test reports for upload
+      if: runner.os != 'Windows' && !inputs.use-gha
+      shell: bash
+      env:
+        FILE_SUFFIX: ${{ inputs.file-suffix }}
+      run: |
+        # Remove any previous test reports if they exist
+        rm -f test-reports-*.zip
+        zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
+
+    # Windows zip
+    - name: Zip JSONs for upload
+      if: runner.os == 'Windows' && !inputs.use-gha
+      shell: powershell
+      env:
+        FILE_SUFFIX: ${{ inputs.file-suffix }}
+      run: |
+        # -ir => recursive include all files in pattern
+        7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
+
+    - name: Zip test reports for upload
+      if: runner.os == 'Windows' && !inputs.use-gha
+      shell: powershell
+      env:
+        FILE_SUFFIX: ${{ inputs.file-suffix }}
+      run: |
+        # -ir => recursive include all files in pattern
+        7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
+
+    # S3 upload
+    - name: Store Test Downloaded JSONs on S3
+      uses: seemethere/upload-artifact-s3@v4
+      if: ${{ !inputs.use-gha }}
+      with:
+        retention-days: 14
+        if-no-files-found: warn
+        path: test-jsons-*.zip
+
+    - name: Store Test Reports on S3
+      uses: seemethere/upload-artifact-s3@v4
+      if: ${{ !inputs.use-gha }}
+      with:
+        retention-days: 14
+        if-no-files-found: error
+        path: test-reports-*.zip
+
+    # GHA upload
+    - name: Store Test Downloaded JSONs on Github
+      uses: actions/upload-artifact@v2
+      if: inputs.use-gha
+      with:
+        # Add the run attempt, see [Artifact run attempt]
+        name: test-jsons-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
+        retention-days: 14
+        if-no-files-found: warn
+        path: test/**/*.json
+
+    - name: Store Test Reports on Github
+      uses: actions/upload-artifact@v2
+      if: inputs.use-gha
+      with:
+        # Add the run attempt, see [Artifact run attempt]
+        name: test-reports-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
+        retention-days: 14
+        if-no-files-found: error
+        path: test/**/*.xml
diff --git a/.github/generated-ciflow-ruleset.json b/.github/generated-ciflow-ruleset.json
deleted file mode 100644
index 3625512b7a804e..00000000000000
--- a/.github/generated-ciflow-ruleset.json
+++ /dev/null
@@ -1,298 +0,0 @@
-{
-  "__comment": "@generated DO NOT EDIT MANUALLY, Generation script: .github/scripts/generate_ci_workflows.py",
-  "label_rules": {
-    "ciflow/all": [
-      "caffe2-linux-xenial-py3.7-gcc5.4",
-      "docker-builds",
-      "ios-12-5-1-arm64",
-      "ios-12-5-1-arm64-coreml",
-      "ios-12-5-1-arm64-custom-ops",
-      "ios-12-5-1-arm64-metal",
-      "ios-12-5-1-x86-64",
-      "ios-12-5-1-x86-64-coreml",
-      "libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-      "libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-bionic-cuda10.2-py3.9-gcc7",
-      "linux-bionic-py3.7-clang9",
-      "linux-bionic-rocm4.5-py3.7",
-      "linux-docs",
-      "linux-docs-push",
-      "linux-vulkan-bionic-py3.7-clang9",
-      "linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-      "linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
-      "linux-xenial-py3-clang5-mobile-build",
-      "linux-xenial-py3-clang5-mobile-custom-build-static",
-      "linux-xenial-py3.7-clang7-asan",
-      "linux-xenial-py3.7-clang7-onnx",
-      "linux-xenial-py3.7-gcc5.4",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "linux-xenial-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc7-no-ops",
-      "macos-10-15-py3-arm64",
-      "macos-10-15-py3-lite-interpreter-x86-64",
-      "macos-11-py3-x86-64",
-      "parallelnative-linux-xenial-py3.7-gcc5.4",
-      "periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
-      "periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug",
-      "periodic-win-vs2019-cuda11.5-py3",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-      "pytorch-xla-linux-bionic-py3.7-clang8",
-      "win-vs2019-cpu-py3",
-      "win-vs2019-cuda11.3-py3"
-    ],
-    "ciflow/android": [
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
-    ],
-    "ciflow/bazel": [
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
-    ],
-    "ciflow/binaries": [
-      "linux-binary-conda",
-      "linux-binary-libtorch-cxx11-abi",
-      "linux-binary-libtorch-pre-cxx11",
-      "linux-binary-manywheel",
-      "macos-arm64-binary-conda",
-      "macos-arm64-binary-wheel",
-      "macos-binary-conda",
-      "macos-binary-libtorch-cxx11-abi",
-      "macos-binary-libtorch-pre-cxx11",
-      "macos-binary-wheel",
-      "windows-binary-libtorch-debug",
-      "windows-binary-libtorch-release",
-      "windows-binary-wheel"
-    ],
-    "ciflow/binaries_conda": [
-      "linux-binary-conda",
-      "macos-arm64-binary-conda",
-      "macos-binary-conda"
-    ],
-    "ciflow/binaries_libtorch": [
-      "linux-binary-libtorch-cxx11-abi",
-      "linux-binary-libtorch-pre-cxx11",
-      "macos-binary-libtorch-cxx11-abi",
-      "macos-binary-libtorch-pre-cxx11",
-      "windows-binary-libtorch-debug",
-      "windows-binary-libtorch-release"
-    ],
-    "ciflow/binaries_wheel": [
-      "linux-binary-manywheel",
-      "macos-arm64-binary-wheel",
-      "macos-binary-wheel",
-      "windows-binary-wheel"
-    ],
-    "ciflow/cpu": [
-      "caffe2-linux-xenial-py3.7-gcc5.4",
-      "linux-bionic-py3.7-clang9",
-      "linux-docs",
-      "linux-docs-push",
-      "linux-vulkan-bionic-py3.7-clang9",
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-      "linux-xenial-py3.7-clang7-asan",
-      "linux-xenial-py3.7-clang7-onnx",
-      "linux-xenial-py3.7-gcc5.4",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "linux-xenial-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc7-no-ops",
-      "parallelnative-linux-xenial-py3.7-gcc5.4",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-      "pytorch-xla-linux-bionic-py3.7-clang8",
-      "win-vs2019-cpu-py3"
-    ],
-    "ciflow/cuda": [
-      "libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-      "libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-bionic-cuda10.2-py3.9-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
-      "periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
-      "periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug",
-      "periodic-win-vs2019-cuda11.5-py3",
-      "win-vs2019-cuda11.3-py3"
-    ],
-    "ciflow/default": [
-      "linux-binary-conda",
-      "linux-binary-libtorch-cxx11-abi",
-      "linux-binary-libtorch-pre-cxx11",
-      "linux-binary-manywheel",
-      "linux-bionic-py3.7-clang9",
-      "linux-bionic-rocm4.5-py3.7",
-      "linux-docs",
-      "linux-vulkan-bionic-py3.7-clang9",
-      "linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-      "linux-xenial-py3-clang5-mobile-build",
-      "linux-xenial-py3-clang5-mobile-custom-build-static",
-      "linux-xenial-py3.7-clang7-asan",
-      "linux-xenial-py3.7-clang7-onnx",
-      "linux-xenial-py3.7-gcc5.4",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "linux-xenial-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc7-no-ops",
-      "macos-arm64-binary-conda",
-      "macos-arm64-binary-wheel",
-      "macos-binary-conda",
-      "macos-binary-libtorch-cxx11-abi",
-      "macos-binary-libtorch-pre-cxx11",
-      "macos-binary-wheel",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-      "win-vs2019-cpu-py3",
-      "win-vs2019-cuda11.3-py3",
-      "windows-binary-libtorch-debug",
-      "windows-binary-libtorch-release",
-      "windows-binary-wheel"
-    ],
-    "ciflow/docs": [
-      "linux-docs"
-    ],
-    "ciflow/ios": [
-      "ios-12-5-1-arm64",
-      "ios-12-5-1-arm64-coreml",
-      "ios-12-5-1-arm64-custom-ops",
-      "ios-12-5-1-arm64-metal",
-      "ios-12-5-1-x86-64",
-      "ios-12-5-1-x86-64-coreml"
-    ],
-    "ciflow/libtorch": [
-      "libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-      "libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7"
-    ],
-    "ciflow/linux": [
-      "caffe2-linux-xenial-py3.7-gcc5.4",
-      "libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-      "libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-bionic-cuda10.2-py3.9-gcc7",
-      "linux-bionic-py3.7-clang9",
-      "linux-bionic-rocm4.5-py3.7",
-      "linux-docs",
-      "linux-docs-push",
-      "linux-vulkan-bionic-py3.7-clang9",
-      "linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-      "linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
-      "linux-xenial-py3-clang5-mobile-build",
-      "linux-xenial-py3-clang5-mobile-custom-build-static",
-      "linux-xenial-py3.7-clang7-asan",
-      "linux-xenial-py3.7-clang7-onnx",
-      "linux-xenial-py3.7-gcc5.4",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "linux-xenial-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc7-no-ops",
-      "parallelnative-linux-xenial-py3.7-gcc5.4",
-      "periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
-      "periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-      "pytorch-xla-linux-bionic-py3.7-clang8"
-    ],
-    "ciflow/macos": [
-      "ios-12-5-1-arm64",
-      "ios-12-5-1-arm64-coreml",
-      "ios-12-5-1-arm64-custom-ops",
-      "ios-12-5-1-arm64-metal",
-      "ios-12-5-1-x86-64",
-      "ios-12-5-1-x86-64-coreml",
-      "macos-10-15-py3-arm64",
-      "macos-10-15-py3-lite-interpreter-x86-64",
-      "macos-11-py3-x86-64"
-    ],
-    "ciflow/mobile": [
-      "linux-xenial-py3-clang5-mobile-build",
-      "linux-xenial-py3-clang5-mobile-custom-build-static",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
-    ],
-    "ciflow/noarch": [
-      "linux-bionic-py3.7-clang9"
-    ],
-    "ciflow/onnx": [
-      "linux-xenial-py3.7-clang7-onnx"
-    ],
-    "ciflow/rocm": [
-      "linux-bionic-rocm4.5-py3.7"
-    ],
-    "ciflow/sanitizers": [
-      "linux-xenial-py3.7-clang7-asan"
-    ],
-    "ciflow/scheduled": [
-      "ios-12-5-1-arm64",
-      "ios-12-5-1-arm64-coreml",
-      "ios-12-5-1-arm64-custom-ops",
-      "ios-12-5-1-arm64-metal",
-      "linux-docs-push",
-      "periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-bionic-cuda11.5-py3.7-gcc7",
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
-      "periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug",
-      "periodic-win-vs2019-cuda11.5-py3"
-    ],
-    "ciflow/slow": [
-      "linux-bionic-cuda10.2-py3.9-gcc7",
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck"
-    ],
-    "ciflow/slow-gradcheck": [
-      "periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck"
-    ],
-    "ciflow/trunk": [
-      "caffe2-linux-xenial-py3.7-gcc5.4",
-      "docker-builds",
-      "ios-12-5-1-x86-64",
-      "ios-12-5-1-x86-64-coreml",
-      "libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-      "libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-bionic-cuda10.2-py3.9-gcc7",
-      "linux-bionic-py3.7-clang9",
-      "linux-bionic-rocm4.5-py3.7",
-      "linux-docs",
-      "linux-vulkan-bionic-py3.7-clang9",
-      "linux-xenial-cuda11.3-py3.7-gcc7",
-      "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-      "linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
-      "linux-xenial-py3-clang5-mobile-build",
-      "linux-xenial-py3-clang5-mobile-custom-build-static",
-      "linux-xenial-py3.7-clang7-asan",
-      "linux-xenial-py3.7-clang7-onnx",
-      "linux-xenial-py3.7-gcc5.4",
-      "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-      "linux-xenial-py3.7-gcc7",
-      "linux-xenial-py3.7-gcc7-no-ops",
-      "macos-10-15-py3-arm64",
-      "macos-10-15-py3-lite-interpreter-x86-64",
-      "macos-11-py3-x86-64",
-      "parallelnative-linux-xenial-py3.7-gcc5.4",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-      "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-      "pytorch-xla-linux-bionic-py3.7-clang8",
-      "win-vs2019-cpu-py3",
-      "win-vs2019-cuda11.3-py3"
-    ],
-    "ciflow/vulkan": [
-      "linux-vulkan-bionic-py3.7-clang9"
-    ],
-    "ciflow/win": [
-      "periodic-win-vs2019-cuda11.5-py3",
-      "win-vs2019-cpu-py3",
-      "win-vs2019-cuda11.3-py3"
-    ],
-    "ciflow/xla": [
-      "pytorch-xla-linux-bionic-py3.7-clang8"
-    ]
-  },
-  "version": "v1"
-}
diff --git a/.github/merge_rules.json b/.github/merge_rules.json
index dded4737f40509..56268e5381618a 100644
--- a/.github/merge_rules.json
+++ b/.github/merge_rules.json
@@ -2,54 +2,49 @@
    {
       "name": "ONNX exporter",
       "patterns": [
-         "torch/onnx/**",
-         "torch/csrc/jit/passes/onnx/**",
-         "torch/csrc/jit/passes/onnx.*",
-         "test/onnx/**",
+         ".jenkins/caffe2/*",
          "docs/source/onnx.rst",
+         "test/onnx/**",
+         "tools/onnx/**",
+         "torch/_C/__init__.pyi.in",
+         "torch/csrc/jit/passes/onnx.*",
+         "torch/csrc/jit/passes/onnx/**",
          "torch/csrc/jit/serialization/export.*",
          "torch/csrc/jit/serialization/onnx.*",
-         "torch/_C/__init__.pyi.in",
          "torch/csrc/onnx/**",
-         ".jenkins/caffe2/*"
+         "torch/onnx/**"
       ],
       "approved_by": ["BowenBao", "garymm"],
-      "mandatory_app_id": 12274
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    },
    {
       "name": "NVFuser",
       "patterns": ["torch/csrc/jit/codegen/fuser/cuda/**", "torch/csrc/jit/codegen/cuda/**", "benchmarks/cpp/nvfuser/**"],
       "approved_by": ["csarofeen", "ngimel"],
-      "mandatory_app_id": 12274
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    },
    {
       "name": "OSS CI",
       "patterns": [".github/**", ".circleci/**", ".jenkins/**", "scripts/**", "tools/**"],
-      "approved_by": ["janeyx99", "ezyang"],
-      "mandatory_app_id": 12274
+      "approved_by": ["ezyang", "pytorch/pytorch-dev-infra"],
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    },
    {
       "name": "Documentation",
       "patterns": ["docs/**", "torch/*docs.py"],
       "approved_by": ["mruberry", "ngimel", "janeyx99"],
-      "mandatory_app_id": 12274
-   },
-   {
-      "name": "Android",
-      "patterns": ["android/**"],
-      "approved_by": ["linbinyu", "kit1980", "IvanKobzarev"],
-      "mandatory_app_id": 12274
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    },
    {
-      "name": "iOS",
-      "patterns": ["ios/**"],
-      "approved_by": ["linbinyu", "kit1980", "xta0", "hanton"],
-      "mandatory_app_id": 12274
+      "name": "Mobile",
+      "patterns": ["ios/**", "android/**", "test/mobile/**"],
+      "approved_by": ["linbinyu", "kit1980", "IvanKobzarev", "dreiss"],
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    },
    {
       "name": "superuser",
       "patterns": ["*"],
-      "approved_by": ["albanD", "jbschlosser", "suo", "osalpekar", "malfet", "seemethere", "ezyang"],
-      "mandatory_app_id": 12274
+      "approved_by": ["pytorch/metamates"],
+      "mandatory_checks_name": ["Facebook CLA Check", "Lint"]
    }
 ]
diff --git a/.github/scale-config.yml b/.github/scale-config.yml
index 0670ed9598ae63..213a9942ff9071 100644
--- a/.github/scale-config.yml
+++ b/.github/scale-config.yml
@@ -30,7 +30,7 @@ runner_types:
   linux.2xlarge:
     instance_type: c5.2xlarge
     os: linux
-    max_available: 500
+    max_available: 750
     disk_size: 150
     is_ephemeral: false
   linux.4xlarge: # for binary-builds
diff --git a/.github/scripts/README.md b/.github/scripts/README.md
new file mode 100644
index 00000000000000..22099c3732ea53
--- /dev/null
+++ b/.github/scripts/README.md
@@ -0,0 +1,58 @@
+# pytorch/.github
+
+> NOTE: This README contains information for the `.github` directory but cannot be located there because it will overwrite the
+repo README.
+
+This directory contains workflows and scripts to support our CI infrastructure that runs on Github Actions.
+
+## Workflows
+
+- Pull CI (`pull.yml`) is run on PRs and on master.
+- Trunk CI (`trunk.yml`) is run on trunk to validate incoming commits. Trunk jobs are usually more expensive to run so we do not run them on PRs unless specified.
+- Scheduled CI (`periodic.yml`) is a subset of trunk CI that is run every few hours on master.
+- Binary CI is run to package binaries for distribution for all platforms.
+
+## Templates
+
+Templates written in [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) are located in the `.github/templates` directory
+and used to generate workflow files for binary jobs found in the `.github/workflows/` directory. These are also a
+couple of utility templates used to discern common utilities that can be used amongst different templates.
+
+### (Re)Generating workflow files
+
+You will need `jinja2` in order to regenerate the workflow files which can be installed using:
+```bash
+pip install -r .github/requirements.txt
+```
+
+Workflows can be generated / regenerated using the following command:
+```bash
+.github/regenerate.sh
+```
+
+### Adding a new generated binary workflow
+
+New generated binary workflows can be added in the `.github/scripts/generate_ci_workflows.py` script. You can reference
+examples from that script in order to add the workflow to the stream that is relevant to what you particularly
+care about.
+
+Different parameters can be used to acheive different goals, i.e. running jobs on a cron, running only on trunk, etc.
+
+#### ciflow (trunk)
+
+The label `ciflow/trunk` can be used to run `trunk` only workflows. This is especially useful if trying to re-land a PR that was
+reverted for failing a `non-default` workflow.
+
+## Infra
+
+Currently most of our self hosted runners are hosted on AWS, for a comprehensive list of available runner types you
+can reference `.github/scale-config.yml`.
+
+Exceptions to AWS for self hosted:
+* ROCM runners
+
+### Adding new runner types
+
+New runner types can be added by committing changes to `.github/scale-config.yml`. Example: https://github.com/pytorch/pytorch/pull/70474
+
+> NOTE: New runner types can only be used once the changes to `.github/scale-config.yml` have made their way into the default branch
diff --git a/.github/scripts/build_publish_nightly_docker.sh b/.github/scripts/build_publish_nightly_docker.sh
index 3e953db88b891d..db84704aa3e4c8 100644
--- a/.github/scripts/build_publish_nightly_docker.sh
+++ b/.github/scripts/build_publish_nightly_docker.sh
@@ -1,9 +1,9 @@
-#!/bin/sh
+#!/usr/bin/env bash
 
 set -xeuo pipefail
 
 PYTORCH_DOCKER_TAG=$(git describe --tags --always)-devel
-CUDA_VERSION=11.3
+CUDA_VERSION=11.3.1
 
 # Build PyTorch nightly docker
 make -f docker.Makefile \
@@ -25,18 +25,20 @@ docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_DOCKER_TAG} \
 docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
        ghcr.io/pytorch/pytorch-nightly:latest
 
-# Push the nightly docker to GitHub Container Registry
-echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin
-make -f docker.Makefile \
-     DOCKER_REGISTRY=ghcr.io \
-     DOCKER_ORG=pytorch \
-     DOCKER_IMAGE=pytorch-nightly \
-     DOCKER_TAG=${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
-     devel-push
-
-make -f docker.Makefile \
-     DOCKER_REGISTRY=ghcr.io \
-     DOCKER_ORG=pytorch \
-     DOCKER_IMAGE=pytorch-nightly \
-     DOCKER_TAG=latest \
-     devel-push
+if [[ ${WITH_PUSH:-} == "true" ]]; then
+    # Push the nightly docker to GitHub Container Registry
+    echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin
+    make -f docker.Makefile \
+         DOCKER_REGISTRY=ghcr.io \
+         DOCKER_ORG=pytorch \
+         DOCKER_IMAGE=pytorch-nightly \
+         DOCKER_TAG=${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
+         devel-push
+
+    make -f docker.Makefile \
+         DOCKER_REGISTRY=ghcr.io \
+         DOCKER_ORG=pytorch \
+         DOCKER_IMAGE=pytorch-nightly \
+         DOCKER_TAG=latest \
+         devel-push
+fi
diff --git a/.github/scripts/ensure_actions_will_cancel.py b/.github/scripts/ensure_actions_will_cancel.py
index a07f4359dd045c..c479aefb9fc433 100755
--- a/.github/scripts/ensure_actions_will_cancel.py
+++ b/.github/scripts/ensure_actions_will_cancel.py
@@ -9,14 +9,8 @@
 
 REPO_ROOT = Path(__file__).resolve().parent.parent.parent
 WORKFLOWS = REPO_ROOT / ".github" / "workflows"
-
-
-def concurrency_key(filename: Path) -> str:
-    workflow_name = filename.with_suffix("").name.replace("_", "-")
-    if workflow_name.startswith("generated-"):
-        workflow_name = workflow_name[len("generated-"):]
-    return f"{workflow_name}-${{{{ github.event.pull_request.number || github.sha }}}}" \
-        "-${{ github.event_name == 'workflow_dispatch' }}"
+EXPECTED_GROUP = "${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}" \
+    "-${{ github.event_name == 'workflow_dispatch' }}"
 
 
 def should_check(filename: Path) -> bool:
@@ -38,12 +32,19 @@ def should_check(filename: Path) -> bool:
 
     errors_found = False
     files = [f for f in files if should_check(f)]
+    names = set()
     for filename in files:
         with open(filename, "r") as f:
             data = yaml.safe_load(f)
 
+        name = data.get("name")
+        if name is not None and name in names:
+            print("ERROR: duplicate workflow name:", name, file=sys.stderr)
+            errors_found = True
+        names.add(name)
+
         expected = {
-            "group": concurrency_key(filename),
+            "group": EXPECTED_GROUP,
             "cancel-in-progress": True,
         }
         actual = data.get("concurrency", None)
diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py
index 90e509d87c2762..b4476092b71a9e 100644
--- a/.github/scripts/generate_binary_build_matrix.py
+++ b/.github/scripts/generate_binary_build_matrix.py
@@ -10,10 +10,10 @@
     * Latest ROCM
 """
 
-from typing import Dict, List, Tuple
+from typing import Dict, List, Tuple, Optional
 
 
-CUDA_ARCHES = ["10.2", "11.3", "11.5"]
+CUDA_ARCHES = ["10.2", "11.3", "11.5", "11.6"]
 
 
 ROCM_ARCHES = ["4.5.2", "5.0"]
@@ -59,6 +59,14 @@ def arch_type(arch_version: str) -> str:
         (gpu_arch, CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cuda{gpu_arch}"
         for gpu_arch in CUDA_ARCHES
     },
+    **{
+        (gpu_arch, PRE_CXX11_ABI): f"pytorch/manylinux-builder:rocm{gpu_arch}"
+        for gpu_arch in ROCM_ARCHES
+    },
+    **{
+        (gpu_arch, CXX11_ABI): f"pytorch/libtorch-cxx11-builder:rocm{gpu_arch}"
+        for gpu_arch in ROCM_ARCHES
+    },
     ("cpu", PRE_CXX11_ABI): "pytorch/manylinux-builder:cpu",
     ("cpu", CXX11_ABI): "pytorch/libtorch-cxx11-builder:cpu",
 }
@@ -112,23 +120,29 @@ def generate_conda_matrix(os: str) -> List[Dict[str, str]]:
     return ret
 
 
-def generate_libtorch_matrix(os: str, abi_version: str) -> List[Dict[str, str]]:
-    libtorch_variants = [
-        "shared-with-deps",
-        "shared-without-deps",
-        "static-with-deps",
-        "static-without-deps",
-    ]
+def generate_libtorch_matrix(os: str, abi_version: str,
+                             arches: Optional[List[str]] = None,
+                             libtorch_variants: Optional[List[str]] = None) -> List[Dict[str, str]]:
+    if arches is None:
+        arches = ["cpu"]
+        if os == "linux":
+            arches += CUDA_ARCHES
+            arches += ROCM_ARCHES
+        elif os == "windows":
+            # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
+            arches += list_without(CUDA_ARCHES, ["10.2"])
+
+    if libtorch_variants is None:
+        libtorch_variants = [
+            "shared-with-deps",
+            "shared-without-deps",
+            "static-with-deps",
+            "static-without-deps",
+        ]
+
     ret: List[Dict[str, str]] = []
-    arches = ["cpu"]
-    if os == "linux":
-        arches += CUDA_ARCHES
-    elif os == "windows":
-        # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
-        arches += list_without(CUDA_ARCHES, ["10.2"])
     for arch_version in arches:
         for libtorch_variant in libtorch_variants:
-            # We don't currently build libtorch for rocm
             # one of the values in the following list must be exactly
             # CXX11_ABI, but the precise value of the other one doesn't
             # matter
@@ -156,19 +170,29 @@ def generate_libtorch_matrix(os: str, abi_version: str) -> List[Dict[str, str]]:
     return ret
 
 
-def generate_wheels_matrix(os: str) -> List[Dict[str, str]]:
-    arches = ["cpu"]
+def generate_wheels_matrix(os: str,
+                           arches: Optional[List[str]] = None,
+                           python_versions: Optional[List[str]] = None) -> List[Dict[str, str]]:
     package_type = "wheel"
-    python_versions = FULL_PYTHON_VERSIONS
     if os == "linux":
-        arches += CUDA_ARCHES + ROCM_ARCHES
         # NOTE: We only build manywheel packages for linux
         package_type = "manywheel"
-    elif os == "windows":
-        # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
-        arches += list_without(CUDA_ARCHES, ["10.2"])
-    elif os == "macos-arm64":
-        python_versions = list_without(python_versions, ["3.7"])
+
+    if python_versions is None:
+        # Define default python version
+        python_versions = FULL_PYTHON_VERSIONS
+        if os == "macos-arm64":
+            python_versions = list_without(python_versions, ["3.7"])
+
+    if arches is None:
+        # Define default compute archivectures
+        arches = ["cpu"]
+        if os == "linux":
+            arches += CUDA_ARCHES + ROCM_ARCHES
+        elif os == "windows":
+            # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
+            arches += list_without(CUDA_ARCHES, ["10.2"])
+
     ret: List[Dict[str, str]] = []
     for python_version in python_versions:
         for arch_version in arches:
diff --git a/.github/scripts/generate_ci_workflows.py b/.github/scripts/generate_ci_workflows.py
index dab955d3596e57..c8b815bf018036 100755
--- a/.github/scripts/generate_ci_workflows.py
+++ b/.github/scripts/generate_ci_workflows.py
@@ -2,10 +2,10 @@
 
 from dataclasses import asdict, dataclass, field
 from pathlib import Path
-from typing import Dict, Set, List, Iterable, Any
+from typing import Dict, Set, List, Iterable
 
 import jinja2
-import json
+
 import os
 import sys
 from typing_extensions import Literal, TypedDict
@@ -14,88 +14,15 @@
 
 Arch = Literal["windows", "linux", "macos"]
 
-DOCKER_REGISTRY = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
 GITHUB_DIR = Path(__file__).resolve().parent.parent
 
-WINDOWS_CPU_TEST_RUNNER = "windows.4xlarge"
-# contains 1 gpu
-WINDOWS_CUDA_TEST_RUNNER = "windows.8xlarge.nvidia.gpu"
-WINDOWS_RUNNERS = {
-    WINDOWS_CPU_TEST_RUNNER,
-    WINDOWS_CUDA_TEST_RUNNER,
-}
-
-LINUX_CPU_TEST_RUNNER = "linux.2xlarge"
-# contains 1 gpu
-LINUX_CUDA_TEST_RUNNER = "linux.4xlarge.nvidia.gpu"
-# contains at least 2 gpus
-LINUX_ROCM_TEST_RUNNER = "linux.rocm.gpu"
-LINUX_RUNNERS = {
-    LINUX_CPU_TEST_RUNNER,
-    LINUX_CUDA_TEST_RUNNER,
-    LINUX_ROCM_TEST_RUNNER,
-}
-
-LINUX_DISTRIBUTED_GPU_RUNNERS = {
-    LINUX_CUDA_TEST_RUNNER : "linux.8xlarge.nvidia.gpu",
-    LINUX_ROCM_TEST_RUNNER : LINUX_ROCM_TEST_RUNNER,
-}
-
-LINUX_MULTIGPU_RUNNERS = {
-    LINUX_CUDA_TEST_RUNNER : "linux.16xlarge.nvidia.gpu",
-    LINUX_ROCM_TEST_RUNNER : LINUX_ROCM_TEST_RUNNER,
-}
-
-MACOS_TEST_RUNNER_10_15 = "macos-10.15"
-MACOS_TEST_RUNNER_11 = "macos-11"
-
-MACOS_RUNNERS = {
-    MACOS_TEST_RUNNER_10_15,
-    MACOS_TEST_RUNNER_11,
-}
-
-CUDA_RUNNERS = {
-    WINDOWS_CUDA_TEST_RUNNER,
-    LINUX_CUDA_TEST_RUNNER,
-}
-ROCM_RUNNERS = {
-    LINUX_ROCM_TEST_RUNNER,
-}
-CPU_RUNNERS = {
-    WINDOWS_CPU_TEST_RUNNER,
-    LINUX_CPU_TEST_RUNNER,
-}
-
-LABEL_CIFLOW_ALL = "ciflow/all"
-LABEL_CIFLOW_BAZEL = "ciflow/bazel"
-LABEL_CIFLOW_CPU = "ciflow/cpu"
-LABEL_CIFLOW_CUDA = "ciflow/cuda"
-LABEL_CIFLOW_ROCM = "ciflow/rocm"
-LABEL_CIFLOW_DOCS = "ciflow/docs"
-LABEL_CIFLOW_DEFAULT = "ciflow/default"
-LABEL_CIFLOW_LIBTORCH = "ciflow/libtorch"
-LABEL_CIFLOW_LINUX = "ciflow/linux"
-LABEL_CIFLOW_MOBILE = "ciflow/mobile"
-LABEL_CIFLOW_ANDROID = "ciflow/android"
-LABEL_CIFLOW_SANITIZERS = "ciflow/sanitizers"
-LABEL_CIFLOW_ONNX = "ciflow/onnx"
-LABEL_CIFLOW_SCHEDULED = "ciflow/scheduled"
-LABEL_CIFLOW_SLOW = "ciflow/slow"
-LABEL_CIFLOW_WIN = "ciflow/win"
-LABEL_CIFLOW_XLA = "ciflow/xla"
-LABEL_CIFLOW_NOARCH = "ciflow/noarch"
-LABEL_CIFLOW_VULKAN = "ciflow/vulkan"
-LABEL_CIFLOW_PREFIX = "ciflow/"
-LABEL_CIFLOW_SLOW_GRADCHECK = "ciflow/slow-gradcheck"
-LABEL_CIFLOW_DOCKER = "ciflow/docker"
-LABEL_CIFLOW_IOS = "ciflow/ios"
-LABEL_CIFLOW_MACOS = "ciflow/macos"
 LABEL_CIFLOW_TRUNK = "ciflow/trunk"
+LABEL_CIFLOW_ALL = "ciflow/all"
 LABEL_CIFLOW_BINARIES = "ciflow/binaries"
-LABEL_CIFLOW_BINARIES_WHEEL = "ciflow/binaries_wheel"
-LABEL_CIFLOW_BINARIES_CONDA = "ciflow/binaries_conda"
+LABEL_CIFLOW_PERIODIC = "ciflow/periodic"
 LABEL_CIFLOW_BINARIES_LIBTORCH = "ciflow/binaries_libtorch"
-
+LABEL_CIFLOW_BINARIES_CONDA = "ciflow/binaries_conda"
+LABEL_CIFLOW_BINARIES_WHEEL = "ciflow/binaries_wheel"
 
 @dataclass
 class CIFlowConfig:
@@ -108,245 +35,13 @@ class CIFlowConfig:
     def __post_init__(self) -> None:
         if not self.isolated_workflow:
             self.labels.add(LABEL_CIFLOW_ALL)
-            if LABEL_CIFLOW_SCHEDULED not in self.labels:
+            if LABEL_CIFLOW_PERIODIC not in self.labels:
                 self.labels.add(LABEL_CIFLOW_TRUNK)
-        assert all(label.startswith(LABEL_CIFLOW_PREFIX) for label in self.labels)
-
-
-@dataclass
-class CIFlowRuleset:
-    version = 'v1'
-    output_file = f'{GITHUB_DIR}/generated-ciflow-ruleset.json'
-    label_rules: Dict[str, Set[str]] = field(default_factory=dict)
-
-    def add_label_rule(self, labels: Set[str], workflow_name: str) -> None:
-        for label in labels:
-            if label in self.label_rules:
-                self.label_rules[label].add(workflow_name)
-            else:
-                self.label_rules[label] = {workflow_name}
-
-    def generate_json(self) -> None:
-        GENERATED = "generated"  # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
-        output = {
-            "__comment": f"@{GENERATED} DO NOT EDIT MANUALLY, Generation script: .github/scripts/generate_ci_workflows.py",
-            "version": self.version,
-            "label_rules": {
-                label: sorted(list(workflows))
-                for label, workflows in self.label_rules.items()
-            }
-        }
-        with open(self.output_file, 'w') as outfile:
-            json.dump(output, outfile, indent=2, sort_keys=True)
-            outfile.write('\n')
-
 
 class Config(TypedDict):
     num_shards: int
     runner: str
 
-
-@dataclass
-class CIWorkflow:
-    # Required fields
-    arch: Arch
-    build_environment: str
-
-    # Optional fields
-    test_runner_type: str = ''
-    multigpu_runner_type: str = ''
-    distributed_gpu_runner_type: str = ''
-    ciflow_config: CIFlowConfig = field(default_factory=CIFlowConfig)
-    cuda_version: str = ''
-    docker_image_base: str = ''
-    enable_doc_jobs: bool = False
-    exclude_test: bool = False
-    build_generates_artifacts: bool = True
-    build_with_debug: bool = False
-    is_scheduled: str = ''
-    is_default: bool = False
-    on_pull_request: bool = False
-    num_test_shards: int = 1
-    timeout_after: int = 240
-    xcode_version: str = ''
-    ios_arch: str = ''
-    ios_platform: str = ''
-    test_jobs: Any = field(default_factory=list)
-
-    enable_default_test: bool = True
-    enable_jit_legacy_test: bool = False
-    enable_distributed_test: bool = True
-    enable_multigpu_test: bool = False
-    enable_nogpu_no_avx_test: bool = False
-    enable_nogpu_no_avx2_test: bool = False
-    enable_slow_test: bool = False
-    enable_docs_test: bool = False
-    enable_backwards_compat_test: bool = False
-    enable_xla_test: bool = False
-    enable_noarch_test: bool = False
-    enable_force_on_cpu_test: bool = False
-
-    def __post_init__(self) -> None:
-        if not self.build_generates_artifacts:
-            self.exclude_test = True
-
-        self.multigpu_runner_type = LINUX_MULTIGPU_RUNNERS.get(self.test_runner_type, "linux.16xlarge.nvidia.gpu")
-        self.distributed_gpu_runner_type = LINUX_DISTRIBUTED_GPU_RUNNERS.get(self.test_runner_type, "linux.8xlarge.nvidia.gpu")
-
-        if LABEL_CIFLOW_DEFAULT in self.ciflow_config.labels:
-            self.is_default = True
-
-        if self.is_default:
-            self.on_pull_request = True
-
-        self.test_jobs = self._gen_test_jobs()
-        self.assert_valid()
-
-    def assert_valid(self) -> None:
-        err_message = f"invalid test_runner_type for {self.arch}: {self.test_runner_type}"
-        if self.arch == 'linux':
-            assert self.test_runner_type in LINUX_RUNNERS, err_message
-        if self.arch == 'windows':
-            assert self.test_runner_type in WINDOWS_RUNNERS, err_message
-
-        if not self.ciflow_config.isolated_workflow:
-            assert LABEL_CIFLOW_ALL in self.ciflow_config.labels
-        if self.arch == 'linux':
-            assert LABEL_CIFLOW_LINUX in self.ciflow_config.labels
-        if self.arch == 'windows':
-            assert LABEL_CIFLOW_WIN in self.ciflow_config.labels
-        if self.arch == 'macos':
-            assert LABEL_CIFLOW_MACOS in self.ciflow_config.labels
-        # Make sure that jobs with tests have a test_runner_type
-        if not self.exclude_test:
-            assert self.test_runner_type != ''
-        if self.test_runner_type in CUDA_RUNNERS:
-            assert LABEL_CIFLOW_CUDA in self.ciflow_config.labels
-        if self.test_runner_type in ROCM_RUNNERS:
-            assert LABEL_CIFLOW_ROCM in self.ciflow_config.labels
-        if self.test_runner_type in CPU_RUNNERS and not self.exclude_test:
-            assert LABEL_CIFLOW_CPU in self.ciflow_config.labels
-        if self.is_scheduled:
-            assert LABEL_CIFLOW_DEFAULT not in self.ciflow_config.labels
-            assert LABEL_CIFLOW_TRUNK not in self.ciflow_config.labels
-            assert LABEL_CIFLOW_SCHEDULED in self.ciflow_config.labels
-        if self.build_with_debug:
-            assert self.build_environment.endswith("-debug")
-
-    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
-        output_file_path = GITHUB_DIR / f"workflows/generated-{self.build_environment}.yml"
-        with open(output_file_path, "w") as output_file:
-            GENERATED = "generated"  # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
-            output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])
-            try:
-                content = workflow_template.render(asdict(self))
-            except Exception as e:
-                print(f"Failed on template: {workflow_template}", file=sys.stderr)
-                raise e
-            output_file.write(content)
-            if content[-1] != "\n":
-                output_file.write("\n")
-        print(output_file_path)
-
-    def normalized_build_environment(self, suffix: str) -> str:
-        return self.build_environment.replace(".", "_") + suffix
-
-    def _gen_test_jobs(self) -> Any:
-        if self.arch == "linux":
-            MULTIGPU_RUNNER_TYPE = "linux.16xlarge.nvidia.gpu"
-            DISTRIBUTED_GPU_RUNNER_TYPE = "linux.8xlarge.nvidia.gpu"
-            NOGPU_RUNNER_TYPE = "linux.2xlarge"
-        elif self.arch == "windows":
-            DISTRIBUTED_GPU_RUNNER_TYPE = self.test_runner_type
-            NOGPU_RUNNER_TYPE = "windows.4xlarge"
-
-        test_jobs = []
-
-        configs: Dict[str, Config] = {}
-        if self.enable_jit_legacy_test:
-            configs["jit_legacy"] = {"num_shards": 1, "runner": self.test_runner_type}
-        if self.enable_multigpu_test:
-            configs["multigpu"] = {"num_shards": 1, "runner": MULTIGPU_RUNNER_TYPE}
-
-        if self.enable_nogpu_no_avx_test:
-            configs["nogpu_NO_AVX"] = {"num_shards": 1, "runner": NOGPU_RUNNER_TYPE}
-        if self.enable_nogpu_no_avx2_test:
-            configs["nogpu_NO_AVX2"] = {"num_shards": 1, "runner": NOGPU_RUNNER_TYPE}
-        if self.enable_force_on_cpu_test:
-            configs["force_on_cpu"] = {"num_shards": 1, "runner": NOGPU_RUNNER_TYPE}
-        if self.enable_distributed_test:
-            configs["distributed"] = {
-                "num_shards": 1,
-                "runner": DISTRIBUTED_GPU_RUNNER_TYPE
-                if "cuda" in str(self.build_environment)
-                else self.test_runner_type,
-            }
-        if self.enable_slow_test:
-            configs["slow"] = {"num_shards": 1, "runner": self.test_runner_type}
-        if self.enable_docs_test:
-            configs["docs_test"] = {"num_shards": 1, "runner": self.test_runner_type}
-        if self.enable_backwards_compat_test:
-            configs["backwards_compat"] = {
-                "num_shards": 1,
-                "runner": self.test_runner_type,
-            }
-        if self.enable_xla_test:
-            configs["xla"] = {"num_shards": 1, "runner": self.test_runner_type}
-        if self.enable_noarch_test:
-            configs["noarch"] = {"num_shards": 1, "runner": self.test_runner_type}
-
-        for name, config in configs.items():
-            for shard in range(1, config["num_shards"] + 1):
-                test_jobs.append(
-                    {
-                        "id": f"test_{name}_{shard}_{config['num_shards']}",
-                        "name": f"test ({name}, {shard}, {config['num_shards']}, {config['runner']})",
-                        "config": name,
-                        "shard": shard,
-                        "num_shards": config["num_shards"],
-                        "runner": config["runner"],
-                    }
-                )
-
-        if self.enable_default_test:
-            for shard in range(1, self.num_test_shards + 1):
-                test_jobs.append(
-                    {
-                        "id": f"test_default_{shard}_{self.num_test_shards}",
-                        "name": f"test (default, {shard}, {self.num_test_shards}, {self.test_runner_type})",
-                        "config": "default",
-                        "shard": shard,
-                        "num_shards": self.num_test_shards,
-                        "runner": self.test_runner_type,
-                    }
-                )
-        return test_jobs
-
-@dataclass
-class DockerWorkflow:
-    build_environment: str
-    docker_images: List[str]
-
-    # Optional fields
-    ciflow_config: CIFlowConfig = field(default_factory=CIFlowConfig)
-    cuda_version: str = ''
-    is_scheduled: str = ''
-
-    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
-        output_file_path = GITHUB_DIR / "workflows/generated-docker-builds.yml"
-        with open(output_file_path, "w") as output_file:
-            GENERATED = "generated"  # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
-            output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])
-            try:
-                content = workflow_template.render(asdict(self))
-            except Exception as e:
-                print(f"Failed on template: {workflow_template}", file=sys.stderr)
-                raise e
-            output_file.write(content)
-            if content[-1] != "\n":
-                output_file.write("\n")
-        print(output_file_path)
-
 @dataclass
 class BinaryBuildWorkflow:
     os: str
@@ -358,6 +53,7 @@ class BinaryBuildWorkflow:
     abi_version: str = ''
     ciflow_config: CIFlowConfig = field(default_factory=CIFlowConfig)
     is_scheduled: str = ''
+    branches: str = 'nightly'
     # Mainly for macos
     cross_compile_arm64: bool = False
     xcode_version: str = ''
@@ -369,7 +65,7 @@ def __post_init__(self) -> None:
             self.build_environment = f"{self.os}-binary-{self.package_type}"
 
     def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
-        output_file_path = GITHUB_DIR / f"workflows/generated-{self.build_environment}.yml"
+        output_file_path = GITHUB_DIR / f"workflows/generated-{self.build_environment}-{self.branches}.yml"
         with open(output_file_path, "w") as output_file:
             GENERATED = "generated"  # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
             output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])
@@ -383,533 +79,6 @@ def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
                 output_file.write("\n")
         print(output_file_path)
 
-WINDOWS_WORKFLOWS = [
-    CIWorkflow(
-        arch="windows",
-        build_environment="win-vs2019-cpu-py3",
-        cuda_version="cpu",
-        enable_distributed_test=False,
-        test_runner_type=WINDOWS_CPU_TEST_RUNNER,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_CPU, LABEL_CIFLOW_WIN}
-        ),
-    ),
-    CIWorkflow(
-        arch="windows",
-        build_environment="win-vs2019-cuda11.3-py3",
-        cuda_version="11.3",
-        enable_distributed_test=False,
-        test_runner_type=WINDOWS_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        enable_force_on_cpu_test=True,
-        # TODO: Revert back to default value after https://github.com/pytorch/pytorch/issues/73489 is closed
-        timeout_after=270,
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_CUDA, LABEL_CIFLOW_WIN}
-        ),
-    ),
-    CIWorkflow(
-        arch="windows",
-        build_environment="periodic-win-vs2019-cuda11.5-py3",
-        cuda_version="11.5",
-        enable_distributed_test=False,
-        test_runner_type=WINDOWS_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        enable_force_on_cpu_test=True,
-        is_scheduled="45 4,10,16,22 * * *",
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_CUDA, LABEL_CIFLOW_WIN}
-        ),
-    ),
-]
-
-LINUX_WORKFLOWS = [
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-gcc5.4",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        enable_jit_legacy_test=True,
-        enable_backwards_compat_test=True,
-        enable_docs_test=True,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU}
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-docs",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        enable_doc_jobs=True,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_DOCS, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU}
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-docs-push",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        enable_doc_jobs=True,
-        exclude_test=True,
-        is_scheduled="0 0 * * *",  # run pushes only on a nightly schedule
-        # NOTE: This is purposefully left without LABEL_CIFLOW_DOCS so that you can run
-        #       docs builds on your PR without the fear of anything pushing
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU}
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc7",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU}
-        ),
-    ),
-    # ParallelTBB does not have a maintainer and is currently flaky
-    # CIWorkflow(
-    #    arch="linux",
-    #    build_environment="paralleltbb-linux-xenial-py3.6-gcc5.4",
-    #    docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
-    #    test_runner_type=LINUX_CPU_TEST_RUNNER,
-    #    ciflow_config=CIFlowConfig(
-    #        labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU},
-    #    ),
-    # ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="parallelnative-linux-xenial-py3.7-gcc5.4",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU},
-        ),
-    ),
-    # Build PyTorch with BUILD_CAFFE2=ON
-    CIWorkflow(
-        arch="linux",
-        build_environment="caffe2-linux-xenial-py3.7-gcc5.4",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3-clang5-mobile-build",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-asan",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        build_generates_artifacts=False,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_MOBILE, LABEL_CIFLOW_DEFAULT},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3-clang5-mobile-custom-build-static",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        build_generates_artifacts=False,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_MOBILE, LABEL_CIFLOW_DEFAULT},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc5.4",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        build_generates_artifacts=False,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_MOBILE, LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LIBTORCH, LABEL_CIFLOW_CPU},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-clang7-asan",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang7-asan",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        num_test_shards=3,
-        timeout_after=300,
-        enable_distributed_test=False,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_SANITIZERS, LABEL_CIFLOW_CPU},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-clang7-onnx",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang7-onnx",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        num_test_shards=2,
-        enable_distributed_test=False,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_ONNX, LABEL_CIFLOW_CPU},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-bionic-cuda10.2-py3.9-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        enable_jit_legacy_test=True,
-        enable_multigpu_test=True,
-        enable_nogpu_no_avx_test=True,
-        enable_nogpu_no_avx2_test=True,
-        enable_slow_test=True,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            run_on_canary=True,
-            labels={LABEL_CIFLOW_SLOW, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA}
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        build_generates_artifacts=False,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_LIBTORCH, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="periodic-linux-bionic-cuda11.5-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        is_scheduled="45 4,10,16,22 * * *",
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        build_generates_artifacts=False,
-        is_scheduled="45 4,10,16,22 * * *",
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_LIBTORCH, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-cuda11.3-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    # no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-py3.7-gcc7-no-ops",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.7-gcc7",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-bionic-rocm4.5-py3.7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-rocm4.5-py3.7",
-        test_runner_type=LINUX_ROCM_TEST_RUNNER,
-        num_test_shards=2,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_ROCM]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        build_generates_artifacts=False,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels=set([LABEL_CIFLOW_LIBTORCH, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA]),
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        build_with_debug=True,
-        is_scheduled="45 0,4,8,12,16,20 * * *",
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA}
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-bionic-py3.7-clang9",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.7-clang9",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        num_test_shards=2,
-        enable_distributed_test=False,
-        enable_noarch_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_NOARCH},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-vulkan-bionic-py3.7-clang9",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.7-clang9",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        num_test_shards=1,
-        enable_distributed_test=False,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_VULKAN},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
-        test_runner_type=LINUX_CUDA_TEST_RUNNER,
-        num_test_shards=2,
-        enable_distributed_test=False,
-        timeout_after=360,
-        # Only run this on master 4 times per day since it does take a while
-        is_scheduled="0 */4 * * *",
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CUDA, LABEL_CIFLOW_SLOW_GRADCHECK, LABEL_CIFLOW_SLOW, LABEL_CIFLOW_SCHEDULED},
-        ),
-    ),
-]
-
-XLA_WORKFLOWS = [
-    CIWorkflow(
-        arch="linux",
-        build_environment="pytorch-xla-linux-bionic-py3.7-clang8",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/xla_base",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        enable_distributed_test=False,
-        enable_xla_test=True,
-        enable_default_test=False,
-        on_pull_request=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_XLA},
-        ),
-    ),
-
-]
-
-ANDROID_SHORT_WORKFLOWS = [
-    CIWorkflow(
-        arch="linux",
-        build_environment="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_ANDROID, LABEL_CIFLOW_DEFAULT},
-        ),
-    ),
-    CIWorkflow(
-        arch="linux",
-        build_environment="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_ANDROID, LABEL_CIFLOW_DEFAULT},
-        ),
-    ),
-]
-
-ANDROID_WORKFLOWS = [
-    CIWorkflow(
-        arch="linux",
-        build_environment="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_LINUX, LABEL_CIFLOW_CPU, LABEL_CIFLOW_ANDROID},
-        ),
-    ),
-]
-
-BAZEL_WORKFLOWS = [
-    CIWorkflow(
-        arch="linux",
-        build_environment="linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
-        docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-        test_runner_type=LINUX_CPU_TEST_RUNNER,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BAZEL, LABEL_CIFLOW_CPU, LABEL_CIFLOW_LINUX},
-        ),
-    ),
-]
-
-IOS_WORKFLOWS = [
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-arm64",
-        ios_arch="arm64",
-        ios_platform="OS",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        is_scheduled="45 4,10,16,22 * * *",
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-arm64-coreml",
-        ios_arch="arm64",
-        ios_platform="OS",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        is_scheduled="45 4,10,16,22 * * *",
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-arm64-custom-ops",
-        ios_arch="arm64",
-        ios_platform="OS",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        is_scheduled="45 4,10,16,22 * * *",
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-arm64-metal",
-        ios_arch="arm64",
-        ios_platform="OS",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        is_scheduled="45 4,10,16,22 * * *",
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_SCHEDULED, LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-x86-64",
-        ios_arch="x86_64",
-        ios_platform="SIMULATOR",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="ios-12-5-1-x86-64-coreml",
-        ios_arch="x86_64",
-        ios_platform="SIMULATOR",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_IOS, LABEL_CIFLOW_MACOS},
-        ),
-    ),
-]
-
-MACOS_WORKFLOWS = [
-    # Distributed tests are still run on MacOS, but part of regular shards
-    CIWorkflow(
-        arch="macos",
-        build_environment="macos-11-py3-x86-64",
-        xcode_version="12.4",
-        test_runner_type=MACOS_TEST_RUNNER_11,
-        num_test_shards=2,
-        enable_distributed_test=False,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="macos-10-15-py3-lite-interpreter-x86-64",
-        xcode_version="12",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        exclude_test=True,
-        build_generates_artifacts=False,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_MACOS},
-        ),
-    ),
-    CIWorkflow(
-        arch="macos",
-        build_environment="macos-10-15-py3-arm64",
-        test_runner_type=MACOS_TEST_RUNNER_10_15,
-        exclude_test=True,
-        ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_MACOS},
-        ),
-    ),
-]
-
-DOCKER_IMAGES = {
-    f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-rocm4.3.1-py3.7",               # for rocm
-    f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-rocm4.5-py3.7",                 # for rocm
-}
-
-DOCKER_IMAGES.update({
-    workflow.docker_image_base
-    for workflow in [*LINUX_WORKFLOWS, *BAZEL_WORKFLOWS, *ANDROID_WORKFLOWS]
-    if workflow.docker_image_base
-})
-
-DOCKER_WORKFLOWS = [
-    DockerWorkflow(
-        build_environment="docker-builds",
-        docker_images=sorted(DOCKER_IMAGES),
-        # Run every Wednesday at 3:01am to ensure they can build
-        is_scheduled="1 3 * * 3",
-    ),
-]
-
 class OperatingSystem:
     LINUX = "linux"
     WINDOWS = "windows"
@@ -922,7 +91,7 @@ class OperatingSystem:
         package_type="manywheel",
         build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.LINUX),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
             isolated_workflow=True,
         ),
     ),
@@ -931,7 +100,7 @@ class OperatingSystem:
         package_type="conda",
         build_configs=generate_binary_build_matrix.generate_conda_matrix(OperatingSystem.LINUX),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
             isolated_workflow=True,
         ),
     ),
@@ -943,7 +112,7 @@ class OperatingSystem:
             OperatingSystem.LINUX, generate_binary_build_matrix.CXX11_ABI
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
@@ -955,33 +124,65 @@ class OperatingSystem:
             OperatingSystem.LINUX, generate_binary_build_matrix.PRE_CXX11_ABI
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
 ]
 
+LINUX_BINARY_SMOKE_WORKFLOWS = [
+    BinaryBuildWorkflow(
+        os=OperatingSystem.LINUX,
+        package_type="manywheel",
+        build_configs=generate_binary_build_matrix.generate_wheels_matrix(
+            OperatingSystem.LINUX,
+            arches=["10.2"],
+            python_versions=["3.7"]),
+        branches="master",
+    ),
+    BinaryBuildWorkflow(
+        os=OperatingSystem.LINUX,
+        package_type="libtorch",
+        abi_version=generate_binary_build_matrix.CXX11_ABI,
+        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
+            OperatingSystem.LINUX, generate_binary_build_matrix.CXX11_ABI,
+            arches=["cpu"],
+            libtorch_variants=["shared-with-deps"],
+        ),
+        branches="master",
+    ),
+    BinaryBuildWorkflow(
+        os=OperatingSystem.LINUX,
+        package_type="libtorch",
+        abi_version=generate_binary_build_matrix.PRE_CXX11_ABI,
+        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
+            OperatingSystem.LINUX, generate_binary_build_matrix.CXX11_ABI,
+            arches=["cpu"],
+            libtorch_variants=["shared-with-deps"],
+        ),
+        branches="master",
+    ),
+]
+
 WINDOWS_BINARY_BUILD_WORKFLOWS = [
     BinaryBuildWorkflow(
         os=OperatingSystem.WINDOWS,
         package_type="wheel",
         build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.WINDOWS),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
+            isolated_workflow=True,
+        ),
+    ),
+    BinaryBuildWorkflow(
+        os=OperatingSystem.WINDOWS,
+        package_type="conda",
+        build_configs=generate_binary_build_matrix.generate_conda_matrix(OperatingSystem.WINDOWS),
+        ciflow_config=CIFlowConfig(
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
             isolated_workflow=True,
         ),
     ),
-    # NOTE: conda binaries are currently bugged on the installation step
-    #       See, https://github.com/pytorch/pytorch/pull/71484#issuecomment-1022617195
-    # BinaryBuildWorkflow(
-    #     os=OperatingSystem.WINDOWS,
-    #     package_type="conda",
-    #     build_configs=generate_binary_build_matrix.generate_conda_matrix(OperatingSystem.WINDOWS),
-    #     ciflow_config=CIFlowConfig(
-    #         labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
-    #         isolated_workflow=True,
-    #     ),
-    # ),
     BinaryBuildWorkflow(
         os=OperatingSystem.WINDOWS,
         package_type="libtorch",
@@ -990,7 +191,7 @@ class OperatingSystem:
             OperatingSystem.WINDOWS, generate_binary_build_matrix.RELEASE
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
@@ -1002,11 +203,44 @@ class OperatingSystem:
             OperatingSystem.WINDOWS, generate_binary_build_matrix.DEBUG
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
 ]
+WINDOWS_BINARY_SMOKE_WORKFLOWS = [
+    BinaryBuildWorkflow(
+        os=OperatingSystem.WINDOWS,
+        package_type="wheel",
+        build_configs=generate_binary_build_matrix.generate_wheels_matrix(
+            OperatingSystem.WINDOWS,
+            arches=["11.3"],
+            python_versions=["3.7"]),
+        branches="master",
+    ),
+    BinaryBuildWorkflow(
+        os=OperatingSystem.WINDOWS,
+        package_type="libtorch",
+        abi_version=generate_binary_build_matrix.RELEASE,
+        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
+            OperatingSystem.WINDOWS, generate_binary_build_matrix.RELEASE,
+            arches=["cpu"],
+            libtorch_variants=["shared-with-deps"],
+        ),
+        branches="master",
+    ),
+    BinaryBuildWorkflow(
+        os=OperatingSystem.WINDOWS,
+        package_type="libtorch",
+        abi_version=generate_binary_build_matrix.DEBUG,
+        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
+            OperatingSystem.WINDOWS, generate_binary_build_matrix.DEBUG,
+            arches=["cpu"],
+            libtorch_variants=["shared-with-deps"],
+        ),
+        branches="master",
+    ),
+]
 
 MACOS_BINARY_BUILD_WORKFLOWS = [
     BinaryBuildWorkflow(
@@ -1014,7 +248,7 @@ class OperatingSystem:
         package_type="wheel",
         build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.MACOS),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
             isolated_workflow=True,
         ),
     ),
@@ -1023,7 +257,7 @@ class OperatingSystem:
         package_type="conda",
         build_configs=generate_binary_build_matrix.generate_conda_matrix(OperatingSystem.MACOS),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
             isolated_workflow=True,
         ),
     ),
@@ -1035,7 +269,7 @@ class OperatingSystem:
             OperatingSystem.MACOS, generate_binary_build_matrix.CXX11_ABI
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
@@ -1047,7 +281,7 @@ class OperatingSystem:
             OperatingSystem.MACOS, generate_binary_build_matrix.PRE_CXX11_ABI
         ),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
             isolated_workflow=True,
         ),
     ),
@@ -1057,7 +291,7 @@ class OperatingSystem:
         build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.MACOS),
         cross_compile_arm64=True,
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
             isolated_workflow=True,
         ),
     ),
@@ -1067,7 +301,7 @@ class OperatingSystem:
         cross_compile_arm64=True,
         build_configs=generate_binary_build_matrix.generate_conda_matrix(OperatingSystem.MACOS_ARM64),
         ciflow_config=CIFlowConfig(
-            labels={LABEL_CIFLOW_DEFAULT, LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
+            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
             isolated_workflow=True,
         ),
     ),
@@ -1079,18 +313,13 @@ def main() -> None:
         loader=jinja2.FileSystemLoader(str(GITHUB_DIR.joinpath("templates"))),
         undefined=jinja2.StrictUndefined,
     )
+
+    # not ported yet
     template_and_workflows = [
-        (jinja_env.get_template("linux_ci_workflow.yml.j2"), LINUX_WORKFLOWS),
-        (jinja_env.get_template("linux_ci_workflow.yml.j2"), XLA_WORKFLOWS),
-        (jinja_env.get_template("windows_ci_workflow.yml.j2"), WINDOWS_WORKFLOWS),
-        (jinja_env.get_template("bazel_ci_workflow.yml.j2"), BAZEL_WORKFLOWS),
-        (jinja_env.get_template("ios_ci_workflow.yml.j2"), IOS_WORKFLOWS),
-        (jinja_env.get_template("macos_ci_workflow.yml.j2"), MACOS_WORKFLOWS),
-        (jinja_env.get_template("docker_builds_ci_workflow.yml.j2"), DOCKER_WORKFLOWS),
-        (jinja_env.get_template("android_ci_full_workflow.yml.j2"), ANDROID_WORKFLOWS),
-        (jinja_env.get_template("android_ci_workflow.yml.j2"), ANDROID_SHORT_WORKFLOWS),
         (jinja_env.get_template("linux_binary_build_workflow.yml.j2"), LINUX_BINARY_BUILD_WORFKLOWS),
+        (jinja_env.get_template("linux_binary_build_workflow.yml.j2"), LINUX_BINARY_SMOKE_WORKFLOWS),
         (jinja_env.get_template("windows_binary_build_workflow.yml.j2"), WINDOWS_BINARY_BUILD_WORKFLOWS),
+        (jinja_env.get_template("windows_binary_build_workflow.yml.j2"), WINDOWS_BINARY_SMOKE_WORKFLOWS),
         (jinja_env.get_template("macos_binary_build_workflow.yml.j2"), MACOS_BINARY_BUILD_WORKFLOWS),
     ]
     # Delete the existing generated files first, this should align with .gitattributes file description.
@@ -1101,16 +330,12 @@ def main() -> None:
         except Exception as e:
             print(f"Error occurred when deleting file {w}: {e}")
 
-    ciflow_ruleset = CIFlowRuleset()
     for template, workflows in template_and_workflows:
         # added Iterable check to appease the mypy gods
         if not isinstance(workflows, Iterable):
             raise Exception(f"How is workflows not iterable? {workflows}")
         for workflow in workflows:
             workflow.generate_workflow_file(workflow_template=template)
-            ciflow_ruleset.add_label_rule(workflow.ciflow_config.labels, workflow.build_environment)
-    ciflow_ruleset.generate_json()
-
 
 if __name__ == "__main__":
     main()
diff --git a/.github/scripts/get_workflow_job_id.py b/.github/scripts/get_workflow_job_id.py
new file mode 100644
index 00000000000000..72aed91d55ca96
--- /dev/null
+++ b/.github/scripts/get_workflow_job_id.py
@@ -0,0 +1,60 @@
+# Helper to get the id of the currently running job in a GitHub Actions
+# workflow. GitHub does not provide this information to workflow runs, so we
+# need to figure it out based on what they *do* provide.
+
+import requests
+import os
+import argparse
+
+# Our strategy is to retrieve the parent workflow run, then filter its jobs on
+# RUNNER_NAME to figure out which job we're currently running.
+#
+# Why RUNNER_NAME? Because it's the only thing that uniquely identifies a job within a workflow.
+# GITHUB_JOB doesn't work, as it corresponds to the job yaml id
+# (https://bit.ly/37e78oI), which has two problems:
+# 1. It's not present in the workflow job JSON object, so we can't use it as a filter.
+# 2. It isn't unique; for matrix jobs the job yaml id is the same for all jobs in the matrix.
+#
+# RUNNER_NAME on the other hand is unique across the pool of runners. Also,
+# since only one job can be scheduled on a runner at a time, we know that
+# looking for RUNNER_NAME will uniquely identify the job we're currently
+# running.
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "workflow_run_id", help="The id of the workflow run, should be GITHUB_RUN_ID"
+)
+parser.add_argument(
+    "runner_name",
+    help="The name of the runner to retrieve the job id, should be RUNNER_NAME",
+)
+
+args = parser.parse_args()
+
+
+PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch"
+GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
+REQUEST_HEADERS = {
+    "Accept": "application/vnd.github.v3+json",
+    "Authorization": "token " + GITHUB_TOKEN,
+}
+
+response = requests.get(
+    f"{PYTORCH_REPO}/actions/runs/{args.workflow_run_id}/jobs?per_page=100",
+    headers=REQUEST_HEADERS,
+)
+
+jobs = response.json()["jobs"]
+while "next" in response.links.keys():
+    response = requests.get(response.links["next"]["url"], headers=REQUEST_HEADERS)
+    jobs.extend(response.json()["jobs"])
+
+# Sort the jobs list by start time, in descending order. We want to get the most
+# recently scheduled job on the runner.
+jobs.sort(key=lambda job: job["started_at"], reverse=True)
+
+for job in jobs:
+    if job["runner_name"] == args.runner_name:
+        print(job["id"])
+        exit(0)
+
+exit(1)
diff --git a/.github/scripts/gitutils.py b/.github/scripts/gitutils.py
index 7d5d24f7963043..d8d4e8f7cd8592 100644
--- a/.github/scripts/gitutils.py
+++ b/.github/scripts/gitutils.py
@@ -1,10 +1,11 @@
 #!/usr/bin/env python3
 
+import os
+import re
+import tempfile
 from collections import defaultdict
 from datetime import datetime
 from typing import cast, Any, Dict, Iterator, List, Optional, Tuple, Union
-import os
-import re
 
 
 RE_GITHUB_URL_MATCH = re.compile("^https://.*@?github.com/(.+)/(.+)$")
@@ -30,9 +31,9 @@ def fuzzy_list_to_dict(items: List[Tuple[str, str]]) -> Dict[str, List[str]]:
 
 
 def _check_output(items: List[str], encoding: str = "utf-8") -> str:
-    from subprocess import check_output, CalledProcessError
+    from subprocess import check_output, CalledProcessError, STDOUT
     try:
-        return check_output(items).decode(encoding)
+        return check_output(items, stderr=STDOUT).decode(encoding)
     except CalledProcessError as e:
         msg = f"Command `{' '.join(e.cmd)}` returned non-zero exit code {e.returncode}"
         stdout = e.stdout.decode(encoding) if e.stdout is not None else ""
@@ -129,8 +130,13 @@ def current_branch(self) -> str:
     def checkout(self, branch: str) -> None:
         self._run_git("checkout", branch)
 
-    def fetch(self, ref: str, branch: str) -> None:
-        self._run_git("fetch", self.remote, f"{ref}:{branch}")
+    def fetch(self, ref: Optional[str] = None, branch: Optional[str] = None) -> None:
+        if branch is None and ref is None:
+            self._run_git("fetch", self.remote)
+        elif branch is None:
+            self._run_git("fetch", self.remote, ref)
+        else:
+            self._run_git("fetch", self.remote, f"{ref}:{branch}")
 
     def show_ref(self, name: str) -> str:
         refs = self._run_git('show-ref', '-s', name).strip().split('\n')
@@ -188,8 +194,15 @@ def compute_branch_diffs(self, from_branch: str, to_branch: str) -> Tuple[List[s
                 while len(from_values) > 0 and len(to_values) > 0:
                     frc = self.get_commit(from_values.pop())
                     toc = self.get_commit(to_values.pop())
+                    # FRC branch might have PR number added to the title
                     if frc.title != toc.title or frc.author_date != toc.author_date:
-                        raise RuntimeError(f"Unexpected differences between {frc} and {toc}")
+                        # HACK: Same commit were merged, reverted and landed again
+                        # which creates a tracking problem
+                        if (
+                            "pytorch/pytorch" not in self.remote_url() or
+                            frc.commit_hash != "0a6a1b27a464ba5be5f587cce2ee12ab8c504dbf"
+                        ):
+                            raise RuntimeError(f"Unexpected differences between {frc} and {toc}")
                     from_commits.remove(frc.commit_hash)
                     to_commits.remove(toc.commit_hash)
                 continue
@@ -212,11 +225,19 @@ def cherry_pick_commits(self, from_branch: str, to_branch: str) -> None:
             self.cherry_pick(commit)
         self.checkout(orig_branch)
 
-    def push(self, branch: str, dry_run: bool) -> None:
-        if dry_run:
-            self._run_git("push", "--dry-run", self.remote, branch)
-        else:
-            self._run_git("push", self.remote, branch)
+    def push(self, branch: str, dry_run: bool, retry: int = 3) -> None:
+        for cnt in range(retry):
+            try:
+                if dry_run:
+                    self._run_git("push", "--dry-run", self.remote, branch)
+                else:
+                    self._run_git("push", self.remote, branch)
+            except RuntimeError as e:
+                # Check if push were rejected because branch is stale
+                if len(e.args) == 0 or re.search(r"\[rejected\].+\(fetch first\)\n", e.args[0]) is None:
+                    raise
+                self.fetch()
+                self._run_git("rebase", f"{self.remote}/{branch}")
 
     def head_hash(self) -> str:
         return self._run_git("show-ref", "--hash", "HEAD").strip()
@@ -240,6 +261,12 @@ def amend_commit_message(self, msg: str) -> None:
         self._run_git("commit", "--amend", "-m", msg)
 
 
+def clone_repo(username: str, password: str, org: str, project: str) -> GitRepo:
+    path = tempfile.mkdtemp()
+    _check_output(['git', 'clone', f'https://{username}:{password}@github.com/{org}/{project}', path]).strip()
+    return GitRepo(path=path)
+
+
 class PeekableIterator(Iterator[str]):
     def __init__(self, val: str) -> None:
         self._val = val
diff --git a/.github/scripts/gql_mocks.json b/.github/scripts/gql_mocks.json
index 16c563eced73c5..123680f97a33e7 100644
--- a/.github/scripts/gql_mocks.json
+++ b/.github/scripts/gql_mocks.json
@@ -1 +1,9182 @@
-{"query_sha=fea7527b55661c30013cf0ce69b664e4ffe28199ce44b9af994c72288bde5fa0 name=pytorch number=71759 owner=pytorch": {"data": {"repository": {"pullRequest": {"closed": true, "isCrossRepository": true, "author": {"login": "coolteemf"}, "title": "Optimize grid sample 3d", "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n", "headRefName": "optimize_grid_sample_3d", "headRepository": {"nameWithOwner": "coolteemf/pytorch"}, "baseRefName": "master", "baseRepository": {"nameWithOwner": "pytorch/pytorch", "isPrivate": false, "defaultBranchRef": {"name": "master"}}, "mergeCommit": null, "commits": {"nodes": [{"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "563ec73747ad53b63b36736c47c4342f962c2a09", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "f683e8aec7aea76097a264eec01511e704c31154", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": {"login": "coolteemf"}, "email": "67541941+coolteemf@users.noreply.github.com", "name": "Fran\u00e7ois Lecomte"}, "oid": "b932e9e286c22aaf352375186df851ef060b295a", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "ghp_XXXXXX", "name": "coolteemf"}, "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}], "totalCount": 16}, "changedFiles": 9, "files": {"nodes": [{"path": "aten/src/ATen/native/GridSampler.cpp"}, {"path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"}, {"path": "aten/src/ATen/native/cuda/GridSampler.cpp"}, {"path": "aten/src/ATen/native/cuda/GridSampler.cu"}, {"path": "aten/src/ATen/native/cuda/GridSampler.h"}, {"path": "aten/src/ATen/native/native_functions.yaml"}, {"path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"}, {"path": "test/test_nn.py"}, {"path": "tools/autograd/derivatives.yaml"}]}, "reviews": {"nodes": [{"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "APPROVED"}, {"author": {"login": "albanD"}, "state": "APPROVED"}], "totalCount": 17}, "comments": {"nodes": [{"bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.", "author": {"login": "github-actions"}, "authorAssociation": "NONE", "editor": null}]}}}}}, "query_sha=1847f597bd535a3d45a5751d69792ce57f4e565713118eee6057e5ee89e17997 name=pytorch number=71759 owner=pytorch": {"data": {"repository": {"pullRequest": {"closed": true, "isCrossRepository": true, "author": {"login": "coolteemf"}, "title": "Optimize grid sample 3d", "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n", "headRefName": "optimize_grid_sample_3d", "headRepository": {"nameWithOwner": "coolteemf/pytorch"}, "baseRefName": "master", "baseRepository": {"nameWithOwner": "pytorch/pytorch", "isPrivate": false, "defaultBranchRef": {"name": "master"}}, "mergeCommit": null, "commits": {"nodes": [{"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "563ec73747ad53b63b36736c47c4342f962c2a09", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5", "checkSuites": {"nodes": []}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "f683e8aec7aea76097a264eec01511e704c31154", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": {"login": "coolteemf"}, "email": "67541941+coolteemf@users.noreply.github.com", "name": "Fran\u00e7ois Lecomte"}, "oid": "b932e9e286c22aaf352375186df851ef060b295a", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}, {"commit": {"author": {"user": null, "email": "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", "name": "coolteemf"}, "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}], "totalCount": 16}, "changedFiles": 9, "files": {"nodes": [{"path": "aten/src/ATen/native/GridSampler.cpp"}, {"path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"}, {"path": "aten/src/ATen/native/cuda/GridSampler.cpp"}, {"path": "aten/src/ATen/native/cuda/GridSampler.cu"}, {"path": "aten/src/ATen/native/cuda/GridSampler.h"}, {"path": "aten/src/ATen/native/native_functions.yaml"}, {"path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"}, {"path": "test/test_nn.py"}, {"path": "tools/autograd/derivatives.yaml"}], "pageInfo": {"endCursor": "OQ", "hasNextPage": false}}, "reviews": {"nodes": [{"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "coolteemf"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "COMMENTED"}, {"author": {"login": "albanD"}, "state": "APPROVED"}, {"author": {"login": "albanD"}, "state": "APPROVED"}], "totalCount": 17}, "comments": {"nodes": [{"bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.", "author": {"login": "github-actions"}, "authorAssociation": "NONE", "editor": null}]}}}}}, "query_sha=1847f597bd535a3d45a5751d69792ce57f4e565713118eee6057e5ee89e17997 name=pytorch number=73099 owner=pytorch": {"data": {"repository": {"pullRequest": {"closed": false, "isCrossRepository": false, "author": {"login": "BowenBao"}, "title": "[ONNX] Make graph name spec-compliant (#71961)", "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* #73104\n* #73103\n* #73102\n* #73101\n* #73100\n* __->__ #73099\n\n[According to the ONNX spec](https://github.com/onnx/onnx/blob/main/docs/IR.md#names-within-a-graph),\nall names must adhere to C90 identifier syntax rules, which means no\ndashes.\n\nFixes: #30952", "headRefName": "gh/BowenBao/138/head", "headRepository": {"nameWithOwner": "pytorch/pytorch"}, "baseRefName": "gh/BowenBao/138/base", "baseRepository": {"nameWithOwner": "pytorch/pytorch", "isPrivate": false, "defaultBranchRef": {"name": "master"}}, "mergeCommit": null, "commits": {"nodes": [{"commit": {"author": {"user": {"login": "BowenBao"}, "email": "bowbao@microsoft.com", "name": "BowenBao"}, "oid": "3038b939eb2069653305c419326a0f47d2598e39", "checkSuites": {"nodes": [{"app": {"databaseId": 12274}, "conclusion": "SUCCESS"}]}}}], "totalCount": 1}, "changedFiles": 162, "files": {"nodes": [{"path": "test/onnx/expect/TestOperators.test_acos.expect"}, {"path": "test/onnx/expect/TestOperators.test_add_broadcast.expect"}, {"path": "test/onnx/expect/TestOperators.test_add_left_broadcast.expect"}, {"path": "test/onnx/expect/TestOperators.test_add_size1_broadcast.expect"}, {"path": "test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect"}, {"path": "test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect"}, {"path": "test/onnx/expect/TestOperators.test_addconstant.expect"}, {"path": "test/onnx/expect/TestOperators.test_addmm.expect"}, {"path": "test/onnx/expect/TestOperators.test_arange_dynamic.expect"}, {"path": "test/onnx/expect/TestOperators.test_argmax.expect"}, {"path": "test/onnx/expect/TestOperators.test_asin.expect"}, {"path": "test/onnx/expect/TestOperators.test_at_op.expect"}, {"path": "test/onnx/expect/TestOperators.test_atan.expect"}, {"path": "test/onnx/expect/TestOperators.test_aten_embedding_1.expect"}, {"path": "test/onnx/expect/TestOperators.test_aten_embedding_2.expect"}, {"path": "test/onnx/expect/TestOperators.test_avg_pool2d.expect"}, {"path": "test/onnx/expect/TestOperators.test_baddbmm.expect"}, {"path": "test/onnx/expect/TestOperators.test_basic.expect"}, {"path": "test/onnx/expect/TestOperators.test_batchnorm.expect"}, {"path": "test/onnx/expect/TestOperators.test_batchnorm_1d.expect"}, {"path": "test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect"}, {"path": "test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect"}, {"path": "test/onnx/expect/TestOperators.test_batchnorm_training.expect"}, {"path": "test/onnx/expect/TestOperators.test_bitshift.expect"}, {"path": "test/onnx/expect/TestOperators.test_c2_op.expect"}, {"path": "test/onnx/expect/TestOperators.test_chunk.expect"}, {"path": "test/onnx/expect/TestOperators.test_clip.expect"}, {"path": "test/onnx/expect/TestOperators.test_clip_max.expect"}, {"path": "test/onnx/expect/TestOperators.test_clip_min.expect"}, {"path": "test/onnx/expect/TestOperators.test_concat2.expect"}, {"path": "test/onnx/expect/TestOperators.test_conv.expect"}, {"path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect"}, {"path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4_opset8.expect"}, {"path": "test/onnx/expect/TestOperators.test_convtranspose.expect"}, {"path": "test/onnx/expect/TestOperators.test_cos.expect"}, {"path": "test/onnx/expect/TestOperators.test_cumsum.expect"}, {"path": "test/onnx/expect/TestOperators.test_det.expect"}, {"path": "test/onnx/expect/TestOperators.test_dict.expect"}, {"path": "test/onnx/expect/TestOperators.test_dict_str.expect"}, {"path": "test/onnx/expect/TestOperators.test_dim.expect"}, {"path": "test/onnx/expect/TestOperators.test_dropout.expect"}, {"path": "test/onnx/expect/TestOperators.test_dropout_default.expect"}, {"path": "test/onnx/expect/TestOperators.test_dropout_opset12.expect"}, {"path": "test/onnx/expect/TestOperators.test_dropout_training.expect"}, {"path": "test/onnx/expect/TestOperators.test_dropout_training_opset12.expect"}, {"path": "test/onnx/expect/TestOperators.test_dynamic_axes_add.expect"}, {"path": "test/onnx/expect/TestOperators.test_dynamic_axes_add_inputs_same_symbolic_shape.expect"}, {"path": "test/onnx/expect/TestOperators.test_dynamic_axes_matmul.expect"}, {"path": "test/onnx/expect/TestOperators.test_dynamic_axes_reduce_mean.expect"}, {"path": "test/onnx/expect/TestOperators.test_dynamic_axes_unchange.expect"}, {"path": "test/onnx/expect/TestOperators.test_elu.expect"}, {"path": "test/onnx/expect/TestOperators.test_embedding_bags.expect"}, {"path": "test/onnx/expect/TestOperators.test_empty_like.expect"}, {"path": "test/onnx/expect/TestOperators.test_empty_like_opset7.expect"}, {"path": "test/onnx/expect/TestOperators.test_equal.expect"}, {"path": "test/onnx/expect/TestOperators.test_erf.expect"}, {"path": "test/onnx/expect/TestOperators.test_exp.expect"}, {"path": "test/onnx/expect/TestOperators.test_expand.expect"}, {"path": "test/onnx/expect/TestOperators.test_flatten.expect"}, {"path": "test/onnx/expect/TestOperators.test_flatten2D.expect"}, {"path": "test/onnx/expect/TestOperators.test_fmod.expect"}, {"path": "test/onnx/expect/TestOperators.test_frobenius_norm.expect"}, {"path": "test/onnx/expect/TestOperators.test_full.expect"}, {"path": "test/onnx/expect/TestOperators.test_full_like.expect"}, {"path": "test/onnx/expect/TestOperators.test_gather.expect"}, {"path": "test/onnx/expect/TestOperators.test_gather_opset11.expect"}, {"path": "test/onnx/expect/TestOperators.test_ge.expect"}, {"path": "test/onnx/expect/TestOperators.test_gelu.expect"}, {"path": "test/onnx/expect/TestOperators.test_gt.expect"}, {"path": "test/onnx/expect/TestOperators.test_hardtanh.expect"}, {"path": "test/onnx/expect/TestOperators.test_implicit_expand.expect"}, {"path": "test/onnx/expect/TestOperators.test_index.expect"}, {"path": "test/onnx/expect/TestOperators.test_isnan.expect"}, {"path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"}, {"path": "test/onnx/expect/TestOperators.test_le.expect"}, {"path": "test/onnx/expect/TestOperators.test_linear.expect"}, {"path": "test/onnx/expect/TestOperators.test_log_sigmoid.expect"}, {"path": "test/onnx/expect/TestOperators.test_logsoftmax.expect"}, {"path": "test/onnx/expect/TestOperators.test_lstm_none_sequence_lens.expect"}, {"path": "test/onnx/expect/TestOperators.test_lt.expect"}, {"path": "test/onnx/expect/TestOperators.test_master_opset.expect"}, {"path": "test/onnx/expect/TestOperators.test_max.expect"}, {"path": "test/onnx/expect/TestOperators.test_maxpool.expect"}, {"path": "test/onnx/expect/TestOperators.test_maxpool_dilations.expect"}, {"path": "test/onnx/expect/TestOperators.test_maxpool_indices.expect"}, {"path": "test/onnx/expect/TestOperators.test_mean.expect"}, {"path": "test/onnx/expect/TestOperators.test_mean_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_meshgrid.expect"}, {"path": "test/onnx/expect/TestOperators.test_min.expect"}, {"path": "test/onnx/expect/TestOperators.test_mm.expect"}, {"path": "test/onnx/expect/TestOperators.test_narrow.expect"}, {"path": "test/onnx/expect/TestOperators.test_ne.expect"}, {"path": "test/onnx/expect/TestOperators.test_nonzero.expect"}, {"path": "test/onnx/expect/TestOperators.test_norm_p1.expect"}, {"path": "test/onnx/expect/TestOperators.test_norm_p2.expect"}, {"path": "test/onnx/expect/TestOperators.test_ones_like.expect"}, {"path": "test/onnx/expect/TestOperators.test_pad.expect"}, {"path": "test/onnx/expect/TestOperators.test_params.expect"}, {"path": "test/onnx/expect/TestOperators.test_params_onnx_irv4.expect"}, {"path": "test/onnx/expect/TestOperators.test_permute2.expect"}], "pageInfo": {"endCursor": "MTAw", "hasNextPage": true}}, "reviews": {"nodes": [{"author": {"login": "garymm"}, "state": "APPROVED"}], "totalCount": 1}, "comments": {"nodes": [{"bodyText": "@malfet Thank you for info. Sure, I have separated the rest of stack from this one, we'll wait for the fix to try again.", "author": {"login": "BowenBao"}, "authorAssociation": "COLLABORATOR", "editor": null}]}}}}}, "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=73099 owner=pytorch": {"data": {"repository": {"pullRequest": {"files": {"nodes": [{"path": "test/onnx/expect/TestOperators.test_pixel_shuffle.expect"}, {"path": "test/onnx/expect/TestOperators.test_pow.expect"}, {"path": "test/onnx/expect/TestOperators.test_prelu.expect"}, {"path": "test/onnx/expect/TestOperators.test_prod.expect"}, {"path": "test/onnx/expect/TestOperators.test_prod_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_rand.expect"}, {"path": "test/onnx/expect/TestOperators.test_randn.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_mean.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_prod.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_sum.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect"}, {"path": "test/onnx/expect/TestOperators.test_reducemax.expect"}, {"path": "test/onnx/expect/TestOperators.test_reducemin.expect"}, {"path": "test/onnx/expect/TestOperators.test_remainder.expect"}, {"path": "test/onnx/expect/TestOperators.test_repeat.expect"}, {"path": "test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect"}, {"path": "test/onnx/expect/TestOperators.test_round.expect"}, {"path": "test/onnx/expect/TestOperators.test_rrelu.expect"}, {"path": "test/onnx/expect/TestOperators.test_rsqrt.expect"}, {"path": "test/onnx/expect/TestOperators.test_rsub.expect"}, {"path": "test/onnx/expect/TestOperators.test_scatter_add.expect"}, {"path": "test/onnx/expect/TestOperators.test_scatter_add_opset11.expect"}, {"path": "test/onnx/expect/TestOperators.test_selu.expect"}, {"path": "test/onnx/expect/TestOperators.test_shape_value_map.expect"}, {"path": "test/onnx/expect/TestOperators.test_sign.expect"}, {"path": "test/onnx/expect/TestOperators.test_sin.expect"}, {"path": "test/onnx/expect/TestOperators.test_slice.expect"}, {"path": "test/onnx/expect/TestOperators.test_slice_dynamic.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d_none.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_4d.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_ignore_index.expect"}, {"path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_weights.expect"}, {"path": "test/onnx/expect/TestOperators.test_split.expect"}, {"path": "test/onnx/expect/TestOperators.test_split_with_sizes.expect"}, {"path": "test/onnx/expect/TestOperators.test_sqrt.expect"}, {"path": "test/onnx/expect/TestOperators.test_std.expect"}, {"path": "test/onnx/expect/TestOperators.test_sum.expect"}, {"path": "test/onnx/expect/TestOperators.test_sum_dtype.expect"}, {"path": "test/onnx/expect/TestOperators.test_tan.expect"}, {"path": "test/onnx/expect/TestOperators.test_topk.expect"}, {"path": "test/onnx/expect/TestOperators.test_topk_smallest_unsorted.expect"}, {"path": "test/onnx/expect/TestOperators.test_transpose.expect"}, {"path": "test/onnx/expect/TestOperators.test_type_as.expect"}, {"path": "test/onnx/expect/TestOperators.test_unfold.expect"}, {"path": "test/onnx/expect/TestOperators.test_unique.expect"}, {"path": "test/onnx/expect/TestOperators.test_unsqueeze.expect"}, {"path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect"}, {"path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect"}, {"path": "test/onnx/expect/TestOperators.test_upsample_nearest_size.expect"}, {"path": "test/onnx/expect/TestOperators.test_view.expect"}, {"path": "test/onnx/expect/TestOperators.test_view_flatten.expect"}, {"path": "test/onnx/expect/TestOperators.test_zeros_like.expect"}, {"path": "torch/csrc/jit/serialization/export.cpp"}, {"path": "torch/csrc/jit/serialization/export.h"}], "pageInfo": {"endCursor": "MTYy", "hasNextPage": false}}}}}}}
+{
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=73811 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "seemethere"
+          },
+          "title": "ci: Migrate metrics credentials to managed IAM",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* __->__ #73811\n\r\nMigrates our credentials to upload metrics statistics to managed IAM\r\ncredentials in order to make it easier to know where the credentials are\r\ncoming from and to make it easier to add more permissions / less\r\npermissions later on.\r\n\r\nRelates to work done in [D34535827](https://www.internalfb.com/diff/D34535827)\r\n\r\nSigned-off-by: Eli Uriegas <eliuriegas@fb.com>",
+          "headRefName": "gh/seemethere/215/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/seemethere/215/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "13c44d16a876a56bca479b4cf30715d21fa16e99"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ],
+            "totalCount": 2
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOaHA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cpu-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7-no-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRM=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Test tools"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-asan"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-onnx"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP89A=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObTk=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-docs"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cuda11.3-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUJII=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-vulkan-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (noarch, 1, 1, linux.2xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP_28=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7-no-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-onnx"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQMyA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-asan"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 3, 3, linux.2xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (default, 1, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQcpA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "mypy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "shellcheck",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "toc",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcU4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKc=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (xla, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQjCM=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cuda11.3-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUs2w=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObUI=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObSk=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-vulkan-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQKq4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-docs"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (cpp)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (python)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQCGQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-rocm4.5-py3.7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.rocm.gpu)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqRADE=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKU=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cpu-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSq34=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQFvU=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-rocm4.5-py3.7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "CANCELLED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Test tools"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQ4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRg=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "macos-10-15-py3-arm64"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA8=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-arm64-coreml"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-arm64"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAc=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "macos-11-py3-x86-64"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, macos-11)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, macos-11)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSQ2M=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-arm64-custom-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBE=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAA=",
+                      "hasNextPage": true
+                    }
+                  },
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ]
+          },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
+              {
+                "path": ".github/templates/common.yml.j2"
+              },
+              {
+                "path": ".github/workflows/generated-macos-11-py3-x86-64.yml"
+              },
+              {
+                "path": ".github/workflows/update_pytorch_labels.yml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "kit1980"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "janeyx99"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "totalCount": 2
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1988337976",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068270969
+              },
+              {
+                "bodyText": "@pytorchbot force merge this",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068436128
+              },
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1989076952",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068437098
+              },
+              {
+                "bodyText": "@pytorchbot merge this",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068482921
+              },
+              {
+                "bodyText": "Hey @seemethere.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1068484404
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP6yFeQ==",
+              "hasPreviousPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a1fbb4e3efd3c0ee1c99a701334f73a0d1fd689c8341a4302ded4d4ebfa5b38a cursor=Y3Vyc29yOnYyOpHPAAAAAVFCdAA= name=pytorch number=73811 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7",
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-x86-64-coreml"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-arm64-metal"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "macos-10-15-py3-lite-interpreter-x86-64"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "ios-12-5-1-x86-64"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Netlify",
+                          "databaseId": 13473
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Azure Pipelines",
+                          "databaseId": 9426
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Dependabot",
+                          "databaseId": 29110
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Codecov",
+                          "databaseId": 254
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "PyTorch Bot",
+                          "databaseId": 40112
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdes=",
+                      "hasNextPage": false
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=31093 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": false,
+          "isCrossRepository": true,
+          "author": {
+            "login": "mingxiaoh"
+          },
+          "title": "improve mkldnn convolution test coverage",
+          "body": "This pr will improve the test coverage of mkldnn convolution.\r\n1.test input: specific sensitive numbers\r\n2.pass criteria: output of mkldnn convolution matches output of thnn convolution\r\n3.coverage: by using coverage tool, we found out the following sensitive parameters. Overall the case will test 4352 patterns, takes 8.8s on my machine.\r\n\r\nto run the test case:\r\n\r\npython test_mkldnn_conv2d_ext.py\r\nor\r\npython run_test.py -i mkldnn_conv2d_ext\r\n\r\nIn case of failure, the pattern will be printed in the log for further debugging.\r\n\r\nactually, this PR is created to replace and improve that PR we created before(https://github.com/pytorch/pytorch/pull/25085) ",
+          "headRefName": "master",
+          "headRepository": {
+            "nameWithOwner": "mingxiaoh/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "11pikachu"
+                    },
+                    "email": "junx.du@intel.com",
+                    "name": "dujun"
+                  },
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                }
+              }
+            ],
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "clang-format"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHOQYu8fQ==",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHOQYu8qA==",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Codecov",
+                          "databaseId": 254
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "codecov/project",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "codecov/patch",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHOQZhcFQ==",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Codecov",
+                          "databaseId": 254
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "codecov/patch",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHOQZZsEQ==",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHOUquzJg==",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHOWKm2eg==",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                }
+              }
+            ]
+          },
+          "changedFiles": 5,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/math_libraries/convolutions.py"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_googlenet_v3.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_maskrcnn_p1.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_mobilenet.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_resnet_50.json"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "NQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
+              },
+              {
+                "author": {
+                  "login": "ailzhang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "totalCount": 34
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.\n\n@mruberry sorry, we add them now, would you please check it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 673402901
+              },
+              {
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 673760580
+              },
+              {
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.\n\n@mruberry  It is suggested by @VitalyFedyunin that, we need to display fail test to avoid invalid inputs, I guess we should set it as expected failures under the pytest test framework, right? we will change it as expected failure cases under pytest test framework. The result will looks like be low, is it ok?\n2500 passed, 136 skipped, 0 failed, 0 errors, 2 expected failures, 0 unexpected passes",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 673816925
+              },
+              {
+                "bodyText": "Displaying tests that fail is fine, but I don't think @VitalyFedyunin meant that it was OK if the tests didn't pass. If these are expected failures then yes, you can use with self.assertRaises(RuntimeError):... when testing them. If you also want to report that the test has test cases with these properties you can print or warn, which will appear in the test output.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 673858224
+              },
+              {
+                "bodyText": "Codecov Report\n\nMerging #31093 into master will not change coverage.\nThe diff coverage is n/a.\n\n\n@@           Coverage Diff           @@\n##           master   #31093   +/-   ##\n=======================================\n  Coverage   68.00%   68.00%           \n=======================================\n  Files         382      382           \n  Lines       49527    49527           \n=======================================\n  Hits        33679    33679           \n  Misses      15848    15848           \n\nContinue to review full report at Codecov.\n\nLegend - Click here to learn more\n\u0394 = absolute <relative> (impact), \u00f8 = not affected, ? = missing data\nPowered by Codecov. Last update 69f6d94...29f6aa6. Read the comment docs.",
+                "author": {
+                  "login": "codecov"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "codecov"
+                },
+                "databaseId": 686921371
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOKCNQFQ==",
+              "hasPreviousPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOKCNQFQ== name=pytorch number=31093 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Hi, @mingfeima  @soumith  @Jianhui-Li\nthis will improve the test coverage of mkldnn convolution, would you please review it?\nThe current code is forward only, do we need to cover backward, if yes, we can add backward.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564806270
+              },
+              {
+                "bodyText": "@mingxiaoh, what is the value in testing DNNL as part of Pytorch validation for the Pytorch developers? Shouldn't having these tests run in DNNL validation be enough?",
+                "author": {
+                  "login": "vpirogov"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564808528
+              },
+              {
+                "bodyText": "@vpirogov  The main value is to serve as a blind test to DNNL. If DNNL adds these test to DNNL test sets, it lost the value as a blind test.  The spirit of validation is to cross check.\n@gottbrath @gchanan  The test was developed per the request of Pytorch team. Mingxiao made an effort to reduce the execution time to a few second but still with good coverage.  Although the test today is focused on DNNL, it could be easily extended to be blind test for any conv implementation used in Pytorch.",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 567826907
+              },
+              {
+                "bodyText": "@mruberry thanks for the comment. As for the chainer dependency, we import it is because we would like to use its testing function for pytest test cases combinations, other wise we need to write much more code to achieve same effect. So, can we use it?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 574563012
+              },
+              {
+                "bodyText": "@mingxiaoh You cannot import chainer. Looking at the code you should be able to achieve the same effect without it.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 575272358
+              },
+              {
+                "bodyText": "@mruberry ok, we will change it according to your requirement. Thanks",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 583917522
+              },
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/31093\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 29f6aa6 (more details on the Dr. CI page):\n\nCommit 29f6aa6 was recently pushed. Waiting for builds...\n\nThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "dr-ci"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 628466876
+              },
+              {
+                "bodyText": "@mruberry how about those cudnn UT error? we add check for it but it should be NV to fix cudnn bugs.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 629955767
+              },
+              {
+                "bodyText": "Hey @mingxiaoh! You're right, of course, that you shouldn't have to fix cuDNN bugs. Would you please:\n\nAssert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update.\nFile a new issue explaining the behavior and providing a short PyTorch program to reproduce the issue.\n\nThen we can ping NVIDIA on that issue.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 629997129
+              },
+              {
+                "bodyText": "about the suggestion 'Assert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update. ',  if we only assert it and continue the following test, I guess users might always ignore them in later test. Anyway, any similar example case for reference?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 630010734
+              },
+              {
+                "bodyText": "In this recent PR https://github.com/pytorch/pytorch/pull/38505/files, for example, you can see that the construction of bool tensors wasn't working properly, so the test author cited the relevant issue and asserted that the incorrect behavior happened, as expected. You can also see how these lines are being removed by https://github.com/pytorch/pytorch/pull/38392/files, which fixes the issue.\nAnother common pattern is to use with self.assertRaises(RuntimeError/AssertionError/etc.):.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 630014823
+              },
+              {
+                "bodyText": "@mruberry the failed UT case is not introduced by our modification, how to handle this issue?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631187735
+              },
+              {
+                "bodyText": "@mingxiaoh You mean the failures on ROCm? You may ignore them. Be sure to re-request review when you're ready.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 631191425
+              },
+              {
+                "bodyText": "@mruberry  we already skipped those ROCm errors, but there are stil somel error caused by the original code, they are not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631886529
+              },
+              {
+                "bodyText": "I understand. Let me know when you're ready for me to review.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 631908011
+              },
+              {
+                "bodyText": "@mruberry thanks, we are ready for review now.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631909442
+              },
+              {
+                "bodyText": "@mingxiaoh Great! I'll take a look ASAP.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 631910556
+              },
+              {
+                "bodyText": "@mruberry we just pull the latest code and updated the patch according to your comment, may you please help double check it? BTW, the new failed case in preci is not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 633430458
+              },
+              {
+                "bodyText": "@ailzhang would you please check the comment below? Thanks.\nIs there a reason why this TestConv2dExt is a new class instead a test inside TestNN?\n//comment: it is actually suggested by Tongzhou Wang in another thread before.\nAlthough this test sits in generic testing framework, it's actually comparing thnn/mkldnn/cudnn results specially. I feel it's better to make it truly generic so that it compares any device result with CPU result. Alternatively you can mark this test only run when torch.backends.mkldnn.is_available()=True\n//comment: but our goal is to compare the result with that of thnn. Anyway, if you insist, we can start to compare it with cpu.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 634432326
+              },
+              {
+                "bodyText": "Pruning reviewers. @ngimel, @VitalyFedyunin, this PR is looking pretty good from a test framework perspective. Would one of you like to review?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 634557563
+              },
+              {
+                "bodyText": "@mruberry  Thanks, would you please help review it again. BTW: failed case is not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 635256214
+              },
+              {
+                "bodyText": "@mruberry  we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637364148
+              },
+              {
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637444457
+              },
+              {
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.\n\n@mruberry  thank you",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637479226
+              },
+              {
+                "bodyText": "Improving test coverage of math libraries is certainly a good goal and this PR is moving towards it. I have some doubts about implementation decisions made, and about running this PR as part of regular pytorch CI.\nIf the primary goal of this PR is to test correctness of the convolution implementations in the vendor library, then it does not serve this purpose. The absolute majority of the 4000+ test cases come from group 1, where different kernel sizes/strides/dilations are used to produce the output of size 1x1. This can test whether pytorch correctly passes convolution parameters to the backends (although there are cheaper ways to do that), but as actual library correctness check it is almost useless - libraries use very different kernels depending in the input/output sizes, and tests with toy sizes like this don't invoke the real bread-and-butter kernels.\nAlso, if this test suite is meant as primary a means of testing vendor libraries (which is a good goal!) it does not have a place as a part of pytorch regular CI, and should be run when the corresponding vendor libraries are updated. I'd suggest moving this test out into a separate file (maybe even outside of torch/test directory) and have it as a part of library update/qualification process rather than regular CI.\nAlso, if the primary goal is to enable easier testing of vendor libraries correctness, perhaps we should rethink the mechanism of the generation of test cases. It should be easy to add a test case with a particular set of parameters that was found to be buggy. Also, running a cross-product of cases in a multi-dimensional space (as this PR does) is rarely an efficient way of getting a signal, some forms of random sampling usually provide a way to get better correctness signal why using less resources.\nAlso, when testing libraries it is important to test both forward and backward functions, whereas this PR does forward only. I'm openminded on whether convTransposed should be tested or not - if we are testing vendor libraries, then it's not necessary, convTransposed calls the same underlying functions, if we are testing pytorch, then it makes sense to test it separately because it takes different codepaths.",
+                "author": {
+                  "login": "ngimel"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637827507
+              },
+              {
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637912105
+              },
+              {
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.\n\nWe know this PR has been open for awhile and we respect that your time is valuable, but we want to make sure we're making the right change here, and I think @ngimel's comments reflect that and should not be too difficult to address. As I understand, her points are:\n\nThis is a good PR with an exciting idea. To let it run longer and test more cases maybe it should run outside the regular PyTorch CI.\nTo remedy this, let's create a test/math_libraries folder and put this test there: test/math_libaries/convolutions.py. Yes, this is different from our requests in the past, which is our mistake, but it should be an easy change.\nTo make the test more interesting it'd be good for the test cases to resemble convolutions used in practice. The current test cases seem like similar \"toy\" examples. Without time pressure we should be able to run larger, more computationally intensive convolutions.\nLet's change the test cases to include some practical convolutions, make it easy to add test cases, and think about how we might generate other interesting cases. (We should also test backwards once we have more time!)\n\nAnd I think these are good points. Maybe the PR doesn't create a new way to generate interesting convolutions to start and instead only runs a few representative convolutions, but @ngimel is positioning the work for success so that it's useful and we can continue to improve on it in the future.\nDoes that make sense?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637924703
+              },
+              {
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap.  Given this, it would be be better if you raise all the requirement at a time,  considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 637960626
+              },
+              {
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap. Given this, it would be be better if you raise all the requirement at a time, considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.\n\nI'm sorry, I don't think I've talked to @Jianhui-Li before. It's true that the team we expressed a concern about timing if the test was to be run in the CI initially, but I think now that we understand what the test is trying to do better we're not sure the CI is the best place for it. The PR was also closed after a lengthy period of inactivity, and we assumed it had simply been abandoned.\nDo you know who @Jianhui-Li spoke with about this issue originally? Maybe I can follow-up with them for more context.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637967153
+              },
+              {
+                "bodyText": "@mruberry  it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637978356
+              },
+              {
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 638446723
+              },
+              {
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.\n\nLet me sync with Mingxiao and follow up with this. Thanks.",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 638451670
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 653028208
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 654443242
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 656062287
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry  the code is ready for review now, would you please take time for it? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 658071151
+              },
+              {
+                "bodyText": "super nit: renaming files to .json will make it more IDE friendly.",
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 658464685
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry the code is ready for review now, would you please take time for it? Thanks.\n\nCool! I took a look with @ngimel, once these issues are addressed I think we're good to go!",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 659164401
+              },
+              {
+                "bodyText": "@ngimel  & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 660884305
+              },
+              {
+                "bodyText": "@ngimel & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.\n\nUpdated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 662678464
+              },
+              {
+                "bodyText": "Updated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.\n@mruberry we have finished the modification according to your comment, would you please review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 662930687
+              },
+              {
+                "bodyText": "The code looks good, but I tried running the test suite and hit the following failures:\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float16, group:1, batchsize:22input channel:448, output channel:384, bias:False, padding:[1, 1], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float32, group:1, batchsize:22input channel:80, output channel:192, bias:False, padding:[0, 0], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 106, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\nLooking at the first invalid convolution, for example, it's:\n    {\n        \"case_name\":\"masknet_p1:conv33\",\n        \"mb\":1,\n        \"g\":1,\n        \"ic\":512,\n        \"ih\":64,\n        \"iw\":64,\n        \"oc\":12,\n        \"kh\":1,\n        \"kw\":1,\n        \"sh\":1,\n        \"sw\":1,\n        \"ph\":0,\n        \"pw\":0,\n        \"dh\":0,\n        \"dw\":0,\n        \"bias\":\"False\"\n    },\n\nwhich has a dh and dw of zero, causing it to be added to invalid cases here:\ndh, dw = case['dh'], case['dw']\n            has_bias = case['bias']\n            if dh == 0 or dw == 0:\n                invalid_cases.append(case_name)",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "mruberry"
+                },
+                "databaseId": 663240268
+              },
+              {
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 664373079
+              },
+              {
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.\n\nBefore I run these tests again, is an atol of 1e-2 needed for all types or just half? Also, how does 1e-2 compare to the values that are being compared?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 664569507
+              },
+              {
+                "bodyText": "@mruberry 1e-2 is experimental result, details see below, random means it might be failed sometimes.\n\n\n\natol,rtol\n1e-2,1e-2\n1e-2,1e-3\n1e-3,1e-2\n1e-3,1e-3\n1e-4,1e-3\n1e-3,1e-4\n1e-4,1e-4\n1e-4,1e-5\n1e-5,1e-4\n\n\n\n\nCuda float16\npass\npass\npass\npass\npass\nfail\nFail\nFail\nfail\n\n\nCuda float32\npass\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nfail",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 666894774
+              },
+              {
+                "bodyText": "@mruberry  would you please find time to review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 668380451
+              },
+              {
+                "bodyText": "@mruberry would you please find time to review it again? Thanks.\n\nI was just about to try and run this again locally but it looks like the files describing the convolutions are missing?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 670306210
+              },
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 670322557
+              },
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 670591170
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOIapCfg==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=71759 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "coolteemf"
+          },
+          "title": "Optimize grid sample 3d",
+          "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n",
+          "headRefName": "optimize_grid_sample_3d",
+          "headRepository": {
+            "nameWithOwner": "coolteemf/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "563ec73747ad53b63b36736c47c4342f962c2a09"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "f683e8aec7aea76097a264eec01511e704c31154"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "coolteemf"
+                    },
+                    "email": "67541941+coolteemf@users.noreply.github.com",
+                    "name": "Fran\u00e7ois Lecomte"
+                  },
+                  "oid": "b932e9e286c22aaf352375186df851ef060b295a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                }
+              }
+            ],
+            "totalCount": 16
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGYqY=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-onnx"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIob0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1E=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-rocm4.5-py3.7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwMsZY=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cuda11.3-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwZbzg=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "mypy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "shellcheck",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "toc",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGbAQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-asan"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 3, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwJC4U=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ_w=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIWu4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1k=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwOTJ0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Test tools"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ80=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (noarch, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIUUk=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-docs"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (cpp)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (python)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIXQk=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ9k=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ08=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIaFM=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ9Y=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cpu-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "FAILURE"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwXcvs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-vulkan-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIjzs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7-no-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ9Q=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ueg=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                }
+              }
+            ]
+          },
+          "changedFiles": 9,
+          "files": {
+            "nodes": [
+              {
+                "path": "aten/src/ATen/native/GridSampler.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/cuda/GridSampler.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/cuda/GridSampler.cu"
+              },
+              {
+                "path": "aten/src/ATen/native/cuda/GridSampler.h"
+              },
+              {
+                "path": "aten/src/ATen/native/native_functions.yaml"
+              },
+              {
+                "path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"
+              },
+              {
+                "path": "test/test_nn.py"
+              },
+              {
+                "path": "tools/autograd/derivatives.yaml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "OQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "totalCount": 17
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to 'NoneType' object is not subscriptable\nRaised by https://github.com/pytorch/pytorch/actions/runs/1887945630",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1048868910
+              },
+              {
+                "bodyText": "Thanks for the update! The windows failure is not your fault, you can ignore it!\n\nThank you very much for all of your feedback and sorry for the delay !",
+                "author": {
+                  "login": "coolteemf"
+                },
+                "authorAssociation": "CONTRIBUTOR",
+                "editor": null,
+                "databaseId": 1048983572
+              },
+              {
+                "bodyText": "@coolteemf can you please send either me or @albanD an email? (or I can send you and invite to collab on private repo)",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1049048119
+              },
+              {
+                "bodyText": "@pytorchbot merge this please",
+                "author": {
+                  "login": "albanD"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1049131992
+              },
+              {
+                "bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1049134520
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOPoR4Lg==",
+              "hasPreviousPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=68111 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "chunyuan-w"
+          },
+          "title": "Add JIT graph fuser for oneDNN Graph API (Preview4)",
+          "body": "## Description\r\nPreview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).\r\n\r\nOn the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:\r\n\r\n- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used\r\n- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.\r\n\r\n### User API:\r\nThe optimization pass is disabled by default. Users could enable it by:\r\n```\r\ntorch.jit.enable_onednn_fusion(True)\r\n```\r\n\r\n### Performance:\r\n[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:\r\n- SkyLake 8180 (1 socket of 28 cores):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)\r\n\r\n- SkyLake 8180 (single thread):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)\r\n \\* By mapping hardswish to oneDNN Graph, it\u2019s 8% faster than PyTorch JIT (NNC + OFI)\r\n  \\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops\r\n\r\n\r\n### Directory structure of the integration code\r\nFuser-related code are placed under:\r\n```\r\ntorch/csrc/jit/codegen/onednn/\r\n```\r\n\r\nOptimization pass registration is done in:\r\n```\r\ntorch/csrc/jit/passes/onednn_graph_fuser.h\r\n```\r\n\r\nCMake for the integration code is:\r\n```\r\ncaffe2/CMakeLists.txt\r\n```\r\n\r\n## Limitations\r\n\r\n- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.\r\n- We have only optimized the inference use case.",
+          "headRefName": "chunyuan/llga_preview2",
+          "headRepository": {
+            "nameWithOwner": "chunyuan-w/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "0096fcc49f277fd8e006fcb42e0cb28a1422ec98"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "7bcc4de26a5472f1d252735dd425b46794b0844f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "3a2a588bfe6bbf9bf74d88d441cd22affda207da"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "ca7df12fbfaa3ddbabeca39b76300d17f4a33f2f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "81d44f35b8bc043c38837d0694e5bc072203b832"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "14fd5d1bfc2c58a71379f778871e3fca0a8e79b2"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "954dc23663125897f4b199eb2a8607dc5fca3274"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9f77a0b476accc678b6f0569e4ff33fa6bbe97fc"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "fbf3b23bc1288697e1aec539a7c4ee3dc0bcb84c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "f8b8e78f786586c3cdf3966fd83ffa124d3eda70"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6fffa2f7453ee7e0f8d8e2f73ea8a65230539589"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "849385404e6f3cd1cf7cef19f931ecf4fa28afdb"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "adbae7b77f8c0dbc59fccf15207d97ba86cfade2"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6dcf2a4981aff24fa16fc7461ae4ec29690f956f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "54f3e05ad524cffd0911ee93be3c50f589b51f58"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "edbfc640ea79a0af85757d9e73796dcc90231519"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "67654db7cba562809d1b4a44cdda58af5cc9daaf"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9c9d99b930b11af9ff03f52d45bf49c652df758d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ffb25119cd9ce815cc4d9d14a2317fcbbfa9ea86"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ab9eee84512ca1bdfbc81e25c6eb67b29d0f302a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "62a4642cf3330524990a69ac29e002c97812320a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ca9b1223be4af2c8b4929303d498eafd71793128"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "6f4a23d24514a02954d2ec792830085f612223c9"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b2a9a9c0926b02d0b2e87722ed61450f224a61d0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e88b492be733f24b6aa395829c76add67d0901e7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c44336d7a914952bfb78e012e08d9a6d6dde5937"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "5157930f7b3921d41a586260582b574c915f6ca1"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "04cb8353813f6bbd0d913a994923cc7e1e291406"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "62991eaad0e638bb0bced327e03f932f66f68732"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "7496bf1588050191595d833d23b8972b2f22655e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d9d35f23cca0cd29c78a845731b24826152dcf1c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "f74ec134f18a65a7c72455bdf44f72e3ebb27105"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "eb32cc65a975361160948bfc3d6a577991ea262e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c7665f8d695b680c54db0bad2b7b7df46d886b50"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e6321ad8f59ea01130568c202d186448bb9cb9d0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a72cd0d02693f45e5354a70654581ad514581ec7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b3cd3028b4ed31805e82f7eaf02217ab74ca59b9"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "49a592d9788d08e6cd0593882f867e129057c1cc"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "0575766b2144b13f6a38227c4e2b8d22ec8db80f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b5c9b10ff87d622350e8ca64fae3a476eb70d5aa"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "66bc652a30ccc329adb929870a4ac726bb98b38c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "72b9ca9c8e2dac98cbb7199b3dfac7c7305b80c5"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a7892ed7373207d96406c8b5734a089643c5cdbd"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d54cb084e1daad8a08c3f8de0ad3f7afb5b05ac1"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "aef71d692a8a159e0ca56be363e2cc1225ce7647"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "bf618e205ec31cff962dcc8ab478e0a699a9572d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e4a331f1088448f7d7d86256ce71e0e71da006b0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "0b743523d1430fec759d5fefbb687f17c89335a5"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e80a351a62d98b810ec8985c4b25257af1d6c5bb"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c189eca154b6691919d0e21489d1c322c7435c0b"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "e080a067c75d7b888a8a362682a2d5ba70e0c3a8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "028561fbf8f3ed90e074e6e0e3a4ca4dd7ffa2a8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "d550cf14037badd4caa2f52202e2f20bc4db8432"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "574159ebadd1dec24daaf883879ffeca8d9e71b7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9eb3ee98ea756067ed1c8f52f309f6d3e211a904"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "29929f48be03dcdd1bbfade572de7feafa825547"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "8a7358ca8da547b40ea1a99ddc57ebed19959684"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "6606637d2c5525b43e294a8b366a85052e1be0c6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "5ecfd1f28b87045deb8bc8ffe33b3d8b906f3264"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "be2d4345c65442c4cfbe8afdfb2ae0893945da42"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b5b89d3644a43e2dbda841cafb71b32edbe07c8a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nikita.shulga@gmail.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                }
+              }
+            ],
+            "totalCount": 62
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Meta Internal-Only Changes Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NXnc=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "shellcheck",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "toc",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Test collect_env (with_torch)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Test collect_env (without_torch)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NZdg=",
+                            "hasNextPage": true
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NYIw=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pull"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                              "conclusion": "NEUTRAL"
+                            },
+                            {
+                              "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "linux-bionic-rocm4.5-py3.7 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3.7-clang7-asan / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3.7-clang7-onnx / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-bionic-py3.7-clang9 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3.7-gcc5.4 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3-clang5-mobile-build / build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NZ50=",
+                            "hasNextPage": true
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxQs=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                }
+              }
+            ]
+          },
+          "changedFiles": 37,
+          "files": {
+            "nodes": [
+              {
+                "path": "aten/src/ATen/core/interned_strings.h"
+              },
+              {
+                "path": "caffe2/CMakeLists.txt"
+              },
+              {
+                "path": "cmake/Dependencies.cmake"
+              },
+              {
+                "path": "cmake/Modules/FindMKLDNN.cmake"
+              },
+              {
+                "path": "cmake/public/mkldnn.cmake"
+              },
+              {
+                "path": "docs/source/jit.rst"
+              },
+              {
+                "path": "test/test_jit_llga_fuser.py"
+              },
+              {
+                "path": "torch/_C/__init__.pyi.in"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/README.md"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_rewriter.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/interface.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/interface.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/kernel.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/kernel.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/operator.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/register_interface.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/ir/alias_analysis.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/ir/ir.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/inline_autodiff_subgraphs.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/onednn_graph_fuser.h"
+              },
+              {
+                "path": "torch/csrc/jit/python/init.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/operator.cpp"
+              },
+              {
+                "path": "torch/jit/__init__.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mzc",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "chunyuan-w"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "wukong1992"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "totalCount": 49
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.",
+                "author": {
+                  "login": "suo"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498483
+              },
+              {
+                "bodyText": "@pytorchbot revert this",
+                "author": {
+                  "login": "suo"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498550
+              },
+              {
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.\n\nOops! Will fix it ASAP.",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1074499668
+              },
+              {
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074508608
+              },
+              {
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1082508130
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQAuLsw==",
+              "hasPreviousPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOQAuLsw== name=pytorch number=68111 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/chunyuan-w/pytorch/blob/7496bf1588050191595d833d23b8972b2f22655e/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries/conda\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-manywheel\nciflow/binaries, ciflow/binaries/wheel\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.1-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.1-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\n\n\nYou can add a comment to the PR and tag @pytorchbot with the following commands:\n\n# ciflow rerun, \"ciflow/default\" will always be added automatically\n@pytorchbot ciflow rerun\n\n# ciflow rerun with additional labels \"-l <ciflow/label_name>\", which is equivalent to adding these labels manually and trigger the rerun\n@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow\n\nFor more information, please take a look at the CI Flow Wiki.",
+                "author": {
+                  "login": "pytorch-probot"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "pytorch-probot"
+                },
+                "databaseId": 964902865
+              },
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/68111\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 7388141 (more details on the Dr. CI page):\n\n\n29/29 failures introduced in this PR\n\n\n\ud83d\udd75\ufe0f 29 new failures recognized by patterns\nThe following CI failures do not appear to be due to upstream breakages:\n pull / linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge) (1/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:31:38.6978776Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:31:38.3001628Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:31:38.5169168Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:31:38.5362923Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:31:38.5413452Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:31:38.5458747Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:31:38.5484014Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:31:38.5497924Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:31:38.5656491Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:31:38.5678893Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:31:38.6888479Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f6488c20adb4dca4\n2022-03-21T21:31:38.6978776Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:31:38.6992648Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:31:38.7003010Z ##[error]Process completed with exit code 2.\n2022-03-21T21:31:38.7044027Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:31:38.7044261Z with:\n2022-03-21T21:31:38.7044413Z env:\n2022-03-21T21:31:38.7044565Z   IN_CI: 1\n2022-03-21T21:31:38.7044709Z   IS_GHA: 1\n2022-03-21T21:31:38.7044885Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:31:38.7045067Z ##[endgroup]\n2022-03-21T21:31:38.7060958Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge) (2/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:35:19.2635222Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:35:18.9028722Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:35:19.1132721Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:35:19.1310590Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:35:19.1360251Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:35:19.1386865Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:35:19.1429182Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:35:19.1441925Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:35:19.1468280Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:35:19.1617667Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:35:19.2545368Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-098be2985e0392130\n2022-03-21T21:35:19.2635222Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:35:19.2648463Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:35:19.2658727Z ##[error]Process completed with exit code 2.\n2022-03-21T21:35:19.2706355Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:35:19.2706591Z with:\n2022-03-21T21:35:19.2706748Z env:\n2022-03-21T21:35:19.2706908Z   IN_CI: 1\n2022-03-21T21:35:19.2707061Z   IS_GHA: 1\n2022-03-21T21:35:19.2707246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:35:19.2707438Z ##[endgroup]\n2022-03-21T21:35:19.2724554Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:11:52.7662022Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T23:11:53.1213298Z      ---------------------------------------- 8.1/8.1 MB 23.6 MB/s eta 0:00:00\n2022-03-21T23:11:53.1644665Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:11:53.2218699Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T23:11:53.2389674Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T23:11:53.2787295Z      -------------------------------------- 247.7/247.7 KB 7.4 MB/s eta 0:00:00\n2022-03-21T23:11:53.3761842Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:11:53.5457622Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T23:11:57.4175080Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T23:11:57.5296815Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0105d4db093574f40\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:11:57.5564814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:11:57.5587712Z ##[error]Process completed with exit code 2.\n2022-03-21T23:11:57.5790311Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T23:11:57.5790832Z with:\n2022-03-21T23:11:57.5791104Z env:\n2022-03-21T23:11:57.5791358Z   IN_CI: 1\n2022-03-21T23:11:57.5791620Z   IS_GHA: 1\n2022-03-21T23:11:57.5791939Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:11:57.5792425Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T23:11:57.5792884Z ##[endgroup]\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (4/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T02:17:12.6257577Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T02:17:11.9280556Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T02:17:11.9335199Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:11.9682045Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T02:17:11.9850357Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0403171Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T02:17:12.0468875Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0590000Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T02:17:12.0607093Z Installing collected packages: jmespath, urllib3, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T02:17:12.5273459Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T02:17:12.6032812Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-114\n2022-03-22T02:17:12.6257577Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T02:17:12.6259543Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T02:17:12.6291924Z ##[error]Process completed with exit code 2.\n2022-03-22T02:17:12.6387977Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T02:17:12.6388298Z with:\n2022-03-22T02:17:12.6388521Z   wait-ssh: false\n2022-03-22T02:17:12.6388727Z env:\n2022-03-22T02:17:12.6388932Z   IN_CI: 1\n2022-03-22T02:17:12.6389143Z   IS_GHA: 1\n2022-03-22T02:17:12.6389368Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T02:17:12.6389669Z   DOCKER_HOST: unix:///run/user/1121/docker.sock\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge) (5/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:19:24.4890693Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:19:24.0962005Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:19:24.3152253Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:19:24.3341183Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:19:24.3391374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:19:24.3436392Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:19:24.3448982Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:19:24.3474092Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:19:24.3502003Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:19:24.3655072Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:19:24.4799309Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0bc9250521f338cae\n2022-03-21T22:19:24.4890693Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:19:24.4903625Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:19:24.4913841Z ##[error]Process completed with exit code 2.\n2022-03-21T22:19:24.4957338Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:19:24.4957575Z with:\n2022-03-21T22:19:24.4957735Z env:\n2022-03-21T22:19:24.4957900Z   IN_CI: 1\n2022-03-21T22:19:24.4958055Z   IS_GHA: 1\n2022-03-21T22:19:24.4958246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:19:24.4958437Z ##[endgroup]\n2022-03-21T22:19:24.4989649Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu) (6/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T01:05:07.6983899Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T01:05:06.8364546Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T01:05:06.8431763Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.8949391Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T01:05:06.9180079Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.9803351Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T01:05:06.9882133Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:07.0067062Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T01:05:07.0088676Z Installing collected packages: urllib3, jmespath, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T01:05:07.5819667Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T01:05:07.6774717Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-60\n2022-03-22T01:05:07.6983899Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T01:05:07.6988652Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T01:05:07.7023073Z ##[error]Process completed with exit code 2.\n2022-03-22T01:05:07.7102087Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T01:05:07.7102389Z with:\n2022-03-22T01:05:07.7102603Z   wait-ssh: false\n2022-03-22T01:05:07.7102820Z env:\n2022-03-22T01:05:07.7103015Z   IN_CI: 1\n2022-03-22T01:05:07.7103224Z   IS_GHA: 1\n2022-03-22T01:05:07.7103458Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T01:05:07.7103737Z   DOCKER_HOST: unix:///run/user/1502/docker.sock\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge) (7/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:51:39.3637996Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:51:39.2041249Z   Attempting uninstall: s3transfer\n2022-03-21T20:51:39.2043010Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:51:39.2083799Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:51:39.2089675Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:51:39.2480546Z   Attempting uninstall: boto3\n2022-03-21T20:51:39.2482953Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:51:39.2584292Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:51:39.2599474Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:51:39.3130921Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:51:39.3550598Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03ef7efc3078e3da5\n2022-03-21T20:51:39.3637996Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:51:39.3650651Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:51:39.3660484Z ##[error]Process completed with exit code 2.\n2022-03-21T20:51:39.3696465Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:51:39.3696693Z with:\n2022-03-21T20:51:39.3696850Z env:\n2022-03-21T20:51:39.3697012Z   IN_CI: 1\n2022-03-21T20:51:39.3697161Z   IS_GHA: 1\n2022-03-21T20:51:39.3697342Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:51:39.3697528Z ##[endgroup]\n2022-03-21T20:51:39.3730420Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge) (8/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:36.3916860Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:03:36.0096309Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:03:36.2278560Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:03:36.2461618Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:03:36.2513260Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:03:36.2541524Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:03:36.2554899Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:03:36.2598277Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:03:36.2758299Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:03:36.2780690Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:03:36.3825021Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0a4a552890e6ef7d3\n2022-03-21T21:03:36.3916860Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:03:36.3930343Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:03:36.3941263Z ##[error]Process completed with exit code 2.\n2022-03-21T21:03:36.3979258Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:03:36.3979496Z with:\n2022-03-21T21:03:36.3979654Z env:\n2022-03-21T21:03:36.3979814Z   IN_CI: 1\n2022-03-21T21:03:36.3979968Z   IS_GHA: 1\n2022-03-21T21:03:36.3980157Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:03:36.3980360Z ##[endgroup]\n2022-03-21T21:03:36.3996257Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (9/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:41:10.3015614Z   Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)\n2022-03-22T00:41:10.3625659Z      ---------------------------------------- 79.5/79.5 KB 1.1 MB/s eta 0:00:00\n2022-03-22T00:41:10.4120236Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-22T00:41:10.4170155Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-22T00:41:10.4722115Z      -------------------------------------- 247.7/247.7 KB 5.2 MB/s eta 0:00:00\n2022-03-22T00:41:10.4843512Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:41:10.6596108Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:41:10.8733354Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-22T00:41:15.3745408Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-22T00:41:15.4987162Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-09cacc848abc3dd32\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:41:15.5373630Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:41:15.5404353Z ##[error]Process completed with exit code 2.\n2022-03-22T00:41:15.5790508Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-22T00:41:15.5791192Z with:\n2022-03-22T00:41:15.5791530Z env:\n2022-03-22T00:41:15.5791849Z   IN_CI: 1\n2022-03-22T00:41:15.5792186Z   IS_GHA: 1\n2022-03-22T00:41:15.5792599Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:41:15.5793237Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-22T00:41:15.5793831Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge) (10/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:32.9799307Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:32.8167560Z   Attempting uninstall: s3transfer\n2022-03-21T20:50:32.8169351Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:50:32.8213295Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:50:32.8219209Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:50:32.8602320Z   Attempting uninstall: boto3\n2022-03-21T20:50:32.8603289Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:50:32.8704535Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:50:32.8719403Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:50:32.9244278Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:50:32.9710449Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0c568461a276d4a71\n2022-03-21T20:50:32.9799307Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:32.9812238Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:32.9823052Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:32.9859290Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:32.9859527Z with:\n2022-03-21T20:50:32.9859664Z env:\n2022-03-21T20:50:32.9859817Z   IN_CI: 1\n2022-03-21T20:50:32.9859977Z   IS_GHA: 1\n2022-03-21T20:50:32.9860144Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:32.9860327Z ##[endgroup]\n2022-03-21T20:50:32.9893642Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge) (11/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7163042Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.6660824Z     #10 0x55fc8a3ea801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.6661768Z     #11 0x55fc8a3f57a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.6662455Z     #12 0x55fc8a3f580b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.6663570Z     #13 0x55fc8a3f5908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.6663952Z     #14 0x55fc8a3f5908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.6664431Z     #15 0x55fc8a3f5908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.6665304Z     #16 0x55fc8a3f5ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7162113Z     #17 0x7f940d00f83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7162534Z     #18 0x55fc8a39a554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7162711Z \n2022-03-21T21:05:00.7163042Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.7334595Z + retcode=1\n2022-03-21T21:05:00.7334954Z + set -e\n2022-03-21T21:05:00.7335215Z + return 1\n2022-03-21T21:05:00.7338688Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.7339232Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.7340113Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.7340612Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.7341187Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.7341668Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.7344466Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (12/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:06:03.4437430Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:06:03.0752199Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:06:03.2853252Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:06:03.3032326Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:06:03.3081589Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:06:03.3093911Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:06:03.3120244Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:06:03.3162406Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:06:03.3188431Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:06:03.3337181Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:06:03.4348072Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ee48c8811fafc444\n2022-03-21T22:06:03.4437430Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:06:03.4450920Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:06:03.4461263Z ##[error]Process completed with exit code 2.\n2022-03-21T22:06:03.4502346Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:06:03.4502576Z with:\n2022-03-21T22:06:03.4502730Z env:\n2022-03-21T22:06:03.4502888Z   IN_CI: 1\n2022-03-21T22:06:03.4503038Z   IS_GHA: 1\n2022-03-21T22:06:03.4503302Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:06:03.4503492Z ##[endgroup]\n2022-03-21T22:06:03.4519156Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (13/29)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:13.2205634Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:12.8679322Z + python3 -m pip install boto3==1.19.12\n2022-03-21T20:50:13.0744228Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T20:50:13.0916284Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T20:50:13.0964264Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T20:50:13.1005656Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T20:50:13.1017299Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T20:50:13.1041042Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T20:50:13.1189450Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T20:50:13.1208751Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T20:50:13.2119445Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d02da60fd18c22f5\n2022-03-21T20:50:13.2205634Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:13.2217939Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:13.2220259Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:13.2248664Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:13.2249012Z with:\n2022-03-21T20:50:13.2249260Z env:\n2022-03-21T20:50:13.2249500Z   IN_CI: 1\n2022-03-21T20:50:13.2249738Z   IS_GHA: 1\n2022-03-21T20:50:13.2250025Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:13.2250329Z ##[endgroup]\n2022-03-21T20:50:13.2272735Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (14/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:47:38.0451999Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:47:37.5554508Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:47:37.8411473Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:47:37.8631484Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:47:37.8699561Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:47:37.8737037Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:47:37.8754443Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:47:37.8814393Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:47:37.8849540Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:47:37.9059579Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:47:38.0336298Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0b44f47f4292089a2\n2022-03-21T23:47:38.0451999Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:47:38.0469471Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:47:38.0484106Z ##[error]Process completed with exit code 2.\n2022-03-21T23:47:38.0532678Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:47:38.0533007Z with:\n2022-03-21T23:47:38.0533223Z env:\n2022-03-21T23:47:38.0533440Z   IN_CI: 1\n2022-03-21T23:47:38.0533649Z   IS_GHA: 1\n2022-03-21T23:47:38.0533902Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:47:38.0534170Z   GPU_FLAG: --gpus all\n2022-03-21T23:47:38.0534401Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (15/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:04:59.3115800Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:04:59.2595213Z     #10 0x55a7f39a4801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:04:59.2595707Z     #11 0x55a7f39af7a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:04:59.2597203Z     #12 0x55a7f39af80b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:04:59.2598205Z     #13 0x55a7f39af908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:04:59.2598697Z     #14 0x55a7f39af908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:04:59.2599178Z     #15 0x55a7f39af908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:04:59.2599747Z     #16 0x55a7f39afccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:04:59.3114751Z     #17 0x7f3b3822383f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:04:59.3115277Z     #18 0x55a7f3954554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:04:59.3115468Z \n2022-03-21T21:04:59.3115800Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:04:59.3292385Z + retcode=1\n2022-03-21T21:04:59.3292781Z + set -e\n2022-03-21T21:04:59.3293062Z + return 1\n2022-03-21T21:04:59.3295462Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:04:59.3295802Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:04:59.3296394Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:04:59.3296700Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:04:59.3297055Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:04:59.3297416Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:04:59.3299623Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (16/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:14:25.5525714Z Collecting jmespath<1.0.0,>=0.7.1\n2022-03-21T22:14:25.5568155Z   Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)\n2022-03-21T22:14:25.5952617Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:14:25.6169392Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:14:25.6629996Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:14:25.6710247Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:14:25.8284354Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:14:25.9816751Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:14:31.6672236Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:14:31.7630473Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ed0915ecee5d2424\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:14:31.7876742Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:14:31.7897140Z ##[error]Process completed with exit code 2.\n2022-03-21T22:14:31.8195621Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:14:31.8196110Z with:\n2022-03-21T22:14:31.8196356Z env:\n2022-03-21T22:14:31.8196614Z   IN_CI: 1\n2022-03-21T22:14:31.8196876Z   IS_GHA: 1\n2022-03-21T22:14:31.8197169Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:14:31.8197652Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:14:31.8198093Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge) (17/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:19:15.8845728Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:19:15.5116060Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:19:15.7231476Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:19:15.7409711Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:19:15.7458478Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:19:15.7470508Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:19:15.7496799Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:19:15.7538362Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:19:15.7566161Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:19:15.7711630Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:19:15.8753543Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0e2b3b4ddb246ff2a\n2022-03-21T21:19:15.8845728Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:19:15.8859814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:19:15.8870165Z ##[error]Process completed with exit code 2.\n2022-03-21T21:19:15.8917039Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:19:15.8917279Z with:\n2022-03-21T21:19:15.8917433Z env:\n2022-03-21T21:19:15.8917586Z   IN_CI: 1\n2022-03-21T21:19:15.8917734Z   IS_GHA: 1\n2022-03-21T21:19:15.8917917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:19:15.8918102Z ##[endgroup]\n2022-03-21T21:19:15.8934572Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (18/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:19:48.5900162Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:19:48.0742254Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:19:48.3742563Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:19:48.3976536Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:19:48.4048700Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:19:48.4065374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:19:48.4128076Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:19:48.4164273Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:19:48.4202610Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:19:48.4416723Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:19:48.5773033Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-07ab7a3c4a5402af2\n2022-03-21T23:19:48.5900162Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:19:48.5919822Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:19:48.5936087Z ##[error]Process completed with exit code 2.\n2022-03-21T23:19:48.6007930Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:19:48.6008268Z with:\n2022-03-21T23:19:48.6008483Z env:\n2022-03-21T23:19:48.6008701Z   IN_CI: 1\n2022-03-21T23:19:48.6008920Z   IS_GHA: 1\n2022-03-21T23:19:48.6009170Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:19:48.6009440Z   GPU_FLAG: --gpus all\n2022-03-21T23:19:48.6009671Z ##[endgroup]\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (19/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:53:59.0889659Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T22:53:59.6881416Z      ---------------------------------------- 8.1/8.1 MB 14.0 MB/s eta 0:00:00\n2022-03-21T22:53:59.7427779Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:53:59.7691882Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:53:59.7779847Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:53:59.8281663Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:54:00.0185115Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:54:00.2359770Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:54:04.1208891Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:54:04.2505862Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03b4fbe63be8ef4b0\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:54:04.2891082Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:54:04.2919900Z ##[error]Process completed with exit code 2.\n2022-03-21T22:54:04.3377901Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:54:04.3378575Z with:\n2022-03-21T22:54:04.3378930Z env:\n2022-03-21T22:54:04.3379275Z   IN_CI: 1\n2022-03-21T22:54:04.3379600Z   IS_GHA: 1\n2022-03-21T22:54:04.3380023Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:54:04.3380691Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:54:04.3381278Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge) (20/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:09:34.0074610Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:09:33.6365531Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:09:33.8475619Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:09:33.8655152Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:09:33.8704395Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:09:33.8716774Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:09:33.8760145Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:09:33.8785000Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:09:33.8811316Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:09:33.8960134Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:09:33.9984866Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d325eb9fd156146f\n2022-03-21T22:09:34.0074610Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:09:34.0087465Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:09:34.0101743Z ##[error]Process completed with exit code 2.\n2022-03-21T22:09:34.0154014Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:09:34.0154246Z with:\n2022-03-21T22:09:34.0154412Z env:\n2022-03-21T22:09:34.0154574Z   IN_CI: 1\n2022-03-21T22:09:34.0154728Z   IS_GHA: 1\n2022-03-21T22:09:34.0154917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:09:34.0155112Z ##[endgroup]\n2022-03-21T22:09:34.0191047Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (21/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:17.8502655Z [E request_callbac...yUniqueId(created_on=0, local_id=0) to be created.\n\n2022-03-21T21:03:14.4669960Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpxgdsmeer\n2022-03-21T21:03:14.4671407Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpxgdsmeer/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.4973023Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp1i2hfmpc\n2022-03-21T21:03:14.4973800Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp1i2hfmpc/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.5532339Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpgx4da7b0\n2022-03-21T21:03:14.5533064Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpgx4da7b0/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.7050673Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0\n2022-03-21T21:03:14.7097127Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3\n2022-03-21T21:03:14.7398339Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2\n2022-03-21T21:03:14.7922283Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1\n2022-03-21T21:03:17.8502655Z [E request_callback_no_python.cpp:559] Received error while processing request type 261: false INTERNAL ASSERT FAILED at \"/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp\":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.\n2022-03-21T21:03:17.8503603Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):\n2022-03-21T21:03:17.8504385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f180df19e19 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505131Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f180df160e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505927Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f180df17a7e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8506674Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f18118b7b64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8507642Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x70 (0x7f18118a7bf0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8508613Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f1819736208 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8509749Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f18118ac914 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8510708Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f1819735865 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8511369Z frame #8: <unknown function> + 0x375249a (0x7f18118a949a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (22/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERR...t available for the merge-base of your branch\"\ufffd[0m\n\n2022-03-21T20:01:07.7012399Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7012634Z \ufffd[36;1m# Covers the case where a previous tag doesn't exist for the tree\ufffd[0m\n2022-03-21T20:01:07.7012992Z \ufffd[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly\ufffd[0m\n2022-03-21T20:01:07.7013373Z \ufffd[36;1mif ! git rev-parse \"$MERGE_BASE:.circleci/docker\"; then\ufffd[0m\n2022-03-21T20:01:07.7013784Z \ufffd[36;1m  echo \"Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit\"\ufffd[0m\n2022-03-21T20:01:07.7014149Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7014325Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7014573Z \ufffd[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse \"$MERGE_BASE:.circleci/docker\")\ufffd[0m\n2022-03-21T20:01:07.7014907Z \ufffd[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here\ufffd[0m\n2022-03-21T20:01:07.7015231Z \ufffd[36;1mif [[ \"${PREVIOUS_DOCKER_TAG}\" = \"${DOCKER_TAG}\" ]]; then\ufffd[0m\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch\"\ufffd[0m\n2022-03-21T20:01:07.7015931Z \ufffd[36;1m  echo \"       contact the PyTorch team to restore the original images\"\ufffd[0m\n2022-03-21T20:01:07.7016225Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7016400Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7016608Z \ufffd[36;1mecho ::set-output name=rebuild::yes\ufffd[0m\n2022-03-21T20:01:07.7027605Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}\n2022-03-21T20:01:07.7027837Z env:\n2022-03-21T20:01:07.7028006Z   IN_CI: 1\n2022-03-21T20:01:07.7028159Z   IS_GHA: 1\n2022-03-21T20:01:07.7028346Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:01:07.7028589Z   BASE_REVISION: 6643522db9ff595f564b8081de58b3a33c546178\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (23/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:49:54.2949572Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:49:53.8049151Z + python3 -m pip install boto3==1.19.12\n2022-03-22T00:49:54.0981629Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-22T00:49:54.1207562Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-22T00:49:54.1277146Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-22T00:49:54.1315027Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-22T00:49:54.1331813Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-22T00:49:54.1391622Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:49:54.1609217Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-22T00:49:54.1637417Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:49:54.2830197Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f7c32fe13be12fea\n2022-03-22T00:49:54.2949572Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:49:54.2966933Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:49:54.2982588Z ##[error]Process completed with exit code 2.\n2022-03-22T00:49:54.3031464Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T00:49:54.3031794Z with:\n2022-03-22T00:49:54.3032012Z env:\n2022-03-22T00:49:54.3032227Z   IN_CI: 1\n2022-03-22T00:49:54.3032434Z   IS_GHA: 1\n2022-03-22T00:49:54.3032681Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:49:54.3033084Z   GPU_FLAG: --gpus all\n2022-03-22T00:49:54.3033312Z ##[endgroup]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (24/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:56:07.3365589Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T21:56:07.7926584Z      ---------------------------------------- 8.1/8.1 MB 17.3 MB/s eta 0:00:00\n2022-03-21T21:56:07.9319362Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T21:56:07.9366132Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T21:56:08.0077590Z      -------------------------------------- 247.7/247.7 KB 3.0 MB/s eta 0:00:00\n2022-03-21T21:56:08.0164070Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:56:08.1775537Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:56:08.3393469Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T21:56:12.4576766Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T21:56:12.5641959Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0afad69838118af0e\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:56:12.5905611Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:56:12.5927729Z ##[error]Process completed with exit code 2.\n2022-03-21T21:56:12.6239531Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T21:56:12.6240039Z with:\n2022-03-21T21:56:12.6240299Z env:\n2022-03-21T21:56:12.6240557Z   IN_CI: 1\n2022-03-21T21:56:12.6240805Z   IS_GHA: 1\n2022-03-21T21:56:12.6241118Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:56:12.6241613Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T21:56:12.6242052Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (25/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:46:39.5474616Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:46:39.1884210Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:46:39.3928976Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:46:39.4105069Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:46:39.4152571Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:46:39.4194931Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:46:39.4218947Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:46:39.4230812Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:46:39.4380089Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:46:39.4399461Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:46:39.5387703Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0888bed1149cca415\n2022-03-21T21:46:39.5474616Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:46:39.5487145Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:46:39.5497480Z ##[error]Process completed with exit code 2.\n2022-03-21T21:46:39.5541319Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:46:39.5541544Z with:\n2022-03-21T21:46:39.5541698Z env:\n2022-03-21T21:46:39.5541851Z   IN_CI: 1\n2022-03-21T21:46:39.5541997Z   IS_GHA: 1\n2022-03-21T21:46:39.5542176Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:46:39.5542361Z ##[endgroup]\n2022-03-21T21:46:39.5557878Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge) (26/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:34:57.0623859Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:34:56.9039884Z   Attempting uninstall: s3transfer\n2022-03-21T21:34:56.9041446Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:34:56.9090783Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:34:56.9095968Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:34:56.9453014Z   Attempting uninstall: boto3\n2022-03-21T21:34:56.9454356Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:34:56.9564320Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:34:56.9578035Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:34:57.0091363Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:34:57.0536230Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-034a3afd5d80b91fd\n2022-03-21T21:34:57.0623859Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:34:57.0637167Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:34:57.0647396Z ##[error]Process completed with exit code 2.\n2022-03-21T21:34:57.0688237Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:34:57.0688481Z with:\n2022-03-21T21:34:57.0688631Z env:\n2022-03-21T21:34:57.0688769Z   IN_CI: 1\n2022-03-21T21:34:57.0688930Z   IS_GHA: 1\n2022-03-21T21:34:57.0689109Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:34:57.0689462Z ##[endgroup]\n2022-03-21T21:34:57.0704768Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge) (27/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7896545Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.7395504Z     #10 0x5597fd5a9801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.7396330Z     #11 0x5597fd5b47a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.7396688Z     #12 0x5597fd5b480b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.7398664Z     #13 0x5597fd5b4908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.7399177Z     #14 0x5597fd5b4908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.7399663Z     #15 0x5597fd5b4908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.7399986Z     #16 0x5597fd5b4ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7895241Z     #17 0x7f0a5905983f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7895772Z     #18 0x5597fd559554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7896033Z \n2022-03-21T21:05:00.7896545Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.8063448Z + retcode=1\n2022-03-21T21:05:00.8063787Z + set -e\n2022-03-21T21:05:00.8064058Z + return 1\n2022-03-21T21:05:00.8067638Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.8068127Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.8069018Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.8069500Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.8070105Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.8070580Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.8072640Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (28/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:48:17.3384813Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:48:16.8599645Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:48:17.1464241Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:48:17.1685222Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:48:17.1754164Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:48:17.1771662Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:48:17.1808722Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:48:17.1868636Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:48:17.1903889Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:48:17.2113746Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:48:17.3267404Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-01fe178c405417375\n2022-03-21T22:48:17.3384813Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:48:17.3402286Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:48:17.3418376Z ##[error]Process completed with exit code 2.\n2022-03-21T22:48:17.3470528Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:48:17.3470874Z with:\n2022-03-21T22:48:17.3471096Z env:\n2022-03-21T22:48:17.3471327Z   IN_CI: 1\n2022-03-21T22:48:17.3471538Z   IS_GHA: 1\n2022-03-21T22:48:17.3471802Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:48:17.3472083Z   GPU_FLAG: --gpus all\n2022-03-21T22:48:17.3472322Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (29/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:16:38.9646300Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:16:38.7995969Z   Attempting uninstall: s3transfer\n2022-03-21T21:16:38.7998039Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:16:38.8066994Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:16:38.8072844Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:16:38.8449275Z   Attempting uninstall: boto3\n2022-03-21T21:16:38.8451430Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:16:38.8559828Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:16:38.8574290Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:16:38.9100438Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:16:38.9558098Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d779c59d277d32ee\n2022-03-21T21:16:38.9646300Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:16:38.9658894Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:16:38.9673240Z ##[error]Process completed with exit code 2.\n2022-03-21T21:16:38.9720106Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:16:38.9720333Z with:\n2022-03-21T21:16:38.9720485Z env:\n2022-03-21T21:16:38.9720645Z   IN_CI: 1\n2022-03-21T21:16:38.9720793Z   IS_GHA: 1\n2022-03-21T21:16:38.9720970Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:16:38.9721151Z ##[endgroup]\n2022-03-21T21:16:38.9736762Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 964902894
+              },
+              {
+                "bodyText": "@vitaly-fedyunin @gottbrath  FYI that this is the oneDNN Graph API integration. It depends on the #63748.",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 970451860
+              },
+              {
+                "bodyText": "CI failures are currently being caused by some issues in the CI infra, and are also occurring with other PRs.",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 990641309
+              },
+              {
+                "bodyText": "CI failures are unrelated.",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 991281407
+              },
+              {
+                "bodyText": "The CI failure is unrelated.",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 995389295
+              },
+              {
+                "bodyText": "Hi, thank you for the PR!\nDo you mind running a larger amount of torchbench and reporting numbers ? You can look at Jason's post here for what models are supported in script. Initially just the vision models would be useful. @Krovatkin also did some benchmarking of a traced Bert model and found on average a ~16% speedup with this PR.",
+                "author": {
+                  "login": "eellison"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1015689390
+              },
+              {
+                "bodyText": "Thanks a lot for reviewing, @eellison & @Krovatkin!\nWe just wanted to let you know that we're working on the benchmarking & will get back to you in a day, or two.\nUPDATE (Jan 21): While running some TorchBench models, we discovered some composability issues, and are working to ensure that oneDNN Graph would complement PyTorch's existing fusion capabilities, not hinder them.\nUPDATE (Jan 24): We've resolved the issues & will update this PR later today. Thanks!",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1016996190
+              },
+              {
+                "bodyText": "Hello @eellison,\nWe used this TorchBench branch for comparison. compare_llga.sh can be run for comparison.\nFor benchmarking mobilenet_v3_large with hardswish support in oneDNN Graph, this oneDNN Graph branch can be used in third_party/ideep/mkl-dnn. It delivers a speedup over PyTorch JIT (NNC + OFI) because 21 additional reorders are prevented (the major factor here), and fusion with conv also helps further.\nThe next release of oneDNN Graph would have hardswish support.\nWe're also exploring adding a hardsigmoid op in oneDNN Graph.\nThank you!",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1022709513
+              },
+              {
+                "bodyText": "Please note that this PR should be merged after #71546, as #71546 changes the  third_party/ideep commit (this PR also uses that ideep commit, but it'd probably be better to merge #71546 first, so that oneDNN v2.5.2 upgrade would be in a separate PR). Thank you!",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1026330085
+              },
+              {
+                "bodyText": "@sanchitintel mind rebasing and i'll land ?",
+                "author": {
+                  "login": "eellison"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1055813984
+              },
+              {
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1057203495
+              },
+              {
+                "bodyText": "Thanks a lot for taking a look, @eellison! To fix this error, we would enable Bazel build for oneDNN Graph.",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1061230087
+              },
+              {
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1063276600
+              },
+              {
+                "bodyText": "@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074355779
+              },
+              {
+                "bodyText": "And graph_rewriter.cpp is full of DOS newlines...",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074407452
+              },
+              {
+                "bodyText": "Hey @chunyuan-w.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1074471758
+              },
+              {
+                "bodyText": "Thanks a ton for your help, @malfet & @eellison! :)\nWe'll incorporate your suggestions in subsequent PR(s).",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1074492365
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOOYM_0Q==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=73969 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "malfet"
+          },
+          "title": "Dummy change",
+          "body": "Test Plan: None at all\n\nDifferential Revision: D34753911\n\n",
+          "headRefName": "export-D34753911",
+          "headRepository": {
+            "nameWithOwner": "malfet/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
+                }
+              }
+            ],
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-vulkan-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRQMQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aM=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-rocm4.5-py3.7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbTiXw=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cuda11.3-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbY_vU=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2ao=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-docs"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (cpp)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (python)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRIt0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRFm4=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aw=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "shellcheck",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "toc",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "mypy",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO4Es=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2b0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2c8=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Test tools"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2as=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbUkMA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7-no-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d8=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cpu-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbWDX8=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-asan"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 3, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSD-k=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Meta Internal-Only Changes Check",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO574=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (xla, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSGAM=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRlJs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (noarch, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRN_c=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-onnx"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRySo=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-UI=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "tools/build_variables.bzl"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [],
+            "totalCount": 0
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/malfet/pytorch/blob/4746da707a9912356f5179625da89616b228dc21/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\nAdd ciflow labels to this PR to trigger more builds:\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-manywheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-rocm4.5-py3.7\nciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build\nciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nmacos-arm64-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-arm64-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwindows-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nwindows-binary-libtorch-debug\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-libtorch-release\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-wheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.3-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\npytorch-xla-linux-bionic-py3.7-clang8\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla\n\ud83d\udeab skipped",
+                "author": {
+                  "login": "pytorch-bot"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1063079053
+              },
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/73969\n\ud83d\udcc4 \u00a0Preview docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 4746da7 (more details on the Dr. CI page):\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1063079113
+              },
+              {
+                "bodyText": "This pull request was exported from Phabricator. Differential Revision: D34753911",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1063079731
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP11MjQ==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=73099 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "BowenBao"
+          },
+          "title": "[ONNX] Make graph name spec-compliant (#71961)",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* #73104\n* #73103\n* #73102\n* #73101\n* #73100\n* __->__ #73099\n\n[According to the ONNX spec](https://github.com/onnx/onnx/blob/main/docs/IR.md#names-within-a-graph),\nall names must adhere to C90 identifier syntax rules, which means no\ndashes.\n\nFixes: #30952",
+          "headRefName": "gh/BowenBao/138/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/BowenBao/138/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "BowenBao"
+                    },
+                    "email": "bowbao@microsoft.com",
+                    "name": "BowenBao"
+                  },
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                }
+              }
+            ],
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNn9o=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkRE_E=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7-no-ops"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJE=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-build"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoIY=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJs=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (noarch, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiwA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-vulkan-bionic-py3.7-clang9"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPxgQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoKA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cpu-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkX070=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiQA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "win-vs2019-cuda11.3-py3"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkdLEE=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoIQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Test tools"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoG0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-docs"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (cpp)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "build-docs (python)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPfnY=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-gcc5.4"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiwQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build-and-test",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoHU=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-bionic-rocm4.5-py3.7"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.rocm.gpu)",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkQmxE=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-asan"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 3, 3, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkQNRA=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "linux-xenial-py3.7-clang7-onnx"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 2, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "test (default, 1, 2, linux.2xlarge)",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPqms=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "mypy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "shellcheck",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "toc",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNpZc=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNnvQ=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Netlify",
+                          "databaseId": 13473
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Azure Pipelines",
+                          "databaseId": 9426
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Dependabot",
+                          "databaseId": 29110
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Codecov",
+                          "databaseId": 254
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "PyTorch Bot",
+                          "databaseId": 40112
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAT_KTRw=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                }
+              }
+            ]
+          },
+          "changedFiles": 162,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/onnx/expect/TestOperators.test_acos.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_left_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addconstant.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_arange_dynamic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_argmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_asin.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_at_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_atan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_avg_pool2d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_baddbmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_basic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_1d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_training.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_bitshift.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_c2_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_chunk.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_concat2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4_opset8.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_convtranspose.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_cos.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_cumsum.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_det.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dict.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dict_str.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dim.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_default.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_opset12.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_training.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_training_opset12.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add_inputs_same_symbolic_shape.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_matmul.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_reduce_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_unchange.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_elu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_embedding_bags.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_empty_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_empty_like_opset7.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_equal.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_erf.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_exp.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten2D.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_fmod.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_frobenius_norm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather_opset11.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ge.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gelu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_hardtanh.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_implicit_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_index.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_isnan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_le.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_linear.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_log_sigmoid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_logsoftmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lstm_none_sequence_lens.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_master_opset.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_dilations.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_indices.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_meshgrid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_narrow.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ne.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_nonzero.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ones_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_pad.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_permute2.expect"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTAw",
+              "hasNextPage": true
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "garymm"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "totalCount": 1
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet \n  \n    \n      pytorch/.github/scripts/trymerge.py\n    \n    \n         Line 63\n      in\n      932adf2\n    \n  \n  \n    \n\n        \n          \n                 files(last: 100) { \n        \n    \n  \n\n Can this be relaxed? If not please import.",
+                "author": {
+                  "login": "BowenBao"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1048084569
+              },
+              {
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet\nCan this be relaxed? If not please import.\n\nWow, you've hit a really interesting problem. 100 is a limitation enforced by GitHub, see https://docs.github.com/en/graphql/overview/resource-limitations, but I can implement a pagination. Do you mind keeping it like that for a bit, want to land a fix soonish.",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1048088691
+              },
+              {
+                "bodyText": "@malfet Thank you for info. Sure, I have separated the rest of stack from this one, we'll wait for the fix to try again.",
+                "author": {
+                  "login": "BowenBao"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1048090640
+              },
+              {
+                "bodyText": "@pytorchbot merge this",
+                "author": {
+                  "login": "BowenBao"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1050293881
+              },
+              {
+                "bodyText": "Hey @BowenBao.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1050295451
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOPniAWQ==",
+              "hasPreviousPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=73099 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "files": {
+            "nodes": [
+              {
+                "path": "test/onnx/expect/TestOperators.test_pixel_shuffle.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_pow.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_prelu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_prod.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_prod_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_rand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_randn.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reducemax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_reducemin.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_remainder.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_repeat.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_round.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_rrelu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_rsqrt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_rsub.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_scatter_add.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_scatter_add_opset11.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_selu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_shape_value_map.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_sign.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_sin.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_slice.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_slice_dynamic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d_none.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_4d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_ignore_index.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_weights.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_split.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_split_with_sizes.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_sqrt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_std.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_sum.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_sum_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_tan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_topk.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_topk_smallest_unsorted.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_transpose.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_type_as.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_unfold.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_unique.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_unsqueeze.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_size.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_view.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_view_flatten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_zeros_like.expect"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/export.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/export.h"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTYy",
+              "hasNextPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a782f66a44a63d21c9e17b1373747a1c07e50b695762a68a8b8db1203ac6c1bb name=pytorch number=74649 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "malfet"
+          },
+          "title": "This should fail flake8",
+          "body": "Test issue for GHF mandatory checks",
+          "headRefName": "malfet-patch-8",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "57c86ff1c5ab948888fd329986c9d55796680e33"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                }
+              }
+            ],
+            "totalCount": 2
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "app": {
+                          "name": "Facebook GitHub Tools",
+                          "databaseId": 12274
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "Facebook CLA Check",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsK3w=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      },
+                      {
+                        "app": {
+                          "name": "Netlify",
+                          "databaseId": 13473
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Azure Pipelines",
+                          "databaseId": 9426
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Dependabot",
+                          "databaseId": 29110
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "Codecov",
+                          "databaseId": 254
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "PyTorch Bot",
+                          "databaseId": 40112
+                        },
+                        "workflowRun": null,
+                        "checkRuns": {
+                          "nodes": [],
+                          "pageInfo": {
+                            "endCursor": null,
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": null
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "Lint"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "clang-format",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "clang-tidy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "cmakelint",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "flake8-py3",
+                              "conclusion": "FAILURE"
+                            },
+                            {
+                              "name": "mypy",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Test collect_env (with_torch)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Test collect_env (without_torch)",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "Test tools",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "py2-setup-validate-errormsg",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "quick-checks",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsMNU=",
+                            "hasNextPage": true
+                          }
+                        },
+                        "conclusion": "FAILURE"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "run-torchbench",
+                              "conclusion": "NEUTRAL"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsLW0=",
+                            "hasNextPage": false
+                          }
+                        },
+                        "conclusion": "SKIPPED"
+                      },
+                      {
+                        "app": {
+                          "name": "GitHub Actions",
+                          "databaseId": 15368
+                        },
+                        "workflowRun": {
+                          "workflow": {
+                            "name": "pull"
+                          }
+                        },
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                              "conclusion": "NEUTRAL"
+                            },
+                            {
+                              "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-bionic-py3.7-clang9 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-bionic-rocm4.5-py3.7 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3.7-gcc5.4 / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3-clang5-mobile-build / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                              "conclusion": "SUCCESS"
+                            },
+                            {
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                              "conclusion": "SUCCESS"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsaNA=",
+                            "hasNextPage": true
+                          }
+                        },
+                        "conclusion": "SUCCESS"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkGU=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "torch/nn/cpp.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "seemethere"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "totalCount": 1
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/74649\n\u21a9\ufe0f \u00a0[fb-only] Re-run with SSH instructions\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 6c3c3de (more details on the Dr. CI page):\n\n\n1/1 failures introduced in this PR\n\n\n1 failure not recognized by patterns:\n\n\n\nJob\nStep\nAction\n\n\n\n\n Lint / flake8-py3\nFail if there were any warnings\n\ud83d\udd01 rerun\n\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1076891218
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQDAOUg==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=pytorch-dev-infra org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "kit1980"
+              },
+              {
+                "login": "b0noI"
+              },
+              {
+                "login": "seemethere"
+              },
+              {
+                "login": "malfet"
+              },
+              {
+                "login": "tenpercent"
+              },
+              {
+                "login": "atalman"
+              },
+              {
+                "login": "osalpekar"
+              },
+              {
+                "login": "janeyx99"
+              },
+              {
+                "login": "clee2000"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": false,
+              "endCursor": "Y3Vyc29yOnYyOpHOAqnOlw=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "dreiss"
+              },
+              {
+                "login": "kumpera"
+              },
+              {
+                "login": "ezyang"
+              },
+              {
+                "login": "stephenroller"
+              },
+              {
+                "login": "swolchok"
+              },
+              {
+                "login": "hyuen"
+              },
+              {
+                "login": "orionr"
+              },
+              {
+                "login": "dhruvbird"
+              },
+              {
+                "login": "likethesky"
+              },
+              {
+                "login": "lw"
+              },
+              {
+                "login": "raziel"
+              },
+              {
+                "login": "simpkins"
+              },
+              {
+                "login": "ebyrne"
+              },
+              {
+                "login": "Babar"
+              },
+              {
+                "login": "kostmo"
+              },
+              {
+                "login": "bhosmer"
+              },
+              {
+                "login": "zdevito"
+              },
+              {
+                "login": "bugra"
+              },
+              {
+                "login": "caraya10"
+              },
+              {
+                "login": "kit1980"
+              },
+              {
+                "login": "shoumikhin"
+              },
+              {
+                "login": "teytaud"
+              },
+              {
+                "login": "xuzhao9"
+              },
+              {
+                "login": "jansel"
+              },
+              {
+                "login": "abhinavarora"
+              },
+              {
+                "login": "b0noI"
+              },
+              {
+                "login": "djthorne"
+              },
+              {
+                "login": "nairbv"
+              },
+              {
+                "login": "Mortimerp9"
+              },
+              {
+                "login": "dadkins20"
+              },
+              {
+                "login": "colesbury"
+              },
+              {
+                "login": "laurencer"
+              },
+              {
+                "login": "nickgg"
+              },
+              {
+                "login": "yzhao30"
+              },
+              {
+                "login": "bearzx"
+              },
+              {
+                "login": "mattjgalloway"
+              },
+              {
+                "login": "chenyang78"
+              },
+              {
+                "login": "yns88"
+              },
+              {
+                "login": "lc0"
+              },
+              {
+                "login": "wenleix"
+              },
+              {
+                "login": "aivanou"
+              },
+              {
+                "login": "jingsh"
+              },
+              {
+                "login": "mthrok"
+              },
+              {
+                "login": "drdarshan"
+              },
+              {
+                "login": "tvalentius"
+              },
+              {
+                "login": "d4l3k"
+              },
+              {
+                "login": "jamiemccrindle"
+              },
+              {
+                "login": "kazhang"
+              },
+              {
+                "login": "simonhollis"
+              },
+              {
+                "login": "lqiao"
+              },
+              {
+                "login": "ajyu"
+              },
+              {
+                "login": "bitfort"
+              },
+              {
+                "login": "govardhan"
+              },
+              {
+                "login": "yinghai"
+              },
+              {
+                "login": "zyan0"
+              },
+              {
+                "login": "ajtulloch"
+              },
+              {
+                "login": "pbelevich"
+              },
+              {
+                "login": "VitalyFedyunin"
+              },
+              {
+                "login": "dbish"
+              },
+              {
+                "login": "NicolasHug"
+              },
+              {
+                "login": "efaust"
+              },
+              {
+                "login": "idning"
+              },
+              {
+                "login": "soumith"
+              },
+              {
+                "login": "nimin98"
+              },
+              {
+                "login": "chaekit"
+              },
+              {
+                "login": "radkris-git"
+              },
+              {
+                "login": "javier-m"
+              },
+              {
+                "login": "mostafaelhoushi"
+              },
+              {
+                "login": "brianjo"
+              },
+              {
+                "login": "ShijunK"
+              },
+              {
+                "login": "suo"
+              },
+              {
+                "login": "vkuzo"
+              },
+              {
+                "login": "seemethere"
+              },
+              {
+                "login": "qihqi"
+              },
+              {
+                "login": "jackm321"
+              },
+              {
+                "login": "neerajprad"
+              },
+              {
+                "login": "rsemenov"
+              },
+              {
+                "login": "ziky90"
+              },
+              {
+                "login": "gmagogsfm"
+              },
+              {
+                "login": "zzzwen"
+              },
+              {
+                "login": "ikriv"
+              },
+              {
+                "login": "deeptigp"
+              },
+              {
+                "login": "andrewor14"
+              },
+              {
+                "login": "jianyuh"
+              },
+              {
+                "login": "cykustcc"
+              },
+              {
+                "login": "highker"
+              },
+              {
+                "login": "navahgar"
+              },
+              {
+                "login": "beauby"
+              },
+              {
+                "login": "jeffreyksmithjr"
+              },
+              {
+                "login": "suphoff"
+              },
+              {
+                "login": "smessmer"
+              },
+              {
+                "login": "ananthsub"
+              },
+              {
+                "login": "d1jang"
+              },
+              {
+                "login": "firstprayer"
+              },
+              {
+                "login": "malfet"
+              },
+              {
+                "login": "fegin"
+              },
+              {
+                "login": "hanton"
+              },
+              {
+                "login": "zanqi"
+              },
+              {
+                "login": "bujar"
+              },
+              {
+                "login": "supriyar"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOACiM0Q=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOACiM0Q== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "kausv"
+              },
+              {
+                "login": "divchenko"
+              },
+              {
+                "login": "rahuln32"
+              },
+              {
+                "login": "bilgeacun"
+              },
+              {
+                "login": "caogao"
+              },
+              {
+                "login": "blefaudeux"
+              },
+              {
+                "login": "miguelmartin75"
+              },
+              {
+                "login": "penguinwu"
+              },
+              {
+                "login": "shz117"
+              },
+              {
+                "login": "ajliu"
+              },
+              {
+                "login": "saketh-are"
+              },
+              {
+                "login": "jessebrizzi"
+              },
+              {
+                "login": "msaroufim"
+              },
+              {
+                "login": "mdundas"
+              },
+              {
+                "login": "davides"
+              },
+              {
+                "login": "alannnna"
+              },
+              {
+                "login": "hlin09"
+              },
+              {
+                "login": "terrychenism"
+              },
+              {
+                "login": "xiaomengy"
+              },
+              {
+                "login": "jisaacso"
+              },
+              {
+                "login": "fkhan1337"
+              },
+              {
+                "login": "xing-liu"
+              },
+              {
+                "login": "alanadakotashine"
+              },
+              {
+                "login": "desertfire"
+              },
+              {
+                "login": "banitag1"
+              },
+              {
+                "login": "letterx"
+              },
+              {
+                "login": "gchanan"
+              },
+              {
+                "login": "dbort"
+              },
+              {
+                "login": "bilalsal"
+              },
+              {
+                "login": "jaceyca"
+              },
+              {
+                "login": "serhaty"
+              },
+              {
+                "login": "yf225"
+              },
+              {
+                "login": "yifuwang"
+              },
+              {
+                "login": "piyushmh"
+              },
+              {
+                "login": "z-a-f"
+              },
+              {
+                "login": "superzgc"
+              },
+              {
+                "login": "tenpercent"
+              },
+              {
+                "login": "spaugh"
+              },
+              {
+                "login": "bertmaher"
+              },
+              {
+                "login": "chauhang"
+              },
+              {
+                "login": "jiayisuse"
+              },
+              {
+                "login": "bradleyhd"
+              },
+              {
+                "login": "ZolotukhinM"
+              },
+              {
+                "login": "jamesr66a"
+              },
+              {
+                "login": "mullachv"
+              },
+              {
+                "login": "voznesenskym"
+              },
+              {
+                "login": "charliechen0401"
+              },
+              {
+                "login": "bwasti"
+              },
+              {
+                "login": "cryptopic"
+              },
+              {
+                "login": "chinannyang"
+              },
+              {
+                "login": "NivekT"
+              },
+              {
+                "login": "zhxchen17"
+              },
+              {
+                "login": "jerryzh168"
+              },
+              {
+                "login": "MohammadMahdiJavanmard"
+              },
+              {
+                "login": "rajkar86"
+              },
+              {
+                "login": "wconstab"
+              },
+              {
+                "login": "Hangjun"
+              },
+              {
+                "login": "davidberard98"
+              },
+              {
+                "login": "Krovatkin"
+              },
+              {
+                "login": "CamiWilliams"
+              },
+              {
+                "login": "J0Nreynolds"
+              },
+              {
+                "login": "datumbox"
+              },
+              {
+                "login": "aartibasant"
+              },
+              {
+                "login": "xta0"
+              },
+              {
+                "login": "zou3519"
+              },
+              {
+                "login": "xman1979"
+              },
+              {
+                "login": "suraj813"
+              },
+              {
+                "login": "gqchen"
+              },
+              {
+                "login": "jayleverett"
+              },
+              {
+                "login": "george-qi"
+              },
+              {
+                "login": "abhikrish"
+              },
+              {
+                "login": "zhangguanheng66"
+              },
+              {
+                "login": "mikeiovine"
+              },
+              {
+                "login": "Adolfo-Karim"
+              },
+              {
+                "login": "Chillee"
+              },
+              {
+                "login": "albanD"
+              },
+              {
+                "login": "robotal"
+              },
+              {
+                "login": "MarcioPorto"
+              },
+              {
+                "login": "srsuryadev"
+              },
+              {
+                "login": "IvanKobzarev"
+              },
+              {
+                "login": "eprivezentsev"
+              },
+              {
+                "login": "linux-jedi"
+              },
+              {
+                "login": "chandlerzuo"
+              },
+              {
+                "login": "prateek1404"
+              },
+              {
+                "login": "otsneh"
+              },
+              {
+                "login": "husthyc"
+              },
+              {
+                "login": "briancoutinho"
+              },
+              {
+                "login": "fduwjj"
+              },
+              {
+                "login": "esqu1"
+              },
+              {
+                "login": "prabhat00155"
+              },
+              {
+                "login": "Gamrix"
+              },
+              {
+                "login": "QuentinDuval"
+              },
+              {
+                "login": "atalman"
+              },
+              {
+                "login": "xush6528"
+              },
+              {
+                "login": "dracifer"
+              },
+              {
+                "login": "SS-JIA"
+              },
+              {
+                "login": "helunwencser"
+              },
+              {
+                "login": "xw285cornell"
+              },
+              {
+                "login": "hhbyyh"
+              },
+              {
+                "login": "rohan-varma"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOAHqtWg=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAHqtWg== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "teng-li"
+              },
+              {
+                "login": "larryliu0820"
+              },
+              {
+                "login": "lyoka"
+              },
+              {
+                "login": "cbalioglu"
+              },
+              {
+                "login": "hl475"
+              },
+              {
+                "login": "hwangjeff"
+              },
+              {
+                "login": "Jack-Khuu"
+              },
+              {
+                "login": "alanwaketan"
+              },
+              {
+                "login": "mehtanirav"
+              },
+              {
+                "login": "nateanl"
+              },
+              {
+                "login": "boyuantan"
+              },
+              {
+                "login": "muntaqim"
+              },
+              {
+                "login": "dennysem"
+              },
+              {
+                "login": "ymao1993"
+              },
+              {
+                "login": "fmassa"
+              },
+              {
+                "login": "esantorella"
+              },
+              {
+                "login": "HamidShojanazeri"
+              },
+              {
+                "login": "jubinchheda"
+              },
+              {
+                "login": "mehdimashayekhi"
+              },
+              {
+                "login": "rkindi"
+              },
+              {
+                "login": "wanchaol"
+              },
+              {
+                "login": "zephirefaith"
+              },
+              {
+                "login": "alexbeloi"
+              },
+              {
+                "login": "kapilsh"
+              },
+              {
+                "login": "plahera"
+              },
+              {
+                "login": "SherlockNoMad"
+              },
+              {
+                "login": "venkatacrc"
+              },
+              {
+                "login": "pritamdamania87"
+              },
+              {
+                "login": "rahxephon89"
+              },
+              {
+                "login": "iseeyuan"
+              },
+              {
+                "login": "Matphyler"
+              },
+              {
+                "login": "protonu"
+              },
+              {
+                "login": "terhuhf"
+              },
+              {
+                "login": "aruntonic"
+              },
+              {
+                "login": "gcatron"
+              },
+              {
+                "login": "yingrliu"
+              },
+              {
+                "login": "alexanderguzhva"
+              },
+              {
+                "login": "zhaoalex"
+              },
+              {
+                "login": "shahofblah"
+              },
+              {
+                "login": "vivekmig"
+              },
+              {
+                "login": "yqhu"
+              },
+              {
+                "login": "jspisak"
+              },
+              {
+                "login": "akshaypandian"
+              },
+              {
+                "login": "HarutMov"
+              },
+              {
+                "login": "tktrungna"
+              },
+              {
+                "login": "eellison"
+              },
+              {
+                "login": "ziab"
+              },
+              {
+                "login": "NarineK"
+              },
+              {
+                "login": "andrewconnors"
+              },
+              {
+                "login": "wenwei202"
+              },
+              {
+                "login": "jg2912"
+              },
+              {
+                "login": "jwpark1985"
+              },
+              {
+                "login": "robieta"
+              },
+              {
+                "login": "davidxili"
+              },
+              {
+                "login": "mreso"
+              },
+              {
+                "login": "soulitzer"
+              },
+              {
+                "login": "prigoyal"
+              },
+              {
+                "login": "PaliC"
+              },
+              {
+                "login": "anijain2305"
+              },
+              {
+                "login": "pvtuan10"
+              },
+              {
+                "login": "huangyi1979"
+              },
+              {
+                "login": "osalpekar"
+              },
+              {
+                "login": "xiaohui-zhang"
+              },
+              {
+                "login": "jerry39213gh"
+              },
+              {
+                "login": "jarodhou"
+              },
+              {
+                "login": "hlu1"
+              },
+              {
+                "login": "huiguoo"
+              },
+              {
+                "login": "H-Huang"
+              },
+              {
+                "login": "vtsyvina"
+              },
+              {
+                "login": "qchip"
+              },
+              {
+                "login": "Nitrokitty"
+              },
+              {
+                "login": "satgera"
+              },
+              {
+                "login": "ngimel"
+              },
+              {
+                "login": "dongreenberg"
+              },
+              {
+                "login": "markkm"
+              },
+              {
+                "login": "EscapeZero"
+              },
+              {
+                "login": "bdhirsh"
+              },
+              {
+                "login": "cccclai"
+              },
+              {
+                "login": "carolineechen"
+              },
+              {
+                "login": "tugsbayasgalan"
+              },
+              {
+                "login": "frankseide"
+              },
+              {
+                "login": "YazhiGao"
+              },
+              {
+                "login": "pavithranrao"
+              },
+              {
+                "login": "VirgileHlav"
+              },
+              {
+                "login": "mrshenli"
+              },
+              {
+                "login": "lena-kashtelyan"
+              },
+              {
+                "login": "brad-mengchi"
+              },
+              {
+                "login": "kimishpatel"
+              },
+              {
+                "login": "aaronenyeshi"
+              },
+              {
+                "login": "shajrawi"
+              },
+              {
+                "login": "samdow"
+              },
+              {
+                "login": "dzhulgakov"
+              },
+              {
+                "login": "great-way"
+              },
+              {
+                "login": "ashkan-software"
+              },
+              {
+                "login": "garroud"
+              },
+              {
+                "login": "knottb"
+              },
+              {
+                "login": "jbitton"
+              },
+              {
+                "login": "jdsgomes"
+              },
+              {
+                "login": "zhangxy988"
+              },
+              {
+                "login": "samlurye"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOAStXFg=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAStXFg== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "EdwardTyantov"
+              },
+              {
+                "login": "anjali411"
+              },
+              {
+                "login": "842974287"
+              },
+              {
+                "login": "JacobSzwejbka"
+              },
+              {
+                "login": "nishantpdce"
+              },
+              {
+                "login": "srinivas212"
+              },
+              {
+                "login": "cherie11"
+              },
+              {
+                "login": "shreyanb98"
+              },
+              {
+                "login": "kavoor"
+              },
+              {
+                "login": "dzdang"
+              },
+              {
+                "login": "naveedgol"
+              },
+              {
+                "login": "Nayef211"
+              },
+              {
+                "login": "zrphercule"
+              },
+              {
+                "login": "HengruiX"
+              },
+              {
+                "login": "langong347"
+              },
+              {
+                "login": "soapisnotfat"
+              },
+              {
+                "login": "ebsmothers"
+              },
+              {
+                "login": "anshuljain1"
+              },
+              {
+                "login": "b-koopman"
+              },
+              {
+                "login": "salilsdesai"
+              },
+              {
+                "login": "vmoens"
+              },
+              {
+                "login": "xinyang0"
+              },
+              {
+                "login": "ramvenkat98"
+              },
+              {
+                "login": "fbbradheintz"
+              },
+              {
+                "login": "kauterry"
+              },
+              {
+                "login": "VenkatSubramaniam"
+              },
+              {
+                "login": "yxia11"
+              },
+              {
+                "login": "anirbanraywork"
+              },
+              {
+                "login": "houseroad"
+              },
+              {
+                "login": "erichan1"
+              },
+              {
+                "login": "hsrussell"
+              },
+              {
+                "login": "ilia-cher"
+              },
+              {
+                "login": "ajitmaths"
+              },
+              {
+                "login": "awgu"
+              },
+              {
+                "login": "wz337"
+              },
+              {
+                "login": "LynneD"
+              },
+              {
+                "login": "qxy11"
+              },
+              {
+                "login": "janeyx99"
+              },
+              {
+                "login": "msedwar"
+              },
+              {
+                "login": "dustinh1999"
+              },
+              {
+                "login": "glaringlee"
+              },
+              {
+                "login": "anj-s"
+              },
+              {
+                "login": "liuchen9494"
+              },
+              {
+                "login": "jramseyer"
+              },
+              {
+                "login": "zengk95"
+              },
+              {
+                "login": "gtarjun"
+              },
+              {
+                "login": "mikaylagawarecki"
+              },
+              {
+                "login": "xianxl"
+              },
+              {
+                "login": "lucasgadams"
+              },
+              {
+                "login": "mingzhe09088"
+              },
+              {
+                "login": "Vucibatina"
+              },
+              {
+                "login": "aazzolini"
+              },
+              {
+                "login": "nataliakliushkina"
+              },
+              {
+                "login": "mruberry"
+              },
+              {
+                "login": "mja314"
+              },
+              {
+                "login": "HDCharles"
+              },
+              {
+                "login": "mcr229"
+              },
+              {
+                "login": "guangy10"
+              },
+              {
+                "login": "mengwa41"
+              },
+              {
+                "login": "hx89"
+              },
+              {
+                "login": "kiukchung"
+              },
+              {
+                "login": "hanhsienhuang"
+              },
+              {
+                "login": "clee2000"
+              },
+              {
+                "login": "lhuang04"
+              },
+              {
+                "login": "sidneyfletcher"
+              },
+              {
+                "login": "gottbrath"
+              },
+              {
+                "login": "lessw2020"
+              },
+              {
+                "login": "choward232"
+              },
+              {
+                "login": "mmh683"
+              },
+              {
+                "login": "dwarakrajagopal"
+              },
+              {
+                "login": "lazysjb"
+              },
+              {
+                "login": "zhaojuanmao"
+              },
+              {
+                "login": "johncalab"
+              },
+              {
+                "login": "dhthompson"
+              },
+              {
+                "login": "superwizard2019"
+              },
+              {
+                "login": "fbhuba"
+              },
+              {
+                "login": "shunting314"
+              },
+              {
+                "login": "edward-io"
+              },
+              {
+                "login": "sean-ngo"
+              },
+              {
+                "login": "bzinodev"
+              },
+              {
+                "login": "xcheng16"
+              },
+              {
+                "login": "adamomainz"
+              },
+              {
+                "login": "sluks"
+              },
+              {
+                "login": "poojahp"
+              },
+              {
+                "login": "ansley"
+              },
+              {
+                "login": "mvsampath"
+              },
+              {
+                "login": "cheetah2216"
+              },
+              {
+                "login": "pinaki-mukerji"
+              },
+              {
+                "login": "hongxiayang"
+              },
+              {
+                "login": "kyulee-com"
+              },
+              {
+                "login": "sstsai-adl"
+              },
+              {
+                "login": "dahsh"
+              },
+              {
+                "login": "ohgnoes"
+              },
+              {
+                "login": "szewaiyuen7"
+              },
+              {
+                "login": "byterover"
+              },
+              {
+                "login": "changjishi"
+              },
+              {
+                "login": "ejguan"
+              },
+              {
+                "login": "nimaelyasi"
+              },
+              {
+                "login": "nikithamalgifb"
+              },
+              {
+                "login": "qxu-fb"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOBECNfg=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOBECNfg== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "sshawnwu"
+              },
+              {
+                "login": "andrewyounkins"
+              },
+              {
+                "login": "njuvekar"
+              },
+              {
+                "login": "iramazanli"
+              },
+              {
+                "login": "jnkwok1"
+              },
+              {
+                "login": "jbschlosser"
+              },
+              {
+                "login": "ccongge"
+              },
+              {
+                "login": "haichuan-fb"
+              },
+              {
+                "login": "wwang84"
+              },
+              {
+                "login": "JustinPinero"
+              },
+              {
+                "login": "gcramer23"
+              },
+              {
+                "login": "woo-kim"
+              },
+              {
+                "login": "chowarfb"
+              },
+              {
+                "login": "priyaramani"
+              },
+              {
+                "login": "yidawang-oss"
+              },
+              {
+                "login": "beback4u"
+              },
+              {
+                "login": "asalioufb"
+              },
+              {
+                "login": "four4fish"
+              },
+              {
+                "login": "kkosik20"
+              },
+              {
+                "login": "KZFB"
+              },
+              {
+                "login": "henryliu-bluehills"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": false,
+              "endCursor": "Y3Vyc29yOnYyOpHOBftYGg=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=qwertyuiop org=pytorch": {
+    "data": {
+      "organization": {
+        "team": null
+      }
+    }
+  }
+}
diff --git a/.github/scripts/install_nvidia_utils_linux.sh b/.github/scripts/install_nvidia_utils_linux.sh
index 0db7de71f4fc80..b854320c9eaa40 100755
--- a/.github/scripts/install_nvidia_utils_linux.sh
+++ b/.github/scripts/install_nvidia_utils_linux.sh
@@ -3,7 +3,7 @@
 set -eou pipefail
 
 DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) \
-DRIVER_FN="NVIDIA-Linux-x86_64-495.44.run"
+DRIVER_FN="NVIDIA-Linux-x86_64-510.60.02.run"
 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
 
 install_nvidia_docker2_amzn2() {
diff --git a/.github/scripts/syncbranches.py b/.github/scripts/syncbranches.py
index 163c4b3759b800..8437e1fa9c1818 100755
--- a/.github/scripts/syncbranches.py
+++ b/.github/scripts/syncbranches.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 
-from gitutils import get_git_repo_dir, GitRepo
+from gitutils import get_git_repo_dir, get_git_remote_name, GitRepo
 from typing import Any
 
 
@@ -16,7 +16,7 @@ def parse_args() -> Any:
 
 def main() -> None:
     args = parse_args()
-    repo = GitRepo(get_git_repo_dir(), debug=args.debug)
+    repo = GitRepo(get_git_repo_dir(), get_git_remote_name(), debug=args.debug)
     repo.cherry_pick_commits(args.sync_branch, args.default_branch)
     repo.push(args.default_branch, args.dry_run)
 
diff --git a/.github/scripts/test_trymerge.py b/.github/scripts/test_trymerge.py
index 539aec9b9c6933..753936d616a488 100755
--- a/.github/scripts/test_trymerge.py
+++ b/.github/scripts/test_trymerge.py
@@ -1,10 +1,20 @@
 #!/usr/bin/env python3
+# Tests implemented in this file are relying on GitHub GraphQL APIs
+# In order to avoid test flakiness, results of the queries
+# are cached in gql_mocks.json
+# PyTorch Lint workflow does not have GITHUB_TOKEN defined to avoid
+# flakiness, so if you are making changes to merge_rules or
+# GraphQL queries in trymerge.py, please make sure to delete `gql_mocks.json`
+# And re-run the test locally with ones PAT
+
 import json
 import os
 from hashlib import sha256
-from trymerge import gh_graphql, GitHubPR
+from trymerge import find_matching_merge_rule, gh_graphql, gh_get_team_members, GitHubPR
+from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
 from typing import Any
 from unittest import TestCase, main, mock
+from urllib.error import HTTPError
 
 def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:
     gql_db_fname = os.path.join(os.path.dirname(__file__), "gql_mocks.json")
@@ -17,7 +27,8 @@ def get_mocked_queries() -> Any:
 
     def save_mocked_queries(obj: Any) -> None:
         with open(gql_db_fname, encoding="utf-8", mode="w") as f:
-            json.dump(obj, f)
+            json.dump(obj, f, indent=2)
+            f.write("\n")
 
     key = f"query_sha={sha256(query.encode('utf-8')).hexdigest()} " + " ".join([f"{k}={kwargs[k]}" for k in sorted(kwargs.keys())])
     mocked_queries = get_mocked_queries()
@@ -25,7 +36,16 @@ def save_mocked_queries(obj: Any) -> None:
     if key in mocked_queries:
         return mocked_queries[key]
 
-    rc = gh_graphql(query, **kwargs)
+    try:
+        rc = gh_graphql(query, **kwargs)
+    except HTTPError as err:
+        if err.code == 401:
+            err_msg = "If you are seeing this message during workflow run, please make sure to update gql_mocks.json"
+            err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with "
+            err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN environment variable"
+            if os.getenv("GITHUB_TOKEN") is None:
+                err_msg = "Failed to update cached GraphQL queries as GITHUB_TOKEN is not defined." + err_msg
+            raise RuntimeError(err_msg) from err
     mocked_queries[key] = rc
 
     save_mocked_queries(mocked_queries)
@@ -34,6 +54,29 @@ def save_mocked_queries(obj: Any) -> None:
 
 
 class TestGitHubPR(TestCase):
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_match_rules(self, mocked_gql: Any) -> None:
+        "Tests that PR passes merge rules"
+        pr = GitHubPR("pytorch", "pytorch", 71759)
+        repo = GitRepo(get_git_repo_dir(), get_git_remote_name())
+        self.assertTrue(find_matching_merge_rule(pr, repo) is not None)
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_lint_fails(self, mocked_gql: Any) -> None:
+        "Tests that PR fails mandatory lint check"
+        pr = GitHubPR("pytorch", "pytorch", 74649)
+        repo = GitRepo(get_git_repo_dir(), get_git_remote_name())
+        self.assertRaises(RuntimeError, lambda: find_matching_merge_rule(pr, repo))
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_get_last_comment(self, mocked_gql: Any) -> None:
+        "Tests that last comment can be fetched"
+        pr = GitHubPR("pytorch", "pytorch", 71759)
+        comment = pr.get_last_comment()
+        self.assertEqual(comment.author_login, "github-actions")
+        self.assertIsNone(comment.editor_login)
+        self.assertTrue("You've committed this PR" in comment.body_text)
+
     @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
     def test_get_author_null(self, mocked_gql: Any) -> None:
         """ Tests that PR author can be computed
@@ -43,6 +86,7 @@ def test_get_author_null(self, mocked_gql: Any) -> None:
         author = pr.get_author()
         self.assertTrue(author is not None)
         self.assertTrue("@" in author)
+        self.assertTrue(pr.get_diff_revision() is None)
 
     @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
     def test_large_diff(self, mocked_gql: Any) -> None:
@@ -52,6 +96,43 @@ def test_large_diff(self, mocked_gql: Any) -> None:
         flist = pr.get_changed_files()
         self.assertEqual(len(flist), pr.get_changed_files_count())
 
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_internal_changes(self, mocked_gql: Any) -> None:
+        "Tests that PR with internal changes is detected"
+        pr = GitHubPR("pytorch", "pytorch", 73969)
+        self.assertTrue(pr.has_internal_changes())
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_checksuites_pagination(self, mocked_gql: Any) -> None:
+        "Tests that PR with lots of checksuits can be fetched"
+        pr = GitHubPR("pytorch", "pytorch", 73811)
+        self.assertGreater(len(pr.get_checkrun_conclusions()), 0)
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_comments_pagination(self, mocked_gql: Any) -> None:
+        "Tests that PR with 50+ comments can be fetched"
+        pr = GitHubPR("pytorch", "pytorch", 31093)
+        self.assertGreater(len(pr.get_comments()), 50)
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_gql_complexity(self, mocked_gql: Any) -> None:
+        "Fetch comments and conclusions for PR with 60 commits"
+        # Previous version of GrapQL query used to cause HTTP/502 error
+        # see https://gist.github.com/malfet/9b93bc7eeddeaf1d84546efc4f0c577f
+        pr = GitHubPR("pytorch", "pytorch", 68111)
+        self.assertGreater(len(pr.get_comments()), 20)
+        self.assertGreater(len(pr.get_checkrun_conclusions()), 3)
+        self.assertGreater(pr.get_commit_count(), 60)
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    def test_team_members(self, mocked_gql: Any) -> None:
+        "Test fetching team members works"
+        dev_infra_team = gh_get_team_members("pytorch", "pytorch-dev-infra")
+        self.assertGreater(len(dev_infra_team), 2)
+        with self.assertWarns(Warning):
+            non_existing_team = gh_get_team_members("pytorch", "qwertyuiop")
+            self.assertEqual(len(non_existing_team), 0)
+
 
 if __name__ == "__main__":
     main()
diff --git a/.github/scripts/trymerge.py b/.github/scripts/trymerge.py
index 25ba3db7feb112..0f0fadbd13e2b9 100755
--- a/.github/scripts/trymerge.py
+++ b/.github/scripts/trymerge.py
@@ -8,6 +8,8 @@
 from urllib.error import HTTPError
 from typing import cast, Any, Callable, Dict, List, Optional, Tuple, Union
 from gitutils import get_git_remote_name, get_git_repo_dir, patterns_to_regex, GitRepo
+from functools import lru_cache
+from warnings import warn
 
 
 GH_GET_PR_INFO_QUERY = """
@@ -36,7 +38,7 @@
       mergeCommit {
         oid
       }
-      commits(first: 100) {
+      commits_with_authors:commits(first: 100) {
         nodes {
           commit {
             author {
@@ -47,17 +49,44 @@
               name
             }
             oid
-            checkSuites(filterBy: {appId: 12274}, first: 1) {
+          }
+        }
+        totalCount
+      }
+      commits(last: 1) {
+        nodes {
+          commit {
+            checkSuites(first: 50) {
               nodes {
                 app {
+                  name
                   databaseId
                 }
+                workflowRun {
+                  workflow {
+                    name
+                  }
+                }
+                checkRuns(first: 10) {
+                  nodes {
+                    name
+                    conclusion
+                  }
+                  pageInfo {
+                    endCursor
+                    hasNextPage
+                  }
+                }
                 conclusion
               }
+              pageInfo {
+                endCursor
+                hasNextPage
+              }
             }
+            oid
           }
         }
-        totalCount
       }
       changedFiles
       files(first: 100) {
@@ -78,7 +107,7 @@
         }
         totalCount
       }
-      comments(last: 1) {
+      comments(last: 5) {
         nodes {
           bodyText
           author {
@@ -88,6 +117,11 @@
           editor {
             login
           }
+          databaseId
+        }
+        pageInfo {
+          startCursor
+          hasPreviousPage
         }
       }
     }
@@ -113,6 +147,95 @@
 }
 """
 
+GH_GET_PR_NEXT_CHECK_RUNS = """
+query ($owner: String!, $name: String!, $number: Int!, $cursor: String!) {
+  repository(name: $name, owner: $owner) {
+    pullRequest(number: $number) {
+      commits(last: 1) {
+        nodes {
+          commit {
+            oid
+            checkSuites(first: 100, after: $cursor) {
+              nodes {
+                app {
+                  name
+                  databaseId
+                }
+                workflowRun {
+                  workflow {
+                    name
+                  }
+                }
+                checkRuns(first: 10) {
+                  nodes {
+                    name
+                    conclusion
+                  }
+                  pageInfo {
+                    endCursor
+                    hasNextPage
+                  }
+                }
+                conclusion
+              }
+              pageInfo {
+                endCursor
+                hasNextPage
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+"""
+
+GH_GET_PR_PREV_COMMENTS = """
+query ($owner: String!, $name: String!, $number: Int!, $cursor: String!) {
+  repository(name: $name, owner: $owner) {
+    pullRequest(number: $number) {
+      comments(last: 100, before: $cursor) {
+        nodes {
+          bodyText
+          author {
+            login
+          }
+          authorAssociation
+          editor {
+            login
+          }
+          databaseId
+        }
+        pageInfo {
+          startCursor
+          hasPreviousPage
+        }
+      }
+    }
+  }
+}
+"""
+
+# This query needs read-org permission
+GH_GET_TEAM_MEMBERS_QUERY = """
+query($org: String!, $name: String!, $cursor: String) {
+  organization(login: $org) {
+    team(slug: $name) {
+      members(first: 100, after: $cursor) {
+        nodes {
+          login
+        }
+        pageInfo {
+          hasNextPage
+          endCursor
+        }
+      }
+    }
+  }
+}
+"""
+
 RE_GHSTACK_HEAD_REF = re.compile(r"^(gh/[^/]+/[0-9]+/)head$")
 RE_GHSTACK_SOURCE_ID = re.compile(r'^ghstack-source-id: (.+)\n?', re.MULTILINE)
 RE_PULL_REQUEST_RESOLVED = re.compile(
@@ -178,15 +301,41 @@ def gh_get_pr_info(org: str, proj: str, pr_no: int) -> Any:
     return rc["data"]["repository"]["pullRequest"]
 
 
+@lru_cache(maxsize=None)
+def gh_get_team_members(org: str, name: str) -> List[str]:
+    rc: List[str] = []
+    team_members: Dict[str, Any] = {"pageInfo": {"hasNextPage": "true", "endCursor": None}}
+    while bool(team_members["pageInfo"]["hasNextPage"]):
+        query = gh_graphql(GH_GET_TEAM_MEMBERS_QUERY, org=org, name=name, cursor=team_members["pageInfo"]["endCursor"])
+        team = query["data"]["organization"]["team"]
+        if team is None:
+            warn(f"Requested non-existing team {org}/{name}")
+            return []
+        team_members = team["members"]
+        rc += [member["login"] for member in team_members["nodes"]]
+    return rc
+
+
 def parse_args() -> Any:
     from argparse import ArgumentParser
     parser = ArgumentParser("Merge PR into default branch")
     parser.add_argument("--dry-run", action="store_true")
     parser.add_argument("--revert", action="store_true")
+    parser.add_argument("--force", action="store_true")
+    parser.add_argument("--comment-id", type=int)
     parser.add_argument("pr_num", type=int)
     return parser.parse_args()
 
 
+@dataclass
+class GitHubComment:
+    body_text: str
+    author_login: str
+    author_association: str
+    editor_login: Optional[str]
+    database_id: int
+
+
 class GitHubPR:
     def __init__(self, org: str, project: str, pr_num: int) -> None:
         assert isinstance(pr_num, int)
@@ -195,6 +344,8 @@ def __init__(self, org: str, project: str, pr_num: int) -> None:
         self.pr_num = pr_num
         self.info = gh_get_pr_info(org, project, pr_num)
         self.changed_files: Optional[List[str]] = None
+        self.conclusions: Optional[Dict[str, str]] = None
+        self.comments: Optional[List[GitHubComment]] = None
 
     def is_closed(self) -> bool:
         return bool(self.info["closed"])
@@ -257,28 +408,56 @@ def get_approved_by(self) -> List[str]:
         return [login for (login, state) in self._get_reviewers() if state == "APPROVED"]
 
     def get_commit_count(self) -> int:
-        return int(self.info["commits"]["totalCount"])
+        return int(self.info["commits_with_authors"]["totalCount"])
 
     def get_pr_creator_login(self) -> str:
         return cast(str, self.info["author"]["login"])
 
     def get_committer_login(self, num: int = 0) -> str:
-        user = self.info["commits"]["nodes"][num]["commit"]["author"]["user"]
+        user = self.info["commits_with_authors"]["nodes"][num]["commit"]["author"]["user"]
         # If author is not github user, user node will be null
         if user is None:
             return ""
         return cast(str, user["login"])
 
     def get_committer_author(self, num: int = 0) -> str:
-        node = self.info["commits"]["nodes"][num]["commit"]["author"]
+        node = self.info["commits_with_authors"]["nodes"][num]["commit"]["author"]
         return f"{node['name']} <{node['email']}>"
 
-    def get_check_suite_conclusions(self) -> Dict[int, str]:
-        last_commit = self.info["commits"]["nodes"][-1]["commit"]
-        rc = {}
-        for node in last_commit["checkSuites"]["nodes"]:
-            rc[int(node["app"]["databaseId"])] = node["conclusion"]
-        return rc
+    def get_checkrun_conclusions(self) -> Dict[str, str]:
+        """ Returns list of checkrun / conclusions """
+        if self.conclusions is not None:
+            return self.conclusions
+        orig_last_commit = self.info["commits"]["nodes"][-1]["commit"]
+        checksuites = orig_last_commit["checkSuites"]
+        conclusions = {}
+
+        def add_conclusions(nodes: List[Dict[str, Any]]) -> None:
+            for node in nodes:
+                workflow_run = node["workflowRun"]
+                checkruns = node["checkRuns"]
+                if workflow_run is not None:
+                    conclusions[workflow_run["workflow"]["name"]] = node["conclusion"]
+                    continue
+                if checkruns is not None:
+                    for checkrun_node in checkruns["nodes"]:
+                        conclusions[checkrun_node["name"]] = checkrun_node["conclusion"]
+
+        add_conclusions(checksuites["nodes"])
+        while bool(checksuites["pageInfo"]["hasNextPage"]):
+            rc = gh_graphql(GH_GET_PR_NEXT_CHECK_RUNS,
+                            name=self.project,
+                            owner=self.org,
+                            number=self.pr_num,
+                            cursor=checksuites["pageInfo"]["endCursor"])
+            info = rc["data"]["repository"]["pullRequest"]
+            last_commit = info["commits"]["nodes"][-1]["commit"]
+            if last_commit["oid"] != orig_last_commit["oid"]:
+                raise RuntimeError("Last commit changed on PR")
+            checksuites = last_commit["checkSuites"]
+            add_conclusions(checksuites["nodes"])
+        self.conclusions = conclusions
+        return conclusions
 
     def get_authors(self) -> Dict[str, str]:
         rc = {}
@@ -306,20 +485,64 @@ def get_merge_commit(self) -> Optional[str]:
     def get_pr_url(self) -> str:
         return f"https://github.com/{self.org}/{self.project}/pull/{self.pr_num}"
 
-    def get_comment_body(self, num: int = -1) -> str:
-        return cast(str, self.info["comments"]["nodes"][num]["bodyText"])
-
-    def get_comment_author_login(self, num: int = -1) -> str:
-        return cast(str, self.info["comments"]["nodes"][num]["author"]["login"])
-
-    def get_comment_editor_login(self, num: int = -1) -> Optional[str]:
-        rc = self.info["comments"]["nodes"][num]["editor"]
-        return rc["login"] if rc is not None else None
-
-    def get_comment_author_association(self, num: int = -1) -> str:
-        return cast(str, self.info["comments"]["nodes"][num]["authorAssociation"])
-
-    def merge_ghstack_into(self, repo: GitRepo) -> None:
+    @staticmethod
+    def _comment_from_node(node: Any) -> GitHubComment:
+        editor = node["editor"]
+        return GitHubComment(body_text=node["bodyText"],
+                             author_login=node["author"]["login"],
+                             author_association=node["authorAssociation"],
+                             editor_login=editor["login"] if editor else None,
+                             database_id=node["databaseId"]
+                             )
+
+    def get_comments(self) -> List[GitHubComment]:
+        if self.comments is not None:
+            return self.comments
+        self.comments = []
+        info = self.info["comments"]
+        # Do not try to fetch more than 10K comments
+        for _ in range(100):
+            self.comments = [self._comment_from_node(node) for node in info["nodes"]] + self.comments
+            if not info["pageInfo"]["hasPreviousPage"]:
+                break
+            rc = gh_graphql(GH_GET_PR_PREV_COMMENTS,
+                            name=self.project,
+                            owner=self.org,
+                            number=self.pr_num,
+                            cursor=info["pageInfo"]["startCursor"])
+            info = rc["data"]["repository"]["pullRequest"]["comments"]
+        return self.comments
+
+    def get_last_comment(self) -> GitHubComment:
+        return self._comment_from_node(self.info["comments"]["nodes"][-1])
+
+    def get_comment_by_id(self, database_id: int) -> GitHubComment:
+        if self.comments is None:
+            # Fastpath - try searching in partial prefetched comments
+            for node in self.info["comments"]["nodes"]:
+                comment = self._comment_from_node(node)
+                if comment.database_id == database_id:
+                    return comment
+
+        for comment in self.get_comments():
+            if comment.database_id == database_id:
+                return comment
+        raise RuntimeError(f"Comment with id {database_id} not found")
+
+    def get_diff_revision(self) -> Optional[str]:
+        rc = RE_DIFF_REV.search(self.get_body())
+        return rc.group(1) if rc is not None else None
+
+    def has_internal_changes(self) -> bool:
+        checkrun_name = "Meta Internal-Only Changes Check"
+        if self.get_diff_revision() is None:
+            return False
+        checks = self.get_checkrun_conclusions()
+        if checks is None or checkrun_name not in checks:
+            return False
+        return checks[checkrun_name] != "SUCCESS"
+
+    def merge_ghstack_into(self, repo: GitRepo, force: bool) -> None:
         assert self.is_ghstack_pr()
         approved_by = self.get_approved_by()
         # For ghstack, cherry-pick commits based from origin
@@ -340,7 +563,7 @@ def merge_ghstack_into(self, repo: GitRepo) -> None:
                     continue
                 approved_by = pr.get_approved_by()
                 # Raises exception if matching rule is not found
-                find_matching_merge_rule(pr, repo)
+                find_matching_merge_rule(pr, repo, force=force)
 
             # Adding the url here makes it clickable within the Github UI
             approved_by_urls = ', '.join(prefix_with_github_url(login) for login in approved_by)
@@ -349,9 +572,11 @@ def merge_ghstack_into(self, repo: GitRepo) -> None:
             msg += f"\nApproved by: {approved_by_urls}\n"
             repo.amend_commit_message(msg)
 
-    def merge_into(self, repo: GitRepo, dry_run: bool = False) -> None:
+    def merge_into(self, repo: GitRepo, *, force: bool = False, dry_run: bool = False) -> None:
         # Raises exception if matching rule is not found
-        find_matching_merge_rule(self, repo)
+        find_matching_merge_rule(self, repo, force=force)
+        if self.has_internal_changes():
+            raise RuntimeError("This PR must be landed via phabricator")
         if repo.current_branch() != self.default_branch():
             repo.checkout(self.default_branch())
         if not self.is_ghstack_pr():
@@ -365,7 +590,7 @@ def merge_into(self, repo: GitRepo, dry_run: bool = False) -> None:
             repo._run_git("merge", "--squash", pr_branch_name)
             repo._run_git("commit", f"--author=\"{self.get_author()}\"", "-m", msg)
         else:
-            self.merge_ghstack_into(repo)
+            self.merge_ghstack_into(repo, force)
 
         repo.push(self.default_branch(), dry_run)
 
@@ -375,7 +600,7 @@ class MergeRule:
     name: str
     patterns: List[str]
     approved_by: List[str]
-    mandatory_app_id: Optional[int]
+    mandatory_checks_name: Optional[List[str]]
 
 
 def read_merge_rules(repo: GitRepo) -> List[MergeRule]:
@@ -389,57 +614,85 @@ def read_merge_rules(repo: GitRepo) -> List[MergeRule]:
     return cast(List[MergeRule], rc)
 
 
-
-def find_matching_merge_rule(pr: GitHubPR, repo: GitRepo) -> MergeRule:
+def find_matching_merge_rule(pr: GitHubPR, repo: GitRepo, force: bool = False) -> MergeRule:
     """Returns merge rule matching to this pr or raises an exception"""
     changed_files = pr.get_changed_files()
     approved_by = set(pr.get_approved_by())
     rules = read_merge_rules(repo)
+    reject_reason = f"PR {pr.pr_num} does not match merge rules"
+    #  Used to determine best rejection reason
+    # Score 0 to 10K - how many files rule matched
+    # Score 10K - matched all files, but no overlapping approvers
+    # Score 20K - matched all files and approvers, but lacks mandatory checks
+    reject_reason_score = 0
     for rule in rules:
         rule_name = rule.name
-        rule_approvers_set = set(rule.approved_by)
+        rule_approvers_set = set()
+        for approver in rule.approved_by:
+            if "/" in approver:
+                org, name = approver.split("/")
+                rule_approvers_set.update(gh_get_team_members(org, name))
+            else:
+                rule_approvers_set.add(approver)
         patterns_re = patterns_to_regex(rule.patterns)
         approvers_intersection = approved_by.intersection(rule_approvers_set)
-        # If rule requires approvers but they aren't the ones that reviewed PR
-        if len(approvers_intersection) == 0 and len(rule_approvers_set) > 0:
-            print(f"Skipping rule {rule_name} due to no approvers overlap")
-            continue
-        if rule.mandatory_app_id is not None:
-            cs_conslusions = pr.get_check_suite_conclusions()
-            mandatory_app_id = rule.mandatory_app_id
-            if mandatory_app_id not in cs_conslusions or cs_conslusions[mandatory_app_id] != "SUCCESS":
-                print(f"Skipping rule {rule_name} as mandatory app {mandatory_app_id} is not in {cs_conslusions}")
-                continue
         non_matching_files = []
         for fname in changed_files:
             if not patterns_re.match(fname):
                 non_matching_files.append(fname)
         if len(non_matching_files) > 0:
-            print(f"Skipping rule {rule_name} due to non-matching files: {non_matching_files}")
+            num_matching_files = len(changed_files) - len(non_matching_files)
+            if num_matching_files > reject_reason_score:
+                reject_reason_score = num_matching_files
+                reject_reason = (f"{num_matching_files} files matched rule {rule_name}, but there are still non-matching files: " +
+                                 f"{','.join(non_matching_files[:5])}{', ...' if len(non_matching_files) > 5 else ''}")
             continue
-        print(f"Matched rule {rule_name} for {pr.pr_num}")
+        # If rule requires approvers but they aren't the ones that reviewed PR
+        if len(approvers_intersection) == 0 and len(rule_approvers_set) > 0:
+            if reject_reason_score < 10000:
+                reject_reason_score = 10000
+                reject_reason = (f"Matched rule {rule_name}, but it was not reviewed yet by any of:" +
+                                 f"{','.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}")
+            continue
+        if rule.mandatory_checks_name is not None:
+            pass_checks = True
+            checks = pr.get_checkrun_conclusions()
+            # HACK: We don't want to skip CLA check, even when forced
+            for checkname in filter(lambda x: force is False or "CLA Check" in x, rule.mandatory_checks_name):
+                if checkname not in checks or checks[checkname] != "SUCCESS":
+                    if reject_reason_score < 20000:
+                        reject_reason_score = 20000
+                        reject_reason = f"Refusing to merge as mandatory check {checkname} "
+                        reject_reason += "has not been run" if checkname not in checks else "failed"
+                        reject_reason += f" for rule {rule_name}"
+                    pass_checks = False
+            if not pass_checks:
+                continue
+        if pr.has_internal_changes():
+            raise RuntimeError("This PR has internal changes and must be landed via Phabricator")
         return rule
-    raise RuntimeError(f"PR {pr.pr_num} does not match merge rules")
+    raise RuntimeError(reject_reason)
 
 
-def try_revert(repo: GitRepo, pr: GitHubPR, dry_run: bool = False) -> None:
+def try_revert(repo: GitRepo, pr: GitHubPR, *, dry_run: bool = False, comment_id: Optional[int] = None) -> None:
     def post_comment(msg: str) -> None:
         gh_post_comment(pr.org, pr.project, pr.pr_num, msg, dry_run=dry_run)
     if not pr.is_closed():
         return post_comment(f"Can't revert open PR #{pr.pr_num}")
-    if not RE_REVERT_CMD.match(pr.get_comment_body()):
-        raise RuntimeError(f"Comment {pr.get_comment_body()} does not seem to be a valid revert command")
-    if pr.get_comment_editor_login() is not None:
+    comment = pr.get_last_comment() if comment_id is None else pr.get_comment_by_id(comment_id)
+    if not RE_REVERT_CMD.match(comment.body_text):
+        raise RuntimeError(f"Comment {comment.body_text} does not seem to be a valid revert command")
+    if comment.editor_login is not None:
         return post_comment("Don't want to revert based on edited command")
-    author_association = pr.get_comment_author_association()
-    author_login = pr.get_comment_author_login()
+    author_association = comment.author_association
+    author_login = comment.author_login
     # For some reason, one can not be a member of private repo, only CONTRIBUTOR
     expected_association = "CONTRIBUTOR" if pr.is_base_repo_private() else "MEMBER"
     if author_association != expected_association and author_association != "OWNER":
         return post_comment(f"Will not revert as @{author_login} is not a {expected_association}, but {author_association}")
 
-    # Raises exception if matching rule is not found
-    find_matching_merge_rule(pr, repo)
+    # Raises exception if matching rule is not found, but ignores all status checks
+    find_matching_merge_rule(pr, repo, force=True)
     commit_sha = pr.get_merge_commit()
     if commit_sha is None:
         commits = repo.commits_resolving_gh_pr(pr.pr_num)
@@ -473,7 +726,7 @@ def main() -> None:
     pr = GitHubPR(org, project, args.pr_num)
     if args.revert:
         try:
-            try_revert(repo, pr, dry_run=args.dry_run)
+            try_revert(repo, pr, dry_run=args.dry_run, comment_id=args.comment_id)
         except Exception as e:
             msg = f"Reverting PR {args.pr_num} failed due to {e}"
             run_url = os.getenv("GH_RUN_URL")
@@ -491,7 +744,7 @@ def main() -> None:
         return
 
     try:
-        pr.merge_into(repo, dry_run=args.dry_run)
+        pr.merge_into(repo, dry_run=args.dry_run, force=args.force)
     except Exception as e:
         msg = f"Merge failed due to {e}"
         run_url = os.getenv("GH_RUN_URL")
diff --git a/.github/templates/android_ci_full_workflow.yml.j2 b/.github/templates/android_ci_full_workflow.yml.j2
deleted file mode 100644
index 9736bee5c4ed81..00000000000000
--- a/.github/templates/android_ci_full_workflow.yml.j2
+++ /dev/null
@@ -1,165 +0,0 @@
-{%- extends "linux_ci_workflow.yml.j2" -%}
-{% import 'common_android.yml.j2' as common_android %}
-{%- set exclude_test = true -%}
-{% block name -%}
-# Template is at:    .github/templates/android_ci_full_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if is_default %}
-  pull_request:
-{%- endif -%}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-  push:
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-
-{% block build +%}
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: !{{ test_runner_type }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build-and-test
-      NUM_TEST_SHARDS: !{{ num_test_shards }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-      !{{ common.calculate_docker_image(false) }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      !{{ common.parse_ref() }}
-      !{{ common_android.build_android("pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a-build", "arm-v7a") }}
-      !{{ common_android.build_android("pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a-build", "arm-v8a") }}
-      !{{ common_android.build_android("pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32-build", "x86_32") }}
-      !{{ common_android.build_android("pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64-build", "x86_64") }}
-      - name: Build final artifact
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          set -eux
-
-          docker_image_libtorch_android_x86_32="${DOCKER_IMAGE}-x86_32"
-          docker_image_libtorch_android_x86_64="${DOCKER_IMAGE}-x86_64"
-          docker_image_libtorch_android_arm_v7a="${DOCKER_IMAGE}-arm-v7a"
-          docker_image_libtorch_android_arm_v8a="${DOCKER_IMAGE}-arm-v8a"
-
-          echo "docker_image_commit: ${DOCKER_IMAGE}"
-          echo "docker_image_libtorch_android_x86_32: ${docker_image_libtorch_android_x86_32}"
-          echo "docker_image_libtorch_android_x86_64: ${docker_image_libtorch_android_x86_64}"
-          echo "docker_image_libtorch_android_arm_v7a: ${docker_image_libtorch_android_arm_v7a}"
-          echo "docker_image_libtorch_android_arm_v8a: ${docker_image_libtorch_android_arm_v8a}"
-
-          # x86_32
-          time docker pull "${docker_image_libtorch_android_x86_32}" >/dev/null
-          export id_x86_32
-          id_x86_32=$(docker run -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_x86_32}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_x86_32}" bash) 2>&1
-
-          # arm-v7a
-          time docker pull "${docker_image_libtorch_android_arm_v7a}" >/dev/null
-          export id_arm_v7a
-          id_arm_v7a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_arm_v7a}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_arm_v7a}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_arm_v7a"
-          docker cp "${id_arm_v7a}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_arm_v7a"
-
-          # x86_64
-          time docker pull "${docker_image_libtorch_android_x86_64}" >/dev/null
-          export id_x86_64
-          id_x86_64=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_x86_64}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_x86_64}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_x86_64"
-          docker cp "${id_x86_64}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_x86_64"
-
-          # arm-v8a
-          time docker pull "${docker_image_libtorch_android_arm_v8a}" >/dev/null
-          export id_arm_v8a
-          id_arm_v8a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_arm_v8a}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_arm_v8a"
-          docker cp "${id_arm_v8a}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_arm_v8a"
-
-          # Putting everything together
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v7a" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v7a"
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_64" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_x86_64"
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v8a" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v8a"
-
-          # run gradle buildRelease
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec \
-            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build" \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --user jenkins \
-            -u jenkins -i "${id_x86_32}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"
-          docker cp "${id_x86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"
-
-          output_image="${DOCKER_IMAGE}-android-x86_32-gradle"
-          docker commit "${id_x86_32}" "${output_image}"
-          time docker push "${output_image}"
-      !{{ common_android.upload_androind_binary_size("prebuilt", "${GITHUB_WORKSPACE}/build_android_artifacts/artifacts.tgz") }}
-      - uses: !{{ common.upload_artifact_s3_action }}
-        name: Store PyTorch Android Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            build_android_artifacts/artifacts.tgz
-      !{{ common.teardown_ec2_linux() }}
-{%- endblock %}
diff --git a/.github/templates/android_ci_workflow.yml.j2 b/.github/templates/android_ci_workflow.yml.j2
deleted file mode 100644
index c86b94c1ad48b8..00000000000000
--- a/.github/templates/android_ci_workflow.yml.j2
+++ /dev/null
@@ -1,111 +0,0 @@
-{%- extends "linux_ci_workflow.yml.j2" -%}
-{% import 'common_android.yml.j2' as common_android %}
-{%- set exclude_test = true -%}
-{% block name -%}
-# Template is at:    .github/templates/android_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if is_default %}
-  pull_request:
-{%- endif -%}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-  push:
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-
-{% block build +%}
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: !{{ test_runner_type }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build-and-test
-      NUM_TEST_SHARDS: !{{ num_test_shards }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-      !{{ common.calculate_docker_image(false) }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Build
-        run: |
-          set -e
-          # Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
-          # 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
-          # 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
-
-          echo "DOCKER_IMAGE: ${DOCKER_IMAGE}"
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-
-          export BUILD_LITE_INTERPRETER
-          BUILD_LITE_INTERPRETER="1"
-          if [[ "${BUILD_ENVIRONMENT}" == *"full-jit" ]]; then
-            BUILD_LITE_INTERPRETER="0"
-          fi
-
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          # shellcheck disable=SC2016
-          export id
-          id=$(docker run -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e PR_LABELS \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e BUILD_LITE_INTERPRETER \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "$(pwd):/var/lib/jenkins/workspace" \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            -t -d -w /var/lib/jenkins "${DOCKER_IMAGE}")
-
-          # shellcheck disable=SC2016
-          export COMMAND
-          # shellcheck disable=SC2016
-          COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "export BUILD_LITE_INTERPRETER=${BUILD_LITE_INTERPRETER}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
-          echo "${COMMAND}" > ./command.sh && bash ./command.sh
-          # Skip docker push as this job is purely for size analysis purpose.
-          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
-      !{{ common.parse_ref() }}
-      !{{ common_android.upload_androind_binary_size("custom-build-single", "") }}
-      !{{ common.teardown_ec2_linux() }}
-{%- endblock %}
diff --git a/.github/templates/bazel_ci_workflow.yml.j2 b/.github/templates/bazel_ci_workflow.yml.j2
deleted file mode 100644
index 0480835794bc84..00000000000000
--- a/.github/templates/bazel_ci_workflow.yml.j2
+++ /dev/null
@@ -1,127 +0,0 @@
-{%- extends "linux_ci_workflow.yml.j2" -%}
-{% import 'common_android.yml.j2' as common_android %}
-{%- set exclude_test = true -%}
-{% block name -%}
-# Template is at:    .github/templates/bazel_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if is_default %}
-  pull_request:
-{%- endif -%}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-  push:
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-
-{% block build +%}
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: !{{ test_runner_type }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build-and-test
-      NUM_TEST_SHARDS: !{{ num_test_shards }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-      !{{ common.calculate_docker_image(false) }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Build
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e PR_LABELS \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/build.sh'
-      !{{ common.parse_ref() }}
-      !{{ common_android.upload_androind_binary_size("", "")}}
-      - name: Test
-        # Time out the test phase after 3.5 hours
-        timeout-minutes: 210
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          export SHARD_NUMBER=0
-          # TODO: Stop building test binaries as part of the build phase
-          # Make sure we copy test results from bazel-testlogs symlink to
-          # a regular directory ./test/test-reports
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e SHARD_NUMBER \
-            -e NUM_TEST_SHARDS \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e PR_LABELS \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/test.sh && cp -Lr ./bazel-testlogs ./test/test-reports'
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      !{{ common.upload_test_reports(name='bazel') }}
-      !{{ common.upload_downloaded_files(name='bazel') }}
-      !{{ common.upload_test_statistics(build_environment) }}
-      !{{ common.teardown_ec2_linux() }}
-{%- endblock %}
diff --git a/.github/templates/common.yml.j2 b/.github/templates/common.yml.j2
index 154745bcc98271..f701f92cf64cee 100644
--- a/.github/templates/common.yml.j2
+++ b/.github/templates/common.yml.j2
@@ -1,4 +1,4 @@
-{%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v3" -%}
+{%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v4" -%}
 
 {# squid_proxy is an private ELB that only available for GHA custom runners #}
 {%- set squid_proxy    = "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -%}
@@ -22,6 +22,37 @@ concurrency:
           }
 {%- endmacro -%}
 
+{%- macro gen_dispatch_rules(on_pull_request, is_scheduled, ciflow_labels, branches = ['master', 'main', 'release/*'], enable_doc_jobs = True) -%}
+on:
+{%- if on_pull_request %}
+  pull_request:
+{%- endif %}
+  push:
+{%- if enable_doc_jobs and is_scheduled %}
+    tags:
+      # NOTE: Binary build pipelines should only get triggered on release candidate builds
+      # Release candidate tags look like: v1.11.0-rc1
+      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+{%- endif %}
+{%- for label in ciflow_labels | sort %}
+  {%- if loop.first and not (enable_doc_jobs  and is_scheduled) %}
+    tags:
+  {%- endif %}
+      - '!{{ label }}/*'
+{%- endfor %}
+{%- if not is_scheduled %}
+    branches:
+{%- for branch in branches %}
+      - !{{ branch }}
+{%- endfor %}
+{%- endif %}
+{%- if is_scheduled %}
+  schedule:
+    - cron: !{{ is_scheduled }}
+{%- endif %}
+  workflow_dispatch:
+{%- endmacro -%}
+
 {%- macro display_ec2_information() -%}
       - name: Display EC2 information
         shell: bash
@@ -36,6 +67,7 @@ concurrency:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
 {%- endmacro -%}
 
 {%- macro parse_ref(pytorch_directory="") -%}
@@ -56,20 +88,25 @@ concurrency:
         if: !{{ when }}
         env:
           AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
           BRANCH: ${{ steps.parse-ref.outputs.branch }}
           JOB_BASE_NAME: !{{ build_environment }}-test
           PR_NUMBER: ${{ github.event.pull_request.number }}
           SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
           TAG: ${{ steps.parse-ref.outputs.tag }}
           WORKFLOW_ID: '${{ github.run_id }}'
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 {%- if needs_credentials %}
-          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
 {%- endif %}
         shell: bash
         run: |
+          set -x
           python3 -m pip install -r requirements.txt
           python3 -m pip install boto3==1.19.12
+          GHA_WORKFLOW_JOB_ID=$(python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}")
+          export GHA_WORKFLOW_JOB_ID
           python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
 {%- endmacro -%}
 
@@ -87,19 +124,23 @@ concurrency:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
 {%- endmacro -%}
 
 {%- macro setup_ec2_linux() -%}
-      !{{ display_ec2_information() }}
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          !{{ add_retry_to_env() }}
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           !{{ add_retry_to_env() }}
@@ -114,9 +155,6 @@ concurrency:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
 {%- endmacro -%}
 
 {%- macro setup_rocm_linux() -%}
@@ -296,6 +334,25 @@ concurrency:
             test-reports-*.zip
 {%- endmacro -%}
 
+{%- macro upload_cores(artifact_name="coredumps", config=None, shard=None, use_s3=True) -%}
+{%- if use_s3 %}- uses: !{{ upload_artifact_s3_action }}
+        name: Store Core dumps on S3
+{%- else %}- uses: actions/upload-artifact@v2
+        name: Store Core dumps on Github
+{%- endif %}
+        if: failure()
+        with:
+{%- if config != "" and shard != "" %}
+          name: !{{ artifact_name }}-!{{ config }}-!{{ shard }}
+{%- else %}
+          name: !{{ artifact_name }}
+{%- endif %}
+          retention-days: 14
+          if-no-files-found: ignore
+          path:
+            ./**/core.[1-9]*
+{%- endmacro -%}
+
 {%- macro render_test_results() -%}
       - name: Install render_test_results dependencies
         if: always()
diff --git a/.github/templates/common_android.yml.j2 b/.github/templates/common_android.yml.j2
deleted file mode 100644
index a0e4e781b6adf0..00000000000000
--- a/.github/templates/common_android.yml.j2
+++ /dev/null
@@ -1,81 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- macro upload_androind_binary_size(build_type, artifacts) -%}
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          # The artifact file is created inside docker container, which contains the result binaries.
-          # Now unpackage it into the project folder. The subsequent script will scan project folder
-          # to locate result binaries and report their sizes.
-          # If artifact file is not provided it assumes that the project folder has been mounted in
-          # the docker during build and already contains the result binaries, so this step can be skipped.
-          export ARTIFACTS=!{{ artifacts }}
-          if [ -n "${ARTIFACTS}" ]; then
-            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
-            cd "${GITHUB_WORKSPACE}"
-          fi
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          ANDROID_BUILD_TYPE=!{{ build_type}}
-          export ANDROID_BUILD_TYPE
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
-{%- endmacro -%}
-
-{%- macro build_android(env_name, container_suffix) -%}
-      - name: Build-!{{ container_suffix }}
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          #!/bin/bash -eo pipefail
-          # Pull Docker image and run build
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-          echo "${DOCKER_IMAGE}"
-          export container_name
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT=!{{ env_name }} \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "${container_name}" bash) 2>&1
-
-          # Copy dist folder back
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-!{{ container_suffix }}
-          docker cp "${container_name}:/var/lib/jenkins/workspace/dist" "${GITHUB_WORKSPACE}/." || echo "Dist folder not found"
-          docker commit "${container_name}" "${COMMIT_DOCKER_IMAGE}"
-          time docker push "${COMMIT_DOCKER_IMAGE}"
-{%- endmacro -%}
diff --git a/.github/templates/docker_builds_ci_workflow.yml.j2 b/.github/templates/docker_builds_ci_workflow.yml.j2
deleted file mode 100644
index 224f683a35a47b..00000000000000
--- a/.github/templates/docker_builds_ci_workflow.yml.j2
+++ /dev/null
@@ -1,60 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- block name -%}
-# Template is at:    .github/templates/docker_builds_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-  workflow_dispatch:
-  pull_request:
-    types: [opened, synchronize, reopened]
-    paths:
-      - '.circleci/docker/**'
-      - '.github/workflows/generated-docker-builds.yml'
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- endif %}
-!{{ common.concurrency(build_environment) }}
-
-env:
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  AWS_DEFAULT_REGION: us-east-1
-
-jobs:
-{% block docker_build +%}
-  docker-build:
-    runs-on: linux.2xlarge
-    timeout-minutes: !{{ common.timeout_minutes }}
-    strategy:
-      matrix:
-        include:
-          {%- for docker_image in docker_images %}
-            - docker_image_base: '!{{ docker_image }}'
-              docker_image_short_name: '!{{ docker_image.split('/')[-1] }}'
-          {%- endfor %}
-    env:
-      DOCKER_IMAGE_BASE: '${{ matrix.docker_image_base }}'
-    name: docker-build (${{ matrix.docker_image_short_name }})
-    steps:
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-      !{{ common.calculate_docker_image(true) }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      !{{ common.parse_ref() }}
-      !{{ common.teardown_ec2_linux() }}
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-{%- endblock %}
diff --git a/.github/templates/ios_ci_workflow.yml.j2 b/.github/templates/ios_ci_workflow.yml.j2
deleted file mode 100644
index 0dd6cbbfff8a3a..00000000000000
--- a/.github/templates/ios_ci_workflow.yml.j2
+++ /dev/null
@@ -1,184 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- block name -%}
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if is_default %}
-  pull_request:
-{%- endif %}
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- endif %}
-  push:
-{%- if not is_scheduled %}
-    branches:
-      - master
-      - main
-      - release/*
-{%- endif %}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: !{{ build_environment }}
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: !{{ ios_platform }}
-  IOS_ARCH: !{{ ios_arch }}
-!{{ common.set_xcode_version(xcode_version) }}
-
-jobs:
-{% block build +%}
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: !{{ common.timeout_minutes }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.checkout() }}
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-{%- if ios_platform == "SIMULATOR" %}
-      - name: Run Simulator Tests
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-          # generate models for differnet backends
-          cd "${GITHUB_WORKSPACE}/ios/TestApp/benchmark"
-          mkdir -p ../models
-          if [ "${USE_COREML_DELEGATE}" == 1 ]; then
-            pip install coremltools==5.0b5
-            pip install six==1.16.0
-            python coreml_backend.py
-          else
-            python trace_model.py
-          fi
-          if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
-            echo "Setting up the TestApp for LiteInterpreter"
-            ruby setup.rb --lite 1
-          else
-            echo "Setting up the TestApp for Full JIT"
-            ruby setup.rb
-          fi
-          cd "${GITHUB_WORKSPACE}/ios/TestApp"
-          instruments -s -devices
-          if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
-            if [ "${USE_COREML_DELEGATE}" == 1 ]; then
-              fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
-            else
-              fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
-            fi
-          else
-            fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
-          fi
-{%- endif -%}
-{% endblock +%}
-
-!{{ common.concurrency(build_environment) }}
diff --git a/.github/templates/linux_binary_build_workflow.yml.j2 b/.github/templates/linux_binary_build_workflow.yml.j2
index 97f2795f4f6405..f10b39a72ced0e 100644
--- a/.github/templates/linux_binary_build_workflow.yml.j2
+++ b/.github/templates/linux_binary_build_workflow.yml.j2
@@ -9,17 +9,22 @@ name: !{{ build_environment }}
 
 on:
   push:
+    {%- if branches == "nightly" %}
     # NOTE: Meta Employees can trigger new nightlies using: https://fburl.com/trigger_pytorch_nightly_build
+    {%- endif %}
     branches:
-      - nightly
+      - !{{ branches }}
+    {%- if branches == "nightly" %}
     tags:
       # NOTE: Binary build pipelines should only get triggered on release candidate builds
       # Release candidate tags look like: v1.11.0-rc1
       - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+    {%- endif %}
 {%- for label in ciflow_config.labels | sort %}
-  {%- if label != "ciflow/default" %}
+    {%- if loop.first and branches != "nightly" %}
+    tags:
+    {%- endif %}
       - '!{{ label }}/*'
-  {%- endif %}
 {%- endfor %}
   workflow_dispatch:
 
@@ -114,7 +119,7 @@ jobs:
     !{{ upload.binary_env(config) }}
     steps:
       !{{ common.setup_ec2_linux() }}
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
@@ -172,5 +177,7 @@ jobs:
           docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
           docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
       !{{ common.teardown_ec2_linux("pytorch/") }}
+  {%- if branches == "nightly" %}
   !{{ upload.upload_binaries(config) }}
+  {%- endif %}
 {%- endfor %}
diff --git a/.github/templates/linux_ci_workflow.yml.j2 b/.github/templates/linux_ci_workflow.yml.j2
deleted file mode 100644
index 7bbdfe04b3f6e0..00000000000000
--- a/.github/templates/linux_ci_workflow.yml.j2
+++ /dev/null
@@ -1,446 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- block name -%}
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if on_pull_request %}
-  pull_request:
-{%- endif %}
-  push:
-{%- if enable_doc_jobs and is_scheduled %}
-    tags:
-      # NOTE: Binary build pipelines should only get triggered on release candidate builds
-      # Release candidate tags look like: v1.11.0-rc1
-      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
-{%- endif %}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first and not (enable_doc_jobs  and is_scheduled) %}
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-{%- if not is_scheduled %}
-    branches:
-      - master
-      - main
-      - release/*
-{%- endif %}
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- endif %}
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: !{{ build_environment }}
-  DOCKER_IMAGE_BASE: !{{ docker_image_base }}
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-{%- if enable_xla_test == 1 %}
-  # This is used for XLA tests only
-  XLA_CUDA: 0
-  XLA_IMAGE_TAG: v0.2
-{%- endif %}
-{%- if build_with_debug %}
-  DEBUG: 1
-{%- endif %}
-!{{ common.concurrency(build_environment) }}
-
-jobs:
-{% block build +%}
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: !{{ common.timeout_minutes }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-    {%- if enable_xla_test == 1 %}
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          echo "XLA workflow uses pre-built test image at ${XLA_IMAGE_TAG}"
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}"
-    {%- else %}
-      !{{ common.calculate_docker_image(false) }}
-    {%- endif %}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      !{{ common.parse_ref() }}
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-          {%- if enable_xla_test == 1 %}
-            -e XLA_CUDA \
-          {%- endif %}
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="!{{ common.squid_proxy }}" -e https_proxy="!{{ common.squid_proxy }}" -e no_proxy="!{{ common.squid_no_proxy }}" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      {%- if build_generates_artifacts %}
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: !{{ common.upload_artifact_s3_action }}
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      {%- endif %}
-      !{{ common.teardown_ec2_linux() }}
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-{%- endblock %}
-{%- if not exclude_test %}
-{% block test +%}
-  {%- for test_job in test_jobs %}
-  !{{ test_job.id }}:
-    name: !{{ test_job.name }}
-    needs: build
-    runs-on: !{{ test_job.runner }}
-    timeout-minutes: !{{ timeout_after + 30 }}
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: !{{ build_environment }}-test
-      TEST_CONFIG: !{{ test_job.config }}
-      SHARD_NUMBER: !{{ test_job.shard }}
-      NUM_TEST_SHARDS: !{{ test_job.num_shards }}
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-{%- if 'rocm' in test_runner_type %}
-      !{{ common.setup_rocm_linux() }}
-{%- else %}
-      !{{ common.setup_ec2_linux() }}
-{%- endif %}
-      !{{ common.checkout() }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-{%- if 'rocm' in test_runner_type and "nogpu" not in test_job.config %}
-      - name: ROCm set GPU_FLAG
-        run: |
-          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
-{%- elif "cuda" in build_environment and "nogpu" not in test_job.config %}
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-{%- endif %}
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-{%- if 'rocm' in test_runner_type %}
-          df -H
-{%- else %}
-          sudo df -H
-{%- endif %}
-      !{{ common.parse_ref() }}
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after !{{ timeout_after }} minutes
-        timeout-minutes: !{{ timeout_after }}
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-{%- if 'rocm' not in test_runner_type %}
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=!{{ common.squid_proxy }} -e https_proxy=!{{ common.squid_proxy }} -e no_proxy=!{{ common.squid_no_proxy }}"
-          fi
-{%- endif %}
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-          {%- if enable_xla_test == 1 %}
-            -e XLA_CUDA \
-          {%- endif %}
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-{%- if 'rocm' not in test_runner_type %}
-            ${PROXY_ENV} \
-{%- endif %}
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-{%- if 'rocm' not in test_runner_type %}
-            --ipc=host \
-{%- endif %}
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-{%- if 'rocm' in test_runner_type %}
-          # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
-          docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
-          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
-          docker exec -t "${container_name}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
-{%- else %}
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-{%- endif %}
-{%- if 'rocm' not in test_runner_type %}
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-{%- endif %}
-      !{{ common.render_test_results() }}
-{%- if 'rocm' in test_runner_type %}
-      !{{ common.upload_downloaded_files(name='linux', use_s3=False, config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-      !{{ common.upload_test_reports(name='linux', artifact_name="test-reports", use_s3=False, config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-{%- else %}
-      !{{ common.upload_downloaded_files(name='linux', config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-      !{{ common.upload_test_reports(name='linux', config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-{%- endif %}
-      !{{ common.upload_test_statistics(build_environment) }}
-{%- if 'rocm' in test_runner_type %}
-      !{{ common.teardown_rocm_linux() }}
-{%- else %}
-      !{{ common.teardown_ec2_linux() }}
-{%- endif %}
-{%- endfor %}
-{% endblock %}
-{%- endif -%}
-{%- if enable_doc_jobs %}
-  build-docs:
-    runs-on: linux.2xlarge
-    timeout-minutes: !{{ common.timeout_minutes }}
-    strategy:
-      matrix:
-        docs_type: [cpp, python]
-    needs: [build]
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      DOCS_TYPE: ${{ matrix.docs_type }}
-      WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
-    steps:
-      !{{ common.setup_ec2_linux() }}
-      !{{ common.checkout() }}
-      - name: Pull Docker image
-        run: |
-          !{{ common.add_retry_to_env() }}
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-{%- if is_scheduled %}
-      - name: Generate netrc (only for docs-push)
-        if: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
-        env:
-          GITHUB_PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
-        run: |
-          # set credentials for https pushing
-          echo "machine github.com" > "${RUNNER_TEMP}/.netrc"
-          echo "login pytorchbot" >> "${RUNNER_TEMP}/.netrc"
-          echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc"
-{%- endif %}
-      - name: Build ${{ matrix.docs_type }} docs
-        run: |
-          set -ex
-          time docker pull "${DOCKER_IMAGE}" > /dev/null
-          # Convert refs/tags/v1.12.0rc3 into 1.12
-          if [[ "${GITHUB_REF}" =~ ^refs/tags/v([0-9]+\.[0-9]+)\.* ]]; then
-            target="${BASH_REMATCH[1]}"
-          else
-            target="master"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e IN_CI \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SHA1="$GITHUB_SHA" \
-            -e DOCS_VERSION="${target}" \
-            -e DOCS_TYPE \
-            -e PR_LABELS \
-            -e WITH_PUSH \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-{%- if is_scheduled %}
-            -v "${RUNNER_TEMP}/.netrc":/var/lib/jenkins/.netrc \
-{%- endif %}
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: !{{ common.upload_artifact_s3_action }}
-        name: Upload Python Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
-        with:
-          retention-days: 14
-          s3-bucket: doc-previews
-          if-no-files-found: error
-          path: pytorch.github.io/docs/master/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}
-      - uses: !{{ common.upload_artifact_s3_action }}
-        name: Upload C++ Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          s3-bucket: doc-previews
-          path: cppdocs/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}/cppdocs
-{%- endif -%}
diff --git a/.github/templates/macos_binary_build_workflow.yml.j2 b/.github/templates/macos_binary_build_workflow.yml.j2
index 10b0f6310d2a66..e788e608619008 100644
--- a/.github/templates/macos_binary_build_workflow.yml.j2
+++ b/.github/templates/macos_binary_build_workflow.yml.j2
@@ -33,9 +33,10 @@ on:
       # Release candidate tags look like: v1.11.0-rc1
       - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
 {%- for label in ciflow_config.labels | sort %}
-  {%- if label != "ciflow/default" %}
+    {%- if loop.first and branches != "nightly" %}
+    tags:
+    {%- endif %}
       - '!{{ label }}/*'
-  {%- endif %}
 {%- endfor %}
   workflow_dispatch:
 
@@ -59,6 +60,7 @@ env:
 jobs:
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
 {%- if config["package_type"] == "libtorch" %}
     # libtorch builds take a long time on github hosted runners
diff --git a/.github/templates/macos_ci_workflow.yml.j2 b/.github/templates/macos_ci_workflow.yml.j2
deleted file mode 100644
index 47fa86fac54b05..00000000000000
--- a/.github/templates/macos_ci_workflow.yml.j2
+++ /dev/null
@@ -1,131 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- block name -%}
-# Template is at:    .github/templates/macos_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-{%- endblock %}
-
-on:
-{%- if is_default -%}
-  pull_request:
-{%- endif -%}
-
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- else %}
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-{%- endif %}
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-  workflow_dispatch:
-
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-env:
-  BUILD_ENVIRONMENT: !{{ build_environment }}
-  COMPACT_JOB_NAME: !{{ build_environment }}
-  IN_CI: 1
-  IS_GHA: 1
-  PYTORCH_RETRY_TEST_CASES: 1
-!{{ common.set_xcode_version(xcode_version) }}
-
-jobs:
-{% block build +%}
-  build:
-    runs-on: !{{ test_runner_type }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}
-      # For sccache access (only on non-forked PRs)
-      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      !{{ common.checkout() }}
-      !{{ common.setup_miniconda("3.8") }}
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-      - name: Build
-        run: |
-          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
-          .jenkins/pytorch/macos-build.sh
-{%- if build_generates_artifacts %}
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/
-      - uses: actions/upload-artifact@v2
-        name: Store PyTorch Build Artifacts on GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-{%- endif %}
-{% endblock +%}
-{%- if not exclude_test %}
-{% block test +%}
-  {%- for test_job in test_jobs %}
-  !{{ test_job.id }}:
-    name: !{{ test_job.name }}
-    needs: build
-    runs-on: !{{ test_job.runner }}
-    timeout-minutes: !{{ common.timeout_minutes }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-test
-      TEST_CONFIG: !{{ test_job.config }}
-      SHARD_NUMBER: !{{ test_job.shard }}
-      NUM_TEST_SHARDS: !{{ test_job.num_shards }}
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      !{{ common.checkout(submodules="false") }}
-      - uses: actions/download-artifact@v2
-        name: Download PyTorch Build Artifacts from GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: .
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      !{{ common.setup_miniconda("3.8") }}
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      !{{ common.parse_ref() }}
-      - name: Test
-        run: |
-          python3 -mpip install dist/*.whl
-          .jenkins/pytorch/macos-test.sh
-      !{{ common.render_test_results() }}
-      !{{ common.upload_downloaded_files(name='macos', config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner, artifact_name="test-jsons", use_s3=False) }}
-      !{{ common.upload_test_reports("macos", config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner, artifact_name="test-reports", use_s3=False) }}
-      !{{ common.upload_test_statistics(build_environment, needs_credentials=True) }}
-{%- endfor %}
-{% endblock +%}
-{%- endif %}
-
-!{{ common.concurrency(build_environment) }}
diff --git a/.github/templates/upload.yml.j2 b/.github/templates/upload.yml.j2
index 4c680eea47d714..63bec412997e27 100644
--- a/.github/templates/upload.yml.j2
+++ b/.github/templates/upload.yml.j2
@@ -52,7 +52,7 @@
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
 {%- if use_s3 %}
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
 {%- else %}
       - uses: actions/download-artifact@v2
 {%- endif %}
diff --git a/.github/templates/windows_binary_build_workflow.yml.j2 b/.github/templates/windows_binary_build_workflow.yml.j2
index df018fc43919bf..0fcfbf9096b805 100644
--- a/.github/templates/windows_binary_build_workflow.yml.j2
+++ b/.github/templates/windows_binary_build_workflow.yml.j2
@@ -21,17 +21,22 @@ name: !{{ build_environment }}
 
 on:
   push:
+    {%- if branches == "nightly" %}
     # NOTE: Meta Employees can trigger new nightlies using: https://fburl.com/trigger_pytorch_nightly_build
+    {%- endif %}
     branches:
-      - nightly
+      - !{{ branches }}
+    {%- if branches == "nightly" %}
     tags:
       # NOTE: Binary build pipelines should only get triggered on release candidate builds
       # Release candidate tags look like: v1.11.0-rc1
       - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+    {%- endif %}
 {%- for label in ciflow_config.labels | sort %}
-  {%- if label != "ciflow/default" %}
+    {%- if loop.first and branches != "nightly" %}
+    tags:
+    {%- endif %}
       - '!{{ label }}/*'
-  {%- endif %}
 {%- endfor %}
   workflow_dispatch:
 
@@ -54,6 +59,7 @@ env:
 jobs:
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config, True) }}
@@ -91,7 +97,7 @@ jobs:
     steps:
       !{{ common.setup_ec2_windows() }}
       !{{ set_runner_specific_vars() }}
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
@@ -107,5 +113,7 @@ jobs:
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
       !{{ common.wait_and_kill_ssh_windows('pytorch') }}
+  {%- if branches == "nightly" %}
   !{{ upload.upload_binaries(config, True) }}
+  {%- endif %}
 {%- endfor %}
diff --git a/.github/templates/windows_ci_workflow.yml.j2 b/.github/templates/windows_ci_workflow.yml.j2
deleted file mode 100644
index af1561343a9b05..00000000000000
--- a/.github/templates/windows_ci_workflow.yml.j2
+++ /dev/null
@@ -1,208 +0,0 @@
-{% import 'common.yml.j2' as common %}
-
-{%- macro wait_and_kill_ssh() -%}
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-{%- endmacro -%}
-
-# Template is at:    .github/templates/windows_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: !{{ build_environment }}
-
-on:
-{%- if on_pull_request %}
-  pull_request:
-{%- endif %}
-  push:
-{%- for label in ciflow_config.labels | sort %}
-  {%- if loop.first %}
-    tags:
-  {%- endif %}
-  {%- if label != "ciflow/default" %}
-      - '!{{ label }}/*'
-  {%- endif %}
-{%- endfor %}
-{%- if not is_scheduled %}
-    branches:
-      - master
-      - main
-      - release/*
-{%- endif %}
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- endif %}
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: !{{ build_environment }}
-  BUILD_WHEEL: 1
-  MAX_JOBS: 8
-  CUDA_VERSION: "!{{ cuda_version }}"
-  IN_CI: 1
-  IS_GHA: 1
-  INSTALL_WINDOWS_SDK: 1
-  PYTHON_VERSION: "3.8"
-  PYTORCH_RETRY_TEST_CASES: 1
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  SCCACHE_BUCKET: "ossci-compiler-cache"
-  VC_PRODUCT: "BuildTools"
-  VC_VERSION: ""
-  VS_VERSION: "16.8.6"
-  VC_YEAR: "2019"
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  no_proxy: !{{ common.squid_no_proxy }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-{%- if build_with_debug %}
-  DEBUG: 1
-{%- endif %}
-{%- if cuda_version != "cpu" %}
-  TORCH_CUDA_ARCH_LIST: "7.0"
-{%- endif %}
-  USE_CUDA: !{{ 1 if cuda_version != "cpu" else 0 }}
-
-!{{ common.concurrency(build_environment) }}
-
-jobs:
-  build:
-    runs-on: "windows.4xlarge"
-    timeout-minutes: !{{ common.timeout_minutes }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-build
-      http_proxy: "!{{ common. squid_proxy }}"
-      https_proxy: "!{{ common.squid_proxy }}"
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      !{{ common.checkout() }}
-      !{{ common.display_ec2_information() }}
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-{%- if cuda_version != "cpu" %}
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-{%- endif %}
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      !{{ common.parse_ref() }}
-      - name: Build
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          .jenkins/pytorch/win-build.sh
-      # Upload to github so that people can click and download artifacts
-      - name: Upload artifacts to s3
-        uses: !{{ common.upload_artifact_s3_action }}
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      !{{ common.wait_and_kill_ssh_windows() }}
-      - name: Cleanup build-results and workspaces
-        if: always()
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
-          rm -rf ./*
-
-  {%- for test_job in test_jobs %}
-  !{{ test_job.id }}:
-    name: !{{ test_job.name }}
-    timeout-minutes: !{{ timeout_after + 30 }}
-    env:
-      JOB_BASE_NAME: !{{ build_environment }}-test
-      SHARD_NUMBER: !{{ test_job.shard }}
-      NUM_TEST_SHARDS: !{{ test_job.num_shards }}
-      TEST_CONFIG: !{{ test_job.config }}
-      http_proxy: "!{{ common.squid_proxy }}"
-      https_proxy: "!{{ common.squid_proxy }}"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: !{{ test_job.runner }}
-    steps:
-      !{{ common.display_ec2_information() }}
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      !{{ common.checkout() }}
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-{%- if cuda_version != "cpu" and not test_job.config == 'force_on_cpu' %}
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-{%- endif %}
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after !{{ timeout_after }} minutes
-        timeout-minutes: !{{ timeout_after }}
-        run: |
-            .jenkins/pytorch/win-test.sh
-      !{{ common.upload_downloaded_files(name='windows', config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-      !{{ common.upload_test_reports(name='windows', config=test_job.config, shard=test_job.shard, num_shards=test_job.num_shards, runner=test_job.runner) }}
-      !{{ common.render_test_results() }}
-      !{{ common.wait_and_kill_ssh_windows() }}
-      !{{ common.parse_ref() }}
-      !{{ common.upload_test_statistics(build_environment) }}
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  {%- endfor %}
diff --git a/.github/workflows/_android-build-test.yml b/.github/workflows/_android-build-test.yml
new file mode 100644
index 00000000000000..a489d7d7e002d4
--- /dev/null
+++ b/.github/workflows/_android-build-test.yml
@@ -0,0 +1,150 @@
+name: android-build-test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      docker-image-name:
+        required: true
+        type: string
+        description: Name of the base docker image to build with.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  build-and-test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, linux.2xlarge]
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Calculate docker image
+        id: calculate-docker-image
+        uses: ./.github/actions/calculate-docker-image
+        with:
+          docker-image-name: ${{ inputs.docker-image-name }}
+          xla: ${{ contains(inputs.build-environment, 'xla') }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+
+      - name: Output disk space left
+        run: |
+          sudo df -H
+
+      - name: Preserve github env variables for use in docker
+        run: |
+          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
+
+      - name: Build
+        env:
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-build-and-test
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          TORCH_CUDA_ARCH_LIST: 5.2
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
+        run: |
+          set -e
+          # Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
+          # 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
+          # 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
+
+          echo "DOCKER_IMAGE: ${DOCKER_IMAGE}"
+          time docker pull "${DOCKER_IMAGE}" >/dev/null
+
+          export BUILD_LITE_INTERPRETER
+          BUILD_LITE_INTERPRETER="1"
+          if [[ "${BUILD_ENVIRONMENT}" == *"full-jit" ]]; then
+            BUILD_LITE_INTERPRETER="0"
+          fi
+
+          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
+          export id
+          id=$(docker run -e BUILD_ENVIRONMENT \
+            -e JOB_BASE_NAME \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SCCACHE_BUCKET \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e PR_LABELS \
+            -e SKIP_SCCACHE_INITIALIZATION=1 \
+            -e TORCH_CUDA_ARCH_LIST \
+            -e BUILD_LITE_INTERPRETER \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --tty \
+            --detach \
+            --user jenkins \
+            -v "$(pwd):/var/lib/jenkins/workspace" \
+            --cap-add=SYS_PTRACE \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --security-opt seccomp=unconfined \
+            -t -d -w /var/lib/jenkins "${DOCKER_IMAGE}")
+
+          export COMMAND
+          # shellcheck disable=SC2016
+          COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'
+          echo "${COMMAND}" > ./command.sh && bash ./command.sh
+          # Skip docker push as this job is purely for size analysis purpose.
+          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Display and upload binary build size statistics (Click Me)
+        # temporary hack: set CIRCLE_* vars, until we update
+        # tools/stats/print_test_stats.py to natively support GitHub Actions
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          ARTIFACTS: ""
+          ANDROID_BUILD_TYPE: custom-build-single
+        run: |
+          # The artifact file is created inside docker container, which contains the result binaries.
+          # Now unpackage it into the project folder. The subsequent script will scan project folder
+          # to locate result binaries and report their sizes.
+          # If artifact file is not provided it assumes that the project folder has been mounted in
+          # the docker during build and already contains the result binaries, so this step can be skipped.
+          if [ -n "${ARTIFACTS}" ]; then
+            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
+            cd "${GITHUB_WORKSPACE}"
+          fi
+          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
+          export COMMIT_TIME
+          pip3 install requests==2.26 boto3==1.16.34
+          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
+
+      - name: Chown workspace
+        uses: ./.github/actions/chown-workspace
+        if: always()
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/_android-full-build-test.yml b/.github/workflows/_android-full-build-test.yml
new file mode 100644
index 00000000000000..d0b8845a662097
--- /dev/null
+++ b/.github/workflows/_android-full-build-test.yml
@@ -0,0 +1,222 @@
+name: android-full-build-test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      docker-image-name:
+        required: true
+        type: string
+        description: Name of the base docker image to build with.
+
+    secrets:
+      SONATYPE_NEXUS_USERNAME:
+        description: nexus user
+        required: true
+      SONATYPE_NEXUS_PASSWORD:
+        description: nexus pass
+        required: true
+      ANDROID_SIGN_KEY:
+        description: android key
+        required: true
+      ANDROID_SIGN_PASS:
+        description: android pass
+        required: true
+      SCRIBE_GRAPHQL_ACCESS_TOKEN:
+        description: token for writing to scribe/scuba
+        required: true
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  build:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, linux.2xlarge]
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Calculate docker image
+        id: calculate-docker-image
+        uses: ./.github/actions/calculate-docker-image
+        with:
+          docker-image-name: ${{ inputs.docker-image-name }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+
+      - name: Output disk space left
+        shell: bash
+        run: |
+          sudo df -H
+
+      - name: Preserve github env variables for use in docker
+        shell: bash
+        run: |
+          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Build arm-v7a
+        uses: ./.github/actions/build-android
+        with:
+          arch: arm_v7a
+          arch-for-build-env: arm-v7a
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          build-environment: ${{ inputs.build-environment }}
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          branch: ${{ steps.parse-ref.outputs.branch }}
+
+      - name: Build arm-v8a
+        uses: ./.github/actions/build-android
+        with:
+          arch: arm_v8a
+          arch-for-build-env: arm-v8a
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          build-environment: ${{ inputs.build-environment }}
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          branch: ${{ steps.parse-ref.outputs.branch }}
+
+      - name: Build x86_32
+        id: build-x86_32
+        uses: ./.github/actions/build-android
+        with:
+          arch: x86_32
+          arch-for-build-env: x86_32
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          build-environment: ${{ inputs.build-environment }}
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          branch: ${{ steps.parse-ref.outputs.branch }}
+
+      - name: Build x86_64
+        uses: ./.github/actions/build-android
+        with:
+          arch: x86_64
+          arch-for-build-env: x86_64
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          build-environment: ${{ inputs.build-environment }}
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          branch: ${{ steps.parse-ref.outputs.branch }}
+
+      - name: Build final artifact
+        env:
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          AWS_DEFAULT_REGION: us-east-1
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }}
+        run: |
+          set -eux
+
+          # Putting everything together
+          # ID_X86_32 container were created during build-x86_32 step
+          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v7a" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v7a"
+          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_64" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_64"
+          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v8a" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v8a"
+          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_32" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_32"
+
+          # run gradle buildRelease
+          (echo "./.circleci/scripts/build_android_gradle.sh" | docker exec \
+            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build" \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e AWS_DEFAULT_REGION \
+            -e IS_GHA \
+            -e PR_NUMBER \
+            -e SHA1 \
+            -e BRANCH \
+            -e GITHUB_RUN_ID \
+            -e SCCACHE_BUCKET \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e SKIP_SCCACHE_INITIALIZATION=1 \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --user jenkins \
+            -u jenkins -i "${ID_X86_32}" bash) 2>&1
+
+          mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"
+          docker cp "${ID_X86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"
+
+      - name: Display and upload binary build size statistics (Click Me)
+        # temporary hack: set CIRCLE_* vars, until we update
+        # tools/stats/print_test_stats.py to natively support GitHub Actions
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          ANDROID_BUILD_TYPE: prebuilt
+          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
+        run: |
+          # The artifact file is created inside docker container, which contains the result binaries.
+          # Now unpackage it into the project folder. The subsequent script will scan project folder
+          # to locate result binaries and report their sizes.
+          # If artifact file is not provided it assumes that the project folder has been mounted in
+          # the docker during build and already contains the result binaries, so this step can be skipped.
+          export ARTIFACTS=${GITHUB_WORKSPACE}/build_android_artifacts/artifacts.tgz
+          if [ -n "${ARTIFACTS}" ]; then
+            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
+            cd "${GITHUB_WORKSPACE}"
+          fi
+          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
+          export COMMIT_TIME
+          pip3 install requests==2.26 boto3==1.16.34
+          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
+
+      - name: Publish android snapshot
+        if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }}
+        env:
+          SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
+          SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
+          ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
+          ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
+          ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }}
+        run: |
+          set -eux
+          (echo "./.circleci/scripts/publish_android_snapshot.sh" | docker exec \
+            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot" \
+            -e SONATYPE_NEXUS_USERNAME \
+            -e SONATYPE_NEXUS_PASSWORD \
+            -e ANDROID_SIGN_KEY \
+            -e ANDROID_SIGN_PASS \
+            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
+            -u jenkins -i "${ID_X86_32}" bash) 2>&1
+
+      - name: Store PyTorch Android Build Artifacts on S3
+        uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: ${{ inputs.build-environment }}
+          retention-days: 14
+          if-no-files-found: error
+          path: build_android_artifacts/artifacts.tgz
+
+      - name: Chown workspace
+        uses: ./.github/actions/chown-workspace
+        if: always()
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/_bazel-build-test.yml b/.github/workflows/_bazel-build-test.yml
new file mode 100644
index 00000000000000..57a5d47af3c25c
--- /dev/null
+++ b/.github/workflows/_bazel-build-test.yml
@@ -0,0 +1,185 @@
+name: bazel
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      docker-image-name:
+        required: true
+        type: string
+        description: Name of the base docker image to build with.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  build-and-test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, linux.2xlarge]
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Calculate docker image
+        id: calculate-docker-image
+        uses: ./.github/actions/calculate-docker-image
+        with:
+          docker-image-name: ${{ inputs.docker-image-name }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+
+      - name: Output disk space left
+        run: |
+          sudo df -H
+
+      - name: Preserve github env variables for use in docker
+        run: |
+          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Build
+        env:
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-build-and-test
+          # TODO duplicated
+          AWS_DEFAULT_REGION: us-east-1
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          TORCH_CUDA_ARCH_LIST: 5.2
+          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
+        run: |
+          # detached container should get cleaned up by teardown_ec2_linux
+          container_name=$(docker run \
+            -e BUILD_ENVIRONMENT \
+            -e JOB_BASE_NAME \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SCCACHE_BUCKET \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e PR_LABELS \
+            -e SKIP_SCCACHE_INITIALIZATION=1 \
+            -e TORCH_CUDA_ARCH_LIST \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --tty \
+            --detach \
+            --user jenkins \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/build.sh'
+
+      # !{{ common_android.upload_androind_binary_size("", "")}}
+      - name: Test
+        # Time out the test phase after 3.5 hours
+        timeout-minutes: 210
+        env:
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-build-and-test
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          PYTORCH_RETRY_TEST_CASES: 1
+          PR_BODY: ${{ github.event.pull_request.body }}
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
+        run: |
+          # detached container should get cleaned up by teardown_ec2_linux
+          export SHARD_NUMBER=0
+          COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+          export COMMIT_MESSAGES
+          # TODO: Stop building test binaries as part of the build phase
+          # Make sure we copy test results from bazel-testlogs symlink to
+          # a regular directory ./test/test-reports
+          container_name=$(docker run \
+            -e BUILD_ENVIRONMENT \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e GITHUB_ACTIONS \
+            -e GIT_DEFAULT_BRANCH="$GIT_DEFAULT_BRANCH" \
+            -e IN_CI \
+            -e SHARD_NUMBER \
+            -e NUM_TEST_SHARDS \
+            -e JOB_BASE_NAME \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SCCACHE_BUCKET \
+            -e PR_LABELS \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --shm-size="1g" \
+            --tty \
+            --detach \
+            --user jenkins \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/test.sh && cp -Lr ./bazel-testlogs ./test/test-reports'
+
+      - name: Chown workspace
+        uses: ./.github/actions/chown-workspace
+        if: always()
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: pytorch/pytorch/.github/actions/get-workflow-job-id@master
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always()
+        with:
+          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}
+
+      - name: Upload test statistics
+        if: always()
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+        shell: bash
+        run: |
+          set -x
+          python3 -m pip install -r requirements.txt
+          python3 -m pip install boto3==1.19.12
+          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/_docs.yml b/.github/workflows/_docs.yml
new file mode 100644
index 00000000000000..96ed63cbb0f6a4
--- /dev/null
+++ b/.github/workflows/_docs.yml
@@ -0,0 +1,132 @@
+name: build docs
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      docker-image:
+        required: true
+        type: string
+        description: Docker image to run in.
+      push:
+        required: false
+        type: boolean
+        default: false
+        description: If set, push the docs to the docs website.
+
+    secrets:
+      GH_PYTORCHBOT_TOKEN:
+        required: false
+        description: Permissions for pushing to the docs site.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+
+jobs:
+  build-docs:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, linux.2xlarge]
+    strategy:
+      matrix:
+        docs_type: [cpp, python]
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ inputs.docker-image }}
+
+      - name: Download build artifacts
+        uses: ./.github/actions/download-build-artifacts
+        with:
+          name: ${{ inputs.build-environment }}
+
+      - name: Generate netrc (only for docs-push)
+        if: inputs.push
+        env:
+          GITHUB_PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
+        run: |
+          # set credentials for https pushing
+          echo "machine github.com" > "${RUNNER_TEMP}/.netrc"
+          echo "login pytorchbot" >> "${RUNNER_TEMP}/.netrc"
+          echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc"
+
+      - name: Build ${{ matrix.docs_type }} docs
+        env:
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
+          DOCKER_IMAGE: ${{ inputs.docker-image }}
+          DOCS_TYPE: ${{ matrix.docs_type }}
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+        run: |
+          set -ex
+          # Convert refs/tags/v1.12.0rc3 into 1.12
+          if [[ "${GITHUB_REF}" =~ ^refs/tags/v([0-9]+\.[0-9]+)\.* ]]; then
+            target="${BASH_REMATCH[1]}"
+          else
+            target="master"
+          fi
+          # detached container should get cleaned up by teardown_ec2_linux
+          container_name=$(docker run \
+            -e BUILD_ENVIRONMENT \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e IN_CI \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SHA1="$GITHUB_SHA" \
+            -e DOCS_VERSION="${target}" \
+            -e DOCS_TYPE \
+            -e PR_LABELS \
+            -e WITH_PUSH \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --tty \
+            --detach \
+            --user jenkins \
+            -v "${RUNNER_TEMP}/.netrc":/var/lib/jenkins/.netrc \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
+
+      - name: Chown workspace
+        uses: ./.github/actions/chown-workspace
+        if: always()
+
+      - name: Upload Python Docs Preview
+        uses: seemethere/upload-artifact-s3@v4
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
+        with:
+          retention-days: 14
+          s3-bucket: doc-previews
+          if-no-files-found: error
+          path: pytorch.github.io/docs/master/
+          s3-prefix: pytorch/${{ github.event.pull_request.number }}
+
+      - name: Upload C++ Docs Preview
+        uses: seemethere/upload-artifact-s3@v4
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
+        with:
+          retention-days: 14
+          if-no-files-found: error
+          s3-bucket: doc-previews
+          path: cppdocs/
+          s3-prefix: pytorch/${{ github.event.pull_request.number }}/cppdocs
diff --git a/.github/workflows/generated-ios-12-5-1-x86-64.yml b/.github/workflows/_ios-build-test.yml
similarity index 79%
rename from .github/workflows/generated-ios-12-5-1-x86-64.yml
rename to .github/workflows/_ios-build-test.yml
index b4e762094b8a3b..fa3b7e2836f8f3 100644
--- a/.github/workflows/generated-ios-12-5-1-x86-64.yml
+++ b/.github/workflows/_ios-build-test.yml
@@ -1,58 +1,62 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-x86-64
+name: ios-build-test
 
 on:
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      ios-platform:
+        required: true
+        type: string
+        description: Which iOS platform to build for.
+      ios-arch:
+        required: true
+        type: string
+        description: Which iOS arch to build for.
+
+    secrets:
+      IOS_CERT_KEY_2022:
+        required: true
+        description: ios cert
+      IOS_CERT_SECRET:
+        required: true
+        description: ios cert
+      IOS_DEV_TEAM_ID:
+        required: true
+        description: ios cert
+      IOS_SIGN_KEY_2022:
+        required: true
+        description: ios cert
 
 env:
-  BUILD_ENVIRONMENT: ios-12-5-1-x86-64
   IN_CI: 1
   IS_GHA: 1
-  IOS_PLATFORM: SIMULATOR
-  IOS_ARCH: x86_64
-
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+  BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+  IOS_PLATFORM: ${{ inputs.ios-platform }}
+  IOS_ARCH: ${{ inputs.ios-arch }}
 
 jobs:
-
   build:
     # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
     #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
+    if: github.repository_owner == 'pytorch'
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
-      JOB_BASE_NAME: ios-12-5-1-x86-64-build
+      JOB_BASE_NAME: ${{ inputs.build-environment }}-build
       IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
       IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
       IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
       IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
       PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
     steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
       - name: Populate CI build options
         run: |
           # Most builds use the lite interpreter, if certain builds shouldn't
@@ -74,10 +78,12 @@ jobs:
               echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
               ;;
           esac
+
       - name: Install brew dependencies
         run: |
           # Install dependencies
           brew install libtool
+
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
@@ -98,6 +104,7 @@ jobs:
             requests \
             setuptools \
             typing_extensions
+
       - name: Run Fastlane
         run: |
           set -x
@@ -118,6 +125,7 @@ jobs:
           echo "${IOS_SIGN_KEY_2022}" >> cert.txt
           base64 --decode cert.txt -o ${PROFILE}
           rm cert.txt
+
       - name: Build
         run: |
           # shellcheck disable=SC1091
@@ -126,6 +134,7 @@ jobs:
           python -VV
           export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
           scripts/build_ios.sh
+
       - name: Run Build Test
         run: |
           PROFILE=PyTorch_CI_2022
@@ -139,7 +148,9 @@ jobs:
           else
             ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
           fi
+
       - name: Run Simulator Tests
+        if: inputs.ios-platform == 'SIMULATOR'
         run: |
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
@@ -152,8 +163,10 @@ jobs:
             pip install six==1.16.0
             python coreml_backend.py
           else
-            python trace_model.py
+            cd "${GITHUB_WORKSPACE}"
+            python test/mobile/model_test/gen_test_model.py ios-test
           fi
+          cd "${GITHUB_WORKSPACE}/ios/TestApp/benchmark"
           if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
             echo "Setting up the TestApp for LiteInterpreter"
             ruby setup.rb --lite 1
@@ -167,12 +180,8 @@ jobs:
             if [ "${USE_COREML_DELEGATE}" == 1 ]; then
               fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
             else
-              fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
+              fastlane scan --skip_testing TestAppTests/TestAppTests/testCoreML
             fi
           else
             fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
           fi
-
-concurrency:
-  group: ios-12-5-1-x86-64-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/_linux-build.yml b/.github/workflows/_linux-build.yml
new file mode 100644
index 00000000000000..cf6419f208e2a8
--- /dev/null
+++ b/.github/workflows/_linux-build.yml
@@ -0,0 +1,158 @@
+name: linux-build
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      docker-image-name:
+        required: true
+        type: string
+        description: Name of the base docker image to build with.
+      build-generates-artifacts:
+        required: false
+        type: boolean
+        default: true
+        description: If set, upload generated build artifacts.
+      build-with-debug:
+        required: false
+        type: boolean
+        default: false
+        description: If set, build in debug mode.
+
+    outputs:
+      docker-image:
+        value: ${{ jobs.build.outputs.docker-image }}
+        description: The docker image containing the built PyTorch.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+
+jobs:
+  build:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, linux.2xlarge]
+    timeout-minutes: 240
+    outputs:
+      docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+    steps:
+      # [pytorch repo ref]
+      # Use a pytorch/pytorch reference instead of a reference to the local
+      # checkout because when we run this action we don't *have* a local
+      # checkout. In other cases you should prefer a local checkout.
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Check for new workflows
+        run: |
+          if [ ! -f "./.github/actions/setup-linux/action.yml" ]; then
+            echo "::error::Your PR is based on a version of master that is too old for our CI to work. Please rebase your PR on latest master and resubmit."
+            exit 1
+          fi
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Calculate docker image
+        id: calculate-docker-image
+        uses: ./.github/actions/calculate-docker-image
+        with:
+          docker-image-name: ${{ inputs.docker-image-name }}
+          xla: ${{ contains(inputs.build-environment, 'xla') }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Build
+        env:
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-build
+          # TODO duplicated
+          AWS_DEFAULT_REGION: us-east-1
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          TORCH_CUDA_ARCH_LIST: 5.2
+          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
+          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
+          DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
+        run: |
+          # detached container should get cleaned up by teardown_ec2_linux
+          container_name=$(docker run \
+            -e BUILD_ENVIRONMENT \
+            -e JOB_BASE_NAME \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e AWS_DEFAULT_REGION \
+            -e IS_GHA \
+            -e PR_NUMBER \
+            -e SHA1 \
+            -e BRANCH \
+            -e GITHUB_RUN_ID \
+            -e SCCACHE_BUCKET \
+            -e XLA_CUDA \
+            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e SKIP_SCCACHE_INITIALIZATION=1 \
+            -e TORCH_CUDA_ARCH_LIST \
+            -e PR_LABELS \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --tty \
+            --detach \
+            --user jenkins \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t "${container_name}" sh -c '.jenkins/pytorch/build.sh'
+
+      - name: Display and upload binary build size statistics (Click Me)
+        # temporary hack: set CIRCLE_* vars, until we update
+        # tools/stats/print_test_stats.py to natively support GitHub Actions
+        env:
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+        run: |
+          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
+          export COMMIT_TIME
+          pip3 install requests==2.26 boto3==1.16.34
+          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
+
+      - name: Archive artifacts into zip
+        if: inputs.build-generates-artifacts
+        run: |
+          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
+
+      - name: Store PyTorch Build Artifacts on S3
+        uses: seemethere/upload-artifact-s3@v4
+        if: inputs.build-generates-artifacts
+        with:
+          name: ${{ inputs.build-environment }}
+          retention-days: 14
+          if-no-files-found: error
+          path: artifacts.zip
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
new file mode 100644
index 00000000000000..8c203b87ebcc5b
--- /dev/null
+++ b/.github/workflows/_linux-test.yml
@@ -0,0 +1,193 @@
+name: linux-test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      test-matrix:
+        required: true
+        type: string
+        description: JSON description of what test configs to run.
+      docker-image:
+        required: true
+        type: string
+        description: Docker image to run in.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    strategy:
+      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      fail-fast: false
+    runs-on: ${{ matrix.runner }}
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ inputs.docker-image }}
+
+      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+
+      - name: Download build artifacts
+        uses: ./.github/actions/download-build-artifacts
+        with:
+          name: ${{ inputs.build-environment }}
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Test
+        env:
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          PYTORCH_RETRY_TEST_CASES: 1
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          TEST_CONFIG: ${{ matrix.config }}
+          SHARD_NUMBER: ${{ matrix.shard }}
+          NUM_TEST_SHARDS: ${{ matrix.num_shards }}
+          PR_BODY: ${{ github.event.pull_request.body }}
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
+          DOCKER_IMAGE: ${{ inputs.docker-image }}
+          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
+          XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
+        timeout-minutes: 240
+        run: |
+          set -x
+
+          if [[ $TEST_CONFIG == 'multigpu' ]]; then
+            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
+          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
+            TEST_COMMAND=.jenkins/caffe2/test.sh
+          else
+            TEST_COMMAND=.jenkins/pytorch/test.sh
+          fi
+
+          COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+          export COMMIT_MESSAGES
+
+          # detached container should get cleaned up by teardown_ec2_linux
+          # TODO: Stop building test binaries as part of the build phase
+          # Used for GPU_FLAG since that doesn't play nice
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BUILD_ENVIRONMENT \
+            -e PR_NUMBER \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e GITHUB_ACTIONS \
+            -e IN_CI \
+            -e IS_GHA \
+            -e BRANCH \
+            -e SHA1 \
+            -e AWS_DEFAULT_REGION \
+            -e IN_WHEEL_TEST \
+            -e SHARD_NUMBER \
+            -e JOB_BASE_NAME \
+            -e TEST_CONFIG \
+            -e NUM_TEST_SHARDS \
+            -e PR_BODY \
+            -e COMMIT_MESSAGES \
+            -e PYTORCH_RETRY_TEST_CASES \
+            -e PR_LABELS \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SCCACHE_BUCKET \
+            -e XLA_CUDA \
+            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --ulimit stack=10485760:83886080 \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --ipc=host \
+            --shm-size="${SHM_SIZE}" \
+            --tty \
+            --detach \
+            --name="${container_name}" \
+            --user jenkins \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t "${container_name}" sh -c "pip install dist/*.whl && ${TEST_COMMAND}"
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: pytorch/pytorch/.github/actions/get-workflow-job-id@master
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always()
+        with:
+          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
+
+      - name: Store Core dumps on S3
+        uses: seemethere/upload-artifact-s3@v4
+        if: failure()
+        with:
+          name: coredumps-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}
+          retention-days: 14
+          if-no-files-found: ignore
+          path:
+            ./**/core.[1-9]*
+
+      - name: Upload test statistics
+        if: always()
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+        shell: bash
+        run: |
+          set -x
+          python3 -m pip install -r requirements.txt
+          python3 -m pip install boto3==1.19.12
+          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/_mac-build.yml b/.github/workflows/_mac-build.yml
new file mode 100644
index 00000000000000..bfda5df5dd104e
--- /dev/null
+++ b/.github/workflows/_mac-build.yml
@@ -0,0 +1,102 @@
+name: mac-build
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      runner-type:
+        required: true
+        type: string
+        description: Name of the GitHub-managed runner type to use for the build.
+      build-generates-artifacts:
+        required: true
+        type: boolean
+        description: If set, upload generated build artifacts.
+      xcode-version:
+        required: false
+        type: string
+        default: ""
+        description: What xcode version to build with.
+
+    secrets:
+      MACOS_SCCACHE_S3_ACCESS_KEY_ID:
+        required: true
+        description: Access key for S3 bucket for macOS sccache.
+      MACOS_SCCACHE_S3_SECRET_ACCESS_KEY:
+        required: true
+        description: Secret for S3 bucket for macOS sccache.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+
+# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
+defaults:
+  run:
+    shell: bash -e -l {0}
+
+jobs:
+  build:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: ${{ inputs.runner-type }}
+    env:
+      JOB_BASE_NAME: ${{ inputs.build-environment }}
+      # For sccache access (only on non-forked PRs)
+      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
+      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
+      BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+      COMPACT_JOB_NAME: ${{ inputs.build-environment }}
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Set xcode version
+        env:
+          XCODE_VERSION: ${{ inputs.xcode-version }}
+        run: |
+          if [ -n "${XCODE_VERSION}" ]; then
+            echo "DEVELOPER_DIR=/Applications/Xcode_${XCODE_VERSION}.app/Contents/Developer" >> "${GITHUB_ENV}"
+          fi
+
+      - name: Setup miniconda
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          auto-update-conda: true
+          python-version: 3.8
+          activate-environment: build
+
+      - name: Install macOS homebrew dependencies
+        run: |
+          # Install dependencies
+          brew install libomp
+
+      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
+        run: |
+          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+          sudo chmod +x /usr/local/bin/sccache
+          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+
+      - name: Build
+        run: |
+          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
+          .jenkins/pytorch/macos-build.sh
+
+      - name: Archive artifacts into zip
+        if: inputs.build-generates-artifacts
+        run: |
+          zip -1 -r artifacts.zip dist/
+
+      - name: Store PyTorch Build Artifacts on GHA
+        uses: actions/upload-artifact@v2
+        if: inputs.build-generates-artifacts
+        with:
+          name: ${{ env.BUILD_ENVIRONMENT }}
+          retention-days: 14
+          if-no-files-found: error
+          path: artifacts.zip
diff --git a/.github/workflows/_mac-test.yml b/.github/workflows/_mac-test.yml
new file mode 100644
index 00000000000000..2234ae78f3206a
--- /dev/null
+++ b/.github/workflows/_mac-test.yml
@@ -0,0 +1,120 @@
+name: mac-test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      test-matrix:
+        required: true
+        type: string
+        description: JSON description of what test configs to run.
+
+    secrets:
+      AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID:
+        required: true
+        description: access key id for test stats upload
+      AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY:
+        required: true
+        description: secret acess key for test stats upload
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+
+# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
+defaults:
+  run:
+    shell: bash -e -l {0}
+
+jobs:
+  test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    strategy:
+      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      fail-fast: false
+    runs-on: ${{ matrix.runner }}
+    timeout-minutes: 240
+    env:
+      GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+      BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+      COMPACT_JOB_NAME: ${{ inputs.build-environment }}
+      JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+      TEST_CONFIG: ${{ matrix.config }}
+      SHARD_NUMBER: ${{ matrix.shard }}
+      NUM_TEST_SHARDS: ${{ matrix.num_shards }}
+      PR_BODY: ${{ github.event.pull_request.body }}
+      PYTORCH_RETRY_TEST_CASES: 1
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Download build artifacts
+        uses: ./.github/actions/download-build-artifacts
+        with:
+          name: ${{ inputs.build-environment }}
+          use-gha: true
+
+      - name: Setup miniconda
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          auto-update-conda: true
+          python-version: 3.8
+          activate-environment: build
+
+      - name: Install macOS homebrew dependencies
+        run: |
+          # Install dependencies
+          brew install libomp
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Test
+        run: |
+          COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+          export COMMIT_MESSAGES
+          python3 -mpip install dist/*.whl
+          .jenkins/pytorch/macos-test.sh
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: pytorch/pytorch/.github/actions/get-workflow-job-id@master
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always()
+        with:
+          use-gha: true
+          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
+
+      - name: Upload test statistics
+        if: always()
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
+          GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+        shell: bash
+        run: |
+          set -x
+          python3 -m pip install -r requirements.txt
+          python3 -m pip install boto3==1.19.12
+          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
diff --git a/.github/workflows/_rocm-test.yml b/.github/workflows/_rocm-test.yml
new file mode 100644
index 00000000000000..167e73424def88
--- /dev/null
+++ b/.github/workflows/_rocm-test.yml
@@ -0,0 +1,176 @@
+# TODO: this looks sort of similar to _linux-test, but there are like a dozen
+# places where you would have to insert an if statement. Probably it's better to
+# just use a different workflow altogether
+
+name: test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      test-matrix:
+        required: true
+        type: string
+        description: JSON description of what test configs to run.
+      docker-image:
+        required: true
+        type: string
+        description: Docker image to run in.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    timeout-minutes: 270
+    strategy:
+      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      fail-fast: false
+    runs-on: ${{ matrix.runner }}
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          no-sudo: true
+
+      - name: Setup ROCm
+        uses: ./.github/actions/setup-rocm
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ inputs.docker-image }}
+
+      - name: Download build artifacts
+        uses: ./.github/actions/download-build-artifacts
+        with:
+          name: ${{ inputs.build-environment }}
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Test
+        env:
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          PYTORCH_RETRY_TEST_CASES: 1
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          TEST_CONFIG: ${{ matrix.config }}
+          SHARD_NUMBER: ${{ matrix.shard }}
+          NUM_TEST_SHARDS: ${{ matrix.num_shards }}
+          PR_BODY: ${{ github.event.pull_request.body }}
+          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          DOCKER_IMAGE: ${{ inputs.docker-image }}
+          XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
+        timeout-minutes: 240
+        run: |
+          set -x
+
+          if [[ $TEST_CONFIG == 'multigpu' ]]; then
+            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
+          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
+            TEST_COMMAND=.jenkins/caffe2/test.sh
+          else
+            TEST_COMMAND=.jenkins/pytorch/test.sh
+          fi
+
+          COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+          export COMMIT_MESSAGES
+
+          # detached container should get cleaned up by teardown_ec2_linux
+          # TODO: Stop building test binaries as part of the build phase
+          # Used for GPU_FLAG since that doesn't play nice
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BUILD_ENVIRONMENT \
+            -e PR_NUMBER \
+            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
+            -e GITHUB_ACTIONS \
+            -e IN_CI \
+            -e IS_GHA \
+            -e BRANCH \
+            -e SHA1 \
+            -e AWS_DEFAULT_REGION \
+            -e IN_WHEEL_TEST \
+            -e SHARD_NUMBER \
+            -e JOB_BASE_NAME \
+            -e TEST_CONFIG \
+            -e NUM_TEST_SHARDS \
+            -e PR_BODY \
+            -e COMMIT_MESSAGES \
+            -e PYTORCH_RETRY_TEST_CASES \
+            -e PR_LABELS \
+            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e SCCACHE_BUCKET \
+            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
+            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+            --ulimit stack=10485760:83886080 \
+            --security-opt seccomp=unconfined \
+            --cap-add=SYS_PTRACE \
+            --shm-size="8g" \
+            --tty \
+            --detach \
+            --name="${container_name}" \
+            --user jenkins \
+            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
+            -w /var/lib/jenkins/workspace \
+            "${DOCKER_IMAGE}"
+          )
+          # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
+          docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
+          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
+          docker exec -t "${container_name}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: pytorch/pytorch/.github/actions/get-workflow-job-id@master
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always()
+        with:
+          use-gha: true
+          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
+
+      - name: Upload test statistics
+        if: always()
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+        shell: bash
+        run: |
+          set -x
+          python3 -m pip install -r requirements.txt
+          python3 -m pip install boto3==1.19.12
+          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
+        with:
+          skip-wait-ssh: true
diff --git a/.github/workflows/_win-build.yml b/.github/workflows/_win-build.yml
new file mode 100644
index 00000000000000..abd7aca07f7a6d
--- /dev/null
+++ b/.github/workflows/_win-build.yml
@@ -0,0 +1,94 @@
+name: windows-build
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      cuda-version:
+        required: true
+        type: string
+        description: What CUDA version to build with, "cpu" for none.
+      build-with-debug:
+        required: false
+        type: boolean
+        default: false
+        description: If set, build in debug mode.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  build:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    runs-on: [self-hosted, windows.4xlarge]
+    timeout-minutes: 240
+    env:
+      JOB_BASE_NAME: ${{ inputs.build-environment }}-build
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          no-sudo: true
+
+      - name: Setup Windows
+        uses: ./.github/actions/setup-win
+        with:
+          cuda-version: ${{ inputs.cuda-version }}
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Build
+        shell: bash
+        env:
+          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          BUILD_WHEEL: 1
+          MAX_JOBS: 8
+          CUDA_VERSION: ${{ inputs.cuda-version }}
+          PYTHON_VERSION: "3.8"
+          PYTORCH_RETRY_TEST_CASES: 1
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          SCCACHE_BUCKET: "ossci-compiler-cache"
+          VC_PRODUCT: "BuildTools"
+          VC_VERSION: ""
+          VC_YEAR: "2019"
+          ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+          AWS_DEFAULT_REGION: us-east-1
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
+          TORCH_CUDA_ARCH_LIST: "7.0"
+          USE_CUDA: ${{ inputs.cuda-version != 'cpu' && '1' || '0' }}
+        run: |
+          .jenkins/pytorch/win-build.sh
+
+      # Upload to github so that people can click and download artifacts
+      - name: Upload artifacts to s3
+        uses: seemethere/upload-artifact-s3@v4
+        with:
+          retention-days: 14
+          if-no-files-found: error
+          name: ${{ env.BUILD_ENVIRONMENT }}
+          path: C:\${{ github.run_id }}\build-results
+
+      - name: Teardown Windows
+        uses: ./.github/actions/teardown-win
+        if: always()
+        timeout-minutes: 120
+        with:
+          extra-delete-dir: /c/${{ github.run_id }}/build-results/
diff --git a/.github/workflows/_win-test.yml b/.github/workflows/_win-test.yml
new file mode 100644
index 00000000000000..9aa3eb17648639
--- /dev/null
+++ b/.github/workflows/_win-test.yml
@@ -0,0 +1,132 @@
+name: win-test
+
+on:
+  workflow_call:
+    inputs:
+      build-environment:
+        required: true
+        type: string
+        description: Top-level label for what's being built/tested.
+      cuda-version:
+        required: true
+        type: string
+        description: What CUDA version to build with, "cpu" for none.
+      test-matrix:
+        required: true
+        type: string
+        description: JSON description of what test configs to run.
+
+env:
+  IN_CI: 1 # TODO delete in favor of GITHUB_ACTIONS
+  IS_GHA: 1 # TODO delete in favor of GITHUB_ACTIONS
+  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+
+jobs:
+  test:
+    # Don't run on forked repos.
+    if: github.repository_owner == 'pytorch'
+    strategy:
+      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      fail-fast: false
+    runs-on: ${{ matrix.runner }}
+    timeout-minutes: 300
+    steps:
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          no-sudo: true
+
+      - name: Setup Windows
+        uses: ./.github/actions/setup-win
+        with:
+          cuda-version: ${{ inputs.cuda-version }}
+
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Download PyTorch Build Artifacts
+        uses: seemethere/download-artifact-s3@v3
+        with:
+          name: ${{ env.BUILD_ENVIRONMENT }}
+          path: C:\${{ github.run_id }}\build-results
+
+      - name: Check build-results folder
+        shell: powershell
+        run: |
+          tree /F C:\$Env:GITHUB_RUN_ID\build-results
+
+      - name: Test
+        shell: bash
+        env:
+          USE_CUDA: ${{ inputs.cuda-version != 'cpu' && '1' || '0' }}
+          INSTALL_WINDOWS_SDK: 1
+          PYTHON_VERSION: 3.8
+          PYTORCH_RETRY_TEST_CASES: 1
+          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+          VC_PRODUCT: "BuildTools"
+          VC_VERSION: ""
+          VS_VERSION: "16.8.6"
+          VC_YEAR: "2019"
+          AWS_DEFAULT_REGION: us-east-1
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          CUDA_VERSION: ${{ inputs.cuda-version }}
+          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+          SHARD_NUMBER: ${{ matrix.shard }}
+          NUM_TEST_SHARDS: ${{ matrix.num_shards }}
+          TEST_CONFIG: ${{ matrix.config }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          PR_BODY: ${{ github.event.pull_request.body }}
+          TORCH_CUDA_ARCH_LIST: "7.0"
+        run: |
+          COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+          export COMMIT_MESSAGES
+          .jenkins/pytorch/win-test.sh
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: pytorch/pytorch/.github/actions/get-workflow-job-id@master
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always()
+        with:
+          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
+
+      - name: Parse ref
+        id: parse-ref
+        run: .github/scripts/parse_ref.py
+
+      - name: Upload test statistics
+        if: always()
+        env:
+          AWS_DEFAULT_REGION: us-east-1
+          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
+          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          JOB_BASE_NAME: ${{ inputs.build-environment }}-test
+          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+          TAG: ${{ steps.parse-ref.outputs.tag }}
+          WORKFLOW_ID: ${{ github.run_id }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+        shell: bash
+        run: |
+          set -x
+          python3 -m pip install -r requirements.txt
+          python3 -m pip install boto3==1.19.12
+          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
+
+      - name: Teardown Windows
+        uses: ./.github/actions/teardown-win
+        if: always()
+        timeout-minutes: 120
diff --git a/.github/workflows/create_release.yml b/.github/workflows/create_release.yml
index f32c3021e3a2ed..b23282536789c4 100644
--- a/.github/workflows/create_release.yml
+++ b/.github/workflows/create_release.yml
@@ -20,6 +20,7 @@ jobs:
       - uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
           submodules: 'recursive'
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
       - name: Fake name for PRs
         if: ${{ github.event_name == 'pull_request' }}
         run: echo "PT_GITHUB_REF=refs/tags/pr-tag" >> "$GITHUB_ENV"
@@ -51,5 +52,5 @@ jobs:
           files: ${{env.PT_RELEASE_FILE}}
 
 concurrency:
-  group: create-release-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
   cancel-in-progress: true
diff --git a/.github/workflows/docker-builds.yml b/.github/workflows/docker-builds.yml
new file mode 100644
index 00000000000000..d294c63e7b3a30
--- /dev/null
+++ b/.github/workflows/docker-builds.yml
@@ -0,0 +1,76 @@
+name: docker-builds
+
+on:
+  workflow_dispatch:
+  pull_request:
+    paths:
+      - .circleci/docker/**
+      - .github/workflows/docker-builds.yml
+  schedule:
+    - cron: 1 3 * * 3
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+env:
+  ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine
+  AWS_DEFAULT_REGION: us-east-1
+
+jobs:
+  docker-build:
+    runs-on: [self-hosted, linux.2xlarge]
+    timeout-minutes: 240
+    strategy:
+      matrix:
+        include:
+          - docker-image-name: pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
+          - docker-image-name: pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7
+          - docker-image-name: pytorch-linux-bionic-py3.7-clang9
+          - docker-image-name: pytorch-linux-bionic-rocm4.5-py3.7
+          - docker-image-name: pytorch-linux-bionic-rocm5.0-py3.7
+          - docker-image-name: pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
+          - docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+          - docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+          - docker-image-name: pytorch-linux-xenial-py3-clang5-asan
+          - docker-image-name: pytorch-linux-xenial-py3-clang7-asan
+          - docker-image-name: pytorch-linux-xenial-py3-clang7-onnx
+          - docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+          - docker-image-name: pytorch-linux-xenial-py3.7-gcc7
+    env:
+      DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
+    steps:
+      - name: Clean workspace
+        shell: bash
+        run: |
+          echo "${GITHUB_WORKSPACE}"
+          sudo rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required for git merge-base
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Build docker image
+        id: build-docker-image
+        uses: ./.github/actions/calculate-docker-image
+        with:
+          docker-image-name: ${{ matrix.docker-image-name }}
+          always-rebuild: true
+
+      - name: Pull docker image
+        uses: ./.github/actions/pull-docker-image
+        with:
+          docker-image: ${{ steps.build-docker-image.outputs.docker-image }}
+
+      - name: Chown workspace
+        uses: ./.github/actions/chown-workspace
+        if: always()
+
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/generated-caffe2-linux-xenial-py3.7-gcc5.4.yml b/.github/workflows/generated-caffe2-linux-xenial-py3.7-gcc5.4.yml
deleted file mode 100644
index d8b08b4ac55bea..00000000000000
--- a/.github/workflows/generated-caffe2-linux-xenial-py3.7-gcc5.4.yml
+++ /dev/null
@@ -1,251 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: caffe2-linux-xenial-py3.7-gcc5.4
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: caffe2-linux-xenial-py3.7-gcc5.4
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: caffe2-linux-xenial-py3.7-gcc5.4-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: caffe2-linux-xenial-py3.7-gcc5.4-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-docker-builds.yml b/.github/workflows/generated-docker-builds.yml
deleted file mode 100644
index 357305f2b3b2db..00000000000000
--- a/.github/workflows/generated-docker-builds.yml
+++ /dev/null
@@ -1,173 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/docker_builds_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: docker-builds
-
-on:
-  workflow_dispatch:
-  pull_request:
-    types: [opened, synchronize, reopened]
-    paths:
-      - '.circleci/docker/**'
-      - '.github/workflows/generated-docker-builds.yml'
-  schedule:
-    - cron: 1 3 * * 3
-concurrency:
-  group: docker-builds-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-env:
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  AWS_DEFAULT_REGION: us-east-1
-
-jobs:
-
-  docker-build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    strategy:
-      matrix:
-        include:
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7'
-              docker_image_short_name: 'pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7'
-              docker_image_short_name: 'pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.7-clang9'
-              docker_image_short_name: 'pytorch-linux-bionic-py3.7-clang9'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-rocm4.3.1-py3.7'
-              docker_image_short_name: 'pytorch-linux-bionic-rocm4.3.1-py3.7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-rocm4.5-py3.7'
-              docker_image_short_name: 'pytorch-linux-bionic-rocm4.5-py3.7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7'
-              docker_image_short_name: 'pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7'
-              docker_image_short_name: 'pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c'
-              docker_image_short_name: 'pytorch-linux-xenial-py3-clang5-android-ndk-r19c'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan'
-              docker_image_short_name: 'pytorch-linux-xenial-py3-clang5-asan'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang7-asan'
-              docker_image_short_name: 'pytorch-linux-xenial-py3-clang7-asan'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang7-onnx'
-              docker_image_short_name: 'pytorch-linux-xenial-py3-clang7-onnx'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4'
-              docker_image_short_name: 'pytorch-linux-xenial-py3.7-gcc5.4'
-            - docker_image_base: '308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc7'
-              docker_image_short_name: 'pytorch-linux-xenial-py3.7-gcc7'
-    env:
-      DOCKER_IMAGE_BASE: '${{ matrix.docker_image_base }}'
-    name: docker-build (${{ matrix.docker_image_short_name }})
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-ios-12-5-1-arm64-coreml.yml b/.github/workflows/generated-ios-12-5-1-arm64-coreml.yml
deleted file mode 100644
index 7640a34c634a67..00000000000000
--- a/.github/workflows/generated-ios-12-5-1-arm64-coreml.yml
+++ /dev/null
@@ -1,143 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-arm64-coreml
-
-on:
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/scheduled/*'
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: ios-12-5-1-arm64-coreml
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: OS
-  IOS_ARCH: arm64
-
-
-jobs:
-
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: ios-12-5-1-arm64-coreml-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-
-concurrency:
-  group: ios-12-5-1-arm64-coreml-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-ios-12-5-1-arm64-custom-ops.yml b/.github/workflows/generated-ios-12-5-1-arm64-custom-ops.yml
deleted file mode 100644
index 75bc1f77252b21..00000000000000
--- a/.github/workflows/generated-ios-12-5-1-arm64-custom-ops.yml
+++ /dev/null
@@ -1,143 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-arm64-custom-ops
-
-on:
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/scheduled/*'
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: ios-12-5-1-arm64-custom-ops
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: OS
-  IOS_ARCH: arm64
-
-
-jobs:
-
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: ios-12-5-1-arm64-custom-ops-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-
-concurrency:
-  group: ios-12-5-1-arm64-custom-ops-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-ios-12-5-1-arm64-metal.yml b/.github/workflows/generated-ios-12-5-1-arm64-metal.yml
deleted file mode 100644
index 2a9da911d79b8d..00000000000000
--- a/.github/workflows/generated-ios-12-5-1-arm64-metal.yml
+++ /dev/null
@@ -1,143 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-arm64-metal
-
-on:
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/scheduled/*'
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: ios-12-5-1-arm64-metal
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: OS
-  IOS_ARCH: arm64
-
-
-jobs:
-
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: ios-12-5-1-arm64-metal-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-
-concurrency:
-  group: ios-12-5-1-arm64-metal-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-ios-12-5-1-arm64.yml b/.github/workflows/generated-ios-12-5-1-arm64.yml
deleted file mode 100644
index 3463fc5c48ac63..00000000000000
--- a/.github/workflows/generated-ios-12-5-1-arm64.yml
+++ /dev/null
@@ -1,143 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-arm64
-
-on:
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/scheduled/*'
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: ios-12-5-1-arm64
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: OS
-  IOS_ARCH: arm64
-
-
-jobs:
-
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: ios-12-5-1-arm64-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-
-concurrency:
-  group: ios-12-5-1-arm64-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-ios-12-5-1-x86-64-coreml.yml b/.github/workflows/generated-ios-12-5-1-x86-64-coreml.yml
deleted file mode 100644
index d9fdd93b79abdf..00000000000000
--- a/.github/workflows/generated-ios-12-5-1-x86-64-coreml.yml
+++ /dev/null
@@ -1,178 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/ios_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: ios-12-5-1-x86-64-coreml
-
-on:
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/ios/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: ios-12-5-1-x86-64-coreml
-  IN_CI: 1
-  IS_GHA: 1
-  IOS_PLATFORM: SIMULATOR
-  IOS_ARCH: x86_64
-
-
-jobs:
-
-  build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-    runs-on: macos-10.15
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: ios-12-5-1-x86-64-coreml-build
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Populate CI build options
-        run: |
-          # Most builds use the lite interpreter, if certain builds shouldn't
-          # build the lite interpreter this env variable should get over-written
-          # in the following case statement
-          echo "BUILD_LITE_INTERPRETER=1" >> "${GITHUB_ENV}"
-
-          case ${BUILD_ENVIRONMENT} in
-            *metal*)
-              echo "USE_PYTORCH_METAL=1" >> "${GITHUB_ENV}"
-              ;;
-            *full_jit*)
-              echo "BUILD_LITE_INTERPRETER=0" >> "${GITHUB_ENV}"
-              ;;
-            *custom*)
-              echo "SELECTED_OP_LIST=${GITHUB_WORKSPACE}/ios/TestApp/custom_build/mobilenetv2.yaml" >> "${GITHUB_ENV}"
-              ;;
-            *coreml*)
-              echo "USE_COREML_DELEGATE=1" >> "${GITHUB_ENV}"
-              ;;
-          esac
-      - name: Install brew dependencies
-        run: |
-          # Install dependencies
-          brew install libtool
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
-      - name: Run Fastlane
-        run: |
-          set -x
-          cd ios/TestApp
-          # install fastlane
-          sudo gem install bundler && bundle install
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
-      - name: Build
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          export TCLLIBPATH="/usr/local/lib"
-          python -VV
-          export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
-          scripts/build_ios.sh
-      - name: Run Build Test
-        run: |
-          PROFILE=PyTorch_CI_2022
-          # run the ruby build script
-          if ! [ -x "$(command -v xcodebuild)" ]; then
-            echo 'Error: xcodebuild is not installed.'
-            exit 1
-          fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
-      - name: Run Simulator Tests
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-          # generate models for differnet backends
-          cd "${GITHUB_WORKSPACE}/ios/TestApp/benchmark"
-          mkdir -p ../models
-          if [ "${USE_COREML_DELEGATE}" == 1 ]; then
-            pip install coremltools==5.0b5
-            pip install six==1.16.0
-            python coreml_backend.py
-          else
-            python trace_model.py
-          fi
-          if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
-            echo "Setting up the TestApp for LiteInterpreter"
-            ruby setup.rb --lite 1
-          else
-            echo "Setting up the TestApp for Full JIT"
-            ruby setup.rb
-          fi
-          cd "${GITHUB_WORKSPACE}/ios/TestApp"
-          instruments -s -devices
-          if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
-            if [ "${USE_COREML_DELEGATE}" == 1 ]; then
-              fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
-            else
-              fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
-            fi
-          else
-            fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
-          fi
-
-concurrency:
-  group: ios-12-5-1-x86-64-coreml-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-libtorch-linux-xenial-cuda10.2-py3.7-gcc7.yml b/.github/workflows/generated-libtorch-linux-xenial-cuda10.2-py3.7-gcc7.yml
deleted file mode 100644
index 5889466d9b0824..00000000000000
--- a/.github/workflows/generated-libtorch-linux-xenial-cuda10.2-py3.7-gcc7.yml
+++ /dev/null
@@ -1,241 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: libtorch-linux-xenial-cuda10.2-py3.7-gcc7
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/libtorch/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: libtorch-linux-xenial-cuda10.2-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: libtorch-linux-xenial-cuda10.2-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: libtorch-linux-xenial-cuda10.2-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-libtorch-linux-xenial-cuda11.3-py3.7-gcc7.yml b/.github/workflows/generated-libtorch-linux-xenial-cuda11.3-py3.7-gcc7.yml
deleted file mode 100644
index 7c9e9f19ff3fda..00000000000000
--- a/.github/workflows/generated-libtorch-linux-xenial-cuda11.3-py3.7-gcc7.yml
+++ /dev/null
@@ -1,241 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: libtorch-linux-xenial-cuda11.3-py3.7-gcc7
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/libtorch/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: libtorch-linux-xenial-cuda11.3-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: libtorch-linux-xenial-cuda11.3-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: libtorch-linux-xenial-cuda11.3-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-binary-conda.yml b/.github/workflows/generated-linux-binary-conda-nightly.yml
similarity index 77%
rename from .github/workflows/generated-linux-binary-conda.yml
rename to .github/workflows/generated-linux-binary-conda-nightly.yml
index f1ff75db90d386..63861bbe87c13a 100644
--- a/.github/workflows/generated-linux-binary-conda.yml
+++ b/.github/workflows/generated-linux-binary-conda-nightly.yml
@@ -54,30 +54,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -94,9 +74,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -161,7 +138,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: conda-py3_7-cpu
           retention-days: 14
@@ -201,30 +178,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -241,10 +198,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cpu
@@ -343,30 +297,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -383,12 +317,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cpu
@@ -459,30 +390,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -499,9 +410,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -566,7 +474,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: conda-py3_7-cuda10_2
           retention-days: 14
@@ -607,30 +515,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -647,10 +535,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda10_2
@@ -761,30 +646,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -801,12 +666,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda10_2
@@ -877,30 +739,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -917,9 +759,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -987,7 +826,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: conda-py3_7-cuda11_3
           retention-days: 14
@@ -1028,30 +867,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1068,10 +887,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda11_3
@@ -1182,30 +998,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1222,12 +1018,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda11_3
@@ -1298,30 +1091,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1338,9 +1111,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1408,7 +1178,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: conda-py3_7-cuda11_5
           retention-days: 14
@@ -1449,30 +1219,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1489,10 +1239,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda11_5
@@ -1603,30 +1350,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1643,12 +1370,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cuda11_5
@@ -1704,7 +1428,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cpu-build:
+  conda-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -1712,36 +1436,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1758,9 +1463,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1784,6 +1486,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -1825,9 +1530,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_8-cpu
+          name: conda-py3_7-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -1850,45 +1555,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cpu-test:  # Testing
+  conda-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cpu-build
-    runs-on: linux.4xlarge
+    needs: conda-py3_7-cuda11_6-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1905,13 +1591,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cpu
+          name: conda-py3_7-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1936,6 +1619,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -1993,44 +1687,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cpu-upload:  # Uploading
+  conda-py3_7-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cpu-test
+    needs: conda-py3_7-cuda11_6-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2047,15 +1722,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cpu
+          name: conda-py3_7-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2108,7 +1780,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda10_2-build:
+  conda-py3_8-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2116,37 +1788,16 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2163,9 +1814,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2230,9 +1878,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_8-cuda10_2
+          name: conda-py3_8-cpu
           retention-days: 14
           if-no-files-found: error
           path:
@@ -2255,46 +1903,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda10_2-test:  # Testing
+  conda-py3_8-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda10_2-build
-    runs-on: linux.4xlarge.nvidia.gpu
+    needs: conda-py3_8-cpu-build
+    runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2311,13 +1938,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda10_2
+          name: conda-py3_8-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2342,17 +1966,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            pushd pytorch
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -2410,45 +2023,24 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda10_2-upload:  # Uploading
+  conda-py3_8-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda10_2-test
+    needs: conda-py3_8-cpu-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2465,15 +2057,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda10_2
+          name: conda-py3_8-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2526,7 +2115,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_3-build:
+  conda-py3_8-cuda10_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2534,37 +2123,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2581,9 +2150,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2607,9 +2173,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - name: Set BUILD_SPLIT_CUDA
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -2651,9 +2214,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_8-cuda11_3
+          name: conda-py3_8-cuda10_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -2676,46 +2239,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_3-test:  # Testing
+  conda-py3_8-cuda10_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-build
+    needs: conda-py3_8-cuda10_2-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2732,13 +2275,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_3
+          name: conda-py3_8-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2831,45 +2371,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_3-upload:  # Uploading
+  conda-py3_8-cuda10_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-test
+    needs: conda-py3_8-cuda10_2-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2886,15 +2406,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_3
+          name: conda-py3_8-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2947,7 +2464,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_5-build:
+  conda-py3_8-cuda11_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2955,37 +2472,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3002,9 +2499,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3072,9 +2566,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_8-cuda11_5
+          name: conda-py3_8-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3097,46 +2591,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_5-test:  # Testing
+  conda-py3_8-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_5-build
+    needs: conda-py3_8-cuda11_3-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3153,13 +2627,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_5
+          name: conda-py3_8-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3252,45 +2723,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_8-cuda11_5-upload:  # Uploading
+  conda-py3_8-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_5-test
+    needs: conda-py3_8-cuda11_3-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3307,15 +2758,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_5
+          name: conda-py3_8-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3368,7 +2816,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cpu-build:
+  conda-py3_8-cuda11_5-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -3376,36 +2824,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3422,9 +2851,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3448,6 +2874,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -3489,9 +2918,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_9-cpu
+          name: conda-py3_8-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3514,45 +2943,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cpu-test:  # Testing
+  conda-py3_8-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cpu-build
-    runs-on: linux.4xlarge
+    needs: conda-py3_8-cuda11_5-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3569,13 +2979,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cpu
+          name: conda-py3_8-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3600,6 +3007,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -3657,44 +3075,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cpu-upload:  # Uploading
+  conda-py3_8-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cpu-test
+    needs: conda-py3_8-cuda11_5-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3711,15 +3110,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cpu
+          name: conda-py3_8-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3772,7 +3168,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda10_2-build:
+  conda-py3_8-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -3780,37 +3176,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3827,9 +3203,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3853,6 +3226,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -3894,9 +3270,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_9-cuda10_2
+          name: conda-py3_8-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3919,46 +3295,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda10_2-test:  # Testing
+  conda-py3_8-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda10_2-build
+    needs: conda-py3_8-cuda11_6-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3975,13 +3331,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda10_2
+          name: conda-py3_8-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4074,45 +3427,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda10_2-upload:  # Uploading
+  conda-py3_8-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda10_2-test
+    needs: conda-py3_8-cuda11_6-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4129,15 +3462,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda10_2
+          name: conda-py3_8-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -4190,7 +3520,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_3-build:
+  conda-py3_9-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -4198,37 +3528,16 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4245,9 +3554,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4271,9 +3577,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - name: Set BUILD_SPLIT_CUDA
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -4315,9 +3618,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_9-cuda11_3
+          name: conda-py3_9-cpu
           retention-days: 14
           if-no-files-found: error
           path:
@@ -4340,7 +3643,695 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_3-test:  # Testing
+  conda-py3_9-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cpu-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cpu-test
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda10_2-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/conda/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: conda-py3_9-cuda10_2
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda10_2-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda10_2-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda10_2
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda10_2-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda10_2-test
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda10_2
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/conda/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: conda-py3_9-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: conda-py3_9-cuda11_3-build
     runs-on: linux.4xlarge.nvidia.gpu
@@ -4349,37 +4340,721 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_3-test
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/conda/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: conda-py3_9-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_5-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_5-test
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
         run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/conda/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: conda-py3_9-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_6-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4396,13 +5071,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda11_3
+          name: conda-py3_9-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4495,45 +5167,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_3-upload:  # Uploading
+  conda-py3_9-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_3-test
+    needs: conda-py3_9-cuda11_6-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4550,15 +5202,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda11_3
+          name: conda-py3_9-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -4611,7 +5260,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_5-build:
+  conda-py3_10-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -4619,37 +5268,16 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4666,9 +5294,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4692,9 +5317,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - name: Set BUILD_SPLIT_CUDA
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -4736,9 +5358,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_9-cuda11_5
+          name: conda-py3_10-cpu
           retention-days: 14
           if-no-files-found: error
           path:
@@ -4761,46 +5383,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_5-test:  # Testing
+  conda-py3_10-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_5-build
-    runs-on: linux.4xlarge.nvidia.gpu
+    needs: conda-py3_10-cpu-build
+    runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4817,13 +5418,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda11_5
+          name: conda-py3_10-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4848,17 +5446,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            pushd pytorch
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -4916,45 +5503,24 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_9-cuda11_5-upload:  # Uploading
+  conda-py3_10-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_5-test
+    needs: conda-py3_10-cpu-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/conda-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4971,15 +5537,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda11_5
+          name: conda-py3_10-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -5032,7 +5595,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cpu-build:
+  conda-py3_10-cuda10_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5040,36 +5603,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5086,9 +5630,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5153,9 +5694,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_10-cpu
+          name: conda-py3_10-cuda10_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -5178,45 +5719,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cpu-test:  # Testing
+  conda-py3_10-cuda10_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cpu-build
-    runs-on: linux.4xlarge
+    needs: conda-py3_10-cuda10_2-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5233,13 +5755,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cpu
+          name: conda-py3_10-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -5264,6 +5783,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -5321,44 +5851,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cpu-upload:  # Uploading
+  conda-py3_10-cuda10_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cpu-test
+    needs: conda-py3_10-cuda10_2-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/conda-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5375,15 +5886,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cpu
+          name: conda-py3_10-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -5436,7 +5944,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda10_2-build:
+  conda-py3_10-cuda11_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5444,37 +5952,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5491,9 +5979,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5517,6 +6002,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -5558,9 +6046,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_10-cuda10_2
+          name: conda-py3_10-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path:
@@ -5583,46 +6071,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda10_2-test:  # Testing
+  conda-py3_10-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda10_2-build
+    needs: conda-py3_10-cuda11_3-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5639,13 +6107,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda10_2
+          name: conda-py3_10-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -5738,45 +6203,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda10_2-upload:  # Uploading
+  conda-py3_10-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda10_2-test
+    needs: conda-py3_10-cuda11_3-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5793,15 +6238,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda10_2
+          name: conda-py3_10-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -5854,7 +6296,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_3-build:
+  conda-py3_10-cuda11_5-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5862,37 +6304,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5909,9 +6331,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5979,9 +6398,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_10-cuda11_3
+          name: conda-py3_10-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path:
@@ -6004,46 +6423,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_3-test:  # Testing
+  conda-py3_10-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-build
+    needs: conda-py3_10-cuda11_5-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6060,13 +6459,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda11_3
+          name: conda-py3_10-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -6159,45 +6555,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_3-upload:  # Uploading
+  conda-py3_10-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-test
+    needs: conda-py3_10-cuda11_5-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6214,15 +6590,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda11_3
+          name: conda-py3_10-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -6275,7 +6648,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_5-build:
+  conda-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -6283,37 +6656,17 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6330,9 +6683,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6400,9 +6750,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: conda-py3_10-cuda11_5
+          name: conda-py3_10-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -6425,46 +6775,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_5-test:  # Testing
+  conda-py3_10-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_5-build
+    needs: conda-py3_10-cuda11_6-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6481,13 +6811,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda11_5
+          name: conda-py3_10-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -6580,45 +6907,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  conda-py3_10-cuda11_5-upload:  # Uploading
+  conda-py3_10-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_5-test
+    needs: conda-py3_10-cuda11_6-test
     env:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/conda-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6635,15 +6942,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_10-cuda11_5
+          name: conda-py3_10-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
diff --git a/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-master.yml b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-master.yml
new file mode 100644
index 00000000000000..3fa24203231b66
--- /dev/null
+++ b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-master.yml
@@ -0,0 +1,283 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/linux_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: linux-binary-libtorch-cxx11-abi
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BINARY_ENV_FILE: /tmp/env
+  BUILD_ENVIRONMENT: linux-binary-libtorch-cxx11-abi
+  BUILDER_ROOT: /builder
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_FINAL_PACKAGE_DIR: /artifacts
+  PYTORCH_RETRY_TEST_CASES: 1
+  PYTORCH_ROOT: /pytorch
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: linux-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  libtorch-cpu-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cpu-shared-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cpu-shared-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cpu-shared-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cpu-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-linux-binary-libtorch-cxx11-abi.yml b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
similarity index 56%
rename from .github/workflows/generated-linux-binary-libtorch-cxx11-abi.yml
rename to .github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
index 5505e4a86971e5..46a8370c1c57e1 100644
--- a/.github/workflows/generated-linux-binary-libtorch-cxx11-abi.yml
+++ b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
@@ -55,30 +55,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -95,9 +75,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -162,7 +139,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-shared-with-deps-cxx11-abi
           retention-days: 14
@@ -203,30 +180,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -243,10 +200,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-cxx11-abi
@@ -346,30 +300,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -386,12 +320,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-cxx11-abi
@@ -462,30 +393,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -502,9 +413,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -569,7 +477,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-shared-without-deps-cxx11-abi
           retention-days: 14
@@ -610,30 +518,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -650,10 +538,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-cxx11-abi
@@ -753,30 +638,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -793,12 +658,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-cxx11-abi
@@ -869,30 +731,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -909,9 +751,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -976,7 +815,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-static-with-deps-cxx11-abi
           retention-days: 14
@@ -1017,30 +856,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1057,10 +876,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-cxx11-abi
@@ -1160,30 +976,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1200,12 +996,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-cxx11-abi
@@ -1276,30 +1069,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1316,9 +1089,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1383,7 +1153,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-static-without-deps-cxx11-abi
           retention-days: 14
@@ -1424,30 +1194,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1464,10 +1214,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-cxx11-abi
@@ -1567,30 +1314,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1607,12 +1334,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-cxx11-abi
@@ -1684,30 +1408,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1724,9 +1428,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1791,7 +1492,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
           retention-days: 14
@@ -1833,30 +1534,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1873,10 +1554,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
@@ -1988,30 +1666,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2028,12 +1686,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
@@ -2105,30 +1760,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2145,9 +1780,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2212,7 +1844,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
           retention-days: 14
@@ -2254,30 +1886,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2294,10 +1906,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
@@ -2409,30 +2018,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2449,12 +2038,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
@@ -2526,30 +2112,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2566,9 +2132,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2633,7 +2196,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-static-with-deps-cxx11-abi
           retention-days: 14
@@ -2675,30 +2238,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2715,10 +2258,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-with-deps-cxx11-abi
@@ -2830,30 +2370,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2870,12 +2390,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-with-deps-cxx11-abi
@@ -2947,30 +2464,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2987,9 +2484,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3054,7 +2548,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-static-without-deps-cxx11-abi
           retention-days: 14
@@ -3096,30 +2590,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3136,10 +2610,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-without-deps-cxx11-abi
@@ -3251,30 +2722,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3291,12 +2742,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-without-deps-cxx11-abi
@@ -3368,30 +2816,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3408,9 +2836,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3478,7 +2903,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
           retention-days: 14
@@ -3520,30 +2945,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3560,10 +2965,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
@@ -3675,30 +3077,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3715,12 +3097,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
@@ -3792,30 +3171,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3832,9 +3191,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3902,7 +3258,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
           retention-days: 14
@@ -3944,30 +3300,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3984,10 +3320,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
@@ -4099,30 +3432,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4139,12 +3452,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
@@ -4216,30 +3526,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4256,9 +3546,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4326,7 +3613,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-static-with-deps-cxx11-abi
           retention-days: 14
@@ -4368,30 +3655,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4408,10 +3675,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-cxx11-abi
@@ -4523,30 +3787,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4563,12 +3807,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-cxx11-abi
@@ -4640,30 +3881,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4680,9 +3901,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4750,7 +3968,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-static-without-deps-cxx11-abi
           retention-days: 14
@@ -4792,30 +4010,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4832,10 +4030,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-cxx11-abi
@@ -4947,30 +4142,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4987,12 +4162,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-cxx11-abi
@@ -5064,31 +4236,11 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
@@ -5104,9 +4256,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5174,7 +4323,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-shared-with-deps-cxx11-abi
           retention-days: 14
@@ -5216,30 +4365,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5256,10 +4385,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-cxx11-abi
@@ -5371,30 +4497,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5411,12 +4517,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-cxx11-abi
@@ -5488,30 +4591,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5528,9 +4611,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5598,7 +4678,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-shared-without-deps-cxx11-abi
           retention-days: 14
@@ -5640,30 +4720,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5680,10 +4740,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-cxx11-abi
@@ -5795,30 +4852,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5835,12 +4872,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-cxx11-abi
@@ -5912,30 +4946,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5952,9 +4966,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6022,7 +5033,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-static-with-deps-cxx11-abi
           retention-days: 14
@@ -6064,30 +5075,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6104,10 +5095,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-cxx11-abi
@@ -6219,30 +5207,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6259,12 +5227,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-cxx11-abi
@@ -6336,30 +5301,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6376,9 +5321,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6446,7 +5388,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-static-without-deps-cxx11-abi
           retention-days: 14
@@ -6488,30 +5430,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6528,10 +5450,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-cxx11-abi
@@ -6643,30 +5562,104 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_5-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6683,15 +5676,4066 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
         run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Clone pytorch/pytorch
-        uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download Build Artifacts
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
-          name: libtorch-cuda11_5-static-without-deps-cxx11-abi
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-static-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-static-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-static-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-with-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-static-without-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-without-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-without-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-cxx11-abi-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-without-deps-cxx11-abi-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-without-deps-cxx11-abi
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
diff --git a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml
new file mode 100644
index 00000000000000..922dbc27b7f250
--- /dev/null
+++ b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml
@@ -0,0 +1,283 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/linux_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: linux-binary-libtorch-pre-cxx11
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BINARY_ENV_FILE: /tmp/env
+  BUILD_ENVIRONMENT: linux-binary-libtorch-pre-cxx11
+  BUILDER_ROOT: /builder
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_FINAL_PACKAGE_DIR: /artifacts
+  PYTORCH_RETRY_TEST_CASES: 1
+  PYTORCH_ROOT: /pytorch
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: linux-binary-libtorch-pre-cxx11-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  libtorch-cpu-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cpu-shared-with-deps-cxx11-abi
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cpu-shared-with-deps-cxx11-abi-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cpu-shared-with-deps-cxx11-abi-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: cxx11-abi
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cpu-shared-with-deps-cxx11-abi
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11.yml b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
similarity index 56%
rename from .github/workflows/generated-linux-binary-libtorch-pre-cxx11.yml
rename to .github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
index 0354e9061c546b..b34a3f3b322862 100644
--- a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11.yml
+++ b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
@@ -55,30 +55,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -95,9 +75,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -162,7 +139,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-shared-with-deps-pre-cxx11
           retention-days: 14
@@ -203,30 +180,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -243,10 +200,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-pre-cxx11
@@ -346,30 +300,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -386,12 +320,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-pre-cxx11
@@ -462,30 +393,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -502,9 +413,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -569,7 +477,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-shared-without-deps-pre-cxx11
           retention-days: 14
@@ -610,30 +518,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -650,10 +538,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-pre-cxx11
@@ -753,30 +638,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -793,12 +658,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-pre-cxx11
@@ -869,30 +731,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -909,9 +751,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -976,7 +815,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-static-with-deps-pre-cxx11
           retention-days: 14
@@ -1017,30 +856,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1057,10 +876,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-pre-cxx11
@@ -1160,30 +976,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1200,12 +996,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-pre-cxx11
@@ -1276,30 +1069,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1316,9 +1089,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1383,7 +1153,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cpu-static-without-deps-pre-cxx11
           retention-days: 14
@@ -1424,30 +1194,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1464,10 +1214,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-pre-cxx11
@@ -1567,30 +1314,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1607,12 +1334,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-pre-cxx11
@@ -1684,30 +1408,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1724,9 +1428,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1791,7 +1492,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
           retention-days: 14
@@ -1833,30 +1534,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1873,10 +1554,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
@@ -1988,30 +1666,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2028,12 +1686,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
@@ -2105,30 +1760,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2145,9 +1780,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2212,7 +1844,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
           retention-days: 14
@@ -2254,30 +1886,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2294,10 +1906,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
@@ -2409,30 +2018,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2449,12 +2038,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
@@ -2526,30 +2112,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2566,9 +2132,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2633,7 +2196,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-static-with-deps-pre-cxx11
           retention-days: 14
@@ -2675,30 +2238,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2715,10 +2258,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-with-deps-pre-cxx11
@@ -2830,30 +2370,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2870,12 +2390,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-with-deps-pre-cxx11
@@ -2947,30 +2464,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2987,9 +2484,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3054,7 +2548,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda10_2-static-without-deps-pre-cxx11
           retention-days: 14
@@ -3096,30 +2590,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3136,10 +2610,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-without-deps-pre-cxx11
@@ -3251,30 +2722,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3291,12 +2742,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda10_2-static-without-deps-pre-cxx11
@@ -3368,30 +2816,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3408,9 +2836,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3478,7 +2903,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
           retention-days: 14
@@ -3520,30 +2945,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3560,10 +2965,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
@@ -3675,30 +3077,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3715,12 +3097,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
@@ -3792,30 +3171,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3832,9 +3191,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3902,7 +3258,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
           retention-days: 14
@@ -3944,30 +3300,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3984,10 +3320,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
@@ -4099,30 +3432,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4139,12 +3452,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
@@ -4216,30 +3526,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4256,9 +3546,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4326,7 +3613,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-static-with-deps-pre-cxx11
           retention-days: 14
@@ -4368,30 +3655,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4408,10 +3675,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-pre-cxx11
@@ -4523,30 +3787,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4563,12 +3807,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-pre-cxx11
@@ -4640,30 +3881,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4680,9 +3901,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4750,7 +3968,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_3-static-without-deps-pre-cxx11
           retention-days: 14
@@ -4792,30 +4010,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4832,10 +4030,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-pre-cxx11
@@ -4947,30 +4142,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4987,12 +4162,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-pre-cxx11
@@ -5064,31 +4236,11 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
@@ -5104,9 +4256,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5174,7 +4323,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-shared-with-deps-pre-cxx11
           retention-days: 14
@@ -5216,30 +4365,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5256,10 +4385,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-pre-cxx11
@@ -5371,30 +4497,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5411,12 +4517,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-pre-cxx11
@@ -5488,30 +4591,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5528,9 +4611,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5598,7 +4678,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-shared-without-deps-pre-cxx11
           retention-days: 14
@@ -5640,30 +4720,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5680,10 +4740,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-pre-cxx11
@@ -5795,30 +4852,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5835,12 +4872,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-pre-cxx11
@@ -5912,30 +4946,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5952,9 +4966,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6022,7 +5033,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-static-with-deps-pre-cxx11
           retention-days: 14
@@ -6064,30 +5075,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6104,10 +5095,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-pre-cxx11
@@ -6219,30 +5207,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6259,12 +5227,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-pre-cxx11
@@ -6336,30 +5301,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6376,9 +5321,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6446,7 +5388,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: libtorch-cuda11_5-static-without-deps-pre-cxx11
           retention-days: 14
@@ -6488,30 +5430,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6528,10 +5450,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-pre-cxx11
@@ -6643,30 +5562,104 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_5-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6683,15 +5676,4066 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
         run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Clone pytorch/pytorch
-        uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download Build Artifacts
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
-          name: libtorch-cuda11_5-static-without-deps-pre-cxx11
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-static-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-cuda11_6-static-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-shared-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-shared-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm4_5_2-static-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm4_5_2-static-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm4_5_2-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-shared-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-shared-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: shared-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-shared-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-static-with-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-with-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-with-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-with-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-with-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-with-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/libtorch/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: libtorch-rocm5_0-static-without-deps-pre-cxx11
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-pre-cxx11-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-without-deps-pre-cxx11-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-without-deps-pre-cxx11
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-rocm5_0-static-without-deps-pre-cxx11-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-rocm5_0-static-without-deps-pre-cxx11-test
+    env:
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_VARIANT: static-without-deps
+      DESIRED_DEVTOOLSET: pre-cxx11
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-rocm5_0-static-without-deps-pre-cxx11
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
diff --git a/.github/workflows/generated-linux-binary-manywheel-master.yml b/.github/workflows/generated-linux-binary-manywheel-master.yml
new file mode 100644
index 00000000000000..d384b3e79bd0d1
--- /dev/null
+++ b/.github/workflows/generated-linux-binary-manywheel-master.yml
@@ -0,0 +1,294 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/linux_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: linux-binary-manywheel
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BINARY_ENV_FILE: /tmp/env
+  BUILD_ENVIRONMENT: linux-binary-manywheel
+  BUILDER_ROOT: /builder
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_FINAL_PACKAGE_DIR: /artifacts
+  PYTORCH_RETRY_TEST_CASES: 1
+  PYTORCH_ROOT: /pytorch
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: linux-binary-manywheel-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  manywheel-py3_7-cuda10_2-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/manywheel/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: manywheel-py3_7-cuda10_2
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_7-cuda10_2-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_7-cuda10_2-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_7-cuda10_2
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-linux-binary-manywheel.yml b/.github/workflows/generated-linux-binary-manywheel-nightly.yml
similarity index 78%
rename from .github/workflows/generated-linux-binary-manywheel.yml
rename to .github/workflows/generated-linux-binary-manywheel-nightly.yml
index c35b6389328010..c8a7c1d73efff7 100644
--- a/.github/workflows/generated-linux-binary-manywheel.yml
+++ b/.github/workflows/generated-linux-binary-manywheel-nightly.yml
@@ -54,30 +54,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -94,9 +74,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -161,7 +138,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_7-cpu
           retention-days: 14
@@ -201,30 +178,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -241,10 +198,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cpu
@@ -343,30 +297,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -383,12 +317,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cpu
@@ -459,30 +390,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -499,9 +410,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -566,7 +474,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_7-cuda10_2
           retention-days: 14
@@ -607,30 +515,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -647,10 +535,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda10_2
@@ -761,30 +646,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -801,12 +666,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda10_2
@@ -877,30 +739,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -917,9 +759,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -987,7 +826,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_7-cuda11_3
           retention-days: 14
@@ -1028,30 +867,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1068,10 +887,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda11_3
@@ -1182,30 +998,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1222,12 +1018,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda11_3
@@ -1298,30 +1091,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1338,9 +1111,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1408,7 +1178,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_7-cuda11_5
           retention-days: 14
@@ -1449,30 +1219,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1489,10 +1239,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda11_5
@@ -1603,30 +1350,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1643,12 +1370,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_7-cuda11_5
@@ -1704,7 +1428,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm4_5_2-build:
+  manywheel-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -1712,37 +1436,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1759,9 +1463,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -1785,6 +1486,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -1826,9 +1530,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_7-rocm4_5_2
+          name: manywheel-py3_7-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -1851,46 +1555,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm4_5_2-test:  # Testing
+  manywheel-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm4_5_2-build
-    runs-on: linux.4xlarge
+    needs: manywheel-py3_7-cuda11_6-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1907,13 +1591,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm4_5_2
+          name: manywheel-py3_7-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1938,6 +1619,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -1995,45 +1687,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm4_5_2-upload:  # Uploading
+  manywheel-py3_7-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm4_5_2-test
+    needs: manywheel-py3_7-cuda11_6-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2050,15 +1722,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm4_5_2
+          name: manywheel-py3_7-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2111,7 +1780,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm5_0-build:
+  manywheel-py3_7-rocm4_5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2119,37 +1788,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2166,9 +1815,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2233,9 +1879,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_7-rocm5_0
+          name: manywheel-py3_7-rocm4_5_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -2258,46 +1904,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm5_0-test:  # Testing
+  manywheel-py3_7-rocm4_5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_0-build
+    needs: manywheel-py3_7-rocm4_5_2-build
     runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2314,13 +1940,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm5_0
+          name: manywheel-py3_7-rocm4_5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2402,45 +2025,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm5_0-upload:  # Uploading
+  manywheel-py3_7-rocm4_5_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_0-test
+    needs: manywheel-py3_7-rocm4_5_2-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2457,15 +2060,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm5_0
+          name: manywheel-py3_7-rocm4_5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2518,7 +2118,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cpu-build:
+  manywheel-py3_7-rocm5_0-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2526,36 +2126,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2572,9 +2153,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -2639,9 +2217,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-cpu
+          name: manywheel-py3_7-rocm5_0
           retention-days: 14
           if-no-files-found: error
           path:
@@ -2664,45 +2242,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cpu-test:  # Testing
+  manywheel-py3_7-rocm5_0-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cpu-build
+    needs: manywheel-py3_7-rocm5_0-build
     runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2719,13 +2278,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cpu
+          name: manywheel-py3_7-rocm5_0
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2807,44 +2363,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cpu-upload:  # Uploading
+  manywheel-py3_7-rocm5_0-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cpu-test
+    needs: manywheel-py3_7-rocm5_0-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2861,15 +2398,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cpu
+          name: manywheel-py3_7-rocm5_0
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2922,7 +2456,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda10_2-build:
+  manywheel-py3_8-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -2930,37 +2464,16 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2977,9 +2490,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3044,9 +2554,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-cuda10_2
+          name: manywheel-py3_8-cpu
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3069,46 +2579,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda10_2-test:  # Testing
+  manywheel-py3_8-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda10_2-build
-    runs-on: linux.4xlarge.nvidia.gpu
+    needs: manywheel-py3_8-cpu-build
+    runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3125,13 +2614,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda10_2
+          name: manywheel-py3_8-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3156,17 +2642,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            pushd pytorch
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -3224,45 +2699,24 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda10_2-upload:  # Uploading
+  manywheel-py3_8-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda10_2-test
+    needs: manywheel-py3_8-cpu-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3279,15 +2733,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda10_2
+          name: manywheel-py3_8-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3340,7 +2791,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_3-build:
+  manywheel-py3_8-cuda10_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -3348,37 +2799,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3395,9 +2826,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3421,9 +2849,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - name: Set BUILD_SPLIT_CUDA
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -3465,9 +2890,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-cuda11_3
+          name: manywheel-py3_8-cuda10_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3490,46 +2915,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_3-test:  # Testing
+  manywheel-py3_8-cuda10_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_3-build
+    needs: manywheel-py3_8-cuda10_2-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3546,13 +2951,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda11_3
+          name: manywheel-py3_8-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3645,45 +3047,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_3-upload:  # Uploading
+  manywheel-py3_8-cuda10_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_3-test
+    needs: manywheel-py3_8-cuda10_2-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3700,15 +3082,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda11_3
+          name: manywheel-py3_8-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3761,7 +3140,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_5-build:
+  manywheel-py3_8-cuda11_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -3769,37 +3148,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3816,9 +3175,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -3886,9 +3242,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-cuda11_5
+          name: manywheel-py3_8-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path:
@@ -3911,46 +3267,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_5-test:  # Testing
+  manywheel-py3_8-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_5-build
+    needs: manywheel-py3_8-cuda11_3-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3967,13 +3303,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda11_5
+          name: manywheel-py3_8-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4066,45 +3399,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-cuda11_5-upload:  # Uploading
+  manywheel-py3_8-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_5-test
+    needs: manywheel-py3_8-cuda11_3-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4121,15 +3434,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-cuda11_5
+          name: manywheel-py3_8-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -4182,7 +3492,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm4_5_2-build:
+  manywheel-py3_8-cuda11_5-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -4190,37 +3500,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4237,9 +3527,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4263,6 +3550,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -4304,9 +3594,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-rocm4_5_2
+          name: manywheel-py3_8-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path:
@@ -4329,46 +3619,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm4_5_2-test:  # Testing
+  manywheel-py3_8-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm4_5_2-build
-    runs-on: linux.4xlarge
+    needs: manywheel-py3_8-cuda11_5-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4385,13 +3655,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm4_5_2
+          name: manywheel-py3_8-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4416,6 +3683,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -4473,45 +3751,1388 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm4_5_2-upload:  # Uploading
+  manywheel-py3_8-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm4_5_2-test
+    needs: manywheel-py3_8-cuda11_5-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
-      SKIP_ALL_TESTS: 1
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
         env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/manywheel/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: manywheel-py3_8-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-cuda11_6-build
+    runs-on: linux.4xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-cuda11_6-test
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm4_5_2-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/manywheel/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: manywheel-py3_8-rocm4_5_2
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm4_5_2-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-rocm4_5_2-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-rocm4_5_2
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm4_5_2-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-rocm4_5_2-test
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-rocm4_5_2
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm5_0-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/manywheel/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: manywheel-py3_8-rocm5_0
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm5_0-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-rocm5_0-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-rocm5_0
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_8-rocm5_0-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_8-rocm5_0-test
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_8-rocm5_0
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Build PyTorch binary
+        run: |
+          set -x
+          mkdir -p artifacts/
+          container_name=$(docker run \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/manywheel/build.sh"
+      - name: Chown artifacts
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - uses: seemethere/upload-artifact-s3@v4
+        with:
+          name: manywheel-py3_9-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_9-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_9-cpu-build
+    runs-on: linux.4xlarge
+    timeout-minutes: 240
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: manywheel-py3_9-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Pull Docker image
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${DOCKER_IMAGE}"
+      - name: Test PyTorch binary
+        run: |
+          set -x
+          # shellcheck disable=SC2086,SC2090
+          container_name=$(docker run \
+            ${GPU_FLAG:-} \
+            -e BINARY_ENV_FILE \
+            -e BUILDER_ROOT \
+            -e BUILD_ENVIRONMENT \
+            -e BUILD_SPLIT_CUDA \
+            -e DESIRED_CUDA \
+            -e DESIRED_DEVTOOLSET \
+            -e DESIRED_PYTHON \
+            -e GPU_ARCH_TYPE \
+            -e GPU_ARCH_VERSION \
+            -e IS_GHA \
+            -e LIBTORCH_VARIANT \
+            -e PACKAGE_TYPE \
+            -e PYTORCH_FINAL_PACKAGE_DIR \
+            -e PYTORCH_ROOT \
+            -e SKIP_ALL_TESTS \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
+            -v "${GITHUB_WORKSPACE}/builder:/builder" \
+            -v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
+            -w / \
+            "${DOCKER_IMAGE}"
+          )
+          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
+          # Generate test script
+          docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
+          docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        working-directory: pytorch/
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
         run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  manywheel-py3_9-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: manywheel-py3_9-cpu-test
+    env:
+      PACKAGE_TYPE: manywheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4528,15 +5149,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm4_5_2
+          name: manywheel-py3_9-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -4589,7 +5207,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm5_0-build:
+  manywheel-py3_9-cuda10_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -4597,37 +5215,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4644,9 +5242,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -4711,9 +5306,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_8-rocm5_0
+          name: manywheel-py3_9-cuda10_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -4736,46 +5331,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm5_0-test:  # Testing
+  manywheel-py3_9-cuda10_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_0-build
-    runs-on: linux.4xlarge
+    needs: manywheel-py3_9-cuda10_2-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4792,13 +5367,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm5_0
+          name: manywheel-py3_9-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -4823,6 +5395,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -4880,45 +5463,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm5_0-upload:  # Uploading
+  manywheel-py3_9-cuda10_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_0-test
+    needs: manywheel-py3_9-cuda10_2-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -4935,15 +5498,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm5_0
+          name: manywheel-py3_9-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -4996,7 +5556,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cpu-build:
+  manywheel-py3_9-cuda11_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5004,36 +5564,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5050,9 +5591,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5076,6 +5614,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -5117,9 +5658,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-cpu
+          name: manywheel-py3_9-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path:
@@ -5142,45 +5683,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cpu-test:  # Testing
+  manywheel-py3_9-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cpu-build
-    runs-on: linux.4xlarge
+    needs: manywheel-py3_9-cuda11_3-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5197,13 +5719,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cpu
+          name: manywheel-py3_9-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -5228,6 +5747,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -5285,44 +5815,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cpu-upload:  # Uploading
+  manywheel-py3_9-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cpu-test
+    needs: manywheel-py3_9-cuda11_3-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5339,15 +5850,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cpu
+          name: manywheel-py3_9-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -5400,7 +5908,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda10_2-build:
+  manywheel-py3_9-cuda11_5-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5408,37 +5916,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5455,9 +5943,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5481,6 +5966,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -5522,9 +6010,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-cuda10_2
+          name: manywheel-py3_9-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path:
@@ -5547,46 +6035,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda10_2-test:  # Testing
+  manywheel-py3_9-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda10_2-build
+    needs: manywheel-py3_9-cuda11_5-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5603,13 +6071,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda10_2
+          name: manywheel-py3_9-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -5702,45 +6167,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda10_2-upload:  # Uploading
+  manywheel-py3_9-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda10_2-test
+    needs: manywheel-py3_9-cuda11_5-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5757,15 +6202,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda10_2
+          name: manywheel-py3_9-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -5818,7 +6260,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_3-build:
+  manywheel-py3_9-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -5826,37 +6268,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -5873,9 +6295,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -5943,9 +6362,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-cuda11_3
+          name: manywheel-py3_9-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -5968,46 +6387,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_3-test:  # Testing
+  manywheel-py3_9-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_3-build
+    needs: manywheel-py3_9-cuda11_6-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6024,13 +6423,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda11_3
+          name: manywheel-py3_9-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -6123,45 +6519,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_3-upload:  # Uploading
+  manywheel-py3_9-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_3-test
+    needs: manywheel-py3_9-cuda11_6-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6178,15 +6554,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda11_3
+          name: manywheel-py3_9-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -6239,7 +6612,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_5-build:
+  manywheel-py3_9-rocm4_5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -6247,37 +6620,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6294,9 +6647,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6320,9 +6670,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - name: Set BUILD_SPLIT_CUDA
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -6364,9 +6711,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-cuda11_5
+          name: manywheel-py3_9-rocm4_5_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -6389,46 +6736,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_5-test:  # Testing
+  manywheel-py3_9-rocm4_5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_5-build
-    runs-on: linux.4xlarge.nvidia.gpu
+    needs: manywheel-py3_9-rocm4_5_2-build
+    runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6445,13 +6772,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda11_5
+          name: manywheel-py3_9-rocm4_5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -6476,17 +6800,6 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            pushd pytorch
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -6544,45 +6857,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-cuda11_5-upload:  # Uploading
+  manywheel-py3_9-rocm4_5_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_5-test
+    needs: manywheel-py3_9-rocm4_5_2-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DESIRED_CUDA: rocm4.5.2
+      GPU_ARCH_VERSION: 4.5.2
+      GPU_ARCH_TYPE: rocm
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6599,15 +6892,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-cuda11_5
+          name: manywheel-py3_9-rocm4_5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -6660,7 +6950,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm4_5_2-build:
+  manywheel-py3_9-rocm5_0-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -6668,37 +6958,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6715,9 +6985,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -6782,9 +7049,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-rocm4_5_2
+          name: manywheel-py3_9-rocm5_0
           retention-days: 14
           if-no-files-found: error
           path:
@@ -6807,46 +7074,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm4_5_2-test:  # Testing
+  manywheel-py3_9-rocm5_0-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm4_5_2-build
+    needs: manywheel-py3_9-rocm5_0-build
     runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -6863,13 +7110,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm4_5_2
+          name: manywheel-py3_9-rocm5_0
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -6951,45 +7195,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm4_5_2-upload:  # Uploading
+  manywheel-py3_9-rocm5_0-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm4_5_2-test
+    needs: manywheel-py3_9-rocm5_0-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm4.5.2
-      GPU_ARCH_VERSION: 4.5.2
+      DESIRED_CUDA: rocm5.0
+      GPU_ARCH_VERSION: 5.0
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm4.5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7006,15 +7230,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm4_5_2
+          name: manywheel-py3_9-rocm5_0
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -7067,7 +7288,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm5_0-build:
+  manywheel-py3_10-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -7075,37 +7296,16 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7122,9 +7322,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -7189,9 +7386,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_9-rocm5_0
+          name: manywheel-py3_10-cpu
           retention-days: 14
           if-no-files-found: error
           path:
@@ -7214,46 +7411,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm5_0-test:  # Testing
+  manywheel-py3_10-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_0-build
+    needs: manywheel-py3_10-cpu-build
     runs-on: linux.4xlarge
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7270,13 +7446,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm5_0
+          name: manywheel-py3_10-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -7358,45 +7531,24 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm5_0-upload:  # Uploading
+  manywheel-py3_10-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_0-test
+    needs: manywheel-py3_10-cpu-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.0
-      GPU_ARCH_VERSION: 5.0
-      GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.0
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7413,15 +7565,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm5_0
+          name: manywheel-py3_10-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -7474,7 +7623,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cpu-build:
+  manywheel-py3_10-cuda10_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -7482,36 +7631,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7528,9 +7658,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -7595,9 +7722,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_10-cpu
+          name: manywheel-py3_10-cuda10_2
           retention-days: 14
           if-no-files-found: error
           path:
@@ -7620,45 +7747,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cpu-test:  # Testing
+  manywheel-py3_10-cuda10_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cpu-build
-    runs-on: linux.4xlarge
+    needs: manywheel-py3_10-cuda10_2-build
+    runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7675,13 +7783,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cpu
+          name: manywheel-py3_10-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -7706,6 +7811,17 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            pushd pytorch
+            bash .github/scripts/install_nvidia_utils_linux.sh
+            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
+            popd
       - name: Pull Docker image
         run: |
           retry () {
@@ -7763,44 +7879,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cpu-upload:  # Uploading
+  manywheel-py3_10-cuda10_2-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cpu-test
+    needs: manywheel-py3_10-cuda10_2-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
+      DESIRED_CUDA: cu102
+      GPU_ARCH_VERSION: 10.2
+      GPU_ARCH_TYPE: cuda
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7817,15 +7914,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cpu
+          name: manywheel-py3_10-cuda10_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -7878,7 +7972,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda10_2-build:
+  manywheel-py3_10-cuda11_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -7886,37 +7980,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -7933,9 +8007,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -7959,6 +8030,9 @@ jobs:
           # Remove any artifacts from the previous checkouts
           git clean -fxd
         working-directory: builder
+      - name: Set BUILD_SPLIT_CUDA
+        run: |
+          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
         run: |
           retry () {
@@ -8000,9 +8074,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_10-cuda10_2
+          name: manywheel-py3_10-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path:
@@ -8025,46 +8099,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda10_2-test:  # Testing
+  manywheel-py3_10-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda10_2-build
+    needs: manywheel-py3_10-cuda11_3-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8081,13 +8135,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda10_2
+          name: manywheel-py3_10-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -8180,45 +8231,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda10_2-upload:  # Uploading
+  manywheel-py3_10-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda10_2-test
+    needs: manywheel-py3_10-cuda11_3-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8235,15 +8266,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda10_2
+          name: manywheel-py3_10-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -8296,7 +8324,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_3-build:
+  manywheel-py3_10-cuda11_5-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -8304,37 +8332,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8351,9 +8359,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -8421,9 +8426,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_10-cuda11_3
+          name: manywheel-py3_10-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path:
@@ -8446,46 +8451,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_3-test:  # Testing
+  manywheel-py3_10-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_3-build
+    needs: manywheel-py3_10-cuda11_5-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8502,13 +8487,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda11_3
+          name: manywheel-py3_10-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -8601,45 +8583,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_3-upload:  # Uploading
+  manywheel-py3_10-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_3-test
+    needs: manywheel-py3_10-cuda11_5-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8656,15 +8618,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda11_3
+          name: manywheel-py3_10-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -8717,7 +8676,7 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_5-build:
+  manywheel-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: linux.4xlarge
     timeout-minutes: 240
@@ -8725,37 +8684,17 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8772,9 +8711,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -8842,9 +8778,9 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
-          name: manywheel-py3_10-cuda11_5
+          name: manywheel-py3_10-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path:
@@ -8867,46 +8803,26 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_5-test:  # Testing
+  manywheel-py3_10-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_5-build
+    needs: manywheel-py3_10-cuda11_6-build
     runs-on: linux.4xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -8923,13 +8839,10 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda11_5
+          name: manywheel-py3_10-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -9022,45 +8935,25 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-cuda11_5-upload:  # Uploading
+  manywheel-py3_10-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_5-test
+    needs: manywheel-py3_10-cuda11_6-test
     env:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.5
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9077,15 +8970,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-cuda11_5
+          name: manywheel-py3_10-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -9153,30 +9043,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9193,9 +9063,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -9260,7 +9127,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_10-rocm4_5_2
           retention-days: 14
@@ -9301,30 +9168,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9341,10 +9188,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_10-rocm4_5_2
@@ -9444,30 +9288,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9484,12 +9308,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_10-rocm4_5_2
@@ -9560,30 +9381,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9600,9 +9401,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -9667,7 +9465,7 @@ jobs:
         run: |
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         with:
           name: manywheel-py3_10-rocm5_0
           retention-days: 14
@@ -9708,30 +9506,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9748,10 +9526,7 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_10-rocm5_0
@@ -9851,30 +9626,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -9891,12 +9646,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: manywheel-py3_10-rocm5_0
diff --git a/.github/workflows/generated-linux-bionic-cuda10.2-py3.9-gcc7.yml b/.github/workflows/generated-linux-bionic-cuda10.2-py3.9-gcc7.yml
deleted file mode 100644
index 2ce53ab2ecba2c..00000000000000
--- a/.github/workflows/generated-linux-bionic-cuda10.2-py3.9-gcc7.yml
+++ /dev/null
@@ -1,2283 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-bionic-cuda10.2-py3.9-gcc7
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/slow/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-bionic-cuda10.2-py3.9-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-bionic-cuda10.2-py3.9-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_jit_legacy_1_1:
-    name: test (jit_legacy, 1, 1, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: jit_legacy
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-jit_legacy-1-1-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-jit_legacy-1-1-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_multigpu_1_1:
-    name: test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.16xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: multigpu
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-multigpu-1-1-linux.16xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-multigpu-1-1-linux.16xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_nogpu_NO_AVX_1_1:
-    name: test (nogpu_NO_AVX, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: nogpu_NO_AVX
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-nogpu_NO_AVX-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-nogpu_NO_AVX-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_nogpu_NO_AVX2_1_1:
-    name: test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: nogpu_NO_AVX2
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-nogpu_NO_AVX2-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-nogpu_NO_AVX2-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.8xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_slow_1_1:
-    name: test (slow, 1, 1, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: slow
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-slow-1-1-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-slow-1-1-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-cuda10.2-py3.9-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-bionic-py3.7-clang9.yml b/.github/workflows/generated-linux-bionic-py3.7-clang9.yml
deleted file mode 100644
index b77d051c6b62cd..00000000000000
--- a/.github/workflows/generated-linux-bionic-py3.7-clang9.yml
+++ /dev/null
@@ -1,995 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-bionic-py3.7-clang9
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/noarch/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-bionic-py3.7-clang9
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.7-clang9
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-bionic-py3.7-clang9-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-bionic-py3.7-clang9-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_noarch_1_1:
-    name: test (noarch, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-      TEST_CONFIG: noarch
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-noarch-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-noarch-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-py3.7-clang9-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-bionic-rocm4.5-py3.7.yml b/.github/workflows/generated-linux-bionic-rocm4.5-py3.7.yml
deleted file mode 100644
index bc7d226e5c1e42..00000000000000
--- a/.github/workflows/generated-linux-bionic-rocm4.5-py3.7.yml
+++ /dev/null
@@ -1,922 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-bionic-rocm4.5-py3.7
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/rocm/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-bionic-rocm4.5-py3.7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-rocm4.5-py3.7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-bionic-rocm4.5-py3.7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.rocm.gpu)
-    needs: build
-    runs-on: linux.rocm.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: Set DOCKER_HOST
-        run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
-      - name: Runner health check system info
-        if: always()
-        run: |
-          cat /etc/os-release || true
-          cat /etc/apt/sources.list.d/rocm.list || true
-          cat /opt/rocm/.info/version || true
-          whoami
-      - name: Runner health check rocm-smi
-        if: always()
-        run: |
-          rocm-smi
-      - name: Runner health check rocminfo
-        if: always()
-        run: |
-          rocminfo
-      - name: Runner health check GPU count
-        if: always()
-        run: |
-          ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
-          if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
-              exit 1
-          fi
-      - name: Runner health check disconnect on failure
-        if: ${{ failure() }}
-        run: |
-          killall runsvc.sh
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: ROCm set GPU_FLAG
-        run: |
-          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
-          docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
-          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
-          docker exec -t "${container_name}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.rocm.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.rocm.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-        if: always()
-        with:
-          name: test-reports
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.rocm.gpu)
-    needs: build
-    runs-on: linux.rocm.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: Set DOCKER_HOST
-        run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
-      - name: Runner health check system info
-        if: always()
-        run: |
-          cat /etc/os-release || true
-          cat /etc/apt/sources.list.d/rocm.list || true
-          cat /opt/rocm/.info/version || true
-          whoami
-      - name: Runner health check rocm-smi
-        if: always()
-        run: |
-          rocm-smi
-      - name: Runner health check rocminfo
-        if: always()
-        run: |
-          rocminfo
-      - name: Runner health check GPU count
-        if: always()
-        run: |
-          ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
-          if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
-              exit 1
-          fi
-      - name: Runner health check disconnect on failure
-        if: ${{ failure() }}
-        run: |
-          killall runsvc.sh
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: ROCm set GPU_FLAG
-        run: |
-          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
-          docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
-          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
-          docker exec -t "${container_name}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.rocm.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.rocm.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-        if: always()
-        with:
-          name: test-reports
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.rocm.gpu)
-    needs: build
-    runs-on: linux.rocm.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: Set DOCKER_HOST
-        run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
-      - name: Runner health check system info
-        if: always()
-        run: |
-          cat /etc/os-release || true
-          cat /etc/apt/sources.list.d/rocm.list || true
-          cat /opt/rocm/.info/version || true
-          whoami
-      - name: Runner health check rocm-smi
-        if: always()
-        run: |
-          rocm-smi
-      - name: Runner health check rocminfo
-        if: always()
-        run: |
-          rocminfo
-      - name: Runner health check GPU count
-        if: always()
-        run: |
-          ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
-          if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
-              exit 1
-          fi
-      - name: Runner health check disconnect on failure
-        if: ${{ failure() }}
-        run: |
-          killall runsvc.sh
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: ROCm set GPU_FLAG
-        run: |
-          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
-          docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
-          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
-          docker exec -t "${container_name}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.rocm.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.rocm.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-        if: always()
-        with:
-          name: test-reports
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-bionic-rocm4.5-py3.7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-docs-push.yml b/.github/workflows/generated-linux-docs-push.yml
deleted file mode 100644
index 0a99fcf684f9ba..00000000000000
--- a/.github/workflows/generated-linux-docs-push.yml
+++ /dev/null
@@ -1,395 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-docs-push
-
-on:
-  push:
-    tags:
-      # NOTE: Binary build pipelines should only get triggered on release candidate builds
-      # Release candidate tags look like: v1.11.0-rc1
-      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/scheduled/*'
-  schedule:
-    - cron: 0 0 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-docs-push
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-docs-push-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-docs-push-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-  build-docs:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    strategy:
-      matrix:
-        docs_type: [cpp, python]
-    needs: [build]
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      DOCS_TYPE: ${{ matrix.docs_type }}
-      WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Generate netrc (only for docs-push)
-        if: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
-        env:
-          GITHUB_PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
-        run: |
-          # set credentials for https pushing
-          echo "machine github.com" > "${RUNNER_TEMP}/.netrc"
-          echo "login pytorchbot" >> "${RUNNER_TEMP}/.netrc"
-          echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc"
-      - name: Build ${{ matrix.docs_type }} docs
-        run: |
-          set -ex
-          time docker pull "${DOCKER_IMAGE}" > /dev/null
-          # Convert refs/tags/v1.12.0rc3 into 1.12
-          if [[ "${GITHUB_REF}" =~ ^refs/tags/v([0-9]+\.[0-9]+)\.* ]]; then
-            target="${BASH_REMATCH[1]}"
-          else
-            target="master"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e IN_CI \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SHA1="$GITHUB_SHA" \
-            -e DOCS_VERSION="${target}" \
-            -e DOCS_TYPE \
-            -e PR_LABELS \
-            -e WITH_PUSH \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${RUNNER_TEMP}/.netrc":/var/lib/jenkins/.netrc \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Upload Python Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
-        with:
-          retention-days: 14
-          s3-bucket: doc-previews
-          if-no-files-found: error
-          path: pytorch.github.io/docs/master/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Upload C++ Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          s3-bucket: doc-previews
-          path: cppdocs/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}/cppdocs
diff --git a/.github/workflows/generated-linux-docs.yml b/.github/workflows/generated-linux-docs.yml
deleted file mode 100644
index f5c73edb01f531..00000000000000
--- a/.github/workflows/generated-linux-docs.yml
+++ /dev/null
@@ -1,386 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-docs
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/docs/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-docs
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-docs-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-docs-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-  build-docs:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    strategy:
-      matrix:
-        docs_type: [cpp, python]
-    needs: [build]
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      DOCS_TYPE: ${{ matrix.docs_type }}
-      WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Build ${{ matrix.docs_type }} docs
-        run: |
-          set -ex
-          time docker pull "${DOCKER_IMAGE}" > /dev/null
-          # Convert refs/tags/v1.12.0rc3 into 1.12
-          if [[ "${GITHUB_REF}" =~ ^refs/tags/v([0-9]+\.[0-9]+)\.* ]]; then
-            target="${BASH_REMATCH[1]}"
-          else
-            target="master"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e IN_CI \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SHA1="$GITHUB_SHA" \
-            -e DOCS_VERSION="${target}" \
-            -e DOCS_TYPE \
-            -e PR_LABELS \
-            -e WITH_PUSH \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Upload Python Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
-        with:
-          retention-days: 14
-          s3-bucket: doc-previews
-          if-no-files-found: error
-          path: pytorch.github.io/docs/master/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Upload C++ Docs Preview
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          s3-bucket: doc-previews
-          path: cppdocs/
-          s3-prefix: pytorch/${{ github.event.pull_request.number }}/cppdocs
diff --git a/.github/workflows/generated-linux-vulkan-bionic-py3.7-clang9.yml b/.github/workflows/generated-linux-vulkan-bionic-py3.7-clang9.yml
deleted file mode 100644
index 3aeaf3d6b49996..00000000000000
--- a/.github/workflows/generated-linux-vulkan-bionic-py3.7-clang9.yml
+++ /dev/null
@@ -1,501 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-vulkan-bionic-py3.7-clang9
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-      - 'ciflow/vulkan/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-vulkan-bionic-py3.7-clang9
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.7-clang9
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-vulkan-bionic-py3.7-clang9-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-vulkan-bionic-py3.7-clang9-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_default_1_1:
-    name: test (default, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-vulkan-bionic-py3.7-clang9-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-vulkan-bionic-py3.7-clang9-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-bazel-test.yml b/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-bazel-test.yml
deleted file mode 100644
index dc0d30c1c72e88..00000000000000
--- a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-bazel-test.yml
+++ /dev/null
@@ -1,337 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/bazel_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/bazel/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: linux.2xlarge
-    env:
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test-build-and-test
-      NUM_TEST_SHARDS: 1
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Build
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e PR_LABELS \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/build.sh'
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          # The artifact file is created inside docker container, which contains the result binaries.
-          # Now unpackage it into the project folder. The subsequent script will scan project folder
-          # to locate result binaries and report their sizes.
-          # If artifact file is not provided it assumes that the project folder has been mounted in
-          # the docker during build and already contains the result binaries, so this step can be skipped.
-          export ARTIFACTS=
-          if [ -n "${ARTIFACTS}" ]; then
-            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
-            cd "${GITHUB_WORKSPACE}"
-          fi
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          ANDROID_BUILD_TYPE=
-          export ANDROID_BUILD_TYPE
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
-      - name: Test
-        # Time out the test phase after 3.5 hours
-        timeout-minutes: 210
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          export SHARD_NUMBER=0
-          # TODO: Stop building test binaries as part of the build phase
-          # Make sure we copy test results from bazel-testlogs symlink to
-          # a regular directory ./test/test-reports
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e SHARD_NUMBER \
-            -e NUM_TEST_SHARDS \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && sudo chown -R jenkins /dev && .jenkins/pytorch/test.sh && cp -Lr ./bazel-testlogs ./test/test-reports'
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: 'bazel-${{ github.job }}'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: 'bazel-${{ github.job }}'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-no-ops.yml b/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-no-ops.yml
deleted file mode 100644
index 362e4db272ebe9..00000000000000
--- a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7-no-ops.yml
+++ /dev/null
@@ -1,251 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-cuda11.3-py3.7-gcc7-no-ops-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-no-ops-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7.yml b/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7.yml
deleted file mode 100644
index 2cc43a1c3ec55d..00000000000000
--- a/.github/workflows/generated-linux-xenial-cuda11.3-py3.7-gcc7.yml
+++ /dev/null
@@ -1,1021 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-cuda11.3-py3.7-gcc7
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-cuda11.3-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-cuda11.3-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.8xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-cuda11.3-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3-clang5-mobile-build.yml b/.github/workflows/generated-linux-xenial-py3-clang5-mobile-build.yml
deleted file mode 100644
index d093e0a976732e..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3-clang5-mobile-build.yml
+++ /dev/null
@@ -1,241 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3-clang5-mobile-build
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/mobile/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3-clang5-mobile-build
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3-clang5-mobile-build-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3-clang5-mobile-build-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3-clang5-mobile-custom-build-static.yml b/.github/workflows/generated-linux-xenial-py3-clang5-mobile-custom-build-static.yml
deleted file mode 100644
index 409a0e3e95a345..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3-clang5-mobile-custom-build-static.yml
+++ /dev/null
@@ -1,241 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3-clang5-mobile-custom-build-static
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/mobile/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3-clang5-mobile-custom-build-static
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3-clang5-mobile-custom-build-static-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3-clang5-mobile-custom-build-static-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-clang7-asan.yml b/.github/workflows/generated-linux-xenial-py3.7-clang7-asan.yml
deleted file mode 100644
index 0f8858cb178f51..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-clang7-asan.yml
+++ /dev/null
@@ -1,995 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-clang7-asan
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/sanitizers/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-clang7-asan
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang7-asan
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-clang7-asan-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_default_1_3:
-    name: test (default, 1, 3, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 330
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 3
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 300 minutes
-        timeout-minutes: 300
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-3-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-3-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_3:
-    name: test (default, 2, 3, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 330
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 3
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 300 minutes
-        timeout-minutes: 300
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-3-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-3-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_3_3:
-    name: test (default, 3, 3, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 330
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 3
-      NUM_TEST_SHARDS: 3
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 300 minutes
-        timeout-minutes: 300
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-3-3-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-3-3-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-clang7-asan-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-clang7-onnx.yml b/.github/workflows/generated-linux-xenial-py3.7-clang7-onnx.yml
deleted file mode 100644
index a2ceb91d987b1f..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-clang7-onnx.yml
+++ /dev/null
@@ -1,748 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-clang7-onnx
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/onnx/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-clang7-onnx
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang7-onnx
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-clang7-onnx-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-onnx-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_default_1_2:
-    name: test (default, 1, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-onnx-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-clang7-onnx-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-clang7-onnx-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-clang7-onnx-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build.yml b/.github/workflows/generated-linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build.yml
deleted file mode 100644
index 80eaabc04c7f92..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build.yml
+++ /dev/null
@@ -1,243 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/libtorch/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/mobile/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-gcc5.4.yml b/.github/workflows/generated-linux-xenial-py3.7-gcc5.4.yml
deleted file mode 100644
index 87df9f6ff116c1..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-gcc5.4.yml
+++ /dev/null
@@ -1,1735 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-gcc5.4
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-gcc5.4
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-gcc5.4-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_jit_legacy_1_1:
-    name: test (jit_legacy, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: jit_legacy
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-jit_legacy-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-jit_legacy-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_docs_test_1_1:
-    name: test (docs_test, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: docs_test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-docs_test-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-docs_test-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_backwards_compat_1_1:
-    name: test (backwards_compat, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: backwards_compat
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-backwards_compat-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-backwards_compat-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-gcc7-no-ops.yml b/.github/workflows/generated-linux-xenial-py3.7-gcc7-no-ops.yml
deleted file mode 100644
index 1b507bc4831625..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-gcc7-no-ops.yml
+++ /dev/null
@@ -1,252 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-gcc7-no-ops
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-gcc7-no-ops
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-gcc7-no-ops-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc7-no-ops-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-linux-xenial-py3.7-gcc7.yml b/.github/workflows/generated-linux-xenial-py3.7-gcc7.yml
deleted file mode 100644
index 59c1e771d7b611..00000000000000
--- a/.github/workflows/generated-linux-xenial-py3.7-gcc7.yml
+++ /dev/null
@@ -1,994 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: linux-xenial-py3.7-gcc7
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: linux-xenial-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: linux-xenial-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: linux-xenial-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-macos-10-15-py3-arm64.yml b/.github/workflows/generated-macos-10-15-py3-arm64.yml
deleted file mode 100644
index 5a6c089249f661..00000000000000
--- a/.github/workflows/generated-macos-10-15-py3-arm64.yml
+++ /dev/null
@@ -1,89 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/macos_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: macos-10-15-py3-arm64
-
-on:
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
-
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-env:
-  BUILD_ENVIRONMENT: macos-10-15-py3-arm64
-  COMPACT_JOB_NAME: macos-10-15-py3-arm64
-  IN_CI: 1
-  IS_GHA: 1
-  PYTORCH_RETRY_TEST_CASES: 1
-
-
-jobs:
-
-  build:
-    runs-on: macos-10.15
-    env:
-      JOB_BASE_NAME: macos-10-15-py3-arm64
-      # For sccache access (only on non-forked PRs)
-      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: 3.8
-          activate-environment: build
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-      - name: Build
-        run: |
-          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
-          .jenkins/pytorch/macos-build.sh
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/
-      - uses: actions/upload-artifact@v2
-        name: Store PyTorch Build Artifacts on GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-
-
-concurrency:
-  group: macos-10-15-py3-arm64-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-macos-10-15-py3-lite-interpreter-x86-64.yml b/.github/workflows/generated-macos-10-15-py3-lite-interpreter-x86-64.yml
deleted file mode 100644
index af9859b138280b..00000000000000
--- a/.github/workflows/generated-macos-10-15-py3-lite-interpreter-x86-64.yml
+++ /dev/null
@@ -1,80 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/macos_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: macos-10-15-py3-lite-interpreter-x86-64
-
-on:
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
-
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-env:
-  BUILD_ENVIRONMENT: macos-10-15-py3-lite-interpreter-x86-64
-  COMPACT_JOB_NAME: macos-10-15-py3-lite-interpreter-x86-64
-  IN_CI: 1
-  IS_GHA: 1
-  PYTORCH_RETRY_TEST_CASES: 1
-
-  # Set xcode xcode version to 12
-  DEVELOPER_DIR: /Applications/Xcode_12.app/Contents/Developer
-
-jobs:
-
-  build:
-    runs-on: macos-10.15
-    env:
-      JOB_BASE_NAME: macos-10-15-py3-lite-interpreter-x86-64
-      # For sccache access (only on non-forked PRs)
-      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: 3.8
-          activate-environment: build
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-      - name: Build
-        run: |
-          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
-          .jenkins/pytorch/macos-build.sh
-
-
-concurrency:
-  group: macos-10-15-py3-lite-interpreter-x86-64-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-macos-11-py3-x86-64.yml b/.github/workflows/generated-macos-11-py3-x86-64.yml
deleted file mode 100644
index 7961cff18fd119..00000000000000
--- a/.github/workflows/generated-macos-11-py3-x86-64.yml
+++ /dev/null
@@ -1,319 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/macos_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: macos-11-py3-x86-64
-
-on:
-  push:
-    branches:
-      - master
-      - main
-      - release/*
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/macos/*'
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
-
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-env:
-  BUILD_ENVIRONMENT: macos-11-py3-x86-64
-  COMPACT_JOB_NAME: macos-11-py3-x86-64
-  IN_CI: 1
-  IS_GHA: 1
-  PYTORCH_RETRY_TEST_CASES: 1
-
-  # Set xcode xcode version to 12.4
-  DEVELOPER_DIR: /Applications/Xcode_12.4.app/Contents/Developer
-
-jobs:
-
-  build:
-    runs-on: macos-11
-    env:
-      JOB_BASE_NAME: macos-11-py3-x86-64
-      # For sccache access (only on non-forked PRs)
-      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
-      PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: 3.8
-          activate-environment: build
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-      - name: Build
-        run: |
-          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
-          .jenkins/pytorch/macos-build.sh
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/
-      - uses: actions/upload-artifact@v2
-        name: Store PyTorch Build Artifacts on GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-
-
-  test_default_1_2:
-    name: test (default, 1, 2, macos-11)
-    needs: build
-    runs-on: macos-11
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: macos-11-py3-x86-64-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: false
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - uses: actions/download-artifact@v2
-        name: Download PyTorch Build Artifacts from GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: .
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: 3.8
-          activate-environment: build
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        run: |
-          python3 -mpip install dist/*.whl
-          .jenkins/pytorch/macos-test.sh
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-macos-11'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-        if: always()
-        with:
-          name: test-jsons
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-macos-11'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-        if: always()
-        with:
-          name: test-reports
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: macos-11-py3-x86-64-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-  test_default_2_2:
-    name: test (default, 2, 2, macos-11)
-    needs: build
-    runs-on: macos-11
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: macos-11-py3-x86-64-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: false
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - uses: actions/download-artifact@v2
-        name: Download PyTorch Build Artifacts from GHA
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: .
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: 3.8
-          activate-environment: build
-      - name: Install macOS homebrew dependencies
-        run: |
-          # Install dependencies
-          brew install libomp
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        run: |
-          python3 -mpip install dist/*.whl
-          .jenkins/pytorch/macos-test.sh
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-macos-11'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-        if: always()
-        with:
-          name: test-jsons
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-macos-11'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-        if: always()
-        with:
-          name: test-reports
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: macos-11-py3-x86-64-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-
-
-concurrency:
-  group: macos-11-py3-x86-64-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/generated-macos-arm64-binary-conda.yml b/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
similarity index 86%
rename from .github/workflows/generated-macos-arm64-binary-conda.yml
rename to .github/workflows/generated-macos-arm64-binary-conda-nightly.yml
index 593ca5a37b6445..37e922583ae4a6 100644
--- a/.github/workflows/generated-macos-arm64-binary-conda.yml
+++ b/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
@@ -38,6 +38,7 @@ concurrency:
 
 jobs:
   conda-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -133,30 +134,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -173,9 +154,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -235,6 +213,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   conda-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -330,30 +309,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -370,9 +329,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -432,6 +388,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   conda-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -527,30 +484,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -567,9 +504,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-macos-arm64-binary-wheel.yml b/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
similarity index 86%
rename from .github/workflows/generated-macos-arm64-binary-wheel.yml
rename to .github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
index b17db22d2a7c1c..a0267de766e2ac 100644
--- a/.github/workflows/generated-macos-arm64-binary-wheel.yml
+++ b/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
@@ -38,6 +38,7 @@ concurrency:
 
 jobs:
   wheel-py3_7-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -133,30 +134,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -173,9 +154,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -235,6 +213,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -330,30 +309,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -370,9 +329,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -432,6 +388,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -527,30 +484,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -567,9 +504,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -629,6 +563,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -724,30 +659,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -764,9 +679,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-macos-binary-conda.yml b/.github/workflows/generated-macos-binary-conda-nightly.yml
similarity index 86%
rename from .github/workflows/generated-macos-binary-conda.yml
rename to .github/workflows/generated-macos-binary-conda-nightly.yml
index 3fb1852c859169..d5c6eae896cb31 100644
--- a/.github/workflows/generated-macos-binary-conda.yml
+++ b/.github/workflows/generated-macos-binary-conda-nightly.yml
@@ -36,6 +36,7 @@ concurrency:
 
 jobs:
   conda-py3_7-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -131,30 +132,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -171,9 +152,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -233,6 +211,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   conda-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -328,30 +307,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -368,9 +327,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -430,6 +386,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   conda-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -525,30 +482,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -565,9 +502,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -627,6 +561,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   conda-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -722,30 +657,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -762,9 +677,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-macos-binary-libtorch-cxx11-abi.yml b/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
similarity index 87%
rename from .github/workflows/generated-macos-binary-libtorch-cxx11-abi.yml
rename to .github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
index a1f39d4ceea408..eac3e4019cd350 100644
--- a/.github/workflows/generated-macos-binary-libtorch-cxx11-abi.yml
+++ b/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
@@ -36,6 +36,7 @@ concurrency:
 
 jobs:
   libtorch-cpu-shared-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -137,30 +138,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -177,9 +158,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -239,6 +217,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-shared-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -340,30 +319,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -380,9 +339,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -442,6 +398,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-with-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -543,30 +500,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -583,9 +520,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -645,6 +579,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-without-deps-cxx11-abi-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -746,30 +681,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -786,9 +701,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-macos-binary-libtorch-pre-cxx11.yml b/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
similarity index 87%
rename from .github/workflows/generated-macos-binary-libtorch-pre-cxx11.yml
rename to .github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
index cf6936d467744b..b943ea97a97011 100644
--- a/.github/workflows/generated-macos-binary-libtorch-pre-cxx11.yml
+++ b/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
@@ -36,6 +36,7 @@ concurrency:
 
 jobs:
   libtorch-cpu-shared-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -137,30 +138,10 @@ jobs:
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -177,9 +158,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -239,6 +217,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-shared-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -340,30 +319,10 @@ jobs:
       LIBTORCH_VARIANT: shared-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -380,9 +339,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -442,6 +398,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-with-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -543,30 +500,10 @@ jobs:
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -583,9 +520,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -645,6 +579,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-without-deps-pre-cxx11-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     # libtorch builds take a long time on github hosted runners
     timeout-minutes: 720
@@ -746,30 +681,10 @@ jobs:
       LIBTORCH_VARIANT: static-without-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -786,9 +701,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-macos-binary-wheel.yml b/.github/workflows/generated-macos-binary-wheel-nightly.yml
similarity index 86%
rename from .github/workflows/generated-macos-binary-wheel.yml
rename to .github/workflows/generated-macos-binary-wheel-nightly.yml
index 1db195ea06d6e0..2dd93eea93ca9c 100644
--- a/.github/workflows/generated-macos-binary-wheel.yml
+++ b/.github/workflows/generated-macos-binary-wheel-nightly.yml
@@ -36,6 +36,7 @@ concurrency:
 
 jobs:
   wheel-py3_7-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -131,30 +132,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -171,9 +152,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -233,6 +211,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -328,30 +307,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -368,9 +327,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -430,6 +386,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -525,30 +482,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -565,9 +502,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
@@ -627,6 +561,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-10.15
     timeout-minutes: 240
     env:
@@ -722,30 +657,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -762,9 +677,6 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
       - uses: actions/download-artifact@v2
diff --git a/.github/workflows/generated-parallelnative-linux-xenial-py3.7-gcc5.4.yml b/.github/workflows/generated-parallelnative-linux-xenial-py3.7-gcc5.4.yml
deleted file mode 100644
index 17322971c3fc84..00000000000000
--- a/.github/workflows/generated-parallelnative-linux-xenial-py3.7-gcc5.4.yml
+++ /dev/null
@@ -1,746 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: parallelnative-linux-xenial-py3.7-gcc5.4
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: parallelnative-linux-xenial-py3.7-gcc5.4
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: parallelnative-linux-xenial-py3.7-gcc5.4-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: parallelnative-linux-xenial-py3.7-gcc5.4-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: parallelnative-linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: parallelnative-linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_1:
-    name: test (default, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: parallelnative-linux-xenial-py3.7-gcc5.4-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: parallelnative-linux-xenial-py3.7-gcc5.4-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7.yml b/.github/workflows/generated-periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7.yml
deleted file mode 100644
index bcf59941a8c7f2..00000000000000
--- a/.github/workflows/generated-periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7.yml
+++ /dev/null
@@ -1,239 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/libtorch/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/scheduled/*'
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-periodic-linux-bionic-cuda11.5-py3.7-gcc7.yml b/.github/workflows/generated-periodic-linux-bionic-cuda11.5-py3.7-gcc7.yml
deleted file mode 100644
index ff85e17659c075..00000000000000
--- a/.github/workflows/generated-periodic-linux-bionic-cuda11.5-py3.7-gcc7.yml
+++ /dev/null
@@ -1,1018 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: periodic-linux-bionic-cuda11.5-py3.7-gcc7
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/scheduled/*'
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: periodic-linux-bionic-cuda11.5-py3.7-gcc7
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: periodic-linux-bionic-cuda11.5-py3.7-gcc7-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.8xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-bionic-cuda11.5-py3.7-gcc7-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck.yml b/.github/workflows/generated-periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck.yml
deleted file mode 100644
index 5d5c901859f0bd..00000000000000
--- a/.github/workflows/generated-periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck.yml
+++ /dev/null
@@ -1,764 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/scheduled/*'
-      - 'ciflow/slow/*'
-      - 'ciflow/slow-gradcheck/*'
-  schedule:
-    - cron: 0 */4 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_default_1_2:
-    name: test (default, 1, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 390
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 360 minutes
-        timeout-minutes: 360
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 390
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 360 minutes
-        timeout-minutes: 360
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug.yml b/.github/workflows/generated-periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug.yml
deleted file mode 100644
index 8e4f047facad57..00000000000000
--- a/.github/workflows/generated-periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug.yml
+++ /dev/null
@@ -1,1019 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/scheduled/*'
-  schedule:
-    - cron: 45 0,4,8,12,16,20 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-  DEBUG: 1
-concurrency:
-  group: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_distributed_1_1:
-    name: test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.8xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-      TEST_CONFIG: distributed
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-distributed-1-1-linux.8xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_1_2:
-    name: test (default, 1, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-  test_default_2_2:
-    name: test (default, 2, 2, linux.4xlarge.nvidia.gpu)
-    needs: build
-    runs-on: linux.4xlarge.nvidia.gpu
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-      TEST_CONFIG: default
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-linux.4xlarge.nvidia.gpu'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-periodic-win-vs2019-cuda11.5-py3.yml b/.github/workflows/generated-periodic-win-vs2019-cuda11.5-py3.yml
deleted file mode 100644
index 8041eca3762360..00000000000000
--- a/.github/workflows/generated-periodic-win-vs2019-cuda11.5-py3.yml
+++ /dev/null
@@ -1,601 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/windows_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: periodic-win-vs2019-cuda11.5-py3
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/scheduled/*'
-      - 'ciflow/win/*'
-  schedule:
-    - cron: 45 4,10,16,22 * * *
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: periodic-win-vs2019-cuda11.5-py3
-  BUILD_WHEEL: 1
-  MAX_JOBS: 8
-  CUDA_VERSION: "11.5"
-  IN_CI: 1
-  IS_GHA: 1
-  INSTALL_WINDOWS_SDK: 1
-  PYTHON_VERSION: "3.8"
-  PYTORCH_RETRY_TEST_CASES: 1
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  SCCACHE_BUCKET: "ossci-compiler-cache"
-  VC_PRODUCT: "BuildTools"
-  VC_VERSION: ""
-  VS_VERSION: "16.8.6"
-  VC_YEAR: "2019"
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  no_proxy: localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  TORCH_CUDA_ARCH_LIST: "7.0"
-  USE_CUDA: 1
-
-concurrency:
-  group: periodic-win-vs2019-cuda11.5-py3-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-  build:
-    runs-on: "windows.4xlarge"
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-build
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          .jenkins/pytorch/win-build.sh
-      # Upload to github so that people can click and download artifacts
-      - name: Upload artifacts to s3
-        uses: seemethere/upload-artifact-s3@v3
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Cleanup build-results and workspaces
-        if: always()
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
-          rm -rf ./*
-  test_force_on_cpu_1_1:
-    name: test (force_on_cpu, 1, 1, windows.4xlarge)
-    timeout-minutes: 270
-    env:
-      JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      TEST_CONFIG: force_on_cpu
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.4xlarge
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-force_on_cpu-1-1-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-force_on_cpu-1-1-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  test_default_1_2:
-    name: test (default, 1, 2, windows.8xlarge.nvidia.gpu)
-    timeout-minutes: 270
-    env:
-      JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.8xlarge.nvidia.gpu
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  test_default_2_2:
-    name: test (default, 2, 2, windows.8xlarge.nvidia.gpu)
-    timeout-minutes: 270
-    env:
-      JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.8xlarge.nvidia.gpu
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: periodic-win-vs2019-cuda11.5-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
diff --git a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build.yml b/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build.yml
deleted file mode 100644
index c198168b1cd883..00000000000000
--- a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build.yml
+++ /dev/null
@@ -1,510 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/android_ci_full_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
-
-on:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/android/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: linux.2xlarge
-    env:
-      JOB_BASE_NAME: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build-build-and-test
-      NUM_TEST_SHARDS: 1
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build-arm-v7a
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          #!/bin/bash -eo pipefail
-          # Pull Docker image and run build
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-          echo "${DOCKER_IMAGE}"
-          export container_name
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT=pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a-build \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "${container_name}" bash) 2>&1
-
-          # Copy dist folder back
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-arm-v7a
-          docker cp "${container_name}:/var/lib/jenkins/workspace/dist" "${GITHUB_WORKSPACE}/." || echo "Dist folder not found"
-          docker commit "${container_name}" "${COMMIT_DOCKER_IMAGE}"
-          time docker push "${COMMIT_DOCKER_IMAGE}"
-      - name: Build-arm-v8a
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          #!/bin/bash -eo pipefail
-          # Pull Docker image and run build
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-          echo "${DOCKER_IMAGE}"
-          export container_name
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT=pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a-build \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "${container_name}" bash) 2>&1
-
-          # Copy dist folder back
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-arm-v8a
-          docker cp "${container_name}:/var/lib/jenkins/workspace/dist" "${GITHUB_WORKSPACE}/." || echo "Dist folder not found"
-          docker commit "${container_name}" "${COMMIT_DOCKER_IMAGE}"
-          time docker push "${COMMIT_DOCKER_IMAGE}"
-      - name: Build-x86_32
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          #!/bin/bash -eo pipefail
-          # Pull Docker image and run build
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-          echo "${DOCKER_IMAGE}"
-          export container_name
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT=pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32-build \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "${container_name}" bash) 2>&1
-
-          # Copy dist folder back
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-x86_32
-          docker cp "${container_name}:/var/lib/jenkins/workspace/dist" "${GITHUB_WORKSPACE}/." || echo "Dist folder not found"
-          docker commit "${container_name}" "${COMMIT_DOCKER_IMAGE}"
-          time docker push "${COMMIT_DOCKER_IMAGE}"
-      - name: Build-x86_64
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          #!/bin/bash -eo pipefail
-          # Pull Docker image and run build
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-          echo "${DOCKER_IMAGE}"
-          export container_name
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT=pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64-build \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins . && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "${container_name}" bash) 2>&1
-
-          # Copy dist folder back
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-x86_64
-          docker cp "${container_name}:/var/lib/jenkins/workspace/dist" "${GITHUB_WORKSPACE}/." || echo "Dist folder not found"
-          docker commit "${container_name}" "${COMMIT_DOCKER_IMAGE}"
-          time docker push "${COMMIT_DOCKER_IMAGE}"
-      - name: Build final artifact
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          set -eux
-
-          docker_image_libtorch_android_x86_32="${DOCKER_IMAGE}-x86_32"
-          docker_image_libtorch_android_x86_64="${DOCKER_IMAGE}-x86_64"
-          docker_image_libtorch_android_arm_v7a="${DOCKER_IMAGE}-arm-v7a"
-          docker_image_libtorch_android_arm_v8a="${DOCKER_IMAGE}-arm-v8a"
-
-          echo "docker_image_commit: ${DOCKER_IMAGE}"
-          echo "docker_image_libtorch_android_x86_32: ${docker_image_libtorch_android_x86_32}"
-          echo "docker_image_libtorch_android_x86_64: ${docker_image_libtorch_android_x86_64}"
-          echo "docker_image_libtorch_android_arm_v7a: ${docker_image_libtorch_android_arm_v7a}"
-          echo "docker_image_libtorch_android_arm_v8a: ${docker_image_libtorch_android_arm_v8a}"
-
-          # x86_32
-          time docker pull "${docker_image_libtorch_android_x86_32}" >/dev/null
-          export id_x86_32
-          id_x86_32=$(docker run -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_x86_32}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_x86_32}" bash) 2>&1
-
-          # arm-v7a
-          time docker pull "${docker_image_libtorch_android_arm_v7a}" >/dev/null
-          export id_arm_v7a
-          id_arm_v7a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_arm_v7a}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_arm_v7a}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_arm_v7a"
-          docker cp "${id_arm_v7a}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_arm_v7a"
-
-          # x86_64
-          time docker pull "${docker_image_libtorch_android_x86_64}" >/dev/null
-          export id_x86_64
-          id_x86_64=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_x86_64}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "${id_x86_64}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_x86_64"
-          docker cp "${id_x86_64}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_x86_64"
-
-          # arm-v8a
-          time docker pull "${docker_image_libtorch_android_arm_v8a}" >/dev/null
-          export id_arm_v8a
-          id_arm_v8a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins "${docker_image_libtorch_android_arm_v8a}")
-
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_install_arm_v8a"
-          docker cp "${id_arm_v8a}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_arm_v8a"
-
-          # Putting everything together
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v7a" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v7a"
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_64" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_x86_64"
-          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v8a" "${id_x86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v8a"
-
-          # run gradle buildRelease
-          # shellcheck disable=SC1105
-          ((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec \
-            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build" \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --user jenkins \
-            -u jenkins -i "${id_x86_32}" bash) 2>&1
-
-          mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"
-          docker cp "${id_x86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"
-
-          output_image="${DOCKER_IMAGE}-android-x86_32-gradle"
-          docker commit "${id_x86_32}" "${output_image}"
-          time docker push "${output_image}"
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          # The artifact file is created inside docker container, which contains the result binaries.
-          # Now unpackage it into the project folder. The subsequent script will scan project folder
-          # to locate result binaries and report their sizes.
-          # If artifact file is not provided it assumes that the project folder has been mounted in
-          # the docker during build and already contains the result binaries, so this step can be skipped.
-          export ARTIFACTS=${GITHUB_WORKSPACE}/build_android_artifacts/artifacts.tgz
-          if [ -n "${ARTIFACTS}" ]; then
-            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
-            cd "${GITHUB_WORKSPACE}"
-          fi
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          ANDROID_BUILD_TYPE=prebuilt
-          export ANDROID_BUILD_TYPE
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Android Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            build_android_artifacts/artifacts.tgz
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit.yml b/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit.yml
deleted file mode 100644
index 471b0bb759f336..00000000000000
--- a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit.yml
+++ /dev/null
@@ -1,277 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/android_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/android/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: linux.2xlarge
-    env:
-      JOB_BASE_NAME: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit-build-and-test
-      NUM_TEST_SHARDS: 1
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Build
-        run: |
-          set -e
-          # Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
-          # 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
-          # 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
-
-          echo "DOCKER_IMAGE: ${DOCKER_IMAGE}"
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-
-          export BUILD_LITE_INTERPRETER
-          BUILD_LITE_INTERPRETER="1"
-          if [[ "${BUILD_ENVIRONMENT}" == *"full-jit" ]]; then
-            BUILD_LITE_INTERPRETER="0"
-          fi
-
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          # shellcheck disable=SC2016
-          export id
-          id=$(docker run -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e PR_LABELS \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e BUILD_LITE_INTERPRETER \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "$(pwd):/var/lib/jenkins/workspace" \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            -t -d -w /var/lib/jenkins "${DOCKER_IMAGE}")
-
-          # shellcheck disable=SC2016
-          export COMMAND
-          # shellcheck disable=SC2016
-          COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "export BUILD_LITE_INTERPRETER=${BUILD_LITE_INTERPRETER}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
-          echo "${COMMAND}" > ./command.sh && bash ./command.sh
-          # Skip docker push as this job is purely for size analysis purpose.
-          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          # The artifact file is created inside docker container, which contains the result binaries.
-          # Now unpackage it into the project folder. The subsequent script will scan project folder
-          # to locate result binaries and report their sizes.
-          # If artifact file is not provided it assumes that the project folder has been mounted in
-          # the docker during build and already contains the result binaries, so this step can be skipped.
-          export ARTIFACTS=
-          if [ -n "${ARTIFACTS}" ]; then
-            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
-            cd "${GITHUB_WORKSPACE}"
-          fi
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          ANDROID_BUILD_TYPE=custom-build-single
-          export ANDROID_BUILD_TYPE
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single.yml b/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single.yml
deleted file mode 100644
index 7d0f98c29bd698..00000000000000
--- a/.github/workflows/generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single.yml
+++ /dev/null
@@ -1,277 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/android_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/android/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-concurrency:
-  group: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  # building and testing in a single job since bazel runs only small subset of tests
-  build-and-test:
-    runs-on: linux.2xlarge
-    env:
-      JOB_BASE_NAME: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-build-and-test
-      NUM_TEST_SHARDS: 1
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Build
-        run: |
-          set -e
-          # Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
-          # 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
-          # 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
-
-          echo "DOCKER_IMAGE: ${DOCKER_IMAGE}"
-          time docker pull "${DOCKER_IMAGE}" >/dev/null
-
-          export BUILD_LITE_INTERPRETER
-          BUILD_LITE_INTERPRETER="1"
-          if [[ "${BUILD_ENVIRONMENT}" == *"full-jit" ]]; then
-            BUILD_LITE_INTERPRETER="0"
-          fi
-
-          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
-          # shellcheck disable=SC2016
-          export id
-          id=$(docker run -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e PR_LABELS \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e BUILD_LITE_INTERPRETER \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "$(pwd):/var/lib/jenkins/workspace" \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --security-opt seccomp=unconfined \
-            -t -d -w /var/lib/jenkins "${DOCKER_IMAGE}")
-
-          # shellcheck disable=SC2016
-          export COMMAND
-          # shellcheck disable=SC2016
-          COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "export BUILD_LITE_INTERPRETER=${BUILD_LITE_INTERPRETER}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
-          echo "${COMMAND}" > ./command.sh && bash ./command.sh
-          # Skip docker push as this job is purely for size analysis purpose.
-          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          # The artifact file is created inside docker container, which contains the result binaries.
-          # Now unpackage it into the project folder. The subsequent script will scan project folder
-          # to locate result binaries and report their sizes.
-          # If artifact file is not provided it assumes that the project folder has been mounted in
-          # the docker during build and already contains the result binaries, so this step can be skipped.
-          export ARTIFACTS=
-          if [ -n "${ARTIFACTS}" ]; then
-            tar xf "${ARTIFACTS}" -C "${GITHUB_WORKSPACE}"
-            cd "${GITHUB_WORKSPACE}"
-          fi
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          ANDROID_BUILD_TYPE=custom-build-single
-          export ANDROID_BUILD_TYPE
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba "android" || exit 0
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-pytorch-xla-linux-bionic-py3.7-clang8.yml b/.github/workflows/generated-pytorch-xla-linux-bionic-py3.7-clang8.yml
deleted file mode 100644
index 8890295d6253cb..00000000000000
--- a/.github/workflows/generated-pytorch-xla-linux-bionic-py3.7-clang8.yml
+++ /dev/null
@@ -1,468 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/linux_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: pytorch-xla-linux-bionic-py3.7-clang8
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/linux/*'
-      - 'ciflow/trunk/*'
-      - 'ciflow/xla/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: pytorch-xla-linux-bionic-py3.7-clang8
-  DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base
-  SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
-  XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-  TORCH_CUDA_ARCH_LIST: 5.2
-  IN_CI: 1
-  IS_GHA: 1
-  # This is used for the phase of adding wheel tests only, will be removed once completed
-  IN_WHEEL_TEST: 1
-  # Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
-  CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  PYTORCH_RETRY_TEST_CASES: 1
-  # This is used for XLA tests only
-  XLA_CUDA: 0
-  XLA_IMAGE_TAG: v0.2
-concurrency:
-  group: pytorch-xla-linux-bionic-py3.7-clang8-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-
-  build:
-    runs-on: linux.2xlarge
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: pytorch-xla-linux-bionic-py3.7-clang8-build
-    outputs:
-      docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          echo "XLA workflow uses pre-built test image at ${XLA_IMAGE_TAG}"
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}"
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        env:
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          # detached container should get cleaned up by teardown_ec2_linux
-          container_name=$(docker run \
-            -e BUILD_ENVIRONMENT \
-            -e JOB_BASE_NAME \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e AWS_DEFAULT_REGION \
-            -e IS_GHA \
-            -e PR_NUMBER \
-            -e SHA1 \
-            -e BRANCH \
-            -e GITHUB_RUN_ID \
-            -e SCCACHE_BUCKET \
-            -e XLA_CUDA \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e SKIP_SCCACHE_INITIALIZATION=1 \
-            -e TORCH_CUDA_ARCH_LIST \
-            -e PR_LABELS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --tty \
-            --detach \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
-      - name: Display and upload binary build size statistics (Click Me)
-        # temporary hack: set CIRCLE_* vars, until we update
-        # tools/stats/print_test_stats.py to natively support GitHub Actions
-        env:
-          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        run: |
-          COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
-          export COMMIT_TIME
-          pip3 install requests==2.26 boto3==1.16.34
-          python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
-      - name: Chown workspace
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Archive artifacts into zip
-        run: |
-          zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store PyTorch Build Artifacts on S3
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            artifacts.zip
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Clean up docker images
-        if: always()
-        run: |
-          # Prune all of the docker images
-          docker system prune -af
-
-  test_xla_1_1:
-    name: test (xla, 1, 1, linux.2xlarge)
-    needs: build
-    runs-on: linux.2xlarge
-    timeout-minutes: 270
-    env:
-      DOCKER_IMAGE: ${{ needs.build.outputs.docker_image }}
-      JOB_BASE_NAME: pytorch-xla-linux-bionic-py3.7-clang8-test
-      TEST_CONFIG: xla
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      PR_BODY: ${{ github.event.pull_request.body }}
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
-      - name: Determine shm-size
-        run: |
-          shm_size="1g"
-          case "${BUILD_ENVIRONMENT}" in
-            *cuda*)
-              shm_size="2g"
-              ;;
-            *rocm*)
-              shm_size="8g"
-              ;;
-          esac
-          echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-      - name: Unzip artifacts
-        run: |
-          unzip -o artifacts.zip
-      - name: Output disk space left
-        run: |
-          sudo df -H
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Test
-        env:
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-          set -x
-
-          if [[ $TEST_CONFIG == 'multigpu' ]]; then
-            TEST_COMMAND=.jenkins/pytorch/multigpu-test.sh
-          elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
-            TEST_COMMAND=.jenkins/caffe2/test.sh
-          else
-            TEST_COMMAND=.jenkins/pytorch/test.sh
-          fi
-          PROXY_ENV=
-          # NOTE: XLA multiprocessing tests appear to have issues with squid proxy, going to disable for now
-          #       We should investigate whether or not there's a list of hostnames we can add to no_proxy to
-          #       make it so that we shouldn't have to fully disable squid for XLA tests
-          if [[ $TEST_CONFIG != 'xla' ]]; then
-            # shellcheck disable=SC2089
-            PROXY_ENV="-e http_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e https_proxy=http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128 -e no_proxy=localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock"
-          fi
-          # detached container should get cleaned up by teardown_ec2_linux
-          # TODO: Stop building test binaries as part of the build phase
-          # Used for GPU_FLAG since that doesn't play nice
-          # shellcheck disable=SC2086,SC2090
-          container_name=$(docker run \
-            ${GPU_FLAG:-} \
-            -e BUILD_ENVIRONMENT \
-            -e PR_NUMBER \
-            -e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-            -e GITHUB_ACTIONS \
-            -e IN_CI \
-            -e IS_GHA \
-            -e BRANCH \
-            -e SHA1 \
-            -e AWS_DEFAULT_REGION \
-            -e IN_WHEEL_TEST \
-            -e SHARD_NUMBER \
-            -e JOB_BASE_NAME \
-            -e TEST_CONFIG \
-            -e NUM_TEST_SHARDS \
-            -e PR_BODY \
-            -e PYTORCH_RETRY_TEST_CASES \
-            -e PR_LABELS \
-            -e MAX_JOBS="$(nproc --ignore=2)" \
-            -e SCCACHE_BUCKET \
-            -e XLA_CUDA \
-            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-            ${PROXY_ENV} \
-            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
-            --ulimit stack=10485760:83886080 \
-            --security-opt seccomp=unconfined \
-            --cap-add=SYS_PTRACE \
-            --ipc=host \
-            --shm-size="${SHM_SIZE}" \
-            --tty \
-            --detach \
-            --name="${container_name}" \
-            --user jenkins \
-            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-            -w /var/lib/jenkins/workspace \
-            "${DOCKER_IMAGE}"
-          )
-          docker exec -t "${container_name}" sh -c "sudo chown -R jenkins . && pip install dist/*.whl && ${TEST_COMMAND}"
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-xla-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-xla-1-1-linux.2xlarge'
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: pytorch-xla-linux-bionic-py3.7-clang8-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
diff --git a/.github/workflows/generated-win-vs2019-cpu-py3.yml b/.github/workflows/generated-win-vs2019-cpu-py3.yml
deleted file mode 100644
index 070d41bd20714d..00000000000000
--- a/.github/workflows/generated-win-vs2019-cpu-py3.yml
+++ /dev/null
@@ -1,430 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/windows_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: win-vs2019-cpu-py3
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cpu/*'
-      - 'ciflow/trunk/*'
-      - 'ciflow/win/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: win-vs2019-cpu-py3
-  BUILD_WHEEL: 1
-  MAX_JOBS: 8
-  CUDA_VERSION: "cpu"
-  IN_CI: 1
-  IS_GHA: 1
-  INSTALL_WINDOWS_SDK: 1
-  PYTHON_VERSION: "3.8"
-  PYTORCH_RETRY_TEST_CASES: 1
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  SCCACHE_BUCKET: "ossci-compiler-cache"
-  VC_PRODUCT: "BuildTools"
-  VC_VERSION: ""
-  VS_VERSION: "16.8.6"
-  VC_YEAR: "2019"
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  no_proxy: localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  USE_CUDA: 0
-
-concurrency:
-  group: win-vs2019-cpu-py3-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-  build:
-    runs-on: "windows.4xlarge"
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: win-vs2019-cpu-py3-build
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          .jenkins/pytorch/win-build.sh
-      # Upload to github so that people can click and download artifacts
-      - name: Upload artifacts to s3
-        uses: seemethere/upload-artifact-s3@v3
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Cleanup build-results and workspaces
-        if: always()
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
-          rm -rf ./*
-  test_default_1_2:
-    name: test (default, 1, 2, windows.4xlarge)
-    timeout-minutes: 270
-    env:
-      JOB_BASE_NAME: win-vs2019-cpu-py3-test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.4xlarge
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: win-vs2019-cpu-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  test_default_2_2:
-    name: test (default, 2, 2, windows.4xlarge)
-    timeout-minutes: 270
-    env:
-      JOB_BASE_NAME: win-vs2019-cpu-py3-test
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.4xlarge
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 240 minutes
-        timeout-minutes: 240
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: win-vs2019-cpu-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
diff --git a/.github/workflows/generated-win-vs2019-cuda11.3-py3.yml b/.github/workflows/generated-win-vs2019-cuda11.3-py3.yml
deleted file mode 100644
index fe218f09ec6d9d..00000000000000
--- a/.github/workflows/generated-win-vs2019-cuda11.3-py3.yml
+++ /dev/null
@@ -1,604 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-# Template is at:    .github/templates/windows_ci_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: win-vs2019-cuda11.3-py3
-
-on:
-  pull_request:
-  push:
-    tags:
-      - 'ciflow/all/*'
-      - 'ciflow/cuda/*'
-      - 'ciflow/trunk/*'
-      - 'ciflow/win/*'
-    branches:
-      - master
-      - main
-      - release/*
-  workflow_dispatch:
-
-env:
-  BUILD_ENVIRONMENT: win-vs2019-cuda11.3-py3
-  BUILD_WHEEL: 1
-  MAX_JOBS: 8
-  CUDA_VERSION: "11.3"
-  IN_CI: 1
-  IS_GHA: 1
-  INSTALL_WINDOWS_SDK: 1
-  PYTHON_VERSION: "3.8"
-  PYTORCH_RETRY_TEST_CASES: 1
-  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-  SCCACHE_BUCKET: "ossci-compiler-cache"
-  VC_PRODUCT: "BuildTools"
-  VC_VERSION: ""
-  VS_VERSION: "16.8.6"
-  VC_YEAR: "2019"
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  no_proxy: localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock
-  AWS_DEFAULT_REGION: us-east-1
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  TORCH_CUDA_ARCH_LIST: "7.0"
-  USE_CUDA: 1
-
-concurrency:
-  group: win-vs2019-cuda11.3-py3-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-  build:
-    runs-on: "windows.4xlarge"
-    timeout-minutes: 240
-    env:
-      JOB_BASE_NAME: win-vs2019-cuda11.3-py3-build
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-    steps:
-      - name: print labels
-        run: echo "${PR_LABELS}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Build
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-        run: |
-          .jenkins/pytorch/win-build.sh
-      # Upload to github so that people can click and download artifacts
-      - name: Upload artifacts to s3
-        uses: seemethere/upload-artifact-s3@v3
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Cleanup build-results and workspaces
-        if: always()
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
-          rm -rf ./*
-  test_force_on_cpu_1_1:
-    name: test (force_on_cpu, 1, 1, windows.4xlarge)
-    timeout-minutes: 300
-    env:
-      JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 1
-      TEST_CONFIG: force_on_cpu
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.4xlarge
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 270 minutes
-        timeout-minutes: 270
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-force_on_cpu-1-1-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-force_on_cpu-1-1-windows.4xlarge'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  test_default_1_2:
-    name: test (default, 1, 2, windows.8xlarge.nvidia.gpu)
-    timeout-minutes: 300
-    env:
-      JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-      SHARD_NUMBER: 1
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.8xlarge.nvidia.gpu
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 270 minutes
-        timeout-minutes: 270
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-1-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
-  test_default_2_2:
-    name: test (default, 2, 2, windows.8xlarge.nvidia.gpu)
-    timeout-minutes: 300
-    env:
-      JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-      SHARD_NUMBER: 2
-      NUM_TEST_SHARDS: 2
-      TEST_CONFIG: default
-      http_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      https_proxy: "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128"
-      PR_BODY: ${{ github.event.pull_request.body }}
-    needs: build
-    runs-on: windows.8xlarge.nvidia.gpu
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          # deep clone, to allow use of git merge-base
-          fetch-depth: 0
-          submodules: recursive
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-      - name: Install Visual Studio 2019 toolchain
-        shell: powershell
-        run: |
-          .\.circleci\scripts\vs_install.ps1
-      - name: Install Cuda
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cuda_install.sh
-      - name: Install Cudnn
-        shell: bash
-        run: |
-          .circleci/scripts/windows_cudnn_install.sh
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
-        name: Download PyTorch Build Artifacts
-        with:
-          name: ${{ env.BUILD_ENVIRONMENT }}
-          path: C:\${{ github.run_id }}\build-results
-      - name: Check build-results folder
-        shell: powershell
-        run: |
-          tree /F C:\$Env:GITHUB_RUN_ID\build-results
-      # Needed for coverage in win-test.sh
-      - uses: actions/setup-python@v2
-        name: Setup Python3
-        with:
-          python-version: '3.x'
-      - name: Test
-        shell: bash
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
-        # Time out the test phase after 270 minutes
-        timeout-minutes: 270
-        run: |
-            .jenkins/pytorch/win-test.sh
-      - name: Zip JSONs for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Downloaded JSONs on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-      - name: Zip test reports for upload
-        if: always()
-        env:
-          FILE_SUFFIX: '${{ github.job }}-default-2-2-windows.8xlarge.nvidia.gpu'
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-      - uses: seemethere/upload-artifact-s3@v3
-        name: Store Test Reports on S3
-        if: always()
-        with:
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-      - name: Wait until all sessions have drained
-        shell: powershell
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-      - name: Parse ref
-        shell: bash
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-      - name: Upload test statistics
-        if: always()
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          JOB_BASE_NAME: win-vs2019-cuda11.3-py3-test
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-        shell: bash
-        run: |
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-      - name: Cleanup workspace
-        if: always()
-        shell: bash
-        # Should remove the entirety of pytorch-${{ github.run_id }}
-        run: |
-          rm -rf ./*
diff --git a/.github/workflows/generated-windows-binary-conda-nightly.yml b/.github/workflows/generated-windows-binary-conda-nightly.yml
new file mode 100644
index 00000000000000..a65be8ad607705
--- /dev/null
+++ b/.github/workflows/generated-windows-binary-conda-nightly.yml
@@ -0,0 +1,4834 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/windows_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: windows-binary-conda
+
+on:
+  push:
+    # NOTE: Meta Employees can trigger new nightlies using: https://fburl.com/trigger_pytorch_nightly_build
+    branches:
+      - nightly
+    tags:
+      # NOTE: Binary build pipelines should only get triggered on release candidate builds
+      # Release candidate tags look like: v1.11.0-rc1
+      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+      - 'ciflow/binaries/*'
+      - 'ciflow/binaries_conda/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BUILD_ENVIRONMENT: windows-binary-conda
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_RETRY_TEST_CASES: 1
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: windows-binary-conda-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  conda-py3_7-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_7-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cpu-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cpu
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cpu-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_7-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_7-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_3-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_7-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_7-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_5
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_5-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_7-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_7-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_6
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_7-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_7-cuda11_6-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_7-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_8-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cpu-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cpu
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cpu-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_8-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_8-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_3-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_8-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_8-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_5
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_5-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_8-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_8-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_6
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_8-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_8-cuda11_6-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_8-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_9-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cpu-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cpu
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cpu-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_9-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_3-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_9-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_5
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_5-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_9-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_9-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_6
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_9-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_9-cuda11_6-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_9-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_10-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cpu-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cpu
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cpu-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_10-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_10-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_3-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_10-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_10-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_5
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_5-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  conda-py3_10-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: conda-py3_10-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_6
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  conda-py3_10-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: conda-py3_10-cuda11_6-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: conda
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.10"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: conda-py3_10-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-windows-binary-libtorch-debug-master.yml b/.github/workflows/generated-windows-binary-libtorch-debug-master.yml
new file mode 100644
index 00000000000000..04188e958fecfe
--- /dev/null
+++ b/.github/workflows/generated-windows-binary-libtorch-debug-master.yml
@@ -0,0 +1,247 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/windows_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: windows-binary-libtorch-debug
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BUILD_ENVIRONMENT: windows-binary-libtorch-debug
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_RETRY_TEST_CASES: 1
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: windows-binary-libtorch-debug-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  libtorch-cpu-shared-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cpu-shared-with-deps-debug
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cpu-shared-with-deps-debug-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cpu-shared-with-deps-debug-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cpu-shared-with-deps-debug
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
diff --git a/.github/workflows/generated-windows-binary-libtorch-debug.yml b/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
similarity index 68%
rename from .github/workflows/generated-windows-binary-libtorch-debug.yml
rename to .github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
index 38ff3b9c519437..22a6b60056f4b2 100644
--- a/.github/workflows/generated-windows-binary-libtorch-debug.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
@@ -37,6 +37,7 @@ concurrency:
 
 jobs:
   libtorch-cpu-shared-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -67,10 +68,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -111,7 +123,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -164,10 +176,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -177,7 +200,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -245,30 +268,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -285,12 +288,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -347,6 +347,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-shared-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -377,10 +378,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -421,7 +433,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-debug
@@ -474,10 +486,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -487,7 +510,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-debug
@@ -555,30 +578,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -595,12 +598,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-debug
@@ -657,6 +657,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -687,10 +688,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -731,7 +743,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-debug
@@ -784,10 +796,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -797,7 +820,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-debug
@@ -865,30 +888,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -905,12 +908,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-debug
@@ -967,6 +967,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -997,10 +998,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1041,7 +1053,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-debug
@@ -1094,10 +1106,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1107,7 +1130,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-debug
@@ -1175,30 +1198,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1215,12 +1218,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-debug
@@ -1277,6 +1277,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-shared-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1308,10 +1309,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1352,7 +1364,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-shared-with-deps-debug
@@ -1406,10 +1418,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1419,7 +1442,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-debug
@@ -1488,30 +1511,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1528,12 +1531,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-debug
@@ -1590,6 +1590,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-shared-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1621,10 +1622,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1665,7 +1677,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-shared-without-deps-debug
@@ -1719,10 +1731,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1732,7 +1755,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-debug
@@ -1801,30 +1824,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1841,12 +1844,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-debug
@@ -1903,6 +1903,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-static-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1934,10 +1935,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1978,7 +1990,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-static-with-deps-debug
@@ -2032,10 +2044,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2045,7 +2068,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-debug
@@ -2114,30 +2137,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2154,12 +2157,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-debug
@@ -2216,6 +2216,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-static-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2247,10 +2248,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2291,7 +2303,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-static-without-deps-debug
@@ -2345,10 +2357,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2358,7 +2381,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-debug
@@ -2427,30 +2450,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2467,12 +2470,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-debug
@@ -2529,6 +2529,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-shared-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2560,10 +2561,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2604,7 +2616,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-shared-with-deps-debug
@@ -2658,10 +2670,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2671,7 +2694,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-debug
@@ -2740,34 +2763,14 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
           retry docker pull "${ALPINE_IMAGE}"
           # Ensure the working directory gets chowned back to the current user
@@ -2780,12 +2783,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-debug
@@ -2842,6 +2842,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-shared-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2873,10 +2874,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2917,7 +2929,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-shared-without-deps-debug
@@ -2971,10 +2983,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2984,7 +3007,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-debug
@@ -3053,30 +3076,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3093,12 +3096,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-debug
@@ -3155,6 +3155,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-static-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3186,10 +3187,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3230,7 +3242,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-static-with-deps-debug
@@ -3284,10 +3296,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3297,7 +3320,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-debug
@@ -3366,30 +3389,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3406,12 +3409,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-debug
@@ -3468,6 +3468,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-static-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3499,10 +3500,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3543,7 +3555,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-static-without-deps-debug
@@ -3597,10 +3609,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3610,7 +3633,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-debug
@@ -3679,30 +3702,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3719,12 +3722,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-debug
@@ -3780,3 +3780,1247 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-debug
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-with-deps-debug-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-debug-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-debug
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-with-deps-debug-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-debug-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-debug
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-debug
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-without-deps-debug-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-debug-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-debug
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-without-deps-debug-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-debug-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-debug
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-static-with-deps-debug
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-with-deps-debug-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-debug-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-debug
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-with-deps-debug-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-debug-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-debug
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-debug-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-static-without-deps-debug
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-without-deps-debug-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-debug-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-debug
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-without-deps-debug-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-debug-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: debug
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-debug
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-windows-binary-libtorch-release-master.yml b/.github/workflows/generated-windows-binary-libtorch-release-master.yml
new file mode 100644
index 00000000000000..422cbb27cbb7e2
--- /dev/null
+++ b/.github/workflows/generated-windows-binary-libtorch-release-master.yml
@@ -0,0 +1,247 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/windows_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: windows-binary-libtorch-release
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BUILD_ENVIRONMENT: windows-binary-libtorch-release
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_RETRY_TEST_CASES: 1
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: windows-binary-libtorch-release-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  libtorch-cpu-shared-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cpu-shared-with-deps-release
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cpu-shared-with-deps-release-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cpu-shared-with-deps-release-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cpu-shared-with-deps-release
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
diff --git a/.github/workflows/generated-windows-binary-libtorch-release.yml b/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
similarity index 68%
rename from .github/workflows/generated-windows-binary-libtorch-release.yml
rename to .github/workflows/generated-windows-binary-libtorch-release-nightly.yml
index 262561c2b199d8..9ee9a85b3ce314 100644
--- a/.github/workflows/generated-windows-binary-libtorch-release.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
@@ -37,6 +37,7 @@ concurrency:
 
 jobs:
   libtorch-cpu-shared-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -67,10 +68,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -111,7 +123,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -164,10 +176,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -177,7 +200,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -245,30 +268,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -285,12 +288,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -347,6 +347,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-shared-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -377,10 +378,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -421,7 +433,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-release
@@ -474,10 +486,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -487,7 +510,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-release
@@ -555,30 +578,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -595,12 +598,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-release
@@ -657,6 +657,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -687,10 +688,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -731,7 +743,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-release
@@ -784,10 +796,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -797,7 +820,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-release
@@ -865,30 +888,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -905,12 +908,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-release
@@ -967,6 +967,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cpu-static-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -997,10 +998,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1041,7 +1053,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-release
@@ -1094,10 +1106,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1107,7 +1130,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-release
@@ -1175,30 +1198,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1215,12 +1218,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-release
@@ -1277,6 +1277,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-shared-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1308,10 +1309,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1352,7 +1364,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-shared-with-deps-release
@@ -1406,10 +1418,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1419,7 +1442,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-release
@@ -1488,30 +1511,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1528,12 +1531,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-with-deps-release
@@ -1590,6 +1590,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-shared-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1621,10 +1622,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1665,7 +1677,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-shared-without-deps-release
@@ -1719,10 +1731,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1732,7 +1755,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-release
@@ -1801,30 +1824,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1841,12 +1844,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-shared-without-deps-release
@@ -1903,6 +1903,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-static-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1934,10 +1935,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1978,7 +1990,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-static-with-deps-release
@@ -2032,10 +2044,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2045,7 +2068,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-release
@@ -2114,30 +2137,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2154,12 +2157,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-with-deps-release
@@ -2216,6 +2216,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_3-static-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2247,10 +2248,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2291,7 +2303,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_3-static-without-deps-release
@@ -2345,10 +2357,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2358,7 +2381,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-release
@@ -2427,30 +2450,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2467,12 +2470,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_3-static-without-deps-release
@@ -2529,6 +2529,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-shared-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2560,10 +2561,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2604,7 +2616,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-shared-with-deps-release
@@ -2658,10 +2670,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2671,7 +2694,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-release
@@ -2740,34 +2763,14 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
           retry docker pull "${ALPINE_IMAGE}"
           # Ensure the working directory gets chowned back to the current user
@@ -2780,12 +2783,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-with-deps-release
@@ -2842,6 +2842,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-shared-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2873,10 +2874,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2917,7 +2929,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-shared-without-deps-release
@@ -2971,10 +2983,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2984,7 +3007,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-release
@@ -3053,30 +3076,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3093,12 +3096,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-shared-without-deps-release
@@ -3155,6 +3155,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-static-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3186,10 +3187,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3230,7 +3242,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-static-with-deps-release
@@ -3284,10 +3296,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3297,7 +3320,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-release
@@ -3366,30 +3389,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3406,12 +3409,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-with-deps-release
@@ -3468,6 +3468,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   libtorch-cuda11_5-static-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3499,10 +3500,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3543,7 +3555,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: libtorch-cuda11_5-static-without-deps-release
@@ -3597,10 +3609,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3610,7 +3633,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-release
@@ -3679,30 +3702,10 @@ jobs:
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3719,12 +3722,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_5-static-without-deps-release
@@ -3780,3 +3780,1247 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
+  libtorch-cuda11_6-shared-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-release
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-with-deps-release-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-release-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-release
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-with-deps-release-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-with-deps-release-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-with-deps-release
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-shared-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-release
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-without-deps-release-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-release-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-release
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-shared-without-deps-release-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-shared-without-deps-release-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: shared-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-shared-without-deps-release
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-with-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-static-with-deps-release
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-with-deps-release-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-release-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-release
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-with-deps-release-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-with-deps-release-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-with-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-with-deps-release
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  libtorch-cuda11_6-static-without-deps-release-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: libtorch-cuda11_6-static-without-deps-release
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-without-deps-release-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-release-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-release
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  libtorch-cuda11_6-static-without-deps-release-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: libtorch-cuda11_6-static-without-deps-release-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: libtorch
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      LIBTORCH_CONFIG: release
+      LIBTORCH_VARIANT: static-without-deps
+      # This is a dummy value for libtorch to work correctly with our batch scripts
+      # without this value pip does not get installed for some reason
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: libtorch-cuda11_6-static-without-deps-release
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
diff --git a/.github/workflows/generated-windows-binary-wheel-master.yml b/.github/workflows/generated-windows-binary-wheel-master.yml
new file mode 100644
index 00000000000000..befb73dd15c241
--- /dev/null
+++ b/.github/workflows/generated-windows-binary-wheel-master.yml
@@ -0,0 +1,241 @@
+# @generated DO NOT EDIT MANUALLY
+
+# Template is at:    .github/templates/windows_binary_build_workflow.yml.j2
+# Generation script: .github/scripts/generate_ci_workflows.py
+name: windows-binary-wheel
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - 'ciflow/all/*'
+      - 'ciflow/trunk/*'
+  workflow_dispatch:
+
+env:
+  # Needed for conda builds
+  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
+  ANACONDA_USER: pytorch
+  AWS_DEFAULT_REGION: us-east-1
+  BUILD_ENVIRONMENT: windows-binary-wheel
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  IN_CI: 1
+  IS_GHA: 1
+  PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
+  PR_NUMBER: ${{ github.event.pull_request.number }}
+  PYTORCH_RETRY_TEST_CASES: 1
+  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+  SKIP_ALL_TESTS: 1
+concurrency:
+  group: windows-binary-wheel-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  wheel-py3_7-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: wheel-py3_7-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_7-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_7-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.7"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_7-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
diff --git a/.github/workflows/generated-windows-binary-wheel.yml b/.github/workflows/generated-windows-binary-wheel-nightly.yml
similarity index 68%
rename from .github/workflows/generated-windows-binary-wheel.yml
rename to .github/workflows/generated-windows-binary-wheel-nightly.yml
index 0e763245267990..95e163841eabba 100644
--- a/.github/workflows/generated-windows-binary-wheel.yml
+++ b/.github/workflows/generated-windows-binary-wheel-nightly.yml
@@ -37,6 +37,7 @@ concurrency:
 
 jobs:
   wheel-py3_7-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -63,10 +64,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -107,7 +119,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: wheel-py3_7-cpu
@@ -156,10 +168,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -169,7 +192,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cpu
@@ -233,30 +256,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -273,12 +276,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cpu
@@ -335,6 +335,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_7-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -362,10 +363,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -406,7 +418,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: wheel-py3_7-cuda11_3
@@ -456,10 +468,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -469,7 +492,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cuda11_3
@@ -534,30 +557,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -574,12 +577,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cuda11_3
@@ -636,6 +636,7 @@ jobs:
           # Prune all of the docker images
           docker system prune -af
   wheel-py3_7-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -663,10 +664,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -707,7 +719,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
           name: wheel-py3_7-cuda11_5
@@ -757,10 +769,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -770,7 +793,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cuda11_5
@@ -835,30 +858,10 @@ jobs:
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -875,12 +878,9 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cuda11_5
@@ -936,7 +936,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_8-cpu-build:
+  wheel-py3_7-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -945,10 +946,11 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -963,10 +965,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1007,10 +1019,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_8-cpu
+          name: wheel-py3_7-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1027,10 +1039,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cpu-test:  # Testing
+  wheel-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cpu-build
-    runs-on: windows.4xlarge
+    needs: wheel-py3_7-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1038,10 +1050,11 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1056,10 +1069,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1069,10 +1092,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cpu
+          name: wheel-py3_7-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1118,45 +1141,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cpu-upload:  # Uploading
+  wheel-py3_7-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cpu-test
+    needs: wheel-py3_7-cuda11_6-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.7"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1173,15 +1177,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cpu
+          name: wheel-py3_7-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -1234,7 +1235,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_8-cuda11_3-build:
+  wheel-py3_8-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1243,9 +1245,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1262,10 +1263,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1306,10 +1318,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_8-cuda11_3
+          name: wheel-py3_8-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1326,10 +1338,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_3-test:  # Testing
+  wheel-py3_8-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: wheel-py3_8-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1337,9 +1349,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1356,10 +1367,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1369,10 +1391,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_3
+          name: wheel-py3_8-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1418,46 +1440,25 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_3-upload:  # Uploading
+  wheel-py3_8-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_3-test
+    needs: wheel-py3_8-cpu-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -1474,15 +1475,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_3
+          name: wheel-py3_8-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -1535,7 +1533,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_8-cuda11_5-build:
+  wheel-py3_8-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1544,8 +1543,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1563,10 +1562,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1607,10 +1617,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_8-cuda11_5
+          name: wheel-py3_8-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1627,9 +1637,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_5-test:  # Testing
+  wheel-py3_8-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_5-build
+    needs: wheel-py3_8-cuda11_3-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -1638,8 +1648,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1657,10 +1667,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1670,10 +1691,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_5
+          name: wheel-py3_8-cuda11_3
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1719,48 +1740,1227 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_5-upload:  # Uploading
+  wheel-py3_8-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_5-test
+    needs: wheel-py3_8-cuda11_3-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
         run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
-      - name: Chown workspace
-        run: |
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_8-cuda11_3
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  wheel-py3_8-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: wheel-py3_8-cuda11_5
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_8-cuda11_5-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_8-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_8-cuda11_5
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_8-cuda11_5-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_8-cuda11_5-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_8-cuda11_5
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  wheel-py3_8-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: wheel-py3_8-cuda11_6
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_8-cuda11_6-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_8-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_8-cuda11_6
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_8-cuda11_6-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_8-cuda11_6-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.8"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_8-cuda11_6
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  wheel-py3_9-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: wheel-py3_9-cpu
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_9-cpu-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_9-cpu-build
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_9-cpu
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_9-cpu-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_9-cpu-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
+          retry () {
+              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
+          }
+          retry docker pull "${ALPINE_IMAGE}"
+          # Ensure the working directory gets chowned back to the current user
+          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Clean workspace
+        run: |
+          rm -rf "${GITHUB_WORKSPACE}"
+          mkdir "${GITHUB_WORKSPACE}"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Clone pytorch/pytorch
+        uses: actions/checkout@v2
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_9-cpu
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Set DRY_RUN (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+        run: |
+          echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
+      - name: Set UPLOAD_CHANNEL (only for tagged pushes)
+        if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        run: |
+          # reference ends with an RC suffix
+          if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
+            echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
+          fi
+      - name: Upload binaries
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
+          ANACONDA_API_TOKEN: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
+        run: |
+          docker run --rm -i \
+            -e ANACONDA_API_TOKEN \
+            -e AWS_ACCESS_KEY_ID \
+            -e AWS_SECRET_ACCESS_KEY \
+            -e DRY_RUN \
+            -e PACKAGE_TYPE \
+            -e PKG_DIR=/artifacts \
+            -e UPLOAD_CHANNEL \
+            -e UPLOAD_SUBFOLDER \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -v "${GITHUB_WORKSPACE}:/v" \
+            -w /v \
+            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
+            bash -c '.circleci/scripts/binary_upload.sh'
+      - name: Hold runner for 2 hours or until ssh sessions have drained
+        # Always hold for active ssh sessions
+        if: always()
+        run: .github/scripts/wait_for_ssh_to_drain.sh
+      - name: Chown workspace
+        if: always()
+        run: |
+          # Ensure the working directory gets chowned back to the current user
+          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
+      - name: Kill containers, clean up images
+        if: always()
+        run: |
+          # ignore expansion of "docker ps -q" since it could be empty
+          # shellcheck disable=SC2046
+          docker stop $(docker ps -q) || true
+          # Prune all of the docker images
+          docker system prune -af
+  wheel-py3_9-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
+    runs-on: windows.4xlarge
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Build PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
+      - uses: seemethere/upload-artifact-s3@v4
+        if: always()
+        with:
+          name: wheel-py3_9-cuda11_3
+          retention-days: 14
+          if-no-files-found: error
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_9-cuda11_3-test:  # Testing
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_9-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
+    timeout-minutes: 240
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Display EC2 information
+        shell: bash
+        run: |
+          set -euo pipefail
+          function get_ec2_metadata() {
+            # Pulled from instance metadata endpoint for EC2
+            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
+            category=$1
+            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
+          }
+          echo "ami-id: $(get_ec2_metadata ami-id)"
+          echo "instance-id: $(get_ec2_metadata instance-id)"
+          echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: seemethere/add-github-ssh-key@v1
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
+      # NOTE: These environment variables are put here so that they can be applied on every job equally
+      #       They are also here because setting them at a workflow level doesn't give us access to the
+      #       runner.temp variable, which we need.
+      - name: Populate binary env
+        shell: bash
+        run: |
+          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
+          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
+          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
+      - uses: seemethere/download-artifact-s3@v3
+        name: Download Build Artifacts
+        with:
+          name: wheel-py3_9-cuda11_3
+          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+          submodules: recursive
+          path: pytorch
+      - name: Clean PyTorch checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: pytorch
+      - name: Checkout pytorch/builder
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: main
+          submodules: recursive
+          repository: pytorch/builder
+          path: builder
+      - name: Clean pytorch/builder checkout
+        run: |
+          # Remove any artifacts from the previous checkouts
+          git clean -fxd
+        working-directory: builder
+      - name: Populate binary env
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
+      - name: Test PyTorch binary
+        shell: bash
+        run: |
+          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
+      - name: Wait until all sessions have drained
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        timeout-minutes: 120
+        run: |
+          .github\scripts\wait_for_ssh_to_drain.ps1
+      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
+        shell: powershell
+        working-directory: pytorch
+        if: always()
+        run: |
+          .github\scripts\kill_active_ssh_sessions.ps1
+  wheel-py3_9-cuda11_3-upload:  # Uploading
+    runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
+    if: ${{ github.repository_owner == 'pytorch' }}
+    needs: wheel-py3_9-cuda11_3-test
+    env:
+      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
+      BUILDER_ROOT: ${{ github.workspace }}/builder
+      PACKAGE_TYPE: wheel
+      # TODO: This is a legacy variable that we eventually want to get rid of in
+      #       favor of GPU_ARCH_VERSION
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
+      SKIP_ALL_TESTS: 1
+      DESIRED_PYTHON: "3.9"
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Chown workspace
+        run: |
           retry () {
               "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
           }
@@ -1775,15 +2975,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_5
+          name: wheel-py3_9-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -1836,7 +3033,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_9-cpu-build:
+  wheel-py3_9-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -1845,8 +3043,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1863,10 +3062,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1907,10 +3117,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_9-cpu
+          name: wheel-py3_9-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1927,10 +3137,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cpu-test:  # Testing
+  wheel-py3_9-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cpu-build
-    runs-on: windows.4xlarge
+    needs: wheel-py3_9-cuda11_5-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1938,8 +3148,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1956,10 +3167,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -1969,10 +3191,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cpu
+          name: wheel-py3_9-cuda11_5
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2018,45 +3240,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cpu-upload:  # Uploading
+  wheel-py3_9-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cpu-test
+    needs: wheel-py3_9-cuda11_5-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2073,15 +3276,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cpu
+          name: wheel-py3_9-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2134,7 +3334,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_9-cuda11_3-build:
+  wheel-py3_9-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2143,8 +3344,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
@@ -2162,10 +3363,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2206,10 +3417,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_9-cuda11_3
+          name: wheel-py3_9-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -2226,9 +3437,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_3-test:  # Testing
+  wheel-py3_9-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_3-build
+    needs: wheel-py3_9-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -2237,8 +3448,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
@@ -2256,10 +3467,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2269,10 +3490,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cuda11_3
+          name: wheel-py3_9-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2318,46 +3539,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_3-upload:  # Uploading
+  wheel-py3_9-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_3-test
+    needs: wheel-py3_9-cuda11_6-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2374,15 +3575,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cuda11_3
+          name: wheel-py3_9-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2435,7 +3633,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_9-cuda11_5-build:
+  wheel-py3_10-cpu-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2444,11 +3643,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2463,10 +3661,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2507,10 +3716,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_9-cuda11_5
+          name: wheel-py3_10-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -2527,10 +3736,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_5-test:  # Testing
+  wheel-py3_10-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_5-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: wheel-py3_10-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -2538,11 +3747,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2557,10 +3765,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2570,10 +3789,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cuda11_5
+          name: wheel-py3_10-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2619,46 +3838,25 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_5-upload:  # Uploading
+  wheel-py3_10-cpu-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_5-test
+    needs: wheel-py3_10-cpu-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2675,15 +3873,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cuda11_5
+          name: wheel-py3_10-cpu
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -2736,7 +3931,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_10-cpu-build:
+  wheel-py3_10-cuda11_3-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -2745,8 +3941,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
@@ -2763,10 +3960,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2807,10 +4015,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_10-cpu
+          name: wheel-py3_10-cuda11_3
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -2827,10 +4035,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cpu-test:  # Testing
+  wheel-py3_10-cuda11_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cpu-build
-    runs-on: windows.4xlarge
+    needs: wheel-py3_10-cuda11_3-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -2838,8 +4046,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
@@ -2856,10 +4065,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -2869,10 +4089,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cpu
+          name: wheel-py3_10-cuda11_3
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2918,45 +4138,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cpu-upload:  # Uploading
+  wheel-py3_10-cuda11_3-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cpu-test
+    needs: wheel-py3_10-cuda11_3-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu113
+      GPU_ARCH_VERSION: 11.3
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -2973,15 +4174,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cpu
+          name: wheel-py3_10-cuda11_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3034,7 +4232,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_10-cuda11_3-build:
+  wheel-py3_10-cuda11_5-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3043,8 +4242,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
@@ -3062,10 +4261,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3106,10 +4316,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_10-cuda11_3
+          name: wheel-py3_10-cuda11_5
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -3126,9 +4336,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_3-test:  # Testing
+  wheel-py3_10-cuda11_5-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_3-build
+    needs: wheel-py3_10-cuda11_5-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -3137,8 +4347,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
@@ -3156,10 +4366,21 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      # Since it's just a defensive command, the workflow should continue even the command fails
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3169,10 +4390,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cuda11_3
+          name: wheel-py3_10-cuda11_5
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3218,46 +4439,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_3-upload:  # Uploading
+  wheel-py3_10-cuda11_5-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_3-test
+    needs: wheel-py3_10-cuda11_5-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu115
+      GPU_ARCH_VERSION: 11.5
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3274,15 +4475,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cuda11_3
+          name: wheel-py3_10-cuda11_5
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
@@ -3335,7 +4533,8 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  wheel-py3_10-cuda11_5-build:
+  wheel-py3_10-cuda11_6-build:
+    if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
@@ -3344,8 +4543,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
@@ -3363,10 +4562,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3407,10 +4616,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v3
+      - uses: seemethere/upload-artifact-s3@v4
         if: always()
         with:
-          name: wheel-py3_10-cuda11_5
+          name: wheel-py3_10-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -3427,9 +4636,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_5-test:  # Testing
+  wheel-py3_10-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_5-build
+    needs: wheel-py3_10-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -3438,8 +4647,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
@@ -3457,10 +4666,20 @@ jobs:
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
           echo "instance-type: $(get_ec2_metadata instance-type)"
+          echo "system info $(uname -a)"
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
+      - name: Enable long paths on Windows
+        shell: powershell
+        run: |
+          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
+      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
+        shell: powershell
+        run: |
+          Set-MpPreference -ExclusionPath $(Get-Location).tostring()
       # NOTE: These environment variables are put here so that they can be applied on every job equally
       #       They are also here because setting them at a workflow level doesn't give us access to the
       #       runner.temp variable, which we need.
@@ -3470,10 +4689,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cuda11_5
+          name: wheel-py3_10-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -3519,46 +4738,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_5-upload:  # Uploading
+  wheel-py3_10-cuda11_6-upload:  # Uploading
     runs-on: linux.2xlarge  # self hosted runner to download ec2 artifacts
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_5-test
+    needs: wheel-py3_10-cuda11_6-test
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu115
-      GPU_ARCH_VERSION: 11.5
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.10"
     steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-      - name: Log in to ECR
-        env:
-          AWS_RETRY_MODE: standard
-          AWS_MAX_ATTEMPTS: 5
-        run: |
-          AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
-              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
       - name: Chown workspace
         run: |
           retry () {
@@ -3575,15 +4774,12 @@ jobs:
         uses: seemethere/add-github-ssh-key@v1
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: Preserve github env variables for use in docker
-        run: |
-          env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
       - name: Clone pytorch/pytorch
         uses: actions/checkout@v2
-      - uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
+      - uses: seemethere/download-artifact-s3@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_10-cuda11_5
+          name: wheel-py3_10-cuda11_6
           path: "${{ runner.temp }}/artifacts/"
       - name: Set DRY_RUN (only for tagged pushes)
         if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
index d98a81da5e9b13..05317c5a92875c 100644
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -8,6 +8,7 @@ on:
 
 jobs:
   quick-checks:
+    name: quick-checks
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -15,8 +16,9 @@ jobs:
         with:
           python-version: 3.x
           architecture: x64
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Clean PyTorch checkout
         run: |
           # Remove any artifacts from the previous checkouts
@@ -113,6 +115,7 @@ jobs:
           .github/scripts/lint_test_ownership.py
 
   clang-format:
+    name: clang-format
     runs-on: ubuntu-18.04
     if: ${{ github.event_name == 'pull_request' }}
     steps:
@@ -121,10 +124,10 @@ jobs:
         with:
           python-version: 3.x
           architecture: x64
-      - name: Fetch PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          fetch-depth: 0 # deep clone, to allow us to use git merge-base
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0 required to use git merge-base)
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Run clang-format
         env:
           BASE_SHA: ${{ github.event.pull_request.base.sha }}
@@ -153,6 +156,7 @@ jobs:
           exit 1
 
   py2-setup-validate-errormsg:
+    name: py2-setup-validate-errormsg
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -160,8 +164,9 @@ jobs:
         with:
           python-version: 2.x
           architecture: x64
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Attempt to run setup.py
         run: |
           if ! python2 setup.py | grep -q "Python 2 has reached end-of-life and is no longer supported by PyTorch."; then
@@ -172,6 +177,7 @@ jobs:
         run: python2 -m py_compile torch/utils/collect_env.py
 
   shellcheck:
+    name: shellcheck
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -179,8 +185,9 @@ jobs:
         with:
           python-version: 3.x
           architecture: x64
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Install requirements
         id: requirements
         run: |
@@ -188,8 +195,9 @@ jobs:
       - name: Install Jinja2
         run: |
           pip3 install Jinja2==3.0.1 --user
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Regenerate workflows
         id: generate_workflows
         run: .github/scripts/generate_ci_workflows.py
@@ -251,6 +259,7 @@ jobs:
           rm actionlint
 
   toc:
+    name: toc
     runs-on: ubuntu-18.04
     # https://github.com/actions/virtual-environments/issues/599#issuecomment-602754687
     env:
@@ -258,8 +267,9 @@ jobs:
     steps:
       - name: Setup Node
         uses: actions/setup-node@v2
+      # [see note: pytorch repo ref]
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Install markdown-toc
         run: npm install -g markdown-toc
       - name: Regenerate ToCs and check that they didn't change
@@ -287,6 +297,7 @@ jobs:
           fi
 
   flake8-py3:
+    name: flake8-py3
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -294,10 +305,10 @@ jobs:
         with:
           python-version: 3.x
           architecture: x64
-      - name: Fetch PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          fetch-depth: 2 # to allow us to use github.event.pull_request.head.sha
+      # [see note: pytorch repo ref]
+      # fetch-depth 2 required to allow us to use github.event.pull_request.head.sha
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Prepare output dir with HEAD commit SHA
         env:
           HEAD_SHA: ${{ github.event.pull_request.head.sha }}
@@ -347,7 +358,8 @@ jobs:
           mode: json
 
   clang-tidy:
-    runs-on: linux.2xlarge
+    name: clang-tidy
+    runs-on: [self-hosted, linux.2xlarge]
     container:
       # ubuntu20.04-cuda11.2-py3.8-tidy11
       image: ghcr.io/pytorch/cilint-clang-tidy:d8f0c777964d0dd8a147360de80aed1a13eb613a
@@ -356,10 +368,12 @@ jobs:
         run: |
           rm -rf "${GITHUB_WORKSPACE}"
           mkdir "${GITHUB_WORKSPACE}"
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) to allow tools/linter/clang_tidy.py to do its thing
       - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
         with:
-          fetch-depth: 0 # to allow tools/linter/clang_tidy.py to do its thing
+          no-sudo: true
       - name: Prepare output dir with HEAD commit SHA
         env:
           HEAD_SHA: ${{ github.event.pull_request.head.sha }}
@@ -398,10 +412,12 @@ jobs:
 
           python3 -m tools.linter.clang_tidy \
             --paths \
+              torch/csrc/cuda \
               torch/csrc/fx \
               torch/csrc/utils \
               torch/csrc/generic \
               torch/csrc/deploy \
+              torch/csrc/onnx \
               torch/csrc/tensor \
             --clang-tidy-exe "$(which clang-tidy)" \
             --disable-progress-bar 2>&1 | tee -a "${GITHUB_WORKSPACE}"/clang-tidy-output.txt
@@ -440,6 +456,7 @@ jobs:
           mode: json
 
   cmakelint:
+    name: cmakelint
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -447,8 +464,9 @@ jobs:
         with:
           python-version: 3.x
           architecture: x64
-      - name: Fetch PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Install dependencies
         run: |
           set -eux
@@ -462,6 +480,7 @@ jobs:
           xargs -0 cmakelint --config=.cmakelintrc --spaces=2 --quiet
 
   mypy:
+    name: mypy
     runs-on: ubuntu-18.04
     steps:
       - name: Setup Python
@@ -469,8 +488,9 @@ jobs:
         with:
           python-version: 3.8
           architecture: x64
-      - name: Fetch PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Install dependencies
         run: |
           set -eux
@@ -502,6 +522,64 @@ jobs:
             false
           fi
 
+  test-tools:
+    name: Test tools
+    if: ${{ github.repository == 'pytorch/pytorch' }}
+    runs-on: ubuntu-18.04
+    steps:
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: 3.8
+          architecture: x64
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required, to allow us to use git log
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Install dependencies
+        # mypy and boto3 versions copied from
+        # .circleci/docker/common/install_conda.sh
+        run: |
+          set -eux
+          python3 -mpip install -r requirements.txt
+          python3 -mpip install boto3==1.16.34
+          pip3 install typing-extensions==3.10 --user
+          pip3 install -r requirements-flake8.txt --user
+          python3 -mpip install -r requirements.txt --user
+          python3 -mpip install mypy==0.960 --user
+          make setup_lint
+      - name: Test tools
+        run: |
+          python3 -m unittest discover -vs tools/test -p 'test_*.py'
+          python3 -m unittest discover -vs .github/scripts -p 'test_*.py'
+
+  test_collect_env:
+    if: ${{ github.repository == 'pytorch/pytorch' }}
+    name: Test collect_env
+    runs-on: ubuntu-18.04
+    strategy:
+      matrix:
+        with_torch: [with_torch, without_torch]
+    steps:
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: 3.8
+          architecture: x64
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required, to allow us to use git log
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Install torch
+        if: matrix.with_torch == 'with_torch'
+        run: |
+          # Doesn't really matter what torch version, we just need ANY torch installed
+          pip install 'torch==1.*'
+      - name: Run collect_env.py
+        run: |
+          # All we need to see is that it passes
+          python3 torch/utils/collect_env.py
+
 concurrency:
-  group: lint-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
   cancel-in-progress: true
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
new file mode 100644
index 00000000000000..3322b2097a17dd
--- /dev/null
+++ b/.github/workflows/nightly.yml
@@ -0,0 +1,33 @@
+name: nightly
+
+on:
+  schedule:
+    - cron: 0 0 * * *
+  push:
+    tags:
+      - ciflow/nightly/*
+  workflow_dispatch:
+
+
+concurrency:
+  group: ${{ github.workflow }}--${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  docs-build:
+    name: docs build
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+
+  docs-push:
+    name: docs push
+    uses: ./.github/workflows/_docs.yml
+    needs: docs-build
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4
+      docker-image: ${{ needs.docs-build.outputs.docker-image }}
+      push: true
+    secrets:
+      GH_PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
diff --git a/.github/workflows/periodic.yml b/.github/workflows/periodic.yml
new file mode 100644
index 00000000000000..972041d24c13da
--- /dev/null
+++ b/.github/workflows/periodic.yml
@@ -0,0 +1,206 @@
+name: periodic
+
+on:
+  schedule:
+    - cron: 45 0,4,8,12,16,20 * * *
+  push:
+    tags:
+      - ciflow/periodic/*
+      - ciflow/all/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}--${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  linux-bionic-cuda11_5-py3_7-gcc7-build:
+    name: linux-bionic-cuda11.5-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.5-py3.7-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7
+
+  linux-bionic-cuda11_5-py3_7-gcc7-test:
+    name: linux-bionic-cuda11.5-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_5-py3_7-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.5-py3.7-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_5-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  linux-bionic-cuda11_6-py3_7-gcc7-build:
+    name: linux-bionic-cuda11.6-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.7-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+
+  linux-bionic-cuda11_6-py3_7-gcc7-test:
+    name: linux-bionic-cuda11.6-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_7-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.7-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  libtorch-linux-bionic-cuda11_5-py3_7-gcc7-build:
+    name: libtorch-linux-bionic-cuda11.5-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: libtorch-linux-bionic-cuda11.5-py3.7-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7
+      build-generates-artifacts: false
+
+  libtorch-linux-bionic-cuda11_6-py3_7-gcc7-build:
+    name: libtorch-linux-bionic-cuda11.6-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: libtorch-linux-bionic-cuda11.6-py3.7-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+      build-generates-artifacts: false
+
+  linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build:
+    name: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
+      docker-image-name: pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
+
+  linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-test:
+    name: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build
+    with:
+      build-environment: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
+      docker-image: ${{ needs.linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  linux-xenial-cuda11_3-py3_7-gcc7-debug-build:
+    name: linux-xenial-cuda11.3-py3.7-gcc7-debug
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-debug
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+      build-with-debug: true
+
+  linux-xenial-cuda11_3-py3_7-gcc7-debug-test:
+    name: linux-xenial-cuda11.3-py3.7-gcc7-debug
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-cuda11_3-py3_7-gcc7-debug-build
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-debug
+      docker-image: ${{ needs.linux-xenial-cuda11_3-py3_7-gcc7-debug-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  win-vs2019-cuda11_5-py3-build:
+    name: win-vs2019-cuda11.5-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cuda11.5-py3
+      cuda-version: "11.5"
+
+  win-vs2019-cuda11_5-py3-test:
+    name: win-vs2019-cuda11.5-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cuda11_5-py3-build
+    with:
+      build-environment: win-vs2019-cuda11.5-py3
+      cuda-version: "11.5"
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
+
+  win-vs2019-cuda11_6-py3-build:
+    name: win-vs2019-cuda11.6-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cuda11.6-py3
+      cuda-version: "11.6"
+
+  win-vs2019-cuda11_6-py3-test:
+    name: win-vs2019-cuda11.6-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cuda11_6-py3-build
+    with:
+      build-environment: win-vs2019-cuda11.6-py3
+      cuda-version: "11.6"
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
+
+  ios-12-5-1-arm64:
+    name: ios-12-5-1-arm64
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-arm64
+      ios-platform: OS
+      ios-arch: arm64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
+
+  ios-12-5-1-arm64-coreml:
+    name: ios-12-5-1-arm64-coreml
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-arm64-coreml
+      ios-platform: OS
+      ios-arch: arm64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
+
+  ios-12-5-1-arm64-custom-ops:
+    name: ios-12-5-1-arm64-custom-ops
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-arm64-custom-ops
+      ios-platform: OS
+      ios-arch: arm64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
+
+  ios-12-5-1-arm64-metal:
+    name: ios-12-5-1-arm64-metal
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-arm64-metal
+      ios-platform: OS
+      ios-arch: arm64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml
new file mode 100644
index 00000000000000..fb2c96fa56efbe
--- /dev/null
+++ b/.github/workflows/pull.yml
@@ -0,0 +1,320 @@
+name: pull
+
+on:
+  pull_request:
+  push:
+    branches:
+      - master
+      - main
+      - release/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  linux-xenial-py3_7-gcc5_4-build:
+    name: linux-xenial-py3.7-gcc5.4
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+
+  linux-xenial-py3_7-gcc5_4-test:
+    name: linux-xenial-py3.7-gcc5.4
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-py3_7-gcc5_4-build
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4
+      docker-image: ${{ needs.linux-xenial-py3_7-gcc5_4-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "docs_test", shard: 1, num_shards: 1,  runner: "linux.2xlarge" },
+          { config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-docs:
+    name: linux-docs
+    uses: ./.github/workflows/_docs.yml
+    needs: linux-xenial-py3_7-gcc5_4-build
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4
+      docker-image: ${{ needs.linux-xenial-py3_7-gcc5_4-build.outputs.docker-image }}
+
+  linux-xenial-py3_7-gcc7-build:
+    name: linux-xenial-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-gcc7
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc7
+
+  linux-xenial-py3_7-gcc7-test:
+    name: linux-xenial-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-py3_7-gcc7-build
+    with:
+      build-environment: linux-xenial-py3.7-gcc7
+      docker-image: ${{ needs.linux-xenial-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
+
+  linux-xenial-py3_7-clang7-asan-build:
+    name: linux-xenial-py3.7-clang7-asan
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-clang7-asan
+      docker-image-name: pytorch-linux-xenial-py3-clang7-asan
+
+  linux-xenial-py3_7-clang7-asan-test:
+    name: linux-xenial-py3.7-clang7-asan
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-py3_7-clang7-asan-build
+    with:
+      build-environment: linux-xenial-py3.7-clang7-asan
+      docker-image: ${{ needs.linux-xenial-py3_7-clang7-asan-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
+          { config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
+        ]}
+
+  linux-xenial-py3_7-gcc7-no-ops:
+    name: linux-xenial-py3.7-gcc7-no-ops
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-gcc7-no-ops
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc7
+
+  linux-xenial-py3_7-clang7-onnx-build:
+    name: linux-xenial-py3.7-clang7-onnx
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-clang7-onnx
+      docker-image-name: pytorch-linux-xenial-py3-clang7-onnx
+
+  linux-xenial-py3_7-clang7-onnx-test:
+    name: linux-xenial-py3.7-clang7-onnx
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-py3_7-clang7-onnx-build
+    with:
+      build-environment: linux-xenial-py3.7-clang7-onnx
+      docker-image: ${{ needs.linux-xenial-py3_7-clang7-onnx-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
+
+  linux-bionic-py3_7-clang9-build:
+    name: linux-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-py3.7-clang9
+      docker-image-name: pytorch-linux-bionic-py3.7-clang9
+
+  linux-bionic-py3_7-clang9-test:
+    name: linux-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-py3_7-clang9-build
+    with:
+      build-environment: linux-bionic-py3.7-clang9
+      docker-image: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "noarch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-vulkan-bionic-py3_7-clang9-build:
+    name: linux-vulkan-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-vulkan-bionic-py3.7-clang9
+      docker-image-name: pytorch-linux-bionic-py3.7-clang9
+
+  linux-vulkan-bionic-py3_7-clang9-test:
+    name: linux-vulkan-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-vulkan-bionic-py3_7-clang9-build
+    with:
+      build-environment: linux-vulkan-bionic-py3.7-clang9
+      docker-image: ${{ needs.linux-vulkan-bionic-py3_7-clang9-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-xenial-cuda11_3-py3_7-gcc7-build:
+    name: linux-xenial-cuda11.3-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+
+  linux-xenial-cuda11_3-py3_7-gcc7-test:
+    name: linux-xenial-cuda11.3-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-xenial-cuda11_3-py3_7-gcc7-build
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7
+      docker-image: ${{ needs.linux-xenial-cuda11_3-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.8xlarge.nvidia.gpu" },
+        ]}
+
+  linux-bionic-rocm5_0-py3_7-build:
+    name: linux-bionic-rocm5.0-py3.7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-rocm5.0-py3.7
+      docker-image-name: pytorch-linux-bionic-rocm5.0-py3.7
+
+  linux-bionic-rocm5_0-py3_7-test:
+    name: linux-bionic-rocm5.0-py3.7
+    uses: ./.github/workflows/_rocm-test.yml
+    needs: linux-bionic-rocm5_0-py3_7-build
+    with:
+      build-environment: linux-bionic-rocm5.0-py3.7
+      docker-image: ${{ needs.linux-bionic-rocm5_0-py3_7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
+        ]}
+
+  linux-xenial-py3-clang5-mobile-build:
+    name: linux-xenial-py3-clang5-mobile-build
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-mobile-build
+      docker-image-name: pytorch-linux-xenial-py3-clang5-asan
+      build-generates-artifacts: false
+
+  linux-xenial-py3-clang5-mobile-custom-build-static:
+    name: linux-xenial-py3-clang5-mobile-custom-build-static
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-mobile-custom-build-static
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+      build-generates-artifacts: false
+
+  pytorch-xla-linux-bionic-py3_7-clang8-build:
+    name: pytorch-xla-linux-bionic-py3.7-clang8
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: pytorch-xla-linux-bionic-py3.7-clang8
+      docker-image-name: xla_base
+
+  pytorch-xla-linux-bionic-py3_7-clang8-test:
+    name: pytorch-xla-linux-bionic-py3.7-clang8
+    uses: ./.github/workflows/_linux-test.yml
+    needs: pytorch-xla-linux-bionic-py3_7-clang8-build
+    with:
+      build-environment: pytorch-xla-linux-bionic-py3.7-clang8
+      docker-image: ${{ needs.pytorch-xla-linux-bionic-py3_7-clang8-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "xla", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  win-vs2019-cpu-py3-build:
+    name: win-vs2019-cpu-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cpu-py3
+      cuda-version: cpu
+
+  win-vs2019-cpu-py3-test:
+    name: win-vs2019-cpu-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cpu-py3-build
+    with:
+      build-environment: win-vs2019-cpu-py3
+      cuda-version: cpu
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.4xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.4xlarge" },
+        ]}
+
+  win-vs2019-cuda11_3-py3-build:
+    name: win-vs2019-cuda11.3-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cuda11.3-py3
+      cuda-version: "11.3"
+
+  win-vs2019-cuda11_3-py3-test:
+    name: win-vs2019-cuda11.3-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cuda11_3-py3-build
+    with:
+      build-environment: win-vs2019-cuda11.3-py3
+      cuda-version: "11.3"
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
+
+  linux-xenial-cuda11_3-py3_7-gcc7-bazel-test:
+    name: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
+    uses: ./.github/workflows/_bazel-build-test.yml
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+
+  pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single:
+    name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
+    uses: ./.github/workflows/_android-build-test.yml
+    with:
+      build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+
+  pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit:
+    name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
+    uses: ./.github/workflows/_android-build-test.yml
+    with:
+      build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+
+  linux-xenial-py3_7-gcc5_4-mobile-lightweight-dispatch-build:
+    name: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+      build-generates-artifacts: false
+
+  deploy-linux-xenial-cuda11_3-py3_7-gcc7-build:
+    name: deploy-linux-xenial-cuda11.3-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: deploy-linux-xenial-cuda11.3-py3.7-gcc7
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+
+  deploy-linux-xenial-cuda11_3-py3_7-gcc7-test:
+    name: linux-xenial-cuda11.3-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: deploy-linux-xenial-cuda11_3-py3_7-gcc7-build
+    with:
+      build-environment: deploy-linux-xenial-cuda11.3-py3.7-gcc7
+      docker-image: ${{ needs.deploy-linux-xenial-cuda11_3-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
diff --git a/.github/workflows/push_nightly_docker_ghcr.yml b/.github/workflows/push_nightly_docker_ghcr.yml
index 3a2ce8d6bcde20..ca30c9651ff8f3 100644
--- a/.github/workflows/push_nightly_docker_ghcr.yml
+++ b/.github/workflows/push_nightly_docker_ghcr.yml
@@ -1,22 +1,30 @@
-name: Build PyTorch nightly Docker image and push to GitHub Container Registry
+name: docker-release-builds
 on:
   schedule:
     # Push the nightly docker daily at 1 PM UTC
     - cron: '0 13 * * *'
+  # Trigger when we modify something related to these images
+  pull_request:
+    paths:
+      - .github/scripts/build_publish_nightly_docker.sh
+      - .github/workflows/push_nightly_docker_ghcr.yml
+      - Dockerfile
+      - docker.Makefile
   # Have the ability to trigger this job manually using the API as well
   workflow_dispatch:
 
 jobs:
-  build-publish-docker:
+  docker-release-build:
     if: ${{ github.repository == 'pytorch/pytorch' }}
     runs-on: linux.2xlarge
     env:
       GHCR_PAT: ${{ secrets.GHCR_PAT }}
+      WITH_PUSH: ${{ github.event_name == 'schedule' }}
     steps:
-      - name: Checkout
+      - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
-          ref: master
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
       - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
         name: Build and upload nightly docker
         with:
@@ -25,3 +33,7 @@ jobs:
           command: |
             set -ex
             bash .github/scripts/build_publish_nightly_docker.sh
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
diff --git a/.github/workflows/revert.yml b/.github/workflows/revert.yml
index fa5451d9695119..22e0508d88b8f8 100644
--- a/.github/workflows/revert.yml
+++ b/.github/workflows/revert.yml
@@ -27,6 +27,12 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
+          COMMENT_ID: ${{ github.event.client_payload.comment_id }}
           GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
         run: |
-          python3 .github/scripts/trymerge.py --revert "${PR_NUM}"
+          set -ex
+          if [ -n "${COMMENT_ID}" ]; then
+            python3 .github/scripts/trymerge.py --revert --comment-id "${COMMENT_ID}" "${PR_NUM}"
+          else
+            python3 .github/scripts/trymerge.py --revert "${PR_NUM}"
+          fi
diff --git a/.github/workflows/run_android_tests.yml b/.github/workflows/run_android_tests.yml
new file mode 100644
index 00000000000000..85cef5623d7ed9
--- /dev/null
+++ b/.github/workflows/run_android_tests.yml
@@ -0,0 +1,67 @@
+name: android-tests
+
+on:
+  push:
+    tags:
+      # Trigger on release candidate builds
+      # Release candidate tags look like: v1.11.0-rc1
+      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+      - 'ciflow/trunk/*'
+      - 'ciflow/android/*'
+    branches:
+      - master
+      - main
+      - release/*
+  workflow_dispatch:
+
+concurrency:
+  group: run-android-tests-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+defaults:
+  run:
+    shell: bash -e -l {0}
+
+jobs:
+
+  build-and-test:
+    runs-on: ubuntu-latest
+    env:
+      JOB_BASE_NAME: ubuntu-latest-android-tests
+    steps:
+      - name: Setup miniconda
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          auto-update-conda: true
+          python-version: 3.8
+          activate-environment: build
+
+      - name: Install dependencies
+        run: |
+          conda install -y \
+            cffi \
+            cmake \
+            mkl \
+            mkl-include \
+            ninja \
+            numpy \
+            pyyaml \
+            requests \
+            setuptools \
+            typing_extensions
+
+      # [see note: pytorch repo ref]
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - name: Build PyTorch Android
+        run: |
+          export ANDROID_NDK="${ANDROID_SDK_ROOT}/ndk-bundle"
+          echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
+          ./scripts/build_pytorch_android.sh x86
+
+      - name: Run tests
+        uses: reactivecircus/android-emulator-runner@v2
+        with:
+          api-level: 25
+          script: ./android/run_tests.sh
diff --git a/.github/workflows/run_torchbench.yml b/.github/workflows/run_torchbench.yml
index 5fe6cb772a6a58..d84a32ca318e1c 100644
--- a/.github/workflows/run_torchbench.yml
+++ b/.github/workflows/run_torchbench.yml
@@ -36,10 +36,15 @@ jobs:
           # shellcheck disable=SC1091
           . "${HOME}"/anaconda3/etc/profile.d/conda.sh
           conda activate pr-ci
-          conda install -y numpy requests ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions \
+          # pin cmake version to 3.22 since 3.23 breaks pytorch build
+          # see details at: https://github.com/pytorch/pytorch/issues/74985
+          conda install -y numpy requests ninja pyyaml mkl mkl-include setuptools cmake=3.22 cffi typing_extensions \
                            future six dataclasses pillow pytest tabulate gitpython git-lfs tqdm psutil
           # install magma
           conda install -y -c pytorch "${MAGMA_VERSION}"
+          # install ffmpeg-4.4.1
+          # torchvision doesn't compile on ffmpeg-5: https://github.com/pytorch/vision/issues/5616
+          conda install -y ffmpeg=4.4.1
       - name: Setup TorchBench branch
         run: |
           # shellcheck disable=SC1091
@@ -84,5 +89,5 @@ jobs:
           path: ~/.torchbench/bisection/pr${{ github.event.number }}
 
 concurrency:
-  group: run-torchbench-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
   cancel-in-progress: true
diff --git a/.github/workflows/test_tools.yml b/.github/workflows/test_tools.yml
deleted file mode 100644
index 18e8339fb92b24..00000000000000
--- a/.github/workflows/test_tools.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-name: Test tools
-
-on:
-  push:
-    branches:
-      - master
-      - main
-  pull_request:
-
-jobs:
-  test:
-    if: ${{ github.repository == 'pytorch/pytorch' }}
-    runs-on: ubuntu-18.04
-    steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          fetch-depth: 0 # deep clone, to allow us to use git log
-      - name: Install dependencies
-        # mypy and boto3 versions copied from
-        # .circleci/docker/common/install_conda.sh
-        run: |
-          set -eux
-          python3 -mpip install -r requirements.txt
-          python3 -mpip install boto3==1.16.34
-          make setup_lint
-      - name: Test tools
-        run: |
-          python3 -m unittest discover -vs tools/test -p 'test_*.py'
-          python3 -m unittest discover -vs .github/scripts -p 'test_*.py'
-
-concurrency:
-  group: test-tools-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/trunk.yml b/.github/workflows/trunk.yml
new file mode 100644
index 00000000000000..e7f051effdd9db
--- /dev/null
+++ b/.github/workflows/trunk.yml
@@ -0,0 +1,222 @@
+name: trunk
+
+on:
+  push:
+    branches:
+      - master
+      - main
+      - release/*
+    tags:
+      - ciflow/trunk/*
+      - ciflow/all/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  parallelnative-linux-xenial-py3_7-gcc5_4-build:
+    name: parallelnative-linux-xenial-py3.7-gcc5.4
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: parallelnative-linux-xenial-py3.7-gcc5.4
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+
+  parallelnative-linux-xenial-py3_7-gcc5_4-test:
+    name: parallelnative-linux-xenial-py3.7-gcc5.4
+    uses: ./.github/workflows/_linux-test.yml
+    needs: parallelnative-linux-xenial-py3_7-gcc5_4-build
+    with:
+      build-environment: parallelnative-linux-xenial-py3.7-gcc5.4
+      docker-image: ${{ needs.parallelnative-linux-xenial-py3_7-gcc5_4-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
+
+  # Build PyTorch with BUILD_CAFFE2=ON
+  caffe2-linux-xenial-py3_7-gcc5_4-build:
+    name: caffe2-linux-xenial-py3.7-gcc5.4
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: caffe2-linux-xenial-py3.7-gcc5.4
+      docker-image-name: pytorch-linux-xenial-py3.7-gcc5.4
+
+  linux-bionic-cuda10_2-py3_9-gcc7-build:
+    name: linux-bionic-cuda10.2-py3.9-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
+
+  linux-bionic-cuda10_2-py3_9-gcc7-test:
+    name: linux-bionic-cuda10.2-py3.9-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda10_2-py3_9-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda10_2-py3_9-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "slow", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "nogpu_NO_AVX", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "multigpu", shard: 1, num_shards: 1, runner: "linux.16xlarge.nvidia.gpu" },
+        ]}
+
+  libtorch-linux-xenial-cuda10_2-py3_7-gcc7-build:
+    name: libtorch-linux-xenial-cuda10.2-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: libtorch-linux-xenial-cuda10.2-py3.7-gcc7
+      docker-image-name: pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
+      build-generates-artifacts: false
+
+  libtorch-linux-xenial-cuda11_3-py3_7-gcc7-build:
+    name: libtorch-linux-xenial-cuda11.3-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: libtorch-linux-xenial-cuda11.3-py3.7-gcc7
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+      build-generates-artifacts: false
+
+  # no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated
+  linux-xenial-cuda11_3-py3_7-gcc7-no-ops-build:
+    name: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+
+  linux-bionic-rocm4_5-py3_7-distributed-build:
+    name: linux-bionic-rocm5.0-py3.7-distributed
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-rocm5.0-py3.7
+      docker-image-name: pytorch-linux-bionic-rocm5.0-py3.7
+
+  linux-bionic-rocm4_5-py3_7-distributed-test:
+    name: linux-bionic-rocm5.0-py3.7-distributed
+    uses: ./.github/workflows/_rocm-test.yml
+    needs: linux-bionic-rocm4_5-py3_7-distributed-build
+    with:
+      build-environment: linux-bionic-rocm5.0-py3.7
+      docker-image: ${{ needs.linux-bionic-rocm4_5-py3_7-distributed-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
+        ]}
+
+  pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build:
+    name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
+    uses: ./.github/workflows/_android-full-build-test.yml
+    with:
+      build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+    secrets:
+      SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
+      SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
+      ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
+      ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
+      SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
+
+  linux-bionic-py3_7-clang9-slow-build:
+    name: linux-bionic-py3.7-clang9-slow
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-py3.7-clang9-slow
+      docker-image-name: pytorch-linux-bionic-py3.7-clang9
+
+  linux-bionic-py3_7-clang9-slow-test:
+    name: linux-bionic-py3.7-clang9-slow
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-py3_7-clang9-slow-build
+    with:
+      build-environment: linux-bionic-py3.7-clang9-slow
+      docker-image: ${{ needs.linux-bionic-py3_7-clang9-slow-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "slow", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  ios-12-5-1-x86-64:
+    name: ios-12-5-1-x86-64
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-x86-64
+      ios-platform: SIMULATOR
+      ios-arch: x86_64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
+
+  ios-12-5-1-x86-64-coreml:
+    name: ios-12-5-1-x86-64-coreml
+    uses: ./.github/workflows/_ios-build-test.yml
+    with:
+      build-environment: ios-12-5-1-x86-64-coreml
+      ios-platform: SIMULATOR
+      ios-arch: x86_64
+    secrets:
+      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
+      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
+      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
+      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
+
+  macos-11-py3-x86-64-build:
+    name: macos-11-py3-x86-64
+    uses: ./.github/workflows/_mac-build.yml
+    with:
+      build-environment: macos-11-py3-x86-64
+      xcode-version: "12.4"
+      runner-type: macos-11
+      build-generates-artifacts: true
+    secrets:
+      MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
+      MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
+
+  macos-11-py3-x86-64-test:
+    name: macos-11-py3-x86-64
+    uses: ./.github/workflows/_mac-test.yml
+    needs: macos-11-py3-x86-64-build
+    with:
+      build-environment: macos-11-py3-x86-64
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "macos-11" },
+          { config: "default", shard: 2, num_shards: 2, runner: "macos-11" },
+        ]}
+    secrets:
+      AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+      AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
+
+  macos-10-15-py3-lite-interpreter-x86-64:
+    name: macos-10-15-py3-lite-interpreter-x86-64
+    uses: ./.github/workflows/_mac-build.yml
+    with:
+      build-environment: macos-10-15-py3-lite-interpreter-x86-64
+      xcode-version: "12"
+      runner-type: macos-10.15
+      build-generates-artifacts: false
+    secrets:
+      MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
+      MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
+
+  macos-10-15-py3-arm64:
+    name: macos-10-15-py3-arm64
+    uses: ./.github/workflows/_mac-build.yml
+    with:
+      build-environment: macos-10-15-py3-arm64
+      runner-type: macos-10.15
+      build-generates-artifacts: false
+    secrets:
+      MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
+      MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
diff --git a/.github/workflows/trymerge.yml b/.github/workflows/trymerge.yml
index ae29ab82462a65..6da9e872ce46e8 100644
--- a/.github/workflows/trymerge.yml
+++ b/.github/workflows/trymerge.yml
@@ -28,5 +28,10 @@ jobs:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
           GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+          FORCE: ${{ github.event.client_payload.force}}
         run: |
-          python3 .github/scripts/trymerge.py "${PR_NUM}"
+          if [ -n "${FORCE}" ]; then
+            python3 .github/scripts/trymerge.py --force "${PR_NUM}"
+          else
+            python3 .github/scripts/trymerge.py "${PR_NUM}"
+          fi
diff --git a/.github/workflows/update_pytorch_labels.yml b/.github/workflows/update_pytorch_labels.yml
index 82061efa3c3caf..f19347070ecef7 100644
--- a/.github/workflows/update_pytorch_labels.yml
+++ b/.github/workflows/update_pytorch_labels.yml
@@ -17,8 +17,8 @@ jobs:
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
       - name: Update PyTorch labels list in S3
         env:
-          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
         run: |
           python3 -m pip install boto3==1.19.12
           .github/scripts/export_pytorch_labels.py
diff --git a/.github/workflows/upload-test-stats.yml b/.github/workflows/upload-test-stats.yml
new file mode 100644
index 00000000000000..bfed85e5131e19
--- /dev/null
+++ b/.github/workflows/upload-test-stats.yml
@@ -0,0 +1,35 @@
+name: Upload test stats
+
+on:
+  workflow_run:
+    workflows: [pull, trunk, periodic]
+    types:
+      - completed
+
+jobs:
+  upload-test-stats:
+    if: github.event.workflow_run.conclusion == 'success' || github.event.workflow_run.conclusion == 'failure'
+    runs-on: [self-hosted, linux.2xlarge]
+
+    steps:
+      - name: Print workflow information
+        env:
+          TRIGGERING_WORKFLOW: ${{ toJSON(github.event.workflow_run) }}
+        run: echo "${TRIGGERING_WORKFLOW}"
+
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+
+      - run: |
+          pip3 install requests==2.26
+          pip3 install rockset==0.8.3
+          pip3 install boto3==1.19.12
+          pip3 install six==1.16.0
+
+      - name: Upload test stats
+        env:
+          ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
+        run: python3 tools/stats/upload_test_stats.py --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}"
diff --git a/.gitignore b/.gitignore
index 4a332afb8d0e04..b95fc1a1d9dae6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,6 +35,7 @@ aten/src/ATen/cuda/CUDAConfig.h
 benchmarks/.data
 caffe2/cpp_test/
 dist/
+docs/build/
 docs/cpp/src
 docs/src/**/*
 docs/cpp/build
@@ -66,8 +67,11 @@ torch/_C/__init__.pyi
 torch/_C/_nn.pyi
 torch/_C/_VariableFunctions.pyi
 torch/_VF.pyi
+torch/return_types.pyi
 torch/nn/functional.pyi
+torch/utils/data/datapipes/datapipe.pyi
 torch/csrc/autograd/generated/*
+torch/csrc/lazy/generated/*
 # Listed manually because some files in this directory are not generated
 torch/testing/_internal/generated/annotated_fn_args.py
 torch/testing/_internal/data/*.pt
@@ -137,6 +141,7 @@ scripts/release_notes/*.json
 compile_commands.json
 *.egg-info/
 docs/source/scripts/activation_images/
+docs/source/scripts/quantization_backend_configs/
 
 ## General
 
@@ -307,7 +312,7 @@ bazel-*
 *.zip
 
 # core dump files
-core.*
+**/core.[1-9]*
 
 # Generated if you use the pre-commit script for clang-tidy
 pr.diff
diff --git a/.gitmodules b/.gitmodules
index 9c9373ef7229ae..c3c93bb76584c8 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -9,7 +9,7 @@
 [submodule "third_party/eigen"]
     ignore = dirty
     path = third_party/eigen
-    url = https://github.com/eigenteam/eigen-git-mirror.git
+    url = https://gitlab.com/libeigen/eigen.git
 [submodule "third_party/googletest"]
     ignore = dirty
     path = third_party/googletest
diff --git a/.jenkins/caffe2/test.sh b/.jenkins/caffe2/test.sh
index fd626d09c3e221..17a5cf796deb0b 100755
--- a/.jenkins/caffe2/test.sh
+++ b/.jenkins/caffe2/test.sh
@@ -134,19 +134,15 @@ if [[ $BUILD_ENVIRONMENT == *-rocm* ]]; then
   rocm_ignore_test+=("--ignore $caffe2_pypath/python/ideep/pool_op_test.py")
 fi
 
-# NB: Warnings are disabled because they make it harder to see what
-# the actual erroring test is
 echo "Running Python tests.."
-if [[ "$BUILD_ENVIRONMENT" == *py3* ]]; then
-  # locale setting is required by click package with py3
-  for loc in "en_US.utf8" "C.UTF-8"; do
-    if locale -a | grep "$loc" >/dev/null 2>&1; then
-      export LC_ALL="$loc"
-      export LANG="$loc"
-      break;
-    fi
-  done
-fi
+# locale setting is required by click package
+for loc in "en_US.utf8" "C.UTF-8"; do
+  if locale -a | grep "$loc" >/dev/null 2>&1; then
+    export LC_ALL="$loc"
+    export LANG="$loc"
+    break;
+  fi
+done
 
 # Some Caffe2 tests fail when run using AVX512 ISA, see https://github.com/pytorch/pytorch/issues/66111
 export DNNL_MAX_CPU_ISA=AVX2
@@ -154,6 +150,8 @@ export DNNL_MAX_CPU_ISA=AVX2
 # Should still run even in the absence of SHARD_NUMBER
 if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
   pip install --user pytest-sugar
+  # NB: Warnings are disabled because they make it harder to see what
+  # the actual erroring test is
   "$PYTHON" \
     -m pytest \
     -x \
@@ -170,18 +168,18 @@ if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
     "${EXTRA_TESTS[@]}"
 fi
 
-#####################
-# torchvision tests #
-#####################
+##############
+# ONNX tests #
+##############
 if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
   # Check out torch/vision at 0.9.0-rc1 commit
   # This hash must match one in .jenkins/pytorch/test.sh
   pip install -q --user git+https://github.com/pytorch/vision.git@8a2dc6f22ac4389ccba8859aa1e1cb14f1ee53db
-  pip install -q --user ninja
+  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.11.0
+  # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
+  # We don't actually need it for our tests, but it's imported if it's present, so uninstall.
+  pip uninstall -q --yes numba
   # JIT C++ extensions require ninja, so put it into PATH.
   export PATH="/var/lib/jenkins/.local/bin:$PATH"
-  if [[ "$BUILD_ENVIRONMENT" == *py3* ]]; then
-    pip install -q --user flatbuffers==2.0 onnxruntime==1.9.0
-  fi
   "$ROOT_DIR/scripts/onnx/test.sh"
 fi
diff --git a/.jenkins/pytorch/build.sh b/.jenkins/pytorch/build.sh
index 01faa947634d60..977b977609eff6 100755
--- a/.jenkins/pytorch/build.sh
+++ b/.jenkins/pytorch/build.sh
@@ -20,7 +20,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
   exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
 fi
 
-if [[ "$BUILD_ENVIRONMENT" == *linux-xenial-cuda11.3* || "$BUILD_ENVIRONMENT" == *linux-bionic-cuda11.5* ]]; then
+if [[ "$BUILD_ENVIRONMENT" == *linux-xenial-cuda11.3* || "$BUILD_ENVIRONMENT" == *linux-bionic-cuda11.5* || "$BUILD_ENVIRONMENT" == *linux-bionic-cuda11.6* ]]; then
   # Enabling DEPLOY build (embedded torch python interpreter, experimental)
   # only on one config for now, can expand later
   export USE_DEPLOY=ON
diff --git a/.jenkins/pytorch/common.sh b/.jenkins/pytorch/common.sh
index be5245bf19bc97..e8ce4b2ecb4d31 100644
--- a/.jenkins/pytorch/common.sh
+++ b/.jenkins/pytorch/common.sh
@@ -8,6 +8,13 @@ set -ex
 # Save the SCRIPT_DIR absolute path in case later we chdir (as occurs in the gpu perf test)
 SCRIPT_DIR="$( cd "$(dirname "${BASH_SOURCE[0]}")" ; pwd -P )"
 
+if [[ "${BUILD_ENVIRONMENT}" == *linux* ]]; then
+  # TODO: Remove this once nvidia package repos are back online
+  # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
+  # shellcheck disable=SC2046
+  sudo sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
+fi
+
 # Required environment variables:
 #   $BUILD_ENVIRONMENT (should be set by your Docker image)
 
@@ -145,7 +152,8 @@ fi
 # export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
 if [[ "${TEST_CONFIG:-}" == *xla* ]] || \
    [[ "$BUILD_ENVIRONMENT" == *centos* ]] || \
-   [[ "$BUILD_ENVIRONMENT" == *linux-bionic* ]]; then
+   [[ "$BUILD_ENVIRONMENT" == *linux-bionic* ]] || \
+   [[ "$BUILD_ENVIRONMENT" == *linux-focal* ]]; then
   if ! which conda; then
     echo "Expected ${BUILD_ENVIRONMENT} to use conda, but 'which conda' returns empty"
     exit 1
diff --git a/.jenkins/pytorch/common_utils.sh b/.jenkins/pytorch/common_utils.sh
index 54bd44d3ccc6de..4169f6a2cb8c79 100644
--- a/.jenkins/pytorch/common_utils.sh
+++ b/.jenkins/pytorch/common_utils.sh
@@ -60,19 +60,18 @@ function get_pr_change_files() {
   set -e
 }
 
-function file_diff_from_base() {
-  # The fetch may fail on Docker hosts, this fetch is necessary for GHA
-  set +e
-  git fetch origin master --quiet
-  set -e
-  git diff --name-only "$(git merge-base origin/master HEAD)" > "$1"
-}
-
 function get_bazel() {
-  # download bazel version
-  wget https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -O tools/bazel
-  # verify content
-  echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c tools/bazel' | sha256sum --quiet -c
+  if [[ $(uname) == "Darwin" ]]; then
+    # download bazel version
+    curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel
+    # verify content
+    echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c  tools/bazel' | shasum -a 256 -c >/dev/null
+  else
+    # download bazel version
+    curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel
+    # verify content
+    echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c  tools/bazel' | shasum -a 256 -c >/dev/null
+  fi
 
   chmod +x tools/bazel
 }
diff --git a/.jenkins/pytorch/macos-test.sh b/.jenkins/pytorch/macos-test.sh
index 28f86c6e6e5dae..63e90c05bdd5eb 100755
--- a/.jenkins/pytorch/macos-test.sh
+++ b/.jenkins/pytorch/macos-test.sh
@@ -10,7 +10,9 @@ conda install -y six
 pip install -q hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"
 
 # TODO move this to docker
-pip install unittest-xml-reporting pytest
+# Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
+pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \
+  pytest
 
 if [ -z "${IN_CI}" ]; then
   rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*
diff --git a/.jenkins/pytorch/multigpu-test.sh b/.jenkins/pytorch/multigpu-test.sh
index 2d119d09a70c07..481619a8dc314d 100755
--- a/.jenkins/pytorch/multigpu-test.sh
+++ b/.jenkins/pytorch/multigpu-test.sh
@@ -13,7 +13,8 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 echo "Testing pytorch (distributed only)"
 if [ -n "${IN_CI}" ]; then
   # TODO move this to docker
-  pip_install unittest-xml-reporting
+  # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
+  pip_install "unittest-xml-reporting<=3.2.0,>=2.0.0"
 fi
 
 # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
diff --git a/.jenkins/pytorch/short-perf-test-cpu.sh b/.jenkins/pytorch/short-perf-test-cpu.sh
index f2e02b52974c69..ff9ef7a84eee75 100755
--- a/.jenkins/pytorch/short-perf-test-cpu.sh
+++ b/.jenkins/pytorch/short-perf-test-cpu.sh
@@ -17,14 +17,15 @@ pip install -q awscli
 # Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read
 # More info at https://github.com/aws/aws-cli/issues/2321
 aws configure set default.s3.multipart_threshold 5GB
+UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
-    # Get current master commit hash
-    MASTER_COMMIT_ID=$(git log --format="%H" -n 1)
-    export MASTER_COMMIT_ID
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
+    # Get current default branch commit hash
+    DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)
+    export DEFAULT_BRANCH_COMMIT_ID
 fi
 
-# Find the master commit to test against
+# Find the default branch commit to test against
 git remote add upstream https://github.com/pytorch/pytorch.git
 git fetch upstream
 IFS=$'\n'
@@ -33,13 +34,13 @@ while IFS='' read -r commit_id; do
         LATEST_TESTED_COMMIT=${commit_id}
         break
     fi
-done < <(git rev-list upstream/master)
+done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")
 aws s3 cp s3://ossci-perf-test/pytorch/cpu_runtime/"${LATEST_TESTED_COMMIT}".json cpu_runtime.json
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
     # Prepare new baseline file
     cp cpu_runtime.json new_cpu_runtime.json
-    python update_commit_hash.py new_cpu_runtime.json "${MASTER_COMMIT_ID}"
+    python update_commit_hash.py new_cpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"
 fi
 
 # Include tests
@@ -54,7 +55,7 @@ fi
 
 # Run tests
 export TEST_MODE="compare_with_baseline"
-if [[ "$COMMIT_SOURCE" == master ]]; then
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
     export TEST_MODE="compare_and_update"
 fi
 
@@ -66,8 +67,8 @@ run_test test_cpu_speed_torch_tensor ${TEST_MODE}
 run_test test_cpu_speed_mini_sequence_labeler 20 ${TEST_MODE}
 run_test test_cpu_speed_mnist 20 ${TEST_MODE}
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
-    # This could cause race condition if we are testing the same master commit twice,
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
+    # This could cause race condition if we are testing the same default branch commit twice,
     # but the chance of them executing this line at the same time is low.
-    aws s3 cp new_cpu_runtime.json s3://ossci-perf-test/pytorch/cpu_runtime/"${MASTER_COMMIT_ID}".json --acl public-read
+    aws s3 cp new_cpu_runtime.json s3://ossci-perf-test/pytorch/cpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read
 fi
diff --git a/.jenkins/pytorch/short-perf-test-gpu.sh b/.jenkins/pytorch/short-perf-test-gpu.sh
index 4d8efee8dc2019..bde8ca5c9dd311 100755
--- a/.jenkins/pytorch/short-perf-test-gpu.sh
+++ b/.jenkins/pytorch/short-perf-test-gpu.sh
@@ -17,14 +17,15 @@ pip install -q awscli --ignore-installed PyYAML
 # Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read
 # More info at https://github.com/aws/aws-cli/issues/2321
 aws configure set default.s3.multipart_threshold 5GB
+UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
-    # Get current master commit hash
-    MASTER_COMMIT_ID=$(git log --format="%H" -n 1)
-    export MASTER_COMMIT_ID
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
+    # Get current default branch commit hash
+    DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)
+    export DEFAULT_BRANCH_COMMIT_ID
 fi
 
-# Find the master commit to test against
+# Find the default branch commit to test against
 git remote add upstream https://github.com/pytorch/pytorch.git
 git fetch upstream
 IFS=$'\n'
@@ -33,13 +34,13 @@ while IFS='' read -r commit_id; do
         LATEST_TESTED_COMMIT=${commit_id}
         break
     fi
-done < <(git rev-list upstream/master)
+done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")
 aws s3 cp s3://ossci-perf-test/pytorch/gpu_runtime/"${LATEST_TESTED_COMMIT}".json gpu_runtime.json
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
     # Prepare new baseline file
     cp gpu_runtime.json new_gpu_runtime.json
-    python update_commit_hash.py new_gpu_runtime.json "${MASTER_COMMIT_ID}"
+    python update_commit_hash.py new_gpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"
 fi
 
 # Include tests
@@ -55,7 +56,7 @@ fi
 . ./test_gpu_speed_mlstm.sh
 
 # Run tests
-if [[ "$COMMIT_SOURCE" == master ]]; then
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
     run_test test_gpu_speed_mnist 20 compare_and_update
     run_test test_gpu_speed_word_language_model 20 compare_and_update
     run_test test_gpu_speed_cudnn_lstm 20 compare_and_update
@@ -69,10 +70,10 @@ else
     run_test test_gpu_speed_mlstm 20 compare_with_baseline
 fi
 
-if [[ "$COMMIT_SOURCE" == master ]]; then
-    # This could cause race condition if we are testing the same master commit twice,
+if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
+    # This could cause race condition if we are testing the same default branch commit twice,
     # but the chance of them executing this line at the same time is low.
-    aws s3 cp new_gpu_runtime.json s3://ossci-perf-test/pytorch/gpu_runtime/"${MASTER_COMMIT_ID}".json --acl public-read
+    aws s3 cp new_gpu_runtime.json s3://ossci-perf-test/pytorch/gpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read
 fi
 
 popd
diff --git a/.jenkins/pytorch/test.sh b/.jenkins/pytorch/test.sh
index 4514aa86330522..b4353c55c10bc1 100755
--- a/.jenkins/pytorch/test.sh
+++ b/.jenkins/pytorch/test.sh
@@ -77,6 +77,7 @@ fi
 
 if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
   # Print GPU info
+  rocminfo
   rocminfo | grep -E 'Name:.*\sgfx|Marketing'
 
   # Manually set NUM_TEST_SHARDS since Jenkins doesn't do it
@@ -274,6 +275,14 @@ test_libtorch() {
     else
       "$TORCH_BIN_DIR"/test_jit  --gtest_filter='-*CUDA' --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml
     fi
+
+    # Run Lazy Tensor cpp tests
+    if [[ "$BUILD_ENVIRONMENT" == *cuda* && "$BUILD_ENVIRONMENT" != *nogpu* ]]; then
+      LTC_TS_CUDA=1 "$TORCH_BIN_DIR"/test_lazy  --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml
+    else
+      "$TORCH_BIN_DIR"/test_lazy  --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml
+    fi
+
     python test/cpp/jit/tests_setup.py shutdown
     # Wait for background download to finish
     wait
@@ -518,7 +527,7 @@ test_torch_deploy() {
   ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"
   ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
   "$TORCH_BIN_DIR"/test_deploy
-  "$TORCH_BIN_DIR"/test_api --gtest_filter='IMethodTest.*'
+  "$TORCH_BIN_DIR"/test_deploy_gpu
   assert_git_not_dirty
 }
 
@@ -530,8 +539,9 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
   (cd test && python -c "import torch; print(torch.__config__.show())")
   (cd test && python -c "import torch; print(torch.__config__.parallel_info())")
 fi
-
-if [[ "${BUILD_ENVIRONMENT}" == *backward* ]]; then
+if [[ "${BUILD_ENVIRONMENT}" == *deploy* ]]; then
+  test_torch_deploy
+elif [[ "${BUILD_ENVIRONMENT}" == *backward* ]]; then
   test_forward_backward_compatibility
   # Do NOT add tests after bc check tests, see its comment.
 elif [[ "${TEST_CONFIG}" == *xla* ]]; then
@@ -544,9 +554,6 @@ elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
   # TODO: run some C++ tests
   echo "no-op at the moment"
 elif [[ "${BUILD_ENVIRONMENT}" == *-test1 || "${JOB_BASE_NAME}" == *-test1 || ("${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1) ]]; then
-  if [[ "${BUILD_ENVIRONMENT}" == *linux-xenial-cuda11.1*-test1* ]]; then
-    test_torch_deploy
-  fi
   test_without_numpy
   install_torchvision
   test_python_shard 1
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat
index 20b3b4db4c0256..65784863124529 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat
@@ -22,7 +22,7 @@ if "%INSTALL_FRESH_CONDA%"=="1" (
   call conda install -y -q python=%PYTHON_VERSION% numpy cffi pyyaml boto3 libuv
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
-  call conda install -y -q -c conda-forge cmake
+  call conda install -y -q -c conda-forge cmake=3.22.3
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
 )
diff --git a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
index 0ad44db5b47dde..c7f3e1b6a6140c 100644
--- a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
+++ b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
@@ -34,7 +34,9 @@ popd
 
 :: The version is fixed to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 =======
-pip install "ninja==1.10.0.post1" future "hypothesis==4.53.2" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow unittest-xml-reporting pytest
+:: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
+
+pip install "ninja==1.10.0.post1" future "hypothesis==4.53.2" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
diff --git a/BUILD.bazel b/BUILD.bazel
index 686e798d9765cd..197592f81e0d14 100644
--- a/BUILD.bazel
+++ b/BUILD.bazel
@@ -3,7 +3,7 @@ load("@pybind11_bazel//:build_defs.bzl", "pybind_extension")
 load("@rules_proto//proto:defs.bzl", "proto_library")
 load("@rules_cc//cc:defs.bzl", "cc_binary", "cc_library", "cc_proto_library", "cc_test")
 load("//third_party:substitution.bzl", "header_template_rule")
-load("//:tools/build_variables.bzl", "jit_core_sources", "libtorch_core_sources", "libtorch_cuda_sources", "libtorch_distributed_sources", "libtorch_extra_sources", "libtorch_nvfuser_generated_headers", "libtorch_nvfuser_runtime_sources", "libtorch_python_core_sources", "torch_cpp_srcs")
+load("//:tools/build_variables.bzl", "jit_core_sources", "libtorch_core_sources", "libtorch_cuda_sources", "libtorch_distributed_sources", "libtorch_extra_sources", "libtorch_nvfuser_generated_headers", "libtorch_nvfuser_runtime_sources", "libtorch_python_core_sources", "torch_cpp_srcs", "lazy_tensor_ts_sources")
 load("//tools/rules:cu.bzl", "cu_library")
 load("//tools/config:defs.bzl", "if_cuda")
 load("//:aten.bzl", "intern_build_aten_ops", "generate_aten", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cuda_sources")
@@ -25,16 +25,6 @@ COMMON_COPTS = [
     "-DUSE_CUDNN",
 ])
 
-# TODO: refactor this into its own library (but how to make
-# a binary based off of a module in a library?)
-py_binary(
-    name = "gen",
-    srcs = ["tools/setup_helpers/gen.py"],
-    deps = [
-        ":tools_codegen"
-    ],
-)
-
 aten_generation_srcs = ["aten/src/ATen/native/native_functions.yaml"] + glob(["aten/src/ATen/templates/**"])
 
 generated_cpu_cpp = [
@@ -102,37 +92,14 @@ generate_aten(
         aten_ufunc_generated_cuda_sources("aten/src/ATen/{}") +
         ["aten/src/ATen/Declarations.yaml"]
     ),
-    generator=":gen",
-)
-
-py_library(
-    name = "tools_codegen",
-    srcs = glob(["tools/codegen/**/*.py"]),
-)
-
-py_library(
-    name = "tools_autograd",
-    srcs = glob(["tools/autograd/*.py"]),
-    data = glob([
-        "tools/autograd/*.yaml",
-        "tools/autograd/templates/*",
-    ]),
-    deps = [":tools_codegen"],
+    generator = "//tools/codegen:gen",
 )
 
 py_library(
     name = "tools_jit",
     srcs = glob(["tools/jit/*.py"]),
     data = glob(["tools/jit/templates/*"]),
-)
-
-py_binary(
-    name = "generate_code",
-    srcs = ["tools/setup_helpers/generate_code.py"],
-    deps = [
-        ":tools_autograd",
-        ":tools_jit",
-    ],
+    visibility = ["//tools/setup_helpers:__pkg__"],
 )
 
 libtorch_cpp_generated_sources = [
@@ -155,6 +122,11 @@ libtorch_cpp_generated_sources = [
         "torch/csrc/autograd/generated/Functions.h",
         "torch/csrc/autograd/generated/Functions.cpp",
         "torch/csrc/autograd/generated/variable_factories.h",
+        "torch/csrc/lazy/generated/LazyIr.h",
+        "torch/csrc/lazy/generated/LazyNativeFunctions.h",
+        "torch/csrc/lazy/generated/LazyNativeFunctions.cpp",
+        "torch/csrc/lazy/generated/RegisterAutogradLazy.cpp",
+        "torch/csrc/lazy/generated/RegisterLazy.cpp",
 ]
 
 libtorch_python_generated_sources = [
@@ -180,10 +152,17 @@ genrule(
     name = "all_generated_code",
     srcs = [
         "aten/src/ATen/native/native_functions.yaml",
+        "aten/src/ATen/native/ts_native_functions.yaml",
+        "torch/csrc/lazy/core/shape_inference.h",
+        "torch/csrc/lazy/ts_backend/ts_native_functions.cpp",
+        "aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp",
+        "aten/src/ATen/templates/DispatchKeyNativeFunctions.h",
+        "aten/src/ATen/templates/RegisterDispatchKey.cpp",
+        "aten/src/ATen/templates/LazyIr.h",
     ],
     outs = libtorch_cpp_generated_sources + libtorch_python_generated_sources,
-    cmd = "$(location :generate_code) --install_dir `dirname $(location torch/csrc/autograd/generated/variable_factories.h)`/../.. --native-functions-path $(location aten/src/ATen/native/native_functions.yaml) --nn-path aten/src",
-    tools = [":generate_code"],
+    cmd = "$(location //tools/setup_helpers:generate_code) --install_dir `dirname $(location torch/csrc/autograd/generated/variable_factories.h)`/../.. --native-functions-path $(location aten/src/ATen/native/native_functions.yaml) --gen_lazy_ts_backend",
+    tools = ["//tools/setup_helpers:generate_code"],
 )
 
 filegroup(
@@ -1368,7 +1347,7 @@ cc_library(
 py_binary(
     name = "gen_op",
     srcs = ["caffe2/contrib/aten/gen_op.py"],
-    deps = [":tools_codegen"],
+    deps = ["//tools/codegen"],
 )
 
 genrule(
@@ -1636,17 +1615,12 @@ cc_library(
 )
 
 # torch
-py_binary(
-    name = "gen_version_header",
-    srcs = ["tools/setup_helpers/gen_version_header.py"],
-)
-
 genrule(
     name = "version_h",
     srcs = ["torch/csrc/api/include/torch/version.h.in", "version.txt"],
     outs = ["torch/csrc/api/include/torch/version.h"],
-    cmd = "$(location :gen_version_header) --template-path $(location torch/csrc/api/include/torch/version.h.in) --version-path $(location version.txt) --output-path $@",
-    tools = [':gen_version_header'],
+    cmd = "$(location //tools/setup_helpers:gen_version_header) --template-path $(location torch/csrc/api/include/torch/version.h.in) --version-path $(location version.txt) --output-path $@",
+    tools = ['//tools/setup_helpers:gen_version_header'],
 )
 
 py_binary(
@@ -1732,7 +1706,7 @@ cc_library(
             "torch/csrc/cuda/nccl.cpp",
             "torch/csrc/distributed/c10d/quantization/quantization_gpu.cu",
         ],
-    )) + libtorch_core_sources + libtorch_distributed_sources + torch_cpp_srcs + libtorch_extra_sources + jit_core_sources + [
+    )) + libtorch_core_sources + libtorch_distributed_sources + torch_cpp_srcs + libtorch_extra_sources + jit_core_sources + lazy_tensor_ts_sources +[
         ":cpp_generated_code",
         "torch/csrc/jit/serialization/flatbuffer_serializer.cpp",
         "torch/csrc/jit/mobile/flatbuffer_loader.cpp"
@@ -1915,6 +1889,11 @@ cc_test(
     srcs = glob([
         "test/cpp/lazy/*.cpp",
         "test/cpp/lazy/*.h",
+    ], exclude=[
+        # skip these since they depend on generated LazyIr.h which isn't available in bazel yet
+        "test/cpp/lazy/test_ir.cpp",
+        "test/cpp/lazy/test_lazy_ops.cpp",
+        "test/cpp/lazy/test_lazy_ops_util.cpp",
     ]),
     linkstatic = True,
     tags = [
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8b2e50ce52e7d5..c5c1aeb0b636ea 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -209,7 +209,7 @@ cmake_dependent_option(
 option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
 option(USE_KINETO "Use Kineto profiling library" ON)
 option(USE_BREAKPAD "Use breakpad crash dump library" ON)
-option(USE_CUPTI_SO "Use CUPTI as a shared library" OFF)
+option(USE_CUPTI_SO "Use CUPTI as a shared library" ON)
 option(USE_FAKELOWP "Use FakeLowp operators" OFF)
 option(USE_FFMPEG "Use ffmpeg" OFF)
 option(USE_GFLAGS "Use GFLAGS" OFF)
@@ -304,6 +304,7 @@ set(MKLDNN_ENABLE_CONCURRENT_EXEC ${USE_MKLDNN})
 cmake_dependent_option(
     USE_MKLDNN_CBLAS "Use CBLAS in MKLDNN" OFF
     "USE_MKLDNN" OFF)
+option(USE_STATIC_MKL "Prefer to link with MKL statically (Unix only)" OFF)
 option(USE_DISTRIBUTED "Use distributed" ON)
 cmake_dependent_option(
     USE_MPI "Use MPI for Caffe2. Only available if USE_DISTRIBUTED is on." ON
@@ -312,12 +313,15 @@ cmake_dependent_option(
     USE_GLOO "Use Gloo. Only available if USE_DISTRIBUTED is on." ON
     "USE_DISTRIBUTED" OFF)
 cmake_dependent_option(
-    USE_GLOO_WITH_OPENSSL "Use Gloo with OpenSSL. Only available if USE_GLOO is on." OFF
+  USE_GLOO_WITH_OPENSSL "Use Gloo with OpenSSL. Only available if USE_GLOO is on." OFF
     "USE_GLOO AND LINUX AND NOT INTERN_BUILD_MOBILE" OFF)
 cmake_dependent_option(
     USE_C10D_GLOO "USE C10D GLOO" ON "USE_DISTRIBUTED;USE_GLOO" OFF)
 cmake_dependent_option(
     USE_C10D_NCCL "USE C10D NCCL" ON "USE_DISTRIBUTED;USE_NCCL" OFF)
+cmake_dependent_option(
+    USE_NCCL_WITH_UCC "Enable UCC support for ProcessGroupNCCL. Only available if USE_C10D_NCCL is on." OFF
+    "USE_C10D_NCCL" OFF)
 cmake_dependent_option(
     USE_C10D_MPI "USE C10D MPI" ON "USE_DISTRIBUTED;USE_MPI" OFF)
 cmake_dependent_option(
@@ -336,6 +340,9 @@ cmake_dependent_option(USE_CCACHE "Attempt using CCache to wrap the compilation"
 option(WERROR "Build with -Werror supported by the compiler" OFF)
 option(USE_COREML_DELEGATE "Use the CoreML backend through delegate APIs" OFF)
 option(USE_PER_OPERATOR_HEADERS "Whether ATen should generate separate headers for each operator" ON)
+cmake_dependent_option(
+    BUILD_LAZY_TS_BACKEND "Build the lazy Torchscript backend, not compatible with mobile builds" ON
+    "NOT INTERN_BUILD_MOBILE" OFF)
 
 
 if(USE_CCACHE)
@@ -550,6 +557,8 @@ endif(NOT MSVC)
 # purpose.
 if(ANDROID OR IOS OR DEFINED ENV{BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN})
   set(INTERN_BUILD_MOBILE ON)
+  message(WARNING "INTERN_BUILD_MOBILE is on, disabling BUILD_LAZY_TS_BACKEND")
+  set(BUILD_LAZY_TS_BACKEND OFF)
 
   if(DEFINED ENV{BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN})
     # C10_MOBILE is derived from Android/iOS toolchain macros in
@@ -789,6 +798,8 @@ if(NOT MSVC)
   if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-range-loop-analysis")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-pass-failed")
+    # sign-compare is not part of -Wall, see https://godbolt.org/z/s1YczM41T
+    string(APPEND CMAKE_CXX_FLAGS " -Wsign-compare")
   endif()
   if(CMAKE_COMPILER_IS_GNUCXX AND NOT (CMAKE_CXX_COMPILER_VERSION VERSION_LESS 7.0.0))
     string(APPEND CMAKE_CXX_FLAGS " -Wno-stringop-overflow")
diff --git a/CODEOWNERS b/CODEOWNERS
index e1d2bf0154b069..dd88eac8c2bb09 100644
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -12,6 +12,7 @@
 /torch/optim/ @albanD
 /test/test_public_bindings.py @albanD
 /docs/source/conf.py @albanD
+/aten/src/ATen/native/native_functions.yaml @bdhirsh
 
 # Tensorpipe RPC Agent.
 /torch/csrc/distributed/rpc/tensorpipe_agent.cpp @jiayisuse @osalpekar @lw @beauby
@@ -20,15 +21,15 @@
 # Distributed package
 # This list is mostly if you'd like to be tagged as reviewer, feel free to add
 # or remove yourself from it.
-/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @bowangbj
-/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @bowangbj
-/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @bowangbj
+/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
+/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
+/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
 
 # Distributed tests
 # This list is mostly if you'd like to be tagged as reviewer, feel free to add
 # or remove yourself from it.
-/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @bowangbj
-/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @bowangbj
+/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu
+/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu
 
 # ONNX Export
 /torch/csrc/jit/passes/onnx.h @bowenbao @shubhambhokare1
@@ -46,9 +47,9 @@
 /.github/ @seemethere @janeyx99 @atalman
 
 # Custom Test Infrastructure
-/test/run_test.py @pytorch-dev-infra
+/test/run_test.py @pytorch/pytorch-dev-infra
 /torch/testing/_internal/common_device_type.py @mruberry
-/torch/testing/_internal/common_utils.py @pytorch-dev-infra
+/torch/testing/_internal/common_utils.py @pytorch/pytorch-dev-infra
 
 # Parametrizations
 /torch/nn/utils/parametriz*.py @lezcano
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 59b7ae8a488f5e..b20ecd3ffcb9d9 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -512,7 +512,7 @@ missing file warnings but will still complete. For example, to work on `jit.rst`
 
 ```bash
 cd docs/source
-ls | grep rst | grep -v index | grep -v jit | xargs rm
+find . -type f | grep rst | grep -v index | grep -v jit | xargs rm
 
 # Make your changes, build the docs, etc.
 
@@ -1098,8 +1098,7 @@ This internally invokes our driver script and closely mimics how clang-tidy is r
 
 ## Pre-commit tidy/linting hook
 
-We use clang-tidy and flake8 (installed with flake8-bugbear,
-flake8-comprehensions, flake8-pyi, and others) to perform additional
+We use clang-tidy to perform additional
 formatting and semantic checking of code. We provide a pre-commit git hook for
 performing these checks, before a commit is created:
 
@@ -1107,18 +1106,18 @@ performing these checks, before a commit is created:
   ln -s ../../tools/git-pre-commit .git/hooks/pre-commit
   ```
 
-You'll need to install an appropriately configured flake8; see
-[Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)
-for documentation on how to do this.
-
-If you haven't set up the pre-commit hook and have already committed files and
+If you have already committed files and
 CI reports `flake8` errors, you can run the check locally in your PR branch with:
 
   ```bash
   flake8 $(git diff --name-only $(git merge-base --fork-point master))
   ```
 
-fix the code so that no errors are reported when you re-run the above check again,
+You'll need to install an appropriately configured flake8; see
+[Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)
+for documentation on how to do this.
+
+Fix the code so that no errors are reported when you re-run the above check again,
 and then commit the fix.
 
 ## Building PyTorch with ASAN
@@ -1245,39 +1244,17 @@ Once you submit a PR or push a new commit to a branch that is in
 an active PR, CI jobs will be run automatically. Some of these may
 fail and you will need to find out why, by looking at the logs.
 
-Fairly often, a CI failure might be unrelated to your changes. In this case, you
+Fairly often, a CI failure might be unrelated to your changes. You can
+confirm by going to our [HUD](hud.pytorch.org) and seeing if the CI job
+is failing upstream already. In this case, you
 can usually ignore the failure. See [the following
 subsection](#which-commit-is-used-in-ci) for more details.
 
 Some failures might be related to specific hardware or environment
-configurations. In this case, if the job is run by CircleCI, you can
-ssh into the job's session to perform manual debugging using the
-following steps:
-
-1. In the CircleCI page for the failed job, make sure you are logged in
-   and then click the `Rerun` actions dropdown button on the top right.
-   Click `Rerun Job with SSH`.
-
-2. When the job reruns, a new step will be added in the `STEPS` tab
-   labelled `Set up SSH`. Inside that tab will be an ssh command that
-   you can execute in a shell.
-
-3. Once you are connected through ssh, you may need to enter a docker
-   container. Run `docker ps` to check if there are any docker
-   containers running. Note that your CI job might be in the process
-   of initiating a docker container, which means it will not show up
-   yet. It is best to wait until the CI job reaches a step where it is
-   building pytorch or running pytorch tests. If the job does have a
-   docker container, run `docker exec -it IMAGE_ID /bin/bash` to
-   connect to it.
-
-4. Now you can find the pytorch working directory, which could be
-   `~/workspace` or `~/project`, and run commands locally to debug
-   the failure.
-
-For certain Windows failures, it may be useful to have a full [Remote
-Desktop](https://docs.microsoft.com/en-us/windows-server/remote/remote-desktop-services/clients/remote-desktop-clients) connection. See detailed instructions [here](https://github.com/pytorch/pytorch/wiki/Debugging-Windows-with-Remote-Desktop-or-CDB-(CLI-windbg)-on-CircleCI)
-for how to set that up after rerunning the job.
+configurations. In this case, if you're a Meta employee, you can ssh into
+the job's session to perform manual debugging following the instructions in
+our [CI wiki](https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions).
+
 
 ### Which commit is used in CI?
 
diff --git a/Dockerfile b/Dockerfile
index e5065cd6524b09..a8dc7f141685d6 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -32,7 +32,7 @@ RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Mini
     chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda-build pyyaml numpy ipython&& \
+    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda-build pyyaml numpy ipython && \
     /opt/conda/bin/conda clean -ya
 
 FROM dev-base as submodule-update
diff --git a/README.md b/README.md
index 88a77f04b34555..9105b1d35f3101 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,8 @@ PyTorch is a Python package that provides two high-level features:
 
 You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.
 
+Our trunk health (Continuous Integration signals) can be found at [hud.pytorch.org](https://hud.pytorch.org/ci/pytorch/pytorch/master).
+
 <!-- toc -->
 
 - [More About PyTorch](#more-about-pytorch)
@@ -39,18 +41,6 @@ You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to
 
 <!-- tocstop -->
 
-| System | 3.7 | 3.8 |
-| :---: | :---: | :--: |
-| Linux CPU | [![Build Status](https://ci.pytorch.org/jenkins/job/pytorch-master/badge/icon)](https://ci.pytorch.org/jenkins/job/pytorch-master/) | <center>—</center> |
-| Linux GPU | [![Build Status](https://ci.pytorch.org/jenkins/job/pytorch-master/badge/icon)](https://ci.pytorch.org/jenkins/job/pytorch-master/) | <center>—</center> |
-| Windows CPU / GPU | [![Build Status](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-win-ws2016-cuda9-cudnn7-py3-trigger/badge/icon)](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-win-ws2016-cuda9-cudnn7-py3-trigger/) |  <center>—</center> |
-| Linux (ppc64le) CPU | [![Build Status](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/badge/icon)](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/) | <center>—</center> |
-| Linux (ppc64le) GPU | [![Build Status](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/badge/icon)](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/) | <center>—</center> |
-| Linux (aarch64) CPU | [![Build Status](http://openlabtesting.org:15000/badge?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py37)](https://status.openlabtesting.org/builds/builds?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py37) | [![Build Status](http://openlabtesting.org:15000/badge?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py38)](https://status.openlabtesting.org/builds/builds?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py38) |
-
-See also the [CI HUD at hud.pytorch.org](https://hud.pytorch.org/ci/pytorch/pytorch/master).
-
-
 ## More About PyTorch
 
 At a granular level, PyTorch is a library that consists of the following components:
diff --git a/RELEASE.md b/RELEASE.md
index 1c95ea1b5328c4..e84ccbc159627d 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -3,13 +3,25 @@
 <!-- toc -->
 
   - [General Overview](#general-overview)
+  - [Cutting a release branch preparations](#cutting-a-release-branch-preparations)
   - [Cutting release branches](#cutting-release-branches)
+    - [`pytorch/pytorch`](#pytorchpytorch)
+    - [`pytorch/builder` / PyTorch domain libraries](#pytorchbuilder--pytorch-domain-libraries)
     - [Making release branch specific changes](#making-release-branch-specific-changes)
     - [Getting CI signal on release branches:](#getting-ci-signal-on-release-branches)
   - [Drafting RCs (Release Candidates)](#drafting-rcs-release-candidates)
     - [Release Candidate Storage](#release-candidate-storage)
     - [Cherry Picking Fixes](#cherry-picking-fixes)
   - [Promoting RCs to Stable](#promoting-rcs-to-stable)
+  - [Additonal Steps to prepare for release day](#additonal-steps-to-prepare-for-release-day)
+    - [Modify release matrix](#modify-release-matrix)
+    - [Open Google Colab issue](#open-google-colab-issue)
+- [Patch Releases](#patch-releases)
+  - [Patch Release Criteria](#patch-release-criteria)
+  - [Patch Release Process](#patch-release-process)
+    - [Triage](#triage)
+    - [Building a release schedule / cherry picking](#building-a-release-schedule--cherry-picking)
+    - [Building Binaries / Promotion to Stable](#building-binaries--promotion-to-stable)
 - [Special Topics](#special-topics)
   - [Updating submodules for a release](#updating-submodules-for-a-release)
 
@@ -19,32 +31,55 @@
 
 Releasing a new version of PyTorch generally entails 3 major steps:
 
+0. Cutting a release branch preparations
 1. Cutting a release branch and making release branch specific changes
 2. Drafting RCs (Release Candidates), and merging cherry picks
-3. Promoting RCs to stable
+3. Promoting RCs to stable and performing release day tasks
+
+## Cutting a release branch preparations
+
+Following Requirements needs to be met prior to final RC Cut:
+
+* Resolve all outstanding issues in the milestones(for example [1.11.0](https://github.com/pytorch/pytorch/milestone/28))before first RC cut is completed. After RC cut is completed following script should be executed from builder repo in order to validate the presence of the fixes in the release branch :
+``` python github_analyze.py --repo-path ~/local/pytorch --remote upstream  --branch release/1.11 --milestone-id 26 --missing-in-branch ```
+* Validate that all new workflows have been created in the PyTorch and domain libraries included in the release. Validate it against all dimensions of release matrix, including operating systems(Linux, MacOS, Windows), Python versions as well as CPU architectures(x86 and arm) and accelerator versions(CUDA, ROCm).
+* All the nighly jobs for pytorch and domain libraries should be green. Validate this using following HUD links:
+  * [Pytorch](https://hud.pytorch.org/hud/pytorch/pytorch/nightly)
+  * [TorchVision](https://hud.pytorch.org/hud/pytorch/vision/nightly)
+  * [TorchAudio](https://hud.pytorch.org/hud/pytorch/audio/nightly)
+  * [TorchText](https://hud.pytorch.org/hud/pytorch/text/nightly)
 
 ## Cutting release branches
 
+### `pytorch/pytorch`
+
 Release branches are typically cut from the branch [`viable/strict`](https://github.com/pytorch/pytorch/tree/viable/strict) as to ensure that tests are passing on the release branch.
 
-Release branches *should* be prefixed like so:
-```
-release/{MAJOR}.{MINOR}
-```
+There's a convenience script to create release branches from current `viable/strict` (from root `pytorch/pytorch`):
 
-An example of this would look like:
+```bash
+DRY_RUN=disabled scripts/release/cut-release-branch.sh
 ```
-release/1.8
+
+This script should create 2 branches:
+* `release/{MAJOR}.{MINOR}`
+* `orig/release/{MAJOR}.{MINOR}`
+
+### `pytorch/builder` / PyTorch domain libraries
+
+Convenience script can also be used domains as well as `pytorch/builder`
+
+> NOTE: RELEASE_VERSION only needs to be specified if version.txt is not available in root directory
+
+```bash
+DRY_RUN=disabled GIT_BRANCH_TO_CUT_FROM=main RELEASE_VERSION=1.11 scripts/release/cut-release-branch.sh
 ```
 
-Please make sure to create branch that pins divergent point of release branch from the main branch, i.e. `orig/release/{MAJOR}.{MINOR}`
 ### Making release branch specific changes
 
 These are examples of changes that should be made to release branches so that CI / tooling can function normally on
 them:
 
-* Update target determinator to use release branch:
-  * Example: https://github.com/pytorch/pytorch/pull/40712
 * Update backwards compatibility tests to use RC binaries instead of nightlies
   * Example: https://github.com/pytorch/pytorch/pull/40706
 * A release branches should also be created in [`pytorch/xla`](https://github.com/pytorch/xla) and [`pytorch/builder`](https://github.com/pytorch/builder) repos and pinned in `pytorch/pytorch`
@@ -57,6 +92,7 @@ These are examples of changes that should be made to the *default* branch after
   * Example: https://github.com/pytorch/pytorch/pull/65435
 
 ### Getting CI signal on release branches:
+
 Create a PR from `release/{MAJOR}.{MINOR}` to `orig/release/{MAJOR}.{MINOR}` in order to start CI testing for cherry-picks into release branch.
 
 Example:
@@ -99,8 +135,11 @@ For fixes that are to go into a release after the release branch has been cut we
 An example of this would look like:
 * https://github.com/pytorch/pytorch/issues/51886
 
+Please also make sure to add milestone target to the PR/issue, especially if it needs to be considered for inclusion into the dot release.
+
 **NOTE**: The cherry pick process is not an invitation to add new features, it is mainly there to fix regressions
 
+
 ## Promoting RCs to Stable
 
 Promotion of RCs to stable is done with this script:
@@ -114,6 +153,69 @@ Promotion should occur in two steps:
 
 **NOTE**: The promotion of wheels to PyPI can only be done once so take caution when attempting to promote wheels to PyPI, (see https://github.com/pypa/warehouse/issues/726 for a discussion on potential draft releases within PyPI)
 
+## Additonal Steps to prepare for release day
+
+The following should be prepared for the release day
+
+### Modify release matrix
+
+Need to modify release matrix for get started page. See following [PR](https://github.com/pytorch/pytorch.github.io/pull/959) as reference.
+
+After modifying published_versions.json you will need to regenerate regenerate the quick-start-module.js file run following command
+```
+python3 scripts/gen_quick_start_module.py >assets/quick-start-module.js
+```
+Please note: This PR needs to be merged on the release day and hence it should be absolutely free of any failures. To test this PR, open another test PR but pointing to to the Release candidate location as above [Release Candidate Storage](RELEASE.md#release-candidate-storage)
+
+### Open Google Colab issue
+
+This is normally done right after the release is completed. We would need to create Google Colab Issue see following [PR](https://github.com/googlecolab/colabtools/issues/2372)
+
+# Patch Releases
+
+A patch release is a maintenance release of PyTorch that includes fixes for regressions found in a previous minor release. Patch releases typically will bump the `patch` version from semver (i.e. `[major].[minor].[patch]`
+
+## Patch Release Criteria
+
+Patch releases should be considered if a regression meets the following criteria:
+
+1. Does the regression break core functionality (stable / beta features) including functionality in first party domain libraries?
+    * First party domain libraries:
+        * [pytorch/vision](https://github.com/pytorch/vision)
+        * [pytorch/audio](https://github.com/pytorch/audio)
+        * [pytorch/text](https://github.com/pytorch/text)
+3. Is there not a viable workaround?
+    * Can the regression be solved simply or is it not overcomable?
+
+> *NOTE*: Patch releases should only be considered when functionality is broken, documentation does not typically fall within this category
+
+## Patch Release Process
+
+### Triage
+
+> Main POC: Triage Reviewers
+
+1. Tag issues / pull requests that are candidates for a potential patch release with `triage review`
+    * ![adding triage review label](https://user-images.githubusercontent.com/1700823/132589089-a9210a14-6159-409d-95e5-f79067f6fa38.png)
+2. Triage reviewers will then check if the regression / fix identified fits within above mentioned [Patch Release Criteria](#patch-release-criteria)
+3. Triage reviewers will then add the issue / pull request to the related milestone (i.e. `1.9.1`) if the regressions if found to be within the [Patch Release Criteria](#patch-release-criteria)
+    * ![adding to milestone](https://user-images.githubusercontent.com/1700823/131175980-148ff38d-44c3-4611-8a1f-cd2fd1f4c49d.png)
+
+### Building a release schedule / cherry picking
+
+> Main POC: Patch Release Managers
+
+1. After regressions / fixes have been triaged Patch Release Managers will work together and build /announce a schedule for the patch release
+    * *NOTE*: Ideally this should be ~2-3 weeks after a regression has been identified to allow other regressions to be identified
+2. Patch Release Managers will work with the authors of the regressions / fixes to cherry pick their change into the related release branch (i.e. `release/1.9` for `1.9.1`)
+
+### Building Binaries / Promotion to Stable
+
+> Main POC: Patch Release managers
+
+1. Patch Release Managers will follow the process of [Drafting RCs (Release Candidates)](#drafting-rcs-release-candidates)
+2. Patch Release Managers will follow the process of [Promoting RCs to Stable](#promoting-rcs-to-stable)
+
 # Special Topics
 
 ## Updating submodules for a release
diff --git a/android/pytorch_android/src/androidTest/assets/activation_ops.ptl b/android/pytorch_android/src/androidTest/assets/activation_ops.ptl
new file mode 100644
index 00000000000000..179f426ae7cdf6
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/activation_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/android_api_module.ptl b/android/pytorch_android/src/androidTest/assets/android_api_module.ptl
new file mode 100644
index 00000000000000..df62dd86208811
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/android_api_module.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/blas_lapack_ops.ptl b/android/pytorch_android/src/androidTest/assets/blas_lapack_ops.ptl
new file mode 100644
index 00000000000000..fea933ee644fd4
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/blas_lapack_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/comparison_ops.ptl b/android/pytorch_android/src/androidTest/assets/comparison_ops.ptl
new file mode 100644
index 00000000000000..01b1c153e7515a
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/comparison_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/convolution_ops.ptl b/android/pytorch_android/src/androidTest/assets/convolution_ops.ptl
new file mode 100644
index 00000000000000..db253a207a33d0
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/convolution_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/distance_function_ops.ptl b/android/pytorch_android/src/androidTest/assets/distance_function_ops.ptl
new file mode 100644
index 00000000000000..cc4d994f440a4d
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/distance_function_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/dropout_ops.ptl b/android/pytorch_android/src/androidTest/assets/dropout_ops.ptl
new file mode 100644
index 00000000000000..422c2f60e6be25
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/dropout_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/dynamic_quant_ops.ptl b/android/pytorch_android/src/androidTest/assets/dynamic_quant_ops.ptl
new file mode 100644
index 00000000000000..0bbbce9671c3c4
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/dynamic_quant_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/fused_quant_ops.ptl b/android/pytorch_android/src/androidTest/assets/fused_quant_ops.ptl
new file mode 100644
index 00000000000000..9d2b3f9dde1a71
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/fused_quant_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/general_quant_ops.ptl b/android/pytorch_android/src/androidTest/assets/general_quant_ops.ptl
new file mode 100644
index 00000000000000..7d4888e0bc817e
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/general_quant_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/linear_ops.ptl b/android/pytorch_android/src/androidTest/assets/linear_ops.ptl
new file mode 100644
index 00000000000000..ca9066c03dc4f3
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/linear_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/loss_function_ops.ptl b/android/pytorch_android/src/androidTest/assets/loss_function_ops.ptl
new file mode 100644
index 00000000000000..4c0592e5485afa
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/loss_function_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/mobilenet_v2.ptl b/android/pytorch_android/src/androidTest/assets/mobilenet_v2.ptl
new file mode 100644
index 00000000000000..9b8297a250d35d
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/mobilenet_v2.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/nn_utils_ops.ptl b/android/pytorch_android/src/androidTest/assets/nn_utils_ops.ptl
new file mode 100644
index 00000000000000..5d008eab03b9b8
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/nn_utils_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/normalization_ops.ptl b/android/pytorch_android/src/androidTest/assets/normalization_ops.ptl
new file mode 100644
index 00000000000000..d85bd06c763bc7
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/normalization_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/other_math_ops.ptl b/android/pytorch_android/src/androidTest/assets/other_math_ops.ptl
new file mode 100644
index 00000000000000..7209c3b3bd1fdd
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/other_math_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/padding_ops.ptl b/android/pytorch_android/src/androidTest/assets/padding_ops.ptl
new file mode 100644
index 00000000000000..02e57ba207129c
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/padding_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/pointwise_ops.ptl b/android/pytorch_android/src/androidTest/assets/pointwise_ops.ptl
new file mode 100644
index 00000000000000..948ed4832660ae
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/pointwise_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/pooling_ops.ptl b/android/pytorch_android/src/androidTest/assets/pooling_ops.ptl
new file mode 100644
index 00000000000000..df051163413f5a
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/pooling_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/recurrent_ops.ptl b/android/pytorch_android/src/androidTest/assets/recurrent_ops.ptl
new file mode 100644
index 00000000000000..245ceb454d5387
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/recurrent_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/reduction_ops.ptl b/android/pytorch_android/src/androidTest/assets/reduction_ops.ptl
new file mode 100644
index 00000000000000..13771302c66802
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/reduction_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/sampling_ops.ptl b/android/pytorch_android/src/androidTest/assets/sampling_ops.ptl
new file mode 100644
index 00000000000000..416be7cb127953
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/sampling_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/shuffle_ops.ptl b/android/pytorch_android/src/androidTest/assets/shuffle_ops.ptl
new file mode 100644
index 00000000000000..5e5520118764ef
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/shuffle_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/sparse_ops.ptl b/android/pytorch_android/src/androidTest/assets/sparse_ops.ptl
new file mode 100644
index 00000000000000..a16f68f8f95ff8
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/sparse_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/spectral_ops.ptl b/android/pytorch_android/src/androidTest/assets/spectral_ops.ptl
new file mode 100644
index 00000000000000..9828dd2ba9013a
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/spectral_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/static_quant_ops.ptl b/android/pytorch_android/src/androidTest/assets/static_quant_ops.ptl
new file mode 100644
index 00000000000000..d0a0a254d1efe1
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/static_quant_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/tensor_creation_ops.ptl b/android/pytorch_android/src/androidTest/assets/tensor_creation_ops.ptl
new file mode 100644
index 00000000000000..d897b43cd36ca9
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/tensor_creation_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/tensor_general_ops.ptl b/android/pytorch_android/src/androidTest/assets/tensor_general_ops.ptl
new file mode 100644
index 00000000000000..6f2855ea83eaa5
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/tensor_general_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/tensor_indexing_ops.ptl b/android/pytorch_android/src/androidTest/assets/tensor_indexing_ops.ptl
new file mode 100644
index 00000000000000..ac9cb8c4b94add
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/tensor_indexing_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/tensor_typing_ops.ptl b/android/pytorch_android/src/androidTest/assets/tensor_typing_ops.ptl
new file mode 100644
index 00000000000000..3e2f4d8cc68922
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/tensor_typing_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/tensor_view_ops.ptl b/android/pytorch_android/src/androidTest/assets/tensor_view_ops.ptl
new file mode 100644
index 00000000000000..5e2dc829484265
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/tensor_view_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/torchscript_builtin_ops.ptl b/android/pytorch_android/src/androidTest/assets/torchscript_builtin_ops.ptl
new file mode 100644
index 00000000000000..2d2532df2fd257
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/torchscript_builtin_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/torchscript_collection_ops.ptl b/android/pytorch_android/src/androidTest/assets/torchscript_collection_ops.ptl
new file mode 100644
index 00000000000000..ce434b3b4210d5
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/torchscript_collection_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/transformer_ops.ptl b/android/pytorch_android/src/androidTest/assets/transformer_ops.ptl
new file mode 100644
index 00000000000000..ebb2bd693604a7
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/transformer_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/assets/vision_function_ops.ptl b/android/pytorch_android/src/androidTest/assets/vision_function_ops.ptl
new file mode 100644
index 00000000000000..c9c45655e2bca9
Binary files /dev/null and b/android/pytorch_android/src/androidTest/assets/vision_function_ops.ptl differ
diff --git a/android/pytorch_android/src/androidTest/java/org/pytorch/PytorchTestBase.java b/android/pytorch_android/src/androidTest/java/org/pytorch/PytorchTestBase.java
index 5a1405e679bcfb..9abcbcbda8a6ca 100644
--- a/android/pytorch_android/src/androidTest/java/org/pytorch/PytorchTestBase.java
+++ b/android/pytorch_android/src/androidTest/java/org/pytorch/PytorchTestBase.java
@@ -12,7 +12,7 @@
 import org.junit.Test;
 
 public abstract class PytorchTestBase {
-  private static final String TEST_MODULE_ASSET_NAME = "test.pt";
+  private static final String TEST_MODULE_ASSET_NAME = "android_api_module.ptl";
 
   @Test
   public void testForwardNull() throws IOException {
@@ -377,6 +377,186 @@ public void testChannelsLastConv2d() throws IOException {
         new long[] {2, 11, -101, 4, 12, -102, 6, 13, -103, 8, 14, -104});
   }
 
+  @Test
+  public void testMobileNetV2() throws IOException {
+    try {
+      final Module module = loadModel("mobilenet_v2.ptl");
+      final IValue inputs = module.runMethod("get_all_bundled_inputs");
+      assertTrue(inputs.isList());
+      final IValue input = inputs.toList()[0];
+      assertTrue(input.isTuple());
+      module.forward(input.toTuple()[0]);
+      assertTrue(true);
+    } catch (Exception ex) {
+      assertTrue("failed to run MobileNetV2 " + ex.getMessage(), false);
+    }
+  }
+
+  @Test
+  public void testPointwiseOps() throws IOException {
+    runModel("pointwise_ops");
+  }
+
+  @Test
+  public void testReductionOps() throws IOException {
+    runModel("reduction_ops");
+  }
+
+  @Test
+  public void testComparisonOps() throws IOException {
+    runModel("comparison_ops");
+  }
+
+  @Test
+  public void testOtherMathOps() throws IOException {
+    runModel("other_math_ops");
+  }
+
+  @Test
+  public void testSpectralOps() throws IOException {
+    runModel("spectral_ops");
+  }
+
+  @Test
+  public void testBlasLapackOps() throws IOException {
+    runModel("blas_lapack_ops");
+  }
+
+  @Test
+  public void testSamplingOps() throws IOException {
+    runModel("sampling_ops");
+  }
+
+  @Test
+  public void testTensorOps() throws IOException {
+    runModel("tensor_general_ops");
+  }
+
+  @Test
+  public void testTensorCreationOps() throws IOException {
+    runModel("tensor_creation_ops");
+  }
+
+  @Test
+  public void testTensorIndexingOps() throws IOException {
+    runModel("tensor_indexing_ops");
+  }
+
+  @Test
+  public void testTensorTypingOps() throws IOException {
+    runModel("tensor_typing_ops");
+  }
+
+  @Test
+  public void testTensorViewOps() throws IOException {
+    runModel("tensor_view_ops");
+  }
+
+  @Test
+  public void testConvolutionOps() throws IOException {
+    runModel("convolution_ops");
+  }
+
+  @Test
+  public void testPoolingOps() throws IOException {
+    runModel("pooling_ops");
+  }
+
+  @Test
+  public void testPaddingOps() throws IOException {
+    runModel("padding_ops");
+  }
+
+  @Test
+  public void testActivationOps() throws IOException {
+    runModel("activation_ops");
+  }
+
+  @Test
+  public void testNormalizationOps() throws IOException {
+    runModel("normalization_ops");
+  }
+
+  @Test
+  public void testRecurrentOps() throws IOException {
+    runModel("recurrent_ops");
+  }
+
+  @Test
+  public void testTransformerOps() throws IOException {
+    runModel("transformer_ops");
+  }
+
+  @Test
+  public void testLinearOps() throws IOException {
+    runModel("linear_ops");
+  }
+
+  @Test
+  public void testDropoutOps() throws IOException {
+    runModel("dropout_ops");
+  }
+
+  @Test
+  public void testSparseOps() throws IOException {
+    runModel("sparse_ops");
+  }
+
+  @Test
+  public void testDistanceFunctionOps() throws IOException {
+    runModel("distance_function_ops");
+  }
+
+  @Test
+  public void testLossFunctionOps() throws IOException {
+    runModel("loss_function_ops");
+  }
+
+  @Test
+  public void testVisionFunctionOps() throws IOException {
+    runModel("vision_function_ops");
+  }
+
+  @Test
+  public void testShuffleOps() throws IOException {
+    runModel("shuffle_ops");
+  }
+
+  @Test
+  public void testNNUtilsOps() throws IOException {
+    runModel("nn_utils_ops");
+  }
+
+  @Test
+  public void testQuantOps() throws IOException {
+    runModel("general_quant_ops");
+  }
+
+  @Test
+  public void testDynamicQuantOps() throws IOException {
+    runModel("dynamic_quant_ops");
+  }
+
+  @Test
+  public void testStaticQuantOps() throws IOException {
+    runModel("static_quant_ops");
+  }
+
+  @Test
+  public void testFusedQuantOps() throws IOException {
+    runModel("fused_quant_ops");
+  }
+
+  @Test
+  public void testTorchScriptBuiltinQuantOps() throws IOException {
+    runModel("torchscript_builtin_ops");
+  }
+
+  @Test
+  public void testTorchScriptCollectionQuantOps() throws IOException {
+    runModel("torchscript_collection_ops");
+  }
+
   static void assertIValueTensor(
       final IValue ivalue,
       final MemoryFormat memoryFormat,
@@ -389,5 +569,15 @@ static void assertIValueTensor(
     assertArrayEquals(expectedData, t.getDataAsLongArray());
   }
 
+  void runModel(final String name) throws IOException {
+    final Module storage_module = loadModel(name + ".ptl");
+    storage_module.forward();
+
+    // TODO enable this once the on-the-fly script is ready
+    // final Module on_the_fly_module = loadModel(name + "_temp.ptl");
+    // on_the_fly_module.forward();
+    assertTrue(true);
+  }
+
   protected abstract Module loadModel(String assetName) throws IOException;
 }
diff --git a/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp b/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
index 8094f7bdc97415..5ed0c9978e8346 100644
--- a/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
+++ b/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
@@ -223,7 +223,8 @@ class TensorHybrid : public facebook::jni::HybridClass<TensorHybrid> {
     } else {
       facebook::jni::throwNewJavaException(
           facebook::jni::gJavaLangIllegalArgumentException,
-          "at::Tensor scalar type is not supported on java side");
+          "at::Tensor scalar type %s is not supported on java side",
+          c10::toString(scalarType));
     }
 
     const auto& tensorShape = tensor.sizes();
diff --git a/aten/src/ATen/BatchingRegistrations.cpp b/aten/src/ATen/BatchingRegistrations.cpp
index 0eb0d697078ea1..c7c95cf92c9fcb 100644
--- a/aten/src/ATen/BatchingRegistrations.cpp
+++ b/aten/src/ATen/BatchingRegistrations.cpp
@@ -1105,6 +1105,7 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
   m.impl("select.int", select_batching_rule);
   m.impl("slice.Tensor", slice_batching_rule);
   m.impl("split.Tensor", split_batching_rule);
+  m.impl("split.sizes", split_with_sizes_batching_rule);
   m.impl("split_with_sizes", split_with_sizes_batching_rule);
   m.impl("squeeze", squeeze_batching_rule);
   m.impl("squeeze.dim", squeeze_dim_batching_rule);
diff --git a/aten/src/ATen/Context.cpp b/aten/src/ATen/Context.cpp
index 98590b266be402..8712fe203d1e1e 100644
--- a/aten/src/ATen/Context.cpp
+++ b/aten/src/ATen/Context.cpp
@@ -236,6 +236,10 @@ const std::vector<at::QEngine>& Context::supportedQEngines() {
     engines.push_back(at::kNoQEngine);
 #endif // C10_MOBILE
 
+#if AT_MKLDNN_ENABLED()
+    engines.push_back(at::kONEDNN);
+#endif
+
 #ifdef USE_FBGEMM
     if (fbgemm::fbgemmSupportedCPU()) {
       engines.push_back(at::kFBGEMM);
@@ -293,6 +297,20 @@ bool NoTF32Guard::should_disable_tf32() {
   return override_allow_tf32_flag;
 }
 
+thread_local bool BackwardPassGuard::is_backward_pass_;
+
+BackwardPassGuard::BackwardPassGuard() {
+  is_backward_pass_ = true;
+}
+
+BackwardPassGuard::~BackwardPassGuard() {
+  is_backward_pass_ = false;
+}
+
+bool BackwardPassGuard::is_backward_pass() {
+  return is_backward_pass_;
+}
+
 bool Context::areVmapFallbackWarningsEnabled() const {
   return display_vmap_fallback_warnings_;
 }
diff --git a/aten/src/ATen/Context.h b/aten/src/ATen/Context.h
index 88cbc3ec0bb3a1..1a90a7e0f1047d 100644
--- a/aten/src/ATen/Context.h
+++ b/aten/src/ATen/Context.h
@@ -80,6 +80,9 @@ class TORCH_API Context {
   static bool hasHIP() {
     return detail::getHIPHooks().hasHIP();
   }
+  static bool hasIPU() {
+    return c10::impl::hasDeviceGuardImpl(at::DeviceType::IPU);
+  }
   static bool hasXLA() {
     return c10::impl::hasDeviceGuardImpl(at::DeviceType::XLA);
   }
@@ -295,6 +298,10 @@ static inline bool hasHIP() {
   return globalContext().hasHIP();
 }
 
+static inline bool hasIPU() {
+  return globalContext().hasIPU();
+}
+
 static inline bool hasXLA() {
   return globalContext().hasXLA();
 }
@@ -387,4 +394,12 @@ struct TORCH_API NoTF32Guard {
   bool changed = false;
 };
 
+struct TORCH_API BackwardPassGuard {
+  BackwardPassGuard();
+  ~BackwardPassGuard();
+  static bool is_backward_pass();
+private:
+  static thread_local bool is_backward_pass_;
+};
+
 } // namespace at
diff --git a/aten/src/ATen/Dispatch.h b/aten/src/ATen/Dispatch.h
index 1bd78db594e51e..e0d66934883fc7 100644
--- a/aten/src/ATen/Dispatch.h
+++ b/aten/src/ATen/Dispatch.h
@@ -513,6 +513,22 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
     }                                                                       \
   }()
 
+#define AT_DISPATCH_QINT_BYTE_TYPES(TYPE, NAME, ...)                        \
+  [&] {                                                                     \
+    const auto& the_type = TYPE;                                            \
+    /* don't use TYPE again in case it is an expensive or side-effect op */ \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                   \
+    RECORD_KERNEL_FUNCTION_DTYPE(NAME, _st);                                \
+    switch (_st) {                                                          \
+      AT_QINT_PRIVATE_CASE_TYPE(                                            \
+          NAME, at::kQInt8, at::qint8, at::kChar, int8_t, __VA_ARGS__)      \
+      AT_QINT_PRIVATE_CASE_TYPE(                                            \
+          NAME, at::kQUInt8, at::quint8, at::kByte, uint8_t, __VA_ARGS__)   \
+      default:                                                              \
+        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");     \
+    }                                                                       \
+  }()
+
 #define AT_DISPATCH_QINT_AND_SUB_BYTE_TYPES(TYPE, NAME, ...)                                   \
   [&] {                                                                                        \
     const auto& the_type = TYPE;                                                               \
@@ -753,6 +769,56 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
     }                                                                       \
   }()
 
+#define AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(                               \
+    SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, SCALARTYPE4, TYPE, NAME, ...)      \
+  [&] {                                                                       \
+    const auto& the_type = TYPE;                                              \
+    /* don't use TYPE again in case it is an expensive or side-effect op*/    \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                     \
+    RECORD_KERNEL_FUNCTION_DTYPE(NAME, _st);                                  \
+    switch (_st) {                                                            \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Byte, uint8_t, __VA_ARGS__)  \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Char, int8_t, __VA_ARGS__)   \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Double, double, __VA_ARGS__) \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Float, float, __VA_ARGS__)   \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Int, int32_t, __VA_ARGS__)   \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Long, int64_t, __VA_ARGS__)  \
+      AT_PRIVATE_CASE_TYPE(NAME, at::ScalarType::Short, int16_t, __VA_ARGS__) \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          at::ScalarType::ComplexFloat,                                       \
+          c10::complex<float>,                                                \
+          __VA_ARGS__)                                                        \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          at::ScalarType::ComplexDouble,                                      \
+          c10::complex<double>,                                               \
+          __VA_ARGS__)                                                        \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          SCALARTYPE1,                                                        \
+          decltype(c10::impl::ScalarTypeToCPPType<SCALARTYPE1>::t),           \
+          __VA_ARGS__)                                                        \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          SCALARTYPE2,                                                        \
+          decltype(c10::impl::ScalarTypeToCPPType<SCALARTYPE2>::t),           \
+          __VA_ARGS__)                                                        \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          SCALARTYPE3,                                                        \
+          decltype(c10::impl::ScalarTypeToCPPType<SCALARTYPE3>::t),           \
+          __VA_ARGS__)                                                        \
+      AT_PRIVATE_CASE_TYPE(                                                   \
+          NAME,                                                               \
+          SCALARTYPE4,                                                        \
+          decltype(c10::impl::ScalarTypeToCPPType<SCALARTYPE4>::t),           \
+          __VA_ARGS__)                                                        \
+      default:                                                                \
+        AT_ERROR(#NAME, " not implemented for '", toString(_st), "'");        \
+    }                                                                         \
+  }()
+
 #define AT_DISPATCH_INDEX_TYPES(TYPE, NAME, ...)                            \
   [&] {                                                                     \
     const auto& the_index_type = TYPE;                                      \
diff --git a/aten/src/ATen/DynamicLibrary.cpp b/aten/src/ATen/DynamicLibrary.cpp
index f380fb6c35dd6a..f3287121b2e267 100644
--- a/aten/src/ATen/DynamicLibrary.cpp
+++ b/aten/src/ATen/DynamicLibrary.cpp
@@ -20,7 +20,7 @@ namespace at {
 
 static void* checkDL(void* x) {
   if (!x) {
-    AT_ERROR("Error in dlopen or dlsym: ", dlerror());
+    TORCH_CHECK_WITH(DynamicLibraryError, false, "Error in dlopen or dlsym: ", dlerror());
   }
 
   return x;
@@ -32,10 +32,10 @@ DynamicLibrary::DynamicLibrary(const char* name, const char* alt_name, bool leak
     if (alt_name) {
       handle = dlopen(alt_name, RTLD_LOCAL | RTLD_NOW);
       if (!handle) {
-        AT_ERROR("Error in dlopen for library ", name, "and ", alt_name);
+        TORCH_CHECK_WITH(DynamicLibraryError, false, "Error in dlopen for library ", name, "and ", alt_name);
       }
     } else {
-      AT_ERROR("Error in dlopen: ", dlerror());
+      TORCH_CHECK_WITH(DynamicLibraryError, false, "Error in dlopen: ", dlerror());
     }
   }
 }
@@ -84,7 +84,7 @@ DynamicLibrary::DynamicLibrary(const char* name, const char* alt_name, bool leak
     FormatMessageA(FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
                   NULL, dw, MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
                   buf, (sizeof(buf) / sizeof(char)), NULL);
-    AT_ERROR("error in LoadLibrary for ", name, ". WinError ", dw, ": ", buf);
+    TORCH_CHECK_WITH(DynamicLibraryError, false, "error in LoadLibrary for ", name, ". WinError ", dw, ": ", buf);
   }
 }
 
@@ -92,7 +92,7 @@ void* DynamicLibrary::sym(const char* name) {
   AT_ASSERT(handle);
   FARPROC procAddress = GetProcAddress((HMODULE)handle, name);
   if (!procAddress) {
-    AT_ERROR("error in GetProcAddress");
+    TORCH_CHECK_WITH(DynamicLibraryError, false, "error in GetProcAddress");
   }
   return (void*)procAddress;
 }
diff --git a/aten/src/ATen/DynamicLibrary.h b/aten/src/ATen/DynamicLibrary.h
index 9e7ade53cf96f8..8f65dd5b494f76 100644
--- a/aten/src/ATen/DynamicLibrary.h
+++ b/aten/src/ATen/DynamicLibrary.h
@@ -1,8 +1,17 @@
 #pragma once
 
 #include <c10/macros/Export.h>
+#include <c10/util/Exception.h>
 #include <ATen/Utils.h>
 
+namespace c10 {
+
+class DynamicLibraryError : public Error {
+  using Error::Error;
+};
+
+} // namespace c10
+
 namespace at {
 
 struct DynamicLibrary {
diff --git a/aten/src/ATen/EmptyTensor.cpp b/aten/src/ATen/EmptyTensor.cpp
index 5e21a2f52d187f..5a72a09d1841c7 100644
--- a/aten/src/ATen/EmptyTensor.cpp
+++ b/aten/src/ATen/EmptyTensor.cpp
@@ -2,31 +2,93 @@
 #include <ATen/EmptyTensor.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 #include <c10/core/CPUAllocator.h>
+#include <c10/util/safe_numerics.h>
+
+#include <limits>
 
 namespace at {
 namespace detail {
-
-static c10::Allocator* GetCPUAllocatorMaybePinned(bool pin_memory) {
+namespace {
+c10::Allocator* GetCPUAllocatorMaybePinned(bool pin_memory) {
   if (pin_memory) {
     return at::detail::getCUDAHooks().getPinnedMemoryAllocator();
   }
   return c10::GetCPUAllocator();
 }
 
+constexpr uint64_t storage_max() {
+  // int64_t and size_t are used somewhat inconsistently throughout ATen.
+  // To be safe, storage size calculations must fit in both types.
+  constexpr auto int64_max = static_cast<uint64_t>(
+      std::numeric_limits<int64_t>::max());
+  constexpr auto size_max = static_cast<uint64_t>(
+      std::numeric_limits<size_t>::max());
+  return std::min(int64_max, size_max);
+}
+
+}  // namespace (anonymous)
+
+size_t computeStorageNbytesContiguous(
+    IntArrayRef sizes,
+    size_t itemsize_bytes,
+    size_t storage_offset
+  ) {
+  // Ignore overflow checks on mobile
+#ifndef C10_MOBILE
+  uint64_t size = 1;
+  bool overflowed = c10::safe_multiplies_u64(sizes, &size);
+  overflowed |= c10::add_overflows(size, storage_offset, &size);
+  overflowed |= c10::mul_overflows(size, itemsize_bytes, &size);
+  overflowed |= size > storage_max();
+  TORCH_CHECK(!overflowed,
+              "Storage size calculation overflowed with sizes=", sizes);
+  return static_cast<size_t>(size);
+#else
+  const auto numel = c10::multiply_integers(sizes);
+  return itemsize_bytes * (storage_offset + numel);
+#endif
+}
+
 size_t computeStorageNbytes(
     IntArrayRef sizes,
     IntArrayRef strides,
-    size_t itemsize_bytes) {
+    size_t itemsize_bytes,
+    size_t storage_offset
+  ) {
+  // Ignore overflow checks on mobile
+#ifndef C10_MOBILE
   // size of the underlying storage is 1 bigger than the offset
   // of the last element according to stride
-  size_t size = 1;
+  uint64_t size = storage_offset + 1;
+  bool overflowed = false;
   for (const auto i : c10::irange(sizes.size())) {
-    if(sizes[i] == 0) {
+    if (sizes[i] == 0) {
       return 0;
     }
-    size += strides[i]*(sizes[i]-1);
+
+    uint64_t strided_size;
+    overflowed |= c10::mul_overflows(strides[i], sizes[i] - 1, &strided_size);
+    overflowed |= c10::add_overflows(size, strided_size, &size);
   }
-  return size * itemsize_bytes;
+  overflowed |= c10::mul_overflows(size, itemsize_bytes, &size);
+  overflowed |= size > storage_max();
+  TORCH_CHECK(!overflowed,
+              "Storage size calculation overflowed with sizes=",
+              sizes, " and strides=", strides);
+  return static_cast<size_t>(size);
+#else
+  // size of the underlying storage is 1 bigger than the offset
+  // of the last element according to stride
+  uint64_t size = 1;
+  for (const auto i : c10::irange(sizes.size())) {
+    if (sizes[i] == 0) {
+      return 0;
+    }
+
+    size += strides[i] * (sizes[i] - 1);
+  }
+  return itemsize_bytes * (storage_offset + size);
+#endif
 }
 
 TensorBase empty_generic(
@@ -37,9 +99,8 @@ TensorBase empty_generic(
     c10::optional<c10::MemoryFormat> memory_format_opt) {
   at::detail::check_size_nonnegative(size);
 
-  int64_t nelements = c10::multiply_integers(size);
   caffe2::TypeMeta dtype = scalarTypeToTypeMeta(scalar_type);
-  int64_t size_bytes = nelements * dtype.itemsize();
+  size_t size_bytes = computeStorageNbytesContiguous(size, dtype.itemsize());
   auto storage_impl = c10::make_intrusive<StorageImpl>(
       c10::StorageImpl::use_byte_size_t(),
       size_bytes,
@@ -73,7 +134,7 @@ TensorBase empty_strided_generic(
   at::detail::check_size_nonnegative(size);
 
   caffe2::TypeMeta dtype = scalarTypeToTypeMeta(scalar_type);
-  int64_t size_bytes = computeStorageNbytes(size, stride, dtype.itemsize());
+  size_t size_bytes = computeStorageNbytes(size, stride, dtype.itemsize());
   auto storage_impl = c10::make_intrusive<StorageImpl>(
       c10::StorageImpl::use_byte_size_t(),
       size_bytes,
@@ -176,13 +237,11 @@ struct MetaAllocator final : public at::Allocator {
 
 static MetaAllocator g_meta_alloc;
 
-at::Allocator* GetMetaAllocator() {
-  return &g_meta_alloc;
-}
+REGISTER_ALLOCATOR(kMeta, &g_meta_alloc);
 
 TensorBase empty_meta(IntArrayRef size, ScalarType dtype,
                      c10::optional<c10::MemoryFormat> memory_format_opt) {
-  auto *allocator = GetMetaAllocator();
+  auto *allocator = GetAllocator(kMeta);
   constexpr c10::DispatchKeySet meta_dks(c10::DispatchKey::Meta);
   return at::detail::empty_generic(
       size, allocator, meta_dks, dtype, memory_format_opt);
@@ -222,7 +281,7 @@ TensorBase empty_meta(
 
 TensorBase empty_strided_meta(IntArrayRef size, IntArrayRef stride,
                               ScalarType dtype) {
-  auto *allocator = GetMetaAllocator();
+  auto *allocator = GetAllocator(kMeta);
   constexpr c10::DispatchKeySet meta_dks(c10::DispatchKey::Meta);
   return at::detail::empty_strided_generic(
       size, stride, allocator, meta_dks, dtype);
diff --git a/aten/src/ATen/EmptyTensor.h b/aten/src/ATen/EmptyTensor.h
index a49b3e909d6e80..895bcc8e177970 100644
--- a/aten/src/ATen/EmptyTensor.h
+++ b/aten/src/ATen/EmptyTensor.h
@@ -10,8 +10,11 @@ inline void check_size_nonnegative(IntArrayRef size) {
   }
 }
 
+TORCH_API size_t computeStorageNbytesContiguous(
+    IntArrayRef sizes, size_t itemsize, size_t storage_offset=0);
 TORCH_API size_t computeStorageNbytes(
-    IntArrayRef sizes, IntArrayRef strides, size_t itemsize);
+    IntArrayRef sizes, IntArrayRef strides,
+    size_t itemsize, size_t storage_offset=0);
 
 TORCH_API TensorBase empty_generic(
     IntArrayRef size,
diff --git a/aten/src/ATen/FunctionalTensorWrapper.cpp b/aten/src/ATen/FunctionalTensorWrapper.cpp
index 5f99e377479866..13cc746246a774 100644
--- a/aten/src/ATen/FunctionalTensorWrapper.cpp
+++ b/aten/src/ATen/FunctionalTensorWrapper.cpp
@@ -322,6 +322,57 @@ void sync(const c10::List<c10::optional<Tensor>> t_list) {
   }
 }
 
+bool isFunctionalTensor(const at::Tensor& tensor) {
+  return tensor.unsafeGetTensorImpl()->key_set().has(c10::DispatchKey::Functionalize);
+}
+
+bool isFunctionalTensor(const c10::optional<Tensor>& t) {
+  if (t.has_value()) {
+    return isFunctionalTensor(*t);
+  } else {
+    return false;
+  }
+}
+
+bool isFunctionalTensor(const c10::List<Tensor>& t_list) {
+  if (t_list.size() == 0) return false;
+  bool any_functional = isFunctionalTensor(t_list[0]);
+  for (const auto i : c10::irange(1, t_list.size())) {
+    auto curr_functional = isFunctionalTensor(t_list[i]);
+    TORCH_INTERNAL_ASSERT(
+         curr_functional == any_functional,
+        "Functionalization encountered a list of tensors where some are functional",
+        "and some are not, which is not currently unsupported.");
+  }
+  return any_functional;
+}
+
+bool isFunctionalTensor(const c10::List<c10::optional<Tensor>>& t_list) {
+  if (t_list.size() == 0) return false;
+  bool any_functional = isFunctionalTensor(t_list[0]);
+  for (const auto i : c10::irange(1, t_list.size())) {
+    auto curr_functional = isFunctionalTensor(t_list[i]);
+    TORCH_INTERNAL_ASSERT(
+         curr_functional == any_functional,
+        "Functionalization encountered a list of tensors where some are functional",
+        "and some are not, which is not currently unsupported.");
+  }
+  return any_functional;
+}
+
+bool isFunctionalTensor(const c10::ArrayRef<Tensor> t_list) {
+  if (t_list.size() == 0) return false;
+  bool any_functional = isFunctionalTensor(t_list[0]);
+  for (const auto i : c10::irange(1, t_list.size())) {
+    auto curr_functional = isFunctionalTensor(t_list[i]);
+    TORCH_INTERNAL_ASSERT(
+         curr_functional == any_functional,
+        "Functionalization encountered a list of tensors where some are functional",
+        "and some are not, which is not currently unsupported.");
+  }
+  return any_functional;
+}
+
 Tensor create_functional_tensor_with_view_meta(const at::Tensor& view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta, int64_t out_idx) {
   TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(view_to_wrap));
   TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(base));
diff --git a/aten/src/ATen/FunctionalTensorWrapper.h b/aten/src/ATen/FunctionalTensorWrapper.h
index 1696b41f1543c7..1f0988c4a07b18 100644
--- a/aten/src/ATen/FunctionalTensorWrapper.h
+++ b/aten/src/ATen/FunctionalTensorWrapper.h
@@ -117,9 +117,11 @@ TORCH_API inline FunctionalTensorWrapper* unsafeGetFunctionalWrapper(const Tenso
   return functional_impl;
 }
 
-TORCH_API inline bool isFunctionalTensor(const at::Tensor& tensor) {
-  return tensor.unsafeGetTensorImpl()->key_set().has(c10::DispatchKey::Functionalize);
-}
+TORCH_API bool isFunctionalTensor(const at::Tensor& tensor);
+TORCH_API bool isFunctionalTensor(const c10::optional<Tensor>& t);
+TORCH_API bool isFunctionalTensor(const c10::List<Tensor>& t_list);
+TORCH_API bool isFunctionalTensor(const c10::List<c10::optional<Tensor>>& t_list);
+TORCH_API bool isFunctionalTensor(const c10::ArrayRef<Tensor> t_list);
 
 TORCH_API Tensor to_functional_tensor(const Tensor& tensor);
 TORCH_API c10::List<Tensor> to_functional_tensor(const c10::List<Tensor>& t_list);
diff --git a/aten/src/ATen/FunctionalizeFallbackKernel.cpp b/aten/src/ATen/FunctionalizeFallbackKernel.cpp
index f130fc7cdbd4df..f63f4bdcd79912 100644
--- a/aten/src/ATen/FunctionalizeFallbackKernel.cpp
+++ b/aten/src/ATen/FunctionalizeFallbackKernel.cpp
@@ -12,23 +12,36 @@ namespace {
     const auto arguments_begin = stack->size() - num_arguments;
     auto arguments = torch::jit::last(stack, num_arguments);
 
+    auto any_functional_inputs = false;
+    auto any_tensor_inputs = false;
     for (uint64_t idx = 0; idx < num_arguments; ++idx) {
       const auto& ivalue = arguments[idx];
       if (ivalue.isTensor()) {
+        any_tensor_inputs = true;
         auto t = ivalue.toTensor();
-        at::functionalization::impl::sync(t);
-        auto t_new = c10::IValue(at::functionalization::impl::from_functional_tensor(t));
-        (*stack)[arguments_begin + idx] = t_new;
+        if (at::functionalization::impl::isFunctionalTensor(t)) {
+          any_functional_inputs = true;
+          at::functionalization::impl::sync(t);
+          auto t_new = c10::IValue(at::functionalization::impl::from_functional_tensor(t));
+          (*stack)[arguments_begin + idx] = t_new;
+        }
       } else if (ivalue.isTensorList()) {
+        any_tensor_inputs = true;
         auto tensors = ivalue.toTensorList();
-        at::functionalization::impl::sync(tensors);
-        auto t_new = c10::IValue(at::functionalization::impl::from_functional_tensor(tensors));
-        (*stack)[arguments_begin + idx] = t_new;
+        if (at::functionalization::impl::isFunctionalTensor(tensors)) {
+          any_functional_inputs = true;
+          at::functionalization::impl::sync(tensors);
+          auto t_new = c10::IValue(at::functionalization::impl::from_functional_tensor(tensors));
+          (*stack)[arguments_begin + idx] = t_new;
+        }
       }
     }
+    // we should wrap the output if any inputs were wrapped,
+    // OR if we're hitting a factory function (with no tensor inputs)
+    auto should_wrap_outputs = !any_tensor_inputs || any_functional_inputs;
     {
       at::AutoDispatchSkipFunctionalize guard;
-      op.redispatchBoxed(dispatchKeySet & c10::after_func_keyset, stack);
+      op.callBoxed(stack);
     }
     const auto num_returns = schema.returns().size();
     const auto returns_begin = stack->size() - num_returns;
@@ -36,11 +49,11 @@ namespace {
 
     for (const auto idx : c10::irange(num_returns)) {
       const auto& ivalue = returns[idx];
-      if (ivalue.isTensor()) {
+      if (ivalue.isTensor() && should_wrap_outputs) {
         auto t = ivalue.toTensor();
         auto t_new = c10::IValue(at::functionalization::impl::to_functional_tensor(t));
         (*stack)[returns_begin + idx] = t_new;
-      } else if (ivalue.isTensorList()) {
+      } else if (ivalue.isTensorList() && should_wrap_outputs) {
         auto tensors = ivalue.toTensorList();
         auto t_new = c10::IValue(at::functionalization::impl::to_functional_tensor(tensors));
         (*stack)[returns_begin + idx] = t_new;
diff --git a/aten/src/ATen/NestedTensorImpl.cpp b/aten/src/ATen/NestedTensorImpl.cpp
index 51e93fc86c5d19..a7b6d97b2cee31 100644
--- a/aten/src/ATen/NestedTensorImpl.cpp
+++ b/aten/src/ATen/NestedTensorImpl.cpp
@@ -30,6 +30,7 @@ NestedTensorImpl::NestedTensorImpl(
   key_set_ =
       key_set_ - c10::DispatchKeySet({c10::DispatchKey::ADInplaceOrView});
   refresh_dim();
+  set_sizes_customization_policy(CustomizableMethodPolicy::NotSupported);
 }
 
 void NestedTensorImpl::refresh_dim() {
@@ -38,5 +39,8 @@ void NestedTensorImpl::refresh_dim() {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(dim() == my_dim);
 }
 
+const char* NestedTensorImpl::tensorimpl_type_name() const {
+  return "NestedTensorImpl";
+}
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/NestedTensorImpl.h b/aten/src/ATen/NestedTensorImpl.h
index 4598a45c3c44fe..5b7757e66dd363 100644
--- a/aten/src/ATen/NestedTensorImpl.h
+++ b/aten/src/ATen/NestedTensorImpl.h
@@ -29,7 +29,7 @@ struct NestedTensorImpl : public c10::TensorImpl {
   // TODO: don't expose private implementation details like this; in
   // particular, resizing this tensor will mess up our dim() and
   // callers cannot fix it.
-  const Tensor& get_nested_size_tensor() {
+  const Tensor& get_nested_size_tensor() const {
     return nested_size_tensor_;
   }
 #ifndef C10_DISABLE_TENSORIMPL_EXTENSIBILITY
@@ -53,6 +53,9 @@ struct NestedTensorImpl : public c10::TensorImpl {
     return buffer_;
   }
 
+ protected:
+  const char* tensorimpl_type_name() const override;
+
  private:
   // Must be called after any changes to our dim() to sync the state
   // to TensorImpl.
@@ -62,5 +65,29 @@ struct NestedTensorImpl : public c10::TensorImpl {
   const at::Tensor nested_size_tensor_;
 };
 
+inline NestedTensorImpl* get_nested_tensor_impl_or_null(const at::Tensor& tensor) {
+  if (tensor.is_nested()) {
+    return static_cast<NestedTensorImpl*>(tensor.unsafeGetTensorImpl());
+  }
+  return nullptr;
+}
+
+inline NestedTensorImpl* get_nested_tensor_impl(
+    const at::Tensor& tensor) {
+  TORCH_CHECK(
+      tensor.is_nested(),
+      "get_nested_tensor_impl requires a NestedTensor.");
+  return static_cast<NestedTensorImpl*>(
+      tensor.unsafeGetTensorImpl());
+}
+
+
+// TODO: real implementation once we support strides.
+inline bool nested_tensor_impl_is_contiguous(
+    const NestedTensorImpl* nt,
+    at::MemoryFormat memory_format = MemoryFormat::Contiguous) {
+  return memory_format == MemoryFormat::Contiguous;
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/OpMathType.h b/aten/src/ATen/OpMathType.h
index b58d4779ac7a47..7b8ad97d3150ab 100644
--- a/aten/src/ATen/OpMathType.h
+++ b/aten/src/ATen/OpMathType.h
@@ -1,7 +1,9 @@
 #pragma once
 
+#include <c10/core/ScalarType.h>
 #include <c10/util/Half.h>
 #include <c10/util/BFloat16.h>
+#include <c10/util/Exception.h>
 
 namespace at {
 
@@ -13,4 +15,21 @@ template<> struct OpMathType<at::BFloat16> { using type = float; };
 template<typename T>
 using opmath_type = typename OpMathType<T>::type;
 
+namespace {
+
+c10::ScalarType toOpMathType(const c10::ScalarType type) {
+  switch (type) {
+#define DEFINE_CASE(scalar_t, TypeNum)                                  \
+    case ScalarType::TypeNum:                                           \
+      return CppTypeToScalarType<at::opmath_type<scalar_t>>::value;
+
+    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(DEFINE_CASE)
+#undef DEFINE_CASE
+
+    default: TORCH_INTERNAL_ASSERT(false, "Unrecognized ScalarType: ", type);
+  }
+}
+
+}
+
 } // namespace at
diff --git a/aten/src/ATen/PythonTorchFunctionTLS.cpp b/aten/src/ATen/PythonTorchFunctionTLS.cpp
new file mode 100644
index 00000000000000..ae9f722de60ac6
--- /dev/null
+++ b/aten/src/ATen/PythonTorchFunctionTLS.cpp
@@ -0,0 +1,38 @@
+#include <ATen/PythonTorchFunctionTLS.h>
+#include <c10/core/TensorImpl.h>
+
+namespace at {
+namespace impl {
+
+static thread_local PythonTorchFunctionTLS pythonTorchFunctionState;
+
+void PythonTorchFunctionTLS::set_mode(std::shared_ptr<c10::SafePyObject> mode) {
+  pythonTorchFunctionState.mode_ = std::move(mode);
+}
+
+const std::shared_ptr<c10::SafePyObject>& PythonTorchFunctionTLS::get_mode() {
+  return pythonTorchFunctionState.mode_;
+}
+
+void PythonTorchFunctionTLS::swap_mode(std::shared_ptr<c10::SafePyObject>& mode) {
+  pythonTorchFunctionState.mode_.swap(mode);
+}
+
+void PythonTorchFunctionTLS::set_disabled(bool disabled) {
+  pythonTorchFunctionState.disabled_ = disabled;
+}
+
+bool PythonTorchFunctionTLS::is_disabled() {
+  return pythonTorchFunctionState.disabled_;
+}
+
+void PythonTorchFunctionTLS::set_state(const PythonTorchFunctionTLS& state) {
+  pythonTorchFunctionState = state;
+}
+
+const PythonTorchFunctionTLS& PythonTorchFunctionTLS::get_state() {
+  return pythonTorchFunctionState;
+}
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/PythonTorchFunctionTLS.h b/aten/src/ATen/PythonTorchFunctionTLS.h
new file mode 100644
index 00000000000000..64256d2f7c21d4
--- /dev/null
+++ b/aten/src/ATen/PythonTorchFunctionTLS.h
@@ -0,0 +1,26 @@
+#pragma once
+
+#include <c10/core/SafePyObject.h>
+#include <c10/macros/Macros.h>
+
+namespace at {
+namespace impl {
+
+struct TORCH_API PythonTorchFunctionTLS {
+  static void set_disabled(bool);
+  static bool is_disabled();
+
+  static void set_mode(std::shared_ptr<c10::SafePyObject>);
+  static const std::shared_ptr<c10::SafePyObject>& get_mode();
+  static void swap_mode(std::shared_ptr<c10::SafePyObject>&);
+
+  static void set_state(const PythonTorchFunctionTLS& state);
+  static const PythonTorchFunctionTLS& get_state();
+
+private:
+  bool disabled_;
+  std::shared_ptr<c10::SafePyObject> mode_;
+};
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/ScalarOps.cpp b/aten/src/ATen/ScalarOps.cpp
index 8eb10266d78fe7..98a38023f9b4f1 100644
--- a/aten/src/ATen/ScalarOps.cpp
+++ b/aten/src/ATen/ScalarOps.cpp
@@ -15,8 +15,8 @@ inline void fill_inplace(Tensor& self, const Scalar& value_scalar) {
 
 namespace detail {
 Tensor& scalar_fill(Tensor& self, const Scalar& value) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-      kHalf, kBool, kBFloat16, self.scalar_type(), "fill_out", [&]() {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+      kComplexHalf, kHalf, kBool, kBFloat16, self.scalar_type(), "fill_out", [&]() {
         fill_inplace<scalar_t>(self, value);
       });
   return self;
diff --git a/aten/src/ATen/SparseCsrTensorImpl.cpp b/aten/src/ATen/SparseCsrTensorImpl.cpp
index 2029189912e6b2..e2f565b6efef12 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.cpp
+++ b/aten/src/ATen/SparseCsrTensorImpl.cpp
@@ -57,20 +57,31 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
       col_indices_(std::move(col_indices)),
       values_(std::move(values)) {
   set_storage_access_should_throw();
+  is_non_overlapping_and_dense_ = false;
+  set_has_contiguity_policy(HasContiguityPolicy::ContiguityNotSupported);
+}
+
+const char* SparseCsrTensorImpl::tensorimpl_type_name() const {
+  return "SparseCsrTensorImpl";
 }
 
 void SparseCsrTensorImpl::resize_(int64_t nnz, IntArrayRef size) {
-  auto rows = size[0];
-  auto cols = size[1];
+  auto rows = size[size.size() - 2];
+  auto cols = size[size.size() - 1];
   auto old_crow_indices_size = crow_indices_.size(-1);
-  crow_indices_.resize_({rows + 1});
+
+  auto new_crow_indices_size = DimVector(size.slice(0, size.size() - 2));
+  new_crow_indices_size.push_back(rows + 1);
+  crow_indices_.resize_(new_crow_indices_size);
   if (rows + 1 >= old_crow_indices_size) {
     crow_indices_.narrow(-1, old_crow_indices_size, rows + 1 - old_crow_indices_size).fill_(nnz);
   } else {
     crow_indices_.narrow(-1, rows, 1).fill_(std::min<int64_t>(nnz, rows*cols));
   }
-  col_indices_.resize_({std::min<int64_t>(nnz, rows*cols)});
-  values_.resize_({std::min<int64_t>(nnz, rows*cols)});
+  auto col_indices_values_size = DimVector(size.slice(0, size.size() - 2));
+  col_indices_values_size.push_back(std::min<int64_t>(nnz, rows*cols));
+  col_indices_.resize_(col_indices_values_size);
+  values_.resize_(col_indices_values_size);
   sizes_and_strides_.set_sizes(size);
 }
 
@@ -113,4 +124,21 @@ void SparseCsrTensorImpl::set_member_tensors(
   sizes_and_strides_.set_sizes(size);
   refresh_numel();
 }
+
+IntArrayRef SparseCsrTensorImpl::strides() const {
+  TORCH_CHECK(false, "Sparse CSR tensors do not have strides.");
+}
+int64_t SparseCsrTensorImpl::stride(int64_t d) const {
+  TORCH_CHECK(false, "Sparse CSR tensors do not have strides.");
+}
+void SparseCsrTensorImpl::set_size(int64_t dim, int64_t new_size) {
+  TORCH_CHECK(false, "Sparse CSR tensors do not have set_size.");
+}
+void SparseCsrTensorImpl::set_stride(int64_t dim, int64_t new_stride) {
+  TORCH_CHECK(false, "Sparse CSR tensors do not have set_stride.");
+}
+void SparseCsrTensorImpl::set_storage_offset(int64_t storage_offset) {
+  TORCH_CHECK(false, "Sparse CSR tensors do not have set_storage_offset.");
+}
+
 } // namespace at
diff --git a/aten/src/ATen/SparseCsrTensorImpl.h b/aten/src/ATen/SparseCsrTensorImpl.h
index 850e0a02a44857..ea308f6891a2c0 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.h
+++ b/aten/src/ATen/SparseCsrTensorImpl.h
@@ -43,7 +43,13 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
   const Tensor& crow_indices() const { return crow_indices_; }
   const Tensor& col_indices() const { return col_indices_; }
   const Tensor& values() const { return values_; }
-  int nnz() { return values_.size(0); }
+  int nnz() { return col_indices_.size(-1); }
+
+  IntArrayRef strides() const override;
+  int64_t stride(int64_t d) const override;
+  void set_size(int64_t dim, int64_t new_size) override;
+  void set_stride(int64_t dim, int64_t new_stride) override;
+  void set_storage_offset(int64_t storage_offset) override;
 
   /**
    * Return a TensorImpl that is a shallow-copy of this TensorImpl.
@@ -91,6 +97,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
       at::Tensor col_indices,
       at::Tensor values);
 
+  const char* tensorimpl_type_name() const override;
+
   /**
    * Copy the tensor metadata fields (e.g. sizes / strides / storage pointer / storage_offset)
    * from one TensorImpl to another TensorImpl.
diff --git a/aten/src/ATen/SparseTensorUtils.cpp b/aten/src/ATen/SparseTensorUtils.cpp
index d5811b933e7ca5..712e85e851be91 100644
--- a/aten/src/ATen/SparseTensorUtils.cpp
+++ b/aten/src/ATen/SparseTensorUtils.cpp
@@ -30,7 +30,7 @@ Tensor flatten_indices(const Tensor& indices, IntArrayRef full_size, bool force_
     }
   } else {
     std::vector<int64_t> indices_mult_cpu_vec;
-    indices_mult_cpu_vec.reserve(sparse_dim);
+    indices_mult_cpu_vec.resize(sparse_dim);
     int64_t mult = 1;
     for (int64_t i = sparse_dim - 1; i >= 0; i--) {
       indices_mult_cpu_vec[i] = mult;
diff --git a/aten/src/ATen/TensorIterator.cpp b/aten/src/ATen/TensorIterator.cpp
index 6c9e03d044ef98..f79dd3066b78ee 100644
--- a/aten/src/ATen/TensorIterator.cpp
+++ b/aten/src/ATen/TensorIterator.cpp
@@ -745,7 +745,7 @@ void TensorIteratorBase::for_each(loop2d_t loop, int64_t grain_size) {
   int64_t numel = this->numel();
   if (numel == 0) {
     return;
-  } else if (numel < grain_size || at::get_num_threads() == 1) {
+  } else if (numel < internal::GRAIN_SIZE || at::get_num_threads() == 1) {
     return serial_for_each(loop, {0, numel});
   } else {
     at::parallel_for(0, numel, grain_size, [&](int64_t begin, int64_t end) {
@@ -1493,8 +1493,10 @@ void TensorIteratorBase::build(TensorIteratorConfig& config) {
   // Nothing beyond this point is important for meta functions, so it's fine to exit early here.
   // Extend the condition to ORT tesnors as ORT tensors also don't have storage.
   if (common_device_.type() == DeviceType::XLA  ||
+      common_device_.type() == DeviceType::IPU  ||
       common_device_.type() == DeviceType::Lazy ||
-      common_device_.type() == DeviceType::ORT) return;
+      common_device_.type() == DeviceType::ORT  ||
+      common_device_.type() == DeviceType::HPU) return;
 
   for (auto& op : operands_) {
     TORCH_INTERNAL_ASSERT(op.tensor_base().defined());
diff --git a/aten/src/ATen/TensorSubclassLikeUtils.h b/aten/src/ATen/TensorSubclassLikeUtils.h
index 7f5517bc08114a..e9f5e7d26e112c 100644
--- a/aten/src/ATen/TensorSubclassLikeUtils.h
+++ b/aten/src/ATen/TensorSubclassLikeUtils.h
@@ -28,8 +28,7 @@ constexpr auto kFunctorchWrappedTensors = DispatchKeySet({
 
 constexpr auto kTensorSubclassLike = kFunctorchWrappedTensors | DispatchKeySet({
     DispatchKey::Batched,
-    DispatchKey::SparseCPU,
-    DispatchKey::SparseCUDA,
+    DispatchKey::Sparse,
     DispatchKey::SparseCsrCPU,
     DispatchKey::SparseCsrCUDA,
     DispatchKey::Meta,
diff --git a/aten/src/ATen/ThreadLocalState.cpp b/aten/src/ATen/ThreadLocalState.cpp
index 3e3d4d6a957371..fdbd8b1699ba6f 100644
--- a/aten/src/ATen/ThreadLocalState.cpp
+++ b/aten/src/ATen/ThreadLocalState.cpp
@@ -13,7 +13,8 @@ ThreadLocalState::ThreadLocalState()
     : dispatch_key_(c10::impl::tls_local_dispatch_key_set()),
       debug_info_(c10::ThreadLocalDebugInfo::current()),
       functorch_tls_(functorch::getCopyOfFuncTorchTLS()),
-      autograd_tls_(c10::AutogradState::get_tls_state()) {
+      autograd_tls_(c10::AutogradState::get_tls_state()),
+      python_torch_function_state_(at::impl::PythonTorchFunctionTLS::get_state()) {
   rf_tls_ = at::get_record_function_tls_();
 
   saved_tensors_default_hooks_ = at::SavedTensorDefaultHooks::get_stack();
@@ -35,6 +36,8 @@ void ThreadLocalState::setThreadLocalState(
 
   at::impl::PythonModeTLS::set_state(state.python_mode_state_);
 
+  at::impl::PythonTorchFunctionTLS::set_state(state.python_torch_function_state_);
+
   at::set_record_function_tls_(state.rf_tls_);
 
   at::SavedTensorDefaultHooks::set_stack(state.saved_tensors_default_hooks_);
diff --git a/aten/src/ATen/ThreadLocalState.h b/aten/src/ATen/ThreadLocalState.h
index c5f14518f42281..7599c16ad4c802 100644
--- a/aten/src/ATen/ThreadLocalState.h
+++ b/aten/src/ATen/ThreadLocalState.h
@@ -10,6 +10,7 @@
 #include <ATen/record_function.h>
 #include <ATen/FuncTorchTLS.h>
 #include <ATen/core/PythonModeTLS.h>
+#include <ATen/PythonTorchFunctionTLS.h>
 
 namespace at {
 
@@ -53,7 +54,11 @@ class TORCH_API ThreadLocalState {
   // TLS for AutogradModes
   AutogradState autograd_tls_;
 
-  std::shared_ptr<TorchDispatchTypeObject> python_mode_state_;
+  // TLS for enable_python_mode (__torch_dispatch__)
+  std::shared_ptr<SafePyObject> python_mode_state_;
+
+  // TLS for __torch_function__ (mode and disable_torch_function)
+  at::impl::PythonTorchFunctionTLS python_torch_function_state_;
 
   // TLS for saved tensors default hooks
   std::stack<std::pair<PyObject*, PyObject*>> saved_tensors_default_hooks_;
diff --git a/aten/src/ATen/autocast_mode.cpp b/aten/src/ATen/autocast_mode.cpp
index bd9da6a4593502..d2c2232cc6d4be 100644
--- a/aten/src/ATen/autocast_mode.cpp
+++ b/aten/src/ATen/autocast_mode.cpp
@@ -325,6 +325,7 @@ TORCH_LIBRARY_IMPL(aten, Autocast, m) {
   KERNEL(ADD_NS(addmv), "addmv", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
   KERNEL(ADD_NS(addr), "addr", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
   KERNEL(ADD_NS(matmul), "matmul", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
+  KERNEL(ADD_NS(einsum), "einsum", Tensor (c10::string_view, TensorList), lower_precision_fp)
   KERNEL(ADD_NS(mm), "mm", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
   KERNEL(ADD_NS(mv), "mv", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
   KERNEL(ADD_NS(linear), "linear", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&), lower_precision_fp)
@@ -487,23 +488,23 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(avg_pool3d), "avg_pool3d", Tensor (const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef, bool, bool, c10::optional<int64_t>), fp32)
   KERNEL_CPU(ADD_NS(gelu), "gelu", Tensor (const Tensor &, c10::string_view), fp32)
   KERNEL_CPU(ADD_NS(upsample_nearest1d), "upsample_nearest1d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_nearest1d), "upsample_nearest1d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_nearest1d), "upsample_nearest1d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(_upsample_nearest_exact1d), "_upsample_nearest_exact1d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(_upsample_nearest_exact1d), "_upsample_nearest_exact1d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(_upsample_nearest_exact1d), "_upsample_nearest_exact1d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(upsample_nearest2d), "upsample_nearest2d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_nearest2d), "upsample_nearest2d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_nearest2d), "upsample_nearest2d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(_upsample_nearest_exact2d), "_upsample_nearest_exact2d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(_upsample_nearest_exact2d), "_upsample_nearest_exact2d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(_upsample_nearest_exact2d), "_upsample_nearest_exact2d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(upsample_nearest3d), "upsample_nearest3d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_nearest3d), "upsample_nearest3d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_nearest3d), "upsample_nearest3d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(_upsample_nearest_exact3d), "_upsample_nearest_exact3d", Tensor (const Tensor &, IntArrayRef, c10::optional<double>, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(_upsample_nearest_exact3d), "_upsample_nearest_exact3d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(_upsample_nearest_exact3d), "_upsample_nearest_exact3d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(upsample_linear1d), "upsample_linear1d", Tensor (const Tensor &, IntArrayRef, bool, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_linear1d), "upsample_linear1d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, bool, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_linear1d), "upsample_linear1d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, bool, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(upsample_bilinear2d), "upsample_bilinear2d", Tensor (const Tensor &, IntArrayRef, bool, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_bilinear2d), "upsample_bilinear2d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, bool, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_bilinear2d), "upsample_bilinear2d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, bool, c10::optional<ArrayRef<double>>), fp32)
   KERNEL_CPU(ADD_NS(upsample_trilinear3d), "upsample_trilinear3d", Tensor (const Tensor &, IntArrayRef, bool, c10::optional<double>, c10::optional<double>, c10::optional<double>), fp32)
-  KERNEL_CPU(ADD_NS(upsample_trilinear3d), "upsample_trilinear3d.vec", Tensor (const Tensor &, c10::optional<IntArrayRef>, bool, c10::optional<ArrayRef<double>>), fp32)
+  KERNEL_CPU(ADD_NS(upsample_trilinear3d), "upsample_trilinear3d.vec", Tensor (const Tensor &, at::OptionalIntArrayRef, bool, c10::optional<ArrayRef<double>>), fp32)
 
   KERNEL_CPU(ADD_NS(binary_cross_entropy), "binary_cross_entropy", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t), fp32)
   KERNEL_CPU(ADD_NS(binary_cross_entropy_with_logits), "binary_cross_entropy_with_logits", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&, int64_t), fp32)
@@ -522,6 +523,7 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(nanquantile), "nanquantile", Tensor(const Tensor &, const Tensor &, c10::optional<int64_t>, bool, c10::string_view), fp32)
   KERNEL_CPU(ADD_NS(nanquantile), "nanquantile.scalar", Tensor(const Tensor &, double, c10::optional<int64_t>, bool, c10::string_view), fp32)
   KERNEL_CPU(ADD_NS(stft), "stft", Tensor(const Tensor &, int64_t, c10::optional<int64_t>, c10::optional<int64_t>, const c10::optional<Tensor> &, bool, c10::optional<bool>, c10::optional<bool>), fp32)
+  KERNEL_CPU(ADD_NS(stft), "stft.center", Tensor(const Tensor &, int64_t, c10::optional<int64_t>, c10::optional<int64_t>, const c10::optional<Tensor> &, bool, c10::string_view, bool, c10::optional<bool>, c10::optional<bool>), fp32)
   KERNEL_CPU(ADD_NS(cdist), "cdist", Tensor(const Tensor &, const Tensor &, double, c10::optional<int64_t>), fp32)
   KERNEL_CPU(ADD_NS(cross), "cross", Tensor(const Tensor &, const Tensor &, c10::optional<int64_t>), fp32)
   KERNEL_CPU(ADD_NS(cumprod), "cumprod", Tensor(const Tensor &, int64_t, c10::optional<at::ScalarType>), fp32)
@@ -580,16 +582,16 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(multilabel_margin_loss), "multilabel_margin_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32)
   KERNEL_CPU(ADD_NS(fft_fft), "fft_fft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_ifft), "fft_ifft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_fft2), "fft_fft2", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ifft2), "fft_ifft2", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_fftn), "fft_fftn", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, c10::optional<at::IntArrayRef>, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ifftn), "fft_ifftn", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, c10::optional<at::IntArrayRef>, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_fft2), "fft_fft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_ifft2), "fft_ifft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_fftn), "fft_fftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_ifftn), "fft_ifftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_rfft), "fft_rfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_irfft), "fft_irfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_rfft2), "fft_rfft2", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_irfft2), "fft_irfft2", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_rfftn), "fft_rfftn", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, c10::optional<at::IntArrayRef>, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_irfftn), "fft_irfftn", Tensor(const Tensor &, c10::optional<at::IntArrayRef>, c10::optional<at::IntArrayRef>, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_rfft2), "fft_rfft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_irfft2), "fft_irfft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_rfftn), "fft_rfftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
+  KERNEL_CPU(ADD_NS(fft_irfftn), "fft_irfftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_hfft), "fft_hfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_ihfft), "fft_ihfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(conv_tbc), "conv_tbc", Tensor(const Tensor &, const Tensor &, const Tensor &, int64_t), fp32)
@@ -607,7 +609,7 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(linalg_inv), "linalg_inv", Tensor(const Tensor &), fp32)
   KERNEL_CPU(ADD_NS(linalg_householder_product), "linalg_householder_product", Tensor(const Tensor &, const Tensor &), fp32)
   KERNEL_CPU(ADD_NS(linalg_tensorinv), "linalg_tensorinv", Tensor(const Tensor &, int64_t), fp32)
-  KERNEL_CPU(ADD_NS(linalg_tensorsolve), "linalg_tensorsolve", Tensor(const Tensor &, const Tensor &, c10::optional<at::IntArrayRef>), fp32)
+  KERNEL_CPU(ADD_NS(linalg_tensorsolve), "linalg_tensorsolve", Tensor(const Tensor &, const Tensor &, at::OptionalIntArrayRef), fp32)
   KERNEL_CPU(ADD_NS(fake_quantize_per_tensor_affine), "fake_quantize_per_tensor_affine", Tensor (const Tensor &, double, int64_t, int64_t, int64_t), fp32)
   KERNEL_CPU(ADD_NS(glu), "glu", Tensor (const Tensor &, int64_t), fp32)
 
diff --git a/aten/src/ATen/core/Formatting.cpp b/aten/src/ATen/core/Formatting.cpp
index f3122daf2cc6d6..832059ed198077 100644
--- a/aten/src/ATen/core/Formatting.cpp
+++ b/aten/src/ATen/core/Formatting.cpp
@@ -12,6 +12,28 @@ namespace c10 {
 std::ostream& operator<<(std::ostream & out, Backend b) {
   return out << toString(b);
 }
+
+std::ostream& operator<<(std::ostream & out, Scalar s) {
+  if (s.isFloatingPoint()) {
+    return out << s.toDouble();
+  }
+  if (s.isComplex()) {
+    return out << s.toComplexDouble();
+  }
+  if (s.isBoolean()) {
+    return out << (s.toBool() ? "true" : "false");
+  }
+  if (s.isIntegral(false)) {
+    return out << s.toLong();
+  }
+  throw std::logic_error("Unknown type in Scalar");
+}
+
+std::string toString(Scalar s) {
+  std::stringstream out;
+  out << s;
+  return out.str();
+}
 }
 namespace at {
 
diff --git a/aten/src/ATen/core/Formatting.h b/aten/src/ATen/core/Formatting.h
index 55cfe7b3bdf7e9..6dcfc6c7b3cd15 100644
--- a/aten/src/ATen/core/Formatting.h
+++ b/aten/src/ATen/core/Formatting.h
@@ -1,12 +1,15 @@
 #pragma once
 
-#include <c10/core/Scalar.h>
-#include <ATen/core/Tensor.h>
 #include <ostream>
+#include <string>
 
+#include <c10/core/Scalar.h>
+#include <ATen/core/Tensor.h>
 
 namespace c10 {
 TORCH_API std::ostream& operator<<(std::ostream& out, Backend b);
+TORCH_API std::ostream& operator<<(std::ostream & out, Scalar s);
+TORCH_API std::string toString(Scalar s);
 }
 namespace at {
 
@@ -19,21 +22,4 @@ static inline std::ostream& operator<<(std::ostream & out, const Tensor & t) {
   return print(out,t,80);
 }
 TORCH_API void print(const Tensor & t, int64_t linesize=80);
-
-static inline std::ostream& operator<<(std::ostream & out, Scalar s) {
-  if (s.isFloatingPoint()) {
-    return out << s.toDouble();
-  }
-  if (s.isComplex()) {
-    return out << s.toComplexDouble();
-  }
-  if (s.isBoolean()) {
-    return out << (s.toBool() ? "true" : "false");
-  }
-  if (s.isIntegral(false)) {
-    return out << s.toLong();
-  }
-  throw std::logic_error("Unknown type in Scalar");
-}
-
 }
diff --git a/aten/src/ATen/core/ITensorListRef.h b/aten/src/ATen/core/ITensorListRef.h
new file mode 100644
index 00000000000000..aaa128b7f2e5ee
--- /dev/null
+++ b/aten/src/ATen/core/ITensorListRef.h
@@ -0,0 +1,445 @@
+#pragma once
+
+#include <ATen/core/List.h>
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Exception.h>
+
+#include <functional>
+#include <initializer_list>
+#include <iterator>
+#include <type_traits>
+
+namespace at {
+class Tensor;
+}
+
+namespace c10 {
+class ITensorListRef;
+class ITensorListRefIterator;
+
+// Applies arbitrary macros to each `ITensorListRefTag`.
+#define TORCH_ITENSORLISTREF_FORALL_TAGS(_, ...) \
+  _(Unboxed, ##__VA_ARGS__)                      \
+  _(Boxed, ##__VA_ARGS__)
+
+// Builds the name of the implementation class for `TAG`.
+#define TORCH_ITENSORLISTREF_IMPL(TAG) \
+  c10::detail::ITensorListRefTagImpl<c10::ITensorListRefTag::TAG>
+
+// Defines a "switch-case" for `TAG`. Inside, it executes `BODY`,
+// while bringing to scope:
+//     - `ImplT`: the implementation class for `TAG`
+//     - `this_`: the result of unwrapping `this`
+#define TORCH_ITENSORLISTREF_UNWRAP_CASE(TAG, BODY) \
+  case c10::ITensorListRefTag::TAG: {               \
+    using ImplT = TORCH_ITENSORLISTREF_IMPL(TAG);   \
+    auto& this_ = ImplT::unwrap(*this);             \
+    BODY                                            \
+  } break;
+
+// Dispatches the unwrap call, depending on `TAG`, followed by
+// the execution of `BODY`. It aborts if `TAG` is not a `ITensorListRefTag`.
+#define TORCH_ITENSORLISTREF_UNWRAP(TAG, BODY)                               \
+  switch (TAG) {                                                             \
+    TORCH_ITENSORLISTREF_FORALL_TAGS(TORCH_ITENSORLISTREF_UNWRAP_CASE, BODY) \
+    default:                                                                 \
+      TORCH_INTERNAL_ASSERT(false, "invalid ITensorListRef tag.");           \
+  }
+
+enum class ITensorListRefTag {
+#define DEFINE_TAG(tag, ...) tag,
+  TORCH_ITENSORLISTREF_FORALL_TAGS(DEFINE_TAG)
+#undef DEFINE_TAG
+      None
+};
+
+namespace detail {
+using ITensorListRefConstRef =
+    typename detail::ivalue_to_const_ref_overload_return<at::Tensor>::type;
+
+/*
+ * Interface that implements key functions for each `ITensorListRefTag` type.
+ *
+ * You should create an specialization of this class for each
+ * possible `ITensorListRefTag` type (except `None`).
+ *
+ * Specializations of this class should, at least, define:
+ *     - a type `list_type`
+ *     - 1 function `unwrap` for getting the actual `list_type`
+ *     - 2 functions `unwrap` (const and non-const overloads) for getting
+ *       iterators of `list_type`
+ *     - a function `iterator_get`
+ *
+ * See the examples below.
+ */
+template <ITensorListRefTag TAG>
+class ITensorListRefTagImpl {};
+
+template <>
+class ITensorListRefTagImpl<ITensorListRefTag::Unboxed> {
+ public:
+  using list_type = at::ArrayRef<at::Tensor>;
+
+  // Unwraps an `ITensorListRef` into a const-ref of type `list_type`.
+  static const list_type& unwrap(const ITensorListRef& ilist);
+
+  // Unwraps an `ITensorListRefIterator` into a (const) ref of type
+  // `list_type::const_iterator`. Has overload for const.
+  static list_type::const_iterator& unwrap(ITensorListRefIterator& it);
+  static const list_type::const_iterator& unwrap(const ITensorListRefIterator& it);
+
+  // Accesses the element referenced by the unwrapped iterator `it`.
+  static ITensorListRefConstRef iterator_get(const list_type::const_iterator& it);
+};
+
+template <>
+class ITensorListRefTagImpl<ITensorListRefTag::Boxed> {
+ public:
+  using list_type = List<at::Tensor>;
+  static const list_type& unwrap(const ITensorListRef& ilist);
+  static list_type::const_iterator& unwrap(ITensorListRefIterator& it);
+  static const list_type::const_iterator& unwrap(const ITensorListRefIterator& it);
+  static ITensorListRefConstRef iterator_get(const list_type::const_iterator& it);
+};
+} // namespace detail
+
+/*
+ * Materialized list for `ITensorListRef`.
+ *
+ * Container that groups `Tensor` references together. This exchanges the
+ * overhead of every method call from `ITensorListRef` for a dynamic allocation.
+ *
+ * You should use this container instead of `ITensorListRef` if:
+ *
+ *   - You are going to iterate the list of tensors more than once
+ *   - You need to repeatedly access arbitrary elements (using `operator[]`)
+ */
+using MaterializedITensorListRef =
+    std::vector<std::reference_wrapper<const at::Tensor>>;
+
+/*
+ * Wrapper around both boxed and unboxed iterators.
+ *
+ * Currently, a `std::bidirectional_iterator` that wraps those defined for
+ * each of the `ITensorListRefTag`.
+ *
+ * One should be able to use it, as if it were the unwrapped iterators
+ * themselves.
+ *
+ * [Note: MSVC Iterator Debug]
+ * ===========================
+ * MSVC `vector<T>::iterator` implementation (used in the boxed variant)
+ * makes it so this union's destructor, copy-constructor (assignment), and
+ * move-constructor (assignment) are implcitly deleted.
+ *
+ * Therefore, we need to explicitly define them as needed. Follows a list
+ * of places where these are needed and their reason:
+ *
+ *   - `Payload` destructor:
+ *     it is deleted only if the macro `_ITERATOR_DEBUG_LEVEL` is set to 2.
+ *
+ *   - `ITensorListRefIterator` destructor:
+ *     same as above. However, we need to explicitly call the variant
+ *     destructor explicitly.
+ *
+ *   - `ITensorListRefIterator` copy-constructor:
+ *     it is deleted only if the macro `_ITERATOR_DEBUG_LEVEL` is different
+ *     than 0.
+ */
+class ITensorListRefIterator
+    : public std::iterator<
+          std::bidirectional_iterator_tag,
+          detail::ITensorListRefConstRef,
+          ptrdiff_t,
+          std::add_pointer<detail::ITensorListRefConstRef>,
+          std::add_rvalue_reference<detail::ITensorListRefConstRef>> {
+ private:
+#define DEFINE_FRIEND_CLASS(TAG, ...) friend class TORCH_ITENSORLISTREF_IMPL(TAG);
+  TORCH_ITENSORLISTREF_FORALL_TAGS(DEFINE_FRIEND_CLASS)
+#undef DEFINE_FRIEND_CLASS
+
+  using unboxed_iterator_type =
+      TORCH_ITENSORLISTREF_IMPL(Unboxed)::list_type::const_iterator;
+  using boxed_iterator_type =
+      TORCH_ITENSORLISTREF_IMPL(Boxed)::list_type::const_iterator;
+
+  union Payload {
+    boxed_iterator_type boxed_iterator;
+    unboxed_iterator_type unboxed_iterator;
+    void* _init_ptr;
+    Payload() : _init_ptr(nullptr) {}
+#if defined(_MSC_VER) && _ITERATOR_DEBUG_LEVEL == 2
+    // See [Note: MSVC Iterator Debug]
+    ~Payload() {}
+#endif
+  };
+
+ public:
+  ITensorListRefIterator() : tag_(ITensorListRefTag::None) {}
+
+#if defined(_MSC_VER) && _ITERATOR_DEBUG_LEVEL != 0
+  // See [Note: MSVC Iterator Debug]
+  ITensorListRefIterator(const ITensorListRefIterator& iterator)
+      : tag_(iterator.tag_) {
+    switch (tag_) {
+      case ITensorListRefTag::Boxed:
+        payload_.boxed_iterator = iterator.payload_.boxed_iterator;
+      case ITensorListRefTag::Unboxed:
+        payload_.unboxed_iterator = iterator.payload_.unboxed_iterator;
+      default:
+        TORCH_INTERNAL_ASSERT(false, "invalid ITensorListRef tag.");
+    }
+  }
+#endif
+
+#if defined(_MSC_VER) && _ITERATOR_DEBUG_LEVEL == 2
+  // See [Note: MSVC Iterator Debug]
+  ~ITensorListRefIterator() {
+    switch (tag_) {
+      case ITensorListRefTag::Boxed:
+        payload_.boxed_iterator.~boxed_iterator_type();
+      case ITensorListRefTag::Unboxed:
+        payload_.unboxed_iterator.~unboxed_iterator_type();
+      default:
+        TORCH_INTERNAL_ASSERT(false, "invalid ITensorListRef tag.");
+    }
+  }
+#endif
+
+  ITensorListRefIterator(boxed_iterator_type boxed) : tag_(ITensorListRefTag::Boxed) {
+    payload_.boxed_iterator = boxed;
+  }
+
+  ITensorListRefIterator(unboxed_iterator_type unboxed)
+      : tag_(ITensorListRefTag::Unboxed) {
+    payload_.unboxed_iterator = unboxed;
+  }
+
+  detail::ITensorListRefConstRef operator*() const {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { return ImplT::iterator_get(this_); });
+  }
+
+  ITensorListRefIterator& operator++() {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { ++this_; });
+    return *this;
+  }
+
+  ITensorListRefIterator operator++(int) {
+    auto old = *this;
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { ++this_; });
+    return old;
+  }
+
+  ITensorListRefIterator& operator--() {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { --this_; });
+    return *this;
+  }
+
+  ITensorListRefIterator operator--(int) {
+    auto old = *this;
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { --this_; });
+    return old;
+  }
+
+  bool operator==(const ITensorListRefIterator& rhs) const {
+    if (tag_ != rhs.tag_) {
+      return false;
+    }
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, {
+      auto& rhs_it = ImplT::unwrap(rhs);
+      return this_ == rhs_it;
+    });
+  }
+
+  bool operator!=(const ITensorListRefIterator& rhs) const {
+    return !(*this == rhs);
+  }
+
+ private:
+  Payload payload_;
+  ITensorListRefTag tag_;
+};
+
+/*
+ * [Note: ITensorListRef]
+ * Wrapper around boxed and unboxed API containers.
+ *
+ * Tagged union of both API containers:
+ *     - `TensorList`, a.k.a. `ArrayRef<Tensor>` (the unboxed API container)
+ *     - `List<Tensor>` (the boxed API container)
+ *
+ * This container wraps around these two, without incurring in extra overhead
+ * for converting from one to another.
+ *
+ * Note that `ITensorListRef` is a view type. Meaning that it won't own the
+ * tensors it holds. If you need it to last longer, make sure that there is
+ * actually a non-temporary list of tensors (e.g. `vector<Tensor>`) that owns
+ * them and outlives the `ITensorListRef` instance.
+ *
+ * (see https://github.com/pytorch/pytorch/issues/66328)
+ */
+class ITensorListRef {
+ private:
+#define DEFINE_FRIEND_CLASS(TAG, ...) friend class TORCH_ITENSORLISTREF_IMPL(TAG);
+  TORCH_ITENSORLISTREF_FORALL_TAGS(DEFINE_FRIEND_CLASS)
+#undef DEFINE_FRIEND_CLASS
+
+  using unboxed_type = TORCH_ITENSORLISTREF_IMPL(Unboxed)::list_type;
+  using boxed_type = TORCH_ITENSORLISTREF_IMPL(Boxed)::list_type;
+
+  union Payload {
+    const boxed_type* boxed;
+    unboxed_type unboxed;
+    Payload() : boxed(nullptr) {}
+    ~Payload() {};
+  };
+
+ public:
+  using iterator = ITensorListRefIterator;
+  using const_iterator = ITensorListRefIterator;
+  using value_type = typename iterator::value_type;
+
+  ITensorListRef() : tag_(ITensorListRefTag::None) {}
+
+  ITensorListRef(const std::initializer_list<at::Tensor>& list)
+      : tag_(ITensorListRefTag::Unboxed) {
+    payload_.unboxed = at::ArrayRef<at::Tensor>(list);
+  }
+
+  ITensorListRef(const boxed_type& boxed) : tag_(ITensorListRefTag::Boxed) {
+    payload_.boxed = &boxed;
+  }
+
+  ITensorListRef(const unboxed_type& unboxed) : tag_(ITensorListRefTag::Unboxed) {
+    payload_.unboxed = unboxed;
+  }
+
+  template <
+      typename... UnboxedConstructorArgs,
+      typename = std::enable_if_t<
+          std::is_constructible<unboxed_type, UnboxedConstructorArgs...>::value>>
+  ITensorListRef(UnboxedConstructorArgs&&... args)
+      : tag_(ITensorListRefTag::Unboxed) {
+    payload_.unboxed = unboxed_type(std::forward<UnboxedConstructorArgs>(args)...);
+  }
+
+  size_t size() const {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { return this_.size(); });
+  }
+
+  bool empty() const {
+    return size() == 0;
+  }
+
+  iterator begin() const {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { return this_.begin(); });
+  }
+
+  iterator end() const {
+    TORCH_ITENSORLISTREF_UNWRAP(tag_, { return this_.end(); });
+  }
+
+  MaterializedITensorListRef materialize() const {
+    MaterializedITensorListRef materialized;
+    materialized.reserve(size());
+    for (const auto& t : *this) {
+      materialized.emplace_back(t);
+    }
+    return materialized;
+  }
+
+#define DEFINE_CHECK(TAG, ...)             \
+  bool is##TAG() const {                   \
+    return tag_ == ITensorListRefTag::TAG; \
+  }
+  TORCH_ITENSORLISTREF_FORALL_TAGS(DEFINE_CHECK);
+#undef DEFINE_CHECK
+
+  bool isNone() const {
+    return tag_ == ITensorListRefTag::None;
+  }
+
+#define DEFINE_CASTING(TAG, ...)                                              \
+  const typename TORCH_ITENSORLISTREF_IMPL(TAG)::list_type& to##TAG() const { \
+    TORCH_INTERNAL_ASSERT(is##TAG());                                         \
+    return TORCH_ITENSORLISTREF_IMPL(TAG)::unwrap(*this);                     \
+  }
+  TORCH_ITENSORLISTREF_FORALL_TAGS(DEFINE_CASTING);
+#undef DEFINE_CASTING
+
+ private:
+  Payload payload_;
+  ITensorListRefTag tag_;
+};
+
+} // namespace c10
+
+inline
+const TORCH_ITENSORLISTREF_IMPL(Unboxed)::list_type&
+TORCH_ITENSORLISTREF_IMPL(Unboxed)::unwrap(
+    const c10::ITensorListRef& ilist
+) {
+  return ilist.payload_.unboxed;
+}
+
+inline
+TORCH_ITENSORLISTREF_IMPL(Unboxed)::list_type::const_iterator&
+TORCH_ITENSORLISTREF_IMPL(Unboxed)::unwrap(
+    c10::ITensorListRefIterator& it
+) {
+  return it.payload_.unboxed_iterator;
+}
+
+inline
+const TORCH_ITENSORLISTREF_IMPL(Unboxed)::list_type::const_iterator&
+TORCH_ITENSORLISTREF_IMPL(Unboxed)::unwrap(
+    const c10::ITensorListRefIterator& it
+) {
+  return it.payload_.unboxed_iterator;
+}
+
+inline
+c10::detail::ITensorListRefConstRef
+TORCH_ITENSORLISTREF_IMPL(Unboxed)::iterator_get(
+    const list_type::const_iterator& it
+) {
+  return *it;
+}
+
+inline
+const TORCH_ITENSORLISTREF_IMPL(Boxed)::list_type&
+TORCH_ITENSORLISTREF_IMPL(Boxed)::unwrap(
+    const c10::ITensorListRef& ilist
+) {
+  return *ilist.payload_.boxed;
+}
+
+inline
+TORCH_ITENSORLISTREF_IMPL(Boxed)::list_type::const_iterator&
+TORCH_ITENSORLISTREF_IMPL(Boxed)::unwrap(
+    c10::ITensorListRefIterator& it
+) {
+  return it.payload_.boxed_iterator;
+}
+
+inline
+const TORCH_ITENSORLISTREF_IMPL(Boxed)::list_type::const_iterator&
+TORCH_ITENSORLISTREF_IMPL(Boxed)::unwrap(
+    const c10::ITensorListRefIterator& it
+) {
+  return it.payload_.boxed_iterator;
+}
+
+inline
+c10::detail::ITensorListRefConstRef
+TORCH_ITENSORLISTREF_IMPL(Boxed)::iterator_get(
+    const list_type::const_iterator& it
+) {
+  return (*it).get().toTensor();
+}
+
+namespace at {
+using ITensorListRef = c10::ITensorListRef;
+using ITensorListRefIterator = c10::ITensorListRefIterator;
+using MaterializedITensorListRef = c10::MaterializedITensorListRef;
+} // namespace at
diff --git a/aten/src/ATen/core/ITensorListRef_test.cpp b/aten/src/ATen/core/ITensorListRef_test.cpp
new file mode 100644
index 00000000000000..679ccea5865ffa
--- /dev/null
+++ b/aten/src/ATen/core/ITensorListRef_test.cpp
@@ -0,0 +1,188 @@
+#include <ATen/Functions.h>
+#include <ATen/core/ITensorListRef.h>
+#include <gtest/gtest.h>
+
+using namespace c10;
+
+static std::vector<at::Tensor> get_tensor_vector() {
+  std::vector<at::Tensor> boxed;
+  const size_t SIZE = 5;
+  for (size_t i = 0; i < SIZE; i++) {
+    boxed.push_back(at::empty({0}));
+  }
+  return boxed;
+}
+
+template <typename T>
+void check_elements_same(ITensorListRef list, const T& thing, int use_count) {
+  EXPECT_EQ(thing.size(), list.size());
+  size_t i = 0;
+  for (const auto& t : list) {
+    const at::Tensor& other = thing[i];
+    EXPECT_EQ(other.use_count(), use_count);
+    EXPECT_TRUE(other.is_same(t));
+    i++;
+  }
+}
+
+TEST(ITensorListRefTest, CtorEmpty_IsNone_Throws) {
+  ITensorListRef list;
+  EXPECT_TRUE(list.isNone());
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  EXPECT_THROW(list.size(), c10::Error);
+}
+
+TEST(ITensorListRefTest, CtorBoxed_IsBoxed) {
+  auto vec = get_tensor_vector();
+  List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  EXPECT_TRUE(list.isBoxed());
+}
+
+TEST(ITensorListRefTest, CtorUnboxed_IsUnboxed) {
+  auto vec = get_tensor_vector();
+  at::ArrayRef<at::Tensor> unboxed(vec);
+  ITensorListRef list(unboxed);
+  EXPECT_TRUE(list.isUnboxed());
+}
+
+TEST(ITensorListRefTest, CtorUnboxedIndirect_IsUnboxed) {
+  auto vec = get_tensor_vector();
+  auto check_is_unboxed = [](ITensorListRef list) {
+    EXPECT_TRUE(list.isUnboxed());
+  };
+  check_is_unboxed(vec[0]);
+  check_is_unboxed({vec.data(), vec.size()});
+  check_is_unboxed({&*vec.begin(), &*vec.end()});
+  check_is_unboxed(vec);
+}
+
+TEST(ITensorListRefTest, CtorTemp_IsUnboxed) {
+  auto check_is_unboxed = [](ITensorListRef list) {
+    EXPECT_TRUE(list.isUnboxed());
+  };
+
+  auto vec = get_tensor_vector();
+  check_is_unboxed({vec[0], vec[1]});
+}
+
+TEST(ITensorListRefTest, Boxed_GetConstRefTensor) {
+  auto vec = get_tensor_vector();
+  // We need 'boxed' to be 'const' here (and some other tests below)
+  // because 'List<Tensor>::operator[]' returns a 'ListElementReference'
+  // instead of returning a 'Tensor'. On the other hand,
+  // 'List<Tensor>::operator[] const' returns a 'const Tensor &'.
+  const List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  static_assert(
+      std::is_same<decltype(*list.begin()), const at::Tensor&>::value,
+      "Accessing elements from List<Tensor> through a ITensorListRef should be const references.");
+  EXPECT_TRUE(boxed[0].is_same(*list.begin()));
+  EXPECT_TRUE(boxed[1].is_same(*(++list.begin())));
+}
+
+TEST(ITensorListRefTest, Unboxed_GetConstRefTensor) {
+  auto vec = get_tensor_vector();
+  ITensorListRef list(vec);
+  static_assert(
+      std::is_same<decltype(*list.begin()), const at::Tensor&>::value,
+      "Accessing elements from ArrayRef<Tensor> through a ITensorListRef should be const references.");
+  EXPECT_TRUE(vec[0].is_same(*list.begin()));
+  EXPECT_TRUE(vec[1].is_same(*(++list.begin())));
+}
+
+TEST(ITensorListRefTest, Boxed_Equal) {
+  auto vec = get_tensor_vector();
+  List<at::Tensor> boxed(vec);
+  check_elements_same(boxed, vec, /* use_count= */ 2);
+}
+
+TEST(ITensorListRefTest, Unboxed_Equal) {
+  auto vec = get_tensor_vector();
+  check_elements_same(at::ArrayRef<at::Tensor>(vec), vec, /* use_count= */ 1);
+}
+
+TEST(ITensorListRefTest, UnboxedIndirect_Equal) {
+  auto vec = get_tensor_vector();
+  check_elements_same(vec[0], std::vector<at::Tensor>{vec[0]}, /* use_count= */ 3);
+  check_elements_same({vec.data(), vec.size()}, vec, /* use_count= */ 1);
+  check_elements_same({&*vec.begin(), &*vec.end()}, vec, /* use_count= */ 1);
+  check_elements_same(vec, vec, /* use_count= */ 1);
+}
+
+TEST(ITensorListRefTest, BoxedMaterialize_Equal) {
+  auto vec = get_tensor_vector();
+  List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  auto materialized = list.materialize();
+  check_elements_same(list, vec, 2);
+  check_elements_same(list, materialized, 2);
+}
+
+TEST(ITensorListRefTest, UnboxedMaterialize_Equal) {
+  auto vec = get_tensor_vector();
+  at::ArrayRef<at::Tensor> unboxed(vec);
+  ITensorListRef list(unboxed);
+  auto materialized = list.materialize();
+  check_elements_same(list, vec, 1);
+  check_elements_same(list, materialized, 1);
+}
+
+TEST(ITensorListRefIteratorTest, CtorEmpty_ThrowsError) {
+  ITensorListRefIterator it;
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  EXPECT_THROW(*it, c10::Error);
+}
+
+TEST(ITensorListRefIteratorTest, Boxed_GetFirstElement) {
+  auto vec = get_tensor_vector();
+  const List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  EXPECT_TRUE(boxed[0].is_same(*list.begin()));
+}
+
+TEST(ITensorListRefIteratorTest, Unboxed_GetFirstElement) {
+  auto vec = get_tensor_vector();
+  ITensorListRef list(vec);
+  EXPECT_TRUE(vec[0].is_same(*list.begin()));
+}
+
+TEST(ITensorListRefIteratorTest, Boxed_Equality) {
+  auto vec = get_tensor_vector();
+  List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  EXPECT_EQ(list.begin(), list.begin());
+  EXPECT_NE(list.begin(), list.end());
+  EXPECT_NE(list.end(), list.begin());
+  EXPECT_EQ(list.end(), list.end());
+}
+
+TEST(ITensorListRefIteratorTest, Unboxed_Equality) {
+  auto vec = get_tensor_vector();
+  ITensorListRef list(vec);
+  EXPECT_EQ(list.begin(), list.begin());
+  EXPECT_NE(list.begin(), list.end());
+  EXPECT_NE(list.end(), list.begin());
+  EXPECT_EQ(list.end(), list.end());
+}
+
+TEST(ITensorListRefIteratorTest, Boxed_Iterate) {
+  auto vec = get_tensor_vector();
+  const List<at::Tensor> boxed(vec);
+  ITensorListRef list(boxed);
+  size_t i = 0;
+  for (const auto& t : list) {
+    EXPECT_TRUE(boxed[i++].is_same(t));
+  }
+  EXPECT_EQ(i, list.size());
+}
+
+TEST(ITensorListRefIteratorTest, Unboxed_Iterate) {
+  auto vec = get_tensor_vector();
+  ITensorListRef list(vec);
+  size_t i = 0;
+  for (const auto& t : list) {
+    EXPECT_TRUE(vec[i++].is_same(t));
+  }
+  EXPECT_EQ(i, list.size());
+}
diff --git a/aten/src/ATen/core/List.h b/aten/src/ATen/core/List.h
index b042fab24f7d8c..0785a6941affda 100644
--- a/aten/src/ATen/core/List.h
+++ b/aten/src/ATen/core/List.h
@@ -78,6 +78,10 @@ class ListElementReference final {
   // assigning another ref to this assigns the underlying value
   ListElementReference& operator=(ListElementReference&& rhs) &&;
 
+  const IValue& get() const& {
+    return *iterator_;
+  }
+
   friend void swap<T, Iterator>(ListElementReference&& lhs, ListElementReference&& rhs);
 
 private:
@@ -235,6 +239,7 @@ class List final {
   using value_type = T;
   using size_type = typename c10::detail::ListImpl::list_type::size_type;
   using iterator = impl::ListIterator<T, typename c10::detail::ListImpl::list_type::iterator>;
+  using const_iterator = impl::ListIterator<T, typename c10::detail::ListImpl::list_type::iterator>;
   using reverse_iterator = impl::ListIterator<T, typename c10::detail::ListImpl::list_type::reverse_iterator>;
 
   /**
diff --git a/aten/src/ATen/core/PythonFallbackKernel.cpp b/aten/src/ATen/core/PythonFallbackKernel.cpp
index 37766077287b54..41becc56735496 100644
--- a/aten/src/ATen/core/PythonFallbackKernel.cpp
+++ b/aten/src/ATen/core/PythonFallbackKernel.cpp
@@ -1,28 +1,65 @@
-#include <torch/library.h>
-#include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/core/PythonModeTLS.h>
+#include <ATen/core/dispatch/Dispatcher.h>
+#include <c10/core/SafePyObject.h>
+#include <torch/library.h>
 
 #include <stack>
 
 namespace {
 
-// TLS saving the state of the include/exclude sets on entry to the dispatcher
-// This is set in the pythonTLSSnapshot fallback and used by the Python fallback.
-thread_local std::stack<c10::impl::LocalDispatchKeySet> tls_on_entry;
+// This TLS is used to track the state of the dispatcher to be able to restore
+// it when calling back into python.
+// It has the following invariant:
+//  - It must be empty while python code is executed.
+//  - It should only be set once even for multiple dispatcher calls that do not come
+//    back to python.
+// To achieve this, we ensure that the tls is empty by default and emptied again both when
+// we call into user torch_dispatch or returning back to python after this call.
 
-struct StashTLSStateGuard {
- public:
-  StashTLSStateGuard(const c10::impl::LocalDispatchKeySet& key_set) {
-    tls_on_entry.push(key_set);
+thread_local c10::optional<c10::impl::LocalDispatchKeySet> tls_on_entry;
+
+// RAII guard to make working with the above TLS safer.
+struct MaybeSetTLSOnEntryGuard {
+public:
+  MaybeSetTLSOnEntryGuard() {
+    if (tls_on_entry.has_value()) {
+      value_set_ = false;
+    } else {
+      value_set_ = true;
+      tls_on_entry = c10::impl::tls_local_dispatch_key_set();
+    }
   }
-  ~StashTLSStateGuard() {
-    tls_on_entry.pop();
+  ~MaybeSetTLSOnEntryGuard() {
+    if (value_set_) {
+      TORCH_INTERNAL_ASSERT(tls_on_entry.has_value());
+      tls_on_entry = c10::nullopt;
+    }
   }
+
+private:
+  bool value_set_;
+};
+
+// This guard assumes that tls_on_entry has a value.
+struct StashTLSOnEntryGuard {
+public:
+  StashTLSOnEntryGuard(): saved_(tls_on_entry.value()) {
+    tls_on_entry = c10::nullopt;
+  }
+
+  ~StashTLSOnEntryGuard() {
+    TORCH_INTERNAL_ASSERT(!tls_on_entry.has_value());
+    tls_on_entry = saved_;
+  }
+
+private:
+  c10::impl::LocalDispatchKeySet saved_;
 };
 
 void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  TORCH_INTERNAL_ASSERT(tls_on_entry.size() > 0);
-  c10::impl::ForceDispatchKeyGuard guard(tls_on_entry.top());
+  TORCH_INTERNAL_ASSERT(tls_on_entry.has_value());
+  c10::impl::ForceDispatchKeyGuard dispatcher_guard(tls_on_entry.value());
+  StashTLSOnEntryGuard stash_guard;
 
   // If Python Mode is active, use its PyInterpreter for dispatch
   const auto& maybe_python_mode_state = at::impl::PythonModeTLS::get_state();
@@ -63,10 +100,9 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
 
 void pythonTLSSnapshotFallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
   // It is ok for the tls to be already set here.
-  // A CompositeImplicitAutograd function may have been called just before this and so the tls here were never cleared
-  // This is also why we don't need an RAII to ensure the tls is reset when exceptions happen
-
-  StashTLSStateGuard guard(c10::impl::tls_local_dispatch_key_set());
+  // It means that there are multiple calls into the dispatcher not originating from python code.
+  // The guard below will properly ignore such calls.
+  MaybeSetTLSOnEntryGuard guard;
 
   op.redispatchBoxed(dispatch_keys & c10::DispatchKeySet(c10::DispatchKeySet::FULL_AFTER, c10::DispatchKey::PythonTLSSnapshot), stack);
 }
diff --git a/aten/src/ATen/core/PythonModeTLS.cpp b/aten/src/ATen/core/PythonModeTLS.cpp
index 97892fcf5d3742..2382c77c220e40 100644
--- a/aten/src/ATen/core/PythonModeTLS.cpp
+++ b/aten/src/ATen/core/PythonModeTLS.cpp
@@ -1,25 +1,26 @@
 #include <ATen/core/PythonModeTLS.h>
+#include <c10/core/SafePyObject.h>
 
 namespace at { namespace impl {
 
-thread_local std::shared_ptr<TorchDispatchTypeObject> pythonModeState;
+thread_local std::shared_ptr<SafePyObject> pythonModeState;
 
-void PythonModeTLS::set_state(const std::shared_ptr<TorchDispatchTypeObject>& state) {
-  pythonModeState = state;
+void PythonModeTLS::set_state(std::shared_ptr<SafePyObject> state) {
   if (state) {
     c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
     c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, true);
   } else {
     PythonModeTLS::reset_state();
   }
+  pythonModeState = std::move(state);
 }
 
-const std::shared_ptr<TorchDispatchTypeObject>& PythonModeTLS::get_state() {
+const std::shared_ptr<SafePyObject>& PythonModeTLS::get_state() {
   return pythonModeState;
 }
 
 void PythonModeTLS::reset_state() {
-  pythonModeState.reset((TorchDispatchTypeObject*)nullptr);
+  pythonModeState.reset();
   c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
   c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, false);
 }
diff --git a/aten/src/ATen/core/PythonModeTLS.h b/aten/src/ATen/core/PythonModeTLS.h
index be52b182c659b2..9794090de1715b 100644
--- a/aten/src/ATen/core/PythonModeTLS.h
+++ b/aten/src/ATen/core/PythonModeTLS.h
@@ -8,8 +8,8 @@ namespace at {
 namespace impl {
 
 struct TORCH_API PythonModeTLS {
-  static void set_state(const std::shared_ptr<TorchDispatchTypeObject>& state);
-  static const std::shared_ptr<TorchDispatchTypeObject>& get_state();
+  static void set_state(std::shared_ptr<SafePyObject> state);
+  static const std::shared_ptr<SafePyObject>& get_state();
   static void reset_state();
 };
 
diff --git a/aten/src/ATen/core/QuantizerBase.h b/aten/src/ATen/core/QuantizerBase.h
index e11d8d6e049c16..922ea8a38f50d0 100644
--- a/aten/src/ATen/core/QuantizerBase.h
+++ b/aten/src/ATen/core/QuantizerBase.h
@@ -55,7 +55,7 @@ struct TORCH_API Quantizer : public c10::intrusive_ptr_target {
    */
   virtual QScheme qscheme() const = 0;
 
-  ScalarType scalar_type() {
+  ScalarType scalar_type() const {
     return scalar_type_;
   }
 
@@ -77,7 +77,7 @@ struct TORCH_API Quantizer : public c10::intrusive_ptr_target {
   /**
    * Compare against `other` for equality.
    */
-  virtual bool equalTo(QuantizerPtr other) = 0;
+  virtual bool equalTo(QuantizerPtr other) const = 0;
 };
 
 } // namespace at
diff --git a/aten/src/ATen/core/SymInt.h b/aten/src/ATen/core/SymInt.h
new file mode 100644
index 00000000000000..5cebf357dbfd83
--- /dev/null
+++ b/aten/src/ATen/core/SymInt.h
@@ -0,0 +1,60 @@
+#pragma once
+
+#include <c10/macros/Macros.h>
+#include <c10/util/Exception.h>
+
+namespace c10 {
+
+// `SymInt` is a C++ wrapper class around int64_t data_ which  and is used to
+// represent concrete dimension values.
+//
+// `SymInt` is also a data type in Pytorch that can be used in function schemas
+// to enable tracing.
+//
+// `SymInt` is introduced to enable tracing arithmetic
+// operations on symbolic integers (e.g. sizes). Tracing symbolic sizes will
+// allow LTC and AOTAutograd representing dynamic shapes in expression graphs
+// faithfully without baking in concrete dimension values.
+//
+// To trace the operations, SymInt will overload arithmetic operators (e.g. +, -, *)
+// and will provide overloads taking SymInt for commonly used math functions.
+//
+// SymInt will be extenteded to represent a union structure Union[int64_t, SymbolicIntNode*]
+// which will be implemented as a single packed int64_t field named data_.
+//
+// data_ can be either a plain int64_t or (1 << 63 | `index`). `index` points to
+// SymbolicIntNode* that will be responsible for constructing an IR node for
+// a traced operation to represent it in LTC or Fx graphs.
+class TORCH_API SymInt {
+    public:
+        SymInt(int64_t d):
+        data_(d) {};
+
+        int64_t expect_int() const {
+            // we are dealing with concrete ints only for now
+            return data_;
+        }
+
+        bool is_symbolic() const {
+            return false;
+        }
+
+        bool operator==(const SymInt& p2) const
+        {
+            return data_ == p2.data_;
+        }
+
+        SymInt operator+(SymInt sci) const {
+            return data_ + sci.data_;
+        }
+
+        int64_t data() const {
+            return data_;
+        }
+
+    private:
+        int64_t data_;
+};
+
+TORCH_API std::ostream& operator<<(std::ostream& os, SymInt s);
+}
diff --git a/aten/src/ATen/core/TensorBase.h b/aten/src/ATen/core/TensorBase.h
index 0af9513eaa57f7..0ba95383f4447b 100644
--- a/aten/src/ATen/core/TensorBase.h
+++ b/aten/src/ATen/core/TensorBase.h
@@ -156,15 +156,17 @@ class TORCH_API TensorBase {
   }
 
   int64_t size(int64_t dim) const {
+    const auto sizes = this->sizes();
+    const auto ndim = static_cast<int64_t>(sizes.size());
     // false is passed to maybe_wrap_dim so behavior is identical to array access (but with wrapping)
-    dim = c10::maybe_wrap_dim(dim, this->dim(), false);
-    return sizes()[dim];
+    return sizes[c10::maybe_wrap_dim(dim, ndim, /*wrap_scalar=*/false)];
   }
 
   int64_t stride(int64_t dim) const {
+    const auto strides = this->strides();
+    const auto ndim = static_cast<int64_t>(strides.size());
     // false is passed to maybe_wrap_dim so behavior is identical to array access (but with wrapping)
-    dim = c10::maybe_wrap_dim(dim, this->dim(), false);
-    return strides()[dim];
+    return strides[c10::maybe_wrap_dim(dim, ndim, /*wrap_scalar=*/false)];
   }
 
   TensorImpl * unsafeGetTensorImpl() const {
@@ -370,6 +372,12 @@ class TORCH_API TensorBase {
     return impl_->is_cuda();
   }
 
+  /// Returns if a `Tensor` has IPU backend.
+  bool is_ipu() const {
+    // NB: this is not a native function to avoid dispatching overhead.
+    return impl_->is_ipu();
+  }
+
   /// Returns if a `Tensor` has XPU backend.
   bool is_xpu() const {
     // NB: this is not a native function to avoid dispatching overhead.
@@ -462,6 +470,11 @@ class TORCH_API TensorBase {
     return impl_->is_inference();
   }
 
+  // Returns if a `Tensor` is a NestedTensor.
+  bool is_nested() const {
+    return impl_->is_nested();
+  }
+
   /// If a tensor is a quantized tensor, returns its quantizer
   /// TODO: it's not in native_functions.yaml yet as it's not exposed to python
   QuantizerPtr quantizer() const;
diff --git a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
index f48246c02fd682..87c5c33bdeeabf 100644
--- a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
+++ b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
@@ -180,6 +180,13 @@ namespace impl {
       "You tried to register a kernel with an unsupported input type: ArrayRef<Scalar>. Please use List<int64_t>, List<double> or Tensor instead.");
   };
 
+  template<class T, bool AllowDeprecatedTypes>
+  struct assert_is_valid_input_type<c10::OptionalArrayRef<T>, AllowDeprecatedTypes>
+  : assert_is_valid_input_type<T, AllowDeprecatedTypes> {
+    static_assert(!std::is_same<T, at::Scalar>::value,
+      "You tried to register a kernel with an unsupported input type: OptionalArrayRef<Scalar>. Please use List<int64_t>, List<double> or Tensor instead.");
+  };
+
   template<class T, size_t N, bool AllowDeprecatedTypes>
   struct assert_is_valid_input_type<std::array<T, N>, AllowDeprecatedTypes>
   : assert_is_valid_input_type<T, AllowDeprecatedTypes> {
@@ -233,6 +240,10 @@ namespace impl {
   struct assert_is_valid_output_type<c10::optional<T>, AllowDeprecatedTypes>
   : assert_is_valid_output_type<T, AllowDeprecatedTypes> {};
 
+  template<class T, bool AllowDeprecatedTypes>
+  struct assert_is_valid_output_type<c10::OptionalArrayRef<T>, AllowDeprecatedTypes>
+  : assert_is_valid_output_type<T, AllowDeprecatedTypes> {};
+
   template<class Key, class Value, bool AllowDeprecatedTypes>
   struct assert_is_valid_output_type<Dict<Key, Value>, AllowDeprecatedTypes>
   : assert_is_valid_output_type<Value, AllowDeprecatedTypes> {
@@ -361,13 +372,24 @@ namespace impl {
   template<class T, bool AllowDeprecatedTypes>
   struct ivalue_to_arg<optional<ArrayRef<T>>, AllowDeprecatedTypes> final {
     // If an argument is optional<ArrayRef<T>>, convert the IValue to an optional<std::vector<T>> and pass that
-    // to the operator. OptionalArray<T> is basically a optional<std::vector<T>> but impliticly convertible
+    // to the operator. OptionalArray<T> is basically a optional<std::vector<T>> but implicitly convertible
     // to optional<ArrayRef<T>>.
     static OptionalArray<T> call(IValue& v) {
       return ivalue_to_arg<OptionalArray<T>, AllowDeprecatedTypes>::call(v);
     }
   };
 
+  template<class T, bool AllowDeprecatedTypes>
+  struct ivalue_to_arg<OptionalArrayRef<T>, AllowDeprecatedTypes> final {
+    // If an argument is OptionalArrayRef<T>, convert the IValue to an
+    // optional<std::vector<T>> and pass that to the operator. OptionalArray<T>
+    // is basically a optional<std::vector<T>> but implicitly convertible to
+    // OptionalArrayRef<T>
+    static OptionalArray<T> call(IValue& v) {
+      return ivalue_to_arg<OptionalArray<T>, AllowDeprecatedTypes>::call(v);
+    }
+  };
+
   // return_to_ivalue
   template<class T, bool AllowDeprecatedTypes, class Enable = void>
   struct return_to_ivalue final {};
diff --git a/aten/src/ATen/core/builtin_function.h b/aten/src/ATen/core/builtin_function.h
index 3c6fd0c77cadf1..6f1e9e75ea3e29 100644
--- a/aten/src/ATen/core/builtin_function.h
+++ b/aten/src/ATen/core/builtin_function.h
@@ -62,7 +62,7 @@ struct BuiltinOpFunction : public Function {
     return *this;
   }
 
-  bool call(Stack& stack, size_t, c10::function_ref<void(const Code&)>) override {
+  bool call(Stack& stack, c10::optional<size_t>, c10::function_ref<void(const Code&)>) override {
     run(stack);
     return false;
   }
diff --git a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.cpp b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.cpp
index a930edc2db6328..9180d0d19e6449 100644
--- a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.cpp
+++ b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.cpp
@@ -6,11 +6,52 @@
 namespace c10 {
 
 void DispatchKeyExtractor::setOperatorHasFallthroughForKey(DispatchKey k, bool has_fallthrough) {
+  // (1) update nonFallthroughKeys_
   if (has_fallthrough) {
     nonFallthroughKeys_ = nonFallthroughKeys_.remove(k);
   } else {
     nonFallthroughKeys_ = nonFallthroughKeys_.add(k);
   }
+  // (2) update nonFallthroughKeysPerBackend_
+  if (isPerBackendFunctionalityKey(toFunctionalityKey(k))) {
+    // This is a per-backend functionality key.
+    // We need to figure out what the current backend is,
+    // and only update the bitset for that backend.
+    // subtracting 1 because the first backend should have index 0 (CPU),
+    // But the enum starts with BackendComponent::InvalidBit.
+    auto backend_idx = static_cast<uint8_t>(toBackendComponent(k)) - 1;
+    TORCH_INTERNAL_ASSERT(backend_idx >= 0 && static_cast<uint8_t>(backend_idx) < nonFallthroughKeysPerBackend_.size());
+    if (has_fallthrough) {
+      nonFallthroughKeysPerBackend_[backend_idx] = nonFallthroughKeysPerBackend_[backend_idx].remove(k);
+    } else {
+      nonFallthroughKeysPerBackend_[backend_idx] = nonFallthroughKeysPerBackend_[backend_idx].add(k);
+    }
+
+    // Set requiresBitsetPerBackend_ accordingly
+    for (const auto i : c10::irange(nonFallthroughKeysPerBackend_.size() - 1)) {
+      if (nonFallthroughKeysPerBackend_[i] != nonFallthroughKeysPerBackend_[i+1]) {
+        requiresBitsetPerBackend_ = true;
+        return;
+      }
+    }
+    requiresBitsetPerBackend_ = false;
+    return;
+  } else {
+    // Otherwise, if a fallthrough is set for a functionality that isn't per backend,
+    // Then we update the fallthrough bitset for EVERY backend.
+    // TODO: we could probably optimize this by only lazily updating these values
+    // the first time that we see requiresBitsetPerBackend_ = true
+    // (which should almost never happen)
+    if (has_fallthrough) {
+      for (const auto i : c10::irange(nonFallthroughKeysPerBackend_.size())) {
+        nonFallthroughKeysPerBackend_[i] = nonFallthroughKeysPerBackend_[i].remove(k);
+      }
+    } else {
+      for (const auto i : c10::irange(nonFallthroughKeysPerBackend_.size())) {
+        nonFallthroughKeysPerBackend_[i] = nonFallthroughKeysPerBackend_[i].add(k);
+      }
+    }
+  }
 }
 
 std::string DispatchKeyExtractor::dumpState() const {
diff --git a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
index 53e348d6b99ea9..d5345b28e7149f 100644
--- a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
+++ b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
@@ -156,14 +156,24 @@ struct TORCH_API DispatchKeyExtractor final {
       }
     });
     // Keys that are fallthrough should be skipped
-    return impl::computeDispatchKeySet(ks, nonFallthroughKeys_);
+    if (requiresBitsetPerBackend_) {
+      auto backend_idx = ks.getBackendIndex();
+      return impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]);
+    } else {
+      return impl::computeDispatchKeySet(ks, nonFallthroughKeys_);
+    }
   }
 
   template<class... Args>
   DispatchKeySet getDispatchKeySetUnboxed(const Args&... args) const {
     auto ks = detail::multi_dispatch_key_set(args...);
     // Keys that are fallthrough should be skipped
-    return impl::computeDispatchKeySet(ks, nonFallthroughKeys_);
+    if (requiresBitsetPerBackend_) {
+      auto backend_idx = ks.getBackendIndex();
+      return impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]);
+    } else {
+      return impl::computeDispatchKeySet(ks, nonFallthroughKeys_);
+    }
   }
 
   void setOperatorHasFallthroughForKey(DispatchKey k, bool has_fallthrough);
@@ -193,7 +203,12 @@ struct TORCH_API DispatchKeyExtractor final {
 
   explicit DispatchKeyExtractor(c10::utils::bitset dispatch_arg_indices_reverse)
   : dispatch_arg_indices_reverse_(dispatch_arg_indices_reverse)
-  , nonFallthroughKeys_(DispatchKeySet::FULL) {}
+  , nonFallthroughKeys_(DispatchKeySet::FULL)
+  , requiresBitsetPerBackend_(false) {
+    for (const auto i : c10::irange(nonFallthroughKeysPerBackend_.size())) {
+      nonFallthroughKeysPerBackend_[i] = DispatchKeySet::FULL;
+    }
+  }
 
   // this is a bitset that has ones for each argument index which has to be
   // considered for dispatch. This avoids having to iterate over the stack
@@ -205,8 +220,14 @@ struct TORCH_API DispatchKeyExtractor final {
   // fallthrough
   c10::utils::bitset dispatch_arg_indices_reverse_;
 
-  // Set of keys for which the operator does NOT have fallthrough kernel.
+  // Set of functionality keys for which the operator does NOT have fallthrough kernel.
   DispatchKeySet nonFallthroughKeys_;
+  // Set of functionality keys for which the operator does NOT have fallthrough kernel, defined PER BACKEND.
+  // This is only needed if we know that the operator has a different set of fallthroughs defined for some backends.
+  std::array<DispatchKeySet, num_backends> nonFallthroughKeysPerBackend_;
+  // Flag to tell us if we can use the single set of nonFallthroughKeys_ (fast path),
+  // or if we need to fall back to the slower path and check nonFallthroughKeysPerBackend_
+  bool requiresBitsetPerBackend_;
 };
 
 }
diff --git a/aten/src/ATen/core/dispatch/Dispatcher.cpp b/aten/src/ATen/core/dispatch/Dispatcher.cpp
index 3dccc4645a824c..86960634e46133 100644
--- a/aten/src/ATen/core/dispatch/Dispatcher.cpp
+++ b/aten/src/ATen/core/dispatch/Dispatcher.cpp
@@ -267,14 +267,16 @@ void Dispatcher::cleanup(const OperatorHandle& op, const OperatorName& op_name)
 RegistrationHandleRAII Dispatcher::registerFallback(DispatchKey dispatchKey, KernelFunction kernel, std::string debug) {
   std::lock_guard<std::mutex> lock(mutex_);
 
+  auto idx = getDispatchTableIndexForDispatchKey(dispatchKey);
+  TORCH_CHECK(idx >= 0 && static_cast<uint64_t>(idx) < backendFallbackKernels_.size(), "idx=", idx);
   TORCH_CHECK(
-    !backendFallbackKernels_[static_cast<uint8_t>(dispatchKey)].kernel.isValid(),
+    !backendFallbackKernels_[idx].kernel.isValid(),
     "Tried to register multiple backend fallbacks for the same dispatch key ", dispatchKey, "; previous registration ",
-    backendFallbackKernels_[static_cast<uint8_t>(dispatchKey)].debug, ", new registration ", debug
+    backendFallbackKernels_[idx].debug, ", new registration ", debug
   );
   // NB: inferred function schema is always nullptr for fallbacks, as fallbacks
   // cannot be unobxed
-  backendFallbackKernels_[static_cast<uint8_t>(dispatchKey)] = impl::AnnotatedKernel(std::move(kernel), nullptr, std::move(debug));
+  backendFallbackKernels_[idx] = impl::AnnotatedKernel(std::move(kernel), nullptr, std::move(debug));
 
   for (auto& op : operators_) {
     op.op.updateFallback(*this, dispatchKey);
@@ -288,7 +290,8 @@ RegistrationHandleRAII Dispatcher::registerFallback(DispatchKey dispatchKey, Ker
 void Dispatcher::deregisterFallback_(DispatchKey dispatchKey) {
   std::lock_guard<std::mutex> lock(mutex_);
 
-  backendFallbackKernels_[static_cast<uint8_t>(dispatchKey)] = {};
+  auto idx = getDispatchTableIndexForDispatchKey(dispatchKey);
+  backendFallbackKernels_[idx] = {};
 
   for (auto& op : operators_) {
     op.op.updateFallback(*this, dispatchKey);
diff --git a/aten/src/ATen/core/dispatch/Dispatcher.h b/aten/src/ATen/core/dispatch/Dispatcher.h
index 14ffa2f94c9c8c..8108c3c1928b81 100644
--- a/aten/src/ATen/core/dispatch/Dispatcher.h
+++ b/aten/src/ATen/core/dispatch/Dispatcher.h
@@ -291,7 +291,7 @@ class TORCH_API Dispatcher final {
   // Map from namespace to debug string (saying, e.g., where the library was defined)
   ska::flat_hash_map<std::string, std::string> libraries_;
 
-  std::array<impl::AnnotatedKernel, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> backendFallbackKernels_;
+  std::array<impl::AnnotatedKernel, num_runtime_entries> backendFallbackKernels_;
 
   std::unique_ptr<detail::RegistrationListenerList> listeners_;
   std::mutex mutex_;
@@ -531,8 +531,7 @@ C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorH
   detail::unused_arg_(args...);  // workaround for a false-positive warning about unused parameters in gcc 5
   auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
     .template getDispatchKeySetUnboxed<Args...>(args...);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId()));
-  const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet.highestPriorityTypeId());
+  const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet);
 #ifndef PYTORCH_DISABLE_PER_OP_PROFILING
   // By default, when there're no high-frequency or non-sampled callbacks,
   // RecordFunction is pre-sampled as a perf optimization;
@@ -553,7 +552,7 @@ template<class Return, class... Args>
 inline Return Dispatcher::redispatch(const TypedOperatorHandle<Return (Args...)>& op, DispatchKeySet currentDispatchKeySet, Args... args) const {
   detail::unused_arg_(args...);  // workaround for a false-positive warning about unused parameters in gcc 5
   // do not use RecordFunction on redispatch
-  const KernelFunction& kernel = op.operatorDef_->op.lookup(currentDispatchKeySet.highestPriorityTypeId());
+  const KernelFunction& kernel = op.operatorDef_->op.lookup(currentDispatchKeySet);
   return kernel.template call<Return, Args...>(op, currentDispatchKeySet, std::forward<Args>(args)...);
 }
 
@@ -561,7 +560,7 @@ inline void Dispatcher::callBoxed(const OperatorHandle& op, Stack* stack) const
   // note: this doesn't need the mutex because write operations on the list keep iterators intact.
   const auto& entry = op.operatorDef_->op;
   auto dispatchKeySet = entry.dispatchKeyExtractor().getDispatchKeySetBoxed(stack);
-  const auto& kernel = entry.lookup(dispatchKeySet.highestPriorityTypeId());
+  const auto& kernel = entry.lookup(dispatchKeySet);
 #ifndef PYTORCH_DISABLE_PER_OP_PROFILING
   bool pre_sampled = false;
   if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) {
@@ -593,7 +592,7 @@ inline void Dispatcher::callBoxed(const OperatorHandle& op, Stack* stack) const
 inline void Dispatcher::redispatchBoxed(const OperatorHandle& op, DispatchKeySet dispatchKeySet, Stack* stack) const {
   // note: this doesn't need the mutex because write operations on the list keep iterators intact.
   const auto& entry = op.operatorDef_->op;
-  const auto& kernel = entry.lookup(dispatchKeySet.highestPriorityTypeId());
+  const auto& kernel = entry.lookup(dispatchKeySet);
   return kernel.callBoxed(op, dispatchKeySet, stack);
 }
 
diff --git a/aten/src/ATen/core/dispatch/ObservedOperators.cpp b/aten/src/ATen/core/dispatch/ObservedOperators.cpp
index 1d1ed4c1926a48..65545a221f9cb8 100644
--- a/aten/src/ATen/core/dispatch/ObservedOperators.cpp
+++ b/aten/src/ATen/core/dispatch/ObservedOperators.cpp
@@ -15,6 +15,7 @@ std::unordered_set<std::string>& ObservedOperators::getUnobservedOperatorList()
     "aten::_version",
     "aten::is_complex",
     "profiler::_record_function_enter",
+    "profiler::_record_function_enter_new",
     "profiler::_record_function_exit",
   };
   return not_observed_ops;
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.cpp b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
index d4d997fde69aef..d5cc6d45933fa2 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.cpp
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
@@ -283,7 +283,10 @@ std::pair<const AnnotatedKernel&, const char*> OperatorEntry::computeDispatchTab
   }
 
   // 3. Backend fallback
-  auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
+  auto dispatch_ix = getDispatchTableIndexForDispatchKey(dispatch_key);
+  if (dispatch_ix < 0) {
+    return {missingKernel(), "backend fallback not registered on mobile"};
+  }
   if (dispatcher.backendFallbackKernels_[dispatch_ix].kernel.isValid()) {
     return {dispatcher.backendFallbackKernels_[dispatch_ix], "backend fallback"};
   }
@@ -299,7 +302,7 @@ std::pair<const AnnotatedKernel&, const char*> OperatorEntry::computeDispatchTab
 // or alias keys and their associated keysets).
 // This function should be considered a private helper for updateDispatchTable_()
 void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
-  const auto dispatch_ix = c10::getDispatchTableIndexForDispatchKey(dispatch_key);
+  const auto dispatch_ix = getDispatchTableIndexForDispatchKey(dispatch_key);
   if (C10_UNLIKELY(dispatch_ix == -1)) {
     return;
   }
@@ -329,8 +332,12 @@ void OperatorEntry::updateDispatchTable_(const c10::Dispatcher& dispatcher, Disp
   }
   // Note [Refresh Runtime Autograd entries in dispatchTable_]
   // Registering to backend key might affect computed entry at its Autograd backend key due to (2.1) & (2.3).
+  // In theory, we should only have to check if the given runtime key has "dense" functionality,
+  // e.g. DispatchKey::CPU (which is composed of DispatchKey::Dense and BackendComponent::CPUBit).
+  // However, there are some backends that should be included in this set that don't have the dense key set.
+  // E.g. DispatchKey::Meta, DispatchKey::ORT.
   if (c10::isBackendDispatchKey(dispatch_key)) {
-    DispatchKey autograd_key = getAutogradKeyFromBackend(dispatch_key);
+    DispatchKey autograd_key = getAutogradKeyFromBackend(toBackendComponent(dispatch_key));
     updateDispatchTableEntry_(dispatcher, autograd_key);
   }
 }
@@ -357,8 +364,9 @@ void OperatorEntry::updateDispatchTableFull_(const c10::Dispatcher& dispatcher)
   // catchAll. After catchAllKernel_ is removed, Undefined now can get a kernel from either CompositeExplicitAutograd
   // or CompositeImplicitAutograd alias key so that we don't break the support. Ideally isIncludedInAlias(Undefined, CompositeImplicitAutograd)
   // should return true, it returns false because Undefined cannot be represented in a DispatchKeySet.
-  for (uint8_t iter = 0; iter != static_cast<uint8_t>(DispatchKey::NumDispatchKeys); ++iter) {
-    updateDispatchTable_(dispatcher, static_cast<DispatchKey>(iter));
+  updateDispatchTable_(dispatcher, DispatchKey::Undefined);
+  for (auto k : DispatchKeySet(DispatchKeySet::FULL)) {
+    updateDispatchTable_(dispatcher, k);
   }
 }
 
@@ -371,9 +379,13 @@ void OperatorEntry::checkInvariants() const {
   for (const auto& kv : kernels_) {
     TORCH_INTERNAL_ASSERT(kv.second.size() > 0, dumpState());
   }
-  for (uint8_t iter = 0; iter != static_cast<uint8_t>(DispatchKey::NumDispatchKeys); ++iter) {
-    auto expected_k = computeDispatchTableEntry(c10::Dispatcher::singleton(), static_cast<DispatchKey>(iter));
-    TORCH_INTERNAL_ASSERT(expected_k._equalsBoxedAndUnboxed(dispatchTable_[iter]),
+  for (auto k : DispatchKeySet(DispatchKeySet::FULL)) {
+    auto expected_k = computeDispatchTableEntry(c10::Dispatcher::singleton(), k);
+    auto idx = getDispatchTableIndexForDispatchKey(k);
+    if (C10_UNLIKELY(idx == -1)) {
+      continue;
+    }
+    TORCH_INTERNAL_ASSERT(expected_k._equalsBoxedAndUnboxed(dispatchTable_[idx]),
       "Canonical state\n~~~~~~~~~~~\n", dumpState(), "\n\n"
       "Computed table:\n~~~~~~~~~~~\n", dumpComputedTable());
   }
@@ -384,8 +396,9 @@ std::string OperatorEntry::listAllDispatchKeys() const {
   str << "[";
 
   bool has_kernels = false;
-  for (uint8_t iter = 0; iter != static_cast<uint8_t>(DispatchKey::NumDispatchKeys); ++iter) {
-    if (!dispatchTable_[iter].isValid()) {
+  for (auto k : DispatchKeySet(DispatchKeySet::FULL)) {
+    auto iter = getDispatchTableIndexForDispatchKey(k);
+    if (iter == -1 || !dispatchTable_[iter].isValid()) {
       continue;
     }
     if (has_kernels) {
@@ -443,8 +456,12 @@ void OperatorEntry::reportError(DispatchKey dispatchKey) const {
 // updateDispatchTableFull_ would update the dispatch table to be)
 std::string OperatorEntry::dumpComputedTable() const {
   std::ostringstream oss;
-  for (uint8_t i = 0; i < static_cast<uint8_t>(DispatchKey::NumDispatchKeys); i++) {
-    auto k = static_cast<DispatchKey>(i);
+  // Need to handle Undefined separately, because its a runtime key that can't be represented
+  // in a DispatchKeySet.
+  std::vector<DispatchKey> runtime_keys = {DispatchKey::Undefined};
+  for (auto k : DispatchKeySet(DispatchKeySet::FULL)) runtime_keys.push_back(k);
+
+  for (auto k : runtime_keys) {
     auto kernel_prov = computeDispatchTableEntryWithDebug(c10::Dispatcher::singleton(), k);
     if (kernel_prov.first.kernel.isValid()) {
       oss << toString(k) << ": "
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.h b/aten/src/ATen/core/dispatch/OperatorEntry.h
index d98bd6bc69041a..c0f90808280a8e 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.h
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.h
@@ -173,10 +173,10 @@ class TORCH_API OperatorEntry final {
 
   [[noreturn]] void reportError(DispatchKey dispatchKey) const;
 
-  const KernelFunction& lookup(DispatchKey k) const {
-    const auto idx = getDispatchTableIndexForDispatchKey(k);
+  const KernelFunction& lookup(DispatchKeySet ks) const {
+    const auto idx = ks.getDispatchTableIndexForDispatchKeySet();
     if (C10_UNLIKELY(idx == -1)) {
-      reportError(k);
+      reportError(ks.highestPriorityTypeId());
     }
     const auto& kernel = dispatchTable_[idx];
     // A valid kernel *always* has a boxed kernel and *may* have an
@@ -187,7 +187,7 @@ class TORCH_API OperatorEntry final {
     // in the common case.
     if (C10_UNLIKELY(!kernel.isValidUnboxed())) {
       if (!kernel.isValid()) {
-        reportError(k);
+        reportError(ks.highestPriorityTypeId());
       }
     }
     return kernel;
@@ -211,7 +211,7 @@ class TORCH_API OperatorEntry final {
   OperatorName name_;
   c10::optional<AnnotatedSchema> schema_;
 
-  std::array<KernelFunction, c10::getDispatchTableIndexForDispatchKey(DispatchKey::NumDispatchKeys)> dispatchTable_;
+  std::array<KernelFunction, c10::num_runtime_entries> dispatchTable_;
   DispatchKeyExtractor dispatchKeyExtractor_;
 
   // kernels_ stores all registered kernels for the corresponding dispatch key
diff --git a/aten/src/ATen/core/dynamic_type.cpp b/aten/src/ATen/core/dynamic_type.cpp
index 95050da593eb01..051b859d98158a 100644
--- a/aten/src/ATen/core/dynamic_type.cpp
+++ b/aten/src/ATen/core/dynamic_type.cpp
@@ -227,6 +227,8 @@ TypePtr DynamicType::fallback() const {
       return BoolType::get();
     case Tag::Int:
       return IntType::get();
+    case Tag::SymInt:
+      return SymIntType::get();
     case Tag::Float:
       return FloatType::get();
     case Tag::Complex:
@@ -320,6 +322,8 @@ DynamicType::Ptr IValue::TagType<c10::DynamicType>::get(const c10::IValue& v) {
       return DynamicTypeTrait<ComplexType>::getBaseType();
     case Tag::Int:
       return DynamicTypeTrait<IntType>::getBaseType();
+    case Tag::SymInt:
+      return DynamicTypeTrait<SymIntType>::getBaseType();
     case Tag::Bool:
       return DynamicTypeTrait<BoolType>::getBaseType();
     case Tag::String:
diff --git a/aten/src/ATen/core/dynamic_type.h b/aten/src/ATen/core/dynamic_type.h
index d5551c9a5e511c..7be10d810e42a1 100644
--- a/aten/src/ATen/core/dynamic_type.h
+++ b/aten/src/ATen/core/dynamic_type.h
@@ -16,6 +16,7 @@ constexpr DynamicTypeBits kDynamicAnyTypeBit = DYNAMIC_TYPE_BIT(30);
 
 constexpr DynamicTypeBits kDynamicNoneTypeBit = DYNAMIC_TYPE_BIT(1);
 constexpr DynamicTypeBits kDynamicIntTypeBit = DYNAMIC_TYPE_BIT(3);
+constexpr DynamicTypeBits kDynamicSymIntTypeBit = DYNAMIC_TYPE_BIT(23);
 constexpr DynamicTypeBits kDynamicFloatTypeBit = DYNAMIC_TYPE_BIT(4);
 constexpr DynamicTypeBits kDynamicComplexTypeBit = DYNAMIC_TYPE_BIT(5);
 constexpr DynamicTypeBits kDynamicListTypeBit = DYNAMIC_TYPE_BIT(7);
@@ -28,6 +29,7 @@ constexpr DynamicTypeBits kDynamicClassTypeBit = DYNAMIC_TYPE_BIT(10);
   _(Bool, DYNAMIC_TYPE_BIT(2), 1)                                            \
   _(Int, kDynamicIntTypeBit, 1)                                              \
   _(Float, kDynamicFloatTypeBit, 1)                                          \
+  _(SymInt, kDynamicSymIntTypeBit, 1)                                        \
   _(Complex, kDynamicComplexTypeBit, 1)                                      \
   _(Number,                                                                  \
     (kDynamicIntTypeBit | kDynamicFloatTypeBit | kDynamicComplexTypeBit),    \
@@ -159,7 +161,7 @@ class DynamicType : public SharedType {
   const Arguments& arguments() const {
     return arguments_;
   }
-  TypeKind dynamicKind() const;
+  TORCH_API TypeKind dynamicKind() const;
 
   // Should be used only on the server side to restore static type information.
 #ifndef C10_MOBILE
diff --git a/aten/src/ATen/core/function.h b/aten/src/ATen/core/function.h
index b0c02041affcbb..881efb1a4ff046 100644
--- a/aten/src/ATen/core/function.h
+++ b/aten/src/ATen/core/function.h
@@ -90,7 +90,7 @@ struct TORCH_API Function {
   // call() returns false.
 
   // Overload for server interpreter, a bailout size is needed for graph executor.
-  virtual bool call(Stack&, size_t, c10::function_ref<void(const Code&)>) {
+  virtual bool call(Stack&, c10::optional<size_t>, c10::function_ref<void(const Code&)>) {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
     return false;
   }
diff --git a/aten/src/ATen/core/interned_strings.h b/aten/src/ATen/core/interned_strings.h
index 88f275093d1e93..46b43aecce2850 100644
--- a/aten/src/ATen/core/interned_strings.h
+++ b/aten/src/ATen/core/interned_strings.h
@@ -64,6 +64,8 @@ namespace c10 {
   _(prim, PadPacked) /* onnx */      \
   _(prim, Placeholder) /* debug */   \
   _(prim, Print)                     \
+  _(prim, EmptyListLiteral)          \
+  _(prim, LegacyTypedConstructor)    \
   _(prim, PythonOp)                  \
   _(prim, IgnoredPythonOp)           \
   _(prim, Reverse)                   \
@@ -107,7 +109,6 @@ namespace c10 {
   _(aten, Complex)                   \
   _(aten, str)                       \
   _(aten, Delete)                    \
-  _(aten, gelu_)                     \
   _(prim, device)                    \
   _(prim, dtype)                     \
   _(prim, layout)                    \
@@ -302,6 +303,7 @@ namespace c10 {
   _(attr, transA)                    \
   _(attr, transB)                    \
   _(attr, name)                      \
+  _(attr, module)                    \
   _(attr, beg)                       \
   _(attr, idx)                       \
   _(attr, split)                     \
diff --git a/aten/src/ATen/core/ivalue.cpp b/aten/src/ATen/core/ivalue.cpp
index 85117e345e30fa..cd980e84df1698 100644
--- a/aten/src/ATen/core/ivalue.cpp
+++ b/aten/src/ATen/core/ivalue.cpp
@@ -91,6 +91,8 @@ c10::TypePtr IValue::TagType<c10::Type>::get(const IValue& v) {
         return ComplexType::get();
       case Tag::Int:
         return IntType::get();
+      case Tag::SymInt:
+        return c10::SymIntType::get();
       case Tag::Bool:
         return BoolType::get();
       case Tag::String:
@@ -298,6 +300,8 @@ IValue IValue::equals(const IValue& rhs) const {
       return rhs.isComplexDouble() && lhs.toComplexDouble() == rhs.toComplexDouble();
     case Tag::Int:
       return rhs.isInt() && lhs.toInt() == rhs.toInt();
+    case Tag::SymInt:
+      return rhs.isSymInt() && lhs.toSymInt() == rhs.toSymInt();
     case Tag::Bool:
       return rhs.isBool() && lhs.toBool() == rhs.toBool();
     case Tag::String:
@@ -349,6 +353,8 @@ size_t IValue::hash(const IValue& v) {
       return c10::get_hash(v.payload.u.as_int);
     case Tag::Int:
       return c10::get_hash(v.payload.u.as_int);
+    case Tag::SymInt:
+      return c10::get_hash(v.payload.u.as_int);
     case Tag::String:
       return c10::get_hash(v.toStringRef());
     case Tag::Tuple:
@@ -567,6 +573,8 @@ std::ostream& IValue::repr(
     }
     case IValue::Tag::Int:
       return out << v.toInt();
+    case IValue::Tag::SymInt:
+      return out << v.toSymInt();
     case IValue::Tag::Bool:
       return out << (v.toBool() ? "True" : "False");
     case IValue::Tag::Tuple: {
@@ -753,6 +761,8 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
       return printComplex(out, v);
     } case IValue::Tag::Int:
       return out << v.toInt();
+    case IValue::Tag::SymInt:
+      return out << v.toSymInt();
     case IValue::Tag::Bool:
       return out << (v.toBool() ? "True" : "False");
     case IValue::Tag::Tuple: {
@@ -886,6 +896,7 @@ IValue IValue::deepcopy(
     case IValue::Tag::None:
     case IValue::Tag::Double:
     case IValue::Tag::Int:
+    case IValue::Tag::SymInt:
     case IValue::Tag::Bool:
     case IValue::Tag::Device:
     case IValue::Tag::Uninitialized: {
diff --git a/aten/src/ATen/core/ivalue.h b/aten/src/ATen/core/ivalue.h
index 81867348450d48..dbb4f08739ff8c 100644
--- a/aten/src/ATen/core/ivalue.h
+++ b/aten/src/ATen/core/ivalue.h
@@ -92,12 +92,29 @@ struct OptionalArray {
     return *this;
   }
 
+  // Used when saving an argument for the backwards pass.
+  OptionalArray& operator=(c10::OptionalArrayRef<T> ref) {
+    if (ref) {
+      list = std::vector<T>(ref->begin(), ref->end());
+    } else {
+      list = nullopt;
+    }
+    return *this;
+  }
+
   operator c10::optional<c10::ArrayRef<T>>() {
     if (!list) {
       return nullopt;
     }
     return *list;
   }
+
+  operator c10::OptionalArrayRef<T>() {
+    if (!list) {
+      return nullopt;
+    }
+    return *list;
+  }
 };
 
 // Capsule is an internal implementation detail of custom C++ classes. We
@@ -127,6 +144,7 @@ struct Capsule {
   _(Double)                  \
   _(ComplexDouble)           \
   _(Int)                     \
+  _(SymInt)   \
   _(Bool)                    \
   _(Tuple)                   \
   _(String)                  \
@@ -543,6 +561,18 @@ struct TORCH_API IValue final {
     payload.u.as_int = i;
   }
 
+  IValue(c10::SymInt i) : tag(Tag::SymInt), is_intrusive_ptr(false) {
+    payload.u.as_int = i.data();
+  }
+
+  bool isSymInt() const {
+    return Tag::SymInt == tag;
+  }
+
+  c10::SymInt toSymInt() const {
+    return c10::SymInt(payload.u.as_int);
+  }
+
   // allow you to pass literals (3, 4) without ambiguity
   IValue(int32_t i) : IValue(static_cast<int64_t>(i)) {}
 
@@ -666,6 +696,8 @@ struct TORCH_API IValue final {
 
   template <class T, enable_if_ivalue_constructible<T> = nullptr>
   IValue(c10::optional<T> v);
+  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  IValue(c10::OptionalArrayRef<T> v);
   IValue(c10::nullopt_t);
 
   // ClassType
diff --git a/aten/src/ATen/core/ivalue_inl.h b/aten/src/ATen/core/ivalue_inl.h
index 24e904e4444e52..57d9ed8d5ed330 100644
--- a/aten/src/ATen/core/ivalue_inl.h
+++ b/aten/src/ATen/core/ivalue_inl.h
@@ -1584,6 +1584,7 @@ DEFINE_TO(at::MemoryFormat, toMemoryFormat)
 DEFINE_TO(at::QScheme, toQScheme)
 DEFINE_TO(at::Dimname, toDimname)
 DEFINE_TO(at::Generator, toGenerator)
+DEFINE_TO(c10::SymInt, toSymInt)
 
 template <class T>
 struct _fake_type {};
@@ -1981,6 +1982,13 @@ inline IValue::IValue(const std::vector<T>& v) : IValue(c10::List<T>()) {
     list.push_back(e);
   }
 }
+template <class T, IValue::enable_if_ivalue_constructible<T>>
+inline IValue::IValue(c10::OptionalArrayRef<T> v) : IValue() {
+  if (v.has_value()) {
+    *this = IValue(std::move(*v));
+  }
+}
+
 template <class T, size_t N>
 inline IValue::IValue(std::array<T, N> v) : IValue(c10::List<T>()) {
   auto list = to<c10::List<T>>();
diff --git a/aten/src/ATen/core/jit_type.h b/aten/src/ATen/core/jit_type.h
index cbeb8154774a72..4956ad426fb96c 100644
--- a/aten/src/ATen/core/jit_type.h
+++ b/aten/src/ATen/core/jit_type.h
@@ -435,6 +435,17 @@ struct TORCH_API SymbolicShape {
     return dims_;
   }
 
+  c10::optional<std::vector<bool>> symbolicDims() const {
+    if (!dims_) {
+      return c10::nullopt;
+    }
+    auto symbolic_dims = std::vector<bool>();
+    for (const ShapeSymbol& s : *dims_) {
+      symbolic_dims.push_back(!s.is_static());
+    }
+    return symbolic_dims;
+  }
+
   // Checks whether the shape is fully defined/complete, ie. rank and sizes
   // of every dimension are known.
   bool isComplete() const {
@@ -866,7 +877,11 @@ struct TORCH_API DictType : public SharedType {
   static const TypeKind Kind = TypeKind::DictType;
 
   static DictTypePtr create(TypePtr key, TypePtr value) {
-    switch (key->kind()) {
+    auto kind = key->kind();
+    if (auto dyn = key->castRaw<DynamicType>()) {
+      kind = dyn->dynamicKind();
+    }
+    switch (kind) {
       case TypeKind::AnyType:
       case TypeKind::IntType:
       case TypeKind::BoolType:
@@ -1232,6 +1247,31 @@ struct TORCH_API ComplexType : public NumberType {
   }
 };
 
+// We need to introduce `SymIntType` to represent the `SymInt` type
+// used in function schemas e.g. `aten::narrow_copy(... SymInt length)
+// `SymInt` will be used to enable tracing arithmetic operations on
+// dimension values. Please see [SymInt.h] for more information
+struct SymIntType;
+using SymIntTypePtr = SingletonTypePtr<SymIntType>;
+struct TORCH_API SymIntType : public Type {
+  bool equals(const Type& rhs) const override {
+    return rhs.kind() == kind();
+  }
+  std::string str() const override {
+    return "SymInt";
+  }
+  std::string annotation_str_impl(TypePrinter printer = nullptr) const override {
+    // TODO: will become a Union[SymbolicIntNode|int] in the near future
+    return "int";
+  }
+  static const TypeKind Kind = TypeKind::SymIntType;
+  // global singleton
+  static SymIntTypePtr get();
+
+ private:
+  SymIntType() : Type(TypeKind::SymIntType) {}
+};
+
 struct IntType;
 using IntTypePtr = SingletonTypePtr<IntType>;
 // This type represents a Python int number
@@ -1693,6 +1733,13 @@ struct getTypePtr_<int64_t> final {
     return IntType::get();
   }
 };
+
+template <>
+struct getTypePtr_<SymInt> final {
+  static decltype(auto) call() {
+    return SymIntType::get();
+  }
+};
 template <>
 struct getTypePtr_<c10::ScalarType> final {
   static decltype(auto) call() {
@@ -1812,6 +1859,15 @@ struct getTypePtr_<at::optional<T>> final {
     return type;
   }
 };
+
+template<>
+struct getTypePtr_<at::OptionalIntArrayRef> final {
+  static const auto& call() {
+    static auto type = OptionalType::create(getTypePtr_<IntArrayRef>::call());
+    return type;
+  }
+};
+
 template <class... Contained>
 struct getTypePtr_<std::tuple<Contained...>> final {
   static const auto& call() {
diff --git a/aten/src/ATen/core/jit_type_base.h b/aten/src/ATen/core/jit_type_base.h
index 21a17c9ec6693e..f7a95402ca39ee 100644
--- a/aten/src/ATen/core/jit_type_base.h
+++ b/aten/src/ATen/core/jit_type_base.h
@@ -6,6 +6,7 @@
 
 #include <ATen/core/qualified_name.h>
 #include <ATen/core/type_ptr.h>
+#include <ATen/core/SymInt.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/ArrayRef.h>
 #include <c10/util/Exception.h>
@@ -48,6 +49,7 @@ namespace c10 {
   _(AnyListType)            \
   _(AnyTupleType)           \
   _(AnyClassType)           \
+  _(SymIntType)             \
   _(UnionType)              \
   _(DynamicType)
 
diff --git a/aten/src/ATen/core/library.cpp b/aten/src/ATen/core/library.cpp
index ba16a5bf10c129..ba608e98ad53a8 100644
--- a/aten/src/ATen/core/library.cpp
+++ b/aten/src/ATen/core/library.cpp
@@ -235,6 +235,9 @@ Library& Library::_fallback(CppFunction&& f) & {
   // Note if dispatch_key is DispatchKey::Undefined, it'll be ignored here since Undefined
   // isn't a runtime key, you shouldn't register anything to it at all.
   for (auto k : c10::getRuntimeDispatchKeySet(*dispatch_key)) {
+    // mobile doesn't use all dispatch keys, so skip any fallback registrations for the unused keys.
+    auto idx = getDispatchTableIndexForDispatchKey(k);
+    if (idx < 0) continue;
     registrars_.emplace_back(
       c10::Dispatcher::singleton().registerFallback(
         k,
diff --git a/aten/src/ATen/core/op_registration/op_registration_test.cpp b/aten/src/ATen/core/op_registration/op_registration_test.cpp
index 0a3f9236b75522..05294c25548eb1 100644
--- a/aten/src/ATen/core/op_registration/op_registration_test.cpp
+++ b/aten/src/ATen/core/op_registration/op_registration_test.cpp
@@ -284,7 +284,8 @@ TEST(OperatorRegistrationTest, whenRegisteringMultipleKernelsInSameOpCallAndCall
   EXPECT_FALSE(called_kernel1);
   EXPECT_TRUE(called_kernel2);
 
-  for (c10::DispatchKey key : {c10::DispatchKey::XLA, c10::DispatchKey::Lazy}) {
+  // Test for out of tree lazy backends- ::Lazy key is now registered to TS backend in tree
+  for (c10::DispatchKey key : {c10::DispatchKey::XLA}) {
     std::string expectMessage = expectedMessageForBackend(key);
     expectThrows<c10::Error>([&] {
       callOp(*op, dummyTensor(key));
@@ -591,7 +592,7 @@ TEST(OperatorRegistrationTest, AutogradBackendOverridesAutogradKernel) {
 
 void LazyBackendsAutogradOverridesAutogradKernel(DispatchKey key) {
   auto registrar = c10::RegisterOperators().op("_test::dummy(Tensor dummy) -> ()", c10::RegisterOperators::options()
-    .kernel<decltype(nonautograd_kernel), &nonautograd_kernel>(c10::getAutogradKeyFromBackend(key))
+    .kernel<decltype(nonautograd_kernel), &nonautograd_kernel>(c10::getAutogradKeyFromBackend(toBackendComponent(key)))
     .kernel<decltype(autograd_kernel), &autograd_kernel>(DispatchKey::Autograd));
 
   auto op = Dispatcher::singleton().findSchema({"_test::dummy", ""});
@@ -613,14 +614,13 @@ void LazyBackendsAutogradOverridesAutogradKernel(DispatchKey key) {
   EXPECT_FALSE(called_nonautograd);
 }
 
+// no longer test ::Lazy key here
+// since it is now registered to TS backend in-tree and thus behaves differently,
+// does not throw the expected 'could not run..' messages
 TEST(OperatorRegistrationTest, AutogradXLAOverridesAutogradKernel) {
   LazyBackendsAutogradOverridesAutogradKernel(DispatchKey::XLA);
 }
 
-TEST(OperatorRegistrationTest, AutogradLazyOverridesAutogradKernel) {
-  LazyBackendsAutogradOverridesAutogradKernel(DispatchKey::Lazy);
-}
-
 void whenRegisterWithLazyBackendsAndCatchAll_AutogradLazyBackendsIsNotFilled(DispatchKey key) {
   {
     auto registrar = c10::RegisterOperators().op("_test::dummy(Tensor dummy) -> ()", c10::RegisterOperators::options()
@@ -1791,22 +1791,22 @@ TEST(NewOperatorRegistrationTest, dispatchAutogradPrecedence) {
 
 TEST(NewOperatorRegistrationTest, throwsWhenRegisterToBackendMapsToAutogradOther) {
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  bool sparsecpu_called, math_called = false;
+  bool fpga_called, math_called = false;
   auto m = MAKE_TORCH_LIBRARY(test);
-  m.def("fn", torch::dispatch(c10::DispatchKey::SparseCPU, [&](const Tensor& x) { sparsecpu_called = true; return x; }));
+  m.def("fn", torch::dispatch(c10::DispatchKey::FPGA, [&](const Tensor& x) { fpga_called = true; return x; }));
   m.impl("fn", c10::DispatchKey::CompositeImplicitAutograd, [&](const Tensor& x) { math_called = true; return x; });
 
   auto op = Dispatcher::singleton().findSchema({"test::fn", ""});
   ASSERT_TRUE(op.has_value());
 
   {
-    callOp(*op, dummyTensor(c10::DispatchKey::SparseCPU));
-    ASSERT_TRUE(sparsecpu_called);
+    callOp(*op, dummyTensor(c10::DispatchKey::FPGA));
+    ASSERT_TRUE(fpga_called);
   }
 
   {
     expectThrows<c10::Error>([&] {
-      callOp(*op, dummyTensor(c10::DispatchKey::SparseCPU, /*requires_grad=*/true));
+      callOp(*op, dummyTensor(c10::DispatchKey::FPGA, /*requires_grad=*/true));
     }, "test::fn has kernels registered to both CompositeImplicitAutograd and a backend mapped to AutogradOther.");
   }
 }
@@ -1849,18 +1849,15 @@ TEST(NewOperatorRegistrationTest, dispatchMultipleTensors) {
   }
 
   {
-    // TODO(#43908): currently this will fallthrough AutogradPrivateUse1 then call catchall kernel
-    // at AutogradCPU, while backend extenders are indeed expecting to call PrivateUse1 kernel.
-    // This confusing behavior is caused by we registering fallthrough as backend fallback for
-    // Autograd keys. Note users could always work around this by registering the same kernel to
-    // AutogradPrivateUse1 as shown below until we support it.
     auto op = Dispatcher::singleton().findOp({"test::fn", ""});
     ASSERT_TRUE(op.has_value());
     catchall_called = false;
+    privateuse1_called = false;
     callOp(*op,
            dummyTensor(c10::DispatchKey::PrivateUse1, /*requires_grad=*/true),
            dummyTensor(c10::DispatchKey::CPU, /*requires_grad=*/true));
-    ASSERT_TRUE(catchall_called);
+    ASSERT_FALSE(catchall_called);
+    ASSERT_TRUE(privateuse1_called);
   }
 
   m.impl("fn", c10::DispatchKey::AutogradPrivateUse1, [&](const Tensor& x, const Tensor& y) { privateuse1_called = true; return x; });
@@ -1876,6 +1873,27 @@ TEST(NewOperatorRegistrationTest, dispatchMultipleTensors) {
   }
 }
 
+TEST(NewOperatorRegistrationTest, registerCompositeImplicitAutogradWithCPUKernel_andCallAutogradOtherKernel_callsComposite) {
+  bool math_called = false;
+  bool cpu_called = false;
+  auto m = MAKE_TORCH_LIBRARY(test);
+  m.def("fn(Tensor dummy) -> Tensor");
+  m.impl("fn", c10::DispatchKey::CPU, [&](const Tensor& x) { cpu_called = true; return x; });
+  m.impl("fn", c10::DispatchKey::CompositeImplicitAutograd, [&](const Tensor& x) { math_called = true; return x; });
+
+  auto op = Dispatcher::singleton().findSchema({"test::fn", ""});
+  ASSERT_TRUE(op.has_value());
+
+  {
+    math_called = cpu_called = false;
+    // Meta should redispatch to the AutogradOther backend,
+    // which the composite kernel should be registered to.
+    callOp(*op, dummyTensor(c10::DispatchKey::Meta, /*requires_grad=*/true));
+    ASSERT_TRUE(math_called);
+    ASSERT_FALSE(cpu_called);
+  }
+}
+
 TEST(NewOperatorRegistrationTest, dispatchMultiple) {
   bool cpu_called = false;
   bool cuda_called = false;
diff --git a/aten/src/ATen/core/tensor_type.cpp b/aten/src/ATen/core/tensor_type.cpp
index cb7b6cc2766753..664aa301f0a463 100644
--- a/aten/src/ATen/core/tensor_type.cpp
+++ b/aten/src/ATen/core/tensor_type.cpp
@@ -3,6 +3,40 @@
 
 namespace c10 {
 
+namespace {
+
+// The idea is to only mark possible overlap across dimensions. We want to
+// return false for expanded tensors and permuted tensors, for which dimensional
+// collapsing is safe.
+bool possible_cross_dimension_overlap(c10::IntArrayRef sizes, c10::IntArrayRef strides) {
+  int n_dim = static_cast<int>(sizes.size());
+  std::vector<size_t> stride_indices(n_dim);
+  std::iota(stride_indices.rbegin(), stride_indices.rend(), 0);
+
+  // sort indices going with ascending strides
+  for (int i = 1; i < n_dim; i++) {
+    auto c = i;
+    for (int j = i - 1; j >= 0; j--) {
+      if (strides[stride_indices[j]] > strides[stride_indices[c]]) {
+        std::swap(stride_indices[j], stride_indices[c]);
+        c = j;
+      }
+    }
+  }
+
+  for (const auto i : c10::irange(1, n_dim)) {
+    if (i != 0) {
+      // we are being conservative on checking for memory overlap
+      if (sizes[stride_indices[i]] != 1 && strides[stride_indices[i]] < sizes[stride_indices[i-1]] * strides[stride_indices[i-1]]) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+}
+
 const TensorTypePtr& TensorType::get() {
   static auto value = TensorType::create(
       {}, {}, SymbolicShape(), VaryingShape<Stride>{}, {});
@@ -115,6 +149,10 @@ VaryingShape<Stride> TensorType::computeStrideProps(
     bool tensor_contiguity) {
   int n_dim = static_cast<int>(sizes.size());
   std::vector<size_t> stride_indices(n_dim);
+  // default has_overlap to false as we only compute overlap when:
+  // 1. input sizes/strides fails format check;
+  // 2. tensor_contiguity are not set.
+  bool has_overlap = false;
 
   // Sorting strides in ascending order
   // Example:
@@ -173,21 +211,35 @@ VaryingShape<Stride> TensorType::computeStrideProps(
         }
       }
     }
+    // conveniently is_contiguous_strides/is_contiguous_strides only returns
+    // true when there's no memory overlap, so we only re-compute has_overlap
+    // in the last branch when both returns false
+    if (!tensor_contiguity) {
+      // trust tensor_contiguity and only computes overlap when it is not set
+      has_overlap = possible_cross_dimension_overlap(sizes, strides);
+    }
   }
   std::vector<Stride> stride_properties;
+
+
   for (size_t i = 0; i < stride_indices.size(); i++) {
     bool contiguous_ = tensor_contiguity;
     if (!contiguous_) {
-      // innermost stride expected to be 1
-      // TODO: turn contiguous_ into an enum CONTIGUOUS, NONCONTIGUOUS,
-      // BROADCASTED
-      if (i == 0) {
-        contiguous_ = strides[stride_indices[i]] == 1;
+      if (!has_overlap) {
+        // innermost stride expected to be 1
+        // TODO: turn contiguous_ into an enum CONTIGUOUS, NONCONTIGUOUS,
+        // BROADCASTED
+        if (i == 0) {
+          contiguous_ = strides[stride_indices[i]] == 1;
+        } else {
+          contiguous_ = strides[stride_indices[i]] == 1 ||
+              (strides[stride_indices[i]] != 0 &&
+               strides[stride_indices[i]] ==
+                   strides[stride_indices[i - 1]] * sizes[stride_indices[i - 1]]);
+        }
       } else {
-        contiguous_ = strides[stride_indices[i]] == 1 ||
-            (strides[stride_indices[i]] != 0 &&
-             strides[stride_indices[i]] ==
-                 strides[stride_indices[i - 1]] * sizes[stride_indices[i - 1]]);
+        // leaving this assign statement for readability;
+        contiguous_ = false;
       }
     }
     stride_properties.emplace_back(stride_indices[i], contiguous_, strides[stride_indices[i]]);
diff --git a/aten/src/ATen/core/type.cpp b/aten/src/ATen/core/type.cpp
index a3f0451dc61cb9..5d981f31f8a5bb 100644
--- a/aten/src/ATen/core/type.cpp
+++ b/aten/src/ATen/core/type.cpp
@@ -143,6 +143,11 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
   return out;
 }
 
+std::ostream& operator<<(std::ostream& os, SymInt s) {
+  os << "SymInt(" << s.data() << ")";
+  return os;
+}
+
 AnyTypePtr AnyType::get() {
   static AnyTypePtr value(new AnyType());
   return value;
@@ -257,6 +262,11 @@ AnyEnumTypePtr AnyEnumType::get() {
   return value;
 }
 
+SymIntTypePtr SymIntType::get() {
+  static SymIntTypePtr value(new SymIntType());
+  return value;
+}
+
 c10::optional<TypePtr> unifyTypesImpl(const TypePtr& t1, const TypePtr& t2, bool default_to_union=false, TypePtr type_hint=nullptr) {
   // check direct subtyping relation
   if (t1->isSubtypeOf(*t2)) {
diff --git a/aten/src/ATen/cuda/Atomic.cuh b/aten/src/ATen/cuda/Atomic.cuh
index cd002414687a34..2bd8364ebf8a4a 100644
--- a/aten/src/ATen/cuda/Atomic.cuh
+++ b/aten/src/ATen/cuda/Atomic.cuh
@@ -298,7 +298,7 @@ static inline __device__ void gpuAtomicAddNoReturn(at::BFloat16 *address, at::BF
 static inline __device__ void gpuAtomicAddNoReturn(double *address, double val) { gpuAtomicAdd(address, val); }
 
 /* Special case fp32 atomic. */
-#if defined(USE_ROCM) && defined(__gfx908__)
+#if defined(USE_ROCM)
 static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { atomicAddNoRet(address, val); }
 #else
 static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { gpuAtomicAdd(address, val); }
@@ -344,3 +344,83 @@ inline __device__ float gpuAtomicMul (float * address, float val) {
 
   return __int_as_float(old);
 }
+
+// Atomic maximum implementation.
+
+inline __device__ at::Half gpuAtomicMax(at::Half * address, at::Half val) {
+  return AtomicFPOp<at::Half>()(address, val,
+                                [](at::Half bsum, at::Half val) {
+                                  return max(bsum, val);
+                                });
+}
+
+inline __device__ at::BFloat16 gpuAtomicMax(at::BFloat16 * address, at::BFloat16 val) {
+  return AtomicFPOp<at::BFloat16>()(address, val,
+                                    [](at::BFloat16 bsum, at::BFloat16 val) {
+                                      return max(bsum, val);
+                                    });
+}
+
+inline __device__ double gpuAtomicMax(double * address, double val) {
+  return AtomicFPOp<double>()(address, val,
+                              [](double val, unsigned long long int assumed) {
+                                return __double_as_longlong(max(val, __longlong_as_double(assumed)));
+                              });
+}
+
+// Dont use a templated function for this since the addition function defaults to the CUDA built-in.
+inline __device__ float gpuAtomicMax(float * address, float val) {
+  unsigned int* address_as_ull = (unsigned int*)address;
+  unsigned int old = *address_as_ull;
+  unsigned int assumed;
+
+  do {
+    assumed = old;
+    old = atomicCAS(address_as_ull, assumed,
+                    __float_as_int(max(val, __int_as_float(assumed))));
+
+    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
+  } while (assumed != old);
+
+  return __int_as_float(old);
+}
+
+// Atomic minimum implementation.
+
+inline __device__ at::Half gpuAtomicMin(at::Half * address, at::Half val) {
+  return AtomicFPOp<at::Half>()(address, val,
+                                [](at::Half bsum, at::Half val) {
+                                  return min(bsum, val);
+                                });
+}
+
+inline __device__ at::BFloat16 gpuAtomicMin(at::BFloat16 * address, at::BFloat16 val) {
+  return AtomicFPOp<at::BFloat16>()(address, val,
+                                    [](at::BFloat16 bsum, at::BFloat16 val) {
+                                      return min(bsum, val);
+                                    });
+}
+
+inline __device__ double gpuAtomicMin(double * address, double val) {
+  return AtomicFPOp<double>()(address, val,
+                              [](double val, unsigned long long int assumed) {
+                                return __double_as_longlong(min(val, __longlong_as_double(assumed)));
+                              });
+}
+
+// Dont use a templated function for this since the addition function defaults to the CUDA built-in.
+inline __device__ float gpuAtomicMin(float * address, float val) {
+  unsigned int* address_as_ull = (unsigned int*)address;
+  unsigned int old = *address_as_ull;
+  unsigned int assumed;
+
+  do {
+    assumed = old;
+    old = atomicCAS(address_as_ull, assumed,
+                    __float_as_int(min(val, __int_as_float(assumed))));
+
+    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
+  } while (assumed != old);
+
+  return __int_as_float(old);
+}
diff --git a/aten/src/ATen/cuda/CUDABlas.cpp b/aten/src/ATen/cuda/CUDABlas.cpp
index 5e795396d7dbe5..ec023f27e89d28 100644
--- a/aten/src/ATen/cuda/CUDABlas.cpp
+++ b/aten/src/ATen/cuda/CUDABlas.cpp
@@ -15,6 +15,11 @@
 #include <cublasLt.h>
 #endif
 
+#ifdef USE_ROCM
+#define PYTORCH_ROCBLAS_VERSION_DECIMAL (ROCBLAS_VERSION_MAJOR * 100 + ROCBLAS_VERSION_MINOR)
+#define USE_GEMM_FLAGS_FP16_ALT_IMPL (PYTORCH_ROCBLAS_VERSION_DECIMAL >= 242)
+#endif
+
 #define CUDABLAS_POSINT_CHECK(FD, X)         \
   TORCH_CHECK(                               \
       (X > 0 && X <= INT_MAX),               \
@@ -246,13 +251,17 @@ void bgemm<at::Half>(CUDABLAS_BGEMM_ARGTYPES(at::Half)) {
   float falpha = alpha;
   float fbeta = beta;
 #ifdef USE_ROCM
+  int flag = 0;
+#if USE_GEMM_FLAGS_FP16_ALT_IMPL
+  flag = at::BackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
+#endif
   TORCH_CUDABLAS_CHECK(rocblas_gemm_strided_batched_ex(handle, opa, opb, (int)m, (int)n, (int)k,
                                    (void*)&falpha, a, rocblas_datatype_f16_r, (int)lda, stridea,
                                    b, rocblas_datatype_f16_r, (int)ldb, strideb,
                                    (void*)&fbeta, c, rocblas_datatype_f16_r, (int)ldc, stridec,
                                    c, rocblas_datatype_f16_r, (int)ldc, stridec,
                                    (int) num_batches, rocblas_datatype_f32_r, rocblas_gemm_algo_standard,
-                                   0, 0));
+                                   0, flag));
 #else
   #if defined(CUDA_VERSION) && CUDA_VERSION < 11000
     // On CUDA versions prior to 11, users are required to set the math mode to CUBLAS_TENSOR_OP_MATH
@@ -392,6 +401,10 @@ void gemm<at::Half>(CUDABLAS_GEMM_ARGTYPES(at::Half)) {
   _cublasAdjustLdLevel3(transa, transb, m, n, k, &lda, &ldb, &ldc);
   GEMM_CHECK_ARGVALUES(at::Half);
 #ifdef USE_ROCM
+  int flag = 0;
+#if USE_GEMM_FLAGS_FP16_ALT_IMPL
+  flag = at::BackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
+#endif
   TORCH_CUDABLAS_CHECK(rocblas_gemm_ex(
       handle,
       opa,
@@ -416,7 +429,7 @@ void gemm<at::Half>(CUDABLAS_GEMM_ARGTYPES(at::Half)) {
       rocblas_datatype_f32_r,
       rocblas_gemm_algo_standard,
       0,
-      0));
+      flag));
 #else
   cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
   if (prop->major >= 5) {
@@ -634,7 +647,8 @@ void gemm_and_bias(
     int64_t mat2_ld,
     const Dtype* bias,
     Dtype* result_ptr,
-    int64_t result_ld) {
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation) {
   using opmath_t = at::opmath_type<Dtype>;
   opmath_t beta_val = 0; // bias is added in epilogue
 
@@ -670,6 +684,13 @@ void gemm_and_bias(
       &transb,
       sizeof(transb)));
   cublasLtEpilogue_t epilogue = CUBLASLT_EPILOGUE_BIAS;
+  if (activation == GEMMAndBiasActivationEpilogue::RELU) {
+    epilogue = CUBLASLT_EPILOGUE_RELU_BIAS;
+  } else if (activation == GEMMAndBiasActivationEpilogue::GELU) {
+#if CUDA_VERSION >= 11040
+    epilogue = CUBLASLT_EPILOGUE_GELU_BIAS;
+#endif
+  }
   TORCH_CUDABLAS_CHECK(cublasLtMatmulDescSetAttribute(
       computeDesc.descriptor(),
       CUBLASLT_MATMUL_DESC_EPILOGUE,
@@ -752,7 +773,8 @@ template void gemm_and_bias(
     int64_t mat2_ld,
     const double* bias,
     double* result_ptr,
-    int64_t result_ld);
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation);
 
 template void gemm_and_bias(
     bool transpose_mat1,
@@ -767,7 +789,8 @@ template void gemm_and_bias(
     int64_t mat2_ld,
     const float* bias,
     float* result_ptr,
-    int64_t result_ld);
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation);
 
 template void gemm_and_bias(
     bool transpose_mat1,
@@ -782,7 +805,8 @@ template void gemm_and_bias(
     int64_t mat2_ld,
     const at::Half* bias,
     at::Half* result_ptr,
-    int64_t result_ld);
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation);
 
 template void gemm_and_bias(
     bool transpose_mat1,
@@ -797,7 +821,8 @@ template void gemm_and_bias(
     int64_t mat2_ld,
     const at::BFloat16* bias,
     at::BFloat16* result_ptr,
-    int64_t result_ld);
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation);
 #endif // defined(CUDA_VERSION) && CUDA_VERSION >= 11000 && !defined(_MSC_VER)
 
 template <>
diff --git a/aten/src/ATen/cuda/CUDABlas.h b/aten/src/ATen/cuda/CUDABlas.h
index 72d0abe40ca49d..10e589ecd6c9d0 100644
--- a/aten/src/ATen/cuda/CUDABlas.h
+++ b/aten/src/ATen/cuda/CUDABlas.h
@@ -71,6 +71,14 @@ void gemm<at::BFloat16>(CUDABLAS_GEMM_ARGTYPES(at::BFloat16));
 #endif
 
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 && !defined(_MSC_VER)
+enum GEMMAndBiasActivationEpilogue {
+  None,
+  RELU,
+  GELU,
+};
+
+// NOTE: GELU activation is not supported prior to CUDA 11.4 and will
+// do nothing if passed in that case.
 template <typename Dtype>
 void gemm_and_bias(
     bool transpose_mat1,
@@ -85,7 +93,8 @@ void gemm_and_bias(
     int64_t mat2_ld,
     const Dtype* bias,
     Dtype* result_ptr,
-    int64_t result_ld);
+    int64_t result_ld,
+    GEMMAndBiasActivationEpilogue activation = GEMMAndBiasActivationEpilogue::None);
 #endif
 
 #define CUDABLAS_BGEMM_ARGTYPES(Dtype)                                                        \
diff --git a/aten/src/ATen/cuda/CUDAEvent.h b/aten/src/ATen/cuda/CUDAEvent.h
index deaebd3583d670..f07daeb979b9ea 100644
--- a/aten/src/ATen/cuda/CUDAEvent.h
+++ b/aten/src/ATen/cuda/CUDAEvent.h
@@ -32,15 +32,11 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
 
   CUDAEvent(
       DeviceIndex device_index, const cudaIpcEventHandle_t* handle) {
-    #if !defined(USE_ROCM)
       device_index_ = device_index;
       CUDAGuard guard(device_index_);
 
       AT_CUDA_CHECK(cudaIpcOpenEventHandle(&event_, *handle));
       is_created_ = true;
-    #else
-      AT_ERROR("cuIpcOpenEventHandle with HIP is not supported");
-    #endif
   }
 
   // Note: event destruction done on creating device to avoid creating a
@@ -148,7 +144,6 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
 
   // Note: cudaIpcGetEventHandle must be called on the same device as the event
   void ipc_handle(cudaIpcEventHandle_t * handle) {
-    #if !defined(USE_ROCM)
       if (!is_created_) {
         // this CUDAEvent object was initially constructed from flags but event_
         // is not created yet.
@@ -156,9 +151,6 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
       }
       CUDAGuard guard(device_index_);
       AT_CUDA_CHECK(cudaIpcGetEventHandle(handle, event_));
-    #else
-      AT_ERROR("cuIpcGetEventHandle with HIP is not supported");
-    #endif
   }
 
 private:
diff --git a/aten/src/ATen/cuda/cub.cuh b/aten/src/ATen/cuda/cub.cuh
index 2011ad097c4a72..abe2e9272014ff 100644
--- a/aten/src/ATen/cuda/cub.cuh
+++ b/aten/src/ATen/cuda/cub.cuh
@@ -6,6 +6,8 @@
 #include <iterator>
 #include <limits>
 
+#include <c10/util/C++17.h>
+
 #include <ATen/cuda/cub_definitions.cuh>
 
 #if USE_GLOBAL_CUB_WRAPPED_NAMESPACE()
@@ -161,6 +163,34 @@ inline void segmented_sort_pairs(
   }
 }
 
+#if CUB_SUPPORTS_UNIQUE_BY_KEY()
+template <typename KeysInputIteratorT, typename ValuesInputIteratorT, typename KeysOutputIteratorT, typename ValuesOutputIteratorT, typename NumSelectedIteratorT>
+inline void unique_by_key(
+  KeysInputIteratorT keys_in, ValuesInputIteratorT values_in,
+  KeysOutputIteratorT keys_out, ValuesOutputIteratorT values_out,
+  NumSelectedIteratorT num_selected, int64_t num_input_items)
+{
+  // TODO: use thrust::discard_iterator to handle null keys_out when https://github.com/NVIDIA/cub/issues/406 is fixed.
+  constexpr bool null_keys_out = std::is_same<KeysOutputIteratorT, std::nullptr_t>::value;
+  using KeyT = typename std::iterator_traits<KeysInputIteratorT>::value_type;
+  using RealKeysOutputIteratorT = typename std::conditional<null_keys_out, KeyT *, KeysOutputIteratorT>::type;
+  RealKeysOutputIteratorT keys_out_;
+  auto allocator = c10::cuda::CUDACachingAllocator::get();
+  c10::DataPtr keys_out_owner;
+  c10::guts::if_constexpr<null_keys_out>(
+    [&](auto _) {
+      keys_out_owner = allocator->allocate(num_input_items * sizeof(KeyT));
+      keys_out_ = static_cast<KeyT *>(keys_out_owner.get());
+    },
+    [&](auto _) {
+      keys_out_ = keys_out;
+    }
+  );
+  CUB_WRAPPER(NO_ROCM(at_cuda_detail)::cub::DeviceSelect::UniqueByKey,
+    keys_in, values_in, keys_out_, values_out, num_selected, num_input_items, c10::cuda::getCurrentCUDAStream());
+}
+#endif
+
 namespace impl {
 
 template<typename InputIteratorT1, typename InputIteratorT2, typename OutputIteratorT, class ScanOpT>
diff --git a/aten/src/ATen/cuda/cub_definitions.cuh b/aten/src/ATen/cuda/cub_definitions.cuh
index e464b19e57d511..a3d551673558f7 100644
--- a/aten/src/ATen/cuda/cub_definitions.cuh
+++ b/aten/src/ATen/cuda/cub_definitions.cuh
@@ -18,7 +18,7 @@
 #define CUB_SUPPORTS_NV_BFLOAT16() false
 #endif
 
-// cub sort support for CUB_WRAPPED_NAMESPACE is added to cub 1.13.1 in:
+// cub support for CUB_WRAPPED_NAMESPACE is added to cub 1.13.1 in:
 // https://github.com/NVIDIA/cub/pull/326
 // CUB_WRAPPED_NAMESPACE is defined globally in cmake/Dependencies.cmake
 // starting from CUDA 11.5
@@ -28,6 +28,14 @@
 #define USE_GLOBAL_CUB_WRAPPED_NAMESPACE() false
 #endif
 
+// cub support for UniqueByKey is added to cub 1.16 in:
+// https://github.com/NVIDIA/cub/pull/405
+#if CUB_VERSION >= 101600
+#define CUB_SUPPORTS_UNIQUE_BY_KEY() true
+#else
+#define CUB_SUPPORTS_UNIQUE_BY_KEY() false
+#endif
+
 // cub support for scan by key is added to cub 1.15
 // in https://github.com/NVIDIA/cub/pull/376
 #if CUB_VERSION >= 101500
diff --git a/aten/src/ATen/cuda/detail/CUDAHooks.cpp b/aten/src/ATen/cuda/detail/CUDAHooks.cpp
index 4efe2ec4c33f36..5a444376cc8f66 100644
--- a/aten/src/ATen/cuda/detail/CUDAHooks.cpp
+++ b/aten/src/ATen/cuda/detail/CUDAHooks.cpp
@@ -139,16 +139,14 @@ bool CUDAHooks::hasCuSOLVER() const {
 #endif
 }
 
-#if !defined(USE_ROCM)
 #if defined(USE_DIRECT_NVRTC)
 static std::pair<std::unique_ptr<at::DynamicLibrary>, at::cuda::NVRTC*> load_nvrtc() {
   return std::make_pair(nullptr, at::cuda::load_nvrtc());
 }
-#else
+#elif !defined(USE_ROCM)
 static std::pair<std::unique_ptr<at::DynamicLibrary>, at::cuda::NVRTC*> load_nvrtc() {
   return std::make_pair(nullptr, &at::cuda::detail::lazyNVRTC);
 }
-#endif
 #else
 static std::pair<std::unique_ptr<at::DynamicLibrary>, at::cuda::NVRTC*> load_nvrtc() {
 #if defined(_WIN32)
@@ -293,10 +291,22 @@ std::string CUDAHooks::showConfig() const {
   cudaRuntimeGetVersion(&runtimeVersion);
 
   auto printCudaStyleVersion = [&](int v) {
+#ifdef USE_ROCM
+    // HIP_VERSION value format was changed after ROCm v4.2 to include the patch number
+    if(v < 500) {
+      // If major=xx, minor=yy then format -> xxyy
+      oss << (v / 100) << "." << (v % 10);
+    }
+    else {
+      // If major=xx, minor=yy & patch=zzzzz then format -> xxyyzzzzz
+      oss << (v / 10000000) << "." << (v / 100000 % 100) << "." << (v % 100000);
+    }
+#else
     oss << (v / 1000) << "." << (v / 10 % 100);
     if (v % 10 != 0) {
       oss << "." << (v % 10);
     }
+#endif
   };
 
 #if !defined(USE_ROCM)
diff --git a/aten/src/ATen/cuda/llvm_complex.cpp b/aten/src/ATen/cuda/llvm_complex.cpp
index 00339bdac0fb69..4cceb11b3eeda1 100644
--- a/aten/src/ATen/cuda/llvm_complex.cpp
+++ b/aten/src/ATen/cuda/llvm_complex.cpp
@@ -724,6 +724,16 @@ log10(const complex<_Tp>& __x)
     return log(__x) / log(_Tp(10));
 }
 
+// log2
+
+template<class _Tp>
+inline
+complex<_Tp>
+log2(const complex<_Tp>& __x)
+{
+    return log(__x) / log(_Tp(2));
+}
+
 // sqrt
 
 template<class _Tp>
diff --git a/aten/src/ATen/cudnn/Descriptors.cpp b/aten/src/ATen/cudnn/Descriptors.cpp
index a5e8dc0a245315..f954bbf5623ad9 100644
--- a/aten/src/ATen/cudnn/Descriptors.cpp
+++ b/aten/src/ATen/cudnn/Descriptors.cpp
@@ -22,6 +22,8 @@ inline cudnnDataType_t getDataType(const at::Tensor& t) {
 #if defined(CUDNN_VERSION) && CUDNN_VERSION >= 8200
     else if (scalar_type == at::kBFloat16) {
     return CUDNN_DATA_BFLOAT16;
+  } else if (scalar_type == at::kQInt8) {
+    return CUDNN_DATA_INT8;
   }
 #endif
   throw std::runtime_error("TensorDescriptor only supports double, float and half tensors");
diff --git a/aten/src/ATen/cudnn/Types.cpp b/aten/src/ATen/cudnn/Types.cpp
index 4771f9bf2165b8..215d42fcd23f84 100644
--- a/aten/src/ATen/cudnn/Types.cpp
+++ b/aten/src/ATen/cudnn/Types.cpp
@@ -5,7 +5,9 @@
 namespace at { namespace native {
 
 cudnnDataType_t getCudnnDataTypeFromScalarType(const at::ScalarType dtype) {
-  if (dtype == at::kFloat) {
+  if (dtype == c10::kQInt8) {
+    return CUDNN_DATA_INT8;
+  } else if (dtype == at::kFloat) {
     return CUDNN_DATA_FLOAT;
   } else if (dtype == at::kDouble) {
     return CUDNN_DATA_DOUBLE;
diff --git a/aten/src/ATen/jiterator_macros.h b/aten/src/ATen/jiterator_macros.h
new file mode 100644
index 00000000000000..2769537346c873
--- /dev/null
+++ b/aten/src/ATen/jiterator_macros.h
@@ -0,0 +1,38 @@
+#pragma once
+#include <c10/macros/Macros.h>
+#include <string>
+
+#define JITERATOR_HOST_DEVICE C10_HOST_DEVICE
+#if defined(_MSC_VER) && defined(__CUDACC__)
+// NVRTC on Windows errors if __host__ __device__ attribute is
+// present on kernel.
+// error: attribute "__host__" does not apply here
+// error: attribute "__device__" does not apply here
+#define JITERATOR_HOST_DEVICE
+#endif
+
+// jiterator_also_stringify_as macro is used to define code (for CPU/ROCm)
+// and generate code string for `jiterator` (only when compiling for CUDA).
+// Usage :
+//      jiterator_also_stringify_as(
+//          jiterator_code(template <typename T> T identity(T x) { return x; }),
+//          identity_string);
+// This will define the template `identity` as present in code and
+// also define `std::string identity_string` with the code as the string
+// if this is being compiled for CUDA.
+
+// `jiterator_code` macro is to deal with `,` in the kernel code.
+// These `,`s confuse the preprocessor into thinking we are passing
+// multiple arguments to the macro.
+#define jiterator_code(...) __VA_ARGS__
+#if defined(__CUDACC__)
+    // CPU and CUDA case
+    #define stringify_code(...) #__VA_ARGS__
+    #define jiterator_also_stringify_as(code, str_name)                    \
+        code /* define the function */                                  \
+        const std::string str_name = std::string(stringify_code(code));
+#else
+    // CPU only or CPU and ROCm case
+    // Only needs the function
+    #define jiterator_also_stringify_as(code, str_name) code
+#endif
diff --git a/aten/src/ATen/mkl/SparseDescriptors.h b/aten/src/ATen/mkl/SparseDescriptors.h
index 46d656898a8d0a..2c152e0b2b725c 100644
--- a/aten/src/ATen/mkl/SparseDescriptors.h
+++ b/aten/src/ATen/mkl/SparseDescriptors.h
@@ -101,7 +101,7 @@ class MklSparseCsrDescriptor
     sparse_matrix_t raw_descriptor;
 
     // Assuming that the last two dimensions are block elements of the matrix
-    if (values.dim() == 3) {
+    if (values.dim() == 3 && crow_indices.dim() == 1 && col_indices.dim() == 1) {
       TORCH_CHECK(
           values.size(-1) == values.size(-2),
           "MKL Sparse doesn't support matrices with non-square blocks.");
diff --git a/aten/src/ATen/native/BatchLinearAlgebra.cpp b/aten/src/ATen/native/BatchLinearAlgebra.cpp
index 5fc486c44f5c60..33f325267884a3 100644
--- a/aten/src/ATen/native/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/BatchLinearAlgebra.cpp
@@ -952,8 +952,8 @@ static Tensor& linalg_solve_out_info(Tensor& result, Tensor& infos, const Tensor
 
   // _linalg_broadcast_batch_dims also includes linearSolveCheckInputs
   // it checks for squareness of 'input' and 'shape' compatibility of 'other' and 'input'
-  Tensor other_broadcasted, input_broadcasted;
-  std::tie(other_broadcasted, input_broadcasted) = _linalg_broadcast_batch_dims(other_, input, "linalg.solve");
+  Tensor other_broadcasted;
+  std::tie(other_broadcasted, std::ignore) = _linalg_broadcast_batch_dims(other_, input, "linalg.solve");
 
   auto squeezed_other_broadcasted = at::squeeze(other_broadcasted, -1);
   auto squeezed_result_shape = squeezed_other_broadcasted.sizes();
@@ -989,18 +989,17 @@ static Tensor& linalg_solve_out_info(Tensor& result, Tensor& infos, const Tensor
   // lu_factor_stub+lu_solve_stub perform calculations in-place and 'result' must be a copy of 'other_broadcasted'
   result.copy_(other_broadcasted);
 
-  auto input_working_copy = cloneBatchedColumnMajor(input_broadcasted);
-
   TORCH_INTERNAL_ASSERT(infos.scalar_type() == kInt);
   TORCH_INTERNAL_ASSERT(infos.device() == input.device());
-  infos.resize_({std::max<int64_t>(1, batchCount(input_broadcasted))});
+  infos.resize_({std::max<int64_t>(1, batchCount(input))});
   // if input is empty infos might not get filled; make sure infos doesn't contain garbage then
   if (input.numel() == 0) {
     infos.fill_(0);
   }
 
   // compute the LU factorization of 'input_working_copy'
-  auto pivots_shape = IntArrayRef(input_broadcasted.sizes().data(), input_broadcasted.dim() - 2).vec(); // input_broadcasted.shape[:-2]
+  auto input_working_copy = cloneBatchedColumnMajor(input);
+  auto pivots_shape = IntArrayRef(input.sizes().data(), input.dim() - 2).vec(); // input.shape[:-2]
   pivots_shape.push_back(std::min(input.size(-2), input.size(-1)));
   Tensor pivots = at::empty(pivots_shape, input.options().dtype(kInt));
   lu_factor_stub(input.device().type(), input_working_copy, pivots, infos, /*compute_pivots=*/true);
@@ -1023,8 +1022,7 @@ Tensor& linalg_solve_out(const Tensor& input, const Tensor& other, Tensor& resul
 
   // Now check LAPACK/MAGMA error codes
   // _linalg_check_errors calls 'infos = infos.to(kCPU)'
-  bool vector_case = linalg_solve_is_vector_rhs(input, other);
-  at::_linalg_check_errors(infos, "linalg.solve", vector_case ? result.dim() == 1 : result.dim() == 2);
+  at::_linalg_check_errors(infos, "linalg.solve", input.dim() == 2);
   return result;
 }
 
diff --git a/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp b/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
index 117bbdb90935d5..84759dce1acc99 100644
--- a/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
+++ b/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
@@ -908,8 +908,8 @@ void apply_lu_solve(const Tensor& b, const Tensor& lu, const Tensor& pivots, Tra
   const auto trans = to_blas(transpose);
   auto pivots_data = pivots.data_ptr<int>();
   auto b_stride = matrixStride(b);
-  auto lu_stride = matrixStride(lu);
-  auto pivots_stride = pivots.size(-1);
+  auto lu_stride = lu.dim() > 2 ? lu.stride(-3) : 0;
+  auto pivots_stride = pivots.dim() > 1 ? pivots.stride(-2) : 0;
   auto batch_size = batchCount(b);
 
   auto n = lu.size(-2);
@@ -917,10 +917,19 @@ void apply_lu_solve(const Tensor& b, const Tensor& lu, const Tensor& pivots, Tra
   auto leading_dimension = std::max<int64_t>(1, n);
 
   int info = 0;
+
+  // lu and pivots tensors can be broadcast to b
+  // here we construct a helper indexing tensor to linearly index into lu and pivots
+  IntArrayRef lu_batch_shape(lu.sizes().data(), lu.dim() - 2);
+  IntArrayRef b_batch_shape(b.sizes().data(), b.dim() - 2);
+  BroadcastLinearIndices lu_index(
+      batchCount(lu), lu_batch_shape, b_batch_shape);
+
   for (const auto i : c10::irange(batch_size)) {
+    int64_t lu_index_i = lu_index(i);
     scalar_t* b_working_ptr = &b_data[i * b_stride];
-    scalar_t* lu_working_ptr = &lu_data[i * lu_stride];
-    int* pivots_working_ptr = &pivots_data[i * pivots_stride];
+    scalar_t* lu_working_ptr = &lu_data[lu_index_i * lu_stride];
+    int* pivots_working_ptr = &pivots_data[lu_index_i * pivots_stride];
 
     lapackLuSolve<scalar_t>(trans, n, nrhs, lu_working_ptr, leading_dimension, pivots_working_ptr,
                             b_working_ptr, leading_dimension, &info);
diff --git a/aten/src/ATen/native/BinaryOps.cpp b/aten/src/ATen/native/BinaryOps.cpp
index 437835d7a86657..5b6ead4ff5a5a8 100644
--- a/aten/src/ATen/native/BinaryOps.cpp
+++ b/aten/src/ATen/native/BinaryOps.cpp
@@ -618,6 +618,11 @@ Tensor& mul_(Tensor& self, const Scalar& other) {
   return at::mul_out(self, wrapped_scalar_tensor(other), self); // redispatch!
 }
 
+Tensor& mul__scalar_sparse_csr(Tensor& self, const Scalar& other) {
+  self.values().mul_(other);
+  return self;
+}
+
 Device correct_out_device(const Tensor& self, const Tensor& other) {
   if (self.device() == at::kCPU){
       return other.device();
diff --git a/aten/src/ATen/native/ConstantPadNd.cpp b/aten/src/ATen/native/ConstantPadNd.cpp
deleted file mode 100644
index f7a2d76ed52280..00000000000000
--- a/aten/src/ATen/native/ConstantPadNd.cpp
+++ /dev/null
@@ -1,87 +0,0 @@
-#include <ATen/ATen.h>
-
-#include <c10/util/irange.h>
-
-namespace at { namespace native {
-
-Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value) {
-    TORCH_CHECK(pad.size() % 2 == 0, "Length of pad must be even but instead it equals ",
-             pad.size());
-
-    auto input_sizes = self.sizes();
-    auto l_inp = self.dim();
-
-    auto l_pad = pad.size() / 2;
-    auto l_diff = l_inp - l_pad;
-    TORCH_CHECK(l_inp >= (int64_t)l_pad, "Length of pad should be no more than twice the number of "
-             "dimensions of the input. Pad length is ", pad.size(), "while the input has ",
-             l_inp, "dimensions.");
-
-    std::vector<int64_t> new_shape;
-
-    bool all_pads_non_positive = true;
-
-    auto c_input = self;
-    for (const auto i : c10::irange(l_diff, l_inp)) {
-        auto pad_idx = 2 * (l_inp - i - 1);
-        if (pad[pad_idx] < 0) {
-            c_input = c_input.narrow(i, -pad[pad_idx], c_input.size(i) + pad[pad_idx]);
-        } else if (pad[pad_idx] != 0) {
-            all_pads_non_positive = false;
-        }
-        if (pad[pad_idx + 1] < 0) {
-            c_input = c_input.narrow(i, 0, c_input.size(i) + pad[pad_idx + 1]);
-        } else if (pad[pad_idx + 1] != 0) {
-            all_pads_non_positive = false;
-        }
-    }
-
-    // if none of the pads are positive we can optimize and just return the result
-    // of calling .narrow() on the input
-    if (all_pads_non_positive) {
-        return c_input.clone();
-    }
-
-
-    for (size_t i = 0; i < (size_t)l_diff; i ++) {
-        new_shape.emplace_back(input_sizes[i]);
-    }
-
-    for (const auto i : c10::irange((size_t)l_pad)) {
-        auto pad_idx = pad.size() - ((i + 1) * 2);
-        auto new_dim = input_sizes[l_diff + i] + pad[pad_idx] + pad[pad_idx + 1];
-        TORCH_CHECK(new_dim > 0, "The input size ", input_sizes[l_diff + i], ", plus negative padding ",
-                 pad[pad_idx], " and ", pad[pad_idx + 1], " resulted in a negative output size, "
-                 "which is invalid. Check dimension ", l_diff + i, " of your input.");
-        new_shape.emplace_back(new_dim);
-    }
-
-    at::Tensor output;
-    const auto memory_format = self.suggest_memory_format();
-    if (self.is_quantized()) {
-        const auto qscheme = self.qscheme();
-        TORCH_CHECK(qscheme == kPerTensorAffine || qscheme == kPerTensorSymmetric,
-                    "Only per-tensor padding is supported.");
-        output = at::_empty_affine_quantized(
-            new_shape, self.options().memory_format(memory_format),
-            self.q_scale(), self.q_zero_point(), c10::nullopt);
-    } else {
-        output = at::empty(new_shape, self.options().memory_format(memory_format));
-    }
-    output.fill_(value);
-
-    auto c_output = output;
-    for (const auto i : c10::irange(l_diff, l_inp)) {
-        auto pad_idx = 2 * (l_inp - i - 1);
-        if (pad[pad_idx] > 0) {
-            c_output = c_output.narrow(i, pad[pad_idx], c_output.size(i) - pad[pad_idx]);
-        }
-        if (pad[pad_idx + 1] > 0) {
-            c_output = c_output.narrow(i, 0, c_output.size(i) - pad[pad_idx + 1]);
-        }
-    }
-    c_output.copy_(c_input);
-    return output;
-}
-
-}}  // namespace at::native
diff --git a/aten/src/ATen/native/ConvUtils.h b/aten/src/ATen/native/ConvUtils.h
index f54103372e3a01..54a4b5d14a5ab5 100644
--- a/aten/src/ATen/native/ConvUtils.h
+++ b/aten/src/ATen/native/ConvUtils.h
@@ -104,7 +104,7 @@ struct ConvParams {
   bool use_mkldnn(const at::Tensor& input, const at::Tensor& weight) const;
   bool use_nnpack(const at::Tensor& input, const at::Tensor& weight) const;
   bool use_xnnpack(const at::Tensor& input, const at::Tensor& weight,
-                   const c10::optional<IntArrayRef> bias_sizes_opt) const;
+                   const at::OptionalIntArrayRef bias_sizes_opt) const;
   bool is_depthwise(const at::Tensor& input, const at::Tensor& weight) const;
 };
 
@@ -139,7 +139,7 @@ enum class ConvBackend {
 TORCH_API ConvBackend select_conv_backend(
     const Tensor& input,
     const Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     const bool need_backward,
     const ConvParams& params);
 
diff --git a/aten/src/ATen/native/Convolution.cpp b/aten/src/ATen/native/Convolution.cpp
index e4e051025239b4..02b179480cc5d5 100644
--- a/aten/src/ATen/native/Convolution.cpp
+++ b/aten/src/ATen/native/Convolution.cpp
@@ -267,7 +267,7 @@ auto ConvParams::use_nnpack(const at::Tensor& input, const at::Tensor& weight) c
 auto ConvParams::use_xnnpack(
     const at::Tensor& input,
     const at::Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt) const -> bool {
+    const at::OptionalIntArrayRef bias_sizes_opt) const -> bool {
 #if defined(C10_MOBILE)
   if (!transposed) {
     return (input.size(1) == groups) &&
@@ -652,6 +652,88 @@ static at::Tensor subtensor(at::Tensor& tensor, int dim, int groups, int g) {
   return tensor.narrow(dim, n * g, n).contiguous();
 }
 
+namespace {
+
+std::pair<Tensor, Tensor> complex_to_real(const Tensor& inp) {
+  auto inp_view_as_complex = at::view_as_real(inp);
+  auto dim_i = inp_view_as_complex.dim() - 1;
+  auto i_r = inp_view_as_complex.select(dim_i, 0);
+  auto i_i = inp_view_as_complex.select(dim_i, 1);
+  return std::make_pair(i_r, i_i);
+}
+
+at::Tensor complex_convolution(
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    IntArrayRef output_padding,
+    int64_t groups) {
+  check_input_same_type_as_parameters(input, weight, bias);
+  Tensor i_r, i_i, w_r, w_i;
+  std::tie(i_r, i_i) = complex_to_real(input.resolve_conj());
+  std::tie(w_r, w_i) = complex_to_real(weight.resolve_conj());
+
+  // [NOTE] Complex Convolution
+  // conv(W, x, b) = conv(Wr, xr, br) - conv(Wi, xi, 0) + i(conv(Wi, xr, bi) + conv(Wr, xi, 0))
+  // where W, x and b are all complex inputs.
+  // With Gauss Trick:
+  // a = conv(Wr, xr, br),
+  // b = conv(Wi, xi, 0),
+  // c = conv(Wr + Wi, xr + xi, bi + br)
+  // conv(W, x, b) = a - b + i(c - a - b)
+  Tensor a, b, c;
+  if (!bias.defined()) {
+    a = at::convolution(i_r, w_r, bias, stride, padding, dilation, false, output_padding, groups);
+    b = at::convolution(i_i, w_i, bias, stride, padding, dilation, false, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, false, output_padding, groups);
+  } else {
+    Tensor b_r, b_i;
+    std::tie(b_r, b_i) = complex_to_real(bias.resolve_conj());
+    a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, false, output_padding, groups);
+    b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, false, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, false, output_padding, groups);
+  }
+
+  auto i = c10::Scalar(c10::complex<double>(0, 1));
+  return a - b + i * (c - a - b);
+}
+
+at::Tensor complex_convolution_mode(
+    const at::Tensor& input,
+    const at::Tensor& weight,
+    const c10::optional<at::Tensor>& bias_opt,
+    at::IntArrayRef stride,
+    c10::string_view padding,
+    at::IntArrayRef dilation,
+    int64_t groups) {
+  auto bias = bias_opt.value_or(Tensor());
+  check_input_same_type_as_parameters(input, weight, bias);
+  Tensor i_r, i_i, w_r, w_i;
+  std::tie(i_r, i_i) = complex_to_real(input.resolve_conj());
+  std::tie(w_r, w_i) = complex_to_real(weight.resolve_conj());
+
+  // See [NOTE] Complex Convolution
+  Tensor a, b, c;
+  if (!bias.defined()) {
+    a = at::_convolution_mode(i_r, w_r, bias, stride, padding, dilation, groups);
+    b = at::_convolution_mode(i_i, w_i, bias, stride, padding, dilation, groups);
+    c = at::_convolution_mode(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, groups);
+  } else {
+    Tensor b_r, b_i;
+    std::tie(b_r, b_i) = complex_to_real(bias.resolve_conj());
+    a = at::_convolution_mode(i_r, w_r, b_r, stride, padding, dilation, groups);
+    b = at::_convolution_mode(i_i, w_i, Tensor(), stride, padding, dilation, groups);
+    c = at::_convolution_mode(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, groups);
+  }
+
+  auto i = c10::Scalar(c10::complex<double>(0, 1));
+  return a - b + i * (c - a - b);
+}
+
+} // namespace
 
 at::Tensor conv1d(
     const Tensor& input_, const Tensor& weight, const c10::optional<Tensor>& bias_opt,
@@ -663,7 +745,12 @@ at::Tensor conv1d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv1d");
-  auto output = at::convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, {0}, groups);
+  } else {
+    output = at::convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -677,7 +764,12 @@ at::Tensor conv2d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d");
-  auto output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0}}, groups);
+  } else {
+    output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -691,7 +783,12 @@ at::Tensor conv3d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv3d");
-  auto output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0, 0}}, groups);
+  } else {
+    output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -787,8 +884,12 @@ at::Tensor conv1d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv1d");
-  auto output = at::_convolution_mode(
-      input, weight, bias, stride, std::move(padding), dilation, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  } else {
+    output = at::_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -799,8 +900,12 @@ at::Tensor conv2d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d");
-  auto output = at::_convolution_mode(
-      input, weight, bias, stride, std::move(padding), dilation, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  } else {
+    output = at::_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -811,8 +916,12 @@ at::Tensor conv3d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv3d");
-  auto output = at::_convolution_mode(
-      input, weight, bias, stride, std::move(padding), dilation, groups);
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  } else {
+    output = at::_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -933,7 +1042,7 @@ ConvBackend select_conv_backend(
 ConvBackend select_conv_backend(
     const Tensor& input,
     const Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     const bool need_backward,
     const ConvParams& params) {
 
@@ -1565,7 +1674,7 @@ std::tuple<at::Tensor, at::Tensor, at::Tensor> _convolution_backward_nogroup_bac
 //   output_mask: 3-dim boolean array specifying which gradients to compute in input, weight, bias order
 std::tuple<Tensor, Tensor, Tensor> convolution_backward(
     const Tensor& grad_output_, const Tensor& input_, const Tensor& weight_,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, bool transposed, IntArrayRef output_padding,
     int64_t groups, std::array<bool, 3> output_mask) {
   auto grad_output = grad_output_;
diff --git a/aten/src/ATen/native/Copy.cpp b/aten/src/ATen/native/Copy.cpp
index 5496facf847c7b..c93d517b7b78d2 100644
--- a/aten/src/ATen/native/Copy.cpp
+++ b/aten/src/ATen/native/Copy.cpp
@@ -52,7 +52,7 @@ void copy_same_type_transpose_(Tensor& self, const Tensor& src) {
   // The code below is implemented with the assumption that sizes are equal
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(self.sizes().equals(src.sizes()));
 
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, kBFloat16, self.scalar_type(), "copy_", [&] {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(kHalf, kBool, kBFloat16, kComplexHalf, self.scalar_type(), "copy_", [&] {
     scalar_t* sp = src.data_ptr<scalar_t>();
     scalar_t* rp = self.data_ptr<scalar_t>();
     scalar_t* bp = buf.data_ptr<scalar_t>();
diff --git a/aten/src/ATen/native/DilatedConvolutionUtils.h b/aten/src/ATen/native/DilatedConvolutionUtils.h
index 2d4815799b10f2..51b30a9bc77aed 100644
--- a/aten/src/ATen/native/DilatedConvolutionUtils.h
+++ b/aten/src/ATen/native/DilatedConvolutionUtils.h
@@ -4,7 +4,7 @@
 #include <vector>
 
 #include <ATen/div_rtn.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 
 #define TORCH_CHECK_DIM_SIZE(T, DIM, DIM_SIZE, SIZE) \
diff --git a/aten/src/ATen/native/EmbeddingBag.cpp b/aten/src/ATen/native/EmbeddingBag.cpp
index e6f88f556c8258..32eb95d50fadc1 100644
--- a/aten/src/ATen/native/EmbeddingBag.cpp
+++ b/aten/src/ATen/native/EmbeddingBag.cpp
@@ -10,6 +10,7 @@
 
 #ifdef USE_FBGEMM
 #include <fbgemm/Fbgemm.h>
+#include <fbgemm/FbgemmConvert.h>
 #else
 #include <caffe2/perfkernels/embedding_lookup_idx.h>
 #endif
@@ -60,14 +61,14 @@ std::pair<Tensor, Tensor> promoteIndicesAndOffsets(
 // is only applicable if special conditions are met
 template<typename index_t>
 bool is_fast_path_index_select(const Tensor& src, Tensor& output, index_t padding_idx) {
-  return src.scalar_type() == kFloat && src.strides()[1] == 1 && output.strides()[1] == 1 && padding_idx < static_cast<index_t>(0);
+  return (src.scalar_type() == kFloat || src.scalar_type() == kHalf) && src.strides()[1] == 1 && output.strides()[1] == 1 && padding_idx < static_cast<index_t>(0);
 }
 
 // Determines if we can use a fast implementation for index_select_scale_add,
 // which is only applicable if special conditions are met
 template<typename index_t>
 bool is_fast_path_index_select_scale(const Tensor& src, const Tensor& scale, Tensor& output, index_t padding_idx) {
-  return src.scalar_type() == kFloat && src.strides()[1] == 1 && output.strides()[1] == 1 && scale.strides()[0] == 1 && padding_idx < static_cast<index_t>(0);
+  return (src.scalar_type() == kFloat || src.scalar_type() == kHalf) && src.strides()[1] == 1 && output.strides()[1] == 1 && scale.strides()[0] == 1 && padding_idx < static_cast<index_t>(0);
 }
 
 template<typename index_t>
@@ -81,7 +82,7 @@ bool is_fast_path(const Tensor& src, const c10::optional<Tensor>& scale, Tensor&
 // index_add (using add_indices as the index), without creating an intermediary
 // tensor to hold the selected embeddings
 template<typename data_t, typename index_t>
-typename std::enable_if<!std::is_same<data_t, float>::value, void>::type
+typename std::enable_if<!std::is_same<data_t, float>::value && !std::is_same<data_t, at::Half>::value, void>::type
 index_select_add(const Tensor &select_indices,
                              const Tensor &add_indices,
                              const Tensor &src,
@@ -96,12 +97,12 @@ index_select_add(const Tensor &select_indices,
   auto* src_data = src.data_ptr<data_t>();
   auto* output_data = output.data_ptr<data_t>();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  index_t* bag_size_data;
+  index_t* bag_size_data = nullptr;
   if (bag_size.defined()) {
     bag_size_data = bag_size.data_ptr<index_t>();
   }
   auto numel = add_indices.numel();
-  int64_t ddim = src.sizes()[1];
+  int64_t ddim = src.size(1);
   auto vocab_size = src.size(0);
   auto src_stride0 = src.strides()[0];
   auto src_stride1 = src.strides()[1];
@@ -157,6 +158,155 @@ void fbgemm_spmdm_report_error_(
 }
 } // namespace
 
+template<typename data_t, typename index_t>
+typename std::enable_if<std::is_same<data_t, at::Half>::value, void>::type
+index_select_add(const Tensor &select_indices,
+                             const Tensor &add_indices,
+                             const Tensor &src,
+                             Tensor &output,
+                             const Tensor& offsets,
+                             bool include_last_offset,
+                             Tensor &bag_size,
+                             index_t padding_idx) {
+  int64_t ddim = src.size(1);
+  auto* select_indices_data = select_indices.data_ptr<index_t>();
+  auto* output_data = output.data_ptr<at::Half>();
+
+  if (is_fast_path_index_select(src, output, padding_idx)) {
+    auto src_contig = src.contiguous();
+    auto* src_data = src_contig.data_ptr<at::Half>();
+    int64_t output_size = offsets.numel() - 1;
+    auto* offsets_data = offsets.data_ptr<index_t>();
+    std::vector<index_t> offsets_include_last;
+
+    if (include_last_offset) {
+      output_size = offsets.numel() - 1;
+    } else {
+      output_size = offsets.numel();
+      offsets_include_last.resize(offsets.numel() + 1);
+      if (offsets.numel() > 0) {
+        std::memcpy(
+            offsets_include_last.data(),
+            offsets.data_ptr<index_t>(),
+            sizeof(index_t) * offsets.numel());
+      }
+      offsets_include_last[offsets.numel()] = select_indices.numel();
+      offsets_data = offsets_include_last.data();
+    }
+
+#ifdef USE_FBGEMM
+    using float16 = uint16_t;
+    auto kernel_fp16_index_t =
+      fbgemm::GenerateEmbeddingSpMDM<float16, index_t, index_t, float16>(
+        /* block_size */ddim,
+        /* has_weight */false,
+        /* normalize_by_lengths */false,
+        /* prefetch */16,
+        /* is_weight_positional */false,
+        /* use_offsets */true
+      );
+#else
+    // Initialize the intermediate output buffer to be 0.
+    Tensor output_fp32 = at::zeros({output_size, ddim}, output.options().dtype(at::kFloat));
+    auto* output_data_fp32 = output_fp32.data_ptr<float>();
+#endif
+    at::parallel_for(
+        0, output_size, 1, [&](index_t start_idx, index_t end_idx) {
+#ifdef USE_FBGEMM
+          bool success = kernel_fp16_index_t(
+            /* output_size */end_idx - start_idx,
+            /* index_size */offsets_data[end_idx] - offsets_data[start_idx],
+            /* data_size */src.size(0),
+            /* input */reinterpret_cast<const float16*>(src_data),
+            /* indices */select_indices_data + offsets_data[start_idx],
+            /* offsets_or_lengths */offsets_data + start_idx,
+            /* weights */nullptr,
+            /* output */reinterpret_cast<float16*>(output_data + start_idx * ddim));
+          if (!success) {
+            fbgemm_spmdm_report_error_(
+                end_idx - start_idx,
+                offsets_data[end_idx] - offsets_data[start_idx],
+                src.size(0),
+                offsets_data + start_idx,
+                select_indices_data + offsets_data[start_idx]);
+          }
+#else
+          caffe2::EmbeddingLookupIdx(
+              /*block_size=*/ddim,
+              /*output_size=*/end_idx - start_idx,
+              /*index_size=*/offsets_data[end_idx] - offsets_data[start_idx],
+              /*data_size=*/src.size(0),
+              /*input=*/src_data,
+              /*indices=*/select_indices_data + offsets_data[start_idx],
+              /*offsets=*/offsets_data + start_idx,
+              /*weights=*/nullptr,
+              /*scale_bias=*/nullptr,
+              /*normalize_by_lengths=*/false,
+              /*out=*/output_data_fp32 + start_idx * ddim);
+          for (const auto i : c10::irange(output_size)) {
+            // Convert FP32 intermediate buffer result back to FP16 for output dtype
+            for (const auto d : c10::irange(ddim)) {
+              (output_data + i * ddim)[d] = static_cast<at::Half>((output_data_fp32 + ddim * i)[d]);
+            }
+          }
+#endif
+        });
+
+  } else {
+    TORCH_CHECK(select_indices.numel() == add_indices.numel());
+    auto* src_data = src.data_ptr<at::Half>();
+    auto* add_indices_data = add_indices.data_ptr<index_t>();
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    index_t* bag_size_data = nullptr;
+    if (bag_size.defined()) {
+      bag_size_data = bag_size.data_ptr<index_t>();
+    }
+    auto vocab_size = src.size(0);
+    auto src_stride0 = src.strides()[0];
+    auto src_stride1 = src.strides()[1];
+    auto output_stride0 = output.strides()[0];
+    auto output_stride1 = output.strides()[1];
+    auto numel = add_indices.numel();
+
+    Tensor src_fp32 = at::empty({ddim}, src.options().dtype(at::kFloat));
+    auto* src_data_fp32 = src_fp32.data_ptr<float>();
+
+    // Initialize the intermediate output buffer to be 0.
+    Tensor output_fp32 = at::zeros({output.size(0), ddim}, output.options().dtype(at::kFloat));
+    auto* output_data_fp32 = output_fp32.data_ptr<float>();
+
+    for (const auto i : c10::irange(numel)) {
+      // We can skip indices equal to padding_idx so they are not included in
+      // the reduction
+      auto idx = select_indices_data[i];
+      TORCH_CHECK(
+          idx >= 0 && idx < vocab_size,
+          "embedding_bag: Expected idx >= 0 && idx < num_embeddings but found idx to be ",
+          idx);
+      if (idx != padding_idx) {
+        // Copy src_data + src_stride0 * idx to src_data_fp32
+        for (const auto d : c10::irange(ddim)) {
+          src_data_fp32[d] = static_cast<float>((src_data + src_stride0 * idx)[d * src_stride1]);
+        }
+        at::native::cpublas::axpy<float>(ddim, 1,
+                src_data_fp32, 1,
+                output_data_fp32 + ddim * add_indices_data[i], 1);
+
+      } else if (bag_size.defined()) {
+        // Decrement bag_size to reflect that the index is padded
+        // NOLINTNEXTLINE(clang-analyzer-core.NullDereference)
+        bag_size_data[add_indices_data[i]]--;
+      }
+    }
+    for (const auto i : c10::irange(output.size(0))) {
+      // Convert FP32 intermediate buffer result back to FP16 for output dtype
+      for (const auto d : c10::irange(ddim)) {
+        (output_data + output_stride0 * i)[d * output_stride1] = static_cast<at::Half>((output_data_fp32 + ddim * i)[d]);
+      }
+    }
+  }
+}
+
 template<typename data_t, typename index_t>
 typename std::enable_if<std::is_same<data_t, float>::value, void>::type
 index_select_add(const Tensor &select_indices,
@@ -167,7 +317,7 @@ index_select_add(const Tensor &select_indices,
                              bool include_last_offset,
                              Tensor &bag_size,
                              index_t padding_idx) {
-  int64_t ddim = src.sizes()[1];
+  int64_t ddim = src.size(1);
   auto* select_indices_data = select_indices.data_ptr<index_t>();
   auto* output_data = output.data_ptr<float>();
 
@@ -210,7 +360,7 @@ index_select_add(const Tensor &select_indices,
           bool success = kernel_fp32_index_t(
             /* output_size */end_idx - start_idx,
             /* index_size */offsets_data[end_idx] - offsets_data[start_idx],
-            /* data_size */src.sizes()[0],
+            /* data_size */src.size(0),
             /* input */src_data,
             /* indices */select_indices_data + offsets_data[start_idx],
             /* offsets_or_lengths */offsets_data + start_idx,
@@ -220,7 +370,7 @@ index_select_add(const Tensor &select_indices,
             fbgemm_spmdm_report_error_(
                 end_idx - start_idx,
                 offsets_data[end_idx] - offsets_data[start_idx],
-                src.sizes()[0],
+                src.size(0),
                 offsets_data + start_idx,
                 select_indices_data + offsets_data[start_idx]);
           }
@@ -229,7 +379,7 @@ index_select_add(const Tensor &select_indices,
               /*block_size=*/ddim,
               /*output_size=*/end_idx - start_idx,
               /*index_size=*/offsets_data[end_idx] - offsets_data[start_idx],
-              /*data_size=*/src.sizes()[0],
+              /*data_size=*/src.size(0),
               /*input=*/src_data,
               /*indices=*/select_indices_data + offsets_data[start_idx],
               /*offsets=*/offsets_data + start_idx,
@@ -244,7 +394,7 @@ index_select_add(const Tensor &select_indices,
     auto* src_data = src.data_ptr<float>();
     auto* add_indices_data = add_indices.data_ptr<index_t>();
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    index_t* bag_size_data;
+    index_t* bag_size_data = nullptr;
     if (bag_size.defined()) {
       bag_size_data = bag_size.data_ptr<index_t>();
     }
@@ -284,7 +434,7 @@ index_select_add(const Tensor &select_indices,
 // mul (scaling by per_sample_weights)
 // index_add (using add_indices as the index)
 template<typename data_t, typename index_t>
-static typename std::enable_if<!std::is_same<data_t, float>::value, void>::type
+static typename std::enable_if<!std::is_same<data_t, float>::value && !std::is_same<data_t, at::Half>::value, void>::type
 index_select_scale_add(const Tensor &select_indices,
                                    const Tensor &add_indices,
                                    const Tensor &scale,
@@ -300,7 +450,7 @@ index_select_scale_add(const Tensor &select_indices,
   auto* src_data = src.data_ptr<data_t>();
   auto* output_data = output.data_ptr<data_t>();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  index_t* bag_size_data;
+  index_t* bag_size_data = nullptr;
   if (bag_size.defined()) {
     bag_size_data = bag_size.data_ptr<index_t>();
   }
@@ -338,6 +488,158 @@ index_select_scale_add(const Tensor &select_indices,
   }
 }
 
+template<typename data_t, typename index_t>
+typename std::enable_if<std::is_same<data_t, at::Half>::value, void>::type
+index_select_scale_add(const Tensor &select_indices,
+                       const Tensor &add_indices,
+                       const Tensor &scale,
+                       const Tensor &src,
+                       Tensor &output,
+                       const Tensor& offsets,
+                       bool include_last_offset,
+                       Tensor &bag_size,
+                       index_t padding_idx) {
+  int64_t ddim = src.size(1);
+  auto* scale_data = scale.data_ptr<at::Half>();
+  auto* select_indices_data = select_indices.data_ptr<index_t>();
+  auto* output_data = output.data_ptr<at::Half>();
+
+  if (is_fast_path_index_select_scale(src, scale, output, padding_idx)) {
+    auto src_contig = src.contiguous();
+    auto* src_data = src_contig.data_ptr<at::Half>();
+    int64_t output_size = offsets.numel() - 1;
+    auto* offsets_data = offsets.data_ptr<index_t>();
+    std::vector<index_t> offsets_include_last;
+
+    if (include_last_offset) {
+      output_size = offsets.numel() - 1;
+    } else {
+      output_size = offsets.numel();
+      offsets_include_last.resize(offsets.numel() + 1);
+      std::memcpy(
+          offsets_include_last.data(),
+          offsets.data_ptr<index_t>(),
+          sizeof(index_t) * offsets.numel());
+      offsets_include_last[offsets.numel()] = select_indices.numel();
+      offsets_data = offsets_include_last.data();
+    }
+
+    Tensor scale_fp32 = at::empty(scale.sizes(), scale.options().dtype(at::kFloat));
+    auto* scale_data_fp32 = scale_fp32.data_ptr<float>();
+
+#ifdef USE_FBGEMM
+    using float16 = uint16_t;
+    fbgemm::Float16ToFloat_simd(reinterpret_cast<const float16*>(scale_data), scale_data_fp32, scale_fp32.numel());
+    auto kernel_fp16_index_t =
+      fbgemm::GenerateEmbeddingSpMDM<float16, index_t, index_t, float16>(
+        /* block_size */ddim,
+        /* has_weight */true,
+        /* normalize_by_lengths */false,
+        /* prefetch */16,
+        /* is_weight_positional */false,
+        /* use_offsets */true
+      );
+#else
+    // Initialize the intermediate output buffer to be 0.
+    Tensor output_fp32 = at::zeros({output_size, ddim}, output.options().dtype(at::kFloat));
+    auto* output_data_fp32 = output_fp32.data_ptr<float>();
+    for (const auto i : c10::irange(scale.numel())) {
+      scale_data_fp32[i] = static_cast<float>(scale_data[i]);
+    }
+#endif
+    at::parallel_for(
+        0, output_size, 1, [&](index_t start_idx, index_t end_idx) {
+#ifdef USE_FBGEMM
+          bool success = kernel_fp16_index_t(
+            /* output_size */end_idx - start_idx,
+            /* index_size */offsets_data[end_idx] - offsets_data[start_idx],
+            /* data_size */src.size(0),
+            /* input */reinterpret_cast<const float16*>(src_data),
+            /* indices */select_indices_data + offsets_data[start_idx],
+            /* offsets_or_lengths */offsets_data + start_idx,
+            /* weights */scale_data_fp32 + offsets_data[start_idx],
+            /* output */reinterpret_cast<float16*>(output_data + start_idx * ddim));
+          if (!success) {
+            fbgemm_spmdm_report_error_(
+                end_idx - start_idx,
+                offsets_data[end_idx] - offsets_data[start_idx],
+                src.size(0),
+                offsets_data + start_idx,
+                select_indices_data + offsets_data[start_idx]);
+          }
+#else
+          caffe2::EmbeddingLookupIdx(
+              /*block_size=*/ddim,
+              /*output_size=*/end_idx - start_idx,
+              /*index_size=*/offsets_data[end_idx] - offsets_data[start_idx],
+              /*data_size=*/src.size(0),
+              /*input=*/src_data,
+              /*indices=*/select_indices_data + offsets_data[start_idx],
+              /*offsets=*/offsets_data + start_idx,
+              /*weights=*/scale_data_fp32 + offsets_data[start_idx],
+              /*scale_bias=*/nullptr,
+              /*normalize_by_lengths=*/false,
+              /*out=*/output_data_fp32 + start_idx * ddim);
+          for (const auto i : c10::irange(output_size)) {
+            // Convert FP32 intermediate buffer result back to FP16 for output dtype
+            for (const auto d : c10::irange(ddim)) {
+              (output_data + i * ddim)[d] = static_cast<at::Half>((output_data_fp32 + ddim * i)[d]);
+            }
+          }
+#endif
+        });
+  } else {
+    AT_ASSERT(select_indices.numel() == add_indices.numel());
+    auto* src_data = src.data_ptr<at::Half>();
+    auto* add_indices_data = add_indices.data_ptr<index_t>();
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    index_t* bag_size_data = nullptr;
+    if (bag_size.defined()) {
+      bag_size_data = bag_size.data_ptr<index_t>();
+    }
+    auto vocab_size = src.size(0);
+    auto src_stride0 = src.strides()[0];
+    auto src_stride1 = src.strides()[1];
+    auto output_stride0 = output.strides()[0];
+    auto output_stride1 = output.strides()[1];
+    auto scale_stride = scale.strides()[0];
+    auto numel = add_indices.numel();
+
+    // Initialize the intermediate output buffer to be 0.
+    Tensor output_fp32 = at::zeros({output.size(0), ddim}, output.options().dtype(at::kFloat));
+    auto* output_data_fp32 = output_fp32.data_ptr<float>();
+
+    for (const auto i : c10::irange(numel)) {
+      // We can skip indices equal to padding_idx so they are not included in
+      // the reduction
+      auto idx = select_indices_data[i];
+      TORCH_CHECK(
+          idx >= 0 && idx < vocab_size,
+          "embedding_bag: Expected idx >= 0 && idx < num_embeddings but found idx to be ",
+          idx);
+      if (idx != padding_idx) {
+
+        auto* src_base = src_data + src_stride0 * idx;
+        auto* output_base_fp32 = output_data_fp32 + ddim * add_indices_data[i];
+        auto scale = scale_data[i * scale_stride];
+        for (const auto j : c10::irange(ddim)) {
+          output_base_fp32[j] += static_cast<float>(src_base[j * src_stride1]) * static_cast<float>(scale);
+        }
+      } else if (bag_size.defined()) {
+        // Decrement bag_size to reflect that the index is padded
+        // NOLINTNEXTLINE(clang-analyzer-core.NullDereference)
+        bag_size_data[add_indices_data[i]]--;
+      }
+    }
+    for (const auto i : c10::irange(output.size(0))) {
+      // Convert FP32 intermediate buffer result back to FP16 for output dtype
+      for (const auto d : c10::irange(ddim)) {
+        (output_data + output_stride0 * i)[d * output_stride1] = static_cast<at::Half>((output_data_fp32 + ddim * i)[d]);
+      }
+    }
+  }
+}
+
 template<typename data_t, typename index_t>
 typename std::enable_if<std::is_same<data_t, float>::value, void>::type
 index_select_scale_add(const Tensor &select_indices,
@@ -349,7 +651,7 @@ index_select_scale_add(const Tensor &select_indices,
                                           bool include_last_offset,
                                           Tensor &bag_size,
                                           index_t padding_idx) {
-  int64_t ddim = src.sizes()[1];
+  int64_t ddim = src.size(1);
   auto* scale_data = scale.data_ptr<float>();
   auto* select_indices_data = select_indices.data_ptr<index_t>();
   auto* output_data = output.data_ptr<float>();
@@ -391,7 +693,7 @@ index_select_scale_add(const Tensor &select_indices,
           bool success = kernel_fp32_index_t(
             /* output_size */end_idx - start_idx,
             /* index_size */offsets_data[end_idx] - offsets_data[start_idx],
-            /* data_size */src.sizes()[0],
+            /* data_size */src.size(0),
             /* input */src_data,
             /* indices */select_indices_data + offsets_data[start_idx],
             /* offsets_or_lengths */offsets_data + start_idx,
@@ -401,7 +703,7 @@ index_select_scale_add(const Tensor &select_indices,
             fbgemm_spmdm_report_error_(
                 end_idx - start_idx,
                 offsets_data[end_idx] - offsets_data[start_idx],
-                src.sizes()[0],
+                src.size(0),
                 offsets_data + start_idx,
                 select_indices_data + offsets_data[start_idx]);
           }
@@ -410,7 +712,7 @@ index_select_scale_add(const Tensor &select_indices,
               /*block_size=*/ddim,
               /*output_size=*/end_idx - start_idx,
               /*index_size=*/offsets_data[end_idx] - offsets_data[start_idx],
-              /*data_size=*/src.sizes()[0],
+              /*data_size=*/src.size(0),
               /*input=*/src_data,
               /*indices=*/select_indices_data + offsets_data[start_idx],
               /*offsets=*/offsets_data + start_idx,
@@ -425,7 +727,7 @@ index_select_scale_add(const Tensor &select_indices,
     auto* src_data = src.data_ptr<float>();
     auto* add_indices_data = add_indices.data_ptr<index_t>();
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    index_t* bag_size_data;
+    index_t* bag_size_data = nullptr;
     if (bag_size.defined()) {
       bag_size_data = bag_size.data_ptr<index_t>();
     }
@@ -477,17 +779,17 @@ void check_arguments(
   checkScalarTypes("embedding_bag", offsets_arg, {kLong, kInt});
   checkSameType("embedding_bag", indices_arg, offsets_arg);
   auto weight_arg = TensorArg(weight, "weight", 1);
-  checkScalarTypes("embedding_bag", weight_arg, {kFloat, kDouble});
+  checkScalarTypes("embedding_bag", weight_arg, {kHalf, kFloat, kDouble});
 
   AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "_embedding_bag_cpu_impl", [&]() {
-    if (offsets.sizes()[0] > 0) {
+    if (offsets.size(0) > 0) {
       index_t offset_0 = offsets.data_ptr<index_t>()[0];
-      index_t offset_n = offsets.data_ptr<index_t>()[offsets.sizes()[0]-1];
+      index_t offset_n = offsets.data_ptr<index_t>()[offsets.size(0)-1];
       TORCH_CHECK(offset_0 == 0, "offsets[0] has to be 0, i.e., the first sequence "
                                 "in the mini-batch has to start from position 0. "
                                 "However, got ", offsets[0]);
-      TORCH_CHECK(offset_n <= indices.sizes()[0], "offsets[-1] can not "
-                  "be greater than input's length ", indices.sizes()[0], " but got offsets[-1] of ",
+      TORCH_CHECK(offset_n <= indices.size(0), "offsets[-1] can not "
+                  "be greater than input's length ", indices.size(0), " but got offsets[-1] of ",
                   offset_n);
     }
   });
@@ -504,7 +806,7 @@ void check_arguments(
 
   if (include_last_offset) {
     TORCH_CHECK(
-        offsets.sizes()[0] >= 1,
+        offsets.size(0) >= 1,
         "include_last_offset: number of offset should be at least 1");
   }
 }
@@ -517,16 +819,16 @@ void make_bag_size_out(
     const bool include_last_offset,
     const bool requires_grad) {
   if (requires_grad || mode == MODE_MEAN || mode == MODE_MAX) {
-    auto num_bags = offsets.sizes()[0] - (include_last_offset ? 1 : 0);
+    auto num_bags = offsets.size(0) - (include_last_offset ? 1 : 0);
     at::native::resize_(bag_size_out, {num_bags}, c10::nullopt);
     // Compute this for MODE_MEAN and MODE_MAX (latter needed for backwards)
     if (num_bags != 1) {
-      bag_size_out.slice(0, 0, bag_size_out.sizes()[0] - 1, 1) =
+      bag_size_out.slice(0, 0, bag_size_out.size(0) - 1, 1) =
           offsets.slice(0, 1, num_bags, 1) -
           offsets.slice(0, 0, num_bags - 1, 1);
     }
     if (num_bags > 0) {
-      bag_size_out[-1] = indices.sizes()[0] - offsets[num_bags - 1];
+      bag_size_out[-1] = indices.size(0) - offsets[num_bags - 1];
     }
   } else {
     at::native::resize_(bag_size_out, offsets.sizes(), c10::nullopt);
@@ -541,7 +843,7 @@ void make_max_indices_out(
     const Tensor& bag_size,
     const int64_t mode,
     bool include_last_offset) {
-  int64_t numBags = offsets.sizes()[0];
+  int64_t numBags = offsets.size(0);
   if (mode == MODE_MAX) {
     if (include_last_offset) {
       TORCH_CHECK(
@@ -569,13 +871,11 @@ void make_offset2bag_out(
   bool fast_path_sum = is_fast_path(weight, per_sample_weights, output, padding_idx);
 
   if (mode == MODE_MEAN || mode == MODE_MAX || !fast_path_sum) {
-    at::native::resize_(offset2bag, {indices.sizes()[0] + 1}, c10::nullopt);
+    at::native::resize_(offset2bag, {indices.size(0) + 1}, c10::nullopt);
     at::native::zero_(offset2bag);
-  }
 
-  if (mode == MODE_MEAN || mode == MODE_MAX || !fast_path_sum) {
     make_offset2bag(offsets, offset2bag);
-    at::native::resize_(offset2bag, {indices.sizes()[0]}, c10::nullopt);
+    at::native::resize_(offset2bag, {indices.size(0)}, c10::nullopt);
     // only initialize output in slow path
     at::native::zero_(output);
   }
@@ -711,7 +1011,7 @@ void _embedding_bag_cpu_impl_out(Tensor& output, Tensor& offset2bag,
                             const c10::optional<Tensor>& per_sample_weights,
                             bool include_last_offset, int64_t padding_idx) {
   if (mode == MODE_MEAN || mode == MODE_SUM) {
-    AT_DISPATCH_FLOATING_TYPES(weight.scalar_type(), "embedding_bag_no_grad_cpu_out",
+    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, weight.scalar_type(), "embedding_bag_no_grad_cpu_out",
       [&indices, &offset2bag, &per_sample_weights, &weight, &output, &offsets, &include_last_offset, &mode, &bag_size, &padding_idx]() {
       AT_DISPATCH_INDEX_TYPES(indices.scalar_type(), "embedding_bag_no_grad_cpu_out",
         [&indices, &offset2bag, &per_sample_weights, &weight, &output, &offsets, &include_last_offset, &mode, &bag_size, &padding_idx]() {
@@ -756,7 +1056,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _embedding_bag_cpu_impl(
   check_arguments(weight, indices, offsets, mode, per_sample_weights, include_last_offset);
 
   Tensor output = at::empty(
-      {include_last_offset ? offsets.sizes()[0] - 1 : offsets.sizes()[0],
+      {include_last_offset ? offsets.size(0) - 1 : offsets.size(0),
        weight.sizes()[1]},
       weight.options());
 
@@ -894,10 +1194,10 @@ Tensor _embedding_bag_backward(const Tensor &grad, const Tensor &indices_,
   Tensor offset2bag_;
   if (indices.numel() != 0 && offset2bag.numel() == 0) {
     offset2bag_ = at::zeros(
-       {indices.sizes()[0] + 1}, offsets.options()); // offset2bag = [0 0 0 0 0]
+       {indices.size(0) + 1}, offsets.options()); // offset2bag = [0 0 0 0 0]
 
     make_offset2bag(offsets, offset2bag_);
-    offset2bag_.resize_({indices.sizes()[0]});
+    offset2bag_.resize_({indices.size(0)});
   } else {
     auto offset2bag_arg = TensorArg(offset2bag, "offset2bag", 1);
     checkScalarTypes("embedding_bag", offset2bag_arg, {kLong, kInt});
@@ -1081,7 +1381,7 @@ Tensor _embedding_bag_dense_backward_cpu(const Tensor &grad_, const Tensor &indi
   // for more details.
   auto grad = grad_.contiguous();
   auto grad_arg = TensorArg(grad, "grad_", 1);
-  checkScalarTypes("embedding_bag", grad_arg, {kFloat, kDouble});
+  checkScalarTypes("embedding_bag", grad_arg, {kHalf, kFloat, kDouble});
 
   if (mode == MODE_MAX) {
     return _embedding_bag_dense_backward_cpu_max(
@@ -1092,12 +1392,24 @@ Tensor _embedding_bag_dense_backward_cpu(const Tensor &grad_, const Tensor &indi
   auto index_grad_weight =
       at::zeros({num_weights, grad.sizes()[1]}, grad.options());
 
-  AT_DISPATCH_FLOATING_TYPES(grad.scalar_type(), "embedding_bag_backward", [&] {
-      _embedding_bag_dense_backward_cpu_sum_mean<scalar_t>(
-          grad, indices_, offset2bag__, bag_size_, num_weights,
-          scale_grad_by_freq, mode, per_sample_weights_, index_grad_weight,
-          padding_idx);
-  });
+  AT_DISPATCH_FLOATING_TYPES_AND2(
+      at::ScalarType::Half,
+      at::ScalarType::BFloat16,
+      grad.scalar_type(),
+      "embedding_bag_backward",
+      [&] {
+        _embedding_bag_dense_backward_cpu_sum_mean<scalar_t>(
+            grad,
+            indices_,
+            offset2bag__,
+            bag_size_,
+            num_weights,
+            scale_grad_by_freq,
+            mode,
+            per_sample_weights_,
+            index_grad_weight,
+            padding_idx);
+      });
   return index_grad_weight;
 }
 
@@ -1120,7 +1432,7 @@ Tensor _embedding_bag_per_sample_weights_backward_cpu_template(
   Tensor indices, offsets;
   std::tie(indices, offsets) = promoteIndicesAndOffsets(indices_, offsets_);
   AT_ASSERT(indices.dim() == 1);
-  auto num_samples = indices.sizes()[0];
+  auto num_samples = indices.size(0);
 
   AT_ASSERT(weight.dim() == 2);
   AT_ASSERT(weight.sizes()[1] == embedding_features);
@@ -1134,11 +1446,11 @@ Tensor _embedding_bag_per_sample_weights_backward_cpu_template(
   Tensor offset2bag_;
   if (indices.numel() != 0 && offset2bag.numel() == 0) {
     offset2bag_ = at::zeros(
-       {indices.sizes()[0] + 1}, offset2bag.options()); // offset2bag = [0 0 0 0 0]
+       {indices.size(0) + 1}, offset2bag.options()); // offset2bag = [0 0 0 0 0]
 
     make_offset2bag(offsets, offset2bag_);
 
-    at::native::resize_(offset2bag_, {indices.sizes()[0]}, c10::nullopt);
+    at::native::resize_(offset2bag_, {indices.size(0)}, c10::nullopt);
   } else {
     auto offset2bag_arg = TensorArg(offset2bag, "offset2bag", 1);
     checkScalarTypes("embedding_bag", offset2bag_arg, {kLong, kInt});
@@ -1194,12 +1506,16 @@ Tensor _embedding_bag_per_sample_weights_backward_cpu(
     const Tensor& offset2bag,
     int64_t mode,
     int64_t padding_idx) {
-  return AT_DISPATCH_FLOATING_TYPES(
-    grad.scalar_type(), "_embedding_bag_per_sample_weights_backward_cpu", [&]() {
-      return _embedding_bag_per_sample_weights_backward_cpu_template<scalar_t>(
-          grad, weight, indices, offsets, offset2bag, mode, padding_idx);
-    }
-  );
+  return AT_DISPATCH_FLOATING_TYPES_AND2(
+      at::ScalarType::Half,
+      at::ScalarType::BFloat16,
+      grad.scalar_type(),
+      "_embedding_bag_per_sample_weights_backward_cpu",
+      [&]() {
+        return _embedding_bag_per_sample_weights_backward_cpu_template<
+            scalar_t>(
+            grad, weight, indices, offsets, offset2bag, mode, padding_idx);
+      });
 }
 
 Tensor _embedding_bag_sparse_backward(
@@ -1229,6 +1545,5 @@ Tensor _embedding_bag_sparse_backward(
   return native::embedding_backward(index_grad, indices, num_weights, padding_idx,
                                     scale_grad_by_freq, true);
 }
-
 }
 } // namespace at::native
diff --git a/aten/src/ATen/native/ForeachUtils.h b/aten/src/ATen/native/ForeachUtils.h
index 8855fd313a5623..033052f401f6bd 100644
--- a/aten/src/ATen/native/ForeachUtils.h
+++ b/aten/src/ATen/native/ForeachUtils.h
@@ -126,19 +126,11 @@ bool check_fast_path_restrictions(
 bool can_use_fast_route(ArrayRef<TensorList> tensorLists,
                         ArrayRef<Scalar> scalarList = {},
                         bool does_op_promote_integer_inputs_to_float = false) {
-#if defined(USE_ROCM)
-  return false;
-#else
   return check_fast_path_restrictions(tensorLists, scalarList, does_op_promote_integer_inputs_to_float);
-#endif
 }
 
 bool can_use_fast_route(TensorList tensors1, TensorList tensors2, bool does_op_promote_integer_inputs_to_float = false) {
-#if defined(USE_ROCM)
-  return false;
-#else
   return can_use_fast_route({tensors1, tensors2}, {}, does_op_promote_integer_inputs_to_float);
-#endif
 }
 
 }
diff --git a/aten/src/ATen/native/GridSampler.cpp b/aten/src/ATen/native/GridSampler.cpp
index 54002dbd8f8fec..8b044061022609 100644
--- a/aten/src/ATen/native/GridSampler.cpp
+++ b/aten/src/ATen/native/GridSampler.cpp
@@ -1,4 +1,5 @@
 #include <ATen/native/GridSampler.h>
+#include <ATen/native/GridSamplerUtils.h>
 #include <ATen/ATen.h>
 #include <ATen/Device.h>
 #include <ATen/NativeFunctions.h>
@@ -23,6 +24,12 @@ namespace {
                                   GridSamplerInterpolation interpolation_mode,
                                   GridSamplerPadding padding_mode,
                                   bool align_corners) {
+    // See NOTE [ grid_sampler Native Functions ].
+    // Add checks here in case this is called instead of grid_sampler.
+    check_grid_sampler_common(input, grid);
+    check_grid_sampler_3d(
+      input, grid, static_cast<int64_t>(interpolation_mode));
+
     int64_t N = input.size(0);
     int64_t C = input.size(1);
     int64_t inp_D = input.size(2);
@@ -179,6 +186,12 @@ namespace {
                                     GridSamplerInterpolation interpolation_mode,
                                     GridSamplerPadding padding_mode,
                                     bool align_corners, std::array<bool,2> output_mask) {
+    // See NOTE [ grid_sampler Native Functions ].
+    // Add checks here in case this is called instead of grid_sampler.
+    check_grid_sampler_common(input, grid);
+    check_grid_sampler_3d(
+      input, grid, static_cast<int64_t>(interpolation_mode));
+
     auto input_requires_grad = output_mask[0];
     Tensor grad_input = ([&]() {
       if (input_requires_grad) {
@@ -411,6 +424,11 @@ Tensor _grid_sampler_2d_cpu_quantized(
     int64_t interpolation_mode_,
     int64_t padding_mode_,
     bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   auto interpolation_mode =
       static_cast<GridSamplerInterpolation>(interpolation_mode_);
   /* Bilinear interpolation is supported using the fact that we can perform
@@ -515,6 +533,11 @@ Tensor _grid_sampler_2d_cpu_fallback(const Tensor& input, const Tensor& grid,
                                      int64_t interpolation_mode_,
                                      int64_t padding_mode_,
                                      bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   auto interpolation_mode = static_cast<GridSamplerInterpolation>(interpolation_mode_);
   auto padding_mode = static_cast<GridSamplerPadding>(padding_mode_);
   using scalar_t = float;
@@ -663,6 +686,11 @@ _grid_sampler_2d_cpu_fallback_backward(const Tensor& grad_output,
                                        int64_t interpolation_mode_,
                                        int64_t padding_mode_,
                                        bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   const auto interpolation_mode = static_cast<GridSamplerInterpolation>(interpolation_mode_);
   const auto padding_mode = static_cast<GridSamplerPadding>(padding_mode_);
   using scalar_t = float;
@@ -856,10 +884,14 @@ _grid_sampler_2d_cpu_fallback_backward(const Tensor& grad_output,
   return std::make_tuple(grad_input, grad_grid);
 }
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 Tensor grid_sampler_2d_cpu(const Tensor& input, const Tensor& grid,
                            int64_t interpolation_mode, int64_t padding_mode,
                            bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   if (input.scalar_type() == kQUInt8) {
     return native::_grid_sampler_2d_cpu_quantized(
         input, grid, interpolation_mode, padding_mode, align_corners);
@@ -896,10 +928,14 @@ Tensor grid_sampler_2d_cpu(const Tensor& input, const Tensor& grid,
 DEFINE_DISPATCH(grid_sampler_2d_cpu_kernel);
 
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 Tensor grid_sampler_3d_cpu(const Tensor& input, const Tensor& grid,
                            int64_t interpolation_mode, int64_t padding_mode,
                            bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_3d(input, grid, interpolation_mode);
+
   return AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "grid_sampler3d_cpu", [&] {
     return grid_sampler_3d_cpu_impl<scalar_t>(
       input, grid, static_cast<GridSamplerInterpolation>(interpolation_mode),
@@ -907,11 +943,14 @@ Tensor grid_sampler_3d_cpu(const Tensor& input, const Tensor& grid,
   });
 }
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 std::tuple<Tensor, Tensor>
 grid_sampler_2d_backward_cpu(const Tensor& grad_output, const Tensor& input, const Tensor& grid,
                              int64_t interpolation_mode, int64_t padding_mode, bool align_corners,
                              std::array<bool,2> output_mask) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
 
   // AVX gather instructions use signed 32-bit offsets to gather float values.
   // Check for possible overflow and fallback to scalar implementation
@@ -953,11 +992,14 @@ grid_sampler_2d_backward_cpu(const Tensor& grad_output, const Tensor& input, con
 
 DEFINE_DISPATCH(grid_sampler_2d_backward_cpu_kernel);
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 std::tuple<Tensor, Tensor>
 grid_sampler_3d_backward_cpu(const Tensor& grad_output, const Tensor& input, const Tensor& grid,
                              int64_t interpolation_mode, int64_t padding_mode, bool align_corners,
                              std::array<bool,2> output_mask) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_3d(input, grid, interpolation_mode);
 
   return AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "grid_sampler_3d_backward_cpu", [&] {
     return grid_sampler_3d_backward_cpu_impl<scalar_t>(
@@ -968,62 +1010,29 @@ grid_sampler_3d_backward_cpu(const Tensor& grad_output, const Tensor& input, con
   });
 }
 
-Tensor grid_sampler(const Tensor& input, const Tensor& grid,
-                    int64_t interpolation_mode, int64_t padding_mode,
-                    bool align_corners) {
-  TORCH_CHECK(
-    input.defined() && grid.defined(),
-    "grid_sampler(): expected input and grid to not be undefined, but input "
-    "is ", input, " and grid is ", grid);
-  auto input_opt = input.options();
-  auto grid_opt = grid.options();
-  TORCH_CHECK(
-    input_opt.device() == grid_opt.device(),
-    "grid_sampler(): expected input and grid to be on same device, but input "
-    "is on ", input_opt.device(), " and grid is on ", grid_opt.device());
-  TORCH_CHECK(
-    input_opt.layout() == kStrided && grid_opt.layout() == kStrided,
-    "grid_sampler(): expected input and grid to have torch.strided layout, but "
-    "input has ", input_opt.layout(), " and grid has ", grid_opt.layout());
-  TORCH_CHECK(
-    (input.dim() == 4 || input.dim() == 5) && input.dim() == grid.dim(),
-    "grid_sampler(): expected 4D or 5D input and grid with same number of "
-    "dimensions, but got input with sizes ", input.sizes(),
-    " and grid with sizes ", grid.sizes());
-  TORCH_CHECK(
-    input.size(0) == grid.size(0),
-    "grid_sampler(): expected grid and input to have same batch size, but got "
-    "input with sizes ", input.sizes(), " and grid with sizes ", grid.sizes());
-  TORCH_CHECK(
-    grid.size(-1) == input.dim() - 2,
-    "grid_sampler(): expected grid to have size ", input.dim() - 2, " in last "
-    "dimension, but got grid with sizes ", grid.sizes());
-  TORCH_CHECK(
-    !(input.dim() == 5 && static_cast<GridSamplerInterpolation>(interpolation_mode) == GridSamplerInterpolation::Bicubic),
-    "grid_sampler(): bicubic interpolation only supports 4D input"
-  );
-  for (const auto i : c10::irange(2, input.dim())) {
-    TORCH_CHECK(input.size(i) > 0,
-      "grid_sampler(): expected input to have non-empty spatial dimensions, "
-      "but input has sizes ", input.sizes(), " with dimension ", i, " being "
-      "empty");
-  }
-  // cudnn does not support inputs larger than 1024
-  if (at::native::cudnn_is_acceptable(input) &&
-      at::native::cudnn_is_acceptable(grid) &&
-      at::native::canUse32BitIndexMath(input) &&
-      at::native::canUse32BitIndexMath(grid) &&
-      static_cast<GridSamplerInterpolation>(interpolation_mode) == GridSamplerInterpolation::Bilinear &&
-      static_cast<GridSamplerPadding>(padding_mode) == GridSamplerPadding::Zeros &&
-      align_corners &&
-      input.dim() == 4 &&
-      input.size(1) <= 1024) {
+// See NOTE [ grid_sampler Native Functions ].
+Tensor grid_sampler(
+  const Tensor& input,
+  const Tensor& grid,
+  int64_t interpolation_mode,
+  int64_t padding_mode,
+  bool align_corners
+) {
+  if (cond_cudnn_grid_sampler(input, grid) &&
+      static_cast<GridSamplerInterpolation>(interpolation_mode) ==
+        GridSamplerInterpolation::Bilinear &&
+      static_cast<GridSamplerPadding>(padding_mode) ==
+        GridSamplerPadding::Zeros &&
+      align_corners) {
     return cudnn_grid_sampler(input, grid);
   }
+
   if (input.dim() == 4) {
-    return at::grid_sampler_2d(input, grid, interpolation_mode, padding_mode, align_corners);
+    return at::grid_sampler_2d(
+      input, grid, interpolation_mode, padding_mode, align_corners);
   } else {
-    return at::grid_sampler_3d(input, grid, interpolation_mode, padding_mode, align_corners);
+    return at::grid_sampler_3d(
+      input, grid, interpolation_mode, padding_mode, align_corners);
   }
 }
 
diff --git a/aten/src/ATen/native/GridSampler.h b/aten/src/ATen/native/GridSampler.h
index 412465937aa015..f4a735032430a1 100644
--- a/aten/src/ATen/native/GridSampler.h
+++ b/aten/src/ATen/native/GridSampler.h
@@ -5,14 +5,9 @@
 #include <cstdint>
 #include <utility>
 
-namespace at { namespace native {
-
-namespace detail {
+#include <ATen/native/GridSamplerUtils.h>
 
-  enum class GridSamplerInterpolation {Bilinear, Nearest, Bicubic};
-  enum class GridSamplerPadding {Zeros, Border, Reflection};
-
-}  // namespace detail
+namespace at { namespace native {
 
 using detail::GridSamplerInterpolation;
 using detail::GridSamplerPadding;
diff --git a/aten/src/ATen/native/GridSamplerUtils.h b/aten/src/ATen/native/GridSamplerUtils.h
new file mode 100644
index 00000000000000..0b6f29de8c4273
--- /dev/null
+++ b/aten/src/ATen/native/GridSamplerUtils.h
@@ -0,0 +1,109 @@
+#pragma once
+
+// See NOTE: [Tensor vs. TensorBase]
+// https://github.com/pytorch/pytorch/pull/66979
+#include <ATen/core/TensorBase.h>
+#include <ATen/native/TensorProperties.h>
+#include <ATen/native/CanUse32BitIndexMath.h>
+
+namespace at { namespace native {
+
+namespace detail {
+
+enum class GridSamplerInterpolation {Bilinear, Nearest, Bicubic};
+enum class GridSamplerPadding {Zeros, Border, Reflection};
+
+} // namespace detail
+
+using detail::GridSamplerInterpolation;
+using detail::GridSamplerPadding;
+
+namespace {
+
+// See NOTE [ grid_sampler Native Functions ].
+void check_grid_sampler_common(
+  const TensorBase& input,
+  const TensorBase& grid
+) {
+  auto input_opt = input.options();
+  auto grid_opt = grid.options();
+
+  TORCH_CHECK(
+    input.defined(),
+    "grid_sampler(): expected input to not be undefined");
+  TORCH_CHECK(
+    grid.defined(),
+    "grid_sampler(): expected grid to not be undefined");
+  TORCH_CHECK(
+    input_opt.device() == grid_opt.device(),
+    "grid_sampler(): expected input and grid to be on same device, but input "
+    "is on ", input_opt.device(), " and grid is on ", grid_opt.device());
+  TORCH_CHECK(
+    input_opt.layout() == kStrided && grid_opt.layout() == kStrided,
+    "grid_sampler(): expected input and grid to have torch.strided layout, but "
+    "input has ", input_opt.layout(), " and grid has ", grid_opt.layout());
+  TORCH_CHECK(
+    input.size(0) == grid.size(0),
+    "grid_sampler(): expected grid and input to have same batch size, but got "
+    "input with sizes ", input.sizes(), " and grid with sizes ", grid.sizes());
+  TORCH_CHECK(
+    grid.size(-1) == input.dim() - 2,
+    "grid_sampler(): expected grid to have size ", input.dim() - 2, " in last "
+    "dimension, but got grid with sizes ", grid.sizes());
+
+  for (const auto i : c10::irange(2, input.dim())) {
+    TORCH_CHECK(input.size(i) > 0,
+      "grid_sampler(): expected input to have non-empty spatial dimensions, "
+      "but input has sizes ", input.sizes(), " with dimension ", i, " being "
+      "empty");
+  }
+}
+
+// See NOTE [ grid_sampler Native Functions ].
+void check_grid_sampler_2d(
+  const TensorBase& input,
+  const TensorBase& grid
+) {
+  TORCH_CHECK(
+    input.dim() == 4 && input.dim() == grid.dim(),
+    "grid_sampler(): expected 4D input and grid with same number of "
+    "dimensions, but got input with sizes ", input.sizes(),
+    " and grid with sizes ", grid.sizes());
+}
+
+// See NOTE [ grid_sampler Native Functions ].
+void check_grid_sampler_3d(
+  const TensorBase& input,
+  const TensorBase& grid,
+  int64_t interpolation_mode
+) {
+  TORCH_CHECK(
+    input.dim() == 5 && input.dim() == grid.dim(),
+    "grid_sampler(): expected 5D input and grid with same number of "
+    "dimensions, but got input with sizes ", input.sizes(),
+    " and grid with sizes ", grid.sizes());
+  TORCH_CHECK(
+    !(input.dim() == 5 &&
+      static_cast<GridSamplerInterpolation>(interpolation_mode) ==
+        GridSamplerInterpolation::Bicubic),
+    "grid_sampler(): bicubic interpolation only supports 4D input");
+}
+
+// See NOTE [ grid_sampler Native Functions ].
+// cudnn does not support inputs larger than 1024.
+bool cond_cudnn_grid_sampler(
+  const TensorBase& input,
+  const TensorBase& grid
+) {
+  return (
+    at::native::cudnn_is_acceptable(input) &&
+    at::native::cudnn_is_acceptable(grid) &&
+    at::native::canUse32BitIndexMath(input) &&
+    at::native::canUse32BitIndexMath(grid) &&
+    input.dim() == 4 &&
+    input.size(1) <= 1024);
+}
+
+} // anonymous namespace
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/Histogram.cpp b/aten/src/ATen/native/Histogram.cpp
index abd1ae32ded110..c3a007f2c2dcba 100644
--- a/aten/src/ATen/native/Histogram.cpp
+++ b/aten/src/ATen/native/Histogram.cpp
@@ -407,4 +407,28 @@ Tensor histogram_histc_cpu(const Tensor& self, int64_t bin_ct,
     return histogram_histc_cpu_out(self, bin_ct, min, max, hist);
 }
 
+std::tuple<Tensor, std::vector<Tensor>> histogramdd(
+    const Tensor &self, TensorList bins, c10::optional<ArrayRef<double>> /*range*/,
+    const c10::optional<Tensor> &weight, bool density) {
+  auto hist = at::_histogramdd_from_bin_tensors(self, bins, weight, density);
+  return std::tuple<Tensor, std::vector<Tensor>>{
+      std::move(hist), bins.vec()};
+}
+
+std::tuple<Tensor, std::vector<Tensor>> histogramdd(
+    const Tensor &self, IntArrayRef bins, c10::optional<ArrayRef<double>> range,
+    const c10::optional<Tensor> &weight, bool density) {
+  auto bin_edges = at::_histogramdd_bin_edges(self, bins, range, weight, density);
+  auto hist = at::_histogramdd_from_bin_cts(self, bins, range, weight, density);
+  return std::tuple<Tensor, std::vector<Tensor>>{
+      std::move(hist), std::move(bin_edges)};
+}
+
+std::tuple<Tensor, std::vector<Tensor>> histogramdd(
+    const Tensor &self, int64_t bins, c10::optional<ArrayRef<double>> range,
+    const c10::optional<Tensor> &weight, bool density) {
+  DimVector bins_v(self.size(-1), bins);
+  return at::native::histogramdd(self, bins_v, range, weight, density);
+}
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/Itertools.cpp b/aten/src/ATen/native/Itertools.cpp
index d1117b8c1d4d56..bd5fa0fa359549 100644
--- a/aten/src/ATen/native/Itertools.cpp
+++ b/aten/src/ATen/native/Itertools.cpp
@@ -46,7 +46,10 @@ Tensor cartesian_prod(TensorList tensors) {
 
 Tensor combinations(const Tensor& self, int64_t r, bool with_replacement) {
   TORCH_CHECK(self.dim() == 1, "Expect a 1D vector, but got shape ", self.sizes());
-  TORCH_CHECK(r > 0, "Expect a positive number, but got ", r);
+  TORCH_CHECK(r >= 0, "Expect a non-negative number, but got ", r);
+  if (r == 0) {
+    return at::empty({0}, self.options());
+  }
   int64_t num_elements = self.numel();
   std::vector<Tensor> grids = at::meshgrid(std::vector<Tensor>(r, self));
   Tensor mask = _triu_mask(num_elements, r, with_replacement, self.options());
diff --git a/aten/src/ATen/native/Linear.cpp b/aten/src/ATen/native/Linear.cpp
index 3a4a8e1fd7f2d3..847a2dab5e838f 100644
--- a/aten/src/ATen/native/Linear.cpp
+++ b/aten/src/ATen/native/Linear.cpp
@@ -34,6 +34,12 @@ Tensor linear(const Tensor& input, const Tensor& weight, const c10::optional<Ten
     // Fused op is marginally faster.
     return at::addmm(*bias, input, weight.t());
   }
+  if (input.dim() == 3 && bias->defined() && input.is_contiguous()) {
+    // Also hit the fused path for contiguous 3D input.
+    const auto input_sizes = input.sizes();
+    const auto result = at::addmm(*bias, input.view({input_sizes[0] * input_sizes[1], input_sizes[2]}), weight.t());
+    return result.view({input_sizes[0], input_sizes[1], result.size(1)});
+  }
   auto output = at::matmul(input, weight.t());
   if (bias->defined()) {
     output.add_(*bias);
diff --git a/aten/src/ATen/native/LinearAlgebra.cpp b/aten/src/ATen/native/LinearAlgebra.cpp
index 926dfc04759e9b..e7a67822068a7f 100644
--- a/aten/src/ATen/native/LinearAlgebra.cpp
+++ b/aten/src/ATen/native/LinearAlgebra.cpp
@@ -29,15 +29,23 @@
 
 namespace at {
 namespace meta {
-TORCH_META_FUNC(addmm)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha) {
-  TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor");
-  TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor");
-  TORCH_CHECK(
-      mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (",
-      mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
 
-  auto names = at::namedinference::propagate_names_for_addmm(mat1, mat2, self);
+#define ADDMM_META() \
+  TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor"); \
+  TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor"); \
+  TORCH_CHECK( \
+      mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (", \
+      mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")"); \
+ \
+  auto names = at::namedinference::propagate_names_for_addmm(mat1, mat2, self); \
   set_output(0, {mat1.sizes()[0], mat2.sizes()[1]}, {}, self.options(), names);
+
+TORCH_META_FUNC(addmm)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha) {
+  ADDMM_META();
+}
+
+TORCH_META_FUNC(_addmm_activation)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha, bool use_gelu) {
+  ADDMM_META();
 }
 
 TORCH_META_FUNC(mm)(const Tensor & self, const Tensor & mat2) {
@@ -1126,6 +1134,19 @@ static void addmm_impl_cpu_(
     return;
   }
 
+  // Some paths in the code below do not handle multiplications of the form [a, 0] x [0, b]
+  if (m1_sizes[1] == 0) {
+    if (beta.toComplexDouble() == 0.0) {
+      result.zero_();
+    } else {
+      if (!self.is_same(result)) {
+        result.copy_(self);
+      }
+      result.mul_(beta);
+    }
+    return;
+  }
+
   if (beta.toComplexDouble() != 0.0 && !self.is_same(result)) {
     result.copy_(self);
   }
@@ -1290,6 +1311,19 @@ TORCH_IMPL_FUNC(addmm_out_cpu)(const Tensor& self, const Tensor& mat1, const Ten
   }
 }
 
+TORCH_IMPL_FUNC(addmm_activation_out_cpu)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha, bool use_gelu, const Tensor &result) {
+  auto b_self = expand_size(self, {mat1.sizes()[0], mat2.sizes()[1]}, "addmm_out");
+  {
+    at::NoNamesGuard guard;
+    addmm_impl_cpu_(const_cast<Tensor&>(result), *b_self, mat1, mat2, beta, alpha);
+    if (use_gelu) {
+      at::gelu_(const_cast<Tensor&>(result));
+    } else {
+      at::relu_(const_cast<Tensor&>(result));
+    }
+  }
+}
+
 TORCH_IMPL_FUNC(mm_out_cpu)(const Tensor & self, const Tensor & mat2, const Tensor & result) {
   {
     at::NoNamesGuard guard;
@@ -2399,7 +2433,7 @@ static std::vector<int64_t> make_dim_list(int64_t ndim) {
 }
 
 // Checks for valid arguments to linalg_norm when type(ord) == str
-static void check_str_ord_valid(const c10::string_view str_ord, optional<IntArrayRef> opt_dim, int64_t ndim) {
+static void check_str_ord_valid(const c10::string_view str_ord, OptionalIntArrayRef opt_dim, int64_t ndim) {
   TORCH_CHECK((str_ord == "nuc") || (str_ord == "fro"), "Invalid norm order: ", str_ord);
   bool dims_valid = (ndim == 2 && !opt_dim.has_value()) || (opt_dim.has_value() && opt_dim.value().size() == 2);
   TORCH_CHECK(dims_valid, "order \"", str_ord,
@@ -2481,7 +2515,7 @@ static Tensor& _linalg_norm_matrix_out(Tensor& result, const Tensor &self, const
   return result;
 }
 
-static Tensor& linalg_norm_out_impl(Tensor& result, const Tensor& self, const optional<Scalar>& opt_num_ord, optional<c10::string_view> opt_str_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+static Tensor& linalg_norm_out_impl(Tensor& result, const Tensor& self, const optional<Scalar>& opt_num_ord, optional<c10::string_view> opt_str_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   // Callers must give the ord argument as either a number, a string, or neither.
   // Since the user-facing API has no direct control over how this function is called, this is an internal assert.
   TORCH_INTERNAL_ASSERT(!(opt_num_ord.has_value() && opt_str_ord.has_value()));
@@ -2525,7 +2559,7 @@ static Tensor& linalg_norm_out_impl(Tensor& result, const Tensor& self, const op
   return result;
 }
 
-static Tensor& linalg_vector_norm_impl(const Tensor& self, const Scalar& scalar_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
+static Tensor& linalg_vector_norm_impl(const Tensor& self, const Scalar& scalar_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   // Casting a large integer to a double will introduce some error, but for
   // practical purposes, it won't matter since a large order will usually
   // give an infinite result
@@ -2601,13 +2635,13 @@ static Tensor& linalg_vector_norm_impl(const Tensor& self, const Scalar& scalar_
   return result;
 }
 
-Tensor linalg_vector_norm(const Tensor& self, const Scalar& ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+Tensor linalg_vector_norm(const Tensor& self, const Scalar& ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   ScalarType out_dtype = opt_dtype.value_or(toRealValueType(self.scalar_type()));
   Tensor result = create_reduction_result(self, opt_dim.value_or(IntArrayRef{}), keepdim, out_dtype);
   return at::native::linalg_vector_norm_impl(self, ord, opt_dim, keepdim, opt_dtype, result);
 }
 
-Tensor& linalg_vector_norm_out(const Tensor& self, const Scalar& ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
+Tensor& linalg_vector_norm_out(const Tensor& self, const Scalar& ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   return at::native::linalg_vector_norm_impl(self, ord, opt_dim, keepdim, opt_dtype, result);
 }
 
@@ -2676,7 +2710,7 @@ Tensor& linalg_matrix_norm_out(
 }
 
 // Numerical or None norms
-Tensor linalg_norm(const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+Tensor linalg_norm(const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   auto options = TensorOptions().dtype(opt_dtype.has_value() ? opt_dtype.value() : toRealValueType(self.scalar_type())).device(self.device());
   Tensor result = at::empty({0}, options);
   return at::native::linalg_norm_out(
@@ -2684,7 +2718,7 @@ Tensor linalg_norm(const Tensor& self, const optional<Scalar>& opt_ord, optional
 }
 
 // Frobenius and nuclear norms
-Tensor linalg_norm(const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+Tensor linalg_norm(const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   auto options = TensorOptions().dtype(opt_dtype.has_value() ? opt_dtype.value() : toRealValueType(self.scalar_type())).device(self.device());
   Tensor result = at::empty({0}, options);
   return at::native::linalg_norm_out(
@@ -2692,12 +2726,12 @@ Tensor linalg_norm(const Tensor& self, c10::string_view ord, optional<IntArrayRe
 }
 
 // Numerical or None norms
-Tensor& linalg_norm_out(const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
+Tensor& linalg_norm_out(const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   return linalg_norm_out_impl(result, self, opt_ord, c10::nullopt, opt_dim, keepdim, opt_dtype);
 }
 
 // Frobenius and nuclear norms
-Tensor& linalg_norm_out(const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
+Tensor& linalg_norm_out(const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   return linalg_norm_out_impl(result, self, c10::nullopt, ord, opt_dim, keepdim, opt_dtype);
 }
 
@@ -2876,7 +2910,7 @@ Tensor& linalg_tensorinv_out(const Tensor& self, int64_t ind, Tensor& result) {
   return result;
 }
 
-Tensor linalg_tensorsolve(const Tensor& self, const Tensor& other, optional<IntArrayRef> dims) {
+Tensor linalg_tensorsolve(const Tensor& self, const Tensor& other, OptionalIntArrayRef dims) {
   /*
   The idea is to reduce the problem to 2D matrix solve.
   Step 1. (optional) `self` is permuted with `dims` such that dimensions from `dims` are moved to the right.
@@ -2914,7 +2948,7 @@ Tensor linalg_tensorsolve(const Tensor& self, const Tensor& other, optional<IntA
   return result.reshape(result_shape);
 }
 
-Tensor& linalg_tensorsolve_out(const Tensor& self, const Tensor& other, optional<IntArrayRef> dims, Tensor& result) {
+Tensor& linalg_tensorsolve_out(const Tensor& self, const Tensor& other, OptionalIntArrayRef dims, Tensor& result) {
   checkSameDevice("tensorsolve", result, self);
   checkLinalgCompatibleDtype("tensorsolve", result, self);
 
diff --git a/aten/src/ATen/native/LinearAlgebraUtils.h b/aten/src/ATen/native/LinearAlgebraUtils.h
index 2448c8db730cff..2c3dfbbf4f6ba2 100644
--- a/aten/src/ATen/native/LinearAlgebraUtils.h
+++ b/aten/src/ATen/native/LinearAlgebraUtils.h
@@ -114,7 +114,7 @@ static inline c10::MaybeOwned<Tensor> borrow_else_clone(const bool cond, const T
  *  broadcasted shape.
  */
 static inline Tensor copyBatchedColumnMajor(const Tensor& src, int64_t nrows = -1,
-    c10::optional<IntArrayRef> desired_batch_sizes = c10::nullopt) {
+    at::OptionalIntArrayRef desired_batch_sizes = c10::nullopt) {
   nrows = (nrows == -1) ? src.size(-2) : nrows;
   auto copy_sizes = desired_batch_sizes.has_value()
     ? desired_batch_sizes.value().vec()
@@ -606,6 +606,41 @@ static inline bool linalg_solve_is_vector_rhs(const Tensor& input, const Tensor&
   return vector_case;
 }
 
+/*
+  Computes linear indices for a tensor with original_shape to access its elements like it was a materialized broadcast tensor.
+*/
+static inline Tensor get_linear_indices(int64_t numel, IntArrayRef original_shape, IntArrayRef broadcast_shape) {
+  TensorOptions options = at::TensorOptions().dtype(at::kLong).device(at::kCPU);
+  return at::arange(numel, options).view(original_shape).broadcast_to(broadcast_shape).contiguous();
+}
+
+class BroadcastLinearIndices {
+ private:
+  Tensor linear_indices_;
+  bool is_broadcasting_;
+
+ public:
+  BroadcastLinearIndices(
+      int64_t numel,
+      IntArrayRef original_shape,
+      IntArrayRef broadcast_shape) {
+    // The assumption is that the broadcast_shape is a materialized broadcast
+    // shape of the original_shape. We need to compute the linear indices
+    // compatible with the original_shape to access the elements in the original
+    // tensor corresponding to the broadcast tensor.
+    is_broadcasting_ = !original_shape.equals(broadcast_shape);
+    if (is_broadcasting_) {
+      linear_indices_ =
+          get_linear_indices(numel, original_shape, broadcast_shape);
+    }
+  }
+  int64_t operator()(int64_t broadcast_linear_index) {
+    return is_broadcasting_
+        ? linear_indices_.data_ptr<int64_t>()[broadcast_linear_index]
+        : broadcast_linear_index;
+  }
+};
+
 static inline bool is_blas_compatible_column_major_order(const Tensor& input) {
   IntArrayRef input_strides = input.strides();
   IntArrayRef input_sizes = input.sizes();
diff --git a/aten/src/ATen/native/LossNLL.cpp b/aten/src/ATen/native/LossNLL.cpp
index ed733411ff5376..6a04992e53e6b2 100644
--- a/aten/src/ATen/native/LossNLL.cpp
+++ b/aten/src/ATen/native/LossNLL.cpp
@@ -491,7 +491,11 @@ Tensor cross_entropy_loss_prob_target(
 
     switch (reduction) {
       case Reduction::Mean:
-        return -(input * target * weight_).sum() / (input.numel() / input.size(1));
+        if (input.numel()==0){
+          return -(input * target * weight_).sum().fill_(std::numeric_limits<double>::quiet_NaN());
+        } else {
+          return -(input * target * weight_).sum() / (input.numel() / input.size(1));
+        }
       case Reduction::Sum:
         return -(input * target * weight_).sum();
       case Reduction::None:
@@ -502,7 +506,11 @@ Tensor cross_entropy_loss_prob_target(
   } else {
     switch (reduction) {
       case Reduction::Mean:
-        return -(input * target).sum() / (input.numel() / input.size(1));
+        if (input.numel()==0){
+          return -(input * target).sum().fill_(std::numeric_limits<double>::quiet_NaN());
+        } else {
+          return -(input * target).sum() / (input.numel()/ input.size(1));
+        }
       case Reduction::Sum:
         return -(input * target).sum();
       case Reduction::None:
diff --git a/aten/src/ATen/native/Math.h b/aten/src/ATen/native/Math.h
index 09255e065879fb..ee10d00f9b5cd3 100644
--- a/aten/src/ATen/native/Math.h
+++ b/aten/src/ATen/native/Math.h
@@ -12,6 +12,7 @@
 #include <c10/util/MathConstants.h>
 #include <c10/util/math_compat.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/jiterator_macros.h>
 
 C10_CLANG_DIAGNOSTIC_PUSH()
 #if C10_CLANG_HAS_WARNING("-Wimplicit-float-conversion")
@@ -67,6 +68,83 @@ Output was modified to be inf or -inf when input is 1 or -1. */
     POSSIBILITY OF SUCH DAMAGE.
 */
 
+namespace {
+/*
+ * This function is derived from the implementation of the i0e function in the
+ * Cephes Math Library. See note [3-Clause BSD License for the Cephes Math
+ * Library].
+ *
+ * Computes an approximation of the exponentially scaled zeroth order modified
+ * Bessel function of the first kind. The approximation is actually two
+ * (sub)approximations, both using a Chebyshev polynomial expansion. One
+ * approximates the function over [0, 8], and the other over (8, infinity). This
+ * function takes the absolute value of all inputs to convert them into the
+ * domain of the approximation.
+ */
+jiterator_also_stringify_as(jiterator_code(
+  template <typename T>
+  JITERATOR_HOST_DEVICE T chbevl(T x, const T array[], const int len) {
+    T b0, b1, b2;
+
+    b0 = array[0];
+    b1 = 0;
+
+    for (int i = 1; i < len; ++i) {
+      b2 = b1;
+      b1 = b0;
+      b0 = x * b1 - b2 + array[i];
+    }
+
+    return T{0.5} * (b0 - b2);
+  }
+
+  template <typename T>
+  JITERATOR_HOST_DEVICE T calc_i0e(T _x) {
+    T x = fabs(_x);
+
+    if (x <= T{8.0}) {
+      static const T coefficients[] = {
+          -4.41534164647933937950E-18, 3.33079451882223809783E-17,
+          -2.43127984654795469359E-16, 1.71539128555513303061E-15,
+          -1.16853328779934516808E-14, 7.67618549860493561688E-14,
+          -4.85644678311192946090E-13, 2.95505266312963983461E-12,
+          -1.72682629144155570723E-11, 9.67580903537323691224E-11,
+          -5.18979560163526290666E-10, 2.65982372468238665035E-9,
+          -1.30002500998624804212E-8,  6.04699502254191894932E-8,
+          -2.67079385394061173391E-7,  1.11738753912010371815E-6,
+          -4.41673835845875056359E-6,  1.64484480707288970893E-5,
+          -5.75419501008210370398E-5,  1.88502885095841655729E-4,
+          -5.76375574538582365885E-4,  1.63947561694133579842E-3,
+          -4.32430999505057594430E-3,  1.05464603945949983183E-2,
+          -2.37374148058994688156E-2,  4.93052842396707084878E-2,
+          -9.49010970480476444210E-2,  1.71620901522208775349E-1,
+          -3.04682672343198398683E-1,  6.76795274409476084995E-1};
+
+      T y = (x / T{2.0}) - T{2.0};
+      return chbevl(y, coefficients, int{30});
+    }
+
+    // x > 8
+    static const T coefficients[] = {
+        -7.23318048787475395456E-18, -4.83050448594418207126E-18,
+        4.46562142029675999901E-17,  3.46122286769746109310E-17,
+        -2.82762398051658348494E-16, -3.42548561967721913462E-16,
+        1.77256013305652638360E-15,  3.81168066935262242075E-15,
+        -9.55484669882830764870E-15, -4.15056934728722208663E-14,
+        1.54008621752140982691E-14,  3.85277838274214270114E-13,
+        7.18012445138366623367E-13,  -1.79417853150680611778E-12,
+        -1.32158118404477131188E-11, -3.14991652796324136454E-11,
+        1.18891471078464383424E-11,  4.94060238822496958910E-10,
+        3.39623202570838634515E-9,   2.26666899049817806459E-8,
+        2.04891858946906374183E-7,   2.89137052083475648297E-6,
+        6.88975834691682398426E-5,   3.36911647825569408990E-3,
+        8.04490411014108831608E-1};
+
+    return chbevl(T{32.0} / x - T{2.0}, coefficients, int{25}) / sqrt(x);
+  }),
+  i0e_string); // i0e_string
+}
+
 #define CENTRAL_RANGE 0.7
 
 template <typename T>
@@ -1385,37 +1463,6 @@ calc_i0(T _x) {
 // Upcast bfloat16 input to float for numerical accuracy purposes
 static inline c10::BFloat16 calc_i0(c10::BFloat16 a) { return calc_i0(static_cast<float>(a)); }
 
-/*
- * This function is derived from the implementation of the i0e function in the Cephes Math Library.
- * See note [3-Clause BSD License for the Cephes Math Library].
- *
- * Computes an approximation of the exponentially scaled zeroth order modified Bessel function of the first kind.
- * The approximation is actually two (sub)approximations, both using a Chebyshev polynomial expansion.
- * One approximates the function over [0, 8], and the other over (8, infinity). This function takes the absolute value
- * of all inputs to convert them into the domain of the approximation.
- */
-template <typename T>
-static inline typename std::enable_if<std::is_floating_point<T>::value, T>::type
-calc_i0e(T _x) {
-  T x = std::abs(_x);
-
-  if (x <= T{8.0}) {
-    auto coeff_pair = chebyshev_coefficients_i0e_A<T>();
-    auto A = std::get<0>(coeff_pair);
-    auto len = std::get<1>(coeff_pair);
-    T y = (x / T{2.0}) - T{2.0};
-    return chbevl(y, A, len);
-  }
-
-  auto coeff_pair = chebyshev_coefficients_i0e_B<T>();
-  auto B = std::get<0>(coeff_pair);
-  auto len = std::get<1>(coeff_pair);
-  return chbevl(T{32.0} / x - T{2.0}, B, len) / std::sqrt(x);
-}
-
-// Upcast bfloat16 input to float for numerical accuracy purposes
-static inline c10::BFloat16 calc_i0e(c10::BFloat16 a) { return calc_i0e(static_cast<float>(a)); }
-
 /*
  * This function is derived from the implementation of the i1 function in the Cephes Math Library.
  * See note [3-Clause BSD License for the Cephes Math Library].
@@ -2113,4 +2160,21 @@ calc_erfcx(T x)
   }
 }
 
+/*
+ * Logarithm of Gaussian cumulative distribution function.
+
+ * This implementation of log_ndtr and its helper functions
+ * follow SciPy's implementation
+ * See NOTICE for the licenses.
+ */
+template <typename T>
+static inline C10_HOST_DEVICE T calc_log_ndtr(T x) {
+  T t = x * M_SQRT1_2;
+  if (x < T{-1.0}) {
+    return std::log(calc_erfcx(-t) / 2) - t * t;
+  } else {
+    return std::log1p(-std::erfc(t) / 2);
+  }
+}
+
 C10_CLANG_DIAGNOSTIC_POP()
diff --git a/aten/src/ATen/native/Normalization.cpp b/aten/src/ATen/native/Normalization.cpp
index 981e568b6b9756..1b6ab5d981f31c 100644
--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -26,7 +26,7 @@ TORCH_META_FUNC(renorm)(const Tensor& self, const Scalar& p, int64_t dim, const
   TORCH_CHECK(maxnorm.toDouble() >= 0.0,
               "renorm: expected maxnorm to be >= 0 but got ", maxnorm.toDouble());
   const auto ndim = self.dim();
-  TORCH_CHECK(ndim > 1, "renorm: input needs at least 2 dimensions, got ", ndim, "dimensions");
+  TORCH_CHECK(ndim > 1, "renorm: input needs at least 2 dimensions, got ", ndim, " dimensions");
   set_output(self.sizes(), self.options());
 }
 
diff --git a/aten/src/ATen/native/PadNd.cpp b/aten/src/ATen/native/PadNd.cpp
new file mode 100644
index 00000000000000..bdeb351a80dd04
--- /dev/null
+++ b/aten/src/ATen/native/PadNd.cpp
@@ -0,0 +1,214 @@
+#include <ATen/ATen.h>
+#include <ATen/native/PadNd.h>
+
+#include <c10/util/irange.h>
+
+namespace at { namespace native {
+
+Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value) {
+    TORCH_CHECK(pad.size() % 2 == 0, "Length of pad must be even but instead it equals ",
+             pad.size());
+
+    auto input_sizes = self.sizes();
+    auto l_inp = self.dim();
+
+    auto l_pad = pad.size() / 2;
+    auto l_diff = l_inp - l_pad;
+    TORCH_CHECK(l_inp >= (int64_t)l_pad, "Length of pad should be no more than twice the number of "
+             "dimensions of the input. Pad length is ", pad.size(), "while the input has ",
+             l_inp, "dimensions.");
+
+    std::vector<int64_t> new_shape;
+
+    bool all_pads_non_positive = true;
+
+    auto c_input = self;
+    for (const auto i : c10::irange(l_diff, l_inp)) {
+        auto pad_idx = 2 * (l_inp - i - 1);
+        if (pad[pad_idx] < 0) {
+            c_input = c_input.narrow(i, -pad[pad_idx], c_input.size(i) + pad[pad_idx]);
+        } else if (pad[pad_idx] != 0) {
+            all_pads_non_positive = false;
+        }
+        if (pad[pad_idx + 1] < 0) {
+            c_input = c_input.narrow(i, 0, c_input.size(i) + pad[pad_idx + 1]);
+        } else if (pad[pad_idx + 1] != 0) {
+            all_pads_non_positive = false;
+        }
+    }
+
+    // if none of the pads are positive we can optimize and just return the result
+    // of calling .narrow() on the input
+    if (all_pads_non_positive) {
+        return c_input.clone();
+    }
+
+
+    for (size_t i = 0; i < (size_t)l_diff; i ++) {
+        new_shape.emplace_back(input_sizes[i]);
+    }
+
+    for (const auto i : c10::irange((size_t)l_pad)) {
+        auto pad_idx = pad.size() - ((i + 1) * 2);
+        auto new_dim = input_sizes[l_diff + i] + pad[pad_idx] + pad[pad_idx + 1];
+        TORCH_CHECK(new_dim > 0, "The input size ", input_sizes[l_diff + i], ", plus negative padding ",
+                 pad[pad_idx], " and ", pad[pad_idx + 1], " resulted in a negative output size, "
+                 "which is invalid. Check dimension ", l_diff + i, " of your input.");
+        new_shape.emplace_back(new_dim);
+    }
+
+    at::Tensor output;
+    const auto memory_format = self.suggest_memory_format();
+    if (self.is_quantized()) {
+        const auto qscheme = self.qscheme();
+        TORCH_CHECK(qscheme == kPerTensorAffine || qscheme == kPerTensorSymmetric,
+                    "Only per-tensor padding is supported.");
+        output = at::_empty_affine_quantized(
+            new_shape, self.options().memory_format(memory_format),
+            self.q_scale(), self.q_zero_point(), c10::nullopt);
+    } else {
+        output = at::empty(new_shape, self.options().memory_format(memory_format));
+    }
+    output.fill_(value);
+
+    auto c_output = output;
+    for (const auto i : c10::irange(l_diff, l_inp)) {
+        auto pad_idx = 2 * (l_inp - i - 1);
+        if (pad[pad_idx] > 0) {
+            c_output = c_output.narrow(i, pad[pad_idx], c_output.size(i) - pad[pad_idx]);
+        }
+        if (pad[pad_idx + 1] > 0) {
+            c_output = c_output.narrow(i, 0, c_output.size(i) - pad[pad_idx + 1]);
+        }
+    }
+    c_output.copy_(c_input);
+    return output;
+}
+
+Tensor _pad_circular(const Tensor &self, IntArrayRef padding) {
+  const auto in_shape = self.sizes();
+  const auto ndim = static_cast<int64_t>(in_shape.size()) - 2;
+  TORCH_CHECK(padding.size() + 4 == in_shape.size() * 2,
+              "Invalid padding size, expected ", ndim * 2, " but got ", padding.size());
+
+  DimVector out_shape(in_shape.size());
+  out_shape[0] = in_shape[0];
+  out_shape[1] = in_shape[1];
+
+  // Get shape of padded tensor
+  for (const auto i : c10::irange(ndim)) {
+    const auto pad_l = padding[2 * (ndim - i - 1) + 0];
+    const auto pad_r = padding[2 * (ndim - i - 1) + 1];
+    const auto size = in_shape[2 + i];
+    out_shape[2 + i] = size + pad_l + pad_r;
+
+    TORCH_CHECK(
+        pad_l <= size && pad_r <= size,
+        "Padding value causes wrapping around more than once.");
+    TORCH_CHECK(
+        out_shape[2 + i] >= 0,
+        "Negative padding value is resulting in an empty dimension");
+  }
+
+  auto out = self.new_empty(out_shape, self.options());
+
+  // Put original array into the padded array
+  Tensor out_slice = out;
+  Tensor in_slice = self;
+  constexpr int64_t zero = 0;
+  for (const auto i : c10::irange(ndim)) {
+    const auto dim = ndim - i + 1;
+    const auto pad_l = padding[2*i + 0];
+    const auto pad_r = padding[2*i + 1];
+    out_slice = out_slice.slice(dim, std::max(pad_l, zero), out_shape[dim] - std::max(pad_r, zero));
+    in_slice = in_slice.slice(dim, std::max(-pad_l, zero), in_shape[dim] - std::max(-pad_r, zero));
+  }
+  out_slice.copy_(in_slice);
+
+  // The following steps first pad the beginning of the tensor (left side),
+  // and then pad the end of the tensor (right side).
+  // Note: Corners will be written more than once when ndim > 1.
+  //
+  // Only in cases where padding values are > 0 are when additional copying
+  // is required.
+  for (const auto i : c10::irange(ndim)) {
+    const auto dim = ndim - i + 1;
+    const auto pad_l = padding[2*i + 0];
+    const auto pad_r = padding[2*i + 1];
+
+    if (pad_l > 0) {
+      out_slice = out.slice(dim, 0, pad_l);
+      in_slice = out.slice(dim,
+                           out_shape[dim] - pad_l - std::max(pad_r, zero),
+                           out_shape[dim] - std::max(pad_r, zero));
+      out_slice.copy_(in_slice);
+    }
+
+    if (pad_r > 0) {
+      out_slice = out.slice(dim, out_shape[dim] - pad_r, out_shape[dim]);
+      in_slice = out.slice(dim, std::max(pad_l, zero), std::max(pad_l, zero) + pad_r);
+      out_slice.copy_(in_slice);
+    }
+  }
+
+  return out;
+}
+
+Tensor _pad_enum(const Tensor &self, IntArrayRef pad, int64_t mode_int, c10::optional<double> value) {
+  const auto input_dim = self.dim();
+  TORCH_CHECK(pad.size() % 2 == 0, "Padding length must be divisible by 2");
+  TORCH_CHECK(static_cast<int64_t>(pad.size()) <= input_dim * 2, "Padding length too large");
+  auto mode = static_cast<at::padding_mode>(mode_int);
+
+  if (mode == at::padding_mode::constant) {
+    return at::constant_pad_nd(self, pad, value.value_or(0.0));
+  }
+  TORCH_CHECK(
+      !value.has_value(), "Padding mode \"",
+      padding_mode_string(mode),
+      "\" doesn't take in value argument");
+
+  if (pad.size() == 2 && (input_dim == 2 || input_dim == 3)) {
+    switch (mode) {
+      case at::padding_mode::reflect: return at::reflection_pad1d(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad1d(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      default: {}
+    }
+  } else if(pad.size() == 4 && (input_dim == 3 || input_dim == 4)) {
+    switch (mode) {
+      case at::padding_mode::reflect: return at::reflection_pad2d(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad2d(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      default: {}
+    }
+  } else if (pad.size() == 6 && (input_dim == 4 || input_dim == 5)) {
+    switch (mode) {
+      case at::padding_mode::reflect: return at::reflection_pad3d(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad3d(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      default: {}
+    }
+  }
+  C10_THROW_ERROR(NotImplementedError,
+      "Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now");
+}
+
+Tensor pad(const Tensor &self, IntArrayRef pad, c10::string_view mode, c10::optional<double> value) {
+  const auto mode_enum = [&] {
+    if (mode == "reflect") {
+      return at::padding_mode::reflect;
+    } else if (mode == "constant") {
+      return at::padding_mode::constant;
+    } else if (mode == "replicate") {
+      return at::padding_mode::replicate;
+    } else if (mode == "circular") {
+      return at::padding_mode::circular;
+    }
+    C10_THROW_ERROR(NotImplementedError,
+                    c10::str("Unrecognised padding mode ", mode));
+  }();
+  return at::native::_pad_enum(self, pad, static_cast<int64_t>(mode_enum), value);
+}
+
+}}  // namespace at::native
diff --git a/aten/src/ATen/native/PadNd.h b/aten/src/ATen/native/PadNd.h
new file mode 100644
index 00000000000000..37f59acb8a4ce0
--- /dev/null
+++ b/aten/src/ATen/native/PadNd.h
@@ -0,0 +1,22 @@
+#pragma once
+
+namespace at {
+
+enum class padding_mode {
+  reflect,
+  replicate,
+  circular,
+  constant,
+};
+
+static inline c10::string_view padding_mode_string(padding_mode m) {
+  switch (m) {
+    case padding_mode::reflect: return "reflect";
+    case padding_mode::replicate: return "replicate";
+    case padding_mode::circular: return "circular";
+    case padding_mode::constant: return "constant";
+  }
+  TORCH_CHECK(false, "Invalid padding mode (", static_cast<int64_t>(m), ")");
+}
+
+}  // namespace at
diff --git a/aten/src/ATen/native/QuantizedLinear.cpp b/aten/src/ATen/native/QuantizedLinear.cpp
index 88513f34b9fb47..fcd8f6335b581d 100644
--- a/aten/src/ATen/native/QuantizedLinear.cpp
+++ b/aten/src/ATen/native/QuantizedLinear.cpp
@@ -13,7 +13,7 @@
 #include <ATen/WrapDimUtilsMulti.h>
 #include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 
 #include <c10/util/irange.h>
 
diff --git a/aten/src/ATen/native/RNN.cpp b/aten/src/ATen/native/RNN.cpp
index af387e3c43f978..f8db0ba311ad89 100644
--- a/aten/src/ATen/native/RNN.cpp
+++ b/aten/src/ATen/native/RNN.cpp
@@ -3,7 +3,7 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/core/op_registration/op_registration.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
 #include <c10/util/irange.h>
diff --git a/aten/src/ATen/native/ReduceOps.cpp b/aten/src/ATen/native/ReduceOps.cpp
index cce0f1a3d3b89d..e5d40fcad40125 100644
--- a/aten/src/ATen/native/ReduceOps.cpp
+++ b/aten/src/ATen/native/ReduceOps.cpp
@@ -267,17 +267,31 @@ TORCH_META_FUNC(aminmax)
 }
 
 TORCH_META_FUNC(amax)
-(const Tensor& self, IntArrayRef dims, bool keepdim) {
+(const Tensor& self, IntArrayRef dim, bool keepdim) {
   auto maybe_result = maybe_get_output();
   if (maybe_result.defined()) {
     TORCH_CHECK(self.scalar_type() == maybe_result.scalar_type(), "Expected the dtype for input and out to match, but got ",
             self.scalar_type(), " for input's dtype and ",  maybe_result.scalar_type(), " for out's dtype.");
   }
   if (self.numel() == 0) {
-    at::native::zero_numel_check_dims(self, dims, "amax()");
+    at::native::zero_numel_check_dims(self, dim, "amax()");
   }
   const ScalarType& out_dtype = maybe_result.defined() ? maybe_result.scalar_type() : self.scalar_type();
-  resize_reduction(*this, self, dims, keepdim, out_dtype);
+  resize_reduction(*this, self, dim, keepdim, out_dtype);
+}
+
+TORCH_META_FUNC(amin)
+(const Tensor& self, IntArrayRef dim, bool keepdim) {
+  auto maybe_result = maybe_get_output();
+  if (maybe_result.defined()) {
+    TORCH_CHECK(self.scalar_type() == maybe_result.scalar_type(), "Expected the dtype for input and out to match, but got ",
+                self.scalar_type(), " for input's dtype and ",  maybe_result.scalar_type(), " for out's dtype.");
+  }
+  if (self.numel() == 0) {
+    at::native::zero_numel_check_dims(self, dim, "amin()");
+  }
+  const ScalarType& out_dtype = maybe_result.defined() ? maybe_result.scalar_type() : self.scalar_type();
+  resize_reduction(*this, self, dim, keepdim, out_dtype);
 }
 
 } // namespace meta
@@ -844,7 +858,7 @@ Tensor& diff_out(const Tensor& self, int64_t n, int64_t dim, const c10::optional
   }
 }
 
-void pre_check_gradient(const Tensor& self, c10::optional<int64_t> spacing_size, c10::optional<IntArrayRef> dim,  int64_t edge_order) {
+void pre_check_gradient(const Tensor& self, c10::optional<int64_t> spacing_size, at::OptionalIntArrayRef dim,  int64_t edge_order) {
   // Helper for gradient function to make sure input data satisfies prerequisites
   TORCH_CHECK(self.scalar_type() != ScalarType::Byte, "torch.gradient does not support uint8 input.");
   if (spacing_size.has_value() && !dim.has_value()) {
@@ -946,7 +960,7 @@ std::vector<int64_t> gradient_dim_preprocess(const Tensor& self, c10::optional<i
 std::vector<Tensor> gradient(const Tensor& self, TensorList coordinates, IntArrayRef dim, int64_t edge_order) {
     pre_check_gradient(self,
                        c10::optional<int64_t>(coordinates.size()),
-                       c10::optional<IntArrayRef>(dim),
+                       at::OptionalIntArrayRef(dim),
                        edge_order);
     return gradient_helper(self, coordinates, dim, edge_order);
 }
@@ -955,7 +969,7 @@ std::vector<Tensor> gradient(const Tensor& self, TensorList coordinates, c10::op
   const auto processed_dim = gradient_dim_preprocess(self, dim);
   pre_check_gradient(self,
                      c10::optional<int64_t>(coordinates.size()),
-                     dim.has_value() ? c10::optional<IntArrayRef>(processed_dim) : c10::nullopt,
+                     dim.has_value() ? at::OptionalIntArrayRef(processed_dim) : c10::nullopt,
                      edge_order);
   return gradient_helper(self, coordinates, processed_dim, edge_order);
 }
@@ -963,7 +977,7 @@ std::vector<Tensor> gradient(const Tensor& self, TensorList coordinates, c10::op
 std::vector<Tensor> gradient(const Tensor& self, c10::ArrayRef<Scalar> spacing, IntArrayRef dim, int64_t edge_order) {
   pre_check_gradient(self,
                      c10::optional<int64_t>(spacing.size()),
-                     c10::optional<IntArrayRef>(dim),
+                     at::OptionalIntArrayRef(dim),
                      edge_order);
   return gradient_helper_float(self, spacing, dim, edge_order);
 }
@@ -972,7 +986,7 @@ std::vector<Tensor> gradient(const Tensor& self, ArrayRef<Scalar> spacing, c10::
   const auto processed_dim = gradient_dim_preprocess(self, dim);
   pre_check_gradient(self,
                      c10::optional<int64_t>(spacing.size()),
-                     dim.has_value() ? c10::optional<IntArrayRef>(processed_dim) : c10::nullopt,
+                     dim.has_value() ? at::OptionalIntArrayRef(processed_dim) : c10::nullopt,
                      edge_order);
   return gradient_helper_float(self, spacing, processed_dim, edge_order);
 }
@@ -983,7 +997,7 @@ std::vector<Tensor> gradient(const Tensor& self, const Scalar& unit_size, IntArr
   std::vector<Scalar> spacing(dim.size(), unit_size);
   pre_check_gradient(self,
                      c10::optional<int64_t>(spacing.size()),
-                     c10::optional<IntArrayRef>(dim),
+                     at::OptionalIntArrayRef(dim),
                      edge_order);
   return gradient_helper_float(self, spacing, dim, edge_order);
 }
@@ -997,7 +1011,7 @@ std::vector<Tensor> gradient(const Tensor& self, const c10::optional<Scalar>& un
                               unit_size.has_value() ? unit_size.value() : 1.0) ;
   pre_check_gradient(self,
                      unit_size.has_value() ?  c10::optional<int64_t>(spacing.size()) : c10::nullopt,
-                     dim.has_value() ? c10::optional<IntArrayRef>(processed_dim) : c10::nullopt,
+                     dim.has_value() ? at::OptionalIntArrayRef(processed_dim) : c10::nullopt,
                      edge_order);
   return gradient_helper_float(self, spacing, processed_dim, edge_order);
 }
@@ -1006,7 +1020,7 @@ std::vector<Tensor> gradient(const Tensor& self, IntArrayRef dim, int64_t edge_o
   std::vector<Scalar> spacing(dim.size(), 1.0) ;
   pre_check_gradient(self,
                      c10::optional<int64_t>(spacing.size()),
-                     c10::optional<IntArrayRef>(dim),
+                     at::OptionalIntArrayRef(dim),
                      edge_order);
   return gradient_helper_float(self, spacing, dim, edge_order);
 }
@@ -1429,29 +1443,17 @@ TORCH_IMPL_FUNC(any_all_out)(const Tensor& self, const Tensor& result) {
   allany_impl<0>(self, result, {}, false, or_stub);
 }
 
-Tensor &amin_out(const Tensor& self, IntArrayRef dim, bool keepdim, Tensor& result) {
-  TORCH_CHECK(self.scalar_type() == result.scalar_type(), "Expected the dtype for input and out to match, but got ",
-              self.scalar_type(), " for input's dtype and ",  result.scalar_type(), " for out's dtype.");
-  if (self.numel() == 0) {
-    zero_numel_check_dims(self, dim, "amin()");
-  }
-
-  auto iter = make_reduction("amin", result, self, dim, keepdim, self.scalar_type());
+TORCH_IMPL_FUNC(amin_out) (const Tensor& self, IntArrayRef dim, bool keepdim, const Tensor& result) {
+  auto iter =
+      meta::make_reduction(self, result, dim, keepdim, self.scalar_type());
   if (iter.numel() != 0) {
     min_values_stub(iter.device_type(), iter);
   }
-  return result;
-}
-
-Tensor amin(const Tensor& self, IntArrayRef dim, bool keepdim) {
-  Tensor result = at::empty({0}, self.options());
-  return at::amin_out(result, self, dim, keepdim);
 }
 
 TORCH_IMPL_FUNC(amax_out) (const Tensor& self, IntArrayRef dim, bool keepdim, const Tensor& result) {
-  c10::MaybeOwned<Tensor> in = c10::MaybeOwned<Tensor>::borrowed(self);
   auto iter =
-      meta::make_reduction(*in, result, dim, keepdim, self.scalar_type());
+      meta::make_reduction(self, result, dim, keepdim, self.scalar_type());
   if (iter.numel() != 0) {
     max_values_stub(iter.device_type(), iter);
   }
@@ -1560,7 +1562,7 @@ static double std_var_all_cpu(const Tensor& self, int64_t correction, bool take_
 
 static Tensor& std_var_out(
     const char* fname, Tensor& result, const Tensor& self,
-    c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction_opt,
+    at::OptionalIntArrayRef dim, c10::optional<int64_t> correction_opt,
     bool keepdim, bool take_sqrt) {
   TORCH_CHECK(self.device().is_cpu() || self.device().is_cuda(),
               "std and var only supports tensors on a CPU or CUDA device, but got: ",
@@ -1628,7 +1630,7 @@ static Tensor& std_var_out(
 
 static std::tuple<Tensor&, Tensor&> std_var_mean_out(
     const char* fname, Tensor& result1, Tensor& result2, const Tensor& self,
-    c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction_opt,
+    at::OptionalIntArrayRef dim, c10::optional<int64_t> correction_opt,
     bool keepdim, bool take_sqrt) {
   AT_ASSERT(result1.defined() && result2.defined());
   TORCH_CHECK(self.device().is_cpu() || self.is_cuda(),
@@ -1699,13 +1701,13 @@ static std::tuple<Tensor&, Tensor&> std_var_mean_out(
 
 std::tuple<Tensor, Tensor> var_mean(
     const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::var_mean(self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::var_mean(self, /*dim=*/at::OptionalIntArrayRef(dim),
                       /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
 std::tuple<Tensor, Tensor> std_mean(
     const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::std_mean(self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::std_mean(self, /*dim=*/at::OptionalIntArrayRef(dim),
                       /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
@@ -1732,7 +1734,7 @@ static TensorOptions options_to_value_type(TensorOptions opts) {
 }
 
 std::tuple<Tensor, Tensor> var_mean(
-    const Tensor& self, c10::optional<IntArrayRef> dim,
+    const Tensor& self, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim) {
   Tensor result1 = at::empty({0}, options_to_value_type(self.options()));
   Tensor result2 = at::empty({0}, self.options());
@@ -1741,7 +1743,7 @@ std::tuple<Tensor, Tensor> var_mean(
 }
 
 std::tuple<Tensor, Tensor> std_mean(
-    const Tensor& self, c10::optional<IntArrayRef> dim,
+    const Tensor& self, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim) {
   Tensor result1 = at::empty({0}, options_to_value_type(self.options()));
   Tensor result2 = at::empty({0}, self.options());
@@ -1755,12 +1757,12 @@ Tensor var(const Tensor& self, bool unbiased) {
 }
 
 Tensor var(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::var(self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::var(self, /*dim=*/at::OptionalIntArrayRef(dim),
                  /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
 Tensor& var_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) {
-  return at::var_out(result, self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::var_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim),
                      /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
@@ -1770,35 +1772,35 @@ Tensor std(const Tensor& self, bool unbiased) {
 }
 
 Tensor std(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::std(self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::std(self, /*dim=*/at::OptionalIntArrayRef(dim),
                  /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
 Tensor& std_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) {
-  return at::std_out(result, self, /*dim=*/c10::optional<IntArrayRef>(dim),
+  return at::std_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim),
                      /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
 }
 
-Tensor std(const Tensor& self, c10::optional<IntArrayRef> dim,
+Tensor std(const Tensor& self, at::OptionalIntArrayRef dim,
            c10::optional<int64_t> correction, bool keepdim) {
   Tensor result = at::empty({0}, options_to_value_type(self.options()));
   return std_var_out("std", result, self, dim, correction, keepdim, true);
 }
 
 Tensor& std_out(
-    const Tensor& self, c10::optional<IntArrayRef> dim,
+    const Tensor& self, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim, Tensor& result) {
   return std_var_out("std", result, self, dim, correction, keepdim, true);
 }
 
 Tensor& var_out(
-    const Tensor& self, c10::optional<IntArrayRef> dim,
+    const Tensor& self, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim, Tensor& result) {
   return std_var_out("var", result, self, dim, correction, keepdim, false);
 }
 
 Tensor var(
-    const Tensor& self, c10::optional<IntArrayRef> dim,
+    const Tensor& self, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim) {
   Tensor result = at::empty({0}, options_to_value_type(self.options()));
   return std_var_out("var", result, self, dim, correction, keepdim, false);
@@ -1983,5 +1985,9 @@ Tensor value_selecting_reduction_backward(const Tensor& grad, int64_t dim, const
   return at::zeros(sizes, grad.options()).scatter_(dim, indices, grad);
 }
 
+Tensor sum_csr(const Tensor &self, c10::optional<ScalarType> dtype) {
+  return self.values().sum(dtype);
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ReduceOpsUtils.h b/aten/src/ATen/native/ReduceOpsUtils.h
index fa93faa782d757..7951d7eda4e178 100644
--- a/aten/src/ATen/native/ReduceOpsUtils.h
+++ b/aten/src/ATen/native/ReduceOpsUtils.h
@@ -167,7 +167,7 @@ static Tensor review_reduce_result(const Tensor& result, int ndim, DimMask mask,
 
 static TensorIterator make_reduction(
     const char* name, Tensor& result, const Tensor& self,
-    c10::optional<IntArrayRef> dim_opt,
+    at::OptionalIntArrayRef dim_opt,
     bool keepdim, ScalarType in_dtype, ScalarType out_dtype) {
   // check that result type and dtype match if provided
   TORCH_CHECK(
@@ -192,7 +192,7 @@ static TensorIterator make_reduction(
 
 static C10_UNUSED TensorIterator make_reduction(
     const char* name, Tensor& result, const Tensor& self,
-    c10::optional<IntArrayRef> dim, bool keepdim, ScalarType out_dtype) {
+    at::OptionalIntArrayRef dim, bool keepdim, ScalarType out_dtype) {
   // special case for type promotion in mixed precision, improves computational
   // efficiency.
   // not generalize this to common mismatched input/output types to avoid cross
@@ -205,7 +205,7 @@ static C10_UNUSED TensorIterator make_reduction(
 
 static TensorIterator make_reduction(
     const char* name, Tensor& result1, Tensor& result2, const Tensor& self,
-    c10::optional<IntArrayRef> dim_opt, bool keepdim, ScalarType dtype1,
+    at::OptionalIntArrayRef dim_opt, bool keepdim, ScalarType dtype1,
     ScalarType dtype2) {
   // check that result type and dtype match if provided
   TORCH_CHECK(
@@ -242,7 +242,7 @@ static TensorIterator make_reduction(
 
 static C10_UNUSED TensorIterator make_reduction(
     const char* name, Tensor& result1, Tensor& result2, const Tensor& self,
-    c10::optional<IntArrayRef> dim, bool keepdim, ScalarType dtype) {
+    at::OptionalIntArrayRef dim, bool keepdim, ScalarType dtype) {
   return make_reduction(name, result1, result2, self, dim, keepdim, dtype, dtype);
 }
 
@@ -257,7 +257,11 @@ static void zero_numel_check_dims(const Tensor& self, const int64_t dim, const c
   }
 }
 
-static C10_UNUSED void zero_numel_check_dims(const Tensor& self, const IntArrayRef dim, const char *fn_name) {
+static void zero_numel_check_dims(const Tensor& self, const IntArrayRef dim, const char *fn_name) {
+  TORCH_CHECK(
+    !dim.empty(),
+      fn_name, ": Expected reduction dim to be specified for input.numel() == 0. ",
+        "Specify the reduction dim with the 'dim' argument.");
   for (const int64_t d : dim) {
     zero_numel_check_dims(self, d, fn_name);
   }
diff --git a/aten/src/ATen/native/ReflectionPad.cpp b/aten/src/ATen/native/ReflectionPad.cpp
index f6a1bc43aba76a..fab267ef43a06d 100644
--- a/aten/src/ATen/native/ReflectionPad.cpp
+++ b/aten/src/ATen/native/ReflectionPad.cpp
@@ -1,6 +1,7 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
+#include <ATen/quantized/Quantizer.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -266,76 +267,43 @@ inline void reflection_pad1d_out_loop(
 
 void reflection_pad1d_out_template(
     const Tensor& output, const Tensor& input_, IntArrayRef padding) {
-  int64_t dim_plane = 0;
-  int64_t dim_w = 1;
-  int64_t nbatch = 1;
-  // allow dim=0 only in the batch dimension.
-  TORCH_CHECK(
-      (input_.ndimension() == 2 && input_.size(1) != 0) ||
-      (input_.ndimension() == 3 && input_.size(1) != 0 && input_.size(2) != 0),
-      "2D or 3D (batch mode) tensor expected for input, but got: ", input_);
-
-  if (input_.ndimension() == 3) {
-    nbatch = input_.size(0);
-    dim_w++;
-    dim_plane++;
-  }
-
-  /* sizes */
-  auto pad_l = padding[0];
-  auto pad_r = padding[1];
-
-  int64_t nplane = input_.size(dim_plane);
-  int64_t input_w = input_.size(dim_w);
-  int64_t output_w  = input_w + pad_l + pad_r;
-
-  TORCH_CHECK(pad_l < input_w && pad_r < input_w, "Argument #4: Padding size "
-    "should be less than the corresponding input dimension, but got: padding (",
-    pad_l, ", ", pad_r, ") at dimension ", dim_w, " of input ", input_.sizes());
-
-  TORCH_CHECK(output_w >= 1 , 2,
-    "input (W: ", input_w, ")is too small. Calculated output W: ", output_w);
-
   /* get contiguous input */
   Tensor input = input_.contiguous();
 
-  /* resize output */
   if (input.ndimension() == 2) {
-    output.resize_({nplane, output_w});
     if (input.is_quantized()) {
       AT_DISPATCH_QINT_TYPES(input.scalar_type(), "qreflection_pad1d", [&]() {
         reflection_pad1d_out_frame<scalar_t>(
           input.data_ptr<scalar_t>(), output.data_ptr<scalar_t>(),
-          nplane,
-          input_w, output_w,
-          pad_l);
+          input.size(0),
+          input.size(1), output.size(-1),
+          padding[0]);
       });
     } else {
       AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(input.scalar_type(), "reflection_pad1d", [&] {
         reflection_pad1d_out_frame<scalar_t>(
           input.data_ptr<scalar_t>(), output.data_ptr<scalar_t>(),
-          nplane,
-          input_w, output_w,
-          pad_l);
+          input.size(0),
+          input.size(1), output.size(-1),
+          padding[0]);
       });
     }
   } else {
-    output.resize_({nbatch, nplane, output_w});
     if (input.is_quantized()) {
       AT_DISPATCH_QINT_TYPES(input.scalar_type(), "qreflection_pad1d", [&]() {
         reflection_pad1d_out_loop<scalar_t>(
           input.data_ptr<scalar_t>(), output.data_ptr<scalar_t>(),
-          nbatch, nplane,
-          input_w, output_w,
-          pad_l);
+          output.size(0), input.size(1),
+          input.size(2), output.size(-1),
+          padding[0]);
       });
     } else {
       AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(input.scalar_type(), "reflection_pad1d", [&] {
         reflection_pad1d_out_loop<scalar_t>(
           input.data_ptr<scalar_t>(), output.data_ptr<scalar_t>(),
-          nbatch, nplane,
-          input_w, output_w,
-          pad_l);
+          output.size(0), input.size(1),
+          input.size(2), output.size(-1),
+          padding[0]);
       });
     }
   }
@@ -854,20 +822,18 @@ static void reflection_pad3d_backward_out_loop(
 
 } // namespace
 
+// TODO: I tihnk this function should be removed since we implement it with
+// TORCH_IMPL_FUNC below
 Tensor& reflection_pad1d_out_cpu(const Tensor& input, IntArrayRef padding,
     Tensor& output) {
   reflection_pad1d_out_template(output, input, padding);
   return output;
 }
 
-// This function is needed because structured_delegate currently does not
-// support quantized backends. This function may be able to be omitted in the
-// future if support for quantized backends is enabled for structured_delegate
-Tensor reflection_pad1d_quantized_cpu(const Tensor& input, IntArrayRef padding) {
+Tensor& reflection_pad1d_out_quantized_cpu(const Tensor& input, IntArrayRef padding,
+    Tensor& output) {
   TORCH_CHECK(input.qscheme() == kPerTensorAffine, "Only per tensor quantization is supported");
-  Tensor output = at::_empty_affine_quantized({0}, input.options(),
-                                           input.q_scale(),
-                                           input.q_zero_point());
+  set_quantizer_(output, make_per_tensor_affine_quantizer(input.q_scale(), input.q_zero_point(), input.scalar_type()));
   reflection_pad1d_out_template(output, input, padding);
   return output;
 }
diff --git a/aten/src/ATen/native/Repeat.h b/aten/src/ATen/native/Repeat.h
index 9751f2ec8be7a4..dadbfb0c2374bb 100644
--- a/aten/src/ATen/native/Repeat.h
+++ b/aten/src/ATen/native/Repeat.h
@@ -1,6 +1,14 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorOperators.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/Resize.h b/aten/src/ATen/native/Resize.h
index 3540ef8b21ac4d..c6fe2b3d214670 100644
--- a/aten/src/ATen/native/Resize.h
+++ b/aten/src/ATen/native/Resize.h
@@ -2,6 +2,7 @@
 
 #include <ATen/core/Tensor.h>
 #include <ATen/native/ResizeCommon.h>
+#include <ATen/EmptyTensor.h>
 #include <ATen/TensorUtils.h>
 
 #include <c10/core/CPUAllocator.h>
@@ -30,22 +31,16 @@ TORCH_API bool resize_output_check(const Tensor& output, IntArrayRef shape);
 
 TORCH_API void resize_bytes_cpu(StorageImpl* storage, size_t size_bytes);
 
-static inline void maybe_resize_storage_cpu(TensorImpl* self, uint64_t new_size) {
+static inline void maybe_resize_storage_cpu(TensorImpl* self, size_t new_size_bytes) {
   // It does not make sense to try to resize a storage
   // to hold 0 elements, and this can break
   // if storage_offset is positive but
   // new_size is 0, so just bail in that case
   // (same comment is in cuda/Resize.h)
-  if (new_size == 0) {
+  if (self->numel() == 0) {
     return;
   }
 
-  const auto new_size_bytes_i =
-      (new_size + self->storage_offset()) * self->dtype().itemsize();
-  TORCH_CHECK(!overflows<size_t>(new_size_bytes_i), "Requested storage size (",
-              new_size_bytes_i, ") cannot be represented as a size_t");
-  const auto new_size_bytes = static_cast<size_t>(new_size_bytes_i);
-
   const Storage& storage = self->unsafe_storage();
   if (!storage) {
     auto new_storage = c10::make_intrusive<StorageImpl>(
@@ -62,21 +57,25 @@ static inline void maybe_resize_storage_cpu(TensorImpl* self, uint64_t new_size)
 inline TensorImpl* resize_impl_cpu_(
     TensorImpl* self,
     IntArrayRef size,
-    c10::optional<IntArrayRef> stride,
+    at::OptionalIntArrayRef stride,
     bool resize_storage = true) {
-  if (self->sizes() == size && (!stride || self->strides() == stride)) {
+  if (self->sizes() == size && (!stride || self->strides() == stride.value())) {
     return self;
   }
 
-  int64_t storage_size = 1;
+  const auto itemsize = self->dtype().itemsize();
+  const auto storage_offset = self->storage_offset();
+  size_t storage_size = 1;
   if (stride) {
     self->set_sizes_and_strides(size, *stride);
-    // NB: storage size can be different from numel.
-    storage_size = storage_size_for(size, *stride);
+    storage_size = at::detail::computeStorageNbytes(
+        size, *stride, itemsize, storage_offset);
   } else {
     self->set_sizes_contiguous(size);
-    storage_size = self->numel();
+    storage_size = at::detail::computeStorageNbytesContiguous(
+        size, itemsize, storage_offset);
   }
+
   if (resize_storage) {
     maybe_resize_storage_cpu(self, storage_size);
   }
@@ -158,6 +157,12 @@ inline void setStrided(
     IntArrayRef stride,
     int64_t storage_offset) {
   TORCH_CHECK(size.size() == stride.size(), "mismatch in length of strides and shape");
+  for (auto val : stride) {
+    TORCH_CHECK(val >= 0,
+                "as_strided: Negative strides are not supported at the moment, "
+                "got strides: ", stride);
+  }
+
   auto* self_ = self.unsafeGetTensorImpl();
   checkInBoundsForStorage(
       size, stride, storage_offset, self_->dtype(), self_->storage());
@@ -170,11 +175,6 @@ inline void setStrided(
   if (self_->sizes() == size && self_->strides() == stride) {
     return;
   }
-  for (auto val : stride) {
-    TORCH_CHECK(val >= 0,
-                "as_strided: Negative strides are not supported at the moment, "
-                "got strides: ", stride);
-  }
   self_->set_sizes_and_strides(size, stride);
 }
 
diff --git a/aten/src/ATen/native/Scalar.cpp b/aten/src/ATen/native/Scalar.cpp
index aecfffadb02025..7342c4806d44c5 100644
--- a/aten/src/ATen/native/Scalar.cpp
+++ b/aten/src/ATen/native/Scalar.cpp
@@ -20,8 +20,8 @@ Scalar item(const Tensor& self) {
 
 Scalar _local_scalar_dense_cpu(const Tensor& self) {
   Scalar r;
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-    at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16, self.scalar_type(), "_local_scalar_dense_cpu", [&] {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+    kComplexHalf, kHalf, kBool, kBFloat16, self.scalar_type(), "_local_scalar_dense_cpu", [&] {
         scalar_t value = *self.data_ptr<scalar_t>();
         r = Scalar(value);
       });
diff --git a/aten/src/ATen/native/ScatterGatherChecks.h b/aten/src/ATen/native/ScatterGatherChecks.h
index 1b71eb40975db7..92e1edeb5fe029 100644
--- a/aten/src/ATen/native/ScatterGatherChecks.h
+++ b/aten/src/ATen/native/ScatterGatherChecks.h
@@ -1,7 +1,7 @@
 #pragma once
 
 #include <vector>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/ReduceOpsUtils.h>
 #include <c10/util/irange.h>
 
diff --git a/aten/src/ATen/native/SegmentReduce.h b/aten/src/ATen/native/SegmentReduce.h
index 11a399ae77a1a0..1e5b87eefb6ddb 100644
--- a/aten/src/ATen/native/SegmentReduce.h
+++ b/aten/src/ATen/native/SegmentReduce.h
@@ -1,10 +1,12 @@
 #pragma once
 
-#include <ATen/ATen.h>
 #include <ATen/native/DispatchStub.h>
+#include <c10/core/Scalar.h>
 #include <c10/util/Optional.h>
 
 namespace at {
+class Tensor;
+
 namespace native {
 
 enum SegmentReductionType { MAX, MEAN, MIN, SUM };
diff --git a/aten/src/ATen/native/SoftMax.cpp b/aten/src/ATen/native/SoftMax.cpp
index b4635365e43224..0ef278a12a6c4d 100644
--- a/aten/src/ATen/native/SoftMax.cpp
+++ b/aten/src/ATen/native/SoftMax.cpp
@@ -170,7 +170,7 @@ void host_softmax(
             }
           } else {
             for (const auto d : c10::irange(0, dim_size)) {
-              if (mask_data[d * dim_stride]) {
+              if (!mask_data[d * dim_stride]) {
                 max_input = is_meaningful_max
                     ? std::max(max_input, input_data[d * dim_stride])
                     : input_data[d * dim_stride];
@@ -183,7 +183,7 @@ void host_softmax(
           acc_type<scalar_t, false> tmpsum = 0;
           for (const auto d : c10::irange(dim_size)) {
             scalar_t z{};
-            if (!MaskedSoftMax || mask_data[d * dim_stride]) {
+            if (!MaskedSoftMax || !mask_data[d * dim_stride]) {
               z = std::exp(input_data[d * dim_stride] - max_input);
             } else {
               z = 0;
diff --git a/aten/src/ATen/native/SpectralOps.cpp b/aten/src/ATen/native/SpectralOps.cpp
index 41a182bd29042e..af000cc70d9fe6 100644
--- a/aten/src/ATen/native/SpectralOps.cpp
+++ b/aten/src/ATen/native/SpectralOps.cpp
@@ -219,7 +219,7 @@ struct ShapeAndDims {
 // Wraps dimensions and applies defaulting behavior.
 // Also checks transform dims are unique and transform shape is non-empty.
 ShapeAndDims canonicalize_fft_shape_and_dim_args(
-    Tensor input, c10::optional<IntArrayRef> shape, c10::optional<IntArrayRef> dim) {
+    Tensor input, at::OptionalIntArrayRef shape, at::OptionalIntArrayRef dim) {
   const int64_t input_dim = input.dim();
   const IntArrayRef input_sizes = input.sizes();
   ShapeAndDims ret;
@@ -372,8 +372,8 @@ Tensor& fft_ihfft_out(const Tensor& self, c10::optional<int64_t> n,
   return out;
 }
 
-Tensor fft_fftn(const Tensor& self, c10::optional<IntArrayRef> s,
-                c10::optional<IntArrayRef> dim,
+Tensor fft_fftn(const Tensor& self, at::OptionalIntArrayRef s,
+                at::OptionalIntArrayRef dim,
                 c10::optional<c10::string_view> norm) {
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dim);
   // TODO: For real input, perform rfftn then mirror with conjugate symmetry
@@ -382,8 +382,8 @@ Tensor fft_fftn(const Tensor& self, c10::optional<IntArrayRef> s,
 }
 
 Tensor& fft_fftn_out(const Tensor& self,
-                     c10::optional<IntArrayRef> s,
-                     c10::optional<IntArrayRef> dim,
+                     at::OptionalIntArrayRef s,
+                     at::OptionalIntArrayRef dim,
                      c10::optional<c10::string_view> norm, Tensor& out) {
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dim);
   // TODO: For real input, perform rfftn then mirror with conjugate symmetry
@@ -392,8 +392,8 @@ Tensor& fft_fftn_out(const Tensor& self,
   return out;
 }
 
-Tensor fft_ifftn(const Tensor& self, c10::optional<IntArrayRef> s,
-                c10::optional<IntArrayRef> dim,
+Tensor fft_ifftn(const Tensor& self, at::OptionalIntArrayRef s,
+                at::OptionalIntArrayRef dim,
                 c10::optional<c10::string_view> norm) {
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dim);
   Tensor input = promote_tensor_fft(self, /*require_complex=*/true);
@@ -401,8 +401,8 @@ Tensor fft_ifftn(const Tensor& self, c10::optional<IntArrayRef> s,
 }
 
 Tensor& fft_ifftn_out(const Tensor& self,
-                      c10::optional<IntArrayRef> s,
-                      c10::optional<IntArrayRef> dim,
+                      at::OptionalIntArrayRef s,
+                      at::OptionalIntArrayRef dim,
                       c10::optional<c10::string_view> norm, Tensor& out) {
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dim);
   Tensor input = promote_tensor_fft(self, /*require_complex=*/true);
@@ -411,8 +411,8 @@ Tensor& fft_ifftn_out(const Tensor& self,
 }
 
 static Tensor fft_rfftn_impl(Tensor out, const Tensor& self,
-                             c10::optional<IntArrayRef> s,
-                             c10::optional<IntArrayRef> dim,
+                             at::OptionalIntArrayRef s,
+                             at::OptionalIntArrayRef dim,
                              const c10::optional<c10::string_view>& norm_str) {
   TORCH_CHECK(!self.is_complex(), "rfftn expects a real-valued input tensor, but got ", self.scalar_type());
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dim);
@@ -424,15 +424,15 @@ static Tensor fft_rfftn_impl(Tensor out, const Tensor& self,
   return fft_r2c_maybe_out(fname, out, x, desc.dim, norm, /*onesided=*/true);
 }
 
-Tensor fft_rfftn(const Tensor& self, c10::optional<IntArrayRef> s,
-                c10::optional<IntArrayRef> dim,
+Tensor fft_rfftn(const Tensor& self, at::OptionalIntArrayRef s,
+                at::OptionalIntArrayRef dim,
                 c10::optional<c10::string_view> norm_str) {
   return fft_rfftn_impl({}, self, s, dim, norm_str);
 }
 
 Tensor& fft_rfftn_out(const Tensor& self,
-                      c10::optional<IntArrayRef> s,
-                      c10::optional<IntArrayRef> dim,
+                      at::OptionalIntArrayRef s,
+                      at::OptionalIntArrayRef dim,
                       c10::optional<c10::string_view> norm_str, Tensor& out) {
   fft_rfftn_impl(out, self, s, dim, norm_str);
   return out;
@@ -440,8 +440,8 @@ Tensor& fft_rfftn_out(const Tensor& self,
 
 ShapeAndDims canonicalize_fft_c2r_shape_and_dim_args(
     c10::string_view fname, const Tensor& self,
-    const c10::optional<IntArrayRef>& s,
-    const c10::optional<IntArrayRef>& dims,
+    const at::OptionalIntArrayRef& s,
+    const at::OptionalIntArrayRef& dims,
     int64_t& last_dim_size) {
   auto desc = canonicalize_fft_shape_and_dim_args(self, s, dims);
   TORCH_CHECK(desc.shape.size() > 0, fname, " must transform at least one axis");
@@ -463,8 +463,8 @@ ShapeAndDims canonicalize_fft_c2r_shape_and_dim_args(
 }
 
 static Tensor fft_irfftn_impl(Tensor out, const Tensor& self,
-                              c10::optional<IntArrayRef> s,
-                              c10::optional<IntArrayRef> dim,
+                              at::OptionalIntArrayRef s,
+                              at::OptionalIntArrayRef dim,
                               const c10::optional<c10::string_view>& norm_str) {
   int64_t last_dim_size = 0;
   auto desc = canonicalize_fft_c2r_shape_and_dim_args(
@@ -477,15 +477,15 @@ static Tensor fft_irfftn_impl(Tensor out, const Tensor& self,
 }
 
 Tensor fft_irfftn(const Tensor& self,
-                  c10::optional<IntArrayRef> s,
-                  c10::optional<IntArrayRef> dim,
+                  at::OptionalIntArrayRef s,
+                  at::OptionalIntArrayRef dim,
                   c10::optional<c10::string_view> norm_str) {
   return fft_irfftn_impl({}, self, s, dim, norm_str);
 }
 
 Tensor& fft_irfftn_out(const Tensor& self,
-                       c10::optional<IntArrayRef> s,
-                       c10::optional<IntArrayRef> dim,
+                       at::OptionalIntArrayRef s,
+                       at::OptionalIntArrayRef dim,
                        c10::optional<c10::string_view> norm_str, Tensor& out) {
   fft_irfftn_impl(out, self, s, dim, norm_str);
   return out;
@@ -493,8 +493,8 @@ Tensor& fft_irfftn_out(const Tensor& self,
 
 static Tensor fft_hfftn_impl(
     const Tensor& self,
-    c10::optional<IntArrayRef> s,
-    c10::optional<IntArrayRef> dim,
+    at::OptionalIntArrayRef s,
+    at::OptionalIntArrayRef dim,
     c10::optional<c10::string_view> norm_str,
     const Tensor& out) {
   constexpr c10::string_view fname = "hfftn";
@@ -521,16 +521,16 @@ static Tensor fft_hfftn_impl(
 
 Tensor fft_hfftn(
     const Tensor& self,
-    c10::optional<IntArrayRef> s,
-    c10::optional<IntArrayRef> dim,
+    at::OptionalIntArrayRef s,
+    at::OptionalIntArrayRef dim,
     c10::optional<c10::string_view> norm) {
   return fft_hfftn_impl(self, s, dim, norm, {});
 }
 
 const Tensor& fft_hfftn_out(
     const Tensor& self,
-    c10::optional<IntArrayRef> s,
-    c10::optional<IntArrayRef> dim, c10::optional<c10::string_view> norm,
+    at::OptionalIntArrayRef s,
+    at::OptionalIntArrayRef dim, c10::optional<c10::string_view> norm,
     const Tensor& out) {
   fft_hfftn_impl(self, s, dim, norm, out);
   return out;
@@ -538,8 +538,8 @@ const Tensor& fft_hfftn_out(
 
 static Tensor fft_ihfftn_impl(
     const Tensor& self,
-    const c10::optional<IntArrayRef>& s,
-    const c10::optional<IntArrayRef>& dim,
+    const at::OptionalIntArrayRef& s,
+    const at::OptionalIntArrayRef& dim,
     const c10::optional<c10::string_view>& norm_str,
     const Tensor& out) {
   constexpr c10::string_view fname = "ihfftn";
@@ -563,80 +563,80 @@ static Tensor fft_ihfftn_impl(
 
 Tensor fft_ihfftn(
     const Tensor& self,
-    c10::optional<IntArrayRef> s,
-    c10::optional<IntArrayRef> dim,
+    at::OptionalIntArrayRef s,
+    at::OptionalIntArrayRef dim,
     c10::optional<c10::string_view> norm) {
   return fft_ihfftn_impl(self, s, dim, norm, {});
 }
 
 const Tensor& fft_ihfftn_out(
     const Tensor& self,
-    c10::optional<IntArrayRef> s,
-    c10::optional<IntArrayRef> dim,
+    at::OptionalIntArrayRef s,
+    at::OptionalIntArrayRef dim,
     c10::optional<c10::string_view> norm,
     const Tensor& out) {
   fft_ihfftn_impl(self, s, dim, norm, out);
   return out;
 }
 
-Tensor fft_fft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_fft2(const Tensor& self, at::OptionalIntArrayRef s,
                 IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_fftn(self, s, dim, std::move(norm));
 }
 
-Tensor& fft_fft2_out(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor& fft_fft2_out(const Tensor& self, at::OptionalIntArrayRef s,
                      IntArrayRef dim, c10::optional<c10::string_view> norm, Tensor& out) {
   return native::fft_fftn_out(self, s, dim, std::move(norm), out);
 }
 
-Tensor fft_ifft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_ifft2(const Tensor& self, at::OptionalIntArrayRef s,
                 IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_ifftn(self, s, dim, std::move(norm));
 }
 
-Tensor& fft_ifft2_out(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor& fft_ifft2_out(const Tensor& self, at::OptionalIntArrayRef s,
                       IntArrayRef dim, c10::optional<c10::string_view> norm, Tensor& out) {
   return native::fft_ifftn_out(self, s, dim, std::move(norm), out);
 }
 
-Tensor fft_rfft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_rfft2(const Tensor& self, at::OptionalIntArrayRef s,
                 IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_rfftn(self, s, dim, std::move(norm));
 }
 
-Tensor& fft_rfft2_out(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor& fft_rfft2_out(const Tensor& self, at::OptionalIntArrayRef s,
                       IntArrayRef dim, c10::optional<c10::string_view> norm, Tensor& out) {
   return native::fft_rfftn_out(self, s, dim, std::move(norm), out);
 }
 
-Tensor fft_irfft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_irfft2(const Tensor& self, at::OptionalIntArrayRef s,
                   IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_irfftn(self, s, dim, std::move(norm));
 }
 
-Tensor& fft_irfft2_out(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor& fft_irfft2_out(const Tensor& self, at::OptionalIntArrayRef s,
                        IntArrayRef dim, c10::optional<c10::string_view> norm, Tensor& out) {
   return native::fft_irfftn_out(self, s, dim, std::move(norm), out);
 }
 
 const Tensor& fft_hfft2_out(
-    const Tensor& self, c10::optional<IntArrayRef> s, IntArrayRef dim,
+    const Tensor& self, at::OptionalIntArrayRef s, IntArrayRef dim,
     c10::optional<c10::string_view> norm, const Tensor& out) {
   return native::fft_hfftn_out(self, s, dim, std::move(norm), out);
 }
 
-Tensor fft_hfft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_hfft2(const Tensor& self, at::OptionalIntArrayRef s,
                  IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_hfftn(self, s, dim, std::move(norm));
 }
 
 const Tensor& fft_ihfft2_out(
-    const Tensor& self, c10::optional<IntArrayRef> s, IntArrayRef dim,
+    const Tensor& self, at::OptionalIntArrayRef s, IntArrayRef dim,
     c10::optional<c10::string_view> norm, const Tensor& out) {
   return native::fft_ihfftn_out(self, s, dim, std::move(norm), out);
 }
 
-Tensor fft_ihfft2(const Tensor& self, c10::optional<IntArrayRef> s,
+Tensor fft_ihfft2(const Tensor& self, at::OptionalIntArrayRef s,
                   IntArrayRef dim, c10::optional<c10::string_view> norm) {
   return native::fft_ihfftn(self, s, dim, std::move(norm));
 }
@@ -687,7 +687,7 @@ Tensor fft_rfftfreq(int64_t n, double d,
 
 // If an array dim is specified, wraps them according to self.dim().
 // Otherwise returns a vector of all dims.
-DimVector default_alldims(const Tensor& self, c10::optional<IntArrayRef> dim_opt) {
+DimVector default_alldims(const Tensor& self, at::OptionalIntArrayRef dim_opt) {
   DimVector dim;
   if (dim_opt) {
     IntArrayRef dim_unwrapped = *dim_opt;
@@ -702,7 +702,7 @@ DimVector default_alldims(const Tensor& self, c10::optional<IntArrayRef> dim_opt
   return dim;
 }
 
-Tensor fft_fftshift(const Tensor& x, c10::optional<IntArrayRef> dim_opt) {
+Tensor fft_fftshift(const Tensor& x, at::OptionalIntArrayRef dim_opt) {
   auto dim = default_alldims(x, dim_opt);
 
   IntArrayRef x_sizes = x.sizes();
@@ -714,7 +714,7 @@ Tensor fft_fftshift(const Tensor& x, c10::optional<IntArrayRef> dim_opt) {
   return at::roll(x, shift, dim);
 }
 
-Tensor fft_ifftshift(const Tensor& x, c10::optional<IntArrayRef> dim_opt) {
+Tensor fft_ifftshift(const Tensor& x, at::OptionalIntArrayRef dim_opt) {
   auto dim = default_alldims(x, dim_opt);
 
   IntArrayRef x_sizes = x.sizes();
@@ -759,14 +759,11 @@ static Stream& write_opt(Stream& SS, const optional<T>& value) {
  *
  * This is modeled after librosa but with support for complex time-domain
  * signals and complex windows.
- *
- * NOTE: librosa's center and pad_mode arguments are currently only implemented
- * in python because it uses torch.nn.functional.pad which is python-only.
  */
 Tensor stft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
             const optional<int64_t> win_lengthOpt, const c10::optional<Tensor>& window_opt,
-            const bool normalized, const optional<bool> onesidedOpt,
-            const optional<bool> return_complexOpt) {
+            const bool center, c10::string_view mode, const bool normalized,
+            const optional<bool> onesidedOpt, const optional<bool> return_complexOpt) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> window_maybe_owned = at::borrow_from_optional_tensor(window_opt);
   const Tensor& window = *window_maybe_owned;
@@ -824,6 +821,19 @@ Tensor stft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop
   if (self.dim() == 1) {
     input = input.unsqueeze(0);
   }
+
+  if (center) {
+    const auto input_shape = input.sizes();
+    const auto input_dim = input_shape.size();
+    const auto extra_dims = std::max(size_t{3}, input_dim) - input_dim;
+    const auto pad_amount = n_fft / 2;
+
+    DimVector extended_shape(extra_dims, 1);
+    extended_shape.append(input_shape.begin(), input_shape.end());
+    input = at::pad(input.view(extended_shape), {pad_amount, pad_amount}, mode);
+    input = input.view(IntArrayRef(input.sizes()).slice(extra_dims));
+  }
+
   int64_t batch = input.size(0);
   int64_t len = input.size(1);
   if (n_fft <= 0 || n_fft > len) {
@@ -897,6 +907,17 @@ Tensor stft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop
   }
 }
 
+Tensor stft(
+    const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
+    const optional<int64_t> win_lengthOpt, const c10::optional<Tensor>& window_opt,
+    const bool normalized,
+    const optional<bool> onesidedOpt, const optional<bool> return_complexOpt) {
+  return at::stft(
+      self, n_fft, hop_lengthOpt, win_lengthOpt, window_opt,
+      /*center=*/false, /*mode=*/"constant", normalized, onesidedOpt,
+      return_complexOpt);
+}
+
 // Create complex tensor from the old style of real tensor with size=(..., 2)
 // This is to support istft in the transition to requiring complex input.
 // NOTE: This may return a view of the input tensor, or might clone if necessary
@@ -1090,14 +1111,6 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   #undef REPR
 }
 
-Tensor stft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
-            const optional<int64_t> win_lengthOpt, const Tensor& window,
-            const bool normalized, const optional<bool> onesidedOpt) {
-  return at::native::stft(
-      self, n_fft, hop_lengthOpt, win_lengthOpt, window, normalized, onesidedOpt,
-      /*return_complex=*/c10::nullopt);
-}
-
 Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
              const optional<int64_t> win_lengthOpt, const Tensor& window,
              const bool center, const bool normalized, const optional<bool> onesidedOpt,
diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.cpp b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
index 340bc5a822ad0a..9492e2c02b43f8 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexing.cpp
+++ b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
@@ -74,13 +74,29 @@
 namespace at {
 namespace meta {
 
-native::SCATTER_GATHER_OP get_operator_enum(const c10::string_view reduce) {
-  if (reduce == "add") {
-    return native::SCATTER_GATHER_OP::REDUCE_ADD;
-  } else if (reduce == "multiply") {
-    return native::SCATTER_GATHER_OP::REDUCE_MULTIPLY;
+native::SCATTER_GATHER_OP get_operator_enum(const c10::string_view reduce, bool use_new_options = false) {
+  if (use_new_options) {
+    if (reduce == "sum") {
+      return native::SCATTER_GATHER_OP::REDUCE_ADD;
+    } else if (reduce == "prod") {
+      return native::SCATTER_GATHER_OP::REDUCE_MULTIPLY;
+    } else if (reduce == "mean") {
+      return native::SCATTER_GATHER_OP::REDUCE_MEAN;
+    } else if (reduce == "amax") {
+      return native::SCATTER_GATHER_OP::REDUCE_MAXIMUM;
+    } else if (reduce == "amin") {
+    return native::SCATTER_GATHER_OP::REDUCE_MINIMUM;
+    } else {
+      TORCH_CHECK(false, "reduce argument must be either sum, prod, mean, amax or amin.");
+    }
   } else {
-    TORCH_CHECK(false, "reduce argument must be either add or multiply.");
+    if (reduce == "add") {
+      return native::SCATTER_GATHER_OP::REDUCE_ADD;
+    } else if (reduce == "multiply") {
+      return native::SCATTER_GATHER_OP::REDUCE_MULTIPLY;
+    } else {
+      TORCH_CHECK(false, "reduce argument must be either add or multiply.")
+    }
   }
 }
 
@@ -113,7 +129,7 @@ TORCH_META_FUNC(gather)
   at::native::gather_shape_check(self, wrapped_dim, index);
 }
 
-template <typename Meta>
+template <bool use_new_options = false, typename Meta>
 void scatter_meta_impl(
     Meta& meta,
     const Tensor& self,
@@ -137,7 +153,7 @@ void scatter_meta_impl(
   meta.set_output(self.sizes(), self.options());
   if (reduce.has_value()) {
     // Check if we have a valid reduce operator.
-    get_operator_enum(reduce.value());
+    get_operator_enum(reduce.value(), use_new_options);
   }
 }
 
@@ -174,6 +190,17 @@ TORCH_META_FUNC(scatter_add)
   scatter_meta_impl(*this, self, dim, index, src, "add");
 }
 
+TORCH_META_FUNC2(scatter_reduce, two)
+(const Tensor& self,
+ int64_t dim,
+ const Tensor& index,
+ const Tensor& src,
+ const c10::string_view reduce,
+ bool include_self) {
+  (void) include_self;
+  scatter_meta_impl</*use_new_options=*/true>(*this, self, dim, index, src, reduce);
+}
+
 TORCH_PRECOMPUTE_META_FUNC(index_copy)
 (const Tensor& self, int64_t dim, const Tensor& index, const Tensor& source) {
   dim = maybe_wrap_dim(dim, self.dim());
@@ -296,6 +323,7 @@ DEFINE_DISPATCH(scatter_fill_stub);
 DEFINE_DISPATCH(scatter_add_stub);
 DEFINE_DISPATCH(scatter_reduce_stub);
 DEFINE_DISPATCH(scatter_scalar_reduce_stub);
+DEFINE_DISPATCH(scatter_reduce_two_stub);
 
 static bool all_strides_match(TensorList tensors) {
   TORCH_CHECK(tensors.size() >= 1);
@@ -880,9 +908,6 @@ Tensor & index_select_out_cpu_dim1_(
 
           for (const auto i : c10::irange(N)) {
             auto idx = idxs[i];
-            if (idx < 0) {
-              idx = idx + src_indexing_axis_dim;
-            }
             dst_floats[i] = src_floats[idx];
           }
         }
@@ -892,10 +917,6 @@ Tensor & index_select_out_cpu_dim1_(
         for (const auto batch : c10::irange(outer_dims_product)) {
           for (const auto i : c10::irange(N)) {
             auto idx = idxs[i];
-            if (idx < 0) {
-              idx = idx + src_indexing_axis_dim;
-            }
-
             auto src = src_base + batch * src_batch_bytesize + idx * block_bytesize;
             auto dst = out + batch * gathered_batch_bytesize + i * block_bytesize;
             memcpy(dst, src, block_bytesize);
@@ -1176,7 +1197,37 @@ Tensor gather_backward(const Tensor& grad, const Tensor& self, int64_t dim, cons
   return grad.new_zeros(self.sizes()).scatter_add_(dim, index, grad);
 }
 
-template <typename T, typename ReduceStub, typename FillStub>
+static void scatter_reduce_exclude_self_helper(
+  const Tensor& self,
+  int64_t dim,
+  const Tensor& index,
+  const SCATTER_GATHER_OP& op) {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+    at::ScalarType::Half, at::ScalarType::BFloat16, at::ScalarType::Bool,
+    self.scalar_type(), "scatter_reduce_exclude_input_init", [&] {
+    scalar_t init_val;
+    switch (op) {
+      case SCATTER_GATHER_OP::REDUCE_ADD:
+        init_val = (scalar_t)0;
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MULTIPLY:
+        init_val = (scalar_t)1;
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MAXIMUM:
+        init_val = std::numeric_limits<scalar_t>::lowest();
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MINIMUM:
+        init_val = std::numeric_limits<scalar_t>::max();
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MEAN:
+        init_val = (scalar_t)0;
+        break;
+    }
+    self.scatter_(dim, index, init_val);
+  });
+}
+
+template <bool use_new_options = false, typename T, typename ReduceStub, typename FillStub>
 void scatter_impl(
     const Tensor& self,
     int64_t dim,
@@ -1185,7 +1236,8 @@ void scatter_impl(
     const Tensor& out,
     ReduceStub& reduce_stub,
     FillStub& fill_stub,
-    const c10::optional<c10::string_view> reduce = nullopt) {
+    const c10::optional<c10::string_view> reduce = nullopt,
+    bool reduce_includes_self = true) {
 
   dim = at::maybe_wrap_dim(dim, self.dim());
   auto mut_out = const_cast<Tensor&>(out);
@@ -1197,7 +1249,11 @@ void scatter_impl(
   if (index.numel() == 0) return;
 
   if (reduce.has_value()) {
-    auto op = meta::get_operator_enum(reduce.value());
+    auto op = meta::get_operator_enum(reduce.value(), use_new_options);
+    if (!reduce_includes_self) {
+      // scatter inits for reduction to appropriate indices (used by scatter_reduce.two)
+      scatter_reduce_exclude_self_helper(mut_out, dim, index, op);
+    }
     reduce_stub(self.device().type(), mut_out, dim, index, src, op);
   } else {
     fill_stub(self.device().type(), mut_out, dim, index, src);
@@ -1282,113 +1338,35 @@ TORCH_IMPL_FUNC(scatter_add)
   }
 }
 
-Tensor scatter_reduce_two_cpu(const Tensor& self,
-                              int64_t dim,
-                              const Tensor& index,
-                              const c10::string_view reduce,
-                              const c10::optional<int64_t> output_size) {
-
-  // TODO: Add documentation.
-
-
-  TORCH_CHECK(dim >= -self.dim() && dim < self.dim(),
-      "Expected `dim` to be in range ", -self.dim(), " to ", self.dim() - 1, " (got ", dim, ")");
-
-  dim = dim < 0 ? dim + self.dim() : dim;
-
-  auto sizes = self.sizes().vec();
-  if (output_size.has_value()) {
-    sizes[dim] = output_size.value();
-  } else {
-    sizes[dim] = index.numel() > 0 ? index.max().item<int64_t>() + 1: 0;
-  }
-  Tensor out = at::empty(sizes, self.options());
-
-  TORCH_CHECK(self.dim() == index.dim(),
-      "Shape mismatch between `self` (got ", self.sizes(), ") and `index` (got ", index.sizes(), ")");
-  for (const auto i : c10::irange(self.dim())) {
-    TORCH_CHECK(self.size(i) == index.size(i),
-        "Shape mismatch between `self` (got ", self.sizes(), ") and `index` (got ", index.sizes(), ")");
-  }
-
-  TORCH_CHECK(reduce == "sum" || reduce == "prod" || reduce == "mean" || reduce == "amax" || reduce =="amin",
-              "`reduce` argument must be one of ('sum', 'prod', 'mean', 'amax', 'amin'");
-
-  if (self.numel() == 0) {
-    return out.zero_();
-  }
-
-  AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, self.scalar_type(), "scatter_reduce", [&] {
-    if (reduce == "prod") {
-      out.fill_((scalar_t)1);
-    } else if (reduce == "amax") {
-      out.fill_(std::numeric_limits<scalar_t>::lowest());
-    } else if (reduce == "amin") {
-      out.fill_(std::numeric_limits<scalar_t>::max());
+TORCH_IMPL_FUNC(scatter_reduce_two)
+(const Tensor& self,
+ int64_t dim,
+ const Tensor& index,
+ const Tensor& src,
+ const c10::string_view reduce,
+ bool include_self,
+ const Tensor& out) {
+  // See issue https://github.com/pytorch/pytorch/issues/74770
+  TORCH_WARN_ONCE("scatter_reduce() is in beta and the API may change at any time.");
+
+  scatter_impl</*use_new_options=*/true>(self, dim, index, src, out,
+                                         scatter_reduce_two_stub,
+                                         scatter_stub,
+                                         reduce,
+                                         include_self);
+
+  if (meta::get_operator_enum(reduce, true) == SCATTER_GATHER_OP::REDUCE_MEAN) {
+    auto ones = at::ones_like(src);
+    auto count = include_self ? at::ones_like(out) : at::zeros_like(out);
+    count.scatter_add_(dim, index, ones);
+    count.masked_fill_(count == 0, 1);
+
+    if (out.is_floating_point() || out.is_complex()) {
+      out.div_(count);
     } else {
-      out.fill_((scalar_t)0);
-    }
-
-
-    auto self_cont = self.contiguous();
-    auto index_cont = index.contiguous();
-    auto self_data = self_cont.data_ptr<scalar_t>();
-    auto index_data = index_cont.data_ptr<int64_t>();
-    bool out_is_contiguous = out.is_contiguous();
-    auto out_cont = out.contiguous();
-    auto out_cont_data = out_cont.data_ptr<scalar_t>();
-
-    auto counts = at::zeros_like(out_cont);
-    auto counts_data = counts.data_ptr<scalar_t>();
-
-
-    int64_t offset1 = 1, offset2 = 1;
-    for (const auto d : c10::irange(dim)) {
-      offset1 *= self.size(d);
-    }
-    for (int64_t d = dim + 1; d < self.dim(); d++) {
-      offset2 *= self.size(d);
-    }
-
-    scalar_t value;
-    int64_t dim_index;
-    for (const auto i : c10::irange(offset1)) {
-      for (const auto j : c10::irange(self.size(dim))) {
-        for (const auto k : c10::irange(offset2)) {
-          value = self_data[i * self_cont.stride(dim) * self_cont.size(dim) + j * self_cont.stride(dim) + k];
-          dim_index = index_data[i * index_cont.stride(dim) * index_cont.size(dim) + j * index_cont.stride(dim) + k];
-          TORCH_CHECK(dim_index >= 0 && dim_index < out.size(dim),
-              "Expected `index` values to be in range ", 0, " to ", out.size(dim), " (got ", dim_index, ")");
-          int64_t ind = i * out_cont.stride(dim) * out_cont.size(dim) + dim_index * out_cont.stride(dim) + k;
-          if (reduce == "sum") {
-            out_cont_data[ind] += value;
-          } else if (reduce == "prod") {
-            out_cont_data[ind] *= value;
-          } else if (reduce == "mean") {
-            auto n = counts_data[ind];
-            out_cont_data[ind] = (out_cont_data[ind] * n + value) / (n + 1);
-            counts_data[ind] += 1;
-          } else if (reduce == "amax") {
-            out_cont_data[ind] = std::max(out_cont_data[ind], value);
-          } else {
-            out_cont_data[ind] = std::min(out_cont_data[ind], value);
-          }
-        }
-      }
-    }
-
-    if (reduce == "amin" || reduce == "amax") {
-      auto val = (reduce == "amin") ? std::numeric_limits<scalar_t>::max() : std::numeric_limits<scalar_t>::lowest();
-      out_cont.masked_fill_(out_cont == val, (scalar_t)0);
-    }
-
-    if (!out_is_contiguous) {
-      out.copy_(out_cont);
+      out.div_(count, "floor");
     }
-
-  });
-
-  return out;
+  }
 }
 
 Tensor masked_scatter(const Tensor & self, const Tensor & mask, const Tensor & source) {
diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.h b/aten/src/ATen/native/TensorAdvancedIndexing.h
index 689ff5178d550c..a0c282d550e407 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexing.h
+++ b/aten/src/ATen/native/TensorAdvancedIndexing.h
@@ -12,7 +12,7 @@ struct TensorIterator;
 
 namespace at { namespace native {
 
-enum class SCATTER_GATHER_OP: uint8_t {REDUCE_ADD, REDUCE_MULTIPLY};
+enum class SCATTER_GATHER_OP: uint8_t {REDUCE_ADD, REDUCE_MULTIPLY, REDUCE_MAXIMUM, REDUCE_MINIMUM, REDUCE_MEAN};
 
 using index_put_with_sort_fn = void(*)(Tensor &, const c10::List<c10::optional<Tensor>> &, const Tensor &, bool accumulate, bool unsafe);
 
@@ -24,6 +24,8 @@ using scatter_reduce_fn = void(*)(const Tensor& self, const int64_t dim, const T
                                   const Tensor& src, const SCATTER_GATHER_OP& reduce);
 using scatter_scalar_reduce_fn = void(*)(const Tensor& self, const int64_t dim, const Tensor& index,
                                          const Scalar& value, const SCATTER_GATHER_OP& reduce);
+using scatter_reduce_two_fn = void(*)(const Tensor& self, const int64_t dim, const Tensor& index,
+                                      const Tensor& src, const SCATTER_GATHER_OP& reduce);
 
 DECLARE_DISPATCH(index_put_with_sort_fn, index_put_with_sort_stub);
 
@@ -33,6 +35,7 @@ DECLARE_DISPATCH(scatter_fill_fn, scatter_fill_stub);
 DECLARE_DISPATCH(scatter_add_fn, scatter_add_stub);
 DECLARE_DISPATCH(scatter_reduce_fn, scatter_reduce_stub);
 DECLARE_DISPATCH(scatter_scalar_reduce_fn, scatter_scalar_reduce_stub);
+DECLARE_DISPATCH(scatter_reduce_two_fn, scatter_reduce_two_stub);
 
 TORCH_API Tensor& index_out(Tensor& result, const Tensor & self, const c10::List<c10::optional<at::Tensor>>& indices);
 
diff --git a/aten/src/ATen/native/TensorCompare.cpp b/aten/src/ATen/native/TensorCompare.cpp
index 0114deb943b35f..5054a57ae9a5b8 100644
--- a/aten/src/ATen/native/TensorCompare.cpp
+++ b/aten/src/ATen/native/TensorCompare.cpp
@@ -323,21 +323,30 @@ static void isin_sorting(
   }
 }
 
-Tensor where(const Tensor& condition, const Tensor& self, const Tensor& other) {
-  TORCH_CHECK(condition.device() == self.device() && self.device() == other.device(),
-              "Expected condition, x and y to be on the same device, but condition is on ",
-              condition.device(), " and x and y are on ", self.device(), " and ", other.device(),
-              " respectively");
+Tensor& where_self_out(const Tensor& condition, const Tensor& self, const Tensor& other, Tensor& out) {
+  TORCH_CHECK(self.dtype() == other.dtype(), "expected scalar type ", self.dtype(), " but found ", other.dtype());
 
   if (condition.scalar_type() == ScalarType::Byte) {
   TORCH_WARN_ONCE("where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead.");
-} else {
+  } else {
   TORCH_CHECK(condition.scalar_type() == ScalarType::Bool, "where expected condition to be a boolean tensor, but got a tensor with dtype ", condition.scalar_type());
+  }
+  Tensor cond_bool = condition.scalar_type() == ScalarType::Byte ? condition.to(ScalarType::Bool) : condition;
+  auto iter = at::TensorIteratorConfig()
+    .check_all_same_dtype(false)
+    .add_output(out)
+    .add_input(cond_bool)
+    .add_input(self)
+    .add_input(other)
+    .build();
+  where_kernel(iter.device_type(), iter);
+  return out;
 }
 
-  c10::MaybeOwned<Tensor> b_condition, b_self, b_other;
-  std::tie(b_condition, b_self, b_other) = expand_outplace(condition, self, other, "where");
-  return at::_s_where(*b_condition, *b_self, *b_other);
+Tensor where(const Tensor& condition, const Tensor& self, const Tensor& other) {
+  Tensor ret = at::empty({0}, self.options());
+  at::native::where_self_out(condition, self, other, ret);
+  return ret;
 }
 
 Tensor where(const Tensor& condition, const Scalar& self, const Tensor& other) {
@@ -359,22 +368,6 @@ std::vector<Tensor> where(const Tensor& condition) {
   return condition.nonzero_numpy();
 }
 
-Tensor _s_where(const Tensor& condition, const Tensor& self, const Tensor& other) {
-  TORCH_CHECK(self.dtype() == other.dtype(), "expected scalar type ", self.dtype(), " but found ", other.dtype());
-  Tensor ret = at::empty(self.sizes(), self.options());
-  //
-  Tensor cond_bool = condition.scalar_type() == ScalarType::Byte ? condition.to(ScalarType::Bool) : condition;
-  auto iter = at::TensorIteratorConfig()
-    .check_all_same_dtype(false)
-    .add_output(ret)
-    .add_input(cond_bool)
-    .add_input(self)
-    .add_input(other)
-    .build();
-  where_kernel(iter.device_type(), iter);
-  return ret;
-}
-
 std::tuple<Tensor, Tensor> mode(const Tensor& self, int64_t dim, bool keepdim) {
   Tensor values = at::empty({0}, self.options());
   Tensor indices = at::empty({0}, self.options().dtype(kLong));
diff --git a/aten/src/ATen/native/TensorConversions.cpp b/aten/src/ATen/native/TensorConversions.cpp
index 71690c4bf2d17b..d79b2929e471cf 100644
--- a/aten/src/ATen/native/TensorConversions.cpp
+++ b/aten/src/ATen/native/TensorConversions.cpp
@@ -240,11 +240,14 @@ Tensor to_dense_backward(const Tensor& grad, const Tensor& input_) {
   if (input_.layout() == c10::kSparse) {
     auto input = input_.coalesce();
     return grad.sparse_mask(input);
-  } else if (input_.layout() == c10::kMkldnn) {
+  }
+  if (input_.layout() == c10::kMkldnn) {
     return grad.to_mkldnn(input_.scalar_type());
-  } else {
-    AT_ERROR("Unsupported input layout: ", input_.layout());
   }
+  if (input_.layout() == c10::kStrided) {
+    return grad.to_dense();
+  }
+  AT_ERROR("Unsupported input layout: ", input_.layout());
 }
 
 Tensor to_mkldnn_backward(const Tensor& grad, const Tensor& input_) {
@@ -252,6 +255,41 @@ Tensor to_mkldnn_backward(const Tensor& grad, const Tensor& input_) {
   return grad.to_dense(input_.scalar_type());
 }
 
+Tensor to_dense(const Tensor& tensor, c10::optional<c10::ScalarType> dtype) {
+  if (tensor.layout() == c10::kSparse) {
+    return tensor._to_dense(dtype);
+  }
+  if (tensor.layout() == c10::kSparseCsr) {
+    return tensor._to_dense(dtype);
+  }
+  if (tensor.layout() == c10::kMkldnn) {
+    return tensor._to_dense(dtype);
+  }
+  TORCH_CHECK(tensor.layout() == c10::kStrided, "to_dense does not support layout ", tensor.layout());
+  if (dtype) {
+    return tensor.to(*dtype);
+  }
+  return tensor;
+}
+
+Tensor sparse_to_dense(
+    const Tensor& self,
+    c10::optional<ScalarType> dtype) {
+  TORCH_CHECK(
+      !dtype.has_value(), "dtype argument is not supported by sparse_to_dense");
+  Tensor dst = at::zeros(self.sizes(), self.options().layout(kStrided));
+  return dst.add_(self);
+}
+
+Tensor sparse_csr_to_dense(
+    const Tensor& self,
+    c10::optional<ScalarType> dtype) {
+  TORCH_CHECK(
+      !dtype.has_value(), "dtype argument is not supported by sparse_csr_to_dense");
+  Tensor dst = at::zeros(self.sizes(), self.options().layout(kStrided));
+  return dst.add_(self);
+}
+
 // Computes the strides for view_dtype output when the view dtype is
 // smaller than the original dtype
 inline DimVector compute_strides_for_view_dtype_downsize(IntArrayRef old_strides, int64_t size_ratio, ScalarType old_dtype, ScalarType new_dtype) {
@@ -371,4 +409,32 @@ Tensor view_dtype(const Tensor& self, ScalarType dtype) {
   return new_tensor;
 }
 
+Tensor dense_to_sparse_csr(const Tensor& self) {
+  return self.to_sparse().to_sparse_csr();
+}
+
+Tensor csr_to_sparse_csr(const Tensor& self) {
+  return self;
+}
+
+Tensor coo_to_sparse_csr(const Tensor& self) {
+  TORCH_CHECK(
+      self.dim() == 2,
+      "Only 2D tensors can be converted to the CSR format but got shape: ",
+      self.sizes());
+  auto coalesced_self = self.coalesce();
+  auto row_indices = coalesced_self.indices()[0];
+  bool out_int32 = (row_indices.scalar_type() == at::kInt);
+  auto crow_indices = at::_convert_indices_from_coo_to_csr(
+      row_indices, self.size(0), out_int32);
+  return at::native::_sparse_csr_tensor_unsafe(
+      crow_indices,
+      coalesced_self.indices()[1].contiguous(),
+      coalesced_self.values(),
+      coalesced_self.sizes(),
+      coalesced_self.scalar_type(),
+      c10::kSparseCsr,
+      coalesced_self.device());
+}
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/TensorFactories.cpp b/aten/src/ATen/native/TensorFactories.cpp
index 458a694411e4bc..5cba59058beb66 100644
--- a/aten/src/ATen/native/TensorFactories.cpp
+++ b/aten/src/ATen/native/TensorFactories.cpp
@@ -110,9 +110,9 @@ Tensor _dim_arange(const Tensor& like, int64_t dim) {
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ complex / polar ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 void complex_check_floating(const Tensor& a, const Tensor& b) {
-  TORCH_CHECK((a.scalar_type() == kFloat || a.scalar_type() == kDouble) &&
-              (b.scalar_type() == kFloat || b.scalar_type() == kDouble),
-              "Expected both inputs to be Float or Double tensors but got ",
+  TORCH_CHECK((a.scalar_type() == kFloat || a.scalar_type() == kDouble || a.scalar_type() == kHalf) &&
+              (b.scalar_type() == kFloat || b.scalar_type() == kDouble || b.scalar_type() == kHalf),
+              "Expected both inputs to be Half, Float or Double tensors but got ",
               a.scalar_type(), " and ", b.scalar_type());
 }
 
@@ -1344,6 +1344,11 @@ Tensor kaiser_window(
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   window_function_checks("kaiser_window", options, window_length);
+  // short-circuit for `meta`.
+  if (device == kMeta) {
+    return at::empty({window_length}, options);
+  }
+
   if (window_length == 0) {
     return at::empty({0}, options);
   }
diff --git a/aten/src/ATen/native/TensorFactories.h b/aten/src/ATen/native/TensorFactories.h
index 2d4a306f094875..35e058df4b3ab7 100644
--- a/aten/src/ATen/native/TensorFactories.h
+++ b/aten/src/ATen/native/TensorFactories.h
@@ -35,6 +35,10 @@ namespace at { namespace native {
 //    In this case, we first calculate the size of top trapezoid, and then
 //    calculate the size of the bottom rectangle.
 inline int64_t get_tril_size(int64_t row, int64_t col, int64_t offset) {
+  // If either dimension is 0 then the there is no tril
+  if (row == 0 || col == 0) {
+    return 0;
+  }
   // number of elements in the first row of the tril
   auto m_first_row = offset > 0 ?
     std::min<int64_t>(col, 1 + offset) : // upper bounded by col
diff --git a/aten/src/ATen/native/TensorProperties.cpp b/aten/src/ATen/native/TensorProperties.cpp
index 63d928749e0910..fd72abc580b4ca 100644
--- a/aten/src/ATen/native/TensorProperties.cpp
+++ b/aten/src/ATen/native/TensorProperties.cpp
@@ -1,6 +1,6 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
-#include <ATen/detail/CUDAHooksInterface.h>
+#include <ATen/native/TensorProperties.h>
 #include <ATen/NamedTensorUtils.h>
 #include <torch/library.h>
 
@@ -31,7 +31,7 @@ int64_t stride(const Tensor& self, Dimname dim) {
   return self.strides()[pos_dim];
 }
 
-bool cudnn_is_acceptable(const Tensor& self) {
+bool cudnn_is_acceptable(const TensorBase& self) {
   if (!globalContext().userEnabledCuDNN()) return false;
   if (!self.is_cuda()) return false;
   auto st = self.scalar_type();
@@ -48,6 +48,10 @@ bool cudnn_is_acceptable(const Tensor& self) {
   return true;
 }
 
+bool cudnn_is_acceptable(const Tensor& self) {
+  return cudnn_is_acceptable(static_cast<const TensorBase&>(self));
+}
+
 Tensor & detach_(Tensor & self) {
   // this just exists to give us a hook in VariableType and an entry in Declarations.yaml
   //AT_ERROR("detach_ is not implemented for Tensor");
diff --git a/aten/src/ATen/native/TensorProperties.h b/aten/src/ATen/native/TensorProperties.h
new file mode 100644
index 00000000000000..fe6e8395c178e9
--- /dev/null
+++ b/aten/src/ATen/native/TensorProperties.h
@@ -0,0 +1,12 @@
+#pragma once
+
+// See NOTE: [Tensor vs. TensorBase]
+namespace at {
+class TensorBase;
+}
+
+namespace at { namespace native {
+
+TORCH_API bool cudnn_is_acceptable(const TensorBase& self);
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/TensorShape.cpp b/aten/src/ATen/native/TensorShape.cpp
index 21233a13c3b7a4..28a79421675247 100644
--- a/aten/src/ATen/native/TensorShape.cpp
+++ b/aten/src/ATen/native/TensorShape.cpp
@@ -59,9 +59,11 @@ Tensor& set_storage_cpu_(Tensor& result, Storage storage, int64_t storage_offset
   checkSetStorage(result, storage, storage_offset, size, stride);
 
   result.unsafeGetTensorImpl()->set_storage_offset(storage_offset);
-  c10::optional<IntArrayRef> stride_opt = stride.data() != nullptr ?
-                                          c10::optional<IntArrayRef>(stride) : c10::nullopt;
-  at::native::resize_impl_cpu_(result.unsafeGetTensorImpl(), size, stride_opt);
+  at::OptionalIntArrayRef stride_opt = stride.data() != nullptr ?
+                                          at::OptionalIntArrayRef(stride) : c10::nullopt;
+  // We can re-use this kernel for the meta device.
+  // We just need to make sure we don't actually try to resize the (null) storage.
+  at::native::resize_impl_cpu_(result.unsafeGetTensorImpl(), size, stride_opt, /*resize_storage=*/!result.is_meta());
   return result;
 }
 
@@ -87,6 +89,19 @@ Tensor& set_cpu_(Tensor& result) {
   return result;
 }
 
+// We can't re-use the cpu kernel here because we don't want to use the cpu allocator.
+Tensor& set_meta_(Tensor& result) {
+  caffe2::TypeMeta dtype = result.dtype();
+  Storage storage(
+      Storage::use_byte_size_t(),
+      0,
+      c10::GetAllocator(kMeta),
+      true);
+  result.set_(storage, 0, {0}, {});
+  TORCH_INTERNAL_ASSERT(dtype == result.dtype());
+  return result;
+}
+
 Tensor sparse_broadcast_to(const Tensor& self, IntArrayRef size) {
   TORCH_CHECK(self.is_sparse(), "input must be sparse tensor");
   int64_t sparse_extra_ndim = size.size() - self.dim();
@@ -877,6 +892,19 @@ const Tensor &as_strided_(const Tensor& self, IntArrayRef size, IntArrayRef stri
   return self;
 }
 
+Tensor narrow_copy_symint(const Tensor& self, int64_t dim, int64_t start, SymInt sym_length) {
+  return narrow_copy(self, dim, start, sym_length.expect_int());
+}
+
+Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
+  return self.narrow(dim, start, length).clone(at::MemoryFormat::Contiguous);
+}
+
+Tensor narrow_copy_dense_cpu(const Tensor& self, int64_t dim, int64_t start, int64_t length){
+  auto output = at::empty_like(self);
+  return narrow_copy_dense_cpu_out(self, dim, start, length, output);
+}
+
 Tensor narrow_copy_sparse(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
   int64_t allDim = self.dim();
   int64_t end = start+length;
@@ -914,6 +942,7 @@ Tensor narrow_copy_sparse(const Tensor& self, int64_t dim, int64_t start, int64_
 Tensor& narrow_copy_dense_cpu_out(
   const Tensor& self, int64_t dim, int64_t start, int64_t length, Tensor& output
 ) {
+
   TORCH_CHECK(self.dim() > 0, "narrow() cannot be applied to a 0-dim tensor.");
   TORCH_CHECK(self.dtype() == output.dtype());
 
@@ -991,15 +1020,6 @@ Tensor& narrow_copy_dense_cpu_out(
   return output;
 }
 
-Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){
-  return self.narrow(dim, start, length).clone(at::MemoryFormat::Contiguous);
-}
-
-Tensor narrow_copy_dense_cpu(const Tensor& self, int64_t dim, int64_t start, int64_t length){
-  auto output = at::empty_like(self);
-  return narrow_copy_dense_cpu_out(self, dim, start, length, output);
-}
-
 Tensor narrow(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
   TORCH_CHECK(self.dim() > 0, "narrow() cannot be applied to a 0-dim tensor.");
   auto cur_size = self.size(dim);
@@ -1159,7 +1179,7 @@ Tensor reshape(const Tensor& self, IntArrayRef proposed_shape) {
     //
     // We need to do the checks here instead of in `native_functions.yaml`
     // to preserve backwards compatibility.
-    if (!self.is_xla() && !self.is_lazy()) {
+    if (!self.is_xla() && !self.is_lazy() && !self.is_ipu()) {
       return self._reshape_alias(shape, stride.value());
     } else {
       return self.view(shape);
@@ -1464,6 +1484,10 @@ std::vector<Tensor> split(const Tensor& self, int64_t split_size, int64_t dim) {
   return splits;
 }
 
+std::vector<Tensor> split(const Tensor& self, IntArrayRef sizes, int64_t dim) {
+  return at::split_with_sizes(self, sizes, dim);
+}
+
 std::vector<Tensor> unsafe_split(const Tensor& self, int64_t split_size, int64_t dim) {
   auto result = at::native::split(self, split_size, dim);
   for (auto& t : result) {
@@ -2206,7 +2230,7 @@ Tensor flatten(const Tensor& self, DimnameList dims, Dimname out_dim) {
 }
 
 Tensor ravel(const Tensor& self) {
-  return self.reshape(-1);
+  return self.contiguous().view(-1);
 }
 
 static inline void handle_unflatten_exception(const std::runtime_error &e,
diff --git a/aten/src/ATen/native/TensorShape.h b/aten/src/ATen/native/TensorShape.h
index 69eb749ea48483..c9fd4d8ad61757 100644
--- a/aten/src/ATen/native/TensorShape.h
+++ b/aten/src/ATen/native/TensorShape.h
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -47,4 +48,11 @@ inline int64_t get_num_splits(const Tensor& self, int64_t split_size, int64_t di
   return num_splits;
 }
 
+///
+/// For more information, see
+/// https://pytorch.org/docs/master/generated/torch.Tensor.unfold.html#torch.Tensor.unfold
+///
+
+Tensor unfold(const Tensor& self, int64_t dimension, int64_t size, int64_t step);
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/TensorTransformations.cpp b/aten/src/ATen/native/TensorTransformations.cpp
index 5e5f9c91179e42..e555fc1db3a3b9 100644
--- a/aten/src/ATen/native/TensorTransformations.cpp
+++ b/aten/src/ATen/native/TensorTransformations.cpp
@@ -1,6 +1,7 @@
 #include <ATen/native/TensorTransformations.h>
 #include <ATen/native/IndexKernel.h>  // for flip_stub
 
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
 #include <ATen/WrapDimUtilsMulti.h>
diff --git a/aten/src/ATen/native/TensorTransformations.h b/aten/src/ATen/native/TensorTransformations.h
index 03ee31e696aada..4909ebe84bb03e 100644
--- a/aten/src/ATen/native/TensorTransformations.h
+++ b/aten/src/ATen/native/TensorTransformations.h
@@ -1,4 +1,10 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/roll.h>
+#endif
 
 #include <c10/util/Exception.h>
 
diff --git a/aten/src/ATen/native/TestOps.cpp b/aten/src/ATen/native/TestOps.cpp
index 0658502619209a..9a3a5b10cb2693 100644
--- a/aten/src/ATen/native/TestOps.cpp
+++ b/aten/src/ATen/native/TestOps.cpp
@@ -13,7 +13,7 @@ namespace native {
 /// Else, return a new tensor containing the elementwise sums.
 Tensor _test_optional_intlist(
     const Tensor& values,
-    c10::optional<IntArrayRef> addends) {
+    at::OptionalIntArrayRef addends) {
   if (!addends) {
     return values;
   }
diff --git a/aten/src/ATen/native/UnaryOps.cpp b/aten/src/ATen/native/UnaryOps.cpp
index 64e17dd9dd0413..8577ca8c1c079a 100644
--- a/aten/src/ATen/native/UnaryOps.cpp
+++ b/aten/src/ATen/native/UnaryOps.cpp
@@ -67,6 +67,7 @@ CREATE_UNARY_FLOAT_META_FUNC(special_i0e)
 CREATE_UNARY_FLOAT_META_FUNC(special_i1)
 CREATE_UNARY_FLOAT_META_FUNC(special_i1e)
 CREATE_UNARY_FLOAT_META_FUNC(special_ndtri)
+CREATE_UNARY_FLOAT_META_FUNC(special_log_ndtr)
 CREATE_UNARY_FLOAT_META_FUNC(sqrt)
 CREATE_UNARY_FLOAT_META_FUNC(tan)
 CREATE_UNARY_FLOAT_META_FUNC(tanh)
@@ -184,6 +185,7 @@ CREATE_UNARY_TORCH_IMPL_FUNC(special_i0e_out, special_i0e_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_i1e_out, special_i1e_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_i1_out, special_i1_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_ndtri_out, special_ndtri_stub)
+CREATE_UNARY_TORCH_IMPL_FUNC(special_log_ndtr_out, special_log_ndtr_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(sqrt_out, sqrt_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(tan_out, tan_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(tanh_out, tanh_stub)
@@ -538,7 +540,7 @@ Tensor special_sinc(const Tensor& self) { return self.sinc(); }
 namespace {
 
 inline Tensor calc_ndtr(const Tensor& self) {
-  auto x_sqrt_2 = self / std::sqrt(2.);
+  auto x_sqrt_2 = self * M_SQRT1_2;
   return (1 + at::erf(x_sqrt_2)) * 0.5;
 }
 
@@ -841,6 +843,7 @@ DEFINE_DISPATCH(log1p_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-
 DEFINE_DISPATCH(log2_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
 DEFINE_DISPATCH(logical_not_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
 DEFINE_DISPATCH(special_ndtri_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
+DEFINE_DISPATCH(special_log_ndtr_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
 DEFINE_DISPATCH(neg_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
 DEFINE_DISPATCH(nan_to_num_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
 DEFINE_DISPATCH(polygamma_stub); // NOLINT(cppcoreguidelines-avoid-non-const-global-variables)
diff --git a/aten/src/ATen/native/UnaryOps.h b/aten/src/ATen/native/UnaryOps.h
index 0a9afd9cd4dbd0..c0fb139c0b1594 100644
--- a/aten/src/ATen/native/UnaryOps.h
+++ b/aten/src/ATen/native/UnaryOps.h
@@ -52,6 +52,7 @@ DECLARE_DISPATCH(unary_fn, log10_stub);
 DECLARE_DISPATCH(unary_fn, log1p_stub);
 DECLARE_DISPATCH(unary_fn, log2_stub);
 DECLARE_DISPATCH(unary_fn, special_ndtri_stub);
+DECLARE_DISPATCH(unary_fn, special_log_ndtr_stub);
 DECLARE_DISPATCH(unary_fn, neg_stub);
 
 DECLARE_DISPATCH(unary_fn, reciprocal_stub);
diff --git a/aten/src/ATen/native/UpSample.cpp b/aten/src/ATen/native/UpSample.cpp
index bcc8891de8dcd7..db75b7e99fdb1a 100644
--- a/aten/src/ATen/native/UpSample.cpp
+++ b/aten/src/ATen/native/UpSample.cpp
@@ -9,7 +9,7 @@ namespace upsample {
 
 TORCH_API c10::SmallVector<int64_t, 3> compute_output_size(
     c10::IntArrayRef input_size,  // Full input tensor size.
-    c10::optional<c10::IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<c10::ArrayRef<double>> scale_factors) {
   const auto spatial_dimensions = static_cast<int64_t>(input_size.size()) - 2;
   if (output_size) {
diff --git a/aten/src/ATen/native/UpSample.h b/aten/src/ATen/native/UpSample.h
index 743188a623b49f..8cc476ab445cb2 100644
--- a/aten/src/ATen/native/UpSample.h
+++ b/aten/src/ATen/native/UpSample.h
@@ -51,7 +51,7 @@ namespace upsample {
 
 TORCH_API c10::SmallVector<int64_t, 3> compute_output_size(
     c10::IntArrayRef input_size,  // Full input tensor size.
-    c10::optional<c10::IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<c10::ArrayRef<double>> scale_factors);
 
 inline c10::optional<double> get_scale_value(c10::optional<c10::ArrayRef<double>> scales, int idx) {
diff --git a/aten/src/ATen/native/UpSampleBicubic2d.cpp b/aten/src/ATen/native/UpSampleBicubic2d.cpp
index 95d9f91bcb8036..a23019ecc0eb60 100644
--- a/aten/src/ATen/native/UpSampleBicubic2d.cpp
+++ b/aten/src/ATen/native/UpSampleBicubic2d.cpp
@@ -264,7 +264,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_bicubic2d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -275,7 +275,7 @@ Tensor upsample_bicubic2d(
 
 Tensor upsample_bicubic2d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
@@ -287,7 +287,7 @@ Tensor upsample_bicubic2d_backward(
 
 Tensor _upsample_bicubic2d_aa(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -298,7 +298,7 @@ Tensor _upsample_bicubic2d_aa(
 
 Tensor _upsample_bicubic2d_aa_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
diff --git a/aten/src/ATen/native/UpSampleBilinear2d.cpp b/aten/src/ATen/native/UpSampleBilinear2d.cpp
index f73bb50c9ff426..2a228a86ac71d7 100644
--- a/aten/src/ATen/native/UpSampleBilinear2d.cpp
+++ b/aten/src/ATen/native/UpSampleBilinear2d.cpp
@@ -145,7 +145,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_bilinear2d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -156,7 +156,7 @@ Tensor upsample_bilinear2d(
 
 Tensor upsample_bilinear2d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
@@ -168,7 +168,7 @@ Tensor upsample_bilinear2d_backward(
 
 Tensor _upsample_bilinear2d_aa(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -179,7 +179,7 @@ Tensor _upsample_bilinear2d_aa(
 
 Tensor _upsample_bilinear2d_aa_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
diff --git a/aten/src/ATen/native/UpSampleLinear1d.cpp b/aten/src/ATen/native/UpSampleLinear1d.cpp
index 371a53dc890028..687cad5c879bf8 100644
--- a/aten/src/ATen/native/UpSampleLinear1d.cpp
+++ b/aten/src/ATen/native/UpSampleLinear1d.cpp
@@ -79,7 +79,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_linear1d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -89,7 +89,7 @@ Tensor upsample_linear1d(
 
 Tensor upsample_linear1d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
diff --git a/aten/src/ATen/native/UpSampleNearest1d.cpp b/aten/src/ATen/native/UpSampleNearest1d.cpp
index 52fa7bcc5c9a5e..b9bc5b3c5b9682 100644
--- a/aten/src/ATen/native/UpSampleNearest1d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest1d.cpp
@@ -109,7 +109,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_nearest1d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_w = get_scale_value(scale_factors, 0);
@@ -118,7 +118,7 @@ Tensor upsample_nearest1d(
 
 Tensor _upsample_nearest_exact1d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_w = get_scale_value(scale_factors, 0);
@@ -127,7 +127,7 @@ Tensor _upsample_nearest_exact1d(
 
 Tensor upsample_nearest1d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
@@ -137,7 +137,7 @@ Tensor upsample_nearest1d_backward(
 
 Tensor _upsample_nearest_exact1d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
diff --git a/aten/src/ATen/native/UpSampleNearest2d.cpp b/aten/src/ATen/native/UpSampleNearest2d.cpp
index 864121fb0afa0d..1f9a9eafd4f6db 100644
--- a/aten/src/ATen/native/UpSampleNearest2d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest2d.cpp
@@ -134,7 +134,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_nearest2d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_h = get_scale_value(scale_factors, 0);
@@ -144,7 +144,7 @@ Tensor upsample_nearest2d(
 
 Tensor _upsample_nearest_exact2d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_h = get_scale_value(scale_factors, 0);
@@ -154,7 +154,7 @@ Tensor _upsample_nearest_exact2d(
 
 Tensor upsample_nearest2d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
@@ -165,7 +165,7 @@ Tensor upsample_nearest2d_backward(
 
 Tensor _upsample_nearest_exact2d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
diff --git a/aten/src/ATen/native/UpSampleNearest3d.cpp b/aten/src/ATen/native/UpSampleNearest3d.cpp
index c659a86cd81f39..ff559f3e09c07b 100644
--- a/aten/src/ATen/native/UpSampleNearest3d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest3d.cpp
@@ -149,7 +149,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_nearest3d_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
@@ -160,7 +160,7 @@ Tensor upsample_nearest3d_cpu(
 
 Tensor _upsample_nearest_exact3d_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
@@ -172,7 +172,7 @@ Tensor _upsample_nearest_exact3d_cpu(
 // when structured kernels can handle QuantizedCPU, update these overloads to be CompositeExplicitAutograd
 Tensor upsample_nearest3d_backward_cpu(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
@@ -184,7 +184,7 @@ Tensor upsample_nearest3d_backward_cpu(
 
 Tensor _upsample_nearest_exact3d_backward_cpu(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
diff --git a/aten/src/ATen/native/UpSampleTrilinear3d.cpp b/aten/src/ATen/native/UpSampleTrilinear3d.cpp
index 75a77a76c623d2..256e5e235b461a 100644
--- a/aten/src/ATen/native/UpSampleTrilinear3d.cpp
+++ b/aten/src/ATen/native/UpSampleTrilinear3d.cpp
@@ -90,7 +90,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_trilinear3d(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
@@ -102,7 +102,7 @@ Tensor upsample_trilinear3d(
 
 Tensor upsample_trilinear3d_backward(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
index e0fb55427a77f3..187ed4fd1404ab 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
@@ -2,7 +2,6 @@
 #include <c10/util/irange.h>
 #include <torch/custom_class.h>
 
-#include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
index a0a389f818c480..ec6e160b16c3e6 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
@@ -1,7 +1,6 @@
 #include <ATen/ATen.h>
 #include <torch/custom_class.h>
 
-#include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
 #include <ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h>
diff --git a/aten/src/ATen/native/cpu/Activation.cpp b/aten/src/ATen/native/cpu/Activation.cpp
index 1eebcde30c9edf..637972e5ff6198 100644
--- a/aten/src/ATen/native/cpu/Activation.cpp
+++ b/aten/src/ATen/native/cpu/Activation.cpp
@@ -24,41 +24,106 @@ namespace {
 
 template <typename scalar_t>
 inline void _vec_log_sigmoid(TensorBase &output, TensorBase &buffer, const TensorBase &input) {
-  using Vec = Vectorized<scalar_t>;
-  scalar_t* output_data = output.data_ptr<scalar_t>();
-  scalar_t* buffer_data = buffer.data_ptr<scalar_t>();
-  scalar_t* input_data = input.data_ptr<scalar_t>();
-  parallel_for(0, input.numel(), 1, [&] (int64_t begin, int64_t end) {
-    int64_t size = end - begin;
-    int64_t d = 0;
-    for (; d < size - (size % Vec::size()); d += Vec::size()) {
-      Vec data_vec = Vec::loadu(input_data + begin+ d);
-      Vec min_vec = vec::minimum(data_vec, Vec(scalar_t(0)));
-      Vec buffer_vec = data_vec.abs().neg().exp();
-      Vec output_vec = min_vec - buffer_vec.log1p();
-      buffer_vec.store(buffer_data + begin + d);
-      output_vec.store(output_data + begin + d);
-    }
-    if (size - d > 0) {
-      Vec data_vec = Vec::loadu(input_data + begin + d, size - d);
-      Vec min_vec = vec::minimum(data_vec, Vec(scalar_t(0)));
-      Vec buffer_vec = data_vec.abs().neg().exp();
-      Vec output_vec = min_vec - buffer_vec.log1p();
-      buffer_vec.store(buffer_data + begin + d, size - d);
-      output_vec.store(output_data + begin + d, size - d);
-    }
-  });
+  if (input.scalar_type() == kBFloat16) {
+    using Vec = Vectorized<BFloat16>;
+    BFloat16* output_data = output.data_ptr<BFloat16>();
+    BFloat16* buffer_data = buffer.data_ptr<BFloat16>();
+    BFloat16* input_data = input.data_ptr<BFloat16>();
+    parallel_for(0, input.numel(), 1, [&] (int64_t begin, int64_t end) {
+      int64_t size = end - begin;
+      int64_t d = 0;
+      for (; d < size - (size % Vec::size()); d += Vec::size()) {
+        Vec data_vec = Vec::loadu(input_data + begin+ d);
+        Vectorized<float> data_vec0, data_vec1;
+        std::tie(data_vec0, data_vec1) = convert_bfloat16_float(data_vec);
+        Vectorized<float> min_vec = minimum(data_vec0, Vectorized<float>(float(0)));
+        Vectorized<float> buffer_vec0 = data_vec0.abs().neg().exp();
+        Vectorized<float> output_vec0 = min_vec - buffer_vec0.log1p();
+        min_vec = minimum(data_vec1, Vectorized<float>(float(0)));
+        Vectorized<float> buffer_vec1 = data_vec1.abs().neg().exp();
+        Vectorized<float> output_vec1 = min_vec - buffer_vec1.log1p();
+        convert_float_bfloat16(buffer_vec0, buffer_vec1).store(buffer_data + begin + d);
+        convert_float_bfloat16(output_vec0, output_vec1).store(output_data + begin + d);
+      }
+      if (size - d > 0) {
+        Vec data_vec = Vec::loadu(input_data + begin + d, size - d);
+        Vectorized<float> data_vec0, data_vec1;
+        std::tie(data_vec0, data_vec1) = convert_bfloat16_float(data_vec);
+        Vectorized<float> min_vec = minimum(data_vec0, Vectorized<float>(float(0)));
+        Vectorized<float> buffer_vec0 = data_vec0.abs().neg().exp();
+        Vectorized<float> output_vec0 = min_vec - buffer_vec0.log1p();
+        min_vec = minimum(data_vec1, Vectorized<float>(float(0)));
+        Vectorized<float> buffer_vec1 = data_vec1.abs().neg().exp();
+        Vectorized<float> output_vec1 = min_vec - buffer_vec1.log1p();
+        convert_float_bfloat16(buffer_vec0, buffer_vec1).store(buffer_data + begin + d, size - d);
+        convert_float_bfloat16(output_vec0, output_vec1).store(output_data + begin + d, size - d);
+      }
+    });
+  } else {
+    using Vec = Vectorized<scalar_t>;
+    scalar_t* output_data = output.data_ptr<scalar_t>();
+    scalar_t* buffer_data = buffer.data_ptr<scalar_t>();
+    scalar_t* input_data = input.data_ptr<scalar_t>();
+    parallel_for(0, input.numel(), 1, [&] (int64_t begin, int64_t end) {
+      int64_t size = end - begin;
+      int64_t d = 0;
+      for (; d < size - (size % Vec::size()); d += Vec::size()) {
+        Vec data_vec = Vec::loadu(input_data + begin+ d);
+        Vec min_vec = vec::minimum(data_vec, Vec(scalar_t(0)));
+        Vec buffer_vec = data_vec.abs().neg().exp();
+        Vec output_vec = min_vec - buffer_vec.log1p();
+        buffer_vec.store(buffer_data + begin + d);
+        output_vec.store(output_data + begin + d);
+      }
+      if (size - d > 0) {
+        Vec data_vec = Vec::loadu(input_data + begin + d, size - d);
+        Vec min_vec = vec::minimum(data_vec, Vec(scalar_t(0)));
+        Vec buffer_vec = data_vec.abs().neg().exp();
+        Vec output_vec = min_vec - buffer_vec.log1p();
+        buffer_vec.store(buffer_data + begin + d, size - d);
+        output_vec.store(output_data + begin + d, size - d);
+      }
+    });
+  }
 }
 
-static void log_sigmoid_cpu_kernel(
-    TensorBase &output, TensorBase &buffer, const TensorBase &input) {
-  AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] {
+static void log_sigmoid_cpu_kernel(TensorBase &output, TensorBase &buffer, const TensorBase &input) {
+  AT_DISPATCH_FLOATING_TYPES_AND(kBFloat16, input.scalar_type(), "log_sigmoid_cpu", [&] {
     _vec_log_sigmoid<scalar_t>(output, buffer, input);
   });
 }
 
 static void log_sigmoid_backward_cpu_kernel(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "log_sigmoid_backward_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    using Vec = Vectorized<BFloat16>;
+    auto zero_val = float(0);
+    auto zero_vec = Vectorized<float>(zero_val);
+    auto one_val = float(1);
+    auto one_vec = Vectorized<float>(one_val);
+    cpu_kernel_vec(iter,
+      [=](BFloat16 a, BFloat16 b, BFloat16 c) -> BFloat16 {
+        auto in_negative = float(a) < float(0);
+        auto max_deriv = in_negative ? float(1) : float(0);
+        auto sign = in_negative ? float(1) : -float(1);
+        return (max_deriv - sign * (float(b) / (float(1) + b))) * float(c);
+      },
+      [=](Vec a, Vec b, Vec c) -> Vec {
+        Vectorized<float> a0, a1, b0, b1, c0, c1;
+        std::tie(a0, a1) = convert_bfloat16_float(a);
+        std::tie(b0, b1) = convert_bfloat16_float(b);
+        std::tie(c0, c1) = convert_bfloat16_float(c);
+        auto mask = a0 < zero_vec;
+        auto max_deriv_vec = Vectorized<float>::blendv(zero_vec, one_vec, mask);
+        auto sign_vec = Vectorized<float>::blendv(one_vec.neg(), one_vec, mask);
+        a0 = (max_deriv_vec - sign_vec * (b0 / (one_vec + b0))) * c0;
+        mask = a1 < zero_vec;
+        max_deriv_vec = Vectorized<float>::blendv(zero_vec, one_vec, mask);
+        sign_vec = Vectorized<float>::blendv(one_vec.neg(), one_vec, mask);
+        a1 = (max_deriv_vec - sign_vec * (b1 / (one_vec + b1))) * c1;
+        return convert_float_bfloat16(a0, a1);
+      });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "log_sigmoid_backward_cpu", [&]() {
     using Vec = Vectorized<scalar_t>;
     auto zero_val = scalar_t(0);
     auto zero_vec = Vec(zero_val);
@@ -78,6 +143,7 @@ static void log_sigmoid_backward_cpu_kernel(TensorIterator& iter) {
         return (max_deriv_vec - sign_vec * (b / (one_vec + b))) * c;
       });
   });
+  }
 }
 
 static void threshold_kernel(
@@ -318,7 +384,34 @@ void GeluBackwardKernelImpl(TensorIteratorBase& it, GeluType approximate) {
 }
 
 void hardsigmoid_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardsigmoid_cpu", [&] {
+  if (iter.dtype() == kBFloat16) {
+    const float zero(0.0f);
+    const float three(3.0f);
+    const float six(6.0f);
+    using Vec = vec::Vectorized<float>;
+    const Vec kZeroVec(zero);
+    const Vec kThreeVec(three);
+    const Vec kSixVec(six);
+    cpu_kernel_vec(
+        iter,
+        [&](BFloat16 self_val) -> BFloat16 {
+          return std::min(std::max(float(self_val) + three, zero), six) / six;
+        },
+        [&](vec::Vectorized<BFloat16> self_val) -> vec::Vectorized<BFloat16> {
+          Vectorized<float> self_val0, self_val1;
+          std::tie(self_val0, self_val1) = convert_bfloat16_float(self_val);
+          self_val0 = minimum(
+            maximum(self_val0 + kThreeVec, kZeroVec),
+            kSixVec
+          ) / kSixVec;
+          self_val1 = minimum(
+            maximum(self_val1 + kThreeVec, kZeroVec),
+            kSixVec
+          ) / kSixVec;
+          return convert_float_bfloat16(self_val0, self_val1);
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardsigmoid_cpu", [&] {
     const scalar_t zero(0.0f);
     const scalar_t three(3.0f);
     const scalar_t six(6.0f);
@@ -338,10 +431,37 @@ void hardsigmoid_kernel(TensorIteratorBase& iter) {
           ) / kSixVec;
         });
   });
+  }
 }
 
 void hardsigmoid_backward_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardsigmoid_backward", [&] {
+  if (iter.dtype() == kBFloat16) {
+    const float zero(0.0f);
+    const float three(3.0f);
+    const float neg_three(-3.0f);
+    const float one_sixth(1.0f / 6.0f);
+    using Vec = Vectorized<float>;
+    Vec kZeroVec(0.0f);
+    Vec kOneSixthVec(1.0f / 6.0f);
+    cpu_kernel_vec(
+        iter,
+        [=](BFloat16 grad_val, BFloat16 self_val) -> BFloat16 {
+          return (float(self_val) > neg_three && float(self_val) < three)
+            ? float(grad_val) * one_sixth
+            : zero;
+        },
+        [=](Vectorized<BFloat16> grad_val, Vectorized<BFloat16> self_val) -> Vectorized<BFloat16> {
+          Vec self_val0, self_val1, grad_val0, grad_val1;
+          std::tie(self_val0, self_val1) = convert_bfloat16_float(self_val);
+          std::tie(grad_val0, grad_val1) = convert_bfloat16_float(grad_val);
+          Vec gradNonZeroMask = (self_val0 > neg_three) & (self_val0 < three);
+          self_val0 = Vec::blendv(kZeroVec, grad_val0 * kOneSixthVec, gradNonZeroMask);
+          gradNonZeroMask = (self_val1 > neg_three) & (self_val1 < three);
+          self_val1 = Vec::blendv(kZeroVec, grad_val1 * kOneSixthVec, gradNonZeroMask);
+          return convert_float_bfloat16(self_val0, self_val1);
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardsigmoid_backward", [&] {
     const scalar_t zero(0.0f);
     const scalar_t three(3.0f);
     const scalar_t neg_three(-3.0f);
@@ -361,10 +481,11 @@ void hardsigmoid_backward_kernel(TensorIteratorBase& iter) {
           return Vec::blendv(kZeroVec, grad_val * kOneSixthVec, gradNonZeroMask);
         });
   });
+  }
 }
 
 void hardshrink_kernel(TensorIteratorBase& iter, const Scalar& lambd) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardshrink_cpu", [&] {
+    AT_DISPATCH_FLOATING_TYPES_AND(kBFloat16, iter.dtype(), "hardshrink_cpu", [&] {
     auto lambd_val = lambd.to<scalar_t>();
     cpu_kernel_vec(
         iter,
@@ -379,16 +500,43 @@ void hardshrink_kernel(TensorIteratorBase& iter, const Scalar& lambd) {
 }
 
 void softshrink_kernel(TensorIteratorBase& iter, const Scalar& lambd) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softshrink_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    auto lambd_val = lambd.to<float>();
+    auto lambdVec = Vectorized<float>(lambd_val);
+    cpu_kernel_vec(
+      iter,
+      [=](BFloat16 a) -> BFloat16 {
+        return float(a) > lambd_val ? a - lambd_val : (float(a) < -lambd_val ? a + lambd_val : float(0));
+      },
+      [=](Vectorized<BFloat16> self_val) {
+          Vectorized<float> self_val0, self_val1;
+          Vectorized<BFloat16> self_val_t0, self_val_t1;
+          std::tie(self_val0, self_val1) = convert_bfloat16_float(self_val);
+          self_val_t0 = convert_float_bfloat16((self_val0 > lambdVec) & (self_val0 - lambdVec), (self_val1 > lambdVec) & (self_val1 - lambdVec));
+          self_val_t1 = convert_float_bfloat16((self_val0 < -lambd_val) & (self_val0 + lambdVec), (self_val1 < -lambd_val) & (self_val1 + lambdVec));
+          return (self_val_t0 | self_val_t1);
+      });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softshrink_cpu", [&]() {
     auto lambd_val = lambd.to<scalar_t>();
-    cpu_kernel(iter, [=](scalar_t a) -> scalar_t {
-      return a > lambd_val ? a - lambd_val : (a < -lambd_val ? a + lambd_val : scalar_t(0));
-    });
+    auto lambdVec = Vectorized<scalar_t>(lambd_val);
+    cpu_kernel_vec(
+      iter,
+      [=](scalar_t a) -> scalar_t {
+        return a > lambd_val ? a - lambd_val : (a < -lambd_val ? a + lambd_val : scalar_t(0));
+      },
+      [=](Vectorized<scalar_t> self_val) {
+          Vectorized<scalar_t> self_val_t0, self_val_t1;
+          self_val_t0 = (self_val > lambdVec) & (self_val - lambdVec);
+          self_val_t1 = (self_val < -lambd_val) & (self_val + lambdVec);
+          return (self_val_t0 | self_val_t1);
+      });
   });
+  }
 }
 
 void shrink_backward_kernel(TensorIteratorBase& iter, const Scalar& lambd) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "shrink_backward_cpu", [&] {
+    AT_DISPATCH_FLOATING_TYPES_AND(kBFloat16, iter.dtype(), "shrink_backward_cpu", [&] {
     auto lambd_val = lambd.to<scalar_t>();
     cpu_kernel_vec(
         iter,
@@ -418,7 +566,35 @@ void hardtanh_backward_kernel(TensorIterator& iter, const Scalar& min, const Sca
 }
 
 void hardswish_kernel(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardswish_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    const float zero(0.0f);
+    const float three(3.0f);
+    const float six(6.0f);
+    using Vec = vec::Vectorized<float>;
+    const Vec kZeroVec(zero);
+    const Vec kThreeVec(three);
+    const Vec kSixVec(six);
+    cpu_kernel_vec(
+      iter,
+      [&](BFloat16 x) -> BFloat16 {
+        return float(x) * std::min(std::max(float(x) + three, zero), six) / six;
+      },
+      [&](vec::Vectorized<BFloat16> x_vec) {
+        Vectorized<float> x_vec0, x_vec1;
+        std::tie(x_vec0, x_vec1) = convert_bfloat16_float(x_vec);
+        x_vec0 = x_vec0 * minimum(
+          maximum(x_vec0 + kThreeVec, kZeroVec),
+          kSixVec
+        ) / kSixVec;
+        x_vec1 = x_vec1 * minimum(
+          maximum(x_vec1 + kThreeVec, kZeroVec),
+          kSixVec
+        ) / kSixVec;
+        return convert_float_bfloat16(x_vec0, x_vec1);
+      }
+    );
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardswish_cpu", [&]() {
     const scalar_t zero(0.0f);
     const scalar_t three(3.0f);
     const scalar_t six(6.0f);
@@ -439,10 +615,58 @@ void hardswish_kernel(TensorIterator& iter) {
       }
     );
   });
+  }
 }
 
 void hardswish_backward_kernel(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardswish_backward_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    const float zero(0.0f);
+    const float three(3.0f);
+    const float neg_three(-3.0f);
+    const float one_half(0.5f);
+    using Vec = vec::Vectorized<float>;
+    const Vec kZeroVec(zero);
+    const Vec kThreeVec(three);
+    const Vec kNegThreeVec(neg_three);
+    const Vec kOneHalfVec(one_half);
+    cpu_kernel_vec(
+      iter,
+      [&](BFloat16 grad_val, BFloat16 self_val) -> BFloat16 {
+        if (float(self_val) < neg_three) {
+          return zero;
+        } else if (float(self_val) <= three) {
+          return float(grad_val) * ((float(self_val) / three) + one_half);
+        } else {
+          return grad_val;
+        }
+      },
+      [&](vec::Vectorized<BFloat16> grad_val, vec::Vectorized<BFloat16> self_val) {
+        Vectorized<float> self_val0, self_val1, grad_val0, grad_val1;
+        std::tie(self_val0, self_val1) = convert_bfloat16_float(self_val);
+        std::tie(grad_val0, grad_val1) = convert_bfloat16_float(grad_val);
+        self_val0 = Vec::blendv(
+          Vec::blendv(
+            grad_val0 * ((self_val0 / kThreeVec) + kOneHalfVec),
+            grad_val0,
+            self_val0 >= kThreeVec
+          ),
+          kZeroVec,
+          self_val0 < kNegThreeVec
+        );
+        self_val1 = Vec::blendv(
+          Vec::blendv(
+            grad_val1 * ((self_val1 / kThreeVec) + kOneHalfVec),
+            grad_val1,
+            self_val1 >= kThreeVec
+          ),
+          kZeroVec,
+          self_val1 < kNegThreeVec
+        );
+        return convert_float_bfloat16(self_val0, self_val1);
+      }
+    );
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardswish_backward_cpu", [&]() {
     const scalar_t zero(0.0f);
     const scalar_t three(3.0f);
     const scalar_t neg_three(-3.0f);
@@ -476,6 +700,7 @@ void hardswish_backward_kernel(TensorIterator& iter) {
       }
     );
   });
+  }
 }
 
 static void leaky_relu_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
@@ -556,7 +781,28 @@ static void leaky_relu_backward_kernel(TensorIteratorBase& iter, const Scalar& n
 }
 
 void softplus_kernel(TensorIteratorBase& iter, const Scalar& beta_, const Scalar& threshold_) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softplus_cpu", [&]() {
+    if (iter.dtype() == kBFloat16) {
+      using Vec = Vectorized<float>;
+      auto beta = beta_.to<float>();
+      auto threshold = threshold_.to<float>();
+      const Vec beta_vec(beta);
+      const Vec threshold_vec(threshold);
+      cpu_kernel_vec(
+          iter,
+          [beta, threshold](BFloat16 a) -> BFloat16 {
+            return (float(a) * beta) > threshold ? a
+              : static_cast<BFloat16>((std::log1p(std::exp(float(a) * beta))) / beta);
+          },
+          [beta_vec, threshold_vec](Vectorized<BFloat16> a) -> Vectorized<BFloat16> {
+            Vectorized<float> a0, a1;
+            std::tie(a0, a1) = convert_bfloat16_float(a);
+            a0 = Vec::blendv((a0 * beta_vec).exp().log1p() / beta_vec, a0, (a0 * beta_vec) > threshold_vec);
+            a1 = Vec::blendv((a1 * beta_vec).exp().log1p() / beta_vec, a1, (a1 * beta_vec) > threshold_vec);
+            return convert_float_bfloat16(a0, a1);
+          }
+      );
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softplus_cpu", [&]() {
     using Vec = Vectorized<scalar_t>;
     auto beta = beta_.to<scalar_t>();
     auto threshold = threshold_.to<scalar_t>();
@@ -573,10 +819,36 @@ void softplus_kernel(TensorIteratorBase& iter, const Scalar& beta_, const Scalar
         }
     );
   });
+  }
 }
 
 void softplus_backward_kernel(TensorIteratorBase& iter, const Scalar& beta_, const Scalar& threshold_) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softplus_backward_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    using Vec = Vectorized<float>;
+    auto beta = beta_.to<float>();
+    auto threshold = threshold_.to<float>();
+    const Vec beta_vec(beta);
+    const Vec threshold_vec(threshold);
+    const Vec one_vec(static_cast<float>(1.0));
+    cpu_kernel_vec(
+        iter,
+        [beta, threshold](BFloat16 a, BFloat16 b) -> BFloat16 {
+          float z = std::exp(float(b) * beta);
+          return (float(b) * beta) > threshold ? a : static_cast<BFloat16>(float(a) * z / (z + float(1.)));
+        },
+        [beta_vec, one_vec, threshold_vec](Vectorized<BFloat16> a, Vectorized<BFloat16> b) -> Vectorized<BFloat16> {
+          Vectorized<float> a0, a1, b0, b1;
+          std::tie(a0, a1) = convert_bfloat16_float(a);
+          std::tie(b0, b1) = convert_bfloat16_float(b);
+          Vec z = (b0 * beta_vec).exp();
+          a0 = Vec::blendv(a0 * z / (z + one_vec), a0, (b0 * beta_vec) > threshold_vec);
+          z = (b1 * beta_vec).exp();
+          a1 = Vec::blendv(a1 * z / (z + one_vec), a1, (b1 * beta_vec) > threshold_vec);
+          return convert_float_bfloat16(a0, a1);
+        }
+    );
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "softplus_backward_cpu", [&]() {
     using Vec = Vectorized<scalar_t>;
     auto beta = beta_.to<scalar_t>();
     auto threshold = threshold_.to<scalar_t>();
@@ -595,6 +867,7 @@ void softplus_backward_kernel(TensorIteratorBase& iter, const Scalar& beta_, con
         }
     );
   });
+  }
 }
 
 void glu_kernel(TensorIteratorBase& iter) {
diff --git a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
index 0e5db26b069dce..1f39aeb3256c90 100644
--- a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
@@ -625,8 +625,33 @@ void fmin_kernel(TensorIteratorBase& iter) {
 }
 
 void smooth_l1_kernel(TensorIteratorBase& iter, double beta) {
-  AT_DISPATCH_FLOATING_TYPES_AND2(
-        kBFloat16, kHalf, iter.dtype(), "smooth_l1_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    const float beta_val(beta);
+    const Vectorized<float> beta_val_vec(beta_val);
+    const Vectorized<float> point_five_vec(static_cast<float>(0.5));
+    cpu_kernel_vec(
+        iter,
+        [&beta_val](BFloat16 a, BFloat16 b) -> BFloat16 {
+          auto z = std::abs(float(a) - float(b));
+          return z < beta_val
+              ? static_cast<float>(0.5) * z * z / beta_val
+              : z - static_cast<float>(0.5) * beta_val;
+        },
+        [&beta_val_vec, &point_five_vec](Vectorized<BFloat16> a, Vectorized<BFloat16> b) {
+          Vectorized<float> a0, a1, b0, b1;
+          std::tie(a0, a1) = convert_bfloat16_float(a);
+          std::tie(b0, b1) = convert_bfloat16_float(b);
+          auto z = (a0 - b0).abs();
+          a0 =  Vectorized<float>::blendv(
+              point_five_vec * z * z / beta_val_vec, z - point_five_vec * beta_val_vec, z >= beta_val_vec);
+          z = (a1 - b1).abs();
+          a1 =  Vectorized<float>::blendv(
+              point_five_vec * z * z / beta_val_vec, z - point_five_vec * beta_val_vec, z >= beta_val_vec);
+          return convert_float_bfloat16(a0, a1);
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND(
+        kHalf, iter.dtype(), "smooth_l1_cpu", [&]() {
         using Vec = Vectorized<scalar_t>;
         const scalar_t beta_val(beta);
         const Vec beta_val_vec(beta_val);
@@ -645,6 +670,7 @@ void smooth_l1_kernel(TensorIteratorBase& iter, double beta) {
                   point_five_vec * z * z / beta_val_vec, z - point_five_vec * beta_val_vec, z >= beta_val_vec);
             });
       });
+  }
 }
 
 void huber_kernel(TensorIterator& iter, double delta) {
diff --git a/aten/src/ATen/native/cpu/BlasKernel.cpp b/aten/src/ATen/native/cpu/BlasKernel.cpp
index 68bb78c0003bbd..7b60e9a45cbac4 100644
--- a/aten/src/ATen/native/cpu/BlasKernel.cpp
+++ b/aten/src/ATen/native/cpu/BlasKernel.cpp
@@ -191,19 +191,28 @@ void cpublas_gemm_impl(
 }
 
 void cpublas_axpy_impl(at::ScalarType type, int64_t n, const Scalar& _a, const void *_x, int64_t incx, void *_y, int64_t incy){
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(at::kHalf, at::kBFloat16, type, "cpublas_axpy_impl",
-    [&] {
-      auto a = _a.to<scalar_t>();
-      auto x = static_cast<const scalar_t *>(_x);
-      auto y = static_cast<scalar_t *>(_y);
+  if (type == at::kBool) {
+      auto a = _a.to<bool>();
+      auto x = static_cast<const bool *>(_x);
+      auto y = static_cast<bool *>(_y);
       int64_t i;
       for(i = 0; i < n; i++)
-        y[i*incy] += a*x[i*incx];
-    });
+        y[i*incy] |= a & x[i*incx];
+  } else {
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(at::kHalf, at::kBFloat16, type, "cpublas_axpy_impl",
+      [&] {
+        auto a = _a.to<scalar_t>();
+        auto x = static_cast<const scalar_t *>(_x);
+        auto y = static_cast<scalar_t *>(_y);
+        int64_t i;
+        for(i = 0; i < n; i++)
+          y[i*incy] += a*x[i*incx];
+      });
+  }
 }
 
 void cpublas_copy_impl(at::ScalarType type, int64_t n, const void *_x, int64_t incx, void *_y, int64_t incy){
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(at::kHalf, at::kBFloat16, type, "cpublas_copy_impl",
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(at::kHalf, at::kBFloat16, at::kBool, type, "cpublas_copy_impl",
     [&] {
       auto x = static_cast<const scalar_t *>(_x);
       auto y = static_cast<scalar_t *>(_y);
diff --git a/aten/src/ATen/native/cpu/ComplexKernel.cpp b/aten/src/ATen/native/cpu/ComplexKernel.cpp
index 56d8fc80ae00e9..99dc6134537ea3 100644
--- a/aten/src/ATen/native/cpu/ComplexKernel.cpp
+++ b/aten/src/ATen/native/cpu/ComplexKernel.cpp
@@ -9,7 +9,7 @@ namespace native {
 namespace {
 
 void complex_kernel(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.input_dtype(), "complex_cpu", [&]() {
+  AT_DISPATCH_FLOATING_TYPES_AND(kHalf, iter.input_dtype(), "complex_cpu", [&]() {
     cpu_kernel(iter, [=](scalar_t a, scalar_t b) -> c10::complex<scalar_t> {
       return c10::complex<scalar_t>(a, b);
     });
diff --git a/aten/src/ATen/native/cpu/CopyKernel.cpp b/aten/src/ATen/native/cpu/CopyKernel.cpp
index 6e1d134c3e47bd..40a0c20b5ca8de 100644
--- a/aten/src/ATen/native/cpu/CopyKernel.cpp
+++ b/aten/src/ATen/native/cpu/CopyKernel.cpp
@@ -81,9 +81,9 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) {
   if (dtype == iter.dtype(1)) {
     copy_same_dtype(iter, requires_conj, requires_neg);
   } else {
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, dtype, "copy_", [&] {
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, dtype, "copy_", [&] {
       using dest_t = scalar_t;
-      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, iter.dtype(1), "copy_", [&] {
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, iter.dtype(1), "copy_", [&] {
         // Note (@zasdfgbnm):
         //
         // The code below can not be simplified as
diff --git a/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp b/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp
index 549384055f2058..d3be310e280244 100644
--- a/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp
@@ -92,7 +92,55 @@ static void addcdiv_cpu_kernel(TensorIteratorBase& iter, const Scalar& value) {
 
 static void smooth_l1_backward_cpu_kernel(TensorIterator& iter, const Scalar& norm, double beta) {
   ScalarType dtype = iter.dtype(0);
-  AT_DISPATCH_ALL_TYPES(dtype, "smooth_l1_backward_cpu_out", [&] {
+  if (dtype == kBFloat16) {
+    auto norm_val = norm.to<float>();
+    float beta_val(beta);
+    auto norm_val_vec = Vectorized<float>(norm_val);
+    auto beta_val_vec = Vectorized<float>(beta_val);
+    const auto neg_1_vec = Vectorized<float>(-1);
+    const auto zero_vec = Vectorized<float>(0);
+    const auto pos_1_vec = Vectorized<float>(1);
+    cpu_kernel_vec(iter,
+      [=](BFloat16 input, BFloat16 target, BFloat16 grad_output) -> BFloat16 {
+        const auto x = float(input) - float(target);
+        if (x <= -beta){
+          return -norm_val * float(grad_output);
+        }else if (x >= beta){
+          return norm_val * float(grad_output);
+        }else{
+          return norm_val * x * float(grad_output) / beta;
+        }
+      },
+      [norm_val_vec, beta_val_vec, neg_1_vec, zero_vec, pos_1_vec](
+         Vectorized<BFloat16> input, Vectorized<BFloat16> target, Vectorized<BFloat16> grad_output) -> Vectorized<BFloat16> {
+        // using two blendv calls to simulate the 3 cases
+        // 1        if  x >= beta
+        // -1       if x <= -beta
+        // x / beta if |x| < beta
+        Vectorized<float> input0, input1, target0, target1, grad_output0, grad_output1;
+        std::tie(input0, input1) = convert_bfloat16_float(input);
+        std::tie(target0, target1) = convert_bfloat16_float(target);
+        std::tie(grad_output0, grad_output1) = convert_bfloat16_float(grad_output);
+        auto x = input0 - target0;
+        auto pos_or_neg_1_vec = Vectorized<float>::blendv(
+            neg_1_vec, pos_1_vec, x > zero_vec);
+        auto x_abs = x.abs();
+        auto output = Vectorized<float>::blendv(
+            x / beta_val_vec, pos_or_neg_1_vec, x_abs >= beta_val_vec);
+        input0 = norm_val_vec * output * grad_output0;
+
+        x = input1 - target1;
+        pos_or_neg_1_vec = Vectorized<float>::blendv(
+            neg_1_vec, pos_1_vec, x > zero_vec);
+        x_abs = x.abs();
+        output = Vectorized<float>::blendv(
+            x / beta_val_vec, pos_or_neg_1_vec, x_abs >= beta_val_vec);
+        input1 = norm_val_vec * output * grad_output1;
+        return convert_float_bfloat16(input0, input1);
+      }
+    );
+  } else {
+    AT_DISPATCH_ALL_TYPES(dtype, "smooth_l1_backward_cpu_out", [&] {
     auto norm_val = norm.to<scalar_t>();
     scalar_t beta_val(beta);
     auto norm_val_vec = Vectorized<scalar_t>(norm_val);
@@ -126,6 +174,7 @@ static void smooth_l1_backward_cpu_kernel(TensorIterator& iter, const Scalar& no
       }
     );
   });
+  }
 }
 
 static void huber_backward_cpu_kernel(TensorIterator& iter, const Scalar& norm, double delta) {
diff --git a/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp b/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
index ee0d457ed2c951..c3ad085e03f3d4 100644
--- a/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
+++ b/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
@@ -35,6 +35,33 @@ class ReduceAdd {
 };
 static ReduceAdd reduce_add;
 
+class ReduceMean {
+public:
+  template <typename scalar_t>
+  constexpr void operator() (scalar_t * self_data, scalar_t * src_data) const {
+    *self_data += *src_data;
+  }
+};
+static ReduceMean reduce_mean;
+
+class ReduceMaximum {
+public:
+  template <typename scalar_t>
+  constexpr void operator() (scalar_t * self_data, scalar_t * src_data) const {
+    *self_data = std::max(*self_data, *src_data);
+  }
+};
+static ReduceMaximum reduce_maximum;
+
+class ReduceMinimum {
+public:
+  template <typename scalar_t>
+  constexpr void operator() (scalar_t * self_data, scalar_t * src_data) const {
+    *self_data = std::min(*self_data, *src_data);
+  }
+};
+static ReduceMinimum reduce_minimum;
+
 class TensorAssign {
 public:
   template <typename scalar_t>
@@ -283,6 +310,273 @@ struct cpu_scatter_gather_base_kernel {
       }
     );
   }
+
+  void operator()(const Tensor& self, int64_t dim,
+    const Tensor& index, const Tensor& src,
+    const std::string& method_name, ReduceMean& kernel_func) {
+
+    auto iter = TensorIteratorConfig()
+      .check_all_same_dtype(false)
+      .resize_outputs(false)
+      // NOLINTNEXTLINE(bugprone-argument-comment)
+      .declare_static_shape(index.sizes(), /*squash_dim=*/dim)
+      .add_output(self)
+      .add_input(src)
+      .add_input(index)
+      .build();
+
+    auto self_dim_stride = ensure_nonempty_stride(self, dim);
+    auto self_dim_size = ensure_nonempty_size(self, dim);
+
+    auto index_dim_stride = ensure_nonempty_stride(index, dim);
+    auto index_dim_size = ensure_nonempty_size(index, dim);
+
+    auto src_dim_stride = ensure_nonempty_stride(src, dim);
+    auto src_dim_size = ensure_nonempty_size(src, dim);
+
+    auto index_upper_bound = is_scatter_like ? self_dim_size : src_dim_size;
+
+    int64_t grain_size = std::max((int64_t) 1, at::internal::GRAIN_SIZE / index_dim_size);
+
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+      ScalarType::Half, ScalarType::BFloat16, iter.dtype(),
+      "scatter_gather_tensor_cpu_reduce_mean", [&] {
+        constexpr auto SELF_ITER_STRIDE_IDX = 0;
+        constexpr auto INDEX_ITER_STRIDE_IDX = 2;
+        constexpr auto SRC_ITER_STRIDE_IDX = 1;
+        auto loop = [&](char** data, const int64_t* strides, int64_t n) {
+          auto* self_data_bytes = data[SELF_ITER_STRIDE_IDX];
+          auto* index_data_bytes = data[INDEX_ITER_STRIDE_IDX];
+          auto* src_data_bytes = data[SRC_ITER_STRIDE_IDX];
+          // we change the order of TensorIterator-dim loop
+          // vs dim-TensorIterator loop order depending on
+          // whether dim is the last dimension
+          if (dim== self.dim() - 1) {
+            for (const auto nelem : c10::irange(n)) {
+              (void)nelem; //Suppress unused variable warning
+              // dim loop is a separate code block
+              // for better performance
+              _cpu_scatter_gather_dim_loop<is_scatter_like>()(
+                 (scalar_t*)self_data_bytes, self_dim_stride,
+                 (int64_t*)index_data_bytes, index_dim_stride,
+                 (scalar_t*)src_data_bytes, src_dim_stride,
+                 dim, index_dim_size, index_upper_bound,
+                 kernel_func
+               );
+
+              self_data_bytes += strides[SELF_ITER_STRIDE_IDX];
+              index_data_bytes += strides[INDEX_ITER_STRIDE_IDX];
+              src_data_bytes += strides[SRC_ITER_STRIDE_IDX];
+            }
+          }
+          else {
+            for (const auto i : c10::irange(index_dim_size)) {
+              auto* self_data = self_data_bytes;
+              auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
+              auto* src_data = src_data_bytes;
+              for (const auto nelem : c10::irange(n)) {
+                (void)nelem; //Suppress unused variable warning
+                int64_t idx_dim = *(int64_t*)index_data;
+                // we are not putting idx_dim in the error message because it disables
+                // loop optimization in clang-7
+                TORCH_CHECK(idx_dim >= 0 && idx_dim < index_upper_bound,
+                            "index ", *(int64_t*)index_data,
+                            " is out of bounds for dimension ", dim,
+                            " with size ", index_upper_bound);
+
+                kernel_func(
+                  (scalar_t*)self_data + (is_scatter_like ? idx_dim : i) * self_dim_stride,
+                  (scalar_t*)src_data + (is_scatter_like ? i : idx_dim) * src_dim_stride);
+
+                self_data += strides[SELF_ITER_STRIDE_IDX];
+                index_data += strides[INDEX_ITER_STRIDE_IDX];
+                src_data += strides[SRC_ITER_STRIDE_IDX];
+              }
+            }
+          }
+        };
+        iter.for_each(loop, grain_size);
+      }
+    );
+  }
+
+  void operator()(const Tensor& self, int64_t dim,
+    const Tensor& index, const Tensor& src,
+    const std::string& method_name, ReduceMaximum& kernel_func) {
+
+    auto iter = TensorIteratorConfig()
+      .check_all_same_dtype(false)
+      .resize_outputs(false)
+      // NOLINTNEXTLINE(bugprone-argument-comment)
+      .declare_static_shape(index.sizes(), /*squash_dim=*/dim)
+      .add_output(self)
+      .add_input(src)
+      .add_input(index)
+      .build();
+
+    auto self_dim_stride = ensure_nonempty_stride(self, dim);
+    auto self_dim_size = ensure_nonempty_size(self, dim);
+
+    auto index_dim_stride = ensure_nonempty_stride(index, dim);
+    auto index_dim_size = ensure_nonempty_size(index, dim);
+
+    auto src_dim_stride = ensure_nonempty_stride(src, dim);
+    auto src_dim_size = ensure_nonempty_size(src, dim);
+
+    auto index_upper_bound = is_scatter_like ? self_dim_size : src_dim_size;
+
+    int64_t grain_size = std::max((int64_t) 1, at::internal::GRAIN_SIZE / index_dim_size);
+
+    AT_DISPATCH_ALL_TYPES_AND3(
+      ScalarType::Bool, ScalarType::Half, ScalarType::BFloat16, iter.dtype(),
+      "scatter_gather_tensor_cpu_reduce_amax", [&] {
+        constexpr auto SELF_ITER_STRIDE_IDX = 0;
+        constexpr auto INDEX_ITER_STRIDE_IDX = 2;
+        constexpr auto SRC_ITER_STRIDE_IDX = 1;
+        auto loop = [&](char** data, const int64_t* strides, int64_t n) {
+          auto* self_data_bytes = data[SELF_ITER_STRIDE_IDX];
+          auto* index_data_bytes = data[INDEX_ITER_STRIDE_IDX];
+          auto* src_data_bytes = data[SRC_ITER_STRIDE_IDX];
+          // we change the order of TensorIterator-dim loop
+          // vs dim-TensorIterator loop order depending on
+          // whether dim is the last dimension
+          if (dim== self.dim() - 1) {
+            for (const auto nelem : c10::irange(n)) {
+              (void)nelem; //Suppress unused variable warning
+              // dim loop is a separate code block
+              // for better performance
+              _cpu_scatter_gather_dim_loop<is_scatter_like>()(
+                 (scalar_t*)self_data_bytes, self_dim_stride,
+                 (int64_t*)index_data_bytes, index_dim_stride,
+                 (scalar_t*)src_data_bytes, src_dim_stride,
+                 dim, index_dim_size, index_upper_bound,
+                 kernel_func
+               );
+
+              self_data_bytes += strides[SELF_ITER_STRIDE_IDX];
+              index_data_bytes += strides[INDEX_ITER_STRIDE_IDX];
+              src_data_bytes += strides[SRC_ITER_STRIDE_IDX];
+            }
+          }
+          else {
+            for (const auto i : c10::irange(index_dim_size)) {
+              auto* self_data = self_data_bytes;
+              auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
+              auto* src_data = src_data_bytes;
+              for (const auto nelem : c10::irange(n)) {
+                (void)nelem; //Suppress unused variable warning
+                int64_t idx_dim = *(int64_t*)index_data;
+                // we are not putting idx_dim in the error message because it disables
+                // loop optimization in clang-7
+                TORCH_CHECK(idx_dim >= 0 && idx_dim < index_upper_bound,
+                            "index ", *(int64_t*)index_data,
+                            " is out of bounds for dimension ", dim,
+                            " with size ", index_upper_bound);
+
+                kernel_func(
+                  (scalar_t*)self_data + (is_scatter_like ? idx_dim : i) * self_dim_stride,
+                  (scalar_t*)src_data + (is_scatter_like ? i : idx_dim) * src_dim_stride);
+
+                self_data += strides[SELF_ITER_STRIDE_IDX];
+                index_data += strides[INDEX_ITER_STRIDE_IDX];
+                src_data += strides[SRC_ITER_STRIDE_IDX];
+              }
+            }
+          }
+        };
+        iter.for_each(loop, grain_size);
+      }
+    );
+  }
+
+  void operator()(const Tensor& self, int64_t dim,
+    const Tensor& index, const Tensor& src,
+    const std::string& method_name, ReduceMinimum& kernel_func) {
+
+    auto iter = TensorIteratorConfig()
+      .check_all_same_dtype(false)
+      .resize_outputs(false)
+      // NOLINTNEXTLINE(bugprone-argument-comment)
+      .declare_static_shape(index.sizes(), /*squash_dim=*/dim)
+      .add_output(self)
+      .add_input(src)
+      .add_input(index)
+      .build();
+
+    auto self_dim_stride = ensure_nonempty_stride(self, dim);
+    auto self_dim_size = ensure_nonempty_size(self, dim);
+
+    auto index_dim_stride = ensure_nonempty_stride(index, dim);
+    auto index_dim_size = ensure_nonempty_size(index, dim);
+
+    auto src_dim_stride = ensure_nonempty_stride(src, dim);
+    auto src_dim_size = ensure_nonempty_size(src, dim);
+
+    auto index_upper_bound = is_scatter_like ? self_dim_size : src_dim_size;
+
+    int64_t grain_size = std::max((int64_t) 1, at::internal::GRAIN_SIZE / index_dim_size);
+
+    AT_DISPATCH_ALL_TYPES_AND3(
+      ScalarType::Bool, ScalarType::Half, ScalarType::BFloat16, iter.dtype(),
+      "scatter_gather_tensor_cpu_reduce_amin", [&] {
+        constexpr auto SELF_ITER_STRIDE_IDX = 0;
+        constexpr auto INDEX_ITER_STRIDE_IDX = 2;
+        constexpr auto SRC_ITER_STRIDE_IDX = 1;
+        auto loop = [&](char** data, const int64_t* strides, int64_t n) {
+          auto* self_data_bytes = data[SELF_ITER_STRIDE_IDX];
+          auto* index_data_bytes = data[INDEX_ITER_STRIDE_IDX];
+          auto* src_data_bytes = data[SRC_ITER_STRIDE_IDX];
+          // we change the order of TensorIterator-dim loop
+          // vs dim-TensorIterator loop order depending on
+          // whether dim is the last dimension
+          if (dim== self.dim() - 1) {
+            for (const auto nelem : c10::irange(n)) {
+              (void)nelem; //Suppress unused variable warning
+              // dim loop is a separate code block
+              // for better performance
+              _cpu_scatter_gather_dim_loop<is_scatter_like>()(
+                 (scalar_t*)self_data_bytes, self_dim_stride,
+                 (int64_t*)index_data_bytes, index_dim_stride,
+                 (scalar_t*)src_data_bytes, src_dim_stride,
+                 dim, index_dim_size, index_upper_bound,
+                 kernel_func
+               );
+
+              self_data_bytes += strides[SELF_ITER_STRIDE_IDX];
+              index_data_bytes += strides[INDEX_ITER_STRIDE_IDX];
+              src_data_bytes += strides[SRC_ITER_STRIDE_IDX];
+            }
+          }
+          else {
+            for (const auto i : c10::irange(index_dim_size)) {
+              auto* self_data = self_data_bytes;
+              auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
+              auto* src_data = src_data_bytes;
+              for (const auto nelem : c10::irange(n)) {
+                (void)nelem; //Suppress unused variable warning
+                int64_t idx_dim = *(int64_t*)index_data;
+                // we are not putting idx_dim in the error message because it disables
+                // loop optimization in clang-7
+                TORCH_CHECK(idx_dim >= 0 && idx_dim < index_upper_bound,
+                            "index ", *(int64_t*)index_data,
+                            " is out of bounds for dimension ", dim,
+                            " with size ", index_upper_bound);
+
+                kernel_func(
+                  (scalar_t*)self_data + (is_scatter_like ? idx_dim : i) * self_dim_stride,
+                  (scalar_t*)src_data + (is_scatter_like ? i : idx_dim) * src_dim_stride);
+
+                self_data += strides[SELF_ITER_STRIDE_IDX];
+                index_data += strides[INDEX_ITER_STRIDE_IDX];
+                src_data += strides[SRC_ITER_STRIDE_IDX];
+              }
+            }
+          }
+        };
+        iter.for_each(loop, grain_size);
+      }
+    );
+  }
 };
 
 void gather_cpu_kernel(const Tensor& result, const Tensor& self, int64_t dim, const Tensor& index) {
@@ -319,6 +613,34 @@ void scatter_reduce_cpu_kernel(const Tensor& self, const int64_t dim, const Tens
     cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
                                        "scatter_reduce_multiply_", reduce_multiply);
     break;
+  default :
+    break;
+  }
+}
+
+void scatter_reduce_two_cpu_kernel(const Tensor& self, const int64_t dim, const Tensor& index,
+                                   const Tensor& src, const SCATTER_GATHER_OP& reduce) {
+  switch (reduce) {
+  case SCATTER_GATHER_OP::REDUCE_ADD :
+    cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
+                                       "scatter_reduce_sum_", reduce_add);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MULTIPLY :
+    cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
+                                       "scatter_reduce_prod_", reduce_multiply);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MAXIMUM :
+    cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
+                                       "scatter_reduce_amax_", reduce_maximum);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MINIMUM :
+    cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
+                                       "scatter_reduce_amin_", reduce_minimum);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MEAN :
+    cpu_scatter_gather_base_kernel<>()(self, dim, index, src,
+                                       "scatter_reduce_mean_", reduce_mean);
+    break;
   }
 }
 
@@ -333,6 +655,8 @@ void scatter_scalar_reduce_cpu_kernel(const Tensor& self, const int64_t dim, con
     cpu_scatter_gather_base_kernel<>()(self, dim, index, value,
                                        "scatter_scalar_reduce_multiply_", reduce_multiply);
     break;
+  default:
+    break;
   }
 }
 
@@ -344,5 +668,6 @@ REGISTER_DISPATCH(scatter_fill_stub, &scatter_fill_cpu_kernel);
 REGISTER_DISPATCH(scatter_add_stub, &scatter_add_cpu_kernel);
 REGISTER_DISPATCH(scatter_reduce_stub, &scatter_reduce_cpu_kernel);
 REGISTER_DISPATCH(scatter_scalar_reduce_stub, &scatter_scalar_reduce_cpu_kernel);
+REGISTER_DISPATCH(scatter_reduce_two_stub, &scatter_reduce_two_cpu_kernel);
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cpu/SortingKernel.cpp b/aten/src/ATen/native/cpu/SortingKernel.cpp
index 829cfd87acfc97..715e7f1d605cd5 100644
--- a/aten/src/ATen/native/cpu/SortingKernel.cpp
+++ b/aten/src/ATen/native/cpu/SortingKernel.cpp
@@ -47,6 +47,10 @@ void _dim_apply(
         auto* values_data_bytes = data[0];
         auto* indices_data_bytes = data[1];
 
+        if(values_data_bytes==nullptr || indices_data_bytes==nullptr){
+          return;
+        }
+
         for (const auto i : c10::irange(n)) {
           (void)i; //Suppress unused variable warning
           f(
diff --git a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
index 8d862615cc5d1c..11661982d279d2 100644
--- a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
@@ -504,6 +504,13 @@ static void ndtri_kernel(TensorIteratorBase& iter) {
       });
 }
 
+static void log_ndtr_kernel(TensorIteratorBase& iter) {
+  TORCH_INTERNAL_ASSERT(iter.ntensors() == 2);
+  AT_DISPATCH_FLOATING_TYPES(iter.common_dtype(), "log_ndtr_cpu", [&]() {
+        cpu_kernel(iter, [](scalar_t x) { return calc_log_ndtr(x); });
+      });
+}
+
 static void i0e_kernel(TensorIteratorBase& iter) {
   TORCH_INTERNAL_ASSERT(iter.ntensors() == 2);
   AT_DISPATCH_FLOATING_TYPES_AND(
@@ -641,6 +648,7 @@ REGISTER_DISPATCH(special_entr_stub, &CPU_CAPABILITY::entr_kernel);
 REGISTER_DISPATCH(frexp_stub, &CPU_CAPABILITY::frexp_kernel);
 REGISTER_DISPATCH(special_i0e_stub, &CPU_CAPABILITY::i0e_kernel);
 REGISTER_DISPATCH(special_ndtri_stub, &CPU_CAPABILITY::ndtri_kernel);
+REGISTER_DISPATCH(special_log_ndtr_stub, &CPU_CAPABILITY::log_ndtr_kernel);
 REGISTER_DISPATCH(special_i1_stub, &CPU_CAPABILITY::i1_kernel);
 REGISTER_DISPATCH(special_i1e_stub, &CPU_CAPABILITY::i1e_kernel);
 REGISTER_DISPATCH(special_erfcx_stub, &CPU_CAPABILITY::erfcx_kernel);
diff --git a/aten/src/ATen/native/cpu/layer_norm_kernel.cpp b/aten/src/ATen/native/cpu/layer_norm_kernel.cpp
index bd3cfc564c531a..e1af1658d1a345 100644
--- a/aten/src/ATen/native/cpu/layer_norm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/layer_norm_kernel.cpp
@@ -42,10 +42,13 @@ void LayerNormKernelImplInternal(
   const T* gamma_data = gamma.defined() ? gamma.data_ptr<T>() : nullptr;
   const T* beta_data = beta.defined() ? beta.data_ptr<T>() : nullptr;
   T* Y_data = Y->data_ptr<T>();
-  T* mean_data = mean->data_ptr<T>();
-  T* rstd_data = rstd->data_ptr<T>();
+  T* mean_data = mean ? mean->data_ptr<T>() : nullptr;
+  T* rstd_data = rstd ? rstd->data_ptr<T>() : nullptr;
+
   const bool gamma_null = gamma_data == nullptr;
   const bool beta_null = beta_data == nullptr;
+  const bool mean_null = mean_data == nullptr;
+  const bool rstd_null = rstd_data == nullptr;
   at::parallel_for(0, M, 1, [&](int64_t start, int64_t end) {
     for (const auto i : c10::irange(start, end)) {
       const T* X_ptr = X_data + i * N;
@@ -73,8 +76,12 @@ void LayerNormKernelImplInternal(
             beta_data,
             N);
       }
-      mean_data[i] = mean_val;
-      rstd_data[i] = rstd_val;
+      if (!mean_null) {
+        mean_data[i] = mean_val;
+      }
+      if (!rstd_null) {
+        rstd_data[i] = rstd_val;
+      }
     }
   });
 }
diff --git a/aten/src/ATen/native/cuda/AbsKernel.cu b/aten/src/ATen/native/cuda/AbsKernel.cu
index 3bfc2621d9305f..ad9b0380f6f26b 100644
--- a/aten/src/ATen/native/cuda/AbsKernel.cu
+++ b/aten/src/ATen/native/cuda/AbsKernel.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/UnaryOps.h>
 #include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/Dispatch.h>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/TensorIterator.h>
@@ -14,12 +15,36 @@ struct AbsFunctor {
   }
 };
 
+const char abs_name[] = "abs_kernel";
 void abs_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(ScalarType::Half, ScalarType::BFloat16, ScalarType::Bool, iter.dtype(), "abs_cuda", [&]() {
-    gpu_kernel(iter, AbsFunctor<scalar_t>());
-  });
+  auto dtype = iter.dtype();
+  if (at::isComplexType(dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto abs_string = jiterator_stringify(
+        template <typename T> T abs_kernel(T x) { return std::abs(x); });
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "abs_cuda", [&]() {
+      jitted_gpu_kernel<
+          /*name=*/abs_name,
+          /*return_dtype=*/scalar_t,
+          /*common_dtype=*/scalar_t,
+          /*arity=*/1>(iter, abs_string);
+    });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "abs_cuda", [&]() {
+      gpu_kernel(iter, AbsFunctor<scalar_t>());
+    });
+#endif
+  } else {
+    AT_DISPATCH_ALL_TYPES_AND3(
+        ScalarType::Half,
+        ScalarType::BFloat16,
+        ScalarType::Bool,
+        iter.dtype(),
+        "abs_cuda",
+        [&]() { gpu_kernel(iter, AbsFunctor<scalar_t>()); });
+  }
 }
 
-REGISTER_DISPATCH(abs_stub, &abs_kernel_cuda);
+  REGISTER_DISPATCH(abs_stub, &abs_kernel_cuda);
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu b/aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu
index 80c74e4e8d9f0c..4ff6d882c85692 100644
--- a/aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu
+++ b/aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu
@@ -8,6 +8,7 @@
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/JitLoops.cuh>
 
 // NOTE: CUDA on Windows requires that the enclosing function
 // of a __device__ lambda not have internal linkage.
@@ -15,15 +16,33 @@
 namespace at {
 namespace native {
 
+const char sigmoid_backward_name[] = "sigmoid_backward";
 void sigmoid_backward_kernel_cuda(TensorIteratorBase& iter) {
-  if(isComplexType(iter.dtype())) {
-    AT_DISPATCH_COMPLEX_TYPES(iter.dtype(), "sigmoid_backward_cuda", [&]() {
+  auto dtype = iter.dtype();
+  if(isComplexType(dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto sigmoid_backward_string = jiterator_stringify(
+        template <typename T>
+        T sigmoid_backward(T a, T b) {
+          return a * std::conj((T{1.} - b) * b);
+        }
+    ); // sigmoid_backward_string
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "sigmoid_backward_cuda", [&]() {
+        jitted_gpu_kernel<
+          /*name=*/ sigmoid_backward_name,
+          /*return_dtype=*/ scalar_t,
+          /*common_dtype=*/ scalar_t,
+          /*arity=*/ 2>(iter, sigmoid_backward_string);
+    });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "sigmoid_backward_cuda", [&]() {
       gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a, scalar_t b) -> scalar_t {
         return a * std::conj((scalar_t{1.} - b) * b);
       });
     });
+#endif
   } else {
-    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, iter.dtype(), "sigmoid_backward_cuda", [&]() {
+    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, dtype, "sigmoid_backward_cuda", [&]() {
       gpu_kernel(iter, []GPU_LAMBDA(scalar_t a, scalar_t b) -> scalar_t {
         return a * (scalar_t(1.) - b) * b;
       });
diff --git a/aten/src/ATen/native/cuda/Blas.cpp b/aten/src/ATen/native/cuda/Blas.cpp
index ec50994fb12809..07ce6dca45e7de 100644
--- a/aten/src/ATen/native/cuda/Blas.cpp
+++ b/aten/src/ATen/native/cuda/Blas.cpp
@@ -13,6 +13,7 @@
 #include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #else
+#include <ATen/ops/_addmm_activation_native.h>
 #include <ATen/ops/_efficientzerotensor.h>
 #include <ATen/ops/addmm_native.h>
 #include <ATen/ops/addmv_native.h>
@@ -21,8 +22,10 @@
 #include <ATen/ops/copy_native.h>
 #include <ATen/ops/dot_native.h>
 #include <ATen/ops/empty.h>
+#include <ATen/ops/gelu.h>
 #include <ATen/ops/mm_native.h>
 #include <ATen/ops/mul.h>
+#include <ATen/ops/relu.h>
 #include <ATen/ops/scalar_tensor_native.h>
 #include <ATen/ops/vdot_native.h>
 #endif
@@ -113,7 +116,29 @@ c10::MaybeOwned<Tensor> prepare_batch_matrix_for_cublas(const Tensor& tensor, bo
 
 namespace {
 
-Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha) {
+enum class Activation {
+  None,
+  RELU,
+  GELU,
+};
+
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 && !defined(_MSC_VER)
+cuda::blas::GEMMAndBiasActivationEpilogue activation_to_gemm_and_blas_arg(Activation a) {
+  switch (a) {
+    case Activation::None:
+      return cuda::blas::GEMMAndBiasActivationEpilogue::None;
+    case Activation::RELU:
+      return cuda::blas::GEMMAndBiasActivationEpilogue::RELU;
+    case Activation::GELU:
+      return cuda::blas::GEMMAndBiasActivationEpilogue::GELU;
+    default:
+      TORCH_CHECK(false);
+      return cuda::blas::GEMMAndBiasActivationEpilogue::None;
+  }
+}
+#endif
+
+Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha, Activation activation=Activation::None) {
   // Make sure to keep addmm_cuda below in sync with this code; it
   // preflights a check to try to avoid actually needing to call
   // expand().
@@ -129,7 +154,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
   at::ScalarType scalar_type = self.scalar_type();
   c10::MaybeOwned<Tensor> self_;
   if (&result != &self) {
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 && !defined(_MSC_VER)
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11040 && !defined(_MSC_VER)
     // Strangely, if mat2 has only 1 row or column, we get
     // CUBLAS_STATUS_INVALID_VALUE error from cublasLtMatmulAlgoGetHeuristic.
     // self.dim() == 1 && result.dim() == 2 && self.sizes()[0] == mat2_sizes[1]
@@ -142,12 +167,6 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
          scalar_type == at::ScalarType::Half ||
          scalar_type == at::ScalarType::BFloat16) &&
         mat2_sizes[0] > 1 && mat2_sizes[1] > 1;
-
-    // https://docs.nvidia.com/cuda/cublas/index.html#cublasLt-general-description
-    // Batch size > 65535 does not work in most cases.
-    if (mat1_sizes[0] > 65535) {
-      useLtInterface = false;
-    }
 #endif
     if (!useLtInterface) {
       self_ = expand_size(self, {mat1_sizes[0], mat2_sizes[1]}, "addmm");
@@ -237,7 +256,19 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
               mat2_ld,
               self.data_ptr<scalar_t>(),
               result_->data_ptr<scalar_t>(),
-              result_ld);
+              result_ld,
+#if 0
+              activation_to_gemm_and_blas_arg(activation)
+#else
+              // GELU is not supported (and does not compile!) prior
+              // to CUDA 11.4.  Have observed accuracy issues with
+              // GELU epilogue in 11.4; disabling the GELU epilogue
+              // path until we confirm which version it's working in.
+              activation != Activation::GELU
+              ? activation_to_gemm_and_blas_arg(activation)
+              : cuda::blas::GEMMAndBiasActivationEpilogue::None
+#endif
+          );
         });
   } else
 #endif
@@ -269,8 +300,27 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
               result_ptr,
               result_ld);
         });
+    switch (activation) {
+      case Activation::RELU:
+        at::relu_(const_cast<Tensor&>(*result_));
+        break;
+      case Activation::GELU:
+        at::gelu_(const_cast<Tensor&>(*result_));
+        break;
+      default: break;
+    }
   }
 
+// Preprocessor gate here needs to match the inverse of the check
+// gating activation_to_gemm_and_blas_arg above; here we are manually
+// performing a post-GELU because we weren't able to use the GELU
+// epilogue above.
+#if !0
+  if (useLtInterface && activation == Activation::GELU) {
+    at::gelu_(const_cast<Tensor&>(*result_));
+  }
+#endif
+
   if (!result.is_same(*result_)) {
     result.copy_(*result_);
   }
@@ -354,6 +404,10 @@ TORCH_IMPL_FUNC(addmm_out_cuda)(const Tensor& self, const Tensor& mat1, const Te
   addmm_out_cuda_impl(const_cast<Tensor&>(result), self, mat1, mat2, beta, alpha);
 }
 
+TORCH_IMPL_FUNC(addmm_activation_out_cuda)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha, bool use_gelu, const Tensor& result) {
+  addmm_out_cuda_impl(const_cast<Tensor&>(result), self, mat1, mat2, beta, alpha, use_gelu ? Activation::GELU : Activation::RELU);
+}
+
 TORCH_IMPL_FUNC(mm_out_cuda)(const Tensor& self, const Tensor& mat2, const Tensor& result) {
   addmm_out_cuda_impl(const_cast<Tensor&>(result), result, self, mat2, 0, 1);
 }
diff --git a/aten/src/ATen/native/cuda/CUDAJitLoops.cuh b/aten/src/ATen/native/cuda/CUDAJitLoops.cuh
index b5b1cd5c63bcf9..52274e043038e3 100644
--- a/aten/src/ATen/native/cuda/CUDAJitLoops.cuh
+++ b/aten/src/ATen/native/cuda/CUDAJitLoops.cuh
@@ -71,7 +71,8 @@ static inline void launch_jitted_unrolled_kernel(
   std::tuple<Args...> extra_args) {
 
   TORCH_INTERNAL_ASSERT(N > 0 && N <= std::numeric_limits<int32_t>::max());
-  const int64_t grid = (N + block_work_size() - 1) / block_work_size();
+  //casting result to int is always safe, intermediate is int64 and won't overflow
+  const uint32_t grid = (N + block_work_size() - 1) / block_work_size();
 
   static std::mutex _jiterator_mutex;
   static std::vector<at::cuda::jit::NvrtcFunction> fns(c10::cuda::device_count());
@@ -114,9 +115,8 @@ static inline void launch_jitted_unrolled_kernel(
     // since 7 slots are already filled in `args`
     args[i + 7] = extra_args_array[i];
   }
-
-  at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, grid, num_threads());
-  C10_CUDA_KERNEL_LAUNCH_CHECK();
+  at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, {grid, 1u, 1u},
+  {num_threads(), 1u, 1u});
 }
 
 template<
@@ -129,7 +129,8 @@ template<
 static inline void launch_jitted_vectorized_kernel(DeviceIndex dev_idx, int64_t N, const std::string& f, array_t data,
 at::opmath_type<f_inputs_type> scalar_val, std::tuple<Args...> extra_args) {
   TORCH_INTERNAL_ASSERT(N > 0 && N <= std::numeric_limits<int32_t>::max());
-  const int64_t grid = (N + block_work_size() - 1) / block_work_size();
+  // N is still int64_t for the computation, but it's always safe to cast result to int
+  const uint32_t grid = (N + block_work_size() - 1) / block_work_size();
   const int vec_size = memory::jitted_can_vectorize_up_to<result_type, f_inputs_type, arity>(data);
 
   // Different kernels are compiled depending on what we're vectorizing up to (1, 2 or 4 elements)
@@ -195,9 +196,7 @@ at::opmath_type<f_inputs_type> scalar_val, std::tuple<Args...> extra_args) {
       // since 3 slots are already filled in `args`
       args[i + 3] = extra_args_array[i];
     }
-
-    at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, grid, num_threads());
-    C10_CUDA_KERNEL_LAUNCH_CHECK();
+    at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, {grid, 1u, 1u}, {num_threads(), 1u, 1u});
   } else {
     auto ic = TrivialOffsetCalculator<arity>();
     auto oc = TrivialOffsetCalculator<1>();
@@ -219,8 +218,8 @@ at::opmath_type<f_inputs_type> scalar_val, std::tuple<Args...> extra_args) {
       // since 7 slots are already filled in `args`
       args[i + 7] = extra_args_array[i];
     }
-    at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, grid, num_threads());
-    C10_CUDA_KERNEL_LAUNCH_CHECK();
+
+    at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, {grid, 1u, 1u}, {num_threads(), 1u, 1u});
   }
 }
 
diff --git a/aten/src/ATen/native/cuda/CUDAScalar.cu b/aten/src/ATen/native/cuda/CUDAScalar.cu
index 637dd6514f409f..4f2b092573e3fa 100644
--- a/aten/src/ATen/native/cuda/CUDAScalar.cu
+++ b/aten/src/ATen/native/cuda/CUDAScalar.cu
@@ -15,8 +15,8 @@ namespace native {
 
 Scalar _local_scalar_dense_cuda(const Tensor& self) {
   Scalar r;
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-    at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16, self.scalar_type(), "_local_scalar_dense_cuda", [&] {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+    kComplexHalf, kHalf, kBool, kBFloat16, self.scalar_type(), "_local_scalar_dense_cuda", [&] {
         scalar_t value;
         cudaStream_t stream = at::cuda::getCurrentCUDAStream();
         at::cuda::memcpy_and_sync(&value, self.data_ptr<scalar_t>(), sizeof(scalar_t), cudaMemcpyDeviceToHost, stream);
diff --git a/aten/src/ATen/native/cuda/ComplexKernel.cu b/aten/src/ATen/native/cuda/ComplexKernel.cu
index 6420279704e027..8738c0ab4c8ec4 100644
--- a/aten/src/ATen/native/cuda/ComplexKernel.cu
+++ b/aten/src/ATen/native/cuda/ComplexKernel.cu
@@ -12,7 +12,7 @@ namespace native {
 namespace {
 
 void complex_kernel_cuda(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.input_dtype(0), "complex_cuda", [&]() {
+  AT_DISPATCH_FLOATING_TYPES_AND(kHalf, iter.input_dtype(0), "complex_cuda", [&]() {
     gpu_kernel(
       iter, [] GPU_LAMBDA(scalar_t a, scalar_t b) -> c10::complex<scalar_t> {
         return c10::complex<scalar_t>(a, b);
diff --git a/aten/src/ATen/native/cuda/Copy.cu b/aten/src/ATen/native/cuda/Copy.cu
index a42a90cbe29306..57f04d481fc5c2 100644
--- a/aten/src/ATen/native/cuda/Copy.cu
+++ b/aten/src/ATen/native/cuda/Copy.cu
@@ -31,8 +31,8 @@ void direct_copy_kernel_cuda(TensorIteratorBase &iter) {
       gpu_kernel(iter, [] GPU_LAMBDA(scalar_t x) { return x; });
     });
   } else {
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-        kHalf, kBool, kBFloat16, dtype, "copy_", [&] {
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+        kHalf, kBool, kBFloat16, kComplexHalf, dtype, "copy_", [&] {
           gpu_kernel(iter, [] GPU_LAMBDA(scalar_t x) { return x; });
     });
   }
diff --git a/aten/src/ATen/native/cuda/Dropout.cu b/aten/src/ATen/native/cuda/Dropout.cu
index 528a43646b9b15..6ec054aa60504f 100644
--- a/aten/src/ATen/native/cuda/Dropout.cu
+++ b/aten/src/ATen/native/cuda/Dropout.cu
@@ -1,6 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/cuda/CUDAGeneratorImpl.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Utils.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/cuda/detail/TensorInfo.cuh>
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
@@ -11,6 +14,17 @@
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/MemoryAccess.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_masked_scale_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/native_dropout_backward_native.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at{
 namespace native{
 
diff --git a/aten/src/ATen/native/cuda/Embedding.cu b/aten/src/ATen/native/cuda/Embedding.cu
index f4b5f160b5256d..8a241cabcd2d36 100644
--- a/aten/src/ATen/native/cuda/Embedding.cu
+++ b/aten/src/ATen/native/cuda/Embedding.cu
@@ -1,5 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/ceil_div.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -17,6 +19,18 @@
 #include <thrust/iterator/reverse_iterator.h>
 #endif
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/embedding_dense_backward_native.h>
+#include <ATen/ops/embedding_renorm_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
diff --git a/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu b/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu
index ef7eb942f26e05..1a2c7627fc730b 100644
--- a/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu
+++ b/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu
@@ -1,15 +1,26 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/cuda/EmbeddingBackwardKernel.cuh>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/cub.h>
+#include <ATen/cuda/cub.cuh>
+#include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/cuda/SortingCommon.cuh>
 
-#include <ATen/AccumulateType.h>
-
 #include <c10/macros/Macros.h>
 
+#if CUB_SUPPORTS_UNIQUE_BY_KEY()
+#include <thrust/iterator/counting_iterator.h>
+#endif
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -35,7 +46,8 @@ int64_t ceil_div(int64_t x, int64_t y) {
 template <typename index_t>
 __global__
 void krn_partials_per_segment(index_t *ret, const index_t *segment_offsets,
-                              int64_t num_of_segments, int64_t numel) {
+                              int64_t *num_of_segments_ptr, int64_t numel) {
+  int64_t num_of_segments = *num_of_segments_ptr;
   const int id = blockIdx.x * blockDim.x + threadIdx.x;
   if(id < num_of_segments) {
     const int64_t idx_start = segment_offsets[id];
@@ -52,7 +64,8 @@ void krn_partial_segment_offset(
         const index_t *partials_per_segment,
         const index_t *partials_per_segment_offset,
         const index_t *segment_offsets,
-        int64_t num_of_segments) {
+        int64_t *num_of_segments_ptr) {
+  int64_t num_of_segments = *num_of_segments_ptr;
   const int id = blockIdx.x * blockDim.x + threadIdx.x;
   if(id < num_of_segments) {
     index_t idx = partials_per_segment_offset[id];
@@ -71,10 +84,11 @@ __global__ void compute_grad_weight_bags(
     index_t *offset2bag, index_t *count, ptrdiff_t numel,
     int64_t stride, int mode_mean, const index_t *bag_size,
     scalar_t* per_sample_weights, int64_t per_sample_weights_stride,
-    index_t* segment_offsets, int64_t num_of_segments,
+    index_t* segment_offsets, int64_t *num_of_segments_ptr,
     acc_type<scalar_t, true> *grad_weight_per_segment,
     const int64_t stride_warped) {
 
+  int64_t num_of_segments = *num_of_segments_ptr;
   const int gid = blockIdx.x * blockDim.x + threadIdx.x;
   const int id = gid / stride_warped;
   const int startFeature = gid % stride_warped;
@@ -115,10 +129,11 @@ __global__ void compute_grad_weight(
     ptrdiff_t numel,
     int64_t stride,
     index_t* segment_offsets,
-    int64_t num_of_segments,
+    int64_t *num_of_segments_ptr,
     acc_type<scalar_t, true> *grad_weight_per_segment,
     const int64_t stride_warped) {
 
+  int64_t num_of_segments = *num_of_segments_ptr;
   using accscalar_t = acc_type<scalar_t, true>;
   const int gid = blockIdx.x * blockDim.x + threadIdx.x;
   const int id = gid / stride_warped;
@@ -145,12 +160,14 @@ __global__ void compute_grad_weight(
 template <typename scalar_t, typename index_t>
 __global__ void sum_and_scatter(
     index_t *input, scalar_t *gradWeight, int64_t stride,
-    index_t* segment_offsets, int64_t num_of_segments,
+    index_t* segment_offsets, int64_t *num_of_segments_ptr,
     const acc_type<scalar_t, true> *grad_weight_per_segment,
-    const index_t *segment_sizes_offsets, int64_t num_of_partial_segments,
+    const index_t *segment_sizes_offsets, int64_t *num_of_partial_segments_ptr,
     const int64_t padding_idx,
     const int64_t stride_warped) {
 
+  int64_t num_of_segments = *num_of_segments_ptr;
+  int64_t num_of_partial_segments = *num_of_partial_segments_ptr;
   const int gid = blockIdx.x * blockDim.x + threadIdx.x;
   const int id = gid / stride_warped;
   const int startFeature = gid % stride_warped;
@@ -173,10 +190,23 @@ __global__ void sum_and_scatter(
   }
 }
 
+template<typename index_t>
+__global__ void compute_num_of_partial_segments(index_t *partials_per_segment, index_t *partials_per_segment_offset, int64_t *num_of_segments_ptr, int64_t *output) {
+  int64_t num_of_segments = *num_of_segments_ptr;
+  *output = partials_per_segment[num_of_segments-1] +
+            partials_per_segment_offset[num_of_segments-1];
+}
+
+__global__ void write_num_of_segments_for_legacy_thrust_path(int64_t *num_of_segments_ptr, int64_t num_of_segments) {
+  *num_of_segments_ptr = num_of_segments;
+}
+
 } // anon namespace
 
+#if !CUB_SUPPORTS_UNIQUE_BY_KEY()
 template<typename index_t>
 int64_t embedding_backward_cuda_kernel_unique_by_key(const Tensor &sorted_indices, Tensor &segment_offsets);
+#endif
 
 Tensor embedding_backward_cuda_kernel(
         const Tensor &grad,
@@ -200,19 +230,35 @@ Tensor embedding_backward_cuda_kernel(
   // spawn a warp per index. In this context, a segment is a number of rows that should
   // be summarized.
   // Unit: index in `sorted_indices` and `orig_indices`
+  auto segment_offsets = at::empty({numel}, orig_indices.options());
+  auto num_of_segments_tensor = at::empty({}, grad.options().dtype(kLong));
+  int64_t *num_of_segments_ptr = num_of_segments_tensor.data_ptr<int64_t>();
+#if !CUB_SUPPORTS_UNIQUE_BY_KEY()
   AT_DISPATCH_INDEX_TYPES(orig_indices.scalar_type(), "embedding_backward_cuda_kernel", [&] () {
-    auto segment_offsets = at::empty({numel}, orig_indices.options());
     int64_t num_of_segments = embedding_backward_cuda_kernel_unique_by_key<index_t>(sorted_indices, segment_offsets);
+    write_num_of_segments_for_legacy_thrust_path<<<1, 1, 0, c10::cuda::getCurrentCUDAStream()>>>(num_of_segments_ptr, num_of_segments);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
+  });
+#else
+  AT_DISPATCH_INDEX_TYPES(orig_indices.scalar_type(), "embedding_backward_cuda_kernel", [&] () {
+    auto num_of_segments_tensor = at::empty({}, grad.options().dtype(kLong));
+    cuda::cub::unique_by_key(
+      sorted_indices.data_ptr<index_t>(), thrust::make_counting_iterator(0),
+      nullptr, segment_offsets.data_ptr<index_t>(),
+      num_of_segments_ptr, sorted_indices.numel());
+  });
+#endif
 
+  AT_DISPATCH_INDEX_TYPES(orig_indices.scalar_type(), "embedding_backward_cuda_kernel", [&] () {
     // We split the segments up into sizes of `NROWS_PER_THREAD`
     // Compute the number partial-segments per segment (some partial-segments
     // may not be the full `NROWS_PER_THREAD` number of rows)
-    auto partials_per_segment = at::empty({num_of_segments}, orig_indices.options());
+    auto partials_per_segment = at::empty({numel}, orig_indices.options());
     {
-      krn_partials_per_segment<<<ceil_div(num_of_segments, 32), 32, 0, stream>>> (
+      krn_partials_per_segment<<<ceil_div(numel, 32), 32, 0, stream>>> (
               partials_per_segment.data_ptr<index_t>(),
               segment_offsets.data_ptr<index_t>(),
-              num_of_segments,
+              num_of_segments_ptr,
               numel);
       C10_CUDA_KERNEL_LAUNCH_CHECK();
     }
@@ -221,33 +267,38 @@ Tensor embedding_backward_cuda_kernel(
     // of each partial-segment in `sorted_indices`, we need to compute the
     // start position of each _segment_ in `partial_segment_offset`.
     // Unit: index in `partial_segment_offset`
-    auto partials_per_segment_offset = at::empty({num_of_segments}, orig_indices.options());
+    auto partials_per_segment_offset = at::empty({numel}, orig_indices.options());
     cuda::cub::exclusive_sum(
         partials_per_segment.data_ptr<index_t>(),
         partials_per_segment_offset.data_ptr<index_t>(),
-        num_of_segments);
+        numel);
 
     // The total number of partial-segments is the sum of `partials_per_segment_offset`
-    const int num_of_partial_segments = partials_per_segment[num_of_segments-1].item<index_t>() +
-            partials_per_segment_offset[num_of_segments-1].item<index_t>();
+    auto num_of_partial_segments_tensor = at::empty({}, grad.options().dtype(kLong));
+    int64_t *num_of_partial_segments_ptr = num_of_partial_segments_tensor.data_ptr<int64_t>();
+    compute_num_of_partial_segments<index_t><<<1, 1, 0, c10::cuda::getCurrentCUDAStream()>>>(
+      partials_per_segment.data_ptr<index_t>(),
+      partials_per_segment_offset.data_ptr<index_t>(),
+      num_of_segments_ptr, num_of_partial_segments_ptr);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
 
     // Now we can compute the start position of each partial-segment
     // Unit: index in `sorted_indices` and `orig_indices`
-    auto partial_segment_offset = at::empty({num_of_partial_segments}, orig_indices.options());
+    auto partial_segment_offset = at::empty({numel}, orig_indices.options());
     {
-      krn_partial_segment_offset<<<ceil_div(num_of_segments, 32), 32, 0, stream>>> (
+      krn_partial_segment_offset<<<ceil_div(numel, 32), 32, 0, stream>>> (
               partial_segment_offset.data_ptr<index_t>(),
               partials_per_segment.data_ptr<index_t>(),
               partials_per_segment_offset.data_ptr<index_t>(),
               segment_offsets.data_ptr<index_t>(),
-              num_of_segments);
+              num_of_segments_ptr);
       C10_CUDA_KERNEL_LAUNCH_CHECK();
     }
 
     const int warp_size = at::cuda::warp_size();
     const int stride_warped = ceil_div(stride, warp_size)*warp_size;
     const int block = std::min(stride_warped, MAX_BLOCK_SIZE);
-    const int grid = ceil_div(num_of_partial_segments*stride_warped, block);
+    const int grid = ceil_div(numel*stride_warped, block);
 
     AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16,
       grad.scalar_type(), "embedding_bag_backward_cuda_compute_grad_weight", [&] {
@@ -260,7 +311,7 @@ Tensor embedding_backward_cuda_kernel(
         } else {
             op = grad.options();
         }
-        auto grad_weight_per_segment = at::empty({num_of_partial_segments, stride}, op);
+        auto grad_weight_per_segment = at::empty({numel, stride}, op);
         // Compute the sum of each partial-segment and handle bags
         if (offset2bag.defined()) {
               compute_grad_weight_bags<scalar_t><<<grid, block, 0, stream>>>(
@@ -272,7 +323,7 @@ Tensor embedding_backward_cuda_kernel(
                 per_sample_weights.defined() ? per_sample_weights.data_ptr<scalar_t>() : NULL,
                 per_sample_weights.defined() ? per_sample_weights.stride(0) : 0,
                 partial_segment_offset.data_ptr<index_t>(),
-                num_of_partial_segments, grad_weight_per_segment.data_ptr<partial_weight_t>(),
+                num_of_partial_segments_ptr, grad_weight_per_segment.data_ptr<partial_weight_t>(),
                 stride_warped);
               C10_CUDA_KERNEL_LAUNCH_CHECK();
         } else {
@@ -282,7 +333,7 @@ Tensor embedding_backward_cuda_kernel(
                 count.defined() ? count.data_ptr<index_t>() : nullptr,
                 numel, stride,
                 partial_segment_offset.data_ptr<index_t>(),
-                num_of_partial_segments,
+                num_of_partial_segments_ptr,
                 grad_weight_per_segment.data_ptr<partial_weight_t>(),
                 stride_warped);
               C10_CUDA_KERNEL_LAUNCH_CHECK();
@@ -290,15 +341,15 @@ Tensor embedding_backward_cuda_kernel(
 
         // Finally, we sum all the partial-sums and scatter them
         // into `grad_weight`.
-        const int grid2 = ceil_div(num_of_segments*stride_warped, block);
+        const int grid2 = ceil_div(numel*stride_warped, block);
             sum_and_scatter<scalar_t><<<grid2, block, 0, stream>>>(
               sorted_indices.data_ptr<index_t>(),
               grad_weight.data_ptr<scalar_t>(),
               stride,
               segment_offsets.data_ptr<index_t>(),
-              num_of_segments, grad_weight_per_segment.data_ptr<partial_weight_t>(),
+              num_of_segments_ptr, grad_weight_per_segment.data_ptr<partial_weight_t>(),
               partials_per_segment_offset.data_ptr<index_t>(),
-              num_of_partial_segments,
+              num_of_partial_segments_ptr,
               padding_idx,
               stride_warped);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
diff --git a/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cuh b/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cuh
index 7b8fc9576e2178..0d8d45c1defb90 100644
--- a/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cuh
+++ b/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cuh
@@ -1,10 +1,8 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
-
-#pragma once
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/EmbeddingBag.cu b/aten/src/ATen/native/cuda/EmbeddingBag.cu
index c6701aba07b5c8..7ac3a7151b79c5 100644
--- a/aten/src/ATen/native/cuda/EmbeddingBag.cu
+++ b/aten/src/ATen/native/cuda/EmbeddingBag.cu
@@ -1,12 +1,26 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/DeviceUtils.cuh>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
 
-#include <ATen/AccumulateType.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/_embedding_bag_native.h>
+#include <ATen/ops/_embedding_bag_forward_only_native.h>
+#include <ATen/ops/_embedding_bag_dense_backward_native.h>
+#include <ATen/ops/_embedding_bag_per_sample_weights_backward_native.h>
+#endif
 
 #include <ATen/cuda/cub.cuh>
 #include <ATen/native/cuda/SortingCommon.cuh>
@@ -53,7 +67,7 @@ __global__ void EmbeddingBag_updateOutputKernel_max(
     index_t *offset2bag, int64_t numIndices, int64_t numBags,
     int64_t featureSize, int64_t weight_stride0, int64_t weight_stride1,
     index_t *bag_size, index_t *max_indices,
-    index_t padding_idx, int64_t vocab_size) {
+    index_t padding_idx) {
 
   // the strategy here is that each bag x feature is handled by a single thread
 
@@ -74,7 +88,6 @@ __global__ void EmbeddingBag_updateOutputKernel_max(
       int64_t bag_size_ = 0;
       int64_t maxWord = -1;
       for (int64_t emb = begin; emb < end; emb++) {
-        CUDA_KERNEL_ASSERT(input[emb] >= 0 && input[emb] < vocab_size);
         bool pad = (input[emb] == padding_idx);
         const int64_t weightRow = input[emb] * weight_stride0;
         scalar_t weightValue = weightFeat[weightRow];
@@ -104,7 +117,7 @@ __global__ void EmbeddingBag_updateOutputKernel_sum_mean(
     int64_t featureSize, int64_t weight_stride0, int64_t weight_stride1,
     int mode, index_t *bag_size,
     scalar_t* per_sample_weights, int64_t per_sample_weights_stride,
-    index_t padding_idx, int64_t vocab_size) {
+    index_t padding_idx) {
 
   // the strategy here is that each bag x feature is handled by a single thread
 
@@ -125,7 +138,6 @@ __global__ void EmbeddingBag_updateOutputKernel_sum_mean(
       accscalar_t weightFeatSum = 0;
       int64_t bag_size_ = 0;
       for (int64_t emb = begin; emb < end; emb++) {
-        CUDA_KERNEL_ASSERT(input[emb] >= 0 && input[emb] < vocab_size);
         bool pad = (input[emb] == padding_idx);
         const int64_t weightRow = input[emb] * weight_stride0;
         scalar_t weightValue = weightFeat[weightRow];
@@ -350,7 +362,6 @@ _embedding_bag_cuda(const Tensor &weight, const Tensor &indices_,
     numBags -= 1;
   }
   int64_t featureSize = weight.size(1);
-  int64_t vocabSize = weight.size(0);
 
   auto bag_size = at::empty(offsets.sizes(), indices.options());
   auto offset2bag =
@@ -384,7 +395,7 @@ _embedding_bag_cuda(const Tensor &weight, const Tensor &indices_,
             offset2bag.data_ptr<index_t>(), numIndices, numBags, featureSize,
             weight.stride(0), weight.stride(1), bag_size.data_ptr<index_t>(),
             max_indices.data_ptr<index_t>(),
-            padding_idx, vocabSize);
+            padding_idx);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
       } else {
         EmbeddingBag_updateOutputKernel_sum_mean<scalar_t, index_t><<<grid, block, 0, stream>>>(
@@ -394,7 +405,7 @@ _embedding_bag_cuda(const Tensor &weight, const Tensor &indices_,
             weight.stride(0), weight.stride(1), mode, bag_size.data_ptr<index_t>(),
             per_sample_weights.defined() ? per_sample_weights.data_ptr<scalar_t>() : NULL,
             per_sample_weights.defined() ? per_sample_weights.stride(0) : 0,
-            padding_idx, vocabSize);
+            padding_idx);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
       }
     });
diff --git a/aten/src/ATen/native/cuda/Equal.cpp b/aten/src/ATen/native/cuda/Equal.cpp
index 401571b2f1f26c..ab8c9adef4e436 100644
--- a/aten/src/ATen/native/cuda/Equal.cpp
+++ b/aten/src/ATen/native/cuda/Equal.cpp
@@ -1,6 +1,14 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/NamedTensorUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
 #include <ATen/CUDAFunctions.h>
-#include <ATen/NamedTensorUtils.h>
+#else
+#include <ATen/ops/eq_cuda_dispatch.h>
+#include <ATen/ops/equal_native.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/cuda/FillKernel.cu b/aten/src/ATen/native/cuda/FillKernel.cu
index 82813338946285..facceccf8028fc 100644
--- a/aten/src/ATen/native/cuda/FillKernel.cu
+++ b/aten/src/ATen/native/cuda/FillKernel.cu
@@ -19,7 +19,7 @@ struct FillFunctor {
 };
 
 void fill_kernel_cuda(TensorIterator& iter, const Scalar& value) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16, iter.dtype(), "fill_cuda", [&]() {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(kComplexHalf, kBool, kHalf, kBFloat16, iter.dtype(), "fill_cuda", [&]() {
     gpu_kernel(iter, FillFunctor<scalar_t>(value.to<scalar_t>()));
   });
 }
diff --git a/aten/src/ATen/native/cuda/ForeachReduceOp.cu b/aten/src/ATen/native/cuda/ForeachReduceOp.cu
index 0d6848324252d8..05fb1f6a087d14 100644
--- a/aten/src/ATen/native/cuda/ForeachReduceOp.cu
+++ b/aten/src/ATen/native/cuda/ForeachReduceOp.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/Dispatch.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/OpMathType.h>
 #include <ATen/cuda/DeviceUtils.cuh>
 #include <ATen/native/ForeachUtils.h>
 #include <ATen/native/cuda/block_reduce.cuh>
@@ -24,13 +25,13 @@ namespace native {
 template<typename T, int NormType, int depth=1, int r_args_depth=1, int res_arg_index=0>
 struct LpNormFunctor {
   static_assert(NormType == 1 || NormType == 2, "foreach_norm supports only L1 and L2 norm");
+  using opmath_t = typename at::opmath_type<T>;
   __device__ __forceinline__ void operator() (
       int chunk_size,
       TensorListMetadata<depth>& tl,
-      T* output_per_tensor,
+      opmath_t* output_per_tensor,
       const int max_chunks_per_tensor
   ) {
-    using opmath_t = typename at::opmath_type<T>;
     int tensor_loc = tl.block_to_tensor[blockIdx.x];
     int chunk_idx = tl.block_to_chunk[blockIdx.x];
     int n = tl.numel_for_tensor[tensor_loc];
@@ -82,16 +83,15 @@ struct LpNormFunctor {
   }
 };
 
-template<typename T, int NormType>
+template<typename T, int NormType, typename opmath_t = at::opmath_type<T>>
 __global__ void lpnorm_cleanup(
-    T* output_per_tensor,
+    opmath_t* output_per_tensor,
     T* ret_per_tensor,
     int max_chunks_per_tensor) {
-  using opmath_t = typename at::opmath_type<T>;
   __shared__ opmath_t vals[512];
 
-  T* output_this_tensor = output_per_tensor + blockIdx.x*max_chunks_per_tensor;
-  T val = 0;
+  opmath_t* output_this_tensor = output_per_tensor + blockIdx.x*max_chunks_per_tensor;
+  opmath_t val = 0;
   for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x) {
     val += output_this_tensor[i];
   }
@@ -134,7 +134,7 @@ std::vector<Tensor> foreach_tensor_norm_cuda(TensorList tensors, const Scalar& o
     }
   }
   const auto options = tensors[0].options();
-  auto output_per_tensor = at::zeros({ntensors*max_chunks_per_tensor}, options);
+  auto output_per_tensor = at::zeros({ntensors*max_chunks_per_tensor}, options.dtype(toOpMathType(tensors[0].scalar_type())));
   auto ret_per_tensor = at::empty({ntensors}, options);
 
   auto tensor_lists = std::vector<std::vector<Tensor>>{tensors.vec()};
@@ -145,13 +145,13 @@ std::vector<Tensor> foreach_tensor_norm_cuda(TensorList tensors, const Scalar& o
         multi_tensor_apply<1>(
           tensor_lists,
           LpNormFunctor<scalar_t, 1>(),
-          output_per_tensor.data_ptr<scalar_t>(),
+          output_per_tensor.data_ptr<opmath_t>(),
           max_chunks_per_tensor);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
         const at::cuda::OptionalCUDAGuard device_guard(device_of(output_per_tensor));
         auto stream = at::cuda::getCurrentCUDAStream();
         lpnorm_cleanup<scalar_t, 1><<<ntensors, 512, 0, stream>>>(
-          output_per_tensor.data_ptr<scalar_t>(),
+          output_per_tensor.data_ptr<opmath_t>(),
           ret_per_tensor.data_ptr<scalar_t>(),
           max_chunks_per_tensor);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
@@ -163,13 +163,13 @@ std::vector<Tensor> foreach_tensor_norm_cuda(TensorList tensors, const Scalar& o
         multi_tensor_apply<1>(
           tensor_lists,
           LpNormFunctor<scalar_t, 2>(),
-          output_per_tensor.data_ptr<scalar_t>(),
+          output_per_tensor.data_ptr<opmath_t>(),
           max_chunks_per_tensor);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
         const at::cuda::OptionalCUDAGuard device_guard(device_of(output_per_tensor));
         auto stream = at::cuda::getCurrentCUDAStream();
         lpnorm_cleanup<scalar_t, 2><<<ntensors, 512, 0, stream>>>(
-          output_per_tensor.data_ptr<scalar_t>(),
+          output_per_tensor.data_ptr<opmath_t>(),
           ret_per_tensor.data_ptr<scalar_t>(),
           max_chunks_per_tensor);
         C10_CUDA_KERNEL_LAUNCH_CHECK();
diff --git a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
index aa898d50a2ce06..46ea4eadf1febe 100644
--- a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
+++ b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
@@ -1,16 +1,24 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/NumericLimits.cuh>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/cuda/detail/KernelUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/fractional_max_pool2d_backward_native.h>
+#include <ATen/ops/fractional_max_pool2d_native.h>
+#endif
+
 #include <algorithm>
 #include <cfloat>
 #include <cmath>
diff --git a/aten/src/ATen/native/cuda/FractionalMaxPool3d.cu b/aten/src/ATen/native/cuda/FractionalMaxPool3d.cu
index 34b238410bb5f2..92a77dc00af539 100644
--- a/aten/src/ATen/native/cuda/FractionalMaxPool3d.cu
+++ b/aten/src/ATen/native/cuda/FractionalMaxPool3d.cu
@@ -1,17 +1,27 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/NumericLimits.cuh>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/cuda/detail/TensorInfo.cuh>
 #include <ATen/cuda/detail/KernelUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/fractional_max_pool3d_backward_native.h>
+#include <ATen/ops/fractional_max_pool3d_native.h>
+#endif
+
 #include <algorithm>
 #include <cfloat>
 #include <cmath>
diff --git a/aten/src/ATen/native/cuda/FunctionOfAMatrixUtilsKernel.cu b/aten/src/ATen/native/cuda/FunctionOfAMatrixUtilsKernel.cu
index e2f51503133e52..7c04ce4da351d4 100644
--- a/aten/src/ATen/native/cuda/FunctionOfAMatrixUtilsKernel.cu
+++ b/aten/src/ATen/native/cuda/FunctionOfAMatrixUtilsKernel.cu
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/FunctionOfAMatrixUtils.h>
 
 #include <ATen/Dispatch.h>
diff --git a/aten/src/ATen/native/cuda/GridSampler.cu b/aten/src/ATen/native/cuda/GridSampler.cu
index 153b779bdf37c6..bfc3d86b8ab9ed 100644
--- a/aten/src/ATen/native/cuda/GridSampler.cu
+++ b/aten/src/ATen/native/cuda/GridSampler.cu
@@ -1,5 +1,6 @@
 #define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/cuda/GridSampler.h>
+#include <ATen/native/GridSamplerUtils.h>
 #include <ATen/native/cuda/GridSampler.cuh>
 #include <ATen/native/cuda/UpSample.cuh>
 #include <ATen/cuda/CUDAContext.h>
@@ -739,10 +740,14 @@ namespace {
   }
 }  // namespace
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 void launch_grid_sampler_2d_forward_kernel(
     const TensorBase &output, const TensorBase &input, const TensorBase &grid,
     int64_t interpolation_mode, int64_t padding_mode, bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   auto N = input.size(0);
   auto H = grid.size(1);
   auto W = grid.size(2);
@@ -777,10 +782,14 @@ void launch_grid_sampler_2d_forward_kernel(
   }
 }
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 void launch_grid_sampler_3d_forward_kernel(
     const TensorBase &output, const TensorBase &input, const TensorBase &grid,
     int64_t interpolation_mode, int64_t padding_mode, bool align_corners) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_3d(input, grid, interpolation_mode);
+
   auto N = input.size(0);
   auto D = grid.size(1);
   auto H = grid.size(2);
@@ -816,12 +825,16 @@ void launch_grid_sampler_3d_forward_kernel(
   }
 }
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 void launch_grid_sampler_2d_backward_kernel(
     const TensorBase &grad_input, const TensorBase &grad_grid,
     const TensorBase &grad_output, const TensorBase &input,
     const TensorBase &grid, int64_t interpolation_mode, int64_t padding_mode,
     bool align_corners, std::array<bool,2> output_mask) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_2d(input, grid);
+
   // See Note [Writing Nondeterministic Operations]
   // Nondeterministic because of atomicAdd usage
   globalContext().alertNotDeterministic("grid_sampler_2d_backward_cuda");
@@ -873,12 +886,16 @@ void launch_grid_sampler_2d_backward_kernel(
   }
 }
 
-// No shape checking needed here. See # NOTE [ grid_sampler Native Functions ].
 void launch_grid_sampler_3d_backward_kernel(
     const TensorBase &grad_input, const TensorBase &grad_grid,
     const TensorBase& grad_output, const TensorBase& input,
     const TensorBase& grid, int64_t interpolation_mode, int64_t padding_mode,
     bool align_corners, std::array<bool,2> output_mask) {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input, grid);
+  check_grid_sampler_3d(input, grid, interpolation_mode);
+
   // See Note [Writing Nondeterministic Operations]
   // Nondeterministic because of atomicAdd usage
   globalContext().alertNotDeterministic("grid_sampler_3d_backward_cuda");
diff --git a/aten/src/ATen/native/cuda/GridSampler.cuh b/aten/src/ATen/native/cuda/GridSampler.cuh
index abc86f21749745..a0e3b16c3a43ac 100644
--- a/aten/src/ATen/native/cuda/GridSampler.cuh
+++ b/aten/src/ATen/native/cuda/GridSampler.cuh
@@ -1,14 +1,9 @@
+#pragma once
 #include <ATen/native/cuda/KernelUtils.cuh>
+#include <ATen/native/GridSamplerUtils.h>
 
 namespace at { namespace native {
 
-namespace detail {
-
-  enum class GridSamplerInterpolation {Bilinear, Nearest, Bicubic};
-  enum class GridSamplerPadding {Zeros, Border, Reflection};
-
-}  // namespace detail
-
 using detail::GridSamplerInterpolation;
 using detail::GridSamplerPadding;
 
diff --git a/aten/src/ATen/native/cuda/Im2Col.cu b/aten/src/ATen/native/cuda/Im2Col.cu
index 053418423adfcc..89b2a1879b4b71 100644
--- a/aten/src/ATen/native/cuda/Im2Col.cu
+++ b/aten/src/ATen/native/cuda/Im2Col.cu
@@ -1,6 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/div_rtn.h>
@@ -10,6 +11,16 @@
 #include <ATen/native/cuda/im2col.cuh>
 #include <ATen/native/im2col_shape_check.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/col2im_native.h>
+#include <ATen/ops/im2col_native.h>
+#include <ATen/ops/im2col_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/IndexKernel.cpp b/aten/src/ATen/native/cuda/IndexKernel.cpp
index b85baf097559d8..478c96fa6084c3 100644
--- a/aten/src/ATen/native/cuda/IndexKernel.cpp
+++ b/aten/src/ATen/native/cuda/IndexKernel.cpp
@@ -1,10 +1,21 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/IndexKernel.h>
 #include <ATen/native/TensorAdvancedIndexing.h>  // For at::native::index_out
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
 #include <ATen/ExpandUtils.h>
-#include <ATen/Functions.h>
 #include <ATen/MemoryOverlap.h>
 #include <ATen/NamedTensorUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/masked_scatter_native.h>
+#include <ATen/ops/masked_select_native.h>
+#endif
+
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/Indexing.cu b/aten/src/ATen/native/cuda/Indexing.cu
index b215968fea5035..85183274ebfcc3 100644
--- a/aten/src/ATen/native/cuda/Indexing.cu
+++ b/aten/src/ATen/native/cuda/Indexing.cu
@@ -1,11 +1,13 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorAdvancedIndexing.h>
 #include <ATen/native/IndexingUtils.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/MemoryOverlap.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/Resize.h>
@@ -14,6 +16,18 @@
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_quantized.h>
+#include <ATen/ops/index_add_native.h>
+#include <ATen/ops/index_select_native.h>
+#include <ATen/ops/masked_fill_native.h>
+#endif
+
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/cub.h>
 #include <c10/util/irange.h>
@@ -268,10 +282,11 @@ void index_put_with_sort_kernel(Tensor & self, const c10::List<c10::optional<Ten
           linearIndex.numel()*sliceSize*nElemBefore, " vs ", expandedValue.numel());
       const int UNROLL = 4;
       const int indices_per_block = 4;
+      const int warp_size = at::cuda::warp_size();
       dim3 grid(ceil_div(num_indices, (int64_t) indices_per_block),
-           std::min<int>(at::cuda::getCurrentDeviceProperties()->maxGridSize[1], ceil_div(sliceSize, (int64_t) (C10_WARP_SIZE*UNROLL))),
+           std::min<int>(at::cuda::getCurrentDeviceProperties()->maxGridSize[1], ceil_div(sliceSize, (int64_t) (warp_size*UNROLL))),
            std::min(std::max<int>(1,nElemBefore), at::cuda::getCurrentDeviceProperties()->maxGridSize[2]));
-      dim3 block(C10_WARP_SIZE, indices_per_block);
+      dim3 block(warp_size, indices_per_block);
 
       AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
       expandedValue.scalar_type(), "indexing_backward", [&] {
diff --git a/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu b/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu
index f8ac9d3ed8f695..b080a6e5eac2ce 100644
--- a/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu
+++ b/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu
@@ -1,7 +1,14 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/cuda/SortingCommon.cuh>
 #include <ATen/cuda/cub_definitions.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#endif
+
 #include <ATen/cuda/ThrustAllocator.h>
 #include <thrust/device_ptr.h>
 #include <thrust/execution_policy.h>
diff --git a/aten/src/ATen/native/cuda/Loss.cu b/aten/src/ATen/native/cuda/Loss.cu
index 6afc89592799bd..1f885ff6fe0b5b 100644
--- a/aten/src/ATen/native/cuda/Loss.cu
+++ b/aten/src/ATen/native/cuda/Loss.cu
@@ -1,14 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Dispatch.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/native/TensorIterator.h>
-#include <aten/src/ATen/TensorUtils.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/binary_cross_entropy_backward_native.h>
+#include <ATen/ops/binary_cross_entropy_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/exp.h>
+#include <ATen/ops/nll_loss_backward_native.h>
+#include <ATen/ops/nll_loss_forward_native.h>
+#include <ATen/ops/squeeze.h>
+#endif
+
 constexpr float EPSILON = 1e-12;
 
 namespace {
diff --git a/aten/src/ATen/native/cuda/LossCTC.cu b/aten/src/ATen/native/cuda/LossCTC.cu
index 65508b1a956b0f..4e406f7cd4de9b 100644
--- a/aten/src/ATen/native/cuda/LossCTC.cu
+++ b/aten/src/ATen/native/cuda/LossCTC.cu
@@ -7,15 +7,32 @@
 // Graves et al call the probabilities y, we use log_probs (also calling them inputs)
 // A few optimizations (similar to those here, but also some I didn't take) are described in
 // 2. Minmin Sun: http://on-demand.gputechconf.com/gtc/2016/presentation/s6383-minmin-sun-speech-recognition.pdf
-
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/TensorUtils.h>
 #include <c10/util/Exception.h>
 #include <c10/macros/Macros.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_ctc_loss_backward_native.h>
+#include <ATen/ops/_ctc_loss_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/exp.h>
+#include <ATen/ops/full_like.h>
+#include <ATen/ops/imag.h>
+#include <ATen/ops/logsumexp.h>
+#include <ATen/ops/tensor.h>
+#include <ATen/ops/where.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <type_traits>
 #include <numeric>
 
diff --git a/aten/src/ATen/native/cuda/Math.cuh b/aten/src/ATen/native/cuda/Math.cuh
index e063ec7f42fbe3..cbd562f542c546 100644
--- a/aten/src/ATen/native/cuda/Math.cuh
+++ b/aten/src/ATen/native/cuda/Math.cuh
@@ -7,108 +7,6 @@
 
 namespace at {
 namespace native {
-
-// TODO: these functions are unconditionally available because kaiser window depends on them
-// TODO: jiterate kaiser window and make them only available when not jiterating
-// NOTE: jiterating kaiser window requires extending the jiterator's scalar support
-/*
- * For licensing information and documentation, please refer to the the cpu implementation located in "ATen/native/Math.h".
- */
-template <typename scalar_t>
-static inline C10_HOST_DEVICE scalar_t
-chbevl(scalar_t _x, const scalar_t array[], size_t len) {
-  static_assert(!std::is_same<scalar_t, Half>() && !std::is_same<scalar_t, BFloat16>(), "don't instantiate with low precision type");
-
-  scalar_t b0, b1, b2;
-
-  b0 = array[0];
-  b1 = 0;
-
-  for (size_t i = 1; i < len; ++i)  {
-    b2 = b1;
-    b1 = b0;
-    b0 = _x * b1 - b2 + array[i];
-  }
-
-  return (0.5 * (b0 - b2));
-}
-
-/*
- * For licensing information and documentation, please refer to the the cpu implementation located in "ATen/native/Math.h".
- */
-template <typename T>
-C10_HOST_DEVICE inline std::tuple<const T*, size_t> chebyshev_coefficients_i0e_A() {
-  /* Chebyshev coefficients for exp(-x) I0(x)
-   * in the interval [0,8].
-   *
-   * lim(x->0){ exp(-x) I0(x) } = 1.
-   */
-  static const T coefficients[] = {
-      -4.41534164647933937950E-18, 3.33079451882223809783E-17,
-      -2.43127984654795469359E-16, 1.71539128555513303061E-15,
-      -1.16853328779934516808E-14, 7.67618549860493561688E-14,
-      -4.85644678311192946090E-13, 2.95505266312963983461E-12,
-      -1.72682629144155570723E-11, 9.67580903537323691224E-11,
-      -5.18979560163526290666E-10, 2.65982372468238665035E-9,
-      -1.30002500998624804212E-8,  6.04699502254191894932E-8,
-      -2.67079385394061173391E-7,  1.11738753912010371815E-6,
-      -4.41673835845875056359E-6,  1.64484480707288970893E-5,
-      -5.75419501008210370398E-5,  1.88502885095841655729E-4,
-      -5.76375574538582365885E-4,  1.63947561694133579842E-3,
-      -4.32430999505057594430E-3,  1.05464603945949983183E-2,
-      -2.37374148058994688156E-2,  4.93052842396707084878E-2,
-      -9.49010970480476444210E-2,  1.71620901522208775349E-1,
-      -3.04682672343198398683E-1,  6.76795274409476084995E-1};
-
-  return std::make_tuple(coefficients, 30);
-}
-
-template <typename T>
-C10_HOST_DEVICE inline std::tuple<const T*, size_t> chebyshev_coefficients_i0e_B() {
-  /* Chebyshev coefficients for exp(-x) sqrt(x) I0(x)
-   * in the inverted interval [8,infinity].
-   *
-   * lim(x->inf){ exp(-x) sqrt(x) I0(x) } = 1/sqrt(2pi).
-   */
-  static const T coefficients[] = {
-      -7.23318048787475395456E-18, -4.83050448594418207126E-18,
-      4.46562142029675999901E-17,  3.46122286769746109310E-17,
-      -2.82762398051658348494E-16, -3.42548561967721913462E-16,
-      1.77256013305652638360E-15,  3.81168066935262242075E-15,
-      -9.55484669882830764870E-15, -4.15056934728722208663E-14,
-      1.54008621752140982691E-14,  3.85277838274214270114E-13,
-      7.18012445138366623367E-13,  -1.79417853150680611778E-12,
-      -1.32158118404477131188E-11, -3.14991652796324136454E-11,
-      1.18891471078464383424E-11,  4.94060238822496958910E-10,
-      3.39623202570838634515E-9,   2.26666899049817806459E-8,
-      2.04891858946906374183E-7,   2.89137052083475648297E-6,
-      6.88975834691682398426E-5,   3.36911647825569408990E-3,
-      8.04490411014108831608E-1};
-
-  return std::make_tuple(coefficients, 25);
-}
-
-template <typename scalar_t>
-static inline C10_HOST_DEVICE scalar_t calc_i0(scalar_t _x) {
-  static_assert(!std::is_same<scalar_t, Half>() && !std::is_same<scalar_t, BFloat16>(), "don't instantiate with low precision type");
-  // Upcast input for numerical accuracy purposes
-  // Needed for accurate results if input is bfloat16 or float16
-  scalar_t x = ::abs(_x);
-
-  if (x <= scalar_t{8.0}) {
-    auto coeff_pair = chebyshev_coefficients_i0e_A<scalar_t>();
-    auto A = std::get<0>(coeff_pair);
-    auto len = std::get<1>(coeff_pair);
-    scalar_t y = (x / scalar_t{2.0}) - scalar_t{2.0};
-    return (::exp(x) * chbevl(y, A, len));
-  }
-
-  auto coeff_pair = chebyshev_coefficients_i0e_B<scalar_t>();
-  auto B = std::get<0>(coeff_pair);
-  auto len = std::get<1>(coeff_pair);
-  return (::exp(x) * chbevl(scalar_t{32.0} / x - scalar_t{2.0}, B, len) / ::sqrt(x));
-}
-
 // See note [Jiterator]
 // TODO: elaborate in this comment on the structure of math.cuh
 #if AT_USE_JITERATOR()
@@ -276,6 +174,19 @@ const auto ndtri_string = jiterator_stringify(
   }
 ); // ndtri_string
 
+const auto log_ndtr_string = jiterator_stringify(
+  template <typename T>
+  T log_ndtr(T x) {
+    constexpr T SQRT1_2{0.707106781186547524400844362104849039};   // 1/sqrt(2)
+    T t = x * SQRT1_2;
+    if (x < T{-1.0}) {
+      return log(erfcx(-t) / 2) - t * t;
+    } else {
+      return log1p(-erfc(t) / 2);
+    }
+  }
+); // log_ndtr_string
+
 const auto gcd_string = jiterator_stringify(
   template <typename T>
   T gcd(const T a_in, const T b_in) {
@@ -555,6 +466,8 @@ const auto entr_string = jiterator_stringify(
   }
 ); // entr_string
 
+// NOTE: `kaiser_window_string` depends on `i0_string`
+//       for its implementation.
 const auto i0_string = jiterator_stringify(
   template<typename T>
   T chbevl(T x, const T array[], const int len) {
@@ -629,69 +542,6 @@ const auto i0_string = jiterator_stringify(
   }
 ); // i0_string
 
-const auto i0e_string = jiterator_stringify(
-  template<typename T>
-  T chbevl(T x, const T array[], const int len) {
-      T b0, b1, b2;
-
-      b0 = array[0];
-      b1 = 0;
-
-      for (int i = 1; i < len; ++i)  {
-          b2 = b1;
-          b1 = b0;
-          b0 = x * b1 - b2 + array[i];
-      }
-
-      return T{0.5} * (b0 - b2);
-  }
-
-  template <typename T>
-  T i0e(T _x) {
-    T x = fabs(_x);
-
-    if (x <= T{8.0}) {
-      T coefficients[] = {
-        -4.41534164647933937950E-18, 3.33079451882223809783E-17,
-        -2.43127984654795469359E-16, 1.71539128555513303061E-15,
-        -1.16853328779934516808E-14, 7.67618549860493561688E-14,
-        -4.85644678311192946090E-13, 2.95505266312963983461E-12,
-        -1.72682629144155570723E-11, 9.67580903537323691224E-11,
-        -5.18979560163526290666E-10, 2.65982372468238665035E-9,
-        -1.30002500998624804212E-8,  6.04699502254191894932E-8,
-        -2.67079385394061173391E-7,  1.11738753912010371815E-6,
-        -4.41673835845875056359E-6,  1.64484480707288970893E-5,
-        -5.75419501008210370398E-5,  1.88502885095841655729E-4,
-        -5.76375574538582365885E-4,  1.63947561694133579842E-3,
-        -4.32430999505057594430E-3,  1.05464603945949983183E-2,
-        -2.37374148058994688156E-2,  4.93052842396707084878E-2,
-        -9.49010970480476444210E-2,  1.71620901522208775349E-1,
-        -3.04682672343198398683E-1,  6.76795274409476084995E-1};
-
-      T y = (x / T{2.0}) - T{2.0};
-      return chbevl(y, coefficients, int{30});
-    }
-
-    // x > 8
-    T coefficients[] = {
-      -7.23318048787475395456E-18, -4.83050448594418207126E-18,
-      4.46562142029675999901E-17,  3.46122286769746109310E-17,
-      -2.82762398051658348494E-16, -3.42548561967721913462E-16,
-      1.77256013305652638360E-15,  3.81168066935262242075E-15,
-      -9.55484669882830764870E-15, -4.15056934728722208663E-14,
-      1.54008621752140982691E-14,  3.85277838274214270114E-13,
-      7.18012445138366623367E-13,  -1.79417853150680611778E-12,
-      -1.32158118404477131188E-11, -3.14991652796324136454E-11,
-      1.18891471078464383424E-11,  4.94060238822496958910E-10,
-      3.39623202570838634515E-9,   2.26666899049817806459E-8,
-      2.04891858946906374183E-7,   2.89137052083475648297E-6,
-      6.88975834691682398426E-5,   3.36911647825569408990E-3,
-      8.04490411014108831608E-1};
-
-    return chbevl(T{32.0} / x - T{2.0}, coefficients, int{25}) / sqrt(x);
-  }
-); // i0e_string
-
 const auto i1_string = jiterator_stringify(
   template<typename T>
   T chbevl(const T x, const T array[], const int len) {
@@ -881,6 +731,15 @@ const auto i1e_string = jiterator_stringify(
   }
 ); // i1e_string
 
+const auto kaiser_window_string = i0_string + jiterator_stringify(
+  template <typename T>
+  T kaiser_window(T a, T inv_alpha, T beta, T inv_i0_beta) {
+    T x = a * inv_alpha - T{1};
+    T y = max(T{0}, T{1} - x * x);
+    return i0(beta * sqrt(y)) * inv_i0_beta;
+  }
+); // kaiser_window_string
+
 const auto sinc_string = jiterator_stringify(
   template <typename T>
   T sinc(T a) {
@@ -1509,22 +1368,102 @@ static inline C10_HOST_DEVICE scalar_t calc_trigamma(scalar_t in) {
   return static_cast<scalar_t>(sign * result);
 }
 
+/*
+ * For licensing information and documentation, please refer to the the cpu implementation located in "ATen/native/Math.h".
+ */
 template <typename scalar_t>
-static inline C10_HOST_DEVICE scalar_t calc_i0e(scalar_t _x) {
+static inline C10_HOST_DEVICE scalar_t
+chbevl(scalar_t _x, const scalar_t array[], size_t len) {
   static_assert(!std::is_same<scalar_t, Half>() && !std::is_same<scalar_t, BFloat16>(), "don't instantiate with low precision type");
+
+  scalar_t b0, b1, b2;
+
+  b0 = array[0];
+  b1 = 0;
+
+  for (size_t i = 1; i < len; ++i)  {
+    b2 = b1;
+    b1 = b0;
+    b0 = _x * b1 - b2 + array[i];
+  }
+
+  return (0.5 * (b0 - b2));
+}
+
+/*
+ * For licensing information and documentation, please refer to the the cpu implementation located in "ATen/native/Math.h".
+ */
+template <typename T>
+C10_HOST_DEVICE inline std::tuple<const T*, size_t> chebyshev_coefficients_i0e_A() {
+  /* Chebyshev coefficients for exp(-x) I0(x)
+   * in the interval [0,8].
+   *
+   * lim(x->0){ exp(-x) I0(x) } = 1.
+   */
+  static const T coefficients[] = {
+      -4.41534164647933937950E-18, 3.33079451882223809783E-17,
+      -2.43127984654795469359E-16, 1.71539128555513303061E-15,
+      -1.16853328779934516808E-14, 7.67618549860493561688E-14,
+      -4.85644678311192946090E-13, 2.95505266312963983461E-12,
+      -1.72682629144155570723E-11, 9.67580903537323691224E-11,
+      -5.18979560163526290666E-10, 2.65982372468238665035E-9,
+      -1.30002500998624804212E-8,  6.04699502254191894932E-8,
+      -2.67079385394061173391E-7,  1.11738753912010371815E-6,
+      -4.41673835845875056359E-6,  1.64484480707288970893E-5,
+      -5.75419501008210370398E-5,  1.88502885095841655729E-4,
+      -5.76375574538582365885E-4,  1.63947561694133579842E-3,
+      -4.32430999505057594430E-3,  1.05464603945949983183E-2,
+      -2.37374148058994688156E-2,  4.93052842396707084878E-2,
+      -9.49010970480476444210E-2,  1.71620901522208775349E-1,
+      -3.04682672343198398683E-1,  6.76795274409476084995E-1};
+
+  return std::make_tuple(coefficients, 30);
+}
+
+template <typename T>
+C10_HOST_DEVICE inline std::tuple<const T*, size_t> chebyshev_coefficients_i0e_B() {
+  /* Chebyshev coefficients for exp(-x) sqrt(x) I0(x)
+   * in the inverted interval [8,infinity].
+   *
+   * lim(x->inf){ exp(-x) sqrt(x) I0(x) } = 1/sqrt(2pi).
+   */
+  static const T coefficients[] = {
+      -7.23318048787475395456E-18, -4.83050448594418207126E-18,
+      4.46562142029675999901E-17,  3.46122286769746109310E-17,
+      -2.82762398051658348494E-16, -3.42548561967721913462E-16,
+      1.77256013305652638360E-15,  3.81168066935262242075E-15,
+      -9.55484669882830764870E-15, -4.15056934728722208663E-14,
+      1.54008621752140982691E-14,  3.85277838274214270114E-13,
+      7.18012445138366623367E-13,  -1.79417853150680611778E-12,
+      -1.32158118404477131188E-11, -3.14991652796324136454E-11,
+      1.18891471078464383424E-11,  4.94060238822496958910E-10,
+      3.39623202570838634515E-9,   2.26666899049817806459E-8,
+      2.04891858946906374183E-7,   2.89137052083475648297E-6,
+      6.88975834691682398426E-5,   3.36911647825569408990E-3,
+      8.04490411014108831608E-1};
+
+  return std::make_tuple(coefficients, 25);
+}
+
+template <typename scalar_t>
+static inline C10_HOST_DEVICE scalar_t calc_i0(scalar_t _x) {
+  static_assert(!std::is_same<scalar_t, Half>() && !std::is_same<scalar_t, BFloat16>(), "don't instantiate with low precision type");
+  // Upcast input for numerical accuracy purposes
+  // Needed for accurate results if input is bfloat16 or float16
   scalar_t x = ::abs(_x);
+
   if (x <= scalar_t{8.0}) {
     auto coeff_pair = chebyshev_coefficients_i0e_A<scalar_t>();
     auto A = std::get<0>(coeff_pair);
     auto len = std::get<1>(coeff_pair);
     scalar_t y = (x / scalar_t{2.0}) - scalar_t{2.0};
-    return (chbevl(y, A, len));
+    return (::exp(x) * chbevl(y, A, len));
   }
 
   auto coeff_pair = chebyshev_coefficients_i0e_B<scalar_t>();
   auto B = std::get<0>(coeff_pair);
   auto len = std::get<1>(coeff_pair);
-  return (chbevl(scalar_t{32.0} / x - scalar_t{2.0}, B, len) / ::sqrt(x));
+  return (::exp(x) * chbevl(scalar_t{32.0} / x - scalar_t{2.0}, B, len) / ::sqrt(x));
 }
 
 template <typename T>
diff --git a/aten/src/ATen/native/cuda/MaxUnpooling.cu b/aten/src/ATen/native/cuda/MaxUnpooling.cu
index 73db29deb4aa09..085f0d9f37b37f 100644
--- a/aten/src/ATen/native/cuda/MaxUnpooling.cu
+++ b/aten/src/ATen/native/cuda/MaxUnpooling.cu
@@ -1,11 +1,25 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/max_unpool2d_native.h>
+#include <ATen/ops/max_unpool2d_backward_native.h>
+#include <ATen/ops/max_unpool3d_native.h>
+#include <ATen/ops/max_unpool3d_backward_native.h>
+
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/MultiLabelMarginCriterion.cu b/aten/src/ATen/native/cuda/MultiLabelMarginCriterion.cu
index 88c88ce0ad8076..7f61d9a0b5b03f 100644
--- a/aten/src/ATen/native/cuda/MultiLabelMarginCriterion.cu
+++ b/aten/src/ATen/native/cuda/MultiLabelMarginCriterion.cu
@@ -1,12 +1,22 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
-#include <ATen/CUDAFunctions.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/macros/Macros.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/cuda/block_reduce.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/CUDAFunctions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros_like.h>
+#include <ATen/ops/sum_cuda_dispatch.h>
+#include <ATen/ops/multilabel_margin_loss.h>
+#endif
+
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/MultiMarginLoss.cu b/aten/src/ATen/native/cuda/MultiMarginLoss.cu
index fcf0a6a2356a3e..15e6d1e9dc0c33 100644
--- a/aten/src/ATen/native/cuda/MultiMarginLoss.cu
+++ b/aten/src/ATen/native/cuda/MultiMarginLoss.cu
@@ -1,9 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/Resize.h>
 #include <c10/cuda/CUDAStream.h>
 #include <c10/cuda/CUDAException.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/multi_margin_loss_native.h>
+#include <ATen/ops/multi_margin_loss_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
@@ -114,7 +126,7 @@ __global__ void MultiMarginLoss_backward_kernel(
   }
 }
 
-void multi_margin_loss_shape_check(
+void multi_margin_loss_shape_check(int &nframe,
     const Tensor &input, const Tensor &target) {
   auto in_sizes = input.sizes();
   auto dims = in_sizes.size();
@@ -124,7 +136,7 @@ void multi_margin_loss_shape_check(
       "Expected non-empty vector or matrix with optional 0-dim batch size, but got: ",
       in_sizes);
 
-  int64_t nframe = dims <= 1 ? 1 : in_sizes[0];
+  nframe = dims <= 1 ? 1 : in_sizes[0];
   TORCH_CHECK(
       target.dim() <= 1 && target.numel() == nframe,
       "inconsistent target size, expected ", nframe, " but got ",
@@ -138,16 +150,16 @@ Tensor& multi_margin_loss_cuda_out(
     const c10::optional<Tensor> &weights_, int64_t reduction, Tensor& out_) {
   auto p = p_.toLong();
   TORCH_CHECK(p == 1 || p == 2, "multi_margin_loss: Invalid p, expected 1 or 2 but got ", p);
-  multi_margin_loss_shape_check(input_, target_);
 
-  if (reduction == at::Reduction::None) {
-    resize_output(out_, target_.sizes());
-  } else if (input_.dim() == 2) {
-    resize_output(out_, {input_.sizes()[0]});
+  int nframe;
+  multi_margin_loss_shape_check(nframe, input_, target_);
+
+  // produce a scalar output for 1d input
+  if (reduction == Reduction::None && target_.dim() > 0) {
+    resize_output(out_, {nframe});
   } else {
     resize_output(out_, {});
   }
-
   if (input_.numel() == 0) {
     return out_;
   }
@@ -166,7 +178,6 @@ Tensor& multi_margin_loss_cuda_out(
   AT_DISPATCH_FLOATING_TYPES_AND2(kHalf, kBFloat16, input.scalar_type(), "multi_margin_loss_cuda", [&] {
     const scalar_t margin = margin_.to<scalar_t>();
     if (input.dim() <= 1) {
-      int nframe = 1;
       TORCH_CHECK(target.dim() <= 1 && target.numel() == nframe, "inconsistent target size");
       dim3 blocks(1);
       dim3 threads(MULTIMARGIN_THREADS);
@@ -196,7 +207,6 @@ Tensor& multi_margin_loss_cuda_out(
     } else {
       auto in_sizes = input.sizes();
       TORCH_INTERNAL_ASSERT(in_sizes.size() == 2);
-      int nframe = in_sizes[0];
       // allow zero-dim target for 2D input.
       TORCH_CHECK(in_sizes[1] != 0 && target.dim() <= 1 && target.numel() == nframe,
                 "inconsistent target size");
@@ -248,7 +258,7 @@ Tensor& multi_margin_loss_cuda_out(
               margin);
           C10_CUDA_KERNEL_LAUNCH_CHECK();
         }
-        at::sum_out(out, tmp_output, /*dims=*/IntArrayRef{});
+        at::sum_out(out, tmp_output, IntArrayRef{});
       }
     }
   });
@@ -262,7 +272,7 @@ Tensor& multi_margin_loss_cuda_out(
 Tensor multi_margin_loss_cuda(
     const Tensor &input, const Tensor &target, const Scalar &p, const Scalar &margin,
     const c10::optional<Tensor> &weights, int64_t reduction) {
-  auto out = at::empty({}, input.options());
+  auto out = at::empty({0}, input.options());
   multi_margin_loss_cuda_out(input, target, p, margin, weights, reduction, out);
   return out;
 }
@@ -274,7 +284,8 @@ Tensor& multi_margin_loss_cuda_backward_out(
   auto p = p_.toLong();
   TORCH_CHECK(p == 1 || p == 2,
               "multi_margin_loss_backward: Invalid p, expected 1 or 2 but got ", p);
-  multi_margin_loss_shape_check(input_, target_);
+  int nframe;
+  multi_margin_loss_shape_check(nframe, input_, target_);
   resize_output(grad_input_, input_.sizes());
 
   if (input_.numel() == 0) {
@@ -331,7 +342,6 @@ Tensor& multi_margin_loss_cuda_backward_out(
     } else {
       auto in_sizes = input.sizes();
       TORCH_INTERNAL_ASSERT(in_sizes.size() == 2);
-      int nframe = in_sizes[0];
       TORCH_CHECK((in_sizes[1] != 0) && (target.dim() <= 1) && (target.numel() == nframe),
                   "inconsistent target size");
       dim3 blocks(in_sizes[0]);
diff --git a/aten/src/ATen/native/cuda/MultinomialKernel.cu b/aten/src/ATen/native/cuda/MultinomialKernel.cu
index f9404fab0193fc..de8e8404ac2ddc 100644
--- a/aten/src/ATen/native/cuda/MultinomialKernel.cu
+++ b/aten/src/ATen/native/cuda/MultinomialKernel.cu
@@ -1,8 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/CUDAFunctions.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/EmptyTensor.h>
 #include <ATen/cuda/detail/KernelUtils.h>
@@ -11,6 +12,16 @@
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
 #include <ATen/native/cuda/block_reduce.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/CUDAFunctions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/cumsum_cuda_dispatch.h>
+#include <ATen/ops/uniform_native.h>
+#endif
+
 #include <curand.h>
 #include <curand_kernel.h>
 #include <curand_philox4x32_x.h>
@@ -74,12 +85,13 @@ void renormRows(Tensor& t) {
   const int64_t maxThreads = std::min(
       props->maxThreadsPerBlock, cuda_utils::kCUDABlockReduceMaxThreads);
 
+  int warp_size = at::cuda::warp_size();
   dim3 grid(rows < numSM * 4 ? rows : numSM * 4);
-  dim3 block(std::min(maxThreads, C10_WARP_SIZE * ceil_div(cols, int64_t{C10_WARP_SIZE})));
+  dim3 block(std::min(maxThreads, warp_size * ceil_div(cols, int64_t{warp_size})));
 
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(t.scalar_type(), "renormRows_cuda", [&] {
     renormRowsL1<scalar_t>
-        <<<grid, block, (block.x / C10_WARP_SIZE) * sizeof(scalar_t),
+        <<<grid, block, (block.x / warp_size) * sizeof(scalar_t),
         at::cuda::getCurrentCUDAStream()>>>(t.data_ptr<scalar_t>(),
             rows, cols);
     C10_CUDA_KERNEL_LAUNCH_CHECK();
@@ -335,8 +347,9 @@ void multinomial_with_replacement_kernel_impl(
     int maxThreads = props->maxThreadsPerBlock;
     int maxShared = props->sharedMemPerBlock;
 
-    int requiredWarps = at::ceil_div(numCategories, C10_WARP_SIZE);
-    int requiredThreads = std::min(maxThreads, requiredWarps * C10_WARP_SIZE);
+    int warp_size = at::cuda::warp_size();
+    int requiredWarps = at::ceil_div(numCategories, warp_size);
+    int requiredThreads = std::min(maxThreads, requiredWarps * warp_size);
     int requiredShared = requiredThreads * sizeof(accscalar_t);
 
     if (n_sample == 1 && maxShared >= requiredShared) {
diff --git a/aten/src/ATen/native/cuda/NLLLoss2d.cu b/aten/src/ATen/native/cuda/NLLLoss2d.cu
index 79cec9f8da3ed0..2246c836f3dcad 100644
--- a/aten/src/ATen/native/cuda/NLLLoss2d.cu
+++ b/aten/src/ATen/native/cuda/NLLLoss2d.cu
@@ -1,7 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
@@ -12,6 +12,16 @@
 #include <ATen/native/Resize.h>
 #include <ATen/native/cuda/block_reduce.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/nll_loss2d_forward_native.h>
+#include <ATen/ops/nll_loss2d_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu
index a04d118b750247..75b4e335754053 100644
--- a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu
+++ b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu
@@ -1,6 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/cuda/im2col.cuh>
+
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorMeta.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
@@ -9,7 +12,16 @@
 #include <ATen/cuda/CUDAContext.h>
 
 #include <ATen/native/ConvUtils.h>
-#include <ATen/native/cuda/im2col.cuh>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/slow_conv_transpose2d_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
index 1198555d144ec4..d34de0f156bd67 100644
--- a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
+++ b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
@@ -1,6 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 
@@ -10,6 +11,17 @@
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/cuda/vol2col.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/slow_conv_transpose3d_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/NaiveDilatedConvolution.cu b/aten/src/ATen/native/cuda/NaiveDilatedConvolution.cu
index 2c2c11f2246720..6c2942b05de39f 100644
--- a/aten/src/ATen/native/cuda/NaiveDilatedConvolution.cu
+++ b/aten/src/ATen/native/cuda/NaiveDilatedConvolution.cu
@@ -1,12 +1,25 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/cuda/vol2col.cuh>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDABlas.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/cuda/im2col.cuh>
-#include <ATen/native/cuda/vol2col.cuh>
 #include <ATen/native/DilatedConvolutionUtils.h>
 #include <c10/util/accumulate.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/slow_conv_dilated2d_native.h>
+#include <ATen/ops/slow_conv_dilated3d_native.h>
+#endif
+
 #include <tuple>
 
 namespace at {
diff --git a/aten/src/ATen/native/cuda/Nonzero.cu b/aten/src/ATen/native/cuda/Nonzero.cu
index dcacf98a80070b..0e524b7b81fd72 100644
--- a/aten/src/ATen/native/cuda/Nonzero.cu
+++ b/aten/src/ATen/native/cuda/Nonzero.cu
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDACachingAllocator.h>
 #include <ATen/cuda/EmptyTensor.h>
@@ -6,6 +8,13 @@
 #include <ATen/cuda/detail/OffsetCalculator.cuh> //for MAX_DIMS
 #include <ATen/cuda/cub.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/nonzero_native.h>
+#endif
+
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/Normalization.cu b/aten/src/ATen/native/cuda/Normalization.cu
index 2f9484770ad44d..e7b2372a18dad2 100644
--- a/aten/src/ATen/native/cuda/Normalization.cu
+++ b/aten/src/ATen/native/cuda/Normalization.cu
@@ -1,3 +1,4 @@
+// #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/ReduceOps.h>
@@ -7,6 +8,30 @@
 #include <ATen/native/cuda/Normalization.cuh>
 #include <c10/cuda/CUDAMathCompat.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/batch_norm_backward_elemt_native.h>
+#include <ATen/ops/batch_norm_backward_reduce_native.h>
+#include <ATen/ops/batch_norm_elemt_native.h>
+#include <ATen/ops/batch_norm_gather_stats_native.h>
+#include <ATen/ops/batch_norm_gather_stats_with_counts_native.h>
+#include <ATen/ops/batch_norm_stats_native.h>
+#include <ATen/ops/batch_norm_update_stats_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/native_batch_norm_backward_native.h>
+#include <ATen/ops/native_batch_norm_native.h>
+#include <ATen/ops/scalar_tensor.h>
+#endif
+
+// TODO: Doesn't exist in this branch
+#if 0
+#include <ATen/ops/from_blob.h>
+#else
+#include <ATen/Functions.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
diff --git a/aten/src/ATen/native/cuda/Normalization.cuh b/aten/src/ATen/native/cuda/Normalization.cuh
index 6d2c806ea3771b..a9b11e76db680b 100644
--- a/aten/src/ATen/native/cuda/Normalization.cuh
+++ b/aten/src/ATen/native/cuda/Normalization.cuh
@@ -1,6 +1,7 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -9,6 +10,14 @@
 #include <ATen/native/cuda/LaunchUtils.h>
 #include <c10/macros/Macros.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at { namespace native {
 
 // The maximum number of threads in a block
diff --git a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
index 6fbbe1f3be472e..4ad544aeb47df8 100644
--- a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
+++ b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
@@ -126,7 +126,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
                 if (!is_transformer_mask) {
                     idx += i*element_count;
                 }
-                if (mask[idx]) {
+                if (!mask[idx]) {
                     max_value[i] = (is_meaningful_max && max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
                     is_meaningful_max = true;
                 }
@@ -160,7 +160,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
                     idx += i*element_count;
                 }
 
-                if (mask[idx]) {
+                if (!mask[idx]) {
                     if (is_log_softmax) {
                         sum[i] += std::exp(elements[i][it] - max_value[i]);
                     } else {
@@ -188,7 +188,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
                     if (!is_transformer_mask) {
                         idx += i*element_count;
                     }
-                    if (!mask[idx]) {
+                    if (mask[idx]) {
                         dst[i*element_count+it*WARP_SIZE] = 0;
                         continue;
                     }
@@ -297,7 +297,8 @@ void dispatch_softmax_forward(output_t *dst, const input_t *src, int softmax_ele
         const int next_power_of_two = 1 << log2_elements;
 
         // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_forward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
+        int warp_size = at::cuda::warp_size();
+        warp_size = (next_power_of_two < warp_size) ? next_power_of_two : warp_size;
 
         // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_forward.
         int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
@@ -346,7 +347,8 @@ void dispatch_softmax_backward(output_t *grad_input, const input_t *grad, const
         const int next_power_of_two = 1 << log2_elements;
 
         // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_backward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
+        int warp_size = at::cuda::warp_size();
+        warp_size = (next_power_of_two < warp_size) ? next_power_of_two : warp_size;
 
         // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_backward.
         int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
diff --git a/aten/src/ATen/native/cuda/PointwiseOpsKernel.cu b/aten/src/ATen/native/cuda/PointwiseOpsKernel.cu
index 5e42326056c194..b1c4a2ae4b411b 100644
--- a/aten/src/ATen/native/cuda/PointwiseOpsKernel.cu
+++ b/aten/src/ATen/native/cuda/PointwiseOpsKernel.cu
@@ -3,6 +3,7 @@
 #include <ATen/Context.h>
 #include <ATen/Dispatch.h>
 #include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/PointwiseOps.h>
@@ -10,28 +11,88 @@
 
 namespace at { namespace native {
 
+const char addcmul_name[] = "addcmul";
 void addcmul_cuda_kernel(TensorIteratorBase& iter, const Scalar& value) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kHalf, kBFloat16, iter.dtype(), "addcmul_cuda", [&]() {
-    // note(mkozuki): If scalar_t is fp16 or bfloat16, cast scalar to float
-    // and do math in fp32 for better accuracy.
-    using accscalar_t = at::acc_type<scalar_t, true>;
-    auto alpha = value.to<accscalar_t>();
-    gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
-      return a + alpha * (static_cast<accscalar_t>(b) * static_cast<accscalar_t>(c));
+  auto dtype = iter.dtype();
+  if (at::isComplexType(dtype)) {
+    #if AT_USE_JITERATOR()
+      AT_DISPATCH_COMPLEX_TYPES(dtype, "addcmul_cuda", [&]() {
+        auto alpha = value.to<scalar_t>();
+        static const auto addcmul_string = jiterator_stringify(
+          template <typename T> T addcmul(T a, T b, T c, T alpha) { return a + alpha * (b * c); });
+        jitted_gpu_kernel<
+            /*name=*/addcmul_name,
+            /*return_dtype=*/scalar_t,
+            /*common_dtype=*/scalar_t,
+            /*arity=*/3>(
+            iter,
+            addcmul_string,
+            /*scalar_pos=*/at::cuda::jit::BinaryFuncVariant::NoScalar,
+            /*scalar_val=*/0,
+            /*extra_args=*/std::make_tuple(alpha));
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(dtype, "addcmul_cuda", [&]() {
+        auto alpha = value.to<scalar_t>();
+        gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
+          return a + alpha * b * c;
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, dtype, "addcmul_cuda", [&]() {
+      // note(mkozuki): If scalar_t is fp16 or bfloat16, cast scalar to float
+      // and do math in fp32 for better accuracy.
+      using accscalar_t = at::acc_type<scalar_t, true>;
+      auto alpha = value.to<accscalar_t>();
+      gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
+        return a + alpha * (static_cast<accscalar_t>(b) * static_cast<accscalar_t>(c));
+      });
     });
-  });
+  }
 }
 
+// return a + alpha * (b / static_cast<accscalar_t>(c));
+const char addcdiv_name[] = "addcdiv";
 void addcdiv_cuda_kernel(TensorIteratorBase& iter, const Scalar& value) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kHalf, kBFloat16, iter.dtype(), "addcdiv_cuda", [&]() {
-    // note(mkozuki): If scalar_t is fp16 or bfloat16, cast scalar to float
-    // and do math in fp32 for better accuracy.
-    using accscalar_t = at::acc_type<scalar_t, true>;
-    auto alpha = value.to<accscalar_t>();
-    gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
-      return a + alpha * (b / static_cast<accscalar_t>(c));
+  auto dtype = iter.dtype();
+  if (at::isComplexType(dtype)) {
+    #if AT_USE_JITERATOR()
+      AT_DISPATCH_COMPLEX_TYPES(dtype, "addcdiv_cuda", [&]() {
+        auto alpha = value.to<scalar_t>();
+        static const auto addcdiv_string =
+            jiterator_stringify(template <typename T> T addcdiv(
+                T a, T b, T c, T alpha) { return a + alpha * (b / c); });
+        jitted_gpu_kernel<
+            /*name=*/addcdiv_name,
+            /*return_dtype=*/scalar_t,
+            /*common_dtype=*/scalar_t,
+            /*arity=*/3>(
+            iter,
+            addcdiv_string,
+            /*scalar_pos=*/at::cuda::jit::BinaryFuncVariant::NoScalar,
+            /*scalar_val=*/0,
+            /*extra_args=*/std::make_tuple(alpha));
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(dtype, "addcdiv_cuda", [&]() {
+        auto alpha = value.to<scalar_t>();
+        gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
+          return a + alpha * (b / c);
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, dtype, "addcdiv_cuda", [&]() {
+      // note(mkozuki): If scalar_t is fp16 or bfloat16, cast scalar to float
+      // and do math in fp32 for better accuracy.
+      using accscalar_t = at::acc_type<scalar_t, true>;
+      auto alpha = value.to<accscalar_t>();
+      gpu_kernel(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b, scalar_t c) -> scalar_t {
+        return a + alpha * (b / static_cast<accscalar_t>(c));
+      });
     });
-  });
+  }
 }
 
 void smooth_l1_backward_cuda_kernel(TensorIterator& iter, const Scalar& norm, double beta) {
diff --git a/aten/src/ATen/native/cuda/RNN.cu b/aten/src/ATen/native/cuda/RNN.cu
index 659ddc28c4979d..046bbe4a5c0421 100644
--- a/aten/src/ATen/native/cuda/RNN.cu
+++ b/aten/src/ATen/native/cuda/RNN.cu
@@ -1,11 +1,24 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/CUDAApplyUtils.cuh>
 #include <c10/macros/Macros.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/_thnn_fused_lstm_cell_native.h>
+#include <ATen/ops/_thnn_fused_lstm_cell_backward_native.h>
+#include <ATen/ops/_thnn_fused_gru_cell_native.h>
+#include <ATen/ops/_thnn_fused_gru_cell_backward_native.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
diff --git a/aten/src/ATen/native/cuda/Randperm.cu b/aten/src/ATen/native/cuda/Randperm.cu
index f0c41f5be444fb..b3c679f7772449 100644
--- a/aten/src/ATen/native/cuda/Randperm.cu
+++ b/aten/src/ATen/native/cuda/Randperm.cu
@@ -1,9 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/TensorFactories.h>
 #include <ATen/cuda/cub.h>
 #include <ATen/native/cuda/Randperm.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/randperm_native.h>
+#endif
+
 #include <limits>
 
 namespace at {
diff --git a/aten/src/ATen/native/cuda/RangeFactories.cu b/aten/src/ATen/native/cuda/RangeFactories.cu
index 027806ed421617..55981ac1ad8e36 100644
--- a/aten/src/ATen/native/cuda/RangeFactories.cu
+++ b/aten/src/ATen/native/cuda/RangeFactories.cu
@@ -1,6 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/cuda/Exceptions.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -8,20 +8,39 @@
 #include <cmath>
 #include <limits>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/linspace_native.h>
+#include <ATen/ops/logspace_native.h>
+#include <ATen/ops/range_native.h>
+#endif
+
 #define GPU_LAMBDA __device__ __host__
 
 namespace {
 
-constexpr int num_threads = C10_WARP_SIZE * 2;
+#if defined(USE_ROCM)
+constexpr int num_threads() {
+  return 128;
+}
+#else
+constexpr int num_threads() {
+  return C10_WARP_SIZE * 2;
+}
+#endif
 constexpr int thread_work_size = 1;
-constexpr int block_work_size = thread_work_size * num_threads;
+constexpr int block_work_size = thread_work_size * num_threads();
 
 template<typename index_t, typename func_t>
-C10_LAUNCH_BOUNDS_1(num_threads)
+C10_LAUNCH_BOUNDS_1(num_threads())
 __global__ void elementwise_kernel_with_index(index_t N, func_t f, typename function_traits<func_t>::result_type *data) {
   #pragma unroll
   for (int i = 0; i < thread_work_size; i++) {
-    index_t idx = block_work_size * blockIdx.x + num_threads * i + threadIdx.x;
+    index_t idx = block_work_size * blockIdx.x + num_threads() * i + threadIdx.x;
     if (idx < N) {
       data[idx] = f(idx);
     }
@@ -38,10 +57,10 @@ void gpu_kernel_with_index(at::Tensor &output, func_t f) {
   auto stream = at::cuda::getCurrentCUDAStream();
   using scalar_t = typename function_traits<func_t>::result_type;
   if (N <= std::numeric_limits<int>::max()) {
-    elementwise_kernel_with_index<int><<<grid, num_threads, 0, stream>>>(N, f, output.data_ptr<scalar_t>());
+    elementwise_kernel_with_index<int><<<grid, num_threads(), 0, stream>>>(N, f, output.data_ptr<scalar_t>());
     C10_CUDA_KERNEL_LAUNCH_CHECK();
   } else {
-    elementwise_kernel_with_index<int64_t><<<grid, num_threads, 0, stream>>>(N, f, output.data_ptr<scalar_t>());
+    elementwise_kernel_with_index<int64_t><<<grid, num_threads(), 0, stream>>>(N, f, output.data_ptr<scalar_t>());
     C10_CUDA_KERNEL_LAUNCH_CHECK();
   }
 }
diff --git a/aten/src/ATen/native/cuda/RecordStream.cu b/aten/src/ATen/native/cuda/RecordStream.cu
index d48561df00e5c5..c4cb74bdc68ffd 100644
--- a/aten/src/ATen/native/cuda/RecordStream.cu
+++ b/aten/src/ATen/native/cuda/RecordStream.cu
@@ -1,5 +1,13 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <c10/cuda/CUDACachingAllocator.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/record_stream_native.h>
+#endif
+
 namespace at { namespace native {
 void record_stream_cuda(Tensor& self, c10::Stream stream) {
   c10::cuda::CUDACachingAllocator::recordStream(self.storage().data_ptr(), at::cuda::CUDAStream::unpack(stream.pack()));
diff --git a/aten/src/ATen/native/cuda/Reduce.cu b/aten/src/ATen/native/cuda/Reduce.cu
index 103a386ff0c99c..2de32f6d4a35e0 100644
--- a/aten/src/ATen/native/cuda/Reduce.cu
+++ b/aten/src/ATen/native/cuda/Reduce.cu
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/cuda/Reduce.cuh>
 #include <c10/util/ArrayRef.h>
 
diff --git a/aten/src/ATen/native/cuda/Reduce.cuh b/aten/src/ATen/native/cuda/Reduce.cuh
index 5ee3757d5937ca..57fa55fbec7d5c 100644
--- a/aten/src/ATen/native/cuda/Reduce.cuh
+++ b/aten/src/ATen/native/cuda/Reduce.cuh
@@ -9,6 +9,7 @@
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/thread_constants.h>
 #include <ATen/native/cuda/MemoryAccess.cuh>
+#include <ATen/OpMathType.h>
 #include <c10/macros/Macros.h>
 #include <c10/cuda/CUDACachingAllocator.h>
 #include <functional>
@@ -17,6 +18,9 @@
 #include <utility>
 #include <thrust/pair.h>
 
+#include <ATen/native/cuda/jit_utils.h>
+#include <iostream>
+
 namespace at { namespace native {
 
 using at::detail::Array;
@@ -272,6 +276,65 @@ func_wrapper_t<scalar_t, func_t> func_wrapper(const func_t& op) {
   return func_wrapper_t<scalar_t, func_t> { op };
 }
 
+template <typename scalar_t, typename out_scalar_t=scalar_t>
+struct ReduceJitOp {
+//ReduceJitOp is almost like ReduceOp, but it doesn't have ops functor that specifies reduction operations
+//Maybe we can find a way to unify ReduceOp and ReduceJitOp
+  using InputCalculator = OffsetCalculator<1, uint32_t>;
+  using OutputCalculator = OffsetCalculator<2, uint32_t>;
+  //TODO for now arg_t is always opmath_t of the input, later we'll need to change it
+  using arg_t = at::opmath_type<scalar_t>;
+
+  static constexpr int input_vec_size = ReduceConfig::input_vec_size;
+  //TODO - ReduceJitOp will probably need to be changed for reductions that need full functor,
+  //not just wrapper
+  arg_t ident;
+  ReduceConfig config;
+  InputCalculator input_calc;
+  OutputCalculator output_calc;
+  const void* src;
+  const char* dst[2]; //it accepts at most two destinations
+  // acc_buf used for accumulation among sub Tensor Iterator when accumulation on
+  // output is not permissible
+  void* acc_buf;
+  // cta_buf used for accumulation between blocks during global reduction
+  void* cta_buf;
+  int* semaphores;
+  int64_t base_idx;
+  bool accumulate;
+  bool final_output;
+  int noutputs;
+
+  ReduceJitOp(
+      ReduceConfig config,
+      InputCalculator input_calc,
+      OutputCalculator output_calc,
+      const void* src,
+      char* dst0,
+      optional<char*> dst1,
+      void* acc_buf,
+      void* cta_buf,
+      int* semaphores,
+      arg_t ident,
+      int noutputs,
+      int64_t base_idx)
+      : ident(ident),
+        config(config),
+        input_calc(input_calc),
+        output_calc(output_calc),
+        src(src),
+        acc_buf(acc_buf),
+        cta_buf(cta_buf),
+        semaphores(semaphores),
+        base_idx(base_idx),
+        noutputs(noutputs) {
+    dst[0] = dst0;
+    if (dst1.has_value()) {
+      dst[1] = dst1.value();
+    }
+  }
+};
+
 template <typename scalar_t, typename ops_t, typename index_t, typename out_scalar_t=scalar_t, int vt0=4>
 struct ReduceOp {
   using traits = function_traits<decltype(&ops_t::reduce)>;
@@ -284,8 +347,6 @@ struct ReduceOp {
     std::is_convertible<arg_t, out_scalar_t>::value
     && std::is_convertible<out_scalar_t, arg_t>::value;
 
-  static constexpr float acc_buffer_multiplier = (float)sizeof(arg_t) / sizeof(out_scalar_t);
-
   static constexpr int input_vec_size = ReduceConfig::input_vec_size;
 
   ops_t ops;
@@ -837,6 +898,47 @@ static void launch_reduce_kernel(const ReduceConfig& config, const R& reduction)
   }
 }
 
+template<char const *name, typename scalar_t, typename out_scalar_t,
+int vt0, typename R>
+static void launch_jitted_reduce_kernel(DeviceIndex idx, const ReduceConfig& config,
+R& reduction, const std::string& func) {
+  constexpr int max_threads = mnt_wrapper<scalar_t>::MAX_NUM_THREADS;
+  dim3 block = config.block();
+  dim3 grid = config.grid();
+
+  static std::mutex _jiterator_mutex;
+  static std::vector<std::array<at::cuda::jit::NvrtcFunction, 3>> fns(c10::cuda::device_count());
+  int shared_memory = config.shared_memory_size();
+  at::cuda::jit::NvrtcFunction* fn_ptr;
+  switch(config.output_vec_size) {
+  case 4:
+    fn_ptr = &fns[idx][0];
+    break;
+  case 2:
+    fn_ptr = &fns[idx][1];
+    break;
+  default:
+    fn_ptr = &fns[idx][2];
+  }
+  if (!fn_ptr->function) {
+    std::string f_inputs_type_str = at::cuda::jit::typeName<scalar_t>();
+    std::string accum_type_str = at::cuda::jit::typeName<at::opmath_type<scalar_t>>();
+    std::string result_type_str = at::cuda::jit::typeName<out_scalar_t>();
+    int max_threads_codegen = max_threads/config.output_vec_size;
+    auto code = at::cuda::jit::generate_reduction_code(1, func, name, vt0,
+                                               f_inputs_type_str, accum_type_str, result_type_str,
+                                               true, false, config.output_vec_size, max_threads_codegen);
+
+    *fn_ptr = at::cuda::jit::jit_pwise_function(code, "reduction_"+std::string(name));
+
+  }
+  constexpr int kernel_args = 1;
+  void* args[kernel_args];
+  args[0] = static_cast<void*>(&reduction);
+  at::cuda::jit::launch_jitted_pwise_function(*fn_ptr, args, grid, block, shared_memory);
+}
+
+
 class AccumulationBuffer {
  public:
   AccumulationBuffer() {}
@@ -874,7 +976,7 @@ class AccumulationBuffer {
 };
 
 template <typename scalar_t>
-int get_output_vec_size(TensorIterator &iter) {
+int get_output_vec_size(const TensorIterator &iter) {
   int vec_size = 4;
   auto update_vec_size = [&vec_size](uint64_t n) {
     while(n % vec_size != 0) {
@@ -898,61 +1000,8 @@ int get_output_vec_size(TensorIterator &iter) {
   return vec_size;
 }
 
-template <typename scalar_t, typename out_scalar_t, int vt0=4, typename ops_t, typename ident_t=double>
-inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t ident=0,
-                              AccumulationBuffer* acc_buf_ptr=nullptr, int64_t base_idx=0) {
-  AT_ASSERT(iter.numel() > 0 && iter.ntensors() - iter.noutputs() == 1 && iter.noutputs() >= 1);
-
-  using traits = function_traits<decltype(&ops_t::reduce)>;
-  using arg_t = typename traits::template arg<0>::type;
-  static constexpr bool can_accumulate_in_output =
-    std::is_convertible<arg_t, out_scalar_t>::value;
-
-  bool can_use_32bit_indexing = iter.can_use_32bit_indexing();
-  std::unique_ptr<AccumulationBuffer> owned_buf_ptr;
-
-  // The acc_buf_ptr is a shared pointer. It is create at the first entrance and
-  // reused by all recursive function calls.
-  if (acc_buf_ptr == NULL) {
-    // acc_buf_ptr holds buffer used for accumulation among multiple sub_iter
-    // when accumulation in output is not possible.
-    if (!can_accumulate_in_output && !can_use_32bit_indexing) {
-      int64_t output_memory_size = iter.element_size(0);
-      for (int dim = 0; dim < iter.ndim(); dim++) {
-        output_memory_size = std::max(output_memory_size, iter.shape()[dim] * iter.strides(0)[dim]);
-      }
-      output_memory_size /= iter.element_size(0); //iter.strides is in bytes
-      owned_buf_ptr.reset(new AccumulationBuffer(sizeof(arg_t),
-                                                 sizeof(out_scalar_t),
-                                                 (char*) iter.data_ptr(0),
-                                                 output_memory_size * sizeof(arg_t)));
-    } else {
-      owned_buf_ptr.reset(new AccumulationBuffer());
-    }
-    acc_buf_ptr = owned_buf_ptr.get();
-  }
-
-  if (!can_use_32bit_indexing) {
-    for (auto& sub_iter : iter.with_32bit_indexing()) {
-      int64_t sub_iter_base_idx = sub_iter.view_offsets()[0];
-
-      gpu_reduce_kernel<scalar_t, out_scalar_t, vt0>(sub_iter, ops, ident,
-          acc_buf_ptr, sub_iter_base_idx);
-    }
-    return;
-  }
-
-  const char* in_data = (char*)iter.data_ptr(iter.ntensors() - 1);
-  char* out_data = (char*)iter.data_ptr(0);
-  const auto noutputs = iter.noutputs();
-  optional<char*> out_data_extra;
-  if (noutputs > 1) {
-    out_data_extra = (char*)iter.data_ptr(1);
-  } else {
-    out_data_extra = nullopt;
-  }
-  char* acc_data = acc_buf_ptr->get_acc_slice(out_data);
-
+template<typename arg_t, typename scalar_t, int vt0>
+ReduceConfig setReduceConfig(const TensorIterator& iter){
   // Start by assuming that each thread handles a single output and all
   // the inputs for that output.
   int64_t num_outputs = iter.num_output_elements();
@@ -1080,7 +1129,64 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
       config.input_mult[2] = config.split_input(config.ctas_per_output);
     }
   }
+  return config;
+};
+
+template <typename scalar_t, typename out_scalar_t, int vt0=4, typename ops_t, typename ident_t=double>
+inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t ident=0,
+                              AccumulationBuffer* acc_buf_ptr=nullptr, int64_t base_idx=0) {
+  AT_ASSERT(iter.numel() > 0 && iter.ntensors() - iter.noutputs() == 1 && iter.noutputs() >= 1);
+
+  using traits = function_traits<decltype(&ops_t::reduce)>;
+  using arg_t = typename traits::template arg<0>::type;
+  static constexpr bool can_accumulate_in_output =
+    std::is_convertible<arg_t, out_scalar_t>::value;
+
+  bool can_use_32bit_indexing = iter.can_use_32bit_indexing();
+  std::unique_ptr<AccumulationBuffer> owned_buf_ptr;
+  // The acc_buf_ptr is a shared pointer. It is create at the first entrance and
+  // reused by all recursive function calls.
+  if (acc_buf_ptr == NULL) {
+    // acc_buf_ptr holds buffer used for accumulation among multiple sub_iter
+    // when accumulation in output is not possible.
+    if (!can_accumulate_in_output && !can_use_32bit_indexing) {
+      int64_t output_memory_size = iter.element_size(0);
+      for (int dim = 0; dim < iter.ndim(); dim++) {
+        output_memory_size = std::max(output_memory_size, iter.shape()[dim] * iter.strides(0)[dim]);
+      }
+      output_memory_size /= iter.element_size(0); //iter.strides is in bytes
+      owned_buf_ptr.reset(new AccumulationBuffer(sizeof(arg_t),
+                                                 sizeof(out_scalar_t),
+                                                 (char*) iter.data_ptr(0),
+                                                 output_memory_size * sizeof(arg_t)));
+    } else {
+      owned_buf_ptr.reset(new AccumulationBuffer());
+    }
+    acc_buf_ptr = owned_buf_ptr.get();
+  }
+
+  if (!can_use_32bit_indexing) {
+    for (auto& sub_iter : iter.with_32bit_indexing()) {
+      int64_t sub_iter_base_idx = sub_iter.view_offsets()[0];
+
+      gpu_reduce_kernel<scalar_t, out_scalar_t, vt0>(sub_iter, ops, ident,
+          acc_buf_ptr, sub_iter_base_idx);
+    }
+    return;
+  }
+
+  const char* in_data = (char*)iter.data_ptr(iter.ntensors() - 1);
+  char* out_data = (char*)iter.data_ptr(0);
+  const auto noutputs = iter.noutputs();
+  optional<char*> out_data_extra;
+  if (noutputs > 1) {
+    out_data_extra = (char*)iter.data_ptr(1);
+  } else {
+    out_data_extra = nullopt;
+  }
+  char* acc_data = acc_buf_ptr->get_acc_slice(out_data);
 
+  ReduceConfig config = setReduceConfig<arg_t, scalar_t, vt0>(iter);
   at::DataPtr buffer;
   at::DataPtr semaphores;
   if (config.should_global_reduce()) {
@@ -1115,4 +1221,101 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
   launch_reduce_kernel<mnt_wrapper<scalar_t>::MAX_NUM_THREADS>(config, reduce);
 }
 
+//TODO this is 100 lines of almost-copy-paste, because we have to have different template args for this function
+//try unifying with gpu_reduce_kernel
+template <char const* name, typename scalar_t, typename out_scalar_t, int vt0=4, typename ident_t=double>
+inline void jitted_gpu_reduce_kernel(TensorIterator& iter, const std::string& func, ident_t ident=0,
+                              AccumulationBuffer* acc_buf_ptr=nullptr, int64_t base_idx=0) {
+  AT_ASSERT(iter.numel() > 0 && iter.ntensors() - iter.noutputs() == 1 && iter.noutputs() >= 1);
+
+  //TODO - this will be different for more complicated reductions, but for now reductions using
+  //func_wrapper all have arg_t = opmath
+  using arg_t = at::opmath_type<scalar_t>;
+  static constexpr bool can_accumulate_in_output =
+    std::is_convertible<arg_t, out_scalar_t>::value;
+  static_assert(can_accumulate_in_output == true, "unsupported arg_t for jitted reduction");
+
+  bool can_use_32bit_indexing = iter.can_use_32bit_indexing();
+  std::unique_ptr<AccumulationBuffer> owned_buf_ptr;
+
+  // The acc_buf_ptr is a shared pointer. It is create at the first entrance and
+  // reused by all recursive function calls.
+  if (acc_buf_ptr == NULL) {
+    // acc_buf_ptr holds buffer used for accumulation among multiple sub_iter
+    // when accumulation in output is not possible.
+    if (!can_accumulate_in_output && !can_use_32bit_indexing) {
+      int64_t output_memory_size = iter.element_size(0);
+      for (int dim = 0; dim < iter.ndim(); dim++) {
+        output_memory_size = std::max(output_memory_size, iter.shape()[dim] * iter.strides(0)[dim]);
+      }
+      output_memory_size /= iter.element_size(0); //iter.strides is in bytes
+      owned_buf_ptr.reset(new AccumulationBuffer(sizeof(out_scalar_t), //TODO
+                                                 sizeof(out_scalar_t),
+                                                 (char*) iter.data_ptr(0),
+                                                 output_memory_size * sizeof(out_scalar_t))); //TODO
+    } else {
+      owned_buf_ptr.reset(new AccumulationBuffer());
+    }
+    acc_buf_ptr = owned_buf_ptr.get();
+  }
+
+  if (!can_use_32bit_indexing) {
+    for (auto& sub_iter : iter.with_32bit_indexing()) {
+      int64_t sub_iter_base_idx = sub_iter.view_offsets()[0];
+
+      jitted_gpu_reduce_kernel<name, scalar_t, out_scalar_t, vt0>(sub_iter, func, ident,
+          acc_buf_ptr, sub_iter_base_idx);
+    }
+    return;
+  }
+
+  //TODO - for now we support a single input, we may be able to relax this constraint
+  const char* in_data = (char*)iter.data_ptr(iter.ntensors() - 1);
+  char* out_data = (char*)iter.data_ptr(0);
+  const auto noutputs = iter.noutputs();
+  optional<char*> out_data_extra;
+  if (noutputs > 1) {
+    out_data_extra = (char*)iter.data_ptr(1);
+  } else {
+    out_data_extra = nullopt;
+  }
+  char* acc_data = acc_buf_ptr->get_acc_slice(out_data);
+
+  ReduceConfig config = setReduceConfig<arg_t, scalar_t, vt0>(iter);
+
+  at::DataPtr buffer;
+  at::DataPtr semaphores;
+  if (config.should_global_reduce()) {
+    auto& allocator = *c10::cuda::CUDACachingAllocator::get();
+    buffer = allocator.allocate(config.global_memory_size());
+    semaphores = allocator.allocate(config.semaphore_size());
+
+    auto stream = at::cuda::getCurrentCUDAStream();
+    AT_CUDA_CHECK(cudaMemsetAsync(semaphores.get(), 0, config.semaphore_size(), stream));
+  }
+
+  AT_ASSERT(can_use_32bit_indexing);
+  auto output_calc = make_output_calculator<uint32_t>(iter);
+  auto input_calc = make_input_calculator<uint32_t>(iter);
+  auto reduce = ReduceJitOp<scalar_t, out_scalar_t>(
+      config,
+      input_calc,
+      output_calc,
+      in_data,
+      out_data,
+      out_data_extra,
+      acc_data,
+      buffer.get(),
+      (int*)semaphores.get(),
+      ident,
+      noutputs,
+      base_idx);
+  reduce.accumulate = iter.should_accumulate();
+  reduce.final_output = iter.is_final_output();
+
+  launch_jitted_reduce_kernel<name, scalar_t,
+  out_scalar_t, vt0>(iter.device().index(),
+  config, reduce, func);
+}
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/ReduceOps.cpp b/aten/src/ATen/native/cuda/ReduceOps.cpp
index 472b26cbd872c7..52bc562f0612c5 100644
--- a/aten/src/ATen/native/cuda/ReduceOps.cpp
+++ b/aten/src/ATen/native/cuda/ReduceOps.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/ReduceOps.h>
 
 #include <ATen/native/LinearAlgebra.h>
@@ -6,9 +7,24 @@
 #include <ATen/native/ReduceOpsUtils.h>
 #include <ATen/native/TensorCompare.h>
 
-#include <ATen/Functions.h>
+#include <ATen/Context.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/WrapDimUtils.h>
+#include <ATen/core/NamedTensor.h>
 #include <ATen/TensorIterator.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/full.h>
+#include <ATen/ops/imag.h>
+#include <ATen/ops/kthvalue_native.h>
+#include <ATen/ops/median_native.h>
+#include <ATen/ops/nanmedian_native.h>
+#include <ATen/ops/where.h>
+#endif
+
 namespace at { namespace native {
 namespace {
 
diff --git a/aten/src/ATen/native/cuda/ReduceSumProdKernel.cu b/aten/src/ATen/native/cuda/ReduceSumProdKernel.cu
index bf81ed5b794026..9faeae965feac9 100644
--- a/aten/src/ATen/native/cuda/ReduceSumProdKernel.cu
+++ b/aten/src/ATen/native/cuda/ReduceSumProdKernel.cu
@@ -5,6 +5,7 @@
 #include <ATen/native/SharedReduceOps.h>
 #include <ATen/Dispatch.h>
 #include <ATen/native/ReduceOps.h>
+#include <ATen/jit_macros.h>
 
 namespace at { namespace native {
 
@@ -26,14 +27,28 @@ struct nansum_functor {
   }
 };
 
+const char op_name[] = "prod";
+
 template <typename scalar_t, typename acc_t = scalar_t, typename out_t = scalar_t>
 struct prod_functor {
+  #if AT_USE_JITERATOR()
+  void operator()(TensorIterator& iter) {
+    std::string func = jiterator_stringify(
+    arg_t combine(arg_t a, arg_t b) {
+      return a * b;
+    }
+    );
+    jitted_gpu_reduce_kernel<op_name, scalar_t, out_t>(
+        iter, func, 1.);
+  }
+  #else
   void operator()(TensorIterator& iter) {
     gpu_reduce_kernel<scalar_t, out_t>(
         iter, func_wrapper<out_t>([] GPU_LAMBDA(acc_t a, acc_t b) -> acc_t {
           return a * b;
-        }), 1);
+        }), 1.);
   }
+  #endif
 };
 
 // Workaround for the error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
diff --git a/aten/src/ATen/native/cuda/ReflectionPad.cu b/aten/src/ATen/native/cuda/ReflectionPad.cu
index e497bae885f0f2..33f71368ca10bc 100644
--- a/aten/src/ATen/native/cuda/ReflectionPad.cu
+++ b/aten/src/ATen/native/cuda/ReflectionPad.cu
@@ -1,12 +1,27 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/cuda/CUDAContext.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros_like.h>
+#include <ATen/ops/reflection_pad1d_native.h>
+#include <ATen/ops/reflection_pad2d_native.h>
+#include <ATen/ops/reflection_pad3d_native.h>
+#include <ATen/ops/reflection_pad1d_backward_native.h>
+#include <ATen/ops/reflection_pad2d_backward_native.h>
+#include <ATen/ops/reflection_pad3d_backward_native.h>
+#endif
+
 #include <thrust/pair.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/cuda/Repeat.cu b/aten/src/ATen/native/cuda/Repeat.cu
index 43d6602ea8e2ae..1b29dac6690f39 100644
--- a/aten/src/ATen/native/cuda/Repeat.cu
+++ b/aten/src/ATen/native/cuda/Repeat.cu
@@ -1,7 +1,15 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/Repeat.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/repeat_interleave_native.h>
+#endif
+
 template <typename index_t>
 __global__ static void compute_cuda_kernel(
     index_t* repeat_ptr,
@@ -33,7 +41,7 @@ static void compute_cuda(
     int64_t size,
     int64_t result_size) {
   int64_t block = 512;
-  int64_t warps_per_block = block / C10_WARP_SIZE;
+  int64_t warps_per_block = block / at::cuda::warp_size();
   int64_t grid =
       std::min<int64_t>((size + warps_per_block - 1) / warps_per_block, 2048L);
 
diff --git a/aten/src/ATen/native/cuda/ReplicationPadding.cu b/aten/src/ATen/native/cuda/ReplicationPadding.cu
index 754161c62097cd..d967ffd0354df6 100644
--- a/aten/src/ATen/native/cuda/ReplicationPadding.cu
+++ b/aten/src/ATen/native/cuda/ReplicationPadding.cu
@@ -1,13 +1,26 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/cuda/CUDAContext.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/replication_pad1d_native.h>
+#include <ATen/ops/replication_pad1d_backward_native.h>
+#include <ATen/ops/replication_pad2d_native.h>
+#include <ATen/ops/replication_pad2d_backward_native.h>
+#include <ATen/ops/replication_pad3d_native.h>
+#include <ATen/ops/replication_pad3d_backward_native.h>
+#endif
+
 #include <algorithm>
 #include <cfloat>
 #include <cmath>
diff --git a/aten/src/ATen/native/cuda/Resize.cpp b/aten/src/ATen/native/cuda/Resize.cpp
index c4167ec56e67a1..43e1cb95157402 100644
--- a/aten/src/ATen/native/cuda/Resize.cpp
+++ b/aten/src/ATen/native/cuda/Resize.cpp
@@ -1,10 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/cuda/Resize.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/PeerToPeerAccess.h>
-#include <torch/library.h>
-#include <ATen/native/cuda/Resize.h>
 #include <ATen/native/ResizeCommon.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/resize_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/Resize.h b/aten/src/ATen/native/cuda/Resize.h
index 33ab263693dc5f..569b145fa61d99 100644
--- a/aten/src/ATen/native/cuda/Resize.h
+++ b/aten/src/ATen/native/cuda/Resize.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/EmptyTensor.h>
 #include <ATen/native/ResizeCommon.h>
 
 #include <c10/cuda/CUDAGuard.h>
@@ -9,19 +9,15 @@ namespace at { namespace native {
 
 TORCH_CUDA_CPP_API void resize_bytes_cuda(StorageImpl* storage, size_t size_bytes);
 
-static inline void maybe_resize_storage_cuda(TensorImpl* self, uint64_t new_size) {
+static inline void maybe_resize_storage_cuda(TensorImpl* self, size_t new_size_bytes) {
   // It does not make sense to try to resize a storage
   // to hold 0 elements, and this can break
   // if storage_offset is positive but
   // new_size is 0, so just bail in that case
   // (same comment is in Resize.h)
-  if (new_size == 0) {
+  if (self->numel() == 0) {
     return;
   }
-  auto new_size_bytes_i = (new_size + self->storage_offset()) * self->dtype().itemsize();
-  TORCH_CHECK(!overflows<size_t>(new_size_bytes_i), "Requested storage size (",
-              new_size_bytes_i, ") cannot be represented as a size_t");
-  const auto new_size_bytes = static_cast<size_t>(new_size_bytes_i);
 
   const Storage &storage = self->unsafe_storage();
   TORCH_CHECK(storage, "Tensor: invalid null storage");
@@ -33,7 +29,7 @@ static inline void maybe_resize_storage_cuda(TensorImpl* self, uint64_t new_size
 inline TensorImpl* resize_impl_cuda_(
     TensorImpl* self,
     IntArrayRef size,
-    c10::optional<IntArrayRef> stride,
+    at::OptionalIntArrayRef stride,
     bool device_guard = true) {
   if (self->sizes() == size && (!stride || self->strides() == stride)) {
     return self;
@@ -45,14 +41,17 @@ inline TensorImpl* resize_impl_cuda_(
     guard.set_index(self->storage().device().index());
   }
 
-  int64_t storage_size = 1;
+  const auto itemsize = self->dtype().itemsize();
+  const auto storage_offset = self->storage_offset();
+  size_t storage_size = 1;
   if (stride) {
     self->set_sizes_and_strides(size, *stride);
-    // NB: storage size can be different from numel.
-    storage_size = storage_size_for(size, *stride);
+    storage_size = at::detail::computeStorageNbytes(
+        size, *stride, itemsize, storage_offset);
   } else {
     self->set_sizes_contiguous(size);
-    storage_size = self->numel();
+    storage_size = at::detail::computeStorageNbytesContiguous(
+        size, itemsize, storage_offset);
   }
   maybe_resize_storage_cuda(self, storage_size);
 
diff --git a/aten/src/ATen/native/cuda/RreluWithNoise.cu b/aten/src/ATen/native/cuda/RreluWithNoise.cu
index b73097758fd75f..3b2435d3dae420 100644
--- a/aten/src/ATen/native/cuda/RreluWithNoise.cu
+++ b/aten/src/ATen/native/cuda/RreluWithNoise.cu
@@ -1,8 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAGeneratorImpl.h>
 #include <ATen/native/cuda/DistributionTemplates.h>
 #include <ATen/native/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/leaky_relu.h>
+#include <ATen/ops/rrelu_with_noise_native.h>
+#endif
+
+
 namespace at { namespace native {
 
 template <typename scalar_t, int unroll_factor, typename F>
diff --git a/aten/src/ATen/native/cuda/ScanKernels.cpp b/aten/src/ATen/native/cuda/ScanKernels.cpp
index f88faa1fcac9e3..8ba8b742af7714 100644
--- a/aten/src/ATen/native/cuda/ScanKernels.cpp
+++ b/aten/src/ATen/native/cuda/ScanKernels.cpp
@@ -1,10 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
 
 #include <ATen/native/cuda/ScanKernels.h>
 #include <ATen/native/ReduceOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cummax_helper_native.h>
+#include <ATen/ops/_cummin_helper_native.h>
+#include <ATen/ops/_logcumsumexp_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#endif
+
 namespace at { namespace native {
 
 static c10::MaybeOwned<Tensor> contiguous_out_arg(const Tensor &tensor) {
diff --git a/aten/src/ATen/native/cuda/ScanKernels.h b/aten/src/ATen/native/cuda/ScanKernels.h
index a502847f63075c..28e65372511bc7 100644
--- a/aten/src/ATen/native/cuda/ScanKernels.h
+++ b/aten/src/ATen/native/cuda/ScanKernels.h
@@ -1,3 +1,4 @@
+#pragma once
 #include <cstdint>
 
 namespace at {
diff --git a/aten/src/ATen/native/cuda/ScatterGatherKernel.cu b/aten/src/ATen/native/cuda/ScatterGatherKernel.cu
index 4ec12e166634a3..e80ec7def9611f 100644
--- a/aten/src/ATen/native/cuda/ScatterGatherKernel.cu
+++ b/aten/src/ATen/native/cuda/ScatterGatherKernel.cu
@@ -1,6 +1,7 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorAdvancedIndexing.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/MemoryOverlap.h>
 
@@ -34,6 +35,33 @@ public:
 };
 static ReduceAdd reduce_add;
 
+class ReduceMean {
+public:
+  template <typename scalar_t>
+  constexpr C10_DEVICE void operator() (scalar_t * self_data, const scalar_t * src_data) const {
+    gpuAtomicAddNoReturn(self_data, *src_data);
+  }
+};
+static ReduceMean reduce_mean;
+
+class ReduceMinimum {
+public:
+  template <typename scalar_t>
+  constexpr C10_DEVICE void operator() (scalar_t * self_data, const scalar_t * src_data) const {
+    gpuAtomicMin(self_data, *src_data);
+  }
+};
+static ReduceMinimum reduce_minimum;
+
+class ReduceMaximum {
+public:
+  template <typename scalar_t>
+  constexpr C10_DEVICE void operator() (scalar_t * self_data, const scalar_t * src_data) const {
+    gpuAtomicMax(self_data, *src_data);
+  }
+};
+static ReduceMaximum reduce_maximum;
+
 class TensorAssign {
 public:
   template <typename scalar_t>
@@ -126,12 +154,11 @@ struct _cuda_scatter_gather_internal_kernel {
 
 template <bool is_scatter_like = true, bool cast_to_opaque = true>
 struct cuda_scatter_gather_base_kernel {
-  template <typename func_t>
   void operator()(
     const Tensor& self, int64_t dim,
     const Tensor& index, const Tensor& src,
     const std::string& method_name,
-    const func_t& f
+    const ReduceAdd& f
   ) {
     at::assert_no_internal_overlap(self);
 
@@ -189,7 +216,66 @@ struct cuda_scatter_gather_base_kernel {
     const Tensor& self, int64_t dim,
     const Tensor& index, const Tensor& src,
     const std::string& method_name,
-    const ReduceMultiply& f
+    const TensorAssign& f
+  ) {
+    at::assert_no_internal_overlap(self);
+
+    auto index_sizes = ensure_nonempty_vec(index.sizes().vec());
+    auto self_strides = ensure_nonempty_vec(self.strides().vec());
+    auto src_strides = ensure_nonempty_vec(src.strides().vec());
+
+    // restride self and src such that
+    // self.shape = src.shape = index.shape
+    //
+    // restride stride[dim] such that
+    // if (is_scatter_like) self.stride[dim] = 0
+    // else src.stride[dim] = 0
+    auto self_restrided = is_scatter_like ?
+        restride_dim(self, dim, index_sizes)
+      : self.as_strided(index_sizes, self_strides);
+    auto src_restrided = is_scatter_like ?
+        src.as_strided(index_sizes, src_strides)
+      : restride_dim(src, dim, index_sizes);
+
+    auto iter = TensorIteratorConfig()
+      .set_check_mem_overlap(false)
+      .check_all_same_dtype(false)
+      .resize_outputs(false)
+      .add_output(self_restrided)
+      .add_input(src_restrided)
+      .add_input(index)
+      .build();
+
+    auto self_dim_stride = ensure_nonempty_stride(self, dim);
+    auto self_dim_size = ensure_nonempty_size(self, dim);
+
+    auto src_dim_stride = ensure_nonempty_stride(src, dim);
+    auto src_dim_size = ensure_nonempty_size(src, dim);
+
+    auto index_size = is_scatter_like ? self_dim_size : src_dim_size;
+    auto index_stride = is_scatter_like ? self_dim_stride : src_dim_stride;
+
+
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+      at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
+      iter.dtype(),
+      "cuda_scatter_gather_base_kernel_func", [&] {
+        using dtype = typename std::conditional<cast_to_opaque,
+          OpaqueType<sizeof(scalar_t)>, scalar_t>::type;
+
+        _cuda_scatter_gather_internal_kernel<is_scatter_like, dtype>()(
+          iter, index_size, index_stride, f
+        );
+      }
+    );
+  }
+
+  template <typename func_t>
+  void operator()(
+    const Tensor& self, int64_t dim,
+    const Tensor& index, const Tensor& src,
+    const std::string& method_name,
+    const func_t& f
   ) {
     at::assert_no_internal_overlap(self);
 
@@ -232,7 +318,7 @@ struct cuda_scatter_gather_base_kernel {
     AT_DISPATCH_FLOATING_TYPES_AND2(
       at::ScalarType::Half, at::ScalarType::BFloat16,
       iter.dtype(),
-      "cuda_scatter_gather_base_kernel_reduce_multiply", [&] {
+      "cuda_scatter_gather_base_kernel_func", [&] {
         using dtype = typename std::conditional<cast_to_opaque,
           OpaqueType<sizeof(scalar_t)>, scalar_t>::type;
 
@@ -416,6 +502,34 @@ void scatter_reduce_cuda_kernel(const Tensor& self, const int64_t dim, const Ten
     cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
                                        "scatter_reduce_cuda_multiply_", reduce_multiply);
     break;
+  default :
+    break;
+  }
+}
+
+void scatter_reduce_two_cuda_kernel(const Tensor& self, const int64_t dim, const Tensor& index,
+                                    const Tensor& src, const SCATTER_GATHER_OP& reduce) {
+  switch (reduce) {
+  case SCATTER_GATHER_OP::REDUCE_ADD :
+    cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
+            "scatter_reduce_cuda_sum_", reduce_add);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MULTIPLY :
+    cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
+            "scatter_reduce_cuda_prod_", reduce_multiply);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MAXIMUM :
+    cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
+            "scatter_reduce_cuda_amax_", reduce_maximum);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MINIMUM :
+    cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
+            "scatter_reduce_cuda_amin_", reduce_minimum);
+    break;
+  case SCATTER_GATHER_OP::REDUCE_MEAN :
+    cuda_scatter_gather_base_kernel<true, false>()(self, dim, index, src,
+            "scatter_reduce_cuda_mean_", reduce_mean);
+    break;
   }
 }
 
@@ -430,6 +544,8 @@ void scatter_scalar_reduce_cuda_kernel(const Tensor& self, const int64_t dim, co
     cuda_scatter_fill_base_kernel<false>()(self, dim, index, value,
                                       "scatter_fill_cuda_multiply_", reduce_multiply);
     break;
+  default :
+    break;
   }
 }
 
@@ -440,5 +556,6 @@ REGISTER_DISPATCH(scatter_fill_stub, &scatter_fill_cuda_kernel);
 REGISTER_DISPATCH(scatter_add_stub, &scatter_add_cuda_kernel);
 REGISTER_DISPATCH(scatter_reduce_stub, &scatter_reduce_cuda_kernel);
 REGISTER_DISPATCH(scatter_scalar_reduce_stub, &scatter_scalar_reduce_cuda_kernel);
+REGISTER_DISPATCH(scatter_reduce_two_stub, &scatter_reduce_two_cuda_kernel);
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/SegmentReduce.cu b/aten/src/ATen/native/cuda/SegmentReduce.cu
index 6a5a768ae0d89d..862de29c76cbbd 100644
--- a/aten/src/ATen/native/cuda/SegmentReduce.cu
+++ b/aten/src/ATen/native/cuda/SegmentReduce.cu
@@ -1,12 +1,20 @@
-
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/SegmentReduce.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/cuda/cub.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/Shape.cu b/aten/src/ATen/native/cuda/Shape.cu
index 17eb9197307595..590761ad690483 100644
--- a/aten/src/ATen/native/cuda/Shape.cu
+++ b/aten/src/ATen/native/cuda/Shape.cu
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/MemoryOverlap.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
@@ -9,14 +10,22 @@
 #include <c10/core/MemoryFormat.h>
 #include <c10/util/Optional.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cat_native.h>
+#include <ATen/ops/cat_native.h>
+#include <ATen/ops/copy_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/narrow.h>
+#endif
+
 namespace at {
 namespace native {
 
-#if defined(USE_ROCM)
-constexpr int CAT_ARRAY_BATCH_SIZE = 1024;
-#else
 constexpr int CAT_ARRAY_BATCH_SIZE = 128;
-#endif
 constexpr int CAT_ARRAY_MAX_INPUT_DIMS = 4;
 
 namespace {
@@ -83,45 +92,6 @@ struct TensorSizeStride {
   */
 
 
-// Use pinned memory and and pass the struct by pointer on ROCm
-template <typename T, typename IndexType>
-struct CatArrInputTensor {
-  T* input;
-  IndexType offset;
-  IndexType dimSize;
-  IndexType nElements;
-};
-
-template <typename T, typename IndexType, int Dims>
-C10_LAUNCH_BOUNDS_1(512)
-__global__ void HIP_CatArrayBatchedCopy(
-    T* output,
-    CatArrInputTensor<T, IndexType>* inputs,
-    TensorSizeStride<IndexType, CAT_ARRAY_MAX_INPUT_DIMS> os,
-    const int concatDim,
-    IndexType dimStride) {
-
-    IndexType tid = blockIdx.x * blockDim.x + threadIdx.x;
-    IndexType nElements = inputs[blockIdx.y].nElements;
-
-    if(tid >= nElements) return;
-
-    T* data = inputs[blockIdx.y].input;
-    IndexType offset = inputs[blockIdx.y].offset;
-    IndexType dimSize = inputs[blockIdx.y].dimSize;
-    IndexType dataOffset = offset * dimStride;
-
-    IndexType stride = gridDim.x * blockDim.x;
-
-    while( tid < nElements){
-    IndexType elementOffset = CatArrIndexToOffset<IndexType, Dims>::compute(
-                  os.tensorSize, os.tensorStride, dimSize, concatDim, tid);
-    output[dataOffset + elementOffset] = data[tid];
-
-    tid += stride;
-    }
-}
-
 // pass meta data directly through kernel argument instead of pin memory
 // In contiguous case, we will not need stride_size, setting it as 1 as placeholder
 // to pass compile.
@@ -171,127 +141,6 @@ __global__ void CatArrayBatchedCopy(
     }
 }
 
-template <typename scalar_t>
-void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
-                  int nDims, c10::MemoryFormat memory_format) {
-  // First, let's set up our kernel parameters. We start with a raw pointer to
-  // the storage for the output Tensor.
-  scalar_t *data = out.data_ptr<scalar_t>();
-
-  // Kernel Parameter
-  long tensorMetadataSize =
-    sizeof(CatArrInputTensor<scalar_t, unsigned int>) * CAT_ARRAY_BATCH_SIZE;
-  auto d_inputs_storage = at::empty(
-    {tensorMetadataSize}, out.options().dtype(at::kByte));
-  auto d_inputs = static_cast<CatArrInputTensor<scalar_t, unsigned int> *>(
-    d_inputs_storage.data_ptr());
-
-  TensorSizeStride<unsigned int, CAT_ARRAY_MAX_INPUT_DIMS> outputParam;
-
-  // Next, let's initialize the size, stride arrays for the output Tensor.
-  if (memory_format == c10::MemoryFormat::Contiguous) {
-    for (int i = 0; i < nDims; ++i) {
-      outputParam.tensorSize[i] = at::native::size(out, i);
-      outputParam.tensorStride[i] = out.stride(i);
-    }
-  } else if (memory_format == c10::MemoryFormat::ChannelsLast || memory_format == c10::MemoryFormat::ChannelsLast3d) {
-    // permute the semantics of dims from NCHW to NHWC so that the input
-    // tensor is now contiguous
-    outputParam.tensorSize[0] = at::native::size(out, 0);
-    outputParam.tensorStride[0] = out.stride(0);
-    for (int i = 1; i < nDims - 1; ++i) {
-      outputParam.tensorSize[i] = at::native::size(out, i + 1);
-      outputParam.tensorStride[i] = out.stride(i + 1);
-    }
-    outputParam.tensorSize[nDims - 1] = at::native::size(out, 1);
-    outputParam.tensorStride[nDims - 1] = out.stride(1);
-  } else {
-    TORCH_CHECK(false, "unsupported memory format");
-  }
-
-  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
-
-  // Now we loop
-  int batchCounter = 0;
-  int64_t offset = 0;
-  for (int i = 0; i < inputs.size() ; i += CAT_ARRAY_BATCH_SIZE) {
-    // Re-allocate stackInputs every iteration to avoid read-after-write hazard
-    {
-      auto stackInputs_storage = at::empty({tensorMetadataSize},
-          out.options().dtype(at::kByte).device(at::kCPU).pinned_memory(true));
-      auto stackInputs =
-        static_cast<CatArrInputTensor<scalar_t, unsigned int> *>(
-          stackInputs_storage.data_ptr());
-      for (batchCounter = 0;
-           batchCounter < CAT_ARRAY_BATCH_SIZE &&
-             (i+batchCounter) < inputs.size();
-           ++batchCounter) {
-        int64_t dimSize = 0;
-        // There is a legacy case where a 1-D empty tensor can be concat with
-        // high-dimensional tensor
-        if (inputs[i+batchCounter].numel() > 0) {
-          dimSize = at::native::size(inputs[i+batchCounter], dimension);
-        }
-
-        stackInputs[batchCounter].input =
-          inputs[i+batchCounter].data_ptr<scalar_t>();
-        stackInputs[batchCounter].offset = offset;
-        stackInputs[batchCounter].dimSize = dimSize;
-        stackInputs[batchCounter].nElements = inputs[i+batchCounter].numel();
-
-        // update offset
-        offset += dimSize;
-      }
-      at::native::copy_(d_inputs_storage, stackInputs_storage,
-                        /* non_blocking= */ true);
-    }
-
-    // Next, let's consider how we set our kernel launch parameters.
-    // We borrow from THCApply, which the kernel's internal indexing
-    // is based on.
-    dim3 applyBlock = dim3(32*16);
-
-    //Get grid where x dim fills half gpu and y dim is number of tensors.
-    //This will have cating two tensors fill the entire grid, but prevent
-    //many threads from needlessly load meta data if their sizes is small.
-    dim3 catGrid;
-    getCatGrid(batchCounter, catGrid);
-
-    if (memory_format != c10::MemoryFormat::Contiguous) {
-      switch (dimension) {
-      case 0:
-        break;
-      case 1:
-        dimension = nDims - dimension;
-        break;
-      default:
-        dimension--;
-      }
-    }
-    // Template Declarations for dim = 1, 2, 3, 4
-#define HANDLE_CASE(DIMS) \
-    HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\
-        catGrid, applyBlock, 0, stream.stream()>>>(\
-            data, d_inputs, outputParam, dimension, outputParam.tensorStride[dimension]); \
-    C10_CUDA_KERNEL_LAUNCH_CHECK();
-    switch (nDims) {
-      case 1:
-        HANDLE_CASE(1);
-        break;
-      case 2:
-        HANDLE_CASE(2);
-        break;
-      case 3:
-        HANDLE_CASE(3);
-        break;
-      case 4:
-        HANDLE_CASE(4);
-        break;
-    }
-#undef HANDLE_CASE
-  }
-}
-
 template <typename scalar_t, int batch_size, int stride_size>
 void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
                   int nDims, c10::MemoryFormat memory_format) {
@@ -304,19 +153,19 @@ void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
   // Next, let's initialize the size, stride arrays for the output Tensor.
   if (memory_format == c10::MemoryFormat::Contiguous) {
     for (int i = 0; i < nDims; ++i) {
-      outputParam.tensorSize[i] = at::native::size(out, i);
+      outputParam.tensorSize[i] = out.size(i);
       outputParam.tensorStride[i] = out.stride(i);
     }
   } else if (memory_format == c10::MemoryFormat::ChannelsLast || memory_format == c10::MemoryFormat::ChannelsLast3d) {
     // permute the semantics of dims from NCHW to NHWC so that the input
     // tensor is now contiguous
-    outputParam.tensorSize[0] = at::native::size(out, 0);
+    outputParam.tensorSize[0] = out.size(0);
     outputParam.tensorStride[0] = out.stride(0);
     for (int i = 1; i < nDims - 1; ++i) {
-      outputParam.tensorSize[i] = at::native::size(out, i + 1);
+      outputParam.tensorSize[i] = out.size(i + 1);
       outputParam.tensorStride[i] = out.stride(i + 1);
     }
-    outputParam.tensorSize[nDims - 1] = at::native::size(out, 1);
+    outputParam.tensorSize[nDims - 1] = out.size(1);
     outputParam.tensorStride[nDims - 1] = out.stride(1);
   } else {
     TORCH_CHECK(false, "unsupported memory format");
@@ -336,7 +185,7 @@ void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
       // There is a legacy case where a 1-D empty tensor can be concat with
       // high-dimensional tensor
       if (inputs[i+batchCounter].numel() > 0) {
-        dimSize = at::native::size(inputs[i+batchCounter], dimension);
+        dimSize = inputs[i+batchCounter].size(dimension);
       }
       catMetaData.input[batchCounter] = inputs[i+batchCounter].data_ptr<scalar_t>();
       catMetaData.offset[batchCounter] = offset;
@@ -440,7 +289,7 @@ Tensor& cat_out_cuda(TensorList inputs, int64_t dimension, Tensor& out) {
   // (i.e. other empty sizes are not skipped).
   // FIXME: warn if this is the case
   auto should_skip = [](const Tensor &t) {
-    return t.dim() == 1 && at::native::size(t, 0) == 0;
+    return t.dim() == 1 && t.size(0) == 0;
   };
 
   const Tensor *notSkippedTensor = NULL;  // non-owning reference
@@ -502,7 +351,7 @@ Tensor& cat_out_cuda(TensorList inputs, int64_t dimension, Tensor& out) {
       continue;
     }
     check_cat_shape_except_dim(*notSkippedTensor, tensor, dimension, i);
-    cat_dim_size += at::native::size(tensor, dimension);
+    cat_dim_size += tensor.size(dimension);
   }
 
   // Compute the size of the result
@@ -546,19 +395,6 @@ Tensor& cat_out_cuda(TensorList inputs, int64_t dimension, Tensor& out) {
     });
   allSameType = allSameType && (out.scalar_type() == firstType);
 
-#if defined(USE_ROCM)
-  if (inputs.size() > 1 &&
-      out.dim() <= CAT_ARRAY_MAX_INPUT_DIMS &&
-      at::cuda::detail::canUse32BitIndexMath(out) &&
-      allContiguous &&
-      all32BitIndexable &&
-      allSameType) {
-      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-          at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
-          out.scalar_type(), "cat_cuda", [&]() {
-        hip_parallel_cat<scalar_t>(out, inputs, dimension, nDims, memory_format);
-      });
-#else
   // We support the contiguous inputs and non-contiguous input (<=4 dims) in different ways
   // For contiguous input, we don't need to pass stride meta data to cuda kernel through constant
   // memory. Therefore, we could pass more inputs to cuda threads.
@@ -570,8 +406,8 @@ Tensor& cat_out_cuda(TensorList inputs, int64_t dimension, Tensor& out) {
       allContiguous &&
       all32BitIndexable &&
       allSameType) {
-      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-          at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          kComplexHalf, kHalf, kBool, kBFloat16,
           out.scalar_type(), "cat_cuda", [&]() {
         parallel_cat<scalar_t, CAT_ARRAY_BATCH_SIZE, 1>(out, inputs, dimension, nDims, memory_format);
       });
@@ -582,18 +418,17 @@ Tensor& cat_out_cuda(TensorList inputs, int64_t dimension, Tensor& out) {
       all32BitIndexable &&
       allSameType &&
       memory_format == c10::MemoryFormat::Contiguous) {
-      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-          at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          kComplexHalf, kHalf, kBool, kBFloat16,
           out.scalar_type(), "cat_cuda", [&]() {
         parallel_cat<scalar_t, CAT_ARRAY_BATCH_SIZE/2, CAT_ARRAY_BATCH_SIZE/2>(out, inputs, dimension, nDims, memory_format);
       });
-#endif
   } else {
     int64_t offset = 0;
     for (int j = 0; j < inputs.size(); j++)
     {
       if (should_skip(inputs[j])) continue;
-      int64_t dimSize = at::native::size(inputs[j], dimension);
+      int64_t dimSize = inputs[j].size(dimension);
       Tensor nt = at::narrow(out, dimension, offset, dimSize);
       copy_(nt, inputs[j]);
       offset += dimSize;
diff --git a/aten/src/ATen/native/cuda/SoftMax.cu b/aten/src/ATen/native/cuda/SoftMax.cu
index 181fbb994c3fda..8c12e034ba48e6 100644
--- a/aten/src/ATen/native/cuda/SoftMax.cu
+++ b/aten/src/ATen/native/cuda/SoftMax.cu
@@ -1,7 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAContext.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/WrapDimUtils.h>
 #include <c10/macros/Macros.h>
 
@@ -13,6 +15,18 @@
 #include <ATen/native/cuda/MemoryAccess.cuh>
 #include <ATen/native/cuda/PersistentSoftmax.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_masked_softmax_native.h>
+#include <ATen/ops/_log_softmax_native.h>
+#include <ATen/ops/_log_softmax_backward_data_native.h>
+#include <ATen/ops/_softmax_native.h>
+#include <ATen/ops/_softmax_backward_data_native.h>
+#include <ATen/ops/softmax.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -153,7 +167,7 @@ inline dim3 SoftMax_getBlockSize(int ILP, uint64_t dim_size) {
 
   while (block_size < (max_block_size)) block_size *= 2;
   // Launch at least a single warp - the kernel assumes that.
-  block_size = std::max(block_size, static_cast<uint64_t>(C10_WARP_SIZE));
+  block_size = std::max(block_size, static_cast<uint64_t>(at::cuda::warp_size()));
   return dim3(block_size);
 }
 
@@ -959,8 +973,7 @@ Tensor masked_softmax_cuda(const Tensor& input, const Tensor& mask) {
           input.scalar_type(),
           "masked_softmax",
           [&] {
-            Tensor mask_not = mask.logical_not();
-            output = at::softmax(input.masked_fill(mask_not, -std::numeric_limits<scalar_t>::infinity()), -1);
+            output = at::softmax(input.masked_fill(mask, -std::numeric_limits<scalar_t>::infinity()), -1);
           });
         return output;
     }
diff --git a/aten/src/ATen/native/cuda/Sort.cpp b/aten/src/ATen/native/cuda/Sort.cpp
index 8bb7d93bfdb551..21f77f7050649b 100644
--- a/aten/src/ATen/native/cuda/Sort.cpp
+++ b/aten/src/ATen/native/cuda/Sort.cpp
@@ -1,11 +1,23 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/Sort.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/ExpandUtils.h>
-#include <ATen/Functions.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/MemoryOverlap.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/native/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_strided.h>
+#include <ATen/ops/sort_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <limits>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/cuda/SortImpl.cu b/aten/src/ATen/native/cuda/SortImpl.cu
index a806c4a138746d..c6e29262046e8e 100644
--- a/aten/src/ATen/native/cuda/SortImpl.cu
+++ b/aten/src/ATen/native/cuda/SortImpl.cu
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <thrust/execution_policy.h>
 #include <thrust/sort.h>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/cuda/Sorting.cpp b/aten/src/ATen/native/cuda/Sorting.cpp
index f92c4778051837..97b8df55416e23 100644
--- a/aten/src/ATen/native/cuda/Sorting.cpp
+++ b/aten/src/ATen/native/cuda/Sorting.cpp
@@ -1,13 +1,27 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/Sorting.h>
-#include <ATen/Functions.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/NamedTensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/NamedTensor.h>
+#include <ATen/Context.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/MemoryOverlap.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/TensorInfo.cuh>
+
 #include <ATen/native/SortingUtils.h>
 #include <ATen/native/ReduceOpsUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/full.h>
+#include <ATen/ops/kthvalue_native.h>
+#include <ATen/ops/median_native.h>
+#include <ATen/ops/nanmedian_native.h>
+#include <ATen/ops/where.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/Sorting.cu b/aten/src/ATen/native/cuda/Sorting.cu
index d72788c1b97c79..52fa2710596d4b 100644
--- a/aten/src/ATen/native/cuda/Sorting.cu
+++ b/aten/src/ATen/native/cuda/Sorting.cu
@@ -5,6 +5,7 @@
 #include <ATen/Dispatch.h>
 #include <ATen/NumericUtils.h>
 #include <c10/macros/Macros.h>
+#include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/TensorInfo.cuh>
 #include <ATen/native/cuda/SortingCommon.cuh>
 #include <ATen/native/cuda/SortingRadixSelect.cuh>
@@ -189,7 +190,7 @@ struct KthValueLauncher {
     }
 
     dim3 block(std::min(
-        round_up(slice_size, (int64_t)C10_WARP_SIZE), (int64_t)1024));
+        round_up(slice_size, (int64_t)at::cuda::warp_size()), (int64_t)1024));
     auto stream = at::cuda::getCurrentCUDAStream();
     gatherKthValue<scalar_t, index_t, all_dims><<<grid, block, 0, stream>>>(
         self_info,
@@ -228,7 +229,7 @@ struct MedianLauncher {
     }
 
     dim3 block(std::min(
-        round_up(slice_size, (int64_t)C10_WARP_SIZE), (int64_t)1024));
+        round_up(slice_size, (int64_t)at::cuda::warp_size()), (int64_t)1024));
     auto stream = at::cuda::getCurrentCUDAStream();
     gatherMedian<scalar_t, index_t, all_dims><<<grid, block, 0, stream>>>(
         values_info,
diff --git a/aten/src/ATen/native/cuda/SparseMM.cu b/aten/src/ATen/native/cuda/SparseMM.cu
index 0cc3fe3806a072..922efa5f4fcb5d 100644
--- a/aten/src/ATen/native/cuda/SparseMM.cu
+++ b/aten/src/ATen/native/cuda/SparseMM.cu
@@ -1,7 +1,13 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/sspaddmm_native.h>
+#endif
+
 namespace at { namespace native {
 // sparse, sparse, sparse, dense, real, real -> sparse
 Tensor& _sspaddmm_out_only_sparse_cuda(const Tensor& self,
diff --git a/aten/src/ATen/native/cuda/SpectralOps.cpp b/aten/src/ATen/native/cuda/SpectralOps.cpp
index f431e1e31cb47a..b418e8ffc8abb2 100644
--- a/aten/src/ATen/native/cuda/SpectralOps.cpp
+++ b/aten/src/ATen/native/cuda/SpectralOps.cpp
@@ -1,19 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/Config.h>
 #include <ATen/Dispatch.h>
-#include <ATen/Utils.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/cuda/detail/KernelUtils.h>
-#include <ATen/cuda/detail/OffsetCalculator.cuh>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 #include <ATen/native/Resize.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/SpectralOpsUtils.h>
 #include <ATen/native/cuda/CuFFTUtils.h>
 #include <ATen/native/cuda/CuFFTPlanCache.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_fft_c2c_native.h>
+#include <ATen/ops/_fft_c2r_native.h>
+#include <ATen/ops/_fft_r2c_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/mul.h>
+#endif
+
 #include <cufft.h>
 #include <cufftXt.h>
 
diff --git a/aten/src/ATen/native/cuda/SpectralOps.cu b/aten/src/ATen/native/cuda/SpectralOps.cu
index 4a91f58e61ec43..df51fe46afea68 100644
--- a/aten/src/ATen/native/cuda/SpectralOps.cu
+++ b/aten/src/ATen/native/cuda/SpectralOps.cu
@@ -1,19 +1,11 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/Config.h>
 #include <ATen/Dispatch.h>
-#include <ATen/Utils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/cuda/detail/OffsetCalculator.cuh>
 #include <ATen/detail/CUDAHooksInterface.h>
-#include <ATen/native/Resize.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/SpectralOpsUtils.h>
-#include <ATen/native/cuda/CuFFTUtils.h>
-#include <ATen/native/cuda/CuFFTPlanCache.h>
-#include <c10/util/accumulate.h>
-
 
 #include <cmath>
 #include <vector>
@@ -21,8 +13,6 @@
 
 namespace at { namespace native {
 
-using namespace at::native::detail;
-
 // Offset calculator for indexing in Hermitian mirrored order.
 // In mirrored dims, maps linear index i to (n - i) % n
 template <typename index_t>
diff --git a/aten/src/ATen/native/cuda/SummaryOps.cu b/aten/src/ATen/native/cuda/SummaryOps.cu
index 4b47d0c9cd90a3..9877e8cf7c3c7e 100644
--- a/aten/src/ATen/native/cuda/SummaryOps.cu
+++ b/aten/src/ATen/native/cuda/SummaryOps.cu
@@ -1,10 +1,22 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/native/Resize.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/CUDAApplyUtils.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/bincount_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/histc_native.h>
+#include <ATen/ops/zeros_native.h>
+#endif
+
 namespace at {
 namespace cuda {
 #define THRESH_NUMBER_BINS_FOR_MULTI_BLOCK_MEM 100
diff --git a/aten/src/ATen/native/cuda/TensorCompare.cpp b/aten/src/ATen/native/cuda/TensorCompare.cpp
index 5d2c84fdaca5a9..b99df69f3b2aa2 100644
--- a/aten/src/ATen/native/cuda/TensorCompare.cpp
+++ b/aten/src/ATen/native/cuda/TensorCompare.cpp
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/TensorCompare.h>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/cuda/TensorFactories.cu b/aten/src/ATen/native/cuda/TensorFactories.cu
index 29bd7adce5a0f0..f442c9c9f4e16f 100644
--- a/aten/src/ATen/native/cuda/TensorFactories.cu
+++ b/aten/src/ATen/native/cuda/TensorFactories.cu
@@ -1,14 +1,29 @@
-#include <ATen/ATen.h>
-#include <ATen/cuda/EmptyTensor.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDAApplyUtils.cuh>
 #include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/EmptyTensor.h>
 #include <ATen/InitialTensorOptions.h>
 #include <ATen/native/cuda/Resize.h>
 #include <ATen/native/TensorFactories.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_efficientzerotensor_native.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/empty_strided_native.h>
+#include <ATen/ops/eye_native.h>
+#include <ATen/ops/tril_indices_native.h>
+#include <ATen/ops/tril_native.h>
+#include <ATen/ops/triu_indices_native.h>
+#include <ATen/ops/triu_native.h>
+#endif
+
 #include <algorithm>
 #include <cmath>
 #include <cstddef>
diff --git a/aten/src/ATen/native/cuda/TensorModeKernel.cpp b/aten/src/ATen/native/cuda/TensorModeKernel.cpp
index 73ae5f3199b9ab..c04693bb72e215 100644
--- a/aten/src/ATen/native/cuda/TensorModeKernel.cpp
+++ b/aten/src/ATen/native/cuda/TensorModeKernel.cpp
@@ -1,5 +1,5 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/TensorModeKernel.h>
-#include <ATen/Functions.h>
 #include <ATen/cuda/CUDAConfig.h>
 #include <ATen/native/CanUse32BitIndexMath.h>
 #include <ATen/native/ReduceOpsUtils.h>
diff --git a/aten/src/ATen/native/cuda/TensorModeKernel.cu b/aten/src/ATen/native/cuda/TensorModeKernel.cu
index 40a8e19eb44502..c62e68a9041675 100644
--- a/aten/src/ATen/native/cuda/TensorModeKernel.cu
+++ b/aten/src/ATen/native/cuda/TensorModeKernel.cu
@@ -142,7 +142,8 @@ void handle_fused_mode(
     int64_t slice_size,
     int64_t slices) {
   constexpr int num_threads = size / 2;
-  static_assert(num_threads % C10_WARP_SIZE == 0 &&
+  int warp_size = at::cuda::warp_size();
+  TORCH_INTERNAL_ASSERT(num_threads % warp_size == 0 &&
                 num_threads <= cuda_utils::kCUDABlockReduceMaxThreads, "");
   const auto memsize =
       (sizeof(scalar_t) * size) + (2 * size * sizeof(unsigned int));
@@ -191,15 +192,9 @@ void fused_mode(
     case 16:
     case 8:
     case 4:
-    case 2: {
-      if (ceilPowerOf2 > 2 * C10_WARP_SIZE) {
-        handle_fused_mode<128, scalar_t>(
-            grid, self, ti_values, ti_indices, slice_size, slices);
-      } else {
-        handle_fused_mode<2 * C10_WARP_SIZE, scalar_t>(
-            grid, self, ti_values, ti_indices, slice_size, slices);
-      }
-    }
+    case 2:
+      handle_fused_mode<128, scalar_t>(
+          grid, self, ti_values, ti_indices, slice_size, slices);
       break;
     case 1:
     default:
diff --git a/aten/src/ATen/native/cuda/TensorShapeCUDA.cpp b/aten/src/ATen/native/cuda/TensorShapeCUDA.cpp
index cc1c523dc1a341..0bb7eb410acf3a 100644
--- a/aten/src/ATen/native/cuda/TensorShapeCUDA.cpp
+++ b/aten/src/ATen/native/cuda/TensorShapeCUDA.cpp
@@ -1,9 +1,15 @@
-
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/cuda/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/set_native.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -27,8 +33,8 @@ Tensor& set_storage_cuda_(Tensor& result, Storage storage, int64_t storage_offse
   checkSetStorage(result, storage, storage_offset, size, stride);
 
   result.unsafeGetTensorImpl()->set_storage_offset(storage_offset);
-  c10::optional<IntArrayRef> stride_opt = stride.data() != nullptr ?
-                                          c10::optional<IntArrayRef>(stride) : c10::nullopt;
+  at::OptionalIntArrayRef stride_opt = stride.data() != nullptr ?
+                                          at::OptionalIntArrayRef(stride) : c10::nullopt;
   at::native::resize_impl_cuda_(result.unsafeGetTensorImpl(), size, stride_opt);
   return result;
 }
diff --git a/aten/src/ATen/native/cuda/TensorTopK.cpp b/aten/src/ATen/native/cuda/TensorTopK.cpp
index 392b3ce25ce2d5..fcd155c2c7fe3e 100644
--- a/aten/src/ATen/native/cuda/TensorTopK.cpp
+++ b/aten/src/ATen/native/cuda/TensorTopK.cpp
@@ -1,9 +1,21 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/TensorTopK.h>
-#include <ATen/Functions.h>
-#include <ATen/NativeFunctions.h>
+
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/native/cuda/Sort.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/sort_native.h>
+#include <ATen/ops/topk_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/TensorTopK.cu b/aten/src/ATen/native/cuda/TensorTopK.cu
index 7980619a786471..9e1e717903dac4 100644
--- a/aten/src/ATen/native/cuda/TensorTopK.cu
+++ b/aten/src/ATen/native/cuda/TensorTopK.cu
@@ -189,7 +189,8 @@ void launch(
 
     dim3 grid;
     TORCH_INTERNAL_ASSERT(getGridFromTiles(numInputSlices, grid), "Too many slices for topk");
-    dim3 block(std::min(at::ceil_div((int64_t)inputSliceSize, (int64_t)C10_WARP_SIZE) * (int64_t)C10_WARP_SIZE, (int64_t)1024));
+    int warp_size = at::cuda::warp_size();
+    dim3 block(std::min(at::ceil_div((int64_t)inputSliceSize, (int64_t)warp_size) * (int64_t)warp_size, (int64_t)1024));
     gatherTopK<T, IndexType, Dim, /* WithKthValues= */false><<<grid, block, 0, c10::cuda::getCurrentCUDAStream()>>>(
         input,
         inputSliceSize,
@@ -472,7 +473,8 @@ void launch(
   {
     dim3 grid;
     TORCH_INTERNAL_ASSERT(getGridFromTiles(numInputSlices, grid), "Too many slices for topk");
-    dim3 block(std::min(at::ceil_div((int64_t)inputSliceSize, (int64_t)C10_WARP_SIZE) * (int64_t)C10_WARP_SIZE, (int64_t)1024));
+    int warp_size = at::cuda::warp_size();
+    dim3 block(std::min(at::ceil_div((int64_t)inputSliceSize, (int64_t)warp_size) * (int64_t)warp_size, (int64_t)1024));
     sbtopk::gatherTopK<T, IndexType, Dim, /* WithKthValues= */true><<<grid, block, 0, c10::cuda::getCurrentCUDAStream()>>>(
         input,
         inputSliceSize,
diff --git a/aten/src/ATen/native/cuda/TensorTransformations.cu b/aten/src/ATen/native/cuda/TensorTransformations.cu
index d46a5613df78cb..335d746294d0df 100644
--- a/aten/src/ATen/native/cuda/TensorTransformations.cu
+++ b/aten/src/ATen/native/cuda/TensorTransformations.cu
@@ -1,11 +1,20 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorTransformations.h>
 
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
-#include <ATen/NativeFunctions.h>
 #include <ATen/cuda/CUDAApplyUtils.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/macros/Macros.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/roll_native.h>
+#endif
+
 #include <cstddef>
 #include <vector>
 
diff --git a/aten/src/ATen/native/cuda/TriangularOps.cu b/aten/src/ATen/native/cuda/TriangularOps.cu
index 1e264a0890787e..2d7bf30309dc86 100644
--- a/aten/src/ATen/native/cuda/TriangularOps.cu
+++ b/aten/src/ATen/native/cuda/TriangularOps.cu
@@ -11,6 +11,7 @@
 #include <ATen/NativeFunctions.h>
 #else
 #include <ATen/ops/diag.h>
+#include <ATen/ops/diag_native.h>
 #include <ATen/ops/trace_native.h>
 #include <ATen/ops/tril_native.h>
 #include <ATen/ops/triu_native.h>
diff --git a/aten/src/ATen/native/cuda/UnaryLogKernels.cu b/aten/src/ATen/native/cuda/UnaryLogKernels.cu
index 47f88383de428a..0f9eb26aba2d16 100644
--- a/aten/src/ATen/native/cuda/UnaryLogKernels.cu
+++ b/aten/src/ATen/native/cuda/UnaryLogKernels.cu
@@ -4,26 +4,70 @@
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
+#include <ATen/native/cuda/jit_utils.h>
+#include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/Math.cuh>
 
 namespace at { namespace native {
 
+const char log_name[] = "log_kernel";
 void log_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return ::log(a);
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto log_string = jiterator_stringify(
+        template <typename T> T log_kernel(T x) { return std::log(x); });
+    AT_DISPATCH_COMPLEX_TYPES(common_dtype, "log_cuda", [&]() {
+      jitted_gpu_kernel<
+          /*name=*/log_name,
+          /*return_dtype=*/scalar_t,
+          /*common_dtype=*/scalar_t,
+          /*arity=*/1>(iter, log_string);
     });
-  });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(iter.common_dtype(), "log_cuda", [&]() {
+      gpu_kernel(
+          iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { return ::log(a); });
+    });
+#endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return ::log(a);
+      });
+    });
+  }
 }
 
+const char log10_name[] = "log10_kernel";
 void log10_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log10_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return ::log10(a);
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto log10_string = jiterator_stringify(
+        template <typename T> T log10_kernel(T x) { return std::log10(x); });
+    AT_DISPATCH_COMPLEX_TYPES(common_dtype, "log10_cuda", [&]() {
+      jitted_gpu_kernel<
+          /*name=*/log10_name,
+          /*return_dtype=*/scalar_t,
+          /*common_dtype=*/scalar_t,
+          /*arity=*/1>(iter, log10_string);
     });
-  });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(iter.common_dtype(), "log10_cuda", [&]() {
+      gpu_kernel(
+          iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { return ::log10(a); });
+    });
+#endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log10_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return ::log10(a);
+      });
+    });
+  }
 }
 
 void log1p_kernel_cuda(TensorIteratorBase& iter) {
@@ -34,12 +78,33 @@ void log1p_kernel_cuda(TensorIteratorBase& iter) {
   });
 }
 
+const char log2_name[] = "log2_kernel";
 void log2_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log2_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return ::log2(a);
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto log2_string = jiterator_stringify(
+        template <typename T> T log2_kernel(T x) { return std::log2(x); });
+    AT_DISPATCH_COMPLEX_TYPES(common_dtype, "log2_cuda", [&]() {
+      jitted_gpu_kernel<
+          /*name=*/log2_name,
+          /*return_dtype=*/scalar_t,
+          /*common_dtype=*/scalar_t,
+          /*arity=*/1>(iter, log2_string);
     });
-  });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(iter.common_dtype(), "log2_cuda", [&]() {
+      gpu_kernel(
+          iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { return ::log2(a); });
+    });
+#endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "log2_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return ::log2(a);
+      });
+    });
+  }
 }
 
 REGISTER_DISPATCH(log_stub, &log_kernel_cuda);
diff --git a/aten/src/ATen/native/cuda/UnaryOpsKernel.cu b/aten/src/ATen/native/cuda/UnaryOpsKernel.cu
index 671ce1d6cbcdfc..303170690b423c 100644
--- a/aten/src/ATen/native/cuda/UnaryOpsKernel.cu
+++ b/aten/src/ATen/native/cuda/UnaryOpsKernel.cu
@@ -8,6 +8,8 @@
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/Math.h>
 #include <ATen/native/TensorIterator.h>
+#include <ATen/native/cuda/jit_utils.h>
+#include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/Math.cuh>
 #include <ATen/NumericUtils.h>
@@ -32,12 +34,37 @@ void bitwise_not_kernel_cuda(TensorIteratorBase& iter) {
   }
 }
 
+const char exp_name[] = "exp_kernel";
 void exp_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, iter.common_dtype(), "exp_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return std::exp(a);
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+    #if AT_USE_JITERATOR()
+      static const auto exp_string = jiterator_stringify(
+          template <typename T>
+          T exp_kernel(T x) {
+            return std::exp(x);
+      }); // exp_string
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "exp_cuda", [&]() {
+          jitted_gpu_kernel<
+              /*name=*/exp_name,
+              /*return_dtype=*/scalar_t,
+              /*common_dtype=*/scalar_t,
+              /*arity=*/1>(iter, exp_string);
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "exp_cuda", [&]() {
+        gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+          return std::exp(a);
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, common_dtype, "exp_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return std::exp(a);
+      });
     });
-  });
+  }
 }
 
 void expm1_kernel_cuda(TensorIteratorBase& iter) {
@@ -53,19 +80,45 @@ void expm1_kernel_cuda(TensorIteratorBase& iter) {
 
 // We manually overload rsqrt because std::rsqrt does not work with complex types.
 template<typename scalar_t>
-__host__ __device__ static inline scalar_t rsqrt_wrapper(scalar_t v) {
+C10_HOST_DEVICE static inline scalar_t rsqrt_wrapper(scalar_t v) {
   return ::rsqrt(v);
 }
 
 template<typename T>
-__host__ __device__ static inline c10::complex<T> rsqrt_wrapper(c10::complex<T> v) {
+C10_HOST_DEVICE static inline c10::complex<T> rsqrt_wrapper(c10::complex<T> v) {
   const c10::complex<T> one = c10::complex<T>(1.0, 0);
   // std::sqrt for c10::complex is overloaded in c10/util/complex_math.h
   return one / ::sqrt(v);
 }
 
+const char rsqrt_name[] = "rsqrt_kernel";
 void rsqrt_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+    #if AT_USE_JITERATOR()
+      static const auto rsqrt_string = jiterator_stringify(
+          template <typename T>
+          T rsqrt_kernel(T x) {
+            const T one = T{1};
+            return one / std::sqrt(x);
+      }); // rsqrt_string
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "rsqrt_cuda", [&]() {
+          jitted_gpu_kernel<
+              /*name=*/rsqrt_name,
+              /*return_dtype=*/scalar_t,
+              /*common_dtype=*/scalar_t,
+              /*arity=*/1>(iter, rsqrt_string);
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "rsqrt_cuda", [&]() {
+        gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+          // In CUDA, ::rsqrt is overloaded for float and at::Half here is implicitly cast to float.
+          return rsqrt_wrapper(a);
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(
       ScalarType::BFloat16, ScalarType::Half,
       iter.common_dtype(), "rsqrt_cuda",
       [&]() {
@@ -74,14 +127,40 @@ void rsqrt_kernel_cuda(TensorIteratorBase& iter) {
           return rsqrt_wrapper(a);
         });
       });
+  }
 }
 
+const char sqrt_name[] = "sqrt_kernel";
 void sqrt_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "sqrt_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return ::sqrt(a);
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+    #if AT_USE_JITERATOR()
+      static const auto sqrt_string = jiterator_stringify(
+          template <typename T>
+          T sqrt_kernel(T x) {
+            return std::sqrt(x);
+      }); // sqrt_string
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "sqrt_cuda", [&]() {
+          jitted_gpu_kernel<
+              /*name=*/sqrt_name,
+              /*return_dtype=*/scalar_t,
+              /*common_dtype=*/scalar_t,
+              /*arity=*/1>(iter, sqrt_string);
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "sqrt_cuda", [&]() {
+        gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+          return std::sqrt(a);
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, common_dtype, "sqrt_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return std::sqrt(a);
+      });
     });
-  });
+  }
 }
 
 void clamp_kernel_cuda(TensorIteratorBase& iter, const Scalar& min_value, const Scalar& max_value) {
diff --git a/aten/src/ATen/native/cuda/UnarySignKernels.cu b/aten/src/ATen/native/cuda/UnarySignKernels.cu
index b88dc6597bdd3d..a41a59f4e95a07 100644
--- a/aten/src/ATen/native/cuda/UnarySignKernels.cu
+++ b/aten/src/ATen/native/cuda/UnarySignKernels.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/UnaryOps.h>
 #include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
 #include <ATen/native/DispatchStub.h>
@@ -23,12 +24,38 @@ void logical_not_kernel_cuda(TensorIteratorBase& iter) {
 }
 
 // NB: Ignores the negative bit on tensors
+const char neg_name[] = "neg_kernel";
 void neg_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(ScalarType::Half, at::ScalarType::BFloat16, iter.dtype(), "neg_cuda", [&]() {
+  auto dtype = iter.dtype();
+  if (at::isComplexType(dtype)) {
+#if AT_USE_JITERATOR()
+  static const auto neg_string = jiterator_stringify(
+      template <typename T>
+      T neg_kernel(T a) {
+        return -a;
+      }
+  ); // neg_string
+  AT_DISPATCH_COMPLEX_TYPES(dtype, "neg_cuda", [&]() {
+      jitted_gpu_kernel<
+        /*name=*/ neg_name,
+        /*return_dtype=*/ scalar_t,
+        /*common_dtype=*/ scalar_t,
+        /*arity=*/ 1>(iter, neg_string);
+  });
+#else
+  AT_DISPATCH_COMPLEX_TYPES(dtype, "neg_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return -a;
+      });
+  });
+#endif
+  } else {
+  AT_DISPATCH_ALL_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, dtype, "neg_cuda", [&]() {
     gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
       return -a;
     });
   });
+  }
 }
 
 void sign_kernel_cuda(TensorIteratorBase& iter){
@@ -52,7 +79,7 @@ void signbit_kernel_cuda(TensorIteratorBase& iter){
 }
 
 template<typename T>
-__host__ __device__ static inline c10::complex<T> sgn_wrapper(c10::complex<T> z) {
+C10_HOST_DEVICE static inline c10::complex<T> sgn_wrapper(c10::complex<T> z) {
   if (z == c10::complex<T>(0, 0)) {
     return c10::complex<T>(0, 0);
   } else {
@@ -60,13 +87,37 @@ __host__ __device__ static inline c10::complex<T> sgn_wrapper(c10::complex<T> z)
   }
 }
 
+const char sgn_name[] = "sgn_kernel";
 void sgn_kernel_cuda(TensorIteratorBase& iter){
-  AT_DISPATCH_COMPLEX_TYPES(iter.dtype(), "sgn_cuda", [&]() {
+  auto dtype = iter.dtype();
+  #if AT_USE_JITERATOR()
+    static const auto sgn_string = jiterator_stringify(
+        template <typename T>
+        T sgn_kernel(T z) {
+          const T zero = T(0);
+          if (z == zero) {
+            return zero;
+          } else {
+            return z / std::abs(z);
+          }
+        }
+      ); // sgn_string
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "sgn_cuda", [&]() {
+      jitted_gpu_kernel<
+        /*name=*/ sgn_name,
+        /*return_dtype=*/ scalar_t,
+        /*common_dtype=*/ scalar_t,
+        /*arity=*/ 1>(iter, sgn_string);
+      });
+  #else
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "sgn_cuda", [&]() {
       gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
         return sgn_wrapper(a);
       });
   });
+  #endif
 }
+
 REGISTER_DISPATCH(logical_not_stub, &logical_not_kernel_cuda);
 REGISTER_DISPATCH(neg_stub, &neg_kernel_cuda);
 REGISTER_DISPATCH(sign_stub, &sign_kernel_cuda);
diff --git a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
index 71a35534702252..84a45a9ec78151 100644
--- a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
+++ b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
@@ -63,7 +63,7 @@ void i0_kernel_cuda(TensorIteratorBase& iter) {
 }
 
 // See note [Jiterator]
-const char i0e_name[] = "i0e";
+const char i0e_name[] = "calc_i0e";
 void i0e_kernel_cuda(TensorIteratorBase& iter) {
   #if AT_USE_JITERATOR()
     AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.common_dtype(), "i0e_cuda", [&]() {
@@ -120,12 +120,39 @@ void i1e_kernel_cuda(TensorIteratorBase& iter) {
   #endif
 }
 
+const char sigmoid_name[] = "sigmoid";
 void sigmoid_kernel_cuda(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, iter.common_dtype(), "sigmoid_cuda", [&]() {
-    gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-      return static_cast<scalar_t>(1) / (static_cast<scalar_t>(1) + std::exp(-a));
+  auto common_dtype = iter.common_dtype();
+  if (at::isComplexType(common_dtype)) {
+    // only jiterate for complex-dtype
+    #if AT_USE_JITERATOR()
+      static const auto sigmoid_string = jiterator_stringify(
+        template <typename T>
+        T sigmoid(T x) {
+          return T{1} / (T{1} + std::exp(-x));
+        }
+      ); // sigmoid_string
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "sigmoid_cuda", [&]() {
+        jitted_gpu_kernel<
+            /*name=*/sigmoid_name,
+            /*return_dtype=*/scalar_t,
+            /*common_dtype=*/scalar_t,
+            /*arity=*/1>(iter, sigmoid_string);
+      });
+    #else
+      AT_DISPATCH_COMPLEX_TYPES(common_dtype, "sigmoid_cuda", [&]() {
+        gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+          return scalar_t{1} / (scalar_t{1} + std::exp(-a));
+        });
+      });
+    #endif
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, common_dtype, "sigmoid_cuda", [&]() {
+      gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return scalar_t{1} / (scalar_t{1} + std::exp(-a));
+      });
     });
-  });
+  }
 }
 
 const char sinc_name[] = "sinc";
@@ -202,6 +229,23 @@ void ndtri_kernel_cuda(TensorIteratorBase& iter) {
   #endif
 }
 
+const char log_ndtr_name[] = "log_ndtr";
+void log_ndtr_kernel_cuda(TensorIteratorBase& iter) {
+  #if AT_USE_JITERATOR()
+    AT_DISPATCH_FLOATING_TYPES(iter.common_dtype(), "log_ndtr_cuda", [&]() {
+      jitted_gpu_kernel</*name=*/log_ndtr_name,
+                        /*return_dtype=*/ scalar_t,
+                        /*common_dtype=*/ scalar_t,
+                        /*arity=*/ 1>(iter, log_ndtr_string);
+    });
+  #else
+    AT_DISPATCH_FLOATING_TYPES(iter.common_dtype(), "log_ndtr_cuda", [&]() {
+      gpu_kernel(
+          iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { return calc_log_ndtr(a); });
+      });
+  #endif
+}
+
 void erf_kernel_cuda(TensorIteratorBase& iter) {
   AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, iter.common_dtype(), "erf_cuda", [&]() {
     gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
@@ -264,18 +308,38 @@ void erfcx_kernel_cuda(TensorIteratorBase& iter) {
   #endif
 }
 
+const char kaiser_window_name[] = "kaiser_window";
 void kaiser_window_kernel_cuda(TensorIteratorBase& iter, int64_t window_length, double beta_){
-  AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.dtype(), "kaiser_window_cuda", [&](){
-    using opmath_t = at::opmath_type<scalar_t>;
-    const opmath_t inv_alpha = static_cast<opmath_t>(2.0 / (window_length - 1));
-    const opmath_t beta = static_cast<opmath_t>(beta_);
-    const opmath_t inv_i0_beta = 1.0 / calc_i0(beta);
-    gpu_kernel(iter, [=]GPU_LAMBDA(scalar_t a) -> scalar_t {
-      opmath_t x = static_cast<opmath_t>(a) * inv_alpha - 1;
-      opmath_t y = std::max<opmath_t>(0, 1 - x * x);
-      return calc_i0(beta * ::sqrt(y)) * inv_i0_beta;
+  #if AT_USE_JITERATOR()
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.dtype(), "kaiser_window_cuda", [&](){
+        using opmath_t = at::opmath_type<scalar_t>;
+        const opmath_t inv_alpha = static_cast<opmath_t>(2.0 / (window_length - 1));
+        const opmath_t beta = static_cast<opmath_t>(beta_);
+        const opmath_t inv_i0_beta = 1.0 / calc_i0(beta);
+        jitted_gpu_kernel<
+            /*name=*/kaiser_window_name,
+            /*return_dtype=*/scalar_t,
+            /*common_dtype=*/scalar_t,
+            /*arity=*/1>(
+            iter,
+            kaiser_window_string,
+            /*scalar_pos=*/at::cuda::jit::BinaryFuncVariant::NoScalar,
+            /*scalar_val=*/0,
+            /*extra_args=*/std::make_tuple(inv_alpha, beta, inv_i0_beta));
     });
-  });
+  #else
+    AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::BFloat16, iter.dtype(), "kaiser_window_cuda", [&](){
+      using opmath_t = at::opmath_type<scalar_t>;
+      const opmath_t inv_alpha = static_cast<opmath_t>(2.0 / (window_length - 1));
+      const opmath_t beta = static_cast<opmath_t>(beta_);
+      const opmath_t inv_i0_beta = 1.0 / calc_i0(beta);
+      gpu_kernel(iter, [=]GPU_LAMBDA(scalar_t a) -> scalar_t {
+        opmath_t x = static_cast<opmath_t>(a) * inv_alpha - 1;
+        opmath_t y = std::max<opmath_t>(0, 1 - x * x);
+        return calc_i0(beta * ::sqrt(y)) * inv_i0_beta;
+      });
+    });
+  #endif
 }
 
 const char entr_name[] = "entr";
@@ -322,6 +386,7 @@ REGISTER_DISPATCH(erfinv_stub, &erfinv_kernel_cuda);
 REGISTER_DISPATCH(kaiser_window_stub, &kaiser_window_kernel_cuda);
 REGISTER_DISPATCH(special_entr_stub, &entr_kernel_cuda);
 REGISTER_DISPATCH(special_ndtri_stub, &ndtri_kernel_cuda);
+REGISTER_DISPATCH(special_log_ndtr_stub, &log_ndtr_kernel_cuda);
 REGISTER_DISPATCH(special_erfcx_stub, &erfcx_kernel_cuda);
 
 } // namespace native
diff --git a/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu b/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
index 8b43900e92716c..90f5238d0180da 100644
--- a/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
+++ b/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/UnfoldBackward.h>
 
 #include <ATen/native/cuda/Loops.cuh>
diff --git a/aten/src/ATen/native/cuda/Unique.cu b/aten/src/ATen/native/cuda/Unique.cu
index d268ca1c490389..e25acb8e06efa0 100644
--- a/aten/src/ATen/native/cuda/Unique.cu
+++ b/aten/src/ATen/native/cuda/Unique.cu
@@ -1,8 +1,22 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/ThrustAllocator.h>
 #include <thrust/execution_policy.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_unique2_native.h>
+#include <ATen/ops/_unique_native.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/unique_consecutive_native.h>
+#include <ATen/ops/unique_dim_consecutive_native.h>
+#include <ATen/ops/unique_dim_native.h>
+#endif
+
 #include <tuple>
 #include <iterator>
 #include <thrust/adjacent_difference.h>
diff --git a/aten/src/ATen/native/cuda/UniqueCub.cu b/aten/src/ATen/native/cuda/UniqueCub.cu
index bda84bdda4e12d..cc19b96a779714 100644
--- a/aten/src/ATen/native/cuda/UniqueCub.cu
+++ b/aten/src/ATen/native/cuda/UniqueCub.cu
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/cuda/UniqueCub.cuh>
 
 #include <ATen/cuda/CUDAContext.h>
@@ -5,6 +6,13 @@
 #include <ATen/cuda/CUDAApplyUtils.cuh>
 #include <ATen/cuda/cub.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#endif
+
 namespace at {
 namespace native {
 namespace internal {
diff --git a/aten/src/ATen/native/cuda/UniqueCub.cuh b/aten/src/ATen/native/cuda/UniqueCub.cuh
index 1bb96e3f5ebdf9..6e1cccc2e175cb 100644
--- a/aten/src/ATen/native/cuda/UniqueCub.cuh
+++ b/aten/src/ATen/native/cuda/UniqueCub.cuh
@@ -1,4 +1,4 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/UpSample.cuh b/aten/src/ATen/native/cuda/UpSample.cuh
index f4d85512ba7242..09e460640df8de 100644
--- a/aten/src/ATen/native/cuda/UpSample.cuh
+++ b/aten/src/ATen/native/cuda/UpSample.cuh
@@ -1,9 +1,11 @@
+#pragma once
 #include <ATen/core/TensorAccessor.h>
 #include <ATen/cuda/Atomic.cuh>
 
 #include <c10/util/ArrayRef.h>
 #include <c10/util/Optional.h>
 #include <c10/util/SmallVector.h>
+#include <c10/util/OptionalArrayRef.h>
 
 #include <math.h>
 
@@ -14,7 +16,7 @@ namespace upsample {
 // TODO: Remove duplicate declaration.
 TORCH_API c10::SmallVector<int64_t, 3> compute_output_size(
     c10::IntArrayRef input_size,  // Full input tensor size.
-    c10::optional<c10::IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<c10::ArrayRef<double>> scale_factors);
 } // namespace upsample
 
diff --git a/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu b/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu
index 29dec1735f2383..1214955b06d441 100644
--- a/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu
@@ -1,12 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/cuda/UpSample.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_bicubic2d_native.h>
+#include <ATen/ops/upsample_bicubic2d_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu b/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu
index 09ec9528ead149..d76e2783207f19 100644
--- a/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu
@@ -1,9 +1,10 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
-#include <ATen/ATen.h>
-#include <ATen/ceil_div.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -12,6 +13,20 @@
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/native/cuda/LaunchUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_bicubic2d_aa_backward_native.h>
+#include <ATen/ops/_upsample_bicubic2d_aa_native.h>
+#include <ATen/ops/_upsample_bilinear2d_aa_backward_native.h>
+#include <ATen/ops/_upsample_bilinear2d_aa_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/upsample_bilinear2d_backward_native.h>
+#include <ATen/ops/upsample_bilinear2d_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/UpSampleLinear1d.cu b/aten/src/ATen/native/cuda/UpSampleLinear1d.cu
index c23887cb79a6b7..af9edca2280e6f 100644
--- a/aten/src/ATen/native/cuda/UpSampleLinear1d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleLinear1d.cu
@@ -1,15 +1,24 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/cuda/UpSample.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_linear1d_native.h>
+#include <ATen/ops/upsample_linear1d_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/UpSampleNearest1d.cu b/aten/src/ATen/native/cuda/UpSampleNearest1d.cu
index 52b7b1d70947b1..decdfca30d7838 100644
--- a/aten/src/ATen/native/cuda/UpSampleNearest1d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleNearest1d.cu
@@ -1,12 +1,23 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/native/cuda/UpSample.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_nearest1d_native.h>
+#include <ATen/ops/upsample_nearest1d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact1d_native.h>
+#include <ATen/ops/_upsample_nearest_exact1d_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/UpSampleNearest2d.cu b/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
index 7b2a58c764bb46..8aa4f68aeda64c 100644
--- a/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
@@ -1,7 +1,8 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -10,6 +11,17 @@
 #include <ATen/native/cuda/KernelUtils.cuh>
 #include <ATen/cuda/detail/KernelUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_nearest_exact2d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact2d_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/upsample_nearest2d_backward_native.h>
+#include <ATen/ops/upsample_nearest2d_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/UpSampleNearest3d.cu b/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
index 3b12614c10d5e4..1a4afa012d780e 100644
--- a/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
@@ -1,11 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/cuda/UpSample.cuh>
+
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/CUDAContext.h>
-#include <ATen/native/cuda/UpSample.cuh>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/upsample_nearest3d.h>
+#include <ATen/ops/upsample_nearest3d_native.h>
+#include <ATen/ops/upsample_nearest3d_backward.h>
+#include <ATen/ops/upsample_nearest3d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact3d.h>
+#include <ATen/ops/_upsample_nearest_exact3d_native.h>
+#include <ATen/ops/_upsample_nearest_exact3d_backward.h>
+#include <ATen/ops/_upsample_nearest_exact3d_backward_native.h>
+#endif
 
 namespace at {
 namespace native {
@@ -322,7 +339,7 @@ using at::native::upsample_cuda::get_scale_value;
 
 Tensor upsample_nearest3d_cuda(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
@@ -333,7 +350,7 @@ Tensor upsample_nearest3d_cuda(
 
 Tensor _upsample_nearest_exact3d_cuda(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
@@ -345,7 +362,7 @@ Tensor _upsample_nearest_exact3d_cuda(
 // when structured kernels can handle QuantizedCPU, update these overloads to be CompositeExplicitAutograd
 Tensor upsample_nearest3d_backward_cuda(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
@@ -357,7 +374,7 @@ Tensor upsample_nearest3d_backward_cuda(
 
 Tensor _upsample_nearest_exact3d_backward_cuda(
     const Tensor& grad_output,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
diff --git a/aten/src/ATen/native/cuda/UpSampleTrilinear3d.cu b/aten/src/ATen/native/cuda/UpSampleTrilinear3d.cu
index a3623d2eb0f8b2..b19bf4858ac629 100644
--- a/aten/src/ATen/native/cuda/UpSampleTrilinear3d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleTrilinear3d.cu
@@ -1,9 +1,10 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
-#include <ATen/ATen.h>
-#include <ATen/ceil_div.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/cuda/Atomic.cuh>
@@ -12,6 +13,14 @@
 #include <ATen/native/cuda/UpSample.cuh>
 #include <ATen/native/cuda/KernelUtils.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_trilinear3d_native.h>
+#include <ATen/ops/upsample_trilinear3d_backward_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/WeightNorm.cu b/aten/src/ATen/native/cuda/WeightNorm.cu
index e9136ca61388bd..c451bc55349a8e 100644
--- a/aten/src/ATen/native/cuda/WeightNorm.cu
+++ b/aten/src/ATen/native/cuda/WeightNorm.cu
@@ -1,11 +1,24 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <c10/util/Exception.h>
 
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/DeviceUtils.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_strided.h>
+#include <ATen/ops/_weight_norm_cuda_interface_native.h>
+#include <ATen/ops/_weight_norm_cuda_interface_backward_native.h>
+#endif
+
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/cuda/group_norm_kernel.cu b/aten/src/ATen/native/cuda/group_norm_kernel.cu
index f05f6e390edab5..53ce77fa37b113 100644
--- a/aten/src/ATen/native/cuda/group_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/group_norm_kernel.cu
@@ -1,13 +1,13 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/group_norm.h>
 
 #include <type_traits>
 
 #include <thrust/tuple.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/SharedReduceOps.h>
 #include <ATen/native/TensorIterator.h>
 #include <c10/cuda/CUDAMathCompat.h>
@@ -15,6 +15,12 @@
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/block_reduce.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/cuda/jit_utils.cpp b/aten/src/ATen/native/cuda/jit_utils.cpp
index c8010a6e9b0afa..e7798d69fafb0d 100644
--- a/aten/src/ATen/native/cuda/jit_utils.cpp
+++ b/aten/src/ATen/native/cuda/jit_utils.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_NO_OPERATORS
 #include <c10/core/ScalarType.h>
 #include <c10/util/irange.h>
 #include <c10/util/hash.h>
@@ -10,6 +11,7 @@
 #include <ATen/code_template.h>
 #include <ATen/native/cuda/jit_utils.h>
 #include <ATen/cuda/llvm_jit_strings.h>
+#include <ATen/native/cuda/reduction_template.cuh>
 
 #include <sstream>
 #include <fstream>
@@ -118,6 +120,11 @@ const std::string jit_common_types = R"ESCAPE(
   Array() = default;
   Array(const Array&) = default;
   Array& operator=(const Array&) = default;
+  __device__ Array(T x) {
+    for (int i = 0; i < size; i++) {
+      data[i] = x;
+    }
+  }
   };
 
   ${half_string}
@@ -322,10 +329,7 @@ const std::string no_dynamic_cast_support_literal = R"ESCAPE(
 
 )ESCAPE";
 
-const std::string jit_code_template = R"ESCAPE(
-
-  ${dynamic_casting_string}
-
+const std::string offset_calc_template = R"ESCAPE(
   template <typename T>
   struct DivMod {
   T div;
@@ -409,6 +413,14 @@ const std::string jit_code_template = R"ESCAPE(
     ${index_type} strides_[25][NARGS];
   };
 
+
+)ESCAPE";
+
+const std::string jit_code_template = R"ESCAPE(
+
+  ${dynamic_casting_string}
+
+
   ${functor}
 
   // TODO: setup grid-stride loop
@@ -769,7 +781,7 @@ std::string generate_code(
                   << ">(out[j], data[0], output_offsets[0]);\n";
     env.s("store_outputs", store_outputs.str());
 
-    static auto cuda_template = at::jit::CodeTemplate(jit_common_types + jit_code_template);
+    static auto cuda_template = at::jit::CodeTemplate(jit_common_types + offset_calc_template + jit_code_template);
     const auto code = cuda_template.format(env);
     return code;
   }
@@ -808,6 +820,126 @@ std::string generate_code(
   return code;
 }
 
+// Creates directories recursively
+bool _r_mkdir(const std::string& dir) {
+  // Check if current dir exists
+  const char* p_dir = dir.c_str();
+  const bool dir_exists = (access(p_dir, F_OK) == 0);
+  if (dir_exists) {
+    return true;
+  }
+
+  // Try to create current directory
+#ifdef _WIN32
+  int ret = _mkdir(dir.c_str());
+#else
+  int ret = mkdir(dir.c_str(), S_IRWXU | S_IRWXG | S_IRWXO);
+#endif
+  // Success
+  if (ret == 0) {
+    return true;
+  }
+
+  // Find folder separator and check if we are at the top
+  auto  pos = dir.find_last_of("/\\");
+  if (pos == std::string::npos) {
+    return false;
+  }
+
+  // Try to create parent directory
+  if (!(_r_mkdir(dir.substr(0, pos)))) {
+    return false;
+  }
+
+  // Try to create complete path again
+#ifdef _WIN32
+  ret = _mkdir(dir.c_str());
+#else
+  ret = mkdir(dir.c_str(), S_IRWXU | S_IRWXG | S_IRWXO);
+#endif
+  return ret == 0;
+}
+
+// Creates directories recursively assuming that base exists
+bool r_mkdir_with_base(std::string& base, std::string& dir){
+  const char* p_base = base.c_str();
+  const bool base_exists = (access(p_base, F_OK) == 0);
+  if (!base_exists) {
+    return false;
+  }
+
+  // remove trailing '/' or '\\'
+  if ((base[base.size()-1]=='/') || base[base.size()-1]=='\\') {
+    base.pop_back();
+  }
+  if ((dir[dir.size()-1]=='/') || dir[dir.size()-1]=='\\') {
+    dir.pop_back();
+  }
+
+  return _r_mkdir(base+dir);
+
+}
+
+std::string load_code_template(const std::string& path) {
+  std::ifstream ifs{path};
+  std::string s{
+    std::istreambuf_iterator<char>(ifs),
+    std::istreambuf_iterator<char>()};
+  return s;
+}
+
+std::string generate_reduction_code(
+    int nOutputs,
+    const std::string& func,
+    const std::string& name,
+    const int vt0,
+    const std::string& f_inputs_type,
+    const std::string& reduction_accum_type,
+    const std::string& result_type,
+    bool contiguous,
+    bool vectorized,
+    int vec_size,
+    int max_threads_codegen) {
+      at::jit::TemplateEnv env;
+      env.s("index_type", "unsigned int");
+      env.s("scalar_type", f_inputs_type);
+      env.s("result_type", result_type);
+      env.s("reduction_accum_type", reduction_accum_type);
+      env.s("vt0", std::to_string(vt0));
+      env.s("name", name);
+      env.s("max_threads_lb", std::to_string(max_threads_codegen));
+      // reductions don't support dynamic casting, so the only way to get nonstandard types
+      // is through input
+      if (f_inputs_type == "at::Half") {
+        env.s("half_string", jiterator_half_support_literal);
+      } else {
+        env.s("half_string", "");
+      }
+      if (f_inputs_type == "at::BFloat16") {
+        env.s("bfloat16_string", jiterator_bfloat16_support_literal);
+      } else {
+        env.s("bfloat16_string", "");
+      }
+      if (f_inputs_type == "std::complex<float>" ||
+          f_inputs_type == "std::complex<double>" ) {
+        env.s("traits_string", get_traits_string());
+        env.s("complex_body_string", get_complex_body_string());
+        env.s("complex_math_string", get_complex_math_string());
+        env.s("complex", std::to_string(1));
+      } else {
+        env.s("traits_string", "");
+        env.s("complex_body_string", "");
+        env.s("complex_math_string", "");
+        env.s("complex", std::to_string(0));
+      }
+      env.s("cmath_string", get_cmath_string());
+      env.s("functor", func);
+      env.s("output_vec_size", std::to_string(vec_size));
+      static auto cuda_template = at::jit::CodeTemplate(
+        jit_common_types + offset_calc_template + get_reduction_template());
+      const auto code = cuda_template.format(env);
+      return code;
+}
 
 // Acquires (possibly creating) the kernel cache directory
 c10::optional<std::string> get_cache_dir() {
@@ -822,6 +954,8 @@ c10::optional<std::string> get_cache_dir() {
   // Cache path comes from PYTORCH_KERNEL_CACHE_PATH, then TEMP (Windows) or XDG_CACHE_HOME (Linux), then HOME environment variables
   std::string cache_dir;
   char* ptkcp = std::getenv("PYTORCH_KERNEL_CACHE_PATH");
+  // Create kernel_cache_dir if needed as we do not want to create the base directory passed by the user
+  std::string kernels_cache_dir = "";
   if (ptkcp != nullptr) {
     cache_dir = std::string(ptkcp);
   } else {
@@ -832,7 +966,8 @@ c10::optional<std::string> get_cache_dir() {
     ptkcp = std::getenv("XDG_CACHE_HOME");
 #endif
     if (ptkcp != nullptr) {
-      cache_dir = std::string(ptkcp) + "/torch/kernels";
+      kernels_cache_dir = "/torch/kernels";
+      cache_dir = std::string(ptkcp) + kernels_cache_dir;
     } else {
       // Falls back to HOME/.cache
       ptkcp = std::getenv("HOME");
@@ -841,7 +976,8 @@ c10::optional<std::string> get_cache_dir() {
                         " This disables kernel caching.");
         return {};
       } else {
-        cache_dir = std::string(ptkcp) + "/.cache/torch/kernels";
+        kernels_cache_dir = "/.cache/torch/kernels";
+        cache_dir = std::string(ptkcp) + kernels_cache_dir;
       }
     }
   }
@@ -850,11 +986,8 @@ c10::optional<std::string> get_cache_dir() {
   const char* p_cache_dir = cache_dir.c_str();
   const bool cache_dir_exists = (access(p_cache_dir, F_OK) == 0);
   if (!cache_dir_exists) {
-#ifdef _WIN32
-    if (_mkdir(p_cache_dir) != 0) {
-#else
-    if (mkdir(p_cache_dir, S_IRWXU | S_IRWXG | S_IRWXO) != 0) {
-#endif
+    std::string s_ptkcp = std::string(ptkcp);
+    if (!r_mkdir_with_base(s_ptkcp, kernels_cache_dir)) {
       TORCH_WARN_ONCE("Specified kernel cache directory could not be created! This disables kernel caching.",
                       " Specified directory is ", cache_dir, ".",
                       " This warning will appear only once per process.");
@@ -886,9 +1019,7 @@ c10::optional<std::string> get_cache_dir() {
 NvrtcFunction jit_pwise_function(
     const std::string& code,
     const std::string& kernel_name) {
-
   initializeCudaContext();
-
   // Acquires CUDA and nvrtc versions and whether we're compiling to ptx or SASS
   const cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
   int cuda_major = 0, cuda_minor = 0, nvrtc_major = 0, nvrtc_minor = 0;
@@ -983,7 +1114,7 @@ NvrtcFunction jit_pwise_function(
     AT_CUDA_NVRTC_CHECK(nvrtc.nvrtcGetProgramLog(program, log.data()));
     std::stringstream cu;
     cu << log.data();
-    throw std::runtime_error(cu.str() + code);
+    throw std::runtime_error(code + cu.str());
   }
 
   size_t ptx_size = 0;
@@ -1049,24 +1180,26 @@ NvrtcFunction jit_pwise_function(
 void launch_jitted_pwise_function(
     NvrtcFunction function,
     void* args[],
-    const int nBlocks,
-    const int kBlockSize) {
+    const dim3 nBlocks,
+    const dim3 kBlockSize,
+    const int smem) {
   initializeCudaContext();
   const auto& nvrtc = at::globalContext().getNVRTC();
   // Launches kernel on current stream
   auto stream = at::cuda::getCurrentCUDAStream();
   AT_CUDA_DRIVER_CHECK(nvrtc.cuLaunchKernel(
     function.function,
-    nBlocks,
-    1,
-    1,
-    kBlockSize,
-    1,
-    1,
-    0,
+    nBlocks.x,
+    nBlocks.y,
+    nBlocks.z,
+    kBlockSize.x,
+    kBlockSize.y,
+    kBlockSize.z,
+    smem,
     stream,
     args,
     nullptr));
 }
 
+
 }}} // at::cuda::jit
diff --git a/aten/src/ATen/native/cuda/jit_utils.h b/aten/src/ATen/native/cuda/jit_utils.h
index 1f0f9c491b17a8..1ff6de701fc34c 100644
--- a/aten/src/ATen/native/cuda/jit_utils.h
+++ b/aten/src/ATen/native/cuda/jit_utils.h
@@ -32,6 +32,19 @@ std::string generate_code(
     bool vectorized=false,
     int vec_size=0);
 
+std::string generate_reduction_code(
+    int nOutputs,
+    const std::string& func,
+    const std::string& name,
+    const int vt0,
+    const std::string& f_inputs_type,
+    const std::string& reduction_accum_type,
+    const std::string& result_type,
+    bool contiguous,
+    bool vectorized,
+    int vec_size,
+    int max_threads_codegen);
+
 NvrtcFunction jit_pwise_function(
     const std::string& code,
     const std::string& kernel_name);
@@ -39,8 +52,9 @@ NvrtcFunction jit_pwise_function(
 void launch_jitted_pwise_function(
     NvrtcFunction function,
     void* args[],
-    const int nBlocks,
-    const int kBlockSize);
+    const dim3 nBlocks,
+    const dim3 kBlockSize,
+    const int smem=0);
 
 template <typename T>
 struct delayed_false : std::false_type {
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 9fc2d02067092f..faa0fd2d4b9811 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -1,18 +1,29 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/layer_norm.h>
 
 #include <type_traits>
 
 #include <thrust/tuple.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
 #include <ATen/native/cuda/block_reduce.cuh>
 #include <ATen/native/cuda/thread_constants.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/native_layer_norm_native.h>
+#include <ATen/ops/native_layer_norm_backward_native.h>
+#include <ATen/ops/zeros_like_native.h>
+#endif
+
 #include <c10/cuda/CUDAMathCompat.h>
 
 namespace at {
@@ -934,6 +945,7 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_backward_cuda(
   return std::make_tuple(std::move(dX), std::move(dgamma), std::move(dbeta));
 }
 
+REGISTER_DISPATCH(LayerNormKernel, &LayerNormKernelImpl);
 
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
index 1099ba88cb4897..de4f222b362604 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
@@ -2882,19 +2882,27 @@ static void apply_lu_solve_looped_magma(const Tensor& b, const Tensor& lu, const
   auto pivots_data = pivots_cpu.data_ptr<magma_int_t>();
 
   auto b_stride = matrixStride(b);
-  auto lu_stride = matrixStride(lu);
-  auto pivots_stride = pivots_cpu.size(-1);
+  auto lu_stride = lu.dim() > 2 ? lu.stride(-3) : 0;
+  auto pivots_stride = pivots_cpu.dim() > 1 ? pivots_cpu.stride(-2) : 0;
   auto batch_size = batchCount(b);
 
   magma_int_t n = magma_int_cast(lu.size(-2), "n");
   magma_int_t nrhs = magma_int_cast(b.size(-1), "nrhs");
   auto leading_dimension = std::max<magma_int_t>(1, n);
 
+  // lu and pivots tensors can be broadcast to b
+  // here we construct a helper indexing tensor to linearly index into lu and pivots
+  IntArrayRef lu_batch_shape(lu.sizes().data(), lu.dim() - 2);
+  IntArrayRef b_batch_shape(b.sizes().data(), b.dim() - 2);
+  BroadcastLinearIndices lu_index(
+      batchCount(lu), lu_batch_shape, b_batch_shape);
+
   int info = 0;
   for (decltype(batch_size) i = 0; i < batch_size; i++) {
+    int64_t lu_index_i = lu_index(i);
     scalar_t* b_working_ptr = &b_data[i * b_stride];
-    scalar_t* lu_working_ptr = &lu_data[i * lu_stride];
-    int* pivots_working_ptr = &pivots_data[i * pivots_stride];
+    scalar_t* lu_working_ptr = &lu_data[lu_index_i * lu_stride];
+    int* pivots_working_ptr = &pivots_data[lu_index_i * pivots_stride];
 
     magmaLuSolve<scalar_t>(n, nrhs, lu_working_ptr, leading_dimension, pivots_working_ptr, b_working_ptr, leading_dimension, &info, trans);
 
@@ -2927,6 +2935,8 @@ static void apply_lu_solve_batched_magma(const Tensor& b, const Tensor& lu, cons
       "Calling torch.lu_solve on a CUDA tensor requires compiling ",
       "PyTorch with MAGMA. Please rebuild with MAGMA.");
 #else
+  TORCH_INTERNAL_ASSERT(batchCount(b) == batchCount(lu), "batch_size of b and lu must be the same");
+  TORCH_INTERNAL_ASSERT(batchCount(lu) == batchCount(pivots.unsqueeze(-1)), "batch_size of lu and pivots must be the same");
   auto trans = to_magma(transpose);
   auto b_data = b.data_ptr<scalar_t>();
   auto lu_data = lu.data_ptr<scalar_t>();
@@ -2993,9 +3003,36 @@ static void lu_solve_looped_magma(const Tensor& b, const Tensor& lu, const Tenso
   });
 }
 
+namespace {
+
+c10::MaybeOwned<Tensor> maybe_expand_lu(const Tensor& b, const Tensor& lu) {
+  if (batchCount(b) != batchCount(lu)) {
+    IntArrayRef b_batch_size(b.sizes().data(), b.dim() - 2);
+    DimVector expand_size(b_batch_size);
+    expand_size.insert(expand_size.end(), {lu.size(-2), lu.size(-1)});
+    return c10::MaybeOwned<Tensor>::owned(
+        cloneBatchedColumnMajor(lu.expand(expand_size)));
+  } else {
+    return c10::MaybeOwned<Tensor>::borrowed(lu);
+  }
+}
+
+c10::MaybeOwned<Tensor> maybe_expand_pivots(const Tensor& b,const Tensor& pivots) {
+  if (batchCount(b) != batchCount(pivots.unsqueeze(-1))) {
+    IntArrayRef b_batch_size(b.sizes().data(), b.dim() - 2);
+    DimVector expand_size(b_batch_size);
+    expand_size.insert(expand_size.end(), {pivots.size(-1)});
+    return c10::MaybeOwned<Tensor>::owned(
+        pivots.expand(expand_size).clone(at::MemoryFormat::Contiguous));
+  } else {
+    return c10::MaybeOwned<Tensor>::borrowed(pivots);
+  }
+}
+
+}  // anonymous namespace
 
 static void lu_solve_trans_dispatch(const Tensor& b, const Tensor& lu, const Tensor& pivots, TransposeType trans) {
-  auto batch_size = batchCount(lu);
+  auto batch_size = batchCount(b);
   auto m = lu.size(-2);
   auto b2 = b.size(-1);
   bool over_magma_dim_limit = b2 > 1024;  // magma implementation of LU solve cannot handle a b tensor with last dim > 1024 (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for)
@@ -3011,11 +3048,15 @@ static void lu_solve_trans_dispatch(const Tensor& b, const Tensor& lu, const Ten
 #endif // ifdef USE_CUSOLVER
 #ifdef CUDART_VERSION
   else if ((batch_size > 2 && m <= 128) || (batch_size > 8 && over_magma_dim_limit)) {
-    lu_solve_batched_cublas(b, lu, pivots, trans);
+    c10::MaybeOwned<Tensor> lu_ = maybe_expand_lu(b, lu);
+    c10::MaybeOwned<Tensor> pivots_ = maybe_expand_pivots(b, pivots);
+    lu_solve_batched_cublas(b, *lu_, *pivots_, trans);
   }
 #endif // ifdef CUDART_VERSION
   else {
-    lu_solve_batched_magma(b, lu, pivots, trans);
+    c10::MaybeOwned<Tensor> lu_ = maybe_expand_lu(b, lu);
+    c10::MaybeOwned<Tensor> pivots_ = maybe_expand_pivots(b, pivots);
+    lu_solve_batched_magma(b, *lu_, *pivots_, trans);
   }
 }
 
@@ -3190,27 +3231,20 @@ void lstsq_kernel(const Tensor& a, Tensor& b, Tensor& /*rank*/, Tensor& /*singul
         "Please rebuild with cuSOLVER.");
 #endif
   } else { // m >= n
-#if !AT_MAGMA_ENABLED()
-    // MAGMA is not available we can either use cuBLAS or cuSOLVER here
+#if !AT_ROCM_ENABLED()
+    // On CUDA platform we use either cuBLAS or cuSOLVER here
     // the batched vs looped dispatch is implemented based on the following performance results
     // https://github.com/pytorch/pytorch/pull/54725#issuecomment-832234456
     if (m <= 256 && batchCount(b) >= std::max<int64_t>(2, m / 16)) {
-      // if CUDART_VERSION is defined then cuBLAS is available
-      #ifdef CUDART_VERSION
       gels_batched_cublas(a, b, infos);
-      #else
-      // this would either call cuSOLVER or MAGMA,
-      // if MAGMA is called a runtime error is thrown about not finding MAGMA in compilation
-      gels_looped(a, b, infos);
-      #endif // CUDART_VERSION
     } else {
       gels_looped(a, b, infos);
     }
 #else
-    // if both MAGMA and cuSOLVER are available this would call cuSOLVER
-    // MAGMA is called if cuSOLVER is not available
-    gels_looped(a, b, infos);
-#endif // AT_MAGMA_ENABLED()
+    // On ROCm platform we can only use MAGMA here
+    // If MAGMA is not available, an error will be thrown
+    gels_magma(a, b, infos);
+#endif // !AT_ROCM_ENABLED()
   }
 }
 
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
index 279c289e9e54ba..5b582a2fd2fb16 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
@@ -96,6 +96,8 @@ static void apply_lu_solve_batched_cublas(const Tensor& b, const Tensor& lu, con
 #ifndef CUDART_VERSION
   TORCH_CHECK(false, "lu_solve: cuBLAS backend for lu_solve is not available.")
 #else
+  TORCH_INTERNAL_ASSERT(batchCount(b) == batchCount(lu), "batch_size of b and lu must be the same");
+  TORCH_INTERNAL_ASSERT(batchCount(lu) == batchCount(pivots.unsqueeze(-1)), "batch_size of lu and pivots must be the same");
   const auto trans = to_cublas(transpose);
 
   auto pivots_data = pivots.data_ptr<int>();
@@ -1446,26 +1448,34 @@ void lu_solve_looped_cusolver(const Tensor& b, const Tensor& lu, const Tensor& p
     const auto trans = to_cublas(transpose);
     int n = cuda_int_cast(lu.size(-2), "n");
     int nrhs = cuda_int_cast(b.size(-1), "nrhs");
-    auto batch_size = batchCount(lu);
+    auto batch_size = batchCount(b);
     auto info = at::zeros({1}, lu.options().dtype(kInt));
     auto info_data = info.data_ptr<int>();
     auto b_data = b.data_ptr<scalar_t>();
     auto lu_data = lu.data_ptr<scalar_t>();
     auto pivots_data = pivots.data_ptr<int>();
-    auto pivots_stride = pivots.size(-1);
-    auto lu_stride = matrixStride(lu);
+    auto pivots_stride = pivots.dim() > 1 ? pivots.stride(-2) : 0;
+    auto lu_stride = lu.dim() > 2 ? lu.stride(-3) : 0;
     auto b_stride = matrixStride(b);
     int leading_dimension = cuda_int_cast(std::max<int>(1, n), "leading_dimension");
 
+    // lu and pivots tensors can be broadcast to b
+    // here we construct a helper indexing tensor to linearly index into lu and pivots
+    IntArrayRef lu_batch_shape(lu.sizes().data(), lu.dim() - 2);
+    IntArrayRef b_batch_shape(b.sizes().data(), b.dim() - 2);
+    BroadcastLinearIndices lu_index(
+        batchCount(lu), lu_batch_shape, b_batch_shape);
+
     auto handle = at::cuda::getCurrentCUDASolverDnHandle();
     for (auto batch = decltype(batch_size){0}; batch < batch_size; ++batch) {
+      int64_t lu_index_i = lu_index(batch);
       at::cuda::solver::getrs<scalar_t>(
         handle,
         n,
         nrhs,
-        lu_data + batch * lu_stride,
+        lu_data + lu_index_i * lu_stride,
         leading_dimension,
-        pivots_data + batch * pivots_stride,
+        pivots_data + lu_index_i * pivots_stride,
         b_data + batch * b_stride,
         leading_dimension,
         info_data,
diff --git a/aten/src/ATen/native/cuda/reduction_template.cuh b/aten/src/ATen/native/cuda/reduction_template.cuh
new file mode 100644
index 00000000000000..4d9d559d8ec8a6
--- /dev/null
+++ b/aten/src/ATen/native/cuda/reduction_template.cuh
@@ -0,0 +1,664 @@
+namespace at {
+namespace cuda {
+//windows doesn't like large string literals, so split in two
+const std::string reduction_template_0 = R"ESCAPE(
+  #define C10_HOST_DEVICE __host__ __device__
+  #define C10_DEVICE __device__
+
+  template <typename T>
+  __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
+  {
+    return __shfl_down_sync(mask, value, delta, width);
+  }
+
+
+  #if ${complex}
+  template <typename T>
+  __device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
+  {
+    return std::complex<T>(
+        __shfl_down_sync(mask, value.real(), delta, width),
+        __shfl_down_sync(mask, value.imag(), delta, width));
+  }
+  #endif
+
+  // aligned vector generates vectorized load/store on CUDA
+  template<typename scalar_t, int vec_size>
+  struct alignas(sizeof(scalar_t) * vec_size) aligned_vector {
+    scalar_t val[vec_size];
+  };
+
+
+  C10_HOST_DEVICE static void reduce_fraction(size_t &numerator, size_t &denominator) {
+    // get GCD of num and denom using Euclid's algorithm.
+    // Can replace this with std::gcd if we ever support c++17.
+    size_t a = denominator;
+    size_t b = numerator;
+    while (b != 0) {
+        a %= b;
+        // swap(a,b)
+        size_t tmp = a;
+        a = b;
+        b = tmp;
+    }
+
+    // a is now the GCD
+    numerator /= a;
+    denominator /= a;
+  }
+
+
+
+
+  struct ReduceConfig {
+  //has to match host-side ReduceConfig in the eager code
+  static constexpr int BLOCK_X = 0;
+  static constexpr int BLOCK_Y = 1;
+  static constexpr int CTA = 2;
+
+  static constexpr int input_vec_size = 4;
+  int element_size_bytes;
+  int num_inputs;
+  int num_outputs;
+  int step_input = 1;
+  int step_output = 1;
+  int ctas_per_output = 1;
+  int input_mult[3] = {0, 0, 0};
+  int output_mult[2] = {0, 0};
+
+  int block_width;
+  int block_height;
+  int num_threads;
+
+  bool vectorize_input = false;
+  int output_vec_size = 1;
+
+  C10_HOST_DEVICE bool should_block_x_reduce() const {
+    return input_mult[BLOCK_X] != 0;
+  }
+
+  C10_HOST_DEVICE bool should_block_y_reduce() const {
+    return input_mult[BLOCK_Y] != 0;
+  }
+
+  C10_HOST_DEVICE bool should_global_reduce() const {
+    return input_mult[CTA] != 0;
+  }
+
+  C10_DEVICE bool should_store(int output_idx) const {
+    return output_idx < num_outputs &&
+      (!should_block_x_reduce() || threadIdx.x == 0) &&
+      (!should_block_y_reduce() || threadIdx.y == 0);
+  }
+
+  C10_DEVICE bool should_reduce_tail() const {
+    return (!should_block_y_reduce() || threadIdx.y == 0) &&
+      (!should_global_reduce() || blockIdx.y == 0);
+  }
+
+  C10_HOST_DEVICE int input_idx() const {
+    int lane = threadIdx.x;
+    int warp = threadIdx.y;
+    int cta2 = blockIdx.y;
+    return (lane * input_mult[BLOCK_X] +
+            warp * input_mult[BLOCK_Y] +
+            cta2 * input_mult[CTA]);
+  }
+
+  template <int output_vec_size>
+  C10_HOST_DEVICE int output_idx() const {
+    int lane = threadIdx.x;
+    int warp = threadIdx.y;
+    int cta1 = blockIdx.x;
+    return (lane * output_mult[BLOCK_X] +
+            warp * output_mult[BLOCK_Y] +
+            cta1 * step_output) * output_vec_size;
+  }
+
+  C10_DEVICE int shared_memory_offset(int offset) const {
+    return threadIdx.x + (threadIdx.y + offset) * blockDim.x;
+  }
+
+  C10_DEVICE int staging_memory_offset(int cta2) const {
+    int offset = cta2 + blockIdx.x * gridDim.y;
+    if (!should_block_x_reduce()) {
+      offset = threadIdx.x + offset * blockDim.x;
+    }
+    return offset;
+  }
+
+
+  };
+
+
+//TODO this will need to be different for more generic reduction functions
+namespace reducer {
+
+  using scalar_t = ${scalar_type};
+  using arg_t = ${reduction_accum_type};
+  using out_scalar_t = ${result_type};
+
+
+  inline __device__ ${functor}
+
+  inline __device__ out_scalar_t project(arg_t arg) {
+    return (out_scalar_t) arg;
+  }
+
+  inline __device__ arg_t warp_shfl_down(arg_t arg, int offset) {
+    return WARP_SHFL_DOWN(arg, offset);
+  }
+
+  inline __device__ arg_t translate_idx(arg_t acc, int64_t /*idx*/) {
+    return acc;
+  }
+
+  // wrap a normal reduction that ignores the index
+  inline __device__ arg_t reduce(arg_t acc, arg_t val, int64_t idx) {
+     return combine(acc, val);
+  }
+}
+
+
+struct ReduceJitOp {
+  using scalar_t = ${scalar_type};
+  using arg_t = ${reduction_accum_type};
+  using out_scalar_t = ${result_type};
+
+  using InputCalculator = OffsetCalculator<1>;
+  using OutputCalculator = OffsetCalculator<2>;
+
+//   static constexpr bool can_accumulate_in_output =
+//     std::is_convertible<arg_t, out_scalar_t>::value
+//     && std::is_convertible<out_scalar_t, arg_t>::value;
+
+  static constexpr int input_vec_size = ReduceConfig::input_vec_size;
+
+  arg_t ident;
+  ReduceConfig config;
+  InputCalculator input_calc;
+  OutputCalculator output_calc;
+  const void* src;
+  const char* dst[2]; //it accepts at most two destinations
+  // acc_buf used for accumulation among sub Tensor Iterator when accumulation on
+  // output is not permissible
+  void* acc_buf;
+  // cta_buf used for accumulation between blocks during global reduction
+  void* cta_buf;
+  int* semaphores;
+  int64_t base_idx;
+  bool accumulate;
+  bool final_output;
+  int noutputs;
+
+
+  C10_DEVICE void run() const {
+    extern __shared__ char shared_memory[];
+    uint32_t output_idx = config.output_idx<${output_vec_size}>();
+    uint32_t input_idx = config.input_idx();
+    auto base_offsets1 = output_calc.get(output_idx)[1];
+
+    using arg_vec_t = Array<arg_t, ${output_vec_size}>;
+    arg_vec_t value;
+
+    if (output_idx < config.num_outputs && input_idx < config.num_inputs) {
+      const scalar_t* input_slice = (const scalar_t*)((const char*)src + base_offsets1);
+
+      value = thread_reduce<${output_vec_size}>(input_slice);
+    }
+
+    if (config.should_block_y_reduce()) {
+      value = block_y_reduce<${output_vec_size}>(value, shared_memory);
+    }
+    if (config.should_block_x_reduce()) {
+      value = block_x_reduce<${output_vec_size}>(value, shared_memory);
+    }
+
+    using out_ptr_vec_t = Array<out_scalar_t*, ${output_vec_size}>;
+    using offset_vec_t = Array<uint32_t, ${output_vec_size}>;
+    offset_vec_t base_offsets;
+    out_ptr_vec_t out;
+
+    #pragma unroll
+    for (int i = 0; i < ${output_vec_size}; i++) {
+      base_offsets[i] = output_calc.get(output_idx + i)[0];
+      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
+    }
+
+    arg_vec_t* acc = nullptr;
+    if (acc_buf != nullptr) {
+      size_t numerator = sizeof(arg_t);
+      size_t denominator = sizeof(out_scalar_t);
+      reduce_fraction(numerator, denominator);
+      acc = (arg_vec_t*)((char*)acc_buf + (base_offsets[0] * numerator / denominator));
+    }
+
+    if (config.should_global_reduce()) {
+      value = global_reduce<${output_vec_size}>(value, acc, shared_memory);
+    } else if (config.should_store(output_idx)) {
+      if (accumulate) {
+        #pragma unroll
+        for (int i = 0; i < ${output_vec_size}; i++) {
+          value[i] = reducer::translate_idx(value[i], base_idx);
+        }
+      }
+
+      if (acc == nullptr) {
+        if (accumulate) {
+          value = accumulate_in_output<${output_vec_size}>(out, value);
+        }
+        if (final_output) {
+          set_results_to_output<${output_vec_size}>(value, base_offsets);
+        } else {
+          #pragma unroll
+          for (int i = 0; i < ${output_vec_size}; i++) {
+            *(out[i]) = get_accumulated_output(out[i], value[i]);
+          }
+        }
+      } else {
+        if (accumulate) {
+          #pragma unroll
+          for (int i = 0; i < ${output_vec_size}; i++) {
+            value[i] = reducer::combine((*acc)[i], value[i]);
+          }
+        }
+        if (final_output) {
+          set_results_to_output<${output_vec_size}>(value, base_offsets);
+        } else {
+          *acc = value;
+        }
+      }
+    }
+  }
+
+  template <int output_vec_size>
+  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce(const scalar_t* data) const {
+    if (config.vectorize_input) {
+      assert(output_vec_size == 1);
+      // reduce at the header of input_slice where memory is not aligned,
+      // so that thread_reduce will have an aligned memory to work on.
+      return {input_vectorized_thread_reduce_impl(data)};
+    } else {
+      uint32_t element_stride = input_calc.strides_[0][0] / sizeof(scalar_t);
+      bool is_contiguous = (input_calc.dims == 1 && element_stride == 1);
+      if (is_contiguous) {
+        return thread_reduce_impl<output_vec_size>(data, [](uint32_t idx) { return idx; });
+      } else if (input_calc.dims == 1) {
+        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return idx * element_stride; });
+      } else {
+        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return input_calc.get(idx)[0] / sizeof(scalar_t); });
+      }
+    }
+  }
+
+  C10_DEVICE arg_t input_vectorized_thread_reduce_impl(const scalar_t* data) const {
+    uint32_t end = config.num_inputs;
+
+    // Handle the head of input slice where data is not aligned
+    arg_t value = ident;
+    constexpr int align_bytes = alignof(aligned_vector<scalar_t, input_vec_size>);
+    constexpr int align_elements = align_bytes / sizeof(scalar_t);
+    int shift = ((int64_t)data) % align_bytes / sizeof(scalar_t);
+    if (shift > 0) {
+      data -= shift;
+      end += shift;
+      if(threadIdx.x >= shift && threadIdx.x < align_elements && config.should_reduce_tail()){
+        value = reducer::reduce(value, data[threadIdx.x], threadIdx.x - shift);
+      }
+      end -= align_elements;
+      data += align_elements;
+      shift = align_elements - shift;
+    }
+
+    // Do the vectorized reduction
+    using load_t = aligned_vector<scalar_t, input_vec_size>;
+
+    uint32_t idx = config.input_idx();
+    const uint32_t stride = config.step_input;
+
+    // Multiple accumulators to remove dependency between unrolled loops.
+    arg_t value_list[input_vec_size];
+    value_list[0] = value;
+
+    #pragma unroll
+    for (int i = 1; i < input_vec_size; i++) {
+      value_list[i] = ident;
+    }
+
+    scalar_t values[input_vec_size];
+
+    load_t *values_vector = reinterpret_cast<load_t*>(&values[0]);
+
+    while (idx * input_vec_size + input_vec_size - 1 < end) {
+      *values_vector = reinterpret_cast<const load_t*>(data)[idx];
+      #pragma unroll
+      for (uint32_t i = 0; i < input_vec_size; i++) {
+        value_list[i] = reducer::reduce(value_list[i], values[i], shift + idx * input_vec_size + i);
+      }
+      idx += stride;
+    }
+
+    // tail
+    uint32_t tail_start = end - end % input_vec_size;
+    if (config.should_reduce_tail()) {
+      int idx = tail_start + threadIdx.x;
+      if (idx < end) {
+        value_list[0] = reducer::reduce(value_list[0], data[idx], idx + shift);
+      }
+    }
+
+    // combine accumulators
+    #pragma unroll
+    for (int i = 1; i < input_vec_size; i++) {
+      value_list[0] = reducer::combine(value_list[0], value_list[i]);
+    }
+    return value_list[0];
+  }
+
+  template <int output_vec_size, typename offset_calc_t>
+  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce_impl(const scalar_t* data_, offset_calc_t calc) const {
+    uint32_t idx = config.input_idx();
+    const uint32_t end = config.num_inputs;
+    const uint32_t stride = config.step_input;
+    const int vt0=${vt0};
+
+    using arg_vec_t = Array<arg_t, output_vec_size>;
+    using load_t = aligned_vector<scalar_t, output_vec_size>;
+    const load_t* data = reinterpret_cast<const load_t*>(data_);
+
+    // Multiple accumulators to remove dependency between unrolled loops.
+    arg_vec_t value_list[vt0];
+
+    #pragma unroll
+    for (int i = 0; i < vt0; i++) {
+      #pragma unroll
+      for (int j = 0; j < output_vec_size; j++) {
+        value_list[i][j] = ident;
+      }
+    }
+
+    load_t values[vt0];
+
+    while (idx + (vt0 - 1) * stride < end) {
+      #pragma unroll
+      for (uint32_t i = 0; i < vt0; i++) {
+        values[i] = data[calc(idx + i * stride) / output_vec_size];
+      }
+      #pragma unroll
+      for (uint32_t i = 0; i < vt0; i++) {
+        #pragma unroll
+        for (uint32_t j = 0; j < output_vec_size; j++) {
+          value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx + i * stride);
+        }
+      }
+      idx += stride * vt0;
+    }
+
+    // tail
+    int idx_ = idx;
+    #pragma unroll
+    for (uint32_t i = 0; i < vt0; i++) {
+      if (idx >= end) {
+        break;
+      }
+      values[i] = data[calc(idx) / output_vec_size];
+      idx += stride;
+    }
+    idx = idx_;
+    #pragma unroll
+    for (uint32_t i = 0; i < vt0; i++) {
+      if (idx >= end) {
+        break;
+      }
+      #pragma unroll
+      for (uint32_t j = 0; j < output_vec_size; j++) {
+        value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx);
+      }
+      idx += stride;
+    }
+
+    // combine accumulators
+    #pragma unroll
+    for (int i = 1; i < vt0; i++) {
+      #pragma unroll
+      for (uint32_t j = 0; j < output_vec_size; j++) {
+        value_list[0][j] = reducer::combine(value_list[0][j], value_list[i][j]);
+      }
+    }
+    return value_list[0];
+  }
+  template <int output_vec_size>
+  C10_DEVICE Array<arg_t, output_vec_size> block_x_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
+    using args_vec_t = Array<arg_t, output_vec_size>;
+    int dim_x = blockDim.x;
+    args_vec_t* shared = (args_vec_t*)shared_memory;
+    if (dim_x > warpSize) {
+      int address_base = threadIdx.x + threadIdx.y*blockDim.x;
+      shared[address_base] = value;
+      for (int offset = dim_x/2; offset >= warpSize; offset >>= 1) {
+        __syncthreads();
+        if (threadIdx.x < offset && threadIdx.x + offset < blockDim.x) {
+          args_vec_t other = shared[address_base + offset];
+          #pragma unroll
+          for (int i = 0; i < output_vec_size; i++) {
+            value[i] = reducer::combine(value[i], other[i]);
+          }
+          shared[address_base] = value;
+        }
+      }
+      dim_x = warpSize;
+    }
+
+    __syncthreads();
+
+    for (int offset = 1; offset < dim_x; offset <<= 1) {
+      #pragma unroll
+      for (int i = 0; i < output_vec_size; i++) {
+        arg_t other = reducer::warp_shfl_down(value[i], offset);
+        value[i] = reducer::combine(value[i], other);
+      }
+    }
+    return value;
+  }
+
+  template <int output_vec_size>
+  C10_DEVICE Array<arg_t, output_vec_size> block_y_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
+    using args_vec_t = Array<arg_t, output_vec_size>;
+    args_vec_t* shared = (args_vec_t*)shared_memory;
+    shared[config.shared_memory_offset(0)] = value;
+    for (int offset = blockDim.y / 2; offset > 0; offset >>= 1) {
+      __syncthreads();
+      if (threadIdx.y < offset && threadIdx.y + offset < blockDim.y) {
+        args_vec_t other = shared[config.shared_memory_offset(offset)];
+        #pragma unroll
+        for (int i = 0; i < output_vec_size; i++) {
+          value[i] = reducer::combine(value[i], other[i]);
+        }
+        shared[config.shared_memory_offset(0)] = value;
+      }
+    }
+    return value;
+  }
+  )ESCAPE";
+
+  const std::string reduction_template_1 = R"ESCAPE(
+
+  C10_DEVICE bool mark_block_finished() const {
+    __shared__ bool is_last_block_done_shared;
+
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0) {
+      int prev_blocks_finished = atomicAdd(&semaphores[blockIdx.x], 1);
+      is_last_block_done_shared = (prev_blocks_finished == gridDim.y - 1);
+    }
+
+    __syncthreads();
+
+    return is_last_block_done_shared;
+  }
+
+  template <int output_vec_size>
+  C10_DEVICE Array<arg_t, output_vec_size> accumulate_in_output(
+    Array<out_scalar_t*, output_vec_size> out,
+    Array<arg_t, output_vec_size> value
+  ) const {
+    Array<arg_t, output_vec_size> ret;
+    #pragma unroll
+    for (int i = 0; i < output_vec_size; i++) {
+      ret[i] = reducer::combine(*(out[i]), value[i]);
+    }
+    return ret;
+  }
+
+
+  C10_DEVICE out_scalar_t get_accumulated_output(
+    out_scalar_t* out, arg_t value
+  ) const {
+    assert(!final_output);
+    return (out_scalar_t)value;
+  }
+
+  template<class T>
+  C10_DEVICE void set_results(const T x, const uint32_t base_offset) const {
+    assert(noutputs == 1);
+    auto res = (out_scalar_t*)((char*)dst[0] + base_offset);
+    *res = x;
+  }
+
+//TODO - multi-output reduction - we won't be able to use thrust::pair
+//just explicitly specify typed output reads/writes
+//Currently implemented for max of two outputs
+//   template<class T1, class T2>
+//   C10_DEVICE void set_results(const thrust::pair<T1, T2> x, const index_t base_offset) const {
+//     if (noutputs >= 1) {
+//       auto res0 = (T1*)((char*)dst[0] + base_offset);
+//       *res0 = x.first;
+//     }
+//     if (noutputs >= 2) {
+//       // base offset is computed assuming element size being sizeof(T1), so we need to make a
+//       // correction to obtain the correct base offset
+//       auto res1 = (T2*) ((char *) dst[1] + base_offset / sizeof(T1) * sizeof(T2));
+//       *res1 = x.second;
+//     }
+//   }
+
+  template <int output_vec_size>
+  C10_DEVICE void set_results_to_output(Array<arg_t, output_vec_size> value, Array<uint32_t, output_vec_size> base_offset) const {
+    assert(final_output);
+    #pragma unroll
+    for (int i = 0; i < output_vec_size; i++) {
+      set_results(reducer::project(value[i]), base_offset[i]);
+    }
+  }
+
+  template <int output_vec_size>
+  C10_DEVICE Array<arg_t, output_vec_size> global_reduce(Array<arg_t, output_vec_size> value, Array<arg_t, output_vec_size> *acc, char* shared_memory) const {
+    using arg_vec_t = Array<arg_t, output_vec_size>;
+    using out_ptr_vec_t = Array<out_scalar_t*, output_vec_size>;
+    using offset_vec_t = Array<uint32_t, output_vec_size>;
+
+    arg_vec_t* reduce_buffer = (arg_vec_t*)cta_buf;
+    uint32_t output_idx = config.output_idx<output_vec_size>();
+    offset_vec_t base_offsets;
+    out_ptr_vec_t out;
+
+    #pragma unroll
+    for (int i = 0; i < output_vec_size; i++) {
+      base_offsets[i] = output_calc.get(output_idx + i)[0];
+      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
+    }
+
+    bool should_store = config.should_store(output_idx);
+    if (should_store) {
+      uint32_t offset = config.staging_memory_offset(blockIdx.y);
+      reduce_buffer[offset] = value;
+    }
+
+    __threadfence(); // make sure writes are globally visible
+    __syncthreads(); // if multiple warps in this block wrote to staging, make sure they're all done
+    bool is_last_block_done = mark_block_finished();
+
+    if (is_last_block_done) {
+      value = ident;
+      if (config.should_block_x_reduce()) {
+        uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;
+        uint32_t step = blockDim.x * blockDim.y;
+        for (; input_offset < config.ctas_per_output; input_offset += step) {
+          uint32_t idx = config.staging_memory_offset(input_offset);
+          arg_vec_t next = reduce_buffer[idx];
+          #pragma unroll
+          for (int i = 0; i < output_vec_size; i++) {
+            value[i] = reducer::combine(value[i], next[i]);
+          }
+        }
+      } else {
+        uint32_t input_offset = threadIdx.y;
+        uint32_t step = blockDim.y;
+        for (; input_offset < config.ctas_per_output; input_offset += step) {
+          uint32_t idx = config.staging_memory_offset(input_offset);
+          arg_vec_t next = reduce_buffer[idx];
+          #pragma unroll
+          for (int i = 0; i < output_vec_size; i++) {
+            value[i] = reducer::combine(value[i], next[i]);
+          }
+        }
+      }
+      value = block_y_reduce(value, shared_memory);
+      if (config.should_block_x_reduce()) {
+        value = block_x_reduce<output_vec_size>(value, shared_memory);
+      }
+      if (should_store) {
+        if (accumulate) {
+          #pragma unroll
+          for (int i = 0; i < output_vec_size; i++) {
+            value[i] = reducer::translate_idx(value[i], base_idx);
+          }
+        }
+
+        if (acc == nullptr) {
+          if (accumulate) {
+            value = accumulate_in_output<output_vec_size>(out, value);
+          }
+          if (final_output) {
+            set_results_to_output<output_vec_size>(value, base_offsets);
+          } else {
+            #pragma unroll
+            for (int i = 0; i < output_vec_size; i++) {
+              *(out[i]) = get_accumulated_output(out[i], value[i]);
+            }
+          }
+        } else {
+          if (accumulate) {
+            #pragma unroll
+            for (int i = 0; i < output_vec_size; i++) {
+              value[i] = reducer::combine((*acc)[i], value[i]);
+            }
+          }
+          if (final_output) {
+            set_results_to_output<output_vec_size>(value, base_offsets);
+          } else {
+            *acc = value;
+          }
+        }
+      }
+    }
+
+    return value;
+  }
+};
+
+extern "C"
+__launch_bounds__(${max_threads_lb}, 4)
+__global__ void reduction_${name}_kernel(ReduceJitOp r){
+  r.run();
+}
+)ESCAPE";
+
+const std::string reduction_template = reduction_template_0 + reduction_template_1;
+
+
+const std::string &get_reduction_template() {
+  return reduction_template;
+}
+
+}}
diff --git a/aten/src/ATen/native/cuda/thread_constants.h b/aten/src/ATen/native/cuda/thread_constants.h
index 464c6fe9fe2e1d..651053d663e4c2 100644
--- a/aten/src/ATen/native/cuda/thread_constants.h
+++ b/aten/src/ATen/native/cuda/thread_constants.h
@@ -13,7 +13,7 @@ constexpr int num_threads() {
   return 256;
 }
 #else
-constexpr int num_threads() {
+constexpr uint32_t num_threads() {
   return C10_WARP_SIZE * 4;
 }
 #endif
diff --git a/aten/src/ATen/native/cuda/vol2col.cuh b/aten/src/ATen/native/cuda/vol2col.cuh
index 17459f382816c6..7ab719bc819ebf 100644
--- a/aten/src/ATen/native/cuda/vol2col.cuh
+++ b/aten/src/ATen/native/cuda/vol2col.cuh
@@ -1,9 +1,5 @@
 #pragma once
 
-#include <ATen/ATen.h>
-#include <ATen/TensorUtils.h>
-#include <ATen/Utils.h>
-
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
diff --git a/aten/src/ATen/native/cudnn/GridSampler.cpp b/aten/src/ATen/native/cudnn/GridSampler.cpp
index 38bde06aa6cc0c..b22d25cbff977a 100644
--- a/aten/src/ATen/native/cudnn/GridSampler.cpp
+++ b/aten/src/ATen/native/cudnn/GridSampler.cpp
@@ -2,6 +2,7 @@
 #include <ATen/NativeFunctions.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
+#include <ATen/native/GridSamplerUtils.h>
 
 #if !AT_CUDNN_ENABLED()
 
@@ -67,6 +68,13 @@ void checkGridSize(CheckedFrom c, TensorArg grid, TensorArg input)
 Tensor cudnn_grid_sampler_forward(
     const Tensor& input_t, const Tensor& grid_t)
 {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input_t, grid_t);
+  TORCH_CHECK(
+    cond_cudnn_grid_sampler(input_t, grid_t),
+    "Invalid arguments to cudnn_grid_sampler_forward");
+
   auto input_contig = contiguousIfZeroInStrides(input_t);
   auto grid_contig = grid_t.contiguous();
   TensorArg input{ input_contig, "input", 1 },
@@ -106,6 +114,13 @@ std::tuple<Tensor, Tensor> cudnn_grid_sampler_backward(
     const Tensor& input_t, const Tensor& grid_t,
     const Tensor& grad_output_t)
 {
+  // See NOTE [ grid_sampler Native Functions ].
+  // Add checks here in case this is called instead of grid_sampler.
+  check_grid_sampler_common(input_t, grid_t);
+  TORCH_CHECK(
+    cond_cudnn_grid_sampler(input_t, grid_t),
+    "Invalid arguments to cudnn_grid_sampler_backward");
+
   auto input_contig = contiguousIfZeroInStrides(input_t);
   auto grid_contig = grid_t.contiguous();
   auto grad_output_contig = contiguousIfZeroInStrides(grad_output_t);
diff --git a/aten/src/ATen/native/cudnn/RNN.cpp b/aten/src/ATen/native/cudnn/RNN.cpp
index a80fc4fe033595..29430b38e74ea4 100644
--- a/aten/src/ATen/native/cudnn/RNN.cpp
+++ b/aten/src/ATen/native/cudnn/RNN.cpp
@@ -753,19 +753,61 @@ namespace {
     }
   }
 
-  cudnnRNNAlgo_t get_algo(const RNNDescriptorParams& rnn, const TensorDescriptorListParams& tensors, const Tensor input) {
+  inline bool use_rnn_persist_small_h(const RNNDescriptorParams& rnn,
+                                            const TensorDescriptorListParams& tensors,
+                                            bool forward) {
+#if CUDNN_VERSION >= 8201 // 8.2.1
+    cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
+    if (prop->major < 6) return false;
+
+    if (forward) {
+      if (rnn.mode == CUDNN_RNN_RELU || rnn.mode == CUDNN_RNN_TANH) {
+        return rnn.hidden_size <= 384;
+      }
+      if (rnn.mode == CUDNN_LSTM || rnn.mode == CUDNN_GRU) {
+        return rnn.hidden_size <= 192;
+      }
+    } else /* backward */ {
+      if (rnn.mode == CUDNN_RNN_RELU || rnn.mode == CUDNN_RNN_TANH) {
+        return rnn.hidden_size <= 256;
+      }
+      if (rnn.mode == CUDNN_LSTM || rnn.mode == CUDNN_GRU) {
+        return rnn.hidden_size <= 128;
+      }
+    }
+
+    return false;
+#else
+    return false;
+#endif
+  }
+
+  cudnnRNNAlgo_t get_algo(const RNNDescriptorParams& rnn, const TensorDescriptorListParams& tensors, const Tensor input, bool forward) {
     // LSTM with projections only works with standard algorithm
     if (rnn.proj_size != 0) {
       return CUDNN_RNN_ALGO_STANDARD;
     }
 
-    if (getCudnnDataType(input) == CUDNN_DATA_HALF &&
-        !tensors.is_input_packed()) {
-      if (use_persist_common_heuristics(rnn, tensors) &&
-          use_persist_device_heuristics(rnn, tensors)) {
-        return CUDNN_RNN_ALGO_PERSIST_STATIC;
+    // Persistent algos typically don't work for packed inputs with sequence lengths that vary
+    // across batch elements, and will return CUDNN_STATUS_NOT_SUPPORTED if attempted. See
+    // https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions
+    if (!tensors.is_input_packed()) {
+      auto cudnnDataType = getCudnnDataType(input);
+#if CUDNN_VERSION >= 8201 // 8.2.1
+      if (cudnnDataType != CUDNN_DATA_DOUBLE) {
+        if (use_rnn_persist_small_h(rnn, tensors, forward)) {
+          return CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H;
+        }
+      }
+#endif
+      if (cudnnDataType == CUDNN_DATA_HALF) {
+        if (use_persist_common_heuristics(rnn, tensors) &&
+            use_persist_device_heuristics(rnn, tensors)) {
+          return CUDNN_RNN_ALGO_PERSIST_STATIC;
+        }
       }
     }
+
     return CUDNN_RNN_ALGO_STANDARD;
   }
 
@@ -970,7 +1012,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor> _cudnn_rnn(
   auto y = output;
 
   auto handle = getCudnnHandle();
-  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input);
+  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input, true);
   fn.rnn.set_algo(algo);
   RNNDescriptors descs(fn, handle, x, y, hx, cx);
 
@@ -1131,7 +1173,7 @@ std::tuple<Tensor, Tensor, Tensor> _cudnn_rnn_backward_input(
   TORCH_CHECK(dhy.is_cuda() && dy.is_cuda() && (!dcy.defined() || dcy.is_cuda()),
            "Gradients aren't CUDA tensors");
 
-  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input);
+  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input, false);
   fn.rnn.set_algo(algo);
   RNNDescriptors descs(fn, handle, x, y, hx, cx);
 
@@ -1234,7 +1276,7 @@ std::vector<Tensor> _cudnn_rnn_backward_weight(
   const auto& y = output;
   auto dw = at::zeros(weight_buf.sizes(), weight_buf.options());
 
-  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input);
+  cudnnRNNAlgo_t algo = get_algo(fn.rnn, fn.tensors, input, false);
   fn.rnn.set_algo(algo);
   RNNDescriptors descs(fn, handle, x, y, hx, cx);
 
diff --git a/aten/src/ATen/native/group_norm.cpp b/aten/src/ATen/native/group_norm.cpp
index 5533780a4547e1..db1d82f84fef03 100644
--- a/aten/src/ATen/native/group_norm.cpp
+++ b/aten/src/ATen/native/group_norm.cpp
@@ -16,6 +16,39 @@
 namespace at {
 namespace native {
 
+void check_group_norm_inputs(
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& bias,
+    int64_t C,
+    int64_t num_groups) {
+  TORCH_CHECK(
+      num_groups > 0,
+      "Expected num groups to be greater than 0, got ", num_groups);
+  TORCH_CHECK(
+      C % num_groups == 0,
+      "Expected number of channels in input to be divisible by ",
+      "num_groups, but got input of shape ",
+      input.sizes(),
+      " and "
+      "num_groups=",
+      num_groups);
+  TORCH_CHECK(
+      !weight.defined() || (weight.dim() == 1 && weight.numel() == C),
+      "Expected weight to be a vector of size equal to the number of ",
+      "channels in input, but got weight of shape ",
+      weight.sizes(),
+      " and input of shape ",
+      input.sizes());
+  TORCH_CHECK(
+      !bias.defined() || (bias.dim() == 1 && bias.numel() == C),
+      "Expected bias to be a vector of size equal to the number of ",
+      "channels in input, but got bias of shape ",
+      weight.sizes(),
+      " and input of shape ",
+      input.sizes());
+}
+
 std::tuple<Tensor, Tensor, Tensor> native_group_norm(
     const Tensor& X,
     const c10::optional<Tensor>& gamma_opt /* optional */,
@@ -31,6 +64,9 @@ std::tuple<Tensor, Tensor, Tensor> native_group_norm(
   const Tensor& gamma = *gamma_maybe_owned;
   const Tensor& beta = c10::value_or_else(beta_opt, [] { return Tensor(); });
 
+  // repeated check so expanded weights can call native_group_norm directly but
+  // save mean and variance from forward
+  check_group_norm_inputs(X, gamma, beta, C, group);
   auto memory_format = X.device().is_cpu() ?
       X.suggest_memory_format() : at::MemoryFormat::Contiguous;
 
@@ -128,28 +164,7 @@ Tensor group_norm(
 
   const int64_t N = input.size(0);
   const int64_t C = input.size(1);
-  TORCH_CHECK(
-      C % num_groups == 0,
-      "Expected number of channels in input to be divisible by ",
-      "num_groups, but got input of shape ",
-      input.sizes(),
-      " and "
-      "num_groups=",
-      num_groups);
-  TORCH_CHECK(
-      !weight.defined() || (weight.dim() == 1 && weight.numel() == C),
-      "Expected weight to be a vector of size equal to the number of ",
-      "channels in input, but got weight of shape ",
-      weight.sizes(),
-      " and input of shape ",
-      input.sizes());
-  TORCH_CHECK(
-      !bias.defined() || (bias.dim() == 1 && bias.numel() == C),
-      "Expected bias to be a vector of size equal to the number of ",
-      "channels in input, but got bias of shape ",
-      weight.sizes(),
-      " and input of shape ",
-      input.sizes());
+  check_group_norm_inputs(input, weight, bias, C, num_groups);
 
   const auto input_shape = input.sizes();
   const int64_t HxW =
diff --git a/aten/src/ATen/native/layer_norm.cpp b/aten/src/ATen/native/layer_norm.cpp
index c6b9b6d5c26ab1..fc5a37bc03ae0b 100644
--- a/aten/src/ATen/native/layer_norm.cpp
+++ b/aten/src/ATen/native/layer_norm.cpp
@@ -18,7 +18,7 @@
 namespace at {
 namespace native {
 
-void layer_norm_cpu_out(
+void layer_norm_with_mean_rstd_out(
     at::Tensor& out,
     at::Tensor& mean,
     at::Tensor& rstd,
@@ -50,6 +50,20 @@ void layer_norm_cpu_out(
   rstd = rstd.view(stat_shape);
 }
 
+void layer_norm_cpu_out(
+    at::Tensor& out,
+    const at::Tensor& input,
+    const Tensor& gamma,
+    const Tensor& beta,
+    double eps,
+    int64_t M,
+    int64_t N) {
+  if (M <= 0) {
+    return;
+  }
+  LayerNormKernel(kCPU, input, gamma, beta, M, N, eps, &out, /*mean=*/nullptr, /*rstd=*/nullptr);
+}
+
 std::tuple<Tensor, Tensor, Tensor> layer_norm_cpu(
     const Tensor& input,
     IntArrayRef normalized_shape, const c10::optional<Tensor>& weight_opt /* optional */, const c10::optional<Tensor>& bias_opt /* optional */,
@@ -78,7 +92,7 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_cpu(
   Tensor mean = at::empty({M}, X->options());
   Tensor rstd = at::empty({M}, X->options());
 
-  layer_norm_cpu_out(Y, mean, rstd, *X, normalized_shape, *gamma, *beta, eps, M, N);
+  layer_norm_with_mean_rstd_out(Y, mean, rstd, *X, normalized_shape, *gamma, *beta, eps, M, N);
   return std::make_tuple(std::move(Y), std::move(mean), std::move(rstd));
 }
 
diff --git a/aten/src/ATen/native/layer_norm.h b/aten/src/ATen/native/layer_norm.h
index e1bf789dcd81d5..629bc9ab3906b9 100644
--- a/aten/src/ATen/native/layer_norm.h
+++ b/aten/src/ATen/native/layer_norm.h
@@ -65,10 +65,7 @@ C10_ALWAYS_INLINE std::pair<int64_t, int64_t> _check_layer_norm_inputs(
 
 void layer_norm_cpu_out(
     at::Tensor& out,
-    at::Tensor& mean,
-    at::Tensor& rstd,
     const at::Tensor& input,
-    IntArrayRef normalized_shape,
     const Tensor& gamma,
     const Tensor& beta,
     double eps,
diff --git a/aten/src/ATen/native/metal/ops/MetalUpsamplingNearest.mm b/aten/src/ATen/native/metal/ops/MetalUpsamplingNearest.mm
index 300cddba006a40..39524569bae5fa 100644
--- a/aten/src/ATen/native/metal/ops/MetalUpsamplingNearest.mm
+++ b/aten/src/ATen/native/metal/ops/MetalUpsamplingNearest.mm
@@ -17,7 +17,7 @@
 
 Tensor upsample_nearest2d_vec(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   TORCH_CHECK(input.is_metal());
   auto osize =
diff --git a/aten/src/ATen/native/mkl/SparseBlasImpl.cpp b/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
index 3485dc1c5fb21d..3d49554ce29a63 100644
--- a/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
@@ -340,18 +340,21 @@ void addmm_out_sparse_csr(
     const Scalar& alpha,
     const Tensor& result) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(mat1.dim() == 2 && mat2.dim() == 2 && result.dim() == 2);
-  if (mat2.layout() == kStrided && result.layout() == kStrided) {
+  if (mat1.is_sparse_csr() && mat2.layout() == kStrided && result.layout() == kStrided) {
     return addmm_dense_result(mat1, mat2, beta, alpha, result);
-  } else if (
-      mat1.is_sparse_csr() && mat2.is_sparse_csr() &&
-      result.layout() == kStrided) {
+  }
+  if (mat1.layout() == kStrided && mat2.is_sparse_csr() && result.layout() == kStrided) {
+    // TODO: We can use MKL's transposition flags once we have CSC support.
+    return addmm_dense_result(mat2.transpose(0, 1), mat1.transpose(0, 1), beta, alpha, result.transpose(0, 1));
+  }
+  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.layout() == kStrided) {
     return addmm_sparse_input_dense_result(mat1, mat2, beta, alpha, result);
-  } else if (mat2.is_sparse_csr() && result.is_sparse_csr()) {
+  }
+  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.is_sparse_csr()) {
     return addmm_sparse_result(mat1, mat2, beta, alpha, result);
-  } else {
-    TORCH_CHECK(false, "addmm: computation on CPU is not implemented for ",
-                result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
   }
+  TORCH_CHECK(false, "addmm: computation on CPU is not implemented for ",
+              result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
 }
 
 /*
diff --git a/aten/src/ATen/native/mkldnn/Conv.cpp b/aten/src/ATen/native/mkldnn/Conv.cpp
index fb41dcdd6215dc..50b366e6ee51bd 100644
--- a/aten/src/ATen/native/mkldnn/Conv.cpp
+++ b/aten/src/ATen/native/mkldnn/Conv.cpp
@@ -199,9 +199,9 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
       mkldnn_to_dense(new_with_itensor_mkldnn(std::move(mkldnn_grad_weight),
                                               optTypeMetaToScalarType(grad_output.options().dtype_opt()),
                                               grad_output.options().device_opt())),
-      mkldnn_to_dense(new_with_itensor_mkldnn(std::move(mkldnn_grad_bias),
+      bias_defined ? mkldnn_to_dense(new_with_itensor_mkldnn(std::move(mkldnn_grad_bias),
                                               optTypeMetaToScalarType(grad_output.options().dtype_opt()),
-                                              grad_output.options().device_opt())));
+                                              grad_output.options().device_opt())) : Tensor());
 }
 
 std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_backward(
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index baad10f9d26e62..3ef3291274a405 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -1878,6 +1878,7 @@
     MkldnnCPU: empty_mkldnn
     SparseCPU, SparseCUDA: empty_sparse
     SparseCsrCPU, SparseCsrCUDA: empty_sparse_csr
+    QuantizedCPU, QuantizedCUDA: empty_unknown_quantized
 
 # We do not make new_empty a composite that calls into new_empty_strided, as the strided version
 # is significantly more difficult to implement by different backends
@@ -1949,6 +1950,7 @@
     CPU: empty_strided_cpu
     CUDA: empty_strided_cuda
     Meta: empty_strided_meta
+    QuantizedCPU, QuantizedCUDA: empty_strided_unknown_quantized
 
 - func: erf(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -2223,10 +2225,12 @@
   variants: function, method
 
 # NOTE [ grid_sampler Native Functions ]
-# `grid_sampler` does all the shape checking and then dispatches to one of
-# `cudnn_grid_sampler`, `grid_sampler_2d`, or `grid_sampler_3d`, each of which
-# has the corresponding backward defined as native functions as well. Therefore,
-# in these functions and their backwards, no more shape checking is done.
+# `grid_sampler` is _supposed to_ do all the shape checking and then dispatch to
+# one of `cudnn_grid_sampler`, `grid_sampler_2d`, or `grid_sampler_3d`, each of
+# which has the corresponding backward defined as native functions as well.
+# However, we do shape checking everywhere for now since each of the mentioned
+# functions can be called directly, which will lead to crashes otherwise.
+# See https://github.com/pytorch/pytorch/issues/73187 for more information.
 #
 # There is also _grid_sampler_2d_backward_cpu_fallback which is an
 # implementation detail of grid_sampler_2d and is only exposed here for testing
@@ -3086,10 +3090,10 @@
 
 - func: amin(Tensor self, int[1] dim=[], bool keepdim=False) -> Tensor
   variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: amin
+  structured_delegate: amin.out
 
 - func: amin.out(Tensor self, int[1] dim=[], bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
+  structured: True
   dispatch:
     CPU, CUDA: amin_out
 
@@ -3173,6 +3177,7 @@
   variants: function, method
   dispatch:
     SparseCPU, SparseCUDA: mul_sparse
+    SparseCsrCPU, SparseCsrCUDA: mul_sparse_csr
     MkldnnCPU: mkldnn_mul
     ZeroTensor: mul_zerotensor
 
@@ -3182,6 +3187,7 @@
   variants: method
   dispatch:
     SparseCPU, SparseCUDA: mul_sparse_
+    SparseCsrCPU, SparseCsrCUDA: mul_sparse_csr_
     MkldnnCPU: mkldnn_mul_
 
 - func: mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
@@ -3192,6 +3198,7 @@
     CPU, CUDA: mul_out
     SparseCPU: mul_out_sparse_cpu
     SparseCUDA: mul_out_sparse_cuda
+    SparseCsrCPU, SparseCsrCUDA: mul_out_sparse_csr
     MkldnnCPU: mkldnn_mul_out
 
   # For C++ only, until we have conversion from C++ numbers to Tensor
@@ -3206,6 +3213,7 @@
   variants: method
   dispatch:
     CompositeExplicitAutograd: mul_
+    SparseCsrCPU, SparseCsrCUDA: mul__scalar_sparse_csr
 
 # multiply, alias for mul
 - func: multiply.Tensor(Tensor self, Tensor other) -> Tensor
@@ -3255,6 +3263,11 @@
     SparseCPU, SparseCUDA: narrow_copy_sparse
     CompositeExplicitAutograd: narrow_copy_dense
 
+- func: narrow_copy.SymInt(Tensor self, int dim, int start, SymInt length) -> Tensor
+  variants: function, method
+  dispatch:
+    CompositeExplicitAutograd: narrow_copy_symint
+
 - func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU: narrow_copy_dense_cpu_out
@@ -3710,6 +3723,7 @@
     CPU, CUDA: relu
     MkldnnCPU: mkldnn_relu
     QuantizedCPU: relu_quantized_cpu
+    NestedTensor: NestedTensor_relu
 
 - func: relu_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -3718,6 +3732,7 @@
     CPU, CUDA: relu_
     MkldnnCPU: mkldnn_relu_
     QuantizedCPU: relu_quantized_cpu_
+    NestedTensor: NestedTensor_relu_
 
 - func: relu6(Tensor self) -> Tensor
   python_module: nn
@@ -3746,6 +3761,13 @@
     CPU: gelu_out_cpu
     CUDA: gelu_out_cuda
 
+- func: gelu_(Tensor(a!) self, *, str approximate='none') -> Tensor(a!)
+  structured_delegate: gelu.out
+  device_check: NoCheck   # TensorIterator
+  python_module: nn
+  dispatch:
+    NestedTensor: NestedTensor_gelu_
+
 - func: gelu(Tensor self, *, str approximate='none') -> Tensor
   structured_delegate: gelu.out
   device_check: NoCheck   # TensorIterator
@@ -3753,6 +3775,7 @@
   dispatch:
     MkldnnCPU: mkldnn_gelu
     QuantizedCPU: gelu_quantized_cpu
+    NestedTensor: NestedTensor_gelu
 
 - func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
   structured: True
@@ -4125,6 +4148,10 @@
   dispatch:
     CompositeExplicitAutograd: split
 
+- func: split.sizes(Tensor(a -> *) self, int[] split_size, int dim=0) -> Tensor(a)[]
+  variants: function, method
+  device_guard: False
+
 - func: unsafe_split_with_sizes(Tensor self, int[] split_sizes, int dim=0) -> Tensor[]
   variants: function, method
   device_check: NoCheck
@@ -4162,7 +4189,7 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, CUDA: squeeze
+    CompositeExplicitAutograd: squeeze
     QuantizedCPU, QuantizedCUDA: squeeze_quantized
 
 - func: squeeze.dim(Tensor(a) self, int dim) -> Tensor(a)
@@ -4170,7 +4197,7 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, CUDA: squeeze
+    CompositeExplicitAutograd: squeeze
     QuantizedCPU, QuantizedCUDA: squeeze_quantized
 
 - func: squeeze.dimname(Tensor(a) self, Dimname dim) -> Tensor(a)
@@ -4240,12 +4267,13 @@
 
 - func: dstack.out(Tensor[] tensors, *, Tensor(a!) out) -> Tensor(a!)
 
-# The signature is designed to be consistent with librosa except that it is
-# missing the `pad_mode` and `center` arguments, which are taken care of at
-# `torch.functional.py`. They shall be moved here once we have mapping between
-# Python strings and C++ Enum in codegen.
+# Overload without center & pad mode, needed for forward-compatibility
 - func: stft(Tensor self, int n_fft, int? hop_length=None, int? win_length=None, Tensor? window=None, bool normalized=False, bool? onesided=None, bool? return_complex=None) -> Tensor
   variants: function, method
+  cpp_no_default_args: ['hop_length', 'win_length', 'window', 'normalized']
+
+- func: stft.center(Tensor self, int n_fft, int? hop_length=None, int? win_length=None, Tensor? window=None, bool center=True, str pad_mode="reflect", bool normalized=False, bool? onesided=None, bool? return_complex=None) -> Tensor
+  variants: function, method
 
 - func: istft(Tensor self, int n_fft, int? hop_length=None, int? win_length=None, Tensor? window=None, bool center=True, bool normalized=False, bool? onesided=None, int? length=None, bool return_complex=False) -> Tensor
   variants: function, method
@@ -4266,6 +4294,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: sum
+    SparseCsrCPU, SparseCsrCUDA: sum_csr
 
 - func: sum.dim_IntList(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   structured_delegate: sum.IntList_out
@@ -4694,7 +4723,7 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, CUDA: unsqueeze
+    CompositeExplicitAutograd: unsqueeze
     SparseCPU, SparseCUDA: unsqueeze_sparse
     QuantizedCPU, QuantizedCUDA: unsqueeze_quantized
 
@@ -4772,12 +4801,16 @@
   device_check: NoCheck
   device_guard: False
 
-# we define both of these because 'where' does the broadcast and '_s_where' doesn't;
-# this allows us to implicitly calculate the broadcast derivative, while only dealing with the
-# _s_where derivative.
 - func: where.self(Tensor condition, Tensor self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
+  dispatch:
+    CPU, CUDA: where
+
+- func: where.self_out(Tensor condition, Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
+  device_check: NoCheck   # TensorIterator
+  dispatch:
+    CPU, CUDA: where_self_out
 
 - func: where.ScalarSelf(Tensor condition, Scalar self, Tensor other) -> Tensor
   variants: function
@@ -4792,11 +4825,6 @@
   device_check: NoCheck   # TensorIterator
   variants: function
 
-- func: _s_where(Tensor condition, Tensor self, Tensor other) -> Tensor
-  variants: function
-  dispatch:
-    CPU, CUDA: _s_where
-
 - func: norm_except_dim(Tensor v, int pow=2, int dim=0) -> Tensor
   variants: function
 
@@ -4895,6 +4923,11 @@
     SparseCPU: _sparse_sum_backward_cpu
     SparseCUDA: _sparse_sum_backward_cuda
 
+- func: _sparse_csr_sum.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+  dispatch:
+    SparseCsrCPU: _sparse_csr_sum_cpu
+    SparseCsrCUDA: _sparse_csr_sum_cuda
+
 - func: _sparse_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor
   python_module: sparse
   variants: function
@@ -5036,7 +5069,7 @@
 
 - func: resize_as_sparse_(Tensor(a!) self, Tensor the_template) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
-  variants: function
+  variants: function, method
   dispatch:
     SparseCPU, SparseCUDA: resize_as_sparse_
     SparseCsrCPU, SparseCsrCUDA: resize_as_sparse_csr_
@@ -5176,6 +5209,16 @@
     SparseCPU: s_addmm_sparse_dense_cpu_
     SparseCUDA: s_addmm_sparse_dense_cuda_
 
+- func: _addmm_activation.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False, Tensor(a!) out) -> Tensor(a!)
+  structured: True
+  dispatch:
+    CPU: addmm_activation_out_cpu
+    CUDA: addmm_activation_out_cuda
+
+- func: _addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
+  structured_delegate: _addmm_activation.out
+  variants: function, method
+
 # NOTE [ Sparse: autograd and API ]
 #
 #
@@ -5336,8 +5379,13 @@
 
 - func: to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
   variants: method
+
+# Special case of to_dense with custom derivative
+- func: _to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
+  variants: method
   dispatch:
-    SparseCPU, SparseCUDA, SparseCsrCPU, SparseCsrCUDA: sparse_to_dense
+    SparseCPU, SparseCUDA: sparse_to_dense
+    SparseCsrCPU, SparseCsrCUDA: sparse_csr_to_dense
     MkldnnCPU: mkldnn_to_dense
 
 - func: to_dense_backward(Tensor grad, Tensor input) -> Tensor
@@ -5490,6 +5538,13 @@
     CPU, CUDA: dense_to_sparse
     SparseCsrCPU, SparseCsrCUDA: sparse_csr_to_sparse
 
+- func: to_sparse_csr(Tensor self) -> Tensor
+  variants: method
+  dispatch:
+    CPU, CUDA: dense_to_sparse_csr
+    SparseCPU, SparseCUDA: coo_to_sparse_csr
+    SparseCsrCPU, SparseCsrCUDA: csr_to_sparse_csr
+
 - func: to_mkldnn(Tensor self, ScalarType? dtype=None) -> Tensor
   variants: method
   dispatch:
@@ -5824,14 +5879,14 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, CUDA: set_
+    CPU, CUDA, Meta: set_
 
 - func: set_.source_Storage_storage_offset(Tensor(a!) self, Storage source, int storage_offset, int[] size, int[] stride=[]) -> Tensor(a!)
   variants: method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU: set_storage_cpu_
+    CPU, Meta: set_storage_cpu_
     CUDA: set_storage_cuda_
     QuantizedCPU, QuantizedCUDA: set_storage_quantized_
 
@@ -5840,13 +5895,14 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, CUDA: set_tensor_
+    CPU, CUDA, Meta: set_tensor_
 
 - func: set_(Tensor(a!) self) -> Tensor(a!)
   variants: method
   dispatch:
     CPU: set_cpu_
     CUDA: set_cuda_
+    Meta: set_meta_
 
 - func: is_set_to(Tensor self, Tensor tensor) -> bool
   variants: method
@@ -6066,10 +6122,19 @@
 - func: scatter_add.dimname(Tensor self, Dimname dim, Tensor index, Tensor src) -> Tensor
   variants: function, method
 
-- func: scatter_reduce.two(Tensor self, int dim, Tensor index, str reduce, *, int? output_size=None) -> Tensor
+- func: scatter_reduce.two(Tensor self, int dim, Tensor index, Tensor src, str reduce, *, bool include_self=True) -> Tensor
+  structured_delegate: scatter_reduce.two_out
   variants: function, method
+
+- func: scatter_reduce_.two(Tensor(a!) self, int dim, Tensor index, Tensor src, str reduce, *, bool include_self=True) -> Tensor(a!)
+  structured_delegate: scatter_reduce.two_out
+  variants: method
+
+- func: scatter_reduce.two_out(Tensor self, int dim, Tensor index, Tensor src, str reduce, *, bool include_self=True, Tensor(a!) out) -> Tensor(a!)
+  structured: True
+  variants: function
   dispatch:
-    CPU: scatter_reduce_two_cpu
+    CPU, CUDA: scatter_reduce_two
 
 - func: eq_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   structured_delegate: eq.Scalar_out
@@ -6276,25 +6341,25 @@
   device_check: NoCheck   # TensorIterator
   variants: method, function
   dispatch:
-    CPU, CUDA: bitwise_left_shift
+    CompositeExplicitAutograd: bitwise_left_shift
 
 - func: bitwise_left_shift_.Tensor_Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
   dispatch:
-    CPU, CUDA: bitwise_left_shift_
+    CompositeExplicitAutograd: bitwise_left_shift_
 
 - func: bitwise_left_shift.Tensor_Scalar_out(Tensor self, Scalar other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: function
   dispatch:
-    CPU, CUDA: bitwise_left_shift_out
+    CompositeExplicitAutograd: bitwise_left_shift_out
 
 - func: bitwise_left_shift.Scalar_Tensor(Scalar self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function
   dispatch:
-    CPU, CUDA: bitwise_left_shift
+    CompositeExplicitAutograd: bitwise_left_shift
 
 - func: __rshift__.Scalar(Tensor self, Scalar other) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -6341,25 +6406,25 @@
   device_check: NoCheck   # TensorIterator
   variants: method, function
   dispatch:
-    CPU, CUDA: bitwise_right_shift
+    CompositeExplicitAutograd: bitwise_right_shift
 
 - func: bitwise_right_shift_.Tensor_Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
   dispatch:
-    CPU, CUDA: bitwise_right_shift_
+    CompositeExplicitAutograd: bitwise_right_shift_
 
 - func: bitwise_right_shift.Tensor_Scalar_out(Tensor self, Scalar other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: function
   dispatch:
-    CPU, CUDA: bitwise_right_shift_out
+    CompositeExplicitAutograd: bitwise_right_shift_out
 
 - func: bitwise_right_shift.Scalar_Tensor(Scalar self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function
   dispatch:
-    CPU, CUDA: bitwise_right_shift
+    CompositeExplicitAutograd: bitwise_right_shift
 
 - func: tril_(Tensor(a!) self, int diagonal=0) -> Tensor(a!)
   structured_delegate: tril.out
@@ -7011,7 +7076,7 @@
 
 - func: linalg_solve_triangular(Tensor self, Tensor B, *, bool upper, bool left=True, bool unitriangular=False) -> Tensor
   python_module: linalg
-  variants: method, function
+  variants: function
   dispatch:
     CPU, CUDA: linalg_solve_triangular
 
@@ -7404,6 +7469,12 @@
   dispatch:
     CPU: histogramdd_cpu
 
+- func: histogramdd(Tensor self, int[] bins, float[]? range=None, Tensor? weight=None, bool density=False) -> (Tensor hist, Tensor[] bin_edges)
+
+- func: histogramdd.int_bins(Tensor self, int bins, float[]? range=None, Tensor? weight=None, bool density=False) -> (Tensor hist, Tensor[] bin_edges)
+
+- func: histogramdd.TensorList_bins(Tensor self, Tensor[] bins, float[]? range=None, Tensor? weight=None, bool density=False) -> (Tensor hist, Tensor[] bin_edges)
+
 - func: fmod.Scalar_out(Tensor self, Scalar other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   dispatch:
@@ -8594,6 +8665,9 @@
     CPU: _convert_indices_from_csr_to_coo_structured_cpu
     CUDA: _convert_indices_from_csr_to_coo_structured_cuda
 
+- func: _csr_to_block_csr(Tensor self, int[2] block_size) -> Tensor
+  python_module: sparse
+
 ## NN wrappers
 
 - func: mse_loss.out(Tensor self, Tensor target, int reduction=Mean, *, Tensor(a!) out) -> Tensor(a!)
@@ -9421,14 +9495,13 @@
   python_module: nn
   structured: True
   dispatch:
-    CPU, QuantizedCPU: reflection_pad1d_out_cpu
+    CPU: reflection_pad1d_out_cpu
+    QuantizedCPU: reflection_pad1d_out_quantized_cpu
     CUDA: reflection_pad1d_out_cuda
 
 - func: reflection_pad1d(Tensor self, int[2] padding) -> Tensor
   python_module: nn
   structured_delegate: reflection_pad1d.out
-  dispatch:
-    QuantizedCPU: reflection_pad1d_quantized_cpu
 
 - func: reflection_pad1d_backward.grad_input(Tensor grad_output, Tensor self, int[2] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
@@ -9556,6 +9629,15 @@
     CPU: replication_pad3d_backward_cpu
     CUDA: replication_pad3d_backward_cuda
 
+- func: _pad_circular(Tensor self, int[] pad) -> Tensor
+  python_module: nn
+
+- func: _pad_enum(Tensor self, int[] pad, int mode, float? value=None) -> Tensor
+  python_module: nn
+
+- func: pad(Tensor self, int[] pad, str mode="constant", float? value=None) -> Tensor
+  python_module: nn
+
 - func: upsample_linear1d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
@@ -10250,6 +10332,19 @@
   dispatch:
     CPU, CUDA: special_ndtri_out
 
+- func: special_log_ndtr(Tensor self) -> Tensor
+  structured_delegate: special_log_ndtr.out
+  python_module: special
+  variants: function
+
+- func: special_log_ndtr.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+  structured: True
+  structured_inherits: TensorIteratorBase
+  python_module: special
+  variants: function
+  dispatch:
+    CPU, CUDA: special_log_ndtr_out
+
 - func: special_expm1(Tensor self) -> Tensor
   python_module: special
   variants: function
@@ -10503,7 +10598,7 @@
 
 - func: special_polygamma(int n, Tensor self) -> Tensor
   python_module: special
-  variants: function, method
+  variants: function
 
 - func: special_polygamma.out(int n, Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   python_module: special
@@ -11252,5 +11347,5 @@
   variants: function
   python_module: nn
 
-- func: _nested_tensor(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: nested_tensor(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: function
diff --git a/aten/src/ATen/native/nested/NestedTensorMath.cpp b/aten/src/ATen/native/nested/NestedTensorMath.cpp
index d41243503275f8..83e2e5428b1517 100644
--- a/aten/src/ATen/native/nested/NestedTensorMath.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorMath.cpp
@@ -8,6 +8,14 @@
 namespace at {
 namespace native {
 
+namespace {
+template <typename Func>
+Tensor map_nt(const Tensor& nt, Func f) {
+  auto* nt_impl = get_nested_tensor_impl(nt);
+  const auto& sizes = nt_impl->get_nested_size_tensor();
+  return at::detail::make_tensor<NestedTensorImpl>(f(nt_impl->get_buffer()), sizes);
+}
+} // namespace
 
 at::Tensor wrap_buffer(at::Tensor buffer, at::Tensor nested_size_tensor) {
   TORCH_CHECK(buffer.is_contiguous(), "Given buffer must be contiguous.");
@@ -15,20 +23,6 @@ at::Tensor wrap_buffer(at::Tensor buffer, at::Tensor nested_size_tensor) {
       std::move(buffer), std::move(nested_size_tensor));
 }
 
-bool is_nested_tensor_impl(const at::Tensor& tensor) {
-  return tensor.unsafeGetTensorImpl()->key_set().has(
-      c10::DispatchKey::NestedTensor);
-}
-
-inline at::native::NestedTensorImpl* get_nested_tensor_impl(
-    const at::Tensor& tensor) {
-  TORCH_CHECK(
-      is_nested_tensor_impl(tensor),
-      "get_nested_tensor_impl requires a NestedTensor.");
-  return static_cast<at::native::NestedTensorImpl*>(
-      tensor.unsafeGetTensorImpl());
-}
-
 inline const at::Tensor& get_buffer(const at::Tensor& tensor) {
   return get_nested_tensor_impl(tensor)->get_buffer();
 }
@@ -69,11 +63,29 @@ std::vector<at::Tensor> NestedTensor_unbind(
   return result_tensors;
 }
 
-/*
- * This result of this function cannot be used by itself. The result needs to
- * be wrapped in torch.nested.NestedTensor.
- */
-Tensor _nested_tensor(
+Tensor& NestedTensor_relu_(Tensor& self) {
+  at::relu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()));
+  return self;
+}
+
+Tensor NestedTensor_relu(const Tensor& self) {
+  return map_nt(self, at::relu);
+}
+
+Tensor& NestedTensor_gelu_(Tensor& self, c10::string_view approximate) {
+  at::gelu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()), approximate);
+  return self;
+}
+
+Tensor NestedTensor_gelu(const Tensor& self, c10::string_view approximate) {
+  return map_nt(
+      self,
+      [approximate](const Tensor& buffer) {
+        return at::gelu(buffer, approximate);
+      });
+}
+
+Tensor nested_tensor(
     TensorList list,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
diff --git a/aten/src/ATen/native/quantized/QTensor.cpp b/aten/src/ATen/native/quantized/QTensor.cpp
index 5fefa3557f4b6c..6e858a3b5c2537 100644
--- a/aten/src/ATen/native/quantized/QTensor.cpp
+++ b/aten/src/ATen/native/quantized/QTensor.cpp
@@ -15,8 +15,11 @@ Tensor quantize_per_tensor_dynamic(
     const Tensor& self,
     ScalarType dtype,
     bool reduce_range) {
-  TORCH_CHECK( (dtype == ScalarType::QInt8 || dtype == ScalarType::QUInt8), "dtype ", dtype, "not supported");
+  TORCH_CHECK( (dtype == ScalarType::QInt8 || dtype == ScalarType::QUInt8 || dtype == ScalarType::Half), "dtype ", dtype, "not supported");
   auto input_contig = self.contiguous();
+  if (dtype == ScalarType::Half) {
+    return input_contig.to(ScalarType::Half);
+  }
   float x_min = input_contig.min().item<float>();
   float x_max = input_contig.max().item<float>();
 
diff --git a/aten/src/ATen/native/quantized/TensorFactories.cpp b/aten/src/ATen/native/quantized/TensorFactories.cpp
index 08a972eacc3831..aa0fef5df9dc02 100644
--- a/aten/src/ATen/native/quantized/TensorFactories.cpp
+++ b/aten/src/ATen/native/quantized/TensorFactories.cpp
@@ -66,6 +66,40 @@ Tensor empty_per_channel_affine_quantized(
       quantizer);
 }
 
+Tensor empty_unknown_quantized(
+    IntArrayRef size,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory,
+    c10::optional<c10::MemoryFormat> optional_memory_format) {
+  // See [Note: hacky wrapper removal for TensorOptions]
+  TensorOptions options_ = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
+
+  TORCH_CHECK(
+    !(options_.has_memory_format() && optional_memory_format.has_value()),
+    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
+    "the redundant setter.");
+  auto options = options_.merge_memory_format(optional_memory_format);
+  TORCH_CHECK(
+      options.has_dtype(),
+      "Must provide data type for Tensor creation functions.");
+  QuantizerPtr quantizer = make_unknown_quantizer(typeMetaToScalarType(options.dtype()));
+  return new_qtensor(size, options, quantizer);
+}
+
+Tensor empty_strided_unknown_quantized(
+    IntArrayRef size,
+    IntArrayRef strided,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory) {
+
+  TORCH_CHECK(false, "empty_strided not supported on quantized tensors yet see https://github.com/pytorch/pytorch/issues/74540")
+
+}
+
 // Provide better error message if dtype is wrong
 Tensor empty_affine_quantized_other_backends_stub(
     IntArrayRef,
diff --git a/aten/src/ATen/native/quantized/cpu/conv_packed_params.h b/aten/src/ATen/native/quantized/cpu/conv_packed_params.h
deleted file mode 100644
index 130be6a0724dd5..00000000000000
--- a/aten/src/ATen/native/quantized/cpu/conv_packed_params.h
+++ /dev/null
@@ -1,28 +0,0 @@
-#pragma once
-
-#include <ATen/ATen.h>
-#include <ATen/core/ivalue.h>
-
-template <int kSpatialDim = 2>
-struct ConvPackedParamsBase : public torch::jit::CustomClassHolder {
-  virtual at::Tensor apply(
-      const at::Tensor& input,
-      double output_scale,
-      int64_t output_zero_point) = 0;
-  virtual at::Tensor apply_relu(
-      const at::Tensor& input,
-      double output_scale,
-      int64_t output_zero_point) = 0;
-  virtual at::Tensor apply_dynamic(
-      const at::Tensor& input,
-      bool reduce_range) = 0;
-
-  virtual std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() = 0;
-
-  virtual torch::List<int64_t> stride() const = 0;
-  virtual torch::List<int64_t> padding() const = 0;
-  virtual torch::List<int64_t> output_padding() const = 0;
-  virtual torch::List<int64_t> dilation() const = 0;
-  virtual int64_t groups() const = 0;
-  virtual bool transpose() const = 0;
-};
diff --git a/aten/src/ATen/native/quantized/cpu/conv_serialization.h b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
index cf5c04977b6a13..369f54b4396147 100644
--- a/aten/src/ATen/native/quantized/cpu/conv_serialization.h
+++ b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
@@ -4,6 +4,7 @@
 #include <ATen/core/List.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <c10/util/irange.h>
 
 #include <tuple>
@@ -358,6 +359,20 @@ c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> deserialize_conv(
     );
   }
 #endif // USE_PYTORCH_QNNPACK
+#if AT_MKLDNN_ENABLED()
+  if (ctx.qEngine() == at::QEngine::ONEDNN) {
+    return PackedConvWeightsOnednn<kSpatialDim>::prepack(
+      weight.value(),
+      bias,
+      stride,
+      padding,
+      output_padding,
+      dilation,
+      groups,
+      transpose
+    );
+  }
+#endif // AT_MKLDNN_ENABLED()
 TORCH_CHECK(
   false,
   "Didn't find engine for when deserializing ConvPackedParams: ",
diff --git a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
index ab6df06f7b73c3..0a8334b96f7071 100644
--- a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
+++ b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
@@ -1,10 +1,10 @@
 #include <ATen/ATen.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/conv_serialization.h>
 #include <ATen/native/quantized/cpu/embedding_packed_params.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/TensorFactories.h>
 #include <ATen/quantized/QTensorImpl.h>
 #include <ATen/quantized/Quantizer.h>
@@ -160,9 +160,10 @@ Tensor MakeStridedQTensorCPU(
       allocator->allocate(size_bytes),
       allocator,
       /* resizable = */ true);
+  constexpr auto quantized_cpu_ks = at::DispatchKeySet(at::DispatchKey::QuantizedCPU);
   auto tensor = detail::make_tensor<QTensorImpl>(
       storage,
-      at::DispatchKeySet(at::DispatchKey::QuantizedCPU),
+      quantized_cpu_ks,
       dtype,
       quantizer);
   get_qtensorimpl(tensor)->set_sizes_and_strides(sizes, strides);
@@ -471,6 +472,16 @@ int register_linear_params() {
                       std::move(weight), std::move(bias));
                 }
 #endif // USE_PYTORCH_QNNPACK
+#if AT_MKLDNN_ENABLED()
+                if (at::globalContext().qEngine() == at::QEngine::ONEDNN) {
+                  TORCH_CHECK(
+                      weight.scalar_type() == at::kQInt8,
+                      "ONEDNN only supports INT8 bit width currently. Got ",
+                      c10::toString(weight.scalar_type()));
+                  return PackedLinearWeightsOnednn::prepack(
+                      std::move(weight), std::move(bias));
+                }
+#endif // #if AT_MKLDNN_ENABLED()
                 TORCH_CHECK(false, "Unknown qengine");
               })
               .def("bias", [](const c10::intrusive_ptr<LinearPackedParamsBase>& self) {
diff --git a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.h b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.h
index 43768658af7e00..c98ef18ec85c60 100644
--- a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.h
+++ b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.h
@@ -1,9 +1,8 @@
 #pragma once
 
 #include <ATen/Tensor.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/embedding_packed_params.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
 #include <c10/core/QScheme.h>
 #include <c10/util/irange.h>
 
diff --git a/aten/src/ATen/native/quantized/cpu/onednn_utils.h b/aten/src/ATen/native/quantized/cpu/onednn_utils.h
new file mode 100644
index 00000000000000..4ee8e8737fb220
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/onednn_utils.h
@@ -0,0 +1,151 @@
+#pragma once
+
+#include <ATen/Config.h>
+#if AT_MKLDNN_ENABLED()
+#include <ATen/Tensor.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/native/mkldnn/MKLDNNCommon.h>
+#include <ATen/native/mkldnn/Utils.h>
+
+struct PackedLinearWeightsOnednn : public LinearPackedParamsBase {
+  PackedLinearWeightsOnednn(
+      std::unique_ptr<ideep::tensor> weight,
+      c10::optional<ideep::tensor> bias,
+      at::Tensor orig_weight,
+      c10::optional<at::Tensor> orig_bias)
+      : weight_(std::move(weight)),
+        bias_(std::move(bias)),
+        orig_weight_(std::move(orig_weight)),
+        orig_bias_(std::move(orig_bias)) {}
+  std::unique_ptr<ideep::tensor> weight_;
+  c10::optional<ideep::tensor> bias_;
+  at::Tensor orig_weight_;
+  c10::optional<at::Tensor> orig_bias_;
+
+  at::Tensor apply(
+      at::Tensor input,
+      double output_scale,
+      int64_t output_zero_point) override;
+  at::Tensor apply_relu(
+      at::Tensor input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_dynamic(at::Tensor input, bool reduce_range=false) override;
+  at::Tensor apply_dynamic_relu(at::Tensor input, bool reduce_range=false) override;
+
+  std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() override;
+
+  c10::optional<at::Tensor> bias() override {
+    return orig_bias_;
+  }
+
+  static c10::intrusive_ptr<LinearPackedParamsBase> prepack(
+      at::Tensor weight,
+      c10::optional<at::Tensor> bias);
+
+ private:
+  template <bool ReluFused>
+  at::Tensor apply_impl(
+      at::Tensor input,
+      double output_scale,
+      int64_t output_zero_point);
+
+  template <bool ReluFused>
+  at::Tensor apply_dynamic_impl(at::Tensor input, bool reduce_range=false);
+};
+
+template <int kSpatialDim = 2>
+struct PackedConvWeightsOnednn : public ConvPackedParamsBase<kSpatialDim> {
+  PackedConvWeightsOnednn(
+      std::unique_ptr<ideep::tensor> weight,
+      c10::optional<ideep::tensor> bias,
+      at::Tensor orig_weight,
+      c10::optional<at::Tensor> orig_bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> output_padding,
+      torch::List<int64_t> dilation,
+      int64_t groups,
+      uint8_t transpose)
+    : weight_(std::move(weight)),
+    bias_(std::move(bias)),
+    orig_weight_(std::move(orig_weight)),
+    orig_bias_(std::move(orig_bias)),
+    stride_(std::move(stride)),
+    padding_(std::move(padding)),
+    output_padding_(std::move(output_padding)),
+    dilation_(std::move(dilation)),
+    groups_(groups),
+    transpose_(transpose) {}
+
+  std::unique_ptr<ideep::tensor> weight_;
+  c10::optional<ideep::tensor> bias_;
+  at::Tensor orig_weight_;
+  c10::optional<at::Tensor> orig_bias_;
+  torch::List<int64_t> stride_;
+  torch::List<int64_t> padding_;
+  torch::List<int64_t> output_padding_;
+  torch::List<int64_t> dilation_;
+  int64_t groups_;
+  uint8_t transpose_;
+
+  at::Tensor apply(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_relu(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_dynamic(
+      const at::Tensor& input,
+      bool reduce_range) override;
+
+  std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() override;
+
+  static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> prepack(
+      at::Tensor weight,
+      c10::optional<at::Tensor> bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> output_padding,
+      torch::List<int64_t> dilation,
+      int64_t groups,
+      bool transpose);
+
+  torch::List<int64_t> stride() const override {
+    return stride_;
+  }
+
+  torch::List<int64_t> padding() const override {
+    return padding_;
+  }
+
+  torch::List<int64_t> output_padding() const override {
+    return output_padding_;
+  }
+
+  torch::List<int64_t> dilation() const override {
+    return dilation_;
+  }
+
+  int64_t groups() const override {
+    return groups_;
+  }
+
+  bool transpose() const override {
+    return (bool)transpose_;
+  }
+
+ private:
+  template <bool ReluFused>
+  at::Tensor apply_impl(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point);
+};
+
+#endif // #if AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/quantized/cpu/qadd.cpp b/aten/src/ATen/native/quantized/cpu/qadd.cpp
index 6aaffff79a22cd..cbca3ba58ef7ef 100644
--- a/aten/src/ATen/native/quantized/cpu/qadd.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qadd.cpp
@@ -7,10 +7,9 @@
 #include <ATen/native/quantized/cpu/quantized_ops.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/xnnpack_utils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
-#include <algorithm>
-
 namespace at {
 namespace native {
 
@@ -217,18 +216,170 @@ Tensor qnnpack_add(Tensor qa, Tensor qb, double scale, int64_t zero_point) {
 
   return qy;
 }
-#endif
+#endif // USE_PYTORCH_QNNPACK
+
+#ifdef USE_XNNPACK
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_create_add_nd(
+    int8_t azp,
+    float ascale,
+    int8_t bzp,
+    float bscale,
+    int8_t czp,
+    float cscale,
+    int8_t output_min,
+    int8_t output_max,
+    uint32_t flags,
+    xnn_operator_t* op) {
+  return xnn_create_add_nd_qs8(
+      azp,        /* int8_t input1_zero_point   */
+      ascale,     /* float input1_scale         */
+      bzp,        /* int8_t input2_zero_point   */
+      bscale,     /* float input2_scale         */
+      czp,        /* int8_t output_zero_point   */
+      cscale,     /* float output_scale         */
+      output_min, /* int8_t output_min          */
+      output_max, /* int8_t output_max          */
+      flags,      /* uint32_t flags             */
+      op);        /* xnn_operator_t* add_op_out */
+}
+
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_setup_add_nd(
+    xnn_operator_t op,
+    const std::vector<size_t>& a_shape,
+    const std::vector<size_t>& b_shape,
+    const int8_t* da,
+    const int8_t* db,
+    int8_t* dc,
+    pthreadpool_t pt_pool) {
+  return xnn_setup_add_nd_qs8(
+      op,             /* xnn_operator_t add_op      */
+      a_shape.size(), /* size_t num_input1_dims     */
+      a_shape.data(), /* const size_t* input1_shape */
+      b_shape.size(), /* size_t num_input2_dims     */
+      b_shape.data(), /* const size_t* input2_shape */
+      da,             /* const int8_t* input1       */
+      db,             /* const int8_t* input2       */
+      dc,             /* int8_t* output             */
+      pt_pool);       /* pthreadpool_t threadpool   */
+}
+
+template <typename scalar_t, bool ReLUFused = false>
+Tensor xnnp_add(Tensor qa, Tensor qb, double scale, int64_t zero_point) {
+  using underlying_t = typename scalar_t::underlying;
+  const string func_name = "xnnp_add()";
+  TORCH_CHECK(qa.ndimension() > 0, func_name, ": Got empty input tensor.");
+  TORCH_CHECK(at::native::xnnpack::available(), func_name, ": XNNPACK is not available")
+
+  // using qa memory format for qb to allow xnnpack kernel to flatten all the
+  // dims
+  auto qa_mem_format = qa.suggest_memory_format();
+  Tensor qa_contig = qa.contiguous(qa_mem_format);
+  Tensor qb_contig = qb.contiguous(qa_mem_format);
+
+  const auto a_zero_point = qa_contig.q_zero_point();
+  const auto b_zero_point = qb_contig.q_zero_point();
+  const auto a_scale = qa_contig.q_scale();
+  const auto b_scale = qb_contig.q_scale();
+
+  Tensor qy = at::native::empty_affine_quantized(
+      at::infer_size_dimvector(qa_contig.sizes(), qb_contig.sizes()),
+      qa.scalar_type(),
+      c10::nullopt /* layout */,
+      kCPU,
+      c10::nullopt /* pin_memory */,
+      scale,
+      zero_point,
+      qa_mem_format);
+
+  if (qa_contig.size(0) == 0) {
+    return qy;
+  }
+
+  xnn_operator_t xnnp_op = nullptr;
+  xnnpack_operator xnnp_add_operator;
+
+  auto output_max = std::numeric_limits<underlying_t>::max();
+  auto output_min = std::numeric_limits<underlying_t>::min();
+  if (ReLUFused) {
+    /*
+     * FIXME: use acticationLimits<T>()
+     * With <T>, MSVC runs into "error C3862: indetifier activationLimits not found".
+     */
+    constexpr int64_t qmin = std::numeric_limits<underlying_t>::min();
+    constexpr int64_t qmax = std::numeric_limits<underlying_t>::max();
+    int64_t qvalue = static_cast<int64_t>(zero_point);
+    qvalue = std::max<int64_t>(qvalue, qmin);
+    output_min = static_cast<underlying_t>(std::min<int64_t>(qvalue, qmax));
+  }
+
+  // Create an operator
+  auto status = xnnp_create_add_nd(
+      a_zero_point,
+      a_scale,
+      b_zero_point,
+      b_scale,
+      static_cast<underlying_t>(zero_point),
+      static_cast<float>(scale),
+      output_min,
+      output_max,
+      0,
+      &xnnp_op);
+  xnnp_add_operator = xnnpack_operator(xnnp_op);
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name, ": xnn create operator failed(", status,")!");
+
+  const auto qa_shape = xnnp_utils::get_mem_format_aware_shape(qa_contig);
+  const auto qb_shape = xnnp_utils::get_mem_format_aware_shape(qb_contig);
+
+  // Setup the operator
+  status = xnnp_setup_add_nd(
+      xnnp_add_operator.get(),
+      qa_shape,
+      qb_shape,
+      reinterpret_cast<const underlying_t*>(qa_contig.data_ptr<scalar_t>()),
+      reinterpret_cast<const underlying_t*>(qb_contig.data_ptr<scalar_t>()),
+      reinterpret_cast<underlying_t*>(qy.data_ptr<scalar_t>()),
+      caffe2::pthreadpool_());
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name, ": xnn setup operator failed(", status,")!");
+
+  // Run the operator
+  status = xnn_run_operator(
+      xnnp_add_operator.get(), /* xnn_operator_t op */
+      caffe2::pthreadpool_()); /* pthreadpool_t threadpool */
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name, ": xnn run operator failed(", status,")");
+  return qy;
+}
+#endif // USE_XNNPACK
 
 template <bool ReLUFused = false>
 Tensor qadd(Tensor qa, Tensor qb, double scale, int64_t zero_point) {
   check_inputs(qa, qb);
+
+  if (at::globalContext().qEngine() == at::QEngine::QNNPACK) {
+    TORCH_CHECK(
+        qa.scalar_type() == qb.scalar_type(),
+        "Both inputs to qadd must have same type");
+
+#ifdef USE_XNNPACK
+    if (qa.scalar_type() == kQInt8) {
+          return xnnp_add<c10::qint8, ReLUFused>(qa, qb, scale, zero_point);
+    }
+#endif // USE_XNNPACK
+
 #ifdef USE_PYTORCH_QNNPACK
-  if (at::globalContext().qEngine() == at::QEngine::QNNPACK &&
-      qa.sizes() == qb.sizes() && /* qnnpack does not support boradcasting */
-      qa.scalar_type() == kQUInt8 && qb.scalar_type() == kQUInt8) {
+    if(qa.sizes() == qb.sizes() && /* qnnpack does not support boradcasting */
+      qa.scalar_type() == kQUInt8) {
     return qnnpack_add<ReLUFused>(qa, qb, scale, zero_point);
+    }
+#endif // USE_PYTORCH_QNNPACK
   }
-#endif
   auto qc = at::_empty_affine_quantized(
       qa.sizes(),
       at::device(kCPU)
diff --git a/aten/src/ATen/native/quantized/cpu/qconv.cpp b/aten/src/ATen/native/quantized/cpu/qconv.cpp
index 4f8bcd257d5c39..aa77489f74195e 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv.cpp
@@ -5,9 +5,12 @@
 #include <ATen/ATen.h>
 #include <ATen/Parallel.h>
 #include <ATen/SmallVector.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/xnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
+#include <ATen/native/ConvUtils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
@@ -588,22 +591,262 @@ template at::Tensor PackedConvWeight<3>::apply_impl<false>(
 
 #ifdef USE_PYTORCH_QNNPACK
 
+#ifdef USE_XNNPACK
 template <int kSpatialDim>
-at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply(
-    const at::Tensor& input,
-    double output_scale,
-    int64_t output_zero_point) {
-  return apply_impl<false>(input, output_scale, output_zero_point);
-}
+template <typename scalar_t, bool kReluFused>
+at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl_xnnp(
+    const at::Tensor& act, double output_scale, int64_t output_zero_point) {
+  using underlying_t = typename scalar_t::underlying;
 
-template <int kSpatialDim>
-at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_relu(
-    const at::Tensor& input,
-    double output_scale,
-    int64_t output_zero_point) {
-  return apply_impl<true>(input, output_scale, output_zero_point);
+  std::lock_guard<std::mutex> lock(qnnp_mutex_);
+
+  const std::string func_name = transpose()
+      ? "quantized::conv_transpose (xnnpack)"
+      : "quantized::conv (xnnpack)";
+  TORCH_CHECK(
+      kSpatialDim == 2,
+      func_name, ": xnnpack does not currently support 3d convolution.");
+
+  /*
+   * NB:
+   * [de]conv_prepack prepares weights (values, scale, and zero_points) ahead of
+   * time during prepack() call assuming the activation will be uint8_t. But it
+   * may not always be the case. A solution may involve making prepack routine
+   * aware of the input qdtype. But currently all the pieces are not ready to
+   * pass that model level info to the prepack function. So, for now, here in
+   * this function we have to massage weights if we learn the input qdtype is
+   * not uint8_t. This involves copying and converting uint8_t to int8_t
+   * whenever necessary. To add to that, since XNNPACK, as of writing this,
+   * doesn't support per_channel weights for quint8_t, we add following assert
+   * makes sure we don't run into that case. Also take shortcuts when processing
+   * weights, which means we have to revisit and fix some weight massging logic
+   * when we enable the missing feature in XNNPACK.
+   *
+   * Table below summarizes how the weights are handled,
+   *
+   * .-------------------------------------------------------------------------.
+   * | input_qdtype |              uint8_t            |            int8_t      |
+   * | per_channel  |       yes       |       no      |      yes     |    no   |
+   * |-------------------------------------------------------------------------|
+   * | zero_points  | at::zeros()*    | orig_zp + 128 | at:zeros()** | orig_zp |
+   * | scale        |            dtype = float, no changes needed              |
+   * | values       |        always processed before passing to XNNPACK        |
+   * .-------------------------------------------------------------------------.
+   *
+   * Notes: * - zero_points for uint8_t + per_channel: no support in xnnpack, need
+   * to fix when support is added. ** - zero_points for int8_t: symmetric
+   * quantization means XNNPACK will ignore kernel zero point(s).
+   */
+
+  if ((std::is_same<underlying_t, c10::quint8>::value )) {
+    TORCH_CHECK(!per_channel(),
+      func_name, ": xnnpack does not currently have per_channel support with activation dtype of c10::quint8."
+    );
+  }
+
+  // More checks
+  ConvDimChecks<kSpatialDim>(
+      act.ndimension(),
+      stride().size(),
+      padding().size(),
+      output_padding().size(),
+      dilation().size(),
+      func_name,
+      transpose());
+
+  const int64_t N = act.size(0);
+  const int64_t H = act.size(2);
+  const int64_t W = act.size(3);
+  const int64_t D = 1;
+  const int64_t M = bias.size(0);
+
+  const auto act_nhwc = act.contiguous(c10::MemoryFormat::ChannelsLast);
+  const auto act_input_scale = act_nhwc.q_scale();
+
+  auto status = xnn_status_invalid_state;
+
+  // Create an operator iff necessary
+  if (!xnnp_convolution_op ||
+      (!input_scale.has_value() || input_scale.value() != act_input_scale)) {
+    xnn_operator_t xnnp_op = nullptr;
+
+    // Update the input scale so we may cache the op
+    input_scale = act_input_scale;
+
+    // create an empty tensor for packing the weights
+    const at::Tensor weight_contig =
+        orig_weight.contiguous(c10::MemoryFormat::ChannelsLast);
+    const float* w_scales_data = w_scales.data_ptr<float>();
+    underlying_t w_zp = 0;
+    at::Tensor weight_tensor;
+
+    if (!per_channel()) {
+      w_zp = static_cast<underlying_t>(
+          weight_contig.q_zero_point() +
+          (std::is_same<underlying_t, uint8_t>::value ? 128 : 0));
+
+      weight_tensor = at::native::empty_affine_quantized(
+          weight_contig.sizes(),
+          c10::CppTypeToScalarType<scalar_t>::value,
+          c10::nullopt /* layout */,
+          c10::kCPU,
+          c10::nullopt /* pin_memory */,
+          w_scales_data[0],
+          w_zp,
+          c10::MemoryFormat::ChannelsLast);
+    } else { /* per_channel */
+      weight_tensor = at::native::empty_per_channel_affine_quantized(
+          weight_contig.sizes(),
+          w_scales,
+          at::zeros(w_scales.sizes(), at::kInt), /* see comment above about w_zp */
+          weight_contig.q_per_channel_axis(),
+          c10::CppTypeToScalarType<scalar_t>::value,
+          c10::nullopt /* layout */,
+          c10::kCPU,
+          c10::nullopt /* pin_memory */,
+          c10::MemoryFormat::ChannelsLast);
+    }
+
+    // copy from the original weight and take care of dtype change if necessary
+    at::native::xnnp_utils::q8_copy_int8_weight_and_add_offset<scalar_t>(
+        weight_contig, weight_tensor);
+    const at::Tensor xnnp_weight =
+        at::native::xnnp_utils::convert_conv_weights_to_channel_last_tensor<
+            kSpatialDim>(weight_tensor, groups(), transpose());
+
+    auto output_min = kReluFused
+        // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
+        ? activationLimits<underlying_t>(output_scale, output_zero_point, Activation::RELU).first
+        : std::numeric_limits<underlying_t>::min();
+    auto output_max = kReluFused
+        // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
+        ? activationLimits<underlying_t>(output_scale, output_zero_point, Activation::RELU).second
+        : std::numeric_limits<underlying_t>::max();
+
+
+    // Original bias was float, so we requantize it here.
+    at::Tensor qbias;
+    if (per_channel()) {
+      auto bias_quant_scales =
+          weight_contig.q_per_channel_scales() * act_input_scale;
+      auto bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt);
+      qbias = at::native::quantize_per_channel(
+          bias, bias_quant_scales, bias_zp, 0, c10::kQInt32);
+    } else {
+      qbias = at::native::quantize_per_tensor(
+          bias, weight_contig.q_scale() * act_input_scale, 0, c10::kQInt32);
+    }
+
+    status = at::native::xnnp_utils::xnnp_create_convolution2d_nhwc(
+        padding()[0],
+        padding()[1],
+        padding()[0],
+        padding()[1],
+        kernel_[0],
+        kernel_[1],
+        stride()[0],
+        stride()[1],
+        dilation()[0],
+        dilation()[1],
+        groups(),
+        !transpose() ? orig_weight.size(1) : orig_weight.size(0) / groups(),
+        !transpose() ? orig_weight.size(0) / groups() : orig_weight.size(1),
+        !transpose() ? orig_weight.size(1) * groups() : orig_weight.size(0),
+        !transpose() ? orig_weight.size(0) : orig_weight.size(1) * groups(),
+        act_nhwc.q_zero_point(),
+        act_input_scale,
+        w_zp, /* will be ignored for Q[SC]8, see comment
+                above about w_zp*/
+        w_scales_data,
+        reinterpret_cast<const underlying_t*>(
+            xnnp_weight.template data_ptr<scalar_t>()),
+        reinterpret_cast<int32_t*>(qbias.template data_ptr<c10::qint32>()),
+        output_zero_point,
+        output_scale,
+        output_min,
+        output_max,
+        0,
+        &xnnp_op,
+        per_channel(),
+        transpose());
+
+    xnnp_convolution_op = xnnpack_operator(xnnp_op);
+    TORCH_CHECK(
+        status == xnn_status_success,
+        func_name,
+        ": xnn create operator failed(",
+        status,
+        ")");
+  }
+
+  at::SmallVector<int64_t, kSpatialDim + 2> output_shape;
+  const auto input_shape = MakeInputShape<kSpatialDim>(D, H, W);
+  if (transpose()) {
+    output_shape = MakeDeConvOutputShape<kSpatialDim>(
+        N, M, {H, W}, kernel_, stride(), padding(), output_padding(), dilation());
+  } else {
+    output_shape = MakeConvOutputShape<kSpatialDim>(
+        N, M, input_shape, kernel_, stride(), padding(), dilation());
+  }
+
+  if (act_nhwc.numel() > 0) {
+    TORCH_CHECK(
+        std::all_of(
+            output_shape.begin(),
+            output_shape.end(),
+            [](int64_t i) { return i > 0; }),
+        func_name, ": ", kSpatialDim, "d (xnnpack): each dimension of output tensor should be greater than 0.")
+  }
+
+  // Allocate output Tensor and a buffer for XNNPACK to use
+  at::Tensor output = at::native::empty_affine_quantized(
+      output_shape,
+      c10::CppTypeToScalarType<scalar_t>::value,
+      c10::nullopt /* layout */,
+      c10::kCPU,
+      c10::nullopt /* pin_memory */,
+      output_scale,
+      output_zero_point,
+      c10::MemoryFormat::ChannelsLast);
+
+  // Setup the operator
+  status = at::native::xnnp_utils::xnnp_setup_convolution2d_nhwc(
+      xnnp_convolution_op.get(),
+      N,
+      H,
+      W,
+      reinterpret_cast<const underlying_t*>(act_nhwc.template data_ptr<scalar_t>()),
+      reinterpret_cast<underlying_t*>(output.template data_ptr<scalar_t>()),
+      caffe2::pthreadpool_(),
+      per_channel(),
+      transpose(),
+      output_padding()[0],
+      output_padding()[1]);
+
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn setup operator failed(",
+      status,
+      ")");
+
+  // Run the operator
+  status = xnn_run_operator(
+      xnnp_convolution_op.get(), /* xnn_operator_t op */
+      caffe2::pthreadpool_()); /* pthreadpool_t threadpool */
+
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn run operator failed(",
+      status,
+      ")");
+
+  return output;
 }
 
+#endif // USE_XNNPACK
+
 template <int kSpatialDim>
 template <bool kReluFused>
 at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl(
@@ -622,7 +865,7 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl(
               func_name,
               "(qnnpack): Expected activation data type ",
               toString(c10::kQUInt8),
-              "but got ",
+              " but got ",
               toString(act.scalar_type()));
   ConvDimChecks<kSpatialDim>(
       act.ndimension(), stride().size(), padding().size(),
@@ -820,6 +1063,61 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl(
   return output;
 }
 
+#ifdef USE_XNNPACK
+bool can_use_xnnp(
+    c10::ScalarType dtype,
+    int kSpatialDim,
+    bool per_channel,
+    bool transpose) {
+  if (!at::native::xnnpack::available()) {
+    return false;
+  }
+  bool supported_dtypes = dtype == c10::kQInt8;
+  bool invalid_config =
+      (kSpatialDim != 2 /* No support for 3d convolution */
+        || (dtype == c10::kQInt8 && transpose &&
+            per_channel)); /* int8_t deconv does not support per-channel */
+  if (supported_dtypes && invalid_config) {
+    /* don't want this to fall through to QNNPACK */
+    const std::string func_name =
+        transpose ? "quantized::conv_transpose" : "quantized::conv";
+    TORCH_CHECK(
+        false,
+        func_name,
+        " (xnnpack): Unsupported conv config for dtype KQInt8");
+  }
+  return supported_dtypes && !invalid_config;
+}
+#endif  // USE_XNNPACK
+
+template <int kSpatialDim>
+at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+#ifdef USE_XNNPACK
+  if (can_use_xnnp(input.scalar_type(), kSpatialDim, per_channel(), transpose())) {
+    return apply_impl_xnnp<c10::qint8, false>(
+        input, output_scale, output_zero_point);
+  } /* fall through for unsupported types, configs, or shapes */
+#endif // USE_XNNPACK
+  return apply_impl<false>(input, output_scale, output_zero_point);
+}
+
+template <int kSpatialDim>
+at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_relu(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+#ifdef USE_XNNPACK
+  if (can_use_xnnp(input.scalar_type(), kSpatialDim, per_channel(), transpose())) {
+    return apply_impl_xnnp<c10::qint8, true>(
+        input, output_scale, output_zero_point);
+  } /* fall through for unsupported types, configs, or shapes */
+#endif // USE_XNNPACK
+  return apply_impl<true>(input, output_scale, output_zero_point);
+}
+
 template at::Tensor PackedConvWeightsQnnp<2>::apply(
     const at::Tensor& act,
     double output_scale,
@@ -852,6 +1150,177 @@ template at::Tensor PackedConvWeightsQnnp<3>::apply_impl<false>(
 
 #endif // USE_PYTORCH_QNNPACK
 
+#if AT_MKLDNN_ENABLED()
+template <int kSpatialDim>
+at::Tensor PackedConvWeightsOnednn<kSpatialDim>::apply(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<false>(input, output_scale, output_zero_point);
+}
+
+template <int kSpatialDim>
+at::Tensor PackedConvWeightsOnednn<kSpatialDim>::apply_relu(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<true>(input, output_scale, output_zero_point);
+}
+
+template <int kSpatialDim>
+template <bool kReluFused>
+at::Tensor PackedConvWeightsOnednn<kSpatialDim>::apply_impl(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point) {
+  std::string func_name = "quantized::conv";
+  if (transpose()) {
+    func_name += "_transpose";
+  }
+  func_name += std::to_string(kSpatialDim) + "d";
+  if (kReluFused) {
+    func_name += "_relu";
+  }
+  ConvDimChecks<kSpatialDim>(
+      act.ndimension(), stride().size(), padding().size(),
+      output_padding().size(), dilation().size(), func_name, transpose());
+  TORCH_CHECK(act.scalar_type() == c10::ScalarType::QUInt8,
+      func_name, " (ONEDNN): data type of input should be QUint8.");
+
+  // src
+  auto act_contig = act.contiguous(kSpatialDim == 2 ? c10::MemoryFormat::ChannelsLast : c10::MemoryFormat::ChannelsLast3d);
+  auto src_dims = act_contig.sizes().vec();
+  auto src_data_type = dnnl::memory::data_type::u8;
+  auto src_desc = ideep::tensor::desc(src_dims, src_data_type,
+      kSpatialDim == 2 ? ideep::format_tag::nhwc : ideep::format_tag::ndhwc);
+  ideep::tensor src;
+  src.init(src_desc, act_contig.data_ptr());
+  // weights & bias
+  ideep::tensor& weights = *(weight_.get());
+  bool with_bias = bias_.has_value();
+  const auto& kernel_size = weights.get_dims();
+  // dst
+  const std::vector<int64_t>& input_size = src.get_dims();
+  std::vector<int64_t> output_sizes;
+  if (transpose()) {
+    // Prepacked weight format: [o, i, ...]
+    const int N = act.size(0); // batch size
+    const int C = act.size(1); // input channels
+    const int M = weights.get_dim(0); // output channels
+    const int D = kSpatialDim == 2 ? 1 : act.size(2); // input depth
+    const int H = act.size(kSpatialDim); // input height
+    const int W = act.size(kSpatialDim + 1); // input width
+    const int KH = weights.get_dim(kSpatialDim); // kernel height
+    const int KW = weights.get_dim(kSpatialDim + 1); // kernel width
+    const int KD = kSpatialDim == 2 ? 1 : weights.get_dim(2); // kernel depth
+    TORCH_CHECK(C == groups() * weights.get_dim(1), // weight: [o, i, ...]
+                func_name, " (ONEDNN): input channel number should be ",
+                groups() * weights.get_dim(1), ", but got ", C);
+    auto output_shape = MakeDeConvOutputShape<kSpatialDim>(
+        N,
+        M,
+        kSpatialDim == 2 ? std::vector<int64_t>{H, W} : std::vector<int64_t>{D, H, W},
+        kSpatialDim == 2 ? std::vector<int64_t>{KH, KW} : std::vector<int64_t>{KD, KH, KW},
+        stride(),
+        padding(),
+        output_padding(),
+        dilation());
+    output_sizes = c10::IntArrayRef(output_shape).vec();
+  } else {
+    output_sizes = at::native::conv_output_size(input_size, kernel_size, padding().vec(), stride().vec(), dilation().vec());
+  }
+  ideep::dims dst_dims = ideep::dims({output_sizes.cbegin(), output_sizes.cend()});
+  at::Tensor output = at::_empty_affine_quantized(
+      dst_dims,
+      device(c10::kCPU)
+          .dtype(c10::kQUInt8)
+          .memory_format(kSpatialDim == 2 ?
+              c10::MemoryFormat::ChannelsLast :
+              c10::MemoryFormat::ChannelsLast3d),
+      output_scale,
+      output_zero_point,
+      c10::nullopt);
+  if (output.numel() == 0) {
+    return output;
+  }
+  ideep::tensor dst({dst_dims, ideep::tensor::data_type::u8, {output.strides().cbegin(), output.strides().cend()}},
+                    output.data_ptr());
+  // Parameters
+  const ideep::dims& strides = stride().vec();
+  const ideep::dims& dilates = dilation().vec();
+  const ideep::dims& padding_l = padding().vec();
+  const ideep::dims& padding_r = padding().vec();
+  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/act.q_scale()); // Scales of ONEDNN and PyTorch are reciprocal
+  const ideep::scale_t& weights_scales = weights.get_scale();
+  const ideep::scale_t& dst_scales = ideep::scale_t(weights_scales.size(), 1.0/output_scale); // Scales of ONEDNN and PyTorch are reciprocal
+  const ideep::zero_point_t src_zero_points = ideep::zero_point_t(1, act.q_zero_point());
+  const ideep::zero_point_t dst_zero_points = ideep::zero_point_t(1, output_zero_point);
+  ideep::attr_t op_attr = kReluFused ? ideep::attr_t::fuse_relu() : ideep::attr_t();
+  op_attr.set_zero_points(DNNL_ARG_SRC, ideep::utils::tensor_zp_mask(1), {DNNL_RUNTIME_S32_VAL}); // runtime src zero point
+  if (with_bias) {
+    // Bias might be modified outside (e.g. by quantization bias correction).
+    // If so, update the prepacked bias as well.
+    if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
+      bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
+    }
+    const auto& b = bias_.value();
+    if (transpose()) {
+      ideep::convolution_transpose_forward::compute_v2(
+          src, weights, b, dst_dims, dst,
+          strides, padding_l, padding_r, dilates,
+          groups(), src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
+          op_attr, dnnl::algorithm::deconvolution_direct, dnnl::prop_kind::forward_inference,
+          ideep::u8s8, ideep::engine::cpu_engine());
+    } else {
+      ideep::convolution_forward::compute_v2(
+          src, weights, b, dst_dims, dst,
+          strides, dilates, padding_l, padding_r, groups(),
+          src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
+          op_attr, dnnl::algorithm::convolution_direct, dnnl::prop_kind::forward_inference,
+          ideep::u8s8, ideep::engine::cpu_engine());
+    }
+  } else {
+    if (transpose()) {
+      ideep::convolution_transpose_forward::compute_v2(
+          src, weights, dst_dims, dst,
+          strides, padding_l, padding_r, dilates,
+          groups(), src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
+          op_attr, dnnl::algorithm::deconvolution_direct, dnnl::prop_kind::forward_inference,
+          ideep::u8s8, ideep::engine::cpu_engine());
+    } else {
+      ideep::convolution_forward::compute_v2(
+          src, weights, dst_dims, dst,
+          strides, dilates, padding_l, padding_r, groups(),
+          src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
+          op_attr, dnnl::algorithm::convolution_direct, dnnl::prop_kind::forward_inference,
+          ideep::u8s8, ideep::engine::cpu_engine());
+    }
+  }
+  return output;
+}
+
+template at::Tensor PackedConvWeightsOnednn<2>::apply(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+template at::Tensor PackedConvWeightsOnednn<2>::apply_relu(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+template at::Tensor PackedConvWeightsOnednn<3>::apply(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+template at::Tensor PackedConvWeightsOnednn<3>::apply_relu(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+#endif // #if AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
index ec95748cd42ba1..2f3a6ed8f3cdb0 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
@@ -5,9 +5,10 @@
 #include <ATen/ATen.h>
 #include <ATen/Parallel.h>
 #include <ATen/SmallVector.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <c10/util/irange.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
@@ -118,6 +119,57 @@ template at::Tensor PackedConvWeightsQnnp<3>::apply_dynamic(
 
 #endif // USE_PYTORCH_QNNPACK
 
+#if AT_MKLDNN_ENABLED()
+
+template <int kSpatialDim>
+at::Tensor PackedConvWeightsOnednn<kSpatialDim>::apply_dynamic(
+    const at::Tensor& input,
+    bool reduce_range) {
+
+  // Find min/max of input
+  float x_max = 0, x_min = 0;
+  if (input.numel() > 0) {
+    x_min = input.min().item<float>();
+    x_max = input.max().item<float>();
+  }
+
+  // Input tensor is quantized as 8-bit unsigned values
+  static constexpr int precision = 8;
+  static constexpr bool is_signed = false;
+
+  // Calculate scale and zero point for quantization of input tensor
+  auto q_params = quant_utils::ChooseQuantizationParams(
+      /*min=*/x_min,
+      /*max=*/x_max,
+      /*qmin=*/is_signed ? -(1 << (precision - 1)) : 0,
+      /*qmax=*/
+      is_signed ? ((1 << (precision - 1)) - 1) : (1 << precision) - 1,
+      /*preserve_sparsity=*/false,
+      /*force_scale_power_of_two=*/false,
+      /*reduce_range=*/reduce_range);
+
+  // Quantize input
+  at::Tensor q_input = at::quantize_per_tensor(
+      input, q_params.scale, q_params.zero_point, c10::kQUInt8);
+
+  at::Tensor out =
+      apply_impl<false>(q_input, q_params.scale, q_params.zero_point);
+
+  // TODO: Modify ideep to allow fp32 input & output
+  // to avoid explicit `quantize - dequantize`
+  return at::dequantize(out);
+}
+
+template at::Tensor PackedConvWeightsOnednn<2>::apply_dynamic(
+    const at::Tensor& input,
+    bool reduce_range);
+
+template at::Tensor PackedConvWeightsOnednn<3>::apply_dynamic(
+    const at::Tensor& input,
+    bool reduce_range);
+
+#endif // AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
index 3cb5d9ef1a18cc..85edffef25b982 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
@@ -2,10 +2,11 @@
 #include <vector>
 
 #include <ATen/ATen.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <ATen/quantized/Quantizer.h>
 #include <torch/library.h>
@@ -314,6 +315,165 @@ c10::intrusive_ptr<ConvPackedParamsBase<2>> PackedConvWeightsQnnp<
         bool transpose);
 #endif // USE_PYTORCH_QNNPACK
 
+#if AT_MKLDNN_ENABLED()
+template <int kSpatialDim>
+c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> PackedConvWeightsOnednn<
+    kSpatialDim>::
+    prepack(
+        at::Tensor weight,
+        c10::optional<at::Tensor> bias,
+        torch::List<int64_t> stride,
+        torch::List<int64_t> padding,
+        torch::List<int64_t> output_padding,
+        torch::List<int64_t> dilation,
+        int64_t groups,
+        bool transpose) {
+  TORCH_CHECK(
+      weight.ndimension() == kSpatialDim + 2,
+      "Weights are expected to have ", kSpatialDim + 2, " dimensions");
+  TORCH_CHECK(
+      stride.size() == kSpatialDim,
+      "stride should contain ", kSpatialDim, " elements for ",
+      kSpatialDim, "D convolution.");
+  TORCH_CHECK(
+      padding.size() == kSpatialDim,
+      "Specify front/top/left padding only. "
+      "end/bottom/right padding assumed to be equal to front/top/left");
+  TORCH_CHECK(
+      !transpose || output_padding.size() == kSpatialDim,
+      "quantized::conv_prepack: Specify top/left output padding "
+      "only. bottom/right padding assumed to be equal to top/left");
+  TORCH_CHECK(
+      dilation.size() == kSpatialDim,
+      "dilation should contain ", kSpatialDim, " elements for ",
+      kSpatialDim, "D convolution.");
+  TORCH_CHECK(
+      !transpose || std::all_of(output_padding.begin(), output_padding.end(), [](int i) { return i==0; }),
+      "quantized::conv_prepack: ONEDNN only supports zero output_padding.");
+
+  // Weight
+  // Format: [OC IC//group KH KW] for conv; [IC OC//group KH KW] for deconv
+  auto dims = weight.sizes().vec();
+  auto strides = stride.vec();
+  auto padding_l = padding.vec();
+  auto padding_r = padding.vec();
+  auto dilates = dilation.vec();
+  auto op_attr = ideep::attr_t();
+  std::vector<int32_t> wgt_zero_points;
+  ideep::scale_t wgt_scales;
+  const int output_channels = transpose ? weight.size(1) * groups
+                                        : weight.size(0);
+  const auto qtype = weight.qscheme();
+  if (qtype == c10::kPerTensorAffine) {
+    TORCH_CHECK(
+        weight.q_zero_point()==0,
+        "quantized::qconv_prepack: ONEDNN only supports symmetric quantization of weight,"
+        " whose zero point must be 0.");
+    wgt_zero_points = std::vector<int32_t>(1, weight.q_zero_point());
+    wgt_scales = ideep::scale_t(1, 1.0/weight.q_scale()); // Scales of ONEDNN and PyTorch are reciprocal
+  } else if (qtype == c10::kPerChannelAffine) {
+    TORCH_CHECK(
+        !transpose,
+        "Per Channel Quantization is currently disabled for transposed conv");
+    wgt_zero_points.resize(output_channels);
+    wgt_scales.resize(output_channels);
+    for (int i = 0; i < output_channels; ++i) {
+      wgt_zero_points[i] = weight.q_per_channel_zero_points()[i].item<int32_t>();
+      TORCH_CHECK(
+          wgt_zero_points[i]==0,
+          "quantized::qconv_prepack: ONEDNN only supports symmetric quantization of weight,"
+          " whose zero point must be 0.");
+      wgt_scales[i] = 1.0f / weight.q_per_channel_scales()[i].item<float>(); // Scales of ONEDNN and PyTorch are reciprocal
+    }
+  } else {
+    TORCH_CHECK(false, "Unsupported qscheme: ", toString(qtype));
+  }
+
+  // Set runtime src zero point
+  auto src_zero_point = {DNNL_RUNTIME_S32_VAL};
+  op_attr.set_zero_points(DNNL_ARG_SRC,
+                          ideep::utils::tensor_zp_mask(src_zero_point.size()),
+                          src_zero_point);
+  at::Tensor weight_copy;
+  ideep::tensor::desc w_desc;
+  ideep::dims dims_iohw, dims_giohw;
+  ideep::tag w_tag = ideep::tag::any;
+  const bool with_groups = groups > 1;
+  if (transpose) {
+    w_desc = ideep::convolution_transpose_forward::expected_weights_desc(
+        dims, dnnl::memory::data_type::s8,
+        strides, padding_l, padding_r, dilates, groups,
+        dnnl::algorithm::deconvolution_direct, dnnl::prop_kind::forward_inference,
+        ideep::dims(), op_attr);
+    // convolution_transpose_forward::expected_weights_desc() gives format [i, o, ...],
+    // but ONEDNN requires [o, i, ...] for computation
+    dims_iohw = w_desc.get_dims();
+    dims_giohw = with_groups ? ideep::utils::group_dims(dims_iohw, groups) : dims_iohw;
+    std::vector<int64_t> perms(dims_giohw.size(), 0); // for permutation of weight
+    std::iota(perms.begin(), perms.end(), 0);
+    w_desc = w_desc.transpose(with_groups, with_groups + 1);
+    std::swap(perms[with_groups], perms[with_groups + 1]);
+    weight_copy = weight.reshape(dims_giohw).permute(c10::IntArrayRef(perms)).clone();
+  } else {
+    w_desc = ideep::convolution_forward::expected_weights_desc(
+        dims, dnnl::memory::data_type::s8,
+        strides, padding_l, padding_r, dilates, groups,
+        dnnl::algorithm::convolution_direct, dnnl::prop_kind::forward_inference,
+        dnnl::memory::data_type::u8, ideep::dims(), op_attr);
+    weight_copy = weight.clone();
+  }
+  if (with_groups) {
+    w_tag = kSpatialDim == 2 ? ideep::tag::goihw : ideep::tag::goidhw;
+  } else {
+    w_tag = kSpatialDim == 2 ? ideep::tag::oihw : ideep::tag::oidhw;
+  }
+  ideep::dims w_dims = with_groups ? ideep::utils::group_dims(w_desc.get_dims(), groups)
+                                   : w_desc.get_dims();
+  ideep::tensor wgt = ideep::tensor(
+      ideep::tensor::desc({w_dims, dnnl::memory::data_type::s8, w_tag}, groups),
+      weight_copy.data_ptr());
+  wgt.set_scale(wgt_scales); // Scales are needed for feed_from().
+  ideep::tensor exp_wgt;
+  exp_wgt.init(w_desc);
+  exp_wgt.set_scale(wgt_scales); // Also for feed_from()
+  exp_wgt.feed_from(wgt, transpose); // expect wgt to be in [OC IC KH KW] format
+  ideep::tensor * packed_weight_p = new ideep::tensor(exp_wgt);
+  packed_weight_p->set_scale(wgt_scales);
+  packed_weight_p->set_zero_point(wgt_zero_points);
+  std::unique_ptr<ideep::tensor> weight_ptr(packed_weight_p);
+  // Bias
+  c10::optional<ideep::tensor> onednn_bias{c10::nullopt};
+  if (bias.has_value()) {
+    at::Tensor bias_vec = bias.value();
+    TORCH_CHECK(bias_vec.dim() == 1, "bias should be a vector (1D Tensor)");
+    TORCH_CHECK(
+        bias_vec.size(0) == output_channels,
+        "bias should have K elements: " + std::to_string(output_channels));
+    auto bias_desc = ideep::tensor::desc(bias.value().sizes().vec(), dnnl::memory::data_type::f32);
+    ideep::tensor packed_bias;
+    packed_bias.init(bias_desc, bias.value().data_ptr());
+    onednn_bias = c10::optional<ideep::tensor>(packed_bias);
+  }
+  auto ret_ptr = c10::make_intrusive<PackedConvWeightsOnednn<kSpatialDim>>(
+      PackedConvWeightsOnednn<kSpatialDim>{
+        std::move(weight_ptr),
+        onednn_bias,
+        weight,
+        bias,
+        stride,
+        padding,
+        output_padding,
+        dilation,
+        groups,
+        transpose
+      });
+  return ret_ptr;
+}
+
+template struct PackedConvWeightsOnednn<2>;
+template struct PackedConvWeightsOnednn<3>;
+#endif // #if AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 namespace {
@@ -377,6 +537,14 @@ class QConvPackWeightInt8 final {
     }
 #endif
 
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      return PackedConvWeightsOnednn<kSpatialDim>::prepack(
+        weight, bias, stride, padding, output_padding, dilation, groups,
+            transpose);
+    }
+#endif
+
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::conv2d_prepack ",
@@ -438,8 +606,6 @@ class QConv1dPackWeightInt8 final {
     }
 #endif
 
-
-
 #ifdef USE_PYTORCH_QNNPACK
     if (ctx.qEngine() == at::QEngine::QNNPACK) {
       return PackedConvWeightsQnnp<2>::prepack(
@@ -447,6 +613,15 @@ class QConv1dPackWeightInt8 final {
           transpose);
     }
 #endif
+
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      return PackedConvWeightsOnednn<2>::prepack(
+          weight, bias, stride, padding, output_padding, dilation, groups,
+          transpose);
+    }
+#endif
+
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::conv1d_prepack ",
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp b/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp
new file mode 100644
index 00000000000000..693e093b120949
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp
@@ -0,0 +1,136 @@
+#include <tuple>
+#include <vector>
+
+#include <ATen/ATen.h>
+#include <torch/library.h>
+#include <ATen/native/quantized/cpu/fbgemm_utils.h>
+#include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
+#include <ATen/native/quantized/cpu/quant_utils.h>
+#include <ATen/native/quantized/packed_params.h>
+
+#ifdef USE_FBGEMM
+template <int kSpatialDim>
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
+    kSpatialDim>::unpack() {
+  auto* packed_weights_p = w.get();
+  // output channels
+  const int output_channels = packed_weights_p->outputChannels();
+  const int input_channels = packed_weights_p->inputChannels();
+  const int groups = packed_weights_p->groups();
+
+  const int kernel_d = kSpatialDim == 2 ? 1 : kernel[0];
+  // R (kernel height)
+  const int kernel_h = kernel[kSpatialDim - 2];
+  // S (kernel width)
+  const int kernel_w = kernel[kSpatialDim - 1];
+
+  const int C_per_G = input_channels / groups;
+
+  // Tensor for unpacked weights
+  // Unpacked format would be physical KRS(C/G) but logical KCRS (channels
+  // first) because that's how
+  // ChannelsLast3d is not available now.FBGEMM stores the weights
+  // TODO: Unify 2d and 3d when ChannelsLast3d is ready.
+  at::Tensor unpacked_weights;
+  if (q_scheme == c10::kPerTensorAffine) {
+    unpacked_weights = kSpatialDim == 2
+        ? at::_empty_affine_quantized(
+              {output_channels, C_per_G, kernel_h, kernel_w},
+              device(c10::kCPU)
+                  .dtype(c10::kQInt8)
+                  .memory_format(c10::MemoryFormat::ChannelsLast),
+              w_scale[0],
+              w_zp[0],
+              c10::nullopt)
+        : at::native::fbgemm_utils::
+              MakeEmptyAffineQuantizedChannelsLast3dTensor(
+                  output_channels,
+                  C_per_G,
+                  kernel_d,
+                  kernel_h,
+                  kernel_w,
+                  device(c10::kCPU).dtype(c10::kQInt8),
+                  w_scale[0],
+                  w_zp[0]);
+  } else if (q_scheme == c10::kPerChannelAffine) {
+    TORCH_CHECK(
+        !transpose(),
+        "Per Channel Quantization is currently disabled for transposed conv");
+    auto scales = at::from_blob(
+        w_scale.data(), w_scale.size(), device(c10::kCPU).dtype(c10::kFloat));
+    auto zero_points = at::from_blob(
+        w_zp.data(), w_zp.size(), device(c10::kCPU).dtype(c10::kInt));
+    unpacked_weights = kSpatialDim == 2
+        ? at::_empty_per_channel_affine_quantized(
+              {output_channels, C_per_G, kernel_h, kernel_w},
+              scales.toType(c10::kDouble),
+              zero_points.toType(c10::kLong),
+              0, /* The output channel axis is 0 */
+              device(c10::kCPU).dtype(c10::kQInt8),
+              c10::MemoryFormat::ChannelsLast)
+        : at::native::fbgemm_utils::
+              MakeEmptyPerChannelAffineQuantizedChannelsLast3dTensor(
+                  output_channels,
+                  C_per_G,
+                  kernel_d,
+                  kernel_h,
+                  kernel_w,
+                  device(c10::kCPU).dtype(c10::kQInt8),
+                  scales.toType(c10::kDouble),
+                  zero_points.toType(c10::kLong));
+  } else {
+    TORCH_CHECK(false, "Unsupported qscheme: ", toString(q_scheme));
+  }
+  int8_t* unpacked_weights_p =
+      reinterpret_cast<int8_t*>(unpacked_weights.data_ptr<c10::qint8>());
+  packed_weights_p->unpack(unpacked_weights_p);
+  if(transpose()){
+    unpacked_weights =
+        at::native::fbgemm_utils::TransposeConvTensorUnpackConversion<
+            kSpatialDim>(unpacked_weights, groups);
+  }
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(
+      unpacked_weights, bias);
+}
+
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
+    2>::unpack();
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
+    3>::unpack();
+#endif // USE_FBGEMM
+
+#ifdef USE_PYTORCH_QNNPACK
+template <int kSpatialDim>
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
+    kSpatialDim>::unpack() {
+  TORCH_CHECK(
+      kSpatialDim == 2,
+      "QNNPACK only supports conv2d_unpack right "
+      "now.");
+  TORCH_CHECK(
+        orig_weight.defined(),
+        "Cannot unpack weights. "
+        "Call at::globalContext()::setReleaseOriginalWeights(false) before packing or loading to enable unpacking.");
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(orig_weight, bias);
+}
+
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
+    2>::unpack();
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
+    3>::unpack();
+#endif // USE_PYTORCH_QNNPACK
+
+#if AT_MKLDNN_ENABLED()
+template <int kSpatialDim>
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsOnednn<
+    kSpatialDim>::unpack() {
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(
+      orig_weight_, orig_bias_);
+}
+
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsOnednn<
+    2>::unpack();
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsOnednn<
+    3>::unpack();
+#endif // #if AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear.cpp b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
index ac055bf74a6e38..d358f23c6af362 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
@@ -2,8 +2,10 @@
 #include <ATen/Parallel.h>
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/xnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
@@ -270,6 +272,161 @@ at::Tensor& PackedLinearWeight::apply_relu_out(
 #endif // USE_FBGEMM
 
 #ifdef USE_PYTORCH_QNNPACK
+
+#ifdef USE_XNNPACK
+// TODO: add per_channel support in the future when xnnp supports it
+template <typename scalar_t, bool kReluFused>
+at::Tensor PackedLinearWeightsQnnp::apply_impl_xnnp(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+  using underlying_t = typename scalar_t::underlying;
+
+  std::lock_guard<std::mutex> lock(qnnp_mutex_);
+
+  const std::string func_name = kReluFused ? "quantized::linear_relu (xnnpack)"
+                                           : "quantized::linear (xnnpack)";
+  TORCH_CHECK(
+      input.dim() >= 2, func_name, ": Input tensor rank should be >= 2.");
+  TORCH_CHECK(
+      !per_channel(),
+      func_name,
+      ": xnnpack does not currently have per_channel support.");
+
+  const auto input_contig = input.contiguous();
+  const auto input_scale = input_contig.q_scale();
+
+  const size_t rows_w = bias_.size(0);
+  const size_t cols_w = input_contig.size(input_contig.dim() - 1);
+
+  auto status = xnn_status_invalid_state;
+
+  // Create an operator iff not already created
+  if (!xnnp_linear_op ||
+      (!this->input_scale.has_value() ||
+       this->input_scale.value() != input_scale)) {
+    // Update the input scale so we may cache the op
+    this->input_scale = input_scale;
+
+    xnn_operator_t xnnp_op = nullptr;
+
+    const float* weight_scales_data = w_scales.data_ptr<float>();
+
+    // prepare weights
+    underlying_t w_zp = static_cast<underlying_t>(
+        orig_weight.q_zero_point() +
+        (std::is_same<underlying_t, uint8_t>::value ? 128 : 0));
+
+   at::Tensor xnnp_weight = at::_empty_affine_quantized(
+        orig_weight.sizes(),
+        c10::CppTypeToScalarType<scalar_t>::value,
+        weight_scales_data[0],
+        w_zp);
+
+    // copy from the original weight and take care of dtype change if necessary
+    at::native::xnnp_utils::q8_copy_int8_weight_and_add_offset<scalar_t>(
+        orig_weight, xnnp_weight);
+
+    // Original bias was float, so we requantize it here.
+    at::Tensor qbias = at::native::quantize_per_tensor(
+          bias_, orig_weight.q_scale() * input_scale, 0, c10::kQInt32);
+
+    // output limits
+   auto output_min = kReluFused
+        // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
+        ? activationLimits<underlying_t>(output_scale, output_zero_point, Activation::RELU).first
+        : std::numeric_limits<underlying_t>::min();
+    auto output_max = kReluFused
+        // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
+        ? activationLimits<underlying_t>(output_scale, output_zero_point, Activation::RELU).second
+        : std::numeric_limits<underlying_t>::max();
+
+    // Create an operator
+    status = at::native::xnnp_utils::xnnp_create_fully_connected_nc(
+        cols_w, /* input_channels */
+        rows_w, /* output_channels */
+        cols_w, /* input_stride */
+        rows_w, /* output_stride */
+        input_contig.q_zero_point(),
+        input_contig.q_scale(),
+        w_zp,
+        weight_scales_data[0],
+        reinterpret_cast<const underlying_t*>(
+            xnnp_weight.template data_ptr<scalar_t>()),
+        reinterpret_cast<int32_t*>(qbias.data_ptr<c10::qint32>()),
+        output_zero_point,
+        output_scale,
+        output_min,
+        output_max,
+        0, /* flags */
+        &xnnp_op);
+    xnnp_linear_op = xnnpack_operator(xnnp_op);
+
+    TORCH_CHECK(
+        status == xnn_status_success,
+        func_name,
+        ": xnn create operator failed(",
+        status,
+        ")");
+  }
+
+  /*
+   * Allocate output Tensor and a buffer for XNNPACK to use
+   * The resulting matrix here is 2-D, let's view it with the original
+   * left hand dimensions of the input. Here are two examples:
+   * 1. If the input tensor is {M, K}, the output tensor is {M, N}.
+   * 2. If the input tensor is {b, M, K}, the output tensor is {b, M, N}.
+   */
+  std::vector<int64_t> out_sizes = input.sizes().vec();
+  out_sizes.back() = static_cast<int64_t>(rows_w);
+  at::Tensor output = at::native::empty_affine_quantized(
+      out_sizes,
+      c10::CppTypeToScalarType<scalar_t>::value,
+      c10::nullopt /* layout */,
+      c10::kCPU,
+      c10::nullopt /* pin_memory */,
+      output_scale,
+      output_zero_point,
+      input.suggest_memory_format());
+
+  // calculate batch_size
+  size_t rows_input = 1;
+  for (const auto i : c10::irange(input_contig.dim() - 1)) {
+    rows_input *= input_contig.size(i);
+  }
+
+  // Setup the operator
+  status = at::native::xnnp_utils::xnnp_setup_fully_connected_nc(
+      xnnp_linear_op.get(),
+      rows_input, /* batch_size */
+      reinterpret_cast<const underlying_t*>(
+          input_contig.template data_ptr<scalar_t>()),
+      reinterpret_cast<underlying_t*>(output.template data_ptr<scalar_t>()),
+      caffe2::pthreadpool_());
+
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn setup operator failed(",
+      status,
+      ")");
+
+  // Run the opeator
+  status = xnn_run_operator(
+      xnnp_linear_op.get(), // Linear op
+      caffe2::pthreadpool_() // threadpool
+  );
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn run operator failed(",
+      status,
+      ")");
+
+  return output;
+}
+#endif // USE_XNNPACK
+
 template <bool ReluFused>
 at::Tensor PackedLinearWeightsQnnp::apply_impl(
     at::Tensor input,
@@ -414,10 +571,35 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl(
   return output;
 }
 
+#ifdef USE_XNNPACK
+bool can_use_xnnp(c10::ScalarType dtype, bool per_channel) {
+  if(!at::native::xnnpack::available()) {
+    return false;
+  }
+
+  bool supported_dtypes = dtype == c10::kQInt8;
+  bool invalid_config = per_channel; /* xnnp does not currently support
+                                        per-channel fully connected op */
+  if (supported_dtypes && invalid_config) {
+    /* don't want this to fall through to QNNPACK */
+    TORCH_CHECK(
+        false,
+        "quantized::linear (xnnpack): Unsupported config for dtype KQInt8");
+  }
+  return supported_dtypes && !invalid_config;
+}
+#endif // USE_XNNPACK
+
 at::Tensor PackedLinearWeightsQnnp::apply(
     at::Tensor input,
     double output_scale,
     int64_t output_zero_point) {
+#ifdef USE_XNNPACK
+  if (can_use_xnnp(input.scalar_type(), per_channel())) {
+    return apply_impl_xnnp<c10::qint8, false>(
+        input, output_scale, output_zero_point);
+  } /* fall through for unsupported types, configs, or shapes */
+#endif // USE_XNNPACK
   return apply_impl<false>(std::move(input), output_scale, output_zero_point);
 }
 
@@ -425,11 +607,92 @@ at::Tensor PackedLinearWeightsQnnp::apply_relu(
     at::Tensor input,
     double output_scale,
     int64_t output_zero_point) {
+#ifdef USE_XNNPACK
+  if (can_use_xnnp(input.scalar_type(), per_channel())) {
+    return apply_impl_xnnp<c10::qint8, true>(
+        input, output_scale, output_zero_point);
+  } /* fall through for unsupported types, configs, or shapes */
+#endif // USE_XNNPACK
   return apply_impl<true>(std::move(input), output_scale, output_zero_point);
 }
 
 #endif // USE_PYTORCH_QNNPACK
 
+#if AT_MKLDNN_ENABLED()
+template <bool ReluFused>
+at::Tensor PackedLinearWeightsOnednn::apply_impl(
+    at::Tensor input,
+    double output_scale,
+    int64_t output_zero_point) {
+  const int64_t dim = input.dim();
+  TORCH_CHECK(
+      dim != 0,
+      "qlinear (ONEDNN): input dim should be at least 1, but got 0");
+  TORCH_CHECK(input.scalar_type() == c10::ScalarType::QUInt8,
+      "qlinear (ONEDNN): data type of input should be QUint8.");
+
+  auto input_contig = input.expect_contiguous();
+  auto& w = *(weight_.get());
+  auto K = input.size(dim - 1), M = input.numel() / K, N = w.get_dim(1);
+  auto input_dims = {M, K};
+  auto input_data_type = dnnl::memory::data_type::u8;
+  auto input_desc = ideep::tensor::desc(input_dims, input_data_type);
+  ideep::attr_t op_attr = ReluFused ? ideep::attr_t::fuse_relu() : ideep::attr_t();
+  ideep::tensor x(input_desc, input_contig->data_ptr<c10::quint8>());
+  auto dst_dims = {M, N};
+  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/input.q_scale());
+  const ideep::scale_t& weights_scales = w.get_scale();
+  const ideep::scale_t& dst_scales = ideep::scale_t(1, 1.0/output_scale); // Scales of ONEDNN and PyTorch are reciprocal
+  const ideep::zero_point_t& src_zero_point = ideep::zero_point_t(1, input.q_zero_point());
+  const ideep::zero_point_t& dst_zero_point = ideep::zero_point_t(1, output_zero_point);
+  // Compute: Use ideep::matmul_forward to support asymmetric quantization
+  // Allocate output Tensor
+  at::Tensor output = at::_empty_affine_quantized(
+      dst_dims,
+      at::device(c10::kCPU).dtype(c10::kQUInt8),
+      output_scale,
+      output_zero_point);
+  if (output.numel() == 0) {
+    return output;
+  }
+  ideep::tensor y({dst_dims, ideep::tensor::data_type::u8, {output.strides().cbegin(), output.strides().cend()}},
+                  output.data_ptr());
+  if (bias_.has_value()) {
+    // Bias might be modified outside (e.g. by quantization bias correction).
+    // If so, update the prepacked bias as well.
+    if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
+      bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
+    }
+    const auto& b = bias_.value();
+    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f, src_scales, weights_scales, dst_scales,
+                                      src_zero_point, dst_zero_point, op_attr);
+  } else {
+    ideep::matmul_forward::compute_v2(x, w, y, 1.0f, 1.0f, src_scales, weights_scales, dst_scales,
+                                      src_zero_point, dst_zero_point, op_attr);
+  }
+  auto out_sizes = input.sizes().vec();
+  out_sizes.back() = N;
+  if (output.sizes().vec() == out_sizes)
+    return output;
+  return output.reshape(out_sizes);
+}
+
+at::Tensor PackedLinearWeightsOnednn::apply(
+    at::Tensor input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<false>(std::move(input), output_scale, output_zero_point);
+}
+
+at::Tensor PackedLinearWeightsOnednn::apply_relu(
+    at::Tensor input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<true>(std::move(input), output_scale, output_zero_point);
+}
+
+#endif // #if AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
index 676b2f1ce64983..111255726dcf8c 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
@@ -2,8 +2,9 @@
 #include <ATen/Parallel.h>
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
@@ -463,6 +464,99 @@ void PackedLinearWeightFp16::set_bias(c10::optional<at::Tensor> bias) {
 
 #endif // USE_FBGEMM
 
+#if AT_MKLDNN_ENABLED()
+template <bool ReluFused>
+at::Tensor PackedLinearWeightsOnednn::apply_dynamic_impl(
+    at::Tensor input,
+    bool reduce_range) {
+  // Dynamic: fp32 * int8 -> fp32
+  using at::Tensor;
+
+  TORCH_CHECK(
+      input.dim() >= 2,
+      "The dimension of input tensor should be larger than or equal to 2");
+  TORCH_CHECK(input.scalar_type() == c10::ScalarType::Float,
+      "qlinear_dynamic (ONEDNN): data type of input should be float.");
+
+  // Input -> uint8
+  auto input_contig = input.contiguous();
+  const int64_t dim = input.dim();
+  auto input_reshaped =
+      dim == 2 ? input : input.reshape({-1, input.size(input.dim() - 1)});
+  auto input_dims = input_reshaped.sizes().vec();
+  auto input_data_type = dnnl::memory::data_type::f32;
+  auto input_desc = ideep::tensor::desc(input_dims, input_data_type);
+  ideep::attr_t op_attr = ReluFused ? ideep::attr_t::fuse_relu() : ideep::attr_t();
+  ideep::tensor x;
+  x.init(input_desc, input_contig.data_ptr());
+  // Find quantization parameters
+  float x_max = 0, x_min = 0;
+  if (input.numel() > 0) {
+    x_min = input_contig.min().item<float>();
+    x_max = input_contig.max().item<float>();
+  }
+  const int precision = 8;
+  auto q_params = quant_utils::ChooseQuantizationParams(
+      /*min=*/x_min,
+      /*max=*/x_max,
+      /*qmin=*/0,
+      /*qmax=*/(1 << precision) - 1,
+      /*preserve_sparsity=*/false,
+      /*force_scale_power_of_two=*/false,
+      /*reduce_range=*/reduce_range);
+  const std::vector<int32_t>& src_zero_point = std::vector<int32_t>(1, q_params.zero_point);
+  // weights, dst
+  auto w = *(weight_.get());
+  auto dst_dims = {x.get_dim(0), w.get_dim(1)};
+  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/q_params.scale);
+  const ideep::scale_t& weights_scales = w.get_scale();
+  // Compute -> f32
+  // Use ideep::matmul_forward instead of ideep::inner_product_forward,
+  // since the latter does not support asymmetric quantization
+  // Allocate output Tensor
+  at::Tensor output = at::empty(dst_dims, input.options().dtype(at::kFloat));
+  if (output.numel() == 0) return output;
+  ideep::tensor y({dst_dims, ideep::tensor::data_type::f32,
+                   {output.strides().cbegin(), output.strides().cend()}},
+                  output.data_ptr());
+  if (bias_.has_value()) {
+    // Bias might be modified outside (e.g. by quantization bias correction).
+    // If so, update the prepacked bias as well.
+    if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
+      bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
+    }
+    const ideep::tensor b = bias_.value();
+    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f,
+                                      src_scales, weights_scales, ideep::scale_t(),
+                                      src_zero_point, ideep::zero_point_t(), op_attr);
+  } else {
+    ideep::matmul_forward::compute_v2(x, w, y, 1.0f, 1.0f,
+                                      src_scales, weights_scales, ideep::scale_t(),
+                                      src_zero_point, ideep::zero_point_t(), op_attr);
+  }
+  auto out_sizes = input.sizes().vec();
+  out_sizes.back() = w.get_dim(1);
+  if (output.sizes().vec() == out_sizes)
+    return output;
+  return output.reshape(out_sizes);
+}
+
+at::Tensor PackedLinearWeightsOnednn::apply_dynamic(
+    at::Tensor input,
+    bool reduce_range) {
+  return apply_dynamic_impl</*ReluFused=*/false>(
+      std::move(input), reduce_range);
+}
+
+at::Tensor PackedLinearWeightsOnednn::apply_dynamic_relu(
+    at::Tensor input,
+    bool reduce_range) {
+  return apply_dynamic_impl</*ReluFused=*/true>(
+      std::move(input), reduce_range);
+}
+
+#endif // #if AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
index 93c54dc1088904..6ca6905119f49e 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
@@ -1,9 +1,9 @@
 #include <ATen/ATen.h>
-#include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <ATen/quantized/Quantizer.h>
 #include <torch/custom_class.h>
@@ -194,6 +194,80 @@ c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeightFp16::prepack(
 }
 #endif // USE_FBGEMM
 
+#if AT_MKLDNN_ENABLED()
+c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeightsOnednn::prepack(
+    at::Tensor weight,
+    c10::optional<at::Tensor> bias) {
+  TORCH_CHECK(
+      weight.dim() == 2,
+      "The weight tensor for quantized::linear_prepack (onednn) should"
+      " be 2-dimensional.");
+  // Weight
+  std::vector<int64_t> dims = weight.sizes().vec();
+  auto N = weight.size(0);
+  std::vector<int32_t> wgt_zero_points;
+  ideep::scale_t wgt_scales;
+  const auto qtype = weight.qscheme();
+  if (qtype == c10::kPerTensorAffine) {
+    TORCH_CHECK(
+        weight.q_zero_point() == 0,
+        "quantized::linear_prepack: ONEDNN only supports symmetric quantization of weight,"
+        " whose zero point must be 0, but got ", weight.q_zero_point());
+    wgt_zero_points = std::vector<int32_t>(1, weight.q_zero_point());
+    wgt_scales = ideep::scale_t(1, 1.0/weight.q_scale()); // Scales of ONEDNN and PyTorch are reciprocal
+  } else if (qtype == c10::kPerChannelAffine) {
+    wgt_zero_points.resize(N);
+    wgt_scales.resize(N);
+    for (int i = 0; i < N; ++i) {
+      wgt_zero_points[i] = weight.q_per_channel_zero_points()[i].item<int32_t>();
+      TORCH_CHECK(
+          wgt_zero_points[i] == 0,
+          "quantized::linear_prepack: ONEDNN only supports symmetric quantization of weight,"
+          " whose zero point must be 0, but got ",  wgt_zero_points[i], ", at index ", i);
+      wgt_scales[i] = 1.0f / weight.q_per_channel_scales()[i].item<float>(); // Scales of ONEDNN and PyTorch are reciprocal
+    }
+  } else {
+    TORCH_CHECK(false, "Unsupported qscheme: ", toString(qtype));
+  }
+
+  // Prepack weight
+  auto weight_copy = weight.clone();
+  ideep::tensor wgt = ideep::tensor({dims, dnnl::memory::data_type::s8}, weight_copy.data_ptr());
+  wgt.transpose_(0, 1); // ONEDNN requires transposed weight
+  auto w_desc = ideep::matmul_forward::expected_weights_desc(wgt.get_dims(), dnnl::memory::data_type::s8,
+                                                             dnnl::memory::data_type::u8);
+  ideep::tensor exp_wgt(w_desc);
+  exp_wgt.feed_from(wgt);
+  ideep::tensor * packed_weight_p = new ideep::tensor(exp_wgt);
+  packed_weight_p->set_scale(wgt_scales);
+  packed_weight_p->set_zero_point(wgt_zero_points);
+  std::unique_ptr<ideep::tensor> weight_ptr(packed_weight_p);
+  // Bias
+  c10::optional<ideep::tensor> onednn_bias{c10::nullopt};
+  if (bias.has_value()) {
+    auto& b = bias.value();
+    auto bias_size = b.sizes().vec();
+    bias_size.insert(bias_size.begin(), 1);
+    TORCH_CHECK(
+        bias_size[1] == weight_ptr->get_dim(1),
+        "bias should have N elements: ",
+        std::to_string(weight_ptr->get_dim(1)),
+        ", but got ", bias_size[1]);
+    auto bias_desc = ideep::tensor::desc(bias_size, dnnl::memory::data_type::f32);
+    ideep::tensor packed_bias;
+    packed_bias.init(bias_desc, b.data_ptr());
+    onednn_bias = c10::optional<ideep::tensor>(packed_bias);
+  }
+  auto ret_ptr = c10::make_intrusive<PackedLinearWeightsOnednn>(
+      PackedLinearWeightsOnednn{
+        std::move(weight_ptr),
+        onednn_bias,
+        weight,
+        bias});
+  return ret_ptr;
+}
+#endif // #if AT_MKLDNN_ENABLED()
+
 namespace at {
 namespace native {
 
@@ -224,6 +298,11 @@ class QLinearPackWeightInt8 final {
           std::move(weight), std::move(bias));
     }
 #endif
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      return PackedLinearWeightsOnednn::prepack(std::move(weight), std::move(bias));
+    }
+#endif // #if AT_MKLDNN_ENABLED()
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::linear_prepack ",
@@ -238,6 +317,9 @@ class QLinearPackWeightFp16 final {
       c10::optional<Tensor> bias) {
     auto& ctx = at::globalContext();
 #ifdef USE_FBGEMM
+    // temporarily convert weight back to fp32, needs to be fixed
+    // after fbgemm fixes the interface for their prepacking op (take fp16 input0
+    weight = weight.to(ScalarType::Float);
     if (ctx.qEngine() == at::QEngine::FBGEMM) {
       return PackedLinearWeightFp16::prepack(
           std::move(weight), std::move(bias));
@@ -251,6 +333,14 @@ class QLinearPackWeightFp16 final {
           "not supported by QNNPACK");
     }
 #endif // USE_PYTORCH_QNNPACK
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      TORCH_CHECK(
+          false,
+          "quantized::linear_prepack_fp16 is currently "
+          "not supported by ONEDNN");
+    }
+#endif // #if AT_MKLDNN_ENABLED()
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::linear_prepack_fp16 ",
@@ -261,63 +351,18 @@ class QLinearPackWeightFp16 final {
 class QLinearPackWeightInt8Legacy final {
  public:
   static Tensor run(at::Tensor weight, c10::optional<Tensor> bias) {
-    auto& ctx = at::globalContext();
-    auto options = weight.options();
-
-#ifdef USE_FBGEMM
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
-      auto prepacked =
-          PackedLinearWeight::prepack(std::move(weight), std::move(bias));
-      auto wrapped =
-          std::make_unique<c10::intrusive_ptr<LinearPackedParamsBase>>(
-              std::move(prepacked));
-      return cpp_custom_type_hack::create(std::move(wrapped), options);
-    }
-#endif // USE_FBGEMM
-#ifdef USE_PYTORCH_QNNPACK
-    if (ctx.qEngine() == at::QEngine::QNNPACK) {
-      auto prepacked =
-          PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
-      auto wrapped =
-          std::make_unique<c10::intrusive_ptr<LinearPackedParamsBase>>(
-              std::move(prepacked));
-      return cpp_custom_type_hack::create(std::move(wrapped), options);
-    }
-#endif // USE_PYTORCH_QNNPACK
-    TORCH_CHECK(
-        false,
-        "Didn't find engine for operation quantized::linear_prepack ",
-        toString(ctx.qEngine()));
+    TORCH_CHECK(false,
+        "This model uses an outdated version of quantized.linear_prepack. "
+        "Please re-export your model using the newer definitions in torch.jit.quantized");
   }
 };
 
 class QLinearPackWeightFp16Legacy final {
  public:
   static Tensor run(at::Tensor weight, c10::optional<Tensor> bias) {
-    auto& ctx = at::globalContext();
-#ifdef USE_FBGEMM
-    auto options = weight.options();
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
-      auto prepacked =
-          PackedLinearWeightFp16::prepack(std::move(weight), std::move(bias));
-      auto wrapped =
-          std::make_unique<c10::intrusive_ptr<LinearPackedParamsBase>>(
-              std::move(prepacked));
-      return cpp_custom_type_hack::create(std::move(wrapped), options);
-    }
-#endif // USE_FBGEMM
-#ifdef USE_PYTORCH_QNNPACK
-    if (ctx.qEngine() == at::QEngine::QNNPACK) {
-      TORCH_CHECK(
-          false,
-          "quantized::linear_prepack_fp16 is currently "
-          "not supported by QNNPACK");
-    }
-#endif // USE_PYTORCH_QNNPACK
-    TORCH_CHECK(
-        false,
-        "Didn't find engine for operation quantized::linear_prepack_fp16 ",
-        toString(ctx.qEngine()));
+    TORCH_CHECK(false,
+        "This model uses an outdated version of quantized.linear_prepack_fp16. "
+        "Please re-export your model using the newer definitions in torch.jit.quantized");
   }
 };
 
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_unpack_impl.cpp
similarity index 50%
rename from aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp
rename to aten/src/ATen/native/quantized/cpu/qlinear_unpack_impl.cpp
index 2a34e6748eb433..b7182bf0fa4724 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_unpack_impl.cpp
@@ -1,7 +1,8 @@
 #include <ATen/ATen.h>
 #include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
@@ -74,78 +75,9 @@ std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedLinearWeightFp16::
 }
 #endif // USE_FBGEMM
 
-namespace at {
-namespace native {
-namespace {
-
-class QLinearUnpackWeightInt8 final {
- public:
-  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
-      const c10::intrusive_ptr<LinearPackedParamsBase>& packed_weight) {
-    return packed_weight->unpack();
-  }
-};
-
-class QLinearUnpackWeightFp16 final {
- public:
-  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
-      const c10::intrusive_ptr<LinearPackedParamsBase>& packed_weight) {
-    auto& ctx = at::globalContext();
-
-    TORCH_CHECK(
-        ctx.qEngine() != at::QEngine::QNNPACK,
-        "quantized::linear_unpack_fp16 is currently "
-        "not supported by QNNPACK");
-
-    return packed_weight->unpack();
-  }
-};
-
-class QLinearUnpackWeightInt8Legacy final {
- public:
-  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
-      const at::Tensor& packed_weight) {
-    TORCH_WARN_ONCE(
-        "quantized.linear_unpack(Tensor) is deprecated! Please "
-        "upgrade your model to use the newer quantized.linear_"
-        "unpack(LinearPackedParamsBase) overload");
-    return cpp_custom_type_hack::cast<
-               c10::intrusive_ptr<LinearPackedParamsBase>>(packed_weight)
-        ->unpack();
-  }
-};
-
-class QLinearUnpackWeightFp16Legacy final {
- public:
-  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
-      const at::Tensor& packed_weight) {
-    TORCH_WARN_ONCE(
-        "quantized.linear_unpack(Tensor) is deprecated! Please "
-        "upgrade your model to use the newer quantized.linear_"
-        "unpack(LinearPackedParamsBase) overload");
-    auto& ctx = at::globalContext();
-
-    TORCH_CHECK(
-        ctx.qEngine() != at::QEngine::QNNPACK,
-        "quantized::linear_unpack_fp16 is currently "
-        "not supported by QNNPACK");
-
-    return cpp_custom_type_hack::cast<
-               c10::intrusive_ptr<LinearPackedParamsBase>>(packed_weight)
-        ->unpack();
-  }
-};
-
-TORCH_LIBRARY_IMPL(quantized, CPU, m) {
-  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack.legacy"), TORCH_FN(QLinearUnpackWeightInt8Legacy::run));
-  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack_fp16.legacy"), TORCH_FN(QLinearUnpackWeightFp16Legacy::run));
-}
-
-TORCH_LIBRARY_IMPL(quantized, CatchAll, m) {
-  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack"), TORCH_FN(QLinearUnpackWeightInt8::run));
-  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack_fp16"), TORCH_FN(QLinearUnpackWeightFp16::run));
+#if AT_MKLDNN_ENABLED()
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedLinearWeightsOnednn::unpack() {
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(
+      orig_weight_, orig_bias_);
 }
-
-} // namespace
-} // namespace native
-} // namespace at
+#endif // #if AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/quantized/cpu/qmatmul.cpp b/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
index 013966a525103e..e42941fd0a35db 100644
--- a/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
@@ -1,6 +1,12 @@
 #include <ATen/ATen.h>
 #include <torch/library.h>
 
+#ifdef USE_RUY_QMATMUL
+#include <ATen/Parallel.h>
+#include <ATen/native/quantized/cpu/ruy_utils.h>
+#include <ruy/ruy.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -21,6 +27,142 @@ inline void check_inputs(const Tensor& qa, const Tensor& qb) {
       "Both inputs to Matmul must have the same quantization scheme.");
 }
 
+#ifdef USE_RUY_QMATMUL
+
+Tensor qmatmul(
+    const Tensor& qa,
+    const Tensor& qb,
+    const double output_scale,
+    const int64_t output_zero_point) {
+  check_inputs(qa, qb);
+
+  const int64_t num_dims = qa.dim();
+  const int64_t b_num_dims = qb.dim();
+
+  TORCH_CHECK(
+      num_dims == b_num_dims,
+      "MatMul operands should have the same dimensionality. (", num_dims,
+      " and ", b_num_dims, " provided)");
+  TORCH_CHECK(
+      num_dims >= 2,
+      "Quantized Matmul currently only suports operands which are at least 2-dimensional. (",
+      num_dims, " provided)");
+
+  const int64_t m = qa.size(num_dims - 2);
+  const int64_t k = qa.size(num_dims - 1);
+  const int64_t b_k = qb.size(num_dims - 2);
+  const int64_t n = qb.size(num_dims - 1);
+
+  TORCH_CHECK(
+      b_k == k,
+      "For Quantized Matmul, the size of tensor a (", k,
+      ") at dimension ", num_dims - 1, " must match the size of tensor b (",
+      b_k, ") at dimension ", num_dims - 2, ".");
+
+  std::vector<int64_t> out_size_vec(num_dims);
+  size_t num_matmuls = 1;
+  for (int64_t i = 0; i < num_dims - 2; i++) {
+    const int64_t dim = qa.size(i);
+    const int64_t qb_dim = qb.size(i);
+
+    TORCH_CHECK(
+        dim == qb_dim,
+        "For Quantized Matmul, the size of tensor a (", dim,
+        ") must match the size of tensor b (", qb_dim,
+        ") at dimension ", i);
+
+    out_size_vec[i] = dim;
+    num_matmuls *= dim;
+  }
+  out_size_vec[num_dims - 2] = m;
+  out_size_vec[num_dims - 1] = n;
+
+  Tensor out = at::_empty_affine_quantized(
+      IntArrayRef(out_size_vec),
+      at::device(kCPU)
+          .dtype(qa.scalar_type())
+          .memory_format(qa.suggest_memory_format()),
+      output_scale,
+      output_zero_point,
+      c10::nullopt);
+
+  const Tensor& qa_contig = qa.contiguous();
+  const Tensor& qb_contig = qb.contiguous();
+
+  AT_DISPATCH_QINT_BYTE_TYPES(qa.scalar_type(), "qmatmul", [&] {
+    using underlying_t = typename scalar_t::underlying;
+
+    const underlying_t* qa_data = reinterpret_cast<const underlying_t*>(
+        qa_contig.data_ptr<scalar_t>());
+    const underlying_t* qb_data = reinterpret_cast<const underlying_t*>(
+        qb_contig.data_ptr<scalar_t>());
+    underlying_t* out_data =
+        reinterpret_cast<underlying_t*>(out.data_ptr<scalar_t>());
+
+    const size_t qa_stride = m * k;
+    const size_t qb_stride = k * n;
+    const size_t out_stride = m * n;
+
+    auto matmuls = [&](int64_t begin, int64_t end) {
+
+      ruy::Matrix<underlying_t> qa_matrix;
+      ruy::MakeSimpleLayout(
+          m, k, ruy::Order::kRowMajor, qa_matrix.mutable_layout());
+      qa_matrix.set_zero_point(qa.q_zero_point());
+
+      ruy::Matrix<underlying_t> qb_matrix;
+      ruy::MakeSimpleLayout(
+          k, n, ruy::Order::kRowMajor, qb_matrix.mutable_layout());
+      qb_matrix.set_zero_point(qb.q_zero_point());
+
+      ruy::Matrix<underlying_t> out_matrix;
+      ruy::MakeSimpleLayout(
+          m, n, ruy::Order::kRowMajor, out_matrix.mutable_layout());
+      out_matrix.set_zero_point(output_zero_point);
+
+      // Requantization explanation:
+      // https://github.com/google/gemmlowp/blob/e844ffd17118c1e17d94e1ba4354c075a4577b88/doc/quantization.md
+      const double requantization_scale_inv =
+          (qa.q_scale() * qb.q_scale()) / output_scale;
+
+      ruy::MulParams<int32_t, underlying_t> mul_params;
+
+      int multiplier_fixedpoint;
+      int multiplier_exponent;
+      ruy_utils::quantize_multiplier(requantization_scale_inv,
+                                     &multiplier_fixedpoint,
+                                     &multiplier_exponent);
+      mul_params.set_multiplier_fixedpoint(multiplier_fixedpoint);
+      mul_params.set_multiplier_exponent(multiplier_exponent);
+
+      const underlying_t* qa_subtensor = qa_data + begin * qa_stride;
+      const underlying_t* qb_subtensor = qb_data + begin * qb_stride;
+      underlying_t* out_subtensor = out_data + begin * out_stride;
+
+      for (int64_t i = begin; i < end; i++) {
+        qa_matrix.set_data(qa_subtensor);
+        qb_matrix.set_data(qb_subtensor);
+        out_matrix.set_data(out_subtensor);
+        ruy::Mul(qa_matrix,
+                 qb_matrix,
+                 mul_params,
+                 ruy_utils::get_ruy_context(),
+                 &out_matrix);
+
+        qa_subtensor += qa_stride;
+        qb_subtensor += qb_stride;
+        out_subtensor += out_stride;
+      }
+    };
+
+    at::parallel_for(0, num_matmuls, 1, matmuls);
+  });
+
+  return out;
+}
+
+#else // ifdef USE_RUY_QMATMUL
+
 Tensor qmatmul(
     const Tensor& qa,
     const Tensor& qb,
@@ -34,6 +176,8 @@ Tensor qmatmul(
       rc, output_scale, output_zero_point, qa.scalar_type());
 }
 
+#endif // ifdef USE_RUY_QMATMUL
+
 TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {
   m.impl(TORCH_SELECTIVE_NAME("quantized::matmul"), TORCH_FN(qmatmul));
 }
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake b/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
index 30cc61dc17fb76..4a86d641e41237 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
@@ -10,7 +10,7 @@ project(googletest-download NONE)
 
 include(ExternalProject)
 ExternalProject_Add(googletest
-  URL https://github.com/google/googletest/archive/release-1.8.0.zip
+  URL https://github.com/google/googletest/archive/release-1.10.0.zip
   URL_HASH SHA256=f3ed3b58511efd272eb074a3a6d6fb79d7c2e6a0e374323d1e6bcbcc1ef141bf
   SOURCE_DIR "${CONFU_DEPENDENCIES_SOURCE_DIR}/googletest"
   BINARY_DIR "${CONFU_DEPENDENCIES_BINARY_DIR}/googletest"
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
index 30cc61dc17fb76..4a86d641e41237 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
@@ -10,7 +10,7 @@ project(googletest-download NONE)
 
 include(ExternalProject)
 ExternalProject_Add(googletest
-  URL https://github.com/google/googletest/archive/release-1.8.0.zip
+  URL https://github.com/google/googletest/archive/release-1.10.0.zip
   URL_HASH SHA256=f3ed3b58511efd272eb074a3a6d6fb79d7c2e6a0e374323d1e6bcbcc1ef141bf
   SOURCE_DIR "${CONFU_DEPENDENCIES_SOURCE_DIR}/googletest"
   BINARY_DIR "${CONFU_DEPENDENCIES_BINARY_DIR}/googletest"
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack_utils.h b/aten/src/ATen/native/quantized/cpu/qnnpack_utils.h
index 1f6d6f1d910561..60ea7822a76056 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack_utils.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack_utils.h
@@ -6,8 +6,8 @@
 #include <pytorch_qnnpack.h>
 #include <qnnpack_func.h>
 
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/cpu/xnnpack_utils.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/utils/Factory.h>
 
 #include <utility>
@@ -40,6 +40,7 @@ struct PackedLinearWeightsQnnp : public LinearPackedParamsBase {
         orig_weight(std::move(orig_weight)),
         bias_(at::native::mobile::allocate_padded_contiguous_if_needed(
             bias, bias.suggest_memory_format())),
+        per_channel_(this->orig_weight.qscheme() == at::kPerChannelAffine),
         input_scale(std::move(input_scale)),
         w_scales(w_scales),
         w_zero_points(std::move(w_zps)) {}
@@ -47,6 +48,7 @@ struct PackedLinearWeightsQnnp : public LinearPackedParamsBase {
   std::unique_ptr<qnnpack::PackBMatrix> w;
   at::Tensor orig_weight;
   at::Tensor bias_;
+  bool per_channel_;
   c10::optional<double> input_scale;
   at::Tensor w_scales;
   std::vector<uint8_t> w_zero_points;
@@ -74,8 +76,23 @@ struct PackedLinearWeightsQnnp : public LinearPackedParamsBase {
       at::Tensor weight,
       c10::optional<at::Tensor> bias);
 
+  bool per_channel() const {
+    return per_channel_;
+  }
+
  private:
   std::mutex qnnp_mutex_;
+
+#ifdef USE_XNNPACK
+  xnnpack_operator xnnp_linear_op;
+
+  template <typename scalar_t, bool kReluFused>
+  at::Tensor apply_impl_xnnp(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point);
+#endif // USE_XNNPACK
+
   template <bool ReluFused>
   at::Tensor apply_impl(
       at::Tensor input,
@@ -112,6 +129,7 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
         dilation_(std::move(dilation)),
         groups_(groups),
         transpose_(transpose),
+        is_per_channel_(is_per_channel),
         input_scale(input_scale),
         kernel_(std::move(kernel)),
         w_scales(w_scale),
@@ -200,7 +218,7 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
     convolution->input_padding_height = padding_[kSpatialDim - 2];
     convolution->input_padding_width = padding_[kSpatialDim - 1];
     convolution->input_padding_depth = kSpatialDim == 3 ? padding_[0] : 0;
-    convolution->per_channel = is_per_channel;
+    convolution->per_channel = is_per_channel_;
     convolution->transpose = transpose_;
 
     const uint32_t kr = pytorch_qnnp_params.q8conv.kr;
@@ -260,6 +278,9 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
   }
 
   std::unique_ptr<pytorch_qnnp_operator, QnnpackOperatorDeleter> convolution_op;
+  #ifdef USE_XNNPACK
+  xnnpack_operator xnnp_convolution_op;
+  #endif  // USE_XNNPACK
   std::unique_ptr<qnnpack::PrePackConvWeights> w;
   at::Tensor orig_weight;
   at::Tensor bias;
@@ -269,6 +290,7 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
   torch::List<int64_t> dilation_;
   int64_t groups_;
   bool transpose_;
+  bool is_per_channel_;
   c10::optional<double> input_scale;
   std::vector<int64_t> kernel_;
   at::Tensor w_scales;
@@ -326,6 +348,10 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
     return transpose_;
   }
 
+  bool per_channel() const {
+    return is_per_channel_;
+  }
+
  private:
   std::mutex qnnp_mutex_;
   template <bool ReluFused>
@@ -333,6 +359,14 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
       const at::Tensor& input,
       double output_scale,
       int64_t output_zero_point);
+
+#ifdef USE_XNNPACK
+  template <typename scalar_t, bool ReluFused>
+  at::Tensor apply_impl_xnnp(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point);
+#endif // USE_XNNPACK
 };
 
 enum class Activation : uint8_t { NONE = 0, RELU = 1 };
diff --git a/aten/src/ATen/native/quantized/cpu/qupsample_bilinear2d.cpp b/aten/src/ATen/native/quantized/cpu/qupsample_bilinear2d.cpp
index ab30cd7d381010..d9a871a591bfde 100644
--- a/aten/src/ATen/native/quantized/cpu/qupsample_bilinear2d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qupsample_bilinear2d.cpp
@@ -178,7 +178,7 @@ using at::native::upsample::get_scale_value;
 
 Tensor upsample_bilinear2d_quantized_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
       bool align_corners,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
diff --git a/aten/src/ATen/native/quantized/cpu/qupsample_nearest2d.cpp b/aten/src/ATen/native/quantized/cpu/qupsample_nearest2d.cpp
index 377ef15790b137..a8cd6abec7e44a 100644
--- a/aten/src/ATen/native/quantized/cpu/qupsample_nearest2d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qupsample_nearest2d.cpp
@@ -202,7 +202,7 @@ Tensor _upsample_nearest_exact2d_quantized_cpu(
 
 Tensor upsample_nearest2d_quantized_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_h = get_scale_value(scale_factors, 0);
@@ -212,7 +212,7 @@ Tensor upsample_nearest2d_quantized_cpu(
 
 Tensor _upsample_nearest_exact2d_quantized_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_h = get_scale_value(scale_factors, 0);
diff --git a/aten/src/ATen/native/quantized/cpu/qupsample_nearest3d.cpp b/aten/src/ATen/native/quantized/cpu/qupsample_nearest3d.cpp
index db4077ef432887..d2e83542133674 100644
--- a/aten/src/ATen/native/quantized/cpu/qupsample_nearest3d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qupsample_nearest3d.cpp
@@ -232,7 +232,7 @@ Tensor _upsample_nearest_exact3d_quantized_cpu(
 
 Tensor upsample_nearest3d_quantized_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
@@ -243,7 +243,7 @@ Tensor upsample_nearest3d_quantized_cpu(
 
 Tensor _upsample_nearest_exact3d_quantized_cpu(
     const Tensor& input,
-    c10::optional<IntArrayRef> output_size,
+    at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_d = get_scale_value(scale_factors, 0);
diff --git a/aten/src/ATen/native/quantized/cpu/ruy_utils.cpp b/aten/src/ATen/native/quantized/cpu/ruy_utils.cpp
new file mode 100644
index 00000000000000..d0164f7363524e
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/ruy_utils.cpp
@@ -0,0 +1,37 @@
+#ifdef USE_RUY_QMATMUL
+
+#include <ATen/ATen.h>
+#include <ATen/native/quantized/cpu/ruy_utils.h>
+
+namespace at {
+namespace native {
+namespace ruy_utils {
+
+static thread_local ruy::Context context;
+
+ruy::Context* get_ruy_context() {
+  return &context;
+}
+
+// Adopted from Ruy:
+// https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/ruy/test.h#L1602
+void quantize_multiplier(double scale,
+                         int* multiplier_fixedpoint,
+                         int* multiplier_exponent) {
+  TORCH_CHECK(scale > 0, "Quantization scale (", scale, ") must be positive.");
+  const double q = std::frexp(scale, multiplier_exponent);
+  auto q_fixed = static_cast<std::int64_t>(std::round(q * (1ll << 31)));
+  TORCH_CHECK(q_fixed <= (1ll << 31));
+  if (q_fixed == (1ll << 31)) {
+    q_fixed /= 2;
+    ++*multiplier_exponent;
+  }
+  TORCH_CHECK(q_fixed <= std::numeric_limits<std::int32_t>::max());
+  *multiplier_fixedpoint = static_cast<std::int32_t>(q_fixed);
+}
+
+} // namespace ruy_utils
+} // namespace native
+} // namesplace
+
+#endif // USE_RUY_QMATMUL
diff --git a/aten/src/ATen/native/quantized/cpu/ruy_utils.h b/aten/src/ATen/native/quantized/cpu/ruy_utils.h
new file mode 100644
index 00000000000000..aeb332af4ecae3
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/ruy_utils.h
@@ -0,0 +1,21 @@
+#pragma once
+
+#ifdef USE_RUY_QMATMUL
+
+#include <ruy/ruy.h>
+
+namespace at {
+namespace native {
+namespace ruy_utils {
+
+ruy::Context* get_ruy_context();
+
+void quantize_multiplier(double scale,
+                         int* multiplier_fixedpoint,
+                         int* multiplier_exponent);
+
+} // namespace ruy_utils
+} // namespace native
+} // namesplace
+
+#endif // USE_RUY_QMATMUL
diff --git a/aten/src/ATen/native/quantized/cpu/xnnpack_utils.cpp b/aten/src/ATen/native/quantized/cpu/xnnpack_utils.cpp
new file mode 100644
index 00000000000000..8f81c8ea8d5e81
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/xnnpack_utils.cpp
@@ -0,0 +1,89 @@
+#ifdef USE_XNNPACK
+
+#include <ATen/ATen.h>
+#include <ATen/quantized/Quantizer.h>
+#include <ATen/native/quantized/cpu/xnnpack_utils.h>
+#include <c10/util/irange.h>
+
+namespace at {
+namespace native {
+namespace xnnp_utils {
+
+std::vector<size_t> get_mem_format_aware_shape(const at::Tensor& in) {
+  const auto mem_format = in.suggest_memory_format();
+  const auto& sizes = in.sizes();
+  std::vector<size_t> ret(sizes.begin(), sizes.end());
+  if (mem_format == c10::MemoryFormat::ChannelsLast) {
+    // NCHW -> NHWC
+    // 0123 -> 0231
+    ret[1] = sizes[2]; /* H */
+    ret[2] = sizes[3]; /* W */
+    ret[3] = sizes[1]; /* C */
+  } else if (mem_format == c10::MemoryFormat::ChannelsLast3d) {
+    // NCDHW -> NDHWC
+    // 01234 -> 02341
+    ret[1] = sizes[2]; /* D */
+    ret[2] = sizes[3]; /* H */
+    ret[3] = sizes[4]; /* W */
+    ret[4] = sizes[1]; /* C */
+  }
+  return ret;
+}
+
+template <typename PT>
+void q8_copy_int8_weight_and_add_offset(const at::Tensor& in, at::Tensor& out) {
+  using T = typename PT::underlying;
+  static constexpr auto offset = std::is_same<T, uint8_t>::value ? 128 : 0;
+  TORCH_CHECK(
+      in.scalar_type() == c10::kQInt8,
+      "q8_copy_int8_weight_and_add_offset: Expected input weight data type ",
+      toString(c10::kQInt8),
+      " but got ",
+      toString(in.scalar_type()))
+  const int8_t* in_ptr =
+      reinterpret_cast<const int8_t*>(in.data_ptr<c10::qint8>());
+  T* out_ptr = reinterpret_cast<T*>(out.data_ptr<PT>());
+
+  for (const auto i : c10::irange(in.numel())) {
+    out_ptr[i] = static_cast<T>(static_cast<int32_t>(in_ptr[i]) + offset);
+  }
+}
+
+template void q8_copy_int8_weight_and_add_offset<c10::quint8>(
+    const at::Tensor& in,
+    at::Tensor& out);
+template void q8_copy_int8_weight_and_add_offset<c10::qint8>(
+    const at::Tensor& in,
+    at::Tensor& out);
+
+/*
+ * Stolen from fbgemm_utils::ConvertConvWeightsToChannelLastTensor to avoid
+ * dependence on USE_FBGEMM. Reorder weights to the format xnnpack expects.
+ * TODO: add a 3d variant.
+ */
+template <>
+Tensor convert_conv_weights_to_channel_last_tensor<2>(
+    const at::Tensor& src,
+    int groups,
+    bool transpose) {
+  return transpose ?
+                   // 2D conv transpose weight transform
+                   // IC OC/G KH KW -> G OC/G KH KW IC/G
+      [&]() {
+        auto ic_g_oc_g_hw_tensors = src.chunk(groups);
+        for (auto& tensor : ic_g_oc_g_hw_tensors) {
+          tensor = tensor.unsqueeze(0);
+        }
+        auto fused_tensor = at::cat(ic_g_oc_g_hw_tensors);
+        set_quantizer_(fused_tensor, src.quantizer());
+        return fused_tensor.permute({0, 2, 3, 4, 1})
+            .contiguous(c10::MemoryFormat::Contiguous);
+      }()
+                   // 2d conv weight transform
+                   : src.contiguous(c10::MemoryFormat::ChannelsLast);
+}
+} // namespace xnnp_utils
+} // namespace native
+} // namespace at
+
+#endif // USE_XNNPACK
diff --git a/aten/src/ATen/native/quantized/cpu/xnnpack_utils.h b/aten/src/ATen/native/quantized/cpu/xnnpack_utils.h
new file mode 100644
index 00000000000000..78f325263f4fc0
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/xnnpack_utils.h
@@ -0,0 +1,279 @@
+#pragma once
+
+#ifdef USE_XNNPACK
+#include <cstdint>
+
+#include <ATen/ATen.h>
+#include <ATen/native/xnnpack/Common.h>
+
+using xnnpack_operator = at::native::xnnpack::Operator;
+
+namespace at {
+namespace native {
+namespace xnnp_utils {
+
+/*
+ * Return shape in the same order as the memory format
+ * e.g. channels_last will return NHWC instead of NCHW
+ */
+std::vector<size_t> get_mem_format_aware_shape(const at::Tensor& in);
+
+/*
+ * Input is always int8_t, output can be [int8_t, uint8_t].
+ * input  + offset = output
+ * int8_t + 128    = uint8_t
+ * int8_t + 0      = int8_t
+ */
+template <typename PT>
+void q8_copy_int8_weight_and_add_offset(const at::Tensor& in, at::Tensor& out);
+
+template <int kSpatialDim>
+Tensor convert_conv_weights_to_channel_last_tensor(
+    const at::Tensor& src,
+    int groups,
+    bool transpose);
+
+/*
+ * Series of create wrapper functions to call xnn_create_[de]conv* functions.
+ */
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_create_convolution2d_nhwc(
+    uint32_t pad_top,
+    uint32_t pad_right,
+    uint32_t pad_bottom,
+    uint32_t pad_left,
+    uint32_t kernel_h,
+    uint32_t kernel_w,
+    uint32_t stride_h,
+    uint32_t stride_w,
+    uint32_t dilation_h,
+    uint32_t dilation_w,
+    uint32_t groups,
+    size_t group_input_channels,
+    size_t group_output_channels,
+    size_t ip_chan_stride,
+    size_t op_chan_stride,
+    int8_t izp,
+    float ip_scale,
+    int8_t kzp,
+    const float* k_scales,
+    const int8_t* kernel,
+    const int32_t* bias,
+    int8_t ozp,
+    float op_scale,
+    int8_t op_min,
+    int8_t op_max,
+    uint32_t flags,
+    xnn_operator_t* op,
+    bool per_channel,
+    bool transpose) {
+  /* Symmetric quantization forces kzp = 0 */
+  TORCH_CHECK(!kzp, "XNNPACK Q[SC]8 conv kernels expects kernel zero point to be zero."
+                    "But got: ", kzp);
+
+  if (transpose) {
+    TORCH_CHECK(!per_channel, "XNNPACK Q[SC]8 does not have a per channel deconvolution!");
+    return xnn_create_deconvolution2d_nhwc_qs8(
+        pad_top,        /* uint32_t output_padding_top          */
+        pad_right,      /* uint32_t output_padding_right        */
+        pad_bottom,     /* uint32_t output_padding_bottom       */
+        pad_left,       /* uint32_t output_padding_left         */
+        kernel_h,       /* uint32_t kernel_height               */
+        kernel_w,       /* uint32_t kernel_width                */
+        stride_h,       /* uint32_t stride_height               */
+        stride_w,       /* uint32_t stride_width                */
+        dilation_h,     /* uint32_t dilation_height             */
+        dilation_w,     /* uint32_t dilation_width              */
+        groups,         /* uint32_t groups                      */
+        group_input_channels,  /* size_t group_input_channels   */
+        group_output_channels, /* size_t group_output_channels  */
+        ip_chan_stride, /* size_t input_pixel_stride            */
+        op_chan_stride, /* size_t output_pixel_stride           */
+        izp,            /* int8_t input_zero_point              */
+        ip_scale,       /* float input_scale                    */
+        k_scales[0],    /* float kernel_scale                   */
+        kernel,         /* const int8_t* kernel                 */
+        bias,           /* const int32_t* bias                  */
+        ozp,            /* int8_t output_zero_point             */
+        op_scale,       /* float output_scale                   */
+        op_min,         /* int8_t output_min                    */
+        op_max,         /* int8_t output_max                    */
+        flags,          /* uint32_t flags                       */
+        op);            /* xnn_operator_t* deconvolution_op_out */
+
+  }
+
+  if (!per_channel) {
+    return xnn_create_convolution2d_nhwc_qs8(
+        pad_top,        /* uint32_t input_padding_top         */
+        pad_right,      /* uint32_t input_padding_right       */
+        pad_bottom,     /* uint32_t input_padding_bottom      */
+        pad_left,       /* uint32_t input_padding_left        */
+        kernel_h,       /* uint32_t kernel_height             */
+        kernel_w,       /* uint32_t kernel_width              */
+        stride_h,       /* uint32_t subsampling_height        */
+        stride_w,       /* uint32_t subsampling_width         */
+        dilation_h,     /* uint32_t dilation_height           */
+        dilation_w,     /* uint32_t dilation_width            */
+        groups,         /* uint32_t groups                    */
+        group_input_channels,  /* size_t group_input_channels */
+        group_output_channels, /* size_t group_output_channels*/
+        ip_chan_stride, /* size_t input_channel_stride        */
+        op_chan_stride, /* size_t output_channel_stride       */
+        izp,            /* int8_t input_zero_point            */
+        ip_scale,       /* float input_scale                  */
+        k_scales[0],    /* float kernel_scale                 */
+        kernel,         /* const int8_t* kernel               */
+        bias,           /* const int32_t* bias                */
+        ozp,            /* int8_t output_zero_point           */
+        op_scale,       /* float output_scale                 */
+        op_min,         /* int8_t output_min                  */
+        op_max,         /* int8_t output_max                  */
+        flags,          /* uint32_t flags                     */
+        op);            /* xnn_operator_t* convolution_op_out */
+  } else { /* per_channel */
+    return xnn_create_convolution2d_nhwc_qc8(
+        pad_top,        /* uint32_t input_padding_top         */
+        pad_right,      /* uint32_t input_padding_right       */
+        pad_bottom,     /* uint32_t input_padding_bottom      */
+        pad_left,       /* uint32_t input_padding_left        */
+        kernel_h,       /* uint32_t kernel_height             */
+        kernel_w,       /* uint32_t kernel_width              */
+        stride_h,       /* uint32_t subsampling_height        */
+        stride_w,       /* uint32_t subsampling_width         */
+        dilation_h,     /* uint32_t dilation_height           */
+        dilation_w,     /* uint32_t dilation_width            */
+        groups,         /* uint32_t groups                    */
+        group_input_channels,  /* size_t group_input_channels */
+        group_output_channels, /* size_t group_output_channels*/
+        ip_chan_stride, /* size_t input_channel_stride        */
+        op_chan_stride, /* size_t output_channel_stride       */
+        izp,            /* int8_t input_zero_point            */
+        ip_scale,       /* float input_scale                  */
+        k_scales,       /* const float* kernel_scale          */
+        kernel,         /* const int8_t* kernel               */
+        bias,           /* const int32_t* bias                */
+        ozp,            /* int8_t output_zero_point           */
+        op_scale,       /* float output_scale                 */
+        op_min,         /* int8_t output_min                  */
+        op_max,         /* int8_t output_max                  */
+        flags,          /* uint32_t flags                     */
+        op);            /* xnn_operator_t* convolution_op_out */
+  }
+}
+
+/*
+ * Series of setup wrapper functions to call xnn_setup_[de]conv* functions.
+ */
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_setup_convolution2d_nhwc(
+    xnn_operator_t op,
+    size_t batch,
+    size_t in_h,
+    size_t in_w,
+    const int8_t* inp,
+    int8_t* outp,
+    pthreadpool_t pt_pool,
+    bool per_channel = false,
+    bool transpose = false,
+    uint32_t adj_h = 0,
+    uint32_t adj_w = 0) {
+  if(transpose) {
+    TORCH_CHECK(!per_channel, "XNNPACK Q[SC]8 does not have a per channel deconvolution!");
+    return xnn_setup_deconvolution2d_nhwc_qs8(
+        op,       /* xnn_operator_t deconvolution_op */
+        batch,    /* size_t batch_size               */
+        in_h,     /* size_t input_height             */
+        in_w,     /* size_t input_width              */
+        adj_h,    /* uint32_t adjustment_height      */
+        adj_w,    /* uint32_t adjustment_width       */
+        inp,      /* const int8_t* input             */
+        outp,     /* int8_t* output                  */
+        pt_pool); /* pthreadpool_t threadpool        */
+  }
+
+  if (!per_channel) {
+    return xnn_setup_convolution2d_nhwc_qs8(
+        op,       /* xnn_operator_t convolution_op */
+        batch,    /* size_t batch_size             */
+        in_h,     /* size_t input_height           */
+        in_w,     /* size_t input_width            */
+        inp,      /* const int8_t* input           */
+        outp,     /* int8_t* output                */
+        pt_pool); /* pthreadpool_t threadpool      */
+  } else { /* per_channel */
+    return xnn_setup_convolution2d_nhwc_qc8(
+        op,       /* xnn_operator_t convolution_op */
+        batch,    /* size_t batch_size             */
+        in_h,     /* size_t input_height           */
+        in_w,     /* size_t input_width            */
+        inp,      /* const int8_t* input           */
+        outp,     /* int8_t* output                */
+        pt_pool); /* pthreadpool_t threadpool      */
+  }
+}
+
+
+/*
+ * Series of wrapper functions to call xnn_create* and xnn_setup*
+ * functions for linear
+ */
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_create_fully_connected_nc(
+    size_t input_channels,
+    size_t output_channels,
+    size_t input_stride,
+    size_t output_stride,
+    int8_t input_zero_point,
+    float input_scale,
+    int8_t kernel_zero_point,
+    float kernel_scale,
+    const int8_t* kernel,
+    const int32_t* bias,
+    int8_t output_zero_point,
+    float output_scale,
+    int8_t output_min,
+    int8_t output_max,
+    uint32_t flags,
+    xnn_operator_t* fully_connected_op_out) {
+  /* Symmetric quantization forces kzp = 0 */
+  TORCH_CHECK(!kernel_zero_point, "XNNPACK QS8 linear kernel expects kernel zero point to be zero."
+                    "But got: ", kernel_zero_point);
+  return xnn_create_fully_connected_nc_qs8(
+      input_channels,          /* size_t input_channels                  */
+      output_channels,         /* size_t output_channels                 */
+      input_stride,            /* size_t input_stride                    */
+      output_stride,           /* size_t output_stride                   */
+      input_zero_point,        /* int8_t input_zero_point                */
+      input_scale,             /* float input_scale                      */
+      kernel_scale,            /* float kernel_scale                     */
+      kernel,                  /* const int8_t* kernel                   */
+      bias,                    /* const int32_t* bias                    */
+      output_zero_point,       /* int8_t output_zero_point               */
+      output_scale,            /* float output_scale                     */
+      output_min,              /* int8_t output_min                      */
+      output_max,              /* int8_t output_max                      */
+      flags,                   /* uint32_t flags                         */
+      fully_connected_op_out); /* xnn_operator_t* fully_connected_op_out */
+}
+
+C10_ALWAYS_INLINE
+enum xnn_status xnnp_setup_fully_connected_nc(
+    xnn_operator_t fully_connected_op,
+    size_t batch_size,
+    const int8_t* input,
+    int8_t* output,
+    pthreadpool_t threadpool) {
+  return xnn_setup_fully_connected_nc_qs8(
+      fully_connected_op, /* xnn_operator_t fully_connected_op */
+      batch_size,         /* size_t batch_size                 */
+      input,              /* const int8_t* input               */
+      output,             /* int8_t* output                    */
+      threadpool);        /* pthreadpool_t threadpool          */
+}
+
+} // namespace xnnp_utils
+} // namespace native
+} // namespace at
+
+#endif // USE_XNNPACK
diff --git a/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp b/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp
new file mode 100644
index 00000000000000..e81814d28e1581
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp
@@ -0,0 +1,224 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+#include <ATen/native/cudnn/Macros.h>
+#if HAS_CUDNN_V8()
+
+#include <ATen/core/TensorBase.h>
+#include <ATen/core/TensorBody.h>
+#include <ATen/cuda/Exceptions.h>
+#include <ATen/cudnn/Handle.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/utils/ParamsHash.h>
+#include <ATen/TensorUtils.h>
+#include <c10/core/QScheme.h>
+#include <c10/util/ArrayRef.h>
+#include <torch/library.h>
+
+#include <unordered_map>
+
+namespace at {
+namespace native {
+namespace {
+// FIXME: make this thread-safe by reusing the benchmark cache in Conv_v7.cpp
+namespace {
+struct CacheKey {
+  uint8_t input_a_alignment;
+  uint8_t input_b_alignment;
+  uint8_t output_alignment;
+  bool kReluFused;
+};
+std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, at::native::ParamsHash<CacheKey>, at::native::ParamsEqual<CacheKey>> execution_plan_cache;
+}
+
+// TODO: this is also in qadd.cpp and some other cpp files in quantized/cpu/. I think we should
+// move everything into a utilities file in quantized/ directory later.
+inline void check_inputs(const Tensor& qa, const Tensor& qb) {
+  TORCH_CHECK(
+      qa.qscheme() == kPerTensorAffine,
+      "Only per tensor quantization is suported in Add.");
+  TORCH_CHECK(
+      qa.qscheme() == qb.qscheme(),
+      "Both inputs to Add must have the same quantization shceme.");
+  TORCH_CHECK(
+      qa.scalar_type() == qb.scalar_type(),
+      "Add operands should have same data type.");
+}
+
+// currently we only support int8 symmetric (zero_point = 0 for inputs and output) quantized add
+// We implement relu ( (a_int8 + b_int8 * ( b_scale/a_scale) ) ) * ( a_scale / out_scale )
+// which requires 4 cudnn ops (2 multiplication, 1 addition, and 1 relu ops)
+// Multiplication ops: rhs_mult_op, requant_op
+// Addition op: add_op
+// Relu op: relu_op
+template <bool kReluFused = false>
+Tensor add(Tensor qa, Tensor qb, double output_scale, int64_t output_zero_point) {
+  if (qa.numel() == 0) {
+    return Tensor{};
+  }
+  // TODO: add shape checking when broadcasted add is supported. For now we assume the input tensors are the same shape
+  TORCH_CHECK(qa.sizes() == qb.sizes(), "Quantized cudnn add currently expects both input tensors to be the same shape");
+
+  check_inputs(qa, qb);
+
+  // cudnn expects tensors to be at least 3D. So we will prepend dummy dimensions if the input tensors are not at least 3D
+  auto orig_sizes = qa.sizes().vec();
+  if (qa.dim() < 3) {
+    std::vector<int64_t> new_sizes(3, 1);
+    // cudnn expects leading dimensions to be the dummy dimensions
+    new_sizes.back() = qa.sizes().back();
+    if (qa.dim() == 2) {
+      new_sizes[1] = qa.size(0);
+    }
+    qa = qa.view(new_sizes);
+    qb = qb.view(new_sizes);
+  }
+
+  at::Tensor add_output = at::empty(qa.sizes(), at::device(at::kCUDA).dtype(at::kFloat));
+  at::Tensor quantized_output = at::_empty_affine_quantized(
+      qa.sizes(),
+      at::device(at::kCUDA).dtype(at::ScalarType::QInt8),
+      output_scale,
+      output_zero_point);
+  // TODO: When cudnn enables support for broadcasting, we can remove this tensor
+  at::Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat));
+  requantize_multiplier_tensor.fill_(qa.q_scale() / output_scale);
+  at::Tensor rhs_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat));
+  rhs_multiplier_tensor.fill_(qb.q_scale() / qa.q_scale());
+
+  cudnnHandle_t handle = at::native::getCudnnHandle();
+  CacheKey key;
+  bool deterministic{true};
+  bool allow_tf32{false};
+  key.kReluFused = kReluFused;
+  key.input_a_alignment = cudnn_utils::getAlignment(qa);
+  key.input_b_alignment = cudnn_utils::getAlignment(qb);
+  key.output_alignment = cudnn_utils::getAlignment(add_output);
+
+  auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor plan_desc) {
+    auto workspace_size = 0;
+    auto workspace = at::empty({workspace_size}, qa.options().dtype(at::kByte));
+    std::vector<void *> data_ptrs;
+    std::vector<int64_t> uids;
+    data_ptrs.reserve(8);
+    uids.reserve(8);
+    data_ptrs = {reinterpret_cast<int8_t*>(qb.data_ptr()), rhs_multiplier_tensor.data_ptr(), add_output.data_ptr(),
+                 reinterpret_cast<int8_t*>(qa.data_ptr()), add_output.data_ptr(), requantize_multiplier_tensor.data_ptr(),
+                 reinterpret_cast<int8_t*>(quantized_output.data_ptr())};
+    uids = {'b', 'm', 'c', 'a', 'p', 'r', 'q'};
+    if (kReluFused) {
+        data_ptrs.emplace_back(add_output.data_ptr()),
+        uids.emplace_back('f');
+    }
+
+    auto variantPack = cudnn_frontend::VariantPackBuilder()
+      .setWorkspacePointer(workspace.data_ptr())
+      .setDataPointers(uids.size(), data_ptrs.data())
+      .setUids(uids.size(), uids.data())
+      .build();
+    auto variant_pack_desc = variantPack.get_raw_desc();
+    AT_CUDNN_CHECK(cudnnBackendExecute(handle, plan_desc->get_backend_descriptor(), variant_pack_desc));
+  };
+
+  auto search = execution_plan_cache.find(key);
+  if (search != execution_plan_cache.end()) {
+    cudnn_frontend::ManagedOpaqueDescriptor plan_desc = search->second;
+    run(plan_desc);
+    return quantized_output.view(orig_sizes);
+  }
+
+  // computes qb_int8 * ( qb_scale/qa_scale )
+  auto rhs_mult_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(cudnn_utils::getTensorDescriptor(qb.sizes(), qb.strides(), CUDNN_DATA_INT8, 'b', key.input_b_alignment))
+      .setbDesc(cudnn_utils::getTensorDescriptor(rhs_multiplier_tensor, 'm', cudnn_utils::getAlignment(rhs_multiplier_tensor)))
+      .setyDesc(cudnn_utils::getTensorDescriptor(add_output, 'c', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(add_output)))
+      .build();
+
+  // add_op computes (qa_int8 + qb_int8 * ( qb_scale/qa_scale ) )
+  // add_output is a fp32 tensor for accumulation purposes
+  auto add_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(rhs_mult_op.getOutputTensor())
+      .setbDesc(cudnn_utils::getTensorDescriptor(qa.sizes(), qa.strides(), CUDNN_DATA_INT8, 'a', key.input_a_alignment))
+      .setyDesc(cudnn_utils::getTensorDescriptor(add_output, 'p', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseAddDescriptor(at::native::getCudnnDataType(add_output)))
+      .build();
+
+  // relu_op computes
+  // relu( (qa_int8 + qb_int8 * ( qb_scale/qa_scale ) )  )
+  // output is a fp32 tensor
+  c10::optional<cudnn_frontend::Operation> relu_op;
+  if (kReluFused) {
+    // we use inplace operation here where the output is assigned to the input
+    relu_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(add_op.getOutputTensor())
+      .setyDesc(cudnn_utils::getTensorDescriptor(add_output, 'f', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseReluDescriptor(at::native::getCudnnDataType(add_output)))
+      .build());
+  }
+
+  // requant_op computes
+  // (a_int8 + b_int8 * ( b_scale/a_scale) ) * a_scale / out_scale
+  auto requant_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+    .setxDesc(kReluFused ? relu_op.value().getOutputTensor() : add_op.getOutputTensor())
+    .setbDesc(cudnn_utils::getTensorDescriptor(requantize_multiplier_tensor, 'r', cudnn_utils::getAlignment(requantize_multiplier_tensor)))
+    .setyDesc(cudnn_utils::getTensorDescriptor(quantized_output.sizes(), quantized_output.strides(), CUDNN_DATA_INT8, 'q', cudnn_utils::getAlignment(quantized_output)))
+    .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(requantize_multiplier_tensor)))
+    .build();
+
+  std::vector<cudnn_frontend::Operation const *> ops{&rhs_mult_op, &add_op};
+  if (kReluFused) {
+    ops.emplace_back(&(relu_op.value()));
+  }
+  ops.emplace_back(&requant_op);
+
+  auto opGraph = cudnn_frontend::OperationGraphBuilder()
+      .setHandle(handle)
+      .setOperationGraph(ops.size(), ops.data())
+      .build();
+  // std::cout << "opGraph: " << opGraph.describe() << std::endl;
+
+  auto heuristics = cudnn_frontend::EngineHeuristicsBuilder()
+      .setOperationGraph(opGraph)
+      .setHeurMode(CUDNN_HEUR_MODE_INSTANT)
+      .build();
+  auto fallback = cudnn_frontend::EngineFallbackListBuilder()
+                    .setOperationGraph(opGraph)
+                    .setOperation(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+                    .build();
+
+  auto& engine_configs = heuristics.getEngineConfig(heuristics.getEngineConfigCount());
+  auto& fallback_list = fallback.getFallbackList();
+
+  cudnn_frontend::EngineConfigList filtered_configs;
+  cudnn_utils::filterEngineConfigs(engine_configs, filtered_configs, deterministic, allow_tf32, at::kChar);
+  cudnn_utils::filterEngineConfigs(fallback_list, filtered_configs, deterministic, allow_tf32, at::kChar);
+  for (auto &cfg : engine_configs) {
+    try {
+      auto plan = cudnn_frontend::ExecutionPlanBuilder()
+        .setHandle(handle)
+        .setEngineConfig(cfg)
+        .build();
+      auto plan_desc = plan.get_desc();
+      run(plan_desc);
+      execution_plan_cache[key] = plan_desc;
+      return quantized_output.view(orig_sizes);
+    } catch (cudnn_frontend::cudnnException &e) {std::cout << "cudnn error:" << e.what() << std::endl;} catch(c10::CuDNNError &e) { std::cout << "other error" << e.what() << std::endl;}
+  }
+
+  TORCH_CHECK(false, "Unable to find an engine to execute this computation");
+}
+
+TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::add"), TORCH_FN(add</*ReLUFused=*/false>));
+  m.impl(TORCH_SELECTIVE_NAME("quantized::add_relu"), TORCH_FN(add</*ReLUFused=*/true>));
+}
+
+} // namespace
+} // namespace native
+} // namespace at
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/Conv.cpp b/aten/src/ATen/native/quantized/cudnn/Conv.cpp
index a96e6d571261ce..abd555557ffe60 100644
--- a/aten/src/ATen/native/quantized/cudnn/Conv.cpp
+++ b/aten/src/ATen/native/quantized/cudnn/Conv.cpp
@@ -8,57 +8,25 @@
 
 #if HAS_CUDNN_V8()
 
-#include <cudnn_frontend.h>
 #include <ATen/ATen.h>
-#include <ATen/TensorUtils.h>
 #include <ATen/cuda/Exceptions.h>
+#include <ATen/cudnn/Handle.h>
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/cudnn/ConvShared.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/utils/ParamsHash.h>
-#include <ATen/cudnn/Handle.h>
 #include <ATen/TensorUtils.h>
+#include <cudnn_frontend.h>
 #include <torch/library.h>
 
-#include <unordered_map>
 #include <iostream>
+#include <unordered_map>
+#include <vector>
 
-namespace at { namespace native{
-
-namespace {
-
-uint8_t getAlignment(const Tensor &t) {
-  // alignment are in bytes
-  uint8_t alignment = 1;
-  uintptr_t address = reinterpret_cast<uintptr_t>(t.data_ptr());
-  while (address % alignment == 0 && alignment < 16) alignment *= 2;
-  return alignment;
-}
-
-cudnn_frontend::Tensor getTensorDescriptor(const Tensor &t, int64_t id, uint8_t alignment) {
-  auto shape = t.sizes();
-  auto strides = t.strides();
-  return cudnn_frontend::TensorBuilder()
-    .setDim(shape.size(), shape.data())
-    .setStrides(strides.size(), strides.data())
-    .setId(id)
-    .setAlignment(alignment)
-    .setDataType(getCudnnDataType(t))
-    .build();
-}
-
-cudnn_frontend::Tensor getTensorDescriptor(const IntArrayRef& shape, const IntArrayRef& strides, cudnnDataType_t cudnn_dtype, int64_t id, uint8_t alignment) {
-  return cudnn_frontend::TensorBuilder()
-    .setDim(shape.size(), shape.data())
-    .setStrides(strides.size(), strides.data())
-    .setId(id)
-    .setAlignment(alignment)
-    .setDataType(cudnn_dtype)
-    .build();
-}
-
-// TODO: there is a table from input dtype and weight dtype to operator dtype,
+// TODO: there is a table from input dtype and weight dtype to operator qdtype,
 // we can derive the operator dtype based on input dtype
-cudnn_frontend::ConvDesc_v8 getConvDescriptor(cudnnDataType_t dataType, IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation) {
+cudnn_frontend::ConvDesc_v8 getConvDescriptor(cudnnDataType_t dataType, c10::IntArrayRef padding, c10::IntArrayRef stride, c10::IntArrayRef dilation) {
   uint64_t convDim = stride.size();
   return cudnn_frontend::ConvDescBuilder()
     .setDataType(dataType)
@@ -71,99 +39,18 @@ cudnn_frontend::ConvDesc_v8 getConvDescriptor(cudnnDataType_t dataType, IntArray
     .build();
 }
 
-// TODO: there is a table from input dtype to operator dtype, we can derive
-// the operator dtype based on input dtype
-cudnn_frontend::PointWiseDesc_v8 getPointWiseMulDescriptor(cudnnDataType_t dataType) {
-  return cudnn_frontend::PointWiseDescBuilder()
-    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_MUL)
-    .setMathPrecision(dataType)
-    .build();
-}
-
-// TODO: there is a table from input dtype to operator dtype, we can derive
-// the operator dtype based on input dtype
-cudnn_frontend::PointWiseDesc_v8 getPointWiseAddDescriptor(cudnnDataType_t dataType) {
-  return cudnn_frontend::PointWiseDescBuilder()
-    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_ADD)
-    .setMathPrecision(dataType)
-    .build();
-}
-
-// TODO: there is a table from input dtype to operator dtype, we can derive
-// the operator dtype based on input dtype
-cudnn_frontend::PointWiseDesc_v8 getPointWiseReluDescriptor(cudnnDataType_t dataType) {
-  return cudnn_frontend::PointWiseDescBuilder()
-    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_RELU_FWD)
-    .setMathPrecision(dataType)
-    .build();
-}
-
-void filterEngineConfigs(
-  cudnn_frontend::EngineConfigList &from,
-  cudnn_frontend::EngineConfigList &to,
-  bool deterministic, bool allow_tf32, c10::ScalarType scalar_type)
-{
-  auto filter = [=](cudnnBackendDescriptor_t c) {
-    if (deterministic) {
-      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC>(c)) return true;
-    }
-    if (scalar_type == kFloat || scalar_type == kChar || !allow_tf32) {
-      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS>(c)) return true;
-      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_TENSOR_CORE>(c)) return true;
-    }
-    return false;
-  };
-  cudnn_frontend::filter(from, to, filter);
-}
-
-cudnn_frontend::ExecutionPlan
-get_execplan_from_heuristics_else_fall_back(cudnn_frontend::OperationGraph&& opGraph, cudnnHandle_t handle_) {
-  auto heuristics = cudnn_frontend::EngineHeuristicsBuilder()
-    .setOperationGraph(opGraph)
-    .setHeurMode(CUDNN_HEUR_MODE_INSTANT)
-    .build();
-
-  // std::cout << "Heuristic has " << heuristics.getEngineConfigCount() << " configurations " << std::endl;
-  auto& engine_config = heuristics.getEngineConfig(heuristics.getEngineConfigCount());
-
-  // Try engine configs returned by the heuristics and pick up the first one that works.
-  for (auto& ecfg : engine_config) {
-    try {
-      auto plan = cudnn_frontend::ExecutionPlanBuilder()
-        .setHandle(handle_)
-        .setEngineConfig(ecfg, opGraph.getTag())
-        .build();
-      return plan;
-    } catch (cudnn_frontend::cudnnException& e) {
-      continue;
-    }
-  }
-
-  {
-    auto total_engines = opGraph.getEngineCount();
-    // std::cout << opGraph.describe() << " has " << total_engines << " engines." << std::endl;
-    auto engine = cudnn_frontend::EngineBuilder().setGlobalEngineIdx(0).setOperationGraph(opGraph).build();
-    // std::cout << engine.describe() << std::endl;
-
-    auto engine_config = cudnn_frontend::EngineConfigBuilder().setEngine(engine).build();
-    // std::cout << engine_config.describe() << std::endl;
-
-    return cudnn_frontend::ExecutionPlanBuilder().setHandle(handle_).setEngineConfig(engine_config).build();
-  }
-}
-
+// FIXME: make this thread-safe by reusing the benchmark cache in Conv_v7.cpp
+namespace {
 struct CacheKey {
-  ConvolutionParams params;
+  at::native::ConvolutionParams params;
   uint8_t input_alignment;
   uint8_t weight_alignment;
   uint8_t output_alignment;
   // default to -1 when no bias
   int8_t bias_alignment;
 };
-
-// FIXME: make this thread-safe by reusing the benchmark cache in Conv_v7.cpp
-std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, ParamsHash<CacheKey>, ParamsEqual<CacheKey>> execution_plan_cache;
-
+std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, at::native::ParamsHash<CacheKey>, at::native::ParamsEqual<CacheKey>> execution_plan_cache;
+}
 // TODO: we can use cudnn_frontend::ExecutionPlanCache when it supports caching
 // multiple operators
 // reference: https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/conv_sample.cpp#L293
@@ -175,9 +62,9 @@ at::SmallVector<int64_t, kSpatialDim + 2> MakeConvOutputShape(
     int M, // output channels
     const std::array<int64_t, kSpatialDim>& input_image_shape,
     const std::vector<int64_t>& kernel,
-    IntArrayRef stride,
-    IntArrayRef padding,
-    IntArrayRef dilation);
+    const torch::List<int64_t>& stride,
+    const torch::List<int64_t>& padding,
+    const torch::List<int64_t>& dilation);
 
 template <>
 at::SmallVector<int64_t, 4> MakeConvOutputShape<2>(
@@ -185,9 +72,9 @@ at::SmallVector<int64_t, 4> MakeConvOutputShape<2>(
     int M, // output channels
     const std::array<int64_t, 2>& input_image_shape,
     const std::vector<int64_t>& kernel,
-    IntArrayRef stride,
-    IntArrayRef padding,
-    IntArrayRef dilation) {
+    const torch::List<int64_t>& stride,
+    const torch::List<int64_t>& padding,
+    const torch::List<int64_t>& dilation) {
   const int H = input_image_shape[0];
   const int W = input_image_shape[1];
   const int64_t Y_H =
@@ -197,94 +84,82 @@ at::SmallVector<int64_t, 4> MakeConvOutputShape<2>(
   return {N, M, Y_H, Y_W};
 }
 
+
 // the parameter quantized_output is a quantized tensor
+template <int kSpatialDim>
 template <bool kReluFused>
-void raw_cudnn_convolution_forward_out(
-    const Tensor& quantized_output,
-    const Tensor& input,
-    const Tensor& weight,
-    const c10::optional<Tensor> &bias,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    IntArrayRef dilation,
-    int64_t groups,
-    bool benchmark,
-    bool deterministic,
-    bool allow_tf32,
-    float bias_multiplier,
-    float requantize_multiplier
-) {
-  TORCH_CHECK(!benchmark, "not supported yet");
+void PackedConvWeightCudnn<kSpatialDim>::apply_impl_helper(const at::Tensor& quantized_output, const at::Tensor& input, double output_scale) {
   if (quantized_output.numel() == 0) {
     return;
   }
-
-  Tensor conv_output = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
+  at::Tensor conv_output = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
   // TODO: combine empty & fill_ using full_like or full
-  Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
+  at::Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
+  auto act_scale = input.q_scale();
+  auto weight_scale = orig_weight_.q_scale();
+  auto requantize_multiplier = act_scale * weight_scale / output_scale;
   requantize_multiplier_tensor.fill_(requantize_multiplier);
   c10::optional<at::Tensor> bias_multiplier_tensor;
-  c10::optional<at::Tensor> after_scales_bias;
-  c10::optional<at::Tensor> after_add;
   c10::optional<at::Tensor> broadcasted_bias;
-  c10::optional<at::Tensor> after_relu;
-  if (bias.has_value()) {
+  if (bias_.has_value()) {
     // the input bias is a 1-D tensor whose size is the same as the size of the second dimension of quantized_output.
     // we need to add trailing dimensions in order to properly broadcast bias, otherwise broadcast_to will fail.
     // the number of trailling dimensions is quantized_output.dim() - 2, so the new size of the broadcast_bias
     // becomes quantized_output.dim() - 2 + 1. nothing needs to be done for the leading dimensions
     std::vector<int64_t> new_size(quantized_output.dim() - 1, 1);
-    new_size[0] = bias.value().size(0);
-    broadcasted_bias = bias.value().reshape(new_size);
+    new_size[0] = bias_.value().size(0);
+    broadcasted_bias = bias_.value().reshape(new_size);
     broadcasted_bias.value() = broadcasted_bias.value().broadcast_to(quantized_output.sizes());
     broadcasted_bias.value() = broadcasted_bias.value().contiguous(c10::MemoryFormat::ChannelsLast);
     bias_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
+    auto bias_multiplier = 1.0 / (act_scale * weight_scale);
     bias_multiplier_tensor.value().fill_(bias_multiplier);
-    after_scales_bias = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
-    after_add = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
-  }
-  if (kReluFused) {
-    after_relu = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
   }
 
-  cudnnHandle_t handle = getCudnnHandle();
+  cudnnHandle_t handle = at::native::getCudnnHandle();
   CacheKey key;
-  setConvolutionParams(&key.params, input, weight, padding, stride, dilation, groups, deterministic, allow_tf32);
+  bool deterministic{true};
+  bool allow_tf32{false};
+  auto padding_vec = padding_.vec();
+  auto stride_vec = stride_.vec();
+  auto dilation_vec = dilation_.vec();
+  setConvolutionParams(&key.params, input, orig_weight_, padding_vec, stride_vec, dilation_vec, groups_, deterministic, allow_tf32);
+
   // operator datatype needs to be int32 for int8 convolution, but we can
   // set the datatype for output tensor to int32 or fp32
   key.params.dataType = CUDNN_DATA_INT32;
-  key.input_alignment = getAlignment(input);
-  key.output_alignment = getAlignment(conv_output);
-  key.weight_alignment = getAlignment(weight);
-  if (bias.has_value()) {
-    key.bias_alignment = getAlignment(broadcasted_bias.value());
+  key.input_alignment = cudnn_utils::getAlignment(input);
+  key.output_alignment = cudnn_utils::getAlignment(conv_output);
+  key.weight_alignment = cudnn_utils::getAlignment(orig_weight_);
+  if (bias_.has_value()) {
+    key.bias_alignment = cudnn_utils::getAlignment(broadcasted_bias.value());
   } else {
     key.bias_alignment = -1;
   }
 
   auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor plan_desc) {
     auto workspace_size = 0;
-    auto workspace = at::empty({workspace_size}, input.options().dtype(kByte));
+    auto workspace = at::empty({workspace_size}, input.options().dtype(at::kByte));
     std::vector<void *> data_ptrs;
     std::vector<int64_t> uids;
     data_ptrs.reserve(10);
     uids.reserve(10);
     data_ptrs = {reinterpret_cast<int8_t*>(input.data_ptr()), conv_output.data_ptr(),
-                                           reinterpret_cast<int8_t*>(weight.data_ptr()),
+                                           reinterpret_cast<int8_t*>(orig_weight_.data_ptr()),
                                            requantize_multiplier_tensor.data_ptr(),
                                            reinterpret_cast<int8_t*>(quantized_output.data_ptr())};
     uids = {'x', 'y', 'w', 's', 'r'};
-    if (bias.has_value()) {
+    if (bias_.has_value()) {
       data_ptrs.insert(data_ptrs.end(), {broadcasted_bias.value().data_ptr(), bias_multiplier_tensor.value().data_ptr(),
-                                         after_scales_bias.value().data_ptr(), after_add.value().data_ptr()});
+                                         broadcasted_bias.value().data_ptr(), conv_output.data_ptr()});
       uids.insert(uids.end(), {'b', 'c', 'd', 'e'});
       if (kReluFused) {
-        data_ptrs.emplace_back(after_relu.value().data_ptr()),
+        data_ptrs.emplace_back(conv_output.data_ptr()),
         uids.emplace_back('f');
       }
     } else {
       if (kReluFused) {
-        data_ptrs.emplace_back(after_relu.value().data_ptr());
+        data_ptrs.emplace_back(conv_output.data_ptr());
         uids.emplace_back('f');
       }
     }
@@ -307,41 +182,40 @@ void raw_cudnn_convolution_forward_out(
   // where act_fp32 and w_fp32 are the input and weight variables, resp.
   // output is a fp32 tensor
   auto conv_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR)
-      .setxDesc(getTensorDescriptor(input, 'x', key.input_alignment))
-      .setyDesc(getTensorDescriptor(conv_output, 'y', key.output_alignment))
-      .setwDesc(getTensorDescriptor(weight, 'w', key.weight_alignment))
-      .setcDesc(getConvDescriptor(key.params.dataType, padding, stride, dilation))
+      .setxDesc(cudnn_utils::getTensorDescriptor(input.sizes(), input.strides(), CUDNN_DATA_INT8, 'x', key.input_alignment))
+      .setyDesc(cudnn_utils::getTensorDescriptor(conv_output, 'y', key.output_alignment))
+      .setwDesc(cudnn_utils::getTensorDescriptor(orig_weight_.sizes(), orig_weight_.strides(), CUDNN_DATA_INT8, 'w', key.weight_alignment))
+      .setcDesc(getConvDescriptor(key.params.dataType, padding_vec, stride_vec, dilation_vec))
       .build();
   // std::cout << "operator:" << conv_op.describe() << std::endl;
 
   c10::optional<cudnn_frontend::Operation> bias_mult_op;
   c10::optional<cudnn_frontend::Operation> sum_conv_bias_op;
-  if (bias.has_value()) {
+  if (bias_.has_value()) {
     // we can't directly assign bias_mult_op becauase operator= is deleted for cudnn_frontend::Operation;
     // alternatively, I think we can use std::unique_ptr and dynamically allocate these builder ops
     // but here, we chose to do it statically. c10::optional<T>::emplace() enables this approach
-    // TODO: can we assign the result back into bias and get rid of after_scales_bias? pending NVIDIA response
 
     // bias_mult_op computes bias_fp32 / (act_scale * w_scale) or bias_fp32 * (1 / (act_scale * w_scale))
     // where bias_multiplier = (1 / (act_scale * w_scale))
     // output is a fp32 tensor
+    // we use inplace operation here where the output is assigned to the input
     bias_mult_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
-      .setxDesc(getTensorDescriptor(broadcasted_bias.value(), 'b', getAlignment(broadcasted_bias.value())))
-      .setbDesc(getTensorDescriptor(bias_multiplier_tensor.value(), 'c', getAlignment(bias_multiplier_tensor.value())))
-      .setyDesc(getTensorDescriptor(after_scales_bias.value(), 'd', getAlignment(after_scales_bias.value())))
-      .setpwDesc(getPointWiseMulDescriptor(getCudnnDataType(bias_multiplier_tensor.value())))
+      .setxDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'b', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setbDesc(cudnn_utils::getTensorDescriptor(bias_multiplier_tensor.value(), 'c', cudnn_utils::getAlignment(bias_multiplier_tensor.value())))
+      .setyDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'd', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(bias_multiplier_tensor.value())))
       .build());
 
-    // TODO: can we assign the result back into conv_output and get rid of after_add?
-
     // computes (act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)])
-    // where the 1st and 2nd summands is conv_output and after_scales_bias, resp.
+    // where the 1st and 2nd summands is conv_output and broadcasted_bias, resp.
     // output is a fp32 tensor
+    // we use inplace operation here where the output is assigned to the input
     sum_conv_bias_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
       .setxDesc(conv_op.getOutputTensor())
-      .setbDesc(getTensorDescriptor(after_scales_bias.value(), 'd', getAlignment(after_scales_bias.value())))
-      .setyDesc(getTensorDescriptor(after_add.value(), 'e', getAlignment(after_add.value())))
-      .setpwDesc(getPointWiseAddDescriptor(getCudnnDataType(after_scales_bias.value())))
+      .setbDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'd', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setyDesc(cudnn_utils::getTensorDescriptor(conv_output, 'e', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseAddDescriptor(at::native::getCudnnDataType(broadcasted_bias.value())))
       .build());
   }
 
@@ -349,13 +223,13 @@ void raw_cudnn_convolution_forward_out(
   // or relu(act_int8 * w_int8) if bias is not present.
   // output is a fp32 tensor
   c10::optional<cudnn_frontend::Operation> relu_op;
-  std::shared_ptr<cudnn_frontend::OpaqueBackendPointer> tensor2requant_ptr = bias.has_value() ? sum_conv_bias_op.value().getOutputTensor() : conv_op.getOutputTensor();
+  std::shared_ptr<cudnn_frontend::OpaqueBackendPointer> tensor2requant_ptr = bias_.has_value() ? sum_conv_bias_op.value().getOutputTensor() : conv_op.getOutputTensor();
   if (kReluFused) {
-    // TODO: can we assign the result back into conv_output and get rid of after_relu?
+    // we use inplace operation here where the output is assigned to the input
     relu_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
       .setxDesc(tensor2requant_ptr)
-      .setyDesc(getTensorDescriptor(after_relu.value(), 'f', getAlignment(after_relu.value())))
-      .setpwDesc(getPointWiseReluDescriptor(getCudnnDataType(after_relu.value())))
+      .setyDesc(cudnn_utils::getTensorDescriptor(conv_output, 'f', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseReluDescriptor(at::native::getCudnnDataType(conv_output)))
       .build());
   }
 
@@ -364,14 +238,14 @@ void raw_cudnn_convolution_forward_out(
   // output is a fp32 tensor
   auto requant_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
     .setxDesc(kReluFused ? relu_op.value().getOutputTensor() : tensor2requant_ptr)
-    .setbDesc(getTensorDescriptor(requantize_multiplier_tensor, 's', getAlignment(requantize_multiplier_tensor)))
-    .setyDesc(getTensorDescriptor(quantized_output.sizes(), quantized_output.strides(), CUDNN_DATA_INT8, 'r', getAlignment(quantized_output)))
-    .setpwDesc(getPointWiseMulDescriptor(getCudnnDataType(requantize_multiplier_tensor)))
+    .setbDesc(cudnn_utils::getTensorDescriptor(requantize_multiplier_tensor, 's', cudnn_utils::getAlignment(requantize_multiplier_tensor)))
+    .setyDesc(cudnn_utils::getTensorDescriptor(quantized_output.sizes(), quantized_output.strides(), CUDNN_DATA_INT8, 'r', cudnn_utils::getAlignment(quantized_output)))
+    .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(requantize_multiplier_tensor)))
     .build();
   // std::cout << "operator:" << requant_op.describe() << std::endl;
 
   std::vector<cudnn_frontend::Operation const *> ops{&conv_op};
-  if (bias.has_value()) {
+  if (bias_.has_value()) {
     ops.emplace_back(&(bias_mult_op.value()));
     ops.emplace_back(&(sum_conv_bias_op.value()));
   }
@@ -399,8 +273,8 @@ void raw_cudnn_convolution_forward_out(
   auto& fallback_list = fallback.getFallbackList();
 
   cudnn_frontend::EngineConfigList filtered_configs;
-  filterEngineConfigs(engine_configs, filtered_configs, deterministic, allow_tf32, input.scalar_type());
-  filterEngineConfigs(fallback_list, filtered_configs, deterministic, allow_tf32, input.scalar_type());
+  cudnn_utils::filterEngineConfigs(engine_configs, filtered_configs, deterministic, allow_tf32, at::kChar);
+  cudnn_utils::filterEngineConfigs(fallback_list, filtered_configs, deterministic, allow_tf32, at::kChar);
 
   for (auto &cfg : engine_configs) {
     try {
@@ -412,7 +286,7 @@ void raw_cudnn_convolution_forward_out(
       run(plan_desc);
       execution_plan_cache[key] = plan_desc;
       return;
-    } catch (cudnn_frontend::cudnnException &e) {std::cout << "cudnn error:" << e.what() << std::endl;} catch(CuDNNError &e) { std::cout << "other error" << e.what() << std::endl;}
+    } catch (cudnn_frontend::cudnnException &e) {std::cout << "cudnn error:" << e.what() << std::endl;} catch(c10::CuDNNError &e) { std::cout << "other error" << e.what() << std::endl;}
   }
 
   TORCH_CHECK(false, "Unable to find an engine to execute this computation");
@@ -436,94 +310,90 @@ out_int8 = (act_fp32 * w_fp32 + [bias_fp32]) / out_scale + out_zero_point
              = (act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)]) / (out_scale / (act_scale * w_scale))
              = requantize((act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)]), out_scale / (act_scale * w_scale))
 */
-template <int kSpatialDim, bool kReluFused>
-Tensor raw_cudnn_convolution_forward(
-    const Tensor& act,
-    const Tensor& weight,
-    c10::optional<Tensor> bias,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    IntArrayRef dilation,
-    int64_t groups,
-    bool benchmark,
-    bool deterministic,
-    bool allow_tf32,
-    float bias_multiplier,
-    float requantize_multiplier,
+template <int kSpatialDim>
+template <bool kReluFused>
+at::Tensor PackedConvWeightCudnn<kSpatialDim>::apply_impl(
+    const at::Tensor& act,
     double output_scale,
     int64_t output_zero_point) {
-  // TODO: add dimension validations for input/weight/bias
   const int N = act.size(0);
   const int D = kSpatialDim == 3 ? act.size(2) : 1;
   const int H = act.size(kSpatialDim);
   const int W = act.size(kSpatialDim + 1);
-  const int M = weight.size(0); // output channels
-  std::vector<int64_t> kernel_size = {weight.size(2), weight.size(3)};
-  at::SmallVector<int64_t, kSpatialDim + 2> output_shape{MakeConvOutputShape<kSpatialDim>(N, M, {H, W},
-  kernel_size, stride, padding, dilation)};
-  Tensor quantized_output = at::_empty_affine_quantized(
+  const int M = orig_weight_.size(0); // output channels
+  std::vector<int64_t> kernel_size = {orig_weight_.size(2), orig_weight_.size(3)};
+  at::SmallVector<int64_t, kSpatialDim + 2> output_shape = MakeConvOutputShape<kSpatialDim>(N, M, {H, W},
+  kernel_size, stride_, padding_, dilation_);
+  at::Tensor quantized_output = at::_empty_affine_quantized(
       output_shape,
-      at::device(at::kCUDA).dtype(ScalarType::QInt8),
+      at::device(at::kCUDA).dtype(at::ScalarType::QInt8),
       output_scale,
       output_zero_point,
       at::MemoryFormat::ChannelsLast);
-  raw_cudnn_convolution_forward_out<kReluFused>(
-      quantized_output, act, weight, bias,
-      padding, stride, dilation, groups,
-      benchmark,
-      deterministic,
-      allow_tf32,
-      bias_multiplier,
-      requantize_multiplier);
-
+  // requantization
+  // out_int8 = act_int8 * weight_int8 * act_scale * w_scale / output_scale
+  apply_impl_helper<kReluFused>(
+      quantized_output, act, output_scale);
   return quantized_output;
 }
 
+template <int kSpatialDim>
+at::Tensor PackedConvWeightCudnn<kSpatialDim>::apply(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<false>(input, output_scale, output_zero_point);
+}
+
+template <int kSpatialDim>
+at::Tensor PackedConvWeightCudnn<kSpatialDim>::apply_relu(
+    const at::Tensor& input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<true>(input, output_scale, output_zero_point);
+}
+
+template at::Tensor PackedConvWeightCudnn<2>::apply(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+template at::Tensor PackedConvWeightCudnn<2>::apply_relu(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point);
+
+namespace at {
+namespace native {
+namespace {
 
 template <int kSpatialDim, bool kReluFused>
 class QConvInt8 final {
  public:
-  static Tensor run(
-      Tensor act,
-      Tensor weight,
-      c10::optional<Tensor> bias,
-      torch::List<int64_t> stride,
-      torch::List<int64_t> padding,
-      torch::List<int64_t> dilation,
-      int64_t groups,
+  static at::Tensor run(
+      at::Tensor act,
+      const c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>>& packed_weight,
       double output_scale,
       int64_t output_zero_point) {
     act = act.contiguous(c10::MemoryFormat::ChannelsLast);
-    weight = weight.contiguous(c10::MemoryFormat::ChannelsLast);
-    // requantization
-    // out_int8 = act_int8 * weight_int8 * act_scale * w_scale / output_scale
-    auto act_scale = act.q_scale();
-    auto weight_scale = weight.q_scale();
-    auto requantize_multiplier = act_scale * weight_scale / output_scale;
-    auto bias_multiplier = 1.0 / (act_scale * weight_scale);
-
     // TODO: check all zero_points are zero/all tensors are symmetrically quantized
-    return raw_cudnn_convolution_forward<kSpatialDim, kReluFused>(
-        act.int_repr(), weight.int_repr(), bias,
-        IntArrayRef(padding.vec()), IntArrayRef(stride.vec()), IntArrayRef(dilation.vec()), groups,
-        false /* benchmark */,
-        true /* deterministic */,
-        false /* allow_tf32 */,
-        bias_multiplier,
-        requantize_multiplier,
-        output_scale,
-        output_zero_point
-    );
+    if (kReluFused) {
+      return packed_weight->apply_relu(act, output_scale, output_zero_point);
+    } else {
+      return packed_weight->apply(act, output_scale, output_zero_point);
+    }
   }
 };
 
 TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
-  m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_cudnn"), QConvInt8<2, false>::run);
-  m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_relu_cudnn"), QConvInt8<2, true>::run);
+  m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d.new"), QConvInt8<2, false>::run);
+  m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_relu.new"), QConvInt8<2, true>::run);
 }
 
 } // namespace
-}} // at::native
+} // namespace native
+} // namespace at
+
 
 #endif  // HAS_CUDNN_V8
 #endif  // AT_CUDNN_ENABLED
diff --git a/aten/src/ATen/native/quantized/cudnn/Linear.cpp b/aten/src/ATen/native/quantized/cudnn/Linear.cpp
new file mode 100644
index 00000000000000..e4579bfc826bcf
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/Linear.cpp
@@ -0,0 +1,345 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+#include <c10/util/ArrayRef.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/Exceptions.h>
+#include <ATen/cudnn/Handle.h>
+#include <ATen/cudnn/Types.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/native/utils/ParamsHash.h>
+#include <ATen/TensorUtils.h>
+#include <c10/core/ScalarType.h>
+#include <cudnn_frontend.h>
+#include <torch/library.h>
+
+#include <iostream>
+#include <unordered_map>
+
+// TODO: there is a table from input dtype and weight dtype to operator dtype,
+// we can derive the operator dtype based on input dtype
+cudnn_frontend::MatMulDesc_v8 getLinearDescriptor(cudnnDataType_t dataType) {
+  return cudnn_frontend::MatMulDescBuilder()
+    .setMathPrecision(dataType)
+    .build();
+}
+
+struct CacheKey {
+  uint8_t input_alignment;
+  uint8_t weight_alignment;
+  uint8_t output_alignment;
+  // default to -1 when no bias
+  int8_t bias_alignment;
+};
+
+// FIXME: make this thread-safe by reusing the benchmark cache in Conv_v7.cpp
+namespace {
+std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, at::native::ParamsHash<CacheKey>, at::native::ParamsEqual<CacheKey>> execution_plan_cache;
+}
+// TODO: we can use cudnn_frontend::ExecutionPlanCache when it supports caching
+// multiple operators
+// reference: https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/conv_sample.cpp#L293
+//static cudnn_frontend::ExecutionPlanCache plan_cache("sample_cache");
+
+// currently we only support int8 symmetric (zero_point = 0 for inputs and output) quantized linear op
+// We implement relu(act_int8 * transpose(w_int8) + [bias_fp32/(act_scale * w_scale] ) * ( act_scale * w_scale / out_scale )
+// which requires 5 cudnn ops (1 matmul, 2 multiplication, 1 add, and 1 relu ops)
+// matmul op: linear_op
+// Multiplication ops: rhs_mult_op, requant_op
+// Addition op: add_op
+// Relu op: relu_op
+template <bool kReluFused>
+void PackedLinearWeightCudnn::apply_impl_helper(const at::Tensor& quantized_output, const at::Tensor& input, double output_scale) {
+  if (quantized_output.numel() == 0) {
+    return;
+  }
+  at::Tensor linear_output = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat));
+  auto act_scale = input.q_scale();
+  auto weight_scale = orig_weight.q_scale();
+  auto requantize_multiplier = act_scale * weight_scale / output_scale;
+  at::Tensor requantize_multiplier_tensor = at::full(quantized_output.sizes(), requantize_multiplier, at::device(at::kCUDA).dtype(at::kFloat));
+  requantize_multiplier_tensor.fill_(requantize_multiplier);
+  c10::optional<at::Tensor> bias_multiplier_tensor;
+  c10::optional<at::Tensor> broadcasted_bias;
+  if (bias_.has_value()) {
+    // the input bias is a 1-D tensor whose size is the same as the size of the second dimension of quantized_output.
+    // we need to add trailing dimensions in order to properly broadcast bias, otherwise broadcast_to will fail.
+    // the number of trailling dimensions is quantized_output.dim() - 2. We also prepend a leading dimension for clarity
+    std::vector<int64_t> new_size(quantized_output.dim(), 1);
+    new_size[1] = bias_.value().size(0);
+    broadcasted_bias = bias_.value().reshape(new_size);
+    broadcasted_bias.value() = broadcasted_bias.value().broadcast_to(quantized_output.sizes());
+    bias_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat));
+    auto bias_multiplier = 1.0 / (act_scale * weight_scale);
+    bias_multiplier_tensor.value().fill_(bias_multiplier);
+  }
+
+  cudnnHandle_t handle = at::native::getCudnnHandle();
+  CacheKey key;
+  bool deterministic{true};
+  bool allow_tf32{false};
+
+  key.input_alignment = cudnn_utils::getAlignment(input);
+  key.output_alignment = cudnn_utils::getAlignment(linear_output);
+  key.weight_alignment = cudnn_utils::getAlignment(orig_weight);
+  if (bias_.has_value()) {
+    key.bias_alignment = cudnn_utils::getAlignment(broadcasted_bias.value());
+  } else {
+    key.bias_alignment = -1;
+  }
+  // the matmul operation is input * transpose(weight), so we will work with the transposed weight
+  auto weight_transposed = transpose(orig_weight, 0, 1);
+  // cudnn expects tensors to be at least 3D. weight_transposed is currently 2D. we will create a 3D view
+  // by prepending a leading dummy dimension (cudnn expects leading dimensions to be the dummy dimensions)
+  std::vector<int64_t> new_sizes(3, 1);
+  new_sizes.back() = weight_transposed.size(1);
+  new_sizes[1] = weight_transposed.size(0);
+  weight_transposed = weight_transposed.view(new_sizes);
+  // TODO: remove this with int8 matmul is supported
+  auto input_fp = input.int_repr().to(at::kFloat);
+  auto weight_fp = weight_transposed.int_repr().to(at::kFloat);
+
+  auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor plan_desc) {
+    auto workspace_size = 0;
+    auto workspace = at::empty({workspace_size}, input.options().dtype(at::kByte));
+    std::vector<void *> data_ptrs;
+    std::vector<int64_t> uids;
+    data_ptrs.reserve(10);
+    uids.reserve(10);
+    data_ptrs = {input_fp.data_ptr(), linear_output.data_ptr(),
+                                           weight_fp.data_ptr(),
+                                           requantize_multiplier_tensor.data_ptr(),
+                                           reinterpret_cast<int8_t*>(quantized_output.data_ptr())};
+    uids = {'x', 'y', 'w', 's', 'r'};
+    if (bias_.has_value()) {
+      data_ptrs.insert(data_ptrs.end(), {broadcasted_bias.value().data_ptr(), bias_multiplier_tensor.value().data_ptr(),
+                                         broadcasted_bias.value().data_ptr(), linear_output.data_ptr()});
+      uids.insert(uids.end(), {'b', 'c', 'd', 'e'});
+      if (kReluFused) {
+        data_ptrs.emplace_back(linear_output.data_ptr()),
+        uids.emplace_back('f');
+      }
+    } else {
+      if (kReluFused) {
+        data_ptrs.emplace_back(linear_output.data_ptr());
+        uids.emplace_back('f');
+      }
+    }
+    auto variantPack = cudnn_frontend::VariantPackBuilder()
+      .setWorkspacePointer(workspace.data_ptr())
+      .setDataPointers(uids.size(), data_ptrs.data())
+      .setUids(uids.size(), uids.data())
+      .build();
+    auto variant_pack_desc = variantPack.get_raw_desc();
+    AT_CUDNN_CHECK(cudnnBackendExecute(handle, plan_desc->get_backend_descriptor(), variant_pack_desc));
+  };
+
+  auto search = execution_plan_cache.find(key);
+  if (search != execution_plan_cache.end()) {
+    cudnn_frontend::ManagedOpaqueDescriptor plan_desc = search->second;
+    run(plan_desc);
+    return;
+  }
+
+  // linear_op computes act_int8 * tranpose(w_int8) (matrix multiplication)
+  // where act_int8 and w_int8 are the input and weight variables, resp.
+  // output is a fp32 tensor
+  auto linear_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR)
+      // TODO: make these 2 CUDNN_DATA_INT8 when cudnn enables int8 matmul
+      // .setaMatDesc(cudnn_utils::getTensorDescriptor(input.sizes(), input.strides(), CUDNN_DATA_FLOAT, 'x', key.input_alignment))
+      .setaMatDesc(cudnn_utils::getTensorDescriptor(input_fp.sizes(), input_fp.strides(), CUDNN_DATA_FLOAT, 'x', key.input_alignment))
+      // .setbMatDesc(cudnn_utils::getTensorDescriptor(orig_weight.sizes(), orig_weight.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment))
+      .setbMatDesc(cudnn_utils::getTensorDescriptor(weight_fp.sizes(), weight_fp.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment))
+      .setcMatDesc(cudnn_utils::getTensorDescriptor(linear_output, 'y', key.output_alignment))
+      .setmatmulDesc(getLinearDescriptor(CUDNN_DATA_FLOAT)) // is this right? should it be float?
+      .build();
+  // std::cout << "operator:" << linear_op.describe() << std::endl;
+
+  c10::optional<cudnn_frontend::Operation> bias_mult_op;
+  c10::optional<cudnn_frontend::Operation> sum_linear_bias_op;
+  if (bias_.has_value()) {
+    // we can't directly assign bias_mult_op becauase operator= is deleted for cudnn_frontend::Operation;
+    // alternatively, I think we can use std::unique_ptr and dynamically allocate these builder ops
+    // but here, we chose to do it statically. c10::optional<T>::emplace() enables this approach
+
+    // bias_mult_op computes bias_fp32 / (act_scale * w_scale) or bias_fp32 * (1 / (act_scale * w_scale))
+    // where bias_multiplier = (1 / (act_scale * w_scale))
+    // output is a fp32 tensor
+    // we use inplace operation here where the output is assigned to the input
+    bias_mult_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'b', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setbDesc(cudnn_utils::getTensorDescriptor(bias_multiplier_tensor.value(), 'c', cudnn_utils::getAlignment(bias_multiplier_tensor.value())))
+      .setyDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'd', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(bias_multiplier_tensor.value())))
+      .build());
+
+    // computes (act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)])
+    // where the 1st and 2nd summands is linear_output and broadcasted_bias, resp.
+    // output is a fp32 tensor
+    // we use inplace operation here where the output is assigned to the input
+    sum_linear_bias_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(linear_op.getOutputTensor())
+      .setbDesc(cudnn_utils::getTensorDescriptor(broadcasted_bias.value(), 'd', cudnn_utils::getAlignment(broadcasted_bias.value())))
+      .setyDesc(cudnn_utils::getTensorDescriptor(linear_output, 'e', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseAddDescriptor(at::native::getCudnnDataType(broadcasted_bias.value())))
+      .build());
+  }
+
+  // relu_op computes relu(act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)]
+  // or relu(act_int8 * w_int8) if bias is not present.
+  // output is a fp32 tensor
+  c10::optional<cudnn_frontend::Operation> relu_op;
+  std::shared_ptr<cudnn_frontend::OpaqueBackendPointer> tensor2requant_ptr = bias_.has_value() ? sum_linear_bias_op.value().getOutputTensor() : linear_op.getOutputTensor();
+  if (kReluFused) {
+    // we use inplace operation here where the output is assigned to the input
+    relu_op.emplace(cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+      .setxDesc(tensor2requant_ptr)
+      .setyDesc(cudnn_utils::getTensorDescriptor(linear_output, 'f', key.output_alignment))
+      .setpwDesc(cudnn_utils::getPointWiseReluDescriptor(at::native::getCudnnDataType(linear_output)))
+      .build());
+  }
+
+  // requant_op computes relu(act_int8 * w_int8 + [bias_fp32/(act_scale * w_scale)]) / (out_scale / (act_scale * w_scale))
+  // or relu(act_int8 * w_int8) / (out_scale / (act_scale * w_scale))) if bias is not present.
+  // output is a fp32 tensor
+  auto requant_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
+    .setxDesc(kReluFused ? relu_op.value().getOutputTensor() : tensor2requant_ptr)
+    .setbDesc(cudnn_utils::getTensorDescriptor(requantize_multiplier_tensor, 's', cudnn_utils::getAlignment(requantize_multiplier_tensor)))
+    .setyDesc(cudnn_utils::getTensorDescriptor(quantized_output.sizes(), quantized_output.strides(), CUDNN_DATA_INT8, 'r', cudnn_utils::getAlignment(quantized_output)))
+    .setpwDesc(cudnn_utils::getPointWiseMulDescriptor(at::native::getCudnnDataType(requantize_multiplier_tensor)))
+    .build();
+  // // std::cout << "operator:" << requant_op.describe() << std::endl;
+
+  std::vector<cudnn_frontend::Operation const *> ops{&linear_op};
+  if (bias_.has_value()) {
+    ops.emplace_back(&(bias_mult_op.value()));
+    ops.emplace_back(&(sum_linear_bias_op.value()));
+  }
+  if (kReluFused) {
+    ops.emplace_back(&(relu_op.value()));
+  }
+  ops.emplace_back(&requant_op);
+
+  auto opGraph = cudnn_frontend::OperationGraphBuilder()
+      .setHandle(handle)
+      .setOperationGraph(ops.size(), ops.data())
+      .build();
+  // std::cout << "opGraph: " << opGraph.describe() << std::endl;
+
+  auto heuristics = cudnn_frontend::EngineHeuristicsBuilder()
+      .setOperationGraph(opGraph)
+      .setHeurMode(CUDNN_HEUR_MODE_INSTANT)
+      .build();
+  auto fallback = cudnn_frontend::EngineFallbackListBuilder()
+                    .setOperationGraph(opGraph)
+                    .setOperation(CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR)
+                    .build();
+
+  auto& engine_configs = heuristics.getEngineConfig(heuristics.getEngineConfigCount());
+  auto& fallback_list = fallback.getFallbackList();
+
+  cudnn_frontend::EngineConfigList filtered_configs;
+  cudnn_utils::filterEngineConfigs(engine_configs, filtered_configs, deterministic, allow_tf32, at::kChar);
+  cudnn_utils::filterEngineConfigs(fallback_list, filtered_configs, deterministic, allow_tf32, at::kChar);
+
+  for (auto &cfg : engine_configs) {
+    try {
+      auto plan = cudnn_frontend::ExecutionPlanBuilder()
+        .setHandle(handle)
+        .setEngineConfig(cfg)
+        .build();
+      auto plan_desc = plan.get_desc();
+      run(plan_desc);
+      execution_plan_cache[key] = plan_desc;
+      return;
+    } catch (cudnn_frontend::cudnnException &e) {std::cout << "cudnn error:" << e.what() << std::endl;} catch(c10::CuDNNError &e) { std::cout << "other error" << e.what() << std::endl;}
+  }
+
+  TORCH_CHECK(false, "Unable to find an engine to execute this computation");
+}
+
+// output Tensor will be a clampped int8 Tensor
+// both act and weight will be int8 Tensor
+// Numerics are the same as conv (see aten/src/ATen/native/quantized/Conv.cpp):
+template <bool kReluFused>
+at::Tensor PackedLinearWeightCudnn::apply_impl(
+    const at::Tensor& act,
+    double output_scale,
+    int64_t output_zero_point) {
+  std::vector<int64_t> original_output_shape{act.sizes().vec()}; // 2D
+  original_output_shape.back() = orig_weight.size(0); // output channels
+  // cudnn expects tensors to be at least 3D. we will prepend a dummy dimension for quantized_output
+  std::vector<int64_t> output_shape(3, 1);
+  output_shape[1] = original_output_shape[0];
+  output_shape[2] = original_output_shape[1];
+  at::Tensor quantized_output = at::_empty_affine_quantized(
+      output_shape,
+      at::device(at::kCUDA).dtype(at::ScalarType::QInt8),
+      output_scale,
+      output_zero_point);
+  // cudnn expects tensors to be at least 3D. act is currently 2D. we will create a 3D view
+  std::vector<int64_t> new_sizes(3, 1);
+  // cudnn expects leading dimensions to be the dummy dimensions
+  new_sizes.back() = act.sizes().back();
+  new_sizes[1] = act.size(0);
+  apply_impl_helper<kReluFused>(
+      quantized_output, act.view(new_sizes), output_scale);
+  return quantized_output.view(original_output_shape);
+}
+
+at::Tensor PackedLinearWeightCudnn::apply(
+    at::Tensor input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<false>(input, output_scale, output_zero_point);
+}
+
+at::Tensor PackedLinearWeightCudnn::apply_relu(
+    at::Tensor input,
+    double output_scale,
+    int64_t output_zero_point) {
+  return apply_impl<true>(input, output_scale, output_zero_point);
+}
+
+namespace at {
+namespace native {
+namespace {
+
+template <bool kReluFused>
+class QLinearInt8 final {
+ public:
+  static at::Tensor run(
+      at::Tensor act,
+      const c10::intrusive_ptr<LinearPackedParamsBase>& packed_weight,
+      double output_scale,
+      int64_t output_zero_point) {
+    // TODO: if act is more than 2D, I think we should flatten the first n-1 dimensions?
+    // TODO: check all zero_points are zero/all tensors are symmetrically quantized
+    if (kReluFused) {
+      return packed_weight->apply_relu(act, output_scale, output_zero_point);
+    } else {
+      return packed_weight->apply(act, output_scale, output_zero_point);
+    }
+  }
+};
+
+TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear"), QLinearInt8<false>::run);
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_relu"), QLinearInt8<true>::run);
+}
+
+} // namespace
+} // namespace native
+} // namespace at
+
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/Pooling.cpp b/aten/src/ATen/native/quantized/cudnn/Pooling.cpp
new file mode 100644
index 00000000000000..747be7a831d895
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/Pooling.cpp
@@ -0,0 +1,212 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/Exceptions.h>
+#include <ATen/cudnn/Descriptors.h>
+#include <ATen/cudnn/Handle.h>
+#include <ATen/cudnn/Types.h>
+#include <ATen/native/Pool.h>
+#include <ATen/native/TensorIterator.h>
+#include <c10/core/ScalarType.h>
+#include <c10/util/ArrayRef.h>
+#include <torch/library.h>
+
+#include <vector>
+
+namespace at {
+namespace native {
+namespace {
+// TODO: This function is the same as that of qpool.cpp. We should refactor this into quantized directory
+// so that we don't need to duplicate the function
+void check_maxpool2d_params(
+    IntArrayRef kernel_size,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation) {
+  TORCH_CHECK(kernel_size.size() == 1 || kernel_size.size() == 2,
+              "Expected 1d or 2d kernel size, got ", kernel_size.size());
+  TORCH_CHECK(stride.empty() || stride.size() == 2,
+              "Expected no strides or 2d strides, got", stride.size());
+  TORCH_CHECK(padding.size() == 1 || padding.size() == 2,
+              "Expected 1d or 2d padding, got ", padding.size());
+  TORCH_CHECK(dilation.size() == 1 || dilation.size() == 2,
+              "Expected 1d or 2d dilation, got ", dilation.size());
+}
+}
+
+// Currently we support 4D and 3D input (qx) tensors, the latter of which is supported for
+// legacy reasons. The first dimension of a 4D input tensor is the batch size.
+// For a 3D tensor, there is no batch size dimension -- it can be viewed as a single batch.
+// cudnn's 2D pooling operation requires the input and output to be 4D tensors, so we must cast
+// any 3D tensors to 4D prior to using cudnn
+// This implementation currently uses the v7 cudnn APIs as v8 cudnn APIs are not yet available for
+// pooling operations.
+// Consult https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnPoolingForward for
+// documentation on the APIs
+// Currently, it appears there is no cudnn support for dilated pooling -- we will
+// submit a feature request for this with cudnn
+// TODO: ideally, we would like to use structured kernel support here so we do not have to repeat
+// the input checks, however, that would require us to implement max_pool2d_with_indices_out_quantized_cuda
+// based on how the dispatch table is currently constructed in native_functions.yaml. currently,
+// there is no support for producing indices with cudnn max pooling, so until that becomes available, this cannot be done.
+Tensor quantized_max_pool2d_cudnn(
+    const Tensor& qx,
+    IntArrayRef kernel_size,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    bool ceil_mode) {
+  check_maxpool2d_params(
+      kernel_size,
+      stride,
+      padding,
+      dilation);
+  if (stride.empty()) {
+    stride = kernel_size;
+  }
+  auto ndim = qx.dim();
+  TORCH_CHECK(
+      ndim == 3 || ndim == 4, "Expecting the input tensor of rank 3 or 4.");
+  TORCH_CHECK(
+      kernel_size.size() == 2,
+      "quantized_max_pool2d_cudnn(): Expected kernel_size to be 2-dimensional: got ",
+      kernel_size.size());
+  TORCH_CHECK(
+      stride.size() == 2,
+      "quantized_max_pool2d_cudnn(): Expected stride to be 2-dimensional: got ",
+      stride.size());
+  TORCH_CHECK(
+      dilation.size() == 2,
+      "quantized_max_pool2d_cudnn(): Expected dilation to be 2-dimensional: got ",
+      dilation.size());
+  TORCH_CHECK(
+      dilation[0] == 1 && dilation[1] == 1,
+      "quantized_max_pool2d_cudnn(): Expected dilation=[1, 1] (cudnn does not currently support dilation[i] != 1), got",
+      dilation);
+  TORCH_CHECK(
+      padding.size() == 2,
+      "quantized_max_pool2d_cudnn(): Expected padding to be 2-dimensional: got ",
+      padding.size());
+
+  auto input = qx;
+  if (ndim == 4) {
+    input = qx.contiguous(MemoryFormat::ChannelsLast);
+  } else { // 3D
+    std::vector<int64_t> new_sizes{1, qx.size(0), qx.size(1), qx.size(2)};
+    input = qx.view(new_sizes);
+  }
+  int batch_size = input.size(0);
+  int64_t inC = input.size(1);
+  int64_t inH = input.size(2);
+  int64_t inW = input.size(3);
+  // Check output dimensions.
+  int64_t padH = padding[0];
+  int64_t padW = padding[1];
+  int64_t kH = kernel_size[0];
+  int64_t kW = kernel_size[1];
+  int64_t strideH = stride[0];
+  int64_t strideW = stride[1];
+  TORCH_CHECK(
+      kH > 0 && kW > 0,
+      "qnnpack_maxpool2d(): kernel_size should be greater than zero.");
+  TORCH_CHECK(
+      strideH > 0 && strideW > 0,
+      "qnnpack_maxpool2d(): strides should be greater than zero.");
+  int64_t dilationH = dilation[0];
+  int64_t dilationW = dilation[1];
+  int64_t outC = inC;
+  int64_t outH = pooling_output_shape(inH, kH, padH, strideH, dilationH, ceil_mode);
+  int64_t outW = pooling_output_shape(inW, kW, padW, strideW, dilationW, ceil_mode);
+  TORCH_CHECK(outH > 0 && outW > 0,
+              "Given input size: (",
+              inC, "x", inH, "x", inW,
+              "). Calculated output size: (",
+              outC, "x", outH, "x", outW,
+              "). Output size is too small.");
+
+  std::vector<int64_t> output_shape;
+  if (ndim == 3) {
+    // cudnn requires 4D input and output for 2D pooling, so we prepend a dummy dimension
+    // whose size represents the batch size (1)
+    output_shape = {1, outC, outH, outW};
+  } else {
+    output_shape = {batch_size, outC, outH, outW};
+  }
+  auto qy = at::_empty_affine_quantized(
+      output_shape,
+      at::device(at::kCUDA).dtype(at::ScalarType::QInt8),
+      input.q_scale(),
+      input.q_zero_point(),
+      (ndim == 4 ? MemoryFormat::ChannelsLast : MemoryFormat::Contiguous));
+
+  cudnnHandle_t handle = getCudnnHandle();
+  cudnnPoolingDescriptor_t poolingDesc;
+  AT_CUDNN_CHECK_WITH_SHAPES(cudnnCreatePoolingDescriptor(&poolingDesc));
+  AT_CUDNN_CHECK_WITH_SHAPES(cudnnSetPooling2dDescriptor(
+      poolingDesc,
+      CUDNN_POOLING_MAX_DETERMINISTIC,
+      CUDNN_NOT_PROPAGATE_NAN,
+      kernel_size[0], // kernel height
+      kernel_size[1], // kernel width
+      padding[0], // vertical padding
+      padding[1], // horizontal padding
+      stride[0], // vertical stride
+      stride[1])); // horizontal stride
+
+  auto dataType = getCudnnDataType(input);
+  float one{1};
+  float zero{0.0};
+  TensorDescriptor xDesc;
+  at::MemoryFormat memory_format = (ndim == 4 ? at::MemoryFormat::ChannelsLast : at::MemoryFormat::Contiguous);
+  xDesc.set(input, memory_format);
+  TensorDescriptor yDesc;
+  yDesc.set(qy, memory_format);
+  cudnnPoolingForward(handle,
+                      poolingDesc,
+                      &one,
+                      xDesc.desc(),
+                      reinterpret_cast<int8_t*>(input.data_ptr()),
+                      &zero,
+                      yDesc.desc(),
+                      reinterpret_cast<int8_t*>(qy.data_ptr()));
+
+  // recall we casted our input and output to 4D if qx was 3D, so we recast it back to 3D prior to returning
+  return (ndim == 3 ? qy.view(std::vector<int64_t>(output_shape.begin() + 1, output_shape.end())) : qy);
+}
+
+// Keep the registry in the anonymous namespace.
+namespace {
+template <uint32_t kSpatialDim>
+class QMaxPool_arr_args final {
+ public:
+  static Tensor run(
+      Tensor qx,
+      std::vector<int64_t> kernel_size,
+      std::vector<int64_t> stride,
+      std::vector<int64_t> padding,
+      std::vector<int64_t> dilation,
+      bool ceil_mode) {
+    TORCH_CHECK(kSpatialDim == 2, "quantized max pool is only valid for 2D")
+    return quantized_max_pool2d_cudnn(qx, kernel_size, stride, padding,
+                                    dilation, ceil_mode);
+  }
+};
+
+TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::max_pool2d"), TORCH_FN(QMaxPool_arr_args<2>::run));
+}
+
+} // namespace
+} // namespace native
+} // namespace at
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/conv_prepack.cpp b/aten/src/ATen/native/quantized/cudnn/conv_prepack.cpp
new file mode 100644
index 00000000000000..70c05f33cc1aa8
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/conv_prepack.cpp
@@ -0,0 +1,151 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <torch/library.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/quantized/Quantizer.h>
+#include <c10/core/QScheme.h>
+#include <c10/util/irange.h>
+#include <torch/library.h>
+
+#include <array>
+#include <vector>
+
+template <int kSpatialDim>
+c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> PackedConvWeightCudnn<
+    kSpatialDim>::
+    prepack(
+        at::Tensor weight,
+        c10::optional<at::Tensor> bias,
+        torch::List<int64_t> stride,
+        torch::List<int64_t> padding,
+        torch::List<int64_t> output_padding,
+        torch::List<int64_t> dilation,
+        int64_t groups,
+        bool transpose) {
+  TORCH_CHECK(weight.qscheme() == c10::kPerTensorAffine, "Unsupported qscheme: ", toString(weight.qscheme()));
+  TORCH_CHECK(
+      weight.ndimension() == kSpatialDim + 2,
+      "Weights are expected to have ",
+      kSpatialDim + 2,
+      " dimensions");
+  TORCH_CHECK(
+      stride.size() == kSpatialDim,
+      "stride should contain ",
+      kSpatialDim,
+      " elements for ",
+      kSpatialDim,
+      "D convolution.");
+  TORCH_CHECK(
+      padding.size() == kSpatialDim,
+      "quantized::conv_prepack (cudnn): Specify front/top/left padding only. "
+      "end/bottom/right padding assumed to be equal to front/top/left");
+  TORCH_CHECK(
+      !transpose || output_padding.size() == kSpatialDim,
+      "quantized::conv_prepack: Specify top/left output padding "
+      "only. bottom/right padding assumed to be equal to top/left");
+  TORCH_CHECK(
+      dilation.size() == kSpatialDim,
+      "quantized::conv_prepack (cudnn): dilation should contain ",
+      kSpatialDim,
+      " elements for ",
+      kSpatialDim,
+      "D convolution.");
+  const int output_channels = transpose ? weight.size(1) * groups
+                                        : weight.size(0);
+  const auto qtype = weight.qscheme();
+  if (bias.has_value()) {
+    TORCH_CHECK(bias.value().dim() == 1, "bias should be a vector (1D Tensor)");
+    TORCH_CHECK(
+        bias.value().size(0) == output_channels,
+        "bias should have K elements: " + std::to_string(output_channels));
+    // TODO: we create a broadcasted_bias tensor later so I think we don't need to make this contiguous here.
+    // we will revisit this when nvidia adds proper support for broadcasting
+    // bias_contig = bias->contiguous();
+  }
+
+  auto ret_ptr = c10::make_intrusive<PackedConvWeightCudnn<kSpatialDim>>(
+          weight.contiguous(c10::MemoryFormat::ChannelsLast), // TODO: this assumes 2D I think. make it more general?
+          bias,
+          stride,
+          padding,
+          output_padding,
+          dilation,
+          groups,
+          transpose,
+          qtype);
+  return ret_ptr;
+}
+
+template
+c10::intrusive_ptr<ConvPackedParamsBase<2>> PackedConvWeightCudnn<
+    2>::
+    prepack(
+        at::Tensor weight,
+        c10::optional<at::Tensor> bias_in,
+        torch::List<int64_t> stride,
+        torch::List<int64_t> padding,
+        torch::List<int64_t> output_padding,
+        torch::List<int64_t> dilation,
+        int64_t groups,
+        bool transpose);
+
+namespace at {
+namespace native {
+namespace {
+
+template <int kSpatialDim = 2>
+class QConvPackWeightInt8Cudnn final {
+ public:
+  static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> run_conv(
+      Tensor weight,
+      c10::optional<Tensor> bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> dilation,
+      int64_t groups) {
+    torch::List<int64_t> output_padding;
+    output_padding.reserve(kSpatialDim);
+    for (const auto idx : c10::irange(kSpatialDim)) {
+      (void)idx; //Suppress unused variable warning
+      output_padding.push_back((int64_t)0);
+    }
+    return _run(weight, bias, stride, padding, output_padding, dilation, groups,
+                /*transpose=*/false);
+  }
+
+ private:
+  static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> _run(
+      Tensor weight,
+      c10::optional<Tensor> bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> output_padding,
+      torch::List<int64_t> dilation,
+      int64_t groups,
+      bool transpose) {
+    return PackedConvWeightCudnn<kSpatialDim>::prepack(
+        weight, bias, stride, padding, output_padding, dilation, groups,
+        transpose);
+  }
+};
+
+TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_prepack"), TORCH_FN(QConvPackWeightInt8Cudnn<2>::run_conv));
+}
+
+} // namespace
+} // namespace native
+} // namespace at
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/conv_unpack_impl.cpp b/aten/src/ATen/native/quantized/cudnn/conv_unpack_impl.cpp
new file mode 100644
index 00000000000000..ca9611dca89066
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/conv_unpack_impl.cpp
@@ -0,0 +1,28 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <torch/library.h>
+
+#include <tuple>
+
+template <int kSpatialDim>
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightCudnn<
+    kSpatialDim>::unpack() {
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>{orig_weight_, bias_};
+}
+
+template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightCudnn<
+    2>::unpack();
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/linear_prepack.cpp b/aten/src/ATen/native/quantized/cudnn/linear_prepack.cpp
new file mode 100644
index 00000000000000..3541ce9b7d80a5
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/linear_prepack.cpp
@@ -0,0 +1,63 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <torch/library.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/quantized/Quantizer.h>
+#include <c10/core/QScheme.h>
+#include <c10/util/irange.h>
+#include <torch/library.h>
+
+c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeightCudnn::prepack(
+        at::Tensor weight,
+        c10::optional<at::Tensor> bias) {
+  TORCH_CHECK(weight.qscheme() == c10::kPerTensorAffine, "Unsupported qscheme: ", toString(weight.qscheme()));
+  const int output_channels = weight.size(0);
+  const auto qtype = weight.qscheme();
+  if (bias.has_value()) {
+    TORCH_CHECK(bias.value().dim() == 1, "bias should be a vector (1D Tensor)");
+    TORCH_CHECK(
+        bias.value().size(0) == output_channels,
+        "bias should have K elements: " + std::to_string(output_channels));
+  }
+
+  auto ret_ptr = c10::make_intrusive<PackedLinearWeightCudnn>(
+          weight,
+          bias,
+          qtype);
+  return ret_ptr;
+}
+
+namespace at {
+namespace native {
+namespace {
+
+class QLinearPackWeightInt8Cudnn final {
+ public:
+  static c10::intrusive_ptr<LinearPackedParamsBase> run(
+      at::Tensor weight,
+      c10::optional<Tensor> bias) {
+      return PackedLinearWeightCudnn::prepack(std::move(weight), std::move(bias));
+  }
+};
+
+TORCH_LIBRARY_IMPL(quantized, QuantizedCUDA, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_prepack"), TORCH_FN(QLinearPackWeightInt8Cudnn::run));
+}
+
+
+} // namespace
+} // namespace native
+} // namespace at
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/linear_unpack_impl.cpp b/aten/src/ATen/native/quantized/cudnn/linear_unpack_impl.cpp
new file mode 100644
index 00000000000000..ebf77b0294d872
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/linear_unpack_impl.cpp
@@ -0,0 +1,23 @@
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/ATen.h>
+#include <ATen/native/quantized/cudnn/utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <torch/library.h>
+
+#include <tuple>
+
+std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedLinearWeightCudnn::unpack() {
+  return std::tuple<at::Tensor, c10::optional<at::Tensor>>{orig_weight, bias_};
+}
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/cudnn/utils.h b/aten/src/ATen/native/quantized/cudnn/utils.h
new file mode 100644
index 00000000000000..c5fdcd99f122d7
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cudnn/utils.h
@@ -0,0 +1,304 @@
+#pragma once
+/*
+This file contains some of the auxiliary functions used by both Conv.cpp & Linear.cpp (introduced in a later PR)
+*/
+
+#ifdef USE_CUDA
+#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+
+#if AT_CUDNN_ENABLED()
+
+#include <ATen/native/cudnn/Macros.h>
+
+#if HAS_CUDNN_V8()
+
+#include <ATen/cudnn/Types.h>
+#include <ATen/Tensor.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <c10/core/QScheme.h>
+#include <c10/util/ArrayRef.h>
+#include <cudnn_frontend.h>
+
+struct TORCH_API PackedLinearWeightCudnn : public LinearPackedParamsBase {
+  PackedLinearWeightCudnn(
+      at::Tensor orig_weight,
+      c10::optional<at::Tensor> bias,
+      c10::QScheme q_scheme)
+      : orig_weight(std::move(orig_weight)),
+        bias_(std::move(bias)),
+        q_scheme(std::move(q_scheme)) {}
+
+  at::Tensor apply(
+      at::Tensor input,
+      double output_scale,
+      int64_t output_zero_point) override;
+  at::Tensor apply_relu(
+      at::Tensor input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_dynamic(at::Tensor input, bool reduce_range = false) override {
+    throw std::runtime_error(
+    "apply_relu_out is not implemented for this packed "
+    "parameter type");
+  }
+  at::Tensor apply_dynamic_relu(at::Tensor input, bool reduce_range = false) override {
+    throw std::runtime_error(
+    "apply_relu_out is not implemented for this packed "
+    "parameter type");
+  }
+
+  std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() override;
+
+  c10::optional<at::Tensor> bias() override {
+    return bias_;
+  }
+
+  static c10::intrusive_ptr<LinearPackedParamsBase> prepack(
+      at::Tensor weight,
+      c10::optional<at::Tensor> bias);
+
+ private:
+  at::Tensor orig_weight;
+  c10::optional<at::Tensor> bias_;
+  c10::QScheme q_scheme;
+
+  template <bool ReluFused>
+  at::Tensor apply_impl(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point);
+
+  template <bool ReluFused>
+  void apply_impl_helper(
+      const at::Tensor& quantized_output,
+      const at::Tensor& input,
+      double output_scale);
+};
+
+template <int kSpatialDim = 2>
+struct TORCH_API PackedConvWeightCudnn : public ConvPackedParamsBase<kSpatialDim> {
+  PackedConvWeightCudnn(
+      at::Tensor orig_weight,
+      c10::optional<at::Tensor> bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> output_padding,
+      torch::List<int64_t> dilation,
+      int64_t groups,
+      bool transpose,
+      c10::QScheme q_scheme)
+      : orig_weight_(std::move(orig_weight)),
+        bias_(std::move(bias)),
+        stride_(std::move(stride)),
+        padding_(std::move(padding)),
+        output_padding_(std::move(output_padding)),
+        dilation_(std::move(dilation)),
+        groups_(groups),
+        transpose_(transpose),
+        q_scheme_(q_scheme) {}
+
+  at::Tensor apply(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_relu(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) override;
+
+  at::Tensor apply_dynamic(
+    const at::Tensor& input,
+    bool reduce_range) {
+    TORCH_CHECK(false, "apply_dynamic is currently not reported");
+  }
+
+  at::Tensor apply_dynamic_relu(
+    const at::Tensor& input,
+    bool reduce_range) {
+    TORCH_CHECK(false, "apply_dynamic_relu is currently not reported");
+  }
+
+  std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() override;
+
+  static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> prepack(
+      at::Tensor weight,
+      c10::optional<at::Tensor> bias,
+      torch::List<int64_t> stride,
+      torch::List<int64_t> padding,
+      torch::List<int64_t> output_padding,
+      torch::List<int64_t> dilation,
+      int64_t groups,
+      bool transpose);
+
+  const float* GetBiasData(at::Tensor* bias);
+
+  torch::List<int64_t> stride() const override {
+    return stride_;
+  }
+
+  torch::List<int64_t> padding() const override {
+    return padding_;
+  }
+
+  torch::List<int64_t> output_padding() const override {
+    return output_padding_;
+  }
+
+  torch::List<int64_t> dilation() const override {
+    return dilation_;
+  }
+
+  int64_t groups() const override {
+    return groups_;
+  }
+
+  bool transpose() const override {
+    return transpose_;
+  }
+
+ private:
+  at::Tensor orig_weight_;
+  c10::optional<at::Tensor> bias_;
+  torch::List<int64_t> stride_;
+  torch::List<int64_t> padding_;
+  torch::List<int64_t> output_padding_;
+  torch::List<int64_t> dilation_;
+  int64_t groups_;
+  bool transpose_;
+  c10::QScheme q_scheme_;
+
+  template <bool ReluFused>
+  at::Tensor apply_impl(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point);
+
+  template <bool ReluFused>
+  void apply_impl_helper(
+      const at::Tensor& quantized_output,
+      const at::Tensor& input,
+      double output_scale);
+};
+
+namespace cudnn_utils {
+namespace {
+
+uint8_t getAlignment(const at::Tensor &t) {
+  // alignment are in bytes
+  uint8_t alignment = 1;
+  uintptr_t address = reinterpret_cast<uintptr_t>(t.data_ptr());
+  while (address % alignment == 0 && alignment < 16) alignment *= 2;
+  return alignment;
+}
+
+cudnn_frontend::Tensor getTensorDescriptor(const at::Tensor &t, int64_t id, uint8_t alignment) {
+  auto shape = t.sizes();
+  auto strides = t.strides();
+  return cudnn_frontend::TensorBuilder()
+    .setDim(shape.size(), shape.data())
+    .setStrides(strides.size(), strides.data())
+    .setId(id)
+    .setAlignment(alignment)
+    .setDataType(at::native::getCudnnDataType(t))
+    .build();
+}
+
+cudnn_frontend::Tensor getTensorDescriptor(const c10::IntArrayRef& shape, const c10::IntArrayRef& strides, cudnnDataType_t cudnn_dtype, int64_t id, uint8_t alignment) {
+  return cudnn_frontend::TensorBuilder()
+    .setDim(shape.size(), shape.data())
+    .setStrides(strides.size(), strides.data())
+    .setId(id)
+    .setAlignment(alignment)
+    .setDataType(cudnn_dtype)
+    .build();
+}
+
+// TODO: there is a table from input dtype to operator dtype, we can derive
+// the operator dtype based on input dtype
+cudnn_frontend::PointWiseDesc_v8 getPointWiseMulDescriptor(cudnnDataType_t dataType) {
+  return cudnn_frontend::PointWiseDescBuilder()
+    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_MUL)
+    .setMathPrecision(dataType)
+    .build();
+}
+
+// TODO: there is a table from input dtype to operator dtype, we can derive
+// the operator dtype based on input dtype
+cudnn_frontend::PointWiseDesc_v8 getPointWiseAddDescriptor(cudnnDataType_t dataType) {
+  return cudnn_frontend::PointWiseDescBuilder()
+    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_ADD)
+    .setMathPrecision(dataType)
+    .build();
+}
+
+// TODO: there is a table from input dtype to operator dtype, we can derive
+// the operator dtype based on input dtype
+cudnn_frontend::PointWiseDesc_v8 getPointWiseReluDescriptor(cudnnDataType_t dataType) {
+  return cudnn_frontend::PointWiseDescBuilder()
+    .setMode(cudnnPointwiseMode_t::CUDNN_POINTWISE_RELU_FWD)
+    .setMathPrecision(dataType)
+    .build();
+}
+
+
+void filterEngineConfigs(
+  cudnn_frontend::EngineConfigList &from,
+  cudnn_frontend::EngineConfigList &to,
+  bool deterministic, bool allow_tf32, c10::ScalarType scalar_type)
+{
+  auto filter = [=](cudnnBackendDescriptor_t c) {
+    if (deterministic) {
+      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC>(c)) return true;
+    }
+    if (scalar_type == at::kFloat || scalar_type == at::kChar || !allow_tf32) {
+      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS>(c)) return true;
+      if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_TENSOR_CORE>(c)) return true;
+    }
+    return false;
+  };
+  cudnn_frontend::filter(from, to, filter);
+}
+
+
+cudnn_frontend::ExecutionPlan get_execplan_from_heuristics_else_fall_back(cudnn_frontend::OperationGraph&& opGraph, cudnnHandle_t handle_) {
+  auto heuristics = cudnn_frontend::EngineHeuristicsBuilder()
+    .setOperationGraph(opGraph)
+    .setHeurMode(CUDNN_HEUR_MODE_INSTANT)
+    .build();
+
+  // std::cout << "Heuristic has " << heuristics.getEngineConfigCount() << " configurations " << std::endl;
+  auto& engine_config = heuristics.getEngineConfig(heuristics.getEngineConfigCount());
+
+  // Try engine configs returned by the heuristics and pick up the first one that works.
+  for (auto& ecfg : engine_config) {
+    try {
+      auto plan = cudnn_frontend::ExecutionPlanBuilder()
+        .setHandle(handle_)
+        .setEngineConfig(ecfg, opGraph.getTag())
+        .build();
+      return plan;
+    } catch (cudnn_frontend::cudnnException& e) {
+      continue;
+    }
+  }
+
+  {
+    auto total_engines = opGraph.getEngineCount();
+    // std::cout << opGraph.describe() << " has " << total_engines << " engines." << std::endl;
+    auto engine = cudnn_frontend::EngineBuilder().setGlobalEngineIdx(0).setOperationGraph(opGraph).build();
+    // std::cout << engine.describe() << std::endl;
+
+    auto engine_config = cudnn_frontend::EngineConfigBuilder().setEngine(engine).build();
+    // std::cout << engine_config.describe() << std::endl;
+
+    return cudnn_frontend::ExecutionPlanBuilder().setHandle(handle_).setEngineConfig(engine_config).build();
+  }
+}
+} // anonymous
+} // cudnn_utils
+
+#endif  // HAS_CUDNN_V8
+#endif  // AT_CUDNN_ENABLED
+#endif  // USE_CUDA
diff --git a/aten/src/ATen/native/quantized/library.cpp b/aten/src/ATen/native/quantized/library.cpp
index 74486fc7ee0c5d..b1106bc1f616d6 100644
--- a/aten/src/ATen/native/quantized/library.cpp
+++ b/aten/src/ATen/native/quantized/library.cpp
@@ -1,7 +1,6 @@
 #include <torch/library.h>
 
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/quantized/cpu/embedding_packed_params.h>
 #include <torch/custom_class.h>
 
@@ -189,11 +188,6 @@ TORCH_LIBRARY(quantized, m) {
   m.def(TORCH_SELECTIVE_SCHEMA("quantized::relu6(Tensor qx, bool inplace=False) -> Tensor"));
   m.def(TORCH_SELECTIVE_SCHEMA("quantized::leaky_relu(Tensor qx, Scalar negative_slope, bool inplace, float output_scale, int output_zero_point) -> Tensor"));
   m.def(TORCH_SELECTIVE_SCHEMA("quantized::sigmoid(Tensor qx, float output_scale, int output_zero_point) -> Tensor"));
-
-  // quantized ops implemented in cudnn, with QuantizedCUDA dispatch
-  // TODO: use the same signature as quantized::conv2d
-  m.def(TORCH_SELECTIVE_SCHEMA("quantized::conv2d_cudnn(Tensor act, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups, float output_scale, int output_zero_point) -> Tensor"));
-  m.def(TORCH_SELECTIVE_SCHEMA("quantized::conv2d_relu_cudnn(Tensor act, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups, float output_scale, int output_zero_point) -> Tensor"));
 }
 
 // According to #33294: The "_" prefix registration will be
diff --git a/aten/src/ATen/native/quantized/cpu/packed_params.h b/aten/src/ATen/native/quantized/packed_params.h
similarity index 71%
rename from aten/src/ATen/native/quantized/cpu/packed_params.h
rename to aten/src/ATen/native/quantized/packed_params.h
index 85d6ffcde17e1c..64d8ec840c4646 100644
--- a/aten/src/ATen/native/quantized/cpu/packed_params.h
+++ b/aten/src/ATen/native/quantized/packed_params.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <ATen/ATen.h>
 #include <ATen/core/ivalue.h>
 
 struct LinearPackedParamsBase : public torch::jit::CustomClassHolder {
@@ -71,3 +72,27 @@ struct LinearPackedParamsBase : public torch::jit::CustomClassHolder {
         "parameter type");
   }
 };
+
+template <int kSpatialDim = 2>
+struct ConvPackedParamsBase : public torch::jit::CustomClassHolder {
+  virtual at::Tensor apply(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) = 0;
+  virtual at::Tensor apply_relu(
+      const at::Tensor& input,
+      double output_scale,
+      int64_t output_zero_point) = 0;
+  virtual at::Tensor apply_dynamic(
+      const at::Tensor& input,
+      bool reduce_range) = 0;
+
+  virtual std::tuple<at::Tensor, c10::optional<at::Tensor>> unpack() = 0;
+
+  virtual torch::List<int64_t> stride() const = 0;
+  virtual torch::List<int64_t> padding() const = 0;
+  virtual torch::List<int64_t> output_padding() const = 0;
+  virtual torch::List<int64_t> dilation() const = 0;
+  virtual int64_t groups() const = 0;
+  virtual bool transpose() const = 0;
+};
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp b/aten/src/ATen/native/quantized/qconv_unpack.cpp
similarity index 63%
rename from aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
rename to aten/src/ATen/native/quantized/qconv_unpack.cpp
index e4855062e360d9..062fc8a0522aca 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
+++ b/aten/src/ATen/native/quantized/qconv_unpack.cpp
@@ -1,124 +1,21 @@
+/*
+The dispatch registrations at the end of this file applies to fbgemm, qnnpack, and cudnn backends.
+The correct unpack backend function is determined using runtime polymorphism through the packed_weight pointer,
+which is of type intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> and points to either a PackedConvWeightsQnnp,
+PackedConvWeights (Fbgemm), or PackedConvWeightsCudnn at runtime, which all inherit from ConvPackedParamsBase.
+The implementations for the unpack functions can be found in /cpu/qconv_unpack_impl.cpp, for fbgemm&qnnpack
+and /cudnn/conv_unpack_impl.cpp, for cudnn.
+*/
+
 #include <tuple>
-#include <vector>
 
 #include <ATen/ATen.h>
 #include <torch/library.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <ATen/native/quantized/cpu/onednn_utils.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
-
-#ifdef USE_FBGEMM
-template <int kSpatialDim>
-std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
-    kSpatialDim>::unpack() {
-  auto* packed_weights_p = w.get();
-  // output channels
-  const int output_channels = packed_weights_p->outputChannels();
-  const int input_channels = packed_weights_p->inputChannels();
-  const int groups = packed_weights_p->groups();
-
-  const int kernel_d = kSpatialDim == 2 ? 1 : kernel[0];
-  // R (kernel height)
-  const int kernel_h = kernel[kSpatialDim - 2];
-  // S (kernel width)
-  const int kernel_w = kernel[kSpatialDim - 1];
-
-  const int C_per_G = input_channels / groups;
-
-  // Tensor for unpacked weights
-  // Unpacked format would be physical KRS(C/G) but logical KCRS (channels
-  // first) because that's how
-  // ChannelsLast3d is not available now.FBGEMM stores the weights
-  // TODO: Unify 2d and 3d when ChannelsLast3d is ready.
-  at::Tensor unpacked_weights;
-  if (q_scheme == c10::kPerTensorAffine) {
-    unpacked_weights = kSpatialDim == 2
-        ? at::_empty_affine_quantized(
-              {output_channels, C_per_G, kernel_h, kernel_w},
-              device(c10::kCPU)
-                  .dtype(c10::kQInt8)
-                  .memory_format(c10::MemoryFormat::ChannelsLast),
-              w_scale[0],
-              w_zp[0],
-              c10::nullopt)
-        : at::native::fbgemm_utils::
-              MakeEmptyAffineQuantizedChannelsLast3dTensor(
-                  output_channels,
-                  C_per_G,
-                  kernel_d,
-                  kernel_h,
-                  kernel_w,
-                  device(c10::kCPU).dtype(c10::kQInt8),
-                  w_scale[0],
-                  w_zp[0]);
-  } else if (q_scheme == c10::kPerChannelAffine) {
-    TORCH_CHECK(
-        !transpose(),
-        "Per Channel Quantization is currently disabled for transposed conv");
-    auto scales = at::from_blob(
-        w_scale.data(), w_scale.size(), device(c10::kCPU).dtype(c10::kFloat));
-    auto zero_points = at::from_blob(
-        w_zp.data(), w_zp.size(), device(c10::kCPU).dtype(c10::kInt));
-    unpacked_weights = kSpatialDim == 2
-        ? at::_empty_per_channel_affine_quantized(
-              {output_channels, C_per_G, kernel_h, kernel_w},
-              scales.toType(c10::kDouble),
-              zero_points.toType(c10::kLong),
-              0, /* The output channel axis is 0 */
-              device(c10::kCPU).dtype(c10::kQInt8),
-              c10::MemoryFormat::ChannelsLast)
-        : at::native::fbgemm_utils::
-              MakeEmptyPerChannelAffineQuantizedChannelsLast3dTensor(
-                  output_channels,
-                  C_per_G,
-                  kernel_d,
-                  kernel_h,
-                  kernel_w,
-                  device(c10::kCPU).dtype(c10::kQInt8),
-                  scales.toType(c10::kDouble),
-                  zero_points.toType(c10::kLong));
-  } else {
-    TORCH_CHECK(false, "Unsupported qscheme: ", toString(q_scheme));
-  }
-  int8_t* unpacked_weights_p =
-      reinterpret_cast<int8_t*>(unpacked_weights.data_ptr<c10::qint8>());
-  packed_weights_p->unpack(unpacked_weights_p);
-  if(transpose()){
-    unpacked_weights =
-        at::native::fbgemm_utils::TransposeConvTensorUnpackConversion<
-            kSpatialDim>(unpacked_weights, groups);
-  }
-  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(
-      unpacked_weights, bias);
-}
-
-template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
-    2>::unpack();
-template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeight<
-    3>::unpack();
-#endif // USE_FBGEMM
-
-#ifdef USE_PYTORCH_QNNPACK
-template <int kSpatialDim>
-std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
-    kSpatialDim>::unpack() {
-  TORCH_CHECK(
-      kSpatialDim == 2,
-      "QNNPACK only supports conv2d_unpack right "
-      "now.");
-  TORCH_CHECK(
-        orig_weight.defined(),
-        "Cannot unpack weights. "
-        "Call at::globalContext()::setReleaseOriginalWeights(false) before packing or loading to enable unpacking.");
-  return std::tuple<at::Tensor, c10::optional<at::Tensor>>(orig_weight, bias);
-}
-
-template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
-    2>::unpack();
-template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsQnnp<
-    3>::unpack();
-#endif // USE_PYTORCH_QNNPACK
+#include <ATen/native/quantized/packed_params.h>
 
 namespace at {
 namespace native {
@@ -154,6 +51,12 @@ class QConvUnpackWeightsInt8 final {
     }
 #endif
 
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      return packed_weight->unpack();
+    }
+#endif
+
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::conv2d_unpack ",
@@ -185,6 +88,15 @@ class QConv1dUnpackWeightsInt8 final {
     }
 #endif
 
+#if AT_MKLDNN_ENABLED()
+    if (ctx.qEngine() == at::QEngine::ONEDNN) {
+      std::tie(weight, bias) = packed_weight->unpack();
+      at::Tensor new_weight = weight.clone();
+      new_weight.squeeze_(quant_utils::kConv1dSqueezeDim + 2);
+      return std::tuple<at::Tensor, c10::optional<at::Tensor>>(new_weight, bias);
+    }
+#endif
+
     TORCH_CHECK(
         false,
         "Didn't find engine for operation quantized::conv1d_unpack ",
@@ -252,7 +164,7 @@ unpack_quantized_prepacked_sizes_conv2d(const IValue& ivalue) {
   at::Tensor weight;
   c10::optional<at::Tensor> bias;
   std::tie(weight, bias) = params->unpack();
-  c10::optional<IntArrayRef> bias_sizes = c10::nullopt;
+  at::OptionalIntArrayRef bias_sizes = c10::nullopt;
   if (bias && bias->defined()) {
     bias_sizes = bias->sizes();
   }
diff --git a/aten/src/ATen/native/quantized/qlinear_unpack.cpp b/aten/src/ATen/native/quantized/qlinear_unpack.cpp
new file mode 100644
index 00000000000000..cfcd0589f03cec
--- /dev/null
+++ b/aten/src/ATen/native/quantized/qlinear_unpack.cpp
@@ -0,0 +1,77 @@
+/*
+The dispatch registrations at the end of this file applies to fbgemm, qnnpack, and cudnn backends.
+The correct unpack backend function is determined using runtime polymorphism through the packed_weight pointer,
+which is of type intrusive_ptr<LinearPackedParamsBase> and points to either a PackedLinearWeightsQnnp,
+PackedLinearWeights (Fbgemm), or PackedLinearWeightsCudnn at runtime, which all inherit from LinearPackedParamsBase.
+The implementations for the unpack functions can be found in /cpu/qlinear_unpack_impl.cpp, for fbgemm&qnnpack
+and /cudnn/linear_unpack_impl.cpp, for cudnn.
+*/
+#include <ATen/ATen.h>
+#include <ATen/native/quantized/cpu/fbgemm_utils.h>
+#include <ATen/native/quantized/packed_params.h>
+#include <ATen/native/quantized/cpu/qnnpack_utils.h>
+#include <torch/custom_class.h>
+#include <torch/library.h>
+
+namespace at {
+namespace native {
+namespace {
+
+class QLinearUnpackWeightInt8 final {
+ public:
+  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
+      const c10::intrusive_ptr<LinearPackedParamsBase>& packed_weight) {
+    return packed_weight->unpack();
+  }
+};
+
+class QLinearUnpackWeightFp16 final {
+ public:
+  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
+      const c10::intrusive_ptr<LinearPackedParamsBase>& packed_weight) {
+    auto& ctx = at::globalContext();
+
+    TORCH_CHECK(
+        ctx.qEngine() != at::QEngine::QNNPACK,
+        "quantized::linear_unpack_fp16 is currently "
+        "not supported by QNNPACK");
+
+    return packed_weight->unpack();
+  }
+};
+
+class QLinearUnpackWeightInt8Legacy final {
+ public:
+  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
+      const at::Tensor& packed_weight) {
+    TORCH_CHECK(false,
+        "quantized.linear_unpack(Tensor) is unsupported! Please "
+        "upgrade your model to use the newer quantized.linear_"
+        "unpack(LinearPackedParamsBase) overload");
+  }
+};
+
+class QLinearUnpackWeightFp16Legacy final {
+ public:
+  static std::tuple<at::Tensor, c10::optional<Tensor>> run(
+      const at::Tensor& packed_weight) {
+    TORCH_CHECK(false,
+        "quantized.linear_unpack(Tensor) is unsupported! Please "
+        "upgrade your model to use the newer quantized.linear_"
+        "unpack(LinearPackedParamsBase) overload");
+  }
+};
+
+TORCH_LIBRARY_IMPL(quantized, CPU, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack.legacy"), TORCH_FN(QLinearUnpackWeightInt8Legacy::run));
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack_fp16.legacy"), TORCH_FN(QLinearUnpackWeightFp16Legacy::run));
+}
+
+TORCH_LIBRARY_IMPL(quantized, CatchAll, m) {
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack"), TORCH_FN(QLinearUnpackWeightInt8::run));
+  m.impl(TORCH_SELECTIVE_NAME("quantized::linear_unpack_fp16"), TORCH_FN(QLinearUnpackWeightFp16::run));
+}
+
+} // namespace
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
index f91d9648e7db9d..24a90826bc1a7e 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
@@ -9,6 +9,7 @@
 #include <ATen/SparseCsrTensorImpl.h>
 #include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorImpl.h>
+#include <ATen/native/LinearAlgebraUtils.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
@@ -56,29 +57,51 @@ void _validate_sparse_csr_tensor_args(const Tensor& crow_indices, const Tensor&
 
   // Shape and Strides invariants
   TORCH_CHECK(
-      size.size() == 2,
-      "size of a CSR tensor must be of length 2, but got: ",
+      size.size() >= 2,
+      "size of a batched CSR tensor must have length >= 2, but got: ",
       size.size());
   TORCH_CHECK(
-      crow_indices.dim() == 1,
-      "crow_indices must have dim=1 but got crow_indices.dim()=",
+      crow_indices.dim() >= 1,
+      "crow_indices must have dim >= 1 but got crow_indices.dim() = ",
       crow_indices.dim());
   TORCH_CHECK(
-      col_indices.dim() == 1,
-      "col_indices must have dim=1 but got col_indices.dim()=",
+      col_indices.dim() >= 1,
+      "col_indices must have dim >= 1 but got col_indices.dim() = ",
       col_indices.dim());
   TORCH_CHECK(
-      values.dim() == 1,
-      "values must have dim=1 but got values.dim()=",
+      values.dim() >= 1,
+      "values must have dim >= 1 but got values.dim() = ",
       values.dim());
-  // Note, this check also enforces `crow_indices.numel() >= 1`
+
+  TORCH_CHECK(
+      crow_indices.dim() == col_indices.dim(),
+      "Number of dimensions of crow_indices and col_indices must be the same.");
+  TORCH_CHECK(
+      crow_indices.dim() == values.dim(),
+      "Number of dimensions of indices and values must be the same.");
+  TORCH_CHECK(
+      static_cast<size_t>(crow_indices.dim()) == size.size() - 1,
+      "Number of dimensions of indices must be one less than the number of dimensions of the provided size.");
+
+  // All batch sizes must be the same
+  auto batch_size = size.slice(0, size.size() - 2);
+  auto crow_indices_batch_size = crow_indices.sizes().slice(0, crow_indices.dim() - 1);
+  auto col_indices_batch_size = col_indices.sizes().slice(0, col_indices.dim() - 1);
+  auto values_batch_size = values.sizes().slice(0, values.dim() - 1);
+  TORCH_CHECK(
+      batch_size == crow_indices_batch_size &&
+      batch_size == col_indices_batch_size &&
+      batch_size == values_batch_size,
+      "All batch dimensions of the provided size, indices, and values must be the same.");
+
+  // Note, this check also enforces `crow_indices.size(-1) >= 1`
   TORCH_CHECK(
-      crow_indices.numel() == (size[0] + 1),
-      "crow_indices.numel() must be size(0) + 1, but got: ",
-      crow_indices.numel());
+      crow_indices.size(-1) == (size[size.size() - 2] + 1),
+      "crow_indices.size(-1) must be equal to size[-2] + 1 (that is ", size[size.size() - 2] + 1, "), but got: ",
+      crow_indices.size(-1));
   TORCH_CHECK(
       col_indices.numel() == values.numel(),
-      "col_indices and values must have equal sizes, but got col_indices.numel(): ",
+      "col_indices and values must have the same number of elements, but got col_indices.numel(): ",
       col_indices.numel(),
       ", values.numel(): ",
       values.numel());
@@ -86,22 +109,28 @@ void _validate_sparse_csr_tensor_args(const Tensor& crow_indices, const Tensor&
   // Indices invariants
   AT_DISPATCH_INDEX_TYPES(crow_indices.scalar_type(), "csr_construct_check", [&] {
     Tensor crow_indices_cpu = crow_indices.to(kCPU);
-    auto crow_indices_accessor = crow_indices_cpu.accessor<index_t, 1>();
-    TORCH_CHECK(
-        crow_indices_accessor[0] == 0, "0th value of crow_indices must be 0.");
-
-    TORCH_CHECK(
-        crow_indices_accessor[crow_indices.numel() - 1] == col_indices.numel(),
-        "last value of crow_indices should be equal to the length of col_indices.");
-
-    for (int i =  1; i <= size[0]; i++) {
+    auto crow_indices_data_ptr = crow_indices_cpu.data_ptr<index_t>();
+    auto batch_stride = crow_indices_cpu.dim() >= 2 ? crow_indices_cpu.stride(-2) : 0;
+    for (const auto batch_id : c10::irange(batchCount(crow_indices_cpu))) {
+      TORCH_CHECK(
+          crow_indices_data_ptr[batch_id*batch_stride] == 0,
+          "(Batch element ", batch_id, ") ",
+          ": 0th value of crow_indices must be 0, but it is ", crow_indices_data_ptr[batch_id*batch_stride]);
       TORCH_CHECK(
-          crow_indices_accessor[i - 1] <= crow_indices_accessor[i],
-          "at position i = ", i, ", this condition crow_indices[i - 1] <= crow_indices[i] fails");
+          crow_indices_data_ptr[batch_id*batch_stride + crow_indices.size(-1) - 1] == col_indices.size(-1),
+          "(Batch element ", batch_id, ") ",
+          "last value of crow_indices should be equal to the length of col_indices.");
+
+      for (int i =  1; i <= size[size.size() - 2]; i++) {
+        TORCH_CHECK(
+            crow_indices_data_ptr[batch_id*batch_stride + i - 1] <= crow_indices_data_ptr[batch_id*batch_stride + i],
+            "(Batch element ", batch_id, ") ",
+            "at position i = ", i, ", the condition crow_indices[i - 1] <= crow_indices[i] fails");
+      }
     }
     if (col_indices.numel() > 0) {
       TORCH_CHECK(0 <= col_indices.min().item<index_t>(), "col_indices.min() should be greater or equal to zero");
-      TORCH_CHECK(size[1] > col_indices.max().item<index_t>(), "size(1) should be greater than col_indices.max()");
+      TORCH_CHECK(size[size.size() - 1] > col_indices.max().item<index_t>(), "size[-1] should be greater than col_indices.max()");
     }
   });
 
@@ -213,13 +242,10 @@ Tensor sparse_csr_tensor(
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
-  std::array<int64_t, 2> size = {0, 0};
-  if (col_indices.numel() > 0) {
-    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "csr_construct_check", [&] {
-      size[0] = crow_indices.numel() - 1;
-      size[1] = col_indices.max().item<index_t>() + 1;
-    });
-  }
+  // std::array<int64_t, 2> size = {0, 0};
+  auto size = DimVector(IntArrayRef(col_indices.sizes().data(), col_indices.dim() - 1));
+  size.push_back(crow_indices.size(-1) - 1);
+  size.push_back(col_indices.max().item<int64_t>() + 1);
 
   at::native::_validate_sparse_csr_tensor_args(crow_indices, col_indices, values, size);
 
@@ -243,16 +269,21 @@ Tensor empty_sparse_csr(
     c10::optional<MemoryFormat> optional_memory_format) {
   check_size_nonnegative(size);
 
-  TORCH_CHECK(size.size() == 2, "torch.empty: Only 2D sparse CSR tensors are supported.");
+  TORCH_CHECK(size.size() >= 2, "torch.empty: Only batched sparse CSR matrices are supported, but got size ", size);
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(layout == Layout::SparseCsr);
 
-  auto rows = size[0];
+  auto rows = size[size.size() - 2];
   int64_t nnz = 0;
 
+  auto crow_indices_size = DimVector(size.slice(0, size.size() - 2));
+  crow_indices_size.push_back(rows + 1);
+  auto col_indices_values_size = DimVector(size.slice(0, size.size() - 2));
+  col_indices_values_size.push_back(nnz);
+
   TensorOptions options = TensorOptions().dtype(ScalarType::Long).layout(Layout::Strided).device(device).pinned_memory(pin_memory);
-  auto crow_indices = at::empty({rows + 1}, options);
-  auto col_indices = at::empty({nnz}, options);
-  auto values = at::empty({nnz}, options.dtype(dtype));
+  auto crow_indices = at::empty(crow_indices_size, options);
+  auto col_indices = at::empty(col_indices_values_size, options);
+  auto values = at::empty(col_indices_values_size, options.dtype(dtype));
 
   return at::native::_sparse_csr_tensor_unsafe(
       crow_indices,
@@ -270,13 +301,13 @@ const Tensor& resize_sparse_csr_(
     IntArrayRef size,
     c10::optional<MemoryFormat> optional_memory_format) {
   check_size_nonnegative(size);
-  TORCH_CHECK(size.size() == 2, "torch.resize_: Only 2D sparse CSR tensors are supported.");
+  TORCH_CHECK(size.size() >= 2, "torch.resize_: Only batched sparse CSR matrices are supported, but got size ", size);
   TORCH_CHECK(
-      self.size(1) <= size[1],
+      self.size(-1) <= size[size.size() - 1],
       "torch.resize_: Resizing columns of sparse CSR tensors to a smaller value is not supported. ",
       "The original number of columns is ",
-      self.size(1),
-      " while the requested new number of columns is ", size[1], ".");
+      self.size(-1),
+      " while the requested new number of columns is ", size[size.size() - 1], ".");
   get_sparse_csr_impl(self)->resize_(self._nnz(), size);
   return self;
 }
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp b/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
index d5d9ead612edee..6cccaf098d4434 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
@@ -1,16 +1,17 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
-#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/Parallel.h>
 #include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorUtils.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/mkl/Sparse.h>
 #include <ATen/native/BinaryOps.h>
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/mkl/SparseBlasImpl.h>
 #include <ATen/native/sparse/SparseBlasImpl.h>
+#include <ATen/native/sparse/SparseCsrTensorMath.h>
 #include <c10/util/irange.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
@@ -22,6 +23,7 @@
 #include <ATen/ops/_convert_indices_from_coo_to_csr_native.h>
 #include <ATen/ops/_convert_indices_from_csr_to_coo_native.h>
 #include <ATen/ops/_sparse_csr_tensor_unsafe_native.h>
+#include <ATen/ops/_unique.h>
 #include <ATen/ops/abs.h>
 #include <ATen/ops/abs_native.h>
 #include <ATen/ops/add.h>
@@ -89,10 +91,11 @@
 #include <ATen/ops/tan_native.h>
 #include <ATen/ops/tanh.h>
 #include <ATen/ops/tanh_native.h>
+#include <ATen/ops/tensor.h>
 #include <ATen/ops/trunc.h>
 #include <ATen/ops/trunc_native.h>
-#include <ATen/ops/zeros.h>
 #include <ATen/ops/zero_native.h>
+#include <ATen/ops/zeros.h>
 #endif
 
 #include <algorithm>
@@ -100,19 +103,22 @@
 namespace at {
 namespace meta {
 
-TORCH_META_FUNC(_convert_indices_from_coo_to_csr) (
-  const Tensor& self, const int64_t size, const bool out_int32
-) {
+TORCH_META_FUNC(_convert_indices_from_coo_to_csr)
+(const Tensor& self, const int64_t size, const bool out_int32) {
   TORCH_CHECK(self.dim() <= 1, "Input is supposed to be a vector");
   ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
-  c10::TensorOptions options = TensorOptions().device(self.options().device()).dtype(scalar_type);
+  c10::TensorOptions options =
+      TensorOptions().device(self.options().device()).dtype(scalar_type);
   set_output(size + 1, options);
 }
 
-TORCH_META_FUNC(_convert_indices_from_csr_to_coo) (
-  const Tensor& crow_indices, const Tensor& col_indices, const bool out_int32, const bool transpose
-) {
-  TORCH_CHECK(crow_indices.dim() == 1, "crow_indices is supposed to be a vector");
+TORCH_META_FUNC(_convert_indices_from_csr_to_coo)
+(const Tensor& crow_indices,
+ const Tensor& col_indices,
+ const bool out_int32,
+ const bool transpose) {
+  TORCH_CHECK(
+      crow_indices.dim() == 1, "crow_indices is supposed to be a vector");
   TORCH_CHECK(col_indices.dim() == 1, "col_indices is supposed to be a vector");
   ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
   c10::TensorOptions options = crow_indices.options().dtype(scalar_type);
@@ -126,7 +132,10 @@ namespace {
 constexpr int64_t GRAIN_SIZE = at::internal::GRAIN_SIZE;
 
 template <typename input_t, typename output_t>
-void convert_indices_from_coo_to_csr_cpu(const Tensor& result, const Tensor& input, const int64_t size) {
+void convert_indices_from_coo_to_csr_cpu(
+    const Tensor& result,
+    const Tensor& input,
+    const int64_t size) {
   int64_t numel = input.numel();
   const input_t* data_in = input.data_ptr<input_t>();
   output_t* data_out = result.data_ptr<output_t>();
@@ -175,7 +184,7 @@ Tensor& unary_op_out(F op_out, const Tensor& self, Tensor& result) {
   return result;
 }
 
-template <typename F, typename ...Args>
+template <typename F, typename... Args>
 Tensor& unary_op_inplace(Tensor& self, const F& op_inplace, Args&&... args) {
   TORCH_INTERNAL_ASSERT(self.is_sparse_csr());
 
@@ -185,7 +194,11 @@ Tensor& unary_op_inplace(Tensor& self, const F& op_inplace, Args&&... args) {
 }
 
 template <typename input_t, typename output_t>
-void convert_indices_from_csr_to_coo_cpu(const Tensor& indices, const Tensor& crow_indices, const Tensor& col_indices, const bool transpose=false) {
+void convert_indices_from_csr_to_coo_cpu(
+    const Tensor& indices,
+    const Tensor& crow_indices,
+    const Tensor& col_indices,
+    const bool transpose = false) {
   int64_t nrows = crow_indices.numel() - 1;
   if (nrows == 0) {
     indices.zero_();
@@ -194,16 +207,18 @@ void convert_indices_from_csr_to_coo_cpu(const Tensor& indices, const Tensor& cr
   auto crow_indices_ = crow_indices.expect_contiguous();
   const input_t* crow_indices_data_in = crow_indices_->data_ptr<input_t>();
   TORCH_INTERNAL_ASSERT(indices.is_contiguous());
-  auto row0 = indices.select(0, transpose?1:0);
-  auto row1 = indices.select(0, transpose?0:1);
+  auto row0 = indices.select(0, transpose ? 1 : 0);
+  auto row1 = indices.select(0, transpose ? 0 : 1);
   output_t* data_out = row0.data_ptr<output_t>();
   row1.copy_(*col_indices.expect_contiguous());
   at::parallel_for(0, nrows, GRAIN_SIZE, [&](int64_t start, int64_t end) {
     for (const auto i : c10::irange(start, end)) {
-      std::fill(&data_out[crow_indices_data_in[i]], &data_out[crow_indices_data_in[i + 1]], static_cast<output_t>(i));
+      std::fill(
+          &data_out[crow_indices_data_in[i]],
+          &data_out[crow_indices_data_in[i + 1]],
+          static_cast<output_t>(i));
     }
   });
-
 }
 
 } // end anonymous namespace
@@ -222,26 +237,27 @@ inline Tensor get_result_tensor_for_unary_op(F op, const Tensor& input) {
 
   // To handle type promotion for inputs to unary ops,
   // we first get the result from the underlined op, and use the result
-  // to create a sparse CSR tensor, which is used as the input to the out= variant
+  // to create a sparse CSR tensor, which is used as the input to the out=
+  // variant
   auto result_values = op(values);
 
   auto result = at::native::_sparse_csr_tensor_unsafe(
-    input.crow_indices().clone(),
-    input.col_indices().clone(),
-    result_values,
-    input.sizes(),
-    result_values.scalar_type(),
-    input.layout(),
-    result_values.device());
+      input.crow_indices().clone(),
+      input.col_indices().clone(),
+      result_values,
+      input.sizes(),
+      result_values.scalar_type(),
+      input.layout(),
+      result_values.device());
 
   return result;
 }
-}
+} // namespace
 
 static constexpr bool is_mkl_supported() {
 #ifdef _MSC_VER
   return false;
-#elif  __APPLE__ || __MACH__
+#elif __APPLE__ || __MACH__
   return false;
 #else
   return true;
@@ -249,41 +265,46 @@ static constexpr bool is_mkl_supported() {
 }
 
 // Only accept squares sparse matrices or dense input as a vector
-// TODO: Check what happens with MKL, the output error reported with non square matrices tends to be high
-// See: https://github.com/pytorch/pytorch/issues/58770
+// TODO: Check what happens with MKL, the output error reported with non square
+// matrices tends to be high See:
+// https://github.com/pytorch/pytorch/issues/58770
 bool is_square_or_vec(int64_t dim_i, int64_t dim_j, int64_t dim_k) {
-  return (dim_i == dim_k  && dim_k == dim_j) || (dim_i == dim_j && dim_k == 1);
+  return (dim_i == dim_k && dim_k == dim_j) || (dim_i == dim_j && dim_k == 1);
 }
 
-Tensor& normal_sparse_csr_(Tensor& self, double mean, double std, c10::optional<Generator> gen) {
+Tensor& normal_sparse_csr_(
+    Tensor& self,
+    double mean,
+    double std,
+    c10::optional<Generator> gen) {
   return unary_op_inplace(self, &Tensor::normal_, mean, std, gen);
 }
 
 /* Implementation of Unary Ufuncs, those supported for Sparse CSR Layout
  * Only simple funcs, with 0->0 correspondence are currently supported. */
 
-#define CREATE_UNARY_UFUNC_OUT(op_name)                                    \
-  Tensor& op_name##_sparse_csr_out(const Tensor& self, Tensor& result) {   \
-    return unary_op_out(&at::op_name##_outf, self, result);                \
+#define CREATE_UNARY_UFUNC_OUT(op_name)                                  \
+  Tensor& op_name##_sparse_csr_out(const Tensor& self, Tensor& result) { \
+    return unary_op_out(&at::op_name##_outf, self, result);              \
   }
 
-#define CREATE_UNARY_UFUNC_FUNCTIONAL(op_name)                             \
-  Tensor op_name##_sparse_csr(const Tensor& self) {                        \
-    return get_result_tensor_for_unary_op(&at::op_name, self);             \
+#define CREATE_UNARY_UFUNC_FUNCTIONAL(op_name)                 \
+  Tensor op_name##_sparse_csr(const Tensor& self) {            \
+    return get_result_tensor_for_unary_op(&at::op_name, self); \
   }
 
-#define CREATE_UNARY_UFUNC_INPLACE(op_name)                                \
-  Tensor& op_name##_sparse_csr_(Tensor& self) {                            \
-    return unary_op_inplace(self, &Tensor::op_name##_);                    \
+#define CREATE_UNARY_UFUNC_INPLACE(op_name)             \
+  Tensor& op_name##_sparse_csr_(Tensor& self) {         \
+    return unary_op_inplace(self, &Tensor::op_name##_); \
   }
 
-#define CREATE_UNARY_UFUNC(op_name)                                        \
-  CREATE_UNARY_UFUNC_OUT(op_name);                                         \
-  CREATE_UNARY_UFUNC_FUNCTIONAL(op_name);                                  \
+#define CREATE_UNARY_UFUNC(op_name)       \
+  CREATE_UNARY_UFUNC_OUT(op_name);        \
+  CREATE_UNARY_UFUNC_FUNCTIONAL(op_name); \
   CREATE_UNARY_UFUNC_INPLACE(op_name);
 
-#define CREATE_UNARY_UFUNC_NO_INPLACE(op_name)                             \
-  CREATE_UNARY_UFUNC_OUT(op_name);                                         \
+#define CREATE_UNARY_UFUNC_NO_INPLACE(op_name) \
+  CREATE_UNARY_UFUNC_OUT(op_name);             \
   CREATE_UNARY_UFUNC_FUNCTIONAL(op_name);
 
 // Exhaustive list of the unary ufuncs supported by sparse CSR
@@ -339,8 +360,12 @@ CREATE_UNARY_UFUNC_FUNCTIONAL(isnan);
 CREATE_UNARY_UFUNC_FUNCTIONAL(isinf);
 
 template <typename scalar_t>
-void addmm_out_sparse_csr_native_cpu(const Tensor& sparse, const Tensor& dense, const Tensor& r, Scalar alpha, Scalar beta) {
-
+void addmm_out_sparse_csr_native_cpu(
+    const Tensor& sparse,
+    const Tensor& dense,
+    const Tensor& r,
+    Scalar alpha,
+    Scalar beta) {
   auto dim_i = sparse.size(0);
   auto dim_k = dense.size(1);
 
@@ -350,41 +375,46 @@ void addmm_out_sparse_csr_native_cpu(const Tensor& sparse, const Tensor& dense,
 
   scalar_t cast_alpha = alpha.to<scalar_t>();
   r.mul_(beta);
-  AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "csr_mm_crow_indices", [&]() {
-    auto csr_accessor = csr.accessor<index_t, 1>();
-    auto col_indices_accessor = col_indices.accessor<index_t, 1>();
-
-    auto values_accessor = values.accessor<scalar_t, 1>();
-    scalar_t* dense_ptr = dense.data_ptr<scalar_t>();
-    scalar_t* r_ptr = r.data_ptr<scalar_t>();
-
-    int64_t dense_stride0 = dense.stride(0);
-    int64_t dense_stride1 = dense.stride(1);
-    int64_t r_stride0 = r.stride(0);
-    int64_t r_stride1 = r.stride(1);
-
-    at::parallel_for(
-        0,
-        dim_i,
-        internal::GRAIN_SIZE,
-        [&](int64_t irow_start, int64_t irow_end) {
-            for (index_t h = irow_start; h < irow_end; ++h) {
-              index_t i_start = csr_accessor[h];
-              index_t i_end = csr_accessor[h+1];
-              for (index_t i = i_start; i < i_end; i++) {
-                scalar_t val = values_accessor[i];
-                index_t col = col_indices_accessor[i];
-                at::native::cpublas::axpy<scalar_t>(dim_k,
-                    cast_alpha * val,
-                    dense_ptr + col * dense_stride0, dense_stride1,
-                    r_ptr + h * r_stride0, r_stride1);
+  AT_DISPATCH_INDEX_TYPES(
+      col_indices.scalar_type(), "csr_mm_crow_indices", [&]() {
+        auto csr_accessor = csr.accessor<index_t, 1>();
+        auto col_indices_accessor = col_indices.accessor<index_t, 1>();
+
+        auto values_accessor = values.accessor<scalar_t, 1>();
+        scalar_t* dense_ptr = dense.data_ptr<scalar_t>();
+        scalar_t* r_ptr = r.data_ptr<scalar_t>();
+
+        int64_t dense_stride0 = dense.stride(0);
+        int64_t dense_stride1 = dense.stride(1);
+        int64_t r_stride0 = r.stride(0);
+        int64_t r_stride1 = r.stride(1);
+
+        at::parallel_for(
+            0,
+            dim_i,
+            internal::GRAIN_SIZE,
+            [&](int64_t irow_start, int64_t irow_end) {
+              for (index_t h = irow_start; h < irow_end; ++h) {
+                index_t i_start = csr_accessor[h];
+                index_t i_end = csr_accessor[h + 1];
+                for (index_t i = i_start; i < i_end; i++) {
+                  scalar_t val = values_accessor[i];
+                  index_t col = col_indices_accessor[i];
+                  at::native::cpublas::axpy<scalar_t>(
+                      dim_k,
+                      cast_alpha * val,
+                      dense_ptr + col * dense_stride0,
+                      dense_stride1,
+                      r_ptr + h * r_stride0,
+                      r_stride1);
+                }
               }
-            }
-    });
-  });
+            });
+      });
 }
 
 // Functions for matrix multiplication.
+// result = beta * self + alpha (mat1 @ mat2)
 Tensor& addmm_out_sparse_csr_cpu(
     const Tensor& self,
     const Tensor& mat1,
@@ -392,62 +422,61 @@ Tensor& addmm_out_sparse_csr_cpu(
     const Scalar& beta,
     const Scalar& alpha,
     Tensor& result) {
-  TORCH_INTERNAL_ASSERT(mat1.is_sparse_csr());
-
   // TODO: remove this, there are no codegenerated checks for devices yet
-  TORCH_CHECK(
-    !self.is_cuda(),
-    "Expected all tensors to be on the same device. addmm expected 't' to be CPU tensor, but got CUDA tensor");
-  TORCH_CHECK(
-      !result.is_cuda(),
-      "Expected all tensors to be on the same device. addmm: expected 'out' to be CPU tensor, but got CUDA tensor");
-  TORCH_CHECK(
-      !mat1.is_cuda(),
-      "Expected all tensors to be on the same device. addmm: expected 'mat1' to be a CPU tensor, but got a CUDA tensor");
-  TORCH_CHECK(
-      !mat2.is_cuda(),
-      "Expected all tensors to be on the same device. addmm: expected 'mat2' to be a CPU tensor, but got a CUDA tensor");
+  sparse::impl::_check_is_cpu(self, "self");
+  sparse::impl::_check_is_cpu(mat1, "mat1");
+  sparse::impl::_check_is_cpu(mat2, "mat2");
+  sparse::impl::_check_is_cpu(result, "result");
 
-  // All the checks are from addmm_out_cuda_impl (ATen/native/cuda/Blas.cpp) and TORCH_META_FUNC(addmm) (ATen/native/LinearAlgebra.cpp)
+  // All the checks are from addmm_out_cuda_impl (ATen/native/cuda/Blas.cpp) and
+  // TORCH_META_FUNC(addmm) (ATen/native/LinearAlgebra.cpp)
   // TODO: remove code duplication and unify code
-  TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor");
-  TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor");
+  sparse::impl::_check_dim(mat1, 2, "mat1");
+  sparse::impl::_check_dim(mat2, 2, "mat2");
+
   TORCH_CHECK(
-      mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (",
-      mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
-
-  IntArrayRef mat1_sizes = mat1.sizes();
-  IntArrayRef mat2_sizes = mat2.sizes();
-  IntArrayRef self__sizes;
-  c10::MaybeOwned<Tensor> self_;
-  if (&result != &self && self.layout() == kStrided) {
-    self_ = expand_size(self, {mat1_sizes[0], mat2_sizes[1]}, "addmm");
-    self__sizes = self_->sizes();
+      mat1.size(1) == mat2.size(0), "mat1 and mat2 shapes cannot be multiplied (",
+      mat1.size(0), "x", mat1.size(1), " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
+
+  c10::MaybeOwned<at::Tensor> self_;
+  // Don't expand self if this is an in-place operation
+  if (&result == &self) {
+     self_ = c10::MaybeOwned<Tensor>::borrowed(self);
   } else {
-    self_ = c10::MaybeOwned<Tensor>::borrowed(self);
-    self__sizes = self_->sizes();
+     self_ = expand_size(self, {mat1.size(0), mat2.size(1)}, "addmm");
   }
 
-  TORCH_CHECK(((self_->dim() == 2) && (self_->sizes()[0] == mat1.sizes()[0]) && (self_->sizes()[1] == mat2.sizes()[1])),
-  "The input tensor must be a matrix with size ", mat1.sizes()[0], "x", mat2.sizes()[1], ", but got a ", self_->dim(),
-  "-D tensor with size ", self__sizes[0], "x", self__sizes[1]);
+
+  TORCH_CHECK(((self_->dim() == 2) &&
+               (self_->size(0) == mat1.size(0)) &&
+               (self_->size(1) == mat2.size(1))),
+              "The input tensor must be a matrix with size ",
+              mat1.size(0),
+              "x",
+              mat2.size(1),
+              ", but got a ",
+              self_->dim(),
+              "-D tensor with size ",
+              self_->size(0),
+              "x",
+              self_->size(1));
 
   if (&result != &self) {
     if (result.layout() == kStrided) {
-      at::native::resize_output(result, self__sizes);
+      at::native::resize_output(result, self_->sizes());
     } else {
-      at::native::resize_as_sparse_csr_(result, *self_);
+      result.resize_as_sparse_(*self_);
     }
     result.copy_(*self_);
   }
 
-  IntArrayRef result_sizes = result.sizes();
-  if ((result_sizes[0] == 0) || (result_sizes[1] == 0)) {
+  if (result.numel() == 0) {
     return result;
   }
 
-  if (mat1._nnz() == 0 && mat2.layout() == kStrided) {
-    // According to docs, when beta==0 values in self should be ignored. nans and infs should not propagate
+  if (sparse::impl::_is_sparse_and_zero(mat1) || sparse::impl::_is_sparse_and_zero(mat2)) {
+    // According to docs, when beta==0 values in self should be ignored.
+    // nans and infs should not propagate
     if (beta.toComplexDouble() == 0.) {
       result.zero_();
     } else {
@@ -456,26 +485,19 @@ Tensor& addmm_out_sparse_csr_cpu(
     return result;
   }
 
-  if (mat2.is_sparse_csr() && (mat1._nnz() == 0 || mat2._nnz() == 0)) {
-    if (beta.toComplexDouble() == 0.) {
-      result.values().zero_();
-    } else {
-      result.values().mul_(beta);
-    }
-    return result;
-  }
-
 #if !AT_USE_MKL_SPARSE()
-    if (mat2.is_sparse_csr() && result.is_sparse_csr()) {
-      TORCH_CHECK(
-          false,
-          "Calling addmm on sparse CPU tensors requires Linux platform. ",
-          "Please use PyTorch built with MKL on Linux.");
-    }
-    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.layout() == kStrided);
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "addmm_sparse_dense", [&] {
-        addmm_out_sparse_csr_native_cpu<scalar_t>(mat1, mat2, result, alpha, beta);
-    });
+  TORCH_CHECK(
+      (mat1.is_sparse_csr() ||
+       (mat2.is_sparse_csr() && result.is_sparse_csr())),
+      false,
+      "Calling addmm on sparse CPU tensors requires Linux platform. ",
+      "Please use PyTorch built with MKL on Linux.");
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.layout() == kStrided);
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+      result.scalar_type(), "addmm_sparse_dense", [&] {
+        addmm_out_sparse_csr_native_cpu<scalar_t>(
+            mat1, mat2, result, alpha, beta);
+      });
 #else
   sparse::impl::mkl::addmm_out_sparse_csr(mat1, mat2, beta, alpha, result);
 #endif
@@ -507,17 +529,36 @@ Tensor& _sparse_csr_mm_out(
   return at::addmm_out(result, zero, mat1, mat2, 0.0, 1.0);
 }
 
-Tensor _sparse_csr_mm(
-    const Tensor& mat1,
-    const Tensor& mat2) {
-  Tensor zero;
+Tensor _sparse_csr_mm(const Tensor& mat1, const Tensor& mat2) {
   if (mat1.is_sparse_csr() && mat2.is_sparse_csr()) {
+    // Return sparse
     // TODO: replace with at::zeros when it's implemented for sparse csr
-    zero = at::empty({mat1.size(0), mat2.size(1)}, mat2.options());
-  } else {
-    zero = at::zeros({mat1.size(0), mat2.size(1)}, mat2.options());
+    return at::addmm(
+        at::empty({mat1.size(0), mat2.size(1)}, mat2.options()),
+        mat1,
+        mat2,
+        0.0,
+        1.0);
+  }
+  if (mat1.is_sparse_csr() && mat2.layout() == c10::kStrided) {
+    // Return dense
+    return at::addmm(
+        at::zeros({mat1.size(0), mat2.size(1)}, mat2.options()),
+        mat1,
+        mat2,
+        0.0,
+        1.0);
   }
-  return at::addmm(zero, mat1, mat2, 0.0, 1.0);
+  if (mat1.layout() == c10::kStrided && mat2.is_sparse_csr()) {
+    // Return dense
+    return at::addmm(
+        at::zeros({mat1.size(0), mat2.size(1)}, mat1.options()),
+        mat1,
+        mat2,
+        0.0,
+        1.0);
+  }
+  TORCH_INTERNAL_ASSERT(false, "Shouldn't get here. Please open an issue.");
 }
 
 Tensor _sparse_csr_addmm(
@@ -533,14 +574,20 @@ Tensor _sparse_csr_addmm(
 }
 
 // Functions for element-wise addition.
-Tensor add_sparse_csr(const Tensor& self, const Tensor& other, const Scalar& alpha) {
+Tensor add_sparse_csr(
+    const Tensor& self,
+    const Tensor& other,
+    const Scalar& alpha) {
   auto commonDtype = at::result_type(self, other);
   alpha_check(commonDtype, alpha);
   Tensor result = at::empty({0, 0}, self.options().dtype(commonDtype));
   return at::add_out(result, self, other, alpha); // redispatch!
 }
 
-Tensor& add_sparse_csr_(Tensor& self, const Tensor& other, const Scalar& alpha) {
+Tensor& add_sparse_csr_(
+    Tensor& self,
+    const Tensor& other,
+    const Scalar& alpha) {
   return at::add_out(self, self, other, alpha); // redispatch!
 }
 
@@ -584,13 +631,10 @@ void add_out_dense_sparse_csr_cpu(
       " in add operation");
 
   auto src_values = src.values();
-  auto src_crow_indices = src.crow_indices();
-  auto src_col_indices = src.col_indices();
 
   resize_output(out, dense.sizes());
 
   Tensor resultBuffer = out;
-  Tensor valuesBuffer = src_values.to(commonDtype);
 
   if (out.scalar_type() != commonDtype) {
     resultBuffer = dense.to(commonDtype);
@@ -598,36 +642,54 @@ void add_out_dense_sparse_csr_cpu(
     resultBuffer.copy_(dense);
   }
 
+  if (src._nnz() == 0) {
+    return;
+  }
+
+  auto valuesBuffer = src_values.to(commonDtype).view({-1, src_values.size(-1)});
+  resultBuffer = resultBuffer.view({-1, out.size(-2), out.size(-1)});
+  auto src_crow_indices = src.crow_indices().view({-1, src.crow_indices().size(-1)});
+  auto src_col_indices = src.col_indices().view({-1, src.col_indices().size(-1)});
+
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-      kHalf, kBool, kBFloat16,
+      kHalf,
+      kBool,
+      kBFloat16,
       commonDtype,
       "add_out_op2_sparse_csr",
-      [&valuesBuffer, &resultBuffer, &alpha, &src_crow_indices, &src_col_indices]() {
+      [&valuesBuffer,
+       &resultBuffer,
+       &alpha,
+       &src_crow_indices,
+       &src_col_indices]() {
         AT_DISPATCH_INDEX_TYPES(
             src_crow_indices.scalar_type(),
             "csr_add_out_crow_indices",
-            [&valuesBuffer, &resultBuffer, &alpha, &src_crow_indices, &src_col_indices]() {
-              auto values_accessor = valuesBuffer.accessor<scalar_t, 1>();
+            [&valuesBuffer,
+             &resultBuffer,
+             &alpha,
+             &src_crow_indices,
+             &src_col_indices]() {
+              auto batch_count = resultBuffer.dim() > 2 ? resultBuffer.size(-3) : 1;
+              auto values_accessor = valuesBuffer.accessor<scalar_t, 2>();
               scalar_t* out_ptr = resultBuffer.data_ptr<scalar_t>();
               scalar_t cast_value = alpha.to<scalar_t>();
 
               auto crow_indices_accessor =
-                  src_crow_indices.accessor<index_t, 1>();
+                  src_crow_indices.accessor<index_t, 2>();
               auto col_indices_accessor =
-                  src_col_indices.accessor<index_t, 1>();
-              auto out_strides0 = resultBuffer.strides()[0];
-              auto out_strides1 = resultBuffer.strides()[1];
-
-              for (index_t irow = 0; irow < src_crow_indices.size(0) - 1;
-                   ++irow) {
-                index_t start_index = crow_indices_accessor[irow];
-                index_t end_index = crow_indices_accessor[irow + 1];
-
-                for (index_t i = start_index; i < end_index; ++i) {
-                  auto icol = col_indices_accessor[i];
-                  auto index = resultBuffer.storage_offset() + irow * out_strides0 +
-                      icol * out_strides1;
-                  out_ptr[index] += cast_value * values_accessor[i];
+                  src_col_indices.accessor<index_t, 2>();
+              auto out_strides = resultBuffer.strides();
+
+              for (const auto batch_idx : c10::irange(batch_count)) {
+                for (const auto irow : c10::irange(src_crow_indices.size(-1) - 1)) {
+                  index_t start_index = crow_indices_accessor[batch_idx][irow];
+                  index_t end_index = crow_indices_accessor[batch_idx][irow + 1];
+                  for (const auto i : c10::irange(start_index, end_index)) {
+                    auto icol = col_indices_accessor[batch_idx][i];
+                    auto index = batch_idx * out_strides[0] + irow * out_strides[1] + icol * out_strides[2];
+                    out_ptr[index] += cast_value * values_accessor[batch_idx][i];
+                  }
                 }
               }
             });
@@ -657,32 +719,583 @@ Tensor& add_out_sparse_csr_cpu(
   return out;
 }
 
-TORCH_IMPL_FUNC(_convert_indices_from_coo_to_csr_structured_cpu) (
-  const Tensor& input, const int64_t size, const bool out_int32, const Tensor& result
-) {
+TORCH_IMPL_FUNC(_convert_indices_from_coo_to_csr_structured_cpu)
+(const Tensor& input,
+ const int64_t size,
+ const bool out_int32,
+ const Tensor& result) {
   if (out_int32) {
-    AT_DISPATCH_INTEGRAL_TYPES(input.scalar_type(), "convert_indices_from_coo_to_csr_cpu", [&] {
-      convert_indices_from_coo_to_csr_cpu<scalar_t, int>(result, input, size);
-    });
+    AT_DISPATCH_INTEGRAL_TYPES(
+        input.scalar_type(), "convert_indices_from_coo_to_csr_cpu", [&] {
+          convert_indices_from_coo_to_csr_cpu<scalar_t, int>(
+              result, input, size);
+        });
   } else {
-    AT_DISPATCH_INTEGRAL_TYPES(input.scalar_type(), "convert_indices_from_coo_to_csr_cpu", [&] {
-      convert_indices_from_coo_to_csr_cpu<scalar_t, int64_t>(result, input, size);
-    });
+    AT_DISPATCH_INTEGRAL_TYPES(
+        input.scalar_type(), "convert_indices_from_coo_to_csr_cpu", [&] {
+          convert_indices_from_coo_to_csr_cpu<scalar_t, int64_t>(
+              result, input, size);
+        });
   }
 }
 
-TORCH_IMPL_FUNC(_convert_indices_from_csr_to_coo_structured_cpu) (
-  const Tensor& crow_indices, const Tensor& col_indices, const bool out_int32, const bool transpose, const Tensor& result
-) {
+TORCH_IMPL_FUNC(_convert_indices_from_csr_to_coo_structured_cpu)
+(const Tensor& crow_indices,
+ const Tensor& col_indices,
+ const bool out_int32,
+ const bool transpose,
+ const Tensor& result) {
   if (out_int32) {
-    AT_DISPATCH_INTEGRAL_TYPES(crow_indices.scalar_type(), "convert_indices_from_csr_to_coo_cpu", [&] {
-      convert_indices_from_csr_to_coo_cpu<scalar_t, int32_t>(result, crow_indices, col_indices, transpose);
-    });
+    AT_DISPATCH_INTEGRAL_TYPES(
+        crow_indices.scalar_type(), "convert_indices_from_csr_to_coo_cpu", [&] {
+          convert_indices_from_csr_to_coo_cpu<scalar_t, int32_t>(
+              result, crow_indices, col_indices, transpose);
+        });
   } else {
-    AT_DISPATCH_INTEGRAL_TYPES(crow_indices.scalar_type(), "convert_indices_from_csr_to_coo_cpu", [&] {
-      convert_indices_from_csr_to_coo_cpu<scalar_t, int64_t>(result, crow_indices, col_indices, transpose);
-    });
+    AT_DISPATCH_INTEGRAL_TYPES(
+        crow_indices.scalar_type(), "convert_indices_from_csr_to_coo_cpu", [&] {
+          convert_indices_from_csr_to_coo_cpu<scalar_t, int64_t>(
+              result, crow_indices, col_indices, transpose);
+        });
+  }
+}
+
+/*
+ * Based on
+ * https://github.com/scipy/scipy/blob/8a64c938ddf1ae4c02a08d2c5e38daeb8d061d38/scipy/sparse/sparsetools/csr.h
+ */
+template <class I, class T>
+void _csr_to_block_csr_cpu_kernel(
+    const I n_row,
+    const I n_col,
+    const I R,
+    const I C,
+    const I* input_crow_indices,
+    const I* input_col_indices,
+    const T* input_values,
+    I* result_crow_indices,
+    I* result_col_indices,
+    T* result_values) {
+  // All blocks are possible, that is, may be allocated if a single non-zero
+  // value lives within them. Otherwise they're not.
+
+  // Allocate pointers for all possible column blocks plus 1
+  std::vector<T*> blocks(n_col / C + 1, (T*)0);
+
+  assert(n_row % R == 0);
+  assert(n_col % C == 0);
+
+  // Major assumptions
+  // 1. Blocks must be square
+
+  // Number of blocks along rows
+  I n_brow = n_row / R;
+  // Number of blocks along columns
+  // I n_bcol = n_col / C;
+
+  // Number of elements per block
+  I RC = R * C;
+  // Number of blocks overall
+  I n_blks = 0;
+
+  result_crow_indices[0] = 0;
+
+  // Iterate over blocks along rows
+  for (I block_i = 0; block_i < n_brow; block_i++) {
+    // Iterate over rows within block
+    for (I r = 0; r < R; r++) {
+      I i = R * block_i + r; // row index
+      for (I jj = input_crow_indices[i]; jj < input_crow_indices[i + 1]; jj++) {
+        I j = input_col_indices[jj]; // column index
+
+        // Block corresponding to column index
+        I block_j = j / C;
+        // Column within block
+        I c = j % C;
+
+        if (blocks[block_j] == 0) {
+          blocks[block_j] = result_values + RC * n_blks;
+          result_col_indices[n_blks] = block_j;
+          n_blks++;
+        }
+
+        // Specific blocks entries should not be visited more than once.
+        // Scipy code does an addition here. Why?
+        *(blocks[block_j] + C * r + c) = input_values[jj];
+      }
+    }
+
+    for (I jj = input_crow_indices[R * block_i];
+         jj < input_crow_indices[R * (block_i + 1)];
+         jj++) {
+      blocks[input_col_indices[jj] / C] = 0;
+    }
+
+    result_crow_indices[block_i + 1] = n_blks;
+  }
+}
+
+/*
+ * Based on
+ * https://github.com/scipy/scipy/blob/8a64c938ddf1ae4c02a08d2c5e38daeb8d061d38/scipy/sparse/sparsetools/csr.h
+ */
+template <class I>
+I csr_count_blocks(
+    const I n_row,
+    const I n_col,
+    const I R,
+    const I C,
+    const I Ap[],
+    const I Aj[]) {
+  std::vector<I> mask(n_col / C + 1, -1);
+  I n_blks = 0;
+  for (I i = 0; i < n_row; i++) {
+    I bi = i / R;
+    for (I jj = Ap[i]; jj < Ap[i + 1]; jj++) {
+      I bj = Aj[jj] / C;
+      if (mask[bj] != bi) {
+        mask[bj] = bi;
+        n_blks++;
+      }
+    }
   }
+  return n_blks;
+}
+
+Tensor _csr_to_block_csr_cpu(const Tensor& self, IntArrayRef blocksize) {
+  TORCH_CHECK(
+      blocksize[0] == blocksize[1],
+      "blocks must be square. ",
+      "Got (",
+      blocksize[0],
+      ", ",
+      blocksize[1],
+      ") instead.");
+  TORCH_CHECK(
+      self.size(0) % blocksize[0] == 0 && self.size(1) % blocksize[1] == 0,
+      "Block sparse CSR Tensors must have a size that is an ",
+      "integral multiple of their block size. ",
+      "Got Tensor of size (",
+      self.size(0),
+      ", ",
+      self.size(1),
+      ") with block size (",
+      blocksize[0],
+      ", ",
+      blocksize[1],
+      ") instead.");
+  Tensor input_values = self.values().contiguous();
+  Tensor input_crow_indices = self.crow_indices().contiguous();
+  Tensor input_col_indices = self.col_indices().contiguous();
+
+  // First we determine the number of blocks needed. For each given block, if it
+  // contains a non-zero element we will allocate values and indices for it.
+  int64_t num_blocks;
+  int64_t n_row = self.size(0);
+  int64_t n_col = self.size(1);
+  AT_DISPATCH_INDEX_TYPES(
+      input_crow_indices.scalar_type(), "_csr_to_block_csr_cpu", [&] {
+        num_blocks = csr_count_blocks<index_t>(
+            n_row,
+            n_col,
+            blocksize[0],
+            blocksize[1],
+            input_crow_indices.data_ptr<index_t>(),
+            input_col_indices.data_ptr<index_t>());
+      });
+
+  Tensor result_values =
+      input_values.new_zeros({num_blocks, blocksize[0], blocksize[1]});
+  Tensor result_crow_indices =
+      input_crow_indices.new_empty({(n_row / blocksize[0]) + 1});
+  Tensor result_col_indices = input_col_indices.new_empty({num_blocks});
+
+  // Next we copy over non-zero elements into the allocated blocks.
+  AT_DISPATCH_INDEX_TYPES(
+      input_crow_indices.scalar_type(), "_csr_to_block_csr_cpu", [&] {
+        AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+            input_values.scalar_type(), "_csr_to_block_csr_cpu", [&] {
+              _csr_to_block_csr_cpu_kernel<index_t, scalar_t>(
+                  n_row,
+                  n_col,
+                  blocksize[0],
+                  blocksize[1],
+                  input_crow_indices.data_ptr<index_t>(),
+                  input_col_indices.data_ptr<index_t>(),
+                  input_values.data_ptr<scalar_t>(),
+                  result_crow_indices.data_ptr<index_t>(),
+                  result_col_indices.data_ptr<index_t>(),
+                  result_values.data_ptr<scalar_t>());
+            });
+      });
+  return at::native::_sparse_csr_tensor_unsafe(
+      result_crow_indices,
+      result_col_indices,
+      result_values,
+      self.sizes(),
+      result_values.scalar_type(),
+      self.layout(),
+      result_values.device());
+}
+
+Tensor _csr_to_block_csr(const Tensor& self, IntArrayRef blocksize) {
+  Tensor self_values = self.values();
+  Tensor self_crow_indices = self.crow_indices();
+  Tensor self_col_indices = self.col_indices();
+  Tensor cpu_result = _csr_to_block_csr_cpu(
+      _sparse_csr_tensor_unsafe(self_crow_indices.cpu(),
+                                self_col_indices.cpu(),
+                                self_values.cpu(),
+                                self.sizes(),
+                                self_values.scalar_type(),
+                                self.layout(),
+                                self_values.device()),
+      blocksize);
+  Tensor result_values = cpu_result.values().to(self_values.options());
+  Tensor result_crow_indices = cpu_result.crow_indices().to(self_crow_indices.options());
+  Tensor result_col_indices = cpu_result.col_indices().to(self_col_indices.options());
+  return at::native::_sparse_csr_tensor_unsafe(
+      result_crow_indices,
+      result_col_indices,
+      result_values,
+      self.sizes(),
+      result_values.scalar_type(),
+      self.layout(),
+      result_values.device());
+}
+
+/*
+    Reductions on sparse CSR tensors using masked semantics.
+
+    - A CSR tensor is a 2D tensor that is specified by a 3-tuple
+      (crow_indices, col_indices, values).
+
+    - To support a reduction operator on a CSR tensor, define:
+
+template <typename scalar_t>
+struct Reduction...Op {
+  inline scalar_t operator()(const scalar_t& a, const scalar_t& b) const {
+    return a ... b;
+  }
+  inline scalar_t identity() const { return ...; }
+};
+
+Tensor _sparse_csr_..._cpu(const Tensor& input, IntArrayRef dims_to_sum, bool keepdim, c10::optional<ScalarType> dtype) {
+  ...
+      result = reduce_sparse_csr_cpu_template<scalar_t>(input_, dims_to_sum, keepdim, Reduction...Op<scalar_t>());
+  ...
+  return result;
+}
+
+      and add the following
+
+        - func: _sparse_csr_op.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+          dispatch:
+            SparseCsrCUDA: _sparse_csr_..._cpu
+
+      to native_functions.yaml
+
+      Use ReductionAddOp and _sparse_csr_sum implementation as an example.
+
+    - Since a CSR tensor dimensionality is always 2, only reductions
+      with keepdim=True can be supported.
+
+*/
+
+namespace {
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim0_cpu_template(const Tensor& sparse, ReductionOp rop) {
+  /*
+    Consider the following sparse tensor:
+
+    1 * * * *
+    * * * 2 *
+    * * 3 * *
+    * * * * *
+    4 * 5 * *
+
+    that has CSR representation
+
+      crow_indices = [0, 1, 2, 3, 3, 5]
+      col_indices = [0, 3, 2, 0, 2]
+      values = [1, 2, 3, 4, 5]
+
+    Reduction with dim=0 results:
+
+    rop(1,4) * rop(3,5) 2 *
+
+    that has CSR representation
+
+      new_crow_indices = [0, 3]
+      new_col_indices = [0, 2, 3]
+      new_values = [rop(1, 4], rop(3, 5), 2]
+
+    In general, the CSR representation data can be computed as follows:
+
+      new_col_indices, col_map = col_indices.unique(sorted=True, return_inverse=True)
+      nnz = new_col_indices.numel()
+      new_crow_indices = [0, nnz]
+      new_values.resize(nnz); new_values.fill_(identity)
+      for i in range(col_indices.numel()):
+          new_values[col_map[i]] = rop(new_values[col_map[i], values[i])
+   */
+
+  Tensor col_indices = sparse.col_indices();
+  Tensor values = sparse.values();
+  auto numel = values.numel();
+  Tensor new_col_indices;
+  Tensor columns_map;
+
+  /*
+    Calling at::_unique constitutes the main bottleneck of this
+    function. However, it is still about 5x faster than using the
+    invariant:
+      csr.sum(dim=0) == csr.transpose(0, 1).sum(dim=1)
+  */
+  std::tie(new_col_indices, columns_map) = at::_unique(col_indices, true, true);
+  auto nnz = new_col_indices.numel();
+
+  Tensor new_crow_indices = at::empty({2}, col_indices.options());
+  new_crow_indices[0] = 0;
+  new_crow_indices[1] = nnz;
+
+  Tensor new_values = at::empty({nnz}, values.options());
+  new_values.fill_(rop.identity());
+
+  AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "reduce_sparse_csr_dim0_cpu_indices",
+                          [&]() {
+                            index_t* columns_map_ptr = columns_map.data_ptr<index_t>();
+                            scalar_t* values_ptr = values.data_ptr<scalar_t>();
+                            scalar_t* new_values_ptr = new_values.data_ptr<scalar_t>();
+
+                            // There is no point in parallelizing the following for-loop
+                            // because about 99.3% of the computation time is spent in the
+                            // at::_unique call above.
+                            for (int64_t i=0; i<numel; i++) {
+                              index_t col = columns_map_ptr[i];
+                              scalar_t val = values_ptr[i];
+                              new_values_ptr[col] = rop(new_values_ptr[col], val);
+                            }
+                          });
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {1, sparse.size(1)},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim1_cpu_template(const Tensor& sparse, ReductionOp rop) {
+  /*
+    Consider the following sparse tensor:
+
+    1 * * * *
+    * * * 2 *
+    * * 3 * *
+    * * * * *
+    4 * 5 * *
+
+    that has CSR representation
+
+      crow_indices = [0, 1, 2, 3, 3, 5]
+      col_indices = [0, 3, 2, 0, 2]
+      values = [1, 2, 3, 4, 5]
+
+    Reduction with dim=1 results:
+
+    1
+    2
+    3
+    *
+    rop(4, 5)
+
+    that has CSR representation
+
+      new_crow_indices = [0, 1, 2, 3, 3, 4]
+      new_col_indices = [0, 0, 0, 0]
+      new_values = [1, 2, 3, rop(4, 5)]
+
+    In general, the result CSR data can be computed as follows:
+
+      new_crow_indices = [0]
+      for i in range(1, nrows+1):
+          new_crow_indices[i] = new_crow_indices[i-1] + (crow_indices[i] == crow_indices[i-1])
+      nnz = new_crow_indices[-1]
+      new_col_indices = zeros(nnz)
+      new_values.resize(nnz)
+      j = -1
+      for i in range(1, nrows+1):
+          if crow_indices[i] == crow_indices[i-1]:
+              continue
+          j += 1
+          new_values[j] = rop(values[crow_indices[i] : crow_indices[i-1]])
+  */
+
+  Tensor crow_indices = sparse.crow_indices();
+  auto ioptions = crow_indices.options();
+  Tensor values = sparse.values();
+  auto nrows = sparse.size(0);
+
+  Tensor new_crow_indices = at::empty({crow_indices.numel()}, ioptions);
+  Tensor new_col_indices = at::empty({}, ioptions);
+  Tensor new_values = at::empty({}, values.options());
+  Tensor row_map = at::empty({nrows}, ioptions);
+
+  AT_DISPATCH_INDEX_TYPES(crow_indices.scalar_type(), "reduce_sparse_csr_dim1_cpu_indices",
+                          [&]() {
+    index_t* crow_indices_ptr = crow_indices.data_ptr<index_t>();
+    index_t* new_crow_indices_ptr = new_crow_indices.data_ptr<index_t>();
+    index_t* row_map_ptr = row_map.data_ptr<index_t>();
+    int64_t nnz = 0;
+    new_crow_indices_ptr[0] = 0;
+    for(int64_t i=0; i<nrows; i++) {
+      if (crow_indices_ptr[i] != crow_indices_ptr[i + 1]) {
+        row_map_ptr[i] = nnz;
+        nnz++;
+      }
+      new_crow_indices_ptr[i + 1] = nnz;
+    }
+    new_col_indices.resize_(nnz);
+    new_col_indices.fill_(index_t(0));
+    new_values.resize_(nnz);
+
+    scalar_t* values_ptr = values.data_ptr<scalar_t>();
+    scalar_t* new_values_ptr = new_values.data_ptr<scalar_t>();
+
+    at::parallel_for(
+        0,
+        nrows,
+        internal::GRAIN_SIZE,
+        [&](int64_t irow_start, int64_t irow_end) {
+            index_t i_end = crow_indices_ptr[irow_start];
+            for (index_t h = irow_start; h < irow_end; ++h) {
+              index_t i_start = i_end;
+              i_end = crow_indices_ptr[h+1];
+              if (i_start != i_end) {
+                scalar_t res = values_ptr[i_start];
+                for (index_t i = i_start + 1; i < i_end; i++) {
+                  res = rop(res, values_ptr[i]);
+                }
+                new_values_ptr[row_map_ptr[h]] = res;
+              }
+            }
+        });
+                          });
+
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {sparse.size(0), 1},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim01_cpu_template(const Tensor& sparse, ReductionOp rop) {
+
+  auto ioptions = sparse.col_indices().options();
+  Tensor values = sparse.values();
+  auto numel = values.numel();
+  auto nnz = std::min<int64_t>(1, numel);
+
+  /* TODO: we can likely do about 3x better than parallel_reduce:
+
+In [2]: t=torch.randn(5000, 5000).to_sparse_csr()
+
+In [3]: %timeit torch._sparse_csr_sum(t, dim=(0, 1), keepdim=True)
+3.39 ms ± 898 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
+
+In [4]: %timeit torch.sum(t.values())
+1.07 ms ± 291 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
+  */
+  scalar_t* values_ptr = values.data_ptr<scalar_t>();
+  scalar_t value = at::parallel_reduce(
+                                       0,
+                                       numel,
+                                       internal::GRAIN_SIZE,
+                                       rop.identity(),
+                                       [&](int64_t i_start, int64_t i_end, scalar_t identity) {
+                                         scalar_t res = identity;
+                                         for (int64_t i=i_start; i<i_end; i++) {
+                                           scalar_t val = values_ptr[i];
+                                           res = rop(res, val);
+                                         }
+                                         return res;
+                                       }, rop
+                                       );
+
+  Tensor new_col_indices = at::zeros({nnz}, ioptions);
+  Tensor new_crow_indices = at::tensor(ArrayRef<int64_t>{0, nnz}, ioptions);
+  Tensor new_values;
+  if (numel > 0) {
+    new_values = at::empty({1}, values.options());
+    new_values.fill_(value);
+  } else {
+    new_values = at::empty({}, values.options());
+  }
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {1, std::min<int64_t>(1, sparse.size(1))},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_cpu_template(const Tensor& sparse, std::vector<int64_t> dims, ReductionOp rop) {
+  if (dims.size() == 1) {
+    if (dims[0] == 0) {
+      return reduce_sparse_csr_dim0_cpu_template<scalar_t>(sparse, rop);
+    } else {
+      TORCH_INTERNAL_ASSERT(dims[0] == 1);
+      return reduce_sparse_csr_dim1_cpu_template<scalar_t>(sparse, rop);
+    }
+  } else if (dims.size() == 2) {
+    TORCH_INTERNAL_ASSERT(((dims[0] == 0 && dims[1] == 1) || (dims[0] == 1 && dims[1] == 0)));
+    return reduce_sparse_csr_dim01_cpu_template<scalar_t>(sparse, rop);
+  }
+  TORCH_INTERNAL_ASSERT(dims.size() == 0);
+  // effective after gh-29137 has been resolved
+  return sparse.clone();
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_cpu_template(const Tensor& sparse, IntArrayRef dims_to_sum, bool keepdim, ReductionOp rop) {
+  TORCH_INTERNAL_ASSERT(sparse.is_sparse_csr());
+  TORCH_CHECK(keepdim, "reduction operations on CSR tensors with keepdim=False is unsupported");
+  TORCH_INTERNAL_ASSERT(sparse.device() == kCPU);
+
+  const int64_t input_dim = sparse.dim();
+  TORCH_INTERNAL_ASSERT(input_dim == 2);
+  auto dims = dims_to_sum.vec();
+  maybe_wrap_dims(dims, input_dim);
+  if (dims.size() == 0) {
+    // after gh-29137 is resolved, delete this if-block
+    dims.emplace_back(0);
+    dims.emplace_back(1);
+  }
+  return reduce_sparse_csr_cpu_template<scalar_t>(sparse, dims, rop);
+}
+
+template <typename scalar_t>
+struct ReductionAddOp {
+  inline scalar_t operator()(const scalar_t& a, const scalar_t& b) const {
+    return a + b;
+  }
+  inline scalar_t identity() const { return 0; }
+};
+
+}  // namespace
+
+Tensor _sparse_csr_sum_cpu(const Tensor& input, IntArrayRef dims_to_sum, bool keepdim, c10::optional<ScalarType> dtype) {
+  ScalarType dtype_ = dtype.value_or(input.scalar_type());
+  Tensor input_ = input.to(dtype_);
+  Tensor result;
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+    kHalf, kBFloat16, input_.scalar_type(), "_sparse_csr_sum_cpu",
+    [&] {
+      result = reduce_sparse_csr_cpu_template<scalar_t>(input_, dims_to_sum, keepdim, ReductionAddOp<scalar_t>());
+    });
+  return result;
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensorMath.h b/aten/src/ATen/native/sparse/SparseCsrTensorMath.h
new file mode 100644
index 00000000000000..f23b1ff18e884c
--- /dev/null
+++ b/aten/src/ATen/native/sparse/SparseCsrTensorMath.h
@@ -0,0 +1,63 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+#include <ATen/core/Scalar.h>
+
+namespace at {
+namespace native {
+namespace sparse {
+namespace impl {
+
+// Returns true if all entries of self are zero
+// TODO: This has potential to be a generic helper
+inline bool _is_sparse_and_zero(const Tensor& self) {
+  if (self.is_sparse_csr() || self.is_sparse()) {
+    if (self._nnz() == 0) {
+      return true;
+    }
+  }
+  return false;
+}
+
+inline void _check_is_cpu(const Tensor& self, c10::string_view name) {
+  TORCH_CHECK(
+      self.is_cpu(),
+      "Expected all tensors to be on the same device. addmm expected '",
+      name,
+      "' to be CPU tensor, but got ",
+      self.device(),
+      " tensor");
+}
+
+inline void _check_is_cuda(const Tensor& self, c10::string_view name) {
+  TORCH_CHECK(
+      self.is_cuda(),
+      "Expected all tensors to be on the same device. addmm expected '",
+      name,
+      "' to be CUDA tensor, but got ",
+      self.device(),
+      " tensor");
+}
+
+inline void _check_dim(const Tensor& self, int64_t target_dim, c10::string_view name) {
+  if (target_dim == 2) {
+    TORCH_CHECK(
+        self.dim() == target_dim,
+        name, " must be a matrix, ",
+        "got ", self.dim(), "-D tensor");
+  }
+  TORCH_CHECK(
+      self.dim() == target_dim,
+      "Expected ",
+      name,
+      " to be of dimension ",
+      target_dim,
+      " but got ",
+      self.dim(),
+      " instead.");
+}
+
+}
+}
+}
+}
diff --git a/aten/src/ATen/native/sparse/SparseTensor.cpp b/aten/src/ATen/native/sparse/SparseTensor.cpp
index 5f2edff7db40a4..256a17f22c23c4 100644
--- a/aten/src/ATen/native/sparse/SparseTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensor.cpp
@@ -569,15 +569,6 @@ SparseTensor sparse_csr_to_sparse(const Tensor& self) {
 
 // NB: Dropped the resizeNd variants
 
-Tensor sparse_to_dense(
-    const SparseTensor& self,
-    c10::optional<ScalarType> dtype) {
-  TORCH_CHECK(
-      !dtype.has_value(), "dtype argument is not supported by sparse_to_dense");
-  Tensor dst = at::zeros(self.sizes(), self.options().layout(kStrided));
-  return dst.add_(self);
-}
-
 SparseTensor& copy_sparse_wrapper_(
     Tensor& self,
     const Tensor& src,
@@ -664,8 +655,8 @@ SparseTensor _coalesce_sparse_cpu(const SparseTensor& self) {
   auto indicesBufferAccessor = indicesBuffer.accessor<int64_t, 1>();
 
   int64_t i = -1;
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(at::ScalarType::BFloat16, at::ScalarType::Half, values.scalar_type(),
-                                        "coalesce", [&] {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(at::ScalarType::BFloat16, at::ScalarType::Half, at::ScalarType::Bool, values.scalar_type(),
+                                         "coalesce", [&] {
     int64_t prev = -1;
     int64_t blockSize = values.stride(0);
     scalar_t* values_ptr = values.data_ptr<scalar_t>();
diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.cpp b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
index f98ab775926bbe..0c45074f8a1eb8 100644
--- a/aten/src/ATen/native/sparse/SparseTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
@@ -707,6 +707,34 @@ Tensor& mul_sparse_(Tensor& self, const Tensor& other) {
   return at::mul_out(self, self, other);  // redispatch!
 }
 
+Tensor& mul_out_sparse_csr(const Tensor& t_, const Tensor& src_, Tensor& r) {
+  // // TODO: Use a specialized CSR kernel for performance if needed
+  TORCH_CHECK(t_.is_sparse_csr() || (t_.layout() == c10::kStrided && t_.dim() == 0), "mul(dense, sparse_csr) is not supported");
+  TORCH_CHECK(src_.is_sparse_csr() || (src_.layout() == c10::kStrided && src_.dim() == 0), "mul(sparse_csr, dense) is not supported");
+  TORCH_CHECK(r.is_sparse_csr(), "Expected result Tensor to be of format CSR");
+  Tensor t = t_.to_sparse();
+  Tensor src = src_.to_sparse();
+  Tensor tmp_result = t.mul(src);
+  auto r_sparse_csr = tmp_result.to_sparse_csr();
+  r.resize_as_sparse_(r_sparse_csr);
+  r.copy_(r_sparse_csr);
+  return r;
+}
+
+Tensor mul_sparse_csr(const Tensor& self, const Tensor& other) {
+  auto commonDtype = at::result_type(self, other);
+  TORCH_CHECK(self.is_sparse_csr(), "mul(dense, sparse_csr) is not supported");
+  TORCH_CHECK(other.is_sparse_csr(), "mul(sparse_csr, dense) is not supported");
+  auto result_options = self.options().dtype(commonDtype);
+  // CSR is 2d!
+  Tensor result = at::empty({0, 0}, result_options);
+  return at::mul_out(result, self, other); // redispatch!
+}
+
+Tensor& mul_sparse_csr_(Tensor& self, const Tensor& other) {
+  return at::mul_out(self, self, other); // redispatch!
+}
+
 SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, SparseTensor& r) {
   if (src_.dim() == 0) {
     return mul_out_sparse_zerodim(r, t_, src_);
diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
index 6a8b7253fbfc62..722582c3cbdbb7 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
@@ -3,6 +3,7 @@
 #include <ATen/ExpandUtils.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/sparse/cuda/SparseBlasImpl.h>
+#include <ATen/native/sparse/SparseCsrTensorMath.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
@@ -103,6 +104,7 @@ Tensor sparse_sampled_addmm_sparse_csr_cuda(
   return result;
 }
 
+// result = beta * self + alpha * (mat1 @ mat2)
 Tensor& addmm_out_sparse_csr_cuda(
     const Tensor& self,
     const Tensor& mat1,
@@ -110,65 +112,63 @@ Tensor& addmm_out_sparse_csr_cuda(
     const Scalar& beta,
     const Scalar& alpha,
     Tensor& result) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(mat1.is_sparse_csr());
+  sparse::impl::_check_is_cuda(self, "self");
+  sparse::impl::_check_is_cuda(mat1, "mat1");
+  sparse::impl::_check_is_cuda(mat2, "mat2");
+  sparse::impl::_check_is_cuda(result, "result");
 
   // Same checks as in TORCH_META_FUNC(addmm) at
   // aten/src/ATen/native/LinearAlgebra.cpp
-  TORCH_CHECK(
-      mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor");
-  TORCH_CHECK(
-      mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor");
+  sparse::impl::_check_dim(mat1, 2, "mat1");
+  sparse::impl::_check_dim(mat2, 2, "mat2");
 
-  IntArrayRef mat1_sizes = mat1.sizes();
-  IntArrayRef mat2_sizes = mat2.sizes();
   TORCH_CHECK(
-      mat1_sizes[1] == mat2_sizes[0],
-      "mat1 and mat2 shapes cannot be multiplied (",
-      mat1_sizes[0],
-      "x",
-      mat1_sizes[1],
-      " and ",
-      mat2_sizes[0],
-      "x",
-      mat2_sizes[1],
-      ")");
+      mat1.size(1) == mat2.size(0), "mat1 and mat2 shapes cannot be multiplied (",
+      mat1.size(0), "x", mat1.size(1), " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
 
   // From addmm_out_cuda_impl at ATen/native/cuda/Blas.cpp
   // TODO: remove code duplication and unify code
   // There were undefined symbol problems,
   // when using the same function for CUDA and SparseCsrCUDA dispatch keys
   // Also structured kernels do not support sparse output
-  IntArrayRef self__sizes;
-  c10::MaybeOwned<Tensor> self_;
-  if (&result != &self && self.layout() == kStrided) {
-    self_ = expand_size(self, {mat1_sizes[0], mat2_sizes[1]}, "addmm");
-    self__sizes = self_->sizes();
+  c10::MaybeOwned<at::Tensor> self_;
+  // Don't expand self if this is an in-place operation
+  if (&result == &self) {
+     self_ = c10::MaybeOwned<Tensor>::borrowed(self);
   } else {
-    self_ = c10::MaybeOwned<Tensor>::borrowed(self);
-    self__sizes = self_->sizes();
-    TORCH_CHECK(result.dim() == 2, "tensors must be 2-D");
-    TORCH_CHECK(
-        self__sizes[0] == mat1_sizes[0], "self_ dim 0 must match mat1 dim 0");
-    TORCH_CHECK(
-        self__sizes[1] == mat2_sizes[1], "self_ dim 1 must match mat2 dim 1");
+     self_ = expand_size(self, {mat1.size(0), mat2.size(1)}, "addmm");
   }
 
+  sparse::impl::_check_dim(*self_, 2, "self");
+  TORCH_CHECK(((self_->dim() == 2) &&
+               (self_->size(0) == mat1.size(0)) &&
+               (self_->size(1) == mat2.size(1))),
+              "The input tensor must be a matrix with size ",
+              mat1.size(0),
+              "x",
+              mat2.size(1),
+              ", but got a ",
+              self_->dim(),
+              "-D tensor with size ",
+              self_->size(0),
+              "x",
+              self_->size(1));
+
   if (&result != &self) {
     if (result.layout() == kStrided) {
-      at::native::resize_output(result, self__sizes);
+      at::native::resize_output(result, self_->sizes());
     } else {
-      at::native::resize_as_sparse_csr_(result, *self_);
+      result.resize_as_sparse_(*self_);
     }
     result.copy_(*self_);
   }
 
-  IntArrayRef result_sizes = result.sizes();
-  if ((result_sizes[0] == 0) || (result_sizes[1] == 0)) {
+  if (result.numel() == 0) {
     return result;
   }
 
-  if (mat1._nnz() == 0 && mat2.layout() == kStrided) {
-    // According to docs, when beta==0 values in self should be ignored
+  if (sparse::impl::_is_sparse_and_zero(mat1) || sparse::impl::_is_sparse_and_zero(mat2)) {
+    // According to docs, when beta==0 values in self should be ignored.
     // nans and infs should not propagate
     if (beta.toComplexDouble() == 0.) {
       result.zero_();
@@ -178,15 +178,6 @@ Tensor& addmm_out_sparse_csr_cuda(
     return result;
   }
 
-  if (mat2.is_sparse_csr() && (mat1._nnz() == 0 || mat2._nnz() == 0)) {
-    if (beta.toComplexDouble() == 0.) {
-      result.values().zero_();
-    } else {
-      result.values().mul_(beta);
-    }
-    return result;
-  }
-
   sparse::impl::cuda::addmm_out_sparse_csr(mat1, mat2, beta, alpha, result);
   return result;
 }
diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
index f5396757ab7ce9..7cfe1248fb6243 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
@@ -120,6 +120,15 @@ void inline col_indices_and_values_resize_(const Tensor& input, int64_t nnz) {
       input.sizes());
 }
 
+void inline bsrsv2_bsrsm2_may_need_to_sync() {
+#if defined(CUSPARSE_VERSION) && CUSPARSE_VERSION < 11703
+  // cusparse bsrsv2 and bsrsm2 have a synchronization issue that may cause illegal memory access in cuda <= 11.6.x
+  // See https://github.com/pytorch/pytorch/issues/71297
+  ::c10::cuda::device_synchronize();
+#endif
+  // else: do nothing!
+}
+
 void block_sparse_triangular_solve_vec(
     const at::sparse_csr::SparseCsrTensor& A,
     const Tensor& B,
@@ -230,6 +239,8 @@ void block_sparse_triangular_solve_vec(
             X_->data_ptr<scalar_t>(),
             CUSPARSE_SOLVE_POLICY_NO_LEVEL,
             work_data.get());
+
+        bsrsv2_bsrsm2_may_need_to_sync();
       });
   if (!X.is_same(*X_)) {
     X.copy_(*X_);
@@ -360,6 +371,8 @@ void block_sparse_triangular_solve_mat(
             ldx,
             CUSPARSE_SOLVE_POLICY_NO_LEVEL,
             work_data.get());
+
+        bsrsv2_bsrsm2_may_need_to_sync();
       });
   if (!X.is_same(*X_)) {
     X.copy_(*X_);
@@ -793,19 +806,23 @@ void spgemm(
 } // anonymous namespace
 
 void addmm_out_sparse_csr(
-    const at::sparse_csr::SparseCsrTensor& mat1,
+    const Tensor& mat1,
     const Tensor& mat2,
     const Scalar& beta,
     const Scalar& alpha,
     const Tensor& result) {
-  if (mat2.layout() == kStrided && result.layout() == kStrided) {
+  if (mat1.is_sparse_csr() && mat2.layout() == kStrided && result.layout() == kStrided) {
     return spmm(mat1, mat2, beta, alpha, result);
-  } else if (mat2.is_sparse_csr() && result.is_sparse_csr()) {
+  }
+  if (mat1.layout() == kStrided && mat2.is_sparse_csr() && result.layout() == kStrided) {
+    // TODO: We can use cuSPARSE's transposition flags once we have CSC support.
+    return spmm(mat2.transpose(0, 1), mat1.transpose(0, 1), beta, alpha, result.transpose(0, 1));
+  }
+  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.is_sparse_csr()) {
     return spgemm(mat1, mat2, beta, alpha, result);
-  } else {
-    TORCH_CHECK(false, "addmm: computation on CUDA is not implemented for ",
-                result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
   }
+  TORCH_CHECK(false, "addmm: computation on CUDA is not implemented for ",
+              result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
 }
 
 /*
@@ -965,6 +982,24 @@ void add_out_sparse_csr(
   auto B_col_indices_ptr = B_col_indices.data_ptr<int>();
   auto C_col_indices_ptr = C_col_indices.data_ptr<int>();
 
+  // Windows compilers don't support nested macros
+  // so we need this lambda outside of the
+  // AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES
+  auto fix_nnz = [
+#if AT_ROCM_ENABLED()
+                     &C_crow_indices,
+                     &m
+#endif
+  ](int nnz) -> int {
+// For some reason POINTER_MODE_HOST is not working here
+// Let's extract manually the nnz from the C_crow_indices
+#if AT_ROCM_ENABLED()
+    return std::max({nnz, C_crow_indices.narrow(-1, m, 1).item<int>()});
+#else
+    return nnz;
+#endif
+  };
+
   AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
       C.scalar_type(), "add_out_sparse_csr_cuda_impl", [&] {
         auto beta_ = beta.to<scalar_t>();
@@ -1025,6 +1060,8 @@ void add_out_sparse_csr(
             &nnzC,
             work_data.get());
 
+        nnzC = fix_nnz(nnzC);
+
         // Resize result using nnz information from cusparse
         col_indices_and_values_resize_(C, nnzC);
         C_col_indices = C.col_indices();
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh b/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
index a8c622639ee6e3..2a266319212a79 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
@@ -2,6 +2,7 @@
 
 #include <ATen/cuda/detail/TensorInfo.cuh>
 #include <ATen/cuda/CUDAApplyUtils.cuh>
+#include <ATen/native/cuda/thread_constants.h>
 #include <c10/macros/Macros.h>
 
 namespace at { namespace native {
@@ -304,7 +305,7 @@ __global__ void indexSparseIntersectionKernel(
 // }
 
 template <typename Dtype, typename Acctype>
-C10_LAUNCH_BOUNDS_1(C10_WARP_SIZE*4)
+C10_LAUNCH_BOUNDS_1(num_threads())
 __global__ void coalesceValuesKernel(
   int64_t *segment_offsets, int64_t *value_indices,
   Dtype *values, Dtype *newValues,
@@ -328,7 +329,6 @@ __global__ void coalesceValuesKernel(
     for (int row = begin; row < end; row++) {
       const int valueRow = ((int) value_indices[row]) * stride;
 
-
       #pragma unroll
       for (int ii = 0; ii < SZ; ii++)
       {
@@ -351,6 +351,56 @@ __global__ void coalesceValuesKernel(
   }
 }
 
+// coalesceValuesKernel when Dtype/Acctype is bool. Can be eliminated using
+// `if constexpr` when CUDA codes will be compiled under C++-17, see
+// gh-56055 for blockers.
+template<typename Dtype>
+C10_LAUNCH_BOUNDS_1(C10_WARP_SIZE*4)
+__global__ void coalesceValuesKernel(
+  int64_t *segment_offsets, int64_t *value_indices,
+  bool *values, bool *newValues,
+  int64_t nnz, int64_t newNnz, int64_t stride) {
+
+  int seg = blockIdx.x * 4 + threadIdx.y;
+
+  // Number of values processed by each thread (grain size)
+  const int SZ = 4;
+
+  if (seg < newNnz) {
+    const int newValueRow = seg * stride;
+    const int begin = segment_offsets[seg];
+    const int end = (seg < newNnz - 1) ? segment_offsets[seg + 1] : nnz;
+    const int startFeature = threadIdx.x + blockIdx.y * blockDim.x * SZ;
+    bool tmp[SZ];
+    #pragma unroll
+    for (int ii = 0; ii < SZ; ii++) {
+      tmp[ii] = 0;
+    }
+    for (int row = begin; row < end; row++) {
+      const int valueRow = ((int) value_indices[row]) * stride;
+
+      #pragma unroll
+      for (int ii = 0; ii < SZ; ii++)
+      {
+        int featureDim = startFeature + ii * C10_WARP_SIZE;
+        if (featureDim < stride)
+        {
+          tmp[ii] |= values[valueRow + featureDim];
+        }
+      }
+    }
+    #pragma unroll
+    for (int ii = 0; ii < SZ; ii++)
+    {
+      int featureDim = startFeature + ii * C10_WARP_SIZE;
+      if (featureDim < stride)
+      {
+        newValues[newValueRow + featureDim] = tmp[ii];
+      }
+    }
+  }
+}
+
 } // namespace apply
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu b/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
index 30e7d873b39cf8..dc5a2acf2da1a5 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
@@ -142,10 +142,11 @@ SparseTensor _coalesce_sparse_cuda(const SparseTensor& self) {
     const int SZ = 4;
     values = values.contiguous();
     int64_t stride = c10::multiply_integers(values.sizes().slice(1));
-    dim3 grid(ceil_div(newNnz, (int64_t) SZ), ceil_div(stride, (int64_t) C10_WARP_SIZE*SZ));
-    dim3 block(C10_WARP_SIZE, SZ);
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-      at::ScalarType::Half, at::ScalarType::BFloat16, values.scalar_type(), "coalesce_sparse_cuda", [&] {
+    int warp_size = at::cuda::warp_size();
+    dim3 grid(ceil_div(newNnz, (int64_t) SZ), ceil_div(stride, (int64_t) warp_size*SZ));
+    dim3 block(warp_size, SZ);
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+      at::ScalarType::Half, at::ScalarType::BFloat16, at::ScalarType::Bool, values.scalar_type(), "coalesce_sparse_cuda", [&] {
         using cuda_accscalar_t = acc_type<scalar_t, /* is_cuda */ true>;
         apply::coalesceValuesKernel<scalar_t, cuda_accscalar_t><<<grid, block, 0, stream>>>(
           uniqueOffsets.data_ptr<int64_t>(),
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu b/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
index c13984f2d92ff6..09663a8c0768d2 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
@@ -16,8 +16,12 @@
 #else
 #include <ATen/ops/_convert_indices_from_coo_to_csr_native.h>
 #include <ATen/ops/_convert_indices_from_csr_to_coo_native.h>
+#include <ATen/ops/_sparse_csr_tensor_unsafe_native.h>
+#include <ATen/ops/_unique.h>
 #include <ATen/ops/add_native.h>
 #include <ATen/ops/resize_as_sparse_native.h>
+#include <ATen/ops/tensor.h>
+#include <ATen/ops/zeros.h>
 #endif
 
 #include <cuda_runtime.h>
@@ -29,6 +33,7 @@
 #include <ATen/cuda/ThrustAllocator.h>
 #include <c10/cuda/CUDACachingAllocator.h>
 
+#include <ATen/native/cuda/Reduce.cuh>
 #include <ATen/native/sparse/cuda/SparseBlasImpl.h>
 #include <ATen/native/sparse/cuda/SparseCUDABlas.h>
 #include <ATen/native/sparse/cuda/SparseCUDATensorMath.cuh>
@@ -159,18 +164,26 @@ Tensor& add_out_dense_sparse_csr_cuda(
       " in add operation");
 
   Tensor src_values = src.values();
-  Tensor src_crow_indices = src.crow_indices();
-  Tensor src_col_indices = src.col_indices();
 
   resize_output(output, dense.sizes());
 
   Tensor resultBuffer = output;
-  Tensor valuesBuffer = src_values.to(commonDtype);
+
   if (output.scalar_type() != commonDtype) {
     resultBuffer = dense.to(commonDtype);
   } else if (!is_same_tensor(output, dense)) {
     resultBuffer.copy_(dense);
   }
+
+  if (src._nnz() == 0) {
+    return output;
+  }
+
+  auto valuesBuffer = src_values.to(commonDtype).view({-1, src_values.size(-1)});
+  resultBuffer = resultBuffer.view({-1, output.size(-2), output.size(-1)});
+  auto src_crow_indices = src.crow_indices().view({-1, src.crow_indices().size(-1)});
+  auto src_col_indices = src.col_indices().view({-1, src.col_indices().size(-1)});
+
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
       kHalf, kBool, kBFloat16,
       commonDtype,
@@ -180,6 +193,7 @@ Tensor& add_out_dense_sparse_csr_cuda(
             src_crow_indices.scalar_type(),
             "csr_add_out_crow_indices",
               [&valuesBuffer, &resultBuffer, &alpha, &src_crow_indices, &src_col_indices]() {
+                auto batch_count = resultBuffer.dim() > 2 ? resultBuffer.size(-3) : 1;
                 scalar_t* values_accessor = valuesBuffer.data_ptr<scalar_t>();
                 scalar_t* out_ptr = resultBuffer.data_ptr<scalar_t>();
                 scalar_t cast_value = alpha.to<scalar_t>();
@@ -189,8 +203,11 @@ Tensor& add_out_dense_sparse_csr_cuda(
                 int64_t out_storage_offset = resultBuffer.storage_offset();
 
                 auto out_strides = resultBuffer.strides();
-                int64_t out_strides0 = out_strides[0];
-                int64_t out_strides1 = out_strides[1];
+                auto out_strides0 = out_strides[0];
+                auto out_strides1 = out_strides[1];
+                auto crow_stride0 = src_crow_indices.stride(0);
+                auto col_stride0 = src_col_indices.stride(0);
+                auto val_stride0 = valuesBuffer.stride(0);
 
                 cudaStream_t stream = at::cuda::getCurrentCUDAStream();
                 at::cuda::ThrustAllocator allocator;
@@ -200,24 +217,29 @@ Tensor& add_out_dense_sparse_csr_cuda(
                thrust::for_each(
                     policy,
                     thrust::make_counting_iterator(int64_t(0)),
-                    thrust::make_counting_iterator(int64_t(src_crow_indices.size(0) - 1)),
+                    thrust::make_counting_iterator(int64_t(src_crow_indices.size(-1) - 1)),
                     [values_accessor,
                     crow_indices_accessor,
                     col_indices_accessor,
                     out_ptr,
-                    out_storage_offset,
-                    out_strides0,
                     cast_value,
-                    out_strides1
+                    out_strides0,
+                    out_strides1,
+                    crow_stride0,
+                    col_stride0,
+                    val_stride0,
+                    batch_count
                     ]__device__(int64_t irow) {
-                        index_t start_index = crow_indices_accessor[irow];
-                        index_t end_index = crow_indices_accessor[irow + 1];
+                      for (index_t batch_idx = 0; batch_idx < batch_count; batch_idx++) {
+                        index_t start_index = crow_indices_accessor[batch_idx*crow_stride0 + irow];
+                        index_t end_index = crow_indices_accessor[batch_idx*crow_stride0 + irow + 1];
 
                         for (index_t i = start_index; i < end_index; ++i) {
-                            auto icol = col_indices_accessor[i];
-                            auto index = out_storage_offset + irow * out_strides0 + icol * out_strides1;
-                            out_ptr[index] += cast_value * values_accessor[i];
+                            auto icol = col_indices_accessor[batch_idx*col_stride0 + i];
+                            auto index = batch_idx * out_strides0 + irow * out_strides1 + icol;
+                            out_ptr[index] += cast_value * values_accessor[batch_idx*val_stride0 + i];
                         }
+                      }
                     });
               });
       });
@@ -275,5 +297,321 @@ TORCH_IMPL_FUNC(_convert_indices_from_csr_to_coo_structured_cuda) (
   }
 }
 
+  /*
+    Reductions on sparse CSR tensors using masked semantics.
+
+    - To support a reduction operator on a CSR tensor with CUDA storage, define
+
+template <typename scalar_t>
+struct Reduction...Op {
+  __device__ __forceinline__ scalar_t operator()(const scalar_t a, const scalar_t b) const {
+    return a ... b;
+  }
+  __device__ __forceinline__ scalar_t identity() const { return ...; }
+  __forceinline__ scalar_t identity_cpu() const { return ...; }
+};
+
+
+Tensor _sparse_csr_..._cuda(const Tensor& input, IntArrayRef dims_to_sum, bool keepdim, c10::optional<ScalarType> dtype) {
+  ...
+      result = reduce_sparse_csr_cuda_template<scalar_t>(input_, dims_to_sum, keepdim, Reduction...Op<scalar_t>());
+  ...
+  return result;
+}
+
+      and add the following
+
+        - func: _sparse_csr_op.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+          dispatch:
+            SparseCsrCUDA: _sparse_csr_..._cuda
+
+      to native_functions.yaml
+  */
+
+namespace {
+
+template <typename scalar_t, typename index_t, typename ReductionOp>
+__global__ void reduce_sparse_csr_dim0_cuda_kernel(scalar_t* new_values,
+                                                   const index_t* new_col_indices,
+                                                   const int64_t new_nnz,
+                                                   const scalar_t* values,
+                                                   const index_t* col_indices,
+                                                   const int64_t nnz,
+                                                   ReductionOp rop
+                                                   ) {
+  int64_t tid = blockDim.x * blockIdx.x + threadIdx.x;
+  if (tid < new_nnz) {
+    index_t col = new_col_indices[tid];
+    scalar_t v = rop.identity();
+    for (int64_t j=0; j < nnz; j++) {
+      if (col == col_indices[j]) {
+        v = rop(v, values[j]);
+      }
+    }
+    new_values[tid] = v;
+  }
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim0_cuda_template(const Tensor& sparse, ReductionOp rop) {
+  /*
+    Consider the following sparse tensor:
+
+      1 * * * *
+      * * * 2 *
+      * * 3 * *
+      * * * * *
+      4 * 5 * *
+
+    that has CSR representation
+
+      crow_indices = [0, 1, 2, 3, 3, 5]
+      col_indices = [0, 3, 2, 0, 2]
+      values = [1, 2, 3, 4, 5]
+
+    Reduction with dim=0 results:
+
+      rop(1,4) * rop(3,5) 2 *
+
+    that has CSR representation
+
+      new_crow_indices = [0, 3]
+      new_col_indices = [0, 2, 3]
+      new_values = [rop(1, 4], rop(3, 5), 2]
+
+    In general, the CSR representation data can be computed as follows:
+
+      nnz = col_indices.numel()
+      new_col_indices = col_indices.unique(sorted=True, return_inverse=False)
+      new_nnz = new_col_indices.numel()
+      new_crow_indices = [0, new_nnz]
+      new_values.resize(new_nnz)
+
+      for i in range(new_nnz):
+          v = identity
+          col = new_col_indices[i]
+          for j in range(nnz):
+              if col == col_indices[j]:
+                  v = rop(v, values[j])
+          new_values[i] = v
+
+    Notice this algorithm is different from the one used on CPU data.
+  */
+
+  Tensor col_indices = sparse.col_indices();
+  Tensor values = sparse.values();
+  auto ncols = sparse.size(1);
+  auto nnz = col_indices.numel();
+  Tensor new_col_indices;
+
+  std::tie(new_col_indices, std::ignore) = at::_unique(col_indices, true, false);
+  auto new_nnz = new_col_indices.numel();
+  Tensor new_crow_indices = at::tensor(ArrayRef<int64_t>{0, new_nnz}, col_indices.options());
+  Tensor new_values = at::empty({new_nnz}, values.options());
+
+  scalar_t* values_ptr = values.data_ptr<scalar_t>();
+  scalar_t* new_values_ptr = new_values.data_ptr<scalar_t>();
+  int64_t THREADS = at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock;
+  int64_t BLOCKS = (new_nnz + THREADS) / THREADS;
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
+  AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "reduce_sparse_csr_dim0_cuda_indices",
+                          [&]() {
+                            index_t* col_indices_ptr = col_indices.data_ptr<index_t>();
+                            index_t* new_col_indices_ptr = new_col_indices.data_ptr<index_t>();
+                            reduce_sparse_csr_dim0_cuda_kernel<<<BLOCKS, THREADS, 0, stream>>>(new_values_ptr,
+                                                                                               new_col_indices_ptr,
+                                                                                               new_nnz,
+                                                                                               values_ptr,
+                                                                                               col_indices_ptr,
+                                                                                               nnz,
+                                                                                               rop
+                                                                                               );
+                          });
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {1, ncols},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename index_t>
+__global__ void reduce_crow_indices_dim1_cuda_kernel(index_t* new_crow_indices,
+                                                     index_t* row_map,
+                                                     const index_t* crow_indices,
+                                                     const int64_t nrows
+                                                     ) {
+  int64_t nnz = 0;
+  new_crow_indices[0] = 0;
+  for(int64_t i=0; i<nrows; i++) {
+    if (crow_indices[i] != crow_indices[i + 1]) {
+      row_map[i] = nnz;
+      nnz++;
+    }
+    new_crow_indices[i + 1] = nnz;
+  }
+}
+
+template <typename scalar_t, typename index_t, typename ReductionOp>
+__global__ void reduce_sparse_csr_dim1_cuda_kernel(scalar_t* new_values,
+                                                   const scalar_t* values,
+                                                   const index_t* crow_indices,
+                                                   const index_t* row_map,
+                                                   const int64_t nrows,
+                                                   ReductionOp rop
+                                                   ) {
+  int64_t tid = blockDim.x * blockIdx.x + threadIdx.x;
+  if (tid < nrows) {
+    index_t i_start = crow_indices[tid];
+    index_t i_end = crow_indices[tid+1];
+    if (i_start != i_end) {
+      scalar_t acc = rop.identity();
+      for (index_t i = i_start; i < i_end; i++) {
+        acc = rop(acc, values[i]);
+      }
+      new_values[row_map[tid]] = acc;
+    }
+  }
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim1_cuda_template(const Tensor& sparse, ReductionOp rop) {
+  /*
+    The algorithm of computing reduce of a CSR tensor along the last
+    dimension is explained in the comment of the
+    reduce_sparse_csr_dim1_cpu_template function.
+  */
+  Tensor crow_indices = sparse.crow_indices();
+  auto ioptions = crow_indices.options();
+  Tensor values = sparse.values();
+  auto nrows = sparse.size(0);
+  auto numel = values.numel();
+
+  Tensor new_crow_indices = at::empty({crow_indices.numel()}, ioptions);
+  Tensor new_col_indices = at::empty({}, ioptions);
+  Tensor new_values = at::empty({}, values.options());
+  Tensor row_map = at::empty({nrows}, ioptions);
+
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
+  int64_t THREADS = at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock;
+  int64_t BLOCKS = (nrows + THREADS) / THREADS;
+
+  AT_DISPATCH_INDEX_TYPES(crow_indices.scalar_type(), "reduce_sparse_csr_dim1_cuda_indices",
+                          [&]() {
+                            index_t* crow_indices_ptr = crow_indices.data_ptr<index_t>();
+                            index_t* new_crow_indices_ptr = new_crow_indices.data_ptr<index_t>();
+                            index_t* row_map_ptr = row_map.data_ptr<index_t>();
+                            reduce_crow_indices_dim1_cuda_kernel<<<1, 1, 0, stream>>>(new_crow_indices_ptr,
+                                                                                      row_map_ptr,
+                                                                                      crow_indices_ptr,
+                                                                                      nrows);
+                            C10_CUDA_KERNEL_LAUNCH_CHECK();
+                            index_t new_nnz = new_crow_indices[-1].item<index_t>();
+                            new_col_indices.resize_(new_nnz);
+                            new_col_indices.fill_(index_t(0));
+                            new_values.resize_(new_nnz);
+
+                            scalar_t* values_ptr = values.data_ptr<scalar_t>();
+                            scalar_t* new_values_ptr = new_values.data_ptr<scalar_t>();
+                            reduce_sparse_csr_dim1_cuda_kernel<<<BLOCKS, THREADS, 0, stream>>>(new_values_ptr,
+                                                                                               values_ptr,
+                                                                                               crow_indices_ptr,
+                                                                                               row_map_ptr,
+                                                                                               nrows,
+                                                                                               rop);
+                            C10_CUDA_KERNEL_LAUNCH_CHECK();
+                          });
+
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {sparse.size(0), 1},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_dim01_cuda_template(const Tensor& sparse, ReductionOp rop) {
+
+  auto ioptions = sparse.col_indices().options();
+  Tensor values = sparse.values();
+  auto numel = values.numel();
+  auto nnz = std::min<int64_t>(1, numel);
+
+  Tensor new_values;
+  if (numel > 0) {
+    new_values = at::empty({1}, values.options());
+    auto iter = TensorIterator::reduce_op(new_values, values);
+    gpu_reduce_kernel<scalar_t, scalar_t>(iter, func_wrapper<scalar_t>(rop), rop.identity_cpu());
+  } else {
+    new_values = at::empty({}, values.options());
+  }
+  Tensor new_col_indices = at::zeros({nnz}, ioptions);
+  Tensor new_crow_indices = at::tensor(ArrayRef<int64_t>{0, nnz}, ioptions);
+  return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
+                                               {1, std::min<int64_t>(1, sparse.size(1))},
+                                               new_values.scalar_type(),
+                                               sparse.layout(),
+                                               new_values.device());
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_cuda_template(const Tensor& sparse, std::vector<int64_t> dims, ReductionOp rop) {
+  if (dims.size() == 1) {
+    if (dims[0] == 0) {
+      return reduce_sparse_csr_dim0_cuda_template<scalar_t>(sparse, rop);
+    } else {
+      TORCH_INTERNAL_ASSERT(dims[0] == 1);
+      return reduce_sparse_csr_dim1_cuda_template<scalar_t>(sparse, rop);
+    }
+  } else if (dims.size() == 2) {
+    TORCH_INTERNAL_ASSERT(((dims[0] == 0 && dims[1] == 1) || (dims[0] == 1 && dims[1] == 0)));
+    return reduce_sparse_csr_dim01_cuda_template<scalar_t>(sparse, rop);
+  }
+  TORCH_INTERNAL_ASSERT(dims.size() == 0);
+  // effective after gh-29137 has been resolved
+  return sparse.clone();
+}
+
+template <typename scalar_t, typename ReductionOp>
+Tensor reduce_sparse_csr_cuda_template(const Tensor& sparse, IntArrayRef dims_to_sum, bool keepdim, ReductionOp rop) {
+  TORCH_INTERNAL_ASSERT(sparse.is_sparse_csr());
+  TORCH_CHECK(keepdim, "reduction operations on CSR tensors with keepdim=False is unsupported");
+  TORCH_INTERNAL_ASSERT(sparse.is_cuda());
+
+  const int64_t input_dim = sparse.dim();
+  TORCH_INTERNAL_ASSERT(input_dim == 2);
+  auto dims = dims_to_sum.vec();
+  maybe_wrap_dims(dims, input_dim);
+  if (dims.size() == 0) {
+    // after gh-29137 is resolved, delete this if-block
+    dims.emplace_back(0);
+    dims.emplace_back(1);
+  }
+  return reduce_sparse_csr_cuda_template<scalar_t>(sparse, dims, rop);
+}
+
+template <typename scalar_t>
+struct ReductionAddOp {
+  __device__ __forceinline__ scalar_t operator()(const scalar_t a, const scalar_t b) const {
+    return a + b;
+  }
+  __device__ __forceinline__ scalar_t identity() const { return 0; }
+  __forceinline__ scalar_t identity_cpu() const { return 0; }
+};
+
+} // namespace
+
+Tensor _sparse_csr_sum_cuda(const Tensor& input, IntArrayRef dims_to_sum, bool keepdim, c10::optional<ScalarType> dtype) {
+  ScalarType dtype_ = dtype.value_or(input.scalar_type());
+  Tensor input_ = input.to(dtype_);
+  Tensor result;
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+    kHalf, kBFloat16, input_.scalar_type(), "_sparse_csr_sum_cuda",
+    [&] {
+      result = reduce_sparse_csr_cuda_template<scalar_t>(input_, dims_to_sum, keepdim, ReductionAddOp<scalar_t>());
+    });
+  return result;
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ts_native_functions.yaml b/aten/src/ATen/native/ts_native_functions.yaml
new file mode 100644
index 00000000000000..6650387757d394
--- /dev/null
+++ b/aten/src/ATen/native/ts_native_functions.yaml
@@ -0,0 +1,177 @@
+backend: Lazy
+cpp_namespace: torch::lazy
+full_codegen:
+  - _adaptive_avg_pool2d
+  - _adaptive_avg_pool2d_backward
+  - _log_softmax
+  - _log_softmax_backward_data
+  - _softmax
+  - _softmax_backward_data
+  - abs
+  - add.Tensor
+  - addcdiv
+  - addcmul
+  - addmm
+  - arange.start_out
+  - all
+  - any
+  - avg_pool2d
+  - avg_pool2d_backward
+  - baddbmm
+  - bernoulli
+  - bernoulli_.float
+  - binary_cross_entropy
+  - binary_cross_entropy_backward
+  - bitwise_and.Tensor
+  - bitwise_or.Tensor
+  - bmm
+  - cat
+  - clamp
+  - clamp_min
+  - constant_pad_nd
+  - convolution
+  - convolution_backward
+  - cos
+  - cumsum
+  - div.Tensor
+  - div.Tensor_mode
+  - elu
+  - elu_backward
+  - embedding
+  - embedding_dense_backward
+  - eq.Scalar
+  - eq.Tensor
+  - exp
+  - flip
+  - floor
+  - frac
+  - gather
+  - ge.Scalar
+  - ge.Tensor
+  - gelu
+  - gelu_backward
+  - glu
+  - glu_backward
+  - grid_sampler_2d
+  - grid_sampler_2d_backward
+  - gt.Scalar
+  - gt.Tensor
+  - hardsigmoid
+  - index_select
+  - kl_div_backward
+  - l1_loss_backward
+  - le.Scalar
+  - le.Tensor
+  - leaky_relu
+  - leaky_relu_backward
+  - log
+  - log2
+  - logdet
+  - log_sigmoid_backward
+  - log_sigmoid_forward
+  - lt.Scalar
+  - lt.Tensor
+  - masked_fill_.Scalar
+  - masked_fill_.Tensor
+  - max
+  - max.dim
+  - max_pool2d_with_indices
+  - max_pool2d_with_indices_backward
+  - maximum
+  - mean
+  - mean.dim
+  - min
+  - minimum
+  - mm
+  - mul.Tensor
+  - mv
+  - native_dropout
+  - native_dropout_backward
+  - native_layer_norm
+  - native_layer_norm_backward
+  - ne.Scalar
+  - ne.Tensor
+  - neg
+  - nll_loss_backward
+  - nll_loss_forward
+  - nll_loss2d_backward
+  - nll_loss2d_forward
+  - norm.ScalarOpt_dim
+  - pow.Tensor_Scalar
+  - pow.Tensor_Tensor
+  - random_
+  - random_.from
+  - random_.to
+  - reciprocal
+  - relu
+  - relu_
+  - remainder.Tensor
+  - repeat
+  - rsqrt
+  - scatter_add
+  - sgn
+  - sigmoid
+  - sigmoid_backward
+  - silu
+  - smooth_l1_loss
+  - smooth_l1_loss_backward
+  - softplus
+  - softplus_backward
+  - sort
+  - sqrt
+  - stack
+  - std
+  - std.dim
+  - std.correction
+  - sub.Tensor
+  - sum
+  - sum.dim_IntList
+  - tanh
+  - tanh_backward
+  - threshold
+  - threshold_backward
+  - topk
+  - trace
+  - tril
+  - triu
+  - trunc
+  - upsample_bilinear2d
+  - upsample_bilinear2d_backward
+  - upsample_nearest2d
+  - upsample_nearest2d_backward
+  - zero_
+supported:
+  - as_strided
+  - as_strided_
+  - clone
+  - _copy_from
+  - _copy_from_and_resize
+  - diagonal
+  - empty.memory_format
+  - empty_strided
+  - expand
+  - fill_.Scalar
+  - native_batch_norm
+  - native_batch_norm_backward
+  - normal_
+  - max_pool3d_with_indices
+  - max_pool3d_with_indices_backward
+  - permute
+  - select.int
+  - slice.Tensor
+  - squeeze
+  - squeeze.dim
+  - squeeze_
+  - squeeze_.dim
+  - t
+  - t_
+  - _to_copy
+  - transpose.int
+  - transpose_
+  - unsqueeze
+  - unsqueeze_
+  - view
+  - alias
+  - _unsafe_view
+autograd:
+  - max_pool3d
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.cpp b/aten/src/ATen/native/vulkan/ops/Gru.cpp
index 9052b43189d00c..8b0e99ab00bd7b 100644
--- a/aten/src/ATen/native/vulkan/ops/Gru.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Gru.cpp
@@ -1,5 +1,5 @@
-#include <ATen/native/vulkan/ops/Common.h>
-#include <torch/library.h>
+#include <ATen/native/vulkan/ops/Gru.h>
+#include <vector>
 
 namespace at {
 namespace native {
@@ -95,6 +95,151 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
 #endif /* USE_VULKAN_API */
 
 } // namespace
+
+std::vector<LinearOpContext> pack_linear_op_contexts(
+    const std::vector<Tensor>& params_cpu,
+    int64_t num_layers) {
+  TORCH_CHECK(params_cpu.size() == 4 * num_layers, "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'.");
+  std::vector<LinearOpContext> linear_op_contexts;
+  for (int64_t i = 0; i < num_layers; ++i) {
+    const auto& w_ih = params_cpu.at(i * 4);
+    const auto& w_hh = params_cpu.at(i * 4 + 1);
+    const auto& b_ih = params_cpu.at(i * 4 + 2);
+    const auto& b_hh = params_cpu.at(i * 4 + 3);
+    const auto& h_in = w_ih.size(0) / 3;
+
+    const auto&  w_i_rzn = w_ih.split(h_in);
+    const auto&  w_h_rzn = w_hh.split(h_in);
+    const auto&  b_i_rzn = b_ih.split(h_in);
+    const auto&  b_h_rzn = b_hh.split(h_in);
+
+    const auto&  w_ir = w_i_rzn[0];
+    const auto&  w_iz = w_i_rzn[1];
+    const auto&  w_in = w_i_rzn[2];
+    const auto&  w_hr = w_h_rzn[0];
+    const auto&  w_hz = w_h_rzn[1];
+    const auto&  w_hn = w_h_rzn[2];
+    const auto&  b_ir = b_i_rzn[0];
+    const auto&  b_iz = b_i_rzn[1];
+    const auto&  b_in = b_i_rzn[2];
+    const auto&  b_hr = b_h_rzn[0];
+    const auto&  b_hz = b_h_rzn[1];
+    const auto&  b_hn = b_h_rzn[2];
+
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_ir.t(), b_ir));
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_hr.t(), b_hr));
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_iz.t(), b_iz));
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_hz.t(), b_hz));
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_in.t(), b_in));
+    linear_op_contexts.emplace_back(LinearOpContext::create(w_hn.t(), b_hn));
+  }
+  return linear_op_contexts;
+}
+
+GruOpContext::GruOpContext(
+    const std::vector<Tensor>& params_cpu,
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first)
+  : packed_{pack_linear_op_contexts(params_cpu, num_layers), has_biases, num_layers, dropout, train, bidirectional, batch_first},
+    unpacked_{params_cpu, has_biases, num_layers, dropout, train, bidirectional, batch_first} {
+  TORCH_INTERNAL_ASSERT(packed_.has_biases, "Vulkan gru expects 'has_biases' to be true.");
+  TORCH_INTERNAL_ASSERT(!packed_.train, "Vulkan gru expects 'train' to be false.");
+  TORCH_INTERNAL_ASSERT(!packed_.bidirectional, "Vulkan gru expects 'bidirectional' to be false.");
+  TORCH_INTERNAL_ASSERT(packed_.batch_first, "Vulkan gru expects 'batch_first' to be true.");
+  TORCH_INTERNAL_ASSERT(packed_.dropout < std::numeric_limits<double>::epsilon()*1000, "Vulkan gru expects 'dropout' to be 0.0.");
+}
+
+GruOpContext GruOpContext::create(
+    const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first) {
+  return GruOpContext{
+      params_cpu,
+      has_biases,
+      num_layers,
+      dropout,
+      train,
+      bidirectional,
+      batch_first
+    };
+}
+
+std::tuple<Tensor, Tensor> GruOpContext::run(
+    const Tensor & input_vk,      // input sequence (vulkan)
+    const Tensor & hx_vk) const { // initial hidden state (vulkan)
+  TORCH_INTERNAL_ASSERT(input_vk.sizes().size() == 3, "Vulkan gru expects 'input_vk' dims to be 3.");
+  TORCH_INTERNAL_ASSERT(hx_vk.sizes().size() == 3, "Vulkan gru expects 'hx_vk' dims to be 3.");
+
+  const int64_t linear_op_contexts_per_layer = 6;   // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz), (b_hz, w_hz), (b_in, w_in), (b_hn, w_hn)
+  std::vector<at::Tensor> h_n_list;  // hidden output
+
+  // reshape to 2D due to Vulkan at::mm op accepts only 2D
+  auto x = input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+
+  for (int64_t i = 0; i < packed_.num_layers; ++i) {
+    // extract each hidden state and squeeze into 2D dim
+    auto h = at::slice(hx_vk, 0, i, i + 1, 1);
+    h = h.reshape({h.size(0) * h.size(1), h.size(2)});
+
+    const auto&  cxt_ir = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 0];
+    const auto&  cxt_hr = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 1];
+    const auto&  cxt_iz = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 2];
+    const auto&  cxt_hz = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 3];
+    const auto&  cxt_in = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 4];
+    const auto&  cxt_hn = packed_.linear_op_contexts[i * linear_op_contexts_per_layer + 5];
+
+    const auto&  r = at::sigmoid(cxt_ir.run(x, 1.0f, 1.0f) + cxt_hr.run(h, 1.0f, 1.0f));
+    const auto&  z = at::sigmoid(cxt_iz.run(x, 1.0f, 1.0f) + cxt_hz.run(h, 1.0f, 1.0f));
+    const auto&  n = at::tanh(cxt_in.run(x, 1.0f, 1.0f) + r * (cxt_hn.run(h, 1.0f, 1.0f)));
+    h = (z * (-1) + 1) * n + z * h;
+    x = h;  // next input
+    h_n_list.emplace_back(h.reshape({1, 1, h.size(0), h.size(1)}));  // 2D to 4D for cat op
+  }
+
+  auto h_n = at::cat(h_n_list, 1);
+  h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
+  return std::tuple<Tensor, Tensor>(x, h_n);
+}
+
+GruOpContext::State GruOpContext::unpack() const {
+  return GruOpContext::State{
+    unpacked_.params_cpu,
+    unpacked_.has_biases,
+    unpacked_.num_layers,
+    unpacked_.dropout,
+    unpacked_.train,
+    unpacked_.bidirectional,
+    unpacked_.batch_first,
+  };
+}
+
+c10::intrusive_ptr<GruOpContext> gru_prepack(
+    std::vector<Tensor>&& params_cpu,
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first) {
+  return c10::make_intrusive<GruOpContext>(GruOpContext::create(
+      params_cpu, has_biases, num_layers, dropout, train, bidirectional, batch_first));
+}
+
+std::tuple<Tensor, Tensor> gru_run(
+    const Tensor& input_vk,
+    const Tensor& hx_vk,
+    const c10::intrusive_ptr<GruOpContext>& context) {
+  return context->run(input_vk, hx_vk);
+}
+
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.h b/aten/src/ATen/native/vulkan/ops/Gru.h
new file mode 100644
index 00000000000000..8000aa449ca4f2
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/ops/Gru.h
@@ -0,0 +1,85 @@
+#pragma once
+
+#ifdef USE_VULKAN_API
+
+#include <ATen/native/vulkan/ops/Common.h>
+#include <ATen/native/vulkan/ops/Mm.h>
+#include <torch/library.h>
+
+namespace at {
+namespace native {
+namespace vulkan {
+namespace ops {
+
+class GruOpContext final : public torch::jit::CustomClassHolder {
+ public:
+  static GruOpContext create(
+      const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
+      bool has_biases,
+      int64_t num_layers,
+      double dropout,
+      bool train,
+      bool bidirectional,
+      bool batch_first);
+
+  using State = std::tuple<std::vector<Tensor>, bool, int64_t, double, bool, bool, bool>;
+
+  std::tuple<Tensor, Tensor> run(
+      const Tensor& input_vk,
+      const Tensor & hx_vk) const;
+  State unpack() const;
+
+ private:
+  GruOpContext(
+      const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
+      bool has_biases,
+      int64_t num_layers,
+      double dropout,
+      bool train,
+      bool bidirectional,
+      bool batch_first);
+
+ private:
+  struct {
+    std::vector<LinearOpContext> linear_op_contexts;  // {{ op context for b_ir, w_ir, op context for b_hr, w_hr,
+                                                      //    op context for b_iz, w_iz, op context for b_hz, w_hz,
+                                                      //    op context for b_in, w_in, op context for b_hn, w_hn,}, ...}
+    bool has_biases{};
+    int64_t num_layers{};
+    double dropout{};
+    bool train{};
+    bool bidirectional{};
+    bool batch_first{};
+  } packed_;
+
+  struct {
+    std::vector<Tensor> params_cpu;      // weights/biases (cpu)
+    bool has_biases{};
+    int64_t num_layers{};
+    double dropout{};
+    bool train{};
+    bool bidirectional{};
+    bool batch_first{};
+  } unpacked_;
+};
+
+c10::intrusive_ptr<GruOpContext> gru_prepack(
+    std::vector<Tensor>&& params_cpu,   // weights/biases (cpu)
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first);
+
+std::tuple<Tensor, Tensor> gru_run(
+    const Tensor& input_vk,
+    const Tensor & hx_vk,
+    const c10::intrusive_ptr<GruOpContext>& context);
+
+} // namespace ops
+} // namespace vulkan
+} // namespace native
+} // namespace at
+
+#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/Register.cpp b/aten/src/ATen/native/vulkan/ops/Register.cpp
index 4b90fc8696e1ff..942836cf6838a4 100644
--- a/aten/src/ATen/native/vulkan/ops/Register.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Register.cpp
@@ -2,6 +2,7 @@
 
 #include <ATen/native/vulkan/ops/Common.h>
 #include <ATen/native/vulkan/ops/Convolution.h>
+#include <ATen/native/vulkan/ops/Gru.h>
 #include <ATen/native/vulkan/ops/TransposeConvolution2d.h>
 #include <ATen/native/vulkan/ops/Mm.h>
 #include <torch/custom_class.h>
@@ -28,9 +29,9 @@ TORCH_LIBRARY(vulkan, m) {
                 std::move(std::get<2>(state)),
                 std::move(std::get<3>(state)),
                 std::move(std::get<4>(state)),
-                std::move(std::get<5>(state)),
-                std::move(std::get<6>(state)),
-                std::move(std::get<7>(state)));
+                std::get<5>(state),
+                std::get<6>(state),
+                std::get<7>(state));
           });
   m.class_<TransposeConv2dOpContext>("TransposeConv2dOpContext")
       .def_pickle(
@@ -47,9 +48,9 @@ TORCH_LIBRARY(vulkan, m) {
                 std::move(std::get<3>(state)),
                 std::move(std::get<4>(state)),
                 std::move(std::get<5>(state)),
-                std::move(std::get<6>(state)),
-                std::move(std::get<7>(state)),
-                std::move(std::get<8>(state)));
+                std::get<6>(state),
+                std::get<7>(state),
+                std::get<8>(state));
           });
   m.class_<LinearOpContext>("LinearOpContext")
       .def_pickle(
@@ -62,6 +63,23 @@ TORCH_LIBRARY(vulkan, m) {
             return linear_prepack(
                 std::move(std::get<0>(state)), std::move(std::get<1>(state)));
           });
+  m.class_<GruOpContext>("GruOpContext")
+      .def_pickle(
+          // __getstate__
+          [](const c10::intrusive_ptr<GruOpContext>& context) {
+            return context->unpack();
+          },
+          // __setstate__
+          [](GruOpContext::State state) {
+            return gru_prepack(
+                std::move(std::get<0>(state)),
+                std::get<1>(state),
+                std::get<2>(state),
+                std::get<3>(state),
+                std::get<4>(state),
+                std::get<5>(state),
+                std::get<6>(state));
+          });
 }
 
 TORCH_LIBRARY(vulkan_prepack, m) {
@@ -87,18 +105,33 @@ TORCH_LIBRARY(vulkan_prepack, m) {
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::linear_run(Tensor X, "
       "__torch__.torch.classes.vulkan.LinearOpContext BW_prepack) -> Tensor Y"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "vulkan_prepack::gru_prepack(Tensor[] params_cpu, "
+      "bool has_biases, "
+      "int num_layers, "
+      "float dropout, "
+      "bool train, "
+      "bool bidirectional, "
+      "bool batch_first) "
+      "-> __torch__.torch.classes.vulkan.GruOpContext"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "vulkan_prepack::gru_run(Tensor input_vk, "
+      "Tensor hx_vk, "
+      "__torch__.torch.classes.vulkan.GruOpContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)"));
 }
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) {
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_prepack"), TORCH_FN(conv2d_clamp_prepack));
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_prepack"), TORCH_FN(conv2d_transpose_clamp_prepack));
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::linear_prepack"), TORCH_FN(linear_prepack));
+  m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::gru_prepack"), TORCH_FN(gru_prepack));
 }
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, Vulkan, m) {
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_run"), TORCH_FN(conv2d_clamp_run));
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_run"), TORCH_FN(conv2d_transpose_clamp_run));
   m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::linear_run"), TORCH_FN(linear_run));
+  m.impl(TORCH_SELECTIVE_NAME("vulkan_prepack::gru_run"), TORCH_FN(gru_run));
 }
 
 Tensor convolution(
diff --git a/aten/src/ATen/native/xnnpack/Convolution.cpp b/aten/src/ATen/native/xnnpack/Convolution.cpp
index 3deb352d76ecd9..278e35280c4020 100644
--- a/aten/src/ATen/native/xnnpack/Convolution.cpp
+++ b/aten/src/ATen/native/xnnpack/Convolution.cpp
@@ -27,7 +27,7 @@ namespace {
 // TODO: Decouple and improve error handling and messages.
 bool available(
     const Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     const IntArrayRef padding,
     const IntArrayRef stride,
     const IntArrayRef dilation,
@@ -189,7 +189,7 @@ ContextConv2D create(
   TORCH_CHECK(
       available(
           weight_nhwc,
-          (bias.has_value() && bias->defined()) ? c10::optional<IntArrayRef>(bias->sizes()) : c10::nullopt,
+          (bias.has_value() && bias->defined()) ? at::OptionalIntArrayRef(bias->sizes()) : c10::nullopt,
           padding_expanded,
           stride_expanded,
           dilation_expanded,
@@ -433,7 +433,7 @@ unpack_prepacked_sizes_conv2d(const IValue& ivalue) {
   const auto& bias = std::get<1>(tuple);
   return IValue(std::make_tuple(
       std::get<0>(tuple).sizes(),
-      (bias && bias->defined()) ? c10::optional<IntArrayRef>(bias->sizes()) : c10::nullopt,
+      (bias && bias->defined()) ? at::OptionalIntArrayRef(bias->sizes()) : c10::nullopt,
       std::get<2>(tuple),
       std::get<3>(tuple),
       std::get<4>(tuple),
@@ -452,7 +452,7 @@ Tensor conv2d_transpose_clamp_run(
 bool use_convolution2d(
     const Tensor& input,
     const Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     const IntArrayRef padding,
     const IntArrayRef stride,
     const IntArrayRef dilation,
diff --git a/aten/src/ATen/native/xnnpack/Engine.h b/aten/src/ATen/native/xnnpack/Engine.h
index 71ed262297b310..9d5c0e4594acfe 100644
--- a/aten/src/ATen/native/xnnpack/Engine.h
+++ b/aten/src/ATen/native/xnnpack/Engine.h
@@ -13,7 +13,7 @@ namespace xnnpack {
 bool use_convolution2d(
     const Tensor& input,
     const Tensor& weight,
-    const c10::optional<IntArrayRef> bias_sizes_opt,
+    const at::OptionalIntArrayRef bias_sizes_opt,
     const IntArrayRef padding,
     const IntArrayRef stride,
     const IntArrayRef dilation,
diff --git a/aten/src/ATen/native/xnnpack/Linear.cpp b/aten/src/ATen/native/xnnpack/Linear.cpp
index 13fd04aad5a6a9..3f7ae681f95501 100644
--- a/aten/src/ATen/native/xnnpack/Linear.cpp
+++ b/aten/src/ATen/native/xnnpack/Linear.cpp
@@ -187,7 +187,7 @@ unpack_prepacked_sizes_linear(const IValue& ivalue) {
   const auto& bias = std::get<1>(tuple);
   return IValue(std::make_tuple(
       std::get<0>(tuple).sizes(),
-      (bias && bias->defined()) ? c10::optional<IntArrayRef>(bias->sizes()) : c10::nullopt));
+      (bias && bias->defined()) ? at::OptionalIntArrayRef(bias->sizes()) : c10::nullopt));
 }
 
 } // namespace linear
diff --git a/aten/src/ATen/native/xnnpack/Shim.cpp b/aten/src/ATen/native/xnnpack/Shim.cpp
index 89fffa024aeff7..32ddfb4b852557 100644
--- a/aten/src/ATen/native/xnnpack/Shim.cpp
+++ b/aten/src/ATen/native/xnnpack/Shim.cpp
@@ -31,7 +31,7 @@ bool available() {
 bool use_convolution2d(
     const Tensor&,
     const Tensor&,
-    const c10::optional<IntArrayRef>,
+    const at::OptionalIntArrayRef,
     const IntArrayRef,
     const IntArrayRef,
     const IntArrayRef,
diff --git a/aten/src/ATen/ops/from_blob.h b/aten/src/ATen/ops/from_blob.h
index 558ab57e900fba..f7599e70ea0558 100644
--- a/aten/src/ATen/ops/from_blob.h
+++ b/aten/src/ATen/ops/from_blob.h
@@ -26,7 +26,7 @@ class TORCH_API TensorMaker {
  public:
   using ContextDeleter = DeleterFnPtr;
 
-  TensorMaker& strides(optional<IntArrayRef> value) noexcept {
+  TensorMaker& strides(OptionalIntArrayRef value) noexcept {
     strides_ = value;
 
     return *this;
@@ -79,7 +79,7 @@ class TORCH_API TensorMaker {
 
   void* data_;
   IntArrayRef sizes_;
-  optional<IntArrayRef> strides_{};
+  OptionalIntArrayRef strides_{};
   optional<int64_t> storage_offset_{};
   std::function<void(void*)> deleter_{};
   std::unique_ptr<void, ContextDeleter> ctx_{nullptr, detail::noopDelete};
diff --git a/aten/src/ATen/ops/tensor.h b/aten/src/ATen/ops/tensor.h
index 3369eaf2502caa..2f72b7ef026379 100644
--- a/aten/src/ATen/ops/tensor.h
+++ b/aten/src/ATen/ops/tensor.h
@@ -1,6 +1,6 @@
 #pragma once
 #include <ATen/core/Tensor.h>
-#include <ATen/Dispatch.h>
+#include <c10/core/ScalarType.h>
 
 namespace at {
 
diff --git a/aten/src/ATen/quantized/Quantizer.cpp b/aten/src/ATen/quantized/Quantizer.cpp
index aa589819435693..4a1bac8bc4c161 100644
--- a/aten/src/ATen/quantized/Quantizer.cpp
+++ b/aten/src/ATen/quantized/Quantizer.cpp
@@ -417,4 +417,23 @@ Tensor from_blob_quantized_per_channel_affine(
   return qtensor;
 }
 
+Tensor UnknownQuantizer::quantize(const Tensor& tensor) {
+  TORCH_INTERNAL_ASSERT(false, "cannot call quantize on UnknownQuantizer");
+}
+Tensor UnknownQuantizer::dequantize(const Tensor& qtensor) {
+  TORCH_INTERNAL_ASSERT(false, "cannot call dequantize on UnknownQuantizer");
+}
+Tensor& UnknownQuantizer::dequantize_out(Tensor& rtensor, const Tensor& qtensor) {
+  TORCH_INTERNAL_ASSERT(false, "cannot call dequantize_out on UnknownQuantizer");
+}
+QScheme UnknownQuantizer::qscheme() const {
+  TORCH_INTERNAL_ASSERT(false, "cannot call qscheme on UnknownQuantizer");
+}
+bool UnknownQuantizer::equalTo(QuantizerPtr other) const{
+  TORCH_INTERNAL_ASSERT(false, "cannot call equalTo on UnknownQuantizer");
+}
+QuantizerPtr make_unknown_quantizer(ScalarType scalar_type) {
+  return c10::make_intrusive<UnknownQuantizer>(scalar_type);
+}
+
 } // namespace at
diff --git a/aten/src/ATen/quantized/Quantizer.h b/aten/src/ATen/quantized/Quantizer.h
index 5d9c7111f19eb0..05bd39b71223a0 100644
--- a/aten/src/ATen/quantized/Quantizer.h
+++ b/aten/src/ATen/quantized/Quantizer.h
@@ -18,6 +18,23 @@
 
 namespace at {
 
+/**
+ * UnknownQuantizer is a placeholder quantizer for functions that implement
+ * quantization in a two step process.  First a tensor is allocated but with
+ * unknown quantizer, and then the quantization kernel decides what the final
+ * quantizer will be.
+ */
+struct TORCH_API UnknownQuantizer : public Quantizer {
+  explicit UnknownQuantizer(ScalarType scalar_type)
+    : Quantizer(scalar_type) {}
+
+  Tensor quantize(const Tensor& tensor) override;
+  Tensor dequantize(const Tensor& qtensor) override;
+  Tensor& dequantize_out(Tensor& rtensor, const Tensor& qtensor) override;
+  QScheme qscheme() const override;
+  bool equalTo(QuantizerPtr other) const override;
+};
+
 /**
  * UniformQuantizer is the parent class for all uniform quantizers.
  * These quantization scheme will map float value uniformly to
@@ -80,7 +97,7 @@ struct TORCH_API PerTensorAffineQuantizer : public AffineQuantizer {
     return zero_point_;
   }
 
-  bool equalTo(QuantizerPtr other) override {
+  bool equalTo(QuantizerPtr other) const override {
     if (!other.get() || other->qscheme() != kPerTensorAffine) {
       return false;
     }
@@ -139,7 +156,7 @@ struct TORCH_API PerChannelAffineQuantizer : public AffineQuantizer {
   Tensor dequantize(const Tensor& qtensor) override;
   Tensor& dequantize_out(Tensor& rtensor, const Tensor& qtensor) override;
 
-  bool equalTo(QuantizerPtr other) override {
+  bool equalTo(QuantizerPtr other) const override {
     if (!other.get() || other->qscheme() != kPerChannelAffine) {
       return false;
     }
@@ -190,7 +207,7 @@ struct TORCH_API PerChannelAffineFloatQParamsQuantizer : public PerChannelAffine
   Tensor dequantize(const Tensor& qtensor) override;
   Tensor& dequantize_out(Tensor& rtensor, const Tensor& qtensor) override;
 
-  bool equalTo(QuantizerPtr other) override {
+  bool equalTo(QuantizerPtr other) const override {
     if (!other.get() || other->qscheme() != kPerChannelAffineFloatQParams) {
       return false;
     }
@@ -222,6 +239,8 @@ TORCH_API QuantizerPtr make_per_channel_affine_quantizer(
     int64_t axis,
     ScalarType scalar_type);
 
+TORCH_API QuantizerPtr make_unknown_quantizer(ScalarType scalar_type);
+
 // Create a Quantized Tensor given arguments for normal Tensor and a quantizer
 TORCH_API Tensor new_qtensor(
     IntArrayRef sizes,
diff --git a/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp b/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp
new file mode 100644
index 00000000000000..1a5b4a452592d9
--- /dev/null
+++ b/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp
@@ -0,0 +1,9 @@
+// ${generated_comment}
+${includes}
+${native_functions_include}
+
+${namespace_prologue}
+
+${native_function_definitions}
+
+${namespace_epilogue}
diff --git a/aten/src/ATen/templates/Functions.h b/aten/src/ATen/templates/Functions.h
index 3313b90d51b035..7ff718892d669a 100644
--- a/aten/src/ATen/templates/Functions.h
+++ b/aten/src/ATen/templates/Functions.h
@@ -62,6 +62,7 @@
 #include <ATen/TracerMode.h>
 #include <ATen/core/Generator.h>
 #include <ATen/core/Reduction.h>
+#include <ATen/core/SymInt.h>
 #include <ATen/core/Tensor.h>
 #include <c10/core/Scalar.h>
 #include <c10/core/Storage.h>
diff --git a/aten/src/ATen/templates/LazyIr.h b/aten/src/ATen/templates/LazyIr.h
new file mode 100644
index 00000000000000..1ee90e66cc6ced
--- /dev/null
+++ b/aten/src/ATen/templates/LazyIr.h
@@ -0,0 +1,19 @@
+#pragma once
+
+// This file contains autogenerated LazyTensor IR nodes
+${lazy_ir_sysinc}
+${lazy_ir_inc}
+
+${namespace_prologue}
+using at::operator<<;
+
+// kNullValue is used to contribute a static hash value any time
+// a node has an Optional<Value> input that is nullopt.  It is important
+// to differentiate between HASH(nullopt, something) and HASH(something, nullopt),
+// and using kNullValue in the hash function in the order of arguments
+// serves this purpose.
+static const torch::lazy::Value kNullValue = torch::lazy::Value();
+
+${ir_declarations}
+
+${namespace_epilogue}
diff --git a/aten/src/ATen/templates/NativeMetaFunctions.h b/aten/src/ATen/templates/NativeMetaFunctions.h
index c83830f1eb1087..8e5d165fb70aa1 100644
--- a/aten/src/ATen/templates/NativeMetaFunctions.h
+++ b/aten/src/ATen/templates/NativeMetaFunctions.h
@@ -3,6 +3,7 @@
 // ${generated_comment}
 
 #include <ATen/core/Tensor.h>
+#include <ATen/core/ITensorListRef.h>
 #include <ATen/TensorMeta.h>
 #include <ATen/TensorIterator.h>
 
diff --git a/aten/src/ATen/templates/Operator.h b/aten/src/ATen/templates/Operator.h
index 15434af15bae33..ee51847a4369fb 100644
--- a/aten/src/ATen/templates/Operator.h
+++ b/aten/src/ATen/templates/Operator.h
@@ -3,6 +3,7 @@
 // ${generated_comment}
 
 #include <c10/core/QScheme.h>
+#include <ATen/core/jit_type_base.h>
 #include <tuple>
 #include <vector>
 
@@ -16,6 +17,7 @@ template<typename T>
 class optional;
 template<typename T>
 class List;
+class ITensorListRef;
 class Stream;
 class Scalar;
 struct Storage;
@@ -29,6 +31,7 @@ class Tensor;
 struct Dimname;
 struct Generator;
 using TensorList = c10::ArrayRef<Tensor>;
+using ITensorListRef = c10::ITensorListRef;
 using DimnameList = c10::ArrayRef<Dimname>;
 using c10::Stream;
 using c10::Storage;
diff --git a/aten/src/ATen/templates/Operators.h b/aten/src/ATen/templates/Operators.h
index 3dc55a677106e3..a5a52ed1896d6b 100644
--- a/aten/src/ATen/templates/Operators.h
+++ b/aten/src/ATen/templates/Operators.h
@@ -17,6 +17,7 @@
   and see NOTE [TORCH_ASSERT_ONLY_METHOD_OPERATORS].
 #endif
 
+#include <ATen/core/SymInt.h>
 #include <c10/core/Scalar.h>
 #include <c10/core/TensorOptions.h>
 #include <c10/core/QScheme.h>
diff --git a/aten/src/ATen/templates/RegisterDispatchKey.cpp b/aten/src/ATen/templates/RegisterDispatchKey.cpp
index f9fa6ab244022d..df00c0d0e4a321 100644
--- a/aten/src/ATen/templates/RegisterDispatchKey.cpp
+++ b/aten/src/ATen/templates/RegisterDispatchKey.cpp
@@ -62,12 +62,12 @@ namespace {
 
 ${dispatch_anonymous_definitions}
 
-TORCH_LIBRARY_IMPL(aten, ${DispatchKey}, m) {
-  ${dispatch_registrations}
-}
+${static_init_dispatch_registrations}
 
 } // anonymous namespace
 
+${deferred_dispatch_registrations}
+
 namespace ${dispatch_namespace} {
 
 ${dispatch_namespaced_definitions}
diff --git a/aten/src/ATen/templates/TensorBody.h b/aten/src/ATen/templates/TensorBody.h
index aa85ac1d30496f..c1f377b09b248c 100644
--- a/aten/src/ATen/templates/TensorBody.h
+++ b/aten/src/ATen/templates/TensorBody.h
@@ -32,8 +32,10 @@
 #include <ATen/core/DeprecatedTypeProperties.h>
 #include <ATen/core/NamedTensor.h>
 #include <ATen/core/QuantizerBase.h>
+#include <ATen/core/SymInt.h>
 #include <ATen/core/TensorBase.h>
 
+
 #include <ATen/MethodOperators.h>
 
 namespace c10{
@@ -340,6 +342,10 @@ class TORCH_API Tensor: public TensorBase {
     return to(options().device(DeviceType::Metal), /*non_blocking*/ false, /*copy*/ false);
   }
 
+  Tensor meta() const {
+    return to(options().device(DeviceType::Meta), /*non_blocking*/ false, /*copy*/ false);
+  }
+
   // ~~~~~ Autograd API ~~~~~
 
   /// \fn bool is_leaf() const;
diff --git a/aten/src/ATen/templates/TensorMethods.cpp b/aten/src/ATen/templates/TensorMethods.cpp
index 29a43a657bb325..be9d94406e74cb 100644
--- a/aten/src/ATen/templates/TensorMethods.cpp
+++ b/aten/src/ATen/templates/TensorMethods.cpp
@@ -15,7 +15,7 @@ namespace at {
      return this->unsafeGetTensorImpl()->data_ptr_impl<T>();         \
    }
 
- AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(DEFINE_CAST)
+ AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_CAST)
  AT_FORALL_QINT_TYPES(DEFINE_CAST)
  #undef DEFINE_CAST
 
@@ -25,7 +25,7 @@ namespace at {
      return item().to##name();     \
    }
 
- AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(DEFINE_ITEM)
+ AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ITEM)
  #undef DEFINE_ITEM
 
  } //namespace at
diff --git a/aten/src/ATen/test/cuda_atomic_ops_test.cu b/aten/src/ATen/test/cuda_atomic_ops_test.cu
index 54d43ffec019cf..d5d261440064bf 100644
--- a/aten/src/ATen/test/cuda_atomic_ops_test.cu
+++ b/aten/src/ATen/test/cuda_atomic_ops_test.cu
@@ -1,6 +1,7 @@
 #include <gtest/gtest.h>
 #include <ATen/cuda/Atomic.cuh>
 #include <c10/test/util/Macros.h>
+#include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAException.h>
 
 #include <cmath>
@@ -25,6 +26,24 @@ __global__ void mul_test_kernel(T * a, T * sum) {
   gpuAtomicMul(&sum[idx], a[idx]);
 }
 
+template <typename T>
+__global__ void max_test_kernel(T * a, T * max) {
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int a_idx = (tid) % (arraysize * factor);
+  int idx = a_idx / factor;
+
+  gpuAtomicMax(&max[idx], a[a_idx]);
+}
+
+template <typename T>
+__global__ void min_test_kernel(T * a, T * min) {
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int a_idx = (tid) % (arraysize * factor);
+  int idx = a_idx / factor;
+
+  gpuAtomicMin(&min[idx], a[a_idx]);
+}
+
 template <typename T>
 void test_atomic_add() {
   dim3 dimBlock(blocksize, 1);
@@ -75,7 +94,7 @@ void test_atomic_mul() {
   for (int i = 0; i < arraysize; ++i) {
     a[i] = 2;
     sum[i] = 2;
-    answer[i] = pow(sum[i], static_cast<T>(factor));
+    answer[i] = pow(sum[i], static_cast<T>(factor + 1));
   }
 
   cudaMalloc((void**)&ad, arraysize * sizeof(T));
@@ -97,7 +116,88 @@ void test_atomic_mul() {
   cudaFree(sumd);
 }
 
+template <typename T>
+void test_atomic_max() {
+  dim3 dimBlock(blocksize, 1);
+  dim3 dimGrid(1, 1);
+
+  T *ad, *sumd;
+
+  std::vector<T> a(arraysize * factor);
+  std::vector<T> sum(arraysize);
+  std::vector<T> answer(arraysize);
+
+  int j;
+  for (int i = 0; i < arraysize * factor; ++i) {
+    a[i] = i;
+    if (i % factor == 0) {
+      j = i / factor;
+      sum[j] = std::numeric_limits<T>::lowest();
+      answer[j] = (j + 1) * factor - 1;
+    }
+  }
+
+  cudaMalloc((void**)&ad, arraysize * factor * sizeof(T));
+  cudaMalloc((void**)&sumd, arraysize * sizeof(T));
+
+  cudaMemcpy(ad, a.data(), arraysize * factor * sizeof(T), cudaMemcpyHostToDevice);
+  cudaMemcpy(sumd, sum.data(), arraysize * sizeof(T), cudaMemcpyHostToDevice);
+
+  max_test_kernel<<<dimGrid, dimBlock>>>(ad, sumd);
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+
+  cudaMemcpy(sum.data(), sumd, arraysize * sizeof(T), cudaMemcpyDeviceToHost);
+
+  for (int i = 0; i < arraysize; ++i) {
+    ASSERT_EQ(sum[i], answer[i]) << typeid(T).name();
+  }
+
+  cudaFree(ad);
+  cudaFree(sumd);
+}
+
+template <typename T>
+void test_atomic_min() {
+  dim3 dimBlock(blocksize, 1);
+  dim3 dimGrid(1, 1);
+
+  T *ad, *sumd;
+
+  std::vector<T> a(arraysize * factor);
+  std::vector<T> sum(arraysize);
+  std::vector<T> answer(arraysize);
+
+  int j;
+  for (int i = 0; i < arraysize * factor; ++i) {
+    a[i] = i;
+    if (i % factor == 0) {
+      j = i / factor;
+      sum[j] = std::numeric_limits<T>::max();
+      answer[j] = j * factor;
+    }
+  }
+
+  cudaMalloc((void**)&ad, arraysize * factor * sizeof(T));
+  cudaMalloc((void**)&sumd, arraysize * sizeof(T));
+
+  cudaMemcpy(ad, a.data(), arraysize * factor * sizeof(T), cudaMemcpyHostToDevice);
+  cudaMemcpy(sumd, sum.data(), arraysize * sizeof(T), cudaMemcpyHostToDevice);
+
+  min_test_kernel<<<dimGrid, dimBlock>>>(ad, sumd);
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+
+  cudaMemcpy(sum.data(), sumd, arraysize * sizeof(T), cudaMemcpyDeviceToHost);
+
+  for (int i = 0; i < arraysize; ++i) {
+    ASSERT_EQ(sum[i], answer[i]) << typeid(T).name();
+  }
+
+  cudaFree(ad);
+  cudaFree(sumd);
+}
+
 TEST(TestAtomicOps, TestAtomicAdd) {
+  if (!at::cuda::is_available()) return;
   test_atomic_add<uint8_t>();
   test_atomic_add<int8_t>();
   test_atomic_add<int16_t>();
@@ -113,8 +213,25 @@ TEST(TestAtomicOps, TestAtomicAdd) {
 }
 
 TEST(TestAtomicOps, DISABLED_ON_WINDOWS(TestAtomicMul)) {
+  if (!at::cuda::is_available()) return;
   test_atomic_mul<at::BFloat16>();
   test_atomic_mul<at::Half>();
   test_atomic_mul<float>();
   test_atomic_mul<double>();
 }
+
+TEST(TestAtomicOps, DISABLED_ON_WINDOWS(TestAtomicMax)) {
+  if (!at::cuda::is_available()) return;
+  test_atomic_max<at::BFloat16>();
+  test_atomic_max<at::Half>();
+  test_atomic_max<float>();
+  test_atomic_max<double>();
+}
+
+TEST(TestAtomicOps, DISABLED_ON_WINDOWS(TestAtomicMin)) {
+  if (!at::cuda::is_available()) return;
+  test_atomic_min<at::BFloat16>();
+  test_atomic_min<at::Half>();
+  test_atomic_min<float>();
+  test_atomic_min<double>();
+}
diff --git a/aten/src/ATen/test/cuda_half_test.cu b/aten/src/ATen/test/cuda_half_test.cu
index a55d9458e85131..aa1644c94b764f 100644
--- a/aten/src/ATen/test/cuda_half_test.cu
+++ b/aten/src/ATen/test/cuda_half_test.cu
@@ -76,6 +76,13 @@ __device__ void test(){
   assert(::abs(::isnan(Half(0.0)) - ::isnan(0.0f)) <= threshold);
   assert(::abs(::isinf(Half(0.0)) - ::isinf(0.0f)) <= threshold);
 #endif
+
+  // test complex<32>
+  Half real = 3.0f;
+  Half imag = -10.0f;
+  auto complex = c10::complex<Half>(real, imag);
+  assert(complex.real() == real);
+  assert(complex.imag() == imag);
 }
 
 __global__ void kernel(){
diff --git a/aten/src/ATen/test/half_test.cpp b/aten/src/ATen/test/half_test.cpp
index 652823e8e9b1e2..02ccb8b6ce5dc3 100644
--- a/aten/src/ATen/test/half_test.cpp
+++ b/aten/src/ATen/test/half_test.cpp
@@ -164,3 +164,11 @@ TEST(TestHalf, CommonMath) {
   assert(std::abs(std::isinf(Half(0.0)) - std::isinf(0.0f)) <= threshold);
 #endif
 }
+
+TEST(TestHalf, ComplexHalf) {
+  Half real = 3.0f;
+  Half imag = -10.0f;
+  auto complex = c10::complex<Half>(real, imag);
+  assert(complex.real() == real);
+  assert(complex.imag() == imag);
+}
diff --git a/aten/src/ATen/test/vulkan_api_test.cpp b/aten/src/ATen/test/vulkan_api_test.cpp
index 7001677d8dd5e2..d792613411eaf7 100644
--- a/aten/src/ATen/test/vulkan_api_test.cpp
+++ b/aten/src/ATen/test/vulkan_api_test.cpp
@@ -2,6 +2,7 @@
 
 #include <gtest/gtest.h>
 #include <ATen/ATen.h>
+#include <ATen/core/dispatch/Dispatcher.h>
 #include <c10/util/irange.h>
 
 // TODO: These functions should move to a common place.
@@ -64,7 +65,7 @@ void showRtol(const at::Tensor& a, const at::Tensor& b) {
 }
 
 
-static void gen_allpermutations(std::vector<std::vector<int64_t>>& out, std::vector<int64_t> in, int i) {
+static void gen_allpermutations(std::vector<std::vector<int64_t>>& out, std::vector<int64_t> in, unsigned i) {
   // generate all permutations of a given dims
   if (i == in.size()) {
     out.push_back(in);
@@ -137,6 +138,31 @@ static void clone_test(const std::vector<int64_t>& size, c10::optional<at::Memor
   ASSERT_TRUE(check);
 }
 
+template <class... Inputs>
+inline std::vector<c10::IValue> makeStack(Inputs&&... inputs) {
+  return {std::forward<Inputs>(inputs)...};
+}
+
+template <class... Args>
+inline std::vector<c10::IValue> callOpByHandle(
+    const c10::OperatorHandle& op,
+    Args... args) {
+  auto stack = makeStack(std::forward<Args>(args)...);
+  c10::Dispatcher::singleton().callBoxed(op, &stack);
+  return stack;
+}
+
+template <class... Args>
+inline std::vector<c10::IValue> callOpByName(
+    const char* func_name,
+    const char* overload_name,
+    Args... args) {
+  const c10::optional<c10::OperatorHandle> op_handle =
+      c10::Dispatcher::singleton().findSchema({func_name, overload_name});
+  assert(op_handle.has_value());
+  return callOpByHandle(op_handle.value(), std::forward<Args>(args)...);
+}
+
 } // namespace
 
 namespace {
@@ -2962,6 +2988,203 @@ TEST(VulkanAPITest, gru_invalidinputs_exceptions) {
       has_biases, num_layers, 1.0, train, bidirectional, batch_first);
   }, ::c10::Error);
 }
+
+TEST(VulkanAPITest, gru_prepack_success) {
+  // Guard
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  // Arrange
+  const int H_in = 384;  // input_size
+  const int H_out = 384; // hidden_size
+  const int num_layers = 2;
+  const double gru_dropout = .0;
+  const bool has_biases = true;
+  const bool train = false;
+  const bool bidirectional = false;
+  const bool batch_first = true;
+  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+
+  c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
+  c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
+  c10::List<at::Tensor> bias_ih_l;   // shape (3 * hidden_size)
+  c10::List<at::Tensor> bias_hh_l;   // shape (3 * hidden_size)
+  for (int i = 0; i < num_layers; ++i) {
+    weight_ih_l.emplace_back(at::rand({3 * H_out, H_in}, at::device(at::kCPU).dtype(at::kFloat)));
+    weight_hh_l.emplace_back(at::rand({3 * H_out, H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+    bias_ih_l.emplace_back(at::rand({3 * H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+    bias_hh_l.emplace_back(at::rand({3 * H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+  }
+
+  // put this guard here to run inference inststead of training
+  // to avoid the following error:
+  //     C++ exception with description "0INTERNAL ASSERT FAILED at "xplat/caffe2/aten/src/ATen/core/boxing/KernelFunction.cpp":31, please report a bug to PyTorch. aten::gru.input has kernels registered to both CompositeImplicitAutograd and a backend mapped to AutogradOther. This makes the backend kernel unreachable; the dispatcher will always prefer the CompositeImplicitAutograd lowering (see Note [Ambiguity in AutogradOther kernel]). If you want to override CompositeImplicitAutograd, please open an issue to request a dedicated Autograd dispatch key for the backend.
+  //     If you only want to run inference instead of training, add `c10::InferenceMode mode;` before model.forward(). Note this guard is only available in C++ but not Python at present.
+  c10::InferenceMode mode;
+
+  // Act
+  const auto out_cpu = at::gru(in_cpu, h0_cpu,
+      { weight_ih_l[0], weight_hh_l[0], bias_ih_l[0], bias_hh_l[0], weight_ih_l[1], weight_hh_l[1], bias_ih_l[1], bias_hh_l[1] },
+      has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
+
+  auto prepack = callOpByName(
+      "vulkan_prepack::gru_prepack",
+      "",
+      std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+        weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+      has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
+  auto out_vulkan = callOpByName(
+      "vulkan_prepack::gru_run",
+      "",
+      in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]);
+
+  auto cpu_output = std::get<0>(out_cpu);
+  auto cpu_hidden = std::get<1>(out_cpu);
+  auto vulkan_output = out_vulkan[0].toTensor();
+  auto vulkan_hidden = out_vulkan[1].toTensor();
+
+  // Assert
+  const auto check_output = almostEqual(cpu_output, vulkan_output.cpu());
+  if (!check_output) {
+    showRtol(cpu_output, vulkan_output.cpu());
+  }
+  ASSERT_TRUE(check_output);
+
+  const auto check_hidden = almostEqual(cpu_hidden, vulkan_hidden.cpu());
+  if (!check_hidden) {
+    showRtol(cpu_hidden, vulkan_hidden.cpu());
+  }
+  ASSERT_TRUE(check_hidden);
+}
+
+TEST(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
+  // Guard
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  // Arrange
+  const int H_in = 384;  // input_size
+  const int H_out = 384; // hidden_size
+  const int num_layers = 2;
+  const double gru_dropout = .0;
+  const bool has_biases = true;
+  const bool train = false;
+  const bool bidirectional = false;
+  const bool batch_first = true;
+  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+
+  c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
+  c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
+  c10::List<at::Tensor> bias_ih_l;   // shape (3 * hidden_size)
+  c10::List<at::Tensor> bias_hh_l;   // shape (3 * hidden_size)
+  for (int i = 0; i < num_layers; ++i) {
+    weight_ih_l.emplace_back(at::rand({3 * H_out, H_in}, at::device(at::kCPU).dtype(at::kFloat)));
+    weight_hh_l.emplace_back(at::rand({3 * H_out, H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+    bias_ih_l.emplace_back(at::rand({3 * H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+    bias_hh_l.emplace_back(at::rand({3 * H_out}, at::device(at::kCPU).dtype(at::kFloat)));
+  }
+
+  // put this guard here to run inference inststead of training
+  // to avoid the following error:
+  //     C++ exception with description "0INTERNAL ASSERT FAILED at "xplat/caffe2/aten/src/ATen/core/boxing/KernelFunction.cpp":31, please report a bug to PyTorch. aten::gru.input has kernels registered to both CompositeImplicitAutograd and a backend mapped to AutogradOther. This makes the backend kernel unreachable; the dispatcher will always prefer the CompositeImplicitAutograd lowering (see Note [Ambiguity in AutogradOther kernel]). If you want to override CompositeImplicitAutograd, please open an issue to request a dedicated Autograd dispatch key for the backend.
+  //     If you only want to run inference instead of training, add `c10::InferenceMode mode;` before model.forward(). Note this guard is only available in C++ but not Python at present.
+  c10::InferenceMode mode;
+
+  // Act: incorrect # of weights/biases
+  EXPECT_THROW({
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1) }),
+        has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
+  }, ::c10::Error);
+
+  // Act: non-3D input tensor
+  EXPECT_THROW({
+    const auto in_cpu_2d = at::rand({1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
+    auto out_vulkan = callOpByName(
+        "vulkan_prepack::gru_run",
+        "",
+        in_cpu_2d.vulkan(), h0_cpu.vulkan(), prepack[0]);
+  }, ::c10::Error);
+
+  // Act: non-3D hidden tensor
+  EXPECT_THROW({
+    const auto h0_cpu_2d = at::rand({num_layers, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
+    auto out_vulkan = callOpByName(
+        "vulkan_prepack::gru_run",
+        "",
+        in_cpu.vulkan(), h0_cpu_2d.vulkan(), prepack[0]);
+  }, ::c10::Error);
+
+  // Act: has_biases should be true
+  EXPECT_THROW({
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+           weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        false, num_layers, gru_dropout, train, bidirectional, batch_first);
+  }, ::c10::Error);
+
+  // Act: train should be false
+  EXPECT_THROW({
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+           weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, gru_dropout, true, bidirectional, batch_first);
+  }, ::c10::Error);
+
+  // Act: bidirectional should be false
+  EXPECT_THROW({
+     auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+           weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, gru_dropout, train, true, batch_first);
+ }, ::c10::Error);
+
+  // Act: batch_first should be true
+  EXPECT_THROW({
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+           weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, gru_dropout, train, bidirectional, false);
+  }, ::c10::Error);
+
+  // Act: dropout should be 0.0
+  EXPECT_THROW({
+    auto prepack = callOpByName(
+        "vulkan_prepack::gru_prepack",
+        "",
+        std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
+           weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
+        has_biases, num_layers, 1.0, train, bidirectional, batch_first);
+  }, ::c10::Error);
+}
+
 } // namespace
 
 #endif /* USE_VULKAN_API */
diff --git a/aten/tools/run_tests.sh b/aten/tools/run_tests.sh
index 4a724fa9400856..5b0c02c2846a46 100755
--- a/aten/tools/run_tests.sh
+++ b/aten/tools/run_tests.sh
@@ -64,6 +64,9 @@ fi
 if [[ -x ./cuda_cub_test ]]; then
   ./cuda_cub_test
 fi
+if [[ -x ./cuda_atomic_ops_test ]]; then
+  ./cuda_atomic_ops_test
+fi
 if [ "$VALGRIND" == "ON" ]; then
   valgrind --suppressions="$VALGRIND_SUP" --error-exitcode=1 ./basic --gtest_filter='-*CUDA'
   if [[ -x ./tensor_interop_test ]]; then
diff --git a/benchmarks/cpp/nvfuser/CMakeLists.txt b/benchmarks/cpp/nvfuser/CMakeLists.txt
index b566e6a359e907..3779616ee969f2 100644
--- a/benchmarks/cpp/nvfuser/CMakeLists.txt
+++ b/benchmarks/cpp/nvfuser/CMakeLists.txt
@@ -10,6 +10,8 @@ if(USE_CUDA)
     instance_norm.cpp
     layer_norm.cpp
     layer_norm_backward.cpp
+    rms_norm.cpp
+    rms_norm_backward.cpp
     lstm_cell.cpp
     reduction.cpp
     softmax.cpp
diff --git a/benchmarks/cpp/nvfuser/instance_norm.cpp b/benchmarks/cpp/nvfuser/instance_norm.cpp
index 007291d75f5f13..2c0cee0b06c75c 100644
--- a/benchmarks/cpp/nvfuser/instance_norm.cpp
+++ b/benchmarks/cpp/nvfuser/instance_norm.cpp
@@ -14,12 +14,18 @@
 
 using namespace torch::jit::fuser::cuda;
 
-static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
+static void setupInstanceNorm(
+    Fusion* fusion,
+    DataType dtype,
+    bool channels_last_3d = false) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
   FusionGuard fg(fusion);
 
   auto input = makeContigTensor(4, dtype);
+  if (channels_last_3d) {
+    input = makeContigTensor(5, dtype);
+  }
   auto weight = makeContigTensor(1, dtype);
   auto bias = makeContigTensor(1, dtype);
   auto running_mean = makeContigTensor(1, DataType::Float);
@@ -51,7 +57,8 @@ static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
       running_var,
       kTraining,
       momentum_ptr,
-      eps_ptr);
+      eps_ptr,
+      channels_last_3d);
 
   auto output = unaryOp(UnaryOpType::Relu, norm.output);
 
@@ -67,7 +74,8 @@ static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
 static void NvFuserScheduler_InstanceNorm(
     benchmark::State& benchmark_state,
     FusionExecutorCache* fusion_executor_cache,
-    DataType dtype) {
+    DataType dtype,
+    bool channels_last_3d = false) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
   std::vector<int64_t> input_shape{
@@ -76,17 +84,25 @@ static void NvFuserScheduler_InstanceNorm(
       benchmark_state.range(1),
       benchmark_state.range(1)};
 
+  std::vector<int64_t> input_shape_3d{
+      benchmark_state.range(0),
+      benchmark_state.range(1),
+      benchmark_state.range(1),
+      benchmark_state.range(1),
+      benchmark_state.range(2)};
+
   // inputs
   at::manual_seed(0);
   auto options =
       at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0);
   auto fp32_options =
       at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_weight = at::ones({input_shape[1]}, options);
-  at::Tensor at_bias = at::zeros({input_shape[1]}, options);
-  at::Tensor at_mean = at::zeros({input_shape[1]}, fp32_options);
-  at::Tensor at_var = at::ones({input_shape[1]}, fp32_options);
+  at::Tensor at_x =
+      at::randn(channels_last_3d ? input_shape_3d : input_shape, options);
+  at::Tensor at_weight = at::ones({benchmark_state.range(2)}, options);
+  at::Tensor at_bias = at::zeros({benchmark_state.range(2)}, options);
+  at::Tensor at_mean = at::zeros({benchmark_state.range(2)}, fp32_options);
+  at::Tensor at_var = at::ones({benchmark_state.range(2)}, fp32_options);
 
   std::vector<c10::IValue> aten_inputs = {
       at_x, at_weight, at_bias, at_mean, at_var};
@@ -94,9 +110,11 @@ static void NvFuserScheduler_InstanceNorm(
 
   runBenchmarkIterations(benchmark_state, fusion_executor_cache, aten_inputs);
 
-  const size_t kSize =
-      input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
-  const size_t kChannels = input_shape[1];
+  const size_t kSize = channels_last_3d
+      ? input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] *
+          input_shape[4]
+      : input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
+  const size_t kChannels = benchmark_state.range(2);
 
   // Read: x, weight, bias, running_mean, running_var
   // Write: y, running_mean, running_var
@@ -108,7 +126,8 @@ static void NvFuserScheduler_InstanceNorm(
 
 static void Baseline_InstanceNorm(
     benchmark::State& benchmark_state,
-    DataType dtype) {
+    DataType dtype,
+    bool channels_last_3d = false) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
   std::vector<int64_t> input_shape{
@@ -116,6 +135,14 @@ static void Baseline_InstanceNorm(
       benchmark_state.range(2),
       benchmark_state.range(1),
       benchmark_state.range(1)};
+  std::vector<int64_t> input_shape_3d{
+      benchmark_state.range(0),
+      benchmark_state.range(2),
+      benchmark_state.range(1),
+      benchmark_state.range(1),
+      benchmark_state.range(1),
+  };
+
   const float kMomentum = 0.1;
   const float kEps = 1e-5;
   const auto aten_dtype = data_type_to_aten(dtype);
@@ -126,10 +153,15 @@ static void Baseline_InstanceNorm(
       at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
 
   at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_weight = at::ones({input_shape[1]}, options);
-  at::Tensor at_bias = at::zeros({input_shape[1]}, options);
-  at::Tensor at_mean = at::zeros({input_shape[1]}, fp32_options);
-  at::Tensor at_var = at::ones({input_shape[1]}, fp32_options);
+  if (channels_last_3d) {
+    at_x = at::randn(
+        input_shape_3d,
+        options.memory_format(c10::MemoryFormat::ChannelsLast3d));
+  }
+  at::Tensor at_weight = at::ones({benchmark_state.range(2)}, options);
+  at::Tensor at_bias = at::zeros({benchmark_state.range(2)}, options);
+  at::Tensor at_mean = at::zeros({benchmark_state.range(2)}, fp32_options);
+  at::Tensor at_var = at::ones({benchmark_state.range(2)}, fp32_options);
 
   auto ato_weight = c10::optional<at::Tensor>(at_weight);
   auto ato_bias = c10::optional<at::Tensor>(at_bias);
@@ -159,9 +191,11 @@ static void Baseline_InstanceNorm(
     cudaDeviceSynchronize();
   }
 
-  const size_t kSize =
-      input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
-  const size_t kChannels = input_shape[1];
+  const size_t kSize = channels_last_3d
+      ? input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] *
+          input_shape[4]
+      : input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
+  const size_t kChannels = benchmark_state.range(2);
 
   // Read: x, weight, bias, running_mean, running_var
   // Write: y, running_mean, running_var
@@ -181,6 +215,11 @@ static void Baseline_InstanceNorm_fp16(benchmark::State& benchmark_state) {
   Baseline_InstanceNorm(benchmark_state, DataType::Half);
 }
 
+static void Baseline_InstanceNorm_fp32_channels_last_3d(
+    benchmark::State& benchmark_state) {
+  Baseline_InstanceNorm(benchmark_state, DataType::Float, true);
+}
+
 //------------------------------------------------------------------------------
 
 NVFUSER_BENCHMARK_DEFINE(
@@ -195,6 +234,43 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm_fp32)
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_InstanceNorm3d_channels_last_fp32,
+    setupInstanceNorm,
+    NvFuserScheduler_InstanceNorm,
+    DataType::Float,
+    true);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{1, 8}, {128, 128}, {32, 32}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{1, 8}, {64, 64}, {64, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{1, 8}, {32, 32}, {128, 128}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{1, 8}, {16, 16}, {256, 256}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{1, 8}, {4, 8}, {320, 320}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
 NVFUSER_BENCHMARK_DEFINE(
     NvFuserScheduler_InstanceNorm_fp16,
     setupInstanceNorm,
@@ -220,4 +296,28 @@ BENCHMARK(Baseline_InstanceNorm_fp16)
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
+BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
+    ->RangeMultiplier(2)
+    ->Ranges({{2, 8}, {128, 128}, {32, 32}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
+    ->RangeMultiplier(2)
+    ->Ranges({{2, 8}, {64, 64}, {64, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
+    ->RangeMultiplier(2)
+    ->Ranges({{2, 8}, {16, 16}, {256, 256}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
+    ->RangeMultiplier(2)
+    ->Ranges({{2, 8}, {4, 8}, {320, 320}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
 //------------------------------------------------------------------------------
diff --git a/benchmarks/cpp/nvfuser/layer_norm.cpp b/benchmarks/cpp/nvfuser/layer_norm.cpp
index 7500ac8525b6b5..bdbc7ec6ac0a8b 100644
--- a/benchmarks/cpp/nvfuser/layer_norm.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm.cpp
@@ -46,8 +46,8 @@ static void setupLayerNorm(Fusion* fusion, DataType dtype) {
 
   auto output = layer_norm_results.output;
 
-  if (dtype == DataType::Half) {
-    output = castOp(DataType::Half, output);
+  if (dtype != DataType::Float) {
+    output = castOp(dtype, output);
   }
 
   fusion->addOutput(output);
diff --git a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
index 045465e712539f..fe95c01048f2b4 100644
--- a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
@@ -61,13 +61,12 @@ static void setupLayerNorm_BWD(Fusion* fusion, DataType dtype) {
   auto layer_norm_results = layer_norm_backward(
       grad_out, input, {1}, mean, rstd, weight, bias, {true, true, true});
 
-  if (dtype == DataType::Half) {
+  if (dtype != DataType::Float) {
     layer_norm_results.grad_input =
-        castOp(DataType::Half, layer_norm_results.grad_input);
-    layer_norm_results.grad_bias =
-        castOp(DataType::Half, layer_norm_results.grad_bias);
+        castOp(dtype, layer_norm_results.grad_input);
+    layer_norm_results.grad_bias = castOp(dtype, layer_norm_results.grad_bias);
     layer_norm_results.grad_weight =
-        castOp(DataType::Half, layer_norm_results.grad_weight);
+        castOp(dtype, layer_norm_results.grad_weight);
   }
 
   fusion->addOutput(layer_norm_results.grad_input);
diff --git a/benchmarks/cpp/nvfuser/rms_norm.cpp b/benchmarks/cpp/nvfuser/rms_norm.cpp
new file mode 100644
index 00000000000000..9c46896366ccf0
--- /dev/null
+++ b/benchmarks/cpp/nvfuser/rms_norm.cpp
@@ -0,0 +1,171 @@
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+
+#include <benchmark/benchmark.h>
+
+#include <cuda_runtime.h>
+
+#include "utils.h"
+
+using namespace torch::jit::fuser::cuda;
+
+//------------------------------------------------------------------------------
+
+static void setupRMSNorm(Fusion* fusion, DataType dtype) {
+  TORCH_INTERNAL_ASSERT(
+      dtype == DataType::Float || dtype == DataType::Half ||
+      dtype == DataType::BFloat16);
+
+  FusionGuard fg(fusion);
+
+  const int kReductionAxis = 2;
+  const float kEps = 1e-6;
+
+  Double* eps_ptr = IrBuilder::create<Double>(kEps);
+
+  // setup fusion
+  auto input = makeContigTensor(3, dtype);
+  auto weight = makeContigTensor(1, dtype);
+
+  fusion->addInput(input);
+  fusion->addInput(weight);
+
+  if (dtype == DataType::Half) {
+    input = castOp(DataType::Float, input);
+    weight = castOp(DataType::Float, weight);
+  }
+
+  auto rms_norm_results = rms_norm(input, 1, weight, eps_ptr);
+
+  auto output = rms_norm_results.output;
+
+  if (dtype != DataType::Float) {
+    output = castOp(dtype, output);
+  }
+
+  fusion->addOutput(output);
+}
+
+static void NvFuserScheduler_RMSNorm(
+    benchmark::State& benchmark_state,
+    FusionExecutorCache* fusion_executor_cache,
+    DataType dtype) {
+  TORCH_INTERNAL_ASSERT(
+      dtype == DataType::Float || dtype == DataType::Half ||
+      dtype == DataType::BFloat16);
+
+  std::vector<int64_t> input_shape{8, benchmark_state.range(0), 1024};
+  const float kEps = 1e-6;
+
+  // inputs
+  at::manual_seed(0);
+  auto options =
+      at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0);
+  at::Tensor input = at::randn(input_shape, options);
+  at::Tensor weight = at::randn({input_shape[2]}, options);
+
+  std::vector<c10::IValue> aten_inputs({input, weight});
+
+  runBenchmarkIterations(benchmark_state, fusion_executor_cache, aten_inputs);
+
+  benchmark_state.SetBytesProcessed(
+      int64_t(benchmark_state.iterations()) *
+      (2 * input.numel() + weight.numel()) * int64_t(dataTypeSize(dtype)));
+}
+
+//------------------------------------------------------------------------------
+
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_fp32,
+    setupRMSNorm,
+    NvFuserScheduler_RMSNorm,
+    DataType::Float);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{18, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{22, 44}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_fp16,
+    setupRMSNorm,
+    NvFuserScheduler_RMSNorm,
+    DataType::Half);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{18, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{22, 44}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_bf16,
+    setupRMSNorm,
+    NvFuserScheduler_RMSNorm,
+    DataType::BFloat16);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{18, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{22, 44}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
diff --git a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
new file mode 100644
index 00000000000000..3bd66b412b97ea
--- /dev/null
+++ b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
@@ -0,0 +1,165 @@
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+
+#include <benchmark/benchmark.h>
+
+#include <cuda_runtime.h>
+
+#include "utils.h"
+
+using namespace torch::jit::fuser::cuda;
+
+//------------------------------------------------------------------------------
+
+static void setupRMSNorm_BWD(Fusion* fusion, DataType dtype) {
+  FusionGuard fg(fusion);
+
+  TORCH_INTERNAL_ASSERT(
+      dtype == DataType::Float || dtype == DataType::Half ||
+      dtype == DataType::BFloat16);
+
+  const int kReductionAxis = 2;
+  Double* eps_ptr = IrBuilder::create<Double>(1e-6);
+
+  // setup fusion
+  auto grad_out = makeContigTensor(3, dtype);
+  auto input = makeContigTensor(3, dtype);
+  auto weight = makeContigTensor(1, dtype);
+  auto rstd = TensorViewBuilder()
+                  .contiguity({false, false, false})
+                  .shape({-1, -1, 1})
+                  .dtype(dtype)
+                  .build();
+
+  fusion->addInput(grad_out);
+  fusion->addInput(input);
+  fusion->addInput(weight);
+  fusion->addInput(rstd);
+
+  if (dtype == DataType::Half) {
+    grad_out = castOp(DataType::Float, grad_out);
+    input = castOp(DataType::Float, input);
+    weight = castOp(DataType::Float, weight);
+    rstd = castOp(DataType::Float, rstd);
+  }
+
+  auto rms_norm_results =
+      rms_norm_backward(grad_out, input, {1}, rstd, weight, {true, true, true});
+
+  if (dtype != DataType::Float) {
+    rms_norm_results.grad_input = castOp(dtype, rms_norm_results.grad_input);
+    rms_norm_results.grad_weight = castOp(dtype, rms_norm_results.grad_weight);
+  }
+
+  fusion->addOutput(rms_norm_results.grad_input);
+  fusion->addOutput(rms_norm_results.grad_weight);
+}
+
+static void NvFuserScheduler_RMSNorm_BWD(
+    benchmark::State& benchmark_state,
+    FusionExecutorCache* fusion_executor_cache,
+    DataType dtype) {
+  TORCH_INTERNAL_ASSERT(
+      dtype == DataType::Float || dtype == DataType::Half ||
+      dtype == DataType::BFloat16);
+
+  std::vector<int64_t> input_shape{8, benchmark_state.range(0), 1024};
+
+  // inputs
+  at::manual_seed(0);
+  auto options =
+      at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0);
+  at::Tensor grad_out = at::randn(input_shape, options);
+  at::Tensor input = at::randn(input_shape, options);
+  at::Tensor weight = at::randn({input_shape[2]}, options);
+  at::Tensor rstd = at::randn({input_shape[0], input_shape[1], 1}, options);
+
+  std::vector<c10::IValue> aten_inputs({grad_out, input, weight, rstd});
+
+  runBenchmarkIterations(benchmark_state, fusion_executor_cache, aten_inputs);
+
+  benchmark_state.SetBytesProcessed(
+      int64_t(benchmark_state.iterations()) *
+      (3 * input.numel() + weight.numel() + rstd.numel()) *
+      int64_t(dataTypeSize(dtype)));
+}
+
+//------------------------------------------------------------------------------
+
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_BWD_fp32,
+    setupRMSNorm_BWD,
+    NvFuserScheduler_RMSNorm_BWD,
+    DataType::Float);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{28, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp32)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_BWD_fp16,
+    setupRMSNorm_BWD,
+    NvFuserScheduler_RMSNorm_BWD,
+    DataType::Half);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{28, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_fp16)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_DEFINE(
+    NvFuserScheduler_RMSNorm_BWD_bf16,
+    setupRMSNorm_BWD,
+    NvFuserScheduler_RMSNorm_BWD,
+    DataType::BFloat16);
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{16, 64}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{28, 56}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+NVFUSER_BENCHMARK_RUN(NvFuserScheduler_RMSNorm_BWD_bf16)
+    ->RangeMultiplier(2)
+    ->Ranges({{24, 48}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
diff --git a/benchmarks/fastrnns/fuser.py b/benchmarks/fastrnns/fuser.py
index e1daab594c5083..29d395055296b5 100644
--- a/benchmarks/fastrnns/fuser.py
+++ b/benchmarks/fastrnns/fuser.py
@@ -4,18 +4,18 @@ def set_fuser(fuser_name, executor_name):
     assert fuser_name in ['te', 'old', 'none', 'default']
     if fuser_name == 'te':
         torch._C._jit_set_profiling_executor(True)
-        torch._C._jit_set_profiling_mode(True)
+        torch._C._get_graph_executor_optimize(True)
         torch._C._jit_override_can_fuse_on_cpu(False)
         torch._C._jit_override_can_fuse_on_gpu(True)
         torch._C._jit_set_texpr_fuser_enabled(True)
     elif fuser_name == 'old':
         torch._C._jit_set_profiling_executor(False)
-        torch._C._jit_set_profiling_mode(False)
+        torch._C._get_graph_executor_optimize(False)
         torch._C._jit_override_can_fuse_on_gpu(True)
         torch._C._jit_set_texpr_fuser_enabled(False)
     elif fuser_name == 'none':
         torch._C._jit_set_profiling_executor(False)
-        torch._C._jit_set_profiling_mode(False)
+        torch._C._get_graph_executor_optimize(False)
         torch._C._jit_override_can_fuse_on_gpu(False)
         torch._C._jit_override_can_fuse_on_cpu(False)
         torch._C._jit_set_texpr_fuser_enabled(False)
@@ -25,12 +25,11 @@ def set_fuser(fuser_name, executor_name):
     # --executor overrides settings of --fuser
     if executor_name == 'profiling':
         torch._C._jit_set_profiling_executor(True)
-        torch._C._jit_set_profiling_mode(True)
+        torch._C._get_graph_executor_optimize(True)
     elif executor_name == 'simple':
-        torch._C._jit_set_profiling_executor(True)
-        torch._C._jit_set_profiling_mode(False)
+        torch._C._get_graph_executor_optimize(False)
     elif executor_name == 'legacy':
         torch._C._jit_set_profiling_executor(False)
-        torch._C._jit_set_profiling_mode(False)
+        torch._C._get_graph_executor_optimize(True)
     elif executor_name == 'default':
         pass
diff --git a/benchmarks/operator_benchmark/benchmark_core.py b/benchmarks/operator_benchmark/benchmark_core.py
index 4248e4776f22bd..16a66d5cf92be5 100644
--- a/benchmarks/operator_benchmark/benchmark_core.py
+++ b/benchmarks/operator_benchmark/benchmark_core.py
@@ -200,8 +200,8 @@ def _print_header(self):
                 print("# {}".format(self.args.operators))
 
     def _print_perf_result(self, reported_run_time_us, test_case):
-        if self.args.ai_pep_format:
-            # Output for AI-PEP
+        if self.args.report_aibench:
+            # Output for AIBench
             # Print out per iteration execution time instead of avg time
             return
             test_name = '_'.join([test_case.framework, test_case.test_config.test_name])
@@ -288,7 +288,7 @@ def _measure_time(self, launch_test, test_case, iters, print_per_iter):
             report_run_time = 1e6 * run_time_sec / iters
             time_trace.append(report_run_time)
             # Print out the time spent in each epoch in ms
-            if self.args.ai_pep_format:
+            if self.args.report_aibench:
                 mode = "JIT" if self.use_jit else "Eager"
                 test_name = '_'.join([test_case.framework, test_case.test_config.test_name, mode])
                 print("PyTorchObserver " + json.dumps(
diff --git a/benchmarks/operator_benchmark/benchmark_runner.py b/benchmarks/operator_benchmark/benchmark_runner.py
index b9347364428eac..3e998e6ceb4ea2 100644
--- a/benchmarks/operator_benchmark/benchmark_runner.py
+++ b/benchmarks/operator_benchmark/benchmark_runner.py
@@ -89,12 +89,12 @@ def parse_args():
     )
 
     parser.add_argument(
-        "--ai_pep_format",
+        "--report_aibench",
         type=benchmark_utils.str2bool,
         nargs='?',
         const=True,
         default=False,
-        help="Print result when running on AI-PEP"
+        help="Print result when running on AIBench"
     )
 
     parser.add_argument(
diff --git a/benchmarks/operator_benchmark/pt/qinterpolate_test.py b/benchmarks/operator_benchmark/pt/qinterpolate_test.py
index ec58e6e6a7dd5f..764274f925810e 100644
--- a/benchmarks/operator_benchmark/pt/qinterpolate_test.py
+++ b/benchmarks/operator_benchmark/pt/qinterpolate_test.py
@@ -44,7 +44,7 @@ def init(self, M, N, K, dtype, mode, scale, contig):
                                                  zero_point=zero_point,
                                                  dtype=dtype)
         if not contig:
-            permute_dims = list(range(q_input.ndim))[::-1]
+            permute_dims = list(range(self.q_input.ndim))[::-1]
             self.q_input = self.q_input.permute(permute_dims)
 
         self.inputs = {
diff --git a/benchmarks/static_runtime/test_static_module.cc b/benchmarks/static_runtime/test_static_module.cc
index be634a48def71d..85c5e7832735b6 100644
--- a/benchmarks/static_runtime/test_static_module.cc
+++ b/benchmarks/static_runtime/test_static_module.cc
@@ -1529,3 +1529,82 @@ TEST(ForceNonEmptyOutputs, TwoSubBlocks) {
     }
   }
 }
+
+TEST(EliminateExtraPermuteOps, FusesCorrectly) {
+  const auto src = R"JIT(
+    def forward(self, x):
+        y = torch.permute(x, (0, 2, 1))
+        z = torch.sum(y, dim=-1)
+        return z
+  )JIT";
+  torch::jit::Module mod("m");
+  mod.define(src);
+
+  auto graph = mod.get_method("forward").graph();
+  // turn the ListConstruct(%constant) into proper constant lists
+  ConstantPropagation(graph);
+  EliminateExtraPermuteOps(graph);
+
+  EXPECT_FALSE(hasNodeWithKind(graph, "aten::permute"));
+  auto* sum = getNodeWithKind(graph, "aten::sum");
+  ASSERT_NE(sum, nullptr);
+  auto dim = toIValue(sum->input(1));
+  ASSERT_TRUE(dim.has_value() && dim->isIntList());
+  EXPECT_EQ(dim->toIntList(), c10::List<int64_t>{1});
+}
+
+TEST(EliminateExtraPermuteOps, DoesNotFuseWrongDim) {
+  const auto src = R"JIT(
+    def forward(self, x):
+        y = torch.permute(x, (0, 2, 1))
+        z = torch.sum(y, dim=1)
+        return z
+  )JIT";
+  torch::jit::Module mod("m");
+  mod.define(src);
+
+  auto graph = mod.get_method("forward").graph();
+  // turn the ListConstruct(%constant) into proper constant lists
+  ConstantPropagation(graph);
+  EliminateExtraPermuteOps(graph);
+
+  EXPECT_TRUE(hasNodeWithKind(graph, "aten::permute"));
+}
+
+TEST(EliminateExtraPermuteOps, DoesNotFuseNonConstantDim) {
+  const auto src = R"JIT(
+    def forward(self, x, dim: int):
+        y = torch.permute(x, (0, 2, 1))
+        z = torch.sum(y, dim=dim)
+        return z
+  )JIT";
+  torch::jit::Module mod("m");
+  mod.define(src);
+
+  auto graph = mod.get_method("forward").graph();
+  // turn the ListConstruct(%constant) into proper constant lists
+  ConstantPropagation(graph);
+  EliminateExtraPermuteOps(graph);
+
+  EXPECT_TRUE(hasNodeWithKind(graph, "aten::permute"));
+}
+
+TEST(UseSplitAndSqueeze, Fusion) {
+  const auto src = R"IR(
+    graph(%x: Tensor):
+      %dim: int = prim::Constant[value=1]()
+      %split_size: int = prim::Constant[value=1]()
+      %split: Tensor[] = aten::split(%x, %split_size, %dim)
+      %a: Tensor, %b: Tensor = prim::ListUnpack(%split)
+      %c: Tensor = aten::squeeze(%a, %dim)
+      %d: Tensor = aten::squeeze(%b, %dim)
+      return (%c, %d)
+  )IR";
+  auto graph = getGraphFromIR(src);
+  UseSplitAndSqueeze(graph);
+  EXPECT_TRUE(
+      hasNodeWithKind(graph, "static_runtime::fused_split_and_squeeze"));
+  EXPECT_FALSE(hasNodeWithKind(graph, "aten::split"));
+  EXPECT_FALSE(hasNodeWithKind(graph, "aten::squeeze"));
+  EXPECT_FALSE(hasNodeWithKind(graph, "prim::ListUnpack"));
+}
diff --git a/benchmarks/static_runtime/test_static_runtime.cc b/benchmarks/static_runtime/test_static_runtime.cc
index b64e3d8d0d6f5f..7ef02659cc8bfc 100644
--- a/benchmarks/static_runtime/test_static_runtime.cc
+++ b/benchmarks/static_runtime/test_static_runtime.cc
@@ -172,6 +172,108 @@ TEST(StaticRuntime, Clamp) {
   testStaticRuntime(clamp_script_2, {a, min_t, max_t}, {b, max_t1, min_t1});
 }
 
+TEST(StaticRuntime, LenWithTuple) {
+  const auto src = R"IR(
+    graph(%input : int[]):
+        %res : int = aten::len(%input)
+        return (%res)
+  )IR";
+
+  testStaticRuntime(src, {c10::List<int64_t>(4)});
+}
+
+TEST(StaticRuntime, LenWithTensor) {
+  const auto src = R"IR(
+    graph(%input : Tensor):
+        %res : int = aten::len(%input)
+        return (%res)
+  )IR";
+
+  testStaticRuntime(src, {at::randn({2, 2, 2})});
+}
+
+TEST(StaticRuntime, LenWithStr) {
+  const auto src = R"IR(
+    graph(%input : str):
+        %res : int = aten::len(%input)
+        return (%res)
+  )IR";
+
+  testStaticRuntime(src, {"static_runtime"});
+}
+
+TEST(StaticRuntime, LenWithDict_str) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[str, str]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<std::string, std::string> dict;
+  dict.insert("abc", "123");
+  dict.insert("def", "456");
+  testStaticRuntime(script, {dict});
+}
+
+TEST(StaticRuntime, LenWithDict_int) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[int, int]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<int64_t, int64_t> dict;
+  dict.insert(0, 1);
+  dict.insert(2, 3);
+  testStaticRuntime(script, {dict});
+}
+
+TEST(StaticRuntime, LenWithDict_bool) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[bool, bool]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<bool, bool> dict;
+  dict.insert(true, false);
+  dict.insert(false, true);
+  testStaticRuntime(script, {dict});
+}
+
+TEST(StaticRuntime, LenWithDict_float) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[float, float]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<double, double> dict;
+  dict.insert(0.1, 0.9);
+  dict.insert(0.8, 0.18);
+  testStaticRuntime(script, {dict});
+}
+
+TEST(StaticRuntime, LenWithDict_complex) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[complex, complex]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<c10::complex<double>, c10::complex<double>> dict;
+  dict.insert(0.1, 0.4);
+  dict.insert(0.9, 0.45);
+  testStaticRuntime(script, {dict});
+}
+
+TEST(StaticRuntime, LenWithDict_Tensor) {
+  const auto script = R"JIT(
+    def forward(self, input: Dict[Tensor, Tensor]):
+        return len(input)
+  )JIT";
+
+  c10::Dict<at::Tensor, at::Tensor> dict;
+  dict.insert(at::randn({1, 2}), at::randn({1, 2}));
+  dict.insert(at::randn({1, 2}), at::randn({1, 2}));
+  testStaticRuntime(script, {dict});
+}
+
 TEST(StaticRuntime, Logit) {
   // no nnc
   const auto logit_script_1 = R"JIT(
@@ -304,13 +406,6 @@ TEST(StaticRuntime, LayerNorm) {
         return torch.layer_norm(input, normalized_shape, None, None, 1e-05, False).clone()
   )JIT";
 
-#ifdef FBCODE_CAFFE2
-  script::Module module("module");
-  module.define(layer_norm_with_weights);
-  torch::jit::StaticModule smodule(module);
-  ASSERT_EQ(getNodeWithKind(smodule, "aten::layer_norm"), nullptr);
-  ASSERT_NE(getNodeWithKind(smodule, "static_runtime::layer_norm"), nullptr);
-#endif
   const auto a = torch::rand({1, 2, 2, 2});
   const auto b = torch::rand({3, 2, 2, 2});
   for (int normalized_size : {2, 3}) {
@@ -1170,13 +1265,23 @@ TEST(StaticRuntime, Full) {
         return (a.clone())
   )JIT";
 
-  auto dtype = at::ScalarType::Int;
   auto cpu = at::Device(DeviceType::CPU);
   c10::List<int64_t> size0{2, 5};
-  std::vector<IValue> args{size0, 4, dtype, at::kStrided, cpu, false};
+  std::vector<IValue> args{
+      size0, 4, at::ScalarType::Int, at::kStrided, cpu, false};
+  std::vector<IValue> args1{
+      size0, 4, at::ScalarType::Float, at::kStrided, cpu, false};
   c10::List<int64_t> size1{5, 6};
-  std::vector<IValue> args2{size1, 5, dtype, at::kStrided, cpu, false};
+  std::vector<IValue> args2{
+      size1, 5, at::ScalarType::Float, at::kStrided, cpu, false};
   testStaticRuntime(full_script, args);
+  testStaticRuntime(
+      full_script,
+      args,
+      args1,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
   testStaticRuntime(full_script, args, args2);
 }
 
@@ -1202,16 +1307,157 @@ TEST(StaticRuntime, FullLike) {
 
   auto a = at::randn({2, 3});
   auto b = at::randn({3, 4, 2});
-  auto dtype = at::ScalarType::Int;
   auto cpu = at::Device(DeviceType::CPU);
   std::vector<IValue> args{
-      a, 4, dtype, at::kStrided, cpu, false, c10::MemoryFormat::Contiguous};
+      a,
+      4,
+      at::ScalarType::Int,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
+  std::vector<IValue> args1{
+      a,
+      4,
+      at::ScalarType::Float,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
   std::vector<IValue> args2{
-      b, 4, dtype, at::kStrided, cpu, false, c10::MemoryFormat::Contiguous};
+      b,
+      4,
+      at::ScalarType::Float,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
   testStaticRuntime(full_like_script, args);
+  testStaticRuntime(
+      full_like_script,
+      args,
+      args1,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
   testStaticRuntime(full_like_script, args, args2);
 }
 
+TEST(StaticRuntime, Ones) {
+  const auto script = R"JIT(
+    def forward(self,
+                size: List[int],
+                dtype: Optional[int],
+                layout: Optional[int],
+                device: Optional[Device],
+                pin_memory: Optional[bool]):
+        a = torch.ones(size,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       pin_memory=pin_memory)
+        return (a.clone())
+  )JIT";
+
+  auto dtype = at::ScalarType::Int;
+  auto cpu = at::Device(DeviceType::CPU);
+  c10::List<int64_t> size0{2, 5};
+  std::vector<IValue> args{size0, dtype, at::kStrided, cpu, false};
+  c10::List<int64_t> size1{5, 6};
+  std::vector<IValue> args2{size1, dtype, at::kStrided, cpu, false};
+  testStaticRuntime(script, args);
+  testStaticRuntime(script, args, args2);
+}
+
+TEST(StaticRuntime, OnesLike) {
+  const auto script = R"JIT(
+    def forward(self,
+                input: Tensor,
+                dtype: Optional[int],
+                layout: Optional[int],
+                device: Optional[Device],
+                pin_memory: Optional[bool],
+                memory_format: Optional[int]):
+        a = torch.ones_like(input,
+                            dtype=dtype,
+                            layout=layout,
+                            device=device,
+                            pin_memory=pin_memory,
+                            memory_format=memory_format)
+        return (a.clone())
+  )JIT";
+
+  auto cpu = at::Device(DeviceType::CPU);
+  auto input0 = at::randn({2, 5});
+  std::vector<IValue> args{
+      input0,
+      at::ScalarType::Int,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
+  std::vector<IValue> args1{
+      input0,
+      at::ScalarType::Float,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
+  auto input1 = at::randn({5, 6});
+  std::vector<IValue> args2{
+      input1,
+      at::ScalarType::Float,
+      at::kStrided,
+      cpu,
+      false,
+      c10::MemoryFormat::Contiguous};
+  testStaticRuntime(script, args);
+  testStaticRuntime(
+      script,
+      args,
+      args1,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
+  testStaticRuntime(script, args, args2);
+}
+
+TEST(StaticRuntime, Zeros) {
+  const auto script = R"JIT(
+    def forward(self,
+                size: List[int],
+                dtype: Optional[int],
+                layout: Optional[int],
+                device: Optional[Device],
+                pin_memory: Optional[bool]):
+        a = torch.zeros(size,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       pin_memory=pin_memory)
+        return (a.clone())
+  )JIT";
+
+  auto cpu = at::Device(DeviceType::CPU);
+  c10::List<int64_t> size0{2, 5};
+  std::vector<IValue> args{
+      size0, at::ScalarType::Int, at::kStrided, cpu, false};
+  std::vector<IValue> args1{
+      size0, at::ScalarType::Float, at::kStrided, cpu, false};
+  c10::List<int64_t> size1{5, 6};
+  std::vector<IValue> args2{
+      size1, at::ScalarType::Float, at::kStrided, cpu, false};
+  testStaticRuntime(script, args);
+  testStaticRuntime(
+      script,
+      args,
+      args1,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
+  testStaticRuntime(script, args, args2);
+}
+
 TEST(StaticRuntime, Linear) {
   const auto linear_script = R"JIT(
     def forward(self, inp: Tensor, weights: Tensor, bias: Optional[Tensor]) -> Tensor:
@@ -1442,6 +1688,28 @@ TEST(StaticRuntime, Index) {
   testStaticRuntime(index_with_two_tensors_script, args_c, args_d);
 }
 
+TEST(StaticRuntime, IndexSelect) {
+  const std::string script = R"IR(
+    graph(%self: Tensor, %dim: int, %index: Tensor):
+        %bias: None = prim::Constant()
+        %ret = aten::index_select(%self, %dim, %index)
+        %cloned = aten::clone(%ret, %bias)
+        return (%cloned)
+  )IR";
+
+  auto self0 = at::rand({6});
+  auto dim0 = 0;
+  auto index0 = at::randint(0, 5, {6}, torch::kInt32);
+  std::vector<IValue> args{self0, dim0, index0};
+  testStaticRuntime(script, args);
+
+  auto self1 = at::rand({128});
+  auto dim1 = 0;
+  auto index1 = at::randint(0, 127, {127}, torch::kInt32);
+  std::vector<IValue> args2{self1, dim1, index1};
+  testStaticRuntime(script, args, args2);
+}
+
 TEST(StaticRuntime, ClampMin) {
   const auto clamp_min_int_script = R"JIT(
     def forward(self, a: Tensor, b: int):
@@ -1784,6 +2052,27 @@ TEST(StaticRuntime, QuantizedLinearDynamicFp16) {
       {input_2, weight_2});
 }
 
+TEST(StaticRuntime, QuantizedLinearReluDynamicFp16) {
+  const std::string quantized_linear_relu_dynamic_fp16_script = R"IR(
+    graph(%input: Tensor, %weights: Tensor):
+        %bias: None = prim::Constant()
+        %packed_params = quantized::linear_prepack_fp16(%weights, %bias)
+        %output = quantized::linear_relu_dynamic_fp16(%input, %packed_params)
+        %ret = aten::clone(%output, %bias)
+        return (%output)
+  )IR";
+  at::Tensor weight = torch::randn({3, 2}, torch::kFloat);
+  at::Tensor input = torch::randn({3, 2}, torch::kFloat);
+
+  at::Tensor weight_2 = torch::randn({4, 3}, torch::kFloat);
+  at::Tensor input_2 = torch::randn({5, 3}, torch::kFloat);
+
+  testStaticRuntime(
+      quantized_linear_relu_dynamic_fp16_script,
+      {input, weight},
+      {input_2, weight_2});
+}
+
 TEST(StaticRuntime, VarStack) {
   const auto var_stack_script = R"JIT(
     def forward(self, inp1: Tensor, inp2: Tensor, dim: int):
@@ -2745,3 +3034,148 @@ TEST(StaticRuntime, IfThenElse) {
   testStaticRuntime(src, args1);
   testStaticRuntime(src, args2);
 }
+
+TEST(StaticRuntime, EmptyIfBlock) {
+  const auto src =
+      R"JIT(
+      def forward(self, cond: bool, a: Tensor, b: Tensor):
+          l = []
+          if cond:
+              l.append((a + b).clone())
+          return l
+  )JIT";
+
+  testStaticRuntime(src, {true, at::rand(1), at::rand({1, 2})});
+  testStaticRuntime(src, {false, at::rand(1), at::rand({1, 2})});
+}
+
+TEST(StaticRuntime, EmptyNestedIfBlock) {
+  const auto src =
+      R"JIT(
+      def forward(self, cond: bool, a: Tensor, b: Tensor):
+          l = []
+          if cond:
+              if cond:
+                  l.append((a + b).clone())
+          return l
+  )JIT";
+
+  testStaticRuntime(src, {true, at::rand(1), at::rand({1, 2})});
+  testStaticRuntime(src, {false, at::rand(1), at::rand({1, 2})});
+}
+
+TEST(StaticRuntime, StackEmpty) {
+  const auto src = R"JIT(
+    def forward(self):
+        x = torch.stack([])
+        return x
+  )JIT";
+
+  torch::jit::Module mod("mod");
+  mod.define(src);
+
+  torch::jit::StaticModule smod(mod);
+  EXPECT_THROW(smod({}), c10::Error);
+}
+
+TEST(StaticRuntime, ConcatEmpty) {
+  const auto src = R"JIT(
+    def forward(self):
+        x = torch.concat([])
+        return x
+  )JIT";
+
+  torch::jit::Module mod("mod");
+  mod.define(src);
+
+  torch::jit::StaticModule smod(mod);
+  EXPECT_THROW(smod({}), c10::Error);
+}
+
+TEST(StaticRuntime, IntImplicit) {
+  const auto src = R"IR(
+    graph(%a: Tensor):
+        %y: int = aten::IntImplicit(%a)
+        return (%y)
+  )IR";
+  testStaticRuntime(src, {at::tensor({1}, at::kInt).squeeze()});
+}
+
+TEST(StaticRuntime, IntImplicit_ThrowOnBadInputs) {
+  const auto src = R"IR(
+    graph(%a: Tensor):
+        %y: int = aten::IntImplicit(%a)
+        return (%y)
+  )IR";
+  auto graph = getGraphFromIR(src);
+  torch::jit::StaticModule smod(graph);
+  // Not 0D tensor
+  EXPECT_THROW(smod({at::tensor({1, 2}, at::kInt)}), std::runtime_error);
+  // Wrong dtype
+  EXPECT_THROW(
+      smod({at::tensor({1}, at::kFloat).squeeze()}), std::runtime_error);
+}
+
+TEST(StaticRuntime, Select) {
+  const auto src = R"IR(
+    graph(%a: Tensor, %dim: int, %index: int):
+        %none: NoneType = prim::Constant()
+        %b: Tensor = aten::select(%a, %dim, %index)
+        %c: Tensor = aten::clone(%b, %none)
+        return (%c)
+  )IR";
+  testStaticRuntime(src, {at::randn({2, 2}), 0, 1});
+}
+
+TEST(StaticRuntime, ReshapeAs) {
+  const auto src = R"JIT(
+    def forward(self, a, b):
+        return a.reshape_as(b).clone()
+  )JIT";
+  testStaticRuntime(src, {at::randn({2, 2}), at::randn({4})});
+}
+
+TEST(StaticRuntime, MoveCtor) {
+  auto mod = getDeepAndWideSciptModel();
+  std::vector<IValue> args{
+      at::randn({1, 1, 32}), at::randn({1, 1, 32}), at::randn({1, 50})};
+
+  torch::jit::StaticModule smod(mod);
+
+  torch::jit::StaticRuntime runtime(smod);
+  auto expected = runtime(args);
+
+  torch::jit::StaticRuntime new_runtime(std::move(runtime));
+  auto actual = new_runtime(args);
+  compareResults(expected, actual);
+}
+
+TEST(StaticRuntime, SingleBlockIfReturnList) {
+  const auto src = R"JIT(
+    def forward(self, a, b, cond: bool):
+        lst = []
+        if cond:
+            lst.append(a + b)
+        return lst
+  )JIT";
+  std::vector<IValue> args1{at::randn({1}), at::randn({1}), true};
+  std::vector<IValue> args2{at::randn({42, 42}), at::randn({42, 42}), false};
+  testStaticRuntime(src, args1, args2);
+}
+
+TEST(StaticRuntime, NestedBlockIfReturnList) {
+  const auto src = R"JIT(
+    def forward(self, a, b, cond1: bool, cond2: bool):
+        if cond1:
+            lst = []
+            if cond2:
+                lst.append(a + b)
+            lst.append(a * b)
+            return lst
+        return []
+  )JIT";
+  std::vector<IValue> args1{at::randn({1}), at::randn({1}), true, true};
+  std::vector<IValue> args2{
+      at::randn({42, 42}), at::randn({42, 42}), true, false};
+  testStaticRuntime(src, args1, args2);
+}
diff --git a/benchmarks/static_runtime/test_utils.cc b/benchmarks/static_runtime/test_utils.cc
index 6b0794d4ab9292..7e0733fbc8af43 100644
--- a/benchmarks/static_runtime/test_utils.cc
+++ b/benchmarks/static_runtime/test_utils.cc
@@ -146,11 +146,13 @@ void compareTensorLists(
   }
 }
 
+} // namespace
+
 void compareResults(
     const IValue& expect,
     const IValue& actual,
-    const bool use_allclose = false,
-    const bool use_equalnan = false) {
+    const bool use_allclose,
+    const bool use_equalnan) {
   if (expect.isTensor()) {
     VLOG(2) << "expect " << expect.toTensor() << std::endl;
     VLOG(2) << "output " << actual.toTensor() << std::endl;
@@ -198,8 +200,6 @@ void compareResults(
   }
 }
 
-} // namespace
-
 at::Tensor getTensor(const at::IValue& ival) {
   if (ival.isTensor()) {
     return ival.toTensor();
diff --git a/benchmarks/static_runtime/test_utils.h b/benchmarks/static_runtime/test_utils.h
index cb0a5a4a8c2ed9..27efd4d7d42efc 100644
--- a/benchmarks/static_runtime/test_utils.h
+++ b/benchmarks/static_runtime/test_utils.h
@@ -53,6 +53,12 @@ void compareResultsWithJIT(
     const bool use_allclose = false,
     const bool use_equalnan = false);
 
+void compareResults(
+    const IValue& expect,
+    const IValue& actual,
+    const bool use_allclose = false,
+    const bool use_equalnan = false);
+
 } // namespace test
 } // namespace jit
 } // namespace torch
diff --git a/benchmarks/tensorexpr/__main__.py b/benchmarks/tensorexpr/__main__.py
index f243ff5b61051e..63a1462d33d14f 100644
--- a/benchmarks/tensorexpr/__main__.py
+++ b/benchmarks/tensorexpr/__main__.py
@@ -137,7 +137,7 @@ def main():
         torch._C._jit_set_profiling_executor(True)
         torch._C._jit_set_texpr_fuser_enabled(True)
         torch._C._jit_override_can_fuse_on_gpu(True)
-        torch._C._jit_set_profiling_mode(True)
+        torch._C._get_graph_executor_optimize(True)
     elif args.cuda_fuser == "old":
         import torch
         torch._C._jit_set_profiling_executor(False)
@@ -148,7 +148,7 @@ def main():
         torch._C._jit_set_profiling_executor(True)
         torch._C._jit_set_texpr_fuser_enabled(False)
         torch._C._jit_set_nvfuser_enabled(True)
-        torch._C._jit_set_profiling_mode(True)
+        torch._C._get_graph_executor_optimize(True)
     else :
         raise ValueError("Undefined fuser: {}".format(args.cuda_fuser))
 
diff --git a/binaries/CMakeLists.txt b/binaries/CMakeLists.txt
index a98754eea2c390..b683ee002280c9 100644
--- a/binaries/CMakeLists.txt
+++ b/binaries/CMakeLists.txt
@@ -4,6 +4,7 @@ if(INTERN_BUILD_MOBILE)
     caffe2_binary_target("speed_benchmark.cc")
   else()
     caffe2_binary_target("speed_benchmark_torch.cc")
+    caffe2_binary_target("load_benchmark_torch.cc")
     if(NOT BUILD_LITE_INTERPRETER)
       caffe2_binary_target("compare_models_torch.cc")
     endif()
diff --git a/binaries/load_benchmark_torch.cc b/binaries/load_benchmark_torch.cc
new file mode 100644
index 00000000000000..330955657ece6e
--- /dev/null
+++ b/binaries/load_benchmark_torch.cc
@@ -0,0 +1,93 @@
+/**
+ * Copyright (c) 2016-present, Facebook, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <string>
+#include <vector>
+
+#include <ATen/ATen.h>
+#include "caffe2/core/timer.h"
+#include "caffe2/utils/string_utils.h"
+#include <torch/csrc/autograd/grad_mode.h>
+#include <torch/csrc/jit/mobile/module.h>
+#include <torch/csrc/jit/mobile/import.h>
+#include <torch/csrc/jit/serialization/import.h>
+#include <torch/script.h>
+
+#include <c10/mobile/CPUCachingAllocator.h>
+
+#include <chrono>
+using namespace std::chrono;
+
+C10_DEFINE_string(model, "", "The given torch script model to benchmark.");
+C10_DEFINE_int(iter, 10, "The number of iterations to run.");
+C10_DEFINE_bool(
+  report_pep,
+  true,
+  "Whether to print performance stats for AI-PEP.");
+
+int main(int argc, char** argv) {
+  c10::SetUsageMessage(
+    "Run model load time benchmark for pytorch model.\n"
+    "Example usage:\n"
+    "./load_benchmark_torch"
+    " --model=<model_file>"
+    " --iter=20");
+  if (!c10::ParseCommandLineFlags(&argc, &argv)) {
+    std::cerr << "Failed to parse command line flags!" << std::endl;
+    return 1;
+  }
+
+  std::cout << "Starting benchmark." << std::endl;
+  CAFFE_ENFORCE(
+      FLAGS_iter >= 0,
+      "Number of main runs should be non negative, provided ",
+      FLAGS_iter,
+      ".");
+
+  caffe2::Timer timer;
+  std::vector<long> times;
+
+  for (int i = 0; i < FLAGS_iter; ++i) {
+    auto start = high_resolution_clock::now();
+
+#if BUILD_LITE_INTERPRETER
+    auto module = torch::jit::_load_for_mobile(FLAGS_model);
+#else
+    auto module = torch::jit::load(FLAGS_model);
+#endif
+
+    auto stop = high_resolution_clock::now();
+    auto duration = duration_cast<microseconds>(stop - start);
+    times.push_back(duration.count());
+  }
+
+  const double micros = static_cast<double>(timer.MicroSeconds());
+  if (FLAGS_report_pep) {
+    for (auto t : times) {
+      std::cout << R"(PyTorchObserver {"type": "NET", "unit": "us", )"
+                << R"("metric": "latency", "value": ")"
+                << t << R"("})" << std::endl;
+    }
+  }
+
+  const double iters = static_cast<double>(FLAGS_iter);
+  std::cout << "Main run finished. Microseconds per iter: "
+            << micros / iters
+            << ". Iters per second: " << 1000.0 * 1000 * iters / micros
+            << std::endl;
+
+  return 0;
+}
diff --git a/c10/core/Backend.h b/c10/core/Backend.h
index e17a1bc4226c69..59805f4a7ab1ae 100644
--- a/c10/core/Backend.h
+++ b/c10/core/Backend.h
@@ -32,6 +32,7 @@ enum class Backend {
   HIP,
   VE,
   FPGA,
+  IPU,
   XPU,
   SparseCPU,
   SparseCUDA,
@@ -96,6 +97,8 @@ static inline Backend dispatchKeyToBackend(DispatchKey t) {
     return Backend::QuantizedCPU;
   } else if (t == DispatchKey::QuantizedCUDA) {
     return Backend::QuantizedCUDA;
+  } else if (t == DispatchKey::IPU || t == DispatchKey::AutogradIPU) {
+    return Backend::IPU;
   } else if (t == DispatchKey::XPU || t == DispatchKey::AutogradXPU) {
     return Backend::XPU;
   } else if (t == DispatchKey::SparseXPU) {
@@ -129,6 +132,8 @@ static inline DispatchKey backendToDispatchKey(Backend b) {
       return DispatchKey::XLA;
     case Backend::Lazy:
       return DispatchKey::Lazy;
+    case Backend::IPU:
+      return DispatchKey::IPU;
     case Backend::XPU:
       return DispatchKey::XPU;
     case Backend::SparseXPU:
@@ -196,6 +201,8 @@ static inline DeviceType backendToDeviceType(Backend b) {
       return DeviceType::CPU;
     case Backend::SparseCsrCUDA:
       return DeviceType::CUDA;
+    case Backend::IPU:
+      return DeviceType::IPU;
     case Backend::XPU:
     case Backend::SparseXPU:
     case Backend::QuantizedXPU:
@@ -235,6 +242,8 @@ static inline const char* toString(Backend b) {
       return "FPGA";
     case Backend::XPU:
       return "XPU";
+    case Backend::IPU:
+      return "IPU";
     case Backend::ORT:
       return "ORT";
     case Backend::XLA:
diff --git a/c10/core/Device.cpp b/c10/core/Device.cpp
index 2531e3942271ad..1e0e4104144dc6 100644
--- a/c10/core/Device.cpp
+++ b/c10/core/Device.cpp
@@ -20,6 +20,7 @@ DeviceType parse_type(const std::string& device_string) {
       types = {{
           {"cpu", DeviceType::CPU},
           {"cuda", DeviceType::CUDA},
+          {"ipu", DeviceType::IPU},
           {"xpu", DeviceType::XPU},
           {"mkldnn", DeviceType::MKLDNN},
           {"opengl", DeviceType::OPENGL},
@@ -47,7 +48,7 @@ DeviceType parse_type(const std::string& device_string) {
   }
   TORCH_CHECK(
       false,
-      "Expected one of cpu, cuda, xpu, mkldnn, opengl, opencl, ideep, hip, ve, ort, mlc, xla, lazy, vulkan, meta, hpu device type at start of device string: ",
+      "Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, ort, mlc, xla, lazy, vulkan, meta, hpu device type at start of device string: ",
       device_string);
 }
 enum DeviceStringParsingState { START, INDEX_START, INDEX_REST, ERROR };
diff --git a/c10/core/Device.h b/c10/core/Device.h
index b935eed6a65659..92ba0fc44d6707 100644
--- a/c10/core/Device.h
+++ b/c10/core/Device.h
@@ -96,11 +96,21 @@ struct C10_API Device final {
     return type_ == DeviceType::XPU;
   }
 
+  /// Return true if the device is of IPU type.
+  bool is_ipu() const noexcept {
+    return type_ == DeviceType::IPU;
+  }
+
   /// Return true if the device is of HPU type.
   bool is_hpu() const noexcept {
     return type_ == DeviceType::HPU;
   }
 
+  /// Return true if the device is of META type.
+  bool is_meta() const noexcept {
+    return type_ == DeviceType::Meta;
+  }
+
   /// Return true if the device is of CPU type.
   bool is_cpu() const noexcept {
     return type_ == DeviceType::CPU;
diff --git a/c10/core/DeviceType.cpp b/c10/core/DeviceType.cpp
index 4635acdb148c22..a076c5a5b0245c 100644
--- a/c10/core/DeviceType.cpp
+++ b/c10/core/DeviceType.cpp
@@ -43,6 +43,8 @@ std::string DeviceTypeName(DeviceType d, bool lower_case) {
       return lower_case ? "meta" : "META";
     case DeviceType::HPU:
       return lower_case ? "hpu" : "HPU";
+    case DeviceType::IPU:
+      return lower_case ? "ipu" : "IPU";
     default:
       TORCH_CHECK(
           false,
@@ -84,6 +86,7 @@ bool isValidDeviceType(DeviceType d) {
     case DeviceType::XPU:
     case DeviceType::Meta:
     case DeviceType::HPU:
+    case DeviceType::IPU:
       return true;
     default:
       return false;
diff --git a/c10/core/DeviceType.h b/c10/core/DeviceType.h
index c6bd56914d6d18..c2264532555134 100644
--- a/c10/core/DeviceType.h
+++ b/c10/core/DeviceType.h
@@ -31,11 +31,12 @@ enum class DeviceType : int8_t {
   HPU = 15, // HPU / HABANA
   VE = 16, // SX-Aurora / NEC
   Lazy = 17, // Lazy Tensors
+  IPU = 18, // Graphcore IPU
   // NB: If you add more devices:
   //  - Change the implementations of DeviceTypeName and isValidDeviceType
   //    in DeviceType.cpp
   //  - Change the number below
-  COMPILE_TIME_MAX_DEVICE_TYPES = 18,
+  COMPILE_TIME_MAX_DEVICE_TYPES = 19,
 };
 
 constexpr DeviceType kCPU = DeviceType::CPU;
@@ -52,18 +53,19 @@ constexpr DeviceType kXPU = DeviceType::XPU;
 constexpr DeviceType kHPU = DeviceType::HPU;
 constexpr DeviceType kVE = DeviceType::VE;
 constexpr DeviceType kLazy = DeviceType::Lazy;
+constexpr DeviceType kIPU = DeviceType::IPU;
 
 // define explicit int constant
 constexpr int COMPILE_TIME_MAX_DEVICE_TYPES =
     static_cast<int>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES);
 
 static_assert(
-    COMPILE_TIME_MAX_DEVICE_TYPES <= 18,
+    COMPILE_TIME_MAX_DEVICE_TYPES <= 19,
     "Hey!  You seem to be adding a lot of new DeviceTypes.  The intent was "
     "for this constant to reflect the actual number of DeviceTypes we support "
     "in PyTorch; it's important that this number is not too large as we "
     "use this to allocate stack arrays in some places in our code.  If you "
-    "are indeed just adding the 18th device type, feel free to change "
+    "are indeed just adding the 19th device type, feel free to change "
     "the check to 32; but if you are adding some sort of extensible device "
     "types registration, please be aware that you are affecting code that "
     "this number is small.  Try auditing uses of this constant.");
diff --git a/c10/core/DispatchKey.cpp b/c10/core/DispatchKey.cpp
index 6dbcaf88d5db78..14f501a6ddd02f 100644
--- a/c10/core/DispatchKey.cpp
+++ b/c10/core/DispatchKey.cpp
@@ -1,14 +1,49 @@
 #include <c10/core/DispatchKey.h>
+#include <c10/core/DispatchKeySet.h>
 
 #include <unordered_map>
 
 namespace c10 {
 
+const char* toString(BackendComponent t) {
+  switch (t) {
+    case BackendComponent::CPUBit:
+      return "CPUBit";
+    case BackendComponent::CUDABit:
+      return "CUDABit";
+    case BackendComponent::HIPBit:
+      return "HIPBit";
+    case BackendComponent::XLABit:
+      return "XLABit";
+    case BackendComponent::LazyBit:
+      return "LazyBit";
+    case BackendComponent::XPUBit:
+      return "XPUBit";
+    case BackendComponent::IPUBit:
+      return "IPUBit";
+    case BackendComponent::MLCBit:
+      return "MLCBit";
+    case BackendComponent::HPUBit:
+      return "HPUBit";
+    case BackendComponent::VEBit:
+      return "VEBit";
+    case BackendComponent::PrivateUse1Bit:
+      return "PrivateUse1Bit";
+    case BackendComponent::PrivateUse2Bit:
+      return "PrivateUse2Bit";
+    case BackendComponent::PrivateUse3Bit:
+      return "PrivateUse3Bit";
+    case BackendComponent::InvalidBit:
+      return "InvalidBit";
+    default:
+      return "UNKNOWN_BACKEND_BIT";
+  }
+}
+
 const char* toString(DispatchKey t) {
   switch (t) {
     case DispatchKey::Undefined:
       return "Undefined";
-
     case DispatchKey::CPU:
       return "CPU";
     case DispatchKey::CUDA:
@@ -21,6 +56,8 @@ const char* toString(DispatchKey t) {
       return "FPGA";
     case DispatchKey::XPU:
       return "XPU";
+    case DispatchKey::IPU:
+      return "IPU";
     case DispatchKey::ORT:
       return "ORT";
     case DispatchKey::XLA:
@@ -91,6 +128,8 @@ const char* toString(DispatchKey t) {
       return "Autograd";
     case DispatchKey::AutogradCPU:
       return "AutogradCPU";
+    case DispatchKey::AutogradIPU:
+      return "AutogradIPU";
     case DispatchKey::AutogradXPU:
       return "AutogradXPU";
     case DispatchKey::AutogradCUDA:
@@ -103,8 +142,6 @@ const char* toString(DispatchKey t) {
       return "AutogradMLC";
     case DispatchKey::AutogradHPU:
       return "AutogradHPU";
-    case DispatchKey::AutogradNestedTensor:
-      return "AutogradNestedTensor";
     case DispatchKey::AutogradPrivateUse1:
       return "AutogradPrivateUse1";
     case DispatchKey::AutogradPrivateUse2:
@@ -113,6 +150,8 @@ const char* toString(DispatchKey t) {
       return "AutogradPrivateUse3";
     case DispatchKey::AutogradOther:
       return "AutogradOther";
+    case DispatchKey::AutogradNestedTensor:
+      return "AutogradNestedTensor";
 
     case DispatchKey::ZeroTensor:
       return "ZeroTensor";
@@ -170,6 +209,15 @@ const char* toString(DispatchKey t) {
     case DispatchKey::FuncTorchBatched:
       return "FuncTorchBatched";
 
+    case DispatchKey::Dense:
+      return "Dense";
+    case DispatchKey::Quantized:
+      return "Quantized";
+    case DispatchKey::Sparse:
+      return "Sparse";
+    case DispatchKey::AutogradFunctionality:
+      return "AutogradFunctionality";
+
     default:
       return "UNKNOWN_TENSOR_TYPE_ID";
   }
@@ -178,76 +226,37 @@ const char* toString(DispatchKey t) {
 std::ostream& operator<<(std::ostream& str, DispatchKey rhs) {
   return str << toString(rhs);
 }
+std::ostream& operator<<(std::ostream& str, BackendComponent rhs) {
+  return str << toString(rhs);
+}
 
-// for a given backend key, return the associated autograd key.
-// for non-backend keys, return AutogradOther as a default.
-// Note: it's convenient and fast to return a default here rather than (say)
-// returning an optional<DispatchKey>, or throwing. But it makes callers
-// responsible for either a) enforcing the invariant that only backend keys
-// be passed as arguments, or b) interpreting our return value carefully.
-//
-DispatchKey getAutogradKeyFromBackend(DispatchKey t) {
-  switch (t) {
-    case DispatchKey::CPU:
-      return DispatchKey::AutogradCPU;
-    case DispatchKey::XPU:
-      return DispatchKey::AutogradXPU;
-    case DispatchKey::CUDA:
-      return DispatchKey::AutogradCUDA;
-    case DispatchKey::XLA:
-      return DispatchKey::AutogradXLA;
-    case DispatchKey::Lazy:
-      return DispatchKey::AutogradLazy;
-    case DispatchKey::MLC:
-      return DispatchKey::AutogradMLC;
-    case DispatchKey::HPU:
-      return DispatchKey::AutogradHPU;
-    case DispatchKey::NestedTensor:
-      return DispatchKey::AutogradNestedTensor;
-    case DispatchKey::PrivateUse1:
-      return DispatchKey::AutogradPrivateUse1;
-    case DispatchKey::PrivateUse2:
-      return DispatchKey::AutogradPrivateUse2;
-    case DispatchKey::PrivateUse3:
-      return DispatchKey::AutogradPrivateUse3;
-    default:
-      return DispatchKey::AutogradOther;
-  }
+DispatchKey getAutogradKeyFromBackend(BackendComponent k) {
+  // We want this to return an autograd key. We're relying on the fact that
+  // getAutogradRelatedKeySetFromBackend returns an autograd key +
+  // ADInplaceOrView, and autograd has higher precedence. The core mapping from
+  // backend -> autograd key lives in `getAutogradRelatedKeySetFromBackend`
+  // instead of here for performance. `getAutogradRelatedKeySetFromBackend` is a
+  // hotpath function, and we want to make sure that it doesn't have to
+  // construct any DispatchKeySets at runtime.
+  return getAutogradRelatedKeySetFromBackend(k).highestPriorityTypeId();
 }
 
 c10::DispatchKey parseDispatchKey(const std::string& k) {
   static std::unordered_map<std::string, c10::DispatchKey> key_map = {
       {"Undefined", c10::DispatchKey::Undefined},
-      {"CPU", c10::DispatchKey::CPU},
-      {"CUDA", c10::DispatchKey::CUDA},
-      {"HIP", c10::DispatchKey::HIP},
+      {"Dense", c10::DispatchKey::Dense},
       {"FPGA", c10::DispatchKey::FPGA},
       {"ORT", c10::DispatchKey::ORT},
-      {"XLA", c10::DispatchKey::XLA},
-      {"MLC", c10::DispatchKey::MLC},
       {"Vulkan", c10::DispatchKey::Vulkan},
       {"Metal", c10::DispatchKey::Metal},
-      {"XPU", c10::DispatchKey::XPU},
-      {"HPU", c10::DispatchKey::HPU},
       {"VE", c10::DispatchKey::VE},
-      {"Lazy", c10::DispatchKey::Lazy},
       {"Meta", c10::DispatchKey::Meta},
-      {"QuantizedCPU", c10::DispatchKey::QuantizedCPU},
-      {"QuantizedCUDA", c10::DispatchKey::QuantizedCUDA},
-      {"QuantizedXPU", c10::DispatchKey::QuantizedXPU},
+      {"Quantized", c10::DispatchKey::Quantized},
       {"CustomRNGKeyId", c10::DispatchKey::CustomRNGKeyId},
       {"MkldnnCPU", c10::DispatchKey::MkldnnCPU},
-      {"SparseCPU", c10::DispatchKey::SparseCPU},
-      {"SparseCUDA", c10::DispatchKey::SparseCUDA},
-      {"SparseHIP", c10::DispatchKey::SparseHIP},
-      {"SparseXPU", c10::DispatchKey::SparseXPU},
-      {"SparseVE", c10::DispatchKey::SparseVE},
+      {"Sparse", c10::DispatchKey::Sparse},
       {"SparseCsrCPU", c10::DispatchKey::SparseCsrCPU},
       {"SparseCsrCUDA", c10::DispatchKey::SparseCsrCUDA},
-      {"NestedTensor", c10::DispatchKey::NestedTensor},
-      {"PrivateUse1", c10::DispatchKey::PrivateUse1},
-      {"PrivateUse2", c10::DispatchKey::PrivateUse2},
-      {"PrivateUse3", c10::DispatchKey::PrivateUse3},
       {"BackendSelect", c10::DispatchKey::BackendSelect},
       {"Python", c10::DispatchKey::Python},
       {"PythonTLSSnapshot", c10::DispatchKey::PythonTLSSnapshot},
@@ -259,17 +268,8 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
        c10::DispatchKey::FuncTorchDynamicLayerBackMode},
       {"ADInplaceOrView", c10::DispatchKey::ADInplaceOrView},
       {"AutogradOther", c10::DispatchKey::AutogradOther},
-      {"AutogradCPU", c10::DispatchKey::AutogradCPU},
-      {"AutogradCUDA", c10::DispatchKey::AutogradCUDA},
-      {"AutogradXLA", c10::DispatchKey::AutogradXLA},
-      {"AutogradLazy", c10::DispatchKey::AutogradLazy},
-      {"AutogradXPU", c10::DispatchKey::AutogradXPU},
-      {"AutogradMLC", c10::DispatchKey::AutogradMLC},
-      {"AutogradHPU", c10::DispatchKey::AutogradHPU},
+      {"AutogradFunctionality", c10::DispatchKey::AutogradFunctionality},
       {"AutogradNestedTensor", c10::DispatchKey::AutogradNestedTensor},
-      {"AutogradPrivateUse1", c10::DispatchKey::AutogradPrivateUse1},
-      {"AutogradPrivateUse2", c10::DispatchKey::AutogradPrivateUse2},
-      {"AutogradPrivateUse3", c10::DispatchKey::AutogradPrivateUse3},
       {"Tracer", c10::DispatchKey::Tracer},
       {"AutocastCPU", c10::DispatchKey::AutocastCPU},
       {"AutocastCUDA", c10::DispatchKey::AutocastCUDA},
@@ -283,6 +283,43 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"TESTING_ONLY_GenericWrapper",
        c10::DispatchKey::TESTING_ONLY_GenericWrapper},
       {"TESTING_ONLY_GenericMode", c10::DispatchKey::TESTING_ONLY_GenericMode},
+
+      {"CPU", c10::DispatchKey::CPU},
+      {"CUDA", c10::DispatchKey::CUDA},
+      {"HIP", c10::DispatchKey::HIP},
+      {"XLA", c10::DispatchKey::XLA},
+      {"MLC", c10::DispatchKey::MLC},
+      {"XPU", c10::DispatchKey::XPU},
+      {"IPU", c10::DispatchKey::IPU},
+      {"HPU", c10::DispatchKey::HPU},
+      {"Lazy", c10::DispatchKey::Lazy},
+      {"NestedTensor", c10::DispatchKey::NestedTensor},
+      {"PrivateUse1", c10::DispatchKey::PrivateUse1},
+      {"PrivateUse2", c10::DispatchKey::PrivateUse2},
+      {"PrivateUse3", c10::DispatchKey::PrivateUse3},
+
+      {"QuantizedCPU", c10::DispatchKey::QuantizedCPU},
+      {"QuantizedCUDA", c10::DispatchKey::QuantizedCUDA},
+      {"QuantizedXPU", c10::DispatchKey::QuantizedXPU},
+
+      {"SparseCPU", c10::DispatchKey::SparseCPU},
+      {"SparseCUDA", c10::DispatchKey::SparseCUDA},
+      {"SparseHIP", c10::DispatchKey::SparseHIP},
+      {"SparseXPU", c10::DispatchKey::SparseXPU},
+      {"SparseVE", c10::DispatchKey::SparseVE},
+
+      {"AutogradCPU", c10::DispatchKey::AutogradCPU},
+      {"AutogradCUDA", c10::DispatchKey::AutogradCUDA},
+      {"AutogradXLA", c10::DispatchKey::AutogradXLA},
+      {"AutogradLazy", c10::DispatchKey::AutogradLazy},
+      {"AutogradIPU", c10::DispatchKey::AutogradIPU},
+      {"AutogradXPU", c10::DispatchKey::AutogradXPU},
+      {"AutogradMLC", c10::DispatchKey::AutogradMLC},
+      {"AutogradHPU", c10::DispatchKey::AutogradHPU},
+      {"AutogradPrivateUse1", c10::DispatchKey::AutogradPrivateUse1},
+      {"AutogradPrivateUse2", c10::DispatchKey::AutogradPrivateUse2},
+      {"AutogradPrivateUse3", c10::DispatchKey::AutogradPrivateUse3},
+
       {"Autograd", c10::DispatchKey::Autograd},
       {"CompositeImplicitAutograd",
        c10::DispatchKey::CompositeImplicitAutograd},
diff --git a/c10/core/DispatchKey.h b/c10/core/DispatchKey.h
index 29315051b4177e..9ea1a36c2bb700 100644
--- a/c10/core/DispatchKey.h
+++ b/c10/core/DispatchKey.h
@@ -9,20 +9,99 @@
 
 namespace c10 {
 
+// Semantically, each value of BackendComponent identifies a "backend" for our
+// dispatch. Some functionalities that we may dispatch to are allowed to
+// register different handlers for each backend. The BackendComponent is then
+// used to figure out which backend implementation to dispatch to.
+
+// In implementation terms, the backend component identifies a specific "bit" in
+// a DispatchKeySet. The bits in the DispatchKeySet are split between the bottom
+// ~12 "BackendComponent" bits, while the remaining upper bits are assigned to
+// functionalities. When we encounter a functionality bit that is known to be
+// customizeable per-backend, then we also look at the lower BackendComponent
+// bits and take the highest bit to determine which backend's implementation to
+// use.
+
+enum class BackendComponent : uint8_t {
+
+  // A "backend" is colloquially used to refer to handlers for dispatch
+  // which actually implement the numerics of an operation in question.
+  //
+  // Due to the nature of the enum, these backends are specified in
+  // an ordered way, but for most backends this order is not semantically
+  // meaningful (e.g., it's valid to reorder these backends without changing
+  // semantics).  The only situation when backend ordering is meaningful
+  // is when the backend participates in multiple dispatch with another
+  // backend; e.g., CPU and CUDA (cuda must have higher priority).
+
+  // These keys don't correspond to individual kernels.
+  // Instead, they represent the backends that are allowed to override specific
+  // pieces of functionality:
+  // - dense kernels (e.g. DispatchKey::CPU)
+  // - sparse kernels (e.g. DispatchKey::SparseCPU)
+  // - quantized kernels (e.g. DispatchKey::QuantizedCPU)
+  // - autograd kernels (e.g. DispatchKey::AutogradCPU)
+  // We reserve space in the runtime operator table for this full cross product
+  // of
+  // [backends in this enum] x [keys below that are explicitly marked as having
+  // per-backend functionality]
+
+  InvalidBit = 0,
+  CPUBit,
+  CUDABit,
+  HIPBit,
+  XLABit,
+  MLCBit,
+  IPUBit,
+  XPUBit,
+  HPUBit,
+  VEBit,
+  LazyBit,
+  PrivateUse1Bit,
+  PrivateUse2Bit,
+  PrivateUse3Bit,
+  // Define an alias to represent end of backend dispatch keys.
+  // If you add new backend keys after PrivateUse3, please also update it here.
+  // (But you shouldn't: private use keys should have higher precedence than
+  // all built-in keys)
+  EndOfBackendKeys = PrivateUse3Bit,
+};
+
 // Semantically, a dispatch key identifies a possible "level" in our
-// dispatch, for which a handler may be registered.  Traditional
-// backends like CPU and CUDA get dispatch keys; however, so do
-// "wrapping" layers like Variable (for autograd handling).
+// dispatch, for which a handler may be registered. Each handler corresponds
+// to a type of functionality.
 //
 // In implementation terms, the dispatch key identifies a specific "bit" in a
 // DispatchKeySet.  Higher bit indexes get handled by dispatching first (because
 // we "count leading zeros" when we extract the highest priority dispatch
 // key.)
 //
+// Note [DispatchKey Classification]
+// This enum actually contains several types of keys, which are explained
+// in more detail further down:
+// (1) non-customizable backends (e.g. FPGA)
+// (2) non-customizable functionalities (e.g. Functionalize)
+// (3) functionalized that are customizable per backend (e.g. Dense, Sparse,
+// AutogradFunctionality) (4) per-backend instances of customizable
+// functionalities (e.g. CPU, SparseCPU, AutogradCPU) (5) alias keys (e.g.
+// CompositeImplicitAutograd)
+//
+// Of the categories above, it's important to note:
+// (a) which keys are assigned individual bits in a DispatchKeySet
+// (b) which keys are assigned individual slots in the runtime operator table
+// ("Runtime keys")
+//
+// (1), (2) and (3) all get their own dedicated bits in the DispatchKeySet.
+// (1), (2) and (4) all get their own dedicated slots in the runtime operator
+// table.
+
+// See Note [DispatchKeySet Internal Representation] for more details.
+//
 // NOTE: Keep the list in sync with `DispatchKey` in tools/codegen/model.py
-enum class DispatchKey : uint8_t {
+enum class DispatchKey : uint16_t {
+
   // ~~~~~~~~~~~~~~~~~~~~~~~~~~ UNDEFINED ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
-  // This is not a "real" tensor id, but it exists to give us a "nullopt"
+  // This is not a "real" functionality, but it exists to give us a "nullopt"
   // element we can return for cases when a DispatchKeySet contains no elements.
   // You can think a more semantically accurate definition of DispatchKey is:
   //
@@ -38,24 +117,31 @@ enum class DispatchKey : uint8_t {
   // this will get eliminated, but for now it's convenient)
   CatchAll = Undefined,
 
-  // ~~~~~~~~~~~~~~~~~~~~~~~~~~ BACKENDS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
-  // A "backend" is colloquially used to refer to handlers for dispatch
-  // which actually implement the numerics of an operation in question.
+  // ~~~~~~~~~~~~~~~~~~~~~~~~~~ Functionality Keys ~~~~~~~~~~~~~~~~~~~~~~ //
+  // Every value in the enum (up to EndOfFunctionalityKeys)
+  // corresponds to an individual "functionality" that can be dispatched to.
+  // This is represented in the DispatchKeySet by assigning each of these enum
+  // values
+  // to each of the remaining (64 - len(BackendComponent)) bits.
   //
-  // Due to the nature of the enum, these backends are specified in
-  // an ordered way, but for most backends this order is not semantically
-  // meaningful (e.g., it's valid to reorder these backends without changing
-  // semantics).  The only situation when backend ordering is meaningful
-  // is when the backend participates in multiple dispatch with another
-  // backend; e.g., CPU and SparseCPU (sparse must have
-  // higher priority).
+  // Most of these functionalities have a single handler assigned to them,
+  // making them "runtime keys".
+  // That map to a single slot in the runtime operator table.
+  //
+  // A few functionalities are allowed to be customizable per backend.
+  // See [Note: Per-Backend Functionality Dispatch Keys] for details.
+
+  // See [Note: Per-Backend Functionality Dispatch Keys]
+  Dense,
+
+  // Below are non-extensible backends.
+  // These are backends that currently don't have their own overrides for
+  // Autograd/Sparse/Quantized kernels,
+  // and we therefore don't waste space in the runtime operator table allocating
+  // space for them.
+  // If any of these backends ever need to customize, e.g., Autograd, then we'll
+  // need to add a DispatchKey::*Bit for them.
 
-  // Here are backends which you think of as traditionally specifying
-  // how to implement operations on some device.
-  CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
-  CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
-  HIP, // NB: I think this is not actually used, due to Note [Masquerading as
-  // CUDA]
   FPGA, // Xilinx support lives out of tree at
   // https://gitlab.com/pytorch-complex/vitis_kernels
 
@@ -67,14 +153,8 @@ enum class DispatchKey : uint8_t {
   // - aten/src/ATen/test/extension_backend_test.cpp
   ORT,
 
-  XLA, // lives out of tree at https://github.com/pytorch/xla
-  MLC, // lives out of tree at https://github.com/pytorch/MLCompute
   Vulkan,
   Metal,
-  XPU, // For out of tree Intel's heterogeneous computing plug-in
-  HPU, // For out of tree & closed source integration of HPU / Habana
-  VE, // For out of tree & closed source integration of SX-Aurora / NEC
-  Lazy, // For lazy tensor backends
 
   // A meta tensor is a tensor without any data associated with it.  (They
   // have also colloquially been referred to as tensors on the "null" device).
@@ -83,11 +163,8 @@ enum class DispatchKey : uint8_t {
   // tensor with the output shape and dtype, but wouldn't actually add anything.
   Meta,
 
-  // Here are backends which specify more specialized operators
-  // based on the dtype of the tensor.
-  QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
-  QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
-  QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
+  // See [Note: Per-Backend Functionality Dispatch Keys]
+  Quantized,
 
   // This backend is to support custom RNGs; it lets you go
   // to a different kernel if you pass in a generator that is not a
@@ -106,30 +183,28 @@ enum class DispatchKey : uint8_t {
   // the corresponding dense tensors, and must be handled before them.
   MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp
   // NB: not to be confused with MKLDNN, which is Caffe2 only
-  SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
-  SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
-  SparseHIP, // TODO: I think this is not actually used, due to Note
-  // [Masquerading as CUDA]
-  SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
-  SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
+
+  // See [Note: Per-Backend Functionality Dispatch Keys]
+  Sparse,
 
   SparseCsrCPU,
   SparseCsrCUDA,
 
-  NestedTensor, // lives out of tree at https://github.com/pytorch/nestedtensor
-
-  // Here are reserved backends for user-defined backends, see Note [Private use
-  // DispatchKey]
-  // To see some example about how to use this, check out ORT
-  PrivateUse1,
-  PrivateUse2,
-  PrivateUse3,
+  // Note [Non-Customizable Backend Keys]
+  // Every key above here is considered a "non-customizable backend".
+  // These are backends that will work correctly with autograd, but
+  // but currently don't require separate implementations
+  // for autograd sparse or quantized kernels.
+  // Any new backends that don't need to be customized should go above here.
+  // If an existing backend needs to e.g. override autograd, then we can
+  // consider promoting it into the "BackendComponent" enum
+  //
+  // For all intents and purposes from the perspective of DispatchKeySet,
+  // "non-customizable backend" keys are treated the same way
+  // as other functionality keys
+  EndOfNonCustomizableBackends = SparseCsrCUDA,
 
-  // Define an alias key to represent end of backend dispatch keys.
-  // If you add new backend keys after PrivateUse3, please also update it here.
-  // (But you shouldn't: private use keys should have higher precedence than
-  // all built-in keys)
-  EndOfBackendKeys = PrivateUse3,
+  NestedTensor, // lives out of tree at https://github.com/pytorch/nestedtensor
 
   // In some situations, it is not immediately obvious what the correct
   // backend for function is, because the function in question doesn't
@@ -233,20 +308,18 @@ enum class DispatchKey : uint8_t {
   // AutogradOther key. We can add specific autograd key for those backends
   // upon request.
   AutogradOther,
-  AutogradCPU,
-  AutogradCUDA,
-  AutogradXLA,
-  AutogradLazy,
-  AutogradXPU,
-  AutogradMLC,
-  AutogradHPU,
-  AutogradNestedTensor, // lives out of tree at
+
+  // See [Note: Per-Backend Functionality Dispatch Keys]
+  AutogradFunctionality,
+
+  // NestedTensor is an example of something that isn't a "real backend"
+  // (because it mostly consists of redispatching kernels)
+  // but it would like to override autograd functionality in C++.
+  // We can handle cases like this by adding an extra functionality key
+  // exclusively for handling autograd for NestedTensor.
+  // lives out of tree at
   // https://github.com/pytorch/nestedtensor
-  // Here are some reserved pre-autograd keys for user-defined backends, see
-  // Note [Private use DispatchKey]
-  AutogradPrivateUse1,
-  AutogradPrivateUse2,
-  AutogradPrivateUse3,
+  AutogradNestedTensor,
 
   Tracer,
 
@@ -280,13 +353,16 @@ enum class DispatchKey : uint8_t {
   // we can consider adding separate keys dedicated to those individual passes.
   // See Note [Functionalization Pass In Core] for details.
   Functionalize,
-  FuncTorchDynamicLayerFrontMode, // See Note [Out-of-tree vmap+grad prototype]
 
   // Used by Python key logic to know the set of tls on entry to the dispatcher
-  // This kernel assumes it is at the very top of the dispatcher. If you add
-  // a key above, make sure to update the fallback implementation for this.
+  // This kernel assumes it is the top-most non-functorch-related DispatchKey.
+  // If you add a key above, make sure to update the fallback implementation for
+  // this.
   PythonTLSSnapshot,
 
+  // This key should be at the very top of the dispatcher
+  FuncTorchDynamicLayerFrontMode, // See Note [Out-of-tree vmap+grad prototype]
+
   // TESTING: This is intended to be a generic testing tensor type id.
   // Don't use it for anything real; its only acceptable use is within a single
   // process test.  Use it by creating a TensorImpl with this DispatchKey, and
@@ -304,9 +380,104 @@ enum class DispatchKey : uint8_t {
   TESTING_ONLY_GenericMode,
 
   // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FIN ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
-  NumDispatchKeys, // Sentinel, end of runtime keys.
+  EndOfFunctionalityKeys, // End of functionality keys.
+
+  // ~~~~~~~~~~~~~~ "Dense" Per-Backend Dispatch keys ~~~~~~~~~~~~~~~~~~~~ //
+  // Here are backends which you think of as traditionally specifying
+  // how to implement operations on some device.
+
+  // See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+  StartOfDenseBackends,
+  CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
+  CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
+  HIP, // NB: I think this is not actually used, due to Note [Masquerading as
+  // CUDA]
+  XLA, // lives out of tree at https://github.com/pytorch/xla
+  MLC, // lives out of tree at https://github.com/pytorch/MLCompute
+  IPU, // lives out of tree at https://github.com/graphcore/poptorch
+  XPU, // For out of tree Intel's heterogeneous computing plug-in
+  HPU, // For out of tree & closed source integration of HPU / Habana
+  VE, // For out of tree & closed source integration of SX-Aurora / NEC
+  Lazy, // For lazy tensor backends
+  // Here are reserved backends for user-defined backends, see Note [Private use
+  // DispatchKey]
+  // To see some example about how to use this, check out ORT
+  PrivateUse1,
+  PrivateUse2,
+  PrivateUse3,
+  EndOfDenseBackends = PrivateUse3,
+
+  // ~~~~~~~~~~~~~~ "Quantized" Per-Backend Dispatch keys ~~~~~~~~~~~~~~~~ //
+  // keys starting with an _ are not currently used,
+  // but are needed to ensure that every backend is indexed correctly.
+
+  // See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+  StartOfQuantizedBackends,
+  QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
+  QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
+  _QuantizedHIP,
+  _QuantizedXLA,
+  _QuantizedMLC,
+  _QuantizedIPU,
+  QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
+  _QuantizedHPU,
+  _QuantizedVE,
+  _QuantizedLazy,
+  _QuantizedPrivateUse1,
+  _QuantizedPrivateUse2,
+  _QuantizedPrivateUse3,
+  EndOfQuantizedBackends = _QuantizedPrivateUse3,
+
+  // ~~~~~~~~~~~~~~ "Sparse" Per-Backend Dispatch keys ~~~~~~~~~~~~~~~~~~~ //
+  // keys starting with an _ are not currently used,
+  // but are needed to ensure that every backend is indexed correctly.
+
+  // See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+  StartOfSparseBackends,
+  SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
+  SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
+  SparseHIP, // TODO: I think this is not actually used, due to Note
+  // [Masquerading as CUDA]
+  _SparseXLA,
+  _SparseMLC,
+  _SparseIPU,
+  SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
+  _SparseHPU,
+  SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
+  _SparseLazy,
+  _SparsePrivateUse1,
+  _SparsePrivateUse2,
+  _SparsePrivateUse3,
+  EndOfSparseBackends = _SparsePrivateUse3,
+
+  // ~~~~~~~~~~~~~~ "Autograd" Per-Backend Dispatch keys ~~~~~~~~~~~~~~~~~ //
+  // keys starting with an _ are not currently used,
+  // but are needed to ensure that every backend is indexed correctly.
+
+  // See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+  StartOfAutogradBackends,
+  AutogradCPU,
+  AutogradCUDA,
+  _AutogradHIP,
+  AutogradXLA,
+  AutogradMLC,
+  AutogradIPU,
+  AutogradXPU,
+  AutogradHPU,
+  _AutogradVE,
+  AutogradLazy,
+  // Here are some reserved pre-autograd keys for user-defined backends, see
+  // Note [Private use DispatchKey]
+  AutogradPrivateUse1,
+  AutogradPrivateUse2,
+  AutogradPrivateUse3,
+  EndOfAutogradBackends = AutogradPrivateUse3,
+  // If we add a new per-backend functionality key that has higher priority
+  // than Autograd, then this key should be updated.
+  EndOfRuntimeBackendKeys = EndOfAutogradBackends,
 
   // ~~~~~~~~~~~~~~~~~~~~~~ Alias Dispatch Keys ~~~~~~~~~~~~~~~~~~~~~~~~~~ //
+  // Note [Alias Dispatch Keys]
   // Alias dispatch keys are synthetic dispatch keys which map to multiple
   // runtime dispatch keys. Alisa keys have precedence, but they are always
   // lower precedence than runtime keys. You can register a kernel to an
@@ -326,6 +497,7 @@ enum class DispatchKey : uint8_t {
 
   // Define an alias key to represent end of alias dispatch keys.
   // If you add new alias keys after Autograd, please also update it here.
+  StartOfAliasKeys = Autograd,
   EndOfAliasKeys = CompositeExplicitAutograd, //
 
   // ~~~~~~~~~~~~~~~~~~~~~~~~~ BC ALIASES ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
@@ -365,54 +537,83 @@ enum class DispatchKey : uint8_t {
 // built-in autograd formulas for operators are not appropriate.
 
 static_assert(
-    static_cast<uint8_t>(DispatchKey::NumDispatchKeys) <= 64,
-    "DispatchKey is used as index into 64-bit bitmask; you must have less than 64 entries");
+    (static_cast<uint8_t>(BackendComponent::EndOfBackendKeys) +
+     static_cast<uint8_t>(DispatchKey::EndOfFunctionalityKeys)) <= 64,
+    "The BackendComponent and DispatchKey enums (below EndOfFunctionalityKeys)"
+    " both map to backend and functionality bits"
+    " into a 64-bit bitmask; you must have less than 64 total entries between them");
 
-#if defined(C10_MOBILE_TRIM_DISPATCH_KEYS)
-/**
- * The method below maps the dispatch key in the enum DispatchKey to an
- * integer index in the dispatchTable_ array in OperatorEntry. The array
- * is trimmed for mobile to reduce peak memory usage since it's
- * unnecessary to reserve additional space for dispatch keys that will
- * never be used on mobile.
- */
-C10_API constexpr int getDispatchTableIndexForDispatchKey(DispatchKey dk) {
-  switch (dk) {
-    case DispatchKey::Undefined:
-      return 0;
-    case DispatchKey::CPU:
-      return 1;
-    case DispatchKey::QuantizedCPU:
-      return 2;
-    case DispatchKey::SparseCPU:
-      return 3;
-    case DispatchKey::BackendSelect:
-      return 4;
-    case DispatchKey::ADInplaceOrView:
-      return 5;
-    case DispatchKey::AutogradOther:
-      return 6;
-    case DispatchKey::AutogradCPU:
-      return 7;
-    case DispatchKey::NumDispatchKeys: // Sentinel, end of runtime keys.
-      return 8;
-    default:
-      return -1;
+// Check if a DispatchKey is an alias mapping to other runtime keys.
+constexpr bool isAliasDispatchKey(DispatchKey k) {
+  return k >= DispatchKey::StartOfAliasKeys && k <= DispatchKey::EndOfAliasKeys;
+}
+
+// [Note: Per-Backend Functionality Dispatch Keys]
+// Check if a DispatchKey is a per-backend functionality key
+// Any functionalities that can be customized per-backend should be added here.
+// These keys correspond to functionalities that can be customized indivually
+// per backend. While they only take up one bit in the `DispatchKeySet` bitset,
+// they map to (# backends) slots in the operator table.
+// Each of these keys also has a separate set of "runtime keys" in the dispatch
+// key enum, per backend, which *do* map to the individual operator table slots.
+// For example, the "Sparse" key maps to an individual bit in the
+// DispatchKeySet, while `SparseCPU`, `SparseCUDA`, etc all map to individual
+// slots in the runtime operator table.
+
+constexpr bool isPerBackendFunctionalityKey(DispatchKey k) {
+  if (k == DispatchKey::Dense || k == DispatchKey::Quantized ||
+      k == DispatchKey::Sparse || k == DispatchKey::AutogradFunctionality) {
+    return true;
+  } else {
+    return false;
   }
 }
-#else
-/**
- * For the server use-case, make this a simple pass-through.
- */
-C10_API constexpr int getDispatchTableIndexForDispatchKey(DispatchKey dk) {
-  return static_cast<int>(dk);
+
+// Note that this includes Undefined in the total count.
+// BUT EndOfFunctionalityKeys is its own (placeholder) key.
+// e.g. Undefined=0, Dense=1, Sparse=2, EndOfFunctionalityKeys=3.
+// In the above example, there are 3 total functionality keys.
+constexpr uint8_t num_functionality_keys =
+    static_cast<uint8_t>(DispatchKey::EndOfFunctionalityKeys);
+
+constexpr uint8_t num_backends =
+    static_cast<uint8_t>(BackendComponent::EndOfBackendKeys);
+
+// Note [No More Than 16 Backends]
+// Search for this note to find places in the code where the "no more than 16
+// backends" invariant is baked in.
+static_assert(
+    static_cast<uint8_t>(BackendComponent::EndOfBackendKeys) <= 16,
+    "BackendComponent currently only supports <= 16 backends. If we really need to extend this, \
+there are a few places where this invariant is baked in");
+
+constexpr uint8_t numPerBackendFunctionalityKeys() {
+  uint8_t count = 0;
+  for (uint8_t k = 0; k <= num_functionality_keys; ++k) {
+    if (isPerBackendFunctionalityKey(static_cast<DispatchKey>(k)))
+      ++count;
+  }
+  return count;
 }
+
+#if defined(C10_MOBILE_TRIM_DISPATCH_KEYS)
+// See [Note: Trimmed Mobile Dispatch Keys]
+constexpr uint16_t num_runtime_entries = 8;
+#else
+constexpr uint16_t num_runtime_entries = num_functionality_keys +
+    (numPerBackendFunctionalityKeys() * (num_backends - 1));
 #endif
 
+// See Note [No More Than 16 Backends]
+constexpr uint16_t full_backend_mask =
+    (static_cast<uint16_t>(1) << num_backends) - 1;
+
 C10_API const char* toString(DispatchKey);
+C10_API const char* toString(BackendComponent);
 C10_API std::ostream& operator<<(std::ostream&, DispatchKey);
+C10_API std::ostream& operator<<(std::ostream&, BackendComponent);
 
-C10_API DispatchKey getAutogradKeyFromBackend(DispatchKey t);
+C10_API DispatchKey getAutogradKeyFromBackend(BackendComponent k);
 
 // Parses a string into a dispatch key.
 // If the string cannot be correctly parsed, throws an exception.
@@ -425,10 +626,86 @@ C10_API c10::DispatchKey parseDispatchKey(const std::string& k);
 // torch::dispatch(torch::kCPU, ...) is also valid.
 constexpr DispatchKey kAutograd = DispatchKey::Autograd;
 
-// Check if a DispatchKey is an alias mapping to other runtime keys.
-inline bool isAliasDispatchKey(DispatchKey k) {
-  return k > DispatchKey::NumDispatchKeys && k <= DispatchKey::EndOfAliasKeys;
+// See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+// This function relies on the invariant that the dispatch keys between
+// StartOfDenseBackends and EndOfRuntimeBackendKeys are ordered by backend
+// in the same order as `BackendComponent`.
+constexpr BackendComponent toBackendComponent(DispatchKey k) {
+  if (k >= DispatchKey::StartOfDenseBackends &&
+      k <= DispatchKey::EndOfDenseBackends) {
+    return static_cast<BackendComponent>(
+        static_cast<uint8_t>(k) -
+        static_cast<uint8_t>(DispatchKey::StartOfDenseBackends));
+  } else if (
+      k >= DispatchKey::StartOfQuantizedBackends &&
+      k <= DispatchKey::EndOfQuantizedBackends) {
+    return static_cast<BackendComponent>(
+        static_cast<uint8_t>(k) -
+        static_cast<uint8_t>(DispatchKey::StartOfQuantizedBackends));
+  } else if (
+      k >= DispatchKey::StartOfSparseBackends &&
+      k <= DispatchKey::EndOfSparseBackends) {
+    return static_cast<BackendComponent>(
+        static_cast<uint8_t>(k) -
+        static_cast<uint8_t>(DispatchKey::StartOfSparseBackends));
+  } else if (
+      k >= DispatchKey::StartOfAutogradBackends &&
+      k <= DispatchKey::EndOfAutogradBackends) {
+    return static_cast<BackendComponent>(
+        static_cast<uint8_t>(k) -
+        static_cast<uint8_t>(DispatchKey::StartOfAutogradBackends));
+  } else {
+    return BackendComponent::InvalidBit;
+  }
 }
+
+constexpr DispatchKey toFunctionalityKey(DispatchKey k) {
+  if (k <= DispatchKey::EndOfFunctionalityKeys) {
+    return k;
+  } else if (k <= DispatchKey::EndOfDenseBackends) {
+    return DispatchKey::Dense;
+  } else if (k <= DispatchKey::EndOfQuantizedBackends) {
+    return DispatchKey::Quantized;
+  } else if (k <= DispatchKey::EndOfSparseBackends) {
+    return DispatchKey::Sparse;
+  } else if (k <= DispatchKey::EndOfAutogradBackends) {
+    return DispatchKey::AutogradFunctionality;
+  } else {
+    return DispatchKey::Undefined;
+  }
+}
+
+// Given (DispatchKey::Dense, DispatchKey::CUDABit), returns DispatchKey::CUDA
+// See Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+// This function relies on the invariant that the dispatch keys between
+// StartOfDenseBackends and EndOfRuntimeBackendKeys are ordered by backend
+// in the same order as `BackendComponent`.
+constexpr DispatchKey toRuntimePerBackendFunctionalityKey(
+    DispatchKey functionality_k,
+    BackendComponent backend_k) {
+  if (functionality_k == DispatchKey::Dense) {
+    return static_cast<DispatchKey>(
+        static_cast<uint8_t>(DispatchKey::StartOfDenseBackends) +
+        static_cast<uint8_t>(backend_k));
+  }
+  if (functionality_k == DispatchKey::Sparse) {
+    return static_cast<DispatchKey>(
+        static_cast<uint8_t>(DispatchKey::StartOfSparseBackends) +
+        static_cast<uint8_t>(backend_k));
+  }
+  if (functionality_k == DispatchKey::Quantized) {
+    return static_cast<DispatchKey>(
+        static_cast<uint8_t>(DispatchKey::StartOfQuantizedBackends) +
+        static_cast<uint8_t>(backend_k));
+  }
+  if (functionality_k == DispatchKey::AutogradFunctionality) {
+    return static_cast<DispatchKey>(
+        static_cast<uint8_t>(DispatchKey::StartOfAutogradBackends) +
+        static_cast<uint8_t>(backend_k));
+  }
+  return DispatchKey::Undefined;
+}
+
 } // namespace c10
 
 namespace torch {
diff --git a/c10/core/DispatchKeySet.cpp b/c10/core/DispatchKeySet.cpp
index 7f85567f886f6b..d36e43513d4783 100644
--- a/c10/core/DispatchKeySet.cpp
+++ b/c10/core/DispatchKeySet.cpp
@@ -1,37 +1,30 @@
 #include <c10/core/DispatchKeySet.h>
+#include <c10/util/irange.h>
+#include <iostream>
 
 namespace c10 {
 
-// backend_dispatch_keyset should include all runtime backend keys.
+// backend_dispatch_keyset includes all dispatch keys that map to backends.
 // Alias key DispatchKey::CompositeExplicitAutograd maps to
-// backend_dispatch_keyset NestedTensor has been explicitly removed due to
-// incompatibility with some kernels, such as structured kernels, that use the
-// DefaultBackend key.
-constexpr DispatchKeySet backend_dispatch_keyset = autogradother_backends |
-    DispatchKeySet({
-        DispatchKey::CPU,
-        DispatchKey::CUDA,
-        DispatchKey::XLA,
-        DispatchKey::Lazy,
-        DispatchKey::XPU,
-        DispatchKey::PrivateUse1,
-        DispatchKey::PrivateUse2,
-        DispatchKey::PrivateUse3,
-        DispatchKey::MLC,
-        DispatchKey::HPU,
-        DispatchKey::ORT,
-        DispatchKey::Meta,
-    });
+// backend_dispatch_keyset
+constexpr DispatchKeySet backend_dispatch_keyset =
+    autogradother_backends | DispatchKeySet(DispatchKey::Dense);
 
 bool isBackendDispatchKey(DispatchKey t) {
   return t != DispatchKey::Undefined
       // See Note [No Alias Keys in DispatchKeySet]
-      && !isAliasDispatchKey(t) && backend_dispatch_keyset.has(t);
+      && !isAliasDispatchKey(t)
+      // Note [NestedTensor Not Included in Backend Keys]
+      // NestedTensor has been explicitly removed from the "backend keyset" due
+      // to incompatibility with some kernels, so we don't want it to be
+      // included in CompositeImplicitAutograd or CompositeExplicitAutograd
+      // kernels.
+      && t != DispatchKey::NestedTensor && backend_dispatch_keyset.has(t);
 }
 
 // math_dispatch_keyset contains all keys in backend_dispatch_keyset and
 // autograd_dispatch_keyset Alias key DispatchKey::CompositeImplicitAutograd
-// maps to math_dispatch_keyset.
+// maps to [math_dispatch_keyset x full_backend_mask]
 constexpr DispatchKeySet math_dispatch_keyset =
     backend_dispatch_keyset | autograd_dispatch_keyset;
 
@@ -39,7 +32,12 @@ DispatchKeySet getRuntimeDispatchKeySet(DispatchKey t) {
   TORCH_INTERNAL_ASSERT(t != DispatchKey::Undefined);
   switch (t) {
     case DispatchKey::Autograd:
-      return autograd_dispatch_keyset;
+      // See Note [autograd_dispatch_keyset Does Not Include Backend Bits]
+      // That's why we OR it with a mask of the backend bits here.
+      // getRuntimeDispatchKeySet() expects to return a keyset of runtime
+      // dispatch keys, like AutogradCPU, but that requires having backend bits.
+      return autograd_dispatch_keyset |
+          DispatchKeySet(DispatchKeySet::RAW, full_backend_mask);
     case DispatchKey::CompositeImplicitAutograd:
       return math_dispatch_keyset;
     case DispatchKey::CompositeExplicitAutograd:
@@ -53,11 +51,13 @@ bool runtimeDispatchKeySetHas(DispatchKey t, DispatchKey k) {
   TORCH_INTERNAL_ASSERT(t != DispatchKey::Undefined);
   switch (t) {
     case DispatchKey::Autograd:
-      return autograd_dispatch_keyset.has(k);
+      return autograd_dispatch_keyset.has(toFunctionalityKey(k));
     case DispatchKey::CompositeImplicitAutograd:
-      return math_dispatch_keyset.has(k);
+      // See Note [NestedTensor Not Included in Backend Keys]
+      return k != DispatchKey::NestedTensor && math_dispatch_keyset.has(k);
     case DispatchKey::CompositeExplicitAutograd:
-      return backend_dispatch_keyset.has(k);
+      // See Note [NestedTensor Not Included in Backend Keys]
+      return k != DispatchKey::NestedTensor && backend_dispatch_keyset.has(k);
     default:
       return t == k;
   }
@@ -79,8 +79,8 @@ DispatchKeySet getBackendKeySetFromAutograd(DispatchKey t) {
       return DispatchKeySet(DispatchKey::MLC);
     case DispatchKey::AutogradHPU:
       return DispatchKeySet(DispatchKey::HPU);
-    case DispatchKey::AutogradNestedTensor:
-      return DispatchKeySet(DispatchKey::NestedTensor);
+    case DispatchKey::AutogradIPU:
+      return DispatchKeySet(DispatchKey::IPU);
     case DispatchKey::AutogradXPU:
       return DispatchKeySet(DispatchKey::XPU);
     case DispatchKey::AutogradPrivateUse1:
@@ -96,23 +96,6 @@ DispatchKeySet getBackendKeySetFromAutograd(DispatchKey t) {
   }
 }
 
-DispatchKeySet getAutocastRelatedKeySetFromBackend(DispatchKey t) {
-  switch (t) {
-    case DispatchKey::CPU:
-      return DispatchKeySet(DispatchKey::AutocastCPU);
-    case DispatchKey::CUDA:
-    case DispatchKey::XLA:
-      return DispatchKeySet(DispatchKey::AutocastCUDA);
-    default:
-      return DispatchKeySet();
-  }
-}
-
-DispatchKeySet getAutogradRelatedKeySetFromBackend(DispatchKey t) {
-  return DispatchKeySet(
-      {DispatchKey::ADInplaceOrView, getAutogradKeyFromBackend(t)});
-}
-
 bool isIncludedInAlias(DispatchKey k, DispatchKey alias) {
   return k != DispatchKey::Undefined && runtimeDispatchKeySetHas(alias, k);
 }
@@ -129,18 +112,135 @@ std::ostream& operator<<(std::ostream& os, DispatchKeySet ts) {
     return os;
   }
   os << "DispatchKeySet(";
-  DispatchKey tid;
   bool first = true;
-  while ((tid = ts.highestPriorityTypeId()) != DispatchKey::Undefined) {
+  for (auto k : ts) {
     if (!first) {
       os << ", ";
     }
-    os << tid;
-    ts = ts.remove(tid);
+    os << k;
     first = false;
   }
   os << ")";
   return os;
 }
 
+DispatchKeySet::iterator& DispatchKeySet::iterator::operator++() {
+  TORCH_INTERNAL_ASSERT(next_functionality_ <= iterator::end_iter_mask_val);
+  TORCH_INTERNAL_ASSERT(next_backend_ <= num_backends, next_backend_);
+
+  // Create a masked version of the set representation to ignore previous
+  // keys that we've iterated through.
+  uint64_t masked_functionality_bits =
+      llvm::maskTrailingZeros<uint64_t>(next_functionality_) & *data_ptr_;
+  uint64_t masked_backend_bits =
+      llvm::maskTrailingZeros<uint64_t>(next_backend_) & full_backend_mask &
+      *data_ptr_;
+
+  uint64_t first_functionality_idx =
+      llvm::findFirstSet(masked_functionality_bits);
+  uint64_t first_backendcomponent_idx = llvm::findFirstSet(masked_backend_bits);
+
+  // If there are no keys, set to end iterator value
+  if (first_functionality_idx == std::numeric_limits<uint64_t>::max() ||
+      next_functionality_ == iterator::end_iter_mask_val) {
+    // Set up state to be the same as end()
+    next_functionality_ = iterator::end_iter_mask_val;
+    current_dispatchkey_idx_ = iterator::end_iter_key_val;
+    next_backend_ = 0;
+    current_backendcomponent_idx_ = iterator::end_iter_key_val;
+    return *this;
+  }
+
+  // The +1 is because of DispatchKey::Undefined and
+  // BackendComponent::InvalidBit
+  auto new_next_functionality = first_functionality_idx + 1;
+  auto new_backendcomponent_idx = first_backendcomponent_idx + 1;
+  // and the -num_backends is because the first <num_backends> bits in the
+  // keyset are not Dispatch Keys.
+  auto next_dispatchkey_idx = new_next_functionality - num_backends;
+
+  // If the current functionality bit is a per-backend bit, we need special
+  // handling
+  if (isPerBackendFunctionalityKey(
+          static_cast<DispatchKey>(next_dispatchkey_idx))) {
+    // case 1: if the current backend is undefined, then there is no valid
+    // backend instance of this functionality key so we can skip it.
+    if (first_backendcomponent_idx == std::numeric_limits<uint64_t>::max()) {
+      // increment the functionality mask so we skip the current functionality
+      // bit on the next increment.
+      next_functionality_ = new_next_functionality;
+      ++(*this);
+      return *this;
+    }
+
+    // Otherwise, at this point we know what the current backend and
+    // functionality bits are.
+    current_dispatchkey_idx_ = next_dispatchkey_idx;
+    current_backendcomponent_idx_ = new_backendcomponent_idx;
+
+    // Next, we need to set up the masks for the next increment.
+    uint64_t next_backendcomponent_bits =
+        llvm::maskTrailingZeros<uint64_t>(first_backendcomponent_idx + 1) &
+        full_backend_mask & *data_ptr_;
+    uint64_t next_backendcomponent_idx =
+        llvm::findFirstSet(next_backendcomponent_bits);
+    if (next_backendcomponent_idx == std::numeric_limits<uint64_t>::max()) {
+      // case 2: the current backend is valid, but there is not another backend
+      // in the keyset. In this case, we need to bump the functionality mask and
+      // reset the backend mask for the next increment
+      next_functionality_ = new_next_functionality;
+      next_backend_ = 0;
+    } else {
+      // case 3: we have another backend to iterate over. We want to iterate
+      // over the same functionality bit next time, but a different backend bit.
+      next_backend_ = first_backendcomponent_idx + 1;
+    }
+  } else {
+    // Functionality bits that aren't per backend are simpler to handle. We can
+    // ignore the backend bits.
+    TORCH_INTERNAL_ASSERT(next_backend_ == 0);
+    current_dispatchkey_idx_ = next_dispatchkey_idx;
+    next_functionality_ = new_next_functionality;
+  }
+  return *this;
+}
+
+std::array<FunctionalityOffsetAndMask, num_functionality_keys>
+initializeFunctionalityOffsetsAndMasks() {
+  std::array<FunctionalityOffsetAndMask, num_functionality_keys>
+      offsets_and_masks;
+  // manualy set the first entry, which corresponds to Undefined.
+  offsets_and_masks[0] = FunctionalityOffsetAndMask(0, 0);
+  // loop through every functionality key (aside from Undefined).
+  for (const auto functionality_idx : c10::irange(1, num_functionality_keys)) {
+    // functionality_idx should be Dense -> 1, ...
+    auto prev_offset_and_mask = offsets_and_masks[functionality_idx - 1];
+    auto k = static_cast<DispatchKey>(functionality_idx);
+
+    // If the previous functionality was not per-backend, then we can just
+    // increment the previous offset. Otherwise, the next offset =
+    // previous_offset + num_backends.
+    auto next_offset = prev_offset_and_mask.offset +
+        (prev_offset_and_mask.mask == 0 ? 1 : num_backends);
+    // the mask is used in the runtime index calculation to find the offset of
+    // the backend. For non-per-backend functionalities, this offset should
+    // always be 0. Otherwise, we need to get the index of the backend (which we
+    // can do using a backend mask).
+    auto next_mask = isPerBackendFunctionalityKey(k) ? full_backend_mask : 0;
+    offsets_and_masks[functionality_idx] =
+        FunctionalityOffsetAndMask(next_offset, next_mask);
+  }
+  // Sanity check that the computed offset index of the last functionality key
+  // is correct. This assumes that the highest priority functionality key is not
+  // per backend.
+  TORCH_INTERNAL_ASSERT(
+      offsets_and_masks[num_functionality_keys - 1].offset ==
+          (num_runtime_entries - 1),
+      "num_runtime_entries: ",
+      num_runtime_entries,
+      "last_offset: ",
+      offsets_and_masks[num_functionality_keys - 1].offset);
+  return offsets_and_masks;
+}
+
 } // namespace c10
diff --git a/c10/core/DispatchKeySet.h b/c10/core/DispatchKeySet.h
index 79d39652219b51..0e631061411dd0 100644
--- a/c10/core/DispatchKeySet.h
+++ b/c10/core/DispatchKeySet.h
@@ -1,5 +1,4 @@
 #pragma once
-
 #include <c10/core/DispatchKey.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Metaprogramming.h>
@@ -8,29 +7,147 @@
 
 namespace c10 {
 
+struct FunctionalityOffsetAndMask {
+  // empty constructor shouldn't be used; only needed to initialize
+  // the array before populating it.
+  FunctionalityOffsetAndMask() {}
+  FunctionalityOffsetAndMask(uint16_t offset, uint16_t mask)
+      : offset(offset), mask(mask) {}
+  // This needs to big enough to cover the size of the operator table.
+  uint16_t offset;
+  // See Note [No More Than 16 Backends]
+  // This mask needs to be big enough to mask all of the backend bits.
+  // We probably don't ever want to have more than 16 backend bits, so uint16_t
+  // should be enough.
+  uint16_t mask;
+};
+static_assert(
+    c10::num_runtime_entries < 65536,
+    "The dispatcher currently only supports up to 2^16 runtime entries");
+
+C10_API std::array<FunctionalityOffsetAndMask, num_functionality_keys>
+initializeFunctionalityOffsetsAndMasks();
+
+C10_ALWAYS_INLINE static const std::
+    array<FunctionalityOffsetAndMask, num_functionality_keys>&
+    offsetsAndMasks() {
+  static auto offsets_and_masks_ = initializeFunctionalityOffsetsAndMasks();
+  return offsets_and_masks_;
+}
+
+// A representation of a set of DispatchKeys. A DispatchKeySet contains both
+// "functionality" bits and "backend bits", and every tensor holds its own
+// DispatchKeySet. The Dispatcher implements multiple dispatch by grabbing the
+// keyset on every input tensor, or’ing them together, and dispatching to a
+// specific piece of functionality. The functionality bits are *ordered*. When
+// multiple functionality bits are set, we use the highest priority
+// functionality. Similarly, multiple backend bits can theoretically be set if
+// you call an operator with multiple tensors from difference devices (e.g. CPU
+// and CUDA), although support for mixed device dispatch is limited (the only
+// kernels that gracefully handle mixed device inputs for now are cuda kernels
+// that take in a scalar cpu tensor).
+
 // A representation of a set of DispatchKeys.  A tensor may have multiple
 // tensor type ids, e.g., a Variable tensor can also be a CPU tensor; the
 // DispatchKeySet specifies what type ids apply.  The internal representation is
 // as a 64-bit bit set (this means only 64 tensor type ids are supported).
 //
-// Note that DispatchKeys are ordered; thus, we can ask questions like "what is
-// the highest priority DispatchKey in the set"?  (The set itself is not
-// ordered; two sets with the same ids will always have the ids ordered in the
-// same way.)
+// As mentioned above, DispatchKeys are ordered; thus, we can ask questions like
+// "what is the highest priority DispatchKey in the set"?  (The set itself is
+// not ordered; two sets with the same ids will always have the ids ordered in
+// the same way.)
+//
+// Note [DispatchKeySet Internal Representation]
+// Internally, dispatch keys are packed into 64-bit DispatchKeySet objects
+// that get passed around at runtime.
+// However, there isn't necessarily a 1-to-1 mapping between bits in the keyset
+// and individual dispatch keys.
+//
+// First: why do we have this distinction, and why not map every dispatch key
+// directly to a bit? This is mostly because we have several types of
+// functionalities that different backends would like to customize. For example,
+// we have:
+// - "Dense":     CPU, CUDA, XLA, ... (~12 keys)
+// - "Sparse":    SparseCPU, SparseCUDA, ...
+// - "Quantized": QuantizedCPU, QuantizedCUDA, QuantizedXLA, ...
+// - "Autograd":  AutogradCPU, AutogradCUDA, Autograd XLA, ...
+// The problem is that total number of keys grows quadratically with [#
+// backends] x [# functionalities], making it very difficult to map each key
+// directly to a bit in a bitset without dramatically increasing the size of the
+// bitset over time.
+//
+// The two enums (BackendComponent and DispatchKey) can be divided roughly into
+// 5 categories.
+//
+// (1) "Building block" keys
+//    (a) backends: jEverything in the BackendComponent enum (e.g. CPUBit,
+//    CUDABIt) (b) functionalities: (per-backend) functionality-bit DispatchKeys
+//    (e.g. AutogradFunctionality, Sparse, Dense)
+// (2) "Runtime" keys
+//    (a) "non-customizable backends" (e.g. FPGA)
+//    (b) "non-customizable functionalities" (e.g. Functionalize)
+//    (c) "per-backend instances of customizable functionalities" (e.g. CPU,
+//    SparseCPU, AutogradCPU)
+// (3) "Alias" DispatchKeys (see Note [Alias Dispatch Keys])
+//
+// (1) Building block keys always correspond to individual bits in a
+// DispatchKeySet. They can also be combined in a DispatchKeySet to form actual
+// runtime keys. e.g.
+//     auto dense_cpu_ks = DispatchKeySet({DispatchKey::CPUBit,
+//     DispatchKey::Dense});
+//     // The keyset has the runtime dense-cpu key.
+//     dense_cpu_ks.has(DispatchKey::CPU);
+//     // And it contains the building block keys too.
+//     dense_cpu_ks.has(DispatchKey::CPUBit);
+//     dense_cpu_ks.has(DispatchKey::Dense);
+//
+// Not every backend and not every functionality counts as a "building block
+// key". This is mostly to give us more levers to pull in the design space.
+// Backend keys and functionality keys that count as "building blocks" will
+// contribute to a full cross product of functionality that can be overriden.
 //
-// At the moment, there are no nontrivial uses of this set; tensors are always
-// singletons.  In the near future, this set will represent variable? + tensor
-// type id.  In the far future, it will be requires grad? + profiling? +
-// tracing? + lazy? + tensor type id.
+// For example, right now we have at least 12 "backend" building blocks (CPU,
+// CUDA, XLA, ...) and at least 4 "functionality" building blocks (Dense,
+// Sparse, Quantized, AutogradFunctionality, ...). These keys together allow
+// every dispatcher operator to be customized in up to 12*4 different ways. Each
+// of those requires a slot in the operator table of every dispatcher operator.
+// Not every piece of functionality necessarily needs to be customizeable
+// per-backend, and not every backend necessarily needs to be able to customize
+// every type of functionality.
 //
-// (The difference between variable and requires grad, is that
-// there are currently three states a tensor can be:
-//  1. Not a variable
-//  2. Variable with requires_grad=False
-//  3. Variable with requires_grad=True
-// Eventually, we want to kill state (1), and only dispatch to autograd
-// handling code if one of the inputs requires grad.)
 //
+// (2) Every runtime key corresponds directly to a slot in an operator's runtime
+// dispatch table, and you can directly register kernels to a runtime dispatch
+// key.
+//
+// For per-backend functionalities like "Dense" or "AutogradFunctionality",
+// you can think of the corresponding runtime dispatch keys as "instances" of
+// that functionality, per backend. E.g. "CPU", "CUDA", "XLA", etc. are all
+// runtime instances of the "Dense" building block key.
+
+// (2a) and (2b) are represented identically in the DispatchKeySet logic:
+// - backend-agnostic functionalities (e.g. FuncTorchBatched) are NOT
+// customizeable per backend.
+//   In order to do so, we'd need to promote it to a per-backend functionality
+//   "building block" key.
+// - non-customizeable backends (e.g. FPGA) can NOT customize existing
+// functionality like Sparse, Autograd, etc.
+//   In order to do so, we'd need to promote it to a backend "building block"
+//   key.
+//
+// In both cases, these keys directly correspond to runtime slots in the
+// operator table.
+//
+//
+// (3) "Alias" keys
+// See Note [Alias Dispatch Keys]
+//
+// Final note: for anyone making future changes to the Dispatcher +
+// DispatchKeySet internals, there's a closed PR with a basic
+// python-implementation of the Dispatcher that might be useful in quickly
+// testing out and validating changes. See it at
+// https://github.com/pytorch/pytorch/pull/68743
+
 // An undefined tensor is one with an empty tensor type set.
 class DispatchKeySet final {
  public:
@@ -41,29 +158,146 @@ class DispatchKeySet final {
   // NB: default constructor representation as zero is MANDATORY as
   // use of DispatchKeySet in TLS requires this.
   constexpr DispatchKeySet() : repr_(0) {}
+
   constexpr DispatchKeySet(Full)
-      : repr_(std::numeric_limits<decltype(repr_)>::max()) {}
+      : repr_((1ULL << (num_backends + num_functionality_keys - 1)) - 1) {}
+
   constexpr DispatchKeySet(FullAfter, DispatchKey t)
       // LSB after t are OK, but not t itself.
-      : repr_((1ULL << (static_cast<uint8_t>(t) - 1)) - 1) {}
+      // "functionalities" have a notion of ordering (e.g. Autograd > Sparse >
+      // Quantized > Dense). But backends don't really have an ordering.
+      // Therefore, we're enforcing that FullAfter can only be used on
+      // "functionality" keys.
+      : repr_(
+            (1ULL
+             << (num_backends + static_cast<uint8_t>(toFunctionalityKey(t)) -
+                 1)) -
+            1) {}
+
   // Public version of DispatchKeySet(uint64_t) API; external users
   // must be explicit when they do this!
   constexpr DispatchKeySet(Raw, uint64_t x) : repr_(x) {}
-  explicit constexpr DispatchKeySet(DispatchKey t)
-      : repr_(
-            t == DispatchKey::Undefined
-                ? 0
-                : 1ULL << (static_cast<uint8_t>(t) - 1)) {}
-  explicit constexpr DispatchKeySet(std::initializer_list<DispatchKey> ks)
-      : repr_(0) {
+
+  constexpr explicit DispatchKeySet(BackendComponent k) {
+    if (k == BackendComponent::InvalidBit) {
+      repr_ = 0;
+    } else {
+      repr_ = 1ULL << (static_cast<uint8_t>(k) - 1);
+    }
+  }
+
+  constexpr explicit DispatchKeySet(DispatchKey k) {
+    if (k == DispatchKey::Undefined) {
+      // Case 1: handle Undefined specifically
+      repr_ = 0;
+    } else if (k <= DispatchKey::EndOfFunctionalityKeys) {
+      // Case 2: handle "functionality-only" keys
+      // These keys have a functionality bit set, but no backend bits
+      // These can technically be either:
+      // - valid runtime keys (e.g. DispatchKey::AutogradOther,
+      // DispatchKey::FuncTorchBatched, etc)
+      // - "building block" keys that aren't actual runtime keys (e.g.
+      // DispatchKey::Dense or Sparse)
+      uint64_t functionality_val = 1ULL
+          << (num_backends + static_cast<uint8_t>(k) - 1);
+      repr_ = functionality_val;
+    } else if (k <= DispatchKey::EndOfRuntimeBackendKeys) {
+      // Case 3: "runtime" keys that have a functionality bit AND a backend bit.
+      // First compute which bit to flip for the functionality.
+      auto functionality_k = toFunctionalityKey(k);
+      // The - 1 is because Undefined is technically a "functionality" that
+      // doesn't show up in the bitset. So e.g. Dense is technically the second
+      // functionality, but the lowest functionality bit.
+      uint64_t functionality_val = 1ULL
+          << (num_backends + static_cast<uint8_t>(functionality_k) - 1);
+
+      // then compute which bit to flip for the backend
+      // Case 4a: handle the runtime instances of "per-backend functionality"
+      // keys For example, given DispatchKey::CPU, we should set:
+      // - the Dense functionality bit
+      // - the CPUBit backend bit
+      // first compute which bit to flip for the backend
+      auto backend_k = toBackendComponent(k);
+      uint64_t backend_val = backend_k == BackendComponent::InvalidBit
+          ? 0
+          : 1ULL << (static_cast<uint8_t>(backend_k) - 1);
+      repr_ = functionality_val + backend_val;
+    } else {
+      // At this point, we should have covered every case except for alias keys.
+      // Technically it would be possible to add alias dispatch keys to a
+      // DispatchKeySet, but the semantics are a little confusing and this
+      // currently isn't needed anywhere.
+      repr_ = 0;
+    }
+  }
+
+  constexpr uint64_t keys_to_repr(std::initializer_list<DispatchKey> ks) {
+    uint64_t repr = 0;
     for (auto k : ks) {
-      repr_ |= DispatchKeySet(k).repr_;
+      repr |= DispatchKeySet(k).repr_;
     }
+    return repr;
   }
+
+  constexpr uint64_t backend_bits_to_repr(
+      std::initializer_list<BackendComponent> ks) {
+    uint64_t repr = 0;
+    for (auto k : ks) {
+      repr |= DispatchKeySet(k).repr_;
+    }
+    return repr;
+  }
+
+  explicit constexpr DispatchKeySet(std::initializer_list<DispatchKey> ks)
+      : repr_(keys_to_repr(ks)) {}
+
+  explicit constexpr DispatchKeySet(std::initializer_list<BackendComponent> ks)
+      // Note: for some reason, putting this logic directly in the constructor
+      // appears to fail to compile on CUDA 10.1.
+      // See an example internal failure at
+      // https://www.internalfb.com/intern/skycastle/run/76561193669136035/artifact/actionlog.76561193742069401.stderr
+      : repr_(backend_bits_to_repr(ks)) {}
+
   // Test if a DispatchKey is in the set
-  bool inline has(DispatchKey t) const {
+  inline bool has(DispatchKey t) const {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(t != DispatchKey::Undefined);
-    return static_cast<bool>(repr_ & DispatchKeySet(t).repr_);
+    return has_all(DispatchKeySet(t));
+  }
+  constexpr bool has_backend(BackendComponent t) const {
+    return has_all(DispatchKeySet(t));
+  }
+
+  // Test if a DispatchKey is in the set
+  // Given a DispatchKeySet of functionality keys and (potentially) backend
+  // keys, tests if all of them are in the current set.
+  constexpr bool has_all(DispatchKeySet ks) const {
+    return static_cast<bool>((repr_ & ks.repr_) == ks.repr_);
+  }
+
+  // Given a DispatchKeySet of functionality keys and (potentially) backend
+  // keys, tests if any of them are in the current set. This could technically
+  // be pretty easily implemented using has(). It is strictly a perf
+  // optimization though. There are many places in the code base where we want
+  // to test for multiple functionality keys together. HOWEVER, runtime
+  // per-backend functionality keys aren't allowed to be used with this
+  // function, because you can end up with weird results. e.g.
+  // DispatchKeySet(DispatchKey::AutogradCPU).has_any(DispatchKeySet(DispatchKey::CPU))
+  // would return true.
+  inline bool has_any(DispatchKeySet ks) const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+        // Either there are no backend bits in the input keyset
+        ((ks.repr_ & full_backend_mask) == 0) ||
+        // or there are no per-backend-functionality bits
+        // See [Note: Per-Backend Functionality Dispatch Keys]
+        ((ks &
+          DispatchKeySet({
+                             DispatchKey::Dense,
+                             DispatchKey::Quantized,
+                             DispatchKey::Sparse,
+                             DispatchKey::AutogradFunctionality,
+                         })
+              .repr_) == 0));
+    return static_cast<bool>((repr_ & ks.repr_) != 0);
   }
   // Test if DispatchKeySet is a superset of ks.
   bool isSupersetOf(DispatchKeySet ks) const {
@@ -74,31 +308,64 @@ class DispatchKeySet final {
     return DispatchKeySet(repr_ | other.repr_);
   }
   // Perform set intersection
-  DispatchKeySet operator&(DispatchKeySet other) const {
+  constexpr DispatchKeySet operator&(DispatchKeySet other) const {
     return DispatchKeySet(repr_ & other.repr_);
   }
-  // Compute the set difference self - other
+  // Compute the set difference self - other,
+  // but ONLY for the functionality keys.
+  // Any backend bits set on self will remain unchanged.
+  // See Note [Removing keys from DispatchKeySet Only Affects Functionality
+  // Keys]
   DispatchKeySet operator-(DispatchKeySet other) const {
-    return DispatchKeySet(repr_ & ~other.repr_);
+    return DispatchKeySet(repr_ & (full_backend_mask | ~other.repr_));
   }
+
   // Compute self ^ other
   constexpr DispatchKeySet operator^(DispatchKeySet other) const {
     return DispatchKeySet(repr_ ^ other.repr_);
   }
-  // Perform set equality
   bool operator==(DispatchKeySet other) const {
     return repr_ == other.repr_;
   }
+  bool operator!=(DispatchKeySet other) const {
+    return repr_ != other.repr_;
+  }
   // Add a DispatchKey to the DispatchKey set.  Does NOT mutate,
   // returns the extended DispatchKeySet!
   C10_NODISCARD DispatchKeySet add(DispatchKey t) const {
     return *this | DispatchKeySet(t);
   }
-  // Remove a DispatchKey from the DispatchKey set.  This is
-  // generally not an operation you should be doing (it's
-  // used to implement operator<<)
-  C10_NODISCARD constexpr DispatchKeySet remove(DispatchKey t) const {
-    return DispatchKeySet(repr_ & ~DispatchKeySet(t).repr_);
+  C10_NODISCARD DispatchKeySet add(DispatchKeySet ks) const {
+    return *this | ks;
+  }
+
+  // Remove a DispatchKey from the DispatchKey set.
+  // This is generally not an operation you should be doing
+  // (it's used to implement the printing overload, operator<<)
+  //
+  // Note [Removing keys from DispatchKeySet Only Affects Functionality Keys]
+  // Only functionality bits are allowed to be removed from a keyset.
+  // For now, we're only allowing removal of "functionality bits" from the
+  // keyset, which is specifically needed by the fallthrough key calculation
+  // logic. Why is removing backend bits problematic? Consider this example:
+  //
+  // DispatchKeySet([DispatchKey.CPU, DispatchKey.AutogradCUDA,
+  // DispatchKey.CUDA]).remove(DispatchKey.AutogradCUDA)
+  // DispatchKeySet([DispatchKey.CPU,
+  // DispatchKey.AutogradCUDA]).remove(DispatchKey.AutogradCUDA)
+  //
+  // What do we want to happen?
+  // Technically, we'd like it to be true that after removal,
+  // the first keyset still has the CUDA dispatch key while the second doesn't.
+  // Unfortunately there's no way to represent that, because the two keysets are
+  // represented the same way internally: functionality bits: Autograd, Dense
+  // backend bits: CPU, CUDA
+  //
+  // Instead, remove(DispatchKey.AutogradCPU) will only remove the "Autograd"
+  // bit from the bitset.
+  constexpr DispatchKeySet remove(DispatchKey t) const {
+    return DispatchKeySet(
+        repr_ & ~(DispatchKeySet(t).repr_ & ~full_backend_mask));
   }
   // Is the set empty?  (AKA undefined tensor)
   bool empty() const {
@@ -107,22 +374,112 @@ class DispatchKeySet final {
   uint64_t raw_repr() {
     return repr_;
   }
-  // Return the type id in this set with the highest priority (i.e.,
-  // is the largest in the DispatchKey enum).  Intuitively, this
-  // type id is the one that should handle dispatch (assuming there
-  // aren't any further exclusions or inclusions).
+
+  DispatchKey highestFunctionalityKey() const {
+    auto functionality_idx = indexOfHighestBit();
+    // This means that none of the functionality bits were set.
+    if (functionality_idx < num_backends)
+      return DispatchKey::Undefined;
+    // The first num_backend bits in the keyset don't correspond to real
+    // dispatch keys.
+    return static_cast<DispatchKey>(functionality_idx - num_backends);
+  }
+
+  // This is similar like toBackendComponent(DispatchKey), but less restrictive.
+  // toBackendComponent() errors out if the key that it was passed has no
+  // backend bits, which is useful for error checking. We need a version of that
+  // here that can also handle "fake" backends like FPGA, because they need to
+  // map to the AutogradOther key. For those backends, we return
+  // BackendComponent::InvalidBit.
+  BackendComponent highestBackendKey() const {
+    // mask to mask out functionality bits
+    auto backend_idx =
+        DispatchKeySet(repr_ & full_backend_mask).indexOfHighestBit();
+    // all zeros across the backend bits means that no backend bits are set.
+    if (backend_idx == 0)
+      return BackendComponent::InvalidBit;
+    return static_cast<BackendComponent>(backend_idx);
+  }
+
+  // returns the DispatchKey of highest priority in the set.
   DispatchKey highestPriorityTypeId() const {
-    // TODO: If I put Undefined as entry 64 and then adjust the
-    // singleton constructor to shift from the right, we can get rid of the
-    // subtraction here.  It's modestly more complicated to get right so I
-    // didn't do it for now.
-    return static_cast<DispatchKey>(64 - llvm::countLeadingZeros(repr_));
+    auto functionality_k = highestFunctionalityKey();
+    if (isPerBackendFunctionalityKey(functionality_k)) {
+      return toRuntimePerBackendFunctionalityKey(
+          functionality_k, highestBackendKey());
+    }
+    return functionality_k;
+  }
+
+  // Returns the index of the most-significant bit in the keyset.
+  // This is used to as part of the calculation into the operator table to get:
+  // - the highest "functionality" bit in the keyset.
+  // - the highest "backend" bit in the keyset.
+  uint8_t indexOfHighestBit() const {
+    return 64 - llvm::countLeadingZeros(repr_);
   }
 
-  DispatchKey highestPriorityBackendTypeId() const {
-    return (*this &
-            ((1ULL << static_cast<uint8_t>(DispatchKey::EndOfBackendKeys)) - 1))
-        .highestPriorityTypeId();
+#if defined(C10_MOBILE_TRIM_DISPATCH_KEYS)
+  // [Note: Trimmed Mobile Dispatch Keys]
+  /**
+   * The method below maps the dispatch key in the enum DispatchKey to an
+   * integer index in the dispatchTable_ array in OperatorEntry. The array
+   * is trimmed for mobile to reduce peak memory usage since it's
+   * unnecessary to reserve additional space for dispatch keys that will
+   * never be used on mobile.
+   */
+  int getDispatchTableIndexForDispatchKeySet() const {
+    auto dk = highestPriorityTypeId();
+    switch (dk) {
+      case DispatchKey::Undefined:
+        return 0;
+      case DispatchKey::CPU:
+        return 1;
+      case DispatchKey::QuantizedCPU:
+        return 2;
+      case DispatchKey::SparseCPU:
+        return 3;
+      case DispatchKey::BackendSelect:
+        return 4;
+      case DispatchKey::ADInplaceOrView:
+        return 5;
+      case DispatchKey::AutogradOther:
+        return 6;
+      case DispatchKey::AutogradCPU:
+        return 7;
+      default:
+        return -1;
+    }
+  }
+#else
+  // returns the index in the operator table of highest priority key in the the
+  // keyset Note that we could in theory implement this using
+  // highestPriorityTypeId(), but this code is very hotpath and we can do it
+  // faster without it.
+  int getDispatchTableIndexForDispatchKeySet() const {
+    auto functionality_idx =
+        DispatchKeySet(repr_ >> num_backends).indexOfHighestBit();
+    auto offset_and_mask = offsetsAndMasks()[functionality_idx];
+    // Mask the functionality bits out first, then right-shift by 1.
+    // right-shifting by 1 because everything is zero-indexed.
+    // E.g. 000001 (CPU) should give us an offset of 0, 000010 (CUDA) should
+    // give us an offset of 1, etc.
+    auto backend_idx =
+        DispatchKeySet((repr_ & offset_and_mask.mask) >> 1).indexOfHighestBit();
+    return offset_and_mask.offset + backend_idx;
+  }
+#endif
+
+  // returns the "index" of the highest priority backend in the keyset.
+  // This is pretty similar to getBackendKey(), but:
+  // - It's hotpath code (part of the runtime bitset calculation)
+  // - I's returns an integer index, not an enum value
+  // - Everything is shifted to the right by 1.
+  //   BackendComponent::InvalidBit is technically the lowest enum value,
+  //   but it isn't included in the runtime table. So CPUBit = 1, CUDABit = 2,
+  //   etc.
+  uint64_t getBackendIndex() const {
+    return DispatchKeySet((repr_ & full_backend_mask) >> 1).indexOfHighestBit();
   }
 
  private:
@@ -130,42 +487,53 @@ class DispatchKeySet final {
   uint64_t repr_ = 0;
 
  public:
-  // STL iterator for DispatchKeySet. Iterates through all DispatchKeys in the
-  // set. The iterator is only invalidated by the destruction of the underlying
-  // DispatchKeySet as the iterator stores a pointer to the raw representation
-  // of the DispatchKeySet.
+  // STL iterator for DispatchKeySet. Iterates through all runtime DispatchKeys
+  // in the set. The iterator is only invalidated by the destruction of the
+  // underlying DispatchKeySet as the iterator stores a pointer to the raw
+  // representation of the DispatchKeySet. Note: When we encounter a per-backend
+  // functionality (e.g. Dense or Sparse), we will iterate through EVERY backend
+  // in the keyset, for that functionality. For example, if the next
+  // functionality key to iterate over is Autograd, and the backend bits in the
+  // keyset correspond to [BackendComponent::CPUBit, BackendComponent::CUDABit],
+  // then the next two keys we return will be DispatchKey::AutogradCPU,
+  // DispatchKey::AutogradCUDA (CPU first because it has lower precedence than
+  // CUDA in DispatchKey.h).
   class iterator {
    public:
     using self_type = iterator;
     using iterator_category = std::input_iterator_tag;
     using value_type = DispatchKey;
     using difference_type = ptrdiff_t;
-
-    explicit iterator(const uint64_t* data_ptr, uint8_t i = 0)
-        : data_ptr_(data_ptr), i_(i) {
+    // final mask value should mask out the entire keyset
+    static const uint8_t end_iter_mask_val =
+        num_backends + num_functionality_keys;
+    // final key value should be the last DispatchKey
+    static const uint8_t end_iter_key_val = num_functionality_keys;
+
+    // current_dispatchkey_idx_ will iterate through all functionality bits.
+    // current_backendcomponent_idx_ will iterate through all backend bits.
+    explicit iterator(
+        const uint64_t* data_ptr,
+        uint8_t next_functionality = num_backends,
+        uint8_t next_backend = 0)
+        : data_ptr_(data_ptr),
+          next_functionality_(next_functionality),
+          next_backend_(next_backend),
+          // These are in an invalid state at construction time, and set by the
+          // first increment call
+          current_dispatchkey_idx_(end_iter_key_val),
+          current_backendcomponent_idx_(end_iter_key_val) {
       // Go to the first key in the set
+      TORCH_INTERNAL_ASSERT(
+          next_functionality_ >= num_backends,
+          "num_backends=",
+          static_cast<uint32_t>(num_backends),
+          "next_functionality_=",
+          static_cast<uint32_t>(next_functionality_));
       ++(*this);
     }
 
-    self_type& operator++() {
-      TORCH_INTERNAL_ASSERT(
-          i_ <= static_cast<uint8_t>(DispatchKey::NumDispatchKeys));
-
-      // Create a masked version of the set representation to ignore previous
-      // keys that we've iterated through.
-      uint64_t masked_data = llvm::maskTrailingZeros<uint64_t>(i_) & *data_ptr_;
-      uint64_t firstKeyIndex = llvm::findFirstSet(masked_data);
-
-      // If there are no keys, set to end iterator value
-      if (firstKeyIndex == std::numeric_limits<uint64_t>::max() ||
-          i_ == static_cast<uint8_t>(DispatchKey::NumDispatchKeys)) {
-        i_ = static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
-        return *this;
-      }
-
-      i_ = static_cast<uint8_t>(firstKeyIndex) + 1;
-      return *this;
-    }
+    C10_API self_type& operator++();
 
     self_type operator++(int) {
       self_type previous_iterator = *this;
@@ -174,18 +542,50 @@ class DispatchKeySet final {
     }
 
     bool operator==(const self_type& rhs) const {
-      return i_ == rhs.i_;
+      return next_functionality_ == rhs.next_functionality_ &&
+          current_dispatchkey_idx_ == rhs.current_dispatchkey_idx_ &&
+          next_backend_ == rhs.next_backend_ &&
+          current_backendcomponent_idx_ == rhs.current_backendcomponent_idx_;
     }
     bool operator!=(const self_type& rhs) const {
-      return i_ != rhs.i_;
+      return next_functionality_ != rhs.next_functionality_ ||
+          current_dispatchkey_idx_ != rhs.current_dispatchkey_idx_ ||
+          next_backend_ != rhs.next_backend_ ||
+          current_backendcomponent_idx_ != rhs.current_backendcomponent_idx_;
     }
     DispatchKey operator*() const {
-      return static_cast<DispatchKey>(i_);
+      auto functionality_key =
+          static_cast<DispatchKey>(current_dispatchkey_idx_);
+      if (isPerBackendFunctionalityKey(functionality_key)) {
+        auto next_key = toRuntimePerBackendFunctionalityKey(
+            functionality_key,
+            static_cast<BackendComponent>(current_backendcomponent_idx_));
+        // We expect all of the Dense, Sparse, Quantized, and Autograd keys to
+        // be ordered the same way with respect to their backends
+        TORCH_INTERNAL_ASSERT(
+            toBackendComponent(next_key) ==
+                static_cast<BackendComponent>(current_backendcomponent_idx_),
+            "Tried to map functionality key ",
+            toString(functionality_key),
+            " and backend bit ",
+            toString(
+                static_cast<BackendComponent>(current_backendcomponent_idx_)),
+            " to a runtime key, but ended up with ",
+            toString(next_key),
+            ". This can happen if the order of the backend dispatch keys in DispatchKey.h isn't consistent.",
+            " Please double check that enum for inconsistencies.");
+        return next_key;
+      } else {
+        return functionality_key;
+      }
     }
 
    private:
     const uint64_t* data_ptr_;
-    uint8_t i_;
+    uint8_t next_functionality_;
+    uint8_t next_backend_;
+    uint8_t current_dispatchkey_idx_;
+    uint8_t current_backendcomponent_idx_;
   };
 
  public:
@@ -195,31 +595,35 @@ class DispatchKeySet final {
     return iterator(&repr_);
   }
 
-  // We do not need to iterate beyond NumDispatchKeys so we will treat this as
-  // the end iterator. NumDispatchKeys will always be strictly less than 64.
+  // We do not need to iterate beyond EndOfFunctionalityKeys so we will treat
+  // this as the end iterator.
   iterator end() const {
-    return iterator(&repr_, static_cast<uint8_t>(DispatchKey::NumDispatchKeys));
+    return iterator(&repr_, iterator::end_iter_mask_val);
   }
 };
 
 C10_API std::string toString(DispatchKeySet);
 C10_API std::ostream& operator<<(std::ostream&, DispatchKeySet);
 
-// autograd_dispatch_keyset should include all runtime autograd keys.
-// Alias key DispatchKey::Autograd maps to autograd_dispatch_keyset.
+C10_API inline int getDispatchTableIndexForDispatchKey(DispatchKey k) {
+  return DispatchKeySet(k).getDispatchTableIndexForDispatchKeySet();
+}
+
+// Alias key DispatchKey::Autograd maps to
+// (autograd_dispatch_keyset x full_backend_mask)
 // NB: keys in this set also get associated with CompositeImplicitAutograd
+//
+// Note [autograd_dispatch_keyset Does Not Include Backend Bits]
+// We don't want to include any backend bits (BackendComponent::CPUBit, etc)
+// directly in autograd_dispatch_keyset.
+// Why? keysets like autograd_dispatch_keyset are commonly used to remove
+// autograd keys from a DispatchKeySet throughout the code base. However, you
+// are only allowed to remove functionality bits from a keyset, not backend
+// bits. See Note [Removing keys from DispatchKeySet Only Affects Functionality
+// Keys] for details. To be consistent and avoid confusion, we're explicitly
+// setting up autograd_dispatch_keyset to not have any backend bits.
 constexpr DispatchKeySet autograd_dispatch_keyset = DispatchKeySet({
-    DispatchKey::AutogradCPU,
-    DispatchKey::AutogradCUDA,
-    DispatchKey::AutogradXLA,
-    DispatchKey::AutogradLazy,
-    DispatchKey::AutogradNestedTensor,
-    DispatchKey::AutogradMLC,
-    DispatchKey::AutogradHPU,
-    DispatchKey::AutogradXPU,
-    DispatchKey::AutogradPrivateUse1,
-    DispatchKey::AutogradPrivateUse2,
-    DispatchKey::AutogradPrivateUse3,
+    DispatchKey::AutogradFunctionality,
     DispatchKey::AutogradOther,
 });
 
@@ -242,27 +646,42 @@ constexpr DispatchKeySet default_excluded_set = DispatchKeySet({
 constexpr DispatchKeySet autograd_dispatch_keyset_with_ADInplaceOrView =
     autograd_dispatch_keyset | DispatchKeySet(DispatchKey::ADInplaceOrView);
 
+constexpr DispatchKeySet python_ks = DispatchKeySet({
+    DispatchKey::Python,
+    DispatchKey::PythonTLSSnapshot,
+});
+
+constexpr DispatchKeySet sparse_ks = DispatchKeySet(DispatchKey::Sparse);
+
+constexpr DispatchKeySet sparse_csr_ks =
+    DispatchKeySet({DispatchKey::SparseCsrCPU, DispatchKey::SparseCsrCUDA});
+
+constexpr DispatchKeySet mkldnn_ks = DispatchKeySet(DispatchKey::MkldnnCPU);
+
 // backend dispatch keys that map to DispatchKey::AutogradOther
 // NB: keys in this set also get associated with CompositeImplicitAutograd
-constexpr DispatchKeySet autogradother_backends = DispatchKeySet(
-    {DispatchKey::HIP,
-     DispatchKey::VE,
-     DispatchKey::FPGA,
-     DispatchKey::ORT,
-     DispatchKey::Vulkan,
-     DispatchKey::Metal,
-     DispatchKey::QuantizedCPU,
-     DispatchKey::QuantizedCUDA,
-     DispatchKey::CustomRNGKeyId,
-     DispatchKey::MkldnnCPU,
-     DispatchKey::SparseCPU,
-     DispatchKey::SparseCUDA,
-     DispatchKey::SparseHIP,
-     DispatchKey::SparseVE,
-     DispatchKey::SparseXPU,
-     DispatchKey::SparseCsrCPU,
-     DispatchKey::SparseCsrCUDA,
-     DispatchKey::Meta});
+constexpr DispatchKeySet autogradother_backends =
+    DispatchKeySet(
+        // HIP and VE aren't in this list: they now have their own backend bits
+        // which means that they can now have their own Autograd keys.
+        // Technically, HIP will now redispatch to its own custom AutogradHIP
+        // slot in the runtime table.
+        {DispatchKey::FPGA,
+         DispatchKey::ORT,
+         DispatchKey::Vulkan,
+         DispatchKey::Metal,
+         DispatchKey::SparseCsrCPU,
+         DispatchKey::SparseCsrCUDA,
+         DispatchKey::CustomRNGKeyId,
+         DispatchKey::MkldnnCPU,
+         DispatchKey::Meta,
+         // Sparse and Quantized backends also live here.
+         DispatchKey::Sparse,
+         DispatchKey::Quantized})
+    // Including the backend bits because this keyset is used during op
+    // registration, which requires looping over all runtime autogradother
+    // backend keys.
+    | DispatchKeySet(DispatchKeySet::RAW, full_backend_mask);
 
 // The set of dispatch keys that come after autograd
 // n.b. this relies on the fact that AutogradOther is currently the lowest
@@ -292,6 +711,48 @@ constexpr DispatchKeySet after_func_keyset =
             // away with it by explicitly removing the key here.
             c10::DispatchKey::ADInplaceOrView);
 
+constexpr DispatchKeySet backend_bitset_mask =
+    DispatchKeySet(DispatchKeySet::RAW, (1ULL << num_backends) - 1);
+
+constexpr auto inplace_or_view_ks =
+    DispatchKeySet(DispatchKey::ADInplaceOrView);
+constexpr auto autograd_cpu_ks = DispatchKeySet(DispatchKey::AutogradCPU);
+constexpr auto autograd_ipu_ks = DispatchKeySet(DispatchKey::AutogradIPU);
+constexpr auto autograd_xpu_ks = DispatchKeySet(DispatchKey::AutogradXPU);
+constexpr auto autograd_cuda_ks = DispatchKeySet(DispatchKey::AutogradCUDA);
+constexpr auto autograd_xla_ks = DispatchKeySet(DispatchKey::AutogradXLA);
+constexpr auto autograd_lazy_ks = DispatchKeySet(DispatchKey::AutogradLazy);
+constexpr auto autograd_mlc_ks = DispatchKeySet(DispatchKey::AutogradMLC);
+constexpr auto autograd_hpu_ks = DispatchKeySet(DispatchKey::AutogradHPU);
+constexpr auto autograd_privateuse1_ks =
+    DispatchKeySet(DispatchKey::AutogradPrivateUse1);
+constexpr auto autograd_privateuse2_ks =
+    DispatchKeySet(DispatchKey::AutogradPrivateUse2);
+constexpr auto autograd_privateuse3_ks =
+    DispatchKeySet(DispatchKey::AutogradPrivateUse3);
+constexpr auto autograd_other_ks = DispatchKeySet(DispatchKey::AutogradOther);
+
+// This keyset has:
+// (1) the functionality bits corresponding to backends (dense, sparse,
+// quantized) (2) all of the backend bits set
+constexpr DispatchKeySet backend_functionality_keys =
+    DispatchKeySet({
+        DispatchKey::Dense,
+        DispatchKey::Quantized,
+        DispatchKey::Sparse,
+    }) |
+    DispatchKeySet(DispatchKeySet::RAW, full_backend_mask);
+
+struct OpTableOffsetAndMask {
+  uint16_t offset;
+  uint16_t backend_mask;
+};
+
+static_assert(
+    num_backends <= 16,
+    "Right now we expect the number of backends not to exceed 16. In the (unlikely) event"
+    " that this changes, the size of OpTableOffsetAndMask::backend_mask needs to be increased too.");
+
 // true if t is a backend dispatch key
 C10_API bool isBackendDispatchKey(DispatchKey t);
 
@@ -307,10 +768,62 @@ C10_API bool runtimeDispatchKeySetHas(DispatchKey t, DispatchKey k);
 C10_API DispatchKeySet getBackendKeySetFromAutograd(DispatchKey t);
 
 // Returns a DispatchKeySet of autograd related keys mapped to backend.
-C10_API DispatchKeySet getAutogradRelatedKeySetFromBackend(DispatchKey t);
+// for a given backend key, use the associated autograd key.
+// for non-backend keys, use AutogradOther as a default.
+// Note: it's convenient and fast to return a default here rather than (say)
+// returning an optional<DispatchKey>, or throwing. But it makes callers
+// responsible for either a) enforcing the invariant that only backend keys
+// be passed as arguments, or b) interpreting our return value carefully.
+inline DispatchKeySet getAutogradRelatedKeySetFromBackend(BackendComponent t) {
+  switch (t) {
+    case BackendComponent::CPUBit:
+      return inplace_or_view_ks | autograd_cpu_ks;
+    case BackendComponent::IPUBit:
+      return inplace_or_view_ks | autograd_ipu_ks;
+    case BackendComponent::XPUBit:
+      return inplace_or_view_ks | autograd_xpu_ks;
+    case BackendComponent::CUDABit:
+      return inplace_or_view_ks | autograd_cuda_ks;
+    case BackendComponent::XLABit:
+      return inplace_or_view_ks | autograd_xla_ks;
+    case BackendComponent::LazyBit:
+      return inplace_or_view_ks | autograd_lazy_ks;
+    case BackendComponent::MLCBit:
+      return inplace_or_view_ks | autograd_mlc_ks;
+    case BackendComponent::HPUBit:
+      return inplace_or_view_ks | autograd_hpu_ks;
+    case BackendComponent::PrivateUse1Bit:
+      return inplace_or_view_ks | autograd_privateuse1_ks;
+    case BackendComponent::PrivateUse2Bit:
+      return inplace_or_view_ks | autograd_privateuse2_ks;
+    case BackendComponent::PrivateUse3Bit:
+      return inplace_or_view_ks | autograd_privateuse3_ks;
+    default:
+      return inplace_or_view_ks | autograd_other_ks;
+  }
+}
 
 // Returns a DispatchKeySet of autocast related keys mapped to backend.
-C10_API DispatchKeySet getAutocastRelatedKeySetFromBackend(DispatchKey t);
+inline DispatchKeySet getAutocastRelatedKeySetFromBackend(BackendComponent t) {
+  constexpr auto autocast_cpu_ks = DispatchKeySet(DispatchKey::AutocastCPU);
+  constexpr auto autocast_cuda_ks = DispatchKeySet(DispatchKey::AutocastCUDA);
+  switch (t) {
+    case BackendComponent::CPUBit:
+      return autocast_cpu_ks;
+    case BackendComponent::CUDABit:
+    case BackendComponent::XLABit:
+      return autocast_cuda_ks;
+    default:
+      return DispatchKeySet();
+  }
+}
+
+// returns the "backend" DispatchKey of highest priority in the set.
+// This is basically like highestBackendKey(), except that we have some
+// "functionality" bits that correspond to backends (Sparse, Quantized)
+inline DispatchKey highestPriorityBackendTypeId(DispatchKeySet ks) {
+  return (ks & backend_functionality_keys).highestPriorityTypeId();
+}
 
 // This API exists because we have a use case for checking
 // getRuntimeDispatchKeySet(alias).has(DispatchKey::Undefined)
@@ -329,7 +842,8 @@ static inline DispatchKey legacyExtractDispatchKey(DispatchKeySet s) {
   // here.  At the moment, autograd keys and ADInplaceOrView key need this
   // treatment;
   return (s - autograd_dispatch_keyset_with_ADInplaceOrView -
-          autocast_dispatch_keyset)
+          autocast_dispatch_keyset -
+          DispatchKeySet({DispatchKey::PythonTLSSnapshot, DispatchKey::Python}))
       .highestPriorityTypeId();
 }
 
diff --git a/c10/core/QEngine.h b/c10/core/QEngine.h
index ac092193d92136..60c21361f15f0d 100644
--- a/c10/core/QEngine.h
+++ b/c10/core/QEngine.h
@@ -15,11 +15,13 @@ enum class QEngine : uint8_t {
   NoQEngine = 0,
   FBGEMM = 1,
   QNNPACK = 2,
+  ONEDNN = 3,
 };
 
 constexpr auto kNoQEngine = QEngine::NoQEngine;
 constexpr auto kFBGEMM = QEngine::FBGEMM;
 constexpr auto kQNNPACK = QEngine::QNNPACK;
+constexpr auto kONEDNN = QEngine::ONEDNN;
 
 inline std::string toString(QEngine qengine) {
   switch (qengine) {
@@ -29,6 +31,8 @@ inline std::string toString(QEngine qengine) {
       return "FBGEMM";
     case kQNNPACK:
       return "QNNPACK";
+    case kONEDNN:
+      return "ONEDNN";
     default:
       TORCH_CHECK(
           false, "Unrecognized Quantized Engine: ", static_cast<int>(qengine));
diff --git a/c10/core/SafePyObject.cpp b/c10/core/SafePyObject.cpp
new file mode 100644
index 00000000000000..d8c3da49ffb121
--- /dev/null
+++ b/c10/core/SafePyObject.cpp
@@ -0,0 +1,11 @@
+#include <c10/core/SafePyObject.h>
+#include <c10/core/TensorImpl.h>
+
+namespace c10 {
+
+PyObject* SafePyObject::ptr(const c10::impl::PyInterpreter* interpreter) const {
+  TORCH_INTERNAL_ASSERT(interpreter == pyinterpreter_);
+  return data_;
+}
+
+} // namespace c10
diff --git a/c10/core/SafePyObject.h b/c10/core/SafePyObject.h
new file mode 100644
index 00000000000000..13e32da3dc1dfe
--- /dev/null
+++ b/c10/core/SafePyObject.h
@@ -0,0 +1,45 @@
+#pragma once
+
+#include <c10/core/impl/PyInterpreter.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/python_stub.h>
+
+namespace c10 {
+
+// This is an safe owning holder for a PyObject, akin to pybind11's
+// py::object, with two major differences:
+//
+//  - It is in c10/core; i.e., you can use this type in contexts where
+//    you do not have a libpython dependency
+//
+//  - It is multi-interpreter safe (ala torchdeploy); when you fetch
+//    the underlying PyObject* you are required to specify what the current
+//    interpreter context is and we will check that you match it.
+//
+// It is INVALID to store a reference to a Tensor object in this way;
+// you should just use TensorImpl directly in that case!
+struct C10_API SafePyObject {
+  // Steals a reference to data
+  SafePyObject(PyObject* data, c10::impl::PyInterpreter* pyinterpreter)
+      : data_(data), pyinterpreter_(pyinterpreter) {}
+
+  // In principle this could be copyable if we add an incref to PyInterpreter
+  // but for now it's easier to just disallow it.
+  SafePyObject(SafePyObject const&) = delete;
+  SafePyObject& operator=(SafePyObject const&) = delete;
+
+  ~SafePyObject() {
+    pyinterpreter_->decref(data_, /*is_tensor*/ false);
+  }
+
+  c10::impl::PyInterpreter* pyinterpreter() const {
+    return pyinterpreter_;
+  }
+  PyObject* ptr(const c10::impl::PyInterpreter*) const;
+
+ private:
+  PyObject* data_;
+  c10::impl::PyInterpreter* pyinterpreter_;
+};
+
+} // namespace c10
diff --git a/c10/core/Scalar.h b/c10/core/Scalar.h
index 08bf95e1875dab..66d96f69af7782 100644
--- a/c10/core/Scalar.h
+++ b/c10/core/Scalar.h
@@ -67,7 +67,7 @@ class C10_API Scalar {
   }
 
   // TODO: Support ComplexHalf accessor
-  AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(DEFINE_ACCESSOR)
+  AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ACCESSOR)
 
   // also support scalar.to<int64_t>();
   // Deleted for unsupported types, but specialized below for supported types
@@ -201,7 +201,7 @@ using OptionalScalarRef = c10::OptionalRef<Scalar>;
   inline T Scalar::to<T>() const { \
     return to##name();             \
   }
-AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(DEFINE_TO)
+AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_TO)
 #undef DEFINE_TO
 
 } // namespace c10
diff --git a/c10/core/ScalarType.h b/c10/core/ScalarType.h
index d805623efe6c14..16553cf0230ace 100644
--- a/c10/core/ScalarType.h
+++ b/c10/core/ScalarType.h
@@ -4,6 +4,7 @@
 #include <c10/util/BFloat16.h>
 #include <c10/util/Half.h>
 #include <c10/util/Optional.h>
+#include <c10/util/OptionalArrayRef.h>
 #include <c10/util/complex.h>
 #include <c10/util/qint32.h>
 #include <c10/util/qint8.h>
@@ -63,6 +64,21 @@ namespace c10 {
   _(bool, Bool)                                                    \
   _(at::BFloat16, BFloat16)
 
+#define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(_) \
+  _(uint8_t, Byte)                             \
+  _(int8_t, Char)                              \
+  _(int16_t, Short)                            \
+  _(int, Int)                                  \
+  _(int64_t, Long)                             \
+  _(at::Half, Half)                            \
+  _(float, Float)                              \
+  _(double, Double)                            \
+  _(c10::complex<c10::Half>, ComplexHalf)      \
+  _(c10::complex<float>, ComplexFloat)         \
+  _(c10::complex<double>, ComplexDouble)       \
+  _(bool, Bool)                                \
+  _(at::BFloat16, BFloat16)
+
 enum class ScalarType : int8_t {
 #define DEFINE_ENUM(_1, n) n,
   AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_AND_QINTS(DEFINE_ENUM)
diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index fad9dcb6fc3a69..75d0e03255b145 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -20,43 +20,6 @@ C10_DEFINE_int64(
 
 namespace c10 {
 
-namespace impl {
-
-static std::string noop_name_fn(const PyInterpreter*) {
-  return "<unloaded interpreter>";
-}
-
-static void noop_decref_fn(const PyInterpreter*, PyObject*, bool) {
-  // no-op
-}
-
-static c10::intrusive_ptr<TensorImpl> noop_detach_fn(
-    const PyInterpreter*,
-    const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to detach (shallow_copy_and_detach) Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static void noop_dispatch_fn(
-    const PyInterpreter*,
-    const c10::OperatorHandle& op,
-    torch::jit::Stack* stack,
-    const std::shared_ptr<TorchDispatchTypeObject>& type) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to dispatch (__torch_dispatch__) an operator on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-void PyInterpreter::disarm() noexcept {
-  name_fn_ = &noop_name_fn;
-  decref_fn_ = &noop_decref_fn;
-  detach_fn_ = &noop_detach_fn;
-  dispatch_fn_ = &noop_dispatch_fn;
-}
-
-} // namespace impl
-
 const char* const TensorImpl::err_msg_tensor_metadata_change_not_allowed =
     "is not allowed on a Tensor created from .data or .detach().\n"
     "If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)\n"
@@ -148,10 +111,7 @@ TensorImpl::TensorImpl(
       numel_(0),
       data_type_(data_type),
       device_opt_(storage_.device()),
-      key_set_(
-          key_set.remove(DispatchKey::Python)
-              .remove(DispatchKey::PythonTLSSnapshot)) { // See [Note: Python
-                                                         // key removal]
+      key_set_(key_set - c10::python_ks) { // See [Note: Python key removal]
   init_bitfields();
   // Inference tensor doesn't have version counter.
   if (!is_inference()) {
@@ -192,14 +152,12 @@ TensorImpl::TensorImpl(
 
   // TODO: be more explicit about the full key set at call sites so we
   // don't have to keep recomputing it here
-  DispatchKey k = key_set.highestPriorityBackendTypeId();
+  auto k = key_set.highestBackendKey();
 
   key_set = key_set | getAutocastRelatedKeySetFromBackend(k);
 
-  key_set =
-      key_set.remove(DispatchKey::Python)
-          .remove(
-              DispatchKey::PythonTLSSnapshot); // See [Note: Python key removal]
+  // See [Note: Python key removal]
+  key_set = key_set - c10::python_ks;
 
   // Inference tensor doesn't have autograd related keys.
   if (inference_mode) {
@@ -420,7 +378,7 @@ void TensorImpl::throw_storage_access_error() const {
 bool TensorImpl::is_contiguous_nondefault_policy_impl(
     at::MemoryFormat memory_format) const {
   if (has_contiguity_ ==
-      static_cast<uint8_t>(HasContiguityPolicy::ContiguityNotSupported)) {
+      static_cast<uint8_t>(CustomizableMethodPolicy::ContiguityNotSupported)) {
     TORCH_CHECK_NOT_IMPLEMENTED(
         false,
         "Tensors of type ",
@@ -429,7 +387,7 @@ bool TensorImpl::is_contiguous_nondefault_policy_impl(
   } else {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         has_contiguity_ ==
-        static_cast<uint8_t>(HasContiguityPolicy::CustomBehavior));
+        static_cast<uint8_t>(CustomizableMethodPolicy::CustomBehavior));
     return is_contiguous_custom(memory_format);
   }
 }
@@ -441,6 +399,22 @@ bool TensorImpl::is_contiguous_custom(at::MemoryFormat memory_format) const {
       "set_has_contiguity_policy and forget to override is_contiguous_custom?");
 }
 
+IntArrayRef TensorImpl::sizes_nondefault_policy_impl() const {
+  if (sizes_customization_policy_ ==
+      static_cast<uint8_t>(CustomizableMethodPolicy::NotSupported)) {
+    TORCH_CHECK_NOT_IMPLEMENTED(
+        false,
+        "Tensors of type ",
+        tensorimpl_type_name(),
+        " do not have sizes");
+  } else {
+    TORCH_CHECK_NOT_IMPLEMENTED(
+        false,
+        "custom behavior for sizes() is not supported; please add it or file "
+        "an issue.")
+  }
+}
+
 static void deletePlacementDeleteContext(void* ptr) {
   delete static_cast<PlacementDeleteContext*>(ptr);
 }
@@ -572,6 +546,8 @@ void TensorImpl::copy_tensor_metadata_except_version_counter(
   dest_impl->is_wrapped_number_ = src_impl->is_wrapped_number_;
   dest_impl->reserved_ = src_impl->reserved_;
   dest_impl->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
+  dest_impl->sizes_customization_policy_ =
+      src_impl->sizes_customization_policy_;
   dest_impl->storage_access_should_throw_ =
       src_impl->storage_access_should_throw_;
   if (src_impl->named_tensor_meta_ != nullptr) {
@@ -606,23 +582,6 @@ void TensorImpl::copy_tensor_metadata(
   }
 }
 
-TorchDispatchTypeObject::TorchDispatchTypeObject(
-    PyObject* type_object,
-    c10::impl::PyInterpreter* pyinterpreter)
-    : data_(type_object), pyinterpreter_(pyinterpreter) {}
-
-TorchDispatchTypeObject::~TorchDispatchTypeObject() {
-  pyinterpreter_->decref(data_, /*is_tensor*/ false);
-}
-
-c10::impl::PyInterpreter* TorchDispatchTypeObject::pyinterpreter() const {
-  return pyinterpreter_;
-}
-
-PyObject* TorchDispatchTypeObject::ptr() const {
-  return data_;
-}
-
 namespace impl {
 
 namespace {
diff --git a/c10/core/TensorImpl.h b/c10/core/TensorImpl.h
index 4f6019a5ec3c6b..1fdfad185c86ea 100644
--- a/c10/core/TensorImpl.h
+++ b/c10/core/TensorImpl.h
@@ -8,6 +8,7 @@
 #include <c10/core/Storage.h>
 #include <c10/core/TensorOptions.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
+#include <c10/core/impl/PyInterpreter.h>
 #include <c10/core/impl/SizesAndStrides.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Flags.h>
@@ -16,9 +17,11 @@
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 #include <c10/util/python_stub.h>
+#include <c10/util/safe_numerics.h>
 
 #include <algorithm>
 #include <atomic>
+#include <limits>
 #include <memory>
 #include <numeric>
 
@@ -49,17 +52,9 @@ class TensorBase;
 
 namespace c10 {
 class Scalar;
-struct IValue;
 struct Storage;
-class OperatorHandle;
 } // namespace c10
 
-namespace torch {
-namespace jit {
-using Stack = std::vector<c10::IValue>;
-}
-} // namespace torch
-
 namespace c10 {
 
 /**
@@ -168,9 +163,6 @@ struct C10_API AutogradMetaInterface {
   virtual ~AutogradMetaInterface();
 };
 
-// forward declared
-struct TorchDispatchTypeObject;
-
 namespace impl {
 
 // Unfortunately, the definition of AutogradMeta lives in a separate
@@ -196,137 +188,6 @@ struct C10_API AutogradMetaFactoryRegisterer {
   }
 };
 
-// Note [Python interpreter tag]
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-// We store a PyObject on TensorImpl so that we can efficiently translate
-// tensors into the Python representations.  However, in some situations
-// (torchdeploy) there may be multiple Python interpreters in a single process
-// and we must take care not to accidentally mix up PyObjects with the wrong
-// interpreters.  Thus, we also tag every TensorImpl with the Python interpreter
-// it corresponds to.
-//
-// With torchdeploy, we have these invariants:
-//  - Any given TensorImpl can be associated with AT MOST one Python
-//  interpreter.
-//    We represent the interpreter tag as a memory address to an instance of
-//    a virtual class that is allocated once per interpreter (this is so that
-//    we can request the interpreter to perform operations for us, if
-//    necessary).
-//  - A given TensorImpl's interpreter tag can only go from uninitialized to
-//    tagged; once tagged, this is a quiescent state (once tagged to an
-//    interpreter, ALWAYS tagged to that interpreter)
-//  - A thread may mutate the PyObject field of a TensorImpl if and only if it
-//    holds the GIL for the interpreter tagged on the TensorImpl.  (If the
-//    TensorImpl is not tagged, it must first atomically claim its tag before it
-//    can validly write)
-
-// The PyInterpreter object itself is a class that contains some function
-// pointers for interacting with the interpreter.  For now this is just for
-// debugging, but if a Tensor can own a PyObject, the interpreter can be used to
-// free it.
-//
-// WARNING: This class has to be written very carefully, because it may be
-// possible for a Tensor to have a reference an interpreter corresponding to
-// a shared library that has ALREADY BEEN UNLOADED.  This makes blindly calling
-// virtual methods very dangerous, because the vtable may be garbage at that
-// point (on a good day, you might get "pure virtual method called").
-//
-// The idea to solve this problem is we always leak PyInterpreters (so they
-// always stay live even after dlclose), and disarm the "virtual methods" by
-// replacing them with function pointers that just no-op.  This can't be done
-// with a traditional C++ vtable, so we have to roll our own.
-//
-// NB: The downside with representing PyInterpreter tags as full objects is that
-// it takes an extra word on TensorImpl.  If tags were instead just integer
-// indices, on 64-bit architectures we could pack the tag and PyObject together
-// into a single atomic word.  On 32-bit architectures we could simply say that
-// only one Python interpreter is supported (erroring if a nontrivial
-// interpreter tag is attempted to be set).
-//
-// The difficulty with this scheme is we need to maintain an out-of-line table
-// to get at the PyInterpreters so that we can do virtual method calls on them,
-// and registration/deregistration to this table must be done in a thread safe
-// manner.  This can be easily done if the number of possible PyInterpreters is
-// small enough (e.g., 8-bit integer) by simply preallocating an array of
-// sufficient size to hold all possible interpreters.  Surely 128 threads is
-// more than enough for anyone!
-//
-// I didn't decide to do this technique at the moment, because the extra word
-// added by the PyInterpreter tag takes us to 24 words, which means that we
-// still fit inside three eight word cache lines.  If you need to penny pinch
-// another word consider doing this!
-
-struct PyInterpreter;
-struct C10_API PyInterpreter {
-  using name_sig = std::string(const PyInterpreter*);
-  using decref_sig = void(const PyInterpreter*, PyObject*, bool);
-  using detach_sig =
-      c10::intrusive_ptr<TensorImpl>(const PyInterpreter*, const TensorImpl*);
-  using dispatch_sig = void(
-      const PyInterpreter*,
-      const c10::OperatorHandle&,
-      torch::jit::Stack* stack,
-      const std::shared_ptr<TorchDispatchTypeObject>& type);
-
-  PyInterpreter(
-      name_sig* name_fn,
-      decref_sig* decref_fn,
-      detach_sig* detach,
-      dispatch_sig* dispatch)
-      : name_fn_(name_fn),
-        decref_fn_(decref_fn),
-        detach_fn_(detach),
-        dispatch_fn_(dispatch) {}
-
-  name_sig* name_fn_;
-  decref_sig* decref_fn_;
-  detach_sig* detach_fn_;
-  dispatch_sig* dispatch_fn_;
-
-  // UBSAN suppression fixes: "call to function
-  // (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*,
-  // _object*) through pointer to incorrect function type 'void (*)(const
-  // c10::impl::PyInterpreter *, _object *)'" See
-  // https://github.com/google/sanitizers/issues/911
-
-  // Report the name of this interpreter
-  __ubsan_ignore_function__ std::string name() const {
-    return (*name_fn_)(this);
-  }
-
-  // Run Py_DECREF on a PyObject.  We DO NOT assume the GIL is held on call
-  // See NOTE [PyInterpreter::decref takes an `is_tensor` arg]
-  __ubsan_ignore_function__ void decref(PyObject* pyobj, bool is_tensor) const {
-    return (*decref_fn_)(this, pyobj, is_tensor);
-  }
-
-  // Perform a detach by deferring to the __torch_dispatch__ implementation of
-  // detach, which will also arrange for the PyObject to get copied in this
-  // situation
-  __ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> detach(
-      const TensorImpl* self) const {
-    return (*detach_fn_)(this, self);
-  }
-
-  // Invoke the Python boxed fallback dispatch to go back into Python
-  __ubsan_ignore_function__ void dispatch(
-      const c10::OperatorHandle& op,
-      torch::jit::Stack* stack,
-      const std::shared_ptr<TorchDispatchTypeObject>& type) const {
-    return (*dispatch_fn_)(this, op, stack, type);
-  }
-
-  // Disarm this PyInterpreter, making all of its methods noops.
-  // Because the function pointers are raw pointers (not atomics),
-  // a disarm() invocation that is concurrent with active destructors
-  // is not thread safe and will trigger TSAN.  My hope is that this
-  // situations doesn't ever actually happen; tensor destruction should
-  // quiesce when a dlclose happens, and any long lived tensors whose
-  // destructors would be disarmed here only begin the destruction process
-  // on process shutdown (long after the dlclose has occurred).
-  void disarm() noexcept;
-};
-
 // PyInterpreterStatus describes what the state of its interpreter tag
 // is, relative to the thread currently holding the GIL.
 enum class PyInterpreterStatus {
@@ -361,30 +222,6 @@ struct C10_API NamedTensorMetaInterface {
   };
 };
 
-// NOTE [What is TorchDispatchTypeObject?]
-// A TorchDispatchTypeObject represents the type of a Tensor subclass that has
-// a __torch_dispatch__ classmethod. Concretely, it holds the class as a
-// PyObject* and a PyInterpreter* that says which python interpreter the class
-// came from.
-//
-// See NOTE [dispatch_fn's type argument] for more details
-struct C10_API TorchDispatchTypeObject {
-  // Steals a reference to type_object
-  TorchDispatchTypeObject(
-      PyObject* type_object,
-      c10::impl::PyInterpreter* pyinterpreter);
-
-  // Releases the stolen reference to type_object
-  ~TorchDispatchTypeObject();
-
-  c10::impl::PyInterpreter* pyinterpreter() const;
-  PyObject* ptr() const;
-
- private:
-  PyObject* data_;
-  c10::impl::PyInterpreter* pyinterpreter_;
-};
-
 // NOTE [ Version Counter Sharing ]
 //
 // Every Tensor has a version counter. Version counters are incremented whenever
@@ -699,16 +536,32 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   /**
    * Return a reference to the sizes of this tensor.  This reference remains
    * valid as long as the tensor is live and not resized.
+   *
+   * NOTE: sizes() is only `TENSORIMPL_MAYBE_VIRTUAL` for backward
+   * compatibility. See `set_sizes_customization_policy` for the
+   * encouraged customization point.
+   *
+   * NOTE: Currently, CustomizableMethodPolicy::CustomBehavior is not
+   * supported due to a lack of use case, but it can easily be added.
    */
   TENSORIMPL_MAYBE_VIRTUAL IntArrayRef sizes() const
 #ifdef C10_DISABLE_TENSORIMPL_EXTENSIBILITY
   {
+    if (C10_UNLIKELY(
+            sizes_customization_policy_ !=
+            static_cast<uint8_t>(CustomizableMethodPolicy::Default))) {
+      return sizes_nondefault_policy_impl();
+    }
     return sizes_and_strides_.sizes_arrayref();
   }
 #else
       ;
 #endif
 
+ private:
+  IntArrayRef sizes_nondefault_policy_impl() const;
+
+ public:
   /**
    * Return a reference to the strides of this tensor.  This reference remains
    * valid as long as the tensor is live and not restrided.
@@ -838,103 +691,112 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_sparse() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::SparseCPU) ||
-        key_set_.has(DispatchKey::SparseCUDA) ||
-        key_set_.has(DispatchKey::SparseHIP) ||
-        key_set_.has(DispatchKey::SparseXPU);
+    return key_set_.has_all(c10::sparse_ks);
   }
 
   // Whether a tensor is sparse COO or not. Use is_sparse_csr for checking CSR
   // format.
   bool is_sparse_csr() const {
-    return key_set_.has(DispatchKey::SparseCsrCPU) ||
-        key_set_.has(DispatchKey::SparseCsrCUDA);
+    return key_set_.has_any(c10::sparse_csr_ks);
   }
 
   bool is_quantized() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::QuantizedCPU) ||
-        key_set_.has(DispatchKey::QuantizedCUDA) ||
-        key_set_.has(DispatchKey::QuantizedXPU);
+    constexpr auto quantized_ks = DispatchKeySet(DispatchKey::Quantized);
+    return key_set_.has_all(quantized_ks);
   }
 
   bool is_meta() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::Meta);
+    constexpr auto meta_ks = DispatchKeySet(DispatchKey::Meta);
+    return key_set_.has_all(meta_ks);
   }
 
   bool is_cpu() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::CPU) ||
-        key_set_.has(DispatchKey::SparseCPU) ||
-        key_set_.has(DispatchKey::SparseCsrCPU) ||
-        key_set_.has(DispatchKey::QuantizedCPU) ||
-        key_set_.has(DispatchKey::MkldnnCPU);
+    constexpr auto cpu_bits_ks = DispatchKeySet(BackendComponent::CPUBit) |
+        DispatchKeySet({DispatchKey::SparseCsrCPU, DispatchKey::MkldnnCPU});
+    return key_set_.has_any(cpu_bits_ks);
   }
 
   bool is_cuda() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::CUDA) ||
-        key_set_.has(DispatchKey::SparseCUDA) ||
-        key_set_.has(DispatchKey::SparseCsrCUDA) ||
-        key_set_.has(DispatchKey::QuantizedCUDA);
+    constexpr auto cuda_bits_ks = DispatchKeySet(BackendComponent::CUDABit) |
+        DispatchKeySet(DispatchKey::SparseCsrCUDA);
+    return key_set_.has_any(cuda_bits_ks);
   }
 
   bool is_xpu() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::XPU) ||
-        key_set_.has(DispatchKey::SparseXPU) ||
-        key_set_.has(DispatchKey::QuantizedXPU);
+    constexpr auto xpu_ks = DispatchKeySet(BackendComponent::XPUBit);
+    return key_set_.has_all(xpu_ks);
+  }
+
+  bool is_ipu() const {
+    constexpr auto ipu_ks = DispatchKeySet(BackendComponent::IPUBit);
+    return key_set_.has_all(ipu_ks);
   }
 
   bool is_xla() const {
-    return key_set_.has(DispatchKey::XLA);
+    constexpr auto xla_ks = DispatchKeySet(BackendComponent::XLABit);
+    return key_set_.has_all(xla_ks);
   }
 
   bool is_hpu() const {
-    return key_set_.has(DispatchKey::HPU);
+    constexpr auto hpu_ks = DispatchKeySet(BackendComponent::HPUBit);
+    return key_set_.has_all(hpu_ks);
   }
 
   bool is_lazy() const {
-    return key_set_.has(DispatchKey::Lazy);
+    constexpr auto lazy_ks = DispatchKeySet(BackendComponent::LazyBit);
+    return key_set_.has_all(lazy_ks);
   }
 
   bool is_hip() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::HIP) ||
-        key_set_.has(DispatchKey::SparseHIP);
+    constexpr auto hip_ks = DispatchKeySet(BackendComponent::HIPBit);
+    return key_set_.has_all(hip_ks);
   }
 
   bool is_ve() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    return key_set_.has(DispatchKey::VE) || key_set_.has(DispatchKey::SparseVE);
+    constexpr auto ve_ks = DispatchKeySet(BackendComponent::VEBit);
+    return key_set_.has_all(ve_ks);
   }
 
   bool is_mkldnn() const {
-    return key_set_.has(DispatchKey::MkldnnCPU);
+    return key_set_.has_all(c10::mkldnn_ks);
   }
 
   bool is_vulkan() const {
-    return key_set_.has(DispatchKey::Vulkan);
+    constexpr auto vulkan_ks = DispatchKeySet(DispatchKey::Vulkan);
+    return key_set_.has_all(vulkan_ks);
   }
 
   bool is_metal() const {
-    return key_set_.has(DispatchKey::Metal);
+    constexpr auto metal_ks = DispatchKeySet(DispatchKey::Metal);
+    return key_set_.has_all(metal_ks);
   }
 
   bool is_mlc() const {
-    return key_set_.has(DispatchKey::MLC);
+    constexpr auto mls_ks = DispatchKeySet(DispatchKey::MLC);
+    return key_set_.has_all(mls_ks);
   }
 
   bool is_ort() const {
-    return key_set_.has(DispatchKey::ORT);
+    constexpr auto ort_ks = DispatchKeySet(DispatchKey::ORT);
+    return key_set_.has_all(ort_ks);
+  }
+
+  bool is_nested() const {
+    return key_set_.has(DispatchKey::NestedTensor);
   }
 
   // TODO: remove this once we don't automatically enabled Autograd dispatch
@@ -950,8 +812,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   // Invariant:
   //   Inference tensor has version_counter_.enabled() == false
   bool is_inference() {
-    bool no_ADInplaceOrView = !key_set_.has(c10::DispatchKey::ADInplaceOrView);
-    bool no_Autograd = (key_set_ & c10::autograd_dispatch_keyset).empty();
+    bool no_ADInplaceOrView = !key_set_.has_any(c10::inplace_or_view_ks);
+    bool no_Autograd = !key_set_.has_any(c10::autograd_dispatch_keyset);
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         no_ADInplaceOrView == no_Autograd,
         "ADInplaceOrView and Autograd keys must be on/off at the same time.");
@@ -972,14 +834,22 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   Layout layout() const {
     // NB: This method is not virtual and avoid dispatches for perf.
-    if (is_sparse()) {
+    // strided is also the most common layout type, so we check for
+    // strided case first.
+    // This keyset must also be kept in sync with the logic in
+    // is_sparse() / is_sparse_csr() / is_mkldnn()
+    constexpr auto sparse_and_sparsecsr_and_mkldnn_ks =
+        c10::sparse_ks | c10::sparse_csr_ks | c10::mkldnn_ks;
+    if (!key_set_.has_any(sparse_and_sparsecsr_and_mkldnn_ks)) {
+      return kStrided;
+    } else if (is_sparse()) {
       return kSparse;
     } else if (is_sparse_csr()) {
       return kSparseCsr;
-    } else if (is_mkldnn()) {
-      return kMkldnn;
     } else {
-      return kStrided;
+      TORCH_INTERNAL_ASSERT(
+          is_mkldnn(), "There is an error in the layout calculation logic.");
+      return kMkldnn;
     }
   }
 
@@ -1065,7 +935,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * Whether or not the imaginary part of the tensor should be negated
    */
   inline bool is_conj() const {
-    return key_set_.has(DispatchKey::Conjugate);
+    constexpr auto conjugate_ks = DispatchKeySet(DispatchKey::Conjugate);
+    return key_set_.has_all(conjugate_ks);
   }
 
   /**
@@ -1085,7 +956,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * Whether or not the tensor is a zerotensor
    */
   inline bool _is_zerotensor() const {
-    return key_set_.has(DispatchKey::ZeroTensor);
+    constexpr auto zerotensor_ks = DispatchKeySet(DispatchKey::ZeroTensor);
+    return key_set_.has_all(zerotensor_ks);
   }
 
   /**
@@ -1105,7 +977,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * Whether or not the tensor should be negated
    */
   inline bool is_neg() const {
-    return key_set_.has(DispatchKey::Negative);
+    constexpr auto negative_ks = DispatchKeySet(DispatchKey::Negative);
+    return key_set_.has_all(negative_ks);
   }
 
   /**
@@ -1476,16 +1349,14 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   void set_python_dispatch(bool k) {
     if (k) {
-      key_set_ =
-          key_set_.add(DispatchKey::Python).add(DispatchKey::PythonTLSSnapshot);
+      key_set_ = key_set_.add(c10::python_ks);
     } else {
-      key_set_ = key_set_.remove(DispatchKey::Python)
-                     .remove(DispatchKey::PythonTLSSnapshot);
+      key_set_ = key_set_ - c10::python_ks;
     }
   }
 
   bool is_python_dispatch() const {
-    return key_set_.has(DispatchKey::Python);
+    return key_set_.has_all(c10::python_ks);
   }
 
   /**
@@ -1550,13 +1421,22 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    */
   inline bool has_compatible_shallow_copy_type(DispatchKeySet from) {
     auto is_dense = [](DispatchKeySet ts) {
-      return ts.has(DispatchKey::CPU) || ts.has(DispatchKey::CUDA) ||
-          ts.has(DispatchKey::HIP) || ts.has(DispatchKey::XPU);
+      constexpr auto dense_backends = DispatchKeySet(
+          {BackendComponent::CPUBit,
+           BackendComponent::CUDABit,
+           BackendComponent::HIPBit,
+           BackendComponent::XPUBit});
+      constexpr auto dense_k = DispatchKeySet(DispatchKey::Dense);
+      return ts.has_any(dense_k) && ts.has_any(dense_backends);
     };
     auto is_sparse = [](DispatchKeySet ts) {
-      return ts.has(DispatchKey::SparseCPU) ||
-          ts.has(DispatchKey::SparseCUDA) || ts.has(DispatchKey::SparseHIP) ||
-          ts.has(DispatchKey::SparseXPU);
+      constexpr auto sparse_backends = DispatchKeySet(
+          {BackendComponent::CPUBit,
+           BackendComponent::CUDABit,
+           BackendComponent::HIPBit,
+           BackendComponent::XPUBit});
+      constexpr auto sparse_k = DispatchKeySet(DispatchKey::Sparse);
+      return ts.has_any(sparse_k) && ts.has_any(sparse_backends);
     };
     return (key_set_ == from) || (is_dense(key_set_) && is_dense(from)) ||
         (is_sparse(key_set_) && is_sparse(from));
@@ -2246,11 +2126,12 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * Compute the number of elements based on the sizes of a tensor.
    */
   int64_t compute_numel() const {
-    int64_t n = 1;
-    for (auto s : sizes()) {
-      n *= s;
-    }
-    return n;
+#if C10_HAS_BUILTIN_OVERFLOW() && !defined(C10_MOBILE)
+    // Use overflow checks if supported by the compiler
+    return safe_compute_numel();
+#else
+    return c10::multiply_integers(sizes());
+#endif
   }
 
   /**
@@ -2259,14 +2140,15 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * using a sparse layout has multiple dimensions with large sizes.
    */
   int64_t safe_compute_numel() const {
-    int64_t n = 1;
-    for (auto s : sizes()) {
-      TORCH_CHECK(
-          s == 0 || n <= std::numeric_limits<int64_t>::max() / s,
-          "numel: integer multiplication overflow");
-      n *= s;
-    }
-    return n;
+    uint64_t n = 1;
+    bool overflows = c10::safe_multiplies_u64(sizes(), &n);
+    constexpr auto numel_max = std::min(
+        static_cast<uint64_t>(std::numeric_limits<int64_t>::max()),
+        static_cast<uint64_t>(std::numeric_limits<size_t>::max()));
+
+    overflows |= (n > numel_max);
+    TORCH_CHECK(!overflows, "numel: integer multiplication overflow");
+    return static_cast<int64_t>(n);
   }
 
   /**
@@ -2408,24 +2290,33 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   }
 
  protected:
-  // Policy for adjusting the behavior of is_contiguous(). Allows
-  // subclass customization while still being able to inline
-  // is_contiguous() in the common case.
-  enum class HasContiguityPolicy : uint8_t {
-    // Default behavior: check is_contiguous_ and similar bitflags.
+  // Policy for adjusting the behavior of customizable methods like
+  // is_contiguous() and sizes(). Allows subclass customization while
+  // still being able to inline the methods in the common case.
+  enum class CustomizableMethodPolicy : uint8_t {
+    // Default behavior.
     Default,
     // Throw a generic error message that this tensor type does not
-    // support is_contiguous.
-    ContiguityNotSupported,
-    // Call virtual is_contiguous_custom method to implement custom
-    // is_contiguous behavior.
+    // support the method in question.
+    NotSupported,
+    // For backward compatibility.
+    ContiguityNotSupported = NotSupported,
+    // Call virtual foo_custom method to implement custom foo
+    // behavior.
     CustomBehavior,
   };
 
-  void set_has_contiguity_policy(HasContiguityPolicy p) {
+  // For backward compatibility.
+  using HasContiguityPolicy = CustomizableMethodPolicy;
+
+  void set_has_contiguity_policy(CustomizableMethodPolicy p) {
     has_contiguity_ = static_cast<uint8_t>(p);
   }
 
+  void set_sizes_customization_policy(CustomizableMethodPolicy p) {
+    sizes_customization_policy_ = static_cast<uint8_t>(p);
+  }
+
   Storage storage_;
 
  private:
@@ -2536,7 +2427,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   // or -std=gnu++2a
   inline void init_bitfields() {
     is_contiguous_ = true;
-    has_contiguity_ = static_cast<uint8_t>(HasContiguityPolicy::Default);
+    has_contiguity_ = static_cast<uint8_t>(CustomizableMethodPolicy::Default);
 
     is_channels_last_ = false;
     is_channels_last_contiguous_ = false;
@@ -2547,6 +2438,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     allow_tensor_metadata_change_ = true;
     reserved_ = false;
     owns_pyobj_ = false;
+    sizes_customization_policy_ =
+        static_cast<uint8_t>(CustomizableMethodPolicy::Default);
     storage_access_should_throw_ = false;
   }
 
@@ -2607,6 +2500,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   // direction (to make sure the pyobj stays live).
   bool owns_pyobj_ : 1;
 
+  // Customization policy for the sizes() virtual method.
+  /* CustomizableMethodPolicy */ uint8_t sizes_customization_policy_ : 2;
+
   // The set of DispatchKeys which describe this tensor.  NB: this
   // does NOT include Autograd (historically, it did, but
   // not anymore!)
diff --git a/c10/core/TensorOptions.h b/c10/core/TensorOptions.h
index f7619db0d60f3c..ea903fdce2d008 100644
--- a/c10/core/TensorOptions.h
+++ b/c10/core/TensorOptions.h
@@ -643,6 +643,9 @@ inline DispatchKey computeDispatchKey(
           }
           return DispatchKey::CUDA;
         }
+        case DeviceType::IPU: {
+          return DispatchKey::IPU;
+        }
         case DeviceType::XPU: {
           if (isQIntType(dtype_)) {
             return DispatchKey::QuantizedXPU;
@@ -780,6 +783,9 @@ inline DeviceType dispatchKeyToDeviceType(DispatchKey dispatch_key) {
       return DeviceType::Meta;
 
     // stuff that people are actively developing
+    case DispatchKey::IPU:
+    case DispatchKey::AutogradIPU:
+      return DeviceType::IPU;
     case DispatchKey::XPU:
     case DispatchKey::SparseXPU:
     case DispatchKey::QuantizedXPU:
diff --git a/c10/core/WrapDimMinimal.cpp b/c10/core/WrapDimMinimal.cpp
new file mode 100644
index 00000000000000..2dc359fc5d4fdd
--- /dev/null
+++ b/c10/core/WrapDimMinimal.cpp
@@ -0,0 +1,36 @@
+#include <c10/core/WrapDimMinimal.h>
+
+namespace c10 {
+namespace detail {
+
+int64_t maybe_wrap_dim_slow(
+    int64_t dim,
+    int64_t dim_post_expr,
+    bool wrap_scalar) {
+  if (dim_post_expr <= 0) {
+    TORCH_CHECK_INDEX(
+        wrap_scalar,
+        "dimension specified as ",
+        dim,
+        " but tensor has no dimensions");
+    return c10::maybe_wrap_dim(dim, /*dim_post_expr=*/1, /*wrap_scalar=*/false);
+  }
+
+  int64_t min = -dim_post_expr;
+  int64_t max = dim_post_expr - 1;
+  TORCH_CHECK_INDEX(
+      min <= dim && dim <= max,
+      "Dimension out of range (expected to be in range of [",
+      min,
+      ", ",
+      max,
+      "], but got ",
+      dim,
+      ")");
+
+  TORCH_INTERNAL_ASSERT(
+      false, "should never reach here as dim should be out-of-bounds");
+}
+
+} // namespace detail
+} // namespace c10
diff --git a/c10/core/WrapDimMinimal.h b/c10/core/WrapDimMinimal.h
index 01cb1c641a14b3..4a6f375147491a 100644
--- a/c10/core/WrapDimMinimal.h
+++ b/c10/core/WrapDimMinimal.h
@@ -4,37 +4,22 @@
 
 namespace c10 {
 
+namespace detail {
+C10_API int64_t
+maybe_wrap_dim_slow(int64_t dim, int64_t dim_post_expr, bool wrap_scalar);
+}
+
 static inline int64_t maybe_wrap_dim(
     int64_t dim,
     int64_t dim_post_expr,
     bool wrap_scalar = true) {
-  if (dim_post_expr <= 0) {
-    if (!wrap_scalar) {
-      TORCH_CHECK_INDEX(
-          false,
-          "dimension specified as ",
-          dim,
-          " but tensor has no dimensions");
-    }
-    dim_post_expr = 1; // this will make range [-1, 0]
-  }
-
-  int64_t min = -dim_post_expr;
-  int64_t max = dim_post_expr - 1;
-  if (dim < min || dim > max) {
-    TORCH_CHECK_INDEX(
-        false,
-        "Dimension out of range (expected to be in range of [",
-        min,
-        ", ",
-        max,
-        "], but got ",
-        dim,
-        ")");
+  // Inline the fast paths
+  if (C10_LIKELY(-dim_post_expr <= dim && dim < dim_post_expr)) {
+    // Branch-less version of dim + (dim < 0 ? dim_post_expr : 0)
+    return dim + dim_post_expr * (dim < 0);
   }
-  if (dim < 0)
-    dim += dim_post_expr;
-  return dim;
+  // Check edge-cases out-of-line (wrapping scalars and out-of-bounds errors)
+  return c10::detail::maybe_wrap_dim_slow(dim, dim_post_expr, wrap_scalar);
 }
 
 } // namespace c10
diff --git a/c10/core/impl/FakeGuardImpl.h b/c10/core/impl/FakeGuardImpl.h
index 2d47db0fdb1847..c86255220c1c1f 100644
--- a/c10/core/impl/FakeGuardImpl.h
+++ b/c10/core/impl/FakeGuardImpl.h
@@ -9,7 +9,7 @@ namespace impl {
 
 // FakeGuardImpl is hardcoded to have eight devices.  Not for
 // any good reason, just to simplify code.
-constexpr size_t kFakeGuardImplMaxDevices = 8;
+constexpr DeviceIndex kFakeGuardImplMaxDevices = 8;
 
 /**
  * A fake implementation of DeviceGuardImplInterface suitable for testing.
@@ -21,7 +21,7 @@ struct FakeGuardImpl final : public DeviceGuardImplInterface {
   static constexpr DeviceType static_type = T;
   // Runtime device type is not used
   FakeGuardImpl(DeviceType) {}
-  FakeGuardImpl() {}
+  FakeGuardImpl() = default;
   DeviceType type() const override {
     return T;
   }
diff --git a/c10/core/impl/PyInterpreter.cpp b/c10/core/impl/PyInterpreter.cpp
new file mode 100644
index 00000000000000..4367c7b7530e2d
--- /dev/null
+++ b/c10/core/impl/PyInterpreter.cpp
@@ -0,0 +1,41 @@
+#include <c10/core/TensorImpl.h>
+#include <c10/core/impl/PyInterpreter.h>
+
+namespace c10 {
+namespace impl {
+
+static std::string noop_name_fn(const PyInterpreter*) {
+  return "<unloaded interpreter>";
+}
+
+static void noop_decref_fn(const PyInterpreter*, PyObject*, bool) {
+  // no-op
+}
+
+static c10::intrusive_ptr<TensorImpl> noop_detach_fn(
+    const PyInterpreter*,
+    const TensorImpl*) {
+  TORCH_INTERNAL_ASSERT(
+      0,
+      "attempted to detach (shallow_copy_and_detach) Tensor with nontrivial PyObject after corresponding interpreter died");
+}
+
+static void noop_dispatch_fn(
+    const PyInterpreter*,
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack,
+    const std::shared_ptr<SafePyObject>& type) {
+  TORCH_INTERNAL_ASSERT(
+      0,
+      "attempted to dispatch (__torch_dispatch__) an operator on Tensor with nontrivial PyObject after corresponding interpreter died");
+}
+
+void PyInterpreter::disarm() noexcept {
+  name_fn_ = &noop_name_fn;
+  decref_fn_ = &noop_decref_fn;
+  detach_fn_ = &noop_detach_fn;
+  dispatch_fn_ = &noop_dispatch_fn;
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/PyInterpreter.h b/c10/core/impl/PyInterpreter.h
new file mode 100644
index 00000000000000..a78ba2d83e728c
--- /dev/null
+++ b/c10/core/impl/PyInterpreter.h
@@ -0,0 +1,190 @@
+#pragma once
+
+#include <c10/macros/Macros.h>
+#include <c10/util/intrusive_ptr.h>
+#include <c10/util/python_stub.h>
+#include <string>
+#include <vector>
+
+// Forward declarations
+
+namespace c10 {
+struct IValue;
+class OperatorHandle;
+struct TensorImpl;
+struct SafePyObject;
+} // namespace c10
+
+namespace torch {
+namespace jit {
+using Stack = std::vector<c10::IValue>;
+}
+} // namespace torch
+
+// Actual implementation
+
+namespace c10 {
+namespace impl {
+
+// Note [Python interpreter tag]
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+// Traditionally, PyTorch is layered such that our Python library
+// (libtorch_python) references our pure C++ library (libtorch) as the
+// natural order of things.  However, sometimes this natural order is
+// subverted: C++ objects refer to Python objects (for example, we
+// store a PyObject* pointer on TensorImpl so that converting from a
+// C++ Tensor to a Python Tensor is just a memory dereference).
+//
+// These unusual orderings must be treated with care.  To start, you need to
+// virtualize the destructor so that the PyObject can be decref'ed on
+// destruction (because the C++ object itself doesn't know anything about
+// Python--remember, layering!).  This process itself is fraught, since
+// acquiring the GIL could lead to deadlocks if someone is blocking on you
+// while holding the GIL.  Furthermore, if the C++ objects outlive the
+// interpreter (which can happen if you stash them in a static global
+// variable defined in libtorch), you may attempt to decref the object when
+// the Python interpreter has already been shutdown.
+//
+// BUT WAIT, IT GETS WORSE.  With torchdeploy, there may be multiple Python
+// interpreters in a single process. If a C++ object is accessible from
+// multiple interpreters, we must take care not to accidentally pass a
+// PyObject from one interpreter with another interpreter.
+//
+// To prevent these mixups, we introduce a PyInterpreter "tag" (object with
+// a vtable), which specifies a specific Python interpreter.
+//
+//  - Any given object can be associated with AT MOST one Python interpreter.
+//    We represent the interpreter tag as a memory address to an instance of
+//    a virtual class that is allocated once per interpreter (this is so that
+//    we can request the interpreter to perform operations for us, if
+//    necessary).
+//
+//  - It can be recorded with a PyObject (PyInterpreterObject) so that
+//    we know what interpreter the object is associated with, and we can
+//    raise an error if you try to use the PyObject from the wrong
+//    interpreter context.
+//
+//  - It contains a vtable that can be used to perform various Python
+//    operations from ordinary C++ code that ordinarily wouldn't be accessible
+//    from libtorch.
+//
+// A simple use case is when a C++ object must be associated with a PyObject.
+// However, for TensorImpl, we lazily allocate a PyObject the first time the
+// object passes into Python.  The invariants for this situation are more
+// subtle:
+//
+//  - A given TensorImpl's interpreter tag can only go from uninitialized to
+//    tagged; once tagged, this is a quiescent state (once tagged to an
+//    interpreter, ALWAYS tagged to that interpreter)
+//
+//  - A thread may mutate the PyObject field of a TensorImpl if and only if it
+//    holds the GIL for the interpreter tagged on the TensorImpl.  (If the
+//    TensorImpl is not tagged, it must first atomically claim its tag before it
+//    can validly write)
+//
+// WARNING: This class has to be written very carefully, because it may be
+// possible for a Tensor to have a reference an interpreter corresponding to
+// a shared library that has ALREADY BEEN UNLOADED.  This makes blindly calling
+// virtual methods very dangerous, because the vtable may be garbage at that
+// point (on a good day, you might get "pure virtual method called").
+//
+// The idea to solve this problem is we always leak PyInterpreters (so they
+// always stay live even after dlclose), and disarm the "virtual methods" by
+// replacing them with function pointers that just no-op.  This can't be done
+// with a traditional C++ vtable, so we have to roll our own.
+//
+// NB: The downside with representing PyInterpreter tags as full objects is that
+// it takes an extra word on TensorImpl.  If tags were instead just integer
+// indices, on 64-bit architectures we could pack the tag and PyObject together
+// into a single atomic word.  On 32-bit architectures we could simply say that
+// only one Python interpreter is supported (erroring if a nontrivial
+// interpreter tag is attempted to be set).
+//
+// The difficulty with this scheme is we need to maintain an out-of-line table
+// to get at the PyInterpreters so that we can do virtual method calls on them,
+// and registration/deregistration to this table must be done in a thread safe
+// manner.  This can be easily done if the number of possible PyInterpreters is
+// small enough (e.g., 8-bit integer) by simply preallocating an array of
+// sufficient size to hold all possible interpreters.  Surely 128 threads is
+// more than enough for anyone!
+//
+// I didn't decide to do this technique at the moment, because the extra word
+// added by the PyInterpreter tag takes us to 24 words, which means that we
+// still fit inside three eight word cache lines.  If you need to penny pinch
+// another word consider doing this!
+
+struct C10_API PyInterpreter {
+  // Feel free to add as much random crap here as you need; each of these
+  // can be thought of as a "C++ to Python" hook.
+  using name_sig = std::string(const PyInterpreter*);
+  using decref_sig = void(const PyInterpreter*, PyObject*, bool);
+  using detach_sig =
+      c10::intrusive_ptr<TensorImpl>(const PyInterpreter*, const TensorImpl*);
+  using dispatch_sig = void(
+      const PyInterpreter*,
+      const c10::OperatorHandle&,
+      torch::jit::Stack* stack,
+      // This is a Tensor subclass type object
+      const std::shared_ptr<SafePyObject>& type);
+
+  PyInterpreter(
+      name_sig* name_fn,
+      decref_sig* decref_fn,
+      detach_sig* detach,
+      dispatch_sig* dispatch)
+      : name_fn_(name_fn),
+        decref_fn_(decref_fn),
+        detach_fn_(detach),
+        dispatch_fn_(dispatch) {}
+
+  name_sig* name_fn_;
+  decref_sig* decref_fn_;
+  detach_sig* detach_fn_;
+  dispatch_sig* dispatch_fn_;
+
+  // UBSAN suppression fixes: "call to function
+  // (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*,
+  // _object*) through pointer to incorrect function type 'void (*)(const
+  // c10::impl::PyInterpreter *, _object *)'" See
+  // https://github.com/google/sanitizers/issues/911
+
+  // Report the name of this interpreter
+  __ubsan_ignore_function__ std::string name() const {
+    return (*name_fn_)(this);
+  }
+
+  // Run Py_DECREF on a PyObject.  We DO NOT assume the GIL is held on call
+  // See NOTE [PyInterpreter::decref takes an `is_tensor` arg]
+  __ubsan_ignore_function__ void decref(PyObject* pyobj, bool is_tensor) const {
+    return (*decref_fn_)(this, pyobj, is_tensor);
+  }
+
+  // Perform a detach by deferring to the __torch_dispatch__ implementation of
+  // detach, which will also arrange for the PyObject to get copied in this
+  // situation
+  __ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> detach(
+      const TensorImpl* self) const {
+    return (*detach_fn_)(this, self);
+  }
+
+  // Invoke the Python boxed fallback dispatch to go back into Python
+  __ubsan_ignore_function__ void dispatch(
+      const c10::OperatorHandle& op,
+      torch::jit::Stack* stack,
+      const std::shared_ptr<SafePyObject>& type) const {
+    return (*dispatch_fn_)(this, op, stack, type);
+  }
+
+  // Disarm this PyInterpreter, making all of its methods noops.
+  // Because the function pointers are raw pointers (not atomics),
+  // a disarm() invocation that is concurrent with active destructors
+  // is not thread safe and will trigger TSAN.  My hope is that this
+  // situations doesn't ever actually happen; tensor destruction should
+  // quiesce when a dlclose happens, and any long lived tensors whose
+  // destructors would be disarmed here only begin the destruction process
+  // on process shutdown (long after the dlclose has occurred).
+  void disarm() noexcept;
+};
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp
index c1ac4bd0ed0c88..49e7f3c3d137c5 100644
--- a/c10/cuda/CUDACachingAllocator.cpp
+++ b/c10/cuda/CUDACachingAllocator.cpp
@@ -7,6 +7,7 @@
 #include <c10/util/UniqueVoidPtr.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/irange.h>
+#include <c10/util/llvmMathExtras.h>
 
 #include <cuda_runtime_api.h>
 #include <algorithm>
@@ -177,6 +178,8 @@ struct Block {
   Block* prev; // prev block if split from a larger allocation
   Block* next; // next block if split from a larger allocation
   int event_count; // number of outstanding CUDA events
+  int gc_count; // counter for prioritizing older / less useful blocks for
+                // garbage collection
 
   Block(
       int device,
@@ -193,7 +196,8 @@ struct Block {
         allocated(0),
         prev(nullptr),
         next(nullptr),
-        event_count(0) {}
+        event_count(0),
+        gc_count(0) {}
 
   // constructor for search key
   Block(int device, cudaStream_t stream, size_t size)
@@ -206,7 +210,8 @@ struct Block {
         allocated(0),
         prev(nullptr),
         next(nullptr),
-        event_count(0) {}
+        event_count(0),
+        gc_count(0) {}
 
   bool is_split() const {
     return (prev != nullptr) || (next != nullptr);
@@ -310,7 +315,7 @@ cudaError_t cudaMallocMaybeCapturing(void** p, size_t size) {
   if (at::cuda::currentStreamCaptureStatusMayInitCtx() ==
       at::cuda::CaptureStatus::None) {
 #endif
-    return cudaMalloc(p, size);
+    return C10_CUDA_ERROR_HANDLED(cudaMalloc(p, size));
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   } else {
     // It's ok to capture cudaMallocs, as long as we never cudaFree those
@@ -318,7 +323,7 @@ cudaError_t cudaMallocMaybeCapturing(void** p, size_t size) {
     // Capturing cudaMalloc behaves nicely: it gives the graph new VA,
     // but is ignored (won't leakily allocate new memory) in replays.
     at::cuda::CUDAStreamCaptureModeGuard g{cudaStreamCaptureModeRelaxed};
-    return cudaMalloc(p, size);
+    return C10_CUDA_ERROR_HANDLED(cudaMalloc(p, size));
   }
 #endif
 }
@@ -330,6 +335,17 @@ class CachingAllocatorConfig {
   static size_t max_split_size() {
     return instance().m_max_split_size;
   }
+  static double garbage_collection_threshold() {
+    return instance().m_garbage_collection_threshold;
+  }
+
+  // This is used to round-up allocation size to nearest power of 2 divisions.
+  // More description below in function roundup_power2_next_division
+  // As ane example, if we want 4 divisions between 2's power, this can be done
+  // using env variable: PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4
+  static size_t roundup_power2_divisions() {
+    return instance().m_roundup_power2_divisions;
+  }
 
  private:
   static CachingAllocatorConfig& instance() {
@@ -342,8 +358,12 @@ class CachingAllocatorConfig {
   }
 
   CachingAllocatorConfig()
-      : m_max_split_size(std::numeric_limits<size_t>::max()) {}
+      : m_max_split_size(std::numeric_limits<size_t>::max()),
+        m_roundup_power2_divisions(0),
+        m_garbage_collection_threshold(0) {}
   size_t m_max_split_size;
+  size_t m_roundup_power2_divisions;
+  double m_garbage_collection_threshold;
 
   void parseArgs() {
     const char* val = getenv("PYTORCH_CUDA_ALLOC_CONF");
@@ -373,6 +393,32 @@ class CachingAllocatorConfig {
             val2 = std::min(
                 val2, (std::numeric_limits<size_t>::max() / (1024 * 1024)));
             m_max_split_size = val2 * 1024 * 1024;
+          } else if (kv[0].compare("roundup_power2_divisions") == 0) {
+            size_t val2 = stoi(kv[1]);
+            TORCH_CHECK(
+                llvm::isPowerOf2_64(val2),
+                "For roundups, the divisons has to be power of 2 ",
+                "");
+            m_roundup_power2_divisions = val2;
+          } else if (kv[0].compare("garbage_collection_threshold") == 0) {
+            /*
+             * Perform garbage collection of GPU memory blocks to avoid
+             * triggering expensive sync-and-reclaim-all operation. Upon setting
+             * the threshold (e.g., 0.8), the allocator will start reclaiming
+             * blocks if GPU memory capacity usage exceeds the threshold (i.e.,
+             * 80% of total memory).
+             * Values 0.0 and 1.0 are not allowed as they are less meaningful.
+             */
+            double val2 = stod(kv[1]);
+            TORCH_CHECK(
+                val2 > 0,
+                "garbage_collect_threshold too small, set it 0.0~1.0",
+                "");
+            TORCH_CHECK(
+                val2 < 1.0,
+                "garbage_collect_threshold too big, set it 0.0~1.0",
+                "");
+            m_garbage_collection_threshold = val2;
           } else {
             TORCH_CHECK(false, "Unrecognized CachingAllocator option: ", kv[0]);
           }
@@ -469,18 +515,29 @@ class DeviceCachingAllocator {
     params.stat_types[static_cast<size_t>(StatType::AGGREGATE)] = true;
     params.stat_types[static_cast<size_t>(get_stat_type_for_pool(pool))] = true;
 
+    // First, try to get a block from the existing pool.
     bool block_found =
         // Search pool
         get_free_block(params)
         // Trigger callbacks and retry search
-        || (trigger_free_memory_callbacks(params) && get_free_block(params))
-        // Attempt allocate
-        || alloc_block(params, false)
-        // Free enough available cached blocks to satisfy alloc and retry alloc.
-        ||
-        (release_available_cached_blocks(params) && alloc_block(params, false))
-        // Free all non-split cached blocks and retry alloc.
-        || (release_cached_blocks() && alloc_block(params, true));
+        || (trigger_free_memory_callbacks(params) && get_free_block(params));
+
+    // Can't reuse an existing block; try to get a new one.
+    if (!block_found) {
+      // Do garbage collection if the flag is set.
+      if (C10_UNLIKELY(
+              CachingAllocatorConfig::garbage_collection_threshold() > 0.0)) {
+        garbage_collect_cached_blocks();
+      }
+      // Attempt allocate
+      block_found = alloc_block(params, false)
+          // Free enough available cached blocks to satisfy alloc and retry
+          // alloc.
+          || (release_available_cached_blocks(params) &&
+              alloc_block(params, false))
+          // Free all non-split cached blocks and retry alloc.
+          || (release_cached_blocks() && alloc_block(params, true));
+    }
 
     if (!block_found) {
       // For any error code other than cudaErrorMemoryAllocation,
@@ -699,9 +756,9 @@ class DeviceCachingAllocator {
     if (*largest ==
         0) { // make an initial guess if a zero *largest is passed in
       size_t tmp_bytes;
-      cudaMemGetInfo(
+      C10_CUDA_CHECK(cudaMemGetInfo(
           largest, // Use free memory as an optimistic initial guess of *largest
-          &tmp_bytes);
+          &tmp_bytes));
     }
     cache_info_aux(large_blocks, total, largest);
     cache_info_aux(small_blocks, total, largest);
@@ -808,11 +865,43 @@ class DeviceCachingAllocator {
     return result;
   }
 
+  // This function takes the size and number of divisions argument and rounds
+  // up the size argument for the nearest power-of-2 division.
+  // For example, if we need to round-up 1200 and number of divisions is 4,
+  // the size 1200 lies between 1024 and 2048 and if we do 4 divisions between
+  // them, the values are 1024, 1280, 1536, and 1792. So the function will
+  // return 1280 as the nearest ceiling of power-2 divison.
+  static size_t roundup_power2_next_division(size_t size, size_t divisions) {
+    if (C10_UNLIKELY(size <= 4 || divisions <= 1)) {
+      return size;
+    }
+    if (llvm::isPowerOf2_64(size)) {
+      return size;
+    }
+
+    // divide the space between these 2's power into equal divisions
+    // If division is zero, return the power-of-2 ceiling.
+    size_t power2_floor = llvm::PowerOf2Floor(size);
+    size_t power2_divison =
+        power2_floor >> (63 - llvm::countLeadingZeros(divisions));
+    if (C10_UNLIKELY(power2_divison == 0)) {
+      return (power2_floor << 1);
+    }
+    size_t round_size_floor = size & (~(power2_divison - 1));
+    return (round_size_floor == size) ? size
+                                      : round_size_floor + power2_divison;
+  }
+
   static size_t round_size(size_t size) {
     if (size < kMinBlockSize) {
       return kMinBlockSize;
     } else {
-      return kMinBlockSize * ((size + kMinBlockSize - 1) / kMinBlockSize);
+      auto divisions = CachingAllocatorConfig::roundup_power2_divisions();
+      if (divisions > 0 && size > (kMinBlockSize * divisions)) {
+        return roundup_power2_next_division(size, divisions);
+      } else {
+        return kMinBlockSize * ((size + kMinBlockSize - 1) / kMinBlockSize);
+      }
     }
   }
 
@@ -1037,6 +1126,14 @@ class DeviceCachingAllocator {
 
   bool get_free_block(AllocParams& p) {
     BlockPool& pool = *p.pool;
+
+    if (C10_UNLIKELY(
+            CachingAllocatorConfig::garbage_collection_threshold() > 0.0)) {
+      // Track block reuse interval only when garbage collection is enabled.
+      for (auto& b : pool.blocks) {
+        ++b->gc_count;
+      }
+    }
     auto it = pool.blocks.lower_bound(&p.search_key);
     if (it == pool.blocks.end() || (*it)->stream != p.stream())
       return false;
@@ -1049,6 +1146,7 @@ class DeviceCachingAllocator {
         ((*it)->size >= p.size() + kLargeBuffer))
       return false;
     p.block = *it;
+    (*it)->gc_count = 0; // Denote this block has been used
     pool.blocks.erase(it);
     return true;
   }
@@ -1062,6 +1160,62 @@ class DeviceCachingAllocator {
     return freed_memory;
   }
 
+  void garbage_collect_cached_blocks() {
+    // Free unused cached blocks to reclaim GPU memory.
+    // Unlike release_cached_blocks(), this does not enforce synchronization and
+    // therefore should be of less overheads.
+
+    size_t gc_threshold = static_cast<size_t>(
+        CachingAllocatorConfig::garbage_collection_threshold() *
+        allowed_memory_maximum);
+    // No need to trigger GC yet
+    if (total_allocated_memory <= gc_threshold) {
+      return;
+    }
+    const auto target_size = total_allocated_memory - gc_threshold;
+    size_t gc_reclaimed = 0;
+
+    // Calculate the total age of the free-able blocks. We'll use it later to
+    // get "avg age" threshold.
+    double total_age = 0.0;
+    int freeable_block_count = 0;
+    for (auto& b : large_blocks.blocks) {
+      if (!b->is_split()) {
+        total_age += b->gc_count;
+        ++freeable_block_count;
+      }
+    }
+    // No free-able blocks?
+    if (freeable_block_count == 0) {
+      return;
+    }
+
+    // Repeat GC until we reach reclaim > target size.
+    bool block_freed = true;
+    while (gc_reclaimed < target_size && block_freed == true &&
+           freeable_block_count > 0) {
+      // Free blocks exceeding this age threshold first.
+      double age_threshold = total_age / freeable_block_count;
+      // Stop iteration if we can no longer free a block.
+      block_freed = false;
+
+      // Free blocks of > avg age. Don't stop upon reaching the target_size,
+      // we don't want this GC to be triggered frequently.
+      auto it = large_blocks.blocks.begin();
+      while (it != large_blocks.blocks.end()) {
+        Block* block = *it;
+        ++it;
+        if (!block->is_split() && block->gc_count >= age_threshold) {
+          block_freed = true;
+          gc_reclaimed += block->size;
+          total_age -= block->gc_count; // Decrement the age
+          freeable_block_count--; // One less block that can be freed
+          release_block(block);
+        }
+      }
+    }
+  }
+
   bool alloc_block(AllocParams& p, bool isRetry) {
     // Defensively checks for preexisting CUDA error state.
     C10_CUDA_CHECK(cudaGetLastError());
@@ -1304,7 +1458,7 @@ class DeviceCachingAllocator {
         cudaEvent_t event = e.first;
         Block* block = e.second;
 
-        cudaError_t err = cudaEventQuery(event);
+        cudaError_t err = C10_CUDA_ERROR_HANDLED(cudaEventQuery(event));
         if (err == cudaErrorNotReady) {
           // ignore and clear the error if not ready
           cudaGetLastError();
@@ -1422,9 +1576,9 @@ class THCCachingAllocator {
         fraction,
         ". Please set within (0, 1).");
     int activated_device;
-    cudaGetDevice(&activated_device);
+    C10_CUDA_CHECK(cudaGetDevice(&activated_device));
     if (activated_device != device) {
-      cudaSetDevice(device);
+      C10_CUDA_CHECK(cudaSetDevice(device));
     }
     device_allocator[device]->setMemoryFraction(fraction);
   }
diff --git a/c10/cuda/CUDACachingAllocator.h b/c10/cuda/CUDACachingAllocator.h
index d3a73943f7bbd0..9b1a6ecf159035 100644
--- a/c10/cuda/CUDACachingAllocator.h
+++ b/c10/cuda/CUDACachingAllocator.h
@@ -102,6 +102,7 @@ struct DeviceStats {
 // cudaMalloc)..
 struct BlockInfo {
   int64_t size = 0;
+  int32_t gc_counter = 0;
   bool allocated = false;
   bool active = false;
 };
diff --git a/c10/cuda/CUDAException.h b/c10/cuda/CUDAException.h
index 77d0d07ac95e86..ca441711cbd679 100644
--- a/c10/cuda/CUDAException.h
+++ b/c10/cuda/CUDAException.h
@@ -63,6 +63,26 @@ class C10_CUDA_API CUDAError : public c10::Error {
     }                                                          \
   } while (0)
 
+// Indicates that a CUDA error is handled in a non-standard way
+#define C10_CUDA_ERROR_HANDLED(EXPR) EXPR
+
+// Intentionally ignore a CUDA error
+#define C10_CUDA_IGNORE_ERROR(EXPR)                             \
+  do {                                                          \
+    cudaError_t __err = EXPR;                                   \
+    if (__err != cudaSuccess) {                                 \
+      cudaError_t error_unused C10_UNUSED = cudaGetLastError(); \
+      (void)error_unused;                                       \
+    }                                                           \
+  } while (0)
+
+// Clear the last CUDA error
+#define C10_CUDA_CLEAR_ERROR()                                \
+  do {                                                        \
+    cudaError_t error_unused C10_UNUSED = cudaGetLastError(); \
+    (void)error_unused;                                       \
+  } while (0)
+
 // This should be used directly after every kernel launch to ensure
 // the launch happened correctly and provide an early, close-to-source
 // diagnostic if it didn't.
diff --git a/c10/cuda/CUDAFunctions.cpp b/c10/cuda/CUDAFunctions.cpp
index 255d798d13fb91..9ab61aa1f38125 100644
--- a/c10/cuda/CUDAFunctions.cpp
+++ b/c10/cuda/CUDAFunctions.cpp
@@ -10,16 +10,13 @@ namespace {
 // returns -1 on failure
 int32_t driver_version() {
   int driver_version = -1;
-  cudaError_t err = cudaDriverGetVersion(&driver_version);
-  if (err != cudaSuccess) {
-    cudaError_t last_err C10_UNUSED = cudaGetLastError();
-  }
+  C10_CUDA_IGNORE_ERROR(cudaDriverGetVersion(&driver_version));
   return driver_version;
 }
 
 int device_count_impl(bool fail_if_no_driver) {
   int count;
-  auto err = cudaGetDeviceCount(&count);
+  auto err = C10_CUDA_ERROR_HANDLED(cudaGetDeviceCount(&count));
   if (err == cudaSuccess) {
     return count;
   }
diff --git a/c10/cuda/CUDAStream.h b/c10/cuda/CUDAStream.h
index 7bb97e88b991e6..6d17136341c6ec 100644
--- a/c10/cuda/CUDAStream.h
+++ b/c10/cuda/CUDAStream.h
@@ -111,7 +111,7 @@ class C10_CUDA_API CUDAStream {
 
   bool query() const {
     DeviceGuard guard{stream_.device()};
-    cudaError_t err = cudaStreamQuery(stream());
+    cudaError_t err = C10_CUDA_ERROR_HANDLED(cudaStreamQuery(stream()));
 
     if (err == cudaSuccess) {
       return true;
diff --git a/c10/cuda/impl/CUDAGuardImpl.h b/c10/cuda/impl/CUDAGuardImpl.h
index 8f5cfdc259d3bc..583feeec26000a 100644
--- a/c10/cuda/impl/CUDAGuardImpl.h
+++ b/c10/cuda/impl/CUDAGuardImpl.h
@@ -41,7 +41,7 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
   }
   c10::optional<Device> uncheckedGetDevice() const noexcept {
     int device;
-    auto err = cudaGetDevice(&device);
+    const auto err = C10_CUDA_ERROR_HANDLED(cudaGetDevice(&device));
     C10_CUDA_CHECK_WARN(err);
     if (err != cudaSuccess) {
       return c10::nullopt;
@@ -164,7 +164,7 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     if (!event)
       return true;
     cudaEvent_t cuda_event = static_cast<cudaEvent_t>(event);
-    const cudaError_t err = cudaEventQuery(cuda_event);
+    const cudaError_t err = C10_CUDA_ERROR_HANDLED(cudaEventQuery(cuda_event));
     if (err != cudaErrorNotReady) {
       C10_CUDA_CHECK(err);
     } else {
diff --git a/c10/test/core/DispatchKeySet_test.cpp b/c10/test/core/DispatchKeySet_test.cpp
index 43b06c110e5bac..db6a2cf721c903 100644
--- a/c10/test/core/DispatchKeySet_test.cpp
+++ b/c10/test/core/DispatchKeySet_test.cpp
@@ -3,25 +3,163 @@
 #include <unordered_set>
 
 #include <c10/core/DispatchKeySet.h>
+#include <c10/util/irange.h>
 
 using namespace c10;
 
+// This test exists not to be comprehensive, but to more clearly show
+// what the semantics of DispatchKeySet are.
+TEST(DispatchKeySet, ShowSemantics) {
+  // the "CPU" dispatch key is an instance of a per-backend-functionality key.
+  // It corresponds to "dense" functionality, "CPU" backend.
+  // This means that it gets a dense functionality bit, and a cpu backend bit
+  // set.
+  auto undefined_set = DispatchKeySet();
+  auto dense_cpu_set = DispatchKeySet(DispatchKey::CPU);
+  ASSERT_TRUE(dense_cpu_set.has(DispatchKey::Dense));
+  ASSERT_TRUE(dense_cpu_set.has_backend(BackendComponent::CPUBit));
+  ASSERT_TRUE(dense_cpu_set.has(DispatchKey::CPU));
+
+  auto dense_lazy_set = DispatchKeySet(DispatchKey::Lazy);
+  ASSERT_TRUE(dense_lazy_set.has(DispatchKey::Dense));
+  ASSERT_TRUE(dense_lazy_set.has_backend(BackendComponent::LazyBit));
+  ASSERT_TRUE(dense_lazy_set.has(DispatchKey::Lazy));
+
+  // You can think of "Dense/Sparse", and "CPUBit/CUDABit", as "building block"
+  // dispatch keys. You are allowed to directly create keysets out of them!
+  auto dense_cpu_set_from_building_blocks = DispatchKeySet(DispatchKey::Dense) |
+      DispatchKeySet(BackendComponent::CPUBit);
+  ASSERT_TRUE(dense_cpu_set.has(DispatchKey::Dense));
+  ASSERT_TRUE(dense_cpu_set.has_backend(BackendComponent::CPUBit));
+  ASSERT_TRUE(dense_cpu_set.has(DispatchKey::CPU));
+  ASSERT_EQ(dense_cpu_set, dense_cpu_set_from_building_blocks);
+
+  // Similarly, the AutogradCUDA key gets 2 bits in the keyset:
+  // The "Autograd" functionality bit, and the "CUDA" backend bit
+  auto autograd_cuda = DispatchKeySet(DispatchKey::AutogradCUDA);
+  ASSERT_TRUE(autograd_cuda.has(DispatchKey::AutogradFunctionality));
+  ASSERT_TRUE(autograd_cuda.has_backend(BackendComponent::CUDABit));
+
+  // Because DispatchKeySet uses a condensed internal representation, you cannot
+  // use it to represent the FULL cross product of backends and functionalities
+  // for example:
+  auto autograd_dense_cpu_cuda = DispatchKeySet(
+      {DispatchKey::AutogradFunctionality,
+       DispatchKey::Dense,
+       DispatchKey::CUDA,
+       DispatchKey::CPU});
+  auto fpga = DispatchKeySet(DispatchKey::FPGA);
+  auto fpga_and_cpu = DispatchKeySet({DispatchKey::FPGA, DispatchKey::CPU});
+  // this keyset has all of the building block keys:
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::AutogradFunctionality));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::Dense));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has_backend(BackendComponent::CUDABit));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has_backend(BackendComponent::CPUBit));
+
+  // and it also has the "runtime" keys that correspond to the full
+  // cross-product of functionality
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::AutogradCPU));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::AutogradCPU));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::CPU));
+  ASSERT_TRUE(autograd_dense_cpu_cuda.has(DispatchKey::CUDA));
+
+  // This means that there's no way to represent a keyset with, say, only
+  // Autograd CUDA + Dense CPU. Instead, you should think of a keyset as
+  // inheriting the full set of functionalities + backends of its keys. This
+  // means that the below keysets are all indistinguishable from each other.
+  ASSERT_EQ(
+      autograd_dense_cpu_cuda,
+      DispatchKeySet(
+          {DispatchKey::AutogradCUDA,
+           DispatchKey::AutogradCPU,
+           DispatchKey::CUDA,
+           DispatchKey::CPU}));
+  ASSERT_EQ(
+      autograd_dense_cpu_cuda,
+      DispatchKeySet({DispatchKey::AutogradCUDA, DispatchKey::CPU}));
+  ASSERT_EQ(
+      autograd_dense_cpu_cuda,
+      DispatchKeySet({DispatchKey::CUDA, DispatchKey::AutogradCPU}));
+
+  // ~~~~~~~~~~ DispatchKeySet iterators ~~~~~~~~~~~
+
+  // Iterators allow you to iterate individually through the DispatchKey's in a
+  // DispatchKeySet
+  auto empty_set = DispatchKeySet();
+  auto t1 = empty_set.begin();
+  auto t2 = empty_set.end();
+  ASSERT_EQ(*empty_set.begin(), *empty_set.end());
+
+  // However, only keys that correspond to actual runtime indices of kernels in
+  // the operator table show up when you iterate through a keyset. i.e.
+  // DispatchKey::Dense, and BackendComponent::CPUBit won't show up in an
+  // iterator.
+  auto dense_cpu_iter = dense_cpu_set.begin();
+  ASSERT_EQ(*dense_cpu_iter++, DispatchKey::CPU);
+  ASSERT_EQ(*dense_cpu_iter, *dense_cpu_set.end());
+
+  auto autograd_dense_cpu_cuda_iter = autograd_dense_cpu_cuda.begin();
+  ASSERT_EQ(*autograd_dense_cpu_cuda_iter++, DispatchKey::CPU);
+  ASSERT_EQ(*autograd_dense_cpu_cuda_iter++, DispatchKey::CUDA);
+  ASSERT_EQ(*autograd_dense_cpu_cuda_iter++, DispatchKey::AutogradCPU);
+  ASSERT_EQ(*autograd_dense_cpu_cuda_iter++, DispatchKey::AutogradCUDA);
+  ASSERT_EQ(*autograd_dense_cpu_cuda_iter, *autograd_dense_cpu_cuda.end());
+
+  // But other "functionality bits" that are not defined per-backend DO get
+  // their own slots in the operator table.
+  auto mixed_keyset = DispatchKeySet(BackendComponent::CPUBit) |
+      DispatchKeySet(
+                          {DispatchKey::FPGA, // runtime key
+                           DispatchKey::Functionalize, // runtime key
+                           DispatchKey::Dense}); // NOT a runtime key
+  auto mixed_iter = mixed_keyset.begin();
+  ASSERT_EQ(*mixed_iter++, DispatchKey::CPU);
+  ASSERT_EQ(*mixed_iter++, DispatchKey::FPGA);
+  ASSERT_EQ(*mixed_iter++, DispatchKey::Functionalize);
+  ASSERT_EQ(*mixed_iter, *mixed_keyset.end());
+}
+
 TEST(DispatchKeySet, Empty) {
   DispatchKeySet empty_set;
-  for (uint8_t i = 1; i < static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
+  for (uint8_t i = 0;
+       i <= static_cast<uint8_t>(DispatchKey::EndOfRuntimeBackendKeys);
        i++) {
     auto tid = static_cast<DispatchKey>(i);
+    if (tid == DispatchKey::Undefined)
+      continue;
     ASSERT_FALSE(empty_set.has(tid));
   }
   ASSERT_TRUE(empty_set.empty());
   DispatchKeySet empty_set2;
   ASSERT_TRUE(empty_set == empty_set2);
-  ASSERT_EQ(empty_set.highestPriorityTypeId(), DispatchKey::Undefined);
 }
 
-TEST(DispatchKeySet, Singleton) {
-  for (uint8_t i = 1; i < static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
-       i++) {
+// This covers all keys that correspond to a single backend bit, e.g.
+// BackendComponent::CPUBit. Even though these are NOT runtime keys, we still
+// allow adding them directly to a keyset
+TEST(DispatchKeySet, SingletonBackendComponent) {
+  for (const auto i : c10::irange(1, num_backends)) {
+    auto tid = static_cast<DispatchKey>(i);
+    DispatchKeySet sing(tid);
+    ASSERT_EQ(sing, sing);
+    ASSERT_EQ(sing, DispatchKeySet().add(tid));
+    ASSERT_EQ(sing, sing.add(tid));
+    ASSERT_EQ(sing, sing | sing);
+    ASSERT_FALSE(sing.empty());
+    ASSERT_TRUE(sing.has(tid));
+  }
+}
+
+// This covers all keys that correspond to a single functionality bit:
+// - runtime, not-per-backend functionality keys, e.g.
+// DispatchKey::FuncTorchBatched
+// - runtime, "fake backend" keys, e.g. DispatchKey::FPGA
+// - NOT-runtime, per-backend functionality keys, e.g. DispatchKey::Dense
+//   Even though it's not a runtime key, we still allow adding it directly to a
+//   keyset.
+// DispatchKey::
+TEST(DispatchKeySet, SingletonFunctionalityKeys) {
+  for (const auto i : c10::irange(1, num_functionality_keys)) {
     auto tid = static_cast<DispatchKey>(i);
     DispatchKeySet sing(tid);
     ASSERT_EQ(sing, sing);
@@ -30,47 +168,145 @@ TEST(DispatchKeySet, Singleton) {
     ASSERT_EQ(sing, sing | sing);
     ASSERT_FALSE(sing.empty());
     ASSERT_TRUE(sing.has(tid));
-    ASSERT_EQ(sing.highestPriorityTypeId(), tid);
     ASSERT_EQ(sing.remove(tid), DispatchKeySet());
   }
 }
 
-TEST(DispatchKeySet, Doubleton) {
-  for (uint8_t i = 1; i < static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
+// This covers runtime keys that are per-backend,
+// and take up more than one bit in a DispatchKeySet. They take up one
+// functionality bit + one backend bit. e.g. CPU, CUDA, SparseCPU, SparseCUDA,
+// AutogradCPU, AutogradCUDA
+TEST(DispatchKeySet, SingletonPerBackendFunctionalityKeys) {
+  for (uint8_t i = static_cast<uint8_t>(DispatchKey::StartOfDenseBackends);
+       i <= static_cast<uint8_t>(DispatchKey::EndOfRuntimeBackendKeys);
+       i++) {
+    auto tid = static_cast<DispatchKey>(i);
+    // Skip these because they aren't real keys.
+    if (tid == DispatchKey::StartOfDenseBackends ||
+        tid == DispatchKey::StartOfSparseBackends ||
+        tid == DispatchKey::StartOfQuantizedBackends ||
+        tid == DispatchKey::StartOfAutogradBackends) {
+      continue;
+    }
+    DispatchKeySet sing(tid);
+    ASSERT_EQ(sing, sing);
+    ASSERT_EQ(sing, DispatchKeySet().add(tid));
+    ASSERT_EQ(sing, sing.add(tid));
+    ASSERT_EQ(sing, sing | sing);
+    ASSERT_FALSE(sing.empty());
+    ASSERT_TRUE(sing.has(tid));
+
+    auto functionality_key = toFunctionalityKey(tid);
+    auto backend_key = toBackendComponent(tid);
+    // These two sets should be equivalent:
+    // DispatchKeySet(DispatchKey::CPU)
+    // DispatchKeySet({DispatchKey::Dense, BackendComponent::CPUBit})
+    auto expected_ks =
+        DispatchKeySet(functionality_key) | DispatchKeySet(backend_key);
+    ASSERT_EQ(sing, expected_ks);
+    // These two sets should be equivalent:
+    // DispatchKeySet(DispatchKey::CPU).remove(DispatchKey::Dense)
+    // DispatchKeySet(BackendComponent::CPUBit)
+    expected_ks = DispatchKeySet(toBackendComponent(tid));
+    ASSERT_EQ(sing.remove(tid), expected_ks);
+  }
+}
+
+TEST(DispatchKeySet, DoubletonPerBackend) {
+  for (uint8_t i = static_cast<uint8_t>(DispatchKey::StartOfDenseBackends);
+       i <= static_cast<uint8_t>(DispatchKey::EndOfRuntimeBackendKeys);
        i++) {
     for (uint8_t j = i + 1;
-         j < static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
+         j <= static_cast<uint8_t>(DispatchKey::EndOfRuntimeBackendKeys);
          j++) {
       ASSERT_LT(i, j);
       auto tid1 = static_cast<DispatchKey>(i);
       auto tid2 = static_cast<DispatchKey>(j);
-      auto doub = DispatchKeySet(tid1).add(tid2);
-      ASSERT_EQ(doub, DispatchKeySet(tid1) | DispatchKeySet(tid2));
-      ASSERT_TRUE(doub.has(tid1));
-      ASSERT_TRUE(doub.has(tid2));
-      ASSERT_EQ(doub.highestPriorityTypeId(), tid2); // relies on i < j
+
+      // Skip these because they aren't real keys.
+      if (tid1 == DispatchKey::StartOfDenseBackends ||
+          tid1 == DispatchKey::StartOfSparseBackends ||
+          tid1 == DispatchKey::StartOfQuantizedBackends ||
+          tid1 == DispatchKey::StartOfAutogradBackends)
+        continue;
+      if (tid2 == DispatchKey::StartOfDenseBackends ||
+          tid2 == DispatchKey::StartOfSparseBackends ||
+          tid2 == DispatchKey::StartOfQuantizedBackends ||
+          tid2 == DispatchKey::StartOfAutogradBackends)
+        continue;
+
+      auto backend1 = toBackendComponent(tid1);
+      auto backend2 = toBackendComponent(tid2);
+      auto functionality1 = toFunctionalityKey(tid1);
+      auto functionality2 = toFunctionalityKey(tid2);
+
+      auto combined = DispatchKeySet({tid1, tid2});
+      // The combined set has the backend bits
+      ASSERT_TRUE(combined.has_backend(backend1));
+      ASSERT_TRUE(combined.has_backend(backend2));
+      // and it has the backend bits
+      ASSERT_TRUE(combined.has(functionality1));
+      ASSERT_TRUE(combined.has(functionality2));
+      // and it has the original two runtime keys
+      ASSERT_TRUE(combined.has(tid1));
+      ASSERT_TRUE(combined.has(tid2));
+
+      // Add all of the keys in the keyset to a real set
+      std::unordered_set<DispatchKey> visited_keys;
+      auto iter = combined.begin();
+      while (*iter != *combined.end()) {
+        visited_keys.insert(*iter);
+        ++iter;
+      }
+      std::unordered_set<DispatchKey> expected_keys;
+      expected_keys.insert(
+          toRuntimePerBackendFunctionalityKey(functionality1, backend1));
+      expected_keys.insert(
+          toRuntimePerBackendFunctionalityKey(functionality1, backend2));
+      expected_keys.insert(
+          toRuntimePerBackendFunctionalityKey(functionality2, backend1));
+      expected_keys.insert(
+          toRuntimePerBackendFunctionalityKey(functionality2, backend2));
+      ASSERT_EQ(expected_keys, visited_keys);
+
+      if (backend1 == backend2 || functionality1 == functionality2) {
+        // We have two runtime keys, with either the same backend or the same
+        // per-backend functionalities. E.g. {AutogradCUDA, CUDA} or
+        // {AutogradCPU, AutogradCUDA} There should be 2 total runtime keys in
+        // this set.
+        ASSERT_EQ(2, visited_keys.size());
+      } else {
+        // since i and j are different keys, they should not have the same
+        // functionality and backend
+        ASSERT_TRUE(backend1 != backend2 && functionality1 != functionality2);
+        // We have two runtime keys, that have different backends + per-backend
+        // functionalities. So we should expect the full cross product of
+        // runtime keys to be in the set. e.g. if i = AutogradCUDA, and j = CPU,
+        // then combined = {AutogradCUDA, AutogradCPU, CUDA, CPU}
+        ASSERT_EQ(4, visited_keys.size());
+      }
     }
   }
 }
 
 TEST(DispatchKeySet, Full) {
   DispatchKeySet full(DispatchKeySet::FULL);
-  for (uint8_t i = 1; i < static_cast<uint8_t>(DispatchKey::NumDispatchKeys);
-       i++) {
+  for (const auto i : c10::irange(1, num_functionality_keys)) {
     auto tid = static_cast<DispatchKey>(i);
     ASSERT_TRUE(full.has(tid));
   }
+  ASSERT_FALSE(full.has(DispatchKey::EndOfFunctionalityKeys));
 }
 
 TEST(DispatchKeySet, IteratorBasicOps) {
   DispatchKeySet empty_set;
   DispatchKeySet full_set(DispatchKeySet::FULL);
-  DispatchKeySet mutated_set = empty_set.add(static_cast<DispatchKey>(1));
+  DispatchKeySet mutated_set = empty_set.add(DispatchKey::CPU);
 
   // Constructor + Comparison
-  ASSERT_EQ(*empty_set.begin(), DispatchKey::NumDispatchKeys);
-  ASSERT_EQ(*empty_set.end(), DispatchKey::NumDispatchKeys);
-  ASSERT_EQ(*mutated_set.begin(), static_cast<DispatchKey>(1));
+  ASSERT_EQ(*empty_set.begin(), DispatchKey::EndOfFunctionalityKeys);
+  ASSERT_EQ(*empty_set.end(), DispatchKey::EndOfFunctionalityKeys);
+  ASSERT_EQ(*mutated_set.begin(), DispatchKey::CPU);
 
   ASSERT_TRUE(empty_set.begin() == empty_set.end());
   ASSERT_TRUE(full_set.begin() != full_set.end());
@@ -80,6 +316,25 @@ TEST(DispatchKeySet, IteratorBasicOps) {
   ASSERT_TRUE(full_set.begin() != ++full_set.begin());
 }
 
+TEST(DispatchKeySet, getHighestPriorityBackendTypeId) {
+  // AutogradCPU isn't a backend key so it is ignored
+  DispatchKeySet dense_cpu({DispatchKey::AutogradCPU, DispatchKey::CPU});
+  ASSERT_EQ(DispatchKey::CPU, c10::highestPriorityBackendTypeId(dense_cpu));
+
+  // Functionalize isn't a backend key so it is ignored
+  DispatchKeySet sparse_cuda(
+      {DispatchKey::Functionalize, DispatchKey::SparseCUDA});
+  ASSERT_EQ(
+      DispatchKey::SparseCUDA, c10::highestPriorityBackendTypeId(sparse_cuda));
+
+  // quantizedCUDA has higher priority than CUDA
+  DispatchKeySet quantized_cuda(
+      {DispatchKey::CUDA, DispatchKey::QuantizedCUDA});
+  ASSERT_EQ(
+      DispatchKey::QuantizedCUDA,
+      c10::highestPriorityBackendTypeId(quantized_cuda));
+}
+
 TEST(DispatchKeySet, IteratorEmpty) {
   DispatchKeySet empty_set;
   uint8_t i = 0;
@@ -90,16 +345,37 @@ TEST(DispatchKeySet, IteratorEmpty) {
   ASSERT_EQ(i, 0);
 }
 
+TEST(DispatchKeySet, IteratorCrossProduct) {
+  // The iterator should return all runtime keys in the set,
+  // including the cross product of {backends} x {functionalities}
+  auto ks =
+      DispatchKeySet({BackendComponent::CPUBit, BackendComponent::CUDABit}) |
+      DispatchKeySet(
+          {DispatchKey::Dense,
+           DispatchKey::FPGA,
+           DispatchKey::AutogradFunctionality});
+
+  auto iter = ks.begin();
+  // iterate through dense backends first.
+  ASSERT_EQ(DispatchKey::CPU, *(iter++));
+  ASSERT_EQ(DispatchKey::CUDA, *(iter++));
+  // FPGA doesn't have a backend bit, so it isn't included in the cross product.
+  ASSERT_EQ(DispatchKey::FPGA, *(iter++));
+  // iterate through the autograd keys laster.
+  ASSERT_EQ(DispatchKey::AutogradCPU, *(iter++));
+  ASSERT_EQ(DispatchKey::AutogradCUDA, *(iter++));
+}
+
 TEST(DispatchKeySet, IteratorFull) {
   DispatchKeySet full_set(DispatchKeySet::FULL);
   uint8_t i = 0;
 
   for (const auto& it : full_set) {
     i++;
-    ASSERT_TRUE(it == static_cast<DispatchKey>(i));
-    ASSERT_TRUE(it != DispatchKey::NumDispatchKeys);
   }
-  ASSERT_EQ(i, static_cast<uint8_t>(DispatchKey::NumDispatchKeys) - 1);
+  // Total # of runtime entries includes an entry for DispatchKey::Undefined,
+  // which is not included when iterating through the DispatchKeySet.
+  ASSERT_EQ(i, num_runtime_entries - 1);
 }
 
 TEST(DispatchKeySet, IteratorRangeFull) {
@@ -108,41 +384,61 @@ TEST(DispatchKeySet, IteratorRangeFull) {
 
   for (DispatchKey dispatch_key : full_set) {
     i++;
-    ASSERT_TRUE(dispatch_key == static_cast<DispatchKey>(i));
   }
 
-  ASSERT_EQ(i, static_cast<uint8_t>(DispatchKey::NumDispatchKeys) - 1);
-}
-
-TEST(DispatchKeySet, SpecificKeys) {
-  DispatchKeySet keyset({
-      static_cast<DispatchKey>(0), // Undefined should be ignored
-      static_cast<DispatchKey>(4),
-      static_cast<DispatchKey>(10),
-      static_cast<DispatchKey>(15),
-  });
-  std::unordered_set<DispatchKey> visited_keys;
-
-  for (DispatchKey key : keyset) {
-    visited_keys.insert(key);
-  }
-
-  ASSERT_EQ(visited_keys.size(), 3);
-  ASSERT_TRUE(
-      visited_keys.find(static_cast<DispatchKey>(4)) != visited_keys.end());
-  ASSERT_TRUE(
-      visited_keys.find(static_cast<DispatchKey>(10)) != visited_keys.end());
-  ASSERT_TRUE(
-      visited_keys.find(static_cast<DispatchKey>(15)) != visited_keys.end());
+  // Total # of runtime entries includes an entry for DispatchKey::Undefined,
+  // which is not included when iterating through the DispatchKeySet.
+  ASSERT_EQ(i, num_runtime_entries - 1);
 }
 
 TEST(DispatchKeySet, FailAtEndIterator) {
   DispatchKeySet full_set(DispatchKeySet::FULL);
   uint64_t raw_repr = full_set.raw_repr();
 
+  // doesn't throw
+  DispatchKeySet::iterator(&raw_repr, num_backends + num_functionality_keys);
   // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
   EXPECT_THROW(
       DispatchKeySet::iterator(
-          &raw_repr, static_cast<uint8_t>(DispatchKey::NumDispatchKeys) + 1),
+          &raw_repr, num_backends + num_functionality_keys + 1),
       c10::Error);
 }
+
+TEST(DispatchKeySet, TestKeyOrderingInvariants) {
+  for (uint8_t i = static_cast<uint8_t>(DispatchKey::StartOfDenseBackends);
+       i <= static_cast<uint8_t>(DispatchKey::EndOfRuntimeBackendKeys);
+       i++) {
+    auto k = static_cast<DispatchKey>(i);
+    // Note [The Ordering of Per-Backend Dispatch Keys Matters!]
+    // The DispatchKey enum includes all of the runtime keys for
+    // Dense/Sparse/Quantized/Autograd, (e.g. CPU, CUDA, SparseCPU, SparseCUDA,
+    // AutogradCPU, AutogradCUDA, etc). And we expect the ordering of those keys
+    // to be the same as the ordering of the backends in the `BackendComponent`
+    // enum. This makes several utilities in `DispatchKey.h` and
+    // `DispatchKeySet.h` significantly easier to implement. The purpose of the
+    // test is to assert (through CI) that this invariant is maintained.
+    //
+    // The only way that we can really check this invariant is by
+    // comparing the string names of each enum.
+    // We only really care about the ordering for "real" keys that are actually
+    // used, which we expect to be able to print properly. This saves us from
+    // having to enumerate the full set of possible runtime keys in
+    // DispatchKey::toString(). It also relies on toString() being implemented
+    // correctly.
+    auto functionality_str = std::string(toString(k));
+    if (functionality_str == "UNKNOWN_TENSOR_TYPE_ID")
+      continue;
+
+    auto computed_backend_k = toBackendComponent(k);
+    auto computed_backend_str = std::string(toString(computed_backend_k));
+    // Skip, e.g., the "Bit" from "CPUBit"
+    computed_backend_str =
+        computed_backend_str.substr(0, computed_backend_str.size() - 3);
+
+    ASSERT_TRUE(
+        functionality_str.find(computed_backend_str) != std::string::npos)
+        << "DispatchKey invariant broken! Found a key that is not ordered correctly"
+        << " with its backend bit. key = " << toString(k) << ", " << k
+        << ", computed backend = " << toString(computed_backend_k);
+  }
+}
diff --git a/c10/test/util/Synchronized_test.cpp b/c10/test/util/Synchronized_test.cpp
new file mode 100644
index 00000000000000..ce781a10cadb4c
--- /dev/null
+++ b/c10/test/util/Synchronized_test.cpp
@@ -0,0 +1,43 @@
+#include <c10/util/Synchronized.h>
+#include <gtest/gtest.h>
+
+#include <array>
+#include <thread>
+
+namespace {
+
+TEST(Synchronized, TestSingleThreadExecution) {
+  c10::Synchronized<int> iv(0);
+  const int kMaxValue = 100;
+  for (int i = 0; i < kMaxValue; ++i) {
+    auto ret = iv.withLock([](int& iv) { return ++iv; });
+    EXPECT_EQ(ret, i + 1);
+  }
+
+  iv.withLock([kMaxValue](int& iv) { EXPECT_EQ(iv, kMaxValue); });
+}
+
+TEST(Synchronized, TestMultiThreadedExecution) {
+  c10::Synchronized<int> iv(0);
+#define NUM_LOOP_INCREMENTS 10000
+
+  auto thread_cb = [&iv]() {
+    for (int i = 0; i < NUM_LOOP_INCREMENTS; ++i) {
+      iv.withLock([](int& iv) { ++iv; });
+    }
+  };
+
+  std::array<std::thread, 10> threads;
+  for (auto& t : threads) {
+    t = std::thread(thread_cb);
+  }
+
+  for (auto& t : threads) {
+    t.join();
+  }
+
+  iv.withLock([](int& iv) { EXPECT_EQ(iv, NUM_LOOP_INCREMENTS * 10); });
+#undef NUM_LOOP_INCREMENTS
+}
+
+} // namespace
diff --git a/c10/test/util/ordered_preserving_dict_test.cpp b/c10/test/util/ordered_preserving_dict_test.cpp
index 773b2e7a2a35b3..aa1d7f0f986eda 100644
--- a/c10/test/util/ordered_preserving_dict_test.cpp
+++ b/c10/test/util/ordered_preserving_dict_test.cpp
@@ -48,7 +48,7 @@ dict_int_int test_dict(dict_int_int& dict) {
   }
   dict.erase(begin, end);
 
-  std::vector<size_t> order;
+  std::vector<int64_t> order;
   for (const auto i : c10::irange(100)) {
     if (!erase_set.count(i)) {
       order.push_back(i);
@@ -211,12 +211,12 @@ TEST(OrderedPreservingDictTest, test_range_erase) {
   using HMap =
       ska_ordered::order_preserving_flat_hash_map<std::string, std::int64_t>;
 
-  const std::size_t nb_values = 1000;
+  const int64_t nb_values = 1000;
   HMap map;
   for (const auto i : c10::irange(nb_values)) {
     map[c10::guts::to_string(i)] = i;
     auto begin = map.begin();
-    for (size_t j = 0; j <= i; ++j, begin++) {
+    for (int64_t j = 0; j <= i; ++j, begin++) {
       TORCH_INTERNAL_ASSERT(begin->second == j);
     }
   }
diff --git a/c10/util/Half.h b/c10/util/Half.h
index f74dc89bb0ef7d..a877efe9d2ca30 100644
--- a/c10/util/Half.h
+++ b/c10/util/Half.h
@@ -392,28 +392,32 @@ struct alignas(2) Half {
 #endif
 };
 
-// This is just a placeholder for whatever complex representation we
-// end up deciding to use for half-precision complex numbers.
+// TODO : move to complex.h
 template <>
 struct alignas(4) complex<Half> {
-  using value_type = Half;
   Half real_;
   Half imag_;
+
+  // Constructors
   complex() = default;
-  Half real() const {
+  // Half constructor is not constexpr so the following constructor can't
+  // be constexpr
+  C10_HOST_DEVICE explicit inline complex(const Half& real, const Half& imag)
+      : real_(real), imag_(imag) {}
+  C10_HOST_DEVICE explicit inline complex(const c10::complex<float>& value)
+      : real_(value.real()), imag_(value.imag()) {}
+
+  // Conversion operator
+  inline C10_HOST_DEVICE operator c10::complex<float>() const {
+    return {real_, imag_};
+  }
+
+  constexpr C10_HOST_DEVICE Half real() const {
     return real_;
   }
-  Half imag() const {
+  constexpr C10_HOST_DEVICE Half imag() const {
     return imag_;
   }
-  explicit inline complex(c10::complex<float> value)
-      : real_(value.real()), imag_(value.imag()) {}
-  explicit inline complex(c10::complex<double> value)
-      : real_(static_cast<float>(value.real())),
-        imag_(static_cast<float>(value.imag())) {}
-  inline operator c10::complex<float>() const {
-    return {real_, imag_};
-  }
 };
 
 // In some versions of MSVC, there will be a compiler error when building.
diff --git a/c10/util/LeftRight.h b/c10/util/LeftRight.h
index 13529f2ea0c780..e45267cb8f7e36 100644
--- a/c10/util/LeftRight.h
+++ b/c10/util/LeftRight.h
@@ -1,4 +1,5 @@
 #include <c10/macros/Macros.h>
+#include <c10/util/Synchronized.h>
 #include <array>
 #include <atomic>
 #include <functional>
@@ -192,13 +193,9 @@ class LeftRight final {
 // read-write lock to protect T (data).
 template <class T>
 class RWSafeLeftRightWrapper final {
-  using mutexType = std::mutex;
-  using rLockType = std::unique_lock<std::mutex>;
-  using wLockType = std::unique_lock<std::mutex>;
-
  public:
   template <class... Args>
-  explicit RWSafeLeftRightWrapper(const Args&... args) : _data{args...} {}
+  explicit RWSafeLeftRightWrapper(const Args&... args) : data_{args...} {}
 
   // RWSafeLeftRightWrapper is not copyable or moveable since LeftRight
   // is not copyable or moveable.
@@ -209,19 +206,17 @@ class RWSafeLeftRightWrapper final {
 
   template <typename F>
   auto read(F&& readFunc) const -> typename std::result_of<F(const T&)>::type {
-    rLockType lock(mutex_);
-    return readFunc(_data);
+    return data_.withLock(
+        [&readFunc](T const& data) { return readFunc(data); });
   }
 
   template <typename F>
   auto write(F&& writeFunc) -> typename std::result_of<F(T&)>::type {
-    wLockType lock(mutex_);
-    return writeFunc(_data);
+    return data_.withLock([&writeFunc](T& data) { return writeFunc(data); });
   }
 
  private:
-  T _data;
-  mutable mutexType mutex_;
+  c10::Synchronized<T> data_;
 };
 
 } // namespace c10
diff --git a/c10/util/OptionalArrayRef.h b/c10/util/OptionalArrayRef.h
new file mode 100644
index 00000000000000..7ca375d7cb785e
--- /dev/null
+++ b/c10/util/OptionalArrayRef.h
@@ -0,0 +1,228 @@
+// This file defines OptionalArrayRef<T>, a class that has almost the same
+// exact functionality as c10::optional<ArrayRef<T>>, except that its
+// converting constructor fixes a dangling pointer issue.
+//
+// The implicit converting constructor of both c10::optional<ArrayRef<T>> and
+// std::optional<ArrayRef<T>> can cause the underlying ArrayRef<T> to store
+// a dangling pointer. OptionalArrayRef<T> prevents this by wrapping
+// a c10::optional<ArrayRef<T>> and fixing the constructor implementation.
+//
+// See https://github.com/pytorch/pytorch/issues/63645 for more on this.
+
+#pragma once
+
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
+
+namespace c10 {
+
+template <typename T>
+class OptionalArrayRef final {
+ public:
+  // Constructors
+
+  constexpr OptionalArrayRef() noexcept {}
+
+  constexpr OptionalArrayRef(nullopt_t) noexcept {}
+
+  OptionalArrayRef(const OptionalArrayRef& other) = default;
+
+  OptionalArrayRef(OptionalArrayRef&& other) = default;
+
+  constexpr OptionalArrayRef(const optional<ArrayRef<T>>& other) noexcept
+      : wrapped_opt_array_ref(other) {}
+
+  constexpr OptionalArrayRef(optional<ArrayRef<T>>&& other) noexcept
+      : wrapped_opt_array_ref(other) {}
+
+  constexpr OptionalArrayRef(const T& value) noexcept
+      : wrapped_opt_array_ref(value) {}
+
+  template <
+      typename U = ArrayRef<T>,
+      std::enable_if_t<
+          !std::is_same<std::decay_t<U>, OptionalArrayRef>::value &&
+              !std::is_same<std::decay_t<U>, in_place_t>::value &&
+              std::is_constructible<ArrayRef<T>, U&&>::value &&
+              std::is_convertible<U&&, ArrayRef<T>>::value &&
+              !std::is_convertible<U&&, T>::value,
+          bool> = false>
+  constexpr OptionalArrayRef(U&& value) noexcept(
+      std::is_nothrow_constructible<ArrayRef<T>, U&&>::value)
+      : wrapped_opt_array_ref(value) {}
+
+  template <
+      typename U = ArrayRef<T>,
+      std::enable_if_t<
+          !std::is_same<std::decay_t<U>, OptionalArrayRef>::value &&
+              !std::is_same<std::decay_t<U>, in_place_t>::value &&
+              std::is_constructible<ArrayRef<T>, U&&>::value &&
+              !std::is_convertible<U&&, ArrayRef<T>>::value,
+          bool> = false>
+  constexpr explicit OptionalArrayRef(U&& value) noexcept(
+      std::is_nothrow_constructible<ArrayRef<T>, U&&>::value)
+      : wrapped_opt_array_ref(value) {}
+
+  template <typename... Args>
+  constexpr explicit OptionalArrayRef(in_place_t ip, Args&&... args) noexcept
+      : wrapped_opt_array_ref(ip, args...) {}
+
+  template <typename U, typename... Args>
+  constexpr explicit OptionalArrayRef(
+      in_place_t ip,
+      std::initializer_list<U> il,
+      Args&&... args)
+      : wrapped_opt_array_ref(ip, il, args...) {}
+
+  // Destructor
+
+  ~OptionalArrayRef() = default;
+
+  // Assignment
+
+  constexpr OptionalArrayRef& operator=(nullopt_t) noexcept {
+    wrapped_opt_array_ref = c10::nullopt;
+    return *this;
+  }
+
+  OptionalArrayRef& operator=(const OptionalArrayRef& other) = default;
+
+  OptionalArrayRef& operator=(OptionalArrayRef&& other) = default;
+
+  constexpr OptionalArrayRef& operator=(
+      const optional<ArrayRef<T>>& other) noexcept {
+    wrapped_opt_array_ref = other;
+    return *this;
+  }
+
+  constexpr OptionalArrayRef& operator=(
+      optional<ArrayRef<T>>&& other) noexcept {
+    wrapped_opt_array_ref = other;
+    return *this;
+  }
+
+  template <typename U = ArrayRef<T>>
+  constexpr std::enable_if_t<
+      !std::is_same<std::decay_t<U>, OptionalArrayRef>::value &&
+          std::is_constructible<ArrayRef<T>, U&&>::value &&
+          std::is_assignable<ArrayRef<T>&, U&&>::value,
+      OptionalArrayRef&>
+  operator=(U&& value) noexcept(
+      std::is_nothrow_constructible<ArrayRef<T>, U&&>::value&&
+          std::is_nothrow_assignable<ArrayRef<T>&, U&&>::value) {
+    wrapped_opt_array_ref = value;
+    return *this;
+  }
+
+  // Observers
+
+  constexpr ArrayRef<T>* operator->() noexcept {
+    return &wrapped_opt_array_ref.value();
+  }
+
+  constexpr const ArrayRef<T>* operator->() const noexcept {
+    return &wrapped_opt_array_ref.value();
+  }
+
+  constexpr ArrayRef<T>& operator*() & noexcept {
+    return wrapped_opt_array_ref.value();
+  }
+
+  constexpr const ArrayRef<T>& operator*() const& noexcept {
+    return wrapped_opt_array_ref.value();
+  }
+
+  constexpr ArrayRef<T>&& operator*() && noexcept {
+    return std::move(wrapped_opt_array_ref.value());
+  }
+
+  constexpr const ArrayRef<T>&& operator*() const&& noexcept {
+    return std::move(wrapped_opt_array_ref.value());
+  }
+
+  constexpr explicit operator bool() const noexcept {
+    return wrapped_opt_array_ref.has_value();
+  }
+
+  constexpr bool has_value() const noexcept {
+    return wrapped_opt_array_ref.has_value();
+  }
+
+  constexpr ArrayRef<T>& value() & {
+    return wrapped_opt_array_ref.value();
+  }
+
+  constexpr const ArrayRef<T>& value() const& {
+    return wrapped_opt_array_ref.value();
+  }
+
+  constexpr ArrayRef<T>&& value() && {
+    return std::move(wrapped_opt_array_ref.value());
+  }
+
+  constexpr const ArrayRef<T>&& value() const&& {
+    return std::move(wrapped_opt_array_ref.value());
+  }
+
+  template <typename U>
+  constexpr std::
+      enable_if_t<std::is_convertible<U&&, ArrayRef<T>>::value, ArrayRef<T>>
+      value_or(U&& default_value) const& {
+    return wrapped_opt_array_ref.value_or(default_value);
+  }
+
+  template <typename U>
+  constexpr std::
+      enable_if_t<std::is_convertible<U&&, ArrayRef<T>>::value, ArrayRef<T>>
+      value_or(U&& default_value) && {
+    return wrapped_opt_array_ref.value_or(default_value);
+  }
+
+  // Modifiers
+
+  constexpr void swap(OptionalArrayRef& other) noexcept {
+    std::swap(wrapped_opt_array_ref, other.wrapped_opt_array_ref);
+  }
+
+  constexpr void reset() noexcept {
+    wrapped_opt_array_ref.reset();
+  }
+
+  template <typename... Args>
+  constexpr std::enable_if_t<
+      std::is_constructible<ArrayRef<T>, Args&&...>::value,
+      ArrayRef<T>&>
+  emplace(Args&&... args) noexcept(
+      std::is_nothrow_constructible<ArrayRef<T>, Args&&...>::value) {
+    return wrapped_opt_array_ref.emplace(args...);
+  }
+
+  template <typename U, typename... Args>
+  constexpr ArrayRef<T>& emplace(
+      std::initializer_list<U> il,
+      Args&&... args) noexcept {
+    return wrapped_opt_array_ref.emplace(il, args...);
+  }
+
+ private:
+  optional<ArrayRef<T>> wrapped_opt_array_ref;
+};
+
+using OptionalIntArrayRef = OptionalArrayRef<int64_t>;
+
+inline bool operator==(
+    const OptionalIntArrayRef& a1,
+    const IntArrayRef& other) {
+  if (!a1.has_value()) {
+    return false;
+  }
+  return a1.value() == other;
+}
+
+inline bool operator==(
+    const c10::IntArrayRef& a1,
+    const c10::OptionalIntArrayRef& a2) {
+  return a2 == a1;
+}
+
+} // namespace c10
diff --git a/c10/util/Synchronized.h b/c10/util/Synchronized.h
index 205ded5a5e1f13..1679d7060fe05c 100644
--- a/c10/util/Synchronized.h
+++ b/c10/util/Synchronized.h
@@ -42,9 +42,9 @@ class Synchronized final {
    * provided callback safely.
    */
   template <typename CB>
-  void withLock(CB cb) {
+  typename std::result_of<CB(T&)>::type withLock(CB cb) {
     std::lock_guard<std::mutex> guard(this->mutex_);
-    cb(this->data_);
+    return cb(this->data_);
   }
 
   /**
@@ -53,9 +53,9 @@ class Synchronized final {
    * the provided callback safely.
    */
   template <typename CB>
-  void withLock(CB cb) const {
+  typename std::result_of<CB(T const&)>::type withLock(CB cb) const {
     std::lock_guard<std::mutex> guard(this->mutex_);
-    cb(this->data_);
+    return cb(this->data_);
   }
 };
 } // end namespace c10
diff --git a/c10/util/TypeCast.h b/c10/util/TypeCast.h
index 86c5c9f62231c4..1c6a72bab4926f 100644
--- a/c10/util/TypeCast.h
+++ b/c10/util/TypeCast.h
@@ -45,7 +45,8 @@ struct static_cast_with_inter_type {
   C10_HOST_DEVICE __ubsan_ignore_undefined__ static inline dest_t apply(
       src_t src) {
     constexpr bool real = needs_real<dest_t, src_t>::value;
-    return static_cast<dest_t>(maybe_real<real, src_t>::apply(src));
+    auto r = maybe_real<real, src_t>::apply(src);
+    return static_cast<dest_t>(r);
   }
 };
 
@@ -68,6 +69,36 @@ struct static_cast_with_inter_type<uint8_t, src_t> {
   }
 };
 
+template <>
+struct static_cast_with_inter_type<c10::complex<c10::Half>, c10::BFloat16> {
+  C10_HOST_DEVICE __ubsan_ignore_undefined__ static inline c10::complex<
+      c10::Half>
+  apply(c10::BFloat16 src) {
+    return static_cast<c10::complex<c10::Half>>(c10::complex<float>{src});
+  }
+};
+
+template <>
+struct static_cast_with_inter_type<c10::complex<c10::Half>, c10::Half> {
+  C10_HOST_DEVICE __ubsan_ignore_undefined__ static inline c10::complex<
+      c10::Half>
+  apply(c10::Half src) {
+    return static_cast<c10::complex<c10::Half>>(c10::complex<float>{src});
+  }
+};
+
+template <>
+struct static_cast_with_inter_type<
+    c10::complex<c10::Half>,
+    c10::complex<double>> {
+  C10_HOST_DEVICE __ubsan_ignore_undefined__ static inline c10::complex<
+      c10::Half>
+  apply(c10::complex<double> src) {
+    return static_cast<c10::complex<c10::Half>>(
+        static_cast<c10::complex<float>>(src));
+  }
+};
+
 // Dynamic type casting utils:
 // - fetch_and_cast
 // - cast_and_store
@@ -130,7 +161,7 @@ C10_HOST_DEVICE inline dest_t fetch_and_cast(
     const ScalarType src_type,
     const void* ptr) {
   switch (src_type) {
-    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(FETCH_AND_CAST_CASE)
+    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(FETCH_AND_CAST_CASE)
     default:
       ERROR_UNSUPPORTED_CAST
   }
@@ -149,7 +180,7 @@ C10_HOST_DEVICE inline void cast_and_store(
     void* ptr,
     src_t value) {
   switch (dest_type) {
-    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF(CAST_AND_STORE_CASE)
+    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(CAST_AND_STORE_CASE)
     default:;
   }
   ERROR_UNSUPPORTED_CAST
diff --git a/c10/util/accumulate.h b/c10/util/accumulate.h
index 086a7977401c52..8d0cc49c8ecbd6 100644
--- a/c10/util/accumulate.h
+++ b/c10/util/accumulate.h
@@ -82,7 +82,7 @@ template <
 inline int64_t numelements_from_dim(const int k, const C& dims) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(k >= 0);
 
-  if (k > dims.size()) {
+  if (k > static_cast<int>(dims.size())) {
     return 1;
   } else {
     auto cbegin = dims.cbegin();
diff --git a/c10/util/int128.cpp b/c10/util/int128.cpp
index a080e73430b365..f83dba49983363 100644
--- a/c10/util/int128.cpp
+++ b/c10/util/int128.cpp
@@ -171,7 +171,7 @@ std::ostream& operator<<(std::ostream& o, const uint128& b) {
 
   // Add the requisite padding.
   std::streamsize width = o.width(0);
-  if (width > rep.size()) {
+  if (width > static_cast<std::streamsize>(rep.size())) {
     if ((flags & std::ios::adjustfield) == std::ios::left) {
       rep.append(width - rep.size(), o.fill());
     } else {
diff --git a/c10/util/safe_numerics.h b/c10/util/safe_numerics.h
new file mode 100644
index 00000000000000..7eb9ed39395d86
--- /dev/null
+++ b/c10/util/safe_numerics.h
@@ -0,0 +1,74 @@
+#pragma once
+#include <c10/macros/Macros.h>
+#include <c10/util/ArrayRef.h>
+
+#include <iterator>
+#include <numeric>
+#include <type_traits>
+
+// GCC has __builtin_mul_overflow from before it supported __has_builtin
+#ifdef _MSC_VER
+#define C10_HAS_BUILTIN_OVERFLOW() (0)
+#include <c10/util/llvmMathExtras.h>
+#include <intrin.h>
+#else
+#define C10_HAS_BUILTIN_OVERFLOW() (1)
+#endif
+
+namespace c10 {
+
+C10_ALWAYS_INLINE bool add_overflows(uint64_t a, uint64_t b, uint64_t* out) {
+#if C10_HAS_BUILTIN_OVERFLOW()
+  return __builtin_add_overflow(a, b, out);
+#else
+  unsigned long long tmp;
+  auto carry = _addcarry_u64(0, a, b, &tmp);
+  *out = tmp;
+  return carry;
+#endif
+}
+
+C10_ALWAYS_INLINE bool mul_overflows(uint64_t a, uint64_t b, uint64_t* out) {
+#if C10_HAS_BUILTIN_OVERFLOW()
+  return __builtin_mul_overflow(a, b, out);
+#else
+  *out = a * b;
+  // This test isnt exact, but avoids doing integer division
+  return (
+      (c10::llvm::countLeadingZeros(a) + c10::llvm::countLeadingZeros(b)) < 64);
+#endif
+}
+
+template <typename It>
+bool safe_multiplies_u64(It first, It last, uint64_t* out) {
+#if C10_HAS_BUILTIN_OVERFLOW()
+  uint64_t prod = 1;
+  bool overflow = false;
+  for (; first != last; ++first) {
+    overflow |= c10::mul_overflows(prod, *first, &prod);
+  }
+  *out = prod;
+  return overflow;
+#else
+  uint64_t prod = 1;
+  uint64_t prod_log2 = 0;
+  bool is_zero = false;
+  for (; first != last; ++first) {
+    auto x = static_cast<uint64_t>(*first);
+    prod *= x;
+    // log2(0) isn't valid, so need to track it specially
+    is_zero |= (x == 0);
+    prod_log2 += c10::llvm::Log2_64_Ceil(x);
+  }
+  *out = prod;
+  // This test isnt exact, but avoids doing integer division
+  return !is_zero && (prod_log2 >= 64);
+#endif
+}
+
+template <typename Container>
+bool safe_multiplies_u64(const Container& c, uint64_t* out) {
+  return safe_multiplies_u64(c.begin(), c.end(), out);
+}
+
+} // namespace c10
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index c636cd18c0a5a4..b44ea8150f6eeb 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -350,6 +350,13 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
       "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_0.cpp"
       "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_1.cpp"
     )
+    if(BUILD_LAZY_TS_BACKEND)
+      list(APPEND GENERATED_CXX_TORCH
+        "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.cpp"
+        "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterAutogradLazy.cpp"
+        "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterLazy.cpp"
+      )
+    endif()
   endif()
 
   set(GENERATED_H_TORCH
@@ -360,6 +367,8 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
   if(NOT INTERN_DISABLE_AUTOGRAD)
     list(APPEND GENERATED_H_TORCH
       "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType.h"
+      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyIr.h"
+      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.h"
     )
   endif()
 
@@ -397,18 +406,31 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
     ${GENERATED_TESTING_PYTHON}
     )
 
+  set(GEN_PER_OPERATOR_FLAG)
+  if(USE_PER_OPERATOR_HEADERS)
+    list(APPEND GEN_PER_OPERATOR_FLAG "--per_operator_headers")
+  endif()
+
   add_custom_command(
     OUTPUT
     ${TORCH_GENERATED_CODE}
     COMMAND
     "${PYTHON_EXECUTABLE}" tools/setup_helpers/generate_code.py
       --native-functions-path "aten/src/ATen/native/native_functions.yaml"
-      --nn-path "aten/src"
       $<$<BOOL:${INTERN_DISABLE_AUTOGRAD}>:--disable-autograd>
       $<$<BOOL:${SELECTED_OP_LIST}>:--selected-op-list-path="${SELECTED_OP_LIST}">
       --force_schema_registration
+      --gen_lazy_ts_backend
+      ${GEN_PER_OPERATOR_FLAG}
     DEPENDS
     "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
+    "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml"
+    "${TORCH_ROOT}/torch/csrc/lazy/core/shape_inference.h"
+    "${TORCH_ROOT}/torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
+    "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.h"
+    "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp"
+    "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h"
+    "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp"
     "${TOOLS_PATH}/autograd/templates/VariableType.h"
     "${TOOLS_PATH}/autograd/templates/VariableType.cpp"
     "${TOOLS_PATH}/autograd/templates/ADInplaceOrViewType.cpp"
@@ -436,6 +458,10 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
     "${TOOLS_PATH}/autograd/gen_variable_type.py"
     "${TOOLS_PATH}/autograd/gen_inplace_or_view_type.py"
     "${TOOLS_PATH}/autograd/load_derivatives.py"
+    "${TOOLS_PATH}/codegen/gen_backend_stubs.py"
+    "${TOOLS_PATH}/codegen/gen_lazy_tensor.py"
+    "${TOOLS_PATH}/codegen/api/lazy.py"
+    "${TOOLS_PATH}/codegen/dest/lazy_ir.py"
     WORKING_DIRECTORY "${TORCH_ROOT}")
 
 
@@ -475,7 +501,9 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
     set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
   else()
     append_filelist("libtorch_cmake_sources" LIBTORCH_CMAKE_SRCS)
-
+    if(BUILD_LAZY_TS_BACKEND)
+      append_filelist("lazy_tensor_ts_sources" LIBTORCH_CMAKE_SRCS)
+    endif()
     if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
       # TODO: Delete this line once https://github.com/pytorch/pytorch/pull/55889 lands
       set_source_files_properties(../torch/csrc/jit/serialization/export.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
@@ -904,15 +932,26 @@ elseif(USE_CUDA)
   if(BUILD_LAZY_CUDA_LINALG)
     add_library(torch_cuda_linalg ${ATen_CUDA_LINALG_SRCS})
     target_compile_definitions(torch_cuda_linalg PRIVATE USE_CUDA BUILD_LAZY_CUDA_LINALG)
+    # Library order is important during static linking
+    # `torch::magma` should be mentioned before other CUDA
+    # to transitively include all symbols present in torch_cuda/torch_cpu
+    if(USE_MAGMA)
+      target_link_libraries(torch_cuda_linalg PRIVATE torch::magma)
+      # CUDAHooks reports version of MAGMA PyTorch was compiled against, i.e. needs to be able to include magma headers
+      get_target_property(HOOKS_INCLUDE_DIRECTORIES torch_cuda INCLUDE_DIRECTORIES)
+      if(NOT "${MAGMA_INCLUDE_DIR}" IN_LIST HOOKS_INCLUDE_DIRECTORIES)
+        set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/cuda/detail/CUDAHooks.cpp PROPERTIES INCLUDE_DIRECTORIES  "${MAGMA_INCLUDE_DIR}")
+      endif()
+    endif()
     target_link_libraries(torch_cuda_linalg PRIVATE
         torch_cpu
         torch_cuda
         ${CUDA_cusolver_LIBRARY}
     )
-    if(USE_MAGMA)
-      target_link_libraries(torch_cuda_linalg PRIVATE torch::magma)
-      # CUDAHooks reports version of MAGMA PyTorch was compiled against, i.e. needs to be able to include magma headers
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/cuda/detail/CUDAHooks.cpp PROPERTIES INCLUDE_DIRECTORIES  "${MAGMA_INCLUDE_DIR}")
+    # NS: TODO, is this really necessary?
+    if(USE_MAGMA AND CAFFE2_STATIC_LINK_CUDA)
+      target_link_libraries(torch_cuda_linalg PRIVATE
+          "${CUDA_TOOLKIT_ROOT_DIR}/lib64/libculibos.a" dl)
     endif()
     set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp PROPERTIES COMPILE_FLAGS "-DBUILD_LAZY_CUDA_LINALG")
     install(TARGETS torch_cuda_linalg DESTINATION "${TORCH_INSTALL_LIB_DIR}")
@@ -930,59 +969,7 @@ elseif(USE_CUDA)
 endif()
 
 if(USE_CUDA OR USE_ROCM)
-  if(BUILD_SPLIT_CUDA)
-    set(TORCHLIB_FLAVOR torch_cuda_cu) # chose torch_cuda_cu here since JIT is in torch_cuda_cpp
-  elseif(USE_CUDA)
-    set(TORCHLIB_FLAVOR torch_cuda)
-  elseif(USE_ROCM)
-    set(TORCHLIB_FLAVOR torch_hip)
-  endif()
-
-  # The list of NVFUSER runtime files
-  list(APPEND NVFUSER_RUNTIME_FILES
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_reduction.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_sync_default.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/broadcast.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fp16_support.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/bf16_support.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_sync.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/helpers.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/index_utils.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/random_numbers.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/tensor.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/welford.cu
-    ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/warp.cu
-    ${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh
-    ${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/cuda/detail/UnpackRaw.cuh
-  )
-
-  file(MAKE_DIRECTORY "${CMAKE_BINARY_DIR}/include/nvfuser_resources")
-
-  # "stringify" NVFUSER runtime sources
-  # (generate C++ header files embedding the original input as a string literal)
-  set(NVFUSER_STRINGIFY_TOOL "${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/tools/stringify_file.py")
-  foreach(src ${NVFUSER_RUNTIME_FILES})
-    get_filename_component(filename ${src} NAME_WE)
-    set(dst "${CMAKE_BINARY_DIR}/include/nvfuser_resources/${filename}.h")
-    add_custom_command(
-      COMMENT "Stringify NVFUSER runtime source file"
-      OUTPUT ${dst}
-      DEPENDS ${src}
-      COMMAND ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst}
-    )
-    add_custom_target(nvfuser_rt_${filename} DEPENDS ${dst})
-    add_dependencies(${TORCHLIB_FLAVOR} nvfuser_rt_${filename})
-
-    # also generate the resource headers during the configuration step
-    # (so tools like clang-tidy can run w/o requiring a real build)
-    execute_process(COMMAND
-      ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst})
-  endforeach()
-
-  target_include_directories(${TORCHLIB_FLAVOR} PRIVATE "${CMAKE_BINARY_DIR}/include")
+  include(${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/nvfuser.cmake)
 endif()
 
 if(NOT MSVC AND USE_XNNPACK)
@@ -1077,7 +1064,7 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
       set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/QuantizedLinear.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
       set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/RNN.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
       set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/qlinear_unpack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
     endif()
 
 if(USE_TBB)
@@ -1107,7 +1094,7 @@ endif()
 
   install(DIRECTORY "${TORCH_SRC_DIR}/csrc"
     DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch
-    FILES_MATCHING PATTERN "*.h")
+    FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp")
   install(DIRECTORY "${TORCH_SRC_DIR}/csrc/distributed/c10d"
     DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}
     FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp")
@@ -1315,8 +1302,14 @@ if(USE_DISTRIBUTED)
     else()
       if(BUILD_SPLIT_CUDA)
         target_compile_definitions(torch_cuda_cpp PUBLIC USE_C10D_NCCL)
+        if(USE_NCCL_WITH_UCC)
+          target_compile_definitions(torch_cuda_cpp PUBLIC USE_NCCL_WITH_UCC)
+        endif()
       else()
         target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
+        if(USE_NCCL_WITH_UCC)
+          target_compile_definitions(torch_cuda PUBLIC USE_NCCL_WITH_UCC)
+        endif()
       endif()
     endif()
   endif()
diff --git a/caffe2/contrib/aten/aten_op_template.h b/caffe2/contrib/aten/aten_op_template.h
index a5d1ea40e27a8b..97c64631921ad8 100644
--- a/caffe2/contrib/aten/aten_op_template.h
+++ b/caffe2/contrib/aten/aten_op_template.h
@@ -179,8 +179,9 @@ class ATenOp : public Operator<Context> {
     std::vector<std::string> attrs;
     for (const auto i : c10::irange(operator_def.arg_size())) {
       auto & attr = operator_def.arg(i);
-      if(attr.name() == "operator" || attr.name() == "type" )
+      if(attr.name() == "operator" || attr.name() == "type" || attr.name() == "overload_name" ) {
         continue;
+      }
       attrs.push_back(attr.name());
     }
     std::sort(attrs.begin(), attrs.end());
diff --git a/caffe2/core/blob_test.cc b/caffe2/core/blob_test.cc
index 2249c3bcbf2ab3..a7e3a8d27e23ac 100644
--- a/caffe2/core/blob_test.cc
+++ b/caffe2/core/blob_test.cc
@@ -1264,7 +1264,7 @@ void TestDataType(
     std::string dataTypeName) {
   LOG(INFO) << dataTypeName;
   FLAGS_caffe2_serialize_using_bytes_as_holder = true;
-  size_t numEl = 1000;
+  int numEl = 1000;
   // Proto with int32
   auto protoInt32 = CreateProtoWithInt32Data(dataType, numEl, false);
   caffe2::Blob blobInt32;
diff --git a/caffe2/core/export_caffe2_op_to_c10.h b/caffe2/core/export_caffe2_op_to_c10.h
index 66ffdf21a1085c..82da29a44f4b4d 100644
--- a/caffe2/core/export_caffe2_op_to_c10.h
+++ b/caffe2/core/export_caffe2_op_to_c10.h
@@ -4,12 +4,13 @@
 
 #if defined(EXPOSE_C2_OPS) || \
     !defined(CAFFE2_IS_XPLAT_BUILD) && !defined(C10_MOBILE)
+#include <ATen/core/dispatch/OperatorOptions.h>
 #include <ATen/core/function_schema.h>
 #include <ATen/core/grad_mode.h>
 #include <ATen/core/op_registration/op_registration.h>
-#include <torch/csrc/jit/frontend/function_schema_parser.h>
 #include <c10/core/CompileTimeFunctionPointer.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/jit/frontend/function_schema_parser.h>
 #include <torch/library.h>
 #include <vector>
 
@@ -113,7 +114,9 @@ void call_caffe2_op_from_c10(
   _call_caffe2_op_from_c10(stack, Schema(), &_call_caffe2_op<Caffe2Operator>);
 }
 
-inline FunctionSchema make_function_schema_for_c10(const char* schema_str) {
+inline FunctionSchema make_function_schema_for_c10(
+    const char* schema_str,
+    c10::optional<c10::AliasAnalysisKind> optional_alias_analysis_kind) {
 #if !defined(EXPOSE_C2_OPS) && \
     (defined(CAFFE2_IS_XPLAT_BUILD) || defined(C10_MOBILE))
   throw std::logic_error(
@@ -127,13 +130,17 @@ inline FunctionSchema make_function_schema_for_c10(const char* schema_str) {
       nullopt,
       IValue());
 
-  return FunctionSchema(
+  auto schema = FunctionSchema(
       parsed_schema.name(),
       parsed_schema.overload_name(),
       std::move(arguments),
       parsed_schema.returns(),
       parsed_schema.is_vararg(),
       parsed_schema.is_varret());
+  if (optional_alias_analysis_kind) {
+    schema.setAliasAnalysis(*optional_alias_analysis_kind);
+  }
+  return schema;
 #endif
 }
 
@@ -169,7 +176,7 @@ inline FunctionSchema make_function_schema_for_c10(const char* schema_str) {
  *   caffe2.
  * - all operators must call C10_DECLARE_EXPORT_CAFFE2_OP_TO_C10 and
  *   C10_EXPORT_CAFFE2_OP_TO_C10_CPU .
- * - calling C10_EXPORT_CAFFE2_OP_TO_C10_CUDA is optional and can be omitted i f
+ * - calling C10_EXPORT_CAFFE2_OP_TO_C10_CUDA is optional and can be omitted if
  *   you don't want to expose the operator for CUDA operations.
  * - caffe2 arguments must come after caffe2 inputs, in other words, any tensor
  *   inputs must precede any non-tensor inputs.
@@ -178,73 +185,85 @@ inline FunctionSchema make_function_schema_for_c10(const char* schema_str) {
  * - If your operator has a variable number of input tensors, make the first (!)
  *   input an input of type TensorList. There must be no other tensor inputs.
  */
-#define C10_DECLARE_EXPORT_CAFFE2_OP_TO_C10(OperatorName)   \
-  namespace caffe2 {                                        \
-  namespace _c10_ops {                                      \
+#define C10_DECLARE_EXPORT_CAFFE2_OP_TO_C10(OperatorName)  \
+  namespace caffe2 {                                       \
+  namespace _c10_ops {                                     \
   TORCH_API const FunctionSchema& schema_##OperatorName(); \
-  }                                                         \
+  }                                                        \
   }
 
-#define C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(OperatorName, OperatorSchema) \
-  /* Register the op schema with the c10 dispatcher */                        \
-  namespace caffe2 {                                                          \
-  namespace _c10_ops {                                                        \
-  C10_EXPORT const FunctionSchema& schema_##OperatorName() {                  \
-    static const FunctionSchema schema =                                      \
-        ::caffe2::detail::make_function_schema_for_c10(OperatorSchema);       \
-    return schema;                                                            \
-  }                                                                           \
-  TORCH_LIBRARY_FRAGMENT(_caffe2, m) {                                        \
-      m.def(::caffe2::detail::make_function_schema_for_c10(OperatorSchema));  \
-  }                                                                           \
-  }                                                                           \
+#define C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(             \
+    OperatorName, OperatorSchema, OptionalAliasAnalysisKind) \
+  /* Register the op schema with the c10 dispatcher */       \
+  namespace caffe2 {                                         \
+  namespace _c10_ops {                                       \
+  C10_EXPORT const FunctionSchema& schema_##OperatorName() { \
+    static const FunctionSchema schema =                     \
+        ::caffe2::detail::make_function_schema_for_c10(      \
+            OperatorSchema, OptionalAliasAnalysisKind);      \
+    return schema;                                           \
+  }                                                          \
+  TORCH_LIBRARY_FRAGMENT(_caffe2, m) {                       \
+    m.def(::caffe2::detail::make_function_schema_for_c10(    \
+        OperatorSchema, OptionalAliasAnalysisKind));         \
+  }                                                          \
+  }                                                          \
   }
 
 #define C10_EXPORT_CAFFE2_OP_TO_C10_CPU_KERNEL_ONLY(                         \
     OperatorName, OperatorClass)                                             \
   /* Register call_caffe2_op_from_c10 as a kernel with the c10 dispatcher */ \
-    TORCH_LIBRARY_IMPL(_caffe2, CPU, m) {                                    \
-        m.impl("_caffe2::" #OperatorName,                                    \
-            torch::CppFunction::makeFromBoxedFunction<                       \
-                ::caffe2::detail::call_caffe2_op_from_c10<                   \
-                    ::caffe2::_c10_ops::schema_##OperatorName,               \
-                    OperatorClass>>());                                      \
-    }
+  TORCH_LIBRARY_IMPL(_caffe2, CPU, m) {                                      \
+    m.impl(                                                                  \
+        "_caffe2::" #OperatorName,                                           \
+        torch::CppFunction::makeFromBoxedFunction<                           \
+            ::caffe2::detail::call_caffe2_op_from_c10<                       \
+                ::caffe2::_c10_ops::schema_##OperatorName,                   \
+                OperatorClass>>());                                          \
+  }
+
+#define C10_EXPORT_CAFFE2_OP_TO_C10_CPU(          \
+    OperatorName, OperatorSchema, OperatorClass)  \
+  C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(        \
+      OperatorName, OperatorSchema, c10::nullopt) \
+  C10_EXPORT_CAFFE2_OP_TO_C10_CPU_KERNEL_ONLY(OperatorName, OperatorClass)
 
-#define C10_EXPORT_CAFFE2_OP_TO_C10_CPU(                                     \
-    OperatorName, OperatorSchema, OperatorClass)                             \
-  C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(OperatorName, OperatorSchema)      \
+#define C10_EXPORT_CAFFE2_OP_TO_C10_CPU_WITH_ALIAS_ANALYSIS(                \
+    OperatorName, OperatorSchema, OperatorClass, OptionalAliasAnalysisKind) \
+  C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(                                  \
+      OperatorName, OperatorSchema, OptionalAliasAnalysisKind)              \
   C10_EXPORT_CAFFE2_OP_TO_C10_CPU_KERNEL_ONLY(OperatorName, OperatorClass)
 
 #define C10_EXPORT_CAFFE2_OP_TO_C10_CUDA(OperatorName, OperatorClass)        \
   /* Register call_caffe2_op_from_c10 as a kernel with the c10 dispatcher */ \
-    TORCH_LIBRARY_IMPL(_caffe2, CUDA, m) {                                   \
-        m.impl("_caffe2::" #OperatorName,                                    \
-            torch::CppFunction::makeFromBoxedFunction<                       \
-                ::caffe2::detail::call_caffe2_op_from_c10<                   \
-                    ::caffe2::_c10_ops::schema_##OperatorName,               \
-                    OperatorClass>>());                                      \
-    }
-
+  TORCH_LIBRARY_IMPL(_caffe2, CUDA, m) {                                     \
+    m.impl(                                                                  \
+        "_caffe2::" #OperatorName,                                           \
+        torch::CppFunction::makeFromBoxedFunction<                           \
+            ::caffe2::detail::call_caffe2_op_from_c10<                       \
+                ::caffe2::_c10_ops::schema_##OperatorName,                   \
+                OperatorClass>>());                                          \
+  }
 
 // You should never manually call the C10_EXPORT_CAFFE2_OP_TO_C10_HIP macro .
 // The C10_EXPORT_CAFFE2_OP_TO_C10_CUDA macro from above will be automatically
 // rewritten to C10_EXPORT_CAFFE2_OP_TO_C10_HIP by hipify .
 #define C10_EXPORT_CAFFE2_OP_TO_C10_HIP(OperatorName, OperatorClass)         \
   /* Register call_caffe2_op_from_c10 as a kernel with the c10 dispatcher */ \
-    TORCH_LIBRARY_IMPL(_caffe2, HIP, m) {                                    \
-        m.impl("_caffe2::" #OperatorName,                                    \
-            torch::CppFunction::makeFromBoxedFunction<                       \
-                ::caffe2::detail::call_caffe2_op_from_c10<                   \
-                    ::caffe2::_c10_ops::schema_##OperatorName,               \
-                    OperatorClass>>());                                      \
-    }
-
+  TORCH_LIBRARY_IMPL(_caffe2, HIP, m) {                                      \
+    m.impl(                                                                  \
+        "_caffe2::" #OperatorName,                                           \
+        torch::CppFunction::makeFromBoxedFunction<                           \
+            ::caffe2::detail::call_caffe2_op_from_c10<                       \
+                ::caffe2::_c10_ops::schema_##OperatorName,                   \
+                OperatorClass>>());                                          \
+  }
 
 #else
 // Don't use c10 dispatcher on mobile because of binary size
 #define C10_DECLARE_EXPORT_CAFFE2_OP_TO_C10(OperatorName)
-#define C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(OperatorName, OperatorSchema)
+#define C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY( \
+    OperatorName, OperatorSchema, OptionalAliasAnalysisKind)
 #define C10_EXPORT_CAFFE2_OP_TO_C10_CPU_KERNEL_ONLY(OperatorName, OperatorClass)
 #define C10_EXPORT_CAFFE2_OP_TO_C10_CPU( \
     OperatorName, OperatorSchema, OperatorClass)
diff --git a/caffe2/core/qtensor.h b/caffe2/core/qtensor.h
index a34da6918bcd2f..f94863a09782ac 100644
--- a/caffe2/core/qtensor.h
+++ b/caffe2/core/qtensor.h
@@ -60,8 +60,7 @@ class C10_EXPORT QTensor {
   void Resize(at::ArrayRef<int> dim_source) {
     if (dims_ != dim_source) {
       const auto source_size = c10::multiply_integers(dim_source);
-      // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
-      if ((source_size * (precision_ + signed_)) > capacity_) {
+      if (static_cast<size_t>(source_size * (precision_ + signed_)) > capacity_) {
         data_ptr_.clear();
         capacity_ = 0;
       }
diff --git a/caffe2/core/serialization_test.cc b/caffe2/core/serialization_test.cc
index 1912802d2ac8fd..902a3e01e6773c 100644
--- a/caffe2/core/serialization_test.cc
+++ b/caffe2/core/serialization_test.cc
@@ -69,7 +69,7 @@ TEST(TensorSerialization, TestUnknownDType) {
     auto* blobTensor = BlobGetMutableTensor(&blob, CPU);
     blobTensor->Resize(kTestTensorSize, 1);
     auto *tensorData = blobTensor->mutable_data<int32_t>();
-    for (int n = 0; n < kTestTensorSize; ++n) {
+    for (unsigned n = 0; n < kTestTensorSize; ++n) {
       tensorData[n] = n;
     }
     auto data = SerializeBlob(blob, "test_blob");
@@ -85,7 +85,7 @@ TEST(TensorSerialization, TestUnknownDType) {
   EXPECT_EQ(kTestTensorSize, tensor.numel());
   EXPECT_EQ(TypeMeta::Make<int32_t>(), tensor.dtype());
   const auto* tensor_data = tensor.template data<int32_t>();
-  for (int i = 0; i < kTestTensorSize; ++i) {
+  for (unsigned i = 0; i < kTestTensorSize; ++i) {
     EXPECT_EQ(static_cast<float>(i), tensor_data[i]);
   }
 
diff --git a/caffe2/core/transform_test.cc b/caffe2/core/transform_test.cc
index adb7ecae050be6..0dc6ba92c7f9e9 100644
--- a/caffe2/core/transform_test.cc
+++ b/caffe2/core/transform_test.cc
@@ -55,7 +55,7 @@ class DummyTransform : public Transform {
       return false;
     }
     // which index are we trying to append the new node to?
-    int pattern_idx = subgraph.size();
+    auto pattern_idx = subgraph.size();
     // type doesn't match
     if (g.node(idx).op.type() != pattern_chain[pattern_idx]) {
       return false;
diff --git a/caffe2/operators/copy_op.cc b/caffe2/operators/copy_op.cc
index f2323bbaf06f7e..c0efef07eeb6a6 100644
--- a/caffe2/operators/copy_op.cc
+++ b/caffe2/operators/copy_op.cc
@@ -200,8 +200,10 @@ REGISTER_GRADIENT(CopyCPUToGPU, GetCPUToGPUGradient);
 
 C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(
     CopyGPUToCPU,
-    "_caffe2::CopyGPUToCPU(Tensor input) -> Tensor");
+    "_caffe2::CopyGPUToCPU(Tensor input) -> Tensor",
+    /*optional_alias_analysis_kind=*/c10::nullopt);
 
 C10_EXPORT_CAFFE2_OP_TO_C10_SCHEMA_ONLY(
     CopyCPUToGPU,
-    "_caffe2::CopyCPUToGPU(Tensor input) -> Tensor");
+    "_caffe2::CopyCPUToGPU(Tensor input) -> Tensor",
+    /*optional_alias_analysis_kind=*/c10::nullopt);
diff --git a/caffe2/operators/generate_proposals_op_util_nms.h b/caffe2/operators/generate_proposals_op_util_nms.h
index 09b10c8e192aa4..a74d04f217fdb6 100644
--- a/caffe2/operators/generate_proposals_op_util_nms.h
+++ b/caffe2/operators/generate_proposals_op_util_nms.h
@@ -50,8 +50,7 @@ std::vector<int> nms_cpu_upright(
   std::vector<int> keep;
   while (order.size() > 0) {
     // exit if already enough proposals
-    // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
-    if (topN >= 0 && keep.size() >= topN) {
+    if (topN >= 0 && keep.size() >= static_cast<size_t>(topN)) {
       break;
     }
 
@@ -127,7 +126,7 @@ std::vector<int> soft_nms_cpu_upright(
   EArrXi pending = AsEArrXt(indices);
   while (pending.size() > 0) {
     // Exit if already enough proposals
-    if (topN >= 0 && keep.size() >= topN) {
+    if (topN >= 0 && keep.size() >= static_cast<unsigned>(topN)) {
       break;
     }
 
@@ -560,8 +559,7 @@ std::vector<int> nms_cpu_rotated(
   std::vector<int> keep;
   while (order.size() > 0) {
     // exit if already enough proposals
-    // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
-    if (topN >= 0 && keep.size() >= topN) {
+    if (topN >= 0 && keep.size() >= static_cast<size_t>(topN)) {
       break;
     }
 
@@ -626,7 +624,7 @@ std::vector<int> soft_nms_cpu_rotated(
   EArrXi pending = AsEArrXt(indices);
   while (pending.size() > 0) {
     // Exit if already enough proposals
-    if (topN >= 0 && keep.size() >= topN) {
+    if (topN >= 0 && keep.size() >= static_cast<size_t>(topN)) {
       break;
     }
 
diff --git a/caffe2/operators/quantized/int8_test.cc b/caffe2/operators/quantized/int8_test.cc
index b6d9719d522303..9b14d3eaec1dae 100644
--- a/caffe2/operators/quantized/int8_test.cc
+++ b/caffe2/operators/quantized/int8_test.cc
@@ -341,8 +341,8 @@ TEST(Int8, SumRelu) {
 }
 
 void setq(int8::Int8TensorCPU* dst, const std::vector<float>& vs) {
-  CHECK_EQ(vs.size(), dst->t.numel());
-  for (auto i = 0; i < vs.size(); ++i) {
+  CHECK_EQ(vs.size(), static_cast<size_t>(dst->t.numel()));
+  for (auto i = 0U; i < vs.size(); ++i) {
     uint8_t vq = std::max(
         std::numeric_limits<uint8_t>::min(),
         std::min(
@@ -354,8 +354,8 @@ void setq(int8::Int8TensorCPU* dst, const std::vector<float>& vs) {
 }
 
 void biassetq(int8::Int8TensorCPU* dst, const std::vector<float>& vs) {
-  CHECK_EQ(vs.size(), dst->t.numel());
-  for (auto i = 0; i < vs.size(); ++i) {
+  CHECK_EQ(vs.size(), static_cast<size_t>(dst->t.numel()));
+  for (auto i = 0U; i < vs.size(); ++i) {
     int32_t vq = std::max(
         std::numeric_limits<int32_t>::min(),
         std::min(
diff --git a/caffe2/operators/text_file_reader_utils.h b/caffe2/operators/text_file_reader_utils.h
index 01b4743a91c145..a4f2d6189860e7 100644
--- a/caffe2/operators/text_file_reader_utils.h
+++ b/caffe2/operators/text_file_reader_utils.h
@@ -56,7 +56,7 @@ struct TORCH_API CharRange {
 struct TORCH_API StringProvider {
   virtual void operator()(CharRange&) = 0;
   virtual void reset() = 0;
-  virtual ~StringProvider() {}
+  virtual ~StringProvider() = default;
 };
 
 class TORCH_API BufferedTokenizer {
@@ -99,7 +99,7 @@ class TORCH_API BufferedTokenizer {
   StringProvider* provider_;
   Tokenizer tokenizer_;
   TokenizedString tokenized_;
-  int tokenIndex_;
+  unsigned tokenIndex_;
   int numPasses_;
   int pass_{0};
 };
diff --git a/caffe2/opt/bound_shape_inference_test.cc b/caffe2/opt/bound_shape_inference_test.cc
index 867142746d82ad..8224281124e1f4 100644
--- a/caffe2/opt/bound_shape_inference_test.cc
+++ b/caffe2/opt/bound_shape_inference_test.cc
@@ -45,7 +45,7 @@ void verifyShapeInfo(
   EXPECT_EQ(shape_info.getDimType(), t);
   const auto& shape = shape_info.shape;
   ASSERT_EQ(shape.dims_size(), dims.size());
-  for (int i = 0; i < dims.size(); ++i) {
+  for (unsigned i = 0; i < dims.size(); ++i) {
     EXPECT_EQ(dims[i], shape.dims(i));
   }
   EXPECT_EQ(shape.data_type(), dtype);
diff --git a/caffe2/perfkernels/adagrad_avx2.cc b/caffe2/perfkernels/adagrad_avx2.cc
index 0039afa942f1de..08c9fd00d9a089 100644
--- a/caffe2/perfkernels/adagrad_avx2.cc
+++ b/caffe2/perfkernels/adagrad_avx2.cc
@@ -18,7 +18,7 @@ void adagrad_update__avx2_fma(
     float decay,
     float lr,
     float weight_decay = 0.f) {
-  constexpr size_t kSize = 8;
+  constexpr int kSize = 8;
   auto i = 0;
   for (; i + kSize <= N; i += kSize) {
     __m256 gi = _mm256_loadu_ps(g + i);
diff --git a/caffe2/python/memonger.py b/caffe2/python/memonger.py
index 6225781bc429a9..178ebd8cd30248 100644
--- a/caffe2/python/memonger.py
+++ b/caffe2/python/memonger.py
@@ -798,15 +798,29 @@ def canonical_name(blob):
             op.output[i] = canonical_name(output)
 
 
-
 def apply_recurrent_blob_assignments(op, blob_assignments, canonical_name):
     log.debug("Applying assignments to recurrent op: {}".format(op.type))
+
+    # Apply on alias_dst
+    alias_dst_args = [a for a in op.arg if a.name.endswith("alias_dst")]
+    for alias_dst in alias_dst_args:
+        for i, blob in enumerate(alias_dst.strings):
+            alias_dst.strings[i] = canonical_name(blob.decode()).encode()
+
+    # Apply on link_external
+    link_external_args = [a for a in op.arg if a.name.endswith("link_external")]
+    for link_external in link_external_args:
+        for i, blob in enumerate(link_external.strings):
+            link_external.strings[i] = canonical_name(blob.decode()).encode()
+
+    # Recurse into step nets
     step_args = [a for a in op.arg if a.name.endswith("step_net")]
     for step_arg in step_args:
         apply_assignments(step_arg.n, blob_assignments)
         for i, einp in enumerate(step_arg.n.external_input):
             if einp in blob_assignments:
                 step_arg.n.external_input[i] = canonical_name(einp)
+
     # Store renamings
     for blob, renamed in viewitems(blob_assignments):
         if blob in list(op.input) + list(op.output):
diff --git a/caffe2/python/pybind_state.cc b/caffe2/python/pybind_state.cc
index ad04cab82d5aa0..ccaa0afb6ac91e 100644
--- a/caffe2/python/pybind_state.cc
+++ b/caffe2/python/pybind_state.cc
@@ -300,7 +300,7 @@ class GetPythonGradient : public GradientMakerBase {
     }
     if (gradOutputIndices.size() > 0) {
       // NOLINTNEXTLINE(modernize-loop-convert)
-      for (int i = 0; i < gradOutputIndices.size(); ++i) {
+      for (unsigned i = 0; i < gradOutputIndices.size(); ++i) {
         int GO_i = gradOutputIndices[i];
         gradientInputs.push_back(GO(GO_i));
       }
@@ -312,7 +312,7 @@ class GetPythonGradient : public GradientMakerBase {
     std::vector<std::string> gradientOutputs;
     if (gradInputIndices.size() > 0) {
       // NOLINTNEXTLINE(modernize-loop-convert)
-      for (int i = 0; i < gradInputIndices.size(); ++i) {
+      for (unsigned i = 0; i < gradInputIndices.size(); ++i) {
         int GI_i = gradInputIndices[i];
         gradientOutputs.push_back(GI(GI_i));
       }
@@ -877,7 +877,7 @@ void addObjectMethods(py::module& m) {
             std::vector<TensorCPU> tensors_data;
 #ifdef USE_NUMPY
             // NOLINTNEXTLINE(modernize-loop-convert)
-            for (auto i = 0; i < inputs.size(); ++i) {
+            for (auto i = 0U; i < inputs.size(); ++i) {
               auto input = inputs[i];
               CAFFE_ENFORCE(
                   PyArray_Check(input.ptr()),
@@ -988,7 +988,7 @@ void addObjectMethods(py::module& m) {
             std::vector<Tensor> tensors_data;
 #ifdef USE_NUMPY
             // NOLINTNEXTLINE(modernize-loop-convert)
-            for (auto i = 0; i < inputs.size(); ++i) {
+            for (auto i = 0U; i < inputs.size(); ++i) {
               auto input = inputs[i];
               CAFFE_ENFORCE(
                   PyArray_Check(input.ptr()),
@@ -1201,7 +1201,7 @@ void addGlobalMethods(py::module& m) {
   });
   m.def("nearby_opnames", [](const std::string& name) {
     std::vector<std::string> alternatives;
-    int editTolerance = 3;
+    unsigned editTolerance = 3;
     // NOLINTNEXTLINE(performance-for-range-copy)
     for (auto it : caffe2::CPUOperatorRegistry()->Keys()) {
       if (editDistance(it, name, editTolerance) < editTolerance + 1) {
diff --git a/caffe2/serialize/inline_container.cc b/caffe2/serialize/inline_container.cc
index 9f0e9ce6194ef9..92632fc7928b32 100644
--- a/caffe2/serialize/inline_container.cc
+++ b/caffe2/serialize/inline_container.cc
@@ -129,22 +129,27 @@ void PyTorchStreamReader::init() {
   }
   std::string version(static_cast<const char*>(version_ptr.get()), version_size);
   version_ = caffe2::stoull(version);
-  AT_ASSERTM(
-      // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
-      version_ >= kMinSupportedFileFormatVersion,
-      "Attempted to read a PyTorch file with version ",
-      c10::to_string(version_),
-      ", but the minimum supported version for reading is ",
-      c10::to_string(kMinSupportedFileFormatVersion),
-      ". Your PyTorch script module file is too old. Please re-export it again.");
-  AT_ASSERTM(
-      // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
-      version_ <= kMaxSupportedFileFormatVersion,
-      "Attempted to read a PyTorch file with version ",
-      version_,
-      ", but the maximum supported version for reading is ",
-      kMaxSupportedFileFormatVersion,
-      ". Your PyTorch installation may be too old.");
+  // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
+  if (version_ < kMinSupportedFileFormatVersion) {
+    CAFFE_THROW(
+        "Attempted to read a PyTorch file with version ",
+        c10::to_string(version_),
+        ", but the minimum supported version for reading is ",
+        c10::to_string(kMinSupportedFileFormatVersion),
+        ". Your PyTorch script module file is too old. Please regenerate it",
+        " with latest version of PyTorch to mitigate this issue.");
+  }
+
+  // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
+  if (version_ > kMaxSupportedFileFormatVersion) {
+    CAFFE_THROW(
+        "Attempted to read a PyTorch file with version ",
+        version_,
+        ", but the maximum supported version for reading is ",
+        kMaxSupportedFileFormatVersion,
+        ". The version of your PyTorch installation may be too old, ",
+        "please upgrade PyTorch to latest version to mitigate this issue.");
+  }
 }
 
 void PyTorchStreamReader::valid(const char* what, const char* info) {
diff --git a/caffe2/serialize/inline_container_test.cc b/caffe2/serialize/inline_container_test.cc
index 5ceb7274b771f2..18f75dddfaa5f5 100644
--- a/caffe2/serialize/inline_container_test.cc
+++ b/caffe2/serialize/inline_container_test.cc
@@ -5,6 +5,7 @@
 #include <gtest/gtest.h>
 
 #include "caffe2/serialize/inline_container.h"
+#include "c10/util/irange.h"
 
 namespace caffe2 {
 namespace serialize {
@@ -22,14 +23,14 @@ TEST(PyTorchStreamWriterAndReader, SaveAndLoad) {
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init,cppcoreguidelines-avoid-magic-numbers)
   std::array<char, 127> data1;
 
-  for (int i = 0; i < data1.size(); ++i) {
+  for (auto i: c10::irange( data1.size())) {
     data1[i] = data1.size() - i;
   }
   writer.writeRecord("key1", data1.data(), data1.size());
 
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init,cppcoreguidelines-avoid-magic-numbers)
   std::array<char, 64> data2;
-  for (int i = 0; i < data2.size(); ++i) {
+  for (auto i: c10::irange(data2.size())) {
     data2[i] = data2.size() - i;
   }
   writer.writeRecord("key2", data2.data(), data2.size());
@@ -83,14 +84,14 @@ TEST(PytorchStreamWriterAndReader, GetNonexistentRecordThrows) {
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init,cppcoreguidelines-avoid-magic-numbers)
   std::array<char, 127> data1;
 
-  for (int i = 0; i < data1.size(); ++i) {
+  for (auto i: c10::irange(data1.size())) {
     data1[i] = data1.size() - i;
   }
   writer.writeRecord("key1", data1.data(), data1.size());
 
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init,cppcoreguidelines-avoid-magic-numbers)
   std::array<char, 64> data2;
-  for (int i = 0; i < data2.size(); ++i) {
+  for (auto i: c10::irange(data2.size())) {
     data2[i] = data2.size() - i;
   }
   writer.writeRecord("key2", data2.data(), data2.size());
diff --git a/caffe2/serialize/versions.h b/caffe2/serialize/versions.h
index 40d2cd0145fd71..78a91c64fe84fd 100644
--- a/caffe2/serialize/versions.h
+++ b/caffe2/serialize/versions.h
@@ -117,11 +117,16 @@ constexpr uint64_t kMinProducedFileFormatVersion = 0x3L;
 //  {the_pointer_value_the_tensor.storage}, for example:
 //  `140245072983168.storage` Forward-compatibility change.
 //  0x6L: Implicit opereator versioning using number of specified argument.
-//  Refer to the summary of https://github.com/pytorch/pytorch/pull/56845 for details.
-//  0x7L: Enable support for operators with default arguments plus out arguments.
-//  Refer. See https://github.com/pytorch/pytorch/pull/63651 for details
-//  0x8L: Emit promoted operators as instructions.
-//  See https://github.com/pytorch/pytorch/pull/71662 for details
+//  Refer to the summary of https://github.com/pytorch/pytorch/pull/56845 for
+//  details.
+//  0x7L: Enable support for operators with default arguments plus out
+//  arguments. Refer. See https://github.com/pytorch/pytorch/pull/63651 for
+//  details.
+//  0x8L: Emit promoted operators as instructions. See
+//  https://github.com/pytorch/pytorch/pull/71662 for details.
+//  0x9L: Change serialization format from pickle to format This version is to
+//  serve migration. v8 pickle and v9 flatbuffer are the same. Refer to the
+//  summary of https://github.com/pytorch/pytorch/pull/75201 for more details.
 constexpr uint64_t kProducedBytecodeVersion = 0x8L;
 
 // static_assert(
@@ -134,8 +139,8 @@ constexpr uint64_t kProducedBytecodeVersion = 0x8L;
 // kMinSupportedBytecodeVersion <= model_version <= kMaxSupportedBytecodeVersion
 // (in loader), we should support this model_version. For example, we provide a
 // wrapper to handle an updated operator.
-constexpr uint64_t kMinSupportedBytecodeVersion = 0x3L;
-constexpr uint64_t kMaxSupportedBytecodeVersion = 0x8L;
+constexpr uint64_t kMinSupportedBytecodeVersion = 0x4L;
+constexpr uint64_t kMaxSupportedBytecodeVersion = 0x9L;
 
 } // namespace serialize
 } // namespace caffe2
diff --git a/caffe2/share/contrib/depthwise/depthwise3x3_conv_op_test.cc b/caffe2/share/contrib/depthwise/depthwise3x3_conv_op_test.cc
index 879f0d25068b7a..0f7e90e55b53ff 100644
--- a/caffe2/share/contrib/depthwise/depthwise3x3_conv_op_test.cc
+++ b/caffe2/share/contrib/depthwise/depthwise3x3_conv_op_test.cc
@@ -199,7 +199,7 @@ void runConv(
 
 } // unnamed namespace
 
-constexpr size_t kIters = 20;
+constexpr int kIters = 20;
 
 TEST(DEPTHWISE3x3, Conv) {
   for (int i = 0; i < kIters; ++i) {
diff --git a/caffe2/share/contrib/nnpack/nnpack_test.cc b/caffe2/share/contrib/nnpack/nnpack_test.cc
index 398be235f7f13f..fe653c4d91abd0 100644
--- a/caffe2/share/contrib/nnpack/nnpack_test.cc
+++ b/caffe2/share/contrib/nnpack/nnpack_test.cc
@@ -236,7 +236,7 @@ void runConv(
 
 } // unnamed namespace
 
-constexpr size_t kIters = 20;
+constexpr int kIters = 20;
 
 TEST(NNPACK, Conv_3x3s1) {
   for (int i = 0; i < kIters; ++i) {
diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
index a818c21eb5ea4f..f8d1ae74eaebad 100644
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@@ -816,6 +816,10 @@ if(USE_FBGEMM)
     set_property(TARGET fbgemm_avx2 PROPERTY POSITION_INDEPENDENT_CODE ON)
     set_property(TARGET fbgemm_avx512 PROPERTY POSITION_INDEPENDENT_CODE ON)
     set_property(TARGET fbgemm PROPERTY POSITION_INDEPENDENT_CODE ON)
+    if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 13.0.0)
+        # See https://github.com/pytorch/pytorch/issues/74352
+        target_compile_options(asmjit PRIVATE -Wno-deprecated-copy -Wno-unused-but-set-variable)
+    endif()
   endif()
 
   if(USE_FBGEMM)
@@ -1936,6 +1940,32 @@ if(USE_KINETO)
       message(STATUS "  CUDA_cupti_LIBRARY = ${CUDA_cupti_LIBRARY}")
       message(STATUS "Found CUPTI")
       set(LIBKINETO_NOCUPTI OFF CACHE STRING "" FORCE)
+
+      # I've only tested this sanity check on Linux; if someone
+      # runs into this bug on another platform feel free to
+      # generalize it accordingly
+      if(NOT USE_CUPTI_SO AND UNIX)
+        include(CheckCXXSourceRuns)
+        # rt is handled by the CMAKE_REQUIRED_LIBRARIES set above
+        if(NOT APPLE)
+          set(CMAKE_REQUIRED_LIBRARIES ${CMAKE_REQUIRED_LIBRARIES} "dl")
+        endif()
+        set(CMAKE_REQUIRED_LINK_OPTIONS "-Wl,--whole-archive,${CUPTI_LIBRARY_PATH},--no-whole-archive")
+        check_cxx_source_runs("#include <stdexcept>
+  int main() {
+    try {
+      throw std::runtime_error(\"error\");
+    } catch (...) {
+      return 0;
+    }
+    return 1;
+  }" EXCEPTIONS_WORK)
+        set(CMAKE_REQUIRED_LINK_OPTIONS "")
+        if(NOT EXCEPTIONS_WORK)
+          message(FATAL_ERROR "Detected that statically linking against CUPTI causes exceptions to stop working.  See https://github.com/pytorch/pytorch/issues/57744 for more details.  Perhaps try: USE_CUPTI_SO=1 python setup.py develop --cmake")
+        endif()
+      endif()
+
     else()
       message(STATUS "Could not find CUPTI library, using CPU-only Kineto build")
       set(LIBKINETO_NOCUPTI ON CACHE STRING "" FORCE)
diff --git a/cmake/Modules/FindMKL.cmake b/cmake/Modules/FindMKL.cmake
index b79a87466252c3..01594a5b66e056 100644
--- a/cmake/Modules/FindMKL.cmake
+++ b/cmake/Modules/FindMKL.cmake
@@ -168,6 +168,26 @@ IF (EXISTS ${INTEL_OMP_DIR})
   ENDIF()
 ENDIF()
 
+MACRO(GET_MKL_LIB_NAMES LIBRARIES INTERFACE MKL64)
+  cmake_parse_arguments("" "" "THREAD" "" ${ARGN})
+  SET(${LIBRARIES} mkl_${INTERFACE}${MKL64} mkl_core)
+  IF(_THREAD)
+    LIST(INSERT ${LIBRARIES} 1 ${_THREAD})
+    IF(UNIX AND ${USE_STATIC_MKL})
+      # The thread library defines symbols required by the other MKL libraries so also add it last
+      LIST(APPEND ${LIBRARIES} ${_THREAD})
+    ENDIF()
+  ENDIF()
+  IF(${USE_STATIC_MKL})
+    IF(UNIX)
+      list(TRANSFORM ${LIBRARIES} PREPEND "lib")
+      list(TRANSFORM ${LIBRARIES} APPEND ".a")
+    ELSE()
+      message(WARNING "Ignoring USE_STATIC_MKL")
+    ENDIF()
+  ENDIF()
+ENDMACRO()
+
 # Try linking multiple libs
 MACRO(CHECK_ALL_LIBRARIES LIBRARIES OPENMP_TYPE OPENMP_LIBRARY _name _list _flags)
   # This macro checks for the existence of the combination of libraries given by _list.
@@ -304,8 +324,9 @@ IF (NOT "${MKL_THREADING}" STREQUAL "SEQ")
       FOREACH(mkl64 ${mkl64s} "")
         FOREACH(mklthread ${mklthreads})
           IF (NOT MKL_LIBRARIES)
+            GET_MKL_LIB_NAMES(mkl_lib_names "${mkliface}" "${mkl64}" THREAD "${mklthread}")
             CHECK_ALL_LIBRARIES(MKL_LIBRARIES MKL_OPENMP_TYPE MKL_OPENMP_LIBRARY cblas_sgemm
-              "mkl_${mkliface}${mkl64};${mklthread};mkl_core;${mklrtl};${mkl_pthread};${mkl_m};${mkl_dl}" "")
+              "${mkl_lib_names};${mklrtl};${mkl_pthread};${mkl_m};${mkl_dl}" "")
           ENDIF (NOT MKL_LIBRARIES)
         ENDFOREACH(mklthread)
       ENDFOREACH(mkl64)
@@ -317,8 +338,9 @@ ENDIF (NOT "${MKL_THREADING}" STREQUAL "SEQ")
 FOREACH(mkliface ${mklifaces})
   FOREACH(mkl64 ${mkl64s} "")
     IF (NOT MKL_LIBRARIES)
+      GET_MKL_LIB_NAMES(mkl_lib_names "${mkliface}" "${mkl64}" THREAD "mkl_sequential")
       CHECK_ALL_LIBRARIES(MKL_LIBRARIES MKL_OPENMP_TYPE MKL_OPENMP_LIBRARY cblas_sgemm
-        "mkl_${mkliface}${mkl64};mkl_sequential;mkl_core;${mkl_m};${mkl_dl}" "")
+        "${mkl_lib_names};${mkl_m};${mkl_dl}" "")
       IF (MKL_LIBRARIES)
         SET(mklseq "_sequential")
       ENDIF (MKL_LIBRARIES)
@@ -331,8 +353,9 @@ FOREACH(mklrtl ${mklrtls} "")
   FOREACH(mkliface ${mklifaces})
     FOREACH(mkl64 ${mkl64s} "")
       IF (NOT MKL_LIBRARIES)
+        GET_MKL_LIB_NAMES(mkl_lib_names "${mkliface}" "${mkl64}" THREAD "${mklthread}")
         CHECK_ALL_LIBRARIES(MKL_LIBRARIES MKL_OPENMP_TYPE MKL_OPENMP_LIBRARY cblas_sgemm
-          "mkl_${mkliface}${mkl64};${mklthread};mkl_core;${mklrtl};pthread;${mkl_m};${mkl_dl}" "")
+          "${mkl_lib_names};${mklrtl};pthread;${mkl_m};${mkl_dl}" "")
       ENDIF (NOT MKL_LIBRARIES)
     ENDFOREACH(mkl64)
   ENDFOREACH(mkliface)
@@ -341,6 +364,9 @@ ENDFOREACH(mklrtl)
 # Check for older versions
 IF (NOT MKL_LIBRARIES)
   SET(MKL_VERSION 900)
+  if (USE_STATIC_MKL)
+      message(WARNING "Ignoring USE_STATIC_MKL")
+  endif()
   CHECK_ALL_LIBRARIES(MKL_LIBRARIES MKL_OPENMP_TYPE MKL_OPENMP_LIBRARY cblas_sgemm
     "mkl;guide;pthread;m" "")
 ENDIF (NOT MKL_LIBRARIES)
diff --git a/cmake/Summary.cmake b/cmake/Summary.cmake
index 9203e72b3bda3d..cd0b330ab0e53c 100644
--- a/cmake/Summary.cmake
+++ b/cmake/Summary.cmake
@@ -148,6 +148,7 @@ function(caffe2_print_configuration_summary)
   message(STATUS "  USE_NCCL              : ${USE_NCCL}")
   if(${USE_NCCL})
     message(STATUS "    USE_SYSTEM_NCCL     : ${USE_SYSTEM_NCCL}")
+    message(STATUS "    USE_NCCL_WITH_UCC   : ${USE_NCCL_WITH_UCC}")
   endif()
   message(STATUS "  USE_NNPACK            : ${USE_NNPACK}")
   message(STATUS "  USE_NUMPY             : ${USE_NUMPY}")
@@ -191,4 +192,5 @@ function(caffe2_print_configuration_summary)
   message(STATUS "  Private Dependencies : ${Caffe2_DEPENDENCY_LIBS}")
   # coreml
   message(STATUS "  USE_COREML_DELEGATE     : ${USE_COREML_DELEGATE}")
+  message(STATUS "  BUILD_LAZY_TS_BACKEND   : ${BUILD_LAZY_TS_BACKEND}")
 endfunction()
diff --git a/docs/Makefile b/docs/Makefile
index 28d910a89b4986..b9719df7ade5c3 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -15,6 +15,10 @@ help:
 
 figures:
 	@$(PYCMD) source/scripts/build_activation_images.py
+	@$(PYCMD) source/scripts/build_quantization_configs.py
+
+onnx_supported_aten_ops:
+	@$(PYCMD) source/scripts/build_onnx_supported_aten_op_csv_table.py
 
 docset: html
 	doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/pytorch-logo-flame.png --enable-js --online-redirect-url https://pytorch.org/docs/ --force $(BUILDDIR)/html/
@@ -30,13 +34,13 @@ html-stable:
 	# See conf.py for more details.
 	RELEASE=1 make html
 
-.PHONY: help Makefile docset
+.PHONY: help Makefile docset onnx_supported_aten_ops
 
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile figures
+%: Makefile figures onnx_supported_aten_ops
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 clean:
 	@echo "Removing everything under 'build' and 'source/generated'.."
-	@rm -rf $(BUILDDIR)/html/ $(BUILDDIR)/doctrees $(SOURCEDIR)/generated
+	@rm -rf $(BUILDDIR)/html/ $(BUILDDIR)/doctrees $(SOURCEDIR)/generated $(BUILDDIR)/auto_gen_aten_op_list.csv
diff --git a/docs/cpp/requirements.txt b/docs/cpp/requirements.txt
index f5d49d2ebe910d..ca3eb7da6846bf 100644
--- a/docs/cpp/requirements.txt
+++ b/docs/cpp/requirements.txt
@@ -1,4 +1,5 @@
 sphinx==3.1.2
+Jinja2==3.0.*
 breathe==4.25.0
 exhale==0.2.3
 docutils==0.16
diff --git a/docs/cpp/source/Doxyfile b/docs/cpp/source/Doxyfile
index 7785239d1539eb..a17d742a461efa 100644
--- a/docs/cpp/source/Doxyfile
+++ b/docs/cpp/source/Doxyfile
@@ -44,12 +44,14 @@ INPUT                  = ../../../aten/src/ATen/ATen.h \
                          ../../../aten/src/ATen/Scalar.h \
                          ../../../aten/src/ATen/TensorOptions.h \
                          ../../../aten/src/ATen/core/Tensor.h \
+                         ../../../aten/src/ATen/native/TensorShape.h \
                          ../../../build/aten/src/ATen/Functions.h \
                          ../../../build/aten/src/ATen/core/TensorBody.h \
                          ../../../c10/core/Device.h \
                          ../../../c10/core/DeviceType.h \
                          ../../../c10/util/Half.h \
                          ../../../c10/util/ArrayRef.h \
+                         ../../../c10/util/OptionalArrayRef.h \
                          ../../../c10/util/Exception.h \
                          ../../../c10/util/Optional.h \
                          ../../../c10/cuda/CUDAGuard.h \
diff --git a/docs/cpp/source/check-doxygen.sh b/docs/cpp/source/check-doxygen.sh
index 6ff6832cd056c4..28c7e5b81ace98 100755
--- a/docs/cpp/source/check-doxygen.sh
+++ b/docs/cpp/source/check-doxygen.sh
@@ -19,8 +19,7 @@ cp torch/_utils_internal.py tools/shared
 python -m tools.codegen.gen
 
 python tools/setup_helpers/generate_code.py                 \
-  --native-functions-path aten/src/ATen/native/native_functions.yaml \
-  --nn-path aten/src
+  --native-functions-path aten/src/ATen/native/native_functions.yaml
 
 popd
 
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 34ec6078225bdf..57bee508f61b40 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,4 +1,5 @@
 sphinx==3.5.4
+Jinja2==3.0.*
 docutils==0.16
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 sphinxcontrib.katex
@@ -7,3 +8,4 @@ tensorboard
 # required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd>=0.4.5
 sphinx_copybutton
+sphinx-panels
diff --git a/docs/source/amp.rst b/docs/source/amp.rst
index 1f70f2c6982e63..e5d2a10585627d 100644
--- a/docs/source/amp.rst
+++ b/docs/source/amp.rst
@@ -1,22 +1,33 @@
 .. role:: hidden
     :class: hidden-section
 
-Automatic Mixed Precision package - torch.cuda.amp
-==================================================
+Automatic Mixed Precision package - torch.amp
+=============================================
 
-.. automodule:: torch.cuda.amp
-.. currentmodule:: torch.cuda.amp
+.. Both modules below are missing doc entry. Adding them here for now.
+.. This does not add anything to the rendered page
+.. py:module:: torch.cpu
+.. py:module:: torch.cpu.amp
+.. py:module:: torch.cuda.amp
+
+.. automodule:: torch.amp
+.. currentmodule:: torch.amp
 
-:class:`torch.cuda.amp` and :class:`torch` provide convenience methods for mixed precision,
+:class:`torch.amp` provides convenience methods for mixed precision,
 where some operations use the ``torch.float32`` (``float``) datatype and other operations
-use ``torch.float16`` (``half``). Some ops, like linear layers and convolutions,
-are much faster in ``float16``. Other ops, like reductions, often require the dynamic
+use lower precision floating point datatype (``lower_precision_fp``): ``torch.float16`` (``half``) or ``torch.bfloat16``. Some ops, like linear layers and convolutions,
+are much faster in ``lower_precision_fp``. Other ops, like reductions, often require the dynamic
 range of ``float32``.  Mixed precision tries to match each op to its appropriate datatype.
 
-Ordinarily, "automatic mixed precision training" uses :class:`torch.autocast` and
-:class:`torch.cuda.amp.GradScaler` together, as shown in the :ref:`Automatic Mixed Precision examples<amp-examples>`
-and `Automatic Mixed Precision recipe <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_.
-However, :class:`torch.autocast` and :class:`GradScaler` are modular, and may be used separately if desired.
+Ordinarily, "automatic mixed precision training" with datatype of ``torch.float16`` uses :class:`torch.autocast` and
+:class:`torch.cuda.amp.GradScaler` together, as shown in the :ref:`CUDA Automatic Mixed Precision examples<amp-examples>`
+and `CUDA Automatic Mixed Precision recipe <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_.
+However, :class:`torch.autocast` and :class:`torch.cuda.amp.GradScaler` are modular, and may be used separately if desired.
+
+For CUDA and CPU, APIs are also provided seperately:
+
+* ``torch.autocast("cuda", args...)`` is equivalent to ``torch.cuda.amp.autocast(args...)``.
+* ``torch.autocast("cpu", args...)`` is equivalent to ``torch.cpu.amp.autocast(args...)``. For CPU, only lower precision floating point datatype of ``torch.bfloat16`` is supported for now.
 
 .. contents:: :local:
 
@@ -38,6 +49,11 @@ Autocasting
 
 .. autofunction::  custom_bwd
 
+.. currentmodule:: torch.cpu.amp
+
+.. autoclass:: autocast
+    :members:
+
 .. _gradient-scaling:
 
 Gradient Scaling
@@ -56,6 +72,8 @@ so they don't flush to zero.
 Each parameter's gradient (``.grad`` attribute) should be unscaled before the optimizer
 updates the parameters, so the scale factor does not interfere with the learning rate.
 
+.. currentmodule:: torch.cuda.amp
+
 .. autoclass:: GradScaler
     :members:
 
@@ -68,8 +86,6 @@ Autocast Op Reference
 
 Op Eligibility
 --------------
-Only CUDA ops are eligible for autocasting.
-
 Ops that run in ``float64`` or non-floating-point dtypes are not eligible, and will
 run in these types whether or not autocast is enabled.
 
@@ -84,8 +100,10 @@ regions.
 Ops called with an explicit ``dtype=...`` argument are not eligible,
 and will produce output that respects the ``dtype`` argument.
 
-Op-Specific Behavior
---------------------
+.. _autocast-cuda-op-reference:
+
+CUDA Op-Specific Behavior
+-------------------------
 The following lists describe the behavior of eligible ops in autocast-enabled regions.
 These ops always go through autocasting whether they are invoked as part of a :class:`torch.nn.Module`,
 as a function, or as a :class:`torch.Tensor` method. If functions are exposed in multiple namespaces,
@@ -99,8 +117,8 @@ If an op is unlisted, we assume it's numerically stable in ``float16``.
 If you believe an unlisted op is numerically unstable in ``float16``,
 please file an issue.
 
-Ops that can autocast to ``float16``
-""""""""""""""""""""""""""""""""""""
+CUDA Ops that can autocast to ``float16``
+"""""""""""""""""""""""""""""""""""""""""
 
 ``__matmul__``,
 ``addbmm``,
@@ -126,8 +144,8 @@ Ops that can autocast to ``float16``
 ``prelu``,
 ``RNNCell``
 
-Ops that can autocast to ``float32``
-""""""""""""""""""""""""""""""""""""
+CUDA Ops that can autocast to ``float32``
+"""""""""""""""""""""""""""""""""""""""""
 
 ``__pow__``,
 ``__rdiv__``,
@@ -181,8 +199,8 @@ Ops that can autocast to ``float32``
 ``tan``,
 ``triplet_margin_loss``
 
-Ops that promote to the widest input type
-"""""""""""""""""""""""""""""""""""""""""
+CUDA Ops that promote to the widest input type
+""""""""""""""""""""""""""""""""""""""""""""""
 These ops don't require a particular dtype for stability, but take multiple inputs
 and require that the inputs' dtypes match.  If all of the inputs are
 ``float16``, the op runs in ``float16``.  If any of the inputs is ``float32``,
@@ -216,3 +234,191 @@ Many models use a sigmoid layer right before the binary cross entropy layer.
 In this case, combine the two layers using :func:`torch.nn.functional.binary_cross_entropy_with_logits`
 or :mod:`torch.nn.BCEWithLogitsLoss`.  ``binary_cross_entropy_with_logits`` and ``BCEWithLogits``
 are safe to autocast.
+
+.. _autocast-cpu-op-reference:
+
+CPU Op-Specific Behavior
+------------------------
+The following lists describe the behavior of eligible ops in autocast-enabled regions.
+These ops always go through autocasting whether they are invoked as part of a :class:`torch.nn.Module`,
+as a function, or as a :class:`torch.Tensor` method. If functions are exposed in multiple namespaces,
+they go through autocasting regardless of the namespace.
+
+Ops not listed below do not go through autocasting.  They run in the type
+defined by their inputs.  However, autocasting may still change the type
+in which unlisted ops run if they're downstream from autocasted ops.
+
+If an op is unlisted, we assume it's numerically stable in ``bfloat16``.
+If you believe an unlisted op is numerically unstable in ``bfloat16``,
+please file an issue.
+
+CPU Ops that can autocast to ``bfloat16``
+"""""""""""""""""""""""""""""""""""""""""
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``bmm``,
+``mm``,
+``baddbmm``,
+``addmm``,
+``addbmm``,
+``linear``,
+``_convolution``
+
+CPU Ops that can autocast to ``float32``
+""""""""""""""""""""""""""""""""""""""""
+
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``batch_norm``,
+``dropout``,
+``avg_pool1d``,
+``avg_pool2d``,
+``avg_pool3d``,
+``gelu``,
+``upsample_nearest1d``,
+``_upsample_nearest_exact1d``,
+``upsample_nearest2d``,
+``_upsample_nearest_exact2d``,
+``upsample_nearest3d``,
+``_upsample_nearest_exact3d``,
+``upsample_linear1d``,
+``upsample_bilinear2d``,
+``upsample_trilinear3d``,
+``binary_cross_entropy``,
+``binary_cross_entropy_with_logits``,
+``instance_norm``,
+``grid_sampler``,
+``polar``,
+``multinomial``,
+``poisson``,
+``fmod``,
+``prod``,
+``quantile``,
+``nanquantile``,
+``stft``,
+``cdist``,
+``cross``,
+``cumprod``,
+``cumsum``,
+``diag``,
+``diagflat``,
+``histc``,
+``logcumsumexp``,
+``searchsorted``,
+``trace``,
+``tril``,
+``triu``,
+``vander``,
+``view_as_complex``,
+``cholesky``,
+``cholesky_inverse``,
+``cholesky_solve``,
+``dot``,
+``inverse``,
+``lu_solve``,
+``matrix_rank``,
+``orgqr``,
+``inverse``,
+``ormqr``,
+``pinverse``,
+``vdot``,
+``im2col``,
+``col2im``,
+``max_pool3d``,
+``max_unpool2d``,
+``max_unpool3d``,
+``adaptive_avg_pool3d``,
+``reflection_pad1d``,
+``reflection_pad2d``,
+``replication_pad1d``,
+``replication_pad2d``,
+``replication_pad3d``,
+``elu``,
+``hardshrink``,
+``hardsigmoid``,
+``hardswish``,
+``log_sigmoid``,
+``prelu``,
+``selu``,
+``celu``,
+``softplus``,
+``softshrink``,
+``group_norm``,
+``smooth_l1_loss``,
+``mse_loss``,
+``ctc_loss``,
+``kl_div``,
+``multilabel_margin_loss``,
+``fft_fft``,
+``fft_ifft``,
+``fft_fft2``,
+``fft_ifft2``,
+``fft_fftn``,
+``fft_ifftn``,
+``fft_rfft``,
+``fft_irfft``,
+``fft_rfft2``,
+``fft_irfft2``,
+``fft_rfftn``,
+``fft_irfftn``,
+``fft_hfft``,
+``fft_ihfft``,
+``conv_tbc``,
+``linalg_matrix_norm``,
+``linalg_cond``,
+``linalg_matrix_rank``,
+``linalg_solve``,
+``linalg_cholesky``,
+``linalg_svdvals``,
+``linalg_eigvals``,
+``linalg_eigvalsh``,
+``linalg_inv``,
+``linalg_householder_product``,
+``linalg_tensorinv``,
+``linalg_tensorsolve``,
+``fake_quantize_per_tensor_affine``,
+``glu``,
+``cummax``,
+``cummin``,
+``eig``,
+``geqrf``,
+``lstsq``,
+``_lu_with_info``,
+``lu_unpack``,
+``qr``,
+``solve``,
+``svd``,
+``symeig``,
+``triangular_solve``,
+``fractional_max_pool2d``,
+``fractional_max_pool3d``,
+``adaptive_max_pool1d``,
+``adaptive_max_pool2d``,
+``adaptive_max_pool3d``,
+``multilabel_margin_loss_forward``,
+``linalg_qr``,
+``linalg_cholesky_ex``,
+``linalg_svd``,
+``linalg_eig``,
+``linalg_eigh``,
+``linalg_lstsq``,
+``linalg_inv_ex``
+
+CPU Ops that promote to the widest input type
+"""""""""""""""""""""""""""""""""""""""""""""
+These ops don't require a particular dtype for stability, but take multiple inputs
+and require that the inputs' dtypes match.  If all of the inputs are
+``bfloat16``, the op runs in ``bfloat16``.  If any of the inputs is ``float32``,
+autocast casts all inputs to ``float32`` and runs the op in ``float32``.
+
+``cat``,
+``stack``,
+``index_copy``
+
+Some ops not listed here (e.g., binary ops like ``add``) natively promote
+inputs without autocasting's intervention.  If inputs are a mixture of ``bfloat16``
+and ``float32``, these ops run in ``float32`` and produce ``float32`` output,
+regardless of whether autocast is enabled.
diff --git a/docs/source/backends.rst b/docs/source/backends.rst
index 45d6fdf2add2a8..2b49e4c9341692 100644
--- a/docs/source/backends.rst
+++ b/docs/source/backends.rst
@@ -3,6 +3,7 @@
 
 torch.backends
 ==============
+.. automodule:: torch.backends
 
 `torch.backends` controls the behavior of various backends that PyTorch supports.
 
@@ -17,6 +18,7 @@ These backends include:
 
 torch.backends.cuda
 ^^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.cuda
 
 .. autofunction::  torch.backends.cuda.is_built
 
@@ -50,6 +52,7 @@ torch.backends.cuda
 
 torch.backends.cudnn
 ^^^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.cudnn
 
 .. autofunction:: torch.backends.cudnn.version
 
@@ -78,17 +81,26 @@ torch.backends.cudnn
 
 torch.backends.mkl
 ^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.mkl
 
 .. autofunction::  torch.backends.mkl.is_available
 
 
 torch.backends.mkldnn
 ^^^^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.mkldnn
 
 .. autofunction::  torch.backends.mkldnn.is_available
 
 
 torch.backends.openmp
 ^^^^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.openmp
 
 .. autofunction::  torch.backends.openmp.is_available
+
+.. Docs for other backends need to be added here.
+.. Automodules are just here to ensure checks run but they don't actually
+.. add anything to the rendered page for now.
+.. py:module:: torch.backends.quantized
+.. py:module:: torch.backends.xnnpack
diff --git a/docs/source/benchmark_utils.rst b/docs/source/benchmark_utils.rst
index c211dcb7b58003..c93fbfd66c3d9a 100644
--- a/docs/source/benchmark_utils.rst
+++ b/docs/source/benchmark_utils.rst
@@ -18,3 +18,10 @@ Benchmark Utils - torch.utils.benchmark
 
 .. autoclass:: FunctionCounts
     :members:
+
+.. These are missing documentation. Adding them here until a better place
+.. is made in this file.
+.. py:module:: torch.utils.benchmark.examples
+.. py:module:: torch.utils.benchmark.op_fuzzers
+.. py:module:: torch.utils.benchmark.utils
+.. py:module:: torch.utils.benchmark.utils.valgrind_wrapper
diff --git a/docs/source/bottleneck.rst b/docs/source/bottleneck.rst
index d6ce122234fb11..3fa1c99b506171 100644
--- a/docs/source/bottleneck.rst
+++ b/docs/source/bottleneck.rst
@@ -1,6 +1,7 @@
 torch.utils.bottleneck
 ======================
 
+.. automodule:: torch.utils.bottleneck
 .. currentmodule:: torch.utils.bottleneck
 
 `torch.utils.bottleneck` is a tool that can be used as an initial step for
diff --git a/docs/source/conf.py b/docs/source/conf.py
index de66776b85cbae..d36deda65a19ab 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -57,12 +57,16 @@
     'sphinxcontrib.katex',
     'sphinx.ext.autosectionlabel',
     'sphinx_copybutton',
+    'sphinx_panels'
 ]
 
 # build the templated autosummary files
 autosummary_generate = True
 numpydoc_show_class_members = False
 
+# Theme has bootstrap already
+panels_add_bootstrap_css = False
+
 # autosectionlabel throws warnings if section names are duplicated.
 # The following tells autosectionlabel to not throw a warning for
 # duplicated section names that are in different documents.
@@ -82,6 +86,8 @@
 # TODO: document these and remove them from here.
 
 coverage_ignore_functions = [
+    # torch
+    "typename",
     # torch.autograd
     "register_py_tensor_class_for_device",
     "variable",
@@ -125,9 +131,41 @@
     "execWrapper",
     # torch.onnx
     "unregister_custom_op_symbolic",
+    # torch.ao.quantization
+    "default_eval_fn",
+    # torch.ao.quantization.fx.backend_config
+    "validate_backend_config_dict",
+    # torch.backends
+    "disable_global_flags",
+    "flags_frozen",
+    # torch.distributed.algorithms.ddp_comm_hooks
+    "register_ddp_comm_hook",
+    # torch.nn
+    "factory_kwargs",
+    # torch.nn.parallel
+    "DistributedDataParallelCPU",
+    # torch.utils
+    "set_module",
+    # torch.utils.model_dump
+    "burn_in_info",
+    "get_info_and_burn_skeleton",
+    "get_inline_skeleton",
+    "get_model_info",
+    "get_storage_info",
+    "hierarchical_pickle",
 ]
 
 coverage_ignore_classes = [
+    # torch
+    "FatalError",
+    "QUInt2x4Storage",
+    "Size",
+    "Storage",
+    "Stream",
+    "Tensor",
+    "finfo",
+    "iinfo",
+    "qscheme",
     # torch.cuda
     "BFloat16Storage",
     "BFloat16Tensor",
@@ -193,109 +231,25 @@
     # torch.onnx
     "CheckerError",
     "ExportTypes",
+    # torch.backends
+    "ContextProp",
+    "PropModule",
+    # torch.backends.cuda
+    "cuBLASModule",
+    "cuFFTPlanCache",
+    "cuFFTPlanCacheAttrContextProp",
+    "cuFFTPlanCacheManager",
+    # torch.distributed.algorithms.ddp_comm_hooks
+    "DDPCommHookType",
+    # torch.jit.mobile
+    "LiteScriptModule",
+    # torch.nn.quantized.modules
+    "DeQuantize",
+    "Quantize",
+    # torch.utils.backcompat
+    "Warning",
 ]
 
-# List of modules that do not have automodule/py:module in the doc yet
-# We should NOT add anything to this list, see the CI failure message
-# on how to solve missing automodule issues
-coverage_missing_automodule = [
-    "torch",
-    "torch.ao",
-    "torch.ao.nn",
-    "torch.ao.nn.sparse",
-    "torch.ao.nn.sparse.quantized",
-    "torch.ao.nn.sparse.quantized.dynamic",
-    "torch.ao.ns",
-    "torch.ao.ns.fx",
-    "torch.ao.quantization",
-    "torch.ao.quantization.fx",
-    "torch.ao.quantization.fx.backend_config",
-    "torch.ao.sparsity",
-    "torch.ao.sparsity.experimental",
-    "torch.ao.sparsity.experimental.pruner",
-    "torch.ao.sparsity.scheduler",
-    "torch.ao.sparsity.sparsifier",
-    "torch.backends",
-    "torch.backends.cuda",
-    "torch.backends.cudnn",
-    "torch.backends.mkl",
-    "torch.backends.mkldnn",
-    "torch.backends.openmp",
-    "torch.backends.quantized",
-    "torch.backends.xnnpack",
-    "torch.contrib",
-    "torch.cpu",
-    "torch.cpu.amp",
-    "torch.distributed.algorithms",
-    "torch.distributed.algorithms.ddp_comm_hooks",
-    "torch.distributed.algorithms.model_averaging",
-    "torch.distributed.elastic",
-    "torch.distributed.elastic.utils",
-    "torch.distributed.elastic.utils.data",
-    "torch.distributed.launcher",
-    "torch.distributed.nn",
-    "torch.distributed.nn.api",
-    "torch.distributed.nn.jit",
-    "torch.distributed.nn.jit.templates",
-    "torch.distributed.pipeline",
-    "torch.distributed.pipeline.sync",
-    "torch.distributed.pipeline.sync.skip",
-    "torch.fft",
-    "torch.for_onnx",
-    "torch.fx.experimental",
-    "torch.fx.experimental.unification",
-    "torch.fx.experimental.unification.multipledispatch",
-    "torch.fx.passes",
-    "torch.jit.mobile",
-    "torch.nn",
-    "torch.nn.backends",
-    "torch.nn.intrinsic",
-    "torch.nn.intrinsic.modules",
-    "torch.nn.intrinsic.qat",
-    "torch.nn.intrinsic.qat.modules",
-    "torch.nn.intrinsic.quantized",
-    "torch.nn.intrinsic.quantized.dynamic",
-    "torch.nn.intrinsic.quantized.dynamic.modules",
-    "torch.nn.intrinsic.quantized.modules",
-    "torch.nn.modules",
-    "torch.nn.parallel",
-    "torch.nn.qat",
-    "torch.nn.qat.modules",
-    "torch.nn.qat.dynamic",
-    "torch.nn.qat.dynamic.modules",
-    "torch.nn.quantizable",
-    "torch.nn.quantizable.modules",
-    "torch.nn.quantized",
-    "torch.nn.quantized.dynamic",
-    "torch.nn.quantized.dynamic.modules",
-    "torch.nn.quantized.modules",
-    "torch.nn.utils",
-    "torch.package",
-    "torch.package.analyze",
-    "torch.quantization",
-    "torch.quantization.fx",
-    "torch.sparse",
-    "torch.special",
-    "torch.utils",
-    "torch.utils.backcompat",
-    "torch.utils.benchmark.examples",
-    "torch.utils.benchmark.op_fuzzers",
-    "torch.utils.benchmark.utils",
-    "torch.utils.benchmark.utils.valgrind_wrapper",
-    "torch.utils.bottleneck",
-    "torch.utils.data.communication",
-    "torch.utils.data.datapipes",
-    "torch.utils.data.datapipes.dataframe",
-    "torch.utils.data.datapipes.iter",
-    "torch.utils.data.datapipes.map",
-    "torch.utils.data.datapipes.utils",
-    "torch.utils.ffi",
-    "torch.utils.hipify",
-    "torch.utils.model_dump",
-    "torch.utils.tensorboard",
-]
-
-
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
@@ -413,6 +367,11 @@ def coverage_post_process(app, exception):
     if not isinstance(app.builder, CoverageBuilder):
         return
 
+    if not torch.distributed.is_available():
+        raise RuntimeError("The coverage tool cannot run with a version "
+                           "of PyTorch that was built with USE_DISTRIBUTED=0 "
+                           "as this module's API changes.")
+
     # These are all the modules that have "automodule" in an rst file
     # These modules are the ones for which coverage is checked
     # Here, we make sure that no module is missing from that list
@@ -439,26 +398,16 @@ def is_not_internal(modname):
             if modname not in modules:
                 missing.add(modname)
 
-    expected = set(coverage_missing_automodule)
-
     output = []
 
-    unexpected_missing = missing - expected
-    if unexpected_missing:
-        mods = ", ".join(unexpected_missing)
+    if missing:
+        mods = ", ".join(missing)
         output.append(f"\nYou added the following module(s) to the PyTorch namespace '{mods}' "
                       "but they have no corresponding entry in a doc .rst file. You should "
                       "either make sure that the .rst file that contains the module's documentation "
                       "properly contains either '.. automodule:: mod_name' (if you do not want "
-                      "the paragraph added by the automodule, you can simply use py:module) or "
-                      "make the module private (by appending an '_' at the beginning of its name.")
-
-    unexpected_not_missing = expected - missing
-    if unexpected_not_missing:
-        mods = ", ".join(unexpected_not_missing)
-        output.append(f"\nThank you for adding the missing .rst entries for '{mods}', please update "
-                      "the 'coverage_missing_automodule' in 'torch/docs/source/conf.py' to remove "
-                      "the module(s) you fixed and make sure we do not regress on this in the future.")
+                      "the paragraph added by the automodule, you can simply use '.. py:module:: mod_name') "
+                      " or make the module private (by appending an '_' at the beginning of its name).")
 
     # The output file is hard-coded by the coverage tool
     # Our CI is setup to fail if any line is added to this file
diff --git a/docs/source/__config__.rst b/docs/source/config_mod.rst
similarity index 100%
rename from docs/source/__config__.rst
rename to docs/source/config_mod.rst
diff --git a/docs/source/data.rst b/docs/source/data.rst
index 322de88e27d939..646f41436caf61 100644
--- a/docs/source/data.rst
+++ b/docs/source/data.rst
@@ -432,3 +432,15 @@ Example::
 .. autoclass:: torch.utils.data.WeightedRandomSampler
 .. autoclass:: torch.utils.data.BatchSampler
 .. autoclass:: torch.utils.data.distributed.DistributedSampler
+
+
+.. This module is experimental and should be private, adding it here for now
+.. py:module:: torch.utils.data.communication
+
+.. These modules are documented as part of torch/data listing them here for
+.. now until we have a clearer fix
+.. py:module:: torch.utils.data.datapipes
+.. py:module:: torch.utils.data.datapipes.dataframe
+.. py:module:: torch.utils.data.datapipes.iter
+.. py:module:: torch.utils.data.datapipes.map
+.. py:module:: torch.utils.data.datapipes.utils
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
index 6c956c68422258..0eb143ca49a5a4 100644
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@@ -123,14 +123,24 @@ It is imperative that all processes specify the same number of interfaces in thi
 Other NCCL environment variables
 """"""""""""""""""""""""""""""""
 
-NCCL has also provided a number of environment variables for fine-tuning purposes.
-
-Commonly used ones include the following for debugging purposes:
-
-- ``export NCCL_DEBUG=INFO``
-- ``export NCCL_DEBUG_SUBSYS=ALL``
-
-For the full list of NCCL environment variables, please refer to
+**Debugging** - in case of NCCL failure, you can set ``NCCL_DEBUG=INFO`` to print an explicit
+warning message as well as basic NCCL initialization information.
+
+You may also use ``NCCL_DEBUG_SUBSYS`` to get more details about a specific
+aspect of NCCL. For example, ``NCCL_DEBUG_SUBSYS=COLL`` would print logs of
+collective calls, which may be helpful when debugging hangs, especially those
+caused by collective type or message size mismatch. In case of topology
+detection failure, it would be helpful to set ``NCCL_DEBUG_SUBSYS=GRAPH``
+to inspect the detailed detection result and save as reference if further help
+from NCCL team is needed.
+
+**Performance tuning** - NCCL performs automatic tuning based on its topology detection to save users'
+tuning effort. On some socket-based systems, users may still try tuning
+``NCCL_SOCKET_NTHREADS`` and ``NCCL_NSOCKS_PERTHREAD`` to increase socket
+network bandwidth. These two environment variables have been pre-tuned by NCCL
+for some cloud providers, such as AWS or GCP.
+
+For a full list of NCCL environment variables, please refer to
 `NVIDIA NCCL's official documentation <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html>`_
 
 
@@ -808,3 +818,21 @@ following matrix shows how the log level can be adjusted via the combination of
 +-------------------------+-----------------------------+------------------------+
 | ``INFO``                | ``DETAIL``                  | Trace (a.k.a. All)     |
 +-------------------------+-----------------------------+------------------------+
+
+
+.. Distributed modules that are missing specific entries.
+.. Adding them here for tracking purposes until they are more permanently fixed.
+.. py:module:: torch.distributed.algorithms
+.. py:module:: torch.distributed.algorithms.ddp_comm_hooks
+.. py:module:: torch.distributed.algorithms.model_averaging
+.. py:module:: torch.distributed.elastic
+.. py:module:: torch.distributed.elastic.utils
+.. py:module:: torch.distributed.elastic.utils.data
+.. py:module:: torch.distributed.launcher
+.. py:module:: torch.distributed.nn
+.. py:module:: torch.distributed.nn.api
+.. py:module:: torch.distributed.nn.jit
+.. py:module:: torch.distributed.nn.jit.templates
+.. py:module:: torch.distributed.pipeline
+.. py:module:: torch.distributed.pipeline.sync
+.. py:module:: torch.distributed.pipeline.sync.skip
diff --git a/docs/source/fft.rst b/docs/source/fft.rst
index 05f6215af513d5..5406b6610a602b 100644
--- a/docs/source/fft.rst
+++ b/docs/source/fft.rst
@@ -7,8 +7,6 @@ torch.fft
 Discrete Fourier transforms and related functions.
 
 .. automodule:: torch.fft
-    :noindex:
-
 .. currentmodule:: torch.fft
 
 Fast Fourier Transforms
diff --git a/docs/source/fx.rst b/docs/source/fx.rst
index 65689930743da9..de1e1b88f93e21 100644
--- a/docs/source/fx.rst
+++ b/docs/source/fx.rst
@@ -1109,3 +1109,12 @@ API Reference
   :members:
 
 .. autofunction:: torch.fx.replace_pattern
+
+
+.. The experimental and passes submodules are missing docs.
+.. Adding it here for coverage but this doesn't add anything to the
+.. rendered doc.
+.. py:module:: torch.fx.passes
+.. py:module:: torch.fx.experimental
+.. py:module:: torch.fx.experimental.unification
+.. py:module:: torch.fx.experimental.unification.multipledispatch
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 24aa75476b044e..e64f7425c56d2a 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -54,9 +54,9 @@ Features described in this documentation are classified by release status:
    tensors
    tensor_attributes
    tensor_view
+   torch.amp <amp>
    torch.autograd <autograd>
    cuda
-   torch.cuda.amp <amp>
    torch.backends <backends>
    torch.distributed <distributed>
    torch.distributed.algorithms.join <distributed.algorithms.join>
@@ -100,7 +100,7 @@ Features described in this documentation are classified by release status:
    type_info
    named_tensor
    name_inference
-   torch.__config__ <__config__>
+   torch.__config__ <config_mod>
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/jit.rst b/docs/source/jit.rst
index 23426fb3d9ea00..d2d55215aa3f10 100644
--- a/docs/source/jit.rst
+++ b/docs/source/jit.rst
@@ -878,3 +878,7 @@ References
 
     jit_python_reference
     jit_unsupported
+
+.. This package is missing doc. Adding it here for coverage
+.. This does not add anything to the rendered page.
+.. py:module:: torch.jit.mobile
diff --git a/docs/source/nn.rst b/docs/source/nn.rst
index 6eca9d4b16b6a4..0e9d161c014bc1 100644
--- a/docs/source/nn.rst
+++ b/docs/source/nn.rst
@@ -3,6 +3,8 @@
 
 torch.nn
 ===================================
+.. automodule:: torch.nn
+.. automodule:: torch.nn.modules
 
 These are the basic building blocks for graphs:
 
@@ -331,6 +333,8 @@ Shuffle Layers
 
 DataParallel Layers (multi-GPU, distributed)
 --------------------------------------------
+.. automodule:: torch.nn.parallel
+.. currentmodule:: torch
 
 .. autosummary::
     :toctree: generated
@@ -342,6 +346,7 @@ DataParallel Layers (multi-GPU, distributed)
 
 Utilities
 ---------
+.. automodule:: torch.nn.utils
 
 From the ``torch.nn.utils`` module
 
@@ -453,3 +458,7 @@ Lazy Modules Initialization
     :template: classtemplate.rst
 
     nn.modules.lazy.LazyModuleMixin
+
+
+.. This module is kept only for backward compatibility
+.. py:module:: torch.nn.backends
diff --git a/docs/source/notes/amp_examples.rst b/docs/source/notes/amp_examples.rst
index 90cda473cb2926..b6bcc38bc0f300 100644
--- a/docs/source/notes/amp_examples.rst
+++ b/docs/source/notes/amp_examples.rst
@@ -1,7 +1,7 @@
 .. _amp-examples:
 
-Automatic Mixed Precision examples
-==================================
+CUDA Automatic Mixed Precision examples
+=======================================
 
 .. currentmodule:: torch.cuda.amp
 
diff --git a/docs/source/notes/autograd.rst b/docs/source/notes/autograd.rst
index af8922ddfce4b8..216bb8cfb2510a 100644
--- a/docs/source/notes/autograd.rst
+++ b/docs/source/notes/autograd.rst
@@ -222,7 +222,7 @@ Evaluation Mode (``nn.Module.eval()``)
 Evaluation mode is not actually a mechanism to locally disable gradient computation.
 It is included here anyway because it is sometimes confused to be such a mechanism.
 
-Functionally, ``module.eval()`` (or equivalently ``module.train()``) are completely
+Functionally, ``module.eval()`` (or equivalently ``module.train(False)``) are completely
 orthogonal to no-grad mode and inference mode. How ``model.eval()`` affects
 your model depends entirely on the specific modules used in your model and
 whether they define any training-mode specific behavior.
diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst
index b2901a6fe33658..59eb7d4c72b69f 100644
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@@ -364,6 +364,26 @@ Available options:
   :meth:`~torch.cuda.memory_summary` methods are useful for tuning.  This
   option should be used as a last resort for a workload that is aborting
   due to 'out of memory' and showing a large amount of inactive split blocks.
+* ``roundup_power2_divisions`` helps with rounding the requested allocation
+  size to nearest power-2 division and making better use of the blocks. In
+  the current CUDACachingAllocator, the sizes are rounded up in multiple
+  of blocks size of 512, so this works fine for smaller sizes. However, this
+  can be inefficient for large near-by allocations as each will go to different
+  size of blocks and re-use of those blocks are minimized. This might create
+  lots of unused blocks and will waste GPU memory capacity. This option enables
+  the rounding of allocation size to nearest power-2 division. For example, if
+  we need to round-up size of 1200 and if number of divisions is 4,
+  the size 1200 lies between 1024 and 2048 and if we do 4 divisions between
+  them, the values are 1024, 1280, 1536, and 1792. So, allocation size of 1200
+  will be rounded to 1280 as the nearest ceiling of power-2 division.
+* ``garbage_collection_threshold`` helps actively reclaiming unused GPU memory to
+  avoid triggering expensive sync-and-reclaim-all operation (release_cached_blocks),
+  which can be unfavorable to latency-critical GPU applications (e.g., servers).
+  Upon setting this threshold (e.g., 0.8), the allocator will start reclaiming
+  GPU memory blocks if the GPU memory capacity usage exceeds the threshold (i.e.,
+  80% of the total memory allocated to the GPU application). The algorithm prefers
+  to free old & unused blocks first to avoid freeing blocks that are actively being
+  reused. The threshold value should be between greater than 0.0 and less than 1.0.
 
 .. _cufft-plan-cache:
 
diff --git a/docs/source/onnx.rst b/docs/source/onnx.rst
index 78458c1d71053e..5ed8d2aebd0bf0 100644
--- a/docs/source/onnx.rst
+++ b/docs/source/onnx.rst
@@ -130,9 +130,9 @@ a :class:`torch.nn.Module`. If the passed-in model is not already a ``ScriptModu
   of different sizes. To use scripting:
 
   * Use :func:`torch.jit.script` to produce a ``ScriptModule``.
-  * Call ``torch.onnx.export()`` with the ``ScriptModule`` as the model, and set the
-    ``example_outputs`` arg. This is required so that the types and shapes of the outputs can be
-    captured without executing the model.
+  * Call ``torch.onnx.export()`` with the ``ScriptModule`` as the model. The ``args`` are still required,
+    but they will be used internally only to produce example outputs, so that the types and shapes of the
+    outputs can be captured. No tracing will be performed.
 
 See `Introduction to TorchScript <https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html>`_
 and `TorchScript <jit.html>`_ for more details, including how to compose tracing and scripting to suit the
@@ -332,10 +332,20 @@ The process for adding a symbolic function depends on the type of operator.
 ATen operators
 ^^^^^^^^^^^^^^
 
-
 `ATen <https://pytorch.org/cppdocs/#aten>`_ is PyTorch’s built-in tensor library.
 If the operator is an ATen operator (shows up in the TorchScript graph with the prefix
-``aten::``):
+``aten::``), make sure it is not supported already.
+
+List of supported operators
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Visit the auto generated :doc:`list of supported ATen operators <../onnx_supported_aten_ops>`
+for details on which operator are supported in each ``opset_version``.
+
+Adding support for an operator
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the operator is not in the list above:
 
 * Define the symbolic function in ``torch/onnx/symbolic_opset<version>.py``, for example
   `torch/onnx/symbolic_opset9.py <https://github.com/pytorch/pytorch/blob/master/torch/onnx/symbolic_opset9.py>`_.
@@ -598,6 +608,11 @@ Functions
 .. autofunction:: register_custom_op_symbolic
 .. autofunction:: select_model_mode_for_export
 .. autofunction:: is_in_onnx_export
+.. autofunction:: is_onnx_log_enabled
+.. autofunction:: enable_log
+.. autofunction:: disable_log
+.. autofunction:: set_log_stream
+.. autofunction:: log
 
 Classes
 -------
diff --git a/docs/source/onnx_supported_aten_ops.rst b/docs/source/onnx_supported_aten_ops.rst
new file mode 100644
index 00000000000000..d6bf535e2e7ec3
--- /dev/null
+++ b/docs/source/onnx_supported_aten_ops.rst
@@ -0,0 +1,14 @@
+:orphan:
+
+ONNX supported ATen operators
+=============================
+
+This file is automatically generated during the documentation build
+by cross referencing ONNX operator symbolics with Torch JIT operators via
+``docs/source/scripts/build_onnx_supported_aten_op_csv_table.py``.
+Do not modify directly and instead `rebuild the docs <https://github.com/pytorch/pytorch#building-the-documentation>`_.
+
+.. csv-table:: Supported ATen operators
+   :file: ../build/auto_gen_aten_op_list.csv
+   :widths: 30, 70
+   :header-rows: 1
diff --git a/docs/source/package.rst b/docs/source/package.rst
index c7881f1961406f..b72112ffed31fb 100644
--- a/docs/source/package.rst
+++ b/docs/source/package.rst
@@ -1,3 +1,6 @@
+.. automodule:: torch.package
+.. py:module:: torch.package.analyze
+
 .. currentmodule:: torch.package
 
 torch.package
diff --git a/docs/source/quantization-backend-configuration.rst b/docs/source/quantization-backend-configuration.rst
new file mode 100644
index 00000000000000..07fd875fa9b34a
--- /dev/null
+++ b/docs/source/quantization-backend-configuration.rst
@@ -0,0 +1,20 @@
+Quantization Backend Configuration
+----------------------------------
+
+FX Graph Mode Quantization allows the user to configure various
+quantization behaviors of an op in order to match the expectation
+of their backend.
+
+In the future, this document will contain a detailed spec of
+these configurations.
+
+
+Default values for native configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Below is the output of the configuration for quantization of ops
+in fbgemm and qnnpack (PyTorch's default quantized backends).
+
+Results:
+
+.. literalinclude:: scripts/quantization_backend_configs/default_backend_config.txt
diff --git a/docs/source/quantization-support.rst b/docs/source/quantization-support.rst
index 78c5ea247c482b..da6649a2fee3d7 100644
--- a/docs/source/quantization-support.rst
+++ b/docs/source/quantization-support.rst
@@ -217,6 +217,8 @@ to configure quantization settings for individual ops.
 
 torch.nn.intrinsic
 ~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.intrinsic
+.. automodule:: torch.nn.intrinsic.modules
 
 This module implements the combined (fused) modules conv + relu which can
 then be quantized.
@@ -243,6 +245,9 @@ then be quantized.
 
 torch.nn.intrinsic.qat
 ~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.intrinsic.qat
+.. automodule:: torch.nn.intrinsic.qat.modules
+
 
 This module implements the versions of those fused operations needed for
 quantization aware training.
@@ -268,6 +273,9 @@ quantization aware training.
 
 torch.nn.intrinsic.quantized
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.intrinsic.quantized
+.. automodule:: torch.nn.intrinsic.quantized.modules
+
 
 This module implements the quantized implementations of fused operations
 like conv + relu. No BatchNorm variants as it's usually folded into convolution
@@ -289,6 +297,8 @@ for inference.
 
 torch.nn.intrinsic.quantized.dynamic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.intrinsic.quantized.dynamic
+.. automodule:: torch.nn.intrinsic.quantized.dynamic.modules
 
 This module implements the quantized dynamic implementations of fused operations
 like linear + relu.
@@ -304,6 +314,8 @@ like linear + relu.
 
 torch.nn.qat
 ~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.qat
+.. automodule:: torch.nn.qat.modules
 
 This module implements versions of the key nn modules **Conv2d()** and
 **Linear()** which run in FP32 but with rounding applied to simulate the
@@ -322,6 +334,8 @@ effect of INT8 quantization.
 
 torch.nn.qat.dynamic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.qat.dynamic
+.. automodule:: torch.nn.qat.dynamic.modules
 
 This module implements versions of the key nn modules such as **Linear()**
 which run in FP32 but with rounding applied to simulate the effect of INT8
@@ -338,6 +352,8 @@ quantization and will be dynamically quantized during inference.
 
 torch.nn.quantized
 ~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.quantized
+.. automodule:: torch.nn.quantized.modules
 
 This module implements the quantized versions of the nn layers such as
 ~`torch.nn.Conv2d` and `torch.nn.ReLU`.
@@ -376,6 +392,7 @@ This module implements the quantized versions of the nn layers such as
 
 torch.nn.quantized.functional
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.quantized.functional
 
 This module implements the quantized versions of the functional layers such as
 ~`torch.nn.functional.conv2d` and `torch.nn.functional.relu`. Note:
@@ -413,6 +430,8 @@ This module implements the quantized versions of the functional layers such as
 
 torch.nn.quantized.dynamic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.nn.quantized.dynamic
+.. automodule:: torch.nn.quantized.dynamic.modules
 
 Dynamically quantized :class:`~torch.nn.Linear`, :class:`~torch.nn.LSTM`,
 :class:`~torch.nn.LSTMCell`, :class:`~torch.nn.GRUCell`, and
@@ -492,3 +511,8 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
   * :attr:`torch.quint8` — 8-bit unsigned integer
   * :attr:`torch.qint8` — 8-bit signed integer
   * :attr:`torch.qint32` — 32-bit signed integer
+
+
+.. These modules are missing docs. Adding them here only for tracking
+.. automodule:: torch.nn.quantizable
+.. automodule:: torch.nn.quantizable.modules
diff --git a/docs/source/quantization.rst b/docs/source/quantization.rst
index 69d0abc0271920..45076d78cafae9 100644
--- a/docs/source/quantization.rst
+++ b/docs/source/quantization.rst
@@ -3,6 +3,9 @@
 Quantization
 ============
 
+.. automodule:: torch.quantization
+.. automodule:: torch.quantization.fx
+
 .. warning ::
      Quantization is in beta and subject to change.
 
@@ -100,7 +103,7 @@ The following table compares the differences between Eager Mode Quantization and
 There are three types of quantization supported:
 
 1. dynamic quantization (weights quantized with activations read/stored in
-   floating point and quantized for compute.)
+   floating point and quantized for compute)
 2. static quantization (weights quantized, activations quantized, calibration
    required post training)
 3. static quantization aware training (weights quantized, activations quantized,
@@ -486,6 +489,17 @@ and supported quantized modules and functions.
     torch.ao.ns._numeric_suite
     torch.ao.ns._numeric_suite_fx
 
+Quantization Backend Configuration
+----------------------------------
+
+The :doc:`Quantization Backend Configuration <quantization-backend-configuration>` contains documentation
+on how to configure the quantization workflows for various backends.
+
+.. toctree::
+    :hidden:
+
+    quantization-backend-configuration
+
 Quantized Tensors
 ---------------------------------------
 
@@ -883,3 +897,22 @@ Numerical Debugging (prototype)
   Eager mode numeric suite
 * :ref:`torch_ao_ns_numeric_suite_fx`
   FX numeric suite
+
+
+.. torch.ao is missing documentation. Since part of it is mentioned here, adding them here for now.
+.. They are here for tracking purposes until they are more permanently fixed.
+.. py:module:: torch.ao
+.. py:module:: torch.ao.nn
+.. py:module:: torch.ao.nn.sparse
+.. py:module:: torch.ao.nn.sparse.quantized
+.. py:module:: torch.ao.nn.sparse.quantized.dynamic
+.. py:module:: torch.ao.ns
+.. py:module:: torch.ao.ns.fx
+.. py:module:: torch.ao.quantization
+.. py:module:: torch.ao.quantization.fx
+.. py:module:: torch.ao.quantization.fx.backend_config
+.. py:module:: torch.ao.sparsity
+.. py:module:: torch.ao.sparsity.experimental
+.. py:module:: torch.ao.sparsity.experimental.pruner
+.. py:module:: torch.ao.sparsity.scheduler
+.. py:module:: torch.ao.sparsity.sparsifier
diff --git a/docs/source/scripts/build_onnx_supported_aten_op_csv_table.py b/docs/source/scripts/build_onnx_supported_aten_op_csv_table.py
new file mode 100644
index 00000000000000..7d12a441c4409b
--- /dev/null
+++ b/docs/source/scripts/build_onnx_supported_aten_op_csv_table.py
@@ -0,0 +1,21 @@
+"""
+This script generates a CSV table with all ATen operators
+supported by `torch.onnx.export`. The generated table is included by
+docs/source/onnx_supported_aten_list.rst.
+"""
+
+import os
+from torch.onnx import onnx_supported_ops
+
+# Constants
+BUILD_DIR = 'build'
+AUTO_GEN_ATEN_OPS_CSV_FILE = 'auto_gen_aten_op_list.csv'
+
+os.makedirs(BUILD_DIR, exist_ok=True)
+
+aten_list = onnx_supported_ops.onnx_supported_ops()
+
+with open(os.path.join(BUILD_DIR, AUTO_GEN_ATEN_OPS_CSV_FILE), 'w') as f:
+    f.write('Operator,opset_version(s)\n')
+    for name, opset_version in aten_list:
+        f.write(f'"``{name}``","{opset_version}"\n')
diff --git a/docs/source/scripts/build_quantization_configs.py b/docs/source/scripts/build_quantization_configs.py
new file mode 100644
index 00000000000000..7e9a011e12ba3e
--- /dev/null
+++ b/docs/source/scripts/build_quantization_configs.py
@@ -0,0 +1,23 @@
+"""
+This script will generate default values of quantization configs.
+These are for use in the documentation.
+"""
+
+from torch.ao.quantization.fx.backend_config import get_native_backend_config_dict
+import os.path
+from pprint import pprint
+
+
+# Create a directory for the images, if it doesn't exist
+QUANTIZATION_BACKEND_CONFIG_IMAGE_PATH = os.path.join(
+    os.path.realpath(os.path.join(__file__, "..")),
+    "quantization_backend_configs"
+)
+
+if not os.path.exists(QUANTIZATION_BACKEND_CONFIG_IMAGE_PATH):
+    os.mkdir(QUANTIZATION_BACKEND_CONFIG_IMAGE_PATH)
+
+output_path = os.path.join(QUANTIZATION_BACKEND_CONFIG_IMAGE_PATH, "default_backend_config.txt")
+
+with open(output_path, "w") as f:
+    pprint(get_native_backend_config_dict(), stream=f)
diff --git a/docs/source/sparse.rst b/docs/source/sparse.rst
index 178e4cb186030a..564df4ef432311 100644
--- a/docs/source/sparse.rst
+++ b/docs/source/sparse.rst
@@ -1,3 +1,5 @@
+.. automodule:: torch.sparse
+
 .. currentmodule:: torch
 
 .. _sparse-docs:
diff --git a/docs/source/special.rst b/docs/source/special.rst
index 1aa24242fad9a3..42acd2148a6a9b 100644
--- a/docs/source/special.rst
+++ b/docs/source/special.rst
@@ -7,8 +7,6 @@ torch.special
 The torch.special module, modeled after SciPy's `special <https://docs.scipy.org/doc/scipy/reference/special.html>`_ module.
 
 .. automodule:: torch.special
-    :noindex:
-
 .. currentmodule:: torch.special
 
 Functions
@@ -39,6 +37,7 @@ Functions
 .. autofunction:: multigammaln
 .. autofunction:: ndtr
 .. autofunction:: ndtri
+.. autofunction:: log_ndtr
 .. autofunction:: round
 .. autofunction:: sinc
 .. autofunction:: softmax
diff --git a/docs/source/storage.rst b/docs/source/storage.rst
index 3aeec082b607b9..747acf11ed36b8 100644
--- a/docs/source/storage.rst
+++ b/docs/source/storage.rst
@@ -1,87 +1,96 @@
 torch.Storage
 ===================================
 
-A :class:`torch.Storage` is a contiguous, one-dimensional array of a single
-data type.
+A :class:`torch._TypedStorage` is a contiguous, one-dimensional array of
+elements of a particular :class:`torch.dtype`. It can be given any
+:class:`torch.dtype`, and the internal data will be interpretted appropriately.
 
-Every :class:`torch.Tensor` has a corresponding storage of the same data type.
+Every strided :class:`torch.Tensor` contains a :class:`torch._TypedStorage`,
+which stores all of the data that the :class:`torch.Tensor` views.
 
-.. autoclass:: torch.DoubleStorage
+For backward compatibility, there are also :class:`torch.<type>Storage` classes
+(like :class:`torch.FloatStorage`, :class:`torch.IntStorage`, etc). These
+classes are not actually instantiated, and calling their constructors creates
+a :class:`torch._TypedStorage` with the appropriate :class:`torch.dtype`.
+:class:`torch.<type>Storage` classes have all of the same class methods that
+:class:`torch._TypedStorage` has.
+
+Also for backward compatibility, :class:`torch.Storage` is an alias for the
+storage class that corresponds with the default data type
+(:func:`torch.get_default_dtype()`). For instance, if the default data type is
+:attr:`torch.float`, :class:`torch.Storage` resolves to
+:class:`torch.FloatStorage`.
+
+
+.. autoclass:: torch._TypedStorage
    :members:
    :undoc-members:
    :inherited-members:
 
+.. autoclass:: torch.DoubleStorage
+   :members:
+   :undoc-members:
+
 .. autoclass:: torch.FloatStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.HalfStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.LongStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.IntStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.ShortStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.CharStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.ByteStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.BoolStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.BFloat16Storage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.ComplexDoubleStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.ComplexFloatStorage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.QUInt8Storage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.QInt8Storage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.QInt32Storage
    :members:
    :undoc-members:
-   :inherited-members:
 
 .. autoclass:: torch.QUInt4x2Storage
    :members:
    :undoc-members:
-   :inherited-members:
+
+.. autoclass:: torch.QUInt2x4Storage
+   :members:
+   :undoc-members:
diff --git a/docs/source/tensorboard.rst b/docs/source/tensorboard.rst
index d3205e3ba58925..8cd13836928819 100644
--- a/docs/source/tensorboard.rst
+++ b/docs/source/tensorboard.rst
@@ -1,5 +1,6 @@
 torch.utils.tensorboard
 ===================================
+.. automodule:: torch.utils.tensorboard
 
 Before going further, more details on TensorBoard can be found at
 https://www.tensorflow.org/tensorboard/
diff --git a/docs/source/tensors.rst b/docs/source/tensors.rst
index fe9467dd4a694a..161a17f4a6da41 100644
--- a/docs/source/tensors.rst
+++ b/docs/source/tensors.rst
@@ -593,6 +593,7 @@ Tensor class reference
     Tensor.scatter_
     Tensor.scatter_add_
     Tensor.scatter_add
+    Tensor.scatter_reduce_
     Tensor.scatter_reduce
     Tensor.select
     Tensor.select_scatter
diff --git a/docs/source/torch.rst b/docs/source/torch.rst
index e09675af82a1a3..e4062b6096f0ee 100644
--- a/docs/source/torch.rst
+++ b/docs/source/torch.rst
@@ -1,13 +1,6 @@
 torch
 =====
-The torch package contains data structures for multi-dimensional
-tensors and defines mathematical operations over these tensors.
-Additionally, it provides many utilities for efficient serializing of
-Tensors and arbitrary types, and other useful utilities.
-
-It has a CUDA counterpart, that enables you to run your tensor computations
-on an NVIDIA GPU with compute capability >= 3.0
-
+.. automodule:: torch
 .. currentmodule:: torch
 
 Tensors
@@ -615,3 +608,18 @@ Utilities
     is_warn_always_enabled
     vmap
     _assert
+
+
+.. Empty submodules added only for tracking.
+.. py:module:: torch.contrib
+.. py:module:: torch.utils.backcompat
+
+.. This submodule is split manually without a top level page.
+.. py:module:: torch.utils
+
+.. This module is only used internally for ROCm builds.
+.. py:module:: torch.utils.hipify
+
+.. This module needs to be documented. Adding here in the meantime
+.. for tracking purposes
+.. py:module:: torch.utils.model_dump
diff --git a/ios/LibTorch-Lite.podspec b/ios/LibTorch-Lite.podspec
index f3ccaa43e93220..d2d9264e0a622d 100644
--- a/ios/LibTorch-Lite.podspec
+++ b/ios/LibTorch-Lite.podspec
@@ -1,6 +1,6 @@
 Pod::Spec.new do |s|
     s.name             = 'LibTorch-Lite'
-    s.version          = '1.10.0'
+    s.version          = '1.11.0'
     s.authors          = 'PyTorch Team'
     s.license          = { :type => 'BSD' }
     s.homepage         = 'https://github.com/pytorch/pytorch'
diff --git a/ios/LibTorch.podspec b/ios/LibTorch.podspec
index 22aaafac9d12c4..77bc0537e89edc 100644
--- a/ios/LibTorch.podspec
+++ b/ios/LibTorch.podspec
@@ -1,6 +1,6 @@
 Pod::Spec.new do |s|
     s.name             = 'LibTorch'
-    s.version          = '1.10.0'
+    s.version          = '1.11.0'
     s.authors          = 'PyTorch Team'
     s.license          = { :type => 'BSD' }
     s.homepage         = 'https://github.com/pytorch/pytorch'
diff --git a/ios/TestApp/TestApp/Base.lproj/Main.storyboard b/ios/TestApp/TestApp/Base.lproj/Main.storyboard
index ad8e8f7c874cf1..86c53ddccf2244 100644
--- a/ios/TestApp/TestApp/Base.lproj/Main.storyboard
+++ b/ios/TestApp/TestApp/Base.lproj/Main.storyboard
@@ -1,38 +1,22 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<document type="com.apple.InterfaceBuilder3.CocoaTouch.Storyboard.XIB" version="3.0" toolsVersion="18122" targetRuntime="iOS.CocoaTouch" propertyAccessControl="none" useAutolayout="YES" useTraitCollections="YES" useSafeAreas="YES" colorMatched="YES" initialViewController="zHv-VB-ug3">
+<document type="com.apple.InterfaceBuilder3.CocoaTouch.Storyboard.XIB" version="3.0" toolsVersion="19529" targetRuntime="iOS.CocoaTouch" propertyAccessControl="none" useAutolayout="YES" useTraitCollections="YES" useSafeAreas="YES" colorMatched="YES" initialViewController="zHv-VB-ug3">
     <device id="retina6_1" orientation="portrait" appearance="light"/>
     <dependencies>
         <deployment identifier="iOS"/>
-        <plugIn identifier="com.apple.InterfaceBuilder.IBCocoaTouchPlugin" version="18093"/>
+        <plugIn identifier="com.apple.InterfaceBuilder.IBCocoaTouchPlugin" version="19519"/>
         <capability name="Safe area layout guides" minToolsVersion="9.0"/>
-        <capability name="System colors in document resources" minToolsVersion="11.0"/>
         <capability name="documents saved in the Xcode 8 format" minToolsVersion="8.0"/>
     </dependencies>
     <scenes>
         <!--View Controller-->
         <scene sceneID="tne-QT-ifu">
             <objects>
-                <viewController id="BYZ-38-t0r" sceneMemberID="viewController">
+                <viewController id="BYZ-38-t0r" customClass="ViewController" sceneMemberID="viewController">
                     <view key="view" contentMode="scaleToFill" id="8bC-Xf-vdC">
                         <rect key="frame" x="0.0" y="0.0" width="414" height="896"/>
                         <autoresizingMask key="autoresizingMask" widthSizable="YES" heightSizable="YES"/>
-                        <subviews>
-                            <textView clipsSubviews="YES" multipleTouchEnabled="YES" contentMode="scaleToFill" textAlignment="natural" translatesAutoresizingMaskIntoConstraints="NO" id="3QV-4Z-f2v">
-                                <rect key="frame" x="20" y="108" width="374" height="734"/>
-                                <color key="backgroundColor" systemColor="systemBackgroundColor"/>
-                                <color key="textColor" systemColor="labelColor"/>
-                                <fontDescription key="fontDescription" type="system" pointSize="14"/>
-                                <textInputTraits key="textInputTraits" autocapitalizationType="sentences"/>
-                            </textView>
-                        </subviews>
                         <viewLayoutGuide key="safeArea" id="6Tk-OE-BBY"/>
                         <color key="backgroundColor" red="1" green="1" blue="1" alpha="1" colorSpace="custom" customColorSpace="sRGB"/>
-                        <constraints>
-                            <constraint firstItem="3QV-4Z-f2v" firstAttribute="leading" secondItem="6Tk-OE-BBY" secondAttribute="leading" constant="20" id="LmH-nY-m0w"/>
-                            <constraint firstItem="3QV-4Z-f2v" firstAttribute="top" secondItem="6Tk-OE-BBY" secondAttribute="top" constant="20" id="Qdn-Ua-oPp"/>
-                            <constraint firstItem="6Tk-OE-BBY" firstAttribute="bottom" secondItem="3QV-4Z-f2v" secondAttribute="bottom" constant="20" id="RGf-vE-yDP"/>
-                            <constraint firstItem="6Tk-OE-BBY" firstAttribute="trailing" secondItem="3QV-4Z-f2v" secondAttribute="trailing" constant="20" id="rb2-GC-nMV"/>
-                        </constraints>
                     </view>
                     <navigationItem key="navigationItem" id="zRt-2x-Qpi"/>
                 </viewController>
@@ -59,12 +43,4 @@
             <point key="canvasLocation" x="131.8840579710145" y="137.94642857142856"/>
         </scene>
     </scenes>
-    <resources>
-        <systemColor name="labelColor">
-            <color white="0.0" alpha="1" colorSpace="custom" customColorSpace="genericGamma22GrayColorSpace"/>
-        </systemColor>
-        <systemColor name="systemBackgroundColor">
-            <color white="1" alpha="1" colorSpace="custom" customColorSpace="genericGamma22GrayColorSpace"/>
-        </systemColor>
-    </resources>
 </document>
diff --git a/ios/TestApp/TestApp/ViewController.mm b/ios/TestApp/TestApp/ViewController.mm
index 38404ddac3b9f6..d8ecacda3c830b 100644
--- a/ios/TestApp/TestApp/ViewController.mm
+++ b/ios/TestApp/TestApp/ViewController.mm
@@ -4,4 +4,9 @@ @interface ViewController ()
 @end
 
 @implementation ViewController
+
+- (void)viewDidLoad {
+  [super viewDidLoad];
+}
+
 @end
diff --git a/ios/TestApp/TestAppTests/TestLiteInterpreter.mm b/ios/TestApp/TestAppTests/TestLiteInterpreter.mm
index f35642a148e3b3..37c8692b9980ae 100644
--- a/ios/TestApp/TestAppTests/TestLiteInterpreter.mm
+++ b/ios/TestApp/TestAppTests/TestLiteInterpreter.mm
@@ -11,8 +11,8 @@ @interface TestAppTests : XCTestCase
 @implementation TestAppTests {
 }
 
-- (void)testLiteInterpreter {
-  NSString* modelPath = [[NSBundle bundleForClass:[self class]] pathForResource:@"model_lite"
+- (void)testCoreML {
+  NSString* modelPath = [[NSBundle bundleForClass:[self class]] pathForResource:@"model_coreml"
                                                                          ofType:@"ptl"];
   auto module = torch::jit::_load_for_mobile(modelPath.UTF8String);
   c10::InferenceMode mode;
@@ -21,14 +21,173 @@ - (void)testLiteInterpreter {
   XCTAssertTrue(outputTensor.numel() == 1000);
 }
 
-- (void)testCoreML {
-  NSString* modelPath = [[NSBundle bundleForClass:[self class]] pathForResource:@"model_coreml"
+- (void)testModel:(NSString*)filename {
+  // model generated using the current pytorch revision
+  [self runModel:[NSString stringWithFormat:@"%@_temp", filename]];
+  // model generated using older pyotrch revision
+  [self runModel:filename];
+}
+
+- (void)runModel:(NSString*)filename {
+  NSString* modelPath = [[NSBundle bundleForClass:[self class]] pathForResource:filename
                                                                          ofType:@"ptl"];
-  auto module = torch::jit::_load_for_mobile(modelPath.UTF8String);
+  XCTAssertNotNil(modelPath);
   c10::InferenceMode mode;
-  auto input = torch::ones({1, 3, 224, 224}, at::kFloat);
-  auto outputTensor = module.forward({input}).toTensor();
-  XCTAssertTrue(outputTensor.numel() == 1000);
+  auto module = torch::jit::_load_for_mobile(modelPath.UTF8String);
+  auto has_bundled_input = module.find_method("get_all_bundled_inputs");
+  if (has_bundled_input) {
+    c10::IValue bundled_inputs = module.run_method("get_all_bundled_inputs");
+    c10::List<at::IValue> all_inputs = bundled_inputs.toList();
+    std::vector<std::vector<at::IValue>> inputs;
+    for (at::IValue input : all_inputs) {
+      inputs.push_back(input.toTupleRef().elements());
+    }
+    // run with the first bundled input
+    XCTAssertNoThrow(module.forward(inputs[0]));
+  } else {
+    XCTAssertNoThrow(module.forward({}));
+  }
+}
+
+// TODO remove this once updated test script
+- (void)testLiteInterpreter {
+  XCTAssertTrue(true);
+}
+
+- (void)testMobileNetV2 {
+  [self testModel:@"mobilenet_v2"];
+}
+
+- (void)testPointwiseOps {
+  [self testModel:@"pointwise_ops"];
+}
+
+- (void)testReductionOps {
+  [self testModel:@"reduction_ops"];
+}
+
+- (void)testComparisonOps {
+  [self testModel:@"comparison_ops"];
+}
+
+- (void)testOtherMathOps {
+  [self testModel:@"other_math_ops"];
+}
+
+- (void)testSpectralOps {
+  [self testModel:@"spectral_ops"];
+}
+
+- (void)testBlasLapackOps {
+  [self testModel:@"blas_lapack_ops"];
+}
+
+- (void)testSamplingOps {
+  [self testModel:@"sampling_ops"];
+}
+
+- (void)testTensorOps {
+  [self testModel:@"tensor_general_ops"];
+}
+
+- (void)testTensorCreationOps {
+  [self testModel:@"tensor_creation_ops"];
+}
+
+- (void)testTensorIndexingOps {
+  [self testModel:@"tensor_indexing_ops"];
+}
+
+- (void)testTensorTypingOps {
+  [self testModel:@"tensor_typing_ops"];
+}
+
+- (void)testTensorViewOps {
+  [self testModel:@"tensor_view_ops"];
+}
+
+- (void)testConvolutionOps {
+  [self testModel:@"convolution_ops"];
+}
+
+- (void)testPoolingOps {
+  [self testModel:@"pooling_ops"];
+}
+
+- (void)testPaddingOps {
+  [self testModel:@"padding_ops"];
+}
+
+- (void)testActivationOps {
+  [self testModel:@"activation_ops"];
+}
+
+- (void)testNormalizationOps {
+  [self testModel:@"normalization_ops"];
+}
+
+- (void)testRecurrentOps {
+  [self testModel:@"recurrent_ops"];
+}
+
+- (void)testTransformerOps {
+  [self testModel:@"transformer_ops"];
+}
+
+- (void)testLinearOps {
+  [self testModel:@"linear_ops"];
+}
+
+- (void)testDropoutOps {
+  [self testModel:@"dropout_ops"];
+}
+
+- (void)testSparseOps {
+  [self testModel:@"sparse_ops"];
+}
+
+- (void)testDistanceFunctionOps {
+  [self testModel:@"distance_function_ops"];
+}
+
+- (void)testLossFunctionOps {
+  [self testModel:@"loss_function_ops"];
+}
+
+- (void)testVisionFunctionOps {
+  [self testModel:@"vision_function_ops"];
+}
+
+- (void)testShuffleOps {
+  [self testModel:@"shuffle_ops"];
+}
+
+- (void)testNNUtilsOps {
+  [self testModel:@"nn_utils_ops"];
+}
+
+- (void)testQuantOps {
+  [self testModel:@"general_quant_ops"];
+}
+
+- (void)testDynamicQuantOps {
+  [self testModel:@"dynamic_quant_ops"];
+}
+
+- (void)testStaticQuantOps {
+  [self testModel:@"static_quant_ops"];
+}
+
+- (void)testFusedQuantOps {
+  [self testModel:@"fused_quant_ops"];
+}
+
+- (void)testTorchScriptBuiltinQuantOps {
+  [self testModel:@"torchscript_builtin_ops"];
+}
+
+- (void)testTorchScriptCollectionQuantOps {
+  [self testModel:@"torchscript_collection_ops"];
 }
 
 @end
diff --git a/ios/TestApp/models/activation_ops.ptl b/ios/TestApp/models/activation_ops.ptl
new file mode 100644
index 00000000000000..44673efd446e98
Binary files /dev/null and b/ios/TestApp/models/activation_ops.ptl differ
diff --git a/ios/TestApp/models/android_api_module.ptl b/ios/TestApp/models/android_api_module.ptl
new file mode 100644
index 00000000000000..df62dd86208811
Binary files /dev/null and b/ios/TestApp/models/android_api_module.ptl differ
diff --git a/ios/TestApp/models/blas_lapack_ops.ptl b/ios/TestApp/models/blas_lapack_ops.ptl
new file mode 100644
index 00000000000000..fea933ee644fd4
Binary files /dev/null and b/ios/TestApp/models/blas_lapack_ops.ptl differ
diff --git a/ios/TestApp/models/comparison_ops.ptl b/ios/TestApp/models/comparison_ops.ptl
new file mode 100644
index 00000000000000..01b1c153e7515a
Binary files /dev/null and b/ios/TestApp/models/comparison_ops.ptl differ
diff --git a/ios/TestApp/models/convolution_ops.ptl b/ios/TestApp/models/convolution_ops.ptl
new file mode 100644
index 00000000000000..de776834eb7704
Binary files /dev/null and b/ios/TestApp/models/convolution_ops.ptl differ
diff --git a/ios/TestApp/models/distance_function_ops.ptl b/ios/TestApp/models/distance_function_ops.ptl
new file mode 100644
index 00000000000000..cc4d994f440a4d
Binary files /dev/null and b/ios/TestApp/models/distance_function_ops.ptl differ
diff --git a/ios/TestApp/models/dropout_ops.ptl b/ios/TestApp/models/dropout_ops.ptl
new file mode 100644
index 00000000000000..422c2f60e6be25
Binary files /dev/null and b/ios/TestApp/models/dropout_ops.ptl differ
diff --git a/ios/TestApp/models/dynamic_quant_ops.ptl b/ios/TestApp/models/dynamic_quant_ops.ptl
new file mode 100644
index 00000000000000..573dee91f07b20
Binary files /dev/null and b/ios/TestApp/models/dynamic_quant_ops.ptl differ
diff --git a/ios/TestApp/models/fused_quant_ops.ptl b/ios/TestApp/models/fused_quant_ops.ptl
new file mode 100644
index 00000000000000..d24e3d8d4caa3f
Binary files /dev/null and b/ios/TestApp/models/fused_quant_ops.ptl differ
diff --git a/ios/TestApp/models/general_quant_ops.ptl b/ios/TestApp/models/general_quant_ops.ptl
new file mode 100644
index 00000000000000..5254d33b4794d9
Binary files /dev/null and b/ios/TestApp/models/general_quant_ops.ptl differ
diff --git a/ios/TestApp/models/linear_ops.ptl b/ios/TestApp/models/linear_ops.ptl
new file mode 100644
index 00000000000000..36915823843cf9
Binary files /dev/null and b/ios/TestApp/models/linear_ops.ptl differ
diff --git a/ios/TestApp/models/loss_function_ops.ptl b/ios/TestApp/models/loss_function_ops.ptl
new file mode 100644
index 00000000000000..4c0592e5485afa
Binary files /dev/null and b/ios/TestApp/models/loss_function_ops.ptl differ
diff --git a/ios/TestApp/models/mobilenet_v2.ptl b/ios/TestApp/models/mobilenet_v2.ptl
new file mode 100644
index 00000000000000..b034aaf8c8020e
Binary files /dev/null and b/ios/TestApp/models/mobilenet_v2.ptl differ
diff --git a/ios/TestApp/models/model_coreml.ptl b/ios/TestApp/models/model_coreml.ptl
new file mode 100644
index 00000000000000..1f2271b365f3c0
Binary files /dev/null and b/ios/TestApp/models/model_coreml.ptl differ
diff --git a/ios/TestApp/models/model_lite.ptl b/ios/TestApp/models/model_lite.ptl
new file mode 100644
index 00000000000000..9aef3bd6b54663
Binary files /dev/null and b/ios/TestApp/models/model_lite.ptl differ
diff --git a/ios/TestApp/models/nn_utils_ops.ptl b/ios/TestApp/models/nn_utils_ops.ptl
new file mode 100644
index 00000000000000..726b200a67d161
Binary files /dev/null and b/ios/TestApp/models/nn_utils_ops.ptl differ
diff --git a/ios/TestApp/models/normalization_ops.ptl b/ios/TestApp/models/normalization_ops.ptl
new file mode 100644
index 00000000000000..1846009a3b7239
Binary files /dev/null and b/ios/TestApp/models/normalization_ops.ptl differ
diff --git a/ios/TestApp/models/other_math_ops.ptl b/ios/TestApp/models/other_math_ops.ptl
new file mode 100644
index 00000000000000..7209c3b3bd1fdd
Binary files /dev/null and b/ios/TestApp/models/other_math_ops.ptl differ
diff --git a/ios/TestApp/models/padding_ops.ptl b/ios/TestApp/models/padding_ops.ptl
new file mode 100644
index 00000000000000..4af0418f11a679
Binary files /dev/null and b/ios/TestApp/models/padding_ops.ptl differ
diff --git a/ios/TestApp/models/pointwise_ops.ptl b/ios/TestApp/models/pointwise_ops.ptl
new file mode 100644
index 00000000000000..948ed4832660ae
Binary files /dev/null and b/ios/TestApp/models/pointwise_ops.ptl differ
diff --git a/ios/TestApp/models/pooling_ops.ptl b/ios/TestApp/models/pooling_ops.ptl
new file mode 100644
index 00000000000000..4b98f1971ee54c
Binary files /dev/null and b/ios/TestApp/models/pooling_ops.ptl differ
diff --git a/ios/TestApp/models/recurrent_ops.ptl b/ios/TestApp/models/recurrent_ops.ptl
new file mode 100644
index 00000000000000..10804040be8479
Binary files /dev/null and b/ios/TestApp/models/recurrent_ops.ptl differ
diff --git a/ios/TestApp/models/reduction_ops.ptl b/ios/TestApp/models/reduction_ops.ptl
new file mode 100644
index 00000000000000..13771302c66802
Binary files /dev/null and b/ios/TestApp/models/reduction_ops.ptl differ
diff --git a/ios/TestApp/models/sampling_ops.ptl b/ios/TestApp/models/sampling_ops.ptl
new file mode 100644
index 00000000000000..416be7cb127953
Binary files /dev/null and b/ios/TestApp/models/sampling_ops.ptl differ
diff --git a/ios/TestApp/models/shuffle_ops.ptl b/ios/TestApp/models/shuffle_ops.ptl
new file mode 100644
index 00000000000000..5e5520118764ef
Binary files /dev/null and b/ios/TestApp/models/shuffle_ops.ptl differ
diff --git a/ios/TestApp/models/sparse_ops.ptl b/ios/TestApp/models/sparse_ops.ptl
new file mode 100644
index 00000000000000..a16f68f8f95ff8
Binary files /dev/null and b/ios/TestApp/models/sparse_ops.ptl differ
diff --git a/ios/TestApp/models/spectral_ops.ptl b/ios/TestApp/models/spectral_ops.ptl
new file mode 100644
index 00000000000000..9828dd2ba9013a
Binary files /dev/null and b/ios/TestApp/models/spectral_ops.ptl differ
diff --git a/ios/TestApp/models/static_quant_ops.ptl b/ios/TestApp/models/static_quant_ops.ptl
new file mode 100644
index 00000000000000..f0f0a09b832db2
Binary files /dev/null and b/ios/TestApp/models/static_quant_ops.ptl differ
diff --git a/ios/TestApp/models/tensor_creation_ops.ptl b/ios/TestApp/models/tensor_creation_ops.ptl
new file mode 100644
index 00000000000000..d897b43cd36ca9
Binary files /dev/null and b/ios/TestApp/models/tensor_creation_ops.ptl differ
diff --git a/ios/TestApp/models/tensor_general_ops.ptl b/ios/TestApp/models/tensor_general_ops.ptl
new file mode 100644
index 00000000000000..6f2855ea83eaa5
Binary files /dev/null and b/ios/TestApp/models/tensor_general_ops.ptl differ
diff --git a/ios/TestApp/models/tensor_indexing_ops.ptl b/ios/TestApp/models/tensor_indexing_ops.ptl
new file mode 100644
index 00000000000000..ac9cb8c4b94add
Binary files /dev/null and b/ios/TestApp/models/tensor_indexing_ops.ptl differ
diff --git a/ios/TestApp/models/tensor_typing_ops.ptl b/ios/TestApp/models/tensor_typing_ops.ptl
new file mode 100644
index 00000000000000..3e2f4d8cc68922
Binary files /dev/null and b/ios/TestApp/models/tensor_typing_ops.ptl differ
diff --git a/ios/TestApp/models/tensor_view_ops.ptl b/ios/TestApp/models/tensor_view_ops.ptl
new file mode 100644
index 00000000000000..5e2dc829484265
Binary files /dev/null and b/ios/TestApp/models/tensor_view_ops.ptl differ
diff --git a/ios/TestApp/models/torchscript_builtin_ops.ptl b/ios/TestApp/models/torchscript_builtin_ops.ptl
new file mode 100644
index 00000000000000..2d2532df2fd257
Binary files /dev/null and b/ios/TestApp/models/torchscript_builtin_ops.ptl differ
diff --git a/ios/TestApp/models/torchscript_collection_ops.ptl b/ios/TestApp/models/torchscript_collection_ops.ptl
new file mode 100644
index 00000000000000..ce434b3b4210d5
Binary files /dev/null and b/ios/TestApp/models/torchscript_collection_ops.ptl differ
diff --git a/ios/TestApp/models/transformer_ops.ptl b/ios/TestApp/models/transformer_ops.ptl
new file mode 100644
index 00000000000000..4546569cd7fd99
Binary files /dev/null and b/ios/TestApp/models/transformer_ops.ptl differ
diff --git a/ios/TestApp/models/vision_function_ops.ptl b/ios/TestApp/models/vision_function_ops.ptl
new file mode 100644
index 00000000000000..e1f8c39c78abd9
Binary files /dev/null and b/ios/TestApp/models/vision_function_ops.ptl differ
diff --git a/modules/observers/perf_observer.cc b/modules/observers/perf_observer.cc
index bdee55daf1792e..cfd6130f7255e3 100644
--- a/modules/observers/perf_observer.cc
+++ b/modules/observers/perf_observer.cc
@@ -195,7 +195,7 @@ void PerfNetObserver::Start() {
   int skipIters = ObserverConfig::getSkipIters();
   int sampleRate = visitCount > 0 ? netFollowupSampleRate : netInitSampleRate;
   // NOLINTNEXTLINE(clang-analyzer-security.insecureAPI.rand)
-  if (skipIters <= numRuns_ && sampleRate > 0 && rand() % sampleRate == 0) {
+  if (skipIters <= static_cast<int>(numRuns_) && sampleRate > 0 && rand() % sampleRate == 0) {
     visitCount++;
     if (visitCount == netFollowupSampleCount) {
       visitCount = 0;
@@ -238,9 +238,9 @@ void PerfNetObserver::Stop() {
 
   if (logType_ == PerfNetObserver::OPERATOR_DELAY) {
     const auto& operators = subject_->GetOperators();
-    for (int idx = 0; idx < operators.size(); ++idx) {
+    for (unsigned idx = 0; idx < operators.size(); ++idx) {
       const auto* op = operators[idx];
-      auto name = getObserverName(op, idx);
+      auto name = getObserverName(op, static_cast<int>(idx));
       PerformanceInformation p;
       const PerfOperatorObserver* opObserver =
           static_cast<const PerfOperatorObserver*>(observerMap_[op]);
diff --git a/mypy.ini b/mypy.ini
index a3ec144806e48e..61442c1a7d697a 100644
--- a/mypy.ini
+++ b/mypy.ini
@@ -41,7 +41,7 @@ files =
 #
 # `exclude` is a regex, not a list of paths like `files` (sigh)
 #
-exclude = torch/include/|torch/csrc/|torch/distributed/elastic/agent/server/api.py|torch/testing/_internal
+exclude = torch/include/|torch/csrc/|torch/distributed/elastic/agent/server/api.py|torch/testing/_internal|torch/distributed/fsdp/fully_sharded_data_parallel.py
 
 # Minimum version supported - variable annotations were introduced
 # in Python 3.7
diff --git a/mypy_plugins/check_mypy_version.py b/mypy_plugins/check_mypy_version.py
index 02a02a60b9501d..a34b8683c989e0 100644
--- a/mypy_plugins/check_mypy_version.py
+++ b/mypy_plugins/check_mypy_version.py
@@ -9,7 +9,7 @@ def get_correct_mypy_version():
     # there's probably a more elegant way to do this
     match, = re.finditer(
         r'mypy==(\d+(?:\.\d+)*)',
-        Path('.circleci/docker/common/install_conda.sh').read_text(),
+        Path('.circleci/docker/requirements-ci.txt').read_text(),
     )
     version, = match.groups()
     return version
diff --git a/related_commits b/related_commits
index 203ce97c0eb4be..32d7bc42104d2c 100644
--- a/related_commits
+++ b/related_commits
@@ -1,4 +1,4 @@
 ubuntu|pytorch|apex|master|none|https://github.com/ROCmSoftwarePlatform/apex
 centos|pytorch|apex|master|none|https://github.com/ROCmSoftwarePlatform/apex
-ubuntu|pytorch|torchvision|main|d8654bb0d84fd2ba8b42cd58d881523821a6214c|https://github.com/pytorch/vision
-centos|pytorch|torchvision|main|d8654bb0d84fd2ba8b42cd58d881523821a6214c|https://github.com/pytorch/vision
+ubuntu|pytorch|torchvision|main|f5afae50bc8e99b873e2345bcda2dedfc863a737|https://github.com/pytorch/vision
+centos|pytorch|torchvision|main|f5afae50bc8e99b873e2345bcda2dedfc863a737|https://github.com/pytorch/vision
diff --git a/scripts/jit/log_extract.py b/scripts/jit/log_extract.py
index de9f983745c542..61e3172fe0b360 100644
--- a/scripts/jit/log_extract.py
+++ b/scripts/jit/log_extract.py
@@ -1,132 +1,45 @@
-from contextlib import contextmanager
-from torch.testing import make_tensor
-from typing import Any, List, Tuple
 import argparse
-import torch
+import functools
+import traceback
+from torch.utils.jit.log_extract import extract_ir, load_graph_and_inputs, run_baseline_no_fusion, run_nnc, run_nvfuser
+from typing import List, Tuple, Callable, Optional
 
 '''
 Usage:
 1. Run your script and pipe into a log file
   PYTORCH_JIT_LOG_LEVEL=">>graph_fuser" python3 my_test.py &> log.txt
 2. Run log_extract:
-  log_extract.py log.txt --nvfuser
+  log_extract.py log.txt --nvfuser --nnc-dynamic --nnc-static
 
 You can also extract the list of extracted IR:
   log_extract.py log.txt --output
+
+Passing in --graphs 0 2 will only run graphs 0 and 2
 '''
 
-def extract_ir(filename: str) -> List[str]:
-    BEGIN = "<GRAPH_EXPORT>"
-    END = "</GRAPH_EXPORT>"
-    pfx = None
-    current = ""
-    graphs = []
-    with open(filename, "r") as f:
-        split_strs = f.read().split(BEGIN)
-        for i, split_str in enumerate(split_strs):
-            if i == 0:
-                continue
-            end_loc = split_str.find(END)
-            if end_loc == -1:
-                continue
-            s = split_str[:end_loc]
-            pfx = split_strs[i - 1].splitlines()[-1]
-            lines = [x[len(pfx):] for x in s.splitlines(keepends=True)]
-            graphs.append(''.join(lines))
-
-    return graphs
-
-
-def make_tensor_from_type(inp_type: torch._C.TensorType):
-    if inp_type.requires_grad() is not False:
-        raise NotImplementedError("Tensors with requires_grad are not implemented")
-    return make_tensor(
-        inp_type.sizes(),
-        dtype=inp_type.dtype(),
-        device=inp_type.device())
-
-
-def load_graph_and_inputs(ir: str) -> Tuple[Any, List[Any]]:
-    graph = torch._C.parse_ir(ir)
-    graph.makeMultiOutputIntoTuple()
-    inputs = []
-    for inp in graph.inputs():
-        if isinstance(inp.type(), torch._C.FloatType):
-            inputs.append(.5)
-        elif isinstance(inp.type(), torch._C.IntType):
-            inputs.append(2)
-        elif isinstance(inp.type(), torch._C.TensorType):
-            inputs.append(make_tensor_from_type(inp.type()))
-        else:
-            raise NotImplementedError(f"A default value is not implemented for type {inp.type()}")
-
-    func = torch._C._create_function_from_graph("forward", graph)
-    torch._C._jit_pass_erase_shape_information(func.graph)
-    return (func, inputs)
-
-
-# TODO add support for timing on CPU
-def run_test(ir, inputs, *, warmup_runs=10, test_runs=20) -> float:
-    graph, _ = load_graph_and_inputs(ir)
-    for _ in range(warmup_runs):
-        graph(*inputs)
-
-    start_event = torch.cuda.Event(enable_timing=True)
-    end_event = torch.cuda.Event(enable_timing=True)
-    torch.cuda.synchronize()
-    start_event.record()
-    torch.cuda.synchronize()
-    for i in range(test_runs):
-        graph(*inputs)
-        torch.cuda.synchronize()
-    end_event.record()
-    torch.cuda.synchronize()
-    return start_event.elapsed_time(end_event) / test_runs
-
-
-@contextmanager
-def no_fuser(*args, **kwargs):
-    old_cpu_fuse = torch._C._jit_can_fuse_on_cpu()
-    old_gpu_fuse = torch._C._jit_can_fuse_on_gpu()
-    old_texpr_fuser_state = torch._C._jit_texpr_fuser_enabled()
-    old_nvfuser_state = torch._C._jit_nvfuser_enabled()
-
-    torch._C._jit_override_can_fuse_on_cpu(False)
-    torch._C._jit_override_can_fuse_on_gpu(False)
-    torch._C._jit_set_texpr_fuser_enabled(False)
-    torch._C._jit_set_nvfuser_enabled(False)
-
-    try:
-        yield
-    finally:
-        torch._C._jit_override_can_fuse_on_cpu(old_cpu_fuse)
-        torch._C._jit_override_can_fuse_on_gpu(old_gpu_fuse)
-        torch._C._jit_set_texpr_fuser_enabled(old_texpr_fuser_state)
-        torch._C._jit_set_nvfuser_enabled(old_nvfuser_state)
-
-
-def run_baseline_no_fusion(ir, inputs) -> float:
-    with no_fuser():
-        return run_test(ir, inputs)
-
-
-def run_nnc(ir, inputs) -> float:
-    with torch.jit.fuser("fuser1"):
-        return run_test(ir, inputs)
-
-
-def run_nvfuser(ir, inputs) -> float:
-    with torch.jit.fuser("fuser2"):
-        return run_test(ir, inputs)
-
-
-def test_nvfuser(graphs: List[str], baseline_fn, nvfuser_fn):
+
+def test_runners(graphs: List[str], runners: List[Tuple[str, Callable]], graph_set: Optional[List[int]]):
     for i, ir in enumerate(graphs):
         _, inputs = load_graph_and_inputs(ir)
-        baseline = baseline_fn(ir, inputs)
-        nvfuser = nvfuser_fn(ir, inputs)
-        improvement = (baseline / nvfuser - 1) * 100
-        print(f"  Graph {i}; baseline: {baseline:.2f} ms; nvfuser: {nvfuser:.2f} ms; improvement: {improvement:.2f}%")
+        if graph_set and i not in graph_set:
+            continue
+
+        print(f"Running Graph {i}")
+        prev_result = None
+        prev_runner_name = None
+        for runner in runners:
+            runner_name, runner_fn = runner
+            try:
+                result = runner_fn(ir, inputs)
+                if prev_result:
+                    improvement = (prev_result / result - 1) * 100
+                    print(f"{runner_name} : {result:.6f} ms improvement over {prev_runner_name}: improvement: {improvement:.2f}%")
+                else:
+                    print(f"{runner_name} : {result:.6f} ms")
+                prev_result = result
+                prev_runner_name = runner_name
+            except RuntimeError:
+                print(f"  Graph {i} failed for {runner_name} :", traceback.format_exc())
 
 
 def run():
@@ -134,30 +47,56 @@ def run():
         description="Extracts torchscript IR from log files and, optionally, benchmarks it or outputs the IR"
     )
     parser.add_argument("filename", help="Filename of log file")
-    parser.add_argument("--nvfuser", dest="nvfuser", action="store_true", help="benchmark nvfuser against no fusion")
-    parser.add_argument("--no-nvfuser", dest="nvfuser", action="store_false", help="DON'T benchmark nvfuser against no fusion")
+    parser.add_argument("--nvfuser", dest="nvfuser", action="store_true", help="benchmark nvfuser")
+    parser.add_argument("--no-nvfuser", dest="nvfuser", action="store_false", help="DON'T benchmark nvfuser")
     parser.set_defaults(nvfuser=False)
-    parser.add_argument("--nvfuser-nnc", dest="nvfuser_nnc", action="store_true", help="benchmark nvfuser against nnc")
-    parser.add_argument("--no-nvfuser-nnc", dest="nvfuser_nnc", action="store_false", help="DON'T benchmark nvfuser against nnc")
-    parser.set_defaults(nvfuser_nnc=False)
+    parser.add_argument("--nnc-static", dest="nnc_static", action="store_true", help="benchmark nnc static")
+    parser.add_argument("--no-nnc-static", dest="nnc_static", action="store_false", help="DON'T benchmark nnc static")
+    parser.set_defaults(nnc_static=False)
+
+    parser.add_argument("--nnc-dynamic", dest="nnc_dynamic", action="store_true", help="nnc with dynamic shapes")
+    parser.add_argument(
+        "--no-nnc-dynamic",
+        dest="nnc_dynamic",
+        action="store_false",
+        help="DONT't benchmark nnc with dynamic shapes")
+    parser.set_defaults(nnc_dynamic=False)
+
+
+    parser.add_argument("--baseline", dest="baseline", action="store_true", help="benchmark baseline")
+    parser.add_argument("--no-baseline", dest="baseline", action="store_false", help="DON'T benchmark baseline")
+    parser.set_defaults(baseline=False)
+
     parser.add_argument("--output", dest="output", action="store_true", help="Output graph IR")
     parser.add_argument("--no-output", dest="output", action="store_false", help="DON'T output graph IR")
     parser.set_defaults(output=False)
 
+    parser.add_argument('--graphs', nargs="+", type=int, help="Run only specified graph indices")
+
+
     args = parser.parse_args()
     graphs = extract_ir(args.filename)
 
+    graph_set = args.graphs
+    graph_set = graph_set if graph_set else None
+
+    options = []
+    if args.baseline:
+        options.append(("Baseline no fusion", run_baseline_no_fusion))
+    if args.nnc_dynamic:
+        options.append(("NNC Dynamic", functools.partial(run_nnc, dynamic=True)))
+    if args.nnc_static:
+        options.append(("NNC Static", functools.partial(run_nnc, dynamic=False)))
     if args.nvfuser:
-        print("NVFuser vs no fusion:")
-        test_nvfuser(graphs, run_baseline_no_fusion, run_nvfuser)
+        options.append(("NVFuser", run_nvfuser))
 
-    if args.nvfuser_nnc:
-        print("NVFuser vs NNC:")
-        test_nvfuser(graphs, run_nnc, run_nvfuser)
+    test_runners(graphs, options, graph_set)
 
     if args.output:
         quoted = []
-        for ir in graphs:
+        for i, ir in enumerate(graphs):
+            if graph_set and i not in graph_set:
+                continue
             quoted.append("\"\"\"" + ir + "\"\"\"")
         print("[" + ", ".join(quoted) + "]")
 
diff --git a/scripts/onnx/test.sh b/scripts/onnx/test.sh
index 3b39f600587668..dbeb6b2b27f5f3 100755
--- a/scripts/onnx/test.sh
+++ b/scripts/onnx/test.sh
@@ -69,7 +69,7 @@ if [[ "$BUILD_ENVIRONMENT" == *ort_test1* ||  "${SHARD_NUMBER}" == "1" ]]; then
   pytest "${args[@]}" \
     "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset7" \
     "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset8" \
-    "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime" \
+    "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset9" \
     "$top_dir/test/onnx/test_custom_ops.py" \
     "$top_dir/test/onnx/test_models_onnxruntime.py" \
     "$top_dir/test/onnx/test_utility_funs.py" \
diff --git a/scripts/release/cut-release-branch.sh b/scripts/release/cut-release-branch.sh
new file mode 100644
index 00000000000000..468dbfb184d941
--- /dev/null
+++ b/scripts/release/cut-release-branch.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+
+: '
+So you are looking to cut a release branch? Well you came
+to the right script.
+
+This script can be used to cut any branch on any repository
+
+For `pytorch/pytorch` usage would be like:
+> DRY_RUN=disabled cut-release-branch.sh
+
+For `pytorch/builder` or domains usage would be like:
+> DRY_RUN=disabled GIT_BRANCH_TO_CUT_FROM=main RELEASE_VERSION=1.11 cut-release-branch.sh
+'
+
+set -eou pipefail
+
+GIT_TOP_DIR=$(git rev-parse --show-toplevel)
+GIT_REMOTE=${GIT_REMOTE:-origin}
+GIT_BRANCH_TO_CUT_FROM=${GIT_BRANCH_TO_CUT_FROM:-viable/strict}
+
+# should output something like 1.11
+RELEASE_VERSION=${RELEASE_VERSION:-$(cut -d'.' -f1-2 "${GIT_TOP_DIR}/version.txt")}
+
+DRY_RUN_FLAG="--dry-run"
+if [[ ${DRY_RUN:-enabled} == "disabled" ]]; then
+    DRY_RUN_FLAG=""
+fi
+
+
+(
+    set -x
+    git fetch --all
+    git checkout "${GIT_REMOTE}/${GIT_BRANCH_TO_CUT_FROM}"
+)
+
+for branch in "release/${RELEASE_VERSION}" "orig/release/${RELEASE_VERSION}"; do
+    if git rev-parse --verify "${branch}" >/dev/null 2>/dev/null; then
+        echo "+ Branch ${branch} already exists, skipping..."
+        continue
+    else
+        (
+            set -x
+            git checkout "${GIT_REMOTE}/${GIT_BRANCH_TO_CUT_FROM}"
+            git checkout -b "${branch}"
+            git push "${GIT_REMOTE}" "${branch}"
+        )
+    fi
+done
diff --git a/setup.py b/setup.py
index 8024cb53b63cb0..dee5b369dc5ad1 100644
--- a/setup.py
+++ b/setup.py
@@ -50,6 +50,9 @@
 #   MKLDNN_CPU_RUNTIME
 #     MKL-DNN threading mode: TBB or OMP (default)
 #
+#   USE_STATIC_MKL
+#     Prefer to link with MKL statically - Unix only
+#
 #   USE_NNPACK=0
 #     disables NNPACK build
 #
@@ -821,7 +824,16 @@ def make_relative_rpath_args(path):
                   include_dirs=[],
                   library_dirs=library_dirs,
                   extra_link_args=extra_link_args + main_link_args + make_relative_rpath_args('lib'))
+    C_flatbuffer = Extension("torch._C_flatbuffer",
+                             libraries=main_libraries,
+                             sources=["torch/csrc/stub_with_flatbuffer.c"],
+                             language='c',
+                             extra_compile_args=main_compile_args + extra_compile_args,
+                             include_dirs=[],
+                             library_dirs=library_dirs,
+                             extra_link_args=extra_link_args + main_link_args + make_relative_rpath_args('lib'))
     extensions.append(C)
+    extensions.append(C_flatbuffer)
 
     if not IS_WINDOWS:
         DL = Extension("torch._dl",
@@ -929,6 +941,7 @@ def print_box(msg):
                 'bin/*',
                 'test/*',
                 '_C/*.pyi',
+                '_C_flatbuffer/*.pyi',
                 'cuda/*.pyi',
                 'optim/*.pyi',
                 'autograd/*.pyi',
@@ -936,6 +949,7 @@ def print_box(msg):
                 'nn/*.pyi',
                 'nn/modules/*.pyi',
                 'nn/parallel/*.pyi',
+                'utils/data/*.pyi',
                 'lib/*.so*',
                 'lib/*.dylib*',
                 'lib/*.dll',
@@ -1015,7 +1029,8 @@ def print_box(msg):
                 'include/torch/csrc/autograd/utils/*.h',
                 'include/torch/csrc/cuda/*.h',
                 'include/torch/csrc/deploy/*.h',
-                'include/torch/csrc/deploy/interpreter/interpreter_impl.h',
+                'include/torch/csrc/deploy/interpreter/*.h',
+                'include/torch/csrc/deploy/interpreter/*.hpp',
                 'include/torch/csrc/distributed/c10d/exception.h',
                 'include/torch/csrc/jit/*.h',
                 'include/torch/csrc/jit/backends/*.h',
@@ -1036,6 +1051,7 @@ def print_box(msg):
                 'include/torch/csrc/profiler/*.h',
                 'include/torch/csrc/utils/*.h',
                 'include/torch/csrc/tensor/*.h',
+                'include/torch/csrc/lazy/backend/*.h',
                 'include/torch/csrc/lazy/core/*.h',
                 'include/pybind11/*.h',
                 'include/pybind11/detail/*.h',
diff --git a/test/ao/sparsity/test_composability.py b/test/ao/sparsity/test_composability.py
new file mode 100644
index 00000000000000..b44c885507740e
--- /dev/null
+++ b/test/ao/sparsity/test_composability.py
@@ -0,0 +1,304 @@
+# -*- coding: utf-8 -*-
+# Owner(s): ["module: unknown"]
+
+
+import logging
+
+import torch
+import torch.ao.quantization as tq
+from torch import nn
+from torch.ao import sparsity
+from torch.testing._internal.common_utils import TestCase
+
+logging.basicConfig(
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
+)
+
+sparse_defaults = {
+    "sparsity_level": 0.8,
+    "sparse_block_shape": (1, 4),
+    "zeros_per_block": 4,
+}
+
+# This series of tests are to check the composability goals for sparsity and quantization. Namely
+# that performing quantization and sparsity model manipulations in various orderings
+# does not cause problems
+class TestComposability(TestCase):
+    def _get_model_and_sparsifier_and_sparse_config(self, qconfig=None):
+        model = nn.Sequential(
+            nn.Linear(4, 4),  # 0
+            nn.ReLU(),
+            nn.Linear(4, 4),  # 2
+            nn.ReLU(),
+            tq.QuantStub(),
+            nn.Linear(4, 4),  # 5
+            nn.ReLU(),
+            tq.DeQuantStub(),
+        )
+        if qconfig is None:
+            model[4].qconfig = tq.get_default_qconfig("fbgemm")
+            model[5].qconfig = tq.get_default_qconfig("fbgemm")
+        else:
+            model[4].qconfig = qconfig
+            model[5].qconfig = qconfig
+
+        sparsifier = sparsity.WeightNormSparsifier(**sparse_defaults)
+
+        sparse_config = [
+            {
+                "module": model[5],
+                "sparsity_level": 0.7,
+                "sparse_block_shape": (1, 4),
+                "zeros_per_block": 4,
+            },
+            model[0],
+        ]
+        return model, sparsifier, sparse_config
+
+    def _squash_mask_calibrate_and_convert(self, model, sparsifier, input):
+        sparsifier.step()
+        sparsifier.squash_mask()
+        model(input)
+        tq.convert(model, inplace=True)
+
+    def _calculate_sparsity(self, tensor):
+        return ((tensor == 0).sum() / tensor.numel()).item()
+
+    # This test checks whether performing quantization prepare before sparse prepare
+    # causes any issues and verifies that the correct observers are inserted and that
+    # the quantized model works as expected
+    def test_q_prep_before_s_prep(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config()
+
+        tq.prepare(mod, inplace=True)
+        sparsifier.prepare(mod, config=sparse_config)
+
+        # check that correct modules had parametrizations added
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5], "parametrizations"))
+        # check that correct observers were inserted
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+
+        self._squash_mask_calibrate_and_convert(
+            mod, sparsifier, torch.randn(1, 4, 4, 4)
+        )
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+    # This test checks whether performing sparsity prepare before quantization prepare
+    # causes any issues. In particular, previous quantization flow was unable to match
+    # the post sparse prepare module names (adding parametrizations changes the module class names)
+    # which would result in those parametrized modules not being quantized. This test verifies that
+    # the fix for this was successful.
+    def test_s_prep_before_q_prep(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config()
+
+        sparsifier.prepare(mod, config=sparse_config)
+        tq.prepare(mod, inplace=True)
+
+        # check that correct modules had parametrizations added and
+        # that none were lost during prepare
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+
+        self._squash_mask_calibrate_and_convert(
+            mod, sparsifier, torch.randn(1, 4, 4, 4)
+        )
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+    # if the sparsified modules have not undergone the final squash mask operation, its possible
+    # that the problem outlined in test_s_prep_before_q_prep would occur. This test verifies
+    # both that the fix to the convert flow avoids this issue and that the resulting quantized
+    # module uses the sparse version of the weight value.
+    def test_convert_without_squash_mask(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config()
+
+        sparsifier.prepare(mod, config=sparse_config)
+        tq.prepare(mod, inplace=True)
+
+        # check that correct modules had parametrizations added and
+        # that none were lost during prepare
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+        sparsifier.step()
+        sparsity_level = self._calculate_sparsity(mod[5].weight)
+        mod(torch.randn(1, 4, 4, 4))
+        tq.convert(mod, inplace=True)
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+        # check that module was actually sparsified
+        cur_sparsity = self._calculate_sparsity(mod[5]._weight_bias()[0])
+        self.assertGreaterAlmostEqual(cur_sparsity, sparsity_level)
+        self.assertGreaterAlmostEqual(
+            sparsity_level, sparse_config[0]["sparsity_level"]
+        )
+        self.assertGreaterAlmostEqual(cur_sparsity, sparse_config[0]["sparsity_level"])
+
+    # This tests whether performing sparse prepare before fusion causes any issues. The
+    # worry was that the link created between the sparsifier and the modules that need to
+    # be sparsified would be broken.
+    def test_s_prep_before_fusion(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config()
+        sparsifier.prepare(mod, config=sparse_config)
+        tq.fuse_modules(mod, [["5", "6"]], inplace=True)
+        mod[5].qconfig = tq.get_default_qconfig("fbgemm")
+        tq.prepare(mod, inplace=True)
+
+        # check that correct modules had parametrizations added and
+        # that none were lost during prepare or fusion
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5][0], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+        self._squash_mask_calibrate_and_convert(
+            mod, sparsifier, torch.randn(1, 4, 4, 4)
+        )
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.intrinsic.quantized.LinearReLU))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+    # This tests whether performing fusion before sparse prepare causes and issues. The
+    # main worry was that the links to the modules in the sparse config would be broken by fusion.
+    def test_fusion_before_s_prep(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config()
+        tq.fuse_modules(mod, [["5", "6"]], inplace=True)
+        sparsifier.prepare(mod, config=sparse_config)
+        mod[5].qconfig = tq.get_default_qconfig("fbgemm")
+        tq.prepare(mod, inplace=True)
+
+        # check that correct modules had parametrizations added and
+        # that none were lost during prepare
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5][0], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+        sparsifier.step()
+        sparsity_level = self._calculate_sparsity(mod[5][0].weight)
+        mod(torch.randn(1, 4, 4, 4))
+        tq.convert(mod, inplace=True)
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.intrinsic.quantized.LinearReLU))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+        # check that module was actually sparsified
+        cur_sparsity = self._calculate_sparsity(mod[5]._weight_bias()[0])
+        self.assertGreaterAlmostEqual(cur_sparsity, sparsity_level)
+        self.assertGreaterAlmostEqual(
+            sparsity_level, sparse_config[0]["sparsity_level"]
+        )
+        self.assertGreaterAlmostEqual(cur_sparsity, sparse_config[0]["sparsity_level"])
+
+    # This tests whether performing sparse prepare before qat prepare causes issues.
+    # The primary worries were that qat_prep wouldn't recognize the parametrized
+    # modules and that the convert step for qat would remove the paramerizations
+    # from the modules.
+    def test_s_prep_before_qat_prep(self):
+        (
+            mod,
+            sparsifier,
+            sparse_config,
+        ) = self._get_model_and_sparsifier_and_sparse_config(
+            tq.get_default_qat_qconfig("fbgemm")
+        )
+        sparsifier.prepare(mod, config=sparse_config)
+        tq.prepare_qat(mod, inplace=True)
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+        self._squash_mask_calibrate_and_convert(
+            mod, sparsifier, torch.randn(1, 4, 4, 4)
+        )
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+        # check that module was actually sparsified
+        cur_sparsity = self._calculate_sparsity(mod[5]._weight_bias()[0])
+        self.assertGreaterAlmostEqual(cur_sparsity, sparse_config[0]["sparsity_level"])
+
+    # This tests whether performing qat prepare before sparse prepare causes issues.
+    def test_qat_prep_before_s_prep(self):
+        mod, sparsifier, _ = self._get_model_and_sparsifier_and_sparse_config(
+            tq.get_default_qat_qconfig("fbgemm")
+        )
+        tq.prepare_qat(mod, inplace=True)
+
+        # need to setup sparse_config on new modules
+        sparse_config = [
+            {
+                "module": mod[5],
+                "sparsity_level": 0.7,
+                "sparse_block_shape": (1, 4),
+                "zeros_per_block": 4,
+            },
+            mod[0],
+        ]
+        sparsifier.prepare(mod, config=sparse_config)
+
+        # check that correct modules had parametrizations added and
+        # that none were lost during qat prepare
+        self.assertTrue(hasattr(mod[0], "parametrizations"))
+        self.assertTrue(hasattr(mod[5], "parametrizations"))
+
+        # check that correct observers were inserted and that matching
+        # occured successfully
+        self.assertTrue(hasattr(mod[5], "activation_post_process"))
+        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+
+        self._squash_mask_calibrate_and_convert(
+            mod, sparsifier, torch.randn(1, 4, 4, 4)
+        )
+
+        # check that final module is the expected quantized module and that the model runs
+        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
+
+        # check that module was actually sparsified
+        cur_sparsity = self._calculate_sparsity(mod[5]._weight_bias()[0])
+        self.assertGreaterAlmostEqual(cur_sparsity, sparse_config[0]["sparsity_level"])
diff --git a/test/ao/sparsity/test_kernels.py b/test/ao/sparsity/test_kernels.py
index 8deec46b4188c1..04a93434599974 100644
--- a/test/ao/sparsity/test_kernels.py
+++ b/test/ao/sparsity/test_kernels.py
@@ -22,6 +22,7 @@
     override_qengines,
     qengine_is_qnnpack,
     qengine_is_fbgemm,
+    qengine_is_onednn,
 )
 
 # TODO: Once more test files are created, move the contents to a ao folder.
@@ -48,6 +49,9 @@ def test_sparse_qlinear(self):
         # to other higher priority works.
         if qengine_is_qnnpack() and not (row_block_size == 1 and col_block_size == 4):
             return
+        # ONEDNN does not support this yet
+        if qengine_is_onednn():
+            return
 
         dense_prepack = torch.ops.quantized.linear_prepack
         dense_qlinear = torch.ops.quantized.linear
@@ -215,6 +219,10 @@ def test_sparse_qlinear(self):
                 Y_hat = sqmodel(X_fp32)
                 self.assertEqual(Y_ref, Y_hat)
 
+            # ONEDNN does not support this yet
+            elif qengine_is_onednn():
+                return
+
             row_block_size, col_block_size = sqmodel.linear._packed_params._weight_bias()[2:]
             assert row_block_size == 1 and col_block_size == 4
 
diff --git a/test/autograd/test_functional.py b/test/autograd/test_functional.py
new file mode 100644
index 00000000000000..3b21be748d8d4e
--- /dev/null
+++ b/test/autograd/test_functional.py
@@ -0,0 +1,1409 @@
+# Owner(s): ["module: autograd"]
+
+import types
+import unittest
+import warnings
+
+import torch
+import torch.autograd.functional as autogradF
+
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_utils import (
+    TestCase, run_tests, subtest, gradcheck, gradgradcheck, parametrize, instantiate_parametrized_tests)
+from torch.testing._internal.logging_tensor import LoggingTensor
+
+# Utilities for parametrizing the tensor constructors used in autograd tests
+#
+# TODO: maybe move somewhere so other tests can also use
+#
+# NB: Not all factory functions included. A complete(?) list can be found here:
+#     https://pytorch.org/cppdocs/notes/tensor_creation.html
+base_ctors_dict = {
+    "ones": torch.ones,
+    "zeros": torch.zeros,
+    "randn": torch.randn,
+    "rand": torch.rand,
+    "tensor": torch.tensor,
+}
+base_ctors = types.SimpleNamespace(**base_ctors_dict)
+
+def wrap_with_logging_tensor(ctor):
+    return lambda *args, **kwargs: LoggingTensor(ctor(*args, **kwargs))
+
+logging_tensor_ctors_dict = {k: wrap_with_logging_tensor(ctor) for (k, ctor) in base_ctors_dict.items()}
+logging_tensor_ctors = types.SimpleNamespace(**logging_tensor_ctors_dict)
+
+base_and_logging_tensor = parametrize("ctors", [subtest(base_ctors, name="base_tensor"),
+                                                subtest(logging_tensor_ctors, name="logging_tensor")])
+
+FIXME_base_and_xfail_logging_tensor = parametrize("ctors", [subtest(base_ctors, name="base_tensor"),
+                                                            subtest(logging_tensor_ctors, name="logging_tensor",
+                                                                    decorators=[unittest.expectedFailure])])
+
+# NB: This is equivalent to having both @parmetrize("vectorized", [True, False]) and
+#     FIXME_base_and_xfail_logging_tensor, except the non-vectorized logging_tensor case is
+#     actually expected to succeed
+FIXME_xfail_vectorized_logging_tensor = (
+    parametrize("vectorize,ctors", [subtest((True, base_ctors), name="vectorized_base_tensor"),
+                                    subtest((False, base_ctors), name="base_tensor"),
+                                    subtest((True, logging_tensor_ctors), name="vectorized_logging_tensor",
+                                            decorators=[unittest.expectedFailure]),
+                                    subtest((False, logging_tensor_ctors), name="logging_tensor")]))
+
+
+class TestAutogradFunctional(TestCase):
+    def _assert_same_struct(self, res, base):
+        # base and res should be Tensors or tuple of Tensors with the same size
+        if isinstance(base, torch.Tensor):
+            self.assertTrue(isinstance(res, torch.Tensor))
+            self.assertEqual(base.size(), res.size())
+        elif isinstance(base, tuple):
+            self.assertTrue(isinstance(res, tuple))
+            self.assertEqual(len(base), len(res))
+            for el_base, el_res in zip(base, res):
+                self.assertTrue(isinstance(el_base, torch.Tensor))
+                self.assertTrue(isinstance(el_res, torch.Tensor))
+                self.assertEqual(el_base.size(), el_res.size())
+        else:
+            # Wrong base
+            raise RuntimeError("The base given to `_assert_same_struct` doesn't have"
+                               " the right structure.")
+
+    def _assert_interleaved_struct(self, res, base1, base2):
+        # base1 and base2 can be Tensors or tuples of Tensors.
+        # If they are tuples, res should be a tuple as well.
+        # The indexing works as follows for base1, base2 being
+        # - tuple, tuple: res[i][j][k][l] = (base1[i][k], base2[j][l])
+        # - tuple, Tensor: res[i][k][l] = (base1[i][k], base2[l])
+        # - Tensor, tuple: res[i][j][l] = (base1[i], base2[j][l])
+        # - Tensor, Tensor: res[k][l] = (base1[k], base2[l])
+        if isinstance(base1, torch.Tensor) and isinstance(base2, torch.Tensor):
+            self.assertTrue(isinstance(res, torch.Tensor))
+            self.assertEqual(res.size(), base1.size() + base2.size())
+        elif isinstance(base1, tuple) and isinstance(base2, torch.Tensor):
+            self.assertTrue(isinstance(res, tuple))
+            self.assertEqual(len(res), len(base1))
+            for el_res, el_base1 in zip(res, base1):
+                self.assertTrue(isinstance(el_res, torch.Tensor))
+                self.assertTrue(isinstance(el_base1, torch.Tensor))
+                self.assertEqual(el_res.size(), el_base1.size() + base2.size())
+        elif isinstance(base1, torch.Tensor) and isinstance(base2, tuple):
+            self.assertTrue(isinstance(res, tuple))
+            self.assertEqual(len(res), len(base2))
+            for el_res, el_base2 in zip(res, base2):
+                self.assertTrue(isinstance(el_res, torch.Tensor))
+                self.assertTrue(isinstance(el_base2, torch.Tensor))
+                self.assertEqual(el_res.size(), base1.size() + el_base2.size())
+        elif isinstance(base1, tuple) and isinstance(base2, tuple):
+            self.assertTrue(isinstance(res, tuple))
+            self.assertEqual(len(res), len(base1))
+            for el_res, el_base1 in zip(res, base1):
+                self.assertTrue(isinstance(el_res, tuple))
+                self.assertEqual(len(res), len(base2))
+                for el_el_res, el_base2 in zip(el_res, base2):
+                    self.assertTrue(isinstance(el_el_res, torch.Tensor))
+                    self.assertTrue(isinstance(el_base2, torch.Tensor))
+                    self.assertEqual(el_el_res.size(), el_base1.size() + el_base2.size())
+        else:
+            # Wrong bases
+            raise RuntimeError("The bases given to `_assert_interleaved_struct` don't have"
+                               " the right structure.")
+
+    @base_and_logging_tensor
+    def test_vjp_err_check(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        inp = ctors.rand(4)
+        v = ctors.ones(3)
+        with self.assertRaisesRegex(TypeError, "The inputs given to vjp must be either a Tensor"):
+            res = autogradF.vjp(foo, (inp, 2), v)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to vjp must"):
+            res = autogradF.vjp(bar, inp, v)
+
+        with self.assertRaisesRegex(RuntimeError, "The vector v can only be None if the user-provided function returns"):
+            res = autogradF.vjp(foo, inp)
+
+        with self.assertRaisesRegex(RuntimeError, "The given v should contain a single Tensor."):
+            res = autogradF.vjp(foo, inp, (torch.ones_like(inp), torch.ones_like(inp)))
+
+        with self.assertRaisesRegex(RuntimeError, "v has invalid size: should be torch.Size"):
+            res = autogradF.vjp(foo, inp, v[:2])
+
+        res = autogradF.vjp(foo, inp, v)[1]
+        self._assert_same_struct(res, inp)
+
+    @base_and_logging_tensor
+    def test_vjp_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone()
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.vjp(foo, inp, v, strict=True)
+        res = autogradF.vjp(foo, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
+            res = autogradF.vjp(bar, inp, v, strict=True)
+        res = autogradF.vjp(bar, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        # The Jacobian does not depend on the input
+        def foo(a):
+            return a.clone()
+
+        inp.requires_grad_()
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
+            res = autogradF.vjp(foo, inp, v, create_graph=True, strict=True)
+        res = autogradF.vjp(foo, inp, v, create_graph=True, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1], v)
+
+    @base_and_logging_tensor
+    def test_vjp_no_grad(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4)
+        with torch.no_grad():
+            res = autogradF.vjp(reducer, inputs, v)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+        inputs.requires_grad_()
+        v.requires_grad_()
+        with torch.no_grad():
+            res = autogradF.vjp(reducer, inputs, v, create_graph=True)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+    @base_and_logging_tensor
+    def test_vjp_output(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4)
+        res = autogradF.vjp(reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+
+        def adder(x, y):
+            return 2 * x + 3 * y
+
+        inputs = (ctors.rand(2), ctors.rand(2))
+        v = ctors.ones(2)
+        out, vjp_val = autogradF.vjp(adder, inputs, v)
+        self._assert_same_struct(vjp_val, inputs)
+        self.assertIsNone(out.grad_fn)
+        self.assertIsNone(vjp_val[0].grad_fn)
+        self.assertIsNone(vjp_val[1].grad_fn)
+
+        def adder(x, y):
+            return 2 * x + 3 * y, x + y
+
+        inputs = (ctors.rand(2), ctors.rand(2))
+        v = (ctors.tensor([1., 0.]), ctors.tensor([1., 0.]))
+        out, vjp_val = autogradF.vjp(adder, inputs, v)
+        self._assert_same_struct(vjp_val, inputs)
+        self.assertIsNone(out[0].grad_fn)
+        self.assertIsNone(out[1].grad_fn)
+        self.assertIsNone(vjp_val[0].grad_fn)
+        self.assertIsNone(vjp_val[1].grad_fn)
+
+    @base_and_logging_tensor
+    def test_vjp_scalar(self, ctors):
+        def reducer(x):
+            return x.sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones([])
+        res = autogradF.vjp(reducer, inputs, v)
+        self._assert_same_struct(res[0], v)
+        self._assert_same_struct(res[1], inputs)
+
+        res = autogradF.vjp(reducer, inputs)
+        self._assert_same_struct(res[0], v)
+        self._assert_same_struct(res[1], inputs)
+
+        def expander(x):
+            return x.unsqueeze(0).repeat(4)
+        inputs = ctors.rand([])
+        v = ctors.ones(4)
+        res = autogradF.vjp(expander, inputs, v)
+        self._assert_same_struct(res[0], v)
+        self._assert_same_struct(res[1], inputs)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_vjp_create_graph(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(2, 2, dtype=torch.double)
+        v = ctors.ones(2, dtype=torch.double)
+
+        inputs.requires_grad_()
+        v.requires_grad_()
+        res = autogradF.vjp(reducer, inputs, v, create_graph=True)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+
+        gradcheck(lambda inp, v: autogradF.vjp(reducer, inputs, v, create_graph=True), (inputs, v))
+        gradgradcheck(lambda inp, v: autogradF.vjp(reducer, inputs, v, create_graph=True), (inputs, v))
+
+        def adder(x, y):
+            return 2 * x + 3 * y, x * y
+
+        inputs = (ctors.rand(2, dtype=torch.double, requires_grad=True),
+                  ctors.rand(2, dtype=torch.double, requires_grad=True))
+        v = (ctors.tensor([1., 0.], dtype=torch.double, requires_grad=True),
+             ctors.tensor([1., 0.], dtype=torch.double, requires_grad=True))
+
+        gradcheck(lambda *args: autogradF.vjp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
+        gradgradcheck(lambda *args: autogradF.vjp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
+
+        def foo(*args):
+            x, y = args[:2]
+            v = args[2:]
+
+            x = x.cos()
+            val, grad = autogradF.vjp(adder, (x, y), v, create_graph=True)
+
+            return val[0].exp() + val[1].exp() + grad[0].exp() + grad[1].exp() + x.exp() + y.exp()
+
+        gradcheck(foo, inputs + v)
+        gradgradcheck(foo, inputs + v)
+
+    @base_and_logging_tensor
+    def test_jvp_err_check(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(TypeError, "The inputs given to jvp must be either a Tensor"):
+            res = autogradF.jvp(foo, (inp, 2), v)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to jvp must"):
+            res = autogradF.jvp(bar, inp, v)
+
+        with self.assertRaisesRegex(RuntimeError, "The vector v can only be None if the input to the user-provided function"):
+            res = autogradF.jvp(foo, inp)
+
+        with self.assertRaisesRegex(RuntimeError, "The given v should contain a single Tensor."):
+            res = autogradF.jvp(foo, inp, (v, v))
+
+        with self.assertRaisesRegex(RuntimeError, "v has invalid size: should be torch.Size"):
+            res = autogradF.jvp(foo, inp, v[:2])
+
+        res = autogradF.jvp(foo, inp, v)[1]
+        self._assert_same_struct(res, foo(inp))
+
+    @base_and_logging_tensor
+    def test_jvp_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone()
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.jvp(foo, inp, v, strict=True)
+        res = autogradF.jvp(foo, inp, v, strict=False)
+        self._assert_same_struct(res[1], res[0])
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
+            res = autogradF.jvp(bar, inp, v, strict=True)
+        res = autogradF.jvp(bar, inp, v, strict=False)
+        self._assert_same_struct(res[1], res[0])
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        # The Jacobian does not depend on the input
+        def foo(a):
+            return a.clone()
+
+        inp.requires_grad_()
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
+            res = autogradF.jvp(foo, inp, v, create_graph=True, strict=True)
+        res = autogradF.jvp(foo, inp, v, create_graph=True, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1], v)
+
+    @base_and_logging_tensor
+    def test_jvp_no_grad(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        with torch.no_grad():
+            res = autogradF.jvp(reducer, inputs, v)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+        inputs.requires_grad_()
+        v.requires_grad_()
+        with torch.no_grad():
+            res = autogradF.jvp(reducer, inputs, v, create_graph=True)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+    @base_and_logging_tensor
+    def test_jvp_output(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.jvp(reducer, inputs, v)
+        self._assert_same_struct(res[1], res[0])
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+
+        def adder(x, y):
+            return 2 * x + 3 * y
+
+        inputs = (ctors.rand(2), ctors.rand(2))
+        v = (ctors.ones(2), ctors.ones(2))
+        out, jvp_val = autogradF.jvp(adder, inputs, v)
+        self._assert_same_struct(jvp_val, out)
+        self.assertIsNone(out.grad_fn)
+        self.assertIsNone(jvp_val[0].grad_fn)
+        self.assertIsNone(jvp_val[1].grad_fn)
+
+        def adder(x, y):
+            return 2 * x + 3 * y, x + y
+
+        inputs = (ctors.rand(2), ctors.rand(2))
+        v = (ctors.tensor([1., 0.]), ctors.tensor([1., 0.]))
+        out, jvp_val = autogradF.jvp(adder, inputs, v)
+        self._assert_same_struct(jvp_val, out)
+        self.assertIsNone(out[0].grad_fn)
+        self.assertIsNone(out[1].grad_fn)
+        self.assertIsNone(jvp_val[0].grad_fn)
+        self.assertIsNone(jvp_val[1].grad_fn)
+
+    @base_and_logging_tensor
+    def test_jvp_scalar(self, ctors):
+        def reducer(x):
+            return x.sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.jvp(reducer, inputs, v)
+        self._assert_same_struct(res[0], ctors.zeros([]))
+        self._assert_same_struct(res[1], res[0])
+
+        def expander(x):
+            return x.unsqueeze(0).repeat(4)
+        inputs = ctors.rand([])
+        v = ctors.ones([])
+        res = autogradF.jvp(expander, inputs, v)
+        self._assert_same_struct(res[0], ctors.zeros(4))
+        self._assert_same_struct(res[1], res[0])
+
+        res = autogradF.jvp(expander, inputs)
+        self._assert_same_struct(res[0], ctors.zeros(4))
+        self._assert_same_struct(res[1], res[0])
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jvp_create_graph(self, ctors):
+        def reducer(x):
+            return x.sum(dim=1)
+        inputs = ctors.rand(2, 2, dtype=torch.double)
+        v = ctors.ones(2, 2, dtype=torch.double)
+
+        inputs.requires_grad_()
+        v.requires_grad_()
+        res = autogradF.jvp(reducer, inputs, v, create_graph=True)
+        self._assert_same_struct(res[1], res[0])
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+
+        gradcheck(lambda inp, v: autogradF.jvp(reducer, inp, v, create_graph=True), (inputs, v))
+        gradgradcheck(lambda inp, v: autogradF.jvp(reducer, inp, v, create_graph=True), (inputs, v))
+
+        def adder(x, y):
+            return 2 * x + 3 * y, x * y
+
+        inputs = (ctors.rand(2, dtype=torch.double, requires_grad=True),
+                  ctors.rand(2, dtype=torch.double, requires_grad=True))
+        v = (ctors.tensor([1., 0.], dtype=torch.double, requires_grad=True),
+             ctors.tensor([1., 0.], dtype=torch.double, requires_grad=True))
+
+        gradcheck(lambda *args: autogradF.jvp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
+        gradgradcheck(lambda *args: autogradF.jvp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
+
+        def foo(*args):
+            x, y = args[:2]
+            v = args[2:]
+
+            x = x.cos()
+            val, grad = autogradF.jvp(adder, (x, y), v, create_graph=True)
+
+            return val[0].exp() + val[1].exp() + grad[0].exp() + grad[1].exp() + x.exp() + y.exp()
+
+        gradcheck(foo, inputs + v)
+        gradgradcheck(foo, inputs + v)
+
+    def _test_construct_standard_basis_for(self, inputs):
+        numels = tuple(tensor.numel() for tensor in inputs)
+        results = autogradF._construct_standard_basis_for(inputs, numels)
+        for result, inp in zip(results, inputs):
+            self.assertEqual(result.dtype, inp.dtype)
+            self.assertEqual(result.device, inp.device)
+        results = torch.cat([result.to(device='cpu', dtype=torch.float)
+                             for result in results], dim=1)
+        expected = torch.eye(results[0].shape[0], dtype=torch.float)
+        self.assertEqual(results, expected)
+
+    @base_and_logging_tensor
+    def test_construct_standard_basis_for(self, ctors):
+        test_cases = [
+            (ctors.randn(2, 3),),
+            (ctors.randn(1),),
+            (ctors.randn([]),),
+            (ctors.randn(1), ctors.randn([]), ctors.randn([])),
+            (ctors.randn(2), ctors.randn(3), ctors.randn([])),
+            (ctors.randn(2), ctors.randn([]), ctors.randn(3)),
+            (ctors.randn(2, 3), ctors.randn(3), ctors.randn(3, 4, 2)),
+            (ctors.randn(2, dtype=torch.float64), ctors.randn(3, dtype=torch.float32)),
+        ]
+
+        for inputs in test_cases:
+            self._test_construct_standard_basis_for(inputs)
+
+    @unittest.skipIf(not TEST_CUDA, "test requires CUDA")
+    @base_and_logging_tensor
+    def test_construct_standard_basis_for_cuda(self, ctors):
+        test_cases = [
+            (ctors.randn(2), ctors.randn(3, device='cuda')),
+            (ctors.randn(3, device='cuda'), ctors.randn(2)),
+        ]
+
+        for inputs in test_cases:
+            self._test_construct_standard_basis_for(inputs)
+
+    def _test_vectorize_raises_no_warnings(self, api, ctors):
+        # vmap is an experimental prototype. When someone calls torch.vmap,
+        # it raises a python warning. This test checks that
+        # autogradF.{jacobian, hessian} don't raise that experimental prototype
+        # warning; it is not nice for a public-facing API to raise a warning
+        # no matter how it is called.
+        def foo(a):
+            return (a ** 2).sum()
+
+        x = ctors.randn(3)
+        with warnings.catch_warnings(record=True) as wa:
+            result = api(foo, x, vectorize=True)
+        self.assertEqual(len(wa), 0)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_raises_no_warnings(self, ctors):
+        return self._test_vectorize_raises_no_warnings(autogradF.jacobian, ctors)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hessian_vectorize_raises_no_warnings(self, ctors):
+        return self._test_vectorize_raises_no_warnings(autogradF.hessian, ctors)
+
+    @FIXME_xfail_vectorized_logging_tensor
+    def test_jacobian_err_check(self, vectorize, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(TypeError, "The inputs given to jacobian must be either a Tensor"):
+            res = autogradF.jacobian(foo, (inp, 2), vectorize=vectorize)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to jacobian must"):
+            res = autogradF.jacobian(bar, inp, vectorize=vectorize)
+
+        res = autogradF.jacobian(foo, inp, vectorize=vectorize)
+        self._assert_interleaved_struct(res, foo(inp), inp)
+
+        def foo(a, b):
+            return b, 3 * a.narrow(0, 0, 3)
+
+        inp = (ctors.rand(4), ctors.rand(5))
+
+        res = autogradF.jacobian(foo, inp, vectorize=vectorize)
+        self._assert_interleaved_struct(res, foo(*inp), inp)
+
+    @base_and_logging_tensor
+    def test_jacobian_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone()
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.jacobian(foo, inp, strict=True)
+        res = autogradF.jacobian(foo, inp, strict=False)
+        self._assert_interleaved_struct(res, foo(inp), inp)
+        self.assertEqual(res.abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function is independent of input 0."):
+            res = autogradF.jacobian(bar, inp, strict=True)
+        res = autogradF.jacobian(bar, inp, strict=False)
+        self._assert_interleaved_struct(res, foo(inp), inp)
+        self.assertEqual(res.abs().sum(), 0.)
+
+        # The Jacobian does not depend on the input
+        def foo(a):
+            return a.clone()
+
+        inp.requires_grad_()
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
+            res = autogradF.jacobian(foo, inp, create_graph=True, strict=True)
+        res = autogradF.jacobian(foo, inp, create_graph=True, strict=False)
+        self._assert_interleaved_struct(res, inp, inp)
+        self.assertEqual(res, torch.eye(4))
+
+    @base_and_logging_tensor
+    def test_jacobian_err_check_strict_vectorize(self, ctors):
+        def foo(x):
+            return x
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "not supported together"):
+            res = autogradF.jacobian(foo, inp, strict=True, vectorize=True)
+
+    @base_and_logging_tensor
+    def test_jacobian_no_grad(self, ctors):
+        def exp_reducer(x):
+            return x.exp().sum(dim=1)
+
+        inputs = ctors.rand(4, 4)
+        with torch.no_grad():
+            res = autogradF.jacobian(exp_reducer, inputs)
+        self.assertIsNone(res.grad_fn)
+        self.assertNotEqual(res, ctors.zeros(4, 4))
+
+        with torch.no_grad():
+            res = autogradF.jacobian(exp_reducer, inputs, create_graph=True)
+        self.assertIsNotNone(res.grad_fn)
+        self.assertNotEqual(res, ctors.zeros(4, 4))
+
+    @FIXME_xfail_vectorized_logging_tensor
+    def test_jacobian_output(self, vectorize, ctors):
+        def exp_reducer(x):
+            return x.exp().sum(dim=1)
+
+        inputs = ctors.rand(4, 4)
+        res = autogradF.jacobian(exp_reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, exp_reducer(inputs), inputs)
+        self.assertIsNone(res.grad_fn)
+
+        def identity(x):
+            return x.clone()
+
+        inputs = ctors.rand(4)
+        res = autogradF.jacobian(identity, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, identity(inputs), inputs)
+        self.assertIsNone(res.grad_fn)
+        self.assertEqual(res, torch.eye(4))
+
+        def add_exp_reducer(x, y):
+            return (x + y.exp()).sum(dim=1)
+
+        inputs = (ctors.rand(4, 4), ctors.rand(4, 4))
+        res = autogradF.jacobian(add_exp_reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, add_exp_reducer(*inputs), inputs)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+
+    @FIXME_xfail_vectorized_logging_tensor
+    def test_jacobian_scalar(self, vectorize, ctors):
+        def reducer(x):
+            return x.sum()
+        inputs = ctors.rand(4, 4)
+        res = autogradF.jacobian(reducer, inputs, vectorize=vectorize)
+        self._assert_same_struct(res, inputs)
+
+        def expander(x):
+            return x.unsqueeze(0).repeat(4)
+        inputs = ctors.rand([])
+        res = autogradF.jacobian(expander, inputs, vectorize=vectorize)
+        self._assert_same_struct(res, ctors.zeros(4))
+
+    @parametrize("vectorize", [True, False])
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_create_graph(self, vectorize, ctors):
+        def exp_reducer(x):
+            return x.exp().sum(dim=1)
+
+        inputs = ctors.rand(4, 4, dtype=torch.double, requires_grad=True)
+        res = autogradF.jacobian(exp_reducer, inputs, create_graph=True, vectorize=vectorize)
+        self._assert_interleaved_struct(res, exp_reducer(inputs), inputs)
+        self.assertIsNotNone(res.grad_fn)
+
+        gradcheck(lambda inp: autogradF.jacobian(exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+        gradgradcheck(lambda inp: autogradF.jacobian(exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+
+        def add_exp_reducer(x, y):
+            return (x + y).exp().sum(dim=1)
+
+        inputs = (ctors.rand(4, 4, dtype=torch.double, requires_grad=True),
+                  ctors.rand(4, 4, dtype=torch.double, requires_grad=True))
+        res = autogradF.jacobian(add_exp_reducer, inputs, create_graph=True, vectorize=vectorize)
+        self._assert_interleaved_struct(res, add_exp_reducer(*inputs), inputs)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+
+        gradcheck(lambda *inp: autogradF.jacobian(add_exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+        gradgradcheck(lambda *inp: autogradF.jacobian(add_exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+
+        def foo(x, y):
+            x = x.cos()
+            val, jac = autogradF.jacobian(add_exp_reducer, (x, y), create_graph=True, vectorize=vectorize)
+
+            res = val[0].exp().sum() + val[1].exp().sum() + jac[0].exp().sum()
+            res = res + jac[1].exp().sum() + x.exp().sum() + y.exp().sum()
+            return res
+
+        gradcheck(foo, inputs)
+        gradgradcheck(foo, inputs)
+
+    def _check_jacobian_vectorize_correctness(self, f, inputs, test_forward_ad=True):
+        expected = autogradF.jacobian(f, inputs, vectorize=False)
+        result_backward_mode = autogradF.jacobian(f, inputs, vectorize=True)
+        self.assertEqual(result_backward_mode, expected)
+
+        if test_forward_ad:
+            result_forward_mode = autogradF.jacobian(f, inputs, strategy="forward-mode", vectorize=True)
+            self.assertEqual(result_forward_mode, expected)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_simple(self, ctors):
+        def f(x):
+            return 3 * x ** 2
+
+        x = ctors.randn(2, 3, 5)
+        self._check_jacobian_vectorize_correctness(f, x)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_multi_input(self, ctors):
+        def f(x, y):
+            return (x.cos() * x) @ y.sin()
+
+        x = ctors.randn(2, 3)
+        y = ctors.randn(3, 5)
+        self._check_jacobian_vectorize_correctness(f, (x, y))
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_multi_input_multi_output(self, ctors):
+        def f(x, y):
+            return (x * x) @ y, x @ (x.sum(1) * y), y.sum()
+
+        x = ctors.randn(5, 3)
+        y = ctors.randn(3, 5)
+        self._check_jacobian_vectorize_correctness(f, (x, y))
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_unrelated_outputs(self, ctors):
+        def f(x, y):
+            return x, y, x, y
+
+        x = ctors.randn(2)
+        y = ctors.randn(3)
+        self._check_jacobian_vectorize_correctness(f, (x, y))
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_zero_dim(self, ctors):
+        # zero-dim output
+        def f(x, y):
+            return x.sum(), y.sum(), x * y
+
+        x = ctors.randn(3)
+        y = ctors.randn(3)
+        self._check_jacobian_vectorize_correctness(f, (x, y))
+
+        # zero-dim input
+        def g(x):
+            return torch.stack([x, x, x])
+
+        x = ctors.randn([])
+        self._check_jacobian_vectorize_correctness(g, x)
+
+        # Mixed zero-dim input / zero-dim output
+        def h(x, y):
+            return y.sum(), x * y
+
+        x = ctors.randn([])
+        y = ctors.randn(1)
+        self._check_jacobian_vectorize_correctness(h, (x, y))
+
+    @unittest.skipIf(not TEST_CUDA, "test requires CUDA")
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_different_devices(self, ctors):
+        def f(x, y):
+            return x * y, (x * y).cuda()
+
+        x = ctors.randn(3)
+        y = ctors.randn(3)
+        self._check_jacobian_vectorize_correctness(f, (x, y))
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_jacobian_vectorize_correctness_different_dtype(self, ctors):
+        def f(x, y):
+            return (x * y).float(), (x * y).double()
+
+        x = ctors.randn(3)
+        y = ctors.randn(3)
+        # The Jacobian computed using forward AD has the dtype of the output
+        # but the Jacobian computed with reverse AD has dtype of input
+        self._check_jacobian_vectorize_correctness(f, (x, y), test_forward_ad=False)
+
+    def _check_hessian_vectorize_correctness(self, f, inputs):
+        expected = autogradF.hessian(f, inputs, vectorize=False)
+        result = autogradF.hessian(f, inputs, vectorize=True)
+        self.assertEqual(result, expected)
+
+        result_forward_mode = autogradF.hessian(f, inputs, outer_jacobian_strategy="forward-mode", vectorize=True)
+        self.assertEqual(result_forward_mode, expected)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hessian_vectorize_correctness_simple(self, ctors):
+        def f(x):
+            return (3 * x ** 2).sum()
+
+        x = ctors.randn(2, 3, 5)
+        self._check_hessian_vectorize_correctness(f, x)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hessian_vectorize_correctness_multi_input(self, ctors):
+        def f(x, y, z):
+            return ((x.relu() * x) @ y.sin() @ z).sum()
+
+        x = ctors.randn(2, 3)
+        y = ctors.randn(3, 5)
+        z = ctors.randn(5, 5)
+        self._check_hessian_vectorize_correctness(f, (x, y, z))
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hessian_vectorize_correctness_unrelated_outputs(self, ctors):
+        # output unrelated to one input
+        def f(x, y):
+            return (x ** 2).sum()
+
+        x = ctors.randn(2)
+        y = ctors.randn(3)
+        self._check_hessian_vectorize_correctness(f, (x, y))
+
+        # output unrelated to all inputs
+        def f(x, y):
+            return ctors.ones([])
+
+        x = ctors.randn(2)
+        y = ctors.randn(3)
+        self._check_hessian_vectorize_correctness(f, (x, y))
+
+    @FIXME_xfail_vectorized_logging_tensor
+    def test_hessian_err_check(self, vectorize, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        def bar2(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        def bar3(a):
+            return 3 * a.narrow(0, 0, 3), 3 * a.narrow(0, 0, 3)
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(TypeError, "The inputs given to hessian must be either a Tensor"):
+            res = autogradF.hessian(foo, (inp, 2), vectorize=vectorize)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to hessian must"):
+            res = autogradF.hessian(bar, inp, vectorize=vectorize)
+
+        err_msg_out = "The Tensor returned by the function given to hessian should contain a single element"
+        with self.assertRaisesRegex(RuntimeError, err_msg_out):
+            res = autogradF.hessian(bar2, inp, vectorize=vectorize)
+
+        with self.assertRaisesRegex(RuntimeError, "The function given to hessian should return a single Tensor"):
+            res = autogradF.hessian(bar3, inp, vectorize=vectorize)
+
+        res = autogradF.hessian(foo, inp, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inp, inp)
+
+        def foo(a, b):
+            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
+
+        inp = (ctors.rand(4), ctors.rand(5))
+
+        res = autogradF.hessian(foo, inp, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inp, inp)
+
+    @base_and_logging_tensor
+    def test_hessian_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach().sum()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone().sum()
+
+        def bar2(a):
+            # A Linear function for which the jacobian is independent of the input
+            return (3 * a).sum()
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.hessian(foo, inp, strict=True)
+        res = autogradF.hessian(foo, inp, strict=False)
+        self._assert_interleaved_struct(res, inp, inp)
+        self.assertEqual(res.abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0"):
+            res = autogradF.hessian(bar, inp, strict=True)
+        res = autogradF.hessian(bar, inp, strict=False)
+        self._assert_interleaved_struct(res, inp, inp)
+        self.assertEqual(res.abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
+            res = autogradF.hessian(bar2, inp, strict=True)
+        res = autogradF.hessian(bar2, inp, strict=False)
+        self._assert_interleaved_struct(res, inp, inp)
+        self.assertEqual(res.abs().sum(), 0.)
+
+    @base_and_logging_tensor
+    def test_hessian_err_check_strict_vectorize(self, ctors):
+        def foo(x):
+            return (x ** 3).sum()
+
+        inp = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "not supported together"):
+            res = autogradF.hessian(foo, inp, strict=True, vectorize=True)
+
+    @base_and_logging_tensor
+    def test_hessian_no_grad(self, ctors):
+        def pow_reducer(x):
+            return x.pow(3).sum()
+
+        inputs = ctors.rand(2, 2)
+        with torch.no_grad():
+            res = autogradF.hessian(pow_reducer, inputs)
+        self.assertIsNone(res[0][0].grad_fn)
+        self.assertIsNone(res[0][1].grad_fn)
+        self.assertIsNone(res[1][0].grad_fn)
+        self.assertIsNone(res[1][1].grad_fn)
+        self.assertNotEqual(res, ctors.zeros(2, 2, 2))
+
+        with torch.no_grad():
+            res = autogradF.hessian(pow_reducer, inputs, create_graph=True)
+        self.assertIsNotNone(res[0][0].grad_fn)
+        self.assertIsNotNone(res[0][1].grad_fn)
+        self.assertIsNotNone(res[1][0].grad_fn)
+        self.assertIsNotNone(res[1][1].grad_fn)
+        self.assertNotEqual(res, ctors.zeros(2, 2, 2))
+
+    @FIXME_xfail_vectorized_logging_tensor
+    def test_hessian_output(self, vectorize, ctors):
+        def pow_reducer(x):
+            return x.pow(3).sum()
+
+        inputs = ctors.rand(2, 2)
+        res = autogradF.hessian(pow_reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+        self.assertIsNone(res.grad_fn)
+
+        def add_pow_reducer(x, y):
+            return (x + y).pow(3).sum()
+
+        inputs = (ctors.rand(2, 2), ctors.rand(2, 2))
+        res = autogradF.hessian(add_pow_reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+        self.assertIsNone(res[0][0].grad_fn)
+        self.assertIsNone(res[0][1].grad_fn)
+        self.assertIsNone(res[1][0].grad_fn)
+        self.assertIsNone(res[1][1].grad_fn)
+
+    @parametrize("vectorize", [True, False])
+    @base_and_logging_tensor
+    def test_hessian_scalar(self, vectorize, ctors):
+        def reducer(x):
+            return x.sum()
+        inputs = ctors.rand(4, 4)
+        res = autogradF.hessian(reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+
+        inputs = ctors.rand([])
+        res = autogradF.hessian(reducer, inputs, vectorize=vectorize)
+        self._assert_same_struct(res, inputs)
+
+        def bad_reducer(x):
+            return x.sum().view(1, 1, 1)
+        inputs = ctors.rand(4, 4)
+        res = autogradF.hessian(bad_reducer, inputs, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+
+    @parametrize("vectorize", [True, False])
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hessian_create_graph(self, vectorize, ctors):
+        def pow_reducer(x):
+            return x.pow(3).sum()
+
+        inputs = ctors.rand(2, 2, dtype=torch.double, requires_grad=True)
+        res = autogradF.hessian(pow_reducer, inputs, create_graph=True, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+        self.assertIsNotNone(res.grad_fn)
+
+        gradcheck(lambda inp: autogradF.hessian(pow_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+        gradgradcheck(lambda inp: autogradF.hessian(pow_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
+
+        def add_pow_reducer(x, y):
+            return (x + y).pow(3).sum()
+
+        inputs = (ctors.rand(2, 2, dtype=torch.double, requires_grad=True),
+                  ctors.rand(2, 2, dtype=torch.double, requires_grad=True))
+        res = autogradF.hessian(add_pow_reducer, inputs, create_graph=True, vectorize=vectorize)
+        self._assert_interleaved_struct(res, inputs, inputs)
+        self.assertIsNotNone(res[0][0].grad_fn)
+        self.assertIsNotNone(res[0][1].grad_fn)
+        self.assertIsNotNone(res[1][0].grad_fn)
+        self.assertIsNotNone(res[1][1].grad_fn)
+
+        def flatten(inp):
+            return tuple(el_lvl2 for el_lvl1 in inp for el_lvl2 in el_lvl1)
+
+        gradcheck(lambda *inp: flatten(autogradF.hessian(add_pow_reducer, inp, create_graph=True, vectorize=vectorize)), inputs)
+        gradgradcheck(lambda *inp: flatten(autogradF.hessian(add_pow_reducer, inp, create_graph=True, vectorize=vectorize)), inputs)
+
+        def foo(x, y):
+            x = x.cos()
+            val, hess = autogradF.hessian(add_pow_reducer, (x, y), create_graph=True, vectorize=vectorize)
+
+            res = val[0].cos().sum() + val[1].cos().sum() + hess[0].cos().sum()
+            res = res + hess[1].cos().sum() + x.cos().sum() + y.cos().sum()
+            return res
+
+        gradcheck(foo, inputs)
+        gradgradcheck(foo, inputs)
+
+    @base_and_logging_tensor
+    def test_vhp_err_check(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        def bar2(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(TypeError, "The inputs given to vhp must be either a Tensor"):
+            res = autogradF.vhp(foo, (inp, 2), v)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to vhp must"):
+            res = autogradF.vhp(bar, inp, v)
+
+        err_msg_out = "The Tensor returned by the function given to vhp should contain a single element"
+        with self.assertRaisesRegex(RuntimeError, err_msg_out):
+            res = autogradF.vhp(bar2, inp, v)
+
+        with self.assertRaisesRegex(RuntimeError, "v has invalid size:"):
+            res = autogradF.vhp(foo, inp, ctors.rand(5))
+
+        with self.assertRaisesRegex(TypeError, "The v given to vhp must be either a Tensor or a tuple of Tensors"):
+            res = autogradF.vhp(foo, inp, (v, 2))
+
+        res = autogradF.vhp(foo, inp, v)
+        self._assert_same_struct(res[1], inp)
+
+        def foo(a, b):
+            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
+
+        inp = (ctors.rand(4), ctors.rand(5))
+        v = (ctors.rand(4), ctors.rand(5))
+
+        res = autogradF.vhp(foo, inp, v)
+        self._assert_same_struct(res[1], inp)
+
+    @base_and_logging_tensor
+    def test_vhp_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach().sum()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone().sum()
+
+        def bar2(a):
+            # A Linear function for which the jacobian is independent of the input
+            return (3 * a).sum()
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.vhp(foo, inp, v, strict=True)
+        res = autogradF.vhp(foo, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
+            res = autogradF.vhp(bar, inp, v, strict=True)
+        res = autogradF.vhp(bar, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
+            res = autogradF.vhp(bar2, inp, v, strict=True)
+        res = autogradF.vhp(bar2, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+    @base_and_logging_tensor
+    def test_vhp_no_grad(self, ctors):
+        def reducer(x):
+            return x.exp().sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        with torch.no_grad():
+            res = autogradF.vhp(reducer, inputs, v)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+        with torch.no_grad():
+            res = autogradF.vhp(reducer, inputs, v, create_graph=True)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+    @base_and_logging_tensor
+    def test_vhp_output(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.vhp(foo, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+
+        def bar(a, b):
+            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
+
+        inputs = (ctors.rand(3), ctors.rand(4))
+        v = (ctors.ones(3), ctors.ones(4))
+        out, vhp_val = autogradF.vhp(bar, inputs, v)
+        self._assert_same_struct(vhp_val, inputs)
+        self.assertIsNone(out.grad_fn)
+        self.assertIsNone(vhp_val[0].grad_fn)
+        self.assertIsNone(vhp_val[1].grad_fn)
+
+    @base_and_logging_tensor
+    def test_vhp_scalar(self, ctors):
+        def reducer(x):
+            return x.sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.vhp(reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+        inputs = ctors.rand([])
+        v = ctors.rand([])
+        res = autogradF.vhp(reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+        res = autogradF.vhp(reducer, inputs)
+        self._assert_same_struct(res[1], inputs)
+
+        def bad_reducer(x):
+            return x.sum().view(1, 1, 1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.rand(4, 4)
+        res = autogradF.vhp(bad_reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_vhp_create_graph(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        inputs = ctors.rand(4, 4, dtype=torch.double, requires_grad=True)
+        v = ctors.ones(4, 4, dtype=torch.double, requires_grad=True)
+        res = autogradF.vhp(foo, inputs, v, create_graph=True)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+
+        gradcheck(lambda inp, v: autogradF.vhp(foo, inp, v, create_graph=True), (inputs, v))
+        gradgradcheck(lambda inp, v: autogradF.vhp(foo, inp, v, create_graph=True), (inputs, v))
+
+        def bar(a, b):
+            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
+
+        inputs = (ctors.rand(3, dtype=torch.double, requires_grad=True),
+                  ctors.rand(4, dtype=torch.double, requires_grad=True))
+        v = (ctors.ones(3, dtype=torch.double, requires_grad=True),
+             ctors.ones(4, dtype=torch.double, requires_grad=True))
+        out, vhp_val = autogradF.vhp(bar, inputs, v, create_graph=True)
+        self._assert_same_struct(vhp_val, inputs)
+        self.assertIsNotNone(out.grad_fn)
+        self.assertIsNotNone(vhp_val[0].grad_fn)
+        self.assertIsNotNone(vhp_val[1].grad_fn)
+
+        gradcheck(lambda *args: autogradF.vhp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
+        gradgradcheck(lambda *args: autogradF.vhp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
+
+        def foo(*args):
+            x, y = args[:2]
+            v = args[2:]
+
+            x = x.cos()
+            val, grad = autogradF.vhp(bar, (x, y), v, create_graph=True)
+
+            return val.cos() + grad[0].cos().sum() + grad[1].cos() + x.cos().sum() + y.cos()
+
+        gradcheck(foo, inputs + v)
+        gradgradcheck(foo, inputs + v)
+
+    @base_and_logging_tensor
+    def test_hvp_err_check(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        def bar(a):
+            return 3 * a.narrow(0, 0, 3), "bar"
+
+        def bar2(a):
+            return 3 * a.narrow(0, 0, 3)
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        res = autogradF.hvp(foo, inp, v)
+        with self.assertRaisesRegex(TypeError, "The inputs given to hvp must be either a Tensor"):
+            res = autogradF.hvp(foo, (inp, 2), v)
+
+        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to hvp must"):
+            res = autogradF.hvp(bar, inp, v)
+
+        err_msg_out = "The Tensor returned by the function given to hvp should contain a single element"
+        with self.assertRaisesRegex(RuntimeError, err_msg_out):
+            res = autogradF.hvp(bar2, inp, v)
+
+        with self.assertRaisesRegex(RuntimeError, "v has invalid size:"):
+            res = autogradF.hvp(foo, inp, ctors.rand(5))
+
+        with self.assertRaisesRegex(TypeError, "The v given to hvp must be either a Tensor or a tuple of Tensors"):
+            res = autogradF.hvp(foo, inp, (v, 2))
+
+        res = autogradF.hvp(foo, inp, v)
+        self._assert_same_struct(res[1], inp)
+
+        def foo(a, b):
+            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
+
+        inp = (ctors.rand(4), ctors.rand(5))
+        v = (ctors.rand(4), ctors.rand(5))
+
+        res = autogradF.hvp(foo, inp, v)
+        self._assert_same_struct(res[1], inp)
+
+    @base_and_logging_tensor
+    def test_hvp_err_check_strict(self, ctors):
+        def foo(a):
+            return a.detach().sum()
+
+        def bar(a):
+            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
+            return a.long().float().requires_grad_().clone().sum()
+
+        def bar2(a):
+            # A Linear function for which the jacobian is independent of the input
+            return (3 * a).sum()
+
+        inp = ctors.rand(4)
+        v = ctors.rand(4)
+        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
+            res = autogradF.hvp(foo, inp, v, strict=True)
+        res = autogradF.hvp(foo, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
+            res = autogradF.hvp(bar, inp, v, strict=True)
+        res = autogradF.hvp(bar, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
+            res = autogradF.hvp(bar2, inp, v, strict=True)
+        res = autogradF.hvp(bar2, inp, v, strict=False)
+        self._assert_same_struct(res[1], inp)
+        self.assertEqual(res[1].abs().sum(), 0.)
+
+    @base_and_logging_tensor
+    def test_hvp_no_grad(self, ctors):
+        def reducer(x):
+            return x.exp().sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        with torch.no_grad():
+            res = autogradF.hvp(reducer, inputs, v)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+        with torch.no_grad():
+            res = autogradF.hvp(reducer, inputs, v, create_graph=True)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+        self.assertNotEqual(res[1], ctors.zeros(4, 4))
+
+    @base_and_logging_tensor
+    def test_hvp_output(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.hvp(foo, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNone(res[0].grad_fn)
+        self.assertIsNone(res[1].grad_fn)
+
+        def bar(a, b):
+            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
+
+        inputs = (ctors.rand(3), ctors.rand(4))
+        v = (ctors.ones(3), ctors.ones(4))
+        out, hvp_val = autogradF.hvp(bar, inputs, v)
+        self._assert_same_struct(hvp_val, inputs)
+        self.assertIsNone(out.grad_fn)
+        self.assertIsNone(hvp_val[0].grad_fn)
+        self.assertIsNone(hvp_val[1].grad_fn)
+
+    @base_and_logging_tensor
+    def test_hvp_scalar(self, ctors):
+        def reducer(x):
+            return x.exp().sum()
+        inputs = ctors.rand(4, 4)
+        v = ctors.ones(4, 4)
+        res = autogradF.hvp(reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+        inputs = ctors.rand([])
+        v = ctors.rand([])
+        res = autogradF.hvp(reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+        res = autogradF.hvp(reducer, inputs)
+        self._assert_same_struct(res[1], inputs)
+
+        def bad_reducer(x):
+            return x.exp().sum().view(1, 1, 1)
+        inputs = ctors.rand(4, 4)
+        v = ctors.rand(4, 4)
+        res = autogradF.hvp(bad_reducer, inputs, v)
+        self._assert_same_struct(res[1], inputs)
+
+    @FIXME_base_and_xfail_logging_tensor
+    def test_hvp_create_graph(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        inputs = ctors.rand(4, 4, dtype=torch.double, requires_grad=True)
+        v = ctors.ones(4, 4, dtype=torch.double, requires_grad=True)
+        res = autogradF.hvp(foo, inputs, v, create_graph=True)
+        self._assert_same_struct(res[1], inputs)
+        self.assertIsNotNone(res[0].grad_fn)
+        self.assertIsNotNone(res[1].grad_fn)
+
+        gradcheck(lambda inp, v: autogradF.hvp(foo, inp, v, create_graph=True), (inputs, v))
+        gradgradcheck(lambda inp, v: autogradF.hvp(foo, inp, v, create_graph=True), (inputs, v))
+
+        def bar(a, b):
+            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
+
+        inputs = (ctors.rand(3, dtype=torch.double, requires_grad=True),
+                  ctors.rand(4, dtype=torch.double, requires_grad=True))
+        v = (ctors.ones(3, dtype=torch.double, requires_grad=True),
+             ctors.ones(4, dtype=torch.double, requires_grad=True))
+        out, hvp_val = autogradF.hvp(bar, inputs, v, create_graph=True)
+        self._assert_same_struct(hvp_val, inputs)
+        self.assertIsNotNone(out.grad_fn)
+        self.assertIsNotNone(hvp_val[0].grad_fn)
+        self.assertIsNotNone(hvp_val[1].grad_fn)
+
+        gradcheck(lambda *args: autogradF.hvp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
+        gradgradcheck(lambda *args: autogradF.hvp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
+
+        def foo(*args):
+            x, y = args[:2]
+            v = args[2:]
+
+            x = x.cos()
+            val, grad = autogradF.hvp(bar, (x, y), v, create_graph=True)
+
+            return val.cos() + grad[0].cos().sum() + grad[1].cos() + x.cos().sum() + y.cos()
+
+        gradcheck(foo, inputs + v)
+        gradgradcheck(foo, inputs + v)
+
+    @base_and_logging_tensor
+    def test_jacobian_match_vjp_jvp(self, ctors):
+        def foo(x):
+            return x ** 3 + x.sum()
+
+        inputs = ctors.rand(4)
+        v = ctors.rand(4)
+
+        jac = autogradF.jacobian(foo, inputs)
+        jvp = autogradF.jvp(foo, inputs, v)[1]
+        vjp = autogradF.vjp(foo, inputs, v)[1]
+
+        self.assertEqual(jvp, torch.mm(jac, v.unsqueeze(1)).squeeze(1))
+        self.assertEqual(vjp, torch.mm(v.unsqueeze(0), jac).squeeze(0))
+
+    @base_and_logging_tensor
+    def test_hessian_match_vhp_hvp(self, ctors):
+        def foo(a):
+            return 3 * a.narrow(0, 0, 3).exp().sum()
+
+        inputs = ctors.rand(4)
+        v = ctors.rand(4)
+
+        hes = autogradF.hessian(foo, inputs)
+        hvp = autogradF.hvp(foo, inputs, v)[1]
+        vhp = autogradF.vhp(foo, inputs, v)[1]
+
+        self.assertEqual(hvp, torch.mm(hes, v.unsqueeze(1)).squeeze(1))
+        self.assertEqual(vhp, torch.mm(v.unsqueeze(0), hes).squeeze(0))
+
+instantiate_parametrized_tests(TestAutogradFunctional)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/benchmark_utils/test_benchmark_utils.py b/test/benchmark_utils/test_benchmark_utils.py
index a98c0ac97b4c92..a1e2adaacfa913 100644
--- a/test/benchmark_utils/test_benchmark_utils.py
+++ b/test/benchmark_utils/test_benchmark_utils.py
@@ -170,6 +170,7 @@ def test_timer(self):
 
     @slowTest
     @unittest.skipIf(IS_SANDCASTLE, "C++ timing is OSS only.")
+    @unittest.skipIf(True, "Failing on clang, see 74398")
     def test_timer_tiny_fast_snippet(self):
         timer = benchmark_utils.Timer(
             'auto x = 1;(void)x;',
@@ -181,6 +182,7 @@ def test_timer_tiny_fast_snippet(self):
 
     @slowTest
     @unittest.skipIf(IS_SANDCASTLE, "C++ timing is OSS only.")
+    @unittest.skipIf(True, "Failing on clang, see 74398")
     def test_cpp_timer(self):
         timer = benchmark_utils.Timer(
             """
@@ -547,6 +549,7 @@ def add_one(x):
     @slowTest
     @unittest.skipIf(IS_WINDOWS, "Valgrind is not supported on Windows.")
     @unittest.skipIf(IS_SANDCASTLE, "Valgrind is OSS only.")
+    @unittest.skipIf(True, "Failing on clang, see 74398")
     def test_collect_cpp_callgrind(self):
         timer = benchmark_utils.Timer(
             "x += 1;",
diff --git a/test/cpp/api/dataloader.cpp b/test/cpp/api/dataloader.cpp
index c0622ba41cbd16..9b71b721b3db93 100644
--- a/test/cpp/api/dataloader.cpp
+++ b/test/cpp/api/dataloader.cpp
@@ -1982,7 +1982,7 @@ TEST(DataLoaderTest, ChunkDatasetSave) {
 
     for (const auto epoch_index : c10::irange(epoch_count)) {
       (void)epoch_index; // Suppress unused variable warning
-      int iteration_count = 0;
+      unsigned iteration_count = 0;
       for (auto iterator = data_loader->begin(); iterator != data_loader->end();
            ++iterator, ++iteration_count) {
         if ((iteration_count + 1) % save_interval == 0) {
@@ -2316,7 +2316,7 @@ TEST(DataLoaderTest, CustomPreprocessPolicy) {
            ++iterator) {
         auto batch_result = *iterator;
         if (batch_result.size() > chunk_size * cross_chunk_shuffle_count) {
-          for (int i = 0; i < batch_result.size(); i += chunk_size) {
+          for (unsigned i = 0; i < batch_result.size(); i += chunk_size) {
             ASSERT_TRUE(std::is_sorted(
                 batch_result.begin() + i,
                 batch_result.begin() + i + chunk_size));
diff --git a/test/cpp/api/init.cpp b/test/cpp/api/init.cpp
index 9e2ed422e28beb..222d4f1171c4d1 100644
--- a/test/cpp/api/init.cpp
+++ b/test/cpp/api/init.cpp
@@ -19,7 +19,7 @@ void check_exact_values(
     auto layerParameters = parameters[i];
     auto expectedLayerParameters = expected_parameters[i];
 
-    if (layerParameters.size(0) != expectedLayerParameters.size()) {
+    if (static_cast<size_t>(layerParameters.size(0)) != expectedLayerParameters.size()) {
       std::cout << "layer #" << i
                 << " layerParameters size: " << layerParameters.size(0)
                 << " != "
diff --git a/test/cpp/api/misc.cpp b/test/cpp/api/misc.cpp
index a8d6320e9533d5..734cea27e5cca7 100644
--- a/test/cpp/api/misc.cpp
+++ b/test/cpp/api/misc.cpp
@@ -90,3 +90,14 @@ TEST(UtilsTest, AmbiguousOperatorDefaults) {
   at::_test_ambiguous_defaults(tmp, 1, 1);
   at::_test_ambiguous_defaults(tmp, 2, "2");
 }
+
+int64_t get_first_element(c10::OptionalIntArrayRef arr) {
+  return arr.value()[0];
+}
+
+TEST(OptionalArrayRefTest, DanglingPointerFix) {
+  // Ensure that the converting constructor of `OptionalArrayRef` does not
+  // create a dangling pointer when given a single value
+  ASSERT_TRUE(get_first_element(300) == 300);
+  ASSERT_TRUE(get_first_element({400}) == 400);
+}
diff --git a/test/cpp/api/nn_utils.cpp b/test/cpp/api/nn_utils.cpp
index 451c72e9d7762a..be371b1ae6d49a 100644
--- a/test/cpp/api/nn_utils.cpp
+++ b/test/cpp/api/nn_utils.cpp
@@ -615,7 +615,7 @@ TEST_F(NNUtilsTest, PackPaddedSequence) {
     }
     int64_t offset = 0;
     std::vector<torch::Tensor> tensors_to_be_cat;
-    for (int64_t i = 1; i < sorted_lengths.size() + 1; i++) {
+    for (int64_t i = 1; i < static_cast<int64_t>(sorted_lengths.size() + 1); i++) {
       int64_t l = sorted_lengths.at(i-1);
       tensors_to_be_cat.emplace_back(pad(i * 100 + torch::arange(1., 5 * l + 1).view({l, 1, 5}), max_length));
     }
diff --git a/test/cpp/api/parameterdict.cpp b/test/cpp/api/parameterdict.cpp
index 5f2eab5d6b289e..21dd1b31d5a88c 100644
--- a/test/cpp/api/parameterdict.cpp
+++ b/test/cpp/api/parameterdict.cpp
@@ -105,7 +105,7 @@ TEST_F(ParameterDictTest, Values) {
   auto dict = torch::nn::ParameterDict(params);
   std::vector<torch::Tensor> values = dict->values();
   std::vector<torch::Tensor> true_values{ta, tb, tc};
-  for (auto i = 0; i < values.size(); i += 1) {
+  for (auto i = 0U; i < values.size(); i += 1) {
     ASSERT_TRUE(torch::all(torch::eq(values[i], true_values[i])).item<bool>());
   }
 }
diff --git a/test/cpp/api/serialize.cpp b/test/cpp/api/serialize.cpp
index b422662aa3623f..ecad2348674b79 100644
--- a/test/cpp/api/serialize.cpp
+++ b/test/cpp/api/serialize.cpp
@@ -129,7 +129,7 @@ void test_serialize_optimizer(DerivedOptimizerOptions options, bool only_has_glo
   // optim3_2 and optim1 should have param_groups and state of size 1 and state_size respectively
   ASSERT_TRUE(optim3_2_param_groups.size() == 1);
   // state_size = 2 for all optimizers except LBFGS as LBFGS only maintains one global state
-  int state_size = only_has_global_state ? 1 : 2;
+  unsigned state_size = only_has_global_state ? 1 : 2;
   ASSERT_TRUE(optim3_2_state.size() == state_size);
 
   // optim3_2 and optim1 should have param_groups and state of same size
@@ -355,6 +355,7 @@ TEST(SerializeTest, ErrorOnMissingKey) {
   // We want the errors to contain hierarchy information, too.
   ASSERT_THROWS_WITH(
       torch::load(model2, stream), "No such serialized tensor 'a.b.x'");
+  stream.seekg(0, stream.beg);
   ASSERT_THROWS_WITH(
       torch::load(model3, stream), "No such serialized submodule: 'a.x'");
 }
diff --git a/test/cpp/jit/CMakeLists.txt b/test/cpp/jit/CMakeLists.txt
index 7e591925f19443..0c36d22c8dd956 100644
--- a/test/cpp/jit/CMakeLists.txt
+++ b/test/cpp/jit/CMakeLists.txt
@@ -95,8 +95,11 @@ set(JIT_TEST_SRCS
 )
 
 if(USE_CUDA)
-  list(APPEND JIT_TEST_SRCS ${JIT_TEST_ROOT}/test_gpu.cpp)
-  list(APPEND JIT_TEST_SRCS ${JIT_TEST_ROOT}/test_gpu_shift.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_view.cpp)
 endif()
 
 add_executable(test_jit
diff --git a/test/cpp/jit/source_range_test.cpp b/test/cpp/jit/source_range_test.cpp
deleted file mode 100644
index 244db9c0085cd3..00000000000000
--- a/test/cpp/jit/source_range_test.cpp
+++ /dev/null
@@ -1,32 +0,0 @@
-#include <gtest/gtest.h>
-#include <torch/csrc/jit/frontend/source_range.h>
-
-using namespace ::testing;
-using namespace ::torch::jit;
-
-TEST(SourceRangeTest, test_find) {
-  std::vector<std::shared_ptr<std::string>> strings;
-  strings.push_back(std::make_shared<std::string>("hello world"));
-  strings.push_back(std::make_shared<std::string>("nihaoma"));
-
-  std::vector<c10::string_view> pieces{*strings[0], *strings[1]};
-
-  StringCordView view(pieces, strings);
-
-  auto x = view.find("rldni", 0);
-  EXPECT_EQ(x, 8);
-}
-
-TEST(SourceRangeTest, test_substr) {
-  std::vector<std::shared_ptr<std::string>> strings;
-  strings.push_back(std::make_shared<std::string>("hello world"));
-  strings.push_back(std::make_shared<std::string>("nihaoma"));
-
-  std::vector<c10::string_view> pieces{*strings[0], *strings[1]};
-
-  StringCordView view(pieces, strings);
-
-  auto x = view.substr(4, 10).str();
-  EXPECT_EQ(x, view.str().substr(4, 10));
-  EXPECT_EQ(view.substr(0, view.size()).str(), view.str());
-}
diff --git a/test/cpp/jit/test_autodiff.cpp b/test/cpp/jit/test_autodiff.cpp
index e8bfefe642630d..6a087adb63c851 100644
--- a/test/cpp/jit/test_autodiff.cpp
+++ b/test/cpp/jit/test_autodiff.cpp
@@ -289,14 +289,11 @@ class AutodiffRemoveUnusedGradientsTest : public ::testing::Test {
   void SetUp() override {
     prev_exec = getExecutorMode();
     getExecutorMode() = true;
-    prev_profiling = getProfilingMode();
-    getProfilingMode() = true;
     prev_inline_autodiff = getAutodiffSubgraphInlining();
     debugSetAutodiffSubgraphInlining(false);
   }
   void TearDown() override {
     getExecutorMode() = prev_exec;
-    getProfilingMode() = prev_profiling;
     debugSetAutodiffSubgraphInlining(prev_inline_autodiff);
   }
 
diff --git a/test/cpp/jit/test_backend.cpp b/test/cpp/jit/test_backend.cpp
index 2b5de4a146e89a..978daa08d94ddb 100644
--- a/test/cpp/jit/test_backend.cpp
+++ b/test/cpp/jit/test_backend.cpp
@@ -143,38 +143,6 @@ TEST(BackendTest, TestCompiler) {
   AT_ASSERT(mres.toTensor().equal(ref.toTensor()));
 }
 
-TEST(BackendTest, TestCompilerWithStringTable) {
-  setShouldUseFormatWithStringTable(true);
-  Module m("m");
-  m.define(R"(
-    def forward(self, x, h):
-        return x + h
-  )");
-
-  std::vector<IValue> inputs;
-  inputs.emplace_back(2.0 * torch::ones({}));
-  inputs.emplace_back(1.0 * torch::ones({}));
-  auto ref = m.forward(inputs);
-
-  c10::Dict<IValue, IValue> compile_spec(StringType::get(), AnyType::get());
-  c10::Dict<IValue, IValue> fake_dict(StringType::get(), AnyType::get());
-  fake_dict.insert("", "");
-  compile_spec.insert("forward", fake_dict);
-  auto any_dict_ty = DictType::create(StringType::get(), AnyType::get());
-  // lowered module
-  auto lm = torch::jit::detail::codegen_backend_module(
-      "backend_with_compiler_demo", m, compile_spec, any_dict_ty);
-  auto res = lm.forward(inputs);
-  AT_ASSERT(res.toTensor().equal(ref.toTensor()));
-
-  std::stringstream ss;
-  lm._save_for_mobile(ss);
-  auto mlm = _load_for_mobile(ss);
-  auto mres = mlm.forward(inputs);
-  setShouldUseFormatWithStringTable(false);
-  AT_ASSERT(mres.toTensor().equal(ref.toTensor()));
-}
-
 TEST(BackendTest, TestComposite) {
   c10::Dict<IValue, IValue> compile_spec(StringType::get(), AnyType::get());
   c10::Dict<IValue, IValue> fake_dict(StringType::get(), AnyType::get());
@@ -308,6 +276,7 @@ TEST(BackendTest, TestConsistencyOfCompositeWithSetStates) {
   c._save_for_mobile(ss);
   auto mc = _load_for_mobile(ss);
   auto res_mobile = mc.forward(inputs);
+  ss.seekg(0, ss.beg);
 
   // check if the methods names are always the same
   // by reloading the script module and saving it back as mobile
@@ -415,56 +384,6 @@ Traceback of TorchScript (most recent call last):
   ASSERT_THROWS_WITH_MESSAGE(mlm.forward(inputs), error_pattern);
 }
 
-TEST(BackendTestDebugInfo, TestCompilerWithStringTable) {
-  setShouldUseFormatWithStringTable(true);
-  Module m("m");
-  m.define(R"(
-    def forward(self, x, h):
-        return x + h
-  )");
-
-  std::vector<IValue> inputs;
-  inputs.emplace_back(torch::rand({2, 4}));
-  inputs.emplace_back(torch::rand({13, 9}));
-
-  c10::Dict<IValue, IValue> compile_spec(StringType::get(), AnyType::get());
-  c10::Dict<IValue, IValue> fake_dict(StringType::get(), AnyType::get());
-  fake_dict.insert("", "");
-  compile_spec.insert("forward", fake_dict);
-  auto any_dict_ty = DictType::create(StringType::get(), AnyType::get());
-  // lowered module
-  auto lm = torch::jit::detail::codegen_backend_module(
-      "backend_with_compiler_demo", m, compile_spec, any_dict_ty);
-
-  std::stringstream ss;
-  lm._save_for_mobile(ss, ExtraFilesMap(), true);
-  auto mlm = _load_for_mobile(ss);
-  std::string error_pattern = R"(
-  Module hierarchy:top(m)::<unknown>.__loweredModule__(m)::forward.aten::add
-Traceback of TorchScript (most recent call last):
-  File "<string>", line 3, in <unknown>
-
-            def forward(self, x: Tensor, h: Tensor):
-                return self.__loweredModule__.forward(x, h)
-                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
-
-  File "<string>", line 5, in forward
-                typed_inputs: List[Any] = [x, h, ]
-                if self.__backend.is_available() :
-                  _0, = self.__backend.execute(self.__handles["forward"], typed_inputs)
-                        ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
-                  assert isinstance(_0, Tensor)
-                  return _0
-  File "<string>", line 3, in <unknown>
-
-    def forward(self, x, h):
-        return x + h
-               ~~~~~ <--- HERE
-  )";
-  setShouldUseFormatWithStringTable(false);
-  ASSERT_THROWS_WITH_MESSAGE(mlm.forward(inputs), error_pattern);
-}
-
 TEST(BackendTestDebugInfo, TestExceptionStackForCompilerWithModuleHierarchy) {
   Module a("A");
   a.define(R"(
diff --git a/test/cpp/jit/test_flatbuffer.cpp b/test/cpp/jit/test_flatbuffer.cpp
index 76c34389488aa7..0abb84c1268ea6 100644
--- a/test/cpp/jit/test_flatbuffer.cpp
+++ b/test/cpp/jit/test_flatbuffer.cpp
@@ -23,6 +23,7 @@
 #include <torch/custom_class.h>
 #include <torch/torch.h>
 
+#include <caffe2/serialize/versions.h>
 #include <torch/csrc/jit/serialization/import_export_functions.h>
 #include <unordered_set>
 // Tests go in torch::jit
@@ -137,6 +138,22 @@ TEST(FlatbufferTest, MethodInvocation) { // NOLINT (use =delete in gtest)
   }
 }
 
+#if defined(ENABLE_FLATBUFFER) && !defined(FB_XPLAT_BUILD)
+TEST(FlatbufferTest, FlatbufferBackPortTest) {
+  Module m("m");
+  m.define(R"(
+    def forward(self, input: Tensor, scale:float):
+      return torch.upsample_nearest2d(input, [1, 1], float(scale), float(scale))
+  )");
+  std::stringstream ss;
+  m._save_for_mobile(ss, {}, false, true);
+
+  std::stringstream oss;
+  bool backPortSuccess = _backport_for_mobile(ss, oss, 5);
+  ASSERT_TRUE(backPortSuccess);
+}
+#endif // defined(ENABLE_FLATBUFFER) && !defined(FB_XPLAT_BUILD)
+
 TEST(FlatbufferTest, ExtraFiles) {
   const auto script = R"JIT(
     def forward(self):
@@ -153,16 +170,30 @@ TEST(FlatbufferTest, ExtraFiles) {
   extra_files["metadata.json"] = "abc";
   extra_files["mobile_info.json"] = "{\"key\": 23}";
 
+  std::unordered_map<std::string, std::string> loaded_extra_files;
+#if defined ENABLE_FLATBUFFER
+  std::stringstream ss;
+  module->_save_for_mobile(ss, extra_files, true, /*use_flatbuffer=*/true);
+
+  loaded_extra_files["metadata.json"] = "";
+  auto mobile_module = _load_for_mobile(ss, c10::nullopt, loaded_extra_files);
+
+  ASSERT_EQ(loaded_extra_files["metadata.json"], "abc");
+  ASSERT_EQ(loaded_extra_files["mobile_info.json"], "{\"key\": 23}");
+
+  // load it twice using the same stream
+  auto mobile_module2 = _load_for_mobile(ss, c10::nullopt, loaded_extra_files);
+#else
   CompilationOptions options;
   mobile::Module bc = jitModuleToMobile(*module, options);
   auto buff = save_mobile_module_to_bytes(bc, extra_files);
 
-  std::unordered_map<std::string, std::string> loaded_extra_files;
   loaded_extra_files["metadata.json"] = "";
   auto* flatbuffer_module =
       mobile::serialization::GetMutableModule(buff.data());
 
   parseExtraFiles(flatbuffer_module, loaded_extra_files);
+#endif
 
   ASSERT_EQ(loaded_extra_files["metadata.json"], "abc");
   ASSERT_EQ(loaded_extra_files["mobile_info.json"], "{\"key\": 23}");
@@ -235,6 +266,23 @@ TEST(FlatbufferTest, Inline) {
   AT_ASSERT(output.toTensor().item<float>() == 7.0);
 }
 
+#if defined ENABLE_FLATBUFFER
+TEST(FlatbufferTest, GetByteCodeVersion) {
+  Module m("m");
+  m.define(R"(
+    def forward(self, input: Tensor):
+      return input + 1
+  )");
+  std::stringstream ss;
+  m._save_for_mobile(ss, {}, false, /*use_flatbuffer=*/true);
+  auto version = _get_model_bytecode_version(ss);
+  AT_ASSERT(version == caffe2::serialize::kProducedBytecodeVersion);
+  ss.seekg(0, ss.beg);
+  auto version_again = _get_model_bytecode_version(ss);
+  AT_ASSERT(version == version_again);
+}
+#endif
+
 TEST(FlatbufferTest, Tuple) {
   Module m("m");
   m.define(R"JIT(
@@ -1135,5 +1183,110 @@ TEST(FlatbufferTest, OperatorTest2) { // NOLINT (use =delete in gtest)
   }
 }
 
+Module jitModuleFromBuffer(void* data) {
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data);
+  FlatbufferLoader loader;
+  mobile::Module mobilem = loader.parseModule(flatbuffer_module);
+  ExtraFilesMap files;
+  std::vector<IValue> constants;
+  loader.extractJitSourceAndConstants(&files, &constants);
+  return jitModuleFromSourceAndConstants(
+      mobilem._ivalue(), files, constants, 8);
+}
+
+#if defined(ENABLE_FLATBUFFER)
+TEST(TestSourceFlatbuffer, UpsampleNearest2d) {
+  Module m("m");
+  m.define(R"(
+    def forward(self, input: Tensor, scale:float):
+      return torch.upsample_nearest2d(input, [1, 1], float(scale), float(scale))
+  )");
+
+  std::vector<IValue> inputs;
+  inputs.emplace_back(torch::rand({1, 3, 128, 128}));
+  inputs.emplace_back(at::Scalar(2.0));
+  auto ref = m.forward(inputs);
+
+  std::stringstream ss;
+  m._save_for_mobile(ss, {}, false, /*use_fatbuffer=*/true);
+  auto mm = _load_for_mobile(ss);
+  auto m2 = load(ss);
+
+  auto res = m2.forward(inputs);
+  auto resm = mm.forward(inputs);
+
+  auto resd = res.toTensor();
+  auto refd = ref.toTensor();
+  auto resmd = resm.toTensor();
+  ASSERT_TRUE(resd.equal(refd));
+  ASSERT_TRUE(resmd.equal(refd));
+}
+#endif
+
+TEST(TestSourceFlatbuffer, CheckAttrAccess) {
+  Module m("m");
+  m.register_attribute("mobile_optimized", BoolType::get(), true);
+  auto data = save_jit_module_to_bytes(m);
+  Module m2 = jitModuleFromBuffer(data.data());
+  bool mobile_optimized = m2.attr("mobile_optimized", false).toBool();
+  AT_ASSERT(mobile_optimized);
+  mobile::Module m3 = parse_mobile_module(data.data(), data.size());
+  mobile_optimized = m3.attr("mobile_optimized", false).toBool();
+  AT_ASSERT(mobile_optimized);
+}
+
+TEST(TestSourceFlatbuffer,
+     MethodInvocation) { // NOLINT (use =delete in gtest)
+  const std::vector<std::string> test_programs{
+      // test invoking a method with default parameter
+      R"(
+      def test_func(self, x, b : int = 4):
+        return self.foo + x + b
+      )",
+      // inner method call with default parameter (gets inlined)
+      R"(
+      def add_with_default_arg(self, x, b : int = 4):
+        return self.foo + x + b
+      def test_func(self, x):
+        return self.add_with_default_arg(x)  # invoke method w/ default arg
+      )",
+      // simple method call
+      R"(
+      def test_func(self, x):
+        b = 4
+        return self.foo + x + b
+      )",
+  };
+  for (const auto& test_program : test_programs) {
+    Module m("m");
+    m.register_parameter("foo", torch::ones({}), false);
+    m.define(test_program);
+
+    const int fortyTwo = 42; // (keep linter happy)
+    auto minput = fortyTwo * torch::ones({});
+    auto ref = m.run_method("test_func", minput);
+
+    auto data = save_jit_module_to_bytes(m);
+    Module m2 = jitModuleFromBuffer(data.data());
+    const auto& test_func = m2.get_method("test_func");
+    IValue res;
+    for (int i = 0; i < 3; ++i) {
+      res = test_func({minput});
+    }
+    auto resd = res.toTensor().item<float>();
+    auto refd = ref.toTensor().item<float>();
+    AT_ASSERT(resd == refd);
+
+    mobile::Module m3 = parse_mobile_module(data.data(), data.size());
+    const auto& test_func3 = m3.get_method("test_func");
+    for (int i = 0; i < 3; ++i) {
+      res = test_func3({minput});
+    }
+    resd = res.toTensor().item<float>();
+    refd = ref.toTensor().item<float>();
+    AT_ASSERT(resd == refd);
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_graph_iterator.cpp b/test/cpp/jit/test_graph_iterator.cpp
index 75edac875b190f..00d1f9a6a28c88 100644
--- a/test/cpp/jit/test_graph_iterator.cpp
+++ b/test/cpp/jit/test_graph_iterator.cpp
@@ -62,7 +62,7 @@ void assert_ordering(
   ASSERT_EQ(expected.size(), actual.size())
       << "Got " << actual.size() << " elements (" << actual << ")"
       << " expected " << expected.size() << " elements (" << expected << ")";
-  for (int i = 0; i < expected.size(); i++) {
+  for (unsigned i = 0; i < expected.size(); i++) {
     ASSERT_EQ(expected[i], actual[i])
         << "Difference at index " << i << " in " << actual << " (expected "
         << actual << ")";
diff --git a/test/cpp/jit/test_lite_interpreter.cpp b/test/cpp/jit/test_lite_interpreter.cpp
index 5e00eafa7382cb..a07cc8af5aa707 100644
--- a/test/cpp/jit/test_lite_interpreter.cpp
+++ b/test/cpp/jit/test_lite_interpreter.cpp
@@ -599,7 +599,7 @@ void runAndCheckTorchScriptModel(
     std::stringstream& input_model_stream,
     const std::vector<IValue>& input_data,
     const std::vector<IValue>& expect_result_list,
-    const int64_t expect_version) {
+    const uint64_t expect_version) {
   auto actual_version = _get_model_bytecode_version(input_model_stream);
   AT_ASSERT(actual_version == expect_version);
 
@@ -616,7 +616,7 @@ void runAndCheckBytecodeModel(
     std::stringstream& input_model_stream,
     const std::vector<IValue>& input_data,
     const std::vector<IValue>& expect_result_list,
-    const int64_t expect_version) {
+    const uint64_t expect_version) {
   auto actual_version = _get_model_bytecode_version(input_model_stream);
   AT_ASSERT(actual_version == expect_version);
 
@@ -634,13 +634,14 @@ void backportAllVersionCheck(
     std::stringstream& test_model_file_stream,
     std::vector<IValue>& input_data,
     std::vector<IValue>& expect_result_list,
-    const int64_t expect_from_version) {
+    const uint64_t expect_from_version) {
   auto from_version = _get_model_bytecode_version(test_model_file_stream);
   AT_ASSERT(from_version == expect_from_version);
+  AT_ASSERT(from_version > 0);
 
   // Backport script_module_v5.ptl to an older version
   constexpr int64_t minimum_to_version = 4;
-  int64_t current_to_version = from_version - 1;
+  auto current_to_version = from_version - 1;
 
   // Verify all candidate to_version work as expected. All backport to version
   // larger than minimum_to_version should success.
@@ -656,12 +657,14 @@ void backportAllVersionCheck(
 
     // Check backport model version
     auto backport_version = _get_model_bytecode_version(oss);
+    backport_version = _get_model_bytecode_version(oss);
     AT_ASSERT(backport_version == current_to_version);
 
     // Load and run the backport model, then compare the result with expect
     // result
     runAndCheckBytecodeModel(
         oss, input_data, expect_result_list, current_to_version);
+    oss.seekg(0, oss.beg);
     runAndCheckTorchScriptModel(
         oss, input_data, expect_result_list, current_to_version);
 
@@ -715,7 +718,15 @@ TEST(LiteInterpreterTest, BackPortByteCodeModelAllVersions) {
   torch::jit::Module module_freeze = freeze(module);
 
   std::stringstream input_model_stream;
+#if defined(ENABLE_FLATBUFFER)
+  module_freeze._save_for_mobile(
+      input_model_stream,
+      /*extra_files=*/{},
+      /*save_mobile_debug_info=*/false,
+      /*use_flatbuffer=*/true);
+#else
   module_freeze._save_for_mobile(input_model_stream);
+#endif
   std::vector<IValue> input_data =
       std::vector<IValue>({torch::ones({1, 1, 28, 28})});
   std::vector<IValue> expect_result_list;
@@ -991,7 +1002,6 @@ TEST(LiteInterpreterTest, ExtraFiles) {
   module->_save_for_mobile(oss, extra_files);
 
   std::istringstream iss(oss.str());
-  caffe2::serialize::IStreamAdapter adapter{&iss};
   std::unordered_map<std::string, std::string> loaded_extra_files;
   loaded_extra_files["metadata.json"] = "";
   torch::jit::_load_for_mobile(iss, torch::kCPU, loaded_extra_files);
@@ -1006,7 +1016,7 @@ TEST(LiteInterpreterTest, ExtraFiles) {
       loaded_extra_files[file_name.substr(6)] = "";
     }
   }
-
+  iss.seekg(0, iss.beg);
   torch::jit::_load_for_mobile(iss, torch::kCPU, loaded_extra_files);
   ASSERT_EQ(loaded_extra_files["metadata.json"], "abc");
   ASSERT_EQ(loaded_extra_files["mobile_info.json"], "{\"key\": 23}");
@@ -1186,7 +1196,6 @@ TEST(RunTimeTest, ParseOperator) {
       function.get());
   parseOperators(
       std::move(*c10::ivalue::Tuple::create(operators)).elements(),
-      model_version,
       1,
       function.get());
   const size_t rsize = 5;
@@ -1569,7 +1578,6 @@ TEST(RunTimeTest, RuntimeCall) {
       foo.get());
   parseOperators(
       std::move(*c10::ivalue::Tuple::create(operatorsFoo)).elements(),
-      model_version,
       1,
       foo.get());
   parseConstants(
@@ -1586,7 +1594,6 @@ TEST(RunTimeTest, RuntimeCall) {
       call.get());
   parseOperators(
       std::move(*c10::ivalue::Tuple::create(operatorsCall)).elements(),
-      model_version,
       1,
       call.get());
   parseConstants(
@@ -2090,10 +2097,7 @@ TEST(LiteInterpreterUpgraderTest, Upgrader) {
     if (byteCodeFunctionWithOperator.function.get_code().operators_.empty()) {
       for (const auto& op : byteCodeFunctionWithOperator.operators) {
         byteCodeFunctionWithOperator.function.append_operator(
-            op.name,
-            op.overload_name,
-            op.num_specified_args,
-            caffe2::serialize::kMaxSupportedFileFormatVersion);
+            op.name, op.overload_name, op.num_specified_args);
       }
     }
     upgrader_functions.push_back(byteCodeFunctionWithOperator.function);
diff --git a/test/cpp/jit/test_lite_trainer.cpp b/test/cpp/jit/test_lite_trainer.cpp
index cf3040f4fba46c..ede1c3a8355b48 100644
--- a/test/cpp/jit/test_lite_trainer.cpp
+++ b/test/cpp/jit/test_lite_trainer.cpp
@@ -158,6 +158,139 @@ TEST(MobileTest, SaveLoadParametersEmpty) {
   AT_ASSERT(mobile_params.size() == 0);
 }
 
+TEST(MobileTest, SaveParametersDefaultsToZip) {
+  // Save some empty parameters.
+  std::map<std::string, at::Tensor> empty_parameters;
+  std::stringstream ss_data;
+  _save_parameters(empty_parameters, ss_data);
+
+  // Verify that parameters were serialized to a ZIP container.
+  EXPECT_GE(ss_data.str().size(), 4);
+  EXPECT_EQ(ss_data.str()[0], 'P');
+  EXPECT_EQ(ss_data.str()[1], 'K');
+  EXPECT_EQ(ss_data.str()[2], '\x03');
+  EXPECT_EQ(ss_data.str()[3], '\x04');
+}
+
+#if defined(ENABLE_FLATBUFFER)
+TEST(MobileTest, SaveParametersCanUseFlatbuffer) {
+  // Save some empty parameters using flatbuffer.
+  std::map<std::string, at::Tensor> empty_parameters;
+  std::stringstream ss_data;
+  _save_parameters(empty_parameters, ss_data, /*use_flatbuffer=*/true);
+
+  // Verify that parameters were serialized to a flatbuffer. The flatbuffer
+  // magic bytes should be at offsets 4..7. The first four bytes contain an
+  // offset to the actual flatbuffer data.
+  EXPECT_GE(ss_data.str().size(), 8);
+  EXPECT_EQ(ss_data.str()[4], 'P');
+  EXPECT_EQ(ss_data.str()[5], 'T');
+  EXPECT_EQ(ss_data.str()[6], 'M');
+  EXPECT_EQ(ss_data.str()[7], 'F');
+}
+#else // !defined(ENABLE_FLATBUFFER)
+TEST(MobileTest, SaveParametersThrowsWithoutFlatbufferSupport) {
+  // Some empty parameters to try saving.
+  std::map<std::string, at::Tensor> empty_parameters;
+  std::stringstream ss_data;
+
+  // Save using flatbuffers should fail when support isn't compiled in. Make
+  // sure we get the exception that explicitly mentions the lack of flatbuffer
+  // support.
+  try {
+    _save_parameters(empty_parameters, ss_data, /*use_flatbuffer=*/true);
+    FAIL() << "_save_parameters should have thrown";
+  } catch (const ::c10::Error& e) {
+    static const std::string kExpectedSubstring =
+        "build hasn't enabled flatbuffer";
+    EXPECT_TRUE(
+        std::string(e.msg()).find(kExpectedSubstring) != std::string::npos)
+        << "Exception message does not contain expected substring \""
+        << kExpectedSubstring << "\": actual message \"" << e.msg() << "\"";
+  } catch (...) {
+    FAIL() << "Unexpected exception type";
+  }
+}
+#endif // !defined(ENABLE_FLATBUFFER)
+
+#if defined(ENABLE_FLATBUFFER)
+TEST(MobileTest, SaveLoadParametersUsingFlatbuffers) {
+  // Create some simple parameters to save.
+  std::map<std::string, at::Tensor> input_params;
+  input_params["four_by_ones"] = 4 * torch::ones({});
+  input_params["three_by_ones"] = 3 * torch::ones({});
+
+  // Serialize them using flatbuffers.
+  std::stringstream data;
+  _save_parameters(input_params, data, /*use_flatbuffer=*/true);
+
+  // The flatbuffer magic bytes should be at offsets 4..7.
+  EXPECT_EQ(data.str()[4], 'P');
+  EXPECT_EQ(data.str()[5], 'T');
+  EXPECT_EQ(data.str()[6], 'M');
+  EXPECT_EQ(data.str()[7], 'F');
+
+  // Read them back and check that they survived the trip.
+  auto output_params = _load_parameters(data);
+  EXPECT_EQ(output_params.size(), 2);
+  {
+    auto four_by_ones = 4 * torch::ones({});
+    EXPECT_EQ(
+        output_params["four_by_ones"].item<int>(), four_by_ones.item<int>());
+  }
+  {
+    auto three_by_ones = 3 * torch::ones({});
+    EXPECT_EQ(
+        output_params["three_by_ones"].item<int>(), three_by_ones.item<int>());
+  }
+}
+#else // !defined(ENABLE_FLATBUFFER)
+TEST(MobileTest, LoadParametersFailsWithoutFlatbufferSupport) {
+  // Create some data that looks like a flatbuffer header.
+  std::stringstream data;
+  data << "abcd"
+       << "PTMF" // Flatbuffer magic
+       << "ijkl";
+
+  // Loading the "flatbuffer" data should fail. Make sure we see the expected
+  // exception, not just any exception; since this isn't properly-formed
+  // flatbuffer data, any attempt to parse it might throw a different error type
+  // or message, but we don't expect anyone to try parsing it.
+  try {
+    _load_parameters(data);
+    FAIL() << "_load_parameters should have thrown";
+  } catch (const ::c10::Error& e) {
+    static const std::string kExpectedSubstring =
+        "build hasn't enabled flatbuffer";
+    EXPECT_TRUE(
+        std::string(e.msg()).find(kExpectedSubstring) != std::string::npos)
+        << "Exception message does not contain expected substring \""
+        << kExpectedSubstring << "\": actual message \"" << e.msg() << "\"";
+  } catch (...) {
+    FAIL() << "Unexpected exception type";
+  }
+}
+#endif // !defined(ENABLE_FLATBUFFER)
+
+TEST(MobileTest, LoadParametersUnexpectedFormatShouldThrow) {
+  // Manually create some data that doesn't look like a ZIP or Flatbuffer file.
+  // Make sure it's longer than 8 bytes, since getFileFormat() needs that much
+  // data to detect the type.
+  std::stringstream bad_data;
+  bad_data << "abcd"
+           << "efgh"
+           << "ijkl";
+
+  // Loading parameters from it should throw an exception.
+  EXPECT_ANY_THROW(_load_parameters(bad_data));
+}
+
+TEST(MobileTest, LoadParametersEmptyDataShouldThrow) {
+  // Loading parameters from an empty data stream should throw an exception.
+  std::stringstream empty;
+  EXPECT_ANY_THROW(_load_parameters(empty));
+}
+
 TEST(LiteTrainerTest, SGD) {
   Module m("m");
   m.register_parameter("foo", torch::ones({1}, at::requires_grad()), false);
diff --git a/test/cpp/jit/test_misc.cpp b/test/cpp/jit/test_misc.cpp
index 9ccbb77ff71531..244ea96bf3e99f 100644
--- a/test/cpp/jit/test_misc.cpp
+++ b/test/cpp/jit/test_misc.cpp
@@ -4,6 +4,7 @@
 #include <ATen/Parallel.h>
 #include <ATen/core/interned_strings.h>
 #include <ATen/core/ivalue.h>
+#include <ATen/core/jit_type_base.h>
 #include <test/cpp/jit/test_utils.h>
 #include <torch/csrc/jit/passes/remove_mutation.h>
 #include <torch/csrc/jit/passes/tensorexpr_fuser.h>
@@ -47,6 +48,7 @@
 #include <torch/csrc/jit/runtime/argument_spec.h>
 #include <torch/csrc/jit/runtime/autodiff.h>
 #include <torch/csrc/jit/runtime/custom_operator.h>
+#include <torch/csrc/jit/runtime/decomposition_registry.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
 #include <torch/csrc/jit/runtime/interpreter.h>
 #include <torch/csrc/jit/runtime/jit_trace.h>
@@ -1381,6 +1383,29 @@ TEST(ThreadLocalDebugInfoTest, Basic) {
   }
 }
 
+TEST(TestSymInt, NarrowCopyWithSymbolicInt) {
+  static const size_t LENGTH = 5;
+  auto a = at::randn({10}, at::kCPU);
+  c10::SymInt si(LENGTH);
+  auto b = a.narrow_copy(0, 0, si);
+  auto c = a.narrow(0, 0, LENGTH);
+  ASSERT_TRUE(torch::allclose(b, c));
+}
+
+TEST(TestSymInt, NarrowCopy) {
+  static const size_t LENGTH = 5;
+  auto a = at::randn({10}, at::kCPU);
+  auto b = a.narrow_copy(0, 0, LENGTH);
+  auto c = a.narrow(0, 0, LENGTH);
+  ASSERT_TRUE(torch::allclose(b, c));
+}
+
+TEST(TestSymInt, AddSymbolicInt) {
+  c10::SymInt a(5);
+  c10::SymInt b(3);
+  ASSERT_TRUE((a + b).expect_int() == 8);
+}
+
 TEST(FallbackGraphsTest, Basic) {
   static const auto nestGraphIntoFallbackGraph =
       [](const std::shared_ptr<Graph>& graph) {
@@ -2913,6 +2938,74 @@ graph(%x.1 : Tensor):
   testing::FileCheck().check_not("aten::relu(")->run(*graph);
 }
 
+TEST(TestFunctionExecutor, SimpleExecutorTest) {
+  auto graph = std::make_shared<Graph>();
+  parseIR(
+      R"IR(
+graph(%x.1 : Tensor):
+  %2 : int = prim::Constant[value=1]()
+  %x.3 : Tensor = aten::add(%x.1, %2, %2)
+  %y : Tensor = aten::relu(%x.3)
+  return (%y))IR",
+      &*graph);
+  {
+    auto func = torch::make_unique<GraphFunction>(
+        "name", graph, [](GraphFunction&) {}, ExecutorExecutionMode::PROFILING);
+    auto a = at::rand({2, 2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
+    Stack stack = {a};
+    func->run(stack);
+    auto g = lastExecutedOptimizedGraph();
+    testing::FileCheck()
+        .check("prim::profile")
+        ->check("aten::add")
+        ->check("aten::relu")
+        ->run(*g);
+  }
+  {
+    auto func = torch::make_unique<GraphFunction>(
+        "name", graph, [](GraphFunction&) {}, ExecutorExecutionMode::SIMPLE);
+    auto a = at::rand({2, 2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
+    Stack stack = {a};
+    func->run(stack);
+    auto g = func->getDebugState().graph;
+    testing::FileCheck()
+        .check_not("prim::profile")
+        ->check("aten::add")
+        ->check("aten::relu")
+        ->run(*g);
+  }
+}
+
+TEST(TestFunctionExecutor, RunDecompositionTest) {
+  GraphFunction* func;
+  std::once_flag flag1;
+  for (bool unbiased : {true, false}) {
+    std::call_once(flag1, [&]() {
+      // NB: take reference to schema here, `auto schema =` will not work
+      auto& schema = getOperatorForLiteral(
+                         "aten::var(Tensor self, bool unbiased=True) -> Tensor")
+                         ->schema();
+      auto maybe_func = GetDecompositionFunction(schema);
+      TORCH_INTERNAL_ASSERT(maybe_func);
+      func = *maybe_func;
+    });
+    auto input = at::rand({4, 4});
+    Stack stack = {input, unbiased};
+    func->run(stack);
+    at::Tensor out = pop(stack).toTensor();
+    ASSERT_TRUE(at::allclose(out, input.var(unbiased)));
+  }
+}
+
+TEST(TestShapeGraphLinting, Basic) {
+  auto schemas = RegisteredShapeComputeSchemas();
+  for (const auto& schema : schemas) {
+    auto g = shapeComputeGraphForSchema(*schema);
+    TORCH_INTERNAL_ASSERT(g);
+    LintShapeComputeGraph(schema, *g);
+  }
+}
+
 // TODO: move to test_kernel when global settings are explicit
 // fusion parameters
 class Composed : public ::testing::Test {
diff --git a/test/cpp/jit/test_save_load.cpp b/test/cpp/jit/test_save_load.cpp
index 88bff7ea93e885..c7d631baa7a232 100644
--- a/test/cpp/jit/test_save_load.cpp
+++ b/test/cpp/jit/test_save_load.cpp
@@ -3,7 +3,9 @@
 #include <test/cpp/jit/test_utils.h>
 #include <sstream>
 
+#include <torch/csrc/jit/mobile/module.h>
 #include <torch/csrc/jit/serialization/export.h>
+#include <torch/csrc/jit/serialization/export_bytecode.h>
 #include <torch/csrc/jit/serialization/import.h>
 #include <torch/csrc/jit/serialization/import_source.h>
 #include <torch/torch.h>
@@ -13,6 +15,20 @@
 namespace torch {
 namespace jit {
 
+namespace {
+
+Module roundtripThroughMobile(const Module& m) {
+  ExtraFilesMap files;
+  std::vector<IValue> constants;
+  jitModuleToPythonCodeAndConstants(m, &files, &constants);
+  CompilationOptions options;
+  mobile::Module mobilem = jitModuleToMobile(m, options);
+  return jitModuleFromSourceAndConstants(
+      mobilem._ivalue(), files, constants, 8);
+}
+
+} // namespace
+
 TEST(SerializationTest, ExtraFilesHookPreference) {
   // Tests that an extra file written explicitly has precedence over
   //   extra files written by a hook
@@ -149,5 +165,78 @@ TEST(SerializationTest, TestJitStream_CUDA) {
   // Check if both the output tensors are equal
   ASSERT_TRUE(op.equal(c));
 }
+
+TEST(TestSourceRoundTrip, UpsampleNearest2d) {
+  Module m("m");
+  m.define(R"(
+    def forward(self, input: Tensor, scale:float):
+      return torch.upsample_nearest2d(input, [1, 1], float(scale), float(scale))
+  )");
+
+  std::vector<IValue> inputs;
+  inputs.emplace_back(torch::rand({1, 3, 128, 128}));
+  inputs.emplace_back(at::Scalar(2.0));
+  auto ref = m.forward(inputs);
+
+  Module m2 = roundtripThroughMobile(m);
+  auto res = m2.forward(inputs);
+
+  auto resd = res.toTensor();
+  auto refd = ref.toTensor();
+  ASSERT_TRUE(resd.equal(refd));
+}
+
+TEST(TestSourceRoundTrip, CheckAttrAccess) {
+  Module m("m");
+  m.register_attribute("mobile_optimized", BoolType::get(), true);
+  Module m2 = roundtripThroughMobile(m);
+  bool mobile_optimized = m2.attr("mobile_optimized", false).toBool();
+  AT_ASSERT(mobile_optimized);
+}
+
+TEST(TestSourceRoundTrip,
+     MethodInvocation) { // NOLINT (use =delete in gtest)
+  const std::vector<std::string> test_programs{
+      // test invoking a method with default parameter
+      R"(
+      def test_func(self, x, b : int = 4):
+        return self.foo + x + b
+      )",
+      // inner method call with default parameter (gets inlined)
+      R"(
+      def add_with_default_arg(self, x, b : int = 4):
+        return self.foo + x + b
+      def test_func(self, x):
+        return self.add_with_default_arg(x)  # invoke method w/ default arg
+      )",
+      // simple method call
+      R"(
+      def test_func(self, x):
+        b = 4
+        return self.foo + x + b
+      )",
+  };
+  for (const auto& test_program : test_programs) {
+    Module m("m");
+    m.register_parameter("foo", torch::ones({}), false);
+    m.define(test_program);
+
+    const int fortyTwo = 42; // (keep linter happy)
+    auto minput = fortyTwo * torch::ones({});
+    auto ref = m.run_method("test_func", minput);
+
+    Module m2 = roundtripThroughMobile(m);
+    const auto& test_func = m2.get_method("test_func");
+    IValue res;
+    for (int i = 0; i < 3; ++i) {
+      res = test_func({minput});
+    }
+
+    auto resd = res.toTensor().item<float>();
+    auto refd = ref.toTensor().item<float>();
+    AT_ASSERT(resd == refd);
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_shape_analysis.cpp b/test/cpp/jit/test_shape_analysis.cpp
index baf9f16e6e79dd..b3157c09e8bee4 100644
--- a/test/cpp/jit/test_shape_analysis.cpp
+++ b/test/cpp/jit/test_shape_analysis.cpp
@@ -30,7 +30,6 @@ Node* findNode(std::shared_ptr<Graph>& g, Symbol k) {
   }
   TORCH_INTERNAL_ASSERT(false, "Couldn't find node");
 }
-
 } // namespace
 
 TEST(ShapeAnalysisTest, DynamicShapesFusion) {
@@ -292,5 +291,66 @@ TEST(ShapeAnalysisTest, MovingConstantOutOfFusionGroups) {
       ->run(*g);
 }
 
+namespace {
+
+void assertShapeEqual(
+    c10::optional<std::vector<c10::SymbolicShape>>& actual,
+    std::vector<c10::optional<int64_t>> expected) {
+  ASSERT_TRUE(actual.has_value());
+  ASSERT_EQ(actual->size(), 1);
+  auto a_canonical = CanonicalizedSymbolicShape(actual->at(0));
+
+  auto symb_expected = c10::SymbolicShape(expected);
+  auto b_canonical = CanonicalizedSymbolicShape(symb_expected);
+  ASSERT_EQ(a_canonical, b_canonical);
+}
+
+} // namespace
+
+TEST(ShapeAnalysisTest, SymbolicShapeAPI) {
+  // Figure out how to fetch a function schema
+
+  // Ask someone else how to create a function schema / operator in C++
+  std::shared_ptr<Operator> op = getOperatorForLiteral(
+      "aten::sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor");
+  const FunctionSchema* schema = &(op->schema());
+
+  c10::IValue const_size_1 = std::vector<int64_t>{64, 56, 56};
+  c10::IValue const_size_2 = std::vector<int64_t>{1, 56, 56};
+
+  // Check vector initializer list syntax
+  c10::optional<int64_t> sym_dim = c10::nullopt;
+  c10::SymbolicShape ss_concrete =
+      std::vector<c10::optional<int64_t>>{1, 56, 56};
+  c10::SymbolicShape ss1 = std::vector<c10::optional<int64_t>>{sym_dim, 56, 56};
+  c10::SymbolicShape ss2 =
+      std::vector<c10::optional<int64_t>>{64, sym_dim, sym_dim};
+  c10::SymbolicShape ss3 =
+      std::vector<c10::optional<int64_t>>{sym_dim, sym_dim, sym_dim, sym_dim};
+
+  auto res = calculateSymbolicShapesOnOp(
+      schema, std::vector<SSAInput>{const_size_1, const_size_1});
+  assertShapeEqual(res, {64, 56, 56});
+
+  res = calculateSymbolicShapesOnOp(
+      schema, std::vector<SSAInput>{const_size_1, const_size_2});
+  assertShapeEqual(res, {64, 56, 56});
+
+  res = calculateSymbolicShapesOnOp(
+      schema, std::vector<SSAInput>{const_size_1, ss1});
+  assertShapeEqual(res, {64, 56, 56});
+
+  res = calculateSymbolicShapesOnOp(
+      schema, std::vector<SSAInput>{const_size_2, ss1});
+  assertShapeEqual(res, {sym_dim, 56, 56});
+
+  res = calculateSymbolicShapesOnOp(
+      schema, std::vector<SSAInput>{ss_concrete, ss2});
+  assertShapeEqual(res, {64, 56, 56});
+
+  res = calculateSymbolicShapesOnOp(schema, std::vector<SSAInput>{ss2, ss3});
+  assertShapeEqual(res, {sym_dim, 64, sym_dim, sym_dim});
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_utils.h b/test/cpp/jit/test_utils.h
index 3d2ff4b159ca24..89a8959c424ff5 100644
--- a/test/cpp/jit/test_utils.h
+++ b/test/cpp/jit/test_utils.h
@@ -17,39 +17,31 @@ static inline void trim(std::string& s) {
           [](unsigned char ch) { return !std::isspace(ch); })
           .base(),
       s.end());
-  for (int64_t i = 0; i < s.size(); ++i) {
-    if (s[i] == '\n') {
+  for (size_t i = 0; i < s.size(); ++i) {
+    while (i < s.size() && s[i] == '\n') {
       s.erase(i, 1);
-      i--;
     }
   }
-  for (int64_t i = 0; i < s.size(); ++i) {
+  for (size_t i = 0; i < s.size(); ++i) {
     if (s[i] == ' ') {
-      for (int64_t j = i + 1; j < s.size(); j++) {
-        if (s[j] == ' ') {
-          s.erase(j, 1);
-          j--;
-        } else {
-          break;
-        }
+      while (i + 1 < s.size() && s[i + 1] == ' ') {
+        s.erase(i + 1, 1);
       }
     }
   }
 }
 } // namespace
 
-#define ASSERT_THROWS_WITH_MESSAGE(statement, substring)             \
-  try {                                                              \
-    (void)statement;                                                 \
-    FAIL();                                                          \
-  } catch (const std::exception& e) {                                \
-    std::string substring_s(substring);                              \
-    trim(substring_s);                                               \
-    auto exception_string = std::string(e.what());                   \
-    trim(exception_string);                                          \
-    ASSERT_NE(exception_string.find(substring_s), std::string::npos) \
-        << " Error was: \n"                                          \
-        << exception_string;                                         \
+#define ASSERT_THROWS_WITH_MESSAGE(statement, substring)              \
+  try {                                                               \
+    (void)statement;                                                  \
+    FAIL();                                                           \
+  } catch (const std::exception& e) {                                 \
+    std::string substring_s(substring);                               \
+    trim(substring_s);                                                \
+    auto exception_string = std::string(e.what());                    \
+    trim(exception_string);                                           \
+    ASSERT_NE(exception_string.find(substring_s), std::string::npos); \
   }
 
 namespace torch {
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl
index be67cecf970508..ddee6be4c35afb 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl
index e5663224ac7603..cb36f9aeba8bc6 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl
index 8698001427a93f..443074fe7130cd 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl
index c52d92b29f44cc..ac8b1b918de7c5 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl
index 749614fa53097d..323aa42dde4ec2 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl
index b20c456058be63..6d06dea6b5896d 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl
index f33f3a8cf8de35..4fd551d073aebf 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl
index ac7cc7479e7988..9680713a83e280 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl
index 0b70614b09366b..0381636677b52b 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl differ
diff --git a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl
index 5f6ae1a90b1e7b..21792d35b8924f 100644
Binary files a/test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl and b/test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl differ
diff --git a/test/cpp/lazy/CMakeLists.txt b/test/cpp/lazy/CMakeLists.txt
index ede4308816cfeb..9360247a4d3126 100644
--- a/test/cpp/lazy/CMakeLists.txt
+++ b/test/cpp/lazy/CMakeLists.txt
@@ -9,9 +9,15 @@ set(LAZY_TEST_SRCS
   ${LAZY_TEST_ROOT}/test_misc.cpp
   ${LAZY_TEST_ROOT}/test_permutation_util.cpp
   ${LAZY_TEST_ROOT}/test_shape.cpp
-  ${LAZY_TEST_ROOT}/test_tensor_impl.cpp
+  ${LAZY_TEST_ROOT}/test_symbolic_shape.cpp
   ${LAZY_TEST_ROOT}/test_util.cpp
 )
+if(BUILD_LAZY_TS_BACKEND)
+    list(APPEND LAZY_TEST_SRCS
+      ${LAZY_TEST_ROOT}/test_lazy_ops.cpp
+      ${LAZY_TEST_ROOT}/test_lazy_ops_util.cpp
+    )
+endif()
 
 add_executable(test_lazy
   ${TORCH_ROOT}/test/cpp/common/main.cpp
diff --git a/test/cpp/lazy/test_backend_device.cpp b/test/cpp/lazy/test_backend_device.cpp
index b75f0512d38787..f8ce49b9e287dd 100644
--- a/test/cpp/lazy/test_backend_device.cpp
+++ b/test/cpp/lazy/test_backend_device.cpp
@@ -74,9 +74,13 @@ TEST(BackendDeviceTest, FromAten) {
   auto device = c10::Device(c10::kCPU);
   EXPECT_THROW(atenDeviceToBackendDevice(device), c10::Error);
 
-  // TODO(alanwaketan): Update the following test once we have TorchScript backend upstreamed.
   device = c10::Device(c10::kLazy);
+#ifndef FBCODE_CAFFE2
+  auto backend_device = atenDeviceToBackendDevice(device);
+#else
+  // Lazy Tensor is disabled in FBCODE until addressing non-virtual methods (e.g. sizes) in TensorImpl
   EXPECT_THROW(atenDeviceToBackendDevice(device), c10::Error);
+#endif // FBCODE_CAFFE2
 }
 
 TEST(BackendDeviceTest, ToAten) {
diff --git a/test/cpp/lazy/test_cache.cpp b/test/cpp/lazy/test_cache.cpp
index a6da9bccbd25e0..53bd6af147ebaf 100644
--- a/test/cpp/lazy/test_cache.cpp
+++ b/test/cpp/lazy/test_cache.cpp
@@ -4,6 +4,8 @@
 #include <torch/csrc/lazy/core/cache.h>
 #include <torch/csrc/lazy/core/hash.h>
 #include <torch/csrc/lazy/core/ir.h>
+#include <torch/csrc/lazy/core/shape.h>
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
 
 namespace torch {
 namespace lazy {
@@ -22,7 +24,6 @@ class CacheNode : public Node {
   const Output& operand(size_t i) const override {
     TORCH_INTERNAL_ASSERT(false, "Can't access operand[i] of test node");
   }
-
  private:
   std::string str_;
 };
@@ -59,5 +60,32 @@ TEST(CacheTest, BasicTest) {
   EXPECT_EQ(cache.Get(c->node_hash()), nullptr);
 }
 
+class CacheNodeWithShape : public TsNode {
+ public:
+  explicit CacheNodeWithShape(const Shape& shape)
+      : TsNode(OpKind(), shape, /* num_outputs */ 1, /* seed */ 0){}
+};
+
+TEST(CacheTest, ShapeCacheTestForDynamicShape) {
+  // enable dynamic shape
+  FLAGS_ltc_enable_dynamic_shapes = true;
+
+  CacheNodeWithShape nodes[] = {
+    CacheNodeWithShape(Shape(c10::kFloat, {2, 4})),
+    CacheNodeWithShape(Shape(c10::kFloat, {4, 2})) };
+
+  /*
+   * Make sure the cached shape for node (2, 4) is not used for node (4, 2)
+   */
+  for (auto& node : nodes) {
+    EXPECT_EQ(node.shape(), node.GetOpShape([&]() {
+      return node.shape();
+    }));
+  }
+
+  // reset the flag
+  FLAGS_ltc_enable_dynamic_shapes = false;
+}
+
 } // namespace lazy
 } // namespace torch
diff --git a/test/cpp/lazy/test_ir.cpp b/test/cpp/lazy/test_ir.cpp
index 326f7a9092c00c..d07530b29e7dfd 100644
--- a/test/cpp/lazy/test_ir.cpp
+++ b/test/cpp/lazy/test_ir.cpp
@@ -1,10 +1,15 @@
 #include <gtest/gtest.h>
 
+#include <torch/csrc/lazy/generated/LazyIr.h>
 #include <c10/util/Exception.h>
 #include <torch/csrc/lazy/core/config.h>
 #include <torch/csrc/lazy/core/ir.h>
+#include <torch/csrc/lazy/core/debug_util.h>
 #include <torch/csrc/lazy/core/ir_metadata.h>
 #include <torch/csrc/lazy/ts_backend/ts_node.h>
+#include <memory>
+#include <c10/core/ScalarType.h>
+#include <torch/csrc/lazy/core/dynamic_ir.h>
 
 namespace torch {
 namespace lazy {
@@ -23,7 +28,6 @@ class TestLeafNode : public Node {
   const Output& operand(size_t i) const override {
     TORCH_INTERNAL_ASSERT(false, "Can't access operand[i] of leaf node");
   }
-
  private:
   size_t param_;
 };
@@ -51,22 +55,22 @@ TEST(IrTest, MetaDataTest) {
   node = MakeNode<TestLeafNode>(1);
   auto metaWithEmptyDebug = node->metadata();
   EXPECT_EQ(metaWithEmptyDebug.scope.size(), 0);
-  EXPECT_EQ(metaWithEmptyDebug.frame_info.size(), 0);
+  EXPECT_EQ(metaWithEmptyDebug.frame_info.size(), 1);
 
   {
     ScopePusher scope("TestScope");
     node = MakeNode<TestLeafNode>(1);
     auto metaWithScope = node->metadata();
     EXPECT_EQ(metaWithScope.scope, "TestScope.1");
-    EXPECT_EQ(metaWithScope.frame_info.size(), 0);
+    EXPECT_EQ(metaWithScope.frame_info.size(), 1);
   }
 
   SourceLocation dummySourceLocation;
   dummySourceLocation.file = "file";
   dummySourceLocation.function = "function";
   dummySourceLocation.line = 10;
-  RegisterGetFrameInfo(
-      [&]() -> std::vector<SourceLocation> { return {dummySourceLocation}; });
+  GetPythonFramesFunction() =
+      [&]() -> std::vector<SourceLocation> { return {dummySourceLocation}; };
   node = MakeNode<TestLeafNode>(1);
   auto metaWithSourceLoc = node->metadata();
   EXPECT_EQ(metaWithSourceLoc.scope.size(), 0);
@@ -77,7 +81,7 @@ TEST(IrTest, MetaDataTest) {
   FLAGS_torch_lazy_ir_debug = restore_FLAGS_torch_lazy_ir_debug;
 }
 
-TEST(IrTest, TsNode) {
+TEST(IrTest, TsNodeTest) {
   NodePtr node1 = MakeNode<TsNode>(
       OpKind(at::aten::view),
       Shape(),
@@ -96,5 +100,28 @@ TEST(IrTest, TsNode) {
   EXPECT_TRUE(leafptr != nullptr);
 }
 
+TEST(IrTest, DimensionNodeTest) {
+
+  const size_t DIM0 = 5;
+  const size_t DIM1 = 8;
+  NodePtr node1 = MakeNode<TsNode>(
+      OpKind(at::aten::view),
+      Shape(c10::kFloat, {DIM0, DIM1}),
+      /*num_outputs*/ 1,
+      /*hash_seed*/ kHashSeed);
+
+  auto size0 = std::dynamic_pointer_cast<SizeNode>(MakeNode<SizeNode>(Value{node1}, 0));
+  auto size1 = std::dynamic_pointer_cast<SizeNode>(MakeNode<SizeNode>(Value{node1}, 1));
+
+  ASSERT_EQ(DIM0, size0->getStaticValue());
+  ASSERT_EQ(DIM1, size1->getStaticValue());
+
+  auto add_dim = std::dynamic_pointer_cast<SizeAdd>(MakeNode<SizeAdd>(Value{size0}, Value{size1}));
+  ASSERT_EQ(DIM0 + DIM1, add_dim->getStaticValue());
+
+  auto mul_dim = std::dynamic_pointer_cast<SizeMul>(MakeNode<SizeMul>(Value{size0}, Value{size1}));
+  ASSERT_EQ(DIM0 * DIM1, mul_dim->getStaticValue());
+}
+
 } // namespace lazy
 } // namespace torch
diff --git a/test/cpp/lazy/test_ir_util.cpp b/test/cpp/lazy/test_ir_util.cpp
index bb29cff6f6b316..6c85c0184323c1 100644
--- a/test/cpp/lazy/test_ir_util.cpp
+++ b/test/cpp/lazy/test_ir_util.cpp
@@ -22,18 +22,6 @@ class IrUtilNode : public Node {
     operands_as_outputs_.emplace_back(v.node.get(), v.index);
     operands_.push_back(std::move(v.node));
   }
-
-  const std::vector<Output>& operands() const override {
-    return operands_as_outputs_;
-  }
-
-  const Output& operand(size_t i) const override {
-    return operands_as_outputs_.at(i);
-  }
-
- private:
-  std::vector<NodePtr> operands_;
-  std::vector<Output> operands_as_outputs_;
 };
 
 /*  a
diff --git a/test/cpp/lazy/test_lazy_ops.cpp b/test/cpp/lazy/test_lazy_ops.cpp
new file mode 100644
index 00000000000000..c1319429c1811c
--- /dev/null
+++ b/test/cpp/lazy/test_lazy_ops.cpp
@@ -0,0 +1,10727 @@
+#include <gtest/gtest.h>
+#include <iostream>
+#include <c10/core/Device.h>
+#include <test/cpp/lazy/test_lazy_ops_util.h>
+#include <torch/csrc/lazy/core/helpers.h>
+#include <torch/csrc/lazy/core/metrics.h>
+#include <torch/csrc/lazy/core/debug_util.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
+#include <torch/csrc/lazy/core/permutation_util.h>
+#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
+#include <torch/torch.h>
+
+namespace torch {
+namespace lazy {
+
+
+// Lazy Tensor is disabled in FBCODE until addressing non-virtual methods (e.g. sizes) in TensorImpl
+#ifndef FBCODE_CAFFE2
+
+namespace {
+  // This registers the torchscript backend, without which lazy device won't work
+static bool inline init_backend(){
+  torch::lazy::InitTorchScriptBackend();
+  return true;
+}
+static const bool backend_initialized = init_backend();
+
+}
+
+class LazyTsTest : public ::testing::Test {
+ protected:
+  void SetUp() override;
+
+  void TearDown() override;
+
+  static void CommonSetup() {}
+
+  void ExpectCounterNotChanged(
+      const std::string& counter_regex,
+      const std::unordered_set<std::string>* ignore_set) {}
+
+  void ExpectCounterChanged(const std::string& counter_regex,
+                            const std::unordered_set<std::string>* ignore_set) {
+  }
+
+  void ResetCounters() {}
+
+ private:
+  void MakeEndSnapshot() {}
+};
+
+class LazyOpsTestBase : public LazyTsTest {
+ protected:
+  static void SetUpTestCase() {}
+};
+
+void LazyTsTest::SetUp() {
+  (void)backend_initialized;  // avoid unused parameter warning
+  at::manual_seed(42);
+  torch::lazy::LazyGraphExecutor::Get()->SetRngSeed(torch::lazy::BackendDevice(), 42);
+}
+
+void LazyTsTest::TearDown() {}
+
+namespace {
+using torch::lazy::DebugUtil;
+
+class LazyOpsTest : public LazyOpsTestBase {};
+
+static inline bool IsCuda() {
+  return torch::lazy::getBackend()->EagerFallbackDeviceType() == at::kCUDA;
+}
+
+static inline at::DeviceType DefaultDevice() {
+  return torch::lazy::getBackend()->EagerFallbackDeviceType();
+}
+
+
+}  // namespace
+
+TEST_F(LazyOpsTest, TestScalarTensor) {
+  torch::Tensor scalar_tensor = torch::scalar_tensor(
+      1., torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_scalar_tensor = torch::scalar_tensor(
+        1., torch::TensorOptions(torch::kFloat).device(torch::kLazy));
+    AllClose(scalar_tensor, lazy_scalar_tensor);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClone) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = lazy_a.clone();
+    AllClose(a, lazy_b);
+    lazy_a.add_(1.0);
+    AllClose(a, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTo) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestIsFloatingPoint) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    bool is_float = torch::is_floating_point(a);
+    bool lazy_is_float = torch::is_floating_point(lazy_a);
+    EXPECT_EQ(is_float, lazy_is_float);
+  });
+}
+
+TEST_F(LazyOpsTest, TestIsSigned) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    bool is_signed = torch::is_signed(a);
+    bool lazy_is_signed = torch::is_signed(lazy_a);
+    EXPECT_EQ(is_signed, lazy_is_signed);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastByte) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Byte(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Byte(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastChar) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Char(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Char(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastShort) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Short(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Short(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastInt) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Int(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Int(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastLong) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Long(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Long(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCastFloat) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::_cast_Float(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::_cast_Float(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRetainType) {
+  torch::Tensor lazy_a = torch::zeros(
+      {2, 2}, torch::TensorOptions(torch::kByte).device(torch::kLazy));
+  torch::Tensor lazy_b = torch::ones(
+      {2, 2}, torch::TensorOptions(torch::kByte).device(torch::kLazy));
+  torch::Tensor lazy_c = lazy_a + lazy_b;
+  EXPECT_EQ(lazy_c.scalar_type(), torch::ScalarType::Byte);
+}
+
+TEST_F(LazyOpsTest, TestLogicalTypeWithInterop) {
+  torch::Tensor query =
+      torch::rand({2, 12, 20, 64},
+                  torch::TensorOptions(torch::kFloat).device(torch::kLazy));
+  torch::Tensor key =
+      torch::rand({2, 12, 64, 20},
+                  torch::TensorOptions(torch::kFloat).device(torch::kLazy));
+  torch::Tensor scores =
+      torch::matmul(query, key) /
+      torch::scalar_tensor(
+          8, torch::TensorOptions(torch::kDouble).device(torch::kLazy));
+  torch::Tensor p_attn = torch::softmax(scores, /*dim=*/-1);
+  EXPECT_EQ(p_attn.scalar_type(), torch::ScalarType::Float);
+}
+
+TEST_F(LazyOpsTest, TestAdd) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::add(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::add(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddHalf) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kHalf).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kHalf).device(DefaultDevice()));
+  torch::Tensor c = torch::add(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::add(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddMixedPrecision) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kHalf).device(DefaultDevice()));
+  torch::Tensor c = torch::add(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::add(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddInPlace) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor c = a.add_(b);
+    torch::Tensor lazy_c = lazy_a.add_(lazy_b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddScalar) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar b(1);
+  torch::Tensor c = torch::add(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_c = torch::add(lazy_a, b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddScalarInPlace) {
+  torch::Scalar b(1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor c = a.add_(b);
+    torch::Tensor lazy_c = lazy_a.add_(b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddZeroSizeDim) {
+  torch::Tensor a = torch::rand(
+      {0, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {1, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::add(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::add(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSub) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::sub(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::sub(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSubInPlace) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor c = a.sub_(b);
+    torch::Tensor lazy_c = lazy_a.sub_(lazy_b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSubScalar) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar b(1);
+  torch::Tensor c = torch::sub(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_c = torch::sub(lazy_a, b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSubScalarInPlace) {
+  torch::Scalar b(1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor c = a.sub_(b);
+    torch::Tensor lazy_c = lazy_a.sub_(b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMul) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::mul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::mul(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMulInPlace) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor c = a.mul_(b);
+    torch::Tensor lazy_c = lazy_a.mul_(lazy_b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMulScalar) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar b(3);
+  torch::Tensor c = torch::mul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_c = torch::mul(lazy_a, b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMulScalarInPlace) {
+  torch::Scalar b(3);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a = torch::rand(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor c = a.mul_(b);
+    torch::Tensor lazy_c = lazy_a.mul_(b);
+    AllClose(a, lazy_a);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestDiv) {
+  for (torch::ScalarType scalar_type1 :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type1)
+            ? torch::rand({3, 4}, torch::TensorOptions(scalar_type1))
+            : torch::randint(0, 100, {3, 4},
+                             torch::TensorOptions(scalar_type1));
+    for (torch::ScalarType scalar_type2 :
+         {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+          torch::kLong}) {
+      torch::Tensor b =
+          isFloatingType(scalar_type2)
+              ? torch::rand({3, 4}, torch::TensorOptions(scalar_type2))
+              : torch::randint(1, 100, {3, 4},
+                               torch::TensorOptions(scalar_type2));
+      torch::Tensor c = torch::div(a, b);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor lazy_b = CopyToDevice(b, device);
+        torch::Tensor lazy_c = torch::div(lazy_a, lazy_b);
+        AllClose(c, lazy_c);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivWithRoundingMode) {
+  c10::optional<c10::string_view> rounding_modes[] = {"trunc", "floor",
+                                                      c10::nullopt};
+  for (const auto& rounding_mode : rounding_modes) {
+    for (torch::ScalarType scalar_type1 :
+         {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+          torch::kLong}) {
+      int lower_bound = (scalar_type1 == torch::kByte) ? 0 : -100;
+      torch::Tensor a =
+          isFloatingType(scalar_type1)
+              ? torch::rand({3, 4}, torch::TensorOptions(scalar_type1))
+              : torch::randint(lower_bound, 50, {3, 4},
+                               torch::TensorOptions(scalar_type1));
+      for (torch::ScalarType scalar_type2 :
+           {torch::kFloat, torch::kByte, torch::kChar, torch::kShort,
+            torch::kInt, torch::kLong}) {
+        torch::Tensor b =
+            isFloatingType(scalar_type2)
+                ? torch::rand({3, 4}, torch::TensorOptions(scalar_type2))
+                : torch::randint(51, 100, {3, 4},
+                                 torch::TensorOptions(scalar_type2));
+        torch::Tensor c = torch::div(a, b, rounding_mode);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = CopyToDevice(b, device);
+          torch::Tensor lazy_c = torch::div(lazy_a, lazy_b, rounding_mode);
+          AllClose(c, lazy_c);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivInPlace) {
+  for (torch::ScalarType scalar_type1 : {torch::kFloat}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type1)
+            ? torch::rand({3, 4}, torch::TensorOptions(scalar_type1))
+            : torch::randint(0, 100, {3, 4},
+                             torch::TensorOptions(scalar_type1));
+    for (torch::ScalarType scalar_type2 : {torch::kFloat}) {
+      torch::Tensor b =
+          isFloatingType(scalar_type2)
+              ? torch::rand({3, 4}, torch::TensorOptions(scalar_type2))
+              : torch::randint(1, 100, {3, 4},
+                               torch::TensorOptions(scalar_type2));
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor c = a.div_(b);
+        torch::Tensor lazy_b = CopyToDevice(b, device);
+        torch::Tensor lazy_c = lazy_a.div_(lazy_b);
+        ;
+        AllClose(c, lazy_c);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivInPlaceWithRoundingMode) {
+  c10::optional<c10::string_view> rounding_modes[] = {"trunc", "floor",
+                                                      c10::nullopt};
+  for (const auto& rounding_mode : rounding_modes) {
+    for (torch::ScalarType scalar_type1 : {torch::kFloat}) {
+      torch::Tensor a =
+          isFloatingType(scalar_type1)
+              ? torch::rand({3, 4}, torch::TensorOptions(scalar_type1))
+              : torch::randint(-100, 100, {3, 4},
+                               torch::TensorOptions(scalar_type1));
+      for (torch::ScalarType scalar_type2 : {torch::kFloat}) {
+        torch::Tensor b =
+            isFloatingType(scalar_type2)
+                ? torch::rand({3, 4}, torch::TensorOptions(scalar_type2))
+                : torch::randint(1, 100, {3, 4},
+                                 torch::TensorOptions(scalar_type2));
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor c = a.div_(b, rounding_mode);
+          torch::Tensor lazy_b = CopyToDevice(b, device);
+          torch::Tensor lazy_c = lazy_a.div_(lazy_b, rounding_mode);
+          AllClose(c, lazy_c);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivScalar) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  1, 100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool is_float : {true, false}) {
+      torch::Scalar b = is_float ? torch::Scalar(3.0) : torch::Scalar(3);
+      torch::Tensor c = torch::div(a, b);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor lazy_c = torch::div(lazy_a, b);
+        AllClose(c, lazy_c);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivScalarInPlace) {
+  for (torch::ScalarType scalar_type : {torch::kFloat}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  1, 100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool is_float : {true, false}) {
+      torch::Scalar b = is_float ? torch::Scalar(3.0) : torch::Scalar(3);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor c = a.div_(b);
+        torch::Tensor lazy_c = lazy_a.div_(b);
+        AllClose(c, lazy_c);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDivOut) {
+  for (torch::ScalarType scalar_type : {torch::kFloat, torch::kDouble}) {
+    torch::Tensor a = torch::rand(
+        {3, 4}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor b = torch::rand(
+        {3, 4}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor c = torch::empty(
+        {3, 4}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::div_out(c, a, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = torch::empty({3, 4}, lazy_b.options());
+      torch::div_out(lazy_c, lazy_a, lazy_b);
+      AllClose(c, lazy_c);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestRsubScalar) {
+  torch::Tensor input = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar other(1.5);
+  torch::Scalar alpha(2.5);
+  torch::Tensor result = torch::rsub(input, other, alpha);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::rsub(lazy_input, other, alpha);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNe) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::ne(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::ne(lazy_a, lazy_b);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNeInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor a_copy = a.clone();
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  a.ne_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.ne_(lazy_b);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEq) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  torch::Tensor c = torch::eq(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::eq(lazy_a, lazy_b);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEqInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  torch::Tensor a_copy = a.clone();
+  a.eq_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.eq_(lazy_b);
+    AllClose(lazy_a, a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGe) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  torch::Tensor c = torch::ge(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::ge(lazy_a, lazy_b);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGeInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  b[1] -= 1;
+  torch::Tensor a_copy = a.clone();
+  a.ge_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.ge_(lazy_b);
+    AllClose(lazy_a, a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLe) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  torch::Tensor c = torch::le(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::le(lazy_a, lazy_b);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLeInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  b[1] -= 1;
+  torch::Tensor a_copy = a.clone();
+  a.le_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.le_(lazy_b);
+    AllClose(lazy_a, a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGt) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::add(a.clone(), torch::ones_like(a));
+  torch::Tensor c = torch::gt(b, a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::gt(lazy_b, lazy_a);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGtInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  b[1] -= 1;
+  torch::Tensor a_copy = a.clone();
+  a.gt_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.gt_(lazy_b);
+    AllClose(lazy_a, a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLt) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::add(a.clone(), torch::ones_like(a));
+  torch::Tensor c = torch::lt(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::lt(lazy_a, lazy_b);
+    AllEqual(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLtInplace) {
+  torch::Tensor a = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.clone();
+  b[0] += 1;
+  b[1] -= 1;
+  torch::Tensor a_copy = a.clone();
+  a.lt_(b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    lazy_a.lt_(lazy_b);
+    AllClose(lazy_a, a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNeScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(0));
+  torch::Tensor result = torch::ne(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::ne(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEqScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(1));
+  torch::Tensor result = torch::eq(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::eq(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGeScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(1));
+  torch::Tensor result = torch::ge(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::ge(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGeScalarInplace) {
+  torch::Tensor input = torch::arange(
+      -1., 1.5, 0.5,
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar other(float(0));
+  torch::Tensor input_copy = input.clone();
+  input.ge_(other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    lazy_input.ge_(other);
+    AllClose(lazy_input, input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLeScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(1));
+  torch::Tensor result = torch::le(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::le(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLeScalarInplace) {
+  torch::Tensor input = torch::arange(
+      -1., 1.5, 0.5,
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar other(float(0));
+  torch::Tensor input_copy = input.clone();
+  input.le_(other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    lazy_input.le_(other);
+    AllClose(lazy_input, input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGtScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(0.5));
+  torch::Tensor result = torch::gt(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::gt(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGtScalarInplace) {
+  torch::Tensor input = torch::arange(
+      -1., 1.5, 0.5,
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar other(float(0));
+  torch::Tensor input_copy = input.clone();
+  input.gt_(other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    lazy_input.gt_(other);
+    AllClose(lazy_input, input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLtScalar) {
+  torch::Tensor input = torch::ones({2, 3});
+  torch::Scalar other(float(1.5));
+  torch::Tensor result = torch::lt(input, other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::lt(lazy_input, other);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLtScalarInplace) {
+  torch::Tensor input = torch::arange(
+      -1., 1.5, 0.5,
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar other(float(0));
+  torch::Tensor input_copy = input.clone();
+  input.lt_(other);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    lazy_input.lt_(other);
+    AllClose(lazy_input, input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestIntegerAdd) {
+  std::vector<torch::ScalarType> types(
+      {torch::kByte, torch::kChar, torch::kShort, torch::kInt, torch::kLong});
+
+  ForEachDevice([&](const torch::Device& device) {
+    for (auto type : types) {
+      torch::Tensor a =
+          torch::randint(0, 63, {2, 2}, torch::TensorOptions(type));
+      torch::Tensor b =
+          torch::randint(0, 63, {2, 2}, torch::TensorOptions(type));
+      torch::Scalar one =
+          isIntegralType(type) ? torch::Scalar(1) : torch::Scalar(1.0);
+      torch::Tensor c = torch::add(b, one);
+
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = torch::add(lazy_b, one);
+
+      AllEqual(c, lazy_c);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSVD) {
+  static const int dims[] = {4, 7};
+  for (auto m : dims) {
+    for (auto n : dims) {
+      torch::Tensor a = torch::rand(
+          {m, n}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      auto b = torch::svd(a, /*some=*/true, /*compute_uv=*/true);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        auto lazy_b = torch::svd(lazy_a, /*some=*/true, /*compute_uv=*/true);
+        // The U and V matrices might have different sign for column vectors, so
+        // cannot be compared if not by absolute value.
+        AllClose(std::get<0>(b).abs(), std::get<0>(lazy_b).abs(), /*rtol=*/1e-3,
+                 /*atol=*/1e-4);
+        torch::Tensor diag = std::get<1>(b);
+        torch::Tensor lazy_diag = std::get<1>(lazy_b);
+        ASSERT_EQ(diag.sizes(), lazy_diag.sizes());
+        AllClose(diag, lazy_diag, /*rtol=*/1e-3,
+                 /*atol=*/1e-4);
+        AllClose(std::get<2>(b).abs(), std::get<2>(lazy_b).abs(), /*rtol=*/1e-3,
+                 /*atol=*/1e-4);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestQR) {
+  static const int dims[] = {4, 7};
+  for (auto m : dims) {
+    for (auto n : dims) {
+      torch::Tensor a = torch::rand(
+          {m, n}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      auto b = torch::qr(a);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        auto lazy_b = torch::qr(lazy_a);
+        AllClose(std::get<0>(b).abs(), std::get<0>(lazy_b).abs(), /*rtol=*/1e-3,
+                 /*atol=*/1e-4);
+        AllClose(std::get<1>(b).abs(), std::get<1>(lazy_b).abs(), /*rtol=*/1e-3,
+                 /*atol=*/1e-4);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSymEig) {
+  static const int dims[] = {4, 7};
+  for (auto m : dims) {
+    for (bool eigenvectors : {true, false}) {
+      for (bool upper : {true, false}) {
+        torch::Tensor a = torch::rand(
+            {m, m},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor sym_a = a.mm(a.t());
+        auto b = torch::symeig(sym_a, eigenvectors, upper);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(sym_a, device);
+          auto lazy_b = torch::symeig(lazy_a, eigenvectors, upper);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b), /*rtol=*/3e-2,
+                   /*atol=*/1e-2);
+          if (eigenvectors) {
+            AllClose(std::get<1>(b).abs(), std::get<1>(lazy_b).abs(),
+                     /*rtol=*/3e-2,
+                     /*atol=*/1e-2);
+          } else {
+            EXPECT_EQ(std::get<1>(b).sizes(), std::get<1>(lazy_b).sizes());
+          }
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestCholesky) {
+  static const int dims[] = {4, 7};
+  for (auto m : dims) {
+    for (bool upper : {true, false}) {
+      torch::Tensor a = torch::rand(
+          {3, m, m},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor pd_a =
+          torch::matmul(a, torch::transpose(a, 1, 2)) +
+          torch::eye(
+              m, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      auto b = torch::cholesky(pd_a, upper);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(pd_a, device);
+        auto lazy_b = torch::cholesky(lazy_a, upper);
+        AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-4);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestLogDet) {
+  static const int dims[] = {4, 7};
+  for (auto m : dims) {
+    torch::Tensor a = torch::rand(
+        {3, m, m}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor pd_a =
+        torch::matmul(a, torch::transpose(a, 1, 2)) +
+        torch::eye(m,
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor b = torch::logdet(pd_a);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(pd_a, device);
+      torch::Tensor lazy_b = torch::logdet(lazy_a);
+      AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-4);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTriangularSolve) {
+  static const int dims[] = {4, 7};
+  for (bool batched_a : {true, false}) {
+    for (bool batched_b : {true, false}) {
+      for (auto m : dims) {
+        for (auto n : dims) {
+          for (bool upper : {true, false}) {
+            for (bool transpose : {true, false}) {
+              for (bool unitriangular : {true, false}) {
+                torch::Tensor a =
+                    torch::randn({m, m}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice()));
+                torch::Tensor b =
+                    torch::randn({m, n}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice()));
+                a = batched_a ? a.expand({3, m, m}).clone() : a;
+                b = batched_b ? b.expand({3, m, n}).clone() : b;
+                auto result = torch::triangular_solve(
+                    b, a, /*upper=*/upper, /*transpose=*/transpose,
+                    /*unitriangular=*/unitriangular);
+                ForEachDevice([&](const torch::Device& device) {
+                  torch::Tensor lazy_a = CopyToDevice(a, device);
+                  torch::Tensor lazy_b = CopyToDevice(b, device);
+                  auto lazy_result = torch::triangular_solve(
+                      lazy_b, lazy_a, /*upper=*/upper, /*transpose=*/transpose,
+                      /*unitriangular=*/unitriangular);
+                  AllClose(std::get<0>(result), std::get<0>(lazy_result),
+                           /*rtol=*/1e-3, /*atol=*/1e-4);
+                  AllClose(std::get<1>(result), std::get<1>(lazy_result),
+                           /*rtol=*/1e-3, /*atol=*/1e-4);
+                });
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestKthValue) {
+  torch::Tensor a = torch::rand(
+      {4, 5, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int k = 1; k <= 3; ++k) {
+    int rank = a.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      for (bool keepdim : {false, true}) {
+        auto b = torch::kthvalue(a, k, dim, keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          auto lazy_b = torch::kthvalue(lazy_a, k, dim, keepdim);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b));
+          AllEqual(std::get<1>(b), std::get<1>(lazy_b));
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestTopK) {
+  torch::Tensor a = torch::rand(
+      {4, 5, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int k = 1; k <= 3; ++k) {
+    int rank = a.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      for (bool largest : {false, true}) {
+        auto b = torch::topk(a, k, dim, largest, /*sorted=*/true);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          auto lazy_b = torch::topk(lazy_a, k, dim, largest, /*sorted=*/true);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b));
+          AllEqual(std::get<1>(b), std::get<1>(lazy_b));
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSort) {
+  torch::Tensor a = torch::rand(
+      {4, 5, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int k = 1; k <= 3; ++k) {
+    for (int dim = 0; dim < 3; ++dim) {
+      for (bool descending : {false, true}) {
+        auto b = torch::sort(a, dim, descending);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          auto lazy_b = torch::sort(lazy_a, dim, descending);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b));
+          AllEqual(std::get<1>(b), std::get<1>(lazy_b));
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSortDescWithMinValue) {
+  std::vector<int8_t> values{-128, 100};
+  torch::Tensor input =
+      torch::tensor(values, torch::TensorOptions(torch::kChar));
+  auto output = torch::sort(input, /*dim=*/0, /*descending=*/true);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    auto lazy_output = torch::sort(lazy_input, /*dim=*/0, /*descending=*/true);
+    AllEqual(std::get<0>(output), std::get<0>(lazy_output));
+    AllEqual(std::get<1>(output), std::get<1>(lazy_output));
+  });
+}
+
+TEST_F(LazyOpsTest, TestArgSort) {
+  torch::Tensor a = torch::rand(
+      {4, 5, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int k = 1; k <= 3; ++k) {
+    for (int dim = 0; dim < 3; ++dim) {
+      for (bool descending : {false, true}) {
+        torch::Tensor b = torch::argsort(a, dim, descending);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = torch::argsort(lazy_a, dim, descending);
+          AllEqual(b, lazy_b);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMin) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::min(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::min(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMax) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::max(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::max(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUnaryMin) {
+  torch::Tensor input = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::min(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::min(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUnaryMax) {
+  torch::Tensor input = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::max(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::max(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAll) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor b = torch::all(a);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::all(lazy_a);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAllDim) {
+  torch::Tensor a = torch::randint(
+      0, 5, {2, 3, 4},
+      torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::all(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::all(lazy_a, dim, /*keepdim=*/false);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAllDimKeep) {
+  torch::Tensor a = torch::randint(
+      0, 5, {2, 3, 4},
+      torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::all(a, dim, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::all(lazy_a, dim, /*keepdim=*/true);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAmax) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (bool keepdim : {false, true}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor values = torch::amax(input, {dim}, /*keepdim=*/keepdim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_values =
+            torch::amax(lazy_input, {dim}, /*keepdim=*/keepdim);
+        AllClose(values, lazy_values);
+      });
+    }
+    for (int dim1 = -rank; dim1 < rank; ++dim1) {
+      for (int dim2 = -rank; dim2 < rank; ++dim2) {
+        if ((dim1 == dim2) || (dim1 == rank + dim2) || (dim2 == rank + dim1))
+          continue;
+        torch::Tensor values =
+            torch::amax(input, {dim1, dim2}, /*keepdim=*/keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_input = CopyToDevice(input, device);
+          torch::Tensor lazy_values =
+              torch::amax(lazy_input, {dim1, dim2}, /*keepdim=*/keepdim);
+          AllClose(values, lazy_values);
+        });
+      }
+    }
+  }
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("xla::amax", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestAmin) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (bool keepdim : {false, true}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor values = torch::amin(input, {dim}, /*keepdim=*/keepdim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_values =
+            torch::amin(lazy_input, {dim}, /*keepdim=*/keepdim);
+        AllClose(values, lazy_values);
+      });
+    }
+    for (int dim1 = -rank; dim1 < rank; ++dim1) {
+      for (int dim2 = -rank; dim2 < rank; ++dim2) {
+        if ((dim1 == dim2) || (dim1 == rank + dim2) || (dim2 == rank + dim1))
+          continue;
+        torch::Tensor values =
+            torch::amin(input, {dim1, dim2}, /*keepdim=*/keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_input = CopyToDevice(input, device);
+          torch::Tensor lazy_values =
+              torch::amin(lazy_input, {dim1, dim2}, /*keepdim=*/keepdim);
+          AllClose(values, lazy_values);
+        });
+      }
+    }
+  }
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("xla::amin", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestAny) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor b = torch::any(a);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::any(lazy_a);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAnyDim) {
+  torch::Tensor a = torch::randint(
+      0, 5, {2, 3, 4},
+      torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::any(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::any(lazy_a, dim, /*keepdim=*/false);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAnyDimKeep) {
+  torch::Tensor a = torch::randint(
+      0, 5, {2, 3, 4},
+      torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::any(a, dim, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::any(lazy_a, dim, /*keepdim=*/true);
+      EqualValues(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMean) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::mean(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::mean(lazy_a);
+    ASSERT_EQ(b.sizes(), lazy_b.sizes());
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMeanCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::mean(a, torch::kDouble);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::mean(lazy_a, torch::kDouble);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMeanInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::mean(a, {dim});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::mean(lazy_a, {dim});
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMeanInDims) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    torch::Tensor b = torch::mean(a, dims);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::mean(lazy_a, dims);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMeanInDimsKeepCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    torch::Tensor b = torch::mean(a, dims, true, torch::kDouble);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::mean(lazy_a, dims, true, torch::kDouble);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMeanInDimOut) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::empty(
+        {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::mean_out(b, a, {dim});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::empty({4, 4}, lazy_a.options());
+      torch::mean_out(lazy_b, lazy_a, {dim});
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestStd) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto unbiased : {true, false}) {
+    torch::Tensor b = torch::std(a, unbiased);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::std(lazy_a, unbiased);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestStdInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (auto unbiased : {true, false}) {
+    for (auto keepdim : {true, false}) {
+      for (int dim = -rank; dim < rank; ++dim) {
+        torch::Tensor b = torch::std(a, {dim}, unbiased, keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = torch::std(lazy_a, {dim}, unbiased, keepdim);
+          AllClose(b, lazy_b);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestStdWithCorrection) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // int rank = a.dim();
+  c10::optional<int64_t> corrections[] = {1, 2, c10::nullopt};
+  for (const auto& correction : corrections) {
+    for (auto keepdim : {true, false}) {
+      for (const auto& dim :
+           std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+        torch::Tensor b = torch::std(a, dim, correction, keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = torch::std(lazy_a, dim, correction, keepdim);
+          AllClose(b, lazy_b);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestStdMeanWithCorrection) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // int rank = a.dim();
+  c10::optional<int64_t> corrections[] = {1, 2, c10::nullopt};
+  for (const auto& correction : corrections) {
+    for (auto keepdim : {true, false}) {
+      for (const auto& dim :
+           std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+        auto b = torch::std_mean(a, dim, correction, keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          auto lazy_b = torch::std_mean(lazy_a, dim, correction, keepdim);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b));
+          AllClose(std::get<1>(b), std::get<1>(lazy_b));
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSum) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::sum(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sum(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSumCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::sum(a, torch::kDouble);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sum(lazy_a, torch::kDouble);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSumU8) {
+  torch::Tensor a = torch::ones(
+      {256}, torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  torch::Tensor b = torch::sum(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sum(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSumInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::sum(a, {dim});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::sum(lazy_a, {dim});
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSumInDims) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    torch::Tensor b = torch::sum(a, dims);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::sum(lazy_a, dims);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSumInDimsKeep) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    torch::Tensor b = torch::sum(a, dims, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::sum(lazy_a, dims, /*keepdim=*/true);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSumInDimsKeepCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    torch::Tensor b = torch::sum(a, dims, /*keepdim=*/true, torch::kDouble);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b =
+          torch::sum(lazy_a, dims, /*keepdim=*/true, torch::kDouble);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestVar) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (bool unbiased : {true, false}) {
+    torch::Tensor b = torch::var(a, unbiased);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::var(lazy_a, unbiased);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestVarWithDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    for (bool keepDim : {true, false}) {
+      for (bool unbiased : {true, false}) {
+        torch::Tensor b = torch::var(a, dims, unbiased, keepDim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = torch::var(lazy_a, dims, unbiased, keepDim);
+          AllClose(b, lazy_b);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestVarWithCorrection) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  c10::optional<int64_t> corrections[] = {1, 2, c10::nullopt};
+  for (const auto& dim : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    for (bool keepDim : {true, false}) {
+      for (const auto& correction : corrections) {
+        torch::Tensor b = torch::var(a, dim, correction, keepDim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = torch::var(lazy_a, dim, correction, keepDim);
+          AllClose(b, lazy_b);
+        });
+      }
+    }
+  }
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::var", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestVarMeanWithCorrection) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  c10::optional<int64_t> corrections[] = {1, 2, c10::nullopt};
+  for (const auto& dim : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    for (const auto& correction : corrections) {
+      for (auto keepdim : {true, false}) {
+        auto b = torch::var_mean(a, dim, correction, keepdim);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          auto lazy_b = torch::var_mean(lazy_a, dim, correction, keepdim);
+          AllClose(std::get<0>(b), std::get<0>(lazy_b));
+          AllClose(std::get<1>(b), std::get<1>(lazy_b));
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxInDim) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    for (bool keepdim : {false, true}) {
+      auto values_indices = torch::max(input, dim, /*keepdim=*/keepdim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        auto lazy_values_indices =
+            torch::max(lazy_input, dim, /*keepdim=*/keepdim);
+        AllClose(std::get<0>(values_indices), std::get<0>(lazy_values_indices));
+        AllEqual(std::get<1>(values_indices), std::get<1>(lazy_values_indices));
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMinInDim) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    for (bool keepdim : {false, true}) {
+      auto values_indices = torch::min(input, dim, /*keepdim=*/keepdim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        auto lazy_values_indices =
+            torch::min(lazy_input, dim, /*keepdim=*/keepdim);
+        AllClose(std::get<0>(values_indices), std::get<0>(lazy_values_indices));
+        AllEqual(std::get<1>(values_indices), std::get<1>(lazy_values_indices));
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNorm) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::norm(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::norm(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNormInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::norm(a, 2, {dim}, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::norm(lazy_a, 2, {dim}, /*keepdim=*/false);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestNormInDims) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{1, 2}, {-2, -1}}) {
+    torch::Tensor b = torch::norm(a, 2, dims, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::norm(lazy_a, 2, dims, /*keepdim=*/false);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestNormInDimsKeep) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{1, 2}, {-2, -1}}) {
+    torch::Tensor b = torch::norm(a, 2, dims, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::norm(lazy_a, 2, dims, /*keepdim=*/true);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestNormalTwoTensor) {
+  at::Tensor mean = at::zeros({10, 10, 10}, at::dtype(at::kFloat));
+  at::Tensor std = at::ones({10, 10, 10}, at::dtype(at::kFloat));
+  ForEachDevice([&](const torch::Device& device) {
+    at::Tensor lazy_mean = CopyToDevice(mean, device);
+    at::Tensor lazy_std = CopyToDevice(std, device);
+    at::Tensor lazy_normal = at::normal(lazy_mean, lazy_std);
+    double res_mean = lazy_normal.mean().item().toDouble();
+    double res_std = lazy_normal.std().item().toDouble();
+    EXPECT_GT(res_mean, -0.06);
+    EXPECT_LT(res_mean, 0.06);
+    EXPECT_GT(res_std, 0.94);
+    EXPECT_LT(res_std, 1.06);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNormalDoubleMean) {
+  at::Tensor std = at::ones({10, 10, 10}, at::dtype(at::kFloat));
+  ForEachDevice([&](const torch::Device& device) {
+    at::Tensor lazy_std = CopyToDevice(std, device);
+    at::Tensor lazy_normal = at::normal(0, lazy_std);
+    double res_mean = lazy_normal.mean().item().toDouble();
+    double res_std = lazy_normal.std().item().toDouble();
+    EXPECT_GT(res_mean, -0.06);
+    EXPECT_LT(res_mean, 0.06);
+    EXPECT_GT(res_std, 0.94);
+    EXPECT_LT(res_std, 1.06);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNormalDoubleStd) {
+  at::Tensor mean = at::zeros({10, 10, 10}, at::dtype(at::kFloat));
+  ForEachDevice([&](const torch::Device& device) {
+    at::Tensor lazy_mean = CopyToDevice(mean, device);
+    at::Tensor lazy_normal = at::normal(lazy_mean, 1);
+    double res_mean = lazy_normal.mean().item().toDouble();
+    double res_std = lazy_normal.std().item().toDouble();
+    EXPECT_GT(res_mean, -0.06);
+    EXPECT_LT(res_mean, 0.06);
+    EXPECT_GT(res_std, 0.94);
+    EXPECT_LT(res_std, 1.06);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNormalInPlace) {
+  at::Tensor a = at::zeros({10, 10, 10}, at::dtype(at::kFloat));
+  ForEachDevice([&](const torch::Device& device) {
+    at::Tensor lazy_a = CopyToDevice(a, device);
+    lazy_a.normal_(/*mean=*/0, /*std=*/1);
+    double res_mean = lazy_a.mean().item().toDouble();
+    double res_std = lazy_a.std().item().toDouble();
+    EXPECT_GT(res_mean, -0.06);
+    EXPECT_LT(res_mean, 0.06);
+    EXPECT_GT(res_std, 0.94);
+    EXPECT_LT(res_std, 1.06);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUniformInPlace) {
+  const double eps = 1e-3;
+  at::Tensor a = at::zeros({10, 10, 10}, at::dtype(at::kFloat));
+  ForEachDevice([&](const torch::Device& device) {
+    at::Tensor lazy_a = CopyToDevice(a, device);
+    lazy_a.uniform_(/*from=*/0, /*to=*/1);
+    at::Tensor cpu_a = ToCpuTensor(lazy_a);
+    double res_min = cpu_a.min().item().toDouble();
+    double res_max = cpu_a.max().item().toDouble();
+    EXPECT_GT(res_min, 0.0 - eps);
+    EXPECT_LT(res_max, 1.0 + eps);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRandomInPlace) {
+  for (auto dtype : {torch::kFloat, torch::kDouble, torch::kByte, torch::kChar,
+                     torch::kShort, torch::kInt, torch::kLong}) {
+    const double eps = 0.2;
+    torch::Tensor a = torch::zeros({10, 10, 10}, torch::TensorOptions(dtype));
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      lazy_a.random_(/*from=*/0, /*to=*/10);
+      double res_mean = lazy_a.sum().item().toDouble() / a.numel();
+      double res_min = lazy_a.min().item().toDouble();
+      double res_max = lazy_a.max().item().toDouble();
+      EXPECT_GT(res_mean, 4.5 - eps);
+      EXPECT_LT(res_mean, 4.5 + eps);
+      EXPECT_EQ(res_min, 0.0);
+      EXPECT_EQ(res_max, 9.0);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestRandomInPlaceDefaultFrom) {
+  for (auto dtype : {torch::kFloat, torch::kDouble, torch::kByte, torch::kChar,
+                     torch::kShort, torch::kInt, torch::kLong}) {
+    const double eps = 0.2;
+    torch::Tensor a = torch::zeros({10, 10, 10}, torch::TensorOptions(dtype));
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      lazy_a.random_(/*to=*/10);
+      double res_mean = lazy_a.sum().item().toDouble() / a.numel();
+      double res_min = lazy_a.min().item().toDouble();
+      double res_max = lazy_a.max().item().toDouble();
+      EXPECT_GT(res_mean, 4.5 - eps);
+      EXPECT_LT(res_mean, 4.5 + eps);
+      EXPECT_EQ(res_min, 0.0);
+      EXPECT_EQ(res_max, 9.0);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestRandomInPlaceDefault) {
+  for (auto dtype : {torch::kFloat, torch::kDouble, torch::kByte, torch::kChar,
+                     torch::kShort, torch::kInt, torch::kLong}) {
+    auto input = torch::zeros({10}, torch::TensorOptions(dtype));
+    ForEachDevice([&](const torch::Device& device) {
+      auto lazyInput = CopyToDevice(input, device);
+      lazyInput.random_();
+      auto output = ToCpuTensor(lazyInput);
+      EXPECT_TRUE(torch::all(output.ne(input)).item<bool>());
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestNormGeneral) {
+  torch::Tensor a = torch::randn(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::norm(a, 3.5);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::norm(lazy_a, 3.5);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNormNuclear) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::norm(a, 1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::norm(lazy_a, 1);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFrobeniusNorm) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::frobenius_norm(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::frobenius_norm(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFrobeniusNormInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::frobenius_norm(a, {dim}, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b =
+          torch::frobenius_norm(lazy_a, {dim}, /*keepdim=*/false);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestFrobeniusNormInDims) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{1, 2}, {-2, -1}}) {
+    torch::Tensor b = torch::frobenius_norm(a, dims, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b =
+          torch::frobenius_norm(lazy_a, dims, /*keepdim=*/false);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestGroupNorm) {
+  int num_channels = 6;
+  torch::Tensor input =
+      torch::rand({20, num_channels, 10, 10},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({num_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias =
+      torch::rand({num_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double eps = 1e-05;
+  for (int num_groups : {3, 6, 1}) {
+    torch::Tensor output =
+        torch::group_norm(input, num_groups, weight, bias, eps,
+                          /*cudnn_enabled=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_weight = CopyToDevice(weight, device);
+      torch::Tensor lazy_bias = CopyToDevice(bias, device);
+      torch::Tensor lazy_output =
+          torch::group_norm(lazy_input, num_groups, lazy_weight, lazy_bias, eps,
+                            /*cudnn_enabled=*/false);
+      AllClose(output, lazy_output, /*rtol=*/1e-3, /*atol=*/1e-5);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestGroupNormBackward) {
+  int num_channels = 6;
+  torch::Tensor input =
+      torch::rand({2, num_channels, 5, 5}, torch::TensorOptions(torch::kFloat)
+                                               .device(DefaultDevice())
+                                               .requires_grad(true));
+  torch::Tensor weight =
+      torch::rand({num_channels}, torch::TensorOptions(torch::kFloat)
+                                      .device(DefaultDevice())
+                                      .requires_grad(true));
+  torch::Tensor bias =
+      torch::rand({num_channels}, torch::TensorOptions(torch::kFloat)
+                                      .device(DefaultDevice())
+                                      .requires_grad(true));
+  double eps = 1e-05;
+  for (bool undef_weight : {true, false}) {
+    for (int num_groups : {3, 6, 1}) {
+      auto testfn =
+          [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+        return torch::group_norm(
+            /*input=*/inputs[0], num_groups, inputs[1], inputs[2],
+            /*eps=*/eps,
+            /*cudnn_enabled=*/false);
+      };
+      torch::Tensor undef;
+      ForEachDevice([&](const torch::Device& device) {
+        TestBackward(
+            {input, undef_weight ? undef : weight, undef_weight ? undef : bias},
+            device, testfn,
+            /*rtol=*/1e-3, /*atol=*/1e-3,
+            /*derivative_level=*/2);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestInstanceNorm) {
+  int batch = 5;
+  int num_channels = 20;
+  torch::Tensor input =
+      torch::rand({batch, num_channels, 10, 10},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({num_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias =
+      torch::rand({num_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_mean =
+      torch::zeros({num_channels},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_var =
+      torch::ones({num_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double momentum = 0.1;
+  double eps = 1e-05;
+  torch::Tensor output = torch::instance_norm(
+      input, weight, bias, running_mean, running_var,
+      /*use_input_stats=*/true, momentum, eps, /*cudnn_enabled=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    torch::Tensor lazy_bias = CopyToDevice(bias, device);
+    torch::Tensor lazy_running_mean = CopyToDevice(running_mean, device);
+    torch::Tensor lazy_running_var = CopyToDevice(running_var, device);
+    torch::Tensor lazy_output = torch::instance_norm(
+        lazy_input, lazy_weight, lazy_bias, lazy_running_mean, lazy_running_var,
+        /*use_input_stats=*/true, momentum, eps, /*cudnn_enabled=*/false);
+    AllClose(output, lazy_output, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLayerNorm) {
+  torch::Tensor input =
+      torch::rand({20, 10, 10, 10},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double eps = 1e-05;
+  torch::Tensor undef;
+  for (bool undef_weight : {true, false}) {
+    for (int64_t normalized_size : {2, 3}) {
+      std::vector<int64_t> normalized_shape(normalized_size, 10);
+      torch::Tensor weight = torch::rand(
+          normalized_shape,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor bias = torch::rand(
+          normalized_shape,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor output = torch::layer_norm(input, normalized_shape,
+                                               undef_weight ? undef : weight,
+                                               undef_weight ? undef : bias, eps,
+                                               /*cudnn_enabled=*/false);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_weight =
+            undef_weight ? undef : CopyToDevice(weight, device);
+        torch::Tensor lazy_bias =
+            undef_weight ? undef : CopyToDevice(bias, device);
+        torch::Tensor lazy_output = torch::layer_norm(
+            lazy_input, normalized_shape, lazy_weight, lazy_bias, eps,
+            /*cudnn_enabled=*/false);
+        AllClose(output, lazy_output, /*rtol=*/1e-3, /*atol=*/1e-5);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestLayerNormBackward) {
+  torch::Tensor input =
+      torch::rand({2, 3, 3, 3}, torch::TensorOptions(torch::kFloat)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true));
+  double eps = 1e-05;
+  for (bool undef_weight : {true, false}) {
+    for (int64_t normalized_size : {2, 3}) {
+      std::vector<int64_t> normalized_shape(normalized_size, 3);
+      auto testfn =
+          [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+        return torch::layer_norm(
+            /*input=*/inputs[0], normalized_shape, inputs[1], inputs[2],
+            /*eps=*/eps,
+            /*cudnn_enabled=*/false);
+      };
+      torch::Tensor weight =
+          torch::rand(normalized_shape, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true));
+      torch::Tensor bias =
+          torch::rand(normalized_shape, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true));
+      torch::Tensor undef;
+      ForEachDevice([&](const torch::Device& device) {
+        TestBackward(
+            {input, undef_weight ? undef : weight, undef_weight ? undef : bias},
+            device, testfn,
+            /*rtol=*/1e-3, /*atol=*/1e-4, /*derivative_level=*/2);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNuclearNorm) {
+  torch::Tensor a = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::nuclear_norm(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::nuclear_norm(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPairwiseDistance) {
+  torch::Tensor x1 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor x2 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double eps = 1e-6;
+  for (bool keepdim : {false, true}) {
+    for (double p : {1, 2, 3, 4}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output =
+            torch::pairwise_distance(x1, x2, p, eps, keepdim);
+        torch::Tensor lazy_x1 = CopyToDevice(x1, device);
+        torch::Tensor lazy_x2 = CopyToDevice(x2, device);
+        torch::Tensor lazy_output =
+            torch::pairwise_distance(lazy_x1, lazy_x2, p, eps, keepdim);
+        AllClose(output, lazy_output, /*rtol=*/1e-5, /*atol=*/1e-5);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestCosineSimilarity) {
+  torch::Tensor x1 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor x2 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double eps = 1e-8;
+  int rank = x1.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor output = torch::cosine_similarity(x1, x2, dim, eps);
+      torch::Tensor lazy_x1 = CopyToDevice(x1, device);
+      torch::Tensor lazy_x2 = CopyToDevice(x2, device);
+      torch::Tensor lazy_output =
+          torch::cosine_similarity(lazy_x1, lazy_x2, dim, eps);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCosineEmbeddingLoss) {
+  torch::Tensor input1 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor input2 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum}) {
+    for (double margin : {0., 0.2}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output = torch::cosine_embedding_loss(
+            input1, input2, target, margin, reduction);
+        torch::Tensor lazy_input1 = CopyToDevice(input1, device);
+        torch::Tensor lazy_input2 = CopyToDevice(input2, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_output = torch::cosine_embedding_loss(
+            lazy_input1, lazy_input2, lazy_target, margin, reduction);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestHingeEmbeddingLoss) {
+  torch::Tensor input = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum}) {
+    for (double margin : {0., 0.2}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output =
+            torch::hinge_embedding_loss(input, target, margin, reduction);
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_output = torch::hinge_embedding_loss(
+            lazy_input, lazy_target, margin, reduction);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestTripletMarginLoss) {
+  torch::Tensor anchor = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor positive = torch::abs(torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor negative = torch::neg(torch::abs(torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()))));
+  double eps = 1e-6;
+  for (double margin : {0., 0.2}) {
+    for (double p : {1, 2, 3, 4}) {
+      for (bool swap : {false, true}) {
+        for (torch::Reduction::Reduction reduction :
+             {torch::Reduction::Mean, torch::Reduction::Sum}) {
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor output = torch::triplet_margin_loss(
+                anchor, positive, negative, margin, p, eps, swap, reduction);
+            torch::Tensor lazy_anchor = CopyToDevice(anchor, device);
+            torch::Tensor lazy_positive = CopyToDevice(positive, device);
+            torch::Tensor lazy_negative = CopyToDevice(negative, device);
+            torch::Tensor lazy_output = torch::triplet_margin_loss(
+                lazy_anchor, lazy_positive, lazy_negative, margin, p, eps, swap,
+                reduction);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestBinaryCrossEntropy) {
+  int batch = 10;
+  int classes = 5;
+  torch::Tensor input =
+      torch::rand({batch, classes},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target =
+      torch::rand({batch, classes},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({batch, classes},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor undef;
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum,
+        torch::Reduction::None}) {
+    for (bool undef_weight : {false, true}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output = torch::binary_cross_entropy(
+            input, target, undef_weight ? undef : weight, reduction);
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_weight =
+            undef_weight ? undef : CopyToDevice(weight, device);
+        torch::Tensor lazy_output = torch::binary_cross_entropy(
+            lazy_input, lazy_target, lazy_weight, reduction);
+        AllClose(output, lazy_output, /*rtol=*/1e-4, /*atol=*/1e-5);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMarginRankingLoss) {
+  torch::Tensor input1 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor input2 = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum}) {
+    for (double margin : {0., 0.2}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output = torch::margin_ranking_loss(
+            input1, input2, target, margin, reduction);
+        torch::Tensor lazy_input1 = CopyToDevice(input1, device);
+        torch::Tensor lazy_input2 = CopyToDevice(input2, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_output = torch::margin_ranking_loss(
+            lazy_input1, lazy_input2, lazy_target, margin, reduction);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestBCEWithLogits) {
+  int batch = 10;
+  int classes = 5;
+  torch::Tensor input =
+      torch::rand({batch, classes},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target =
+      torch::rand({batch, classes},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight = torch::rand(
+      {classes}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor pos_weight = torch::rand(
+      {classes}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor undef;
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum}) {
+    for (bool undef_weight : {false, true}) {
+      for (bool undef_pos_weight : {false, true}) {
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor output = torch::binary_cross_entropy_with_logits(
+              input, target, undef_weight ? undef : weight,
+              undef_pos_weight ? undef : pos_weight, reduction);
+          torch::Tensor lazy_input = CopyToDevice(input, device);
+          torch::Tensor lazy_target = CopyToDevice(target, device);
+          torch::Tensor lazy_weight =
+              undef_weight ? undef : CopyToDevice(weight, device);
+          torch::Tensor lazy_pos_weight =
+              undef_pos_weight ? undef : CopyToDevice(pos_weight, device);
+          torch::Tensor lazy_output = torch::binary_cross_entropy_with_logits(
+              lazy_input, lazy_target, lazy_weight, lazy_pos_weight, reduction);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestKlDiv) {
+  torch::Tensor input = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (bool log_target : {true, false}) {
+    for (torch::Reduction::Reduction reduction :
+         {torch::Reduction::Mean, torch::Reduction::Sum}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor output =
+            torch::kl_div(input, target, reduction, log_target);
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_output =
+            torch::kl_div(lazy_input, lazy_target, reduction, log_target);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestProd) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::prod(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::prod(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestProdCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::prod(a, torch::kDouble);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::prod(lazy_a, torch::kDouble);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestProdInDim) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::prod(a, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::prod(lazy_a, dim);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestProdInDimKeepCast) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::prod(a, dim, /*keepdim=*/true, torch::kDouble);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b =
+          torch::prod(lazy_a, dim, /*keepdim=*/true, torch::kDouble);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestProdInDimKeep) {
+  torch::Tensor a = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor b = torch::prod(a, dim, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::prod(lazy_a, dim, /*keepdim=*/true);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumSum) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumSumCast) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim, torch::kDouble);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim, torch::kDouble);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumSumLong) {
+  torch::Tensor input = torch::randint(
+      1000, {4, 3, 4},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim);
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumSumCastLong) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim, torch::kLong);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim, torch::kLong);
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumProd) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumprod(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumprod(lazy_input, dim);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumProdCast) {
+  torch::Tensor input = torch::mul(
+      torch::rand({4, 3, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())),
+      10);
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumprod(input, dim, torch::kDouble);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumprod(lazy_input, dim, torch::kDouble);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumProdLong) {
+  torch::Tensor input = torch::randint(
+      7, {2, 3}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim);
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCumProdCastLong) {
+  torch::Tensor input =
+      torch::rand({2, 3},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      7;
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cumsum(input, dim, torch::kLong);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_result = torch::cumsum(lazy_input, dim, torch::kLong);
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMin) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::argmin(a, c10::nullopt, /*keepdim=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::argmin(lazy_a, c10::nullopt, /*keepdim=*/false);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestArgMinDim) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmin(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmin(lazy_a, dim, /*keepdim=*/false);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMinDimKeep) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmin(a, dim, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmin(lazy_a, dim, /*keepdim=*/true);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMinSameValue) {
+  torch::Tensor a = torch::ones(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::argmin(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::argmin(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestArgMinWrapper) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmin(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmin(lazy_a, dim, /*keepdim=*/false);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMax) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::argmax(a, c10::nullopt, /*keepdim=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::argmax(lazy_a, c10::nullopt, /*keepdim=*/false);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestArgMaxDim) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmax(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmax(lazy_a, dim, /*keepdim=*/false);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMaxDimKeep) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmax(a, dim, /*keepdim=*/true);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmax(lazy_a, dim, /*keepdim=*/true);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestArgMaxSameValue) {
+  torch::Tensor a = torch::ones(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::argmax(a, c10::nullopt, /*keepdim=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::argmax(lazy_a, c10::nullopt, /*keepdim=*/false);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestArgMaxWrapper) {
+  torch::Tensor a = torch::rand(
+      {4, 4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor b = torch::argmax(a, dim, /*keepdim=*/false);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::argmax(lazy_a, dim, /*keepdim=*/false);
+      AllEqual(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAsin) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::asin(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::asin(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsinh) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::asinh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::asinh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsinhInPlace) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::asinh_(a);
+    torch::Tensor lazy_b = torch::asinh_(lazy_a);
+    AllClose(a, lazy_a, /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSin) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::sin(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sin(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSinh) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::sinh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sinh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAcos) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::acos(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::acos(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAcosh) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100;
+  torch::Tensor b = torch::acosh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::acosh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAcoshInPlace) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::acosh_(a);
+    torch::Tensor lazy_b = torch::acosh_(lazy_a);
+    AllClose(a, lazy_a, /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCos) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::cos(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::cos(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCosh) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::cosh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::cosh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAtan) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::atan(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::atan(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAtanh) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::atanh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::atanh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAtanhInPlace) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::atanh_(a);
+    torch::Tensor lazy_b = torch::atanh_(lazy_a);
+    AllClose(a, lazy_a, /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAtan2) {
+  torch::Tensor a = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::atan2(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::atan2(lazy_a, lazy_b);
+    AllClose(c, lazy_c, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTan) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::tan(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::tan(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTanh) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::tanh(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::tanh(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMinMax) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar min_val(0.311);
+  torch::Scalar max_val(0.409);
+  torch::Tensor b = torch::clamp(a, min_val, max_val);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::clamp(lazy_a, min_val, max_val);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMin) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar min_val(0.311);
+  torch::Tensor b = torch::clamp(a, min_val, c10::nullopt);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::clamp(lazy_a, min_val, c10::nullopt);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMax) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar max_val(0.409);
+  torch::Tensor b = torch::clamp(a, c10::nullopt, max_val);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::clamp(lazy_a, c10::nullopt, max_val);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMinExplicit) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar min_val(0.311);
+  torch::Tensor b = torch::clamp_min(a, min_val);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::clamp_min(lazy_a, min_val);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMaxExplicit) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar max_val(0.409);
+  torch::Tensor b = torch::clamp_max(a, max_val);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::clamp_max(lazy_a, max_val);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMinExplicitInPlace) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar min_val(0.311);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::clamp_min_(a, min_val);
+    torch::Tensor lazy_b = torch::clamp_min_(lazy_a, min_val);
+    AllClose(a, lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestClampMaxExplicitInPlace) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar max_val(0.409);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = torch::clamp_max_(a, max_val);
+    torch::Tensor lazy_b = torch::clamp_max_(lazy_a, max_val);
+    AllClose(a, lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCeil) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::ceil(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::ceil(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFloor) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::floor(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::floor(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRound) {
+  torch::Tensor a = torch::cat(
+      {torch::randn(
+           {8}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+           100.0,
+       // Special case: 0.5, -0.5. lazy::Round impl rounds to -1/1 whereas
+       // lazy::RoundToEven properly implements bankers rounding.
+       torch::tensor(
+           {-0.5, 0.5},
+           torch::TensorOptions(torch::kFloat).device(DefaultDevice()))},
+      0);
+  torch::Tensor b = torch::round(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::round(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTrunc) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::trunc(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::trunc(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFrac) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::frac(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::frac(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNeg) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::neg(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::neg(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseNot) {
+  std::vector<torch::ScalarType> types(
+      {torch::kByte, torch::kChar, torch::kShort, torch::kInt, torch::kLong});
+
+  ForEachDevice([&](const torch::Device& device) {
+    for (auto type : types) {
+      torch::Tensor a =
+          torch::randint(0, 63, {2, 2}, torch::TensorOptions(type));
+      torch::Tensor b = torch::bitwise_not(a);
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = torch::bitwise_not(lazy_a);
+      AllEqual(b, lazy_b);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseNotInPlace) {
+  std::vector<torch::ScalarType> types(
+      {torch::kByte, torch::kChar, torch::kShort, torch::kInt, torch::kLong});
+
+  ForEachDevice([&](const torch::Device& device) {
+    for (auto type : types) {
+      torch::Tensor a =
+          torch::randint(0, 63, {2, 2}, torch::TensorOptions(type));
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      a.bitwise_not_();
+      lazy_a.bitwise_not_();
+      AllEqual(a, lazy_a);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSign) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b = torch::sign(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sign(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSignByte) {
+  torch::Tensor a = torch::randint(
+      256, {2, 2}, torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  torch::Tensor b = torch::sign(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sign(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAbs) {
+  torch::Tensor a = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::abs(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::abs(lazy_a);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAbsByte) {
+  torch::Tensor a = torch::randint(
+      256, {2, 2}, torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  torch::Tensor b = torch::abs(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::abs(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEmptyLike) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::empty_like(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::empty_like(lazy_a);
+    EXPECT_EQ(b.sizes(), lazy_b.sizes());
+  });
+}
+
+TEST_F(LazyOpsTest, TestEmptyLikeOptions) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::empty_like(
+      a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::empty_like(
+        lazy_a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    EXPECT_EQ(b.sizes(), lazy_b.sizes());
+  });
+}
+
+TEST_F(LazyOpsTest, TestEmpty) {
+  torch::Tensor a = torch::zeros(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = torch::empty(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(device));
+    EXPECT_EQ(a.sizes(), lazy_a.sizes());
+  });
+}
+
+TEST_F(LazyOpsTest, TestZeroInPlace) {
+  torch::Tensor input = torch::ones(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazyInput = CopyToDevice(input, device);
+    auto& output = torch::zero_(input);
+    auto& lazyOutput = torch::zero_(lazyInput);
+    AllClose(output, lazyOutput);
+  });
+}
+
+TEST_F(LazyOpsTest, TestZerosLike) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::zeros_like(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::zeros_like(lazy_a);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestZerosLikeOptions) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::zeros_like(
+      a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::zeros_like(
+        lazy_a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestZeros) {
+  torch::Tensor a = torch::zeros(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = torch::zeros(
+        {2, 2}, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestOnes) {
+  torch::Tensor a = torch::ones(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a =
+        torch::ones({2, 2}, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestOnesLike) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::ones_like(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::ones_like(lazy_a);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestOnesLikeOptions) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::ones_like(
+      a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::ones_like(
+        lazy_a, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFull) {
+  torch::Tensor a =
+      torch::full({2, 2}, 3.1165,
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = torch::full(
+        {2, 2}, 3.1165, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFullLike) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::full_like(a, 3.1165);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::full_like(lazy_a, 3.1165);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFullLikeOptions) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::full_like(
+      a, 3.1165, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::full_like(
+        lazy_a, 3.1165,
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestARange) {
+  for (auto& ranges : std::vector<std::vector<float>>{{0.0, 100.0, 0.5},
+                                                      {0.0, -100.0, -0.5}}) {
+    torch::Tensor a = torch::arange(
+        ranges[0], ranges[1], ranges[2],
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a =
+          torch::arange(ranges[0], ranges[1], ranges[2],
+                        torch::TensorOptions(torch::kFloat).device(device));
+      AllClose(a, lazy_a);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestARangeOut) {
+  torch::Tensor a = torch::randn(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto& ranges : std::vector<std::vector<float>>{{0.0, 100.0, 0.5},
+                                                      {0.0, -100.0, -0.5}}) {
+    torch::Tensor b = torch::arange_out(a, ranges[0], ranges[1], ranges[2]);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b =
+          torch::arange_out(lazy_a, ranges[0], ranges[1], ranges[2]);
+      AllClose(b, lazy_b);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDimARange) {
+  torch::Tensor like = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor a = torch::_dim_arange(like, 1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_like = CopyToDevice(like, device);
+    torch::Tensor lazy_a = torch::_dim_arange(lazy_like, 1);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBartlettWindow) {
+  int window_length = 10;
+  for (bool periodic : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor output = torch::bartlett_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+
+      torch::Tensor lazy_output = torch::bartlett_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(device));
+      AllClose(output, lazy_output, /*rtol=*/1e-5, /*atol=*/1e-7);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestBlackmanWindow) {
+  int window_length = 10;
+  for (bool periodic : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor output = torch::blackman_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_output = torch::blackman_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(device));
+      AllClose(output, lazy_output, /*rtol=*/1e-5, /*atol=*/1e-7);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestHammingWindow) {
+  double alpha = 0.54;
+  double beta = 0.46;
+  int window_length = 10;
+  for (bool periodic : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor output = torch::hamming_window(
+          window_length, periodic, alpha, beta,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_output = torch::hamming_window(
+          window_length, periodic, alpha, beta,
+          torch::TensorOptions(torch::kFloat).device(device));
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestHannWindow) {
+  int window_length = 10;
+  for (bool periodic : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor output = torch::hann_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_output = torch::hann_window(
+          window_length, periodic,
+          torch::TensorOptions(torch::kFloat).device(device));
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestLogSigmoid) {
+  torch::Tensor a = torch::empty(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  a.uniform_(-1.0, 1.0);
+  torch::Tensor b = torch::log_sigmoid(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::log_sigmoid(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSigmoidForward) {
+  torch::Tensor a = torch::empty(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  a.uniform_(-1.0, 1.0);
+  auto tuple = torch::log_sigmoid_forward(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    auto lazy_tuple = torch::log_sigmoid_forward(lazy_a);
+    AllClose(std::get<0>(tuple), std::get<0>(lazy_tuple),
+             /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(std::get<1>(tuple), std::get<1>(lazy_tuple),
+             /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogsumexp) {
+  torch::Tensor a = torch::rand(
+      {3, 4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (auto dims : std::vector<std::vector<int64_t>>{{0, 1}, {-3, -2}}) {
+    for (bool keepdim : {false, true}) {
+      torch::Tensor b = torch::logsumexp(a, dims, keepdim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor lazy_b = torch::logsumexp(lazy_a, dims, keepdim);
+        AllClose(b, lazy_b);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSiLU) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::silu(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::silu(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+  ExpectCounterChanged("lazy::silu_out", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestSigmoid) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::sigmoid(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sigmoid(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMatmul_1x1) {
+  torch::Tensor a = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::matmul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::matmul(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMatmul_2x1) {
+  torch::Tensor a = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::matmul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::matmul(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMatmul_1x2) {
+  torch::Tensor a = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::matmul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::matmul(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMatmul_2x2) {
+  torch::Tensor a = torch::rand(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::matmul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::matmul(lazy_a, lazy_b);
+    AllClose(c, lazy_c, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMatmulBcast) {
+  torch::Tensor a =
+      torch::rand({4, 2, 3, 2, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b =
+      torch::rand({2, 1, 4, 3},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::matmul(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::matmul(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestDot) {
+  torch::Tensor a = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::dot(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::dot(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTensorDot) {
+  torch::Tensor a = torch::rand(
+      {6, 4, 8}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4, 7, 8}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> dims_a = {1, 2};
+  std::vector<int64_t> dims_b = {0, 2};
+  torch::Tensor c = torch::tensordot(a, b, dims_a, dims_b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::tensordot(lazy_a, lazy_b, dims_a, dims_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGer) {
+  torch::Tensor a = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::ger(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::ger(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMv) {
+  torch::Tensor a = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::mv(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::mv(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMvOut) {
+  torch::Tensor a = torch::rand(
+      {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::mv_out(c, a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::empty({4}, lazy_b.options());
+    torch::mv_out(lazy_c, lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBatchAddBatchMatMul) {
+  torch::Tensor a = torch::rand(
+      {3, 6, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 6, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {3, 4, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 0.5;
+  torch::Scalar beta = 1.5;
+  torch::Tensor d = torch::baddbmm(a, b, c, beta, alpha);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::baddbmm(lazy_a, lazy_b, lazy_c, beta, alpha);
+    AllClose(d, lazy_d, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBatchAddBatchMatMulInPlace) {
+  torch::Tensor a = torch::rand(
+      {3, 6, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 6, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {3, 4, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 0.5;
+  torch::Scalar beta = 1.5;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor d = a.baddbmm_(b, c, beta, alpha);
+    torch::Tensor lazy_d = lazy_a.baddbmm_(lazy_b, lazy_c, beta, alpha);
+    AllClose(d, lazy_d, /*rtol=*/1e-3, /*atol=*/1e-4);
+    AllClose(a, lazy_a, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBatchMatMul) {
+  torch::Tensor a = torch::rand(
+      {3, 6, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 4, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::bmm(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::bmm(lazy_a, lazy_b);
+    AllClose(c, lazy_c, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestChainMatMul) {
+  torch::Tensor a = torch::rand(
+      {5, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {4, 6}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {6, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor d = torch::rand(
+      {2, 7}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::chain_matmul({a, b, c, d});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = CopyToDevice(d, device);
+    torch::Tensor lazy_result =
+        torch::chain_matmul({lazy_a, lazy_b, lazy_c, lazy_d});
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLinear) {
+  torch::Tensor input = torch::rand(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias = torch::rand(
+      {3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::linear(input, weight);
+  torch::Tensor result_with_bias = torch::linear(input, weight, bias);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    torch::Tensor lazy_bias = CopyToDevice(bias, device);
+    torch::Tensor lazy_result = torch::linear(lazy_input, lazy_weight);
+    torch::Tensor lazy_result_with_bias =
+        torch::linear(lazy_input, lazy_weight, lazy_bias);
+    AllClose(result, lazy_result, /*rtol=*/1e-2, /*atol=*/1e-4);
+    AllClose(result_with_bias, lazy_result_with_bias, /*rtol=*/1e-2,
+             /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPinverse) {
+  torch::Tensor input = torch::rand(
+      {4, 6}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::pinverse(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::pinverse(lazy_input);
+    AllClose(result, lazy_result, /*rtol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumOuter) {
+  torch::Tensor a = torch::rand(
+      {5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "i,j->ij";
+  torch::Tensor c = torch::einsum(equation, {a, b});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::einsum(equation, {lazy_a, lazy_b});
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumOuterBackward) {
+  torch::Tensor a = torch::rand({5}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true));
+  torch::Tensor b = torch::rand({5}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true));
+  std::string equation = "i,j->ij";
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::einsum(equation, inputs);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({a, b}, device, testfn, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumBatchMatMul) {
+  torch::Tensor a = torch::rand(
+      {3, 2, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 5, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "bij,bjk->bik";
+  torch::Tensor c = torch::einsum(equation, {a, b});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::einsum(equation, {lazy_a, lazy_b});
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumPyTorchLowerBilinear) {
+  torch::Tensor a = torch::rand(
+      {3, 5, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor l = torch::rand(
+      {2, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor r = torch::rand(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "bn,anm,bm->ba";
+  torch::Tensor c = torch::einsum(equation, {l, a, r});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_l = CopyToDevice(l, device);
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_r = CopyToDevice(r, device);
+    torch::Tensor lazy_c = torch::einsum(equation, {lazy_l, lazy_a, lazy_r});
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumPyTorchLowerDiagonal) {
+  torch::Tensor input = torch::rand(
+      {3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "ii->i";
+  torch::Tensor result = torch::einsum(equation, {input});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::einsum(equation, {lazy_input});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumPyTorchLowerBatchDiagonal) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "...ii->...i";
+  torch::Tensor result = torch::einsum(equation, {input});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::einsum(equation, {lazy_input});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumPyTorchLowerBatchPermute) {
+  torch::Tensor input =
+      torch::rand({2, 3, 4, 5},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "...ij->...ji";
+  torch::Tensor result = torch::einsum(equation, {input});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::einsum(equation, {lazy_input});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEinsumPyTorchLowerRepeatedAxis) {
+  torch::Tensor x = torch::rand(
+      {2, 3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor y = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::string equation = "ijj,k->ik";
+  torch::Tensor result = torch::einsum(equation, {x, y});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_x = CopyToDevice(x, device);
+    torch::Tensor lazy_y = CopyToDevice(y, device);
+    torch::Tensor lazy_result = torch::einsum(equation, {lazy_x, lazy_y});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBilinear) {
+  int batch_size = 16;
+  int in1_features = 4;
+  int in2_features = 6;
+  int out_features = 8;
+  torch::Tensor input1 =
+      torch::rand({batch_size, in1_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor input2 =
+      torch::rand({batch_size, in2_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({out_features, in1_features, in2_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias =
+      torch::rand({out_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input1 = CopyToDevice(input1, device);
+    torch::Tensor lazy_input2 = CopyToDevice(input2, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    torch::Tensor lazy_bias = CopyToDevice(bias, device);
+    torch::Tensor result = torch::bilinear(input1, input2, weight, bias);
+    torch::Tensor lazy_result =
+        torch::bilinear(lazy_input1, lazy_input2, lazy_weight, lazy_bias);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUpsampleNearest2D) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int uh = 8;
+  int uw = 8;
+  int chans = 2;
+  torch::Tensor input =
+      torch::rand({batch_size, chans, h, w},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = torch::upsample_nearest2d(input, {uh, uw});
+    torch::Tensor lazy_result = torch::upsample_nearest2d(lazy_input, {uh, uw});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUpsampleNearest2DBackward) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int uh = 8;
+  int uw = 8;
+  int chans = 2;
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::upsample_nearest2d(inputs[0], {uh, uw});
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({batch_size, chans, h, w},
+                              torch::TensorOptions(torch::kFloat)
+                                  .device(DefaultDevice())
+                                  .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUpsampleNearest2DWithScale) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int chans = 2;
+  double scale_h = 2.5;
+  double scale_w = 3.4;
+  torch::Tensor input =
+      torch::rand({batch_size, chans, h, w},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = torch::upsample_nearest2d(
+        input, c10::nullopt, at::ArrayRef<double>{scale_h, scale_w});
+    torch::Tensor lazy_result = torch::upsample_nearest2d(
+        lazy_input, c10::nullopt, at::ArrayRef<double>{scale_h, scale_w});
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUpsampleNearest2DBackwardWithScale) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int chans = 2;
+  double scale_h = 2.5;
+  double scale_w = 3.4;
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::upsample_nearest2d(inputs[0], c10::nullopt,
+                                     at::ArrayRef<double>{scale_h, scale_w});
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({batch_size, chans, h, w},
+                              torch::TensorOptions(torch::kFloat)
+                                  .device(DefaultDevice())
+                                  .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUpsampleBilinear2D) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int uh = 8;
+  int uw = 8;
+  int chans = 2;
+  for (bool align_corners : {true, false}) {
+    torch::Tensor input = torch::rand(
+        {batch_size, chans, h, w},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor result =
+          torch::upsample_bilinear2d(input, {uh, uw}, align_corners);
+      torch::Tensor lazy_result =
+          torch::upsample_bilinear2d(lazy_input, {uh, uw}, align_corners);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestUpsampleBilinear2DBackward) {
+  int batch_size = 2;
+  int h = 5;
+  int w = 5;
+  int uh = 8;
+  int uw = 8;
+  int chans = 2;
+  for (bool align_corners : {true, false}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::upsample_bilinear2d(inputs[0], {uh, uw}, align_corners);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand({batch_size, chans, h, w},
+                                torch::TensorOptions(torch::kFloat)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true))},
+                   device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAddCMul) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor d = torch::addcmul(a, b, c, 3.1165);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::addcmul(lazy_a, lazy_b, lazy_c, 3.1165);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddCDiv) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c =
+      torch::abs(torch::rand(
+          {2, 2},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()))) +
+      1.0;
+  torch::Tensor d = torch::addcdiv(a, b, c, 3.1165);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::addcdiv(lazy_a, lazy_b, lazy_c, 3.1165);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddCDivWithBroadcast) {
+  torch::Tensor a = torch::rand(
+      {1, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c =
+      torch::abs(torch::rand(
+          {1, 3},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()))) +
+      1.0;
+  torch::Tensor d = torch::addcdiv(a, b, c, 3.1165);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::addcdiv(lazy_a, lazy_b, lazy_c, 3.1165);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSize) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    for (int dim = -rank; dim < rank; ++dim) {
+      EXPECT_EQ(torch::size(input, dim), torch::size(lazy_input, dim));
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSelect) {
+  std::vector<int64_t> input_sizes = {14, 24, 8};
+  int rank = input_sizes.size();
+  for (int dim = -rank; dim < rank; ++dim) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::select(inputs[0], dim, 0);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand(input_sizes, torch::TensorOptions(torch::kFloat)
+                                                 .requires_grad(true))},
+                   device, testfn);
+    });
+  };
+}
+
+TEST_F(LazyOpsTest, TestBernoulliScalarProb) {
+  torch::Tensor input = torch::zeros(
+      1000, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::bernoulli(lazy_input, 0.1);
+    double frac = lazy_output.sum().item().toDouble() / input.numel();
+    EXPECT_GT(frac, 0.06);
+    EXPECT_LT(frac, 0.14);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBernoulliTensorProb) {
+  std::vector<float> prob_values(1000, 0.1);
+  torch::Tensor input = torch::tensor(
+      prob_values, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::bernoulli(lazy_input);
+    double frac = lazy_output.sum().item().toDouble() / input.numel();
+    EXPECT_GT(frac, 0.06);
+    EXPECT_LT(frac, 0.14);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBernoulliScalarProbInPlace) {
+  torch::Tensor input = torch::zeros(
+      1000, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    lazy_input.bernoulli_(0.1);
+    double frac = lazy_input.sum().item().toDouble() / input.numel();
+    EXPECT_GT(frac, 0.06);
+    EXPECT_LT(frac, 0.14);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBernoulliTensorProbInPlace) {
+  torch::Tensor input = torch::zeros(
+      1000, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor prob = torch::scalar_tensor(
+      0.1, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_prob = CopyToDevice(prob, device);
+    lazy_input.bernoulli_(lazy_prob);
+    double frac = lazy_input.sum().item().toDouble() / input.numel();
+    EXPECT_GT(frac, 0.06);
+    EXPECT_LT(frac, 0.14);
+  });
+}
+
+TEST_F(LazyOpsTest, TestDropout) {
+  torch::Tensor a = torch::rand(
+      {17, 21}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::dropout(lazy_a, 0.1, /*train=*/true);
+    double prob =
+        static_cast<double>(lazy_b.cpu().ne(0.0f).sum().item().toDouble()) /
+        a.numel();
+    EXPECT_GT(prob, 0.86);
+    EXPECT_LT(prob, 0.94);
+  });
+}
+
+TEST_F(LazyOpsTest, TestDropoutInPlace) {
+  torch::Tensor a = torch::rand(
+      {17, 21}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::dropout_(lazy_a, 0.1, /*train=*/true);
+    double prob =
+        static_cast<double>(lazy_a.cpu().ne(0.0f).sum().item().toDouble()) /
+        a.numel();
+    EXPECT_GT(prob, 0.85);
+    EXPECT_LT(prob, 0.94);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRandperm) {
+  unsigned n = 5;
+  torch::Tensor shuffle = torch::randperm(
+      n, torch::TensorOptions(torch::kLong).device(torch::kLazy));
+  torch::Tensor shuffle_cpu = CopyToDevice(shuffle, torch::kCPU);
+  std::vector<int64_t> shuffle_data(shuffle_cpu.data_ptr<int64_t>(),
+                                    shuffle_cpu.data_ptr<int64_t>() + n);
+  EXPECT_TRUE(shuffle_data.size() == n &&
+              torch::lazy::IsPermutation(shuffle_data));
+}
+
+TEST_F(LazyOpsTest, TestSlice) {
+  torch::Tensor a =
+      torch::rand({32, 24, 16},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::slice(a, 1, 0, 16, 1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::slice(lazy_a, 1, 0, 16, 1);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTake) {
+  torch::Tensor a = torch::rand(
+      {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::randint(
+      16, {5}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor c = torch::take(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::take(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTakeBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::take(inputs[0], inputs[1]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward(
+        {torch::rand({4, 4}, torch::TensorOptions(torch::kFloat)
+                                 .device(DefaultDevice())
+                                 .requires_grad(true)),
+         torch::randint(
+             16, {5},
+             torch::TensorOptions(torch::kLong).device(DefaultDevice()))},
+        device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestStack) {
+  torch::Tensor a = torch::rand(
+      {2, 4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {2, 4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = a.dim() + 1;
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor d = torch::stack({a, b, c}, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::stack({lazy_a, lazy_b, lazy_c}, dim);
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCat) {
+  torch::Tensor a = torch::rand(
+      {2, 1, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {2, 3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int dim : {1, -2}) {
+    torch::Tensor d = torch::cat({a, b, c}, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::cat({lazy_a, lazy_b, lazy_c}, dim);
+      EXPECT_TRUE(d.sizes() == lazy_d.sizes() && d.dtype() == lazy_d.dtype());
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestUnbind) {
+  torch::Tensor input = torch::rand(
+      {4, 3, 7}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    std::vector<torch::Tensor> output = torch::unbind(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      std::vector<torch::Tensor> lazy_output = torch::unbind(lazy_input, dim);
+      ASSERT_EQ(output.size(), lazy_output.size());
+      for (size_t i = 0; i < output.size(); ++i) {
+        AllClose(output[i], lazy_output[i]);
+      }
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestRepeat) {
+  std::vector<std::vector<int64_t>> repeats_list = {{4, 2}, {4, 2, 3}};
+  std::vector<std::vector<int64_t>> input_size_list = {{3}, {2, 4}};
+  for (const auto& repeats : repeats_list) {
+    for (const auto& input_size : input_size_list) {
+      torch::Tensor input = torch::rand(
+          input_size,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor output = input.repeat(repeats);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_output = lazy_input.repeat(repeats);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestGather) {
+  torch::Tensor a = torch::rand(
+      {3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::empty(
+      {3, 3}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int i = 0; i < 3; i++) {
+    for (int j = 0; j < 3; j++) {
+      b[i][j] = (i + j) % 3;
+    }
+  }
+  for (bool sparse_grad : {false, true}) {
+    torch::Tensor c = torch::gather(a, 1, b, sparse_grad);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = torch::gather(lazy_a, 1, lazy_b, sparse_grad);
+      AllClose(c, lazy_c);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestScatter) {
+  torch::Tensor a = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 5}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int dim = 0; dim < 2; ++dim) {
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 5; j++) {
+        c[i][j] = (i + j) % c.sizes()[dim];
+      }
+    }
+    torch::Tensor d = torch::scatter(a, dim, c, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::scatter(lazy_a, dim, lazy_c, lazy_b);
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestScatterR1) {
+  torch::Tensor a = torch::rand(
+      {5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  c[0] = 1;
+  c[1] = 3;
+  torch::Tensor d = torch::scatter(a, 0, c, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::scatter(lazy_a, 0, lazy_c, lazy_b);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestScatterR3) {
+  torch::Tensor a = torch::rand(
+      {3, 5, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 4, 2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int i = 0; i < 3; i++) {
+    for (int j = 0; j < 4; j++) {
+      for (int k = 0; k < 2; k++) {
+        c[i][j][k] = (i + j + k) % 4;
+      }
+    }
+  }
+  torch::Tensor d = torch::scatter(a, 1, c, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::scatter(lazy_a, 1, lazy_c, lazy_b);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestScatterBiggerSource) {
+  torch::Tensor a = torch::rand(
+      {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {8, 8}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {4, 4}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      c[i][j] = (i + j) % 4;
+    }
+  }
+  for (int dim = 0; dim < 2; ++dim) {
+    torch::Tensor d = torch::scatter(a, dim, c, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::scatter(lazy_a, dim, lazy_c, lazy_b);
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestScatterScalar) {
+  torch::Tensor a = torch::rand(
+      {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar b = 1.0f;
+  torch::Tensor c = torch::empty(
+      {4, 4}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      c[i][j] = (i + j) % 4;
+    }
+  }
+  for (int dim = 0; dim < 2; ++dim) {
+    torch::Tensor d = torch::scatter(a, dim, c, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::scatter(lazy_a, dim, lazy_c, b);
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestScatterReduceAdd) {
+  torch::Tensor a = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 5}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int dim = 0; dim < 2; ++dim) {
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 5; j++) {
+        c[i][j] = (i + j) % c.sizes()[dim];
+      }
+    }
+    torch::Tensor d = torch::scatter(a, dim, c, b, "add");
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::scatter(lazy_a, dim, lazy_c, lazy_b, "add");
+      AllClose(d, lazy_d);
+    });
+  }
+
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::scatter_out", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestScatterAdd) {
+  torch::Tensor a = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 5}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int dim = 0; dim < 2; ++dim) {
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 5; j++) {
+        c[i][j] = (i + j) % c.sizes()[dim];
+      }
+    }
+    torch::Tensor d = torch::scatter_add(a, dim, c, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = torch::scatter_add(lazy_a, dim, lazy_c, lazy_b);
+      AllClose(d, lazy_d);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestScatterAddInPlace) {
+  torch::Tensor b = torch::rand(
+      {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {4, 4}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      c[i][j] = (i + j) % 4;
+    }
+  }
+  for (int dim = 0; dim < 2; ++dim) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor a = torch::rand(
+          {4, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor d = a.scatter_add_(dim, c, b);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c = CopyToDevice(c, device);
+      torch::Tensor lazy_d = lazy_a.scatter_add_(dim, lazy_c, lazy_b);
+      AllClose(d, lazy_d);
+      AllClose(a, lazy_a);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexSelect) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (torch::ScalarType index_scalar_type : {torch::kInt, torch::kLong}) {
+      torch::Tensor b = torch::empty(
+          {2}, torch::TensorOptions(index_scalar_type).device(DefaultDevice()));
+      b[0] = 0;
+      b[1] = 2;
+      for (auto offset : {-2, 0}) {
+        torch::Tensor c0 = torch::index_select(a, 0 + offset, b);
+        torch::Tensor c1 = torch::index_select(a, 1 + offset, b);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a, device);
+          torch::Tensor lazy_b = CopyToDevice(b, device);
+          torch::Tensor lazy_c0 = torch::index_select(lazy_a, 0 + offset, lazy_b);
+          torch::Tensor lazy_c1 = torch::index_select(lazy_a, 1 + offset, lazy_b);
+          AllEqual(c0, lazy_c0);
+          AllEqual(c1, lazy_c1);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexSelectRank0) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor a =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor b = torch::scalar_tensor(
+        2, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor c0 = torch::index_select(a, 0, b);
+    torch::Tensor c1 = torch::index_select(a, 1, b);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_a = CopyToDevice(a, device);
+      torch::Tensor lazy_b = CopyToDevice(b, device);
+      torch::Tensor lazy_c0 = torch::index_select(lazy_a, 0, lazy_b);
+      torch::Tensor lazy_c1 = torch::index_select(lazy_a, 1, lazy_b);
+      AllEqual(c0, lazy_c0);
+      AllEqual(c1, lazy_c1);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestInverse) {
+  if (IsCuda()) {
+    // TODO(whc) debug failure on cuda, lazy_b comes back transposed
+    GTEST_SKIP();
+  }
+  torch::Tensor a = torch::randn(
+      {5, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::inverse(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::inverse(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestIsnan) {
+  torch::Tensor a = torch::tensor(
+      {1.0, 2.0, std::nan("1"), 4.0},
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::isnan(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::isnan(lazy_a);
+    AllEqual(b, lazy_b);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::isnan", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestExpand) {
+  torch::Tensor a = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.expand({2, 3, 4}, /*implicit=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = lazy_a.expand({2, 3, 4}, /*implicit=*/false);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestExpandBack) {
+  torch::Tensor a = torch::rand(
+      {3, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = a.expand({3, 4}, /*implicit=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = lazy_a.expand({3, 4}, /*implicit=*/false);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestExpandAs) {
+  torch::Tensor a = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::native::expand_as(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::native::expand_as(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEye) {
+  int n = 5;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor out = torch::eye(
+        n, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_out =
+        torch::eye(n, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(out, lazy_out);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEyeWide) {
+  int lines = 3;
+  int cols = 5;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor out =
+        torch::eye(lines, cols,
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_out = torch::eye(
+        lines, cols, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(out, lazy_out);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEyeNarrow) {
+  int lines = 5;
+  int cols = 3;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor out =
+        torch::eye(lines, cols,
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_out = torch::eye(
+        lines, cols, torch::TensorOptions(torch::kFloat).device(device));
+    AllClose(out, lazy_out);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBroadcastTensors) {
+  torch::Tensor a = torch::rand(
+      {2, 1, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<torch::Tensor> c = torch::broadcast_tensors({a, b});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    std::vector<torch::Tensor> lazy_c = torch::broadcast_tensors({lazy_a, lazy_b});
+    ASSERT_EQ(c.size(), lazy_c.size());
+    for (size_t i = 0; i < c.size(); ++i) {
+      AllClose(c[i], lazy_c[i]);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestOneIndex) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result = torch::index(params, {indices});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices = CopyToDevice(indices, device);
+      torch::Tensor lazy_result = torch::index(lazy_params, {lazy_indices});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestOneIndexTransfer) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result = torch::index(params, {indices});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_result = torch::index(lazy_params, {indices});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestNonzero) {
+  torch::Tensor a = torch::zeros(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  a[0][1] = 1.0;
+  a[1][0] = 2.0;
+  a[3][1] = 3.0;
+  torch::Tensor b = torch::nonzero(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::nonzero(lazy_a);
+    AllClose(b, lazy_b);
+
+    if (DebugUtil::ExperimentEnabled("nonzero")) {
+      // If the nonzero support is enabled, we must not see any aten:: calls.
+      ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+    }
+    ResetCounters();
+  });
+}
+
+TEST_F(LazyOpsTest, TestMaskedSelect) {
+  torch::Tensor a = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::randint(
+      0, 2, {5}, torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+  torch::Tensor c = torch::masked_select(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::masked_select(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+
+    if (DebugUtil::ExperimentEnabled("masked_select")) {
+      // If the masked_select support is enabled, we must not see any aten::
+      // calls.
+      ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+    }
+    ResetCounters();
+  });
+}
+
+TEST_F(LazyOpsTest, TestMaskedScatter) {
+  torch::Tensor a = torch::rand(
+      {3, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::randint(
+      0, 2, {3, 5}, torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {15}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor d = torch::masked_scatter(a, b, c);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::masked_scatter(lazy_a, lazy_b, lazy_c);
+    AllClose(d, lazy_d);
+
+    if (DebugUtil::ExperimentEnabled("masked_scatter")) {
+      // If the masked_select support is enabled, we must not see any aten::
+      // calls.
+      ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+    }
+    ResetCounters();
+  });
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexHeadNull) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices_null;
+    torch::Tensor indices_0 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor indices_1 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result =
+        torch::index(params, {indices_null, indices_0, indices_1});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+      torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+      torch::Tensor lazy_result = torch::index(
+          lazy_params, {indices_null, lazy_indices_0, lazy_indices_1});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexMiddleNull) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices_0 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor indices_null;
+    torch::Tensor indices_1 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result =
+        torch::index(params, {indices_0, indices_null, indices_1});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+      torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+      torch::Tensor lazy_result = torch::index(
+          lazy_params, {lazy_indices_0, indices_null, lazy_indices_1});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexTailNull) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices_0 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor indices_null;
+    torch::Tensor indices_1 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result =
+        torch::index(params, {indices_0, indices_1, indices_null});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+      torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+      torch::Tensor lazy_result = torch::index(
+          lazy_params, {lazy_indices_0, lazy_indices_1, indices_null});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexMiddleBroadcast) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices_0 = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor indices_1 = torch::randint(
+        -3, 3, {2, 1, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result = torch::index(params, {indices_0, indices_1});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+      torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+      torch::Tensor lazy_result =
+          torch::index(lazy_params, {lazy_indices_0, lazy_indices_1});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexTailBroadcast) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices_0 = torch::randint(
+        -3, 3, {2, 1, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor indices_1 = torch::randint(
+        -3, 3, {2, 1},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor result = torch::index(params, {indices_0, indices_1});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+      torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+      torch::Tensor lazy_result =
+          torch::index(lazy_params, {lazy_indices_0, lazy_indices_1});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaskIndex) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {2, 2},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {2, 2},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices = torch::randint(
+        0, 2, {2, 2},
+        torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+    torch::Tensor result = torch::index(params, {indices});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_params = CopyToDevice(params, device);
+      torch::Tensor lazy_indices = CopyToDevice(indices, device);
+      torch::Tensor lazy_result = torch::index(lazy_params, {lazy_indices});
+      AllEqual(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestOneIndexPut) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor indices = torch::randint(
+        -3, 3, {2, 4, 3},
+        torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+    torch::Tensor values =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result =
+          torch::index_put(params, {indices}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices = CopyToDevice(indices, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result =
+            torch::index_put(lazy_params, {lazy_indices}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestOneIndexPutInPlace) {
+  torch::Tensor indices = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor values =
+        torch::ones({3, 5, 6, 7},
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor params =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {4, 3, 5, 6, 7},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {4, 3, 5, 6, 7},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_params = CopyToDevice(params.clone(), device);
+        torch::Tensor result =
+            torch::index_put_(params, {indices}, values, accumulate);
+        torch::Tensor lazy_indices = CopyToDevice(indices, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put_(lazy_params, {lazy_indices},
+                                                     lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+        AllEqual(params, lazy_params);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestOneIndexPutTransfer) {
+  torch::Tensor indices = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values =
+        torch::ones({3, 5, 6, 7},
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result =
+          torch::index_put(params, {indices}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result =
+            torch::index_put(lazy_params, {indices}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPut) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {5, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result =
+          torch::index_put(params, {indices_0, indices_1}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {lazy_indices_0, lazy_indices_1}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPutHeadNull) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_null;
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {3, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result = torch::index_put(
+          params, {indices_null, indices_0, indices_1}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {indices_null, lazy_indices_0, lazy_indices_1},
+            lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPutMiddleNull) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_null;
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {3, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result = torch::index_put(
+          params, {indices_0, indices_null, indices_1}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {lazy_indices_0, indices_null, lazy_indices_1},
+            lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPutTailNull) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_null;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 3, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {3, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result = torch::index_put(
+          params, {indices_0, indices_1, indices_null}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {lazy_indices_0, lazy_indices_1, indices_null},
+            lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPutMiddleBroadcast) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 1, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {5, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result =
+          torch::index_put(params, {indices_0, indices_1}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {lazy_indices_0, lazy_indices_1}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMultiIndexPutTailBroadcast) {
+  torch::Tensor indices_0 = torch::randint(
+      -3, 3, {2, 1, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor indices_1 = torch::randint(
+      -3, 3, {2, 1},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {4, 3, 5, 6, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {5, 6, 7}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      torch::Tensor result =
+          torch::index_put(params, {indices_0, indices_1}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices_0 = CopyToDevice(indices_0, device);
+        torch::Tensor lazy_indices_1 = CopyToDevice(indices_1, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::index_put(
+            lazy_params, {lazy_indices_0, lazy_indices_1}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaskIndexPut) {
+  torch::Tensor indices =
+      torch::tensor({0, 1},
+                    torch::TensorOptions(torch::kByte).device(DefaultDevice()))
+          .to(torch::kBool);
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor params =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {2, 2},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {2, 2},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor values = torch::ones(
+        {2}, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      torch::Tensor result =
+          torch::index_put(params, {indices}, values, accumulate);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_params = CopyToDevice(params, device);
+        torch::Tensor lazy_indices = CopyToDevice(indices, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result =
+            torch::index_put(lazy_params, {lazy_indices}, lazy_values, accumulate);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexPutImpl) {
+  torch::Tensor indices = torch::randint(
+      -3, 3, {2, 4, 3},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor values =
+        torch::ones({3, 5, 6, 7},
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    for (bool accumulate : {false, true}) {
+      if (accumulate && IsCuda()) {
+        GTEST_SKIP();
+      }
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor params =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {4, 3, 5, 6, 7},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {4, 3, 5, 6, 7},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_params = CopyToDevice(params.clone(), device);
+        torch::Tensor result = torch::_index_put_impl_(
+            params, {indices}, values, accumulate, /*unsafe=*/true);
+        torch::Tensor lazy_indices = CopyToDevice(indices, device);
+        torch::Tensor lazy_values = CopyToDevice(values, device);
+        torch::Tensor lazy_result = torch::_index_put_impl_(
+            lazy_params, {lazy_indices}, lazy_values, accumulate, /*unsafe=*/true);
+        AllEqual(result, lazy_result);
+        AllEqual(params, lazy_params);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexFillWithScalar) {
+  torch::Tensor index = torch::tensor(
+      {0, 2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Scalar value = 42;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor result = torch::index_fill(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_result =
+            torch::index_fill(lazy_base, dim, lazy_index, value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexFillWithScalarInPlace) {
+  torch::Tensor index = torch::tensor(
+      {0, 2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Scalar value = 42;
+  int rank = 3;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor base =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {3, 4, 5},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {3, 4, 5},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+        torch::Tensor result = base.index_fill_(dim, index, value);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_result = lazy_base.index_fill_(dim, lazy_index, value);
+        AllEqual(result, lazy_result);
+        AllEqual(base, lazy_base);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexFillWithTensor) {
+  torch::Tensor index = torch::tensor(
+      {0, 2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor value = torch::scalar_tensor(
+        42, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor result = torch::index_fill(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            torch::index_fill(lazy_base, dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexFillWithTensorInPlace) {
+  torch::Tensor index = torch::tensor(
+      {0, 2}, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor value = torch::scalar_tensor(
+        42, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = 3;
+    for (int dim = -rank; dim < rank; ++dim) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor base =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {3, 4, 5},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {3, 4, 5},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+        torch::Tensor result = base.index_fill_(dim, index, value);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            lazy_base.index_fill_(dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+        AllEqual(base, lazy_base);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexFillRank0) {
+  torch::Tensor index = torch::scalar_tensor(
+      2, torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {3, 4, 5},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    torch::Tensor value = torch::scalar_tensor(
+        42, torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor result = torch::index_fill(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            torch::index_fill(lazy_base, dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexAdd) {
+  int index_size = 10;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      for (torch::ScalarType index_scalar_type : {torch::kInt, torch::kLong}) {
+        torch::Tensor index = torch::randint(
+            0, base.size(dim), {index_size},
+            torch::TensorOptions(index_scalar_type).device(DefaultDevice()));
+        std::vector<int64_t> value_sizes(base.sizes().begin(),
+                                         base.sizes().end());
+        int canonical_dim = dim < 0 ? dim + rank : dim;
+        value_sizes[canonical_dim] = index_size;
+        torch::Tensor value =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      value_sizes,
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, value_sizes,
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor result = torch::index_add(base, dim, index, value);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_base = CopyToDevice(base, device);
+          torch::Tensor lazy_index = CopyToDevice(index, device);
+          torch::Tensor lazy_value = CopyToDevice(value, device);
+          torch::Tensor lazy_result =
+              torch::index_add(lazy_base, dim, lazy_index, lazy_value);
+          AllClose(result, lazy_result);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexAddInPlace) {
+  int index_size = 10;
+  int rank = 3;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor base =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {5, 3, 7},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {5, 3, 7},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor index = torch::randint(
+            0, base.size(dim), {index_size},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        std::vector<int64_t> value_sizes(base.sizes().begin(),
+                                         base.sizes().end());
+        int canonical_dim = dim < 0 ? dim + rank : dim;
+        value_sizes[canonical_dim] = index_size;
+        torch::Tensor value =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      value_sizes,
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, value_sizes,
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+        torch::Tensor result = base.index_add_(dim, index, value);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            lazy_base.index_add_(dim, lazy_index, lazy_value);
+        AllClose(result, lazy_result);
+        AllClose(base, lazy_base);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexAddRank0) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor index = torch::randint(
+          0, base.size(dim), at::IntArrayRef{},
+          torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+      std::vector<int64_t> value_sizes(base.sizes().begin(),
+                                       base.sizes().end());
+      int canonical_dim = dim < 0 ? dim + rank : dim;
+      value_sizes[canonical_dim] = 1;
+      torch::Tensor value =
+          isFloatingType(scalar_type)
+              ? torch::rand(
+                    value_sizes,
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()))
+              : torch::randint(
+                    100, value_sizes,
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+      torch::Tensor result = torch::index_add(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            torch::index_add(lazy_base, dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexCopy) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor index = torch::randperm(
+          base.size(dim),
+          torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+      torch::Tensor value =
+          isFloatingType(scalar_type)
+              ? torch::rand(
+                    base.sizes(),
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()))
+              : torch::randint(
+                    100, base.sizes(),
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+      torch::Tensor result = torch::index_copy(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            torch::index_copy(lazy_base, dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexCopyInPlace) {
+  if (IsCuda()) {
+    GTEST_SKIP();
+  }
+  int index_size = 10;
+  int rank = 3;
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor base =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      {5, 3, 7},
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, {5, 3, 7},
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor index = torch::randint(
+            0, base.size(dim), {index_size},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        std::vector<int64_t> value_sizes(base.sizes().begin(),
+                                         base.sizes().end());
+        int canonical_dim = dim < 0 ? dim + rank : dim;
+        value_sizes[canonical_dim] = index_size;
+        torch::Tensor value =
+            isFloatingType(scalar_type)
+                ? torch::rand(
+                      value_sizes,
+                      torch::TensorOptions(scalar_type).device(DefaultDevice()))
+                : torch::randint(100, value_sizes,
+                                 torch::TensorOptions(scalar_type)
+                                     .device(DefaultDevice()));
+        torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+        torch::Tensor result = base.index_copy_(dim, index, value);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            lazy_base.index_copy_(dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+        AllEqual(base, lazy_base);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestIndexCopyRank0) {
+  for (torch::ScalarType scalar_type :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor base =
+        isFloatingType(scalar_type)
+            ? torch::rand(
+                  {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()))
+            : torch::randint(
+                  100, {5, 3, 7},
+                  torch::TensorOptions(scalar_type).device(DefaultDevice()));
+    int rank = base.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor index = torch::randint(
+          0, base.size(dim), at::IntArrayRef{},
+          torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+      std::vector<int64_t> value_sizes(base.sizes().begin(),
+                                       base.sizes().end());
+      int canonical_dim = dim < 0 ? dim + rank : dim;
+      value_sizes[canonical_dim] = 1;
+      torch::Tensor value =
+          isFloatingType(scalar_type)
+              ? torch::rand(
+                    value_sizes,
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()))
+              : torch::randint(
+                    100, value_sizes,
+                    torch::TensorOptions(scalar_type).device(DefaultDevice()));
+      torch::Tensor result = torch::index_copy(base, dim, index, value);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_base = CopyToDevice(base, device);
+        torch::Tensor lazy_index = CopyToDevice(index, device);
+        torch::Tensor lazy_value = CopyToDevice(value, device);
+        torch::Tensor lazy_result =
+            torch::index_copy(lazy_base, dim, lazy_index, lazy_value);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestRelu) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::relu(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::relu(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReluInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::relu_(input);
+    torch::Tensor lazy_output = torch::relu_(lazy_input);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardshrink) {
+  torch::Tensor input = torch::randn(
+      {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::hardshrink(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::hardshrink(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardSigmoid) {
+  torch::Tensor input = torch::randn(
+      {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::hardsigmoid(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::hardsigmoid(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardSigmoidInPlace) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor input = torch::randn(
+        {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::hardsigmoid_(input);
+    torch::Tensor lazy_output = torch::hardsigmoid_(lazy_input);
+    AllClose(input, lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardSigmoidBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::hardsigmoid(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::randn({10}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftshrink) {
+  torch::Tensor input = torch::randn(
+      {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::softshrink(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::softshrink(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardtanh) {
+  torch::Tensor input = torch::randn(
+      {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::hardtanh(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::hardtanh(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardtanhInPlace) {
+  torch::Tensor input = torch::randn(
+      {10}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::hardtanh_(input);
+    torch::Tensor lazy_output = torch::hardtanh_(lazy_input);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLeakyRelu) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double negative_slope = 0.01;
+  torch::Tensor output = torch::leaky_relu(input, negative_slope);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::leaky_relu(lazy_input, negative_slope);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLeakyReluInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double negative_slope = 0.01;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::leaky_relu_(input, negative_slope);
+    torch::Tensor lazy_output = torch::leaky_relu_(lazy_input, negative_slope);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestExp) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::exp(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::exp(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestExpm1) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::expm1(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::expm1(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLog) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::log(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::log(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLog2) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::log2(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::log2(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLog10) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::log10(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::log10(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLog1p) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::log1p(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::log1p(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestErf) {
+  torch::Tensor a = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::erf(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::erf(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestErfc) {
+  torch::Tensor a = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::erfc(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::erfc(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestErfinv) {
+  torch::Tensor a = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::erfinv(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::erfinv(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSqrt) {
+  torch::Tensor a = torch::abs(torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor b = torch::sqrt(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::sqrt(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRsqrt) {
+  torch::Tensor a = torch::abs(torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor b = torch::rsqrt(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::rsqrt(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReciprocal) {
+  torch::Tensor a = torch::randn(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::reciprocal(a);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::reciprocal(lazy_a);
+    AllClose(b, lazy_b, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowTensorScalar) {
+  torch::Tensor base = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar exponent = 4.09;
+  torch::Tensor result = torch::pow(base, exponent);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base, device);
+    torch::Tensor lazy_result = torch::pow(lazy_base, exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowTensorScalarInPlace) {
+  torch::Tensor base = torch::rand(
+      {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar exponent = 4.09;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+    torch::Tensor result = base.pow_(exponent);
+    torch::Tensor lazy_result = lazy_base.pow_(exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(base, lazy_base, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowTensorTensor) {
+  torch::Tensor base = torch::abs(torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor exponent = torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::pow(base, exponent);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base, device);
+    torch::Tensor lazy_exponent = CopyToDevice(exponent, device);
+    torch::Tensor lazy_result = torch::pow(lazy_base, lazy_exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowTensorTensorInPlace) {
+  torch::Tensor base = torch::abs(torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor exponent = torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base.clone(), device);
+    torch::Tensor result = base.pow_(exponent);
+    torch::Tensor lazy_exponent = CopyToDevice(exponent, device);
+    torch::Tensor lazy_result = lazy_base.pow_(lazy_exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+    AllClose(base, lazy_base, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowTensorTensorBroadcast) {
+  torch::Tensor base = torch::abs(torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Tensor exponent = torch::rand(
+      {4, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::pow(base, exponent);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base, device);
+    torch::Tensor lazy_exponent = CopyToDevice(exponent, device);
+    torch::Tensor lazy_result = torch::pow(lazy_base, lazy_exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowScalarTensor) {
+  torch::Scalar base = 3.5;
+  torch::Tensor exponent = torch::rand({4, 2});
+  torch::Tensor result = torch::pow(base, exponent);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_exponent = CopyToDevice(exponent, device);
+    torch::Tensor lazy_result = torch::pow(base, lazy_exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPowIntExponent) {
+  torch::Tensor base = torch::abs(torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())));
+  torch::Scalar exponent = 3;
+  torch::Tensor result = torch::pow(base, exponent);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_base = CopyToDevice(base, device);
+    torch::Tensor lazy_result = torch::pow(lazy_base, exponent);
+    AllClose(result, lazy_result, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFmodScalar) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Scalar divisor = 2.0;
+  torch::Tensor b = torch::fmod(a, divisor);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::fmod(lazy_a, divisor);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFmodScalarInPlace) {
+  torch::Scalar divisor = 2.0;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a =
+        torch::rand(
+            {2, 2},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+        100.0;
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = a.fmod_(divisor);
+    torch::Tensor lazy_b = lazy_a.fmod_(divisor);
+    AllClose(b, lazy_b);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFmodTensor) {
+  torch::Tensor a =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      10.0;
+  torch::Tensor c = torch::fmod(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::fmod(lazy_a, lazy_b);
+    AllClose(c, lazy_c);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFmodTensorInPlace) {
+  torch::Tensor b =
+      torch::rand({2, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      10.0;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a =
+        torch::rand(
+            {2, 2},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+        100.0;
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor c = a.fmod_(b);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = lazy_a.fmod_(lazy_b);
+    AllClose(c, lazy_c);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRemainderScalar) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Scalar divisor = -2.0;
+  torch::Tensor b = torch::remainder(a, divisor);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = torch::remainder(lazy_a, divisor);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRemainderScalarInPlace) {
+  torch::Scalar divisor = -2.0;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a =
+        torch::randn(
+            {2, 2},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+        100.0;
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor b = a.remainder_(divisor);
+    torch::Tensor lazy_b = lazy_a.remainder_(divisor);
+    AllClose(b, lazy_b);
+    AllClose(a, lazy_a);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRemainderTensor) {
+  torch::Tensor a =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      100.0;
+  torch::Tensor b =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      10.0;
+  torch::Tensor c = torch::remainder(a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = torch::remainder(lazy_a, lazy_b);
+    AllClose(c, lazy_c, /*rtol=*/1e-4, /*atol=*/1e-6);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRemainderTensorInPlace) {
+  torch::Tensor b =
+      torch::randn(
+          {2, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+      10.0;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor a =
+        torch::randn(
+            {2, 2},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice())) *
+        100.0;
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor c = a.remainder_(b);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = lazy_a.remainder_(lazy_b);
+    AllClose(c, lazy_c, /*rtol=*/1e-4, /*atol=*/1e-6);
+    AllClose(a, lazy_a, /*rtol=*/1e-4, /*atol=*/1e-6);
+  });
+}
+
+TEST_F(LazyOpsTest, TestWhere) {
+  torch::Tensor a = torch::rand(
+      {3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 3}, torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  for (int i = 0; i < 3; ++i) {
+    for (int j = 0; j < 3; ++j) {
+      c[i][j] = i == j;
+    }
+  }
+  torch::Tensor d = torch::where(c, a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::where(lazy_c, lazy_a, lazy_b);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestWhereBroadcast) {
+  torch::Tensor a = torch::rand(
+      {3, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::zeros(
+      {}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::empty(
+      {3, 3}, torch::TensorOptions(torch::kByte).device(DefaultDevice()));
+  for (int i = 0; i < 3; ++i) {
+    for (int j = 0; j < 3; ++j) {
+      c[i][j] = i == j;
+    }
+  }
+  torch::Tensor d = torch::where(c, a, b);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    torch::Tensor lazy_d = torch::where(lazy_c, lazy_a, lazy_b);
+    AllClose(d, lazy_d);
+  });
+}
+
+TEST_F(LazyOpsTest, TestThreshold) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  float threshold = 0.4;
+  float value = 20;
+  torch::Tensor output = torch::threshold(input, threshold, value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::threshold(lazy_input, threshold, value);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestThresholdBackward) {
+  float threshold = 0.4;
+  float value = 20;
+
+  auto testFunction = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::threshold(inputs[0], threshold, value);
+  };
+
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testFunction);
+  });
+}
+
+TEST_F(LazyOpsTest, TestThresholdInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = input.clone();
+  float threshold = 0.4;
+  float value = 20;
+  torch::threshold_(output, threshold, value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_output = CopyToDevice(input, device);
+    torch::threshold_(lazy_output, threshold, value);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestElu) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 0.5;
+  torch::Scalar scale = 2.5;
+  torch::Scalar input_scale = 1.5;
+  torch::Tensor output = torch::elu(input, alpha, scale, input_scale);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::elu(lazy_input, alpha, scale, input_scale);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEluInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 0.5;
+  torch::Scalar scale = 2.5;
+  torch::Scalar input_scale = 1.5;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::elu_(input, alpha, scale, input_scale);
+    torch::Tensor lazy_output =
+        torch::elu_(lazy_input, alpha, scale, input_scale);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSelu) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::selu(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::selu(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSeluInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::selu_(input);
+    torch::Tensor lazy_output = torch::selu_(lazy_input);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCelu) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 2.5;
+  torch::Tensor output = torch::celu(input, alpha);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::celu(lazy_input, alpha);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestCeluInPlace) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar alpha = 2.5;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::celu_(input, alpha);
+    torch::Tensor lazy_output = torch::celu_(lazy_input, alpha);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGelu) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::gelu(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::gelu(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddMatMul) {
+  int in_channels = 32;
+  int out_channels = 320;
+  int labels = 50;
+  torch::Tensor input =
+      torch::rand({in_channels, out_channels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({out_channels, labels},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias = torch::rand(
+      {labels}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test beta != 1. through the CPU interop.
+  for (double beta : {1., 2.}) {
+    torch::Tensor output = torch::addmm(bias, input, weight, /*beta=*/beta);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_weight = CopyToDevice(weight, device);
+      torch::Tensor lazy_bias = CopyToDevice(bias, device);
+      torch::Tensor lazy_output =
+          torch::addmm(lazy_bias, lazy_input, lazy_weight, /*beta=*/beta);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestEmbedding) {
+  torch::Tensor a = torch::rand(
+      {32, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor i = torch::randint(
+      0, 31, {3, 4},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor b =
+      torch::embedding(a, i, /*padding_idx=*/0, /*scale_grad_by_freq=*/false,
+                       /*sparse=*/false);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_i = CopyToDevice(i, device);
+    torch::Tensor lazy_b = torch::embedding(lazy_a, lazy_i, /*padding_idx=*/0,
+                                           /*scale_grad_by_freq=*/false,
+                                           /*sparse=*/false);
+    AllClose(b, lazy_b);
+  });
+}
+
+TEST_F(LazyOpsTest, TestOneHot) {
+  int num_classes = 5;
+  torch::Tensor input = torch::randint(
+      0, num_classes, {10},
+      torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+  torch::Tensor output = torch::one_hot(input, num_classes);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::one_hot(lazy_input, num_classes);
+    AllEqual(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTranspose) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::t(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::t(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTransposeInPlace) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = input.t_();
+    torch::Tensor lazy_output = lazy_input.t_();
+    EXPECT_EQ(lazy_output.sizes(), output.sizes());
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReshape) {
+  torch::Tensor input =
+      torch::rand({32, 20, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::reshape(input, {-1, 320});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::reshape(lazy_input, {-1, 320});
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestResize) {
+  // Testing a resize_() with target size bigger than original size is not
+  // possible, as we fill with zeros, while pytorch fills with random garbage.
+  torch::Tensor input = torch::rand(
+      {2, 2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor saved_input = input.clone();
+  input.resize_({3, 3});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(saved_input, device);
+    lazy_input.resize_({3, 3});
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestViewResize) {
+  torch::Tensor input = torch::zeros(
+      {8, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor saved_input = input.clone();
+  torch::Tensor output = input.view({4, 4});
+  output.resize_({3, 3});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(saved_input, device);
+    torch::Tensor lazy_output = lazy_input.view({4, 4});
+    lazy_output.resize_({3, 3});
+    AllClose(input, lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestView) {
+  torch::Tensor input =
+      torch::rand({32, 20, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = input.view({-1, 320});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = lazy_input.view({-1, 320});
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestViewMod) {
+  torch::Tensor input =
+      torch::zeros({32, 20, 4, 4},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor one = torch::tensor(
+      1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = input.view({-1, 320});
+  output.add_(one, 1.0);
+  input.add_(one, 1.0);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor xinput = torch::zeros(
+        {32, 20, 4, 4},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(xinput, device);
+    torch::Tensor lazy_one = CopyToDevice(one, device);
+    torch::Tensor lazy_output = lazy_input.view({-1, 320});
+    lazy_output.add_(lazy_one, 1.0);
+    lazy_input.add_(lazy_one, 1.0);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestViewModComplex) {
+  torch::Tensor input =
+      torch::zeros({32, 20, 4, 4},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor one = torch::tensor(
+      1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output1 = input.view({-1, 320});
+  output1.add_(one, 1.0);
+  torch::Tensor output2 = input.view({-1, 160});
+  output2.add_(one, 1.0);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor xinput = torch::zeros(
+        {32, 20, 4, 4},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(xinput, device);
+    torch::Tensor lazy_one = CopyToDevice(one, device);
+    torch::Tensor lazy_output1 = lazy_input.view({-1, 320});
+    lazy_output1.add_(lazy_one, 1.0);
+    torch::Tensor lazy_output2 = lazy_input.view({-1, 160});
+    lazy_output2.add_(lazy_one, 1.0);
+    AllClose(output1, lazy_output1);
+    AllClose(output2, lazy_output2);
+  });
+}
+
+TEST_F(LazyOpsTest, TestViewOfViewMod) {
+  torch::Tensor input =
+      torch::zeros({32, 20, 4, 4},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor one = torch::tensor(
+      1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output1 = input.view({-1, 320});
+  output1.add_(one, 1.0);
+  torch::Tensor output2 = output1.view({-1, 160});
+  output2.add_(one, 1.0);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor xinput = torch::zeros(
+        {32, 20, 4, 4},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(xinput, device);
+    torch::Tensor lazy_one = CopyToDevice(one, device);
+    torch::Tensor lazy_output1 = lazy_input.view({-1, 320});
+    lazy_output1.add_(lazy_one, 1.0);
+    torch::Tensor lazy_output2 = lazy_output1.view({-1, 160});
+    lazy_output2.add_(lazy_one, 1.0);
+    AllClose(output1, lazy_output1);
+    AllClose(output2, lazy_output2);
+  });
+}
+
+TEST_F(LazyOpsTest, TestViewSqueezeAddInPlace) {
+  torch::Tensor input = torch::zeros(
+      {2, 3, 1}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> view_size = {2, 3, 1, 1};
+  int squeeze_dim = 2;
+  torch::Tensor one = torch::tensor(
+      1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = input.view(view_size);
+    output.squeeze_(squeeze_dim);
+    output.add_(one, 1.0);
+    torch::Tensor lazy_one = CopyToDevice(one, device);
+    torch::Tensor lazy_output = lazy_input.view(view_size);
+    lazy_output.squeeze_(squeeze_dim);
+    lazy_output.add_(lazy_one, 1.0);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestUnsafeView) {
+  torch::Tensor input =
+      torch::rand({32, 20, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::_unsafe_view(input, {-1, 320});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::_unsafe_view(lazy_input, {-1, 320});
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestNarrow) {
+  torch::Tensor a =
+      torch::rand({8, 10, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int64_t dim : {1, -3}) {
+    for (int64_t start : {2, -8}) {
+      torch::Tensor b = a.narrow(dim, start, 6);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a, device);
+        torch::Tensor lazy_b = lazy_a.narrow(dim, start, 6);
+        AllClose(b, lazy_b);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowUpdate) {
+  for (int64_t dim : {1, -2}) {
+    for (int64_t start : {2, -6}) {
+      torch::Tensor a = torch::rand(
+          {3, 8, 3},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor a_copy = a.clone();
+      torch::Tensor b = torch::rand(
+          {3, 4, 3},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor c = a.narrow(dim, start, 4);
+      c.add_(b, 1.0);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+        torch::Tensor lazy_b = CopyToDevice(b, device);
+        torch::Tensor lazy_c = lazy_a.narrow(dim, start, 4);
+        lazy_c.add_(lazy_b, 1.0);
+        AllClose(c, lazy_c);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowUpdateBaseCheck) {
+  for (int64_t dim : {0, -2}) {
+    for (int64_t start : {2, -6}) {
+      torch::Tensor a = torch::zeros(
+          {8, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor a_copy = a.clone();
+      torch::Tensor b = torch::ones(
+          {4, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor c = a.narrow(dim, start, 4);
+      c.add_(b, 1.0);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+        torch::Tensor lazy_b = CopyToDevice(b, device);
+        torch::Tensor lazy_c = lazy_a.narrow(dim, start, 4);
+        lazy_c.add_(lazy_b, 1.0);
+        AllClose(a, lazy_a);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowUpdateTwoSlices) {
+  for (int64_t dim : {0, -2}) {
+    for (int64_t start0 : {2, -6}) {
+      for (int64_t start1 : {6, -2}) {
+        torch::Tensor a = torch::zeros(
+            {8, 3},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor a_copy = a.clone();
+        torch::Tensor b = torch::ones(
+            {2, 3},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor c = b + 1;
+        torch::Tensor d = a.narrow(dim, start0, 2);
+        torch::Tensor e = a.narrow(dim, start1, 2);
+        d.add_(b, 1.0);
+        e.add_(c, 1.0);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+          torch::Tensor lazy_b = CopyToDevice(b, device);
+          torch::Tensor lazy_c = CopyToDevice(c, device);
+          torch::Tensor lazy_d = lazy_a.narrow(dim, start0, 2);
+          torch::Tensor lazy_e = lazy_a.narrow(dim, start1, 2);
+          lazy_d.add_(lazy_b, 1.0);
+          lazy_e.add_(lazy_c, 1.0);
+          AllClose(d, lazy_d);
+          AllClose(e, lazy_e);
+          AllClose(a, lazy_a);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowUpdateView) {
+  for (int64_t dim : {0, -3}) {
+    for (int64_t start : {2, -6}) {
+      torch::Tensor a = torch::rand(
+          {8, 2, 3},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor a_copy = a.clone();
+      torch::Tensor b = torch::rand(
+          {4, 6}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor c = a.narrow(dim, start, 4);
+      torch::Tensor d = c.view({4, 6});
+      d.add_(b, 1.0);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+        torch::Tensor lazy_b = CopyToDevice(b, device);
+        torch::Tensor lazy_c = lazy_a.narrow(dim, start, 4);
+        torch::Tensor lazy_d = lazy_c.view({4, 6});
+        lazy_d.add_(lazy_b, 1.0);
+        AllClose(d, lazy_d);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowInNarrowUpdate) {
+  for (int64_t dim : {1, -2}) {
+    for (int64_t start0 : {1, -7}) {
+      for (int64_t start1 : {1, -5}) {
+        torch::Tensor a = torch::rand(
+            {3, 8, 3},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor a_copy = a.clone();
+        torch::Tensor b = torch::rand(
+            {3, 2, 3},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor c = a.narrow(dim, start0, 6);
+        torch::Tensor d = c.narrow(dim, start1, 2);
+        d.add_(b, 1.0);
+        ForEachDevice([&](const torch::Device& device) {
+          torch::Tensor lazy_a = CopyToDevice(a_copy, device);
+          torch::Tensor lazy_b = CopyToDevice(b, device);
+          torch::Tensor lazy_c = lazy_a.narrow(dim, start0, 6);
+          torch::Tensor lazy_d = lazy_c.narrow(dim, start1, 2);
+          lazy_d.add_(lazy_b, 1.0);
+          AllClose(a, lazy_a);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNarrowCopy) {
+  for (int64_t dim : {1, -3}) {
+    for (int64_t start : {2, -8}) {
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor input = torch::rand(
+            {8, 10, 4, 4},
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor result = input.narrow_copy(dim, start, 6);
+        input.add_(1);
+        torch::Tensor lazy_result = lazy_input.narrow_copy(dim, start, 6);
+        lazy_input.add_(1);
+        AllClose(result, lazy_result);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestViewAs) {
+  torch::Tensor input =
+      torch::rand({32, 20, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor empty = torch::empty({32, 320});
+  torch::Tensor output = input.view_as(empty);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_empty = CopyToDevice(empty, device);
+    torch::Tensor lazy_output = lazy_input.view_as(lazy_empty);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSoftmax) {
+  torch::Tensor input =
+      torch::rand({5, 3, 4, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output = torch::log_softmax(input, dim);
+      torch::Tensor lazy_output = torch::log_softmax(lazy_input, dim);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSoftmaxCast) {
+  torch::Tensor input =
+      torch::rand({5, 3, 4, 2},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output = torch::log_softmax(input, dim, torch::kDouble);
+      torch::Tensor lazy_output =
+          torch::log_softmax(lazy_input, dim, torch::kDouble);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSoftmaxWrapper) {
+  torch::Tensor input =
+      torch::rand({10, 2, 6, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output =
+          torch::_log_softmax(input, dim, /*half_to_float=*/false);
+      torch::Tensor lazy_output =
+          torch::_log_softmax(lazy_input, dim, /*half_to_float=*/false);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftmax) {
+  torch::Tensor input =
+      torch::rand({10, 2, 6, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output = torch::softmax(input, dim);
+      torch::Tensor lazy_output = torch::softmax(lazy_input, dim);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftmaxCast) {
+  torch::Tensor input =
+      torch::rand({10, 2, 6, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output = torch::softmax(input, dim, torch::kDouble);
+      torch::Tensor lazy_output = torch::softmax(lazy_input, dim, torch::kDouble);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftmaxWrapper) {
+  torch::Tensor input =
+      torch::rand({10, 2, 6, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    int rank = input.dim();
+    for (int dim = -rank; dim < rank; ++dim) {
+      torch::Tensor output =
+          torch::_softmax(input, dim, /*half_to_float=*/false);
+      torch::Tensor lazy_output =
+          torch::_softmax(lazy_input, dim, /*half_to_float=*/false);
+      AllClose(output, lazy_output, /*rtol=*/1e-3);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftplus) {
+  torch::Tensor input =
+      torch::rand({2, 1, 4, 6},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::softplus(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::softplus(lazy_input);
+    AllClose(output, lazy_output, /*rtol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMaxPool1D) {
+  torch::Tensor input = torch::rand(
+      {1, 16, 56}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output =
+              torch::max_pool1d(input, /*kernel_size=*/{kernel_size},
+                                /*stride=*/{stride},
+                                /*padding=*/{padding}, /*dilation=*/{dilation},
+                                /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::max_pool1d(lazy_input,
+                                  /*kernel_size=*/{kernel_size},
+                                  /*stride=*/{stride},
+                                  /*padding=*/{padding},
+                                  /*dilation=*/{dilation},
+                                  /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2D) {
+  torch::Tensor input =
+      torch::rand({1, 4, 14, 14},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::max_pool2d(lazy_input,
+                                  /*kernel_size=*/{kernel_size, kernel_size},
+                                  /*stride=*/{stride, stride},
+                                  /*padding=*/{padding, padding},
+                                  /*dilation=*/{dilation, dilation},
+                                  /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2DWithIndices) {
+  torch::Tensor input =
+      torch::rand({1, 4, 14, 14},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          auto outputs = torch::max_pool2d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            auto lazy_outputs = torch::max_pool2d_with_indices(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size},
+                /*stride=*/{stride, stride},
+                /*padding=*/{padding, padding},
+                /*dilation=*/{dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(std::get<0>(outputs), std::get<0>(lazy_outputs));
+            AllClose(std::get<1>(outputs), std::get<1>(lazy_outputs));
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2DNonSquare) {
+  torch::Tensor input =
+      torch::rand({1, 4, 14, 14},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 4;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size + 1},
+              /*stride=*/{stride, stride + 1},
+              /*padding=*/{padding, padding + 1},
+              /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::max_pool2d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size + 1},
+                /*stride=*/{stride, stride + 1},
+                /*padding=*/{padding, padding + 1},
+                /*dilation=*/{dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3D) {
+  torch::Tensor input =
+      torch::rand({1, 1, 8, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::max_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*dilation=*/{dilation, dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DWithIndices) {
+  torch::Tensor input =
+      torch::rand({1, 1, 8, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          auto outputs = torch::max_pool3d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            auto lazy_outputs = torch::max_pool3d_with_indices(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*dilation=*/{dilation, dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+
+            AllClose(std::get<0>(outputs), std::get<0>(lazy_outputs));
+            AllClose(std::get<1>(outputs), std::get<1>(lazy_outputs));
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DIncompleteAttributes) {
+  torch::Tensor input =
+      torch::rand({1, 1, 8, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{},
+              /*padding=*/{padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::max_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{},
+                /*padding=*/{padding},
+                /*dilation=*/{dilation, dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DNonSquare) {
+  torch::Tensor input =
+      torch::rand({1, 1, 8, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 4;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool3d(
+              input,
+              /*kernel_size=*/{kernel_size, kernel_size + 1, kernel_size},
+              /*stride=*/{stride, stride + 1, stride},
+              /*padding=*/{padding, padding + 1, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::max_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size + 1, kernel_size},
+                /*stride=*/{stride, stride + 1, stride},
+                /*padding=*/{padding, padding + 1, padding},
+                /*dilation=*/{dilation, dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2DNoBatch) {
+  torch::Tensor input = torch::rand(
+      {4, 14, 14}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::max_pool2d(lazy_input,
+                                  /*kernel_size=*/{kernel_size, kernel_size},
+                                  /*stride=*/{stride, stride},
+                                  /*padding=*/{padding, padding},
+                                  /*dilation=*/{dilation, dilation},
+                                  /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DNoBatch) {
+  torch::Tensor input =
+      torch::rand({1, 8, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output = torch::max_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::max_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*dilation=*/{dilation, dilation, dilation},
+                /*ceil_mode=*/ceil_mode);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool1D) {
+  torch::Tensor input = torch::rand(
+      {4, 1, 28}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output =
+              torch::avg_pool1d(input, /*kernel_size=*/{kernel_size},
+                                /*stride=*/{stride},
+                                /*padding=*/{padding}, /*ceil_mode=*/ceil_mode,
+                                /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::avg_pool1d(lazy_input,
+                                  /*kernel_size=*/{kernel_size},
+                                  /*stride=*/{stride},
+                                  /*padding=*/{padding},
+                                  /*ceil_mode=*/ceil_mode,
+                                  /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool2D) {
+  torch::Tensor input =
+      torch::rand({2, 1, 14, 14},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            // torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::avg_pool2d(lazy_input,
+                                  /*kernel_size=*/{kernel_size, kernel_size},
+                                  /*stride=*/{stride, stride},
+                                  /*padding=*/{padding, padding},
+                                  /*ceil_mode=*/ceil_mode,
+                                  /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output.to(torch::kCPU));
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool2DNonSquare) {
+  torch::Tensor input =
+      torch::rand({2, 1, 14, 14},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 4;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size + 1},
+              /*stride=*/{stride, stride + 1},
+              /*padding=*/{padding, padding + 1}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::avg_pool2d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size + 1},
+                /*stride=*/{stride, stride + 1},
+                /*padding=*/{padding, padding + 1},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3D) {
+  torch::Tensor input =
+      torch::rand({1, 1, 7, 7, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::avg_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3DIncompleteAttributes) {
+  torch::Tensor input =
+      torch::rand({1, 1, 7, 7, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{},
+              /*padding=*/{padding, padding, padding}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::avg_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{},
+                /*padding=*/{padding, padding, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3DNonSquare) {
+  torch::Tensor input =
+      torch::rand({1, 1, 7, 7, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 4;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool3d(
+              input,
+              /*kernel_size=*/{kernel_size, kernel_size + 1, kernel_size},
+              /*stride=*/{stride, stride + 1, stride},
+              /*padding=*/{padding, padding + 1, padding},
+              /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::avg_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size + 1, kernel_size},
+                /*stride=*/{stride, stride + 1, stride},
+                /*padding=*/{padding, padding + 1, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool2DNoBatch) {
+  torch::Tensor input = torch::rand(
+      {1, 7, 7}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool2d(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::avg_pool2d(lazy_input,
+                                  /*kernel_size=*/{kernel_size, kernel_size},
+                                  /*stride=*/{stride, stride},
+                                  /*padding=*/{padding, padding},
+                                  /*ceil_mode=*/ceil_mode,
+                                  /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3DNoBatch) {
+  torch::Tensor input =
+      torch::rand({1, 7, 7, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          torch::Tensor output = torch::avg_pool3d(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding}, /*ceil_mode=*/ceil_mode,
+              /*count_include_pad=*/count_include_pad);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output = torch::avg_pool3d(
+                lazy_input,
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool2D) {
+  torch::Tensor input =
+      torch::rand({4, 1, 28, 28},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int64_t output_size : {7, 4}) {
+    torch::Tensor output =
+        torch::adaptive_avg_pool2d(input, {output_size, output_size});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output =
+          torch::adaptive_avg_pool2d(lazy_input, {output_size, output_size});
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool3D) {
+  torch::Tensor input =
+      torch::rand({9, 4, 56, 28, 28},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int64_t output_size : {7, 4}) {
+    torch::Tensor output = torch::adaptive_avg_pool3d(
+        input, {output_size, output_size, output_size});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::adaptive_avg_pool3d(
+          lazy_input, {output_size, output_size, output_size});
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool3DNoBatch) {
+  torch::Tensor input =
+      torch::rand({3, 56, 28, 28},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int64_t output_size : {7, 4}) {
+    torch::Tensor output = torch::adaptive_avg_pool3d(
+        input, {output_size, output_size, output_size});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::adaptive_avg_pool3d(
+          lazy_input, {output_size, output_size, output_size});
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool2DNoBatch) {
+  torch::Tensor input = torch::rand(
+      {1, 56, 56}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int64_t output_size : {7, 8}) {
+    torch::Tensor output =
+        torch::adaptive_avg_pool2d(input, {output_size, output_size});
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output =
+          torch::adaptive_avg_pool2d(lazy_input, {output_size, output_size});
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxUnpool2D) {
+  int kernel_size = 2;
+  torch::Tensor input =
+      torch::rand({2, 2, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output;
+          torch::Tensor indices;
+          std::tie(output, indices) = torch::max_pool2d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+
+          std::vector<int64_t> output_size({input.size(2), input.size(3)});
+          at::Tensor utensor =
+              torch::max_unpool2d(output, indices, output_size);
+
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_output = CopyToDevice(output, device);
+            torch::Tensor lazy_indices = CopyToDevice(indices, device);
+            at::Tensor lazy_utensor =
+                torch::max_unpool2d(lazy_output, lazy_indices, output_size);
+            AllClose(utensor, lazy_utensor);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxUnpool3D) {
+  int kernel_size = 2;
+  torch::Tensor input =
+      torch::rand({1, 1, 4, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        // Test dilation through the CPU interop.
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output;
+          torch::Tensor indices;
+          std::tie(output, indices) = torch::max_pool3d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+
+          std::vector<int64_t> output_size(
+              {input.size(2), input.size(3), input.size(4)});
+          at::Tensor utensor = torch::max_unpool3d(
+              output, indices, output_size, /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding});
+
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_output = CopyToDevice(output, device);
+            torch::Tensor lazy_indices = CopyToDevice(indices, device);
+            at::Tensor lazy_utensor =
+                torch::max_unpool3d(lazy_output, lazy_indices, output_size,
+                                    /*stride=*/{stride, stride, stride},
+                                    /*padding=*/{padding, padding, padding});
+            AllClose(utensor, lazy_utensor);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNllLoss) {
+
+  // TODO(whc) debug divide-by-zero failure under ASAN
+  GTEST_SKIP();
+
+  int batch = 6;
+  int classes = 2;
+  // TODO(asuhan): Fix the torch::kDouble case.
+  for (auto dtype : {torch::kFloat}) {
+    for (int ignore_index : {-1, 0, 1, 5}) {
+      for (bool def_weight : {false, true}) {
+        torch::Tensor input =
+            torch::rand({batch, classes},
+                        torch::TensorOptions(dtype).device(DefaultDevice()));
+        torch::Tensor target = torch::randint(
+            std::min(ignore_index, 0), classes, {batch},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        torch::Tensor weight;
+        if (def_weight) {
+          weight = torch::rand(
+              {classes}, torch::TensorOptions(dtype).device(DefaultDevice()));
+        }
+        for (torch::Reduction::Reduction reduction :
+             {torch::Reduction::Mean, torch::Reduction::Sum,
+              torch::Reduction::None}) {
+          torch::Tensor output =
+              torch::nll_loss(/*self=*/input, /*target=*/target,
+                              /*weight=*/weight,
+                              /*reduction=*/reduction,
+                              /*ignore_index=*/ignore_index);
+
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_target = CopyToDevice(target, device);
+            torch::Tensor lazy_weight =
+                def_weight ? CopyToDevice(weight, device) : torch::Tensor();
+            torch::Tensor lazy_output = torch::nll_loss(
+                /*self=*/lazy_input, /*target=*/lazy_target,
+                /*weight=*/lazy_weight,
+                /*reduction=*/reduction, /*ignore_index=*/ignore_index);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNllLoss2d) {
+  int batch = 6;
+  int classes = 2;
+  int height = 3;
+  int width = 3;
+  // TODO(asuhan): Fix the torch::kDouble case.
+  for (auto dtype : {torch::kFloat}) {
+    for (int ignore_index : {-1, 0, 1, 5}) {
+      for (bool def_weight : {false, true}) {
+        torch::Tensor input =
+            torch::rand({batch, classes, height, width},
+                        torch::TensorOptions(dtype).device(DefaultDevice()));
+        torch::Tensor target = torch::randint(
+            std::min(ignore_index, 0), classes, {batch, height, width},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        torch::Tensor weight;
+        if (def_weight) {
+          weight = torch::rand(
+              {classes}, torch::TensorOptions(dtype).device(DefaultDevice()));
+        }
+        for (torch::Reduction::Reduction reduction :
+             {torch::Reduction::Mean, torch::Reduction::Sum,
+              torch::Reduction::None}) {
+          torch::Tensor output =
+              torch::nll_loss2d(/*self=*/input, /*target=*/target,
+                                /*weight=*/weight,
+                                /*reduction=*/reduction,
+                                /*ignore_index=*/ignore_index);
+
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_target = CopyToDevice(target, device);
+            torch::Tensor lazy_weight =
+                def_weight ? CopyToDevice(weight, device) : torch::Tensor();
+            torch::Tensor lazy_output = torch::nll_loss2d(
+                /*self=*/lazy_input, /*target=*/lazy_target,
+                /*weight=*/lazy_weight,
+                /*reduction=*/reduction, /*ignore_index=*/ignore_index);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSmoothL1Loss) {
+  torch::Tensor input = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    for (double beta : {0.25, 1.}) {
+      torch::Tensor output =
+          torch::smooth_l1_loss(input, target, reduction, beta);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_target = CopyToDevice(target, device);
+        torch::Tensor lazy_output =
+            torch::smooth_l1_loss(lazy_input, lazy_target, reduction, beta);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestL1Loss) {
+  torch::Tensor input = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    torch::Tensor output = torch::l1_loss(input, target, reduction);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_target = CopyToDevice(target, device);
+      torch::Tensor lazy_output =
+          torch::l1_loss(lazy_input, lazy_target, reduction);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestL1LossBackward) {
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::l1_loss(inputs[0], inputs[1], reduction);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand({2, 4}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                    torch::rand({2, 4}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice()))},
+                   device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMseLoss) {
+  torch::Tensor input = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor target = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    torch::Tensor output = torch::mse_loss(input, target, reduction);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_target = CopyToDevice(target, device);
+      torch::Tensor lazy_output =
+          torch::mse_loss(lazy_input, lazy_target, reduction);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMseLossBackward) {
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::mse_loss(inputs[0], inputs[1], reduction);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand({2, 4}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                    torch::rand({2, 4}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice()))},
+                   device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestBatchNorm1D) {
+  int num_features = 3;
+  torch::Tensor input =
+      torch::rand({2, num_features, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias =
+      torch::rand({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_mean =
+      torch::zeros({num_features},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_var =
+      torch::ones({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double momentum = 0.1;
+  double eps = 0.5;
+  torch::Tensor undef;
+  for (bool training : {true, false}) {
+    for (bool undef_weight_bias : {false, true}) {
+      torch::Tensor output = torch::batch_norm(
+          /*input=*/input, /*weight=*/undef_weight_bias ? undef : weight,
+          /*bias=*/undef_weight_bias ? undef : bias,
+          /*running_mean=*/running_mean, /*running_var=*/running_var,
+          /*training=*/training, /*momentum=*/momentum, /*eps=*/eps,
+          /*cudnn_enabled=*/false);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_weight =
+            undef_weight_bias ? undef : CopyToDevice(weight, device);
+        torch::Tensor lazy_bias =
+            undef_weight_bias ? undef : CopyToDevice(bias, device);
+        torch::Tensor lazy_running_mean = CopyToDevice(running_mean, device);
+        torch::Tensor lazy_running_var = CopyToDevice(running_var, device);
+        torch::Tensor lazy_output = torch::batch_norm(
+            /*input=*/lazy_input, /*weight=*/lazy_weight, /*bias=*/lazy_bias,
+            /*running_mean=*/lazy_running_mean, /*running_var=*/lazy_running_var,
+            /*training=*/training, /*momentum=*/momentum, /*eps=*/eps,
+            /*cudnn_enabled=*/false);
+        AllClose(output, lazy_output, /*rtol=*/1e-3, /*atol=*/1e-5);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestBatchNorm2D) {
+  int num_features = 3;
+  torch::Tensor input =
+      torch::rand({2, num_features, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight =
+      torch::rand({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor bias =
+      torch::rand({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_mean =
+      torch::zeros({num_features},
+                   torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor running_var =
+      torch::ones({num_features},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  double momentum = 0.1;
+  double eps = 0.5;
+  torch::Tensor undef;
+  for (bool training : {true, false}) {
+    for (bool undef_weight_bias : {false, true}) {
+      torch::Tensor output = torch::batch_norm(
+          /*input=*/input, /*weight=*/undef_weight_bias ? undef : weight,
+          /*bias=*/undef_weight_bias ? undef : bias,
+          /*running_mean=*/running_mean, /*running_var=*/running_var,
+          /*training=*/training, /*momentum=*/momentum, /*eps=*/eps,
+          /*cudnn_enabled=*/false);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_weight =
+            undef_weight_bias ? undef : CopyToDevice(weight, device);
+        torch::Tensor lazy_bias =
+            undef_weight_bias ? undef : CopyToDevice(bias, device);
+        torch::Tensor lazy_running_mean = CopyToDevice(running_mean, device);
+        torch::Tensor lazy_running_var = CopyToDevice(running_var, device);
+        torch::Tensor lazy_output = torch::batch_norm(
+            /*input=*/lazy_input, /*weight=*/lazy_weight, /*bias=*/lazy_bias,
+            /*running_mean=*/lazy_running_mean, /*running_var=*/lazy_running_var,
+            /*training=*/training, /*momentum=*/momentum, /*eps=*/eps,
+            /*cudnn_enabled=*/false);
+        AllClose(output, lazy_output, /*rtol=*/1e-3, /*atol=*/1e-5);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestDim) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    EXPECT_EQ(input.dim(), lazy_input.dim());
+  });
+}
+
+TEST_F(LazyOpsTest, TestContiguous) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::native::contiguous(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::native::contiguous(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSqueezeAll) {
+  torch::Tensor input =
+      torch::rand({2, 1, 3, 1},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::squeeze(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::squeeze(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSqueezeAllInPlace) {
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor input = torch::rand(
+        {2, 1, 3, 1},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = input.squeeze_();
+    torch::Tensor lazy_output = lazy_input.squeeze_();
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+    ASSERT_EQ(input.dim(), lazy_input.dim());
+    for (int64_t dim_idx = 0; dim_idx < input.dim(); ++dim_idx) {
+      ASSERT_EQ(input.size(dim_idx), lazy_input.size(dim_idx));
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSqueezeOne) {
+  torch::Tensor input =
+      torch::rand({2, 1, 3, 1},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor output = torch::squeeze(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::squeeze(lazy_input, dim);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSqueezeOneInPlace) {
+  int rank = 4;
+  for (int dim = -rank; dim < rank; ++dim) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor input = torch::rand(
+          {2, 1, 3, 1},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor output = input.squeeze_(dim);
+      torch::Tensor lazy_output = lazy_input.squeeze_(dim);
+      AllClose(output, lazy_output);
+      AllClose(input, lazy_input);
+      ASSERT_EQ(input.dim(), lazy_input.dim());
+      for (int64_t dim_idx = 0; dim_idx < input.dim(); ++dim_idx) {
+        ASSERT_EQ(input.size(dim_idx), lazy_input.size(dim_idx));
+      }
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestUnsqueeze) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim() + 1;
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor output = torch::unsqueeze(input, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::unsqueeze(lazy_input, dim);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestUnsqueezeInPlace) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim() + 1;
+  for (int dim = -rank; dim < rank; ++dim) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor output = input.unsqueeze_(dim);
+      torch::Tensor lazy_output = lazy_input.unsqueeze_(dim);
+      AllClose(output, lazy_output);
+      AllClose(input, lazy_input);
+      ASSERT_EQ(input.dim(), lazy_input.dim());
+      for (int64_t dim_idx = 0; dim_idx < input.dim(); ++dim_idx) {
+        ASSERT_EQ(input.size(dim_idx), lazy_input.size(dim_idx));
+      }
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaskedFill) {
+  torch::Tensor input = torch::rand(
+      {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor mask = torch::randint(
+      0, 2, {2, 3}, torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+  torch::Scalar value(42);
+  torch::Tensor result = torch::masked_fill(input, mask, value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_mask = CopyToDevice(mask, device);
+    torch::Tensor lazy_result = torch::masked_fill(lazy_input, lazy_mask, value);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMaskedFillInPlace) {
+  torch::Scalar value(42);
+  torch::Tensor mask = torch::randint(
+      0, 2, {2, 3}, torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor input = torch::rand(
+        {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_mask = CopyToDevice(mask, device);
+    torch::Tensor result = input.masked_fill_(mask, value);
+    torch::Tensor lazy_result = lazy_input.masked_fill_(lazy_mask, value);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMaskedFillBroadcast) {
+  torch::Tensor input =
+      torch::rand({2, 5, 4, 3},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor mask = torch::randint(
+      0, 2, {4, 1}, torch::TensorOptions(torch::kBool).device(DefaultDevice()));
+  torch::Scalar value(42);
+  torch::Tensor result = torch::masked_fill(input, mask, value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_mask = CopyToDevice(mask, device);
+    torch::Tensor lazy_result = torch::masked_fill(lazy_input, lazy_mask, value);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFill) {
+  torch::Scalar value(42);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor input = torch::empty(
+        {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = torch::fill_(input, value);
+    torch::Tensor lazy_result = torch::fill_(lazy_input, value);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestFillWithRank0) {
+  torch::Tensor value = torch::scalar_tensor(42);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor input = torch::empty(
+        {2, 3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = torch::fill_(input, value);
+    torch::Tensor lazy_value = CopyToDevice(value, device);
+    torch::Tensor lazy_result = torch::fill_(lazy_input, value);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestPermute) {
+  torch::Tensor input = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<std::vector<int64_t>> dims_permutations = {
+      {0, 1, 2}, {0, 2, 1}, {1, 0, 2}, {1, 2, 0}, {2, 0, 1}, {2, 1, 0}};
+  int rank = input.dim();
+  for (std::vector<int64_t> dims_permutation : dims_permutations) {
+    for (bool negative_dims : {false, true}) {
+      if (negative_dims) {
+        std::for_each(dims_permutation.begin(), dims_permutation.end(),
+                      [rank](int64_t& dim) { dim -= rank; });
+      }
+      torch::Tensor output = input.permute(dims_permutation);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_output = lazy_input.permute(dims_permutation);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestPermuteMod) {
+  std::vector<std::vector<int64_t>> dims_permutations = {
+      {0, 1, 2}, {0, 2, 1}, {1, 0, 2}, {1, 2, 0}, {2, 0, 1}, {2, 1, 0}};
+  std::vector<int64_t> input_sizes = {2, 3, 4};
+  int rank = input_sizes.size();
+  for (std::vector<int64_t> dims_permutation : dims_permutations) {
+    for (bool negative_dims : {false, true}) {
+      if (negative_dims) {
+        std::for_each(dims_permutation.begin(), dims_permutation.end(),
+                      [rank](int64_t& dim) { dim -= rank; });
+      }
+      torch::Tensor input = torch::zeros(
+          input_sizes,
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor one = torch::tensor(
+          1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor output = input.permute(dims_permutation);
+      output.add_(one, 1.0);
+      input.add_(one, 1.0);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor xinput = torch::zeros(
+            input_sizes,
+            torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+        torch::Tensor lazy_input = CopyToDevice(xinput, device);
+        torch::Tensor lazy_one = CopyToDevice(one, device);
+        torch::Tensor lazy_output = lazy_input.permute(dims_permutation);
+        lazy_output.add_(lazy_one, 1.0);
+        lazy_input.add_(lazy_one, 1.0);
+        AllClose(output, lazy_output);
+        AllClose(input, lazy_input);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestFlip) {
+  torch::Tensor input = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<std::vector<int64_t>> dim_powerset = {
+      {0}, {1}, {2}, {0, 1}, {1, 2}, {2, 0}, {0, 1, 2}};
+  for (std::vector<int64_t> flip_dims : dim_powerset) {
+    for (bool negative_dims : {false, true}) {
+      if (negative_dims) {
+        std::for_each(flip_dims.begin(), flip_dims.end(),
+                      [](int64_t& dim) { dim -= 3; });
+      }
+      torch::Tensor output = torch::flip(input, flip_dims);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        torch::Tensor lazy_output = torch::flip(lazy_input, flip_dims);
+        AllClose(output, lazy_output);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestPixelShuffle) {
+  torch::Tensor input =
+      torch::rand({5, 18, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int upscale_factor = 3;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = torch::pixel_shuffle(input, upscale_factor);
+    torch::Tensor lazy_output = torch::pixel_shuffle(lazy_input, upscale_factor);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSumToSize) {
+  torch::Tensor input =
+      torch::rand({4, 6, 3, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> out_size = {4, 1, 1, 7};
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = input.sum_to_size(out_size);
+    torch::Tensor lazy_output = lazy_input.sum_to_size(out_size);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTransposeDims) {
+  torch::Tensor input = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int dim0 = 0;
+  int dim1 = 2;
+  torch::Tensor output = torch::transpose(input, dim0, dim1);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::transpose(lazy_input, dim0, dim1);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTransposeDimsMod) {
+  std::vector<int64_t> input_sizes = {2, 3, 4};
+  int dim0 = 0;
+  int dim1 = 2;
+  torch::Tensor input = torch::zeros(
+      input_sizes, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor one = torch::tensor(
+      1.0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::transpose(input, dim0, dim1);
+  output.add_(one, 1.0);
+  input.add_(one, 1.0);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor xinput = torch::zeros(
+        input_sizes,
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor lazy_input = CopyToDevice(xinput, device);
+    torch::Tensor lazy_one = CopyToDevice(one, device);
+    torch::Tensor lazy_output = torch::transpose(lazy_input, dim0, dim1);
+    lazy_output.add_(lazy_one, 1.0);
+    lazy_input.add_(lazy_one, 1.0);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTransposeDimsInPlace) {
+  torch::Tensor input = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int dim0 = 0;
+  int dim1 = 2;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output = input.transpose_(dim0, dim1);
+    torch::Tensor lazy_output = lazy_input.transpose_(dim0, dim1);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSplit) {
+  torch::Tensor input = torch::rand(
+      {7, 8, 9}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int split_size : {2, 3}) {
+    for (int dim = -rank; dim < rank; ++dim) {
+      std::vector<torch::Tensor> outputs = torch::split(input, split_size, dim);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_input = CopyToDevice(input, device);
+        std::vector<torch::Tensor> lazy_outputs =
+            torch::split(lazy_input, split_size, dim);
+        ASSERT_EQ(outputs.size(), lazy_outputs.size());
+        for (size_t i = 0; i < outputs.size(); ++i) {
+          AllClose(outputs[i], lazy_outputs[i]);
+        }
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSplitEmpty) {
+  torch::Tensor input = torch::rand(
+      {0}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int split_size = 0;
+  int dim = 0;
+  std::vector<torch::Tensor> outputs = torch::split(input, split_size, dim);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    std::vector<torch::Tensor> lazy_outputs =
+        torch::split(lazy_input, split_size, dim);
+    ASSERT_EQ(outputs.size(), lazy_outputs.size());
+    for (size_t i = 0; i < outputs.size(); ++i) {
+      AllClose(outputs[i], lazy_outputs[i]);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestSplitWithSizes) {
+  torch::Tensor input =
+      torch::rand({15, 15, 15},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = input.dim();
+  for (int dim = -rank; dim < rank; ++dim) {
+    std::vector<torch::Tensor> outputs =
+        torch::split_with_sizes(input, {4, 5, 6}, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      std::vector<torch::Tensor> lazy_outputs =
+          torch::split_with_sizes(lazy_input, {4, 5, 6}, dim);
+      ASSERT_EQ(outputs.size(), lazy_outputs.size());
+      for (size_t i = 0; i < outputs.size(); ++i) {
+        AllClose(outputs[i], lazy_outputs[i]);
+      }
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCrossImplicitDim) {
+  std::vector<std::vector<int64_t>> dim_sizes = {
+      {4, 5, 3}, {4, 3, 5}, {3, 4, 5}};
+  for (auto dim_size : dim_sizes) {
+    torch::Tensor input = torch::rand(
+        dim_size, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor other = torch::rand(
+        dim_size, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    torch::Tensor result = torch::cross(input, other);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_other = CopyToDevice(other, device);
+      torch::Tensor lazy_result = torch::cross(lazy_input, lazy_other);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCrossExplicitDim) {
+  std::vector<int64_t> dim_size = {3, 3};
+  torch::Tensor input = torch::rand(
+      dim_size, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor other = torch::rand(
+      dim_size, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  int rank = dim_size.size();
+  for (int dim = -rank; dim < rank; ++dim) {
+    torch::Tensor result = torch::cross(input, other, dim);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_other = CopyToDevice(other, device);
+      torch::Tensor lazy_result = torch::cross(lazy_input, lazy_other, dim);
+      AllClose(result, lazy_result);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestCrossZeroDim) {
+  torch::Tensor input =
+      torch::rand({0, 1, 3, 0},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor result = torch::cross(input, input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::cross(lazy_input, lazy_input);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTriu) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::triu(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::triu(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTriuNonSquare) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size + 1},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::triu(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::triu(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTriuBatch) {
+  int size = 5;
+  int batch_size = 3;
+  torch::Tensor input =
+      torch::rand({batch_size, size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::triu(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::triu(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTril) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::tril(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::tril(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTrilNonSquare) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size + 1},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::tril(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::tril(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTrilBatch) {
+  int size = 5;
+  int batch_size = 3;
+  torch::Tensor input =
+      torch::rand({batch_size, size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::tril(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::tril(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTriuInPlace) {
+  int size = 5;
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor input = torch::rand(
+          {size, size},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor output = input.triu_(diagonal);
+      torch::Tensor lazy_output = lazy_input.triu_(diagonal);
+      AllClose(output, lazy_output);
+      AllClose(input, lazy_input);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTrilInPlace) {
+  int size = 5;
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor input = torch::rand(
+          {size, size},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor output = input.tril_(diagonal);
+      torch::Tensor lazy_output = lazy_input.tril_(diagonal);
+      AllClose(output, lazy_output);
+      AllClose(input, lazy_input);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestTrace) {
+  int n = 5;
+  torch::Tensor input = torch::rand(
+      {n, n}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::trace(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::trace(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTraceWide) {
+  int lines = 3;
+  int cols = 5;
+  torch::Tensor input =
+      torch::rand({lines, cols},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::trace(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::trace(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTraceNarrow) {
+  int lines = 5;
+  int cols = 3;
+  torch::Tensor input =
+      torch::rand({lines, cols},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor output = torch::trace(input);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::trace(lazy_input);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestDiagRank1) {
+  int size = 7;
+  torch::Tensor input = torch::rand(
+      {size}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -2 * size; diagonal <= 2 * size; ++diagonal) {
+    torch::Tensor output = torch::diag(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::diag(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagRank2) {
+  int size = 7;
+  torch::Tensor input =
+      torch::rand({size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::diag(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::diag(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagFlat) {
+  torch::Tensor input =
+      torch::rand({4, 3, 6, 7},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int diagonal = -10; diagonal < 10; ++diagonal) {
+    torch::Tensor output = torch::diagflat(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::diagflat(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagonal) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::diagonal(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::diagonal(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagonalUpdate) {
+  int size = 5;
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    auto input = torch::rand({size, size},
+        torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+    auto input_clone = input.clone();
+    auto output = torch::diagonal(input, diagonal);
+    output.add_(1);
+
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input_clone, device);
+      torch::Tensor lazy_output = torch::diagonal(lazy_input, diagonal);
+      lazy_output.add_(1);
+
+      AllClose(output, lazy_output);
+      AllClose(input, lazy_input);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagonalNonSquare) {
+  int size = 5;
+  torch::Tensor input =
+      torch::rand({size, size + 1},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output = torch::diagonal(input, diagonal);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output = torch::diagonal(lazy_input, diagonal);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestDiagonalBatch) {
+  int size = 5;
+  int batch_size = 3;
+  int dim1 = 1;
+  int dim2 = 2;
+  torch::Tensor input =
+      torch::rand({batch_size, size, size},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  // Test all diagonals and out of bounds (must be no-op).
+  for (int diagonal = -size; diagonal <= size; ++diagonal) {
+    torch::Tensor output =
+        torch::diagonal(input, diagonal, /*dim1=*/dim1, /*dim1=*/dim2);
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor lazy_input = CopyToDevice(input, device);
+      torch::Tensor lazy_output =
+          torch::diagonal(lazy_input, diagonal, /*dim1=*/dim1, /*dim1=*/dim2);
+      AllClose(output, lazy_output);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestFlatten) {
+  torch::Tensor input = torch::rand({4, 7, 5, 3});
+  int rank = input.dim();
+  for (int pos_start_dim = 0; pos_start_dim < rank; ++pos_start_dim) {
+    for (int pos_end_dim = pos_start_dim; pos_end_dim < rank; ++pos_end_dim) {
+      for (bool negative_start_dim : {false, true}) {
+        for (bool negative_end_dim : {false, true}) {
+          int start_dim =
+              negative_start_dim ? pos_start_dim - rank : pos_start_dim;
+          int end_dim = negative_end_dim ? pos_end_dim - rank : pos_end_dim;
+          torch::Tensor output = torch::flatten(input, start_dim, end_dim);
+          ForEachDevice([&](const torch::Device& device) {
+            torch::Tensor lazy_input = CopyToDevice(input, device);
+            torch::Tensor lazy_output =
+                torch::flatten(lazy_input, start_dim, end_dim);
+            AllClose(output, lazy_output);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestLogicalAnd) {
+  for (torch::ScalarType scalar_type1 :
+       {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+        torch::kLong}) {
+    torch::Tensor lhs =
+        isFloatingType(scalar_type1)
+            ? torch::rand({3, 4}, torch::TensorOptions(scalar_type1))
+            : torch::randint(0, 100, {3, 4},
+                             torch::TensorOptions(scalar_type1));
+    for (torch::ScalarType scalar_type2 :
+         {torch::kFloat, torch::kByte, torch::kChar, torch::kShort, torch::kInt,
+          torch::kLong}) {
+      torch::Tensor rhs =
+          isFloatingType(scalar_type2)
+              ? torch::rand({3, 4}, torch::TensorOptions(scalar_type2))
+              : torch::randint(1, 100, {3, 4},
+                               torch::TensorOptions(scalar_type2));
+      torch::Tensor result = torch::logical_and(lhs, rhs);
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+        torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+        torch::Tensor lazy_result = torch::logical_and(lazy_lhs, lazy_rhs);
+        AllEqual(result, lazy_result);
+      });
+    }
+  }
+
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("xla::logical_and_out", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestBitwiseAnd) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor result = lhs.__and__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__and__(lazy_rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseAndInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__iand__(rhs);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__iand__(lazy_rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseAndScalar) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  torch::Tensor result = lhs.__and__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__and__(rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseAndScalarInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__iand__(rhs);
+    torch::Tensor lazy_result = lazy_lhs.__iand__(rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseAndPromotion) {
+  torch::Tensor input = torch::rand(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor view = input.reshape(-1);
+  torch::Tensor result = torch::__and__(view.gt(0), view.ne(0));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_view = lazy_input.reshape(-1);
+    torch::Tensor lazy_result = torch::__and__(lazy_view.gt(0), lazy_view.ne(0));
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseOr) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor result = lhs.__or__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__or__(lazy_rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseOrInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__ior__(rhs);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__ior__(lazy_rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseOrScalar) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  torch::Tensor result = lhs.__or__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__or__(rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseOrScalarInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__ior__(rhs);
+    torch::Tensor lazy_result = lazy_lhs.__ior__(rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseXor) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor result = lhs.__xor__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__xor__(lazy_rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseXorInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Tensor rhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__ixor__(rhs);
+    torch::Tensor lazy_rhs = CopyToDevice(rhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__ixor__(lazy_rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseXorScalar) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  torch::Tensor result = lhs.__xor__(rhs);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor lazy_result = lazy_lhs.__xor__(rhs);
+    AllEqual(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBitwiseXorScalarInPlace) {
+  torch::Tensor lhs = torch::randint(0, std::numeric_limits<int32_t>::max(),
+                                     {4, 2}, torch::TensorOptions(torch::kInt));
+  torch::Scalar rhs(123456789);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_lhs = CopyToDevice(lhs, device);
+    torch::Tensor result = lhs.__ixor__(rhs);
+    torch::Tensor lazy_result = lazy_lhs.__ixor__(rhs);
+    AllEqual(result, lazy_result);
+    AllEqual(lhs, lazy_lhs);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLshift) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor shift_amount = torch::randint(
+      16, input.sizes(), torch::TensorOptions().device(DefaultDevice()));
+  torch::Tensor result = torch::__lshift__(input, shift_amount);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_shift_amount = CopyToDevice(shift_amount, device);
+    torch::Tensor lazy_result = torch::__lshift__(lazy_input, lazy_shift_amount);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLshiftInPlace) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor shift_amount = torch::randint(
+        16, input.sizes(), torch::TensorOptions().device(DefaultDevice()));
+    torch::Tensor result = input.__ilshift__(shift_amount);
+    torch::Tensor lazy_shift_amount = CopyToDevice(shift_amount, device);
+    torch::Tensor lazy_result = lazy_input.__ilshift__(lazy_shift_amount);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLshiftScalar) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar shift_amount = 3;
+  torch::Tensor result = torch::__lshift__(input, shift_amount);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::__lshift__(lazy_input, shift_amount);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLshiftScalarInPlace) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar shift_amount = 3;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = input.__ilshift__(shift_amount);
+    torch::Tensor lazy_result = lazy_input.__ilshift__(shift_amount);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRshift) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor shift_amount = torch::randint(
+      16, input.sizes(), torch::TensorOptions().device(DefaultDevice()));
+  torch::Tensor result = torch::__rshift__(input, shift_amount);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_shift_amount = CopyToDevice(shift_amount, device);
+    torch::Tensor lazy_result = torch::__rshift__(lazy_input, lazy_shift_amount);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRshiftInPlace) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor shift_amount = torch::randint(
+        16, input.sizes(), torch::TensorOptions().device(DefaultDevice()));
+    torch::Tensor result = input.__irshift__(shift_amount);
+    torch::Tensor lazy_shift_amount = CopyToDevice(shift_amount, device);
+    torch::Tensor lazy_result = lazy_input.__irshift__(lazy_shift_amount);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRshiftScalar) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar shift_amount = 3;
+  torch::Tensor result = torch::__rshift__(input, shift_amount);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_result = torch::__rshift__(lazy_input, shift_amount);
+    AllClose(result, lazy_result);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRshiftScalarInPlace) {
+  torch::Tensor input = torch::ones(
+      {4, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar shift_amount = 3;
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor result = input.__irshift__(shift_amount);
+    torch::Tensor lazy_result = lazy_input.__irshift__(shift_amount);
+    AllClose(result, lazy_result);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestMeshgrid) {
+  torch::Tensor a = torch::rand(
+      {3}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor b = torch::rand(
+      {2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor c = torch::rand(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  auto d = torch::meshgrid({a, b, c});
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_a = CopyToDevice(a, device);
+    torch::Tensor lazy_b = CopyToDevice(b, device);
+    torch::Tensor lazy_c = CopyToDevice(c, device);
+    auto lazy_d = torch::meshgrid({lazy_a, lazy_b, lazy_c});
+    EXPECT_EQ(d.size(), lazy_d.size());
+    for (size_t i = 0; i < d.size(); ++i) {
+      AllClose(d[i], lazy_d[i]);
+    }
+  });
+}
+
+TEST_F(LazyOpsTest, TestConstantPad) {
+  torch::Tensor input = torch::rand(
+      {4, 2, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 2, 3, 4, 5, 6};
+  float pad_value = 5;
+  torch::Tensor output = torch::constant_pad_nd(input, pad, pad_value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output =
+        torch::constant_pad_nd(lazy_input, pad, pad_value);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestConstantPadIncomplete) {
+  torch::Tensor input = torch::rand(
+      {4, 2, 5}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 2};
+  float pad_value = 5;
+  torch::Tensor output = torch::constant_pad_nd(input, pad, pad_value);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output =
+        torch::constant_pad_nd(lazy_input, pad, pad_value);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReflectionPad2dRank3) {
+  torch::Tensor input = torch::rand(
+      {2, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{2, 2, 2, 2};
+  torch::Tensor output = torch::reflection_pad2d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::reflection_pad2d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReflectionPad2dRank4) {
+  torch::Tensor input =
+      torch::rand({2, 2, 3, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{2, 2, 2, 2};
+  torch::Tensor output = torch::reflection_pad2d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::reflection_pad2d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReflectionPad2dBackward) {
+  std::vector<int64_t> pad{2, 3, 1, 2};
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::reflection_pad2d(inputs[0], pad);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({1, 2, 4, 4}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad1d) {
+  torch::Tensor input = torch::rand(
+      {1, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 2};
+  torch::Tensor output = torch::replication_pad1d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::replication_pad1d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad1dZeroPad) {
+  torch::Tensor input = torch::rand(
+      {1, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 0};
+  torch::Tensor output = torch::replication_pad1d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::replication_pad1d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad1dBackward) {
+  std::vector<int64_t> pad{2, 3};
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::replication_pad1d(inputs[0], pad);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 4}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad2d) {
+  torch::Tensor input = torch::rand(
+      {1, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 2, 2, 1};
+  torch::Tensor output = torch::replication_pad2d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::replication_pad2d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad2dZeroPad) {
+  torch::Tensor input = torch::rand(
+      {1, 3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> pad{1, 0, 0, 1};
+  torch::Tensor output = torch::replication_pad2d(input, pad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output = torch::replication_pad2d(lazy_input, pad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReplicationPad2dBackward) {
+  std::vector<int64_t> pad{2, 3, 1, 1};
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::replication_pad2d(inputs[0], pad);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 3, 4}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice())
+                                             .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsStrided) {
+  torch::Tensor input = torch::rand(
+      {128, 320}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> size = {128, 20, 4, 4};
+  std::vector<int64_t> stride = {320, 16, 4, 1};
+  torch::Tensor output =
+      torch::as_strided(input, /*size=*/size, /*stride=*/stride);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output =
+        torch::as_strided(lazy_input, /*size=*/size, /*stride=*/stride);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsStridedInPlace) {
+  torch::Tensor input = torch::rand(
+      {128, 320}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> size = {128, 20, 4, 4};
+  std::vector<int64_t> stride = {320, 16, 4, 1};
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor output =
+        torch::as_strided_(input, /*size=*/size, /*stride=*/stride);
+    torch::Tensor lazy_output =
+        torch::as_strided_(lazy_input, /*size=*/size, /*stride=*/stride);
+    AllClose(output, lazy_output);
+    AllClose(input, lazy_input);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsStridedWithOffset) {
+  torch::Tensor input = torch::rand(
+      {4, 8, 2}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> size = {4, 4, 2};
+  std::vector<int64_t> stride = {8, 2, 1};
+  int64_t storage_offset = 4;
+  torch::Tensor output =
+      torch::as_strided(input, /*size=*/size, /*stride=*/stride,
+                        /*storage_offset=*/storage_offset);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input, device);
+    torch::Tensor lazy_output =
+        torch::as_strided(lazy_input, /*size=*/size, /*stride=*/stride,
+                          /*storage_offset=*/storage_offset);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAsStridedWithInplaceCopy) {
+  torch::Tensor grad = torch::ones(
+      {4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  std::vector<int64_t> size = {4};
+  std::vector<int64_t> stride = {1};
+  torch::Tensor output = torch::zeros({4}, grad.options());
+  output.as_strided(size, stride).copy_(grad);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_grad = CopyToDevice(grad, device);
+    torch::Tensor lazy_output = torch::zeros({4}, lazy_grad.options());
+    lazy_output.as_strided(size, stride).copy_(lazy_grad);
+    AllClose(output, lazy_output);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEmptyStrided) {
+  std::vector<int64_t> size = {4, 4, 2};
+  std::vector<int64_t> stride = {8, 2, 1};
+  torch::Tensor output = torch::empty_strided(/*size=*/size, /*stride=*/stride);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_output =
+        torch::empty_strided(/*size=*/size, /*stride=*/stride);
+    EXPECT_EQ(output.sizes(), lazy_output.sizes());
+    EXPECT_EQ(output.strides(), lazy_output.strides());
+  });
+}
+
+TEST_F(LazyOpsTest, TestAvgPool2DBackward) {
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::avg_pool2d(inputs[0],
+                                     /*kernel_size=*/{kernel_size, kernel_size},
+                                     /*stride=*/{stride, stride},
+                                     /*padding=*/{padding, padding},
+                                     /*ceil_mode=*/ceil_mode,
+                                     /*count_include_pad=*/count_include_pad);
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward(
+                {torch::rand({1, 1, 7, 7}, torch::TensorOptions(torch::kFloat)
+                                               .device(DefaultDevice())
+                                               .requires_grad(true))},
+                device, testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3DBackward) {
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::avg_pool3d(
+                inputs[0],
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward({torch::rand({1, 1, 7, 7, 7},
+                                      torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                         device, testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool2DNoBatchBackward) {
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::avg_pool2d(inputs[0],
+                                     /*kernel_size=*/{kernel_size, kernel_size},
+                                     /*stride=*/{stride, stride},
+                                     /*padding=*/{padding, padding},
+                                     /*ceil_mode=*/ceil_mode,
+                                     /*count_include_pad=*/count_include_pad);
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward(
+                {torch::rand({1, 7, 7}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true))},
+                device, testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAvgPool3DNoBatchBackward) {
+  int kernel_size = 2;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (bool count_include_pad : {true, false}) {
+        // Test ceil_mode=true through the CPU interop.
+        for (bool ceil_mode : {false, true}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::avg_pool3d(
+                inputs[0],
+                /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+                /*stride=*/{stride, stride, stride},
+                /*padding=*/{padding, padding, padding},
+                /*ceil_mode=*/ceil_mode,
+                /*count_include_pad=*/count_include_pad);
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward(
+                {torch::rand({1, 7, 7, 7}, torch::TensorOptions(torch::kFloat)
+                                               .device(DefaultDevice())
+                                               .requires_grad(true))},
+                device, testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool3DNoBatchBackward) {
+  if (IsCuda()) {
+    GTEST_SKIP();
+  }
+  for (int64_t output_size : {7, 4}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::adaptive_avg_pool3d(
+          inputs[0], {output_size, output_size, output_size});
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward(
+          {torch::rand({1, 56, 28, 28}, torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true))},
+          device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool3DBackward) {
+  if (IsCuda()) {
+    GTEST_SKIP();
+  }
+  for (int64_t output_size : {7, 4}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::adaptive_avg_pool3d(
+          inputs[0], {output_size, output_size, output_size});
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward(
+          {torch::rand({4, 1, 56, 28, 28}, torch::TensorOptions(torch::kFloat)
+                                               .device(DefaultDevice())
+                                               .requires_grad(true))},
+          device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool2DBackward) {
+  for (int64_t output_size : {7, 8}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::adaptive_avg_pool2d(inputs[0], {output_size, output_size});
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward(
+          {torch::rand({4, 1, 56, 56}, torch::TensorOptions(torch::kFloat)
+                                           .device(DefaultDevice())
+                                           .requires_grad(true))},
+          device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestAdaptiveAvgPool2DNoBatchBackward) {
+  for (int64_t output_size : {7, 8}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::adaptive_avg_pool2d(inputs[0], {output_size, output_size});
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand({1, 56, 56}, torch::TensorOptions(torch::kFloat)
+                                                 .requires_grad(true))},
+                   device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestConv2D) {
+  int in_channels = 4;
+  int out_channels = 4;
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 3; ++stride) {
+    for (int padding = 0; padding <= 2; ++padding) {
+      for (bool with_bias : {true, false}) {
+        for (int dilation = 1; dilation <= 3; ++dilation) {
+          for (int groups :
+               {1, 2, 4}) {  // covers normal, grouped, depthwise conv.
+            ForEachDevice([&](const torch::Device& device) {
+              torch::Tensor input = torch::rand(
+                  {1, in_channels, 7, 7},
+                  torch::TensorOptions(torch::kDouble).device(DefaultDevice()));
+              torch::Tensor weight = torch::rand(
+                  {out_channels, in_channels / groups, kernel_size,
+                   kernel_size},
+                  torch::TensorOptions(torch::kDouble).device(DefaultDevice()));
+              torch::Tensor bias =
+                  with_bias ? torch::rand({out_channels},
+                                          torch::TensorOptions(torch::kDouble)
+                                              .device(DefaultDevice()))
+                            : torch::Tensor();
+
+              torch::Tensor lazy_input = CopyToDevice(input, device);
+              torch::Tensor lazy_weight = CopyToDevice(weight, device);
+              torch::Tensor lazy_bias =
+                  with_bias ? CopyToDevice(bias, device) : torch::Tensor();
+
+              torch::Tensor output =
+                  torch::conv2d(input, weight, bias,
+                                /*stride=*/{stride, stride},
+                                /*padding=*/{padding, padding},
+                                /*dilation=*/{dilation, dilation}, groups);
+              torch::Tensor lazy_output =
+                  torch::conv2d(lazy_input, lazy_weight, lazy_bias,
+                                /*stride=*/{stride, stride},
+                                /*padding=*/{padding, padding},
+                                /*dilation=*/{dilation, dilation}, groups);
+              AllClose(output, lazy_output);
+            });
+          }
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestConv2DBackward) {
+  int in_channels = 4;
+  int out_channels = 4;
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 3; ++stride) {
+    for (int padding = 0; padding <= 2; ++padding) {
+      for (bool with_bias : {true, false}) {
+        for (int dilation = 1; dilation <= 3; ++dilation) {
+          for (int groups :
+               {1, 2, 4}) {  // covers normal, grouped, depthwise conv.
+            auto testfn =
+                [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+              return torch::conv2d(inputs[0], inputs[1], inputs[2],
+                                   /*stride=*/{stride, stride},
+                                   /*padding=*/{padding, padding},
+                                   /*dilation=*/{dilation, dilation}, groups);
+            };
+
+            ForEachDevice([&](const torch::Device& device) {
+              torch::Tensor bias =
+                  with_bias ? torch::rand({out_channels},
+                                          torch::TensorOptions(torch::kDouble)
+                                              .device(DefaultDevice()))
+                            : torch::Tensor();
+              TestBackward({torch::rand({1, in_channels, 7, 7},
+                                        torch::TensorOptions(torch::kDouble)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                            torch::rand({out_channels, in_channels / groups,
+                                         kernel_size, kernel_size},
+                                        torch::TensorOptions(torch::kDouble)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                            bias},
+                           device, testfn);
+            });
+          }
+        };
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestTransposedConv2DBackward) {
+  int in_channels = 4;
+  int out_channels = 4;
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (int dilation = 1; dilation <= 2; ++dilation) {
+        for (int output_padding = 0;
+             output_padding < std::max(stride, dilation); ++output_padding) {
+          for (bool with_bias : {true, false}) {
+            for (int groups :
+                 {1, 2, 4}) {  // covers normal, grouped, depthwise conv.
+              auto testfn = [&](const std::vector<torch::Tensor>& inputs)
+                  -> torch::Tensor {
+                return torch::conv_transpose2d(
+                    inputs[0], inputs[1], inputs[2],
+                    /*stride=*/{stride, stride + 1},
+                    /*padding=*/{padding, padding + 1},
+                    /*output_padding=*/output_padding,
+                    /*groups=*/groups,
+                    /*dilation=*/{dilation, dilation + 1});
+              };
+              ForEachDevice([&](const torch::Device& device) {
+                torch::Tensor input = torch::rand(
+                    {4, out_channels, 7, 7}, torch::TensorOptions(torch::kFloat)
+                                                 .device(DefaultDevice())
+                                                 .requires_grad(true));
+                torch::Tensor weight =
+                    torch::rand({out_channels, in_channels / groups,
+                                 kernel_size, kernel_size},
+                                torch::TensorOptions(torch::kFloat)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true));
+                torch::Tensor bias =
+                    with_bias ? torch::rand({in_channels},
+                                            torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))
+                              : torch::Tensor();
+                TestBackward({input, weight, bias}, device, testfn,
+                             /*rtol=*/1e-5, /*atol=*/1e-5);
+              });
+            }
+          };
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestConv3DBackward) {
+  int in_channels = 4;
+  int out_channels = 4;
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 3; ++stride) {
+    for (int padding = 1; padding <= 2; ++padding) {
+      for (bool with_bias : {true, false}) {
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          for (int groups :
+               {1, 2, 4}) {  // covers normal, grouped, depthwise conv.
+            auto testfn =
+                [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+              return torch::conv3d(inputs[0], inputs[1], inputs[2],
+                                   /*stride=*/{stride, stride, stride},
+                                   /*padding=*/{padding, padding, padding},
+                                   /*dilation=*/{dilation, dilation, dilation},
+                                   groups);
+            };
+
+            ForEachDevice([&](const torch::Device& device) {
+              torch::Tensor bias =
+                  with_bias ? torch::rand({out_channels},
+                                          torch::TensorOptions(torch::kDouble)
+                                              .device(DefaultDevice()))
+                            : torch::Tensor();
+              TestBackward({torch::rand({4, in_channels, 7, 7, 7},
+                                        torch::TensorOptions(torch::kDouble)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                            torch::rand({out_channels, in_channels / groups,
+                                         kernel_size, kernel_size, kernel_size},
+                                        torch::TensorOptions(torch::kDouble)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true)),
+                            bias},
+                           device, testfn);
+            });
+          }
+        };
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestTransposedConv3DBackward) {
+  int in_channels = 4;
+  int out_channels = 4;
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      for (int dilation = 1; dilation <= 2; ++dilation) {
+        for (int output_padding = 0;
+             output_padding < std::max(stride, dilation); ++output_padding) {
+          for (bool with_bias : {true, false}) {
+            for (int groups :
+                 {1, 2, 4}) {  // covers normal, grouped, depthwise conv.
+              auto testfn = [&](const std::vector<torch::Tensor>& inputs)
+                  -> torch::Tensor {
+                return torch::conv_transpose3d(
+                    inputs[0], inputs[1], inputs[2],
+                    /*stride=*/{stride, stride + 1, stride},
+                    /*padding=*/{padding, padding + 1, stride},
+                    /*output_padding=*/output_padding,
+                    /*groups=*/groups,
+                    /*dilation=*/{dilation, dilation + 1, dilation});
+              };
+              ForEachDevice([&](const torch::Device& device) {
+                torch::Tensor input =
+                    torch::rand({4, out_channels, 7, 7, 7},
+                                torch::TensorOptions(torch::kDouble)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true));
+                torch::Tensor weight =
+                    torch::rand({out_channels, in_channels / groups,
+                                 kernel_size, kernel_size, kernel_size},
+                                torch::TensorOptions(torch::kDouble)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true));
+                torch::Tensor bias =
+                    with_bias ? torch::rand({in_channels},
+                                            torch::TensorOptions(torch::kDouble)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))
+                              : torch::Tensor();
+                TestBackward({input, weight, bias}, device, testfn);
+              });
+            }
+          };
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2DBackward) {
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        auto testfn =
+            [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+          return torch::max_pool2d(
+              inputs[0], /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{1, 1},
+              /*ceil_mode=*/ceil_mode);
+        };
+
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward(
+              {torch::rand({1, 2, 8, 8}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice())
+                                             .requires_grad(true))},
+              device, testfn);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DBackward) {
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        auto testfn =
+            [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+          return torch::max_pool3d(
+              inputs[0],
+              /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding}, /*dilation=*/{1, 1, 1},
+              /*ceil_mode=*/ceil_mode);
+        };
+
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward(
+              {torch::rand({1, 2, 4, 4, 4}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+              device, testfn);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool2DNoBatchBackward) {
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        auto testfn =
+            [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+          return torch::max_pool2d(
+              inputs[0], /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{1, 1},
+              /*ceil_mode=*/ceil_mode);
+        };
+
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward(
+              {torch::rand({2, 8, 8}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+              device, testfn);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxPool3DNoBatchBackward) {
+  int kernel_size = 3;
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        auto testfn =
+            [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+          return torch::max_pool3d(
+              inputs[0],
+              /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding}, /*dilation=*/{1, 1, 1},
+              /*ceil_mode=*/ceil_mode);
+        };
+
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward(
+              {torch::rand({2, 4, 4, 4}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice())
+                                             .requires_grad(true))},
+              device, testfn);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxUnpool2DBackward) {
+  int kernel_size = 2;
+  torch::Tensor input =
+      torch::rand({2, 2, 8, 8},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output;
+          torch::Tensor indices;
+          std::tie(output, indices) = torch::max_pool2d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size},
+              /*stride=*/{stride, stride},
+              /*padding=*/{padding, padding}, /*dilation=*/{dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+
+          std::vector<int64_t> output_size({input.size(2), input.size(3)});
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::max_unpool2d(inputs[0], inputs[1], output_size);
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward({output.requires_grad_(true), indices}, device,
+                         testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestMaxUnpool3DBackward) {
+  int kernel_size = 2;
+  torch::Tensor input =
+      torch::rand({1, 1, 4, 4, 4},
+                  torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (int stride = 1; stride <= 2; ++stride) {
+    for (int padding = 0; padding <= 1; ++padding) {
+      // Test ceil_mode=true through the CPU interop.
+      for (bool ceil_mode : {false, true}) {
+        for (int dilation = 1; dilation <= 2; ++dilation) {
+          torch::Tensor output;
+          torch::Tensor indices;
+          std::tie(output, indices) = torch::max_pool3d_with_indices(
+              input, /*kernel_size=*/{kernel_size, kernel_size, kernel_size},
+              /*stride=*/{stride, stride, stride},
+              /*padding=*/{padding, padding, padding},
+              /*dilation=*/{dilation, dilation, dilation},
+              /*ceil_mode=*/ceil_mode);
+
+          std::vector<int64_t> output_size(
+              {input.size(2), input.size(3), input.size(4)});
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::max_unpool3d(inputs[0], inputs[1], output_size,
+                                       /*stride=*/{stride, stride, stride},
+                                       /*padding=*/{padding, padding, padding});
+          };
+
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward({output.requires_grad_(true), indices}, device,
+                         testfn);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestTanhBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::tanh(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 2}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSigmoidBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::sigmoid(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 2}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSigmoidBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::log_sigmoid(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 2}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn, /*rtol=*/1e-3, /*atol=*/1e-5);
+  });
+}
+
+TEST_F(LazyOpsTest, TestLogSoftmaxBackward) {
+  for (int dim = -4; dim < 4; ++dim) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::log_softmax(inputs[0], dim);
+    };
+
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward(
+          {torch::rand({5, 3, 4, 2}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true))},
+          device, testfn, /*rtol=*/1e-3, /*atol=*/1e-4);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSoftmaxBackward) {
+  for (int dim = -4; dim < 4; ++dim) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::softmax(inputs[0], dim);
+    };
+
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward(
+          {torch::rand({5, 3, 4, 2}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true))},
+          device, testfn, /*rtol=*/1e-3, /*atol=*/1e-4);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestSoftplusBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::softplus(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn, /*rtol=*/1e-4);
+  });
+}
+
+TEST_F(LazyOpsTest, TestReluBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::relu(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestRreluBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::rrelu(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardshrinkBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::hardshrink(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::randn({100}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestSoftshrinkBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::softshrink(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::randn({100}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestHardtanhBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::hardtanh(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::randn({100}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestEluBackward) {
+  torch::Scalar alpha = 0.5;
+  torch::Scalar scale = 2.5;
+  torch::Scalar input_scale = 1.5;
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::elu(inputs[0], alpha, scale, input_scale);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestGeluBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::gelu(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 3}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+  ExpectCounterChanged("lazy::gelu_backward", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLeakyReluBackward) {
+  double negative_slope = 0.01;
+  auto testfn = [=](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::leaky_relu(inputs[0], negative_slope);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 1, 4, 6}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestTransposeBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::t(inputs[0]);
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward({torch::rand({2, 3}, torch::TensorOptions(torch::kFloat)
+                                          .device(DefaultDevice())
+                                          .requires_grad(true))},
+                 device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAddMatMulBackward) {
+  int in_channels = 32;
+  int out_channels = 320;
+  int labels = 50;
+  // Test beta != 1. through the CPU interop.
+  for (double beta : {1., 2.}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::addmm(inputs[0], inputs[1], inputs[2], /*beta=*/beta);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({torch::rand({labels}, torch::TensorOptions(torch::kFloat)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true)),
+                    torch::rand({in_channels, out_channels},
+                                torch::TensorOptions(torch::kFloat)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true)),
+                    torch::rand({out_channels, labels},
+                                torch::TensorOptions(torch::kFloat)
+                                    .device(DefaultDevice())
+                                    .requires_grad(true))},
+                   device, testfn);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestBinaryCrossEntropyBackward) {
+  int batch = 6;
+  int classes = 2;
+  // TODO(asuhan): Fix the torch::kDouble case.
+  for (auto dtype : {torch::kFloat}) {
+    for (bool def_weight : {false, true}) {
+      torch::Tensor input = torch::rand(
+          {batch, classes}, torch::TensorOptions(dtype).requires_grad(true));
+      torch::Tensor target =
+          torch::rand({batch, classes}, torch::TensorOptions(dtype));
+      torch::Tensor weight;
+      if (def_weight) {
+        weight = torch::rand({batch, classes}, torch::TensorOptions(dtype));
+      }
+      for (torch::Reduction::Reduction reduction :
+           {torch::Reduction::Mean, torch::Reduction::Sum,
+            torch::Reduction::None}) {
+        auto testfn =
+            [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+          return torch::binary_cross_entropy(
+              /*self=*/inputs[0], /*target=*/inputs[1],
+              /*weight=*/inputs[2],
+              /*reduction=*/reduction);
+        };
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward({input, target, weight}, device, testfn, /*rtol=*/1e-4,
+                       /*atol=*/1e-7);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNllLossBackward) {
+  // TODO(whc) debug divide-by-zero failure under ASAN
+  GTEST_SKIP();
+
+  int batch = 6;
+  int classes = 2;
+  // TODO(asuhan): Fix the torch::kDouble case.
+  for (auto dtype : {torch::kFloat}) {
+    for (int ignore_index : {-1, 0, 1, 5}) {
+      for (bool def_weight : {false, true}) {
+        torch::Tensor input =
+            torch::rand({batch, classes}, torch::TensorOptions(dtype)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true));
+        torch::Tensor target = torch::randint(
+            std::min(ignore_index, 0), classes, {batch},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        torch::Tensor weight;
+        if (def_weight) {
+          weight = torch::rand(
+              {classes}, torch::TensorOptions(dtype).device(DefaultDevice()));
+        }
+        for (torch::Reduction::Reduction reduction :
+             {torch::Reduction::Mean, torch::Reduction::Sum,
+              torch::Reduction::None}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::nll_loss(
+                /*self=*/inputs[0], /*target=*/inputs[1],
+                /*weight=*/inputs[2],
+                /*reduction=*/reduction, /*ignore_index=*/ignore_index);
+          };
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward({input, target, weight}, device, testfn, /*rtol=*/1e-5,
+                         /*atol=*/1e-8);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestNllLoss2dBackward) {
+  int batch = 6;
+  int classes = 2;
+  int height = 3;
+  int width = 3;
+  // TODO(asuhan): Fix the torch::kDouble case.
+  for (auto dtype : {torch::kFloat}) {
+    for (int ignore_index : {-1, 0, 1, 5}) {
+      for (bool def_weight : {false, true}) {
+        torch::Tensor input = torch::rand({batch, classes, height, width},
+                                          torch::TensorOptions(dtype)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true));
+        torch::Tensor target = torch::randint(
+            std::min(ignore_index, 0), classes, {batch, height, width},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        torch::Tensor weight;
+        if (def_weight) {
+          weight = torch::rand(
+              {classes}, torch::TensorOptions(dtype).device(DefaultDevice()));
+        }
+        for (torch::Reduction::Reduction reduction :
+             {torch::Reduction::Mean, torch::Reduction::Sum,
+              torch::Reduction::None}) {
+          auto testfn =
+              [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+            return torch::nll_loss2d(
+                /*self=*/inputs[0], /*target=*/inputs[1],
+                /*weight=*/inputs[2],
+                /*reduction=*/reduction, /*ignore_index=*/ignore_index);
+          };
+          ForEachDevice([&](const torch::Device& device) {
+            TestBackward({input, target, weight}, device, testfn, /*rtol=*/1e-5,
+                         /*atol=*/1e-8);
+          });
+        }
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestSmoothL1LossBackward) {
+  torch::Tensor input = torch::randn({2, 4}, torch::TensorOptions(torch::kFloat)
+                                                 .device(DefaultDevice())
+                                                 .requires_grad(true));
+  torch::Tensor target = torch::randn(
+      {2, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    for (double beta : {0.25, 1.}) {
+      auto testfn =
+          [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+        return torch::smooth_l1_loss(/*input=*/inputs[0], /*target=*/inputs[1],
+                                     /*reduction=*/reduction, /*beta=*/beta);
+      };
+      ForEachDevice([&](const torch::Device& device) {
+        TestBackward({input, target}, device, testfn, /*rtol=*/1e-5,
+                     /*atol=*/1e-8);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestViewBackward) {
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return inputs[0].view({-1, 320});
+  };
+  ForEachDevice([&](const torch::Device& device) {
+    TestBackward(
+        {torch::rand({32, 20, 4, 4}, torch::TensorOptions(torch::kFloat)
+                                         .device(DefaultDevice())
+                                         .requires_grad(true))},
+        device, testfn);
+  });
+}
+
+TEST_F(LazyOpsTest, TestBatchNorm2DBackward) {
+  double momentum = 0.1;
+  double eps = 0.5;
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::batch_norm(
+        /*input=*/inputs[0], /*weight=*/inputs[1], /*bias=*/inputs[2],
+        /*running_mean=*/inputs[3], /*running_var=*/inputs[4],
+        /*training=*/true, /*momentum=*/momentum, /*eps=*/eps,
+        /*cudnn_enabled=*/false);
+  };
+  int num_features = 3;
+  torch::Tensor undef;
+  for (bool undef_weight_bias : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor input = torch::rand({2, num_features, 4, 4},
+                                        torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true));
+      torch::Tensor weight =
+          undef_weight_bias
+              ? undef
+              : torch::rand({num_features}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true));
+      torch::Tensor bias =
+          undef_weight_bias
+              ? undef
+              : torch::rand({num_features}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true));
+      torch::Tensor running_mean = torch::zeros(
+          {num_features},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor running_var = torch::ones(
+          {num_features},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      TestBackward({input, weight, bias, running_mean, running_var}, device,
+                   testfn,
+                   /*rtol=*/1e-3, /*atol=*/1e-4);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestBatchNorm3DBackward) {
+  double momentum = 0.1;
+  double eps = 0.5;
+  auto testfn = [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+    return torch::batch_norm(
+        /*input=*/inputs[0], /*weight=*/inputs[1], /*bias=*/inputs[2],
+        /*running_mean=*/inputs[3], /*running_var=*/inputs[4],
+        /*training=*/true, /*momentum=*/momentum, /*eps=*/eps,
+        /*cudnn_enabled=*/false);
+  };
+  int num_features = 3;
+  torch::Tensor undef;
+  for (bool undef_weight_bias : {false, true}) {
+    ForEachDevice([&](const torch::Device& device) {
+      torch::Tensor input = torch::rand({2, num_features, 4, 4, 2},
+                                        torch::TensorOptions(torch::kFloat)
+                                            .device(DefaultDevice())
+                                            .requires_grad(true));
+      torch::Tensor weight =
+          undef_weight_bias
+              ? undef
+              : torch::rand({num_features}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true));
+      torch::Tensor bias =
+          undef_weight_bias
+              ? undef
+              : torch::rand({num_features}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true));
+      torch::Tensor running_mean = torch::zeros(
+          {num_features},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      torch::Tensor running_var = torch::ones(
+          {num_features},
+          torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+      TestBackward({input, weight, bias, running_mean, running_var}, device,
+                   testfn,
+                   /*rtol=*/1e-3, /*atol=*/1e-3);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestBCEWithLogitsBackward) {
+  int batch = 10;
+  int classes = 5;
+  torch::Tensor undef;
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::None, torch::Reduction::Mean,
+        torch::Reduction::Sum}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::binary_cross_entropy_with_logits(
+          /*input=*/inputs[0], /*target=*/inputs[1], /*weight=*/inputs[2],
+          /*pos_weight=*/inputs[3],
+          /*reduction=*/reduction);
+    };
+    for (bool undef_weight : {false, true}) {
+      for (bool undef_pos_weight : {false, true}) {
+        torch::Tensor input =
+            torch::rand({batch, classes}, torch::TensorOptions(torch::kFloat)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true));
+        torch::Tensor target =
+            torch::rand({batch, classes}, torch::TensorOptions(torch::kFloat)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true));
+        torch::Tensor weight =
+            undef_weight
+                ? undef
+                : torch::rand({classes}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice()));
+        torch::Tensor pos_weight =
+            undef_pos_weight
+                ? undef
+                : torch::rand({classes}, torch::TensorOptions(torch::kFloat)
+                                             .device(DefaultDevice()));
+        ForEachDevice([&](const torch::Device& device) {
+          TestBackward({input, target, weight, pos_weight}, device, testfn,
+                       /*rtol=*/1e-3, /*atol=*/1e-5);
+        });
+      }
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestKlDivBackward) {
+  torch::Tensor input = torch::rand({4, 3}, torch::TensorOptions(torch::kFloat)
+                                                .device(DefaultDevice())
+                                                .requires_grad(true));
+  torch::Tensor target = torch::rand({4, 3}, torch::TensorOptions(torch::kFloat)
+                                                 .device(DefaultDevice())
+                                                 .requires_grad(true));
+  for (torch::Reduction::Reduction reduction :
+       {torch::Reduction::Mean, torch::Reduction::Sum,
+        torch::Reduction::None}) {
+    auto testfn =
+        [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+      return torch::kl_div(/*self=*/inputs[0], /*target=*/inputs[1], reduction);
+    };
+    ForEachDevice([&](const torch::Device& device) {
+      TestBackward({input, target}, device, testfn, /*rtol=*/1e-4,
+                   /*atol=*/1e-5);
+    });
+  }
+}
+
+TEST_F(LazyOpsTest, TestEmbeddingBackward) {
+  int num_weights = 32;
+  for (int padding_idx = -1; padding_idx < num_weights; ++padding_idx) {
+    for (bool scale_grad_by_freq : {false, true}) {
+      auto testfn =
+          [&](const std::vector<torch::Tensor>& inputs) -> torch::Tensor {
+        return torch::embedding(inputs[0], inputs[1],
+                                /*padding_idx=*/padding_idx,
+                                /*scale_grad_by_freq=*/scale_grad_by_freq,
+                                /*sparse=*/false);
+      };
+      ForEachDevice([&](const torch::Device& device) {
+        torch::Tensor weight =
+            torch::rand({num_weights, 7}, torch::TensorOptions(torch::kFloat)
+                                              .device(DefaultDevice())
+                                              .requires_grad(true));
+        torch::Tensor indices = torch::randint(
+            num_weights, {3, 9, 4},
+            torch::TensorOptions(torch::kLong).device(DefaultDevice()));
+        TestBackward({weight, indices}, device, testfn, /*rtol=*/1e-5,
+                     /*atol=*/1e-8);
+      });
+    }
+  }
+}
+
+TEST_F(LazyOpsTest, TestAmpForeachNonFiniteCheckAndUnscale) {
+  if (IsCuda()) {
+    // TODO(whc) debug failure on cuda
+    GTEST_SKIP();
+  }
+
+  torch::Tensor grads0 = torch::tensor(
+      {1, 2, 3, 4},
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor grads1 = torch::tensor(
+      {1.0, 2.0, std::nan("1"), 4.0},
+      torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor inv_scale = torch::scalar_tensor(
+      0.2, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor found_inf = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor grads_output0 = grads0 * inv_scale;
+  torch::Tensor found_inf_output0 = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor found_inf_output1 = torch::scalar_tensor(
+      1, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ForEachDevice([&](const torch::Device& device) {
+    if (grads0.device() == at::kCPU) {
+      GTEST_SKIP();
+    }
+    torch::Tensor lazy_grads0 = CopyToDevice(grads0, device);
+    torch::Tensor lazy_inv_scale = CopyToDevice(inv_scale, device);
+    torch::Tensor lazy_found_inf = CopyToDevice(found_inf, device);
+    torch::_amp_foreach_non_finite_check_and_unscale_(lazy_grads0, lazy_found_inf,
+                                                      lazy_inv_scale);
+    AllClose(grads_output0, lazy_grads0, /*rtol=*/1e-2, /*atol=*/1e-4);
+    AllEqual(found_inf_output0, lazy_found_inf);
+
+    torch::Tensor lazy_grads1 = CopyToDevice(grads1, device);
+    torch::_amp_foreach_non_finite_check_and_unscale_(lazy_grads1, lazy_found_inf,
+                                                      lazy_inv_scale);
+    AllEqual(found_inf_output1, lazy_found_inf);
+  });
+}
+
+TEST_F(LazyOpsTest, TestAmpUpdateScale) {
+  torch::Tensor growth_tracker = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kInt32).device(DefaultDevice()));
+  torch::Tensor current_scale = torch::scalar_tensor(
+      4, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor found_inf = torch::scalar_tensor(
+      1, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor not_found_inf = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  float scale_growth_factor = 2.0;
+  float scale_backoff_factor = 0.5;
+  int growth_interval = 3;
+
+  torch::Tensor growth_tracker_result0 = torch::scalar_tensor(
+      1, torch::TensorOptions(torch::kInt32).device(DefaultDevice()));
+  torch::Tensor current_scale_result0 = torch::scalar_tensor(
+      4, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor growth_tracker_result1 = torch::scalar_tensor(
+      2, torch::TensorOptions(torch::kInt32).device(DefaultDevice()));
+  torch::Tensor current_scale_result1 = torch::scalar_tensor(
+      4, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor growth_tracker_result2 = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kInt32).device(DefaultDevice()));
+  torch::Tensor current_scale_result2 = torch::scalar_tensor(
+      8, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor growth_tracker_result3 = torch::scalar_tensor(
+      0, torch::TensorOptions(torch::kInt32).device(DefaultDevice()));
+  torch::Tensor current_scale_result3 = torch::scalar_tensor(
+      4, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+
+  ForEachDevice([&](const torch::Device& device) {
+    if (growth_tracker.device() == at::kCPU) {
+      GTEST_SKIP();
+    }
+    torch::Tensor lazy_growth_tracker = CopyToDevice(growth_tracker, device);
+    torch::Tensor lazy_current_scale = CopyToDevice(current_scale, device);
+    torch::Tensor lazy_found_inf = CopyToDevice(found_inf, device);
+    torch::Tensor lazy_not_found_inf = CopyToDevice(not_found_inf, device);
+
+    torch::_amp_update_scale_(lazy_current_scale, lazy_growth_tracker,
+                              lazy_not_found_inf, scale_growth_factor,
+                              scale_backoff_factor, growth_interval);
+    AllClose(current_scale_result0, lazy_current_scale, /*rtol=*/1e-2,
+             /*atol=*/1e-4);
+    AllEqual(growth_tracker_result0, lazy_growth_tracker);
+
+    torch::_amp_update_scale_(lazy_current_scale, lazy_growth_tracker,
+                              lazy_not_found_inf, scale_growth_factor,
+                              scale_backoff_factor, growth_interval);
+    AllClose(current_scale_result1, lazy_current_scale, /*rtol=*/1e-2,
+             /*atol=*/1e-4);
+    AllEqual(growth_tracker_result1, lazy_growth_tracker);
+
+    // torch::_amp_update_scale_ returns the reference of current_scale
+    lazy_current_scale = torch::_amp_update_scale_(
+        lazy_current_scale, lazy_growth_tracker, lazy_not_found_inf,
+        scale_growth_factor, scale_backoff_factor, growth_interval);
+    AllClose(current_scale_result2, lazy_current_scale, /*rtol=*/1e-2,
+             /*atol=*/1e-4);
+    AllEqual(growth_tracker_result2, lazy_growth_tracker);
+
+    lazy_current_scale = torch::_amp_update_scale_(
+        lazy_current_scale, lazy_growth_tracker, lazy_found_inf,
+        scale_growth_factor, scale_backoff_factor, growth_interval);
+    AllClose(current_scale_result3, lazy_current_scale, /*rtol=*/1e-2,
+             /*atol=*/1e-4);
+    AllEqual(growth_tracker_result3, lazy_growth_tracker);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::_amp_update_scale_",
+                       GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestEarlySyncLiveTensors) {
+  torch::Tensor scalar_tensor = torch::scalar_tensor(
+      1., torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar scalar1 = scalar_tensor.item();
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_scalar_tensor = CopyToDevice(scalar_tensor, device);
+    torch::Scalar scalar2 = lazy_scalar_tensor.item();
+    ASSERT_EQ(scalar1.to<float>(), scalar2.to<float>());
+  });
+  if (DebugUtil::ExperimentEnabled("early_sync")) {
+    ExpectCounterChanged("EarlySyncLiveTensorsCount",
+                         GetIgnoredCounters());
+  } else {
+    ExpectCounterNotChanged("EarlySyncLiveTensorsCount",
+                            GetIgnoredCounters());
+  }
+  ExpectCounterChanged("aten::_local_scalar_dense",
+                       GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerp) {
+  torch::Tensor start = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor res = torch::lerp(start, end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_start = CopyToDevice(start, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    torch::Tensor lazy_res = torch::lerp(lazy_start, lazy_end, lazy_weight);
+    AllClose(res, lazy_res);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerpScalar) {
+  torch::Tensor start = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar weight = torch::Scalar(3.0);
+  torch::Tensor res = torch::lerp(start, end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_start = CopyToDevice(start, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    torch::Tensor lazy_res = torch::lerp(lazy_start, lazy_end, weight);
+    AllClose(res, lazy_res);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerpInplace) {
+  torch::Tensor input = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor input_copy = input.clone();
+  input.lerp_(end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    lazy_input.lerp_(lazy_end, lazy_weight);
+    AllClose(lazy_input, input);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerpScalarInplace) {
+  torch::Tensor input = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar weight = torch::Scalar(3.0);
+  torch::Tensor input_copy = input.clone();
+  input.lerp_(end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_input = CopyToDevice(input_copy, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    lazy_input.lerp_(lazy_end, weight);
+    AllClose(lazy_input, input);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerpOut) {
+  torch::Tensor start = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor weight = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor res = torch::empty(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  ;
+  torch::lerp_out(res, start, end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_start = CopyToDevice(start, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    torch::Tensor lazy_weight = CopyToDevice(weight, device);
+    torch::Tensor lazy_res = torch::empty({3, 4}, lazy_start.options());
+    torch::lerp_out(lazy_res, lazy_start, lazy_end, lazy_weight);
+    AllClose(res, lazy_res);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+TEST_F(LazyOpsTest, TestLerpScalarOut) {
+  torch::Tensor start = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Tensor end = torch::rand(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::Scalar weight = torch::Scalar(3.0);
+  torch::Tensor res = torch::empty(
+      {3, 4}, torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
+  torch::lerp_out(res, start, end, weight);
+  ForEachDevice([&](const torch::Device& device) {
+    torch::Tensor lazy_start = CopyToDevice(start, device);
+    torch::Tensor lazy_end = CopyToDevice(end, device);
+    torch::Tensor lazy_res = torch::empty({3, 4}, lazy_start.options());
+    torch::lerp_out(lazy_res, lazy_start, lazy_end, weight);
+    AllClose(res, lazy_res);
+  });
+  ExpectCounterNotChanged("aten::.*", GetIgnoredCounters());
+  ExpectCounterChanged("lazy::lerp", GetIgnoredCounters());
+}
+
+#endif // FBCODE_CAFFE2
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/test/cpp/lazy/test_lazy_ops_util.cpp b/test/cpp/lazy/test_lazy_ops_util.cpp
new file mode 100644
index 00000000000000..6f12f960e7af2a
--- /dev/null
+++ b/test/cpp/lazy/test_lazy_ops_util.cpp
@@ -0,0 +1,194 @@
+#include <test/cpp/lazy/test_lazy_ops_util.h>
+
+#include <torch/csrc/lazy/backend/lowering_context.h>
+#include <torch/csrc/lazy/ts_backend/ops/device_data.h>
+#include <torch/csrc/lazy/core/ir_dump_util.h>
+#include <torch/csrc/lazy/core/tensor_impl.h>
+
+#include <iostream>
+#include <string>
+
+
+namespace torch {
+namespace lazy {
+namespace {
+
+bool IsLtcTensor(const at::Tensor& tensor) {
+  return dynamic_cast<torch::lazy::LTCTensorImpl*>(tensor.unsafeGetTensorImpl());
+}
+
+std::unordered_set<std::string>* CreateIgnoredCounters() {
+  std::unordered_set<std::string>* icounters =
+      new std::unordered_set<std::string>();
+  // Add below the counters whose name need to be ignored when doing
+  // is-any-counter-changed assertins.
+  icounters->insert("aten::rand");
+  return icounters;
+}
+
+}  // namespace
+
+const std::unordered_set<std::string>* GetIgnoredCounters() {
+  static const std::unordered_set<std::string>* icounters =
+      CreateIgnoredCounters();
+  return icounters;
+}
+
+at::Tensor ToCpuTensor(const at::Tensor& tensor) {
+  // tensor.to() implicitly triggers a sync if t.device=torch::kLazy.
+  return tensor.to(torch::kCPU);
+}
+
+torch::Tensor CopyToDevice(const torch::Tensor& tensor,
+                           const torch::Device& device) {
+  return tensor.clone().to(device, /*non_blocking=*/false, /*copy=*/true);
+}
+
+bool EqualValues(at::Tensor tensor1, at::Tensor tensor2) {
+  tensor1 = ToCpuTensor(tensor1);
+  tensor2 = ToCpuTensor(tensor2);
+  if (torch::isnan(tensor1).any().item<bool>()) {
+    EXPECT_TRUE(EqualValues(torch::isnan(tensor1), torch::isnan(tensor2)));
+    tensor1.nan_to_num_();
+    tensor2.nan_to_num_();
+  }
+  if (tensor1.sizes() != tensor2.sizes() ||
+      tensor1.dtype() != tensor2.dtype()) {
+    std::cerr << "Different shape:\n"
+              << tensor1.dtype() << " " << tensor1.sizes() << "\n-vs-\n"
+              << tensor2.dtype() << " " << tensor2.sizes() << "\n";
+    return false;
+  }
+  at::ScalarType type1 = tensor1.scalar_type();
+  at::ScalarType type2 = tensor2.scalar_type();
+  if (type1 != type2) {
+    tensor1 = tensor1.toType(type2);
+  }
+  bool equal = tensor1.equal(tensor2);
+  return equal;
+}
+
+bool EqualValuesNoElementTypeCheck(at::Tensor tensor1, at::Tensor tensor2) {
+  tensor1 = ToCpuTensor(tensor1);
+  tensor2 = ToCpuTensor(tensor2);
+  if (tensor1.sizes() != tensor2.sizes()) {
+    std::cerr << "Different shape:\n"
+              << tensor1.dtype() << " " << tensor1.sizes() << "\n-vs-\n"
+              << tensor2.dtype() << " " << tensor2.sizes() << "\n";
+    return false;
+  }
+  at::ScalarType type1 = tensor1.scalar_type();
+  at::ScalarType type2 = tensor2.scalar_type();
+  if (type1 != type2) {
+    tensor1 = tensor1.toType(type2);
+  }
+  bool equal = tensor1.equal(tensor2);
+  return equal;
+}
+
+void ForEachDevice(const std::function<void(const torch::Device&)>& devfn) {
+  // Currently TorchScript backend only supports one type of hardware per process,
+  // which is set by env. And the ordinal is always 0 given distributed training/
+  // multi-device is not supported yet.
+  auto device = torch::lazy::BackendDevice();
+  torch::Device torch_device = torch::lazy::backendDeviceToAtenDevice(device);
+  devfn(torch_device);
+}
+
+bool CloseValues(at::Tensor tensor1, at::Tensor tensor2, double rtol,
+                 double atol) {
+  tensor1 = ToCpuTensor(tensor1);
+  tensor2 = ToCpuTensor(tensor2);
+  if (torch::isnan(tensor1).any().item<bool>()) {
+    EXPECT_TRUE(EqualValues(torch::isnan(tensor1), torch::isnan(tensor2)));
+    tensor1.nan_to_num_();
+    tensor2.nan_to_num_();
+  }
+  if (tensor1.sizes() != tensor2.sizes() ||
+      tensor1.dtype() != tensor2.dtype()) {
+    std::cerr << "Different shape:\n"
+              << tensor1.dtype() << " " << tensor1.sizes() << "\n-vs-\n"
+              << tensor2.dtype() << " " << tensor2.sizes() << "\n";
+    return false;
+  }
+  bool equal = tensor1.allclose(tensor2, rtol, atol);
+  return equal;
+}
+
+std::string GetTensorTextGraph(at::Tensor tensor) {
+  torch::lazy::LazyTensorPtr lazy_tensor = torch::lazy::TryGetLtcTensor(tensor);
+  return torch::lazy::DumpUtil::ToText({lazy_tensor->GetIrValue().node.get()});
+}
+
+std::string GetTensorDotGraph(at::Tensor tensor) {
+  torch::lazy::LazyTensorPtr lazy_tensor = torch::lazy::TryGetLtcTensor(tensor);
+  return torch::lazy::DumpUtil::ToDot({lazy_tensor->GetIrValue().node.get()});
+}
+
+void TestBackward(
+    const std::vector<torch::Tensor>& inputs, const torch::Device& device,
+    const std::function<torch::Tensor(const std::vector<torch::Tensor>&)>&
+        testfn,
+    double rtol, double atol, int derivative_level) {
+  std::vector<torch::Tensor> input_vars;
+  std::vector<torch::Tensor> xinput_vars;
+  std::vector<torch::Tensor> inputs_w_grad;
+  std::vector<torch::Tensor> xinputs_w_grad;
+  for (size_t i = 0; i < inputs.size(); ++i) {
+    const torch::Tensor& input = inputs[i];
+    if (input.defined()) {
+      torch::Tensor oinput =
+          input.clone().detach().set_requires_grad(input.requires_grad());
+      input_vars.push_back(oinput);
+
+      torch::Tensor xinput = CopyToDevice(input, device)
+                                 .detach()
+                                 .set_requires_grad(input.requires_grad());
+      xinput_vars.push_back(xinput);
+      if (input.requires_grad()) {
+        inputs_w_grad.push_back(oinput);
+        xinputs_w_grad.push_back(xinput);
+      }
+    } else {
+      input_vars.emplace_back();
+      xinput_vars.emplace_back();
+    }
+  }
+
+  torch::Tensor output = testfn(input_vars);
+  torch::Tensor xoutput = testfn(xinput_vars);
+  torch::lazy::AllClose(output, xoutput, rtol, atol);
+
+  std::vector<torch::Tensor> outs = {output};
+  std::vector<torch::Tensor> xouts = {xoutput};
+  for (int d = 1; d <= derivative_level; ++d) {
+    // Check grad of sum(outs) w.r.t inputs_w_grad.
+    torch::Tensor sum = torch::zeros_like(outs[0]).sum();
+    torch::Tensor xsum = torch::zeros_like(xouts[0]).sum();
+    for (size_t i = 0; i < outs.size(); ++i) {
+      if (outs[i].requires_grad()) {
+        sum += outs[i].sum();
+        xsum += xouts[i].sum();
+      }
+    }
+    // Calculating higher order derivative requires create_graph=true
+    bool create_graph = d != derivative_level;
+    outs = torch::autograd::grad({sum}, inputs_w_grad, /*grad_outputs=*/{},
+                                 /*retain_graph=*/c10::nullopt,
+                                 /*create_graph=*/create_graph,
+                                 /*allow_unused=*/true);
+    xouts = torch::autograd::grad({xsum}, xinputs_w_grad, /*grad_outputs=*/{},
+                                  /*retain_graph=*/c10::nullopt,
+                                  /*create_graph=*/create_graph,
+                                  /*allow_unused=*/true);
+    for (size_t i = 0; i < outs.size(); ++i) {
+      ASSERT_EQ(outs[i].defined(), xouts[i].defined());
+      if (outs[i].defined()) {
+        AllClose(outs[i], xouts[i], rtol, atol);
+      }
+    }
+  }
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/test/cpp/lazy/test_lazy_ops_util.h b/test/cpp/lazy/test_lazy_ops_util.h
new file mode 100644
index 00000000000000..6dc26b48be9518
--- /dev/null
+++ b/test/cpp/lazy/test_lazy_ops_util.h
@@ -0,0 +1,68 @@
+#pragma once
+
+#include <gtest/gtest.h>
+#include <torch/csrc/lazy/backend/backend_device.h>
+#include <torch/csrc/lazy/core/debug_util.h>
+#include <torch/csrc/lazy/core/ir.h>
+#include <torch/csrc/lazy/core/tensor.h>
+#include <torch/torch.h>
+
+#include <cmath>
+#include <functional>
+#include <string>
+#include <unordered_set>
+
+namespace torch {
+namespace lazy {
+
+const std::unordered_set<std::string>* GetIgnoredCounters();
+
+// Converts an at::Tensor(device=torch::kLazy) to at::Tensor(device=torch::kCPU)
+// This at::Tensor can be torch::Tensor which is a Variable, or at::Tensor which
+// know nothing about autograd. If the input tensor is already a CPU tensor, it
+// will be returned. Needed because EqualValues and AllClose require CPU tensors
+// on both sides.
+at::Tensor ToCpuTensor(const at::Tensor& tensor);
+
+// Helper function to copy a tensor to device.
+torch::Tensor CopyToDevice(const torch::Tensor& tensor,
+                           const torch::Device& device);
+
+bool EqualValues(at::Tensor tensor1, at::Tensor tensor2);
+
+bool EqualValuesNoElementTypeCheck(at::Tensor tensor1, at::Tensor tensor2);
+
+bool CloseValues(at::Tensor tensor1, at::Tensor tensor2, double rtol = 1e-5,
+                 double atol = 1e-8);
+
+static inline void AllClose(at::Tensor tensor, at::Tensor xla_tensor,
+                            double rtol = 1e-5, double atol = 1e-8) {
+  EXPECT_TRUE(CloseValues(tensor, xla_tensor, rtol, atol));
+}
+
+static inline void AllClose(at::Tensor tensor, torch::lazy::LazyTensor& xla_tensor,
+                            double rtol = 1e-5, double atol = 1e-8) {
+  EXPECT_TRUE(
+      CloseValues(tensor, xla_tensor.ToTensor(/*detached=*/false), rtol, atol));
+}
+
+static inline void AllEqual(at::Tensor tensor, at::Tensor xla_tensor) {
+  EXPECT_TRUE(EqualValues(tensor, xla_tensor));
+}
+
+void ForEachDevice(const std::function<void(const torch::Device&)>& devfn);
+
+std::string GetTensorTextGraph(at::Tensor tensor);
+
+std::string GetTensorDotGraph(at::Tensor tensor);
+
+std::string GetTensorHloGraph(at::Tensor tensor);
+
+void TestBackward(
+    const std::vector<torch::Tensor>& inputs, const torch::Device& device,
+    const std::function<torch::Tensor(const std::vector<torch::Tensor>&)>&
+        testfn,
+    double rtol = 1e-5, double atol = 1e-8, int derivative_level = 1);
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/test/cpp/lazy/test_misc.cpp b/test/cpp/lazy/test_misc.cpp
index 45b54fd2824b73..b2f941c42dd6ba 100644
--- a/test/cpp/lazy/test_misc.cpp
+++ b/test/cpp/lazy/test_misc.cpp
@@ -71,6 +71,11 @@ TEST(HashTest, Sanity) {
   auto b = std::vector<int32_t>({1, 1, 2, 3, 5, 8, 12});
   test_hash_repeatable_sensitive(a, b);
   test_hash_repeatable_sensitive(c10::ArrayRef<int32_t>(a), c10::ArrayRef<int32_t>(b));
+
+  // vector<bool> is a special case bc it is implemented as vector<bit>
+  auto bool_a = std::vector<bool>({true, false, false, true});
+  auto bool_b = std::vector<bool>({true, true, false, true});
+  test_hash_repeatable_sensitive(bool_a, bool_b);
 }
 
 } // namespace lazy
diff --git a/test/cpp/lazy/test_symbolic_shape.cpp b/test/cpp/lazy/test_symbolic_shape.cpp
new file mode 100644
index 00000000000000..7fac64f44839f2
--- /dev/null
+++ b/test/cpp/lazy/test_symbolic_shape.cpp
@@ -0,0 +1,132 @@
+
+#include <c10/core/Device.h>
+#include <gtest/gtest.h>
+#include <test/cpp/lazy/test_lazy_ops_util.h>
+#include <torch/csrc/lazy/core/debug_util.h>
+#include <torch/csrc/lazy/core/helpers.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
+#include <torch/csrc/lazy/core/metrics.h>
+#include <torch/csrc/lazy/core/permutation_util.h>
+#include <torch/csrc/lazy/core/tensor.h>
+#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
+#include <torch/torch.h>
+#include <iostream>
+
+namespace torch {
+namespace lazy {
+
+// Lazy Tensor is disabled in FBCODE until addressing non-virtual methods (e.g.
+// sizes) in TensorImpl
+#ifndef FBCODE_CAFFE2
+
+namespace {
+// This registers the torchscript backend, without which lazy device won't work
+torch::lazy::BackendRegistrar g_registrar(GetTSBackendImpl());
+
+static inline at::DeviceType DefaultDevice() {
+  return torch::lazy::getBackend()->EagerFallbackDeviceType();
+}
+
+std::vector<bool> getIsSymbolic(at::Tensor& lazy_tensor) {
+  auto ltc_tensor = GetLtcTensor(lazy_tensor);
+  Value ir_val = ltc_tensor->GetIrValue();
+  const Shape& shape = ir_val->shape();
+  return shape.is_symbolic().value();
+}
+
+class LazyShapeTest : public ::testing::Test {
+ protected:
+  static void SetUpTestCase() {}
+  void SetUp() override {
+    at::manual_seed(42);
+    torch::lazy::LazyGraphExecutor::Get()->SetRngSeed(
+        torch::lazy::BackendDevice(), 42);
+    FLAGS_ltc_enable_symbolic_shapes = true;
+  }
+  void TearDown() override {
+    FLAGS_ltc_enable_symbolic_shapes = false;
+  }
+};
+
+class DynamicInputShapeNode : public Node {
+ public:
+  explicit DynamicInputShapeNode(Shape& shape)
+      : Node(
+            OpKind(),
+            /* num_outputs */ 1,
+            /* hash_func */
+            [&](bool /*bakeInSizes*/) -> hash_t { return 0; }),
+        shape_(shape) {}
+  ~DynamicInputShapeNode() override = default;
+
+  const std::vector<Output>& operands() const override {
+    TORCH_INTERNAL_ASSERT(false, "Can't access operands of test node");
+  }
+
+  const Output& operand(size_t i) const override {
+    TORCH_INTERNAL_ASSERT(false, "Can't access operand[i] of test node");
+  }
+  const Shape& shape(size_t i) const override {
+    return shape_;
+  }
+  c10::ArrayRef<Shape> shapes() const override {
+    return {shape_};
+  }
+
+ private:
+  Shape shape_;
+};
+
+} // namespace
+
+Tensor tensorWithSymbolicShape(
+    const std::vector<int64_t>& sizes,
+    const std::vector<bool>& is_symbolic) {
+  Shape shape = Shape(torch::kFloat32, sizes);
+  Shape shape_with_symbolic = shape.with_symbolic_dims(is_symbolic);
+  auto n = torch::lazy::MakeNode<DynamicInputShapeNode>(shape_with_symbolic);
+  auto device = BackendDevice();
+  auto lt = torch::lazy::LazyTensor::Create(n, device);
+  return torch::lazy::CreateAtenFromLtcTensor(lt);
+}
+
+TEST_F(LazyShapeTest, TestMulBasic) {
+  // Basic propagation
+  torch::Tensor a = tensorWithSymbolicShape({2, 2}, {true, false});
+  torch::Tensor b = tensorWithSymbolicShape({2, 2}, {true, false});
+  torch::Tensor res = torch::mul(a, b);
+
+  std::vector<bool> expected = {true, false};
+  EXPECT_EQ(getIsSymbolic(res), expected);
+
+  // Test when some inputs are symbolic
+  a = tensorWithSymbolicShape({2, 2}, {true, true});
+  b = tensorWithSymbolicShape({2, 2}, {true, false});
+  res = torch::mul(a, b);
+
+  // This is not {true, false}, as the SSA shape propagation
+  // is not able to simplify
+  // expandedSizes.append(sizeB if sizeA == 1 else sizeA)
+  // in broadcast() in shape_functions_1.h
+  // due to sizeA being symbolic
+  expected = {true, true};
+  EXPECT_EQ(getIsSymbolic(res), expected);
+
+  // Test correct handling of broadcasting dim
+  a = tensorWithSymbolicShape({2, 2}, {false, true});
+  b = tensorWithSymbolicShape({2, 1}, {true, false});
+  res = torch::mul(a, b);
+
+  expected = {false, true};
+  EXPECT_EQ(getIsSymbolic(res), expected);
+
+  // Test correct handling of scalar values
+  a = tensorWithSymbolicShape({2, 2}, {false, true});
+  res = torch::mul(a, 3);
+  expected = {false, true};
+  EXPECT_EQ(getIsSymbolic(res), expected);
+};
+#endif // FBCODE_CAFFE2
+
+} // namespace lazy
+} // namespace torch
diff --git a/test/cpp/lazy/test_tensor_impl.cpp b/test/cpp/lazy/test_tensor_impl.cpp
index 2a7f2893c72496..8d968f620b6b24 100644
--- a/test/cpp/lazy/test_tensor_impl.cpp
+++ b/test/cpp/lazy/test_tensor_impl.cpp
@@ -6,12 +6,14 @@
 namespace torch {
 namespace lazy {
 
-// TODO(alanwaketan): Update the following unit tests once the TorchScript backend is merged.
+#ifdef FBCODE_CAFFE2
+// Lazy Tensor is disabled in FBCODE until addressing non-virtual methods (e.g. sizes) in TensorImpl
 TEST(LazyTensorImplTest, BasicThrow) {
   EXPECT_THROW({
     auto input = torch::rand({0, 1, 3, 0}, torch::TensorOptions(torch::kFloat).device("lazy"));
   }, ::c10::Error);
 }
+#endif // FBCODE_CAFFE2
 
 }  // namespace lazy
 }  // namespace torch
diff --git a/test/cpp/profiler/containers.cpp b/test/cpp/profiler/containers.cpp
new file mode 100644
index 00000000000000..60e6d0f238b185
--- /dev/null
+++ b/test/cpp/profiler/containers.cpp
@@ -0,0 +1,76 @@
+#include <algorithm>
+#include <cmath>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include <c10/util/irange.h>
+#include <torch/csrc/profiler/containers.h>
+#include <torch/csrc/profiler/util.h>
+
+TEST(ProfilerTest, AppendOnlyList) {
+    const int n = 4096;
+    torch::profiler::impl::AppendOnlyList<int, 1024> list;
+    for (const auto i : c10::irange(n)) {
+        list.emplace_back(i);
+        ASSERT_EQ(list.size(), i + 1);
+    }
+
+    int expected = 0;
+    for (const auto i : list) {
+        ASSERT_EQ(i, expected++);
+    }
+    ASSERT_EQ(expected, n);
+
+    list.clear();
+    ASSERT_EQ(list.size(), 0);
+}
+
+TEST(ProfilerTest, AppendOnlyList_ref) {
+    const int n = 512;
+    torch::profiler::impl::AppendOnlyList<std::pair<int, int>, 64> list;
+    std::vector<std::pair<int, int>*> refs;
+    for (const auto _ : c10::irange(n)) {
+        refs.push_back(list.emplace_back());
+    }
+
+    for (const auto i : c10::irange(n)) {
+        *refs.at(i) = {i, 0};
+    }
+
+    int expected = 0;
+    for (const auto& i : list) {
+        ASSERT_EQ(i.first, expected++);
+    }
+}
+
+// Test that we can convert TSC measurements back to wall clock time.
+TEST(ProfilerTest, clock_converter) {
+    const int n = 10001;
+    torch::profiler::impl::ApproximateClockToUnixTimeConverter converter;
+    std::vector<torch::profiler::impl::ApproximateClockToUnixTimeConverter::UnixAndApproximateTimePair> pairs;
+    for (const auto i : c10::irange(n)) {
+        pairs.push_back(torch::profiler::impl::ApproximateClockToUnixTimeConverter::measurePair());
+    }
+    auto count_to_ns = converter.makeConverter();
+    std::vector<int64_t> deltas;
+    for (const auto& i : pairs) {
+        deltas.push_back(i.t_ - count_to_ns(i.approx_t_));
+    }
+    std::sort(deltas.begin(), deltas.end());
+
+    // In general it's not a good idea to put clocks in unit tests as it leads
+    // to flakiness. We mitigate this by:
+    //   1) Testing the clock itself. While the time to complete a task may
+    //      vary, two clocks measuring the same time should be much more
+    //      consistent.
+    //   2) Only testing the interquartile range. Context switches between
+    //      calls to the two timers do occur and can result in hundreds of
+    //      nanoseconds of noise, but such switches are only a few percent
+    //      of cases.
+    //   3) We're willing to accept a somewhat large bias which can emerge from
+    //      differences in the cost of calling each clock.
+    EXPECT_LT(std::abs(deltas[n / 2]), 200);
+    EXPECT_LT(deltas[n * 3 / 4] - deltas[n / 4], 50);
+}
diff --git a/test/cpp/tensorexpr/test_base.h b/test/cpp/tensorexpr/test_base.h
index 4a8e667de3acfd..510cad45001281 100644
--- a/test/cpp/tensorexpr/test_base.h
+++ b/test/cpp/tensorexpr/test_base.h
@@ -78,7 +78,7 @@ static void assertAllEqual(const std::vector<T>& vec, const T& val) {
 template <typename T>
 static void assertAllEqual(const std::vector<T>& v1, const std::vector<T>& v2) {
   ASSERT_EQ(v1.size(), v2.size());
-  for (int i = 0; i < v1.size(); i++) {
+  for (size_t i = 0; i < v1.size(); ++i) {
     ASSERT_EQ(v1[i], v2[i]);
   }
 }
diff --git a/test/cpp/tensorexpr/test_external_calls.cpp b/test/cpp/tensorexpr/test_external_calls.cpp
index 60335ed55b494b..76fba75444318f 100644
--- a/test/cpp/tensorexpr/test_external_calls.cpp
+++ b/test/cpp/tensorexpr/test_external_calls.cpp
@@ -2,8 +2,17 @@
 
 #include <test/cpp/tensorexpr/test_base.h>
 
+#include <torch/csrc/jit/ir/irparser.h>
+#include <torch/csrc/jit/passes/subgraph_rewrite.h>
+#include <torch/csrc/jit/passes/tensorexpr_fuser.h>
+#include <torch/csrc/jit/runtime/custom_operator.h>
+#include <torch/csrc/jit/tensorexpr/kernel.h>
+
 #include <test/cpp/tensorexpr/test_utils.h>
+#include <torch/csrc/jit/runtime/operator.h>
+#include <torch/csrc/jit/runtime/symbolic_shape_registry.h>
 #include <torch/csrc/jit/tensorexpr/eval.h>
+#include <torch/csrc/jit/tensorexpr/external_functions_registry.h>
 #include <torch/csrc/jit/tensorexpr/ir.h>
 #include <torch/csrc/jit/tensorexpr/ir_printer.h>
 #include <torch/csrc/jit/tensorexpr/ir_simplifier.h>
@@ -11,6 +20,9 @@
 #include <torch/csrc/jit/tensorexpr/loopnest.h>
 #include <torch/csrc/jit/tensorexpr/tensor.h>
 
+#include <torch/csrc/jit/testing/file_check.h>
+#include <torch/jit.h>
+
 #include <ATen/NativeFunctions.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/native/xnnpack/OpContext.h>
@@ -884,7 +896,7 @@ TEST(ExternalCall, Inlining) {
         return MatmulResult.load(i, j) + FloatImm::make(3.0f);
       });
 
-  StmtPtr root_stmt = alloc<Block>(std::vector<StmtPtr>(
+  StmtPtr root_stmt = alloc<torch::jit::tensorexpr::Block>(std::vector<StmtPtr>(
       {A.stmt(), B.stmt(), MatmulResult.stmt(), Result.stmt()}));
   LoopNest l(root_stmt, {Result.buf()});
 
@@ -923,5 +935,130 @@ TEST(ExternalCall, Inlining) {
   ASSERT_TRUE(at::allclose(nnc_result, ref));
 }
 
+TEST(ExternalCall, JitCustomFusionOp) {
+  const char* custom_op_schema_literal =
+      "nnc_custom::add_mul(Tensor a, Tensor b, Tensor c) -> Tensor";
+  const char* external_func_name = "nnc_add_mul";
+
+  auto add_mul_lowering_func =
+      [external_func_name](
+          const std::vector<torch::jit::tensorexpr::ArgValue>& inputs,
+          const std::vector<torch::jit::tensorexpr::ExprHandle>& output_shape,
+          const c10::optional<torch::jit::tensorexpr::ScalarType>& output_type,
+          at::Device device) {
+        auto output_dtype = Dtype(*output_type);
+        torch::jit::tensorexpr::BufHandle result_buf(
+            "nnc_add_mul_res_buf", output_shape, output_dtype);
+        const torch::jit::tensorexpr::BufHandle& a =
+            c10::get<torch::jit::tensorexpr::BufHandle>(inputs[0]);
+        const torch::jit::tensorexpr::BufHandle& b =
+            c10::get<torch::jit::tensorexpr::BufHandle>(inputs[1]);
+        const torch::jit::tensorexpr::BufHandle& c =
+            c10::get<torch::jit::tensorexpr::BufHandle>(inputs[1]);
+        torch::jit::tensorexpr::StmtPtr s =
+            torch::jit::tensorexpr::ExternalCall::make(
+                result_buf, external_func_name, {a, b, c}, {});
+        return Tensor(result_buf.node(), s);
+      };
+
+  auto add_mul_external_func = [](int64_t bufs_num,
+                                  void** buf_data,
+                                  int64_t* buf_ranks,
+                                  int64_t* buf_dims,
+                                  int64_t* buf_strides,
+                                  int8_t* buf_dtypes,
+                                  int64_t args_num,
+                                  int64_t* extra_args) {};
+
+  torch::jit::RegisterOperators reg({Operator(
+      custom_op_schema_literal,
+      [](const Node* node) -> Operation {
+        return [](Stack& _stack) {
+          auto a = std::move(peek(_stack, 0, 3)).toTensor();
+          auto b = std::move(peek(_stack, 1, 3)).toTensor();
+          auto c = std::move(peek(_stack, 2, 3)).toTensor();
+          drop(_stack, 3);
+          auto result = (a + b) * c;
+          pack(_stack, std::move(result));
+          return 0;
+        };
+      },
+      c10::AliasAnalysisKind::FROM_SCHEMA)});
+
+  auto& custom_operator_set = torch::jit::tensorexpr::getCustomOperatorSet();
+  custom_operator_set.insert({custom_op_schema_literal});
+
+  auto& te_lowering_registry = torch::jit::tensorexpr::getNNCLoweringRegistry();
+  te_lowering_registry.insert(
+      parseSchema(custom_op_schema_literal), add_mul_lowering_func);
+
+  auto& te_nnc_func_registry = torch::jit::tensorexpr::getNNCFunctionRegistry();
+  te_nnc_func_registry[external_func_name] = add_mul_external_func;
+
+  std::string graph_string = R"IR(
+    graph(%a : Float(10, 20, strides=[20, 1], device=cpu),
+          %b : Float(10, 20, strides=[20, 1], device=cpu),
+          %c : Float(10, 20, strides=[20, 1], device=cpu)):
+      %res : Float(10, 20, strides=[20, 1], device=cpu) = nnc_custom::add_mul(%a, %b, %c)
+      return (%res))IR";
+
+  auto graph = std::make_shared<Graph>();
+  torch::jit::parseIR(graph_string, graph.get());
+
+  std::string shape_compute_python_string = R"PY(
+  def computOutput(a: List[int], b: List[int], c: List[int]):
+    expandedSizes: List[int] = []
+    dimsA = len(a)
+    dimsB = len(b)
+    dimsC = len(c)
+    ndim = max(dimsA, dimsB, dimsC)
+    for i in range(ndim):
+        offset = ndim - 1 - i
+        dimA = dimsA - 1 - offset
+        dimB = dimsB - 1 - offset
+        dimC = dimsC - 1 - offset
+        sizeA = a[dimA] if (dimA >= 0) else 1
+        sizeB = b[dimB] if (dimB >= 0) else 1
+        sizeC = a[dimC] if (dimC >= 0) else 1
+
+        if sizeA != sizeB and sizeB != sizeC and sizeA != 1 and sizeB != 1 and sizeC != 1:
+            # TODO: only assertion error is bound in C++ compilation right now
+            raise AssertionError(
+                "The size of tensor a {} must match the size of tensor b ("
+                "{} and c {}) at non-singleton dimension {}".format(sizeA, sizeB, sizeC, i)
+            )
+
+        expandedSizes.append(max(sizeA, sizeB, sizeC))
+
+    return expandedSizes
+  )PY";
+  auto cu_ptr = torch::jit::compile(shape_compute_python_string);
+  torch::jit::GraphFunction* gf =
+      (torch::jit::GraphFunction*)&cu_ptr->get_function("computOutput");
+  ASSERT_TRUE(gf);
+
+#ifdef TORCH_ENABLE_LLVM
+  auto static_graph_case = graph->copy();
+  FuseTensorExprs(static_graph_case, 1);
+  torch::jit::testing::FileCheck()
+      .check("prim::TensorExprGroup_")
+      ->check("nnc_custom::add_mul")
+      ->run(*static_graph_case);
+
+  auto dynamic_graph_case = graph->copy();
+  auto custom_op = torch::jit::getOperatorForLiteral(custom_op_schema_literal);
+  ASSERT_TRUE(custom_op);
+  torch::jit::RegisterShapeComputeGraphForSchema(
+      custom_op->schema(), gf->graph());
+  FuseTensorExprs(dynamic_graph_case, 1, false, true);
+  torch::jit::testing::FileCheck()
+      .check("prim::TensorExprGroup_")
+      ->check("nnc_custom::add_mul")
+      ->run(*dynamic_graph_case);
+#else
+  torch::jit::testing::FileCheck().check("nnc_custom::add_mul")->run(*graph);
+#endif
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/tensorexpr/test_memdependency.cpp b/test/cpp/tensorexpr/test_memdependency.cpp
index 535ef439deaf7d..2e4c5bdb737ae0 100644
--- a/test/cpp/tensorexpr/test_memdependency.cpp
+++ b/test/cpp/tensorexpr/test_memdependency.cpp
@@ -274,7 +274,7 @@ TEST(MemDependency, BoundSubtractMultiDim) {
     if (x.size() != y.size()) {
       return false;
     }
-    for (auto i = 0; i < x.size(); ++i) {
+    for (auto i = 0U; i < x.size(); ++i) {
       if (!indexBoundsEquals(x[i], y[i])) {
         return false;
       }
@@ -338,7 +338,7 @@ TEST(MemDependency, BoundSubtractMultiDimSymbolic) {
     if (x.size() != y.size()) {
       return false;
     }
-    for (auto i = 0; i < x.size(); ++i) {
+    for (auto i = 0U; i < x.size(); ++i) {
       if (!indexBoundsEquals(x[i], y[i])) {
         return false;
       }
diff --git a/test/cpp/tensorexpr/test_ops.cpp b/test/cpp/tensorexpr/test_ops.cpp
index e4c9155ff60c05..6ad9cb2a54a32f 100644
--- a/test/cpp/tensorexpr/test_ops.cpp
+++ b/test/cpp/tensorexpr/test_ops.cpp
@@ -24,7 +24,7 @@ TEST(Ops, Sum) {
   constexpr int N = 16;
   std::vector<IntList> testDims = {{0}, {1}, {0, 1}};
   std::vector<std::vector<ExprHandle>> outputShapes = {{N}, {M}, {}};
-  for (int idx = 0; idx < testDims.size(); idx++) {
+  for (unsigned idx = 0; idx < testDims.size(); idx++) {
     const auto& dims = testDims[idx];
     const auto& outShape = outputShapes[idx];
 
diff --git a/test/cpp/tensorexpr/test_quantization.cpp b/test/cpp/tensorexpr/test_quantization.cpp
index 9df2503a608ca9..82eb8573cff500 100644
--- a/test/cpp/tensorexpr/test_quantization.cpp
+++ b/test/cpp/tensorexpr/test_quantization.cpp
@@ -1,6 +1,6 @@
 #include <gtest/gtest.h>
 
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <test/cpp/tensorexpr/test_base.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/ir/irparser.h>
@@ -221,7 +221,99 @@ TEST_F(Quantization, QuantAddDequantUInt8) {
   CHECK_EQ(check, 1);
 }
 
-TEST_F(Quantization, QuantUpsampleNearest2dDequantUInt8) {
+TEST_F(Quantization, QuantSigmoidDequantUInt8) {
+  const auto graph_string = R"IR(
+      graph(%x1 : Float(2, 2, strides=[2, 1], device=cpu)):
+        %2 : int = prim::Constant[value=13]()
+        %qz1 : int = prim::Constant[value=13]()
+        %qs1 : float = prim::Constant[value=0.1]()
+        %q1 : QUInt8(2, 2) = aten::quantize_per_tensor(%x1, %qs1, %qz1, %2)
+        %qa : QUInt8(2, 2) = aten::sigmoid(%q1)
+        %6 : Float(2, 2) = aten::dequantize(%qa)
+        return (%6))IR";
+  auto graph = std::make_shared<Graph>();
+  parseIR(graph_string, &*graph);
+
+  auto x1 = at::rand({2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
+  auto q1 = at::quantize_per_tensor(x1, 0.1f, 13, at::kQUInt8);
+  auto qs = at::sigmoid(q1);
+  auto y_expected = at::dequantize(qs);
+
+  TensorExprKernel k(graph);
+  std::vector<at::Tensor> inputs = {x1};
+  StmtPtr s = k.getCodeGenStmt();
+
+  std::vector<IValue> stack = fmap<IValue>(inputs);
+  k.run(stack);
+  auto y = stack[0].toTensor();
+  bool check = at::allclose(y_expected, y);
+  if (!check) {
+    std::cout << "x1:\n" << x1 << std::endl;
+    std::cout << "q1:\n" << q1 << std::endl;
+    std::cout << "qs:\n" << qs << std::endl;
+    std::cout << "y_expected:\n" << y_expected << std::endl;
+    std::cout << "y:\n" << y << std::endl;
+  }
+  CHECK_EQ(check, 1);
+}
+
+at::Tensor quantized_mul(
+    at::Tensor x1,
+    at::Tensor x2,
+    double scale,
+    int64_t zero) {
+  const auto op =
+      c10::Dispatcher::singleton()
+          .findSchemaOrThrow("quantized::mul", "")
+          .typed<at::Tensor(at::Tensor, at::Tensor, double, int64_t)>();
+  return op.call(x1, x2, scale, zero);
+}
+
+TEST_F(Quantization, QuantMulDequantUInt8) {
+  const auto graph_string = R"IR(
+      graph(%x1 : Float(2, 2, strides=[2, 1], device=cpu), %x2 : Float(2, 2, strides=[2, 1], device=cpu)):
+        %2 : int = prim::Constant[value=13]()
+        %qz1 : int = prim::Constant[value=13]()
+        %qs1 : float = prim::Constant[value=0.1]()
+        %qz2 : int = prim::Constant[value=13]()
+        %qs2 : float = prim::Constant[value=0.1]()
+        %qza : int = prim::Constant[value=13]()
+        %qsa : float = prim::Constant[value=0.1]()
+        %q1 : QUInt8(2, 2) = aten::quantize_per_tensor(%x1, %qs1, %qz1, %2)
+        %q2 : QUInt8(2, 2) = aten::quantize_per_tensor(%x2, %qs2, %qz2, %2)
+        %qa : QUInt8(2, 2) = quantized::mul(%q1, %q2, %qsa, %qza)
+        %6 : Float(2, 2) = aten::dequantize(%qa)
+        return (%6))IR";
+  auto graph = std::make_shared<Graph>();
+  parseIR(graph_string, &*graph);
+
+  auto x1 = at::rand({2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
+  auto x2 = at::rand({2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
+  auto q1 = at::quantize_per_tensor(x1, 0.1f, 13, at::kQUInt8);
+  auto q2 = at::quantize_per_tensor(x2, 0.1f, 13, at::kQUInt8);
+  auto qa = quantized_mul(q1, q2, 0.1f, 13);
+  auto y_expected = at::dequantize(qa);
+
+  TensorExprKernel k(graph);
+  std::vector<at::Tensor> inputs = {x1, x2};
+  StmtPtr s = k.getCodeGenStmt();
+
+  std::vector<IValue> stack = fmap<IValue>(inputs);
+  k.run(stack);
+  auto y = stack[0].toTensor();
+  bool check = at::allclose(y_expected, y);
+  if (!check) {
+    std::cout << "x1:\n" << x1 << std::endl;
+    std::cout << "q1:\n" << q1 << std::endl;
+    std::cout << "x2:\n" << x2 << std::endl;
+    std::cout << "q2:\n" << q2 << std::endl;
+    std::cout << "y_expected:\n" << y_expected << std::endl;
+    std::cout << "y:\n" << y << std::endl;
+  }
+  CHECK_EQ(check, 1);
+}
+
+TEST_F(Quantization, QuantUpsampleNearst2dDequantUInt8) {
   const auto graph_string = R"IR(
       graph(%x : Float(1, 1, 4, 4, strides=[16, 16, 4, 1], device=cpu)):
         %2 : int = prim::Constant[value=13]()
diff --git a/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py b/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py
index f5ba770898665f..807d27a20230dd 100644
--- a/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py
+++ b/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py
@@ -9,6 +9,7 @@
 import torch
 import torch.distributed as dist
 from torch.distributed import rpc
+from torch.distributed import distributed_c10d
 from torch.distributed._shard import (
     shard_parameter,
     sharded_tensor,
@@ -1464,6 +1465,92 @@ def test_gather_uneven(self) -> None:
         else:
             self.assertIsNone(full_tensor)
 
+    @with_comms
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_sharded_tensor_to_cpu(self):
+        cpu_spec = ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:0/cpu",
+                "rank:1/cpu",
+                "rank:2/cpu",
+                "rank:3/cpu",
+            ],
+        )
+        spec = ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:0/cuda:0",
+                "rank:1/cuda:1",
+                "rank:2/cuda:2",
+                "rank:3/cuda:3",
+            ],
+        )
+        h, w = 10, 20
+        gloo_pg = dist.new_group(backend="gloo")
+
+        # CPU sharded tensor should return the same instance (no copy)
+        st_cpu = sharded_tensor.zeros(cpu_spec, h, w, process_group=gloo_pg)
+        new_st_cpu = st_cpu.cpu()
+        self.assertEqual(st_cpu, new_st_cpu)
+
+        # GPU sharded tensor to cpu
+        st = sharded_tensor.zeros(spec, h, w)
+        # test ability to move st to CPU
+        spec_before_move = st.sharding_spec()
+        new_st = st.cpu(process_group=gloo_pg)
+        # return a copy of orginal st
+        self.assertNotEqual(st, new_st)
+        # check the spec is still ChunkShardingSpec
+        spec_after_move = new_st.sharding_spec()
+        self.assertIsInstance(spec_after_move, ChunkShardingSpec)
+        # now it should be ProcessGroupGloo since it's on CPU
+        self.assertIsInstance(new_st._process_group, distributed_c10d.ProcessGroupGloo)
+        # test specs before and after the move almost the same except placement device
+        self.assertEqual(spec_before_move.dim, spec_after_move.dim)
+        self.assertEqual(len(spec_before_move.placements), len(spec_after_move.placements))
+        for i, remote_device_after in enumerate(spec_after_move.placements):
+            remote_device_before = spec_before_move.placements[i]
+            self.assertEqual(remote_device_before.rank(), remote_device_after.rank())
+            self.assertEqual(str(remote_device_after.device()), "cpu")
+
+        # ensure metdata also get changed to CPU
+        metas = new_st.metadata().shards_metadata
+        for meta in metas:
+            self.assertEqual(str(meta.placement.device()), "cpu")
+
+        # Test if a mixed sharded tensor (ShardedTensor with different devices) to cpu
+        mixed_spec = ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:0/cpu",
+                "rank:1/cpu",
+                "rank:2/cuda:2",
+                "rank:3/cuda:3",
+            ],
+        )
+
+        st = sharded_tensor.zeros(mixed_spec, h, w, process_group=gloo_pg)
+        new_st = st.cpu()
+        # return a copy of orginal st
+        self.assertNotEqual(st, new_st)
+        # check the spec is still ChunkShardingSpec
+        spec_after_move = new_st.sharding_spec()
+        self.assertIsInstance(spec_after_move, ChunkShardingSpec)
+        # test specs before and after the move almost the same except placement device
+        self.assertEqual(mixed_spec.dim, spec_after_move.dim)
+        self.assertEqual(len(mixed_spec.placements), len(spec_after_move.placements))
+        for i, remote_device_after in enumerate(spec_after_move.placements):
+            remote_device_before = mixed_spec.placements[i]
+            self.assertEqual(remote_device_before.rank(), remote_device_after.rank())
+            self.assertEqual(str(remote_device_after.device()), "cpu")
+
+        # ensure metdata also get changed to CPU
+        metas = new_st.metadata().shards_metadata
+        for meta in metas:
+            self.assertEqual(str(meta.placement.device()), "cpu")
+
     @skip_if_lt_x_gpu(4)
     @requires_nccl()
     def test_uneven_shards(self):
diff --git a/test/distributed/_shard/sharding_spec/test_sharding_spec.py b/test/distributed/_shard/sharding_spec/test_sharding_spec.py
index 30aa3a12609794..a0e13d80d93e18 100644
--- a/test/distributed/_shard/sharding_spec/test_sharding_spec.py
+++ b/test/distributed/_shard/sharding_spec/test_sharding_spec.py
@@ -318,6 +318,22 @@ def _infer_enum_sharding_spec_case(self):
         self.assertTrue(isinstance(spec, EnumerableShardingSpec))
         self.assertEqual(spec.shards, shards_metadata)
 
+        shards_metadata = [
+            ShardMetadata(
+                shard_offsets=[0],
+                shard_sizes=[16],
+                placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[16],
+                shard_sizes=[9],
+                placement="cuda:1",
+            )
+        ]
+        spec = _infer_sharding_spec_from_shards_metadata(shards_metadata)
+        self.assertTrue(isinstance(spec, EnumerableShardingSpec))
+        self.assertEqual(spec.shards, shards_metadata)
+
         shards_metadata = [
             ShardMetadata(
                 shard_offsets=[0, 0],
diff --git a/test/distributed/_shard/test_replicated_tensor.py b/test/distributed/_shard/test_replicated_tensor.py
new file mode 100644
index 00000000000000..474fbfb90aaa37
--- /dev/null
+++ b/test/distributed/_shard/test_replicated_tensor.py
@@ -0,0 +1,76 @@
+# Owner(s): ["oncall: distributed"]
+
+import torch
+
+import torch.distributed as dist
+
+from torch.testing._internal.common_distributed import (
+    requires_nccl,
+    skip_if_lt_x_gpu,
+)
+
+from torch.testing._internal.distributed._shard.sharded_tensor import (
+    ShardedTensorTestBase,
+    with_comms,
+)
+from torch.distributed._shard.replicated_tensor import ReplicatedTensor
+
+
+class TestReplicatedTensor(ShardedTensorTestBase):
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_replicated_tensor_basics(self):
+        local_tensor = torch.ones(3, 3, device=f"cuda:{self.rank}") * 4
+        replica_tensor = ReplicatedTensor(local_tensor)
+        print(replica_tensor.process_group)
+        # validate it's a replicated tensor by checking values on all rank
+        validated = replica_tensor.validate()
+        self.assertEqual(validated, True)
+        res = replica_tensor + 2
+        self.assertIsInstance(res, torch.Tensor)
+        self.assertNotIsInstance(res, ReplicatedTensor)
+        self.assertEqual(res, torch.ones(3, 3) * 6)
+
+        # modify local tensor on certain rank, and test if validation raise
+        if self.rank == 2:
+            local_tensor += 3
+
+        with self.assertRaisesRegex(ValueError, 'have different values'):
+            replica_tensor.validate()
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_replicated_tensor_inter_op_replicated_tensor(self):
+        local_tensor = torch.ones(3, 3, device=f"cuda:{self.rank}")
+        replica_tensor1 = ReplicatedTensor(local_tensor * 4)
+        replica_tensor2 = ReplicatedTensor(local_tensor * 6)
+
+        new_tensor = replica_tensor1 * replica_tensor2
+        self.assertIsInstance(new_tensor, ReplicatedTensor)
+        self.assertEqual(new_tensor, torch.ones(3, 3) * 24)
+
+        # test replicated tensor inter-op with different pgs
+        new_pg = dist.new_group(ranks=[1, 2, 3])
+        replica_tensor_new_group = ReplicatedTensor(local_tensor * 3, process_group=new_pg)
+
+        with self.assertRaisesRegex(RuntimeError, 'must be in the same'):
+            replica_tensor_new_group * replica_tensor1
+
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_replicated_tensor_inter_op_tensor(self):
+        local_tensor = torch.ones(3, 3, device=f"cuda:{self.rank}") * 4
+        replica_tensor = ReplicatedTensor(local_tensor)
+
+        local_rand_tensor = torch.randn(3, 3, device=f"cuda:{self.rank}")
+
+        new_tensor = replica_tensor + local_rand_tensor
+        self.assertIsInstance(new_tensor, torch.Tensor)
+        self.assertNotIsInstance(new_tensor, ReplicatedTensor)
+
+        self.assertEqual(new_tensor, local_tensor + local_rand_tensor)
diff --git a/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py b/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
index a931f3ef1d4e29..9c5a395054900b 100644
--- a/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
+++ b/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
@@ -38,8 +38,8 @@
 from torch.distributed.rpc.backend_registry import BackendType
 from torch.testing._internal.common_utils import (
     TEST_WITH_DEV_DBG_ASAN,
-    sandcastle_skip_if,
     TEST_WITH_TSAN,
+    sandcastle_skip_if,
 )
 
 
@@ -170,11 +170,26 @@ def _check_env_function():
         "TORCHELASTIC_MAX_RESTARTS",
         "TORCHELASTIC_RUN_ID",
         "TORCHELASTIC_USE_AGENT_STORE",
+        "NCCL_ASYNC_ERROR_HANDLING",
     ]
     for var in env_vars:
         _ = os.environ[var]
 
 
+def _check_env_value(key: str, expected: str):
+    # checks if the env var ``key`` matches ``value``
+    # this function is intended to be used as the entrypoint to the elastic run
+    if key not in os.environ:
+        raise RuntimeError(f"Environment variable {key} not found in os.environ")
+    else:
+        actual = os.getenv(key)
+        if expected != actual:
+            raise RuntimeError(
+                f"os.environ['{key}']={actual}"
+                f" does not equal the expected value: {expected}"
+            )
+
+
 def acquire_available_port():
     """
     Uses sockets to acquire an available port from the os for use.
@@ -184,10 +199,7 @@ def acquire_available_port():
           the port as quickly as possible.
     """
     addrs = socket.getaddrinfo(
-        host="localhost",
-        port=None,
-        family=socket.AF_UNSPEC,
-        type=socket.SOCK_STREAM
+        host="localhost", port=None, family=socket.AF_UNSPEC, type=socket.SOCK_STREAM
     )
 
     for addr in addrs:
@@ -398,7 +410,6 @@ def run_test_with_backend(self, backend: str, test_to_run: Callable):
 
         test_to_run()
 
-
     def dummy_compute(self):
         res = self.run_agent(Conf(entrypoint=dummy_compute, local_world_size=2))
         self.assertFalse(res.is_failed())
@@ -406,21 +417,15 @@ def dummy_compute(self):
             self.assertIsInstance(return_value, torch.Tensor)
             self.assertEqual((100, 100), return_value.shape)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_dummy_compute_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.dummy_compute)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_dummy_compute_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.dummy_compute)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_dummy_compute_etcd_v2(self):
         self.run_test_with_backend(backend="etcd-v2", test_to_run=self.dummy_compute)
 
@@ -430,23 +435,19 @@ def run_happy_function(self):
         self.assertIsNone(res.return_values[0])
         self.assertIsNone(res.return_values[1])
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_happy_function_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.run_happy_function)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_happy_function_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.run_happy_function)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_happy_function_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_happy_function)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.run_happy_function
+        )
 
     def check_master_addr_port_override(self):
         master_addr = "test_host"
@@ -463,17 +464,17 @@ def check_master_addr_port_override(self):
         self.assertFalse(res.is_failed())
         self.assertIsNone(res.return_values[0])
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_check_master_addr_port_override_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.check_master_addr_port_override)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.check_master_addr_port_override
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_check_master_addr_port_override_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.check_master_addr_port_override)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.check_master_addr_port_override
+        )
 
     def run_check_env_function(self):
         # just checks that all env vars that we need to set on the user script
@@ -481,11 +482,47 @@ def run_check_env_function(self):
         res = self.run_agent(Conf(entrypoint=_check_env_function, local_world_size=1))
         self.assertFalse(res.is_failed())
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    def run_check_nccl_async_error_handling_env(self):
+        # make sure NCCL_ASYNC_ERROR_HANDLING set in os.environ is honored
+        with patch.dict(os.environ, {"NCCL_ASYNC_ERROR_HANDLING": "0"}):
+            res = self.run_agent(
+                Conf(
+                    entrypoint=_check_env_value,
+                    local_world_size=1,
+                    args=("NCCL_ASYNC_ERROR_HANDLING", "0"),
+                )
+            )
+            self.assertFalse(res.is_failed())
+
+    def run_check_nccl_async_error_handling_env_default(self):
+        # if not present in env var it should default to 1
+        res = self.run_agent(
+            Conf(
+                entrypoint=_check_env_value,
+                local_world_size=1,
+                args=("NCCL_ASYNC_ERROR_HANDLING", "1"),
+            )
+        )
+        self.assertFalse(res.is_failed())
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_check_env_function_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.run_check_env_function)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_check_env_function
+        )
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_check_nccl_async_error_handling_env_c10d(self):
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_check_nccl_async_error_handling_env
+        )
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_check_nccl_async_error_handling_env_default_c10d(self):
+        self.run_test_with_backend(
+            backend="c10d",
+            test_to_run=self.run_check_nccl_async_error_handling_env_default,
+        )
 
     def run_function_with_return_value(self):
         res = self.run_agent(Conf(entrypoint=_echo, args=("foo",), local_world_size=2))
@@ -493,44 +530,38 @@ def run_function_with_return_value(self):
         self.assertEqual("foo", res.return_values[0])
         self.assertEqual("foo", res.return_values[1])
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_function_with_return_value_c10d(self):
-        self.run_test_with_backend(backend="c10d", test_to_run=self.run_function_with_return_value)
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_function_with_return_value
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_function_with_return_value_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.run_function_with_return_value)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_function_with_return_value
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_function_with_return_value_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_function_with_return_value)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.run_function_with_return_value
+        )
 
     def simple_dist_sum(self):
         res = self.run_agent(Conf(entrypoint=_dist_sum, local_world_size=2))
         self.assertFalse(res.is_failed())
         # _dist_sum internally checks that the sum computed is valid
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_simple_dist_sum_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.simple_dist_sum)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_simple_dist_sum_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.simple_dist_sum)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_simple_dist_sum_etcd_v2(self):
         self.run_test_with_backend(backend="etcd-v2", test_to_run=self.simple_dist_sum)
 
@@ -556,21 +587,27 @@ def run_distributed_sum_homogeneous(self):
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_run_distributed_sum_homogeneous_c10d(self):
-        self.run_test_with_backend(backend="c10d", test_to_run=self.run_distributed_sum_homogeneous)
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_distributed_sum_homogeneous
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_run_distributed_sum_homogeneous_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.run_distributed_sum_homogeneous)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_distributed_sum_homogeneous
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_run_distributed_sum_homogeneous_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_distributed_sum_homogeneous)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.run_distributed_sum_homogeneous
+        )
 
     def run_distributed_sum_heterogeneous(self):
         # sums all ranks on 3 agents; each running 1, 2, 3 workers respectively
@@ -593,23 +630,23 @@ def run_distributed_sum_heterogeneous(self):
             ranks.update(run_results.return_values.keys())
         self.assertSetEqual(set(range(1 + 2 + 3)), ranks)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_distributed_sum_heterogeneous_c10d(self):
-        self.run_test_with_backend(backend="c10d", test_to_run=self.run_distributed_sum_heterogeneous)
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_distributed_sum_heterogeneous
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_distributed_sum_heterogeneous_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.run_distributed_sum_heterogeneous)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_distributed_sum_heterogeneous
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_distributed_sum_heterogeneous_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_distributed_sum_heterogeneous)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.run_distributed_sum_heterogeneous
+        )
 
     def run_sad_function(self):
         """
@@ -632,21 +669,15 @@ def run_sad_function(self):
                 self.assertEqual(data["message"], failure_data["message"])
                 self.assertEqual(int(data["extraInfo"]["timestamp"]), failure.timestamp)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_sad_function_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.run_sad_function)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_sad_function_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.run_sad_function)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_sad_function_etcd_v2(self):
         self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_sad_function)
 
@@ -663,23 +694,23 @@ def run_bipolar_function(self):
         self.assertEqual(WorkerState.FAILED, agent.get_worker_group().state)
         self.assertTrue(agent._total_execution_time > 0)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_bipolar_function_c10d(self):
-        self.run_test_with_backend(backend="c10d", test_to_run=self.run_bipolar_function)
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_bipolar_function
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_bipolar_function_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.run_bipolar_function)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_bipolar_function
+        )
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_bipolar_function_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.run_bipolar_function)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.run_bipolar_function
+        )
 
     def correct_rank_assignment_heterogeneous(self):
         node_configs = [
@@ -710,14 +741,18 @@ def correct_rank_assignment_heterogeneous(self):
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_correct_rank_assignment_heterogeneous_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.correct_rank_assignment_heterogeneous)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.correct_rank_assignment_heterogeneous
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_correct_rank_assignment_heterogeneous_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.correct_rank_assignment_heterogeneous)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.correct_rank_assignment_heterogeneous
+        )
 
     def correct_rank_assignment_homogeneous(self):
         node_configs = [
@@ -744,14 +779,18 @@ def correct_rank_assignment_homogeneous(self):
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_correct_rank_assignment_homogeneous_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.correct_rank_assignment_homogeneous)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.correct_rank_assignment_homogeneous
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_correct_rank_assignment_homogeneous_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.correct_rank_assignment_homogeneous)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.correct_rank_assignment_homogeneous
+        )
 
     def assert_rank_consistency(
         self,
@@ -853,14 +892,18 @@ def double_agent_fault_tolerance(self):
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_double_agent_fault_tolerance_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.double_agent_fault_tolerance)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.double_agent_fault_tolerance
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_double_agent_fault_tolerance_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.double_agent_fault_tolerance)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.double_agent_fault_tolerance
+        )
 
     def double_agent_elastic(self):
         """
@@ -907,21 +950,27 @@ def double_agent_elastic(self):
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_double_agent_elastic_c10d(self):
-        self.run_test_with_backend(backend="c10d", test_to_run=self.double_agent_elastic)
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.double_agent_elastic
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_double_agent_elastic_etcd(self):
-        self.run_test_with_backend(backend="etcd", test_to_run=self.double_agent_elastic)
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.double_agent_elastic
+        )
 
     @unittest.skipIf(
         TEST_WITH_DEV_DBG_ASAN or TEST_WITH_TSAN,
         "test incompatible with dev/dbg asan or tsan",
     )
     def test_double_agent_elastic_etcd_v2(self):
-        self.run_test_with_backend(backend="etcd-v2", test_to_run=self.double_agent_elastic)
+        self.run_test_with_backend(
+            backend="etcd-v2", test_to_run=self.double_agent_elastic
+        )
 
     def torch_rpc(self):
         """
@@ -1056,21 +1105,15 @@ def barrier_failed(self, barrier_mock):
         self.assertFalse(res.is_failed())
         barrier_mock.assert_called_once()
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_barrier_failed_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.barrier_failed)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_barrier_failed_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.barrier_failed)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_barrier_failed_etcd_v2(self):
         self.run_test_with_backend(backend="etcd-v2", test_to_run=self.barrier_failed)
 
@@ -1089,20 +1132,14 @@ def shutdown_called(self, start_processes_mock):
             agent.run("worker")
         pcontext_mock.close.assert_called_once()
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_shutdown_called_c10d(self):
         self.run_test_with_backend(backend="c10d", test_to_run=self.shutdown_called)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_shutdown_called_etcd(self):
         self.run_test_with_backend(backend="etcd", test_to_run=self.shutdown_called)
 
-    @sandcastle_skip_if(
-        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
-    )
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_shutdown_called_etcd_v2(self):
         self.run_test_with_backend(backend="etcd-v2", test_to_run=self.shutdown_called)
diff --git a/test/distributed/fsdp/test_fsdp_clip_grad_norm.py b/test/distributed/fsdp/test_fsdp_clip_grad_norm.py
new file mode 100644
index 00000000000000..a88eb2deeb5378
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_clip_grad_norm.py
@@ -0,0 +1,117 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+from math import inf
+
+import torch
+from torch import distributed as dist
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    FullyShardedDataParallel as FSDP,
+    CPUOffload,
+    _calc_grad_norm,
+)
+from torch.nn import utils as nn_utils
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import (
+    DeterministicModel,
+    FSDPTest,
+    _collect_total_grad_norm_fsdp,
+    _collect_total_grad_norm_local,
+)
+from torch.testing._internal.common_utils import (
+    TEST_WITH_DEV_DBG_ASAN,
+    run_tests,
+    parametrize,
+    instantiate_parametrized_tests,
+)
+
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class TestClipGradNorm(FSDPTest):
+    def _run_fsdp_one_iteration(self, norm_type, nested_fsdp, cpu_offload):
+        """Test FSDP with clip grad norm."""
+        fsdp_model = DeterministicModel(nested_fsdp, cpu_offload=cpu_offload)
+        local_model = DeterministicModel(False)
+        input = torch.rand(14, 2, device=self.rank)
+        fsdp_model = FSDP(fsdp_model, cpu_offload=cpu_offload)
+        self.assertTrue(len(input) >= self.world_size)
+        out = local_model(input[: self.world_size])
+        out.sum().backward()
+        in_data = torch.tensor(input[self.rank], device=self.rank)
+        out_fsdp = fsdp_model(in_data)
+        out_fsdp.sum().backward()
+        total_norms_fsdp = _collect_total_grad_norm_fsdp(
+            fsdp_model, norm_type, self.rank
+        )
+        total_norms_local = _collect_total_grad_norm_local(local_model, norm_type)
+        total_norms_local /= self.world_size
+        norm_cap = total_norms_fsdp / 2.0
+        self.assertEqual(total_norms_local, total_norms_fsdp)
+        fsdp_model.clip_grad_norm_(norm_cap, norm_type=norm_type)
+        nn_utils.clip_grad_norm_(
+            local_model.parameters(), norm_cap, norm_type=norm_type
+        )
+        total_norms_after_clip_fsdp = _collect_total_grad_norm_fsdp(
+            fsdp_model, norm_type, self.rank
+        )
+        total_norms_after_clip_local = _collect_total_grad_norm_local(
+            local_model, norm_type
+        )
+        self.assertTrue(total_norms_after_clip_fsdp <= norm_cap)
+        self.assertEqual(total_norms_after_clip_local, total_norms_after_clip_fsdp)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("norm_type", [2.0, inf])
+    @parametrize("nested_fsdp", [True, False])
+    @parametrize(
+        "cpu_offload",
+        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
+    )
+    def test_fsdp_clip_grad_norm(self, norm_type, nested_fsdp, cpu_offload):
+        """Test FSDP with clip grad norm."""
+        self._run_fsdp_one_iteration(norm_type, nested_fsdp, cpu_offload)
+
+
+class TestCalcuGradNorm(FSDPTest):
+    @skip_if_lt_x_gpu(2)
+    @parametrize("norm_type", [2.0, inf])
+    @parametrize("nested_fsdp", [True, False])
+    def test_fsdp_calc_grad_norm(self, norm_type, nested_fsdp):
+        """Test grad norm cal API."""
+        model = FSDP(DeterministicModel(nested_fsdp))
+        input = torch.rand(15, 2, device=self.rank)
+        out = model(input)
+        out.sum().backward()
+        total_norm = _calc_grad_norm(model.params_with_grad, norm_type)
+        total_norm_expected = _collect_total_grad_norm_local(model, norm_type)
+        self.assertEqual(total_norm, total_norm_expected)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("norm_type", [1.3, 2.5])
+    def test_fsdp_calc_grad_norm_error(self, norm_type):
+        """Test the abnormal cases of grad norm cal API."""
+        model = DeterministicModel(False)
+        input = torch.rand(12, 2, device=self.rank)
+        out = model(input)
+        out.sum().backward()
+        error_msg = f"Order {norm_type} not supported for matrix norm"
+        with self.assertRaisesRegex(RuntimeError, error_msg):
+            total_norm = _calc_grad_norm(model.parameters(), norm_type)
+
+
+instantiate_parametrized_tests(TestClipGradNorm)
+instantiate_parametrized_tests(TestCalcuGradNorm)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_comm.py b/test/distributed/fsdp/test_fsdp_comm.py
index 86cbaebb086327..f86880ff21e24d 100644
--- a/test/distributed/fsdp/test_fsdp_comm.py
+++ b/test/distributed/fsdp/test_fsdp_comm.py
@@ -6,6 +6,7 @@
 import torch
 from torch import distributed as dist
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import FSDPTest, NestedWrappedModule
 from torch.testing._internal.common_utils import (
@@ -38,10 +39,15 @@ class TestCommunication(FSDPTest):
         "use_no_sync",
         [False, True],
     )
+    @parametrize(
+        "sharding_strategy",
+        [ShardingStrategy.SHARD_GRAD_OP, None],
+    )
     def test_communication(
         self,
         nested_model: bool,
         use_no_sync: bool,
+        sharding_strategy: ShardingStrategy,
     ):
         """
         Tests FSDP's communication cost in terms of calls to collective
@@ -60,10 +66,14 @@ def test_communication(
         group = dist.distributed_c10d._get_default_group()
         device = torch.device("cuda")
         if nested_model:
-            model = NestedWrappedModule(group, wrap_fsdp=True)
-            fsdp_model: FSDP = FSDP(model, group).to(device)
+            model = NestedWrappedModule(group, wrap_fsdp=True, sharding_strategy=sharding_strategy)
+            fsdp_model: FSDP = FSDP(model, group, sharding_strategy=sharding_strategy).to(device)
         else:
-            fsdp_model: FSDP = self._get_wrapped_model(group, cuda_first=False)
+            fsdp_model: FSDP = self._get_wrapped_model(
+                group,
+                cuda_first=False,
+                config={"sharding_strategy": sharding_strategy},
+            )
         batch = fsdp_model.module.get_input(device)
 
         # Count the number of FSDP instances
@@ -74,10 +84,16 @@ def test_communication(
 
         # Count the number of all-gathers and reduce-scatters by mocking
         # `_all_gather_base()` and `_reducer_scatter_base()`
-        # Both with and without `no_sync()`:
-        #   Forward: `num_fsdp` all-gathers
+        #
+        # with `no_sync()`:
+        #   Forward: when no_sync mode, root will not free full parameters,
+        #   thus there will be `num_fsdp-1` all-gathers.
+        #   Backward: `num_fsdp` - 1 all-gathers (only excluding the root)
+        # without `no_sync()`:
+        #   Forward: all instances free full parameters, thus there will be ``
+        #   `num_fsdp` all-gathers.
         #   Backward: `num_fsdp` - 1 all-gathers (only excluding the root)
-        expected_num_all_gather_no_sync = num_fsdp + (num_fsdp - 1)
+        expected_num_all_gather_no_sync = (num_fsdp - 1) + (num_fsdp - 1)
         expected_num_all_gather_sync = num_fsdp + (num_fsdp - 1)
         expected_num_reduce_scatter_no_sync = 0
         expected_num_reduce_scatter_sync = num_fsdp
@@ -92,7 +108,7 @@ def reset_mocks():
 
             if use_no_sync:
                 # Check the communication cost when using `no_sync()`
-                for _ in range(num_no_sync_iters):
+                for i in range(num_no_sync_iters):
                     reset_mocks()
                     with fsdp_model.no_sync():
                         output = fsdp_model(*batch)
@@ -100,33 +116,69 @@ def reset_mocks():
                         loss.backward()
                     num_all_gather = mock_all_gather.call_count
                     num_reduce_scatter = mock_reduce_scatter.call_count
-                    assert num_all_gather == expected_num_all_gather_no_sync, \
-                        f"Expected {expected_num_all_gather_no_sync} " \
-                        f"all-gathers but saw {num_all_gather} all-gathers " \
+                    # in the first iteration, all fsdp instances including root
+                    # need to all_gather shards in the forward pass.
+                    if i == 0:
+                        expected_num_all_gather_no_sync_updated = expected_num_all_gather_no_sync + 1
+                        # in the first iteration, all fsdp instances need to all_gather shards
+                        # in the forward pass
+                        if sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+                            expected_num_all_gather_no_sync_updated = num_fsdp
+                    else:
+                        expected_num_all_gather_no_sync_updated = expected_num_all_gather_no_sync
+                        # full parameters are not freed after first iteration in the no_sync mode
+                        if sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+                            expected_num_all_gather_no_sync_updated = 0
+                    self.assertEqual(
+                        num_all_gather,
+                        expected_num_all_gather_no_sync_updated,
+                        f"Expected {expected_num_all_gather_no_sync_updated} "
+                        f"all-gathers but saw {num_all_gather} all-gathers "
                         f"when using `no_sync()`"
-                    assert num_reduce_scatter == \
-                        expected_num_reduce_scatter_no_sync, \
-                        f"Expected {expected_num_reduce_scatter_no_sync} " \
-                        f"reduce-scatters but saw {num_reduce_scatter} " \
+                    )
+                    self.assertEqual(
+                        num_reduce_scatter,
+                        expected_num_reduce_scatter_no_sync,
+                        f"Expected {expected_num_reduce_scatter_no_sync} "
+                        f"reduce-scatters but saw {num_reduce_scatter} "
                         "reduce-scatters when using `no_sync()`"
+                    )
 
             # Check the normal communication cost (when not using `no_sync()`)
-            for _ in range(num_sync_iters):
+            for i in range(num_sync_iters):
                 reset_mocks()
                 output = fsdp_model(*batch)
                 loss = fsdp_model.module.get_loss(batch, output)
                 loss.backward()
                 num_all_gather = mock_all_gather.call_count
                 num_reduce_scatter = mock_reduce_scatter.call_count
-                assert num_all_gather == expected_num_all_gather_sync, \
-                    f"Expected {expected_num_all_gather_sync} all-gathers " \
-                    f"but saw {num_all_gather} all-gathers when not using " \
+                # previous non-sync iteration does not free full parameters for
+                # the root instance.
+                if use_no_sync and i == 0:
+                    expected_num_all_gather_sync_updated = expected_num_all_gather_sync - 1
+                    # previous non-sync iteration does not free full parameters
+                    if sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+                        expected_num_all_gather_sync_updated = 0
+                else:
+                    expected_num_all_gather_sync_updated = expected_num_all_gather_sync
+                    # no need to all_gather shards in the backward pass when in
+                    # SHARD_GRAD_OP mode
+                    if sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+                        expected_num_all_gather_sync_updated = num_fsdp
+                self.assertEqual(
+                    num_all_gather,
+                    expected_num_all_gather_sync_updated,
+                    f"Expected {expected_num_all_gather_sync_updated} all-gathers "
+                    f"but saw {num_all_gather} all-gathers when not using "
                     "`no_sync()`"
-                assert num_reduce_scatter == \
-                    expected_num_reduce_scatter_sync, \
-                    f"Expected {expected_num_reduce_scatter_sync} reduce-" \
-                    f"scatters but saw {num_reduce_scatter} reduce-scatters " \
+                )
+                self.assertEqual(
+                    num_reduce_scatter,
+                    expected_num_reduce_scatter_sync,
+                    f"Expected {expected_num_reduce_scatter_sync} reduce-"
+                    f"scatters but saw {num_reduce_scatter} reduce-scatters "
                     "when not using `no_sync()`"
+                )
 
 
 instantiate_parametrized_tests(TestCommunication)
diff --git a/test/distributed/fsdp/test_fsdp_core.py b/test/distributed/fsdp/test_fsdp_core.py
index ef91d4db083603..7ea54f27ce6c8f 100644
--- a/test/distributed/fsdp/test_fsdp_core.py
+++ b/test/distributed/fsdp/test_fsdp_core.py
@@ -1,6 +1,7 @@
 # Owner(s): ["oncall: distributed"]
 
 import functools
+import itertools
 import sys
 from unittest import mock
 
@@ -18,6 +19,7 @@
     NestedWrappedModule,
     NestedWrappedModuleWithDelay,
     TransformerWithSharedParams,
+    subtest_name
 )
 from torch.testing._internal.common_utils import (
     TEST_WITH_DEV_DBG_ASAN,
@@ -26,8 +28,8 @@
     run_tests,
 )
 
-from torch.distributed.fsdp import CPUOffload
-from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch
+from torch.distributed.fsdp import CPUOffload, MixedPrecision
+from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch, ShardingStrategy
 
 
 if not dist.is_available():
@@ -41,6 +43,23 @@
     )
     sys.exit(0)
 
+params = "cpu_offload,backward_prefetch,sharding_strategy"
+cpu_offload_config = [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
+backward_prefetch_config = [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
+sharding_strategy_config = [ShardingStrategy.SHARD_GRAD_OP, None]
+configs = list(itertools.product(cpu_offload_config,
+                                 backward_prefetch_config,
+                                 sharding_strategy_config))
+test_name_mapping = {
+    str(CPUOffload(offload_params=True)): "offload_true",
+    str(CPUOffload(offload_params=False)): "offload_false",
+    str(BackwardPrefetch.BACKWARD_PRE): "prefetch_pre",
+    str(BackwardPrefetch.BACKWARD_POST): "prefetch_post",
+    str(ShardingStrategy.SHARD_GRAD_OP): "shard_grad_op",
+}
+
+subtest_name = functools.partial(subtest_name, test_name_mapping)
+
 
 class TestParityWithDDP(FSDPTest):
     """
@@ -63,15 +82,8 @@ def _get_init_modes_for_test(self, cpu_offload):
         return modes
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_nested_wrapped_model(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    def test_nested_wrapped_model(self, cpu_offload, backward_prefetch, sharding_strategy):
         init_modes = self._get_init_modes_for_test(cpu_offload)
         for fsdp_init_mode in init_modes:
             with self.subTest(fsdp_init_mode=fsdp_init_mode):
@@ -80,18 +92,39 @@ def test_nested_wrapped_model(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    sharding_strategy=sharding_strategy,
                 )
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_nested_all_wrapped_model(self, cpu_offload, backward_prefetch):
+    @parametrize("cpu_offload", cpu_offload_config)
+    @parametrize("sharding_strategy", sharding_strategy_config)
+    @parametrize("mixed_precision", [True, False])
+    def test_nested_wrapped_model_single_iteration_mixed_precision(
+        self,
+        cpu_offload,
+        sharding_strategy,
+        mixed_precision
+    ):
+        init_modes = self._get_init_modes_for_test(cpu_offload)
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        for fsdp_init_mode in init_modes:
+            with self.subTest(fsdp_init_mode=fsdp_init_mode):
+                self._test_identical_outputs(
+                    NestedWrappedModule,
+                    # Only run one step for comparison, as usually grad scaler
+                    # is needed to avoid NaN after first step.
+                    num_steps=1,
+                    fsdp_init_mode=fsdp_init_mode,
+                    cpu_offload=cpu_offload,
+                    sharding_strategy=sharding_strategy,
+                    mixed_precision=mixed_precision,
+                )
+
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize(params, configs, subtest_name)
+    @parametrize("clip_norm_type", [2.0, None])
+    def test_nested_all_wrapped_model(self, cpu_offload, backward_prefetch, sharding_strategy, clip_norm_type):
         init_modes = self._get_init_modes_for_test(cpu_offload)
         for fsdp_init_mode in init_modes:
             with self.subTest(fsdp_init_mode=fsdp_init_mode):
@@ -101,18 +134,14 @@ def test_nested_all_wrapped_model(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    norm_type=clip_norm_type,
+                    sharding_strategy=sharding_strategy,
                 )
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_transformer_parameterized(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    @parametrize("clip_norm_type", [2.0, None])
+    def test_transformer_parameterized(self, cpu_offload, backward_prefetch, sharding_strategy, clip_norm_type):
         init_modes = self._get_init_modes_for_test(cpu_offload)
         for fsdp_init_mode in init_modes:
             with self.subTest(fsdp_init_mode=fsdp_init_mode):
@@ -121,18 +150,13 @@ def test_transformer_parameterized(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    norm_type=clip_norm_type,
+                    sharding_strategy=sharding_strategy,
                 )
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_delayed_optim_step(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    def test_delayed_optim_step(self, cpu_offload, backward_prefetch, sharding_strategy):
         # We use a model with a long CUDA delay right before the optimizer step.
         # This tests our streams logic, and that we don't start the allgather
         # until after the optimization step completes.
@@ -147,18 +171,12 @@ def test_delayed_optim_step(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    sharding_strategy=sharding_strategy,
                 )
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_delayed_reduce_scatter(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    def test_delayed_reduce_scatter(self, cpu_offload, backward_prefetch, sharding_strategy):
         # We insert a delay in the torch.distributed._reduce_scatter_base op, so that
         # the post_backward_stream takes much longer than the backward pass.
         # This tests that we properly block at the end of the backward pass for
@@ -174,21 +192,16 @@ def test_delayed_reduce_scatter(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    sharding_strategy=sharding_strategy,
                 )
 
     def _dummy_ddp_fn(self, model):
         return DummyDDP(model)
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_mixture_of_experts(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    @parametrize("clip_norm_type", [2.0, None])
+    def test_mixture_of_experts(self, cpu_offload, backward_prefetch, sharding_strategy, clip_norm_type):
         init_modes = self._get_init_modes_for_test(cpu_offload)
         for fsdp_init_mode in init_modes:
             with self.subTest(fsdp_init_mode=fsdp_init_mode):
@@ -200,18 +213,13 @@ def test_mixture_of_experts(self, cpu_offload, backward_prefetch):
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    norm_type=clip_norm_type,
+                    sharding_strategy=sharding_strategy,
                 )
 
     @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_mixture_of_experts_with_delay_before_free(self, cpu_offload, backward_prefetch):
+    @parametrize(params, configs, subtest_name)
+    def test_mixture_of_experts_with_delay_before_free(self, cpu_offload, backward_prefetch, sharding_strategy):
         init_modes = self._get_init_modes_for_test(cpu_offload)
         for fsdp_init_mode in init_modes:
             with self.subTest(fsdp_init_mode=fsdp_init_mode):
@@ -222,15 +230,21 @@ def test_mixture_of_experts_with_delay_before_free(self, cpu_offload, backward_p
                     fsdp_init_mode=fsdp_init_mode,
                     cpu_offload=cpu_offload,
                     backward_prefetch=backward_prefetch,
+                    sharding_strategy=sharding_strategy,
                 )
 
 
 class TestParamInit(FSDPTest):
     @skip_if_lt_x_gpu(2)
-    def test_param_change_after_init(self):
+    @parametrize("mixed_precision", [True, False])
+    def test_param_change_after_init(self, mixed_precision):
         group = dist.distributed_c10d._get_default_group()
         # Establish reference behavior.
-        model = self._get_wrapped_model(group, cuda_first=False)
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        config = {"mixed_precision": mixed_precision}
+        model = self._get_wrapped_model(
+            group, mixed_precision=mixed_precision, cuda_first=False
+        )
         model.eval()  # no dropout for this test
         input = model.module.get_input(torch.device("cuda"))
         ref_output = model(*input)
@@ -284,10 +298,15 @@ def _test_output_backward_hooks(self, model):
 
     @skip_if_lt_x_gpu(2)
     @parametrize("cuda_first", [False, True])
-    def test_register_functions_called(self, cuda_first):
+    @parametrize("mixed_precision", [True, False])
+    def test_register_functions_called(self, cuda_first, mixed_precision):
         """Tests that _register_{pre|post}_backward_hooks called during forward."""
         group = dist.distributed_c10d._get_default_group()
-        model = self._get_wrapped_model(group, cuda_first=cuda_first)
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        config = {"mixed_precision": mixed_precision}
+        model = self._get_wrapped_model(
+            group, mixed_precision=mixed_precision, cuda_first=cuda_first
+        )
         input = model.module.get_input(torch.device("cuda"))
         model._register_post_backward_hooks = mock.MagicMock(return_value=None)
         model._register_pre_backward_hooks = mock.MagicMock(return_value=None)
@@ -300,11 +319,19 @@ def test_register_functions_called(self, cuda_first):
 
 class TestNoGrad(FSDPTest):
     @skip_if_lt_x_gpu(2)
-    def test_transformer_no_grad(self):
+    @parametrize("mixed_precision", [True, False])
+    def test_transformer_no_grad(self, mixed_precision):
         group = dist.distributed_c10d._get_default_group()
-        model = self._get_wrapped_model(group, cuda_first=False)
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        config = {"mixed_precision": mixed_precision}
+        model = self._get_wrapped_model(group, config=config, cuda_first=False)
         # Train model for a step
-        self._train_for_several_steps(model, num_steps=1, autocast=False)
+        self._train_for_several_steps(
+            model,
+            num_steps=1,
+            autocast=False,
+            mixed_precision=config["mixed_precision"]
+        )
 
         model.eval()  # no dropout for this test
 
@@ -321,6 +348,8 @@ def test_transformer_no_grad(self):
 
 instantiate_parametrized_tests(TestHooks)
 instantiate_parametrized_tests(TestParityWithDDP)
+instantiate_parametrized_tests(TestNoGrad)
+instantiate_parametrized_tests(TestParamInit)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_grad_acc.py b/test/distributed/fsdp/test_fsdp_grad_acc.py
new file mode 100644
index 00000000000000..f2569266c34711
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_grad_acc.py
@@ -0,0 +1,261 @@
+# Owner(s): ["oncall: distributed"]
+
+import contextlib
+import itertools
+import sys
+from dataclasses import dataclass
+from typing import List, Optional, Tuple
+
+import torch
+from torch import distributed as dist
+from torch.distributed.fsdp import CPUOffload
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import (
+    TEST_WITH_DEV_DBG_ASAN,
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+)
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+@dataclass
+class _GradAccConfig:
+    """
+    This configures how gradients are accumulated in :meth:`_test_grad_acc`.
+    Each instance of this class represents ``num_iters``-many consecutive
+    iterations, where the ``no_sync()`` context manager is used or not as given
+    by ``use_no_sync``.
+
+    Attributes:
+        use_no_sync (bool): Indicates whether to use the ``no_sync()`` context
+            manager as the way to accumulate gradients.
+        num_iters (int): Number of iterations to accumulate gradients.
+    """
+    use_no_sync: bool
+    num_iters: int
+
+    def __repr__(self) -> str:
+        # Override to remove any spaces in the string to appease the internal
+        # build's test name parser
+        return (
+            f"(use_no_sync={self.use_no_sync},"
+            f"num_iters={self.num_iters})"
+        )
+
+
+@dataclass
+class _GradAccConfigs:
+    """
+    This wraps a :class:`list` of :class:`_GradAccConfig` instances with the
+    sole purpose of overriding :meth:`__repr__` to remove spaces.
+    """
+    configs: List[_GradAccConfig]
+
+    def __repr__(self) -> str:
+        # Override to remove any spaces in the string to appease the internal
+        # build's test name parser
+        return (
+            "[" + ",".join(config.__repr__() for config in self.configs) + "]"
+        )
+
+
+class TestGradAcc(FSDPTest):
+    """Tests ``FullyShardedDataParallel``'s gradient accumulation via both its
+    ``no_sync()`` context manager and without the context manager."""
+
+    def _test_grad_acc(
+        self,
+        batch_dim: int,
+        configs: List[_GradAccConfig],
+        cpu_offload: CPUOffload,
+        backward_prefetch: Optional[BackwardPrefetch],
+    ):
+        """
+        Tests gradient accumulation by comparing a run that trains sequentially
+        through some batches while accumulating gradients with a run that
+        trains on the concatenation of those batches in a single iteration.
+
+        The last iteration always synchronizes gradients regardless of what is
+        specified by the last element of ``configs``.
+
+        Arguments:
+            batch_dim (int): Batch dimension in the input tensor to be passed
+                into the model for the forward pass.
+            configs (List[_GradAccConfig]): :class:`list` of configurations
+                specifying how gradients are accumulated; for example, a list
+                corresponding to [(False, 2), (True, 2), (False, 2)] indicates
+                to accumulate over 2 + 2 + 2 = 6 total iterations, where the
+                first two do not use ``no_sync()``, the middle two do use
+                ``no_sync()``, and the final two again do not use
+                ``no_sync()``.
+            cpu_offload (CPUOffload): Configures CPU offloading.
+            backward_prefetch (Optional[BackwardPrefetch]): Specifies at which
+                point to prefetch the next layer's full parameters during the
+                backward pass, if at all.
+        """
+        # Gradient accumulation outside `no_sync()` is not currently compatible
+        # with CPU offloading
+        if cpu_offload.offload_params and \
+                any(not config.use_no_sync for config in configs):
+            return
+        old_allow_tf32 = torch.backends.cuda.matmul.allow_tf32
+        try:
+            # Disable TF32 to prevent floating point drift
+            torch.backends.cuda.matmul.allow_tf32 = False
+
+            # Initialize the FSDP model and optimizer
+            group = dist.distributed_c10d._get_default_group()
+            fsdp_model: FSDP = self._get_wrapped_model(
+                group, cuda_first=False, add_bn=False,
+                config={
+                    "cpu_offload": cpu_offload,
+                    "backward_prefetch": backward_prefetch,
+                },
+            )  # disable BN since the test uses varying batch sizes
+            fsdp_model.eval()  # disable dropout
+            device = torch.device("cuda")
+            optim = torch.optim.SGD(
+                fsdp_model.parameters(), lr=0.01, momentum=0.9,
+            )
+
+            # Generate the sequence of batches, each containing the same data
+            # but permuted
+            def permute_tensor(x: torch.Tensor):
+                return x.view(-1)[torch.randperm(x.numel())].view_as(x)
+
+            batch: Tuple[torch.Tensor, ...] = \
+                fsdp_model.module.get_input(device)
+            batches: List[Tuple[torch.Tensor, ...]] = [batch]
+            num_iters_to_acc = sum(config.num_iters for config in configs)
+            for _ in range(num_iters_to_acc - 1):
+                batches.append(tuple(permute_tensor(t) for t in batch))
+            for (batch1, batch2) in itertools.combinations(batches, r=2):
+                for t1, t2 in zip(batch1, batch2):
+                    assert not torch.all(t1 == t2), \
+                        "Check the test to make sure that batches are distinct"
+
+            # Concatenate the batches along the given batch dimension
+            concat_batch: Tuple[torch.Tensor, ...] = tuple(
+                torch.cat(ts, dim=batch_dim) for ts in zip(*batches)
+            )
+
+            # Establish reference gradients using the concatenated batch
+            fsdp_model.zero_grad()
+            output = fsdp_model(*concat_batch)
+            ref_loss = fsdp_model.module.get_loss(concat_batch, output)
+            ref_loss.backward()
+            ref_grads = [
+                p.grad.detach().clone() for p in fsdp_model.parameters()
+            ]
+
+            # Compute and accumulate the gradients
+            fsdp_model.zero_grad()
+            losses = []
+            batch_idx = 0
+            for config in configs:
+                sync_context = fsdp_model.no_sync() if config.use_no_sync \
+                    else contextlib.suppress()
+                with sync_context:
+                    for _ in range(config.num_iters):
+                        if batch_idx == num_iters_to_acc - 1:
+                            break  # always sync on the last iteration
+                        batch = batches[batch_idx]
+                        batch_idx += 1
+                        output = fsdp_model(*batch)
+                        loss = fsdp_model.module.get_loss(batch, output)
+                        loss.backward()
+                        losses.append(loss)
+            output = fsdp_model(*batches[-1])
+            loss = fsdp_model.module.get_loss(batches[-1], output)
+            loss.backward()
+            losses.append(loss)
+            acc_loss = sum(losses)
+            acc_grads = [
+                p.grad.detach().clone() for p in fsdp_model.parameters()
+            ]
+
+            # Compare the losses and gradients
+            torch.testing.assert_close(ref_loss, acc_loss)
+            self.assertEqual(len(ref_grads), len(acc_grads))
+            for ref_grad, acc_grad in zip(ref_grads, acc_grads):
+                self.assertEqual(ref_grad.device, acc_grad.device)
+                self.assertEqual(ref_grad.size(), acc_grad.size())
+                self.assertEqual(ref_grad.dtype, acc_grad.dtype)
+                torch.testing.assert_close(ref_grad, acc_grad)
+
+            # Check that the optimizer step does not error
+            optim.step()
+        finally:
+            torch.backends.cuda.matmul.allow_tf32 = old_allow_tf32
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize(
+        "configs",
+        [
+            _GradAccConfigs([
+                _GradAccConfig(use_no_sync=True, num_iters=3),
+                _GradAccConfig(use_no_sync=False, num_iters=3),
+                _GradAccConfig(use_no_sync=True, num_iters=3),
+            ]),
+            _GradAccConfigs([
+                _GradAccConfig(use_no_sync=False, num_iters=3),
+                _GradAccConfig(use_no_sync=True, num_iters=3),
+                _GradAccConfig(use_no_sync=False, num_iters=3),
+            ]),
+        ]
+    )
+    @parametrize(
+        "cpu_offload",
+        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
+    )
+    @parametrize(
+        "backward_prefetch",
+        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None],
+    )
+    def test_grad_acc(
+        self,
+        configs: _GradAccConfigs,
+        cpu_offload: CPUOffload,
+        backward_prefetch: Optional[BackwardPrefetch],
+    ):
+        """
+        Tests gradient accumulation.
+
+        This exercises gradient accumulation inside and outside the
+        ``no_sync()`` context manager, in particular by interleaving the two.
+        It tests both interleaving starting with (and ending with, resp.)
+        inside versus outside ``no_sync()`` to ensure that initial conditions
+        (and final conditions, resp.) do not affect the correctness. This test
+        also checks for compatibility with the CPU offload and backward
+        prefetch options.
+
+        NOTE: Gradient accumulation without using the ``no_sync()`` context
+        manager is not currently compatible with CPU offloading, so those tests
+        are vacuous.
+        """
+        self._test_grad_acc(
+            batch_dim=1,
+            configs=configs.configs,
+            cpu_offload=cpu_offload,
+            backward_prefetch=backward_prefetch,
+        )
+
+
+instantiate_parametrized_tests(TestGradAcc)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_mixed_precision.py b/test/distributed/fsdp/test_fsdp_mixed_precision.py
new file mode 100644
index 00000000000000..d2295a93f1c9d1
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_mixed_precision.py
@@ -0,0 +1,426 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+import contextlib
+from functools import partial
+from itertools import product
+
+import torch
+import torch.cuda.nccl as nccl
+import torch.nn as nn
+from torch import distributed as dist
+from torch.distributed.fsdp import (
+    FullyShardedDataParallel as FSDP,
+    CPUOffload,
+    MixedPrecision,
+    BackwardPrefetch,
+    ShardingStrategy,
+)
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import (
+    FSDPTest,
+    subtest_name,
+)
+from torch.testing._internal.common_utils import (
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
+)
+from torch.testing._internal.common_cuda import CUDA11OrLater
+
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+# Various mixed precision configs to test under.
+default_mp = MixedPrecision()
+
+nccl_supports_bf16 = (
+    CUDA11OrLater and dist.is_nccl_available() and nccl.version() >= (2, 10)
+)
+
+mp_configs = [default_mp]
+
+if nccl_supports_bf16:
+    mp_diff_reduce = MixedPrecision(reduce_dtype=torch.bfloat16)
+    mp_diff_buffer = MixedPrecision(buffer_dtype=torch.bfloat16)
+    mp_diff_buffer_and_reduce = MixedPrecision(buffer_dtype=torch.bfloat16, reduce_dtype=torch.float32)
+    mp_configs.extend([
+        mp_diff_reduce, mp_diff_buffer, mp_diff_buffer_and_reduce,
+    ])
+
+# Buffer original dtype, which can differ from model params.
+buffer_orig_dtype = torch.float64
+
+params = "mp_config,cpu_offload,backward_prefetch,full_precision_param_dtype"
+cpu_offload_config = [
+    CPUOffload(offload_params=True), CPUOffload(offload_params=False)
+]
+backward_prefetch_config = [
+    BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST
+]
+full_precision_param_dtype_config = [torch.float32, torch.float64]
+configs = list(product(
+    mp_configs,
+    cpu_offload_config,
+    backward_prefetch_config,
+    full_precision_param_dtype_config,
+))
+
+test_name_mapping = {
+    str(CPUOffload(offload_params=True)): "offload_true",
+    str(CPUOffload(offload_params=False)): "offload_false",
+    str(BackwardPrefetch.BACKWARD_PRE): "prefetch_pre",
+    str(BackwardPrefetch.BACKWARD_POST): "prefetch_post",
+    str(default_mp): "mp_fp16",
+    str(torch.float32): "fp32",
+    str(torch.float64): "fp64",
+}
+
+if nccl_supports_bf16:
+    test_name_mapping.update({
+        str(mp_diff_reduce): "mp_diff_reduce",
+        str(mp_diff_buffer): "mp_diff_buffer",
+        str(mp_diff_buffer_and_reduce): "mp_diff_buffer_reduce",
+    })
+
+subtest_name = partial(subtest_name, test_name_mapping)
+
+@contextlib.contextmanager
+def patch_reduce_scatter(new_reduce_scatter):
+    """
+    Patches dist._reduce_scatter_base with a new reduce_scatter_base and
+    restores upon exiting. Used for validation of mixed precision
+    """
+    orig_reduce_scatter = dist._reduce_scatter_base
+    dist._reduce_scatter_base = new_reduce_scatter
+    try:
+        yield
+    finally:
+        dist._reduce_scatter_base = orig_reduce_scatter
+
+class LinearMixedPrecision(nn.Module):
+    """
+    A linear module with extra checks for mixed precision training.
+    """
+    def __init__(self, param_dtype):
+        super().__init__()
+        self.lin = nn.Linear(10, 10, bias=False).to(param_dtype)
+        self.register_buffer('buffer', torch.randn((1, 2), dtype=buffer_orig_dtype))
+
+    def forward(self, tup):
+        # Param and input should be the mixed precision type
+        inp, cls, fsdp, mp_config, full_precision_param_dtype = tup
+        expected_param_type = mp_config.param_dtype
+        expected_buffer_type = mp_config.buffer_dtype
+        cls.assertEqual(inp.dtype, expected_param_type)
+        # Buffer should be in specified precision as well.
+        cls.assertEqual(self.buffer.dtype, expected_buffer_type)
+
+        # In FSDP, self.params should point to the right type.
+        num_active_fsdp = 0
+        for fsdp_module in FSDP.fsdp_modules(fsdp):
+            fsdp_managed_params = fsdp_module.params
+            # Single param assumption
+            cls.assertEqual(1, len(fsdp_managed_params))
+            for param in fsdp_managed_params:
+                # FSDP unit is currently active if it is not using the param
+                # local shard. This supports both FULL_SHARD and SHARD_GRAD_OP
+                # cases. In FULL_SHARD, we have the additional property that
+                # param._full_param_padded has not been freed.
+                is_fsdp_unit_active = (
+                    param._is_sharded and
+                    (param.data.data_ptr() != param._local_shard.data_ptr())
+                )
+                if is_fsdp_unit_active:
+                    num_active_fsdp += 1
+                    # This FSDP unit is active, verify param points to mixed
+                    cls.assertEqual(param.dtype, expected_param_type)
+                    # _rebuild_full_param should have also freed the fp16 shard.
+                    cls.assertEqual(0, param._mp_shard.storage().size())
+                elif param._is_sharded:
+                    # This FSDP unit is not active as full param has been
+                    # freed or not yet allocated. Ensure param points to full
+                    # precision param.
+                    cls.assertEqual(param.dtype, full_precision_param_dtype)
+        # We should have gotten at least one active FSDP unit for sharded
+        # (world size > 1) cases. For cases where param is not sharded
+        # (ie world_size == 1) it is a bit hard to check if FSDP unit is active
+        # as we'd always point to the local shard, so we rely on the forward
+        # pass self.lin(inp) working well and inp being reduced precision to
+        # implicitly validate that the param is indeed in the reduced precision.
+        if cls.world_size > 1:
+            cls.assertGreater(num_active_fsdp, 0)
+
+        return (self.lin(inp), cls, fsdp, mp_config, full_precision_param_dtype)
+
+
+class TestFSDPMixedPrecision(FSDPTest):
+    @property
+    def world_size(self):
+        raise ValueError("To be implemented by child classes")
+
+    def _get_simple_nested_model(self, param_dtype, *fsdp_args, **fsdp_kwargs):
+        model = FSDP(
+            nn.Sequential(
+                FSDP(LinearMixedPrecision(param_dtype).cuda(), *fsdp_args, **fsdp_kwargs),
+                LinearMixedPrecision(param_dtype).cuda(),
+            ),
+            *fsdp_args,
+            **fsdp_kwargs,
+        )
+        return model
+
+    def _get_simple_model(self, param_dtype, *fsdp_args, **fsdp_kwargs):
+        model = FSDP(LinearMixedPrecision(param_dtype).cuda(), *fsdp_args, **fsdp_kwargs)
+        return model
+
+    def _validate_mp_shard_freed(self, fsdp_model):
+        """
+        Ensures that the mixed precision shard is greed for all FSDP units.
+        """
+        fsdp_units = FSDP.fsdp_modules(fsdp_model)
+        for fsdp in fsdp_units:
+            for param in fsdp.params:
+                self.assertEqual(0, param._mp_shard.storage().size())
+
+    def _reduce_scatter_base_validate_mp(
+        self,
+        orig_reduce_scatter,
+        mp_config,
+        *args,
+        **kwargs
+    ):
+        """
+        Performs dist._reduce_scatter_base but verifies mixed precision settings
+        before. This is to test mixed precision is working as expected during
+        backward pass.
+        """
+        tensors = []
+        for x in args:
+            if isinstance(x, torch.Tensor):
+                tensors.append(x)
+        for _, x in kwargs.items():
+            if isinstance(x, torch.Tensor):
+                tensors.append(x)
+
+        # reduce_dtype has higher priority than param_dtype, because mixed_precision
+        # supports overriding param_dtype with reduce_dtype to control the
+        # reduction precision. In the case where reduce_dtype == param_dtype
+        # this tests that gradients are in the expected precision as well.
+        expected_dtype = mp_config.reduce_dtype
+        for t in tensors:
+            self.assertEqual(expected_dtype, t.dtype)
+
+        return orig_reduce_scatter(*args, **kwargs)
+
+    def _run_test_mixed_precision_e2e(
+        self,
+        mp_config,
+        cpu_offload,
+        backward_prefetch,
+        full_precision_param_dtype,
+        sharding_strategy,
+    ):
+        torch.cuda.set_device(self.rank)
+        fsdp_models = [
+            self._get_simple_model(
+                param_dtype=full_precision_param_dtype,
+                sharding_strategy=sharding_strategy,
+                cpu_offload=cpu_offload,
+                mixed_precision=mp_config,
+                backward_prefetch=backward_prefetch
+            ),
+            self._get_simple_nested_model(
+                param_dtype=full_precision_param_dtype,
+                sharding_strategy=sharding_strategy,
+                cpu_offload=cpu_offload,
+                mixed_precision=mp_config,
+                backward_prefetch=backward_prefetch
+            ),
+        ]
+        for model in fsdp_models:
+            if not cpu_offload.offload_params:
+                model.cuda()
+
+            # Patch reduce_scatter to add validation for mixed precision types.
+            orig_reduce_scatter = dist._reduce_scatter_base
+            test_reduce_scatter = partial(
+                self._reduce_scatter_base_validate_mp, orig_reduce_scatter, mp_config,
+            )
+            with patch_reduce_scatter(test_reduce_scatter):
+                optim = torch.optim.Adam(model.parameters())
+
+                for _ in range(3):
+                    inp = torch.randn(3, 10).cuda()
+                    # Forward pass of LinearMixedPrecision check casting of
+                    # inputs, params, buffers.
+                    act, *_ = model(
+                        (inp, self, model, mp_config, full_precision_param_dtype)
+                    )
+                    # Buffers should be casted.
+                    for buf in model.buffers():
+                        self.assertEqual(buf.dtype, mp_config.buffer_dtype)
+                    # p._mp_shard should be freed.
+                    if model.params[0]._is_sharded:  # i.e. world_size > 1
+                        # TODO: free the mixed precision shard after forward
+                        # when world_size == 1 as well, currently when
+                        # world_size == 1 it is only freed after backward.
+                        self._validate_mp_shard_freed(model)
+
+                    loss = act.sum()
+                    self.assertEqual(loss.dtype, mp_config.param_dtype)
+                    # Will run patched reduce scatter that validates mixed_precision
+                    # types in backward.
+                    loss.backward()
+                    # Buffers stay casted even after backwards.
+                    for buf in model.buffers():
+                        self.assertEqual(buf.dtype, mp_config.buffer_dtype)
+                    # p._mp_shard should be freed.
+                    self._validate_mp_shard_freed(model)
+
+                    # Ensure params and grads are in full precision
+                    for param in model.parameters():
+                        self.assertEqual(param.dtype, full_precision_param_dtype)
+                        if param.grad is not None:
+                            self.assertEqual(param.grad.dtype, full_precision_param_dtype)
+
+                    optim.step()
+
+                    # Summon full params should be in full precision
+                    with model.summon_full_params():
+                        # It is not expected for summon_full_params to allocate
+                        # a mixed precision shard.
+                        self._validate_mp_shard_freed(model)
+                        params = list(model.parameters())
+                        for p in params:
+                            self.assertEqual(p.dtype, full_precision_param_dtype)
+
+                        # Note that buffers are cast only once and only restored
+                        # to the original buffer dtype in state_dict, so
+                        # summon_full_params is not expected to restore buffer
+                        # types to their original.
+                        named_buffers = dict(model.named_buffers())
+                        for k, v in named_buffers.items():
+                            self.assertEqual(v.dtype, mp_config.buffer_dtype)
+
+                    # state_dict should be in full precision
+                    state_dict = {k: v.clone() for k, v in model.state_dict().items()}
+                    for name, tensor in state_dict.items():
+                        # Parameters and buffers are checkpointed in their
+                        # original dtypes, which may be different.
+                        if name in named_buffers.keys():
+                            self.assertEqual(tensor.dtype, buffer_orig_dtype)
+                        else:
+                            self.assertEqual(
+                                tensor.dtype, full_precision_param_dtype,
+                                f"{name}: {tensor.dtype} vs {full_precision_param_dtype}"
+                            )
+
+                    # After state_dict, buffer's dtype should have been restored
+                    # to the mixed precision one.
+                    for buf in model.buffers():
+                        self.assertEqual(buf.dtype, mp_config.buffer_dtype)
+
+
+class TestFSDPMixedPrecisionSharded(TestFSDPMixedPrecision):
+
+    @property
+    def world_size(self):
+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def test_mixed_precision_no_reshard_after_forward(self):
+        # Note that we don't exercise all possible different configs so as to
+        # not increase test TTS too much.
+        mp = default_mp if not nccl_supports_bf16 else mp_diff_buffer_and_reduce
+        self._run_test_mixed_precision_e2e(
+            mp_config=mp,
+            cpu_offload=CPUOffload(offload_params=True),
+            backward_prefetch=None,
+            full_precision_param_dtype=torch.float64,
+            sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize(params, configs, subtest_name)
+    def test_mixed_precision_e2e_full_shard(
+        self,
+        mp_config,
+        cpu_offload,
+        backward_prefetch,
+        full_precision_param_dtype
+    ):
+        self._run_test_mixed_precision_e2e(
+            mp_config,
+            cpu_offload,
+            backward_prefetch,
+            full_precision_param_dtype,
+            ShardingStrategy.FULL_SHARD,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_mixed_precision_embedding_table(self):
+        # Basic test to ensure int inputs are not casted which would break
+        # modules such as embedding tables.
+        mp_config = MixedPrecision()
+        model = self._get_wrapped_model(
+            group=torch.distributed.distributed_c10d._get_default_group(),
+            config={"mixed_precision": mp_config}
+        )
+        optim = torch.optim.SGD(model.parameters(), lr=0.1)
+        for _ in range(6):
+            inp = model.module.get_input(torch.device("cuda"))
+            # This would fail if we casted integer module inputs such as for
+            # embedding tables.
+            output = model(*inp)
+            loss = model.module.get_loss(inp, output).cuda()
+            self.assertEqual(loss.dtype, mp_config.param_dtype)
+            model.module.run_backward(loss)
+            optim.step()
+
+class TestFSDPMixedPrecisionUnsharded(TestFSDPMixedPrecision):
+    """
+    Smaller test suite for unshared param (i.e. world_size == 1) case.
+    """
+    @property
+    def world_size(self):
+        return 1
+
+    @skip_if_lt_x_gpu(1)
+    def test_mixed_precision_no_reshard_after_forward(self):
+        # Note that we don't exercise all possible different configs so as to
+        # not increase test TTS too much.
+        mp = default_mp if not nccl_supports_bf16 else mp_diff_buffer_and_reduce
+        self._run_test_mixed_precision_e2e(
+            mp_config=mp,
+            cpu_offload=CPUOffload(offload_params=True),
+            backward_prefetch=None,
+            full_precision_param_dtype=torch.float64,
+            sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
+        )
+
+    @skip_if_lt_x_gpu(1)
+    def test_mixed_precision_e2e_full_shard(self):
+        mp = default_mp if not nccl_supports_bf16 else mp_diff_buffer_and_reduce
+        self._run_test_mixed_precision_e2e(
+            mp_config=mp,
+            cpu_offload=CPUOffload(offload_params=True),
+            backward_prefetch=None,
+            full_precision_param_dtype=torch.float64,
+            sharding_strategy=ShardingStrategy.FULL_SHARD,
+        )
+
+instantiate_parametrized_tests(TestFSDPMixedPrecisionSharded)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_no_sync.py b/test/distributed/fsdp/test_fsdp_no_sync.py
deleted file mode 100644
index 1016de3fe6af0c..00000000000000
--- a/test/distributed/fsdp/test_fsdp_no_sync.py
+++ /dev/null
@@ -1,166 +0,0 @@
-# Owner(s): ["oncall: distributed"]
-
-import itertools
-import sys
-from typing import List, Optional, Tuple
-
-import torch
-from torch import distributed as dist
-from torch.distributed.fsdp import CPUOffload
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch
-from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import FSDPTest
-from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    instantiate_parametrized_tests,
-    parametrize,
-    run_tests,
-)
-
-if not dist.is_available():
-    print("Distributed not available, skipping tests", file=sys.stderr)
-    sys.exit(0)
-
-if TEST_WITH_DEV_DBG_ASAN:
-    print(
-        "Skip dev-asan as torch + multiprocessing spawn have known issues",
-        file=sys.stderr,
-    )
-    sys.exit(0)
-
-
-class TestNoSync(FSDPTest):
-    """Tests ``FullyShardedDataParallel``'s gradient accumulation via its
-    ``no_sync()`` context manager."""
-
-    def _test_no_sync(
-        self,
-        batch_dim: int,
-        num_iters_to_acc: int,
-        cpu_offload: CPUOffload,
-        backward_prefetch: Optional[BackwardPrefetch],
-    ):
-        """
-        Tests ``no_sync()`` by comparing a run that trains sequentially through
-        some batches while accumulating gradients with a run that trains on the
-        concatenation of those batches in a single iteration. The number of
-        batches, i.e. the number of iterations for which to accumulate
-        gradients, is given by ``num_iters_to_acc``.
-
-        Arguments:
-            batch_dim (int): Batch dimension in the input tensor to be passed
-                into the model for the forward pass.
-            num_iters_to_acc (int): Number of iterations for which to
-                accumulate gradients; all but the last iteration are run using
-                the ``no_sync()`` context manager so that gradients are not
-                synchronized until the final iteration.
-            cpu_offload (CPUOffload): Configures CPU offloading.
-            backward_prefetch (Optional[BackwardPrefetch]): Specifies at which
-                point to prefetch the next layer's full parameters during the
-                backward pass, if at all.
-        """
-        old_allow_tf32 = torch.backends.cuda.matmul.allow_tf32
-        try:
-            # Disable TF32 to prevent floating point drift
-            torch.backends.cuda.matmul.allow_tf32 = False
-
-            # Initialize the FSDP model and optimizer
-            group = dist.distributed_c10d._get_default_group()
-            fsdp_model: FSDP = self._get_wrapped_model(
-                group, cuda_first=False, add_bn=False,
-                cpu_offload=cpu_offload, backward_prefetch=backward_prefetch,
-            )  # disable BN since the test uses varying batch sizes
-            fsdp_model.eval()  # disable dropout
-            device = torch.device("cuda")
-            optim = torch.optim.SGD(fsdp_model.parameters(), lr=0.01, momentum=0.9)
-
-            # Generate the sequence of batches, each containing the same data but
-            # permuted
-            def permute_tensor(x: torch.Tensor):
-                return x.view(-1)[torch.randperm(x.numel())].view_as(x)
-
-            batch: Tuple[torch.Tensor, ...] = fsdp_model.module.get_input(device)
-            batches: List[Tuple[torch.Tensor, ...]] = [batch]
-            for _ in range(num_iters_to_acc - 1):
-                batches.append(tuple(permute_tensor(t) for t in batch))
-            for (batch1, batch2) in itertools.combinations(batches, r=2):
-                for t1, t2 in zip(batch1, batch2):
-                    assert not torch.all(t1 == t2)
-
-            # Concatenate the batches along the given batch dimension
-            concat_batch: Tuple[torch.Tensor, ...] = tuple(
-                torch.cat(ts, dim=batch_dim) for ts in zip(*batches)
-            )
-
-            # Establish reference gradients using the concatenated batch
-            fsdp_model.zero_grad()
-            output = fsdp_model(*concat_batch)
-            ref_loss = fsdp_model.module.get_loss(concat_batch, output)
-            ref_loss.backward()
-            ref_grads = [p.grad.detach().clone() for p in fsdp_model.parameters()]
-
-            # Compute the gradients by accumulating via `no_sync()`
-            fsdp_model.zero_grad()
-            losses = []
-            with fsdp_model.no_sync():
-                for batch in batches[:-1]:  # accumulate for all but the last batch
-                    output = fsdp_model(*batch)
-                    loss = fsdp_model.module.get_loss(batch, output)
-                    loss.backward()
-                    losses.append(loss)
-            output = fsdp_model(*batches[-1])
-            loss = fsdp_model.module.get_loss(batches[-1], output)
-            loss.backward()
-            losses.append(loss)
-            acc_loss = sum(losses)
-            acc_grads = [p.grad.detach().clone() for p in fsdp_model.parameters()]
-
-            # Compare the losses and gradients
-            torch.testing.assert_allclose(ref_loss, acc_loss)
-            assert len(ref_grads) == len(acc_grads)
-            for ref_grad, acc_grad in zip(ref_grads, acc_grads):
-                assert ref_grad.device == acc_grad.device
-                assert ref_grad.size() == acc_grad.size()
-                assert ref_grad.dtype == acc_grad.dtype
-                torch.testing.assert_allclose(ref_grad, acc_grad)
-
-            # Check that the optimizer step does not error
-            optim.step()
-        finally:
-            torch.backends.cuda.matmul.allow_tf32 = old_allow_tf32
-
-    @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "num_iters_to_acc",
-        [2, 4],
-    )
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
-    )
-    @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None]
-    )
-    def test_no_sync(
-        self,
-        num_iters_to_acc: int,
-        cpu_offload: CPUOffload,
-        backward_prefetch: Optional[BackwardPrefetch],
-    ):
-        """Tests the ``no_sync()`` context manager."""
-        assert num_iters_to_acc >= 2, \
-            "Accumulate for at least 2 iterations to be nontrivial"
-        self._test_no_sync(
-            batch_dim=1,
-            num_iters_to_acc=num_iters_to_acc,
-            cpu_offload=cpu_offload,
-            backward_prefetch=backward_prefetch,
-        )
-
-
-instantiate_parametrized_tests(TestNoSync)
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_optim_state.py b/test/distributed/fsdp/test_fsdp_optim_state.py
new file mode 100644
index 00000000000000..cfe22062d356e5
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_optim_state.py
@@ -0,0 +1,591 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+from typing import Any, Dict, List, Type
+
+import torch
+from torch import distributed as dist
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    OptimStateKeyType,
+)
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import (
+    TEST_WITH_DEV_DBG_ASAN,
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+)
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class Bias(torch.nn.Module):
+    """This module applies a 1D additive bias with dimension ``dim``."""
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        assert dim > 0
+        torch.manual_seed(0)
+        self.bias = torch.nn.Parameter(torch.randn((dim,)))
+
+    def forward(self, x):
+        return x + self.bias
+
+
+class BlockA(torch.nn.Module):
+    """
+    Used to define interesting nested structure for FSDP wrapping.
+    BlockA
+        Bias0
+            bias
+        weight
+        Bias1
+            bias
+    """
+    def __init__(self, in_dim: int, out_dim: int) -> None:
+        super().__init__()
+        assert all(v > 0 for v in (in_dim, out_dim))
+        torch.manual_seed(0)
+        self.bias_module0 = Bias(out_dim)
+        self.weight = torch.nn.Parameter(torch.randn((in_dim, out_dim)))
+        self.bias_module1 = Bias(out_dim)
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x):
+        x = x @ self.weight
+        x = self.bias_module0(x)
+        x = self.relu(x)  # ensure biases have different gradients
+        x = self.bias_module1(x)
+        return x
+
+class BlockB(torch.nn.Module):
+    """
+    Used to define interesting nested structure for FSDP wrapping.
+    BlockB
+        weight
+        Bias
+            bias
+        Bias
+            bias
+    """
+    def __init__(self, in_dim: int, out_dim: int) -> None:
+        super().__init__()
+        assert all(v > 0 for v in (in_dim, out_dim))
+        torch.manual_seed(0)
+        self.weight = torch.nn.Parameter(torch.randn((in_dim, out_dim)))
+        self.bias_module0 = Bias(out_dim)
+        self.bias_module1 = Bias(out_dim)
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x):
+        x = x @ self.weight
+        x = self.bias_module0(x)
+        x = self.relu(x)  # ensure biases have different gradients
+        x = self.bias_module1(x)
+        return x
+
+
+class NestedModel(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.block0 = BlockB(5, 7)
+        self.block1 = BlockB(7, 7)
+        self.bias = torch.nn.Parameter(torch.randn((5,)))
+        self.block2 = torch.nn.Sequential(
+            BlockA(7, 9),
+            BlockA(9, 9),
+            BlockB(9, 5),
+        )
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x) -> torch.Tensor:
+        x = self.relu(self.block0(x))
+        x = self.relu(self.block1(x))
+        x = self.relu(self.block2(x))
+        x = x + self.bias
+        return x
+
+    def get_input(self, device):
+        BATCH_SIZE = 8
+        return (torch.randn((BATCH_SIZE, 5)).to(device),)
+
+    def get_loss(self, inp, output):
+        return output.sum()
+
+    def run_backward(self, loss):
+        loss.backward()
+
+    @staticmethod
+    def wrap(model, group=None) -> torch.nn.Module:
+        # Flatten Bias0; then flatten weight and Bias1 together into `block1`
+        model.block1.bias_module0 = FSDP(
+            model.block1.bias_module0, process_group=group,
+        )
+        model.block1 = FSDP(model.block1, process_group=group)
+        # Flatten Bias0; flatten Bias1; then flatten weight into `block2[1]`
+        model.block2[1].bias_module0 = FSDP(
+            model.block2[1].bias_module0, process_group=group,
+        )
+        model.block2[1].bias_module1 = FSDP(
+            model.block2[1].bias_module1, process_group=group,
+        )
+        model.block2[1] = FSDP(model.block2[1], process_group=group)
+        # Flatten weight, Bias, bias into `block2[2]`
+        model.block2[2] = FSDP(model.block2[2], process_group=group)
+        return model
+
+    @staticmethod
+    def wrap_alt(model, group=None) -> torch.nn.Module:
+        model.block0.bias_module0 = FSDP(
+            model.block0.bias_module0, process_group=group,
+        )
+        model.block0 = FSDP(model.block0, process_group=group)
+        return model
+
+    # NOTE: We exclude `self.bias` from either parameter group to test the
+    # case where the optimizer input does not include all model parameters
+    def param_group0(self) -> List[torch.nn.Parameter]:
+        # Use `block1`'s parameters for the first parameter group to deviate
+        # from the `model.parameters()` order
+        return list(self.block1.parameters())
+
+    def param_group1(self) -> List[torch.nn.Parameter]:
+        # Deviate from the `model.parameters()` order further by rearranging
+        # `block2`'s parameters to be before `block0`'s parameters
+        return list(self.block2.parameters()) + \
+            list(self.block0.parameters())
+
+
+class TestFSDPOptimState(FSDPTest):
+    def _init_nested_model(
+        self,
+        wrap: bool,
+        wrap_alt: bool = False,  # ignored if `wrap=False`
+        device: torch.device = torch.device("cuda"),
+        group=None,
+        optim_class: Type[torch.optim.Optimizer] = torch.optim.Adam,
+        use_multiple_param_groups: bool = False,
+    ):
+        model = NestedModel().to(device)
+        if wrap:
+            model = NestedModel.wrap_alt(model, group) if wrap_alt \
+                else NestedModel.wrap(model, group)
+        if not use_multiple_param_groups:
+            optim_input = list(model.parameters())
+        else:
+            optim_input = [
+                {"params": model.param_group0()},
+                {"params": model.param_group1(), "weight_decay": 0.9}
+            ]
+        optim = optim_class(optim_input, lr=0.01)
+        return model, optim, optim_input
+
+    def _init_transformer_model(
+        self,
+        wrap: bool,
+        device: torch.device = torch.device("cuda"),
+        group=None,
+        optim_class: Type[torch.optim.Optimizer] = torch.optim.Adam,
+        use_multiple_param_groups: bool = False,
+    ):
+        assert not use_multiple_param_groups, \
+            "Multiple parameter groups for the transformer is not implemented"
+        if group is None:
+            group = dist.distributed_c10d._get_default_group()
+        model = self._get_wrapped_model(group=group).to(device) if wrap \
+            else self._get_nonwrapped_model(group=group).to(device)
+        model.eval()  # disable dropout for determinism
+        optim = optim_class(model.parameters(), lr=0.01)
+        return model, optim, None
+
+    def _step_model(
+        self,
+        model: torch.nn.Module,
+        optim: torch.optim.Optimizer,
+        device: torch.device = torch.device("cuda"),
+        num_iters: int = 1,
+    ) -> List[float]:
+        """Performs a forward pass, backward pass, and optimizer step
+        ``num_iters``-many times, and returns the per-iteration losses."""
+        torch.manual_seed(0)  # set seed for determinism
+        losses = []
+        module = model.module if hasattr(model, "module") else model
+        for _ in range(num_iters):
+            inp = module.get_input(device)
+            output = model(*inp)
+            loss = module.get_loss(inp, output).to(device)
+            losses.append(loss.item())
+            module.run_backward(loss)
+            optim.step()
+        return losses
+
+    def _broadcast_full_osd(self, full_osd: Dict[str, Any], group=None):
+        """Broadcasts the full optimizer state dict in place of using
+        ``torch.save()`` and ``torch.load()`` so that all ranks can have it."""
+        obj_list = [full_osd]
+        dist.broadcast_object_list(
+            obj_list, src=0, group=group,
+        )
+        full_osd = obj_list[0]
+        return full_osd
+
+    def _are_equal_states(
+        self,
+        state1: Dict[str, Any],
+        state2: Dict[str, Any],
+    ) -> bool:
+        """Checks if ``state1`` and ``state2`` contain the same mappings."""
+        if set(state1.keys()) != set(state2.keys()):
+            return False
+        for state_name, value1 in state1.items():
+            value2 = state2[state_name]
+            if type(value1) != type(value2):
+                return False
+            if torch.is_tensor(value1):  # tensor state
+                assert torch.is_tensor(value2)
+                # Check the values on CPU to be device-agnostic
+                value1 = value1.cpu()
+                value2 = value2.cpu()
+                if value1.shape != value2.shape or \
+                        not torch.all(torch.isclose(value1, value2)):
+                    return False
+            else:  # non-tensor state
+                if value1 != value2:
+                    return False
+        return True
+
+    def _check_same_state(
+        self,
+        full_osd,
+        ref_osd,
+        check_same_param_keys: bool,
+    ):
+        """Checks that ``full_osd`` and ``ref_osd`` have the same "state" part.
+        If ``check_same_param_keys=True``, then checks that the parameter keys
+        match (e.g. when both should be parameter names), and does not check
+        the parameter keys otherwise."""
+        assert "state" in ref_osd
+        self.assertTrue("state" in full_osd)
+        ref_osd_state = ref_osd["state"]
+        full_osd_state = full_osd["state"]
+        if check_same_param_keys:
+            # Check parameter keys are the same
+            ref_osd_param_ids = set(ref_osd_state.keys())
+            full_osd_param_ids = set(full_osd_state.keys())
+            self.assertTrue(ref_osd_param_ids == full_osd_param_ids)
+            for param_id, param_state in full_osd_state.items():
+                for state_name, value in param_state.items():
+                    ref_value = ref_osd_state[param_id][state_name]
+                    self.assertEqual(value, ref_value)
+            return
+        # Otherwise, only require the parameter keys to be isomorphic (e.g.
+        # between IDs and names)
+        ref_osd_states = list(ref_osd["state"].values())
+        full_osd_states = list(full_osd["state"].values())
+        assert len(ref_osd_states) == len(full_osd_states)
+        # Use brute-force quadratic-time comparison since it is hard to
+        # hash a tensor by value instead of by object
+        for full_osd_state in full_osd_states:
+            # Check for at least one match (may be > 1 in toy edge cases, e.g.
+            # multiple biases); nonetheless, each having >= 1 match and the two
+            # lists having equal length imply that the list contents are equal
+            self.assertTrue(any(
+                self._are_equal_states(full_osd_state, ref_osd_state)
+                for ref_osd_state in ref_osd_states
+            ))
+
+    def _check_same_param_groups(
+        self,
+        full_osd,
+        ref_osd,
+        check_same_param_keys: bool,
+    ):
+        """Checks that ``full_osd`` and ``ref_osd`` have the same
+        "param_groups" part. If ``check_same_param_keys=True`, then checks that
+        the parameter keys match (e.g. when both should be parameter names),
+        and does not check the parameter keys otherwise."""
+        assert "param_groups" in ref_osd
+        self.assertTrue("param_groups" in full_osd)
+        ref_osd_param_groups = ref_osd["param_groups"]
+        full_osd_param_groups = full_osd["param_groups"]
+        self.assertTrue(len(full_osd_param_groups), len(ref_osd_param_groups))
+        if self.rank == 0:
+            for full_osd_pg, ref_osd_pg in zip(
+                full_osd_param_groups, ref_osd_param_groups,
+            ):
+                self.assertEqual(
+                    set(full_osd_pg.keys()), set(ref_osd_pg.keys()),
+                )
+                for name, full_osd_value in full_osd_pg.items():
+                    if name == "params" and not check_same_param_keys:
+                        continue
+                    self.assertEqual(full_osd_value, ref_osd_pg[name])
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_multiple_param_groups", [False, True])
+    @parametrize("rank0_only", [False, True])
+    def test_full_optim_state_dict_nested(
+        self,
+        use_multiple_param_groups: bool,
+        rank0_only: bool,
+    ) -> None:
+        """
+        Tests :meth:`full_optim_state_dict` by comparing the returned dict for
+        an FSDP-wrapped model with that of an equivalent non-wrapped model.
+
+        The parameter groups in the "param_groups" part and the values in the
+        "state" part should be the same, but the parameter keys may be
+        different (e.g. the full optimizer state dict uses parameter names
+        while the non-wrapped equivalent uses parameter IDs).
+        """
+        NUM_ITERS = 3
+        model1, optim1, optim_input = self._init_nested_model(
+            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        losses1 = self._step_model(model1, optim1, num_iters=NUM_ITERS)
+        full_osd = FSDP.full_optim_state_dict(
+            model1, optim1, optim_input, rank0_only=rank0_only,
+        )
+        # Non-target ranks get an empty state dict
+        if rank0_only and self.rank != 0:
+            self.assertEqual(len(full_osd), 0)
+            return
+        model2, optim2, _ = self._init_nested_model(
+            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        losses2 = self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        ref_osd = optim2.state_dict()
+        # Check the losses to eliminate model drift as a source of error
+        for i, (l1, l2) in enumerate(zip(losses1, losses2)):
+            assert l1 == l2, f"Losses differ on iter {i}: {l1:.5f} {l2:.5f}"
+        # Do not check the parameter keys since the full optimizer state dict
+        # uses parameter names, while the non-wrapped equivalent uses parameter
+        # IDs
+        check_same_param_keys = False
+        self._check_same_param_groups(
+            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+        )
+        self._check_same_state(
+            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+        )
+
+    # Require 4 GPUs since we test halving the world size
+    @skip_if_lt_x_gpu(4)
+    @parametrize("use_multiple_param_groups", [False, True])
+    @parametrize("wrap_alt", [False, True])
+    @parametrize("halve_world_size", [False, True])
+    def test_shard_full_optim_state_dict_nested(
+        self,
+        use_multiple_param_groups: bool,
+        wrap_alt: bool,
+        halve_world_size: bool,
+    ):
+        """Tests :meth:`shard_full_optim_state_dict` for a non-FSDP-root model
+        with nested FSDP instances."""
+        self._test_shard_full_optim_state(
+            model_class="nested",
+            use_multiple_param_groups=use_multiple_param_groups,
+            halve_world_size=halve_world_size,
+            wrap_alt=wrap_alt,
+        )
+
+    # Require 4 GPUs since we test halving the world size
+    @skip_if_lt_x_gpu(4)
+    def test_shard_full_optim_state_dict_transformer(self) -> None:
+        """Tests :meth:`shard_full_optim_state_dict` for an FSDP-root
+        transformer model with shared parameters."""
+        self._test_shard_full_optim_state(
+            model_class="transformer", use_multiple_param_groups=False,
+            halve_world_size=True,
+        )
+
+    def _test_shard_full_optim_state(
+        self,
+        model_class: str,
+        use_multiple_param_groups: bool,
+        halve_world_size: bool,
+        **new_model_kwargs,
+    ):
+        """
+        (1) Runs a model with full world size for K iterations to generate a
+        full optimizer state dict;
+        (2) initializes a model with halved world size and possibly different
+        FSDP wrapping scheme (based on ``new_model_kwargs``);
+        (3) shards the full optimizer state dict from (1) according to the
+        halved-world-size model;
+        (4) runs the halved-world-size model for K iterations; and
+        (5) checks that the sharded optimizer state dict from (3) matches the
+        halved-world-size model's local optimizer state dict, meaning that the
+        former could have equivalently been loaded into the local optimizer.
+        """
+        NUM_ITERS = 3
+        initializer = self._init_nested_model if model_class == "nested" \
+            else self._init_transformer_model if model_class == "transformer" \
+            else None
+        assert initializer is not None, f"Unsupported model: {model_class}"
+        # Run a wrapped model with full world size for a few iterations
+        model1, optim1, optim_input1 = initializer(
+            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        self._step_model(model1, optim1, num_iters=NUM_ITERS)
+        full_osd1 = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+        # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
+        # have the full state dict
+        full_osd1 = self._broadcast_full_osd(full_osd1)
+        if halve_world_size:
+            # Create a new process group with halved world size
+            new_group_ranks = [r for r in range(self.world_size) if r % 2 == 0]
+            new_group = dist.new_group(ranks=new_group_ranks)
+            if self.rank not in new_group_ranks:
+                return
+        else:
+            new_group = dist.distributed_c10d._get_default_group()
+        # Run a wrapped model with halved world size (from scratch)
+        model2, optim2, optim_input2 = initializer(
+            wrap=True, group=new_group,
+            use_multiple_param_groups=use_multiple_param_groups,
+            **new_model_kwargs,  # specify `wrap_alt` to change wrapping
+        )
+        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        full_osd2 = FSDP.full_optim_state_dict(model2, optim2, optim_input2)
+        full_osd2 = self._broadcast_full_osd(full_osd2, group=new_group)
+        # As a sanity check, check that sharding the halved-world-size model's
+        # full optimizer state dict according to itself is equivalent to its
+        # local optimizer's state dict
+        local_osd2 = optim2.state_dict()
+        sharded_osd2 = FSDP.shard_full_optim_state_dict(
+            full_osd2, model2, optim_input2,
+        )
+        check_same_param_keys = True  # should all have matching parameter IDs
+        self._check_same_param_groups(
+            sharded_osd2, local_osd2,
+            check_same_param_keys=check_same_param_keys,
+        )
+        self._check_same_state(
+            sharded_osd2, local_osd2,
+            check_same_param_keys=check_same_param_keys,
+        )
+        # Check that sharding the full-world-size model's full optimizer state
+        # dict according to the halved-world-size model is equivalent to the
+        # halved-world-size model's local optimizer state dict
+        sharded_osd1 = FSDP.shard_full_optim_state_dict(
+            full_osd1, model2, optim_input2,
+        )
+        self._check_same_param_groups(
+            sharded_osd1, local_osd2,
+            check_same_param_keys=check_same_param_keys,
+        )
+        self._check_same_state(
+            sharded_osd1, local_osd2,
+            check_same_param_keys=check_same_param_keys,
+        )
+        # As a sanity check, check that we can load and run a few iterations
+        optim2.load_state_dict(sharded_osd1)
+        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_multiple_param_groups", [False, True])
+    def test_rekey_optim_state_dict_to_ids(
+        self,
+        use_multiple_param_groups: bool,
+    ):
+        """Tests :meth:`rekey_optim_state_dict` with the new keys being
+        parameter IDs by checking that a wrapped model (i.e. with FSDP modules)
+        can rekey its optimizer state dict to match that of an equivalent
+        non-wrapped model (i.e. without FSDP modules)."""
+        NUM_ITERS = 3
+        # Run a wrapped model for a few iterations
+        model1, optim1, optim_input1 = self._init_nested_model(
+            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        self._step_model(model1, optim1, num_iters=NUM_ITERS)
+        full_osd = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+        # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
+        # have the full state dict
+        full_osd = self._broadcast_full_osd(full_osd)
+        # Run a non-wrapped model for a few iterations
+        model2, optim2, optim_input2 = self._init_nested_model(
+            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        # Re-key the wrapped model's optimizer state dict using parameter IDs
+        # according to the non-wrapped model
+        rekeyed_osd = FSDP.rekey_optim_state_dict(
+            full_osd, OptimStateKeyType.PARAM_ID, model2, optim_input2,
+        )
+        # Check that the re-keyed dict and actual dict are the same
+        osd = optim2.state_dict()
+        check_same_param_keys = True
+        self._check_same_param_groups(
+            rekeyed_osd, osd, check_same_param_keys=check_same_param_keys,
+        )
+        self._check_same_state(
+            rekeyed_osd, osd, check_same_param_keys=check_same_param_keys,
+        )
+        # As a sanity check, check that we can load and run a few iterations
+        optim2.load_state_dict(rekeyed_osd)
+        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_multiple_param_groups", [False])
+    def test_rekey_optim_state_dict_to_names(
+        self,
+        use_multiple_param_groups: bool,
+    ):
+        """Tests :meth:`rekey_optim_state_dict` with the new keys being
+        parameter names by checking that a non-wrapped model (i.e. without FSDP
+        modules) can rekey its optimizer state dict to match the expected
+        output of :meth:`full_optim_state_dict`, hence be sharded using
+        :meth:`shard_full_optim_state_dict`, and finally match the per-rank
+        optimizer state dict of a wrapped model (i.e. with FSDP modules)."""
+        NUM_ITERS = 3
+        # Run a wrapped model for a few iterations
+        model1, optim1, optim_input1 = self._init_nested_model(
+            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        self._step_model(model1, optim1, num_iters=NUM_ITERS)
+        # Run a non-wrapped model for a few iterations
+        model2, optim2, optim_input2 = self._init_nested_model(
+            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+        )
+        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        # Re-key the non-wrapped model's optimizer state dict using parameter
+        # names (still according to itself)
+        osd2 = optim2.state_dict()
+        rekeyed_osd = FSDP.rekey_optim_state_dict(
+            osd2, OptimStateKeyType.PARAM_NAME, model2, optim_input2,
+        )
+        # Shard the non-wrapped model's re-keyed optimizer state dict, which
+        # maps back to (flattened) parameter IDs
+        sharded_osd = FSDP.shard_full_optim_state_dict(
+            rekeyed_osd, model1, optim_input1,
+        )
+        # Check that this sharded optimizer state dict matches the wrapped
+        # model's per-rank optimizer state dict
+        osd1 = optim1.state_dict()
+        check_same_param_keys = True
+        self._check_same_param_groups(
+            sharded_osd, osd1, check_same_param_keys=check_same_param_keys,
+        )
+        self._check_same_state(
+            sharded_osd, osd1, check_same_param_keys=check_same_param_keys,
+        )
+        # As a sanity check, check that we can load and run a few iterations
+        optim1.load_state_dict(sharded_osd)
+        self._step_model(model1, optim1, num_iters=NUM_ITERS)
+
+
+instantiate_parametrized_tests(TestFSDPOptimState)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_state_dict.py b/test/distributed/fsdp/test_fsdp_state_dict.py
index 86734a1c794754..bd854155620b2e 100644
--- a/test/distributed/fsdp/test_fsdp_state_dict.py
+++ b/test/distributed/fsdp/test_fsdp_state_dict.py
@@ -1,6 +1,7 @@
 # Owner(s): ["oncall: distributed"]
 
 import sys
+from contextlib import suppress
 from copy import deepcopy
 from functools import partial
 from typing import Any, Dict
@@ -10,8 +11,10 @@
 from torch.distributed.fsdp import (
     FullyShardedDataParallel as FSDP,
     StateDictType,
-    CPUOffload
+    CPUOffload,
+    MixedPrecision,
 )
+from torch.distributed.fsdp.wrap import enable_wrap, wrap
 from torch.nn import Linear, Module
 import torch.nn as nn
 from torch.nn.parallel import DistributedDataParallel
@@ -21,8 +24,9 @@
     FSDPTest,
     get_full_params,
     _get_full_detached_param,
-    _zero_model,
     _get_state_dict,
+    SkipModel,
+    _zero_model,
 )
 from torch.testing._internal.common_utils import (
     instantiate_parametrized_tests,
@@ -78,8 +82,8 @@ def world_size(self):
     def _get_simple_nested_model(self, *fsdp_args, **fsdp_kwargs):
         model = FSDP(
             nn.Sequential(
-                FSDP(nn.Linear(10, 10, bias=False), *fsdp_args, **fsdp_kwargs),
-                nn.Linear(10, 10, bias=False),
+                FSDP(nn.Linear(10, 10, bias=False).cuda(), *fsdp_args, **fsdp_kwargs),
+                nn.Linear(10, 10, bias=False).cuda(),
             ),
             *fsdp_args,
             **fsdp_kwargs,
@@ -87,7 +91,7 @@ def _get_simple_nested_model(self, *fsdp_args, **fsdp_kwargs):
         return model
 
     def _get_simple_model(self, *fsdp_args, **fsdp_kwargs):
-        model = FSDP(nn.Linear(10, 10, bias=False), *fsdp_args, **fsdp_kwargs)
+        model = FSDP(nn.Linear(10, 10, bias=False).cuda(), *fsdp_args, **fsdp_kwargs)
         return model
 
     @skip_if_lt_x_gpu(2)
@@ -139,20 +143,24 @@ def test_basic_save_and_load_state_dict(self, cpu_offload, fp16):
                             self.assertEqual(tensor.dtype, torch.float16)
 
     @skip_if_lt_x_gpu(2)
-    def test_save_and_load_after_forward_state_dict(self):
+    @parametrize("mixed_precision", [True, False])
+    def test_save_and_load_after_forward_state_dict(self, mixed_precision):
         """
         Test that saving after some training results in params being updated as
         expected.
         """
         torch.cuda.set_device(self.rank)
-        model = self._get_wrapped_model(group=torch.distributed.distributed_c10d._get_default_group())
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        model = self._get_simple_nested_model(mixed_precision=mixed_precision)
         optim = torch.optim.SGD(model.parameters(), lr=0.1)
         initial_params = _get_full_detached_param(model)
         for _ in range(6):
-            inp = model.module.get_input(torch.device("cuda"))
+            inp = torch.randn(1, 10, device=torch.cuda.current_device())
             output = model(*inp)
-            loss = model.module.get_loss(inp, output).cuda()
-            model.module.run_backward(loss)
+            loss = output.sum()
+            expected_dtype = torch.float32 if mixed_precision is None else torch.float16
+            self.assertEqual(expected_dtype, loss.dtype)
+            loss.backward()
             optim.step()
 
         trained_params = _get_full_detached_param(model)
@@ -162,6 +170,10 @@ def test_save_and_load_after_forward_state_dict(self):
         state_dict = {k: v.clone() for k, v in model.state_dict().items()}
         _zero_model(model)
 
+        # Ensure checkpointed params have the full param dtype
+        for tensor in state_dict.values():
+            self.assertEqual(tensor.dtype, torch.float32)
+
         # Load state_dict into zeroed model
         model.load_state_dict(state_dict)
         loaded_params = _get_full_detached_param(model)
@@ -185,7 +197,7 @@ def _state_dict(model: Module, state_dict_type: str):
         except KeyError:
             raise ValueError(f"No state_dict type for {state_dict_type}")
 
-        with model.state_dict_type(enum_val):
+        with FSDP.state_dict_type(model, enum_val):
             return model.state_dict()
 
     @staticmethod
@@ -197,7 +209,7 @@ def _load_state_dict(
         except KeyError:
             raise ValueError(f"No state_dict for {state_dict_type}")
 
-        with model.state_dict_type(enum_val):
+        with FSDP.state_dict_type(model, enum_val):
             return model.load_state_dict(state_dict)
 
     def _dist_train(self, wrap_fsdp: bool, state_dict_type: str = ""):
@@ -274,6 +286,70 @@ def test_state_dict_load_into_local_module(self):
         for fsdp_param, local_param in zip(fsdp_params, local_params):
             self.assertEqual(fsdp_param, local_param)
 
+    @skip_if_lt_x_gpu(2)
+    @parametrize("double_nest", [True])
+    def test_state_dict_skip_module(self, double_nest):
+        torch.cuda.set_device(self.rank)
+
+        def _create_module(wrap_fsdp=True):
+            LINEAR_SKIP = "linear_skip"
+            ctx = enable_wrap(wrapper_cls=FSDP) if wrap_fsdp else suppress()
+            with ctx:
+                module = SkipModel(double_nest=double_nest)
+                # Full name of linear_skip param tensors in SkipModel, as would be
+                # stored in checkpoint.
+                linear_skip_tensor_names = [
+                    k for k in dict(module.named_parameters()).keys()
+                    if LINEAR_SKIP in k
+                ]
+                # skip SkipModule
+                linear_skip = getattr(module, LINEAR_SKIP)
+                delattr(module, LINEAR_SKIP)
+                # Wrap FSDP
+                fsdp = wrap(module)
+                # reattach
+                setattr(module, LINEAR_SKIP, linear_skip)
+                return fsdp, linear_skip_tensor_names
+
+        fsdp, linear_skip_tensor_names = _create_module()
+        # Run a forward pass
+        inp = torch.randn((1, 10), device=torch.cuda.current_device())
+        loss = fsdp(inp)
+        loss.sum().backward()
+
+        state_dict = fsdp.state_dict()
+        if self.rank == 0:
+            sd_keys = list(state_dict.keys())
+            expected = list(SkipModel(double_nest=False).state_dict().keys())
+            self.assertEqual(sorted(sd_keys), sorted(expected))
+            # TODO: parameters in linear_skip_tensor_names should not be handled
+            # by FSDP.state_dict(). Have a check once this is implemented in
+            # FSDP.state_dict().
+
+        # Check that it can be loaded into FSDP.
+        new_fsdp, _ = _create_module()
+        _zero_model(new_fsdp)
+        for (p1, p2) in zip(fsdp.parameters(), new_fsdp.parameters()):
+            self.assertNotEqual(p1, p2)
+        new_fsdp.load_state_dict(deepcopy(state_dict))
+        for (p1, p2) in zip(fsdp.parameters(), new_fsdp.parameters()):
+            self.assertEqual(p1, p2)
+
+        # Test that the checkpoint can be loaded into a local model.
+        local, _ = _create_module(wrap_fsdp=False)
+        for param in local.parameters():
+            with torch.no_grad():
+                param.zero_()
+
+        with fsdp.summon_full_params():
+            for (p1, p2) in zip(fsdp.parameters(), local.parameters()):
+                self.assertNotEqual(p1, p2)
+
+        local.load_state_dict(deepcopy(state_dict))
+        with fsdp.summon_full_params():
+            for (p1, p2) in zip(fsdp.parameters(), local.parameters()):
+                self.assertEqual(p1, p2)
+
 
 instantiate_parametrized_tests(TestFSDPStateDict)
 
diff --git a/test/distributed/fsdp/test_fsdp_summon_full_params.py b/test/distributed/fsdp/test_fsdp_summon_full_params.py
index f0632e64cf4bab..42ad9354ba3b68 100644
--- a/test/distributed/fsdp/test_fsdp_summon_full_params.py
+++ b/test/distributed/fsdp/test_fsdp_summon_full_params.py
@@ -7,8 +7,9 @@
 import torch
 import torch.nn as nn
 from torch import distributed as dist
-from torch.distributed.fsdp import CPUOffload
+from torch.distributed.fsdp import CPUOffload, MixedPrecision
 from torch.distributed.fsdp import FlatParameter
+from torch.distributed.fsdp.wrap import wrap, enable_wrap
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
@@ -37,10 +38,12 @@
     sys.exit(0)
 
 
-def _run_test_summon_full_param_writeback(cls, writeback, cpu_offload, modify_outer):
-    model = FSDP(
-        nn.Sequential(FSDP(nn.Linear(5, 5, bias=False)), nn.Linear(5, 3, bias=False))
-    ).cuda(cls.rank)
+def _run_test_summon_full_param_writeback(cls, writeback, modify_outer, *fsdp_args, **fsdp_kwargs):
+    with enable_wrap(wrapper_cls=FSDP, *fsdp_args, **fsdp_kwargs):
+        lin1 = wrap(nn.Linear(5, 5, bias=False).cuda(cls.rank))
+        lin2 = nn.Linear(5, 3, bias=False).cuda(cls.rank)
+        model = wrap(nn.Sequential(lin1, lin2))
+
 
     # set the value
     outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
@@ -72,17 +75,19 @@ def world_size(self):
 
     @skip_if_lt_x_gpu(2)
     @parametrize("writeback", [True, False])
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
-    )
     @parametrize("modify_outer", [True, False])
-    def test_summon_full_param_writeback(self, writeback, cpu_offload, modify_outer):
+    @parametrize("mixed_precision", [True, False])
+    # TODO: CPUOffload summon + writeback does not
+    # work when param is not sharded
+    # (currently when world_size == 1)
+    def test_summon_full_param_writeback(self, writeback, modify_outer, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         return _run_test_summon_full_param_writeback(
             self,
             writeback,
-            cpu_offload,
-            modify_outer,
+            modify_outer=modify_outer,
+            cpu_offload=CPUOffload(offload_params=False),
+            mixed_precision=mixed_precision,
         )
 
 
@@ -104,20 +109,27 @@ def get_expected_sharded_size(self, global_size):
         "cpu_offload",
         [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
     )
+    @parametrize("mixed_precision", [True, False])
     @parametrize("modify_outer", [True, False])
-    def test_summon_full_param_writeback(self, writeback, cpu_offload, modify_outer):
+    def test_summon_full_param_writeback(self, writeback, cpu_offload, mixed_precision, modify_outer):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         return _run_test_summon_full_param_writeback(
-            self, writeback, cpu_offload, modify_outer
+            self,
+            writeback,
+            modify_outer,
+            cpu_offload=cpu_offload,
+            mixed_precision=mixed_precision,
         )
 
     @skip_if_lt_x_gpu(2)
-    def test_summon_full_param_shard_value(self):
-
+    @parametrize("mixed_precision", [True, False])
+    def test_summon_full_param_shard_value(self, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         raw_model = nn.Linear(10, 11)
         raw_model_size = self.get_model_param_count(raw_model)
         expected_shard_size = self.get_expected_sharded_size(raw_model_size)
 
-        model = FSDP(raw_model.cuda(self.rank))
+        model = FSDP(raw_model.cuda(self.rank), mixed_precision=mixed_precision)
         self.assertEqual(expected_shard_size, self.get_model_param_count(model))
 
         # we're assuming a single flatenned param
@@ -140,11 +152,15 @@ def test_summon_full_param_shard_value(self):
     @skip_if_lt_x_gpu(2)
     @parametrize("recurse", [True, False])
     @parametrize("summon_outer", [True, False])
-    def test_summon_full_param_recursive(self, recurse, summon_outer):
+    @parametrize("mixed_precision", [True, False])
+    def test_summon_full_param_recursive(self, recurse, summon_outer, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         model = FSDP(
             nn.Sequential(
-                FSDP(nn.Linear(5, 5, bias=False)), nn.Linear(5, 3, bias=False)
-            )
+                FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
+                nn.Linear(5, 3, bias=False)
+            ),
+            mixed_precision=mixed_precision,
         ).cuda(self.rank)
 
         global_inner_numel = self.get_model_param_count(nn.Linear(5, 5, bias=False))
@@ -210,11 +226,15 @@ def bad_backwards_hook(tensor):
             output.backward()
 
     @skip_if_lt_x_gpu(2)
-    def test_summon_full_params_respects_reshard_after_forward(self):
+    @parametrize("mixed_precision", [True, False])
+    def test_summon_full_params_respects_reshard_after_forward(self, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         model = FSDP(
             nn.Sequential(
-                FSDP(nn.Linear(5, 5, bias=False)), nn.Linear(5, 3, bias=False)
-            )
+                FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
+                nn.Linear(5, 3, bias=False)
+            ),
+            mixed_precision=mixed_precision,
         ).cuda(self.rank)
 
         outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
@@ -225,7 +245,6 @@ def test_summon_full_params_respects_reshard_after_forward(self):
 
         # trigger lazy init
         model(torch.zeros(5).cuda(self.rank))
-
         # the root FSDP module keeps all params around
         self.assertEqual(
             outer_full_param_size, outer_param._full_param_padded.storage().size()
@@ -263,7 +282,9 @@ def test_summon_single_param(self):
             self.assertEqual(self.rank + 2, p[0])
 
     @skip_if_lt_x_gpu(2)
-    def test_summon_full_params_equivalence(self):
+    @parametrize("rank0_only", [True, False])
+    @parametrize("offload_to_cpu", [True, False])
+    def test_summon_full_params_equivalence(self, rank0_only, offload_to_cpu):
         offload = CPUOffload(offload_params=True)
         model = FSDP(
             DeterministicModel(wrap_fsdp=True, cpu_offload=offload),
@@ -271,20 +292,34 @@ def test_summon_full_params_equivalence(self):
         )
         local_model = DeterministicModel(wrap_fsdp=False)
 
-        with model.summon_full_params(recurse=True):
+        dev = torch.device("cpu") if offload_to_cpu else torch.device("cuda", torch.cuda.current_device())
+
+        params_to_compare = (
+            [p.clone() for p in model.parameters()] if rank0_only and self.rank != 0
+            else list(local_model.parameters())
+        )
+
+        with model.summon_full_params(recurse=True, rank0_only=rank0_only, writeback=not rank0_only, offload_to_cpu=offload_to_cpu):
             # Below sleep causes failures without stream synchronization in
             # summon_full_params fix.
             torch.cuda._sleep(1000000)
-            fsdp_params = deepcopy(list(model.parameters()))
+            # FSDP param deepcopy() of params has issues
+            fsdp_params = [p.clone() for p in model.parameters()]
 
-        self.assertEqual(fsdp_params, list(local_model.parameters()))
+        self.assertEqual(fsdp_params, params_to_compare)
 
     @skip_if_lt_x_gpu(2)
-    def test_reshard_outside_forward_backward_iteration(self):
+    @parametrize("rank0_only", [True, False])
+    @parametrize("offload_to_cpu", [True, False])
+    @parametrize("mixed_precision", [True, False])
+    def test_reshard_outside_forward_backward_iteration(self, rank0_only, offload_to_cpu, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         model = FSDP(
             nn.Sequential(
-                FSDP(nn.Linear(5, 5, bias=False)), nn.Linear(5, 1, bias=False)
-            )
+                FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
+                nn.Linear(5, 1, bias=False)
+            ),
+            mixed_precision=mixed_precision,
         ).cuda(self.rank)
 
         outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
@@ -310,7 +345,11 @@ def test_reshard_outside_forward_backward_iteration(self):
         # now lets repeat it with summon done in between
 
         output = model(torch.zeros(5).cuda(self.rank))
-        with model.summon_full_params():
+        self.assertEqual(
+            outer_full_param_size, outer_param._full_param_padded.storage().size()
+        )
+        self.assertEqual(0, inner_param._full_param_padded.storage().size())
+        with model.summon_full_params(rank0_only=rank0_only, writeback=not rank0_only, offload_to_cpu=offload_to_cpu):
             pass
         self.assertEqual(
             outer_full_param_size, outer_param._full_param_padded.storage().size()
@@ -318,43 +357,128 @@ def test_reshard_outside_forward_backward_iteration(self):
         self.assertEqual(0, inner_param._full_param_padded.storage().size())
 
         output.backward()
-        with model.summon_full_params():
+        with model.summon_full_params(rank0_only=rank0_only, writeback=not rank0_only, offload_to_cpu=offload_to_cpu):
             pass
         self.assertEqual(0, outer_param._full_param_padded.storage().size())
         self.assertEqual(0, inner_param._full_param_padded.storage().size())
 
     @skip_if_lt_x_gpu(2)
-    def test_params_are_unflattenned(self):
+    @parametrize("rank0_only", [True, False])
+    @parametrize("offload_to_cpu", [True, False])
+    @parametrize("mixed_precision", [True, False])
+    def test_params_are_unflattenned(self, rank0_only, offload_to_cpu, mixed_precision):
         layer_shape = (10, 12)
         model = nn.Linear(*layer_shape, bias=False).cuda(self.rank)
-        fsdp_model = FSDP(deepcopy(model)).cuda(self.rank)
+        mixed_precision = MixedPrecision() if mixed_precision else None
+        fsdp_model = FSDP(deepcopy(model), mixed_precision=mixed_precision).cuda(self.rank)
 
-        flattened_param = fsdp_model.get_parameter("_fsdp_wrapped_module.flat_param")
+        def _get_flat_param():
+            return fsdp_model.get_parameter("_fsdp_wrapped_module.flat_param")
+
+        flattened_param = _get_flat_param()
         self.assertEqual(layer_shape[0] * layer_shape[1] / 2, flattened_param.numel())
 
-        with fsdp_model.summon_full_params():
-            self.assertEqual(fsdp_model.weight.shape, model.weight.shape)
+        with fsdp_model.summon_full_params(rank0_only=rank0_only, writeback=not rank0_only, offload_to_cpu=offload_to_cpu):
+            if self.rank == 0 or not rank0_only:
+                self.assertEqual(fsdp_model.weight.shape, model.weight.shape)
+                expected_device = (
+                    torch.device("cpu") if offload_to_cpu else torch.device("cuda", torch.cuda.current_device())
+                )
+                self.assertTrue(expected_device == fsdp_model.weight.device)
+            else:
+                # Nonzero rank with rank0_only maintains original params.
+                flat_within_ctx = _get_flat_param()
+                self.assertEqual(flat_within_ctx, flattened_param)
+                self.assertEqual(flat_within_ctx.device, torch.device(torch.cuda.current_device()))
+
+        # CPU offload should restore the param device
+        param = next(fsdp_model.parameters())
+        self.assertTrue(param.device == torch.device("cuda", torch.cuda.current_device()))
 
     @skip_if_lt_x_gpu(2)
-    def test_params_count_and_value(self):
+    @parametrize("rank0_only", [True, False])
+    @parametrize("offload_to_cpu", [True, False])
+    @parametrize("mixed_precision", [True, False])
+    def test_params_count_and_value(self, rank0_only, offload_to_cpu, mixed_precision):
+        mixed_precision = MixedPrecision() if mixed_precision else None
         fsdp_model = FSDP(
             NestedWrappedModule(
                 group=dist.distributed_c10d._get_default_group(),
                 wrap_fsdp=True,
                 fsdp_init_mode=FSDPInitMode.CUDA_BEFORE,
-            )
+                mixed_precision=mixed_precision,
+            ),
+            mixed_precision=mixed_precision,
         )
         model = NestedWrappedModule(
             group=dist.distributed_c10d._get_default_group(),
             wrap_fsdp=False,
             fsdp_init_mode=FSDPInitMode.CUDA_BEFORE,
         )
-        with fsdp_model.summon_full_params():
+
+        dev = (
+            torch.device("cpu") if offload_to_cpu
+            else torch.device("cuda", torch.cuda.current_device())
+        )
+
+        params_to_compare = (
+            [p.to(dev) for p in model.module.parameters()]
+            if not rank0_only or self.rank == 0 else
+            list(p.clone() for p in fsdp_model.parameters())
+        )
+        with fsdp_model.summon_full_params(rank0_only=rank0_only, writeback=not rank0_only):
             for p1, p2 in itertools.zip_longest(
-                fsdp_model.parameters(), model.module.parameters()
+                fsdp_model.parameters(), params_to_compare
             ):
                 self.assertEqual(p1, p2)
 
+        # CPU offload should restore the param device
+        param = next(fsdp_model.parameters())
+        self.assertTrue(
+            param.device == torch.device("cuda", torch.cuda.current_device())
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_raises_rank0_with_writeback(self):
+        fsdp_model = FSDP(
+            NestedWrappedModule(
+                group=dist.distributed_c10d._get_default_group(),
+                wrap_fsdp=True,
+                fsdp_init_mode=FSDPInitMode.CUDA_BEFORE,
+            )
+        )
+
+        with self.assertRaisesRegex(ValueError, "is not supported"):
+            with fsdp_model.summon_full_params(rank0_only=True, writeback=True):
+                pass
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("prefix", ["", "test_prefix"])
+    @parametrize("recurse", [False, True])
+    def test_named_parameters_buffers(self, prefix: str, recurse: bool):
+        fsdp_model = FSDP(
+            NestedWrappedModule(
+                group=dist.distributed_c10d._get_default_group(),
+                wrap_fsdp=True,
+                fsdp_init_mode=FSDPInitMode.CUDA_BEFORE,
+            )
+        )
+        fsdp_model.register_buffer("buffer", torch.ones(1))
+        model = NestedWrappedModule(
+            group=dist.distributed_c10d._get_default_group(),
+            wrap_fsdp=False,
+            fsdp_init_mode=FSDPInitMode.CUDA_BEFORE,
+        )
+        model.register_buffer("buffer", torch.ones(1))
+        with fsdp_model.summon_full_params():
+            for call in ["named_parameters", "named_buffers"]:
+                for (n1, p1), (n2, p2) in itertools.zip_longest(
+                    getattr(fsdp_model, call)(prefix=prefix, recurse=recurse),
+                    getattr(model, call)(prefix=prefix, recurse=recurse),
+                ):
+                    self.assertEqual(n1, n2)
+                    self.assertEqual(p1, p2)
+
 
 instantiate_parametrized_tests(TestSummonFullParams)
 instantiate_parametrized_tests(TestSummonFullParamsNoShard)
diff --git a/test/distributed/fsdp/test_fsdp_traversal.py b/test/distributed/fsdp/test_fsdp_traversal.py
new file mode 100644
index 00000000000000..69ceca082441bf
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_traversal.py
@@ -0,0 +1,57 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+
+from torch import distributed as dist
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import (
+    FSDPTest,
+    NestedWrappedModule,
+)
+from torch.testing._internal.common_utils import (
+    TEST_WITH_DEV_DBG_ASAN,
+    run_tests,
+)
+
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class TestTraversal(FSDPTest):
+    @property
+    def world_size(self):
+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def test_fsdp_modules(self):
+        group = dist.distributed_c10d._get_default_group()
+        model = NestedWrappedModule(group, wrap_fsdp=True)
+        modules = FSDP.fsdp_modules(model)
+        self.assertEquals(
+            modules, [
+                model.module.get_submodule("1"),
+                model.module.get_submodule("1").get_submodule("0"),
+                model.module.get_submodule("2"),
+            ]
+        )
+        modules = FSDP.fsdp_modules(model, root_only=True)
+        self.assertEqual(
+            modules, [
+                model.module.get_submodule("1"),
+                model.module.get_submodule("2"),
+            ]
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_utils.py b/test/distributed/fsdp/test_utils.py
index 5ac13eefa7e970..99a17a26d2d586 100644
--- a/test/distributed/fsdp/test_utils.py
+++ b/test/distributed/fsdp/test_utils.py
@@ -1,5 +1,6 @@
 # Owner(s): ["oncall: distributed"]
 
+from collections import OrderedDict
 import random
 import sys
 import unittest
@@ -58,7 +59,7 @@ def get_a_tensor():
         data.append({"key1": get_a_tensor(), "key2": {1: get_a_tensor()}, "key3": 3})
         data.insert(0, set(["x", get_a_tensor(), get_a_tensor()]))
         data.append(([1], get_a_tensor(), (1), [get_a_tensor()], set((1, 2))))
-        od = dict()
+        od = OrderedDict()
         od["k"] = "value"
         data.append(od)
 
diff --git a/test/distributed/fsdp/test_wrap.py b/test/distributed/fsdp/test_wrap.py
index 0b4c1f8acc6cc7..d181ca23235aef 100644
--- a/test/distributed/fsdp/test_wrap.py
+++ b/test/distributed/fsdp/test_wrap.py
@@ -5,7 +5,6 @@
 import os
 import tempfile
 import unittest
-
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -15,6 +14,7 @@
     BackwardPrefetch,
 )
 from torch.distributed.fsdp.wrap import (
+    always_wrap_policy,
     default_auto_wrap_policy,
     enable_wrap,
     wrap,
@@ -67,6 +67,15 @@ def get_model(cuda=True):
                 sequential = sequential.cuda()
             return sequential
 
+        @staticmethod
+        def verify_model_all_wrapped(cls, model):
+            cls.assertTrue(isinstance(model, FSDP))
+            cls.assertTrue(isinstance(model.module[0], FSDP))
+            cls.assertTrue(isinstance(model.module[1], FSDP))
+            cls.assertTrue(isinstance(model.module[2], FSDP))
+            cls.assertTrue(isinstance(model.module[2].module[0], FSDP))
+            cls.assertTrue(isinstance(model.module[2].module[1], FSDP))
+
         @staticmethod
         def verify_model(cls, model):
             cls.assertTrue(isinstance(model, FSDP))
@@ -123,7 +132,7 @@ def test_error_already_wrapped(self, nested, fsdp_init_mode):
             wrapped_fsdp = wrapped_fsdp.cuda()
 
         with self.assertRaisesRegex(ValueError, "to NOT be FullyShardedDataParallel"):
-            mod = FSDP(wrapped_fsdp, fsdp_auto_wrap_policy=default_auto_wrap_policy)
+            mod = FSDP(wrapped_fsdp, auto_wrap_policy=default_auto_wrap_policy)
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
@@ -168,7 +177,7 @@ def forward(self, input):
         model = MyModel()
         wrapped_model = FSDP(
             model,
-            fsdp_auto_wrap_policy=functools.partial(
+            auto_wrap_policy=functools.partial(
                 default_auto_wrap_policy,
                 min_num_params=0,  # wrap all modules
             ),
@@ -226,7 +235,7 @@ def test_wrap(self, wrap_method):
             layer = FSDP(
                 nn.Linear(5, 5),
                 process_group=self.process_group,
-                fsdp_auto_wrap_policy=functools.partial(default_auto_wrap_policy, min_num_params=1)
+                auto_wrap_policy=functools.partial(default_auto_wrap_policy, min_num_params=1)
             )
         self.assertTrue(isinstance(layer, FSDP))
         self.assertEqual(layer.rank, self.process_group.rank())
@@ -257,6 +266,16 @@ def test_wrap_override_defaults(self):
         self.assertEqual(layer.rank, 0)
         self.assertEqual(layer.world_size, 2)
 
+    @unittest.skipIf(not torch.cuda.is_available(), "Test Requires CUDA")
+    def test_always_wrap(self):
+        """
+        Test to ensure that if `always_wrap_policy` is
+        passed into FSDP, all submodules are wrapped.
+        """
+        seq = TestFSDPWrap.NestedSequentialModel.get_model(cuda=True)
+        model = FSDP(seq, process_group=self.process_group, auto_wrap_policy=always_wrap_policy)
+        TestFSDPWrap.NestedSequentialModel.verify_model_all_wrapped(self, model)
+
     def test_auto_wrap_api(self):
         """
         Test to ensure with auto wrap, we wrap child modules correctly based on the min_num_params.
@@ -269,7 +288,7 @@ def test_auto_wrap_api(self):
         model = FSDP(
             sequential,
             process_group=self.process_group,
-            fsdp_auto_wrap_policy=my_auto_wrap_policy
+            auto_wrap_policy=my_auto_wrap_policy
         )
 
         TestFSDPWrap.NestedSequentialModel.verify_model(self, model)
@@ -288,7 +307,7 @@ def test_auto_wrap_preset_exclude_wrap(self):
         model = FSDP(
             sequential,
             process_group=self.process_group,
-            fsdp_auto_wrap_policy=my_auto_wrap_policy
+            auto_wrap_policy=my_auto_wrap_policy
         )
 
         self.assertTrue(isinstance(model, FSDP))
@@ -304,7 +323,7 @@ def test_auto_wrap_preset_exclude_wrap_include_children(self):
         my_auto_wrap_policy = functools.partial(
             default_auto_wrap_policy, min_num_params=40
         )
-        model = FSDP(sequential, process_group=self.process_group, fsdp_auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
 
         self.assertTrue(isinstance(model, FSDP))
         self.assertTrue(isinstance(model[0], FSDP))
@@ -318,7 +337,7 @@ def test_auto_wrap_preset_force_leaf(self):
         my_auto_wrap_policy = functools.partial(
             default_auto_wrap_policy, min_num_params=40
         )
-        model = FSDP(sequential, process_group=self.process_group, fsdp_auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
         self.assertTrue(isinstance(model.module[0], FSDP))
         # Assert children of multihead attention are not wrapped
         self.assertTrue(isinstance(model.module[1], nn.MultiheadAttention))
@@ -338,7 +357,7 @@ def test_auto_wrap_preset_force_leaf_custom(self):
         sequential = nn.Sequential(
             nn.Linear(10, 10), nn.ModuleList([nn.Linear(10, 10)])
         )
-        model = FSDP(sequential, process_group=self.process_group, fsdp_auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
         # Model was wrapped in FSDP as no inner modules were wrapped.
         self.assertTrue(isinstance(model, FSDP))
         self.assertTrue(isinstance(model.module[0], nn.Linear))
@@ -380,7 +399,7 @@ def test_auto_wrap_smoke_test(self, fsdp_init_mode, cpu_offload):
             my_auto_wrap_policy = functools.partial(
                 default_auto_wrap_policy, min_num_params=40
             )
-            model = FSDP(sequential, cpu_offload=cpu_offload, fsdp_auto_wrap_policy=my_auto_wrap_policy)
+            model = FSDP(sequential, cpu_offload=cpu_offload, auto_wrap_policy=my_auto_wrap_policy)
             TestFSDPWrap.NestedSequentialModel.verify_model(self, model)
             if cuda_after_init:
                 model = model.cuda()
diff --git a/test/distributed/optim/test_zero_redundancy_optimizer.py b/test/distributed/optim/test_zero_redundancy_optimizer.py
index 67c274575d4468..6f8639395a8c5b 100644
--- a/test/distributed/optim/test_zero_redundancy_optimizer.py
+++ b/test/distributed/optim/test_zero_redundancy_optimizer.py
@@ -6,19 +6,17 @@
 # LICENSE file in the root directory of this source tree.
 
 import copy
-import itertools
 import os
 import sys
+import unittest
 from contextlib import suppress
-from typing import Any, List, Type, cast
+from typing import Any, List, cast
 
 import numpy as np
 
 import torch
 import torch.distributed as dist
 
-import unittest
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -34,15 +32,16 @@
 from torch.distributed.optim.zero_redundancy_optimizer import _broadcast_object
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.optim import SGD, AdamW
-from torch.testing._internal import common_distributed, common_utils
+from torch.testing._internal import common_distributed
 from torch.testing._internal.common_utils import (
+    IS_WINDOWS,
     TEST_WITH_ASAN,
     TEST_WITH_DEV_DBG_ASAN,
-    sandcastle_skip_if,
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
 )
 
-from torch.testing._internal.common_utils import IS_WINDOWS
-
 try:
     import torchvision
     HAS_TORCHVISION = True
@@ -60,30 +59,19 @@ def _get_backend_for_tests():
 
 BACKEND = _get_backend_for_tests()
 
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-
-
-def check_same_model_params(model_a: torch.nn.Module, model_b: torch.nn.Module, message: str = "") -> None:
-    for p_a, p_b in zip(model_a.parameters(), model_b.parameters()):
-        assert torch.allclose(p_a, p_b, atol=1e-3), f"Model parameters differ\n{p_a} {p_b}\n" + message
-
-    for b_a, b_b in zip(model_a.buffers(), model_b.buffers()):
-        assert torch.allclose(b_a, b_b), f"Model buffers differ {b_a} - {b_b}\n" + message
-
-
 @unittest.skipIf(
-    TEST_WITH_ASAN or TEST_WITH_DEV_DBG_ASAN, "CUDA + ASAN doesnt work."
+    TEST_WITH_ASAN or TEST_WITH_DEV_DBG_ASAN, "CUDA + ASAN does not work."
 )
 class TestZeroRedundancyOptimizer(common_distributed.MultiProcessTestCase):
     def setUp(self):
         super(TestZeroRedundancyOptimizer, self).setUp()
         os.environ["WORLD_SIZE"] = str(self.world_size)
-
         self._spawn_processes()
 
     @property
     def device(self):
-        return torch.device(self.rank) if torch.cuda.is_available() else torch.device("cpu")
+        return torch.device("cuda") if torch.cuda.is_available() \
+            else torch.device("cpu")
 
     @property
     def world_size(self):
@@ -94,7 +82,6 @@ def tearDown(self):
             torch.distributed.destroy_process_group()
         except AssertionError:
             pass
-
         try:
             os.remove(self.file_name)
         except OSError:
@@ -104,75 +91,94 @@ def dist_init(self, rank, world_size=-1, backend=BACKEND):
         if (world_size < 1):
             world_size = self.world_size
         store = dist.FileStore(self.file_name, world_size)
-        return dist.init_process_group(backend=backend, store=store, rank=rank, world_size=world_size)
+        return dist.init_process_group(
+            backend=backend, store=store, rank=rank, world_size=world_size,
+        )
 
 
 # TODO: sandcastle_skip_if does not work here.
 @unittest.skipIf(
-    TEST_WITH_ASAN or TEST_WITH_DEV_DBG_ASAN, "CUDA + ASAN doesnt work."
+    TEST_WITH_ASAN or TEST_WITH_DEV_DBG_ASAN, "CUDA + ASAN does not work."
 )
 class TestZeroRedundancyOptimizerSingleRank(TestZeroRedundancyOptimizer):
     def test_state_dict(self):
-        """Check that the ZeroRedundancyOptimizer exposes the expected state dict interface,
-        irrespective of the sharding.
-        """
+        """Check that ZeroRedundancyOptimizer exposes the expected state dict
+        interface, irrespective of the sharding."""
         self.dist_init(self.rank)
-        x = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=0.1, momentum=0.9)
+        LR1 = 0.1
+        LR2 = 0.01
+        MOMENTUM = 0.9
+        RECIPIENT_RANK = 0  # rank 0 is the only rank since the world size is 1
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer(
+            [x], optimizer_class=SGD, lr=LR1, momentum=MOMENTUM,
+        )
         x.backward()
         o.step()
-        self.assertEqual(x, torch.tensor([0.9], device=DEVICE))
-        self.assertEqual(o.optim.state[x]["momentum_buffer"], torch.tensor([1.0], device=DEVICE))
+        self.assertEqual(x, torch.tensor([0.9], device=self.device))
+        self.assertEqual(
+            o.optim.state[x]["momentum_buffer"],
+            torch.tensor([1.0], device=self.device),
+        )
 
         o.zero_grad()
-        o.consolidate_state_dict()  # Sync state dict in between replicas - even if there are none
+        o.consolidate_state_dict(to=RECIPIENT_RANK)
         state_dict = o.state_dict()
 
-        # Check that the state dict is pytorch-compliant key wise
+        # Check that the state dict has keys compliant with PyTorch
         self.assertIn("param_groups", state_dict.keys())
         self.assertIn("state", state_dict.keys())
 
-        # Check that the pulled state is what we expect, and that we have all the expected keys
+        # Check that the state has the expected keys
         self.assertEqual(state_dict["param_groups"][0]["lr"], 0.1)
         self.assertEqual(state_dict["param_groups"][0]["momentum"], 0.9)
         self.assertFalse(state_dict["param_groups"][0]["nesterov"])
         self.assertEqual(state_dict["param_groups"][0]["weight_decay"], 0.0)
         self.assertEqual(state_dict["param_groups"][0]["dampening"], 0.0)
 
-        # Check that the pulled state and the .param_groups attribute are in sync
-        for k in state_dict["param_groups"][0].keys():
+        # Check that the state and the `param_groups` attribute are in sync
+        for k in state_dict["param_groups"][0]:
             if k != "params":
-                self.assertEqual(state_dict["param_groups"][0][k], o.param_groups[0][k])
+                self.assertEqual(
+                    state_dict["param_groups"][0][k],
+                    o.param_groups[0][k],
+                )
 
-        # Check that it's correctly loaded
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=0.01)
+        # Check that the state is reloaded with the correct values and device
+        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=LR2)
         o.load_state_dict(state_dict)
+        self.assertEqual(
+            o.optim.state[x]["momentum_buffer"],
+            torch.tensor([1.0], device=self.device),
+        )
 
-        # Check that state is correct and on proper device
-        self.assertEqual(o.optim.state[x]["momentum_buffer"], torch.tensor([1.0], device=DEVICE))
-
-        # We should now be using a lr of 0.1, both within the optimizer
-        # and as exposed by the .param_groups attribute
-        assert o.param_groups[0]["lr"] == 0.1
+        # We should we using `LR1` and not `LR2` after reloading, both within
+        # the optimizer and as exposed by the `param_groups` attribute
+        self.assertEqual(o.param_groups[0]["lr"], LR1)
         x.backward()
         o.step()
-        self.assertEqual(x, torch.tensor([0.71], device=DEVICE))
-        self.assertEqual(o.optim.state[x]["momentum_buffer"], torch.tensor([1.9], device=DEVICE))
+        self.assertEqual(x, torch.tensor([0.71], device=self.device))
+        self.assertEqual(
+            o.optim.state[x]["momentum_buffer"],
+            torch.tensor([1.9], device=self.device),
+        )
 
-        # Check that the exposed param_groups are on the proper device
+        # Check that the exposed `param_groups`` are on the proper device
         self.assertEqual(o.param_groups[0]["params"][0].device, x.device)
 
     def test_lr_scheduler(self):
-        """ Check that a normal torch lr_scheduler is usable with ZeroRedundancyOptimizer"""
-
+        """Check that a normal PyTorch ``lr_scheduler`` is usable with
+        ZeroRedundancyOptimizer."""
         self.dist_init(self.rank)
-        x = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        x2 = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=0.01)
-        o2 = torch.optim.SGD([x2], lr=0.01)
+        NUM_ITERS = 5
+        LR = 0.01
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        x2 = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=LR)
+        o2 = torch.optim.SGD([x2], lr=LR)
         s = torch.optim.lr_scheduler.StepLR(o, 1)
         s2 = torch.optim.lr_scheduler.StepLR(o2, 1)
-        for _ in range(5):
+        for _ in range(NUM_ITERS):
             x.backward()
             o.zero_grad()
             o.step()
@@ -184,8 +190,9 @@ def test_lr_scheduler(self):
             self.assertEqual(x, x2)
 
     def test_step_with_kwargs(self):
-        """ Check that the `step(**kwargs)` interface is properly exposed"""
+        """Check that the ``step(**kwargs)`` interface is properly exposed."""
         self.dist_init(self.rank)
+        LR = 0.1
 
         class SGDWithStepKWArg(torch.optim.SGD):
             def step(self, closure=None, kwarg=None):
@@ -193,18 +200,21 @@ def step(self, closure=None, kwarg=None):
                 kwarg.append(5)
 
         kwarg: List[Any] = []
-        x = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGDWithStepKWArg, lr=0.1)
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer(
+            [x], optimizer_class=SGDWithStepKWArg, lr=LR,
+        )
         x.backward()
         o.step(0, kwarg=kwarg)
         self.assertEqual(kwarg, [5])
-        self.assertEqual(x, torch.tensor([0.9], device=DEVICE))
+        self.assertEqual(x, torch.tensor([0.9], device=self.device))
 
     def test_step_with_extra_inner_key(self):
-        """Check that an optimizer adding extra keys to the param_groups
-        is properly handled, in that the new key is exposed to the user
-        """
+        """Check that ZeroRedundancyOptimizer wrapping an optimizer that adds
+        extra keys to ``param_groups`` exposes those keys through ZeRO's own
+        ``param_groups``."""
         self.dist_init(self.rank)
+        LR = 0.1
 
         class SGDWithNewKey(torch.optim.SGD):
             # Dummy optimizer which adds a new key to the param groups
@@ -212,33 +222,38 @@ def step(self, closure=None):
                 super().step()
                 self.param_groups[0]["new_key"] = 0.1
 
-        x = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGDWithNewKey, lr=0.1)
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer([x], optimizer_class=SGDWithNewKey, lr=LR)
         x.backward()
         o.step()
         self.assertEqual(o.param_groups[0]["new_key"], 0.1)
-        self.assertEqual(x, torch.tensor([0.9], device=DEVICE))
+        self.assertEqual(x, torch.tensor([0.9], device=self.device))
 
     def test_step_without_closure(self):
-        """Check that the step() method (without closure) is handlded as expected"""
+        """Check that the ``step()`` method (without closure) is handled as
+        expected."""
         self.dist_init(self.rank)
+        LR = 0.1
 
         class SGDWithoutClosure(torch.optim.SGD):
             def step(self):
                 return super().step()
 
-        x = torch.tensor([1.0], device=DEVICE, requires_grad=True)
-        o = ZeroRedundancyOptimizer([x], optimizer_class=SGDWithoutClosure, lr=0.1)
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer(
+            [x], optimizer_class=SGDWithoutClosure, lr=LR,
+        )
         x.backward()
         o.step()
-        self.assertEqual(x, torch.tensor([0.9], device=DEVICE))
+        self.assertEqual(x, torch.tensor([0.9], device=self.device))
 
     def test_zero_grad(self):
-        """Check that the zero_grad attribute is properly handled"""
+        """Check that the ``zero_grad`` method is properly handled."""
         self.dist_init(self.rank)
+        LR = 0.01
         x = torch.rand(1)
         m = torch.nn.Linear(1, 1)
-        o = ZeroRedundancyOptimizer(m.parameters(), optimizer_class=SGD, lr=0.1)
+        o = ZeroRedundancyOptimizer(m.parameters(), optimizer_class=SGD, lr=LR)
         y = m(x)
         y.backward(x)
         self.assertNotEqual(m.weight.grad, torch.zeros_like(m.weight))
@@ -251,46 +266,43 @@ def test_constructor(self):
         """Check the robustness of the ZeroRedundancyOptimizer constructor by
         passing different values for the ``params`` argument."""
         self.dist_init(self.rank)
-
+        LR = 0.01
         m = torch.nn.Sequential(
             torch.nn.Linear(5, 10),
             torch.nn.Linear(10, 10),
             torch.nn.Linear(10, 10),
         )
-
         # Test various constructor inputs in the form: (input, expected error)
         ctor_inputs = [
-            ([], ValueError),                           # empty parameter list
-            (torch.randn(1), TypeError),                # non-iterable: `torch.Tensor`
-            (1.2, TypeError),                           # non-iterable: `float`
+            ([], ValueError),                          # empty parameter list
+            (torch.randn(1), TypeError),               # non-iterable: `torch.Tensor`
+            (1.2, TypeError),                          # non-iterable: `float`
             ([
                 {"params": [l.weight for l in m]},
                 {"params": [l.bias for l in m]},
-            ], None),                                   # iterable of dict
-            (list(m.parameters()) + [42], TypeError),   # iterable containing invalid type
-            (m.parameters(), None),                     # `params` as a generator
-            (list(m.parameters()), None)                # `params` as a list
+            ], None),                                  # iterable of dict
+            (list(m.parameters()) + [42], TypeError),  # iterable containing invalid type
+            (m.parameters(), None),                    # `params` as a generator
+            (list(m.parameters()), None)               # `params` as a list
         ]
-
         for ctor_input, error in ctor_inputs:
-            if error:
-                with self.assertRaises(error):
-                    ZeroRedundancyOptimizer(ctor_input, optimizer_class=SGD, lr=0.01)
-            else:
-                ZeroRedundancyOptimizer(ctor_input, optimizer_class=SGD, lr=0.01)
+            context = self.assertRaises(error) if error else suppress()
+            with context:
+                ZeroRedundancyOptimizer(
+                    ctor_input, optimizer_class=SGD, lr=LR,
+                )
 
         # Test constructing with multiple parameter groups more thoroughly
-        weight_decay = 0.01
-        lr = 0.01
-        betas = (0.9, 0.999)
-        eps = 1e-8
+        WD = 0.01
+        BETAS = (0.9, 0.999)
+        EPS = 1e-8
         params = [
             {"params": [l.weight for l in m], "weight_decay": 0.},
-            {"params": [l.bias for l in m], "weight_decay": weight_decay},
+            {"params": [l.bias for l in m], "weight_decay": WD},
         ]
         o = ZeroRedundancyOptimizer(
             params, optimizer_class=AdamW,
-            lr=lr, betas=betas, eps=eps,
+            lr=LR, betas=BETAS, eps=EPS,
         )
         assert len(o.param_groups) == 2, \
             f"Expected 2 ZeRO param groups, but got {len(o.param_groups)}"
@@ -306,7 +318,7 @@ def test_same_dense_param_type(self):
         and varying parameter types is added.
         """
         self.dist_init(self.rank)
-
+        LR = 0.01
         inputs = [
             [torch.sparse_coo_tensor(size=(2, 3))],
             [torch.FloatTensor(1), torch.DoubleTensor(1)],
@@ -315,37 +327,63 @@ def test_same_dense_param_type(self):
         ]
         for input in inputs:
             with self.assertRaises(ValueError):
-                ZeroRedundancyOptimizer(input, optimizer_class=SGD, lr=0.1)
+                ZeroRedundancyOptimizer(input, optimizer_class=SGD, lr=LR)
 
 
 class TestZeroRedundancyOptimizerDistributed(TestZeroRedundancyOptimizer):
+    @property
+    def device(self):
+        return torch.device(self.rank) if torch.cuda.is_available() \
+            else torch.device("cpu")
+
     @property
     def world_size(self):
         return min(4, max(2, torch.cuda.device_count()))
 
-    @common_distributed.skip_if_rocm
-    def test_step(self):
-        """ Check that the ZeroRedundancyOptimizer wrapper properly exposes the `.step()` interface"""
+    @property
+    def context(self):
+        return suppress() if not torch.cuda.is_available() \
+            else torch.cuda.device(self.rank)
 
-        if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):
-            return
+    def _check_same_model_params(
+        self,
+        model_a: torch.nn.Module,
+        model_b: torch.nn.Module,
+        message: str = "",
+    ) -> None:
+        # Check that model parameters match
+        for p_a, p_b in zip(model_a.parameters(), model_b.parameters()):
+            torch.testing.assert_close(
+                p_a, p_b, atol=1e-3, rtol=1e-5,
+                msg=f"Model parameters differ:\n{p_a} {p_b}\n" + message,
+            )
+        # Check that model buffers match
+        for b_a, b_b in zip(model_a.buffers(), model_b.buffers()):
+            torch.testing.assert_close(
+                b_a, b_b,
+                msg=f"Model buffers differ:\n{b_a} {b_b}\n" + message,
+            )
 
+    @common_distributed.skip_if_no_gpu
+    @common_distributed.skip_if_rocm
+    def test_step(self):
+        """Check that ZeroRedundancyOptimizer properly exposes the ``step()``
+        interface."""
         self.dist_init(self.rank, world_size=self.world_size)
+        LR = 0.01
 
-        context = suppress() if not torch.cuda.is_available() else torch.cuda.device(self.rank)
-
-        with context:
+        with self.context:
             x = torch.tensor([float(self.rank + 1)], device=self.device)
             m = torch.nn.Linear(1, 1)
             m.weight.data = torch.tensor([[1.0]])
             m.bias.data = torch.tensor([2.0])
-            m_zero = copy.deepcopy(m)
-            m.to(self.device)
-            m_zero.to(self.device)
+            m = m.to(self.device)
+            m_zero = copy.deepcopy(m).to(self.device)
 
-            lr = 0.1
-            o = SGD(m.parameters(), lr=lr)
-            o_zero = ZeroRedundancyOptimizer(m_zero.parameters(), optimizer_class=SGD, lr=lr)
+            o = SGD(m.parameters(), lr=LR)
+            o_zero = ZeroRedundancyOptimizer(
+                m_zero.parameters(), optimizer_class=SGD, lr=LR,
+            )
 
             y = m(x)
             y.backward(x)
@@ -364,24 +402,23 @@ def test_step(self):
             self.assertEqual(m.weight, m_zero.weight)
             self.assertEqual(m.bias, m_zero.bias)
 
+    @common_distributed.skip_if_no_gpu
     @common_distributed.skip_if_rocm
     def test_step_with_closure(self):
-        """ Check that the ZeroRedundancyOptimizer wrapper properly exposes the `.step(closure)` interface"""
-
-        if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):
-            return
-
+        """Check that ZeroRedundancyOptimizer properly exposes the
+        ``step(closure)`` interface."""
         self.dist_init(self.rank, world_size=self.world_size)
 
-        context = suppress() if not torch.cuda.is_available() else torch.cuda.device(self.rank)
-
-        with context:
+        with self.context:
             for bucket_view in [False, True]:
                 x_val = self.rank + 1
                 weight = 1.0
                 bias = 2.0
                 error = 1.0
-                target = torch.tensor([x_val * weight + bias + error], device=self.device)
+                target = torch.tensor(
+                    [x_val * weight + bias + error],
+                    device=self.device,
+                )
                 loss_fn = torch.nn.L1Loss()
 
                 x = torch.tensor([float(x_val)], device=self.device)
@@ -416,32 +453,62 @@ def closure():
                 self.assertEqual(m.weight, torch.tensor([[1.1]]))
                 self.assertEqual(m.bias, torch.tensor([2.1]))
 
+    @common_distributed.skip_if_no_gpu
+    def test_lr_scheduler(self):
+        """Check that a normal PyTorch ``lr_scheduler`` is usable with
+        ZeroRedundancyOptimizer."""
+        self.dist_init(self.rank)
+        x = torch.tensor([1.0], device=self.device, requires_grad=True)
+        x2 = torch.tensor([1.0], device=self.device, requires_grad=True)
+        o = ZeroRedundancyOptimizer([x], optimizer_class=SGD, lr=0.01)
+        o2 = torch.optim.SGD([x2], lr=0.01)
+        s = torch.optim.lr_scheduler.StepLR(o, 1)
+        s2 = torch.optim.lr_scheduler.StepLR(o2, 1)
+        for _ in range(5):
+            x.backward()
+            o.zero_grad()
+            o.step()
+            s.step()
+            x2.backward()
+            o2.zero_grad()
+            o2.step()
+            s2.step()
+            self.assertEqual(x, x2)
+
     def test_sharding(self):
-        """ Check the sharding at construction time
+        """
+        Check ZeroRedundancyOptimizer's parameter sharding at construction
+        time.
 
         NOTE: The correctness of this test depends on the ZeRO implementation
         using the sorted-greedy partitioning algorithm. For details, see
-        `ZeroRedundancyOptimizer._partition_parameters()` in
-        `zero_redundancy_optimizer.py`.
+        ``ZeroRedundancyOptimizer._partition_parameters()`` in
+        zero_redundancy_optimizer.py.
         """
         self.dist_init(self.rank)
+        LR = 0.01
         sizes = [9, 7, 5, 3]
         params = []
         for size in sizes * self.world_size:
             params.append(torch.rand(size, 1))
-        o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=0.1)
-        self.assertEqual(sum([x.numel() for x in o.optim.param_groups[0]["params"]]), sum(sizes))
+        o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=LR)
+        self.assertEqual(
+            sum([x.numel() for x in o.optim.param_groups[0]["params"]]),
+            sum(sizes),
+        )
 
     def test_add_param_group(self):
-        """Check that ZeroRedundancyOptimizer properly handles adding a new param_group a posteriori,
-        and that all ranks get a shard
+        """Check that ZeroRedundancyOptimizer properly handles adding a new
+        parameter group a posteriori and that all ranks get a shard of the
+        contained parameters.
 
         NOTE: The correctness of this test depends on the ZeRO implementation
         using the sorted-greedy partitioning algorithm. For details, see
-        `ZeroRedundancyOptimizer._partition_parameters()` in
-        `zero_redundancy_optimizer.py`.
+        ``ZeroRedundancyOptimizer._partition_parameters()`` in
+        zero_redundancy_optimizer.py.
         """
         self.dist_init(self.rank)
+        LR = 0.01
 
         # Test with all parameters trainable to begin with
         def all_trainable():
@@ -451,19 +518,26 @@ def all_trainable():
             for size in sizes_world[:-1]:
                 params.append(torch.rand(size, 1))
 
-            # Make sure that the params are trainable, enforces size-based partitioning
+            # Make sure that the params are trainable so that they are factored
+            # into the size-based parameter partitioning
             for p in params:
                 p.requires_grad = True
 
-            o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=0.1)
-
-            assert len(o.param_groups) == 1
+            o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=LR)
+            self.assertEqual(len(o.param_groups), 1)
             o.add_param_group({"params": [torch.rand(3, 1)]})
-
-            assert len(o.param_groups) == 2
-            # Verify that added group is added to the correct partition making all have the same elements.
-            assert sum([x.numel() for g in o.optim.param_groups for x in g["params"]]) == sum(sizes)
-            assert len(o.optim.param_groups) == 2
+            # Verify that new group is added to the correct partition, making
+            # all partitions have the same elements
+            self.assertEqual(len(o.param_groups), 2)
+            self.assertEqual(
+                sum([
+                    x.numel()
+                    for g in o.optim.param_groups
+                    for x in g["params"]
+                ]),
+                sum(sizes),
+            )
+            self.assertEqual(len(o.optim.param_groups), 2)
 
         # Test a pathological config with a first big non-trainable param
         def some_trainable():
@@ -471,17 +545,16 @@ def some_trainable():
             for size in [100, 3, 5, 2, 6, 4]:
                 params.append(torch.rand(size, 1))
 
-            # Make sure that the params are trainable, enforces size-based partitioning
+            # Make sure that all but the first param are trainable so that they
+            # are factored into the size-based parameter partitioning
             for p in params[1:]:
                 p.requires_grad = True
 
-            o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=0.1)
-
-            assert len(o.param_groups) == 1
+            o = ZeroRedundancyOptimizer(params, optimizer_class=SGD, lr=LR)
+            self.assertEqual(len(o.param_groups), 1)
             o.add_param_group({"params": [torch.rand(3, 1)]})
-
-            assert len(o.param_groups) == 2
-            assert len(o.optim.param_groups) == 2
+            self.assertEqual(len(o.param_groups), 2)
+            self.assertEqual(len(o.optim.param_groups), 2)
 
         all_trainable()
         some_trainable()
@@ -489,91 +562,91 @@ def some_trainable():
     @common_distributed.skip_if_no_gpu
     def test_multiple_param_groups(self):
         """
-        Tests parity between constructing ZeRO with multiple parameter groups
+        Check parity between constructing ZeRO with multiple parameter groups
         upfront versus adding parameter groups to ZeRO after construction
         versus a non-sharded optimizer.
         """
         self.dist_init(self.rank)
-
+        BATCH_SIZE, NUM_ITERS = 8, 3
+        INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM = 5, 10, 5
+        WD, LR = 0.01, 0.01
         model1 = torch.nn.Sequential(
-            torch.nn.Linear(5, 10),
-            torch.nn.Linear(10, 10),
-            torch.nn.Linear(10, 5),
+            torch.nn.Linear(INPUT_DIM, HIDDEN_DIM),
+            torch.nn.Linear(HIDDEN_DIM, HIDDEN_DIM),
+            torch.nn.Linear(HIDDEN_DIM, OUTPUT_DIM),
         )
         model2 = copy.deepcopy(model1)
         model3 = copy.deepcopy(model1)
         model1 = model1.to(self.device)
         model2 = model2.to(self.device)
         model3 = model3.to(self.device)
-
-        batch_size = 8
-        num_iters = 3
         inputs = [
-            torch.randn(batch_size, 5).to(self.device) for _ in range(num_iters)
+            torch.randn(BATCH_SIZE, INPUT_DIM).to(self.device)
+            for _ in range(NUM_ITERS)
         ]
-        wd = 0.01
-        lr = 0.01
         # Construct `optim1` with both parameter groups upfront
         optim1 = ZeroRedundancyOptimizer(
             [
                 {"params": [l.weight for l in model1], "weight_decay": 0.},
-                {"params": [l.bias for l in model1], "weight_decay": wd},
+                {"params": [l.bias for l in model1], "weight_decay": WD},
             ],
-            optimizer_class=AdamW, lr=lr,
+            optimizer_class=AdamW, lr=LR,
         )
         # Construct `optim2` by adding the second parameter after
         optim2 = ZeroRedundancyOptimizer(
             [l.weight for l in model2],
-            optimizer_class=AdamW, lr=lr, weight_decay=0.,
+            optimizer_class=AdamW, lr=LR, weight_decay=0.,
         )
         optim2.add_param_group(
-            {"params": [l.bias for l in model2], "weight_decay": wd}
+            {"params": [l.bias for l in model2], "weight_decay": WD}
         )
         # Construct `optim3` as a non-sharded optimizer
         optim3 = AdamW(
             [
                 {"params": [l.weight for l in model3], "weight_decay": 0.},
-                {"params": [l.bias for l in model3], "weight_decay": wd},
-            ], lr=lr,
+                {"params": [l.bias for l in model3], "weight_decay": WD},
+            ], lr=LR,
         )
-
         # Check parity over a few iterations
-        for iter in range(num_iters):
+        for input in inputs:
             for model, optim in (
                 (model1, optim1), (model2, optim2), (model3, optim3),
             ):
                 optim.zero_grad()
-                out = model(inputs[iter])
+                out = model(input)
                 loss = out.sum()
                 loss.backward()
                 optim.step()
-
             for layer1, layer2, layer3 in zip(model1, model2, model3):
-                assert torch.allclose(layer1.weight, layer2.weight)
-                assert torch.allclose(layer1.weight, layer3.weight)
-                assert torch.allclose(layer1.bias, layer2.bias)
-                assert torch.allclose(layer1.bias, layer3.bias)
+                torch.testing.assert_close(layer1.weight, layer2.weight)
+                torch.testing.assert_close(layer1.weight, layer3.weight)
+                torch.testing.assert_close(layer1.bias, layer2.bias)
+                torch.testing.assert_close(layer1.bias, layer3.bias)
 
-    @common_distributed.skip_if_lt_x_gpu(2)
+    @common_distributed.skip_if_no_gpu
     @common_distributed.skip_if_rocm
     def test_collect_shards(self):
-        """ Check the state consolidation mechanism, and the state dict exposed by ZeroRedundancyOptimizer"""
+        """Check the state consolidation mechanism and the state dict exposed
+        by ZeroRedundancyOptimizer."""
         self.dist_init(self.rank)
-        RECIPIENT_RANK = 0
-
-        # Run a dummy step so that the optimizer state dict exists
-        batch, input_width, hidden, target_width = 3, 20, 10, 5
-        target = torch.rand((batch, target_width), device=self.device)
-        inputs = torch.rand((batch, input_width), device=self.device)
-
-        model = torch.nn.Sequential(torch.nn.Linear(input_width, hidden), torch.nn.Linear(hidden, target_width))
-        model.to(self.device)
-
+        LR = 1e-3
+        MOMENTUM = 0.99
+        BATCH_SIZE, INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM = 3, 20, 10, 5
+        REFERENCE_RANK = 0
+        target = torch.rand((BATCH_SIZE, OUTPUT_DIM), device=self.device)
+        inputs = torch.rand((BATCH_SIZE, INPUT_DIM), device=self.device)
+        model = torch.nn.Sequential(
+            torch.nn.Linear(INPUT_DIM, HIDDEN_DIM),
+            torch.nn.Linear(HIDDEN_DIM, OUTPUT_DIM),
+        ).to(self.device)
         loss_fn = torch.nn.L1Loss()
         loss_fn.to(self.device)
-
-        # With SGD, Momentum is required to get a state to shard
-        optimizer = ZeroRedundancyOptimizer(model.parameters(), optimizer_class=SGD, lr=0.1, momentum=0.99)
+        optimizer = ZeroRedundancyOptimizer(
+            model.parameters(),
+            optimizer_class=SGD,
+            lr=LR,
+            momentum=MOMENTUM,  # ensure there exists state to shard
+        )
 
         def closure():
             optimizer.zero_grad()
@@ -582,56 +655,78 @@ def closure():
             loss.backward()
             return loss
 
+        # Run a dummy step so that the optimizer state dict exists
         _ = optimizer.step(closure=closure)
 
-        # Update the optimizer state on the reference rank
-        optimizer.consolidate_state_dict(to=RECIPIENT_RANK)
-
-        # Fetch the state on the reference rank
-        # - check that it has the correct size
-        # - load it again
-        if self.rank == RECIPIENT_RANK:
+        # Get the optimizer state on the reference rank
+        optimizer.consolidate_state_dict(to=REFERENCE_RANK)
+        if self.rank == REFERENCE_RANK:
+            # Check that the state has the correct size
             optimizer_state_dict = optimizer.state_dict()
-            self.assertEqual(len(optimizer_state_dict["state"]), len(list(model.parameters())))
+            self.assertEqual(
+                len(optimizer_state_dict["state"]),
+                len(list(model.parameters())),
+            )
         else:
             optimizer_state_dict = {}
 
+        # Load the optimizer state on all ranks without any exceptions
         optimizer_state_dict = _broadcast_object(
             optimizer_state_dict,
-            src_rank=RECIPIENT_RANK,
+            src_rank=REFERENCE_RANK,
             group=dist.group.WORLD,
             device=self.device,
         )
-
-        # Load the optimizer state dict, check that no exception is raised
         optimizer.load_state_dict(optimizer_state_dict)
 
-    @sandcastle_skip_if(
-        IS_WINDOWS,
-        "Test is flaky on windows: https://github.com/pytorch/pytorch/issues/66059"
-    )
-    def test_multiple_groups(self):
-        """ Check that the ZeroRedundancyOptimizer handles working with multiple process groups"""
-        self.dist_init(self.rank, self.world_size, dist.Backend.GLOO)
-
-        # Only work with the even ranks, to check that the global_rank indexing is properly used
-        sub_group_ranks = list(filter(lambda x: x % 2 == 0, range(self.world_size)))
-        process_group = torch.distributed.new_group(ranks=sub_group_ranks, backend="gloo")
+    def test_nondefault_process_group(self):
+        """Check that ZeroRedundancyOptimizer works with a non-default process
+        group consisting only of even ranks."""
+        # Skip the test if below the minimum world size since then the test is
+        # trivial
+        MIN_WORLD_SIZE = 4
+        if self.world_size < MIN_WORLD_SIZE:
+            common_distributed.logger.info(
+                "Skipping `test_nondefault_process_group()` since world size "
+                f"of {self.world_size} is less than {MIN_WORLD_SIZE}"
+            )
+            return
+        BACKEND = dist.Backend.GLOO
+        self.dist_init(self.rank, self.world_size, BACKEND)
+        # Use GPU if enough are available, or fall back to CPU otherwise, which
+        # is fine since Gloo backend supports both
+        if torch.cuda.is_available() and \
+                torch.cuda.device_count() >= self.world_size:
+            device = torch.device(self.rank)
+        else:
+            device = torch.device("cpu")
+        # Create a new process group consisting of the even ranks to exercise
+        # the case where the global and local ranks do not necessarily match
+        subgroup_ranks = [r for r in range(self.world_size) if r % 2 == 0]
+        process_group = dist.new_group(
+            ranks=subgroup_ranks, backend=BACKEND,
+        )
+        # Ranks not participating in the new process group are no longer needed
+        if self.rank not in subgroup_ranks:
+            return
 
-        # Make sure that all the ranks get different training data
-        # So that the sync check in between their models is meaningful
+        # Set different seeds across ranks so that each rank gets different
+        # training data and hence the model sync check is meaningful
         torch.manual_seed(self.rank)
         np.random.seed(self.rank)
 
-        # Standard deep learning setup
-        epochs, batch, input_width, hidden, target_width = 5, 3, 20, 10, 5
-        loss_fn = torch.nn.L1Loss().to(self.device)
+        EPOCHS, BATCH_SIZE, INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM = 5, 3, 20, 10, 5
+        LR = 1e-3
+        MOMENTUM = 0.99
+        REFERENCE_RANK = 0
+        assert REFERENCE_RANK in subgroup_ranks, \
+            "Reference rank must be in the new process group"
+        loss_fn = torch.nn.L1Loss().to(device)
 
         def check(optimizer):
-            # Just run a couple of epochs, check that the model is properly updated
-            for _ in range(epochs):
-                target = torch.rand((batch, target_width), device=self.device)
-                inputs = torch.rand((batch, input_width), device=self.device)
+            for _ in range(EPOCHS):
+                target = torch.rand((BATCH_SIZE, OUTPUT_DIM), device=device)
+                inputs = torch.rand((BATCH_SIZE, INPUT_DIM), device=device)
 
                 def closure():
                     optimizer.zero_grad()
@@ -639,167 +734,189 @@ def closure():
                     loss = loss_fn(output, target)
                     loss /= self.world_size
                     loss.backward()
-                    dist.all_reduce(loss, group=process_group)  # Not strictly needed for the test below
-
+                    dist.all_reduce(loss, group=process_group)
                     return loss
 
                 _ = optimizer.step(closure=closure)
 
-                # Check that all the params are the same on all ranks
+                # Check that the parameters match across ranks after a step
                 for pg in optimizer.param_groups:
                     for p in pg["params"]:
-                        receptacle = [p.clone() for _ in sub_group_ranks] if self.rank == 0 else []
-                        dist.gather(p, receptacle, dst=0, group=process_group)
-                        if self.rank == 0:
-                            for sync_p in receptacle[1:]:
-                                assert torch.all(torch.eq(receptacle[0], sync_p)), "Models differ in between ranks"
-
-        if self.rank in sub_group_ranks:
-            # Model fitting in the broadcast bucket
-            model = torch.nn.Sequential(
-                torch.nn.Linear(input_width, hidden),
-                torch.nn.Linear(hidden, target_width),
-            ).to(self.device)
+                        receptacle = [
+                            p.clone() for _ in subgroup_ranks
+                        ] if self.rank == REFERENCE_RANK else []
+                        dist.gather(
+                            p, receptacle, dst=REFERENCE_RANK,
+                            group=process_group,
+                        )
+                        if self.rank == REFERENCE_RANK:
+                            reference_param = receptacle[0]
+                            for param in receptacle[1:]:
+                                torch.testing.assert_close(
+                                    reference_param,
+                                    param,
+                                    msg="Models differ between ranks",
+                                )
 
-            # With SGD, Momentum is required to get a state to shard
-            optimizer = ZeroRedundancyOptimizer(
-                model.parameters(), optimizer_class=SGD, lr=0.1, momentum=0.99, process_group=process_group
-            )
-            check(optimizer)
+        model = torch.nn.Sequential(
+            torch.nn.Linear(INPUT_DIM, HIDDEN_DIM),
+            torch.nn.Linear(HIDDEN_DIM, OUTPUT_DIM),
+        ).to(device)
+        optimizer = ZeroRedundancyOptimizer(
+            model.parameters(),
+            optimizer_class=SGD,
+            lr=LR,
+            momentum=MOMENTUM,  # ensure there exists state to shard
+            process_group=process_group,
+        )
+        check(optimizer)
 
-            # Model not-fitting in the broadcast bucket
+    @common_distributed.skip_if_no_gpu
+    @parametrize(
+        "optimizer_class_str",
+        ["Adam", "AdamW", "SGD"],
+        # Use string to appease the internal test name parser
+    )
+    @parametrize(
+        "maximize",
+        [False, True],
+    )
+    def test_local_optimizer_parity(
+        self,
+        optimizer_class_str: str,
+        maximize: bool,
+    ):
+        """When combined with DDP, check that a local optimizer gives the same
+        results as wrapping that optimizer with ZeroRedundancyOptimizer."""
+        self.dist_init(self.rank)
+        BATCHES = 20
+        BATCH_SIZE = 64
+        LR = 1e-3
+        INPUT_DIM = 2
+        HIDDEN_DIM = 3
+        OUTPUT_DIM = 3
+        torch.manual_seed(self.rank)
+        np.random.seed(self.rank)
+        if optimizer_class_str == "Adam":
+            optimizer_class = torch.optim.Adam
+        elif optimizer_class_str == "AdamW":
+            optimizer_class = torch.optim.AdamW
+        elif optimizer_class_str == "SGD":
+            optimizer_class = torch.optim.SGD
+        else:
+            assert 0, f"Unsupported optimizer class: {optimizer_class_str}"
+
+        with self.context:
+            # Define a base model with a different buffer for each rank
             model = torch.nn.Sequential(
-                torch.nn.Linear(input_width, hidden),
-                torch.nn.Linear(hidden, target_width),
+                torch.nn.Linear(INPUT_DIM, HIDDEN_DIM),
+                torch.nn.Linear(HIDDEN_DIM, HIDDEN_DIM),
+                torch.nn.Linear(HIDDEN_DIM, OUTPUT_DIM),
             ).to(self.device)
-
-            # With SGD, Momentum is required to get a state to shard
-            optimizer = ZeroRedundancyOptimizer(
-                model.parameters(),
-                optimizer_class=SGD,
-                lr=0.1,
-                momentum=0.99,
-                process_group=process_group,
+            model.register_buffer(
+                "test_buffer", torch.ones((1), device=self.device) * self.rank,
+            )
+            # Define models/optimizers for DDP with ZeRO and DDP with local
+            # optimizer
+            defaults = {"maximize": True} if maximize else {}
+            sharded_optimizer = ZeroRedundancyOptimizer(
+                params=model.parameters(), optimizer_class=optimizer_class,
+                lr=LR, **defaults,
+            )
+            sharded_ddp_model = DDP(
+                module=model, device_ids=[self.rank],
+                broadcast_buffers=True, find_unused_parameters=True,
+            )
+            local_model = copy.deepcopy(model).to(self.device)
+            ddp_optimizer = optimizer_class(
+                local_model.parameters(), lr=LR, **defaults,
+            )
+            ddp_model = DDP(
+                local_model, device_ids=[self.rank],
+                broadcast_buffers=True, find_unused_parameters=True,
+            )
+            # Check that the model is properly synchronized between ranks
+            # at construction time
+            self._check_same_model_params(
+                sharded_ddp_model, ddp_model,
+                "Models differ from the start",
             )
-            check(optimizer)
-
-    @common_distributed.skip_if_no_gpu
-    def test_local_optimizer_parity(self):
-        """When combined with DDP, check that ZeroRedundancyOptimizer(optimizer) and the same monolithic optimizer
-        give the exact same results
-        """
 
-        self.dist_init(self.rank)
-        BATCHS = 20
-
-        with torch.cuda.device(self.rank):
-            torch.manual_seed(self.rank)
-            np.random.seed(self.rank)
-
-            def check_optimizer_equivalence(optimizer: Type[torch.optim.Optimizer], maximize: bool = False):
-                # Any model works. Add one different buffer per rank
-                model = torch.nn.Sequential(
-                    torch.nn.Linear(2, 3),
-                    torch.nn.Linear(3, 3),
-                    torch.nn.Linear(3, 3),
-                )
-                model.register_buffer("test_buffer", torch.ones((1)) * self.rank)
-                model.to(self.device)
+            def check_step():
+                input_tensor = torch.rand((BATCH_SIZE, INPUT_DIM))
 
-                defaults = dict()
+                def closure_ddp(input_tensor=input_tensor):
+                    ddp_optimizer.zero_grad()
+                    ddp_loss = ddp_model(input_tensor).abs().sum()
+                    ddp_loss.backward()
+                    return ddp_loss
 
-                if maximize:
-                    defaults['maximize'] = True
+                def closure_sharded(input_tensor=input_tensor):
+                    sharded_optimizer.zero_grad()
+                    sharded_loss = sharded_ddp_model(input_tensor).abs().sum()
+                    sharded_loss.backward()
+                    return sharded_loss
 
-                sharded_optimizer = ZeroRedundancyOptimizer(
-                    params=model.parameters(), optimizer_class=optimizer, lr=1e-3, **defaults
+                loss_ddp = cast(
+                    torch.Tensor, ddp_optimizer.step(closure=closure_ddp),
                 )
-                sharded_ddp_model = DDP(
-                    module=model, device_ids=[self.rank], broadcast_buffers=True, find_unused_parameters=True
+                loss_sharded_optim = cast(
+                    torch.Tensor,
+                    sharded_optimizer.step(closure=closure_sharded),
                 )
-
-                ddp_model_single = copy.deepcopy(model)
-                ddp_model_single.to(self.device)
-
-                ddp_optimizer = optimizer(ddp_model_single.parameters(), lr=1e-3, **defaults)
-                ddp_model = DDP(
-                    ddp_model_single, device_ids=[self.rank], broadcast_buffers=True, find_unused_parameters=True
+                torch.testing.assert_close(
+                    loss_ddp, loss_sharded_optim,
+                    msg="Losses differ between local optimizer and ZeRO",
+                )
+                self._check_same_model_params(
+                    sharded_ddp_model, ddp_model,
+                    "Models differ after a step",
                 )
 
-                # The model should be synchronized in between the ranks at construction time, check that
-                check_same_model_params(sharded_ddp_model, ddp_model, "Models differ from the start")
-
-                def check_step():
-                    input_tensor = torch.rand((64, 2))
-
-                    def closure_ddp(input_tensor=input_tensor):
-                        ddp_optimizer.zero_grad()
-                        ddp_loss = ddp_model(input_tensor).abs().sum()
-                        ddp_loss.backward()
-                        return ddp_loss
-
-                    def closure_sharded(input_tensor=input_tensor):
-                        sharded_optimizer.zero_grad()
-                        sharded_loss = sharded_ddp_model(input_tensor).abs().sum()
-                        sharded_loss.backward()
-                        return sharded_loss
-
-                    loss_ddp = cast(torch.Tensor, ddp_optimizer.step(closure=closure_ddp))
-                    loss_sharded_optim = cast(torch.Tensor, sharded_optimizer.step(closure=closure_sharded))
-
-                    assert torch.allclose(
-                        loss_ddp, loss_sharded_optim
-                    ), "Losses differ in between Pytorch optim and ZeroRedundancyOptimizer"
-
-                    check_same_model_params(sharded_ddp_model, ddp_model, "Models differ after a step")
-
-                # The models should stay the same in between the ranks
-                for i in range(BATCHS):
-                    check_step()
-
-                    # Change the models trainability, check that parity is maintained
-                    # only check after a couple of constant batchs to go through both regimes
-                    if i > BATCHS // 2:
-                        next(ddp_model.parameters()).requires_grad = bool(i % 2)
-                        next(sharded_ddp_model.parameters()).requires_grad = bool(i % 2)
-
-                # Check that the checkpoints are compatible
-                reference_rank = 0
-                # - get states
-                ddp_state_dict = ddp_optimizer.state_dict()
-                sharded_optimizer.consolidate_state_dict(to=reference_rank)
-                sharded_optim_state_dict = [sharded_optimizer.state_dict() if self.rank == reference_rank else {}]
-                dist.broadcast_object_list(sharded_optim_state_dict, src=reference_rank, group=dist.group.WORLD)
-                sharded_optim_state_dict = sharded_optim_state_dict[0]
-
-                # - cross load the states
-                # run one step and check that the models are still the same
-                ddp_state_dict_ref = copy.deepcopy(ddp_state_dict)  # OSS will remove some states
-                ddp_optimizer.load_state_dict(sharded_optim_state_dict)  # mixup on purpose !
-                sharded_optimizer.load_state_dict(ddp_state_dict)
-                check_step()
-
-                #  - self load, rewind, check no problem
-                # run one step and check that the models are still the same
-                ddp_optimizer.load_state_dict(ddp_state_dict_ref)
-                sharded_optimizer.load_state_dict(sharded_optim_state_dict)
+            # Check that parity is maintained
+            for i in range(BATCHES):
                 check_step()
+                # For the second half of batches, change the parameter
+                # trainability to further test parity
+                if i > BATCHES // 2:
+                    next(ddp_model.parameters()).requires_grad = bool(i % 2)
+                    next(sharded_ddp_model.parameters()).requires_grad = bool(i % 2)
+
+            # Check that the `state_dict` checkpoints are compatible between
+            # the local optimizer and ZeRO
+            REFERENCE_RANK = 0
+            # - Get states
+            ddp_state_dict = ddp_optimizer.state_dict()
+            sharded_optimizer.consolidate_state_dict(to=REFERENCE_RANK)
+            sharded_optim_state_dict = [
+                sharded_optimizer.state_dict()
+                if self.rank == REFERENCE_RANK else {}
+            ]
+            dist.broadcast_object_list(
+                sharded_optim_state_dict, src=REFERENCE_RANK,
+                group=dist.group.WORLD,
+            )
+            sharded_optim_state_dict = sharded_optim_state_dict[0]
 
-            for opt in [torch.optim.Adam, torch.optim.AdamW, torch.optim.SGD]:
-                for maximize in (True, False):
-                    check_optimizer_equivalence(opt, maximize=maximize)
+            # - Cross-load the states
+            # Run one step and check that the models are still the same
+            ddp_state_dict_ref = copy.deepcopy(ddp_state_dict)
+            ddp_optimizer.load_state_dict(sharded_optim_state_dict)
+            sharded_optimizer.load_state_dict(ddp_state_dict)
+            check_step()
 
+            # - Reload their respective states
+            # Run one step and check that the models are still the same
+            ddp_optimizer.load_state_dict(ddp_state_dict_ref)
+            sharded_optimizer.load_state_dict(sharded_optim_state_dict)
+            check_step()
 
     def _test_zero_join(self, device):
-        r"""
-        Check that the ZeRO join hook allows training with uneven inputs when using the given device.
-
-        Arguments:
-            device (torch.device): device used to store parameters and perform
-                collective communications.
-        """
+        """Check that the ZeRO join hook allows training with uneven inputs
+        when using the given device."""
         NUM_INPUTS = 3
         NUM_EPOCHS = 2
+        LR = 0.01
         torch.manual_seed(0)
         torch.cuda.manual_seed(0)
 
@@ -808,8 +925,6 @@ def _test_zero_join(self, device):
         is_gpu = device.type == "cuda"
         backend = _get_backend_for_tests() if is_gpu else dist.Backend.GLOO
         self.dist_init(rank, world_size, backend)
-        if is_gpu:
-            torch.cuda.set_device(self.device)
 
         model = torch.nn.Sequential(
             torch.nn.Linear(2, 3),
@@ -822,14 +937,18 @@ def _test_zero_join(self, device):
         # local optimizers on uneven inputs should be equivalent to ZeRO on
         # uneven inputs with gradients being manually set
         ddp_model = DDP(model, device_ids=[rank]) if is_gpu else DDP(model)
-        local_optim = torch.optim.Adam(ddp_model.parameters(), lr=0.01)
+        local_optim = torch.optim.Adam(ddp_model.parameters(), lr=LR)
         zero_model = copy.deepcopy(model)
         zero_model.to(device)
-        zero_optim = ZeroRedundancyOptimizer(zero_model.parameters(), torch.optim.Adam, lr=0.01)
+        zero_optim = ZeroRedundancyOptimizer(
+            zero_model.parameters(), torch.optim.Adam, lr=LR,
+        )
         loss_fn = torch.nn.MSELoss()
 
         # Use uneven inputs: rank i has i extra inputs
-        inputs = [torch.randn(20, 2).to(device) for _ in range(NUM_INPUTS + rank)]
+        inputs = [
+            torch.randn(20, 2).to(device) for _ in range(NUM_INPUTS + rank)
+        ]
         labels = torch.randn(20, 3).to(device)
 
         # Save the gradients and parameters from DDP as the ground truth; do
@@ -856,7 +975,10 @@ def _test_zero_join(self, device):
         # Broadcast the saved gradients and parameters to all of the other
         # ranks (which joined early)
         grads_and_params = [grads_at_each_iter, params_at_each_iter]
-        grads_and_params = _broadcast_object(grads_and_params, src_rank=world_size - 1, group=dist.group.WORLD, device=device)
+        grads_and_params = _broadcast_object(
+            grads_and_params, src_rank=world_size - 1, group=dist.group.WORLD,
+            device=device,
+        )
         grads_at_each_iter = grads_and_params[0]
         params_at_each_iter = grads_and_params[1]
         # TODO: Replace this `_broadcast_object` with `broadcast_object_list`
@@ -877,8 +999,9 @@ def __init__(self, zero_optim, grads):
                 super().__init__()
 
             def main_hook(self):
-                grads = self.zero._join_grad_info.grads[self.zero._join_grad_info.index]
-                self.zero._join_grad_info.index += 1
+                join_grad_info = self.zero._join_grad_info
+                grads = self.zero._join_grad_info.grads[join_grad_info.index]
+                join_grad_info.index += 1
                 for p, grad in zip(self.zero._all_params, grads):
                     p.grad = grad.detach().clone().to(device)
 
@@ -905,39 +1028,48 @@ def join_process_group(self):
         grads = grads_at_each_iter[-num_grads_after_joining:]
         gradient_setter = _GradientSetter()
         iter = 0
-        with Join([gradient_setter, zero_optim], zero_optim=zero_optim, grads=grads):
+        with Join(
+            [gradient_setter, zero_optim], zero_optim=zero_optim, grads=grads,
+        ):
             for _ in range(NUM_EPOCHS):
                 for input in inputs:
                     # Notify join context that this process has not joined
                     Join.notify_join_context(gradient_setter)
-
                     # Set gradients manually
-                    for p, grad in zip(zero_model.parameters(), grads_at_each_iter[iter]):
+                    for p, grad in zip(
+                        zero_model.parameters(), grads_at_each_iter[iter],
+                    ):
                         p.grad = grad.detach().clone().to(device)
-
                     # Perform optimizer step and check parity
                     zero_optim.step()
-                    for p, ddp_p in zip(zero_model.parameters(), params_at_each_iter[iter]):
-                        assert torch.allclose(p, ddp_p), \
-                            "Parameters differ between using ZeRO and local optimizer"
+                    for p, ddp_p in zip(
+                        zero_model.parameters(), params_at_each_iter[iter],
+                    ):
+                        torch.testing.assert_close(
+                            p, ddp_p,
+                            msg="Parameters differ between using ZeRO and "
+                            "local optimizer",
+                        )
                     iter += 1
 
     @common_distributed.requires_nccl()
-    @common_distributed.skip_if_lt_x_gpu(2)
+    @common_distributed.skip_if_no_gpu
     def test_zero_join_gpu(self):
-        """Check that the ZeRO join hook allows training with uneven inputs on GPU."""
+        """Check that the ZeRO join hook allows training with uneven inputs
+        on GPU."""
         self._test_zero_join(self.device)
 
     @common_distributed.requires_gloo()
     def test_zero_join_cpu(self):
-        """Check that the ZeRO join hook allows training with uneven inputs on CPU."""
+        """Check that the ZeRO join hook allows training with uneven inputs
+        on CPU."""
         self._test_zero_join(torch.device("cpu"))
 
     def _test_zero_model_parallel(self, parameters_as_bucket_view: bool):
         # Use two processes each with two GPUs
         assert self.rank < 2
-        NUM_EPOCHS = 3
-        NUM_INPUTS = 5
+        NUM_EPOCHS = 2
+        NUM_INPUTS = 4
         LR = 0.01
         torch.manual_seed(0)
         torch.cuda.manual_seed(0)
@@ -967,17 +1099,20 @@ def __init__(self):
             def forward(self, x):
                 return self.net1(self.relu(self.net0(x)))
 
-        dev0 = 2 * self.rank
-        dev1 = 2 * self.rank + 1
+        dev0 = torch.device(2 * self.rank)
+        dev1 = torch.device(2 * self.rank + 1)
         mp_model = ModelParallelModel(dev0, dev1)
         ddp_model = DDP(mp_model)
-        local_model = LocalModel()
-        cpu_device = torch.device("cpu")
+        local_model = LocalModel().to(dev0)
+
         # Ensure the parameters are the same across the two models
-        local_model.net0.weight = torch.nn.Parameter(mp_model.net0.weight.detach().clone().to(cpu_device))
-        local_model.net0.bias = torch.nn.Parameter(mp_model.net0.bias.detach().clone().to(cpu_device))
-        local_model.net1.weight = torch.nn.Parameter(mp_model.net1.weight.detach().clone().to(cpu_device))
-        local_model.net1.bias = torch.nn.Parameter(mp_model.net1.bias.detach().clone().to(cpu_device))
+        def copy_param(p):
+            return torch.nn.Parameter(p.detach().clone().to(dev0))
+
+        local_model.net0.weight = copy_param(mp_model.net0.weight)
+        local_model.net0.bias = copy_param(mp_model.net0.bias)
+        local_model.net1.weight = copy_param(mp_model.net1.weight)
+        local_model.net1.bias = copy_param(mp_model.net1.bias)
 
         # Compare parity between DDP with model parallelism using ZeRO and
         # a local model using a local optimizer
@@ -985,10 +1120,10 @@ def forward(self, x):
             ddp_model.parameters(),
             optimizer_class=torch.optim.Adam,
             parameters_as_bucket_view=parameters_as_bucket_view,
-            lr=LR
+            lr=LR,
         )
         local_optim = torch.optim.Adam(local_model.parameters(), lr=LR)
-        inputs = [torch.randn(20, 10) for _ in range(NUM_INPUTS)]
+        inputs = [torch.randn(20, 10).to(dev0) for _ in range(NUM_INPUTS)]
 
         for _ in range(NUM_EPOCHS):
             for input in inputs:
@@ -1004,40 +1139,42 @@ def closure_ddp():
                     ddp_loss.backward()
                     return ddp_loss
 
-                local_loss = cast(torch.Tensor, local_optim.step(closure=closure_local))
-                ddp_loss = cast(torch.Tensor, zero_optim.step(closure=closure_ddp)).to(cpu_device)
-
-                # Increased tolerances are needed to pass test when using TensorFloat32
-                # see https://github.com/pytorch/pytorch/issues/67764
-                assert torch.allclose(
-                    local_loss, ddp_loss, rtol=1e-03
-                ), "Losses differ between local optim and ZeroRedundancyOptimizer"
+                local_loss = cast(
+                    torch.Tensor, local_optim.step(closure=closure_local)
+                )
+                ddp_loss = cast(
+                    torch.Tensor, zero_optim.step(closure=closure_ddp)
+                )
 
-                for local_p, ddp_p in zip(local_model.parameters(), ddp_model.parameters()):
-                    ddp_p = ddp_p.to(cpu_device)
-                    assert torch.allclose(local_p, ddp_p, rtol=1e-03, atol=1e-04), "Models differ after a step"
+                # Increased tolerances are needed to pass when using TF32
+                # See: https://github.com/pytorch/pytorch/issues/67764
+                torch.testing.assert_close(
+                    local_loss.cpu(), ddp_loss.cpu(), rtol=1e-03, atol=1e-08,
+                ), "Losses differ between local optimizer and ZeRO"
 
-    @common_distributed.skip_if_lt_x_gpu(4)
-    def test_zero_model_parallel_with_bucket_view(self):
-        """
-        Check that ZeRO works with model parallelism where layers are sharded
-        across devices when ``parameters_as_bucket_view=True``.
-        """
-        if self.rank >= 2:
-            return
-        self.dist_init(self.rank, world_size=2)
-        self._test_zero_model_parallel(parameters_as_bucket_view=True)
+                for local_p, ddp_p in zip(
+                    local_model.parameters(),
+                    ddp_model.parameters()
+                ):
+                    torch.testing.assert_close(
+                        local_p.cpu(), ddp_p.cpu(), rtol=1e-03, atol=1e-04,
+                    ), "Models differ after a step"
 
     @common_distributed.skip_if_lt_x_gpu(4)
-    def test_zero_model_parallel_without_bucket_view(self):
-        """
-        Check that ZeRO works with model parallelism where layers are sharded
-        across devices when ``parameters_as_bucket_view=False``.
-        """
+    @parametrize(
+        "parameters_as_bucket_view",
+        [False, True],
+    )
+    def test_zero_model_parallel(
+        self,
+        parameters_as_bucket_view: bool,
+    ):
+        """Check that ZeRO works with model parallelism where the model's
+        layers are assigned to different devices."""
         if self.rank >= 2:
             return
         self.dist_init(self.rank, world_size=2)
-        self._test_zero_model_parallel(parameters_as_bucket_view=False)
+        self._test_zero_model_parallel(parameters_as_bucket_view)
 
     def _test_ddp_zero_overlap(
         self,
@@ -1058,22 +1195,21 @@ def _test_ddp_zero_overlap(
         is_gpu = device.type == "cuda"
         if is_gpu:
             torch.cuda.set_device(device)
-        models_to_test = [
-            (
-                torch.nn.Sequential(
-                    torch.nn.Linear(1000, 2000),
-                    torch.nn.Linear(2000, 500)
-                ),
-                [torch.randn(1, 1000).to(device) for _ in range(NUM_INPUTS)]
+        models_to_test = [(
+            torch.nn.Sequential(
+                torch.nn.Linear(1000, 2000),
+                torch.nn.Linear(2000, 500),
             ),
-        ]
+            [torch.randn(1, 1000).to(device) for _ in range(NUM_INPUTS)],
+        )]
         if HAS_TORCHVISION:
-            models_to_test.append(
-                (
-                    torchvision.models.resnet50(),
-                    [torch.randn(1, 3, 3, 1000).to(device) for _ in range(NUM_INPUTS)]
-                )
-            )
+            models_to_test.append((
+                torchvision.models.resnet50(),
+                [
+                    torch.randn(1, 3, 3, 1000).to(device)
+                    for _ in range(NUM_INPUTS)
+                ]
+            ))
         for (model, inputs) in models_to_test:
             # Enable determinism in cudnn operators
             with torch.backends.cudnn.flags(
@@ -1098,7 +1234,10 @@ def _test_ddp_zero_overlap(
                 )
                 ddp_model_overlap.register_comm_hook(
                     None,
-                    hook_constructor(allreduce_hook, ddp_model_overlap, zero_optim, **kwargs)
+                    hook_constructor(
+                        allreduce_hook, ddp_model_overlap, zero_optim,
+                        **kwargs,
+                    )
                 )
 
                 # Set up the DDP model with local optimizer
@@ -1163,120 +1302,68 @@ def _test_ddp_zero_overlap(
                     self.assertEqual(p1, p2)
 
                 # Check that the parameters were updated
-                self.assertNotEqual(init_params_overlap, list(ddp_model_overlap.parameters()))
+                self.assertNotEqual(
+                    init_params_overlap, list(ddp_model_overlap.parameters()),
+                )
 
                 # Ensure that this test runs independently
                 dist.barrier()
 
+    # NOTE: The test is skipped if using Windows since functional optimizers
+    # are not currently supported.
     @common_distributed.skip_if_win32()
     @common_distributed.requires_nccl()
     @common_distributed.skip_if_no_gpu
     @common_distributed.skip_if_rocm
-    def test_ddp_with_zero_step_parity_gpu(self):
-        r"""
-        Check that overlapping DDP with ZeRO using ``hook_with_zero_step()``
-        achieves parity with DDP using a local optimizer when running on GPU.
-
-        NOTE: The test is skipped if using Windows since functional optimizers
-        are not currently supported.
+    @parametrize(
+        "use_gpu",
+        [True],
+        # Add `False` once the Gloo sync issue causing hangs is fixed
+        # See: https://github.com/pytorch/pytorch/issues/62300
+    )
+    @parametrize(
+        "use_interleaved_hook",
+        [False, True],
+    )
+    @parametrize(
+        "gradient_as_bucket_view",
+        [False, True],
+    )
+    @parametrize(
+        "static_graph",
+        [False, True],
+    )
+    @parametrize(
+        "shard_buckets",
+        [False, True],
+    )
+    def test_ddp_zero_overlap(
+        self,
+        use_gpu: bool,
+        use_interleaved_hook: bool,
+        gradient_as_bucket_view: bool,
+        static_graph: bool,
+        shard_buckets: bool,
+    ):
         """
-        self.dist_init(self.rank, self.world_size, dist.Backend.NCCL)
-        for gradient_as_bucket_view, static_graph in itertools.product(
-            [True, False],
-            [True, False]
-        ):
-            self._test_ddp_zero_overlap(
-                torch.device(self.rank),
-                hook_with_zero_step,
-                gradient_as_bucket_view,
-                static_graph
-            )
-    # TODO: Add `test_ddp_with_zero_step_parity_cpu()` once the Gloo
-    # synchronization issue causing hangs is fixed.
-
-    @common_distributed.skip_if_win32()
-    @common_distributed.requires_nccl()
-    @common_distributed.skip_if_no_gpu
-    @common_distributed.skip_if_rocm
-    def test_ddp_with_zero_step_interleaved_parity_gpu(self):
-        r"""
-        Check that overlapping DDP with ZeRO using
-        ``hook_with_zero_step_interleaved()`` achieves parity with DDP using a
-        local optimizer when running on GPU.
-
-        NOTE: The test is skipped if using Windows since functional optimizers
-        are not currently supported.
+        Check that overlapping DDP with ZeRO using the given method determined
+        by ``hook_constructor`` and ``shard_buckets`` and using the given ZeRO
+        and DDP arguments achieves parity with DDP using a local optimizer.
         """
-        self.dist_init(self.rank, self.world_size, dist.Backend.NCCL)
-        for gradient_as_bucket_view, static_graph in itertools.product(
-            [True, False],
-            [True, False]
-        ):
-            self._test_ddp_zero_overlap(
-                torch.device(self.rank),
-                hook_with_zero_step_interleaved,
-                gradient_as_bucket_view,
-                static_graph
-            )
-    # TODO: Add `test_ddp_with_zero_step_interleaved_parity_cpu()` once the
-    # Gloo synchronization issue causing hangs is fixed.
+        device = torch.device(self.rank) if use_gpu else torch.device("cpu")
+        backend = _get_backend_for_tests()
+        self.dist_init(self.rank, self.world_size, backend)
+        hook_constructor = hook_with_zero_step if not use_interleaved_hook \
+            else hook_with_zero_step_interleaved
+        self._test_ddp_zero_overlap(
+            device, hook_constructor, gradient_as_bucket_view, static_graph,
+            shard_buckets=shard_buckets,
+        )
 
-    @common_distributed.skip_if_win32()
-    @common_distributed.requires_nccl()
-    @common_distributed.skip_if_no_gpu
-    @common_distributed.skip_if_rocm
-    def test_ddp_with_zero_step_uniform_parity_gpu(self):
-        r"""
-        Check that overlapping DDP with ZeRO using
-        ``hook_with_zero_step()`` with ``shard_buckets=True``
-        achieves parity with DDP using a local optimizer when running on GPU.
-
-        NOTE: The test is skipped if using Windows since functional optimizers
-        are not currently supported.
-        """
-        self.dist_init(self.rank, self.world_size, dist.Backend.NCCL)
-        for gradient_as_bucket_view, static_graph in itertools.product(
-            [True, False],
-            [True, False]
-        ):
-            self._test_ddp_zero_overlap(
-                torch.device(self.rank),
-                hook_with_zero_step,
-                gradient_as_bucket_view,
-                static_graph,
-                shard_buckets=True,
-            )
-    # TODO: Add `test_ddp_with_zero_step_uniform_parity_cpu()` once the Gloo
-    # synchronization issue causing hangs is fixed.
 
-    @common_distributed.skip_if_win32()
-    @common_distributed.requires_nccl()
-    @common_distributed.skip_if_no_gpu
-    @common_distributed.skip_if_rocm
-    def test_ddp_with_zero_step_interleaved_uniform_parity_gpu(self):
-        r"""
-        Check that overlapping DDP with ZeRO using
-        ``hook_with_zero_step()`` with ``shard_buckets=True``
-        achieves parity with DDP using a local optimizer when running on GPU.
-
-        NOTE: The test is skipped if using Windows since functional optimizers
-        are not currently supported.
-        """
-        self.dist_init(self.rank, self.world_size, dist.Backend.NCCL)
-        for gradient_as_bucket_view, static_graph in itertools.product(
-            [True, False],
-            [True, False]
-        ):
-            self._test_ddp_zero_overlap(
-                torch.device(self.rank),
-                hook_with_zero_step_interleaved,
-                gradient_as_bucket_view,
-                static_graph,
-                shard_buckets=True,
-            )
-    # TODO: Add `test_ddp_with_zero_step_interleaved_uniform_parity_cpu()` once
-    # the Gloo synchronization issue causing hangs is fixed.
+instantiate_parametrized_tests(TestZeroRedundancyOptimizerSingleRank)
+instantiate_parametrized_tests(TestZeroRedundancyOptimizerDistributed)
 
 if __name__ == "__main__":
     # ! unittest should not be used here, else the tests are not properly registered
-    common_utils.run_tests()
+    run_tests()
diff --git a/test/distributed/test_c10d_common.py b/test/distributed/test_c10d_common.py
index 3bdb0fc15e0411..5c29f1fd448d84 100644
--- a/test/distributed/test_c10d_common.py
+++ b/test/distributed/test_c10d_common.py
@@ -9,6 +9,7 @@
 from datetime import timedelta
 from itertools import product
 from sys import platform
+from contextlib import suppress
 
 import torch
 import torch.distributed as dist
@@ -18,6 +19,7 @@
     sys.exit(0)
 
 import torch.distributed.distributed_c10d as c10d
+from torch.utils.checkpoint import checkpoint
 import torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook as powerSGD
 import torch.nn.functional as F
 import torch.testing._internal.common_utils as common
@@ -25,12 +27,16 @@
 from torch.nn.parallel import DistributedDataParallel
 from torch.testing._internal.common_distributed import (
     MultiProcessTestCase,
+    skip_if_lt_x_gpu,
 )
+
 from torch.testing._internal.common_utils import (
     TestCase,
     load_tests,
     run_tests,
     TEST_WITH_DEV_DBG_ASAN,
+    instantiate_parametrized_tests,
+    parametrize
 )
 
 if TEST_WITH_DEV_DBG_ASAN:
@@ -238,7 +244,7 @@ def forward(self, x):
         return F.softmax(self.embedding(x), dim=1)
 
 
-class AbstractDistributedDataParallelTest(object):
+class CommonDistributedDataParallelTest(object):
     def tearDown(self):
         # DistributedDataParallel test doesn't seem to call FileStore destructor
         # TODO: investigate this test and the test is known to have issues
@@ -307,6 +313,363 @@ def _prepare_multi_device_module(
 
         return model, ddp_model, input, target
 
+    def _get_store(self):
+        return dist.FileStore(self.file_name, self.world_size)
+
+    def _get_process_group(self):
+        raise NotImplementedError("To be implemented by child class")
+
+    def _train_model(self, model, input_var, target, loss, run_checkpoint=False, use_reentrant=True):
+        model.train()
+        if run_checkpoint:
+            output = checkpoint(model, input_var, use_reentrant=use_reentrant)
+        else:
+            output = model(input_var)
+        l = loss(output, target)
+        l.backward()
+
+    def _test_ddp_checkpointing(
+        self,
+        input_model,
+        process_group,
+        use_bucket_view,
+        find_unused_parameters=False,
+        static_graph=False,
+        run_checkpoint=False,
+        use_reentrant=True,
+        allow_none_grads=False,
+    ):
+        # to reproduce the same training results
+        torch.cuda.set_device(self.rank)
+        torch.manual_seed(31415)
+        model = copy.deepcopy(input_model).cuda()
+        ddp_model = copy.deepcopy(input_model).cuda()
+        ddp_model = nn.parallel.DistributedDataParallel(
+            ddp_model,
+            bucket_cap_mb=1,
+            gradient_as_bucket_view=use_bucket_view,
+            device_ids=[self.rank],
+            process_group=process_group,
+            find_unused_parameters=find_unused_parameters,
+            static_graph=static_graph,
+        )
+        self.assertEqual(
+            ddp_model._get_ddp_logging_data().get("static_graph", 0), static_graph
+        )
+        input, ddp_input, target, ddp_target = self._prepare_dummy_data()
+        loss = nn.MSELoss()
+        n_iters = 5
+        for i in range(n_iters):
+            model.zero_grad(set_to_none=False)
+            ddp_model.zero_grad(set_to_none=False)
+            self._train_model(model, input, target, loss, run_checkpoint=run_checkpoint, use_reentrant=use_reentrant)
+            self._train_model(
+                ddp_model, ddp_input, ddp_target, loss, run_checkpoint=run_checkpoint, use_reentrant=use_reentrant
+            )
+            for i, j in zip(model.parameters(), ddp_model.parameters()):
+                if not allow_none_grads:
+                    self.assertTrue(i.grad is not None)
+                    self.assertTrue(j.grad is not None)
+                self.assertEqual(i.grad, j.grad, rtol=1.3e-06, atol=5e-5)
+
+    # A list of tests for ddp with activation checkpointing
+    # when gradient_as_bucket_view=True, False.
+    # Most of the tests are referred to
+    # https://github.com/facebookresearch/fairscale/blob/main/tests/nn/pipe/test_checkpoint_ddp.py
+    class CheckpointOnceModule(nn.Module):
+        """
+        Runs checkpoint for a single layer in the model.
+        """
+        def __init__(self, use_reentrant=True):
+            super().__init__()
+            self.l1 = nn.Linear(20, 20)
+            self.l2 = nn.Linear(20, 20)
+            self.use_reentrant = use_reentrant
+
+        def forward(self, inp):
+            x = self.l1(inp)
+            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
+            return x
+
+    class CheckpointTwiceModule(CheckpointOnceModule):
+        """
+        Runs checkpoint for the same layer twice in a model. This simulates use
+        cases such as pipeline parallel where the same layer can be checkpointed
+        more than one time.
+        """
+        def __init__(self, use_reentrant=True):
+            super().__init__(use_reentrant=use_reentrant)
+
+        def forward(self, inp):
+            x = self.l1(inp)
+            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
+            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
+            return x
+
+    class CheckpointTwiceModuleWeightSharing(CheckpointTwiceModule):
+        """
+        Similar to CheckpointTwiceModule but the weights are shared.
+        """
+        def __init__(self, use_reentrant=True):
+            super().__init__(use_reentrant=use_reentrant)
+            # Share weights
+            self.l1.weight = self.l2.weight
+
+        def forward(self, inp):
+            x = self.l1(inp)
+            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
+            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
+            return x
+
+
+    class DynamicCheckpointTwiceModule(CheckpointTwiceModule):
+        def __init__(self, use_reentrant=True):
+            super().__init__(use_reentrant=use_reentrant)
+            self.count = 0
+
+        def forward(self, inp):
+            if self.count % 2:
+                x = checkpoint(self.l1, inp, use_reentrant=self.use_reentrant)
+            else:
+                x = checkpoint(self.l2, inp, use_reentrant=self.use_reentrant)
+
+            self.count += 1
+            return x
+
+    class DynamicCheckpointTwiceModuleWeightSharing(DynamicCheckpointTwiceModule):
+        def __init__(self, use_reentrant=True):
+            super().__init__(use_reentrant=use_reentrant)
+            # Share weights
+            self.l1.weight = self.l2.weight
+
+
+    def _prepare_dummy_data(self):
+        ddp_bs = 16
+        bs = ddp_bs * self.world_size
+        input = torch.rand((bs, 20), device="cuda", requires_grad=True)
+        target = torch.randn((bs, 20), device="cuda")
+        offset = self.rank * ddp_bs
+        ddp_input = input[offset : offset + ddp_bs]
+        ddp_target = target[offset : offset + ddp_bs]
+        return input, ddp_input, target, ddp_target
+
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_reentrant", [True, False])
+    def test_ddp_checkpointing_once(self, use_reentrant):
+        """
+        DDP works as expected when layer is checkpointed only once.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view, static_graph in product((False, True), (False, True)):
+            self._test_ddp_checkpointing(
+                self.CheckpointOnceModule(use_reentrant=use_reentrant),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                static_graph=static_graph,
+            )
+            if static_graph:
+                # find_unused_parameters does not make a difference, since it is
+                # ignored for static graph.
+                self._test_ddp_checkpointing(
+                    self.CheckpointOnceModule(),
+                    process_group=process_group,
+                    use_bucket_view=use_bucket_view,
+                    static_graph=static_graph,
+                    find_unused_parameters=True,
+                )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_reentrant", [True, False])
+    def test_ddp_checkpointing_unused_params(self, use_reentrant):
+        """
+        With reentrant autograd checkpointing impl, DDP will fail when there are
+        unused params in the model and no static graph training. With
+        non-reentrant checkpointing implementation, this works as expected.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view in (True, False):
+            err_ctx = (
+                suppress() if not use_reentrant else
+                self.assertRaisesRegex(
+                    RuntimeError,
+                    "Expected to mark a variable ready only once."
+                )
+            )
+            with err_ctx:
+                model = self._test_ddp_checkpointing(
+                    self.CheckpointOnceModule(use_reentrant=use_reentrant),
+                    process_group=process_group,
+                    use_bucket_view=use_bucket_view,
+                    find_unused_parameters=True,
+                )
+            # test passes when static_graph is true
+            model = self._test_ddp_checkpointing(
+                self.CheckpointOnceModule(use_reentrant=use_reentrant),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                find_unused_parameters=True,
+                static_graph=True,
+            )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_reentrant", [True, False])
+    def test_ddp_checkpointing_twice(self, use_reentrant):
+        """
+        Checkpoitning twice fails for non-static graph with reentrant checkpoint
+        implementation, succeeds with non-reentrant checkpoint implementation.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view in (True, False):
+            err_ctx = (
+                suppress() if not use_reentrant else
+                self.assertRaisesRegex(
+                    RuntimeError,
+                    "Expected to mark a variable ready only once."
+                )
+            )
+            with err_ctx:
+                model = self._test_ddp_checkpointing(
+                    self.CheckpointTwiceModule(use_reentrant=use_reentrant),
+                    process_group=process_group,
+                    use_bucket_view=use_bucket_view,
+                    static_graph=False,
+                )
+
+            with err_ctx:
+                model = self._test_ddp_checkpointing(
+                    self.CheckpointTwiceModule(use_reentrant=use_reentrant),
+                    process_group=process_group,
+                    use_bucket_view=use_bucket_view,
+                    static_graph=False,
+                    find_unused_parameters=True,
+                )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_reentrant", [True, False])
+    def test_ddp_checkpointing_twice_static_graph(self, use_reentrant):
+        """
+        Regardless of reentrant or non-reentrant checkpointing impl,
+        checkpointing twice works with static graph enabled.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view in (True, False):
+            # Test passes when static_graph=True.
+            model = self._test_ddp_checkpointing(
+                self.CheckpointTwiceModule(use_reentrant=use_reentrant),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                static_graph=True,
+            )
+
+    @skip_if_lt_x_gpu(2)
+    def test_ddp_checkpointing_dynamic_module(self):
+        """
+        Dynamic module can be checkpointed, multiple times, with non-reentrant
+        checkpointing implementation.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view in (True, False):
+            model = self._test_ddp_checkpointing(
+                self.DynamicCheckpointTwiceModule(use_reentrant=False),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                static_graph=False,
+                find_unused_parameters=True,
+                # Grads can be none sometimes due to dynamic module not using
+                # all params.
+                allow_none_grads=True
+            )
+
+    @skip_if_lt_x_gpu(2)
+    def test_ddp_checkpointing_dynamic_weight_sharing(self):
+        """
+        Dynamic module can be checkpointed multiple times with weight sharing
+        using non-reentrant checkpointing implementation.
+        """
+        process_group = self._get_process_group()
+        for use_bucket_view in (True, False):
+            model = self._test_ddp_checkpointing(
+                self.DynamicCheckpointTwiceModuleWeightSharing(use_reentrant=False),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                static_graph=False,
+                find_unused_parameters=True,
+                # Grads can be none sometimes due to dynamic module not using
+                # all params.
+                allow_none_grads=True
+            )
+
+    # DDP works as expected if there is weight sharing among layers
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_reentrant", [True, False])
+    def test_ddp_checkpointing_weight_sharing(self, use_reentrant):
+        """
+        Test that checkpointing with weight sharing works.
+        """
+        process_group = self._get_process_group()
+        torch.cuda.set_device(self.rank)
+        for use_bucket_view, static_graph in product((False, True), (False, True)):
+            torch.manual_seed(31415)
+            l1 = nn.Linear(20, 20)
+            l2 = nn.Linear(20, 20)
+            l1.weight = l2.weight
+            model = nn.Sequential(l1, l2)
+            # TODO: non-reentrant based checkpointing of DDP module with
+            # static_graph runs into the below issue, see
+            # https://github.com/pytorch/pytorch/issues/70865 and
+            # https://github.com/pytorch/pytorch/issues/58111 for details.
+            err_ctx = (
+                self.assertRaisesRegex(
+                    RuntimeError,
+                    "Your training graph has changed in this iteration"
+                ) if static_graph and not use_reentrant else suppress()
+            )
+            with err_ctx:
+                self._test_ddp_checkpointing(
+                    model,
+                    process_group=process_group,
+                    use_bucket_view=use_bucket_view,
+                    static_graph=static_graph,
+                    run_checkpoint=True,
+                    use_reentrant=use_reentrant,
+                )
+
+    @skip_if_lt_x_gpu(2)
+    def test_ddp_checkpointing_twice_weight_sharing(self):
+        """
+        Checkpointing should work with static graph in the case of checkpointing
+        same layer twice and having weights shared acrosss layers.
+        """
+        process_group = self._get_process_group()
+        torch.cuda.set_device(self.rank)
+        for use_bucket_view in (True, False):
+            model = self._test_ddp_checkpointing(
+                self.CheckpointTwiceModuleWeightSharing(),
+                process_group=process_group,
+                use_bucket_view=use_bucket_view,
+                static_graph=True,
+            )
+
+    def test_invalid_powerSGD_state(self):
+        for start_powerSGD_iter, use_error_feedback, warm_start in product(
+            [0, 1], [True, False], [True, False]
+        ):
+            if not use_error_feedback and not warm_start:
+                continue
+            with self.assertRaisesRegex(
+                ValueError,
+                "Expect `start_powerSGD_iter` > 1 if `use_error_feedback` or `warm_start` is enabled, "
+                "because PowerSGD can only be applied after the first two iterations in DDP.",
+            ):
+                state = powerSGD.PowerSGDState(
+                    process_group=None,
+                    matrix_approximation_rank=1,
+                    start_powerSGD_iter=start_powerSGD_iter,
+                    use_error_feedback=use_error_feedback,
+                    warm_start=warm_start,
+                )
+
     def _test_ddp_with_process_group(
         self,
         process_group,
@@ -443,33 +806,101 @@ def fut_then(fut):
 
         return fut.then(fut_then)
 
+    def _test_not_nan(self, model, x):
+        y = model(x)
+        self.assertFalse(y.isnan().any().item())
+        y.sum().backward()
+        for p in model.parameters():
+            self.assertFalse(p.grad.isnan().any().item())
+
+    @skip_if_lt_x_gpu(2)
+    def test_sync_batch_norm_only_empty_input(self):
+        pg = self._get_process_group()
+
+        model = torch.nn.Sequential(
+            nn.BatchNorm2d(2),
+        ).to(device=self.rank)
+        model = DistributedDataParallel(
+            model,
+            device_ids=[self.rank],
+            process_group=pg,
+        )
+        model = nn.SyncBatchNorm.convert_sync_batchnorm(
+            model,
+            process_group=pg,
+        )
 
-class DistributedDataParallelTest(
-    AbstractDistributedDataParallelTest, MultiProcessTestCase
-):
-    def setUp(self):
-        super(DistributedDataParallelTest, self).setUp()
-        self._spawn_processes()
+        model.train()
 
-    def test_invalid_powerSGD_state(self):
-        for start_powerSGD_iter, use_error_feedback, warm_start in product(
-            [0, 1], [True, False], [True, False]
-        ):
-            if not use_error_feedback and not warm_start:
-                continue
-            with self.assertRaisesRegex(
-                ValueError,
-                "Expect `start_powerSGD_iter` > 1 if `use_error_feedback` or `warm_start` is enabled, "
-                "because PowerSGD can only be applied after the first two iterations in DDP.",
-            ):
-                state = powerSGD.PowerSGDState(
-                    process_group=None,
-                    matrix_approximation_rank=1,
-                    start_powerSGD_iter=start_powerSGD_iter,
-                    use_error_feedback=use_error_feedback,
-                    warm_start=warm_start,
-                )
+        # only rank 0 receives empty inputs
+        x = torch.zeros(
+            (1 if self.rank != 0 else 0, 2, 11, 13),
+            dtype=torch.float32,
+            device=self.rank
+        )
+
+        # input requires grad, this will trigger the collective communication
+        # in the backward pass
+        x.requires_grad = True
+        self._test_not_nan(model, x)
+
+        # input does not requires grad
+        x.requires_grad = False
+        self._test_not_nan(model, x)
+
+        # all ranks receive empty inputs
+        x = torch.zeros(
+            (0, 2, 11, 13),
+            dtype=torch.float32,
+            device=self.rank
+        )
+
+        # input requires grad, this will trigger the collective communication
+        # in the backward pass
+        x.requires_grad = True
+        self._test_not_nan(model, x)
+
+        # input does not requires grad
+        x.requires_grad = False
+        self._test_not_nan(model, x)
+
+    @skip_if_lt_x_gpu(2)
+    def test_sync_batch_norm_empty_input(self):
+        pg = self._get_process_group()
+
+        model = torch.nn.Sequential(
+            nn.Conv2d(2, 2, 3),
+            nn.BatchNorm2d(2),
+            nn.Linear(28, 2),
+        ).to(device=self.rank)
+        model = DistributedDataParallel(
+            model,
+            device_ids=[self.rank],
+            process_group=pg,
+        )
+        model = nn.SyncBatchNorm.convert_sync_batchnorm(
+            model,
+            process_group=pg,
+        )
+
+        model.train()
+        # only rank 0 receives empty inputs
+        x = torch.zeros(
+            (3 if self.rank != 0 else 0, 2, 30, 30),
+            dtype=torch.float32,
+            device=self.rank
+        )
 
+        self._test_not_nan(model, x)
+
+        # all ranks receive empty inputs
+        x = torch.zeros(
+            (0, 2, 30, 30),
+            dtype=torch.float32,
+            device=self.rank
+        )
+
+        self._test_not_nan(model, x)
 
 class ComputeBucketAssignmentTest(TestCase):
     def test_single_limit_single_dtype(self):
@@ -892,6 +1323,8 @@ def test_send_recv(self):
         # user applications would explicitly that.
 
 
+instantiate_parametrized_tests(CommonDistributedDataParallelTest)
+
 
 if __name__ == "__main__":
     assert (
diff --git a/test/distributed/test_c10d_gloo.py b/test/distributed/test_c10d_gloo.py
index 9cd515fb05cbf7..22b5d7a98f6cf7 100644
--- a/test/distributed/test_c10d_gloo.py
+++ b/test/distributed/test_c10d_gloo.py
@@ -1136,8 +1136,14 @@ def _test_allgather_stress(self, inputs, fn):
             [[torch.tensor([i + j]) for j in range(self.world_size)]]
             for i in range(len(inputs))
         ]
+        input_holder = {}
         for i in range(len(inputs)):
-            fut = pg.allgather(outputs[i], [fn(inputs[i])]).get_future()
+            # Note that this works around the data race discussed in
+            # https://github.com/pytorch/pytorch/issues/75529, but we should
+            # actually be able to pass the list directly into allgather when
+            # that race is fixed.
+            input_holder[i] = [fn(inputs[i])]
+            fut = pg.allgather(outputs[i], input_holder[i]).get_future()
             future_handles.append(fut)
 
         for i, future_handle in enumerate(future_handles):
@@ -1457,12 +1463,16 @@ def create(num, prefix):
 
 
 class DistributedDataParallelTest(
-    test_c10d_common.AbstractDistributedDataParallelTest, MultiProcessTestCase
+    test_c10d_common.CommonDistributedDataParallelTest, MultiProcessTestCase
 ):
     def setUp(self):
         super(DistributedDataParallelTest, self).setUp()
         self._spawn_processes()
 
+    def _get_process_group(self):
+        store = self._get_store()
+        return c10d.ProcessGroupGloo(store, self.rank, self.world_size)
+
     def _test_gloo_backend(
         self, devices, device_ids, multi_device=False, gradient_as_bucket_view=False
     ):
diff --git a/test/distributed/test_c10d_nccl.py b/test/distributed/test_c10d_nccl.py
index afe3a7cc19a374..e9eca078960aaa 100644
--- a/test/distributed/test_c10d_nccl.py
+++ b/test/distributed/test_c10d_nccl.py
@@ -9,7 +9,7 @@
 import tempfile
 import threading
 import time
-from contextlib import contextmanager, suppress
+from contextlib import contextmanager
 from datetime import timedelta
 from itertools import product
 from unittest import mock
@@ -49,11 +49,8 @@
     TEST_WITH_DEV_DBG_ASAN,
     TEST_WITH_ROCM,
     sandcastle_skip,
-    instantiate_parametrized_tests,
-    parametrize,
     sandcastle_skip_if,
 )
-from torch.utils.checkpoint import checkpoint
 
 if TEST_WITH_DEV_DBG_ASAN:
     print(
@@ -949,7 +946,7 @@ def allreduce(tensors):
 
 
 class DistributedDataParallelTest(
-    test_c10d_common.AbstractDistributedDataParallelTest, MultiProcessTestCase
+    test_c10d_common.CommonDistributedDataParallelTest, MultiProcessTestCase
 ):
     def setUp(self):
         super(DistributedDataParallelTest, self).setUp()
@@ -958,6 +955,10 @@ def setUp(self):
         os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
         self._spawn_processes()
 
+    def _get_process_group(self):
+        store = self._get_store()
+        return c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
+
     def _test_nccl_backend(
         self, devices, device_ids, multi_device=False, gradient_as_bucket_view=False
     ):
@@ -2216,349 +2217,6 @@ def test_ddp_weight_sharing(self):
                         ),
                     )
 
-    # A list of tests for ddp with activation checkpointing
-    # when gradient_as_bucket_view=True, False.
-    # Most of the tests are referred to
-    # https://github.com/facebookresearch/fairscale/blob/main/tests/nn/pipe/test_checkpoint_ddp.py
-    class CheckpointOnceModule(nn.Module):
-        """
-        Runs checkpoint for a single layer in the model.
-        """
-        def __init__(self, use_reentrant=True):
-            super().__init__()
-            self.l1 = nn.Linear(20, 20)
-            self.l2 = nn.Linear(20, 20)
-            self.use_reentrant = use_reentrant
-
-        def forward(self, inp):
-            x = self.l1(inp)
-            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
-            return x
-
-    class CheckpointTwiceModule(CheckpointOnceModule):
-        """
-        Runs checkpoint for the same layer twice in a model. This simulates use
-        cases such as pipeline parallel where the same layer can be checkpointed
-        more than one time.
-        """
-        def __init__(self, use_reentrant=True):
-            super().__init__(use_reentrant=use_reentrant)
-
-        def forward(self, inp):
-            x = self.l1(inp)
-            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
-            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
-            return x
-
-    class CheckpointTwiceModuleWeightSharing(CheckpointTwiceModule):
-        """
-        Similar to CheckpointTwiceModule but the weights are shared.
-        """
-        def __init__(self, use_reentrant=True):
-            super().__init__(use_reentrant=use_reentrant)
-
-        def forward(self, inp):
-            x = self.l1(inp)
-            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
-            x = checkpoint(self.l2, x, use_reentrant=self.use_reentrant)
-            return x
-
-
-    class DynamicCheckpointTwiceModule(CheckpointTwiceModule):
-        def __init__(self, use_reentrant=True):
-            super().__init__(use_reentrant=use_reentrant)
-            self.count = 0
-
-        def forward(self, inp):
-            if self.count % 2:
-                x = checkpoint(self.l1, inp, use_reentrant=self.use_reentrant)
-            else:
-                x = checkpoint(self.l2, inp, use_reentrant=self.use_reentrant)
-
-            self.count += 1
-            return x
-
-    class DynamicCheckpointTwiceModuleWeightSharing(DynamicCheckpointTwiceModule):
-        def __init__(self, use_reentrant=True):
-            super().__init__(use_reentrant=use_reentrant)
-            self.l1.weight = self.l2.weight
-
-
-    def _prepare_dummy_data(self):
-        ddp_bs = 16
-        bs = ddp_bs * self.world_size
-        input = torch.rand((bs, 20), device="cuda", requires_grad=True)
-        target = torch.randn((bs, 20), device="cuda")
-        offset = self.rank * ddp_bs
-        ddp_input = input[offset : offset + ddp_bs]
-        ddp_target = target[offset : offset + ddp_bs]
-        return input, ddp_input, target, ddp_target
-
-    def _train_model(self, model, input_var, target, loss, run_checkpoint=False, use_reentrant=True):
-        model.train()
-        if run_checkpoint:
-            output = checkpoint(model, input_var, use_reentrant=use_reentrant)
-        else:
-            output = model(input_var)
-        l = loss(output, target)
-        l.backward()
-
-    def _test_ddp_checkpointing(
-        self,
-        input_model,
-        process_group,
-        use_bucket_view,
-        find_unused_parameters=False,
-        static_graph=False,
-        run_checkpoint=False,
-        use_reentrant=True,
-        allow_none_grads=False,
-    ):
-        # to reproduce the same training results
-        torch.cuda.set_device(self.rank)
-        torch.manual_seed(31415)
-        model = copy.deepcopy(input_model).cuda()
-        ddp_model = copy.deepcopy(input_model).cuda()
-        ddp_model = nn.parallel.DistributedDataParallel(
-            ddp_model,
-            bucket_cap_mb=1,
-            gradient_as_bucket_view=use_bucket_view,
-            device_ids=[self.rank],
-            process_group=process_group,
-            find_unused_parameters=find_unused_parameters,
-            static_graph=static_graph,
-        )
-        self.assertEqual(
-            ddp_model._get_ddp_logging_data().get("static_graph", 0), static_graph
-        )
-        input, ddp_input, target, ddp_target = self._prepare_dummy_data()
-        loss = nn.MSELoss()
-        n_iters = 5
-        for i in range(n_iters):
-            model.zero_grad(set_to_none=False)
-            ddp_model.zero_grad(set_to_none=False)
-            self._train_model(model, input, target, loss, run_checkpoint=run_checkpoint, use_reentrant=use_reentrant)
-            self._train_model(
-                ddp_model, ddp_input, ddp_target, loss, run_checkpoint=run_checkpoint, use_reentrant=use_reentrant
-            )
-            for i, j in zip(model.parameters(), ddp_model.parameters()):
-                if not allow_none_grads:
-                    self.assertTrue(i.grad is not None)
-                    self.assertTrue(j.grad is not None)
-                self.assertEqual(i.grad, j.grad, rtol=1.3e-06, atol=5e-5)
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    @parametrize("use_reentrant", [True, False])
-    def test_ddp_checkpointing_once(self, use_reentrant):
-        """
-        DDP works as expected when layer is checkpointed only once.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view, static_graph in product((False, True), (False, True)):
-            self._test_ddp_checkpointing(
-                self.CheckpointOnceModule(use_reentrant=use_reentrant),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                static_graph=static_graph,
-            )
-            if static_graph:
-                # find_unused_parameters does not make a difference, since it is
-                # ignored for static graph.
-                self._test_ddp_checkpointing(
-                    self.CheckpointOnceModule(),
-                    process_group=process_group,
-                    use_bucket_view=use_bucket_view,
-                    static_graph=static_graph,
-                    find_unused_parameters=True,
-                )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    @parametrize("use_reentrant", [True, False])
-    def test_ddp_checkpointing_unused_params(self, use_reentrant):
-        """
-        With reentrant autograd checkpointing impl, DDP will fail when there are
-        unused params in the model and no static graph training. With
-        non-reentrant checkpointing implementation, this works as expected.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view in (True, False):
-            err_ctx = (
-                suppress() if not use_reentrant else
-                self.assertRaisesRegex(
-                    RuntimeError,
-                    "Expected to mark a variable ready only once."
-                )
-            )
-            with err_ctx:
-                model = self._test_ddp_checkpointing(
-                    self.CheckpointOnceModule(use_reentrant=use_reentrant),
-                    process_group=process_group,
-                    use_bucket_view=use_bucket_view,
-                    find_unused_parameters=True,
-                )
-            # test passes when static_graph is true
-            model = self._test_ddp_checkpointing(
-                self.CheckpointOnceModule(use_reentrant=use_reentrant),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                find_unused_parameters=True,
-                static_graph=True,
-            )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    @parametrize("use_reentrant", [True, False])
-    def test_ddp_checkpointing_twice(self, use_reentrant):
-        """
-        Checkpoitning twice fails for non-static graph with reentrant checkpoint
-        implementation, succeeds with non-reentrant checkpoint implementation.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view in (True, False):
-            err_ctx = (
-                suppress() if not use_reentrant else
-                self.assertRaisesRegex(
-                    RuntimeError,
-                    "Expected to mark a variable ready only once."
-                )
-            )
-            with err_ctx:
-                model = self._test_ddp_checkpointing(
-                    self.CheckpointTwiceModule(use_reentrant=use_reentrant),
-                    process_group=process_group,
-                    use_bucket_view=use_bucket_view,
-                    static_graph=False,
-                )
-
-            with err_ctx:
-                model = self._test_ddp_checkpointing(
-                    self.CheckpointTwiceModule(use_reentrant=use_reentrant),
-                    process_group=process_group,
-                    use_bucket_view=use_bucket_view,
-                    static_graph=False,
-                    find_unused_parameters=True,
-                )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    @parametrize("use_reentrant", [True, False])
-    def test_ddp_checkpointing_twice_static_graph(self, use_reentrant):
-        """
-        Regardless of reentrant or non-reentrant checkpointing impl,
-        checkpointing twice works with static graph enabled.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view in (True, False):
-            # Test passes when static_graph=True.
-            model = self._test_ddp_checkpointing(
-                self.CheckpointTwiceModule(use_reentrant=use_reentrant),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                static_graph=True,
-            )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    def test_ddp_checkpointing_dynamic_module(self):
-        """
-        Dynamic module can be checkpointed, multiple times, with non-reentrant
-        checkpointing implementation.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view in (True, False):
-            model = self._test_ddp_checkpointing(
-                self.DynamicCheckpointTwiceModule(use_reentrant=False),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                static_graph=False,
-                find_unused_parameters=True,
-                # Grads can be none sometimes due to dynamic module not using
-                # all params.
-                allow_none_grads=True
-            )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    def test_ddp_checkpointing_dynamic_weight_sharing(self):
-        """
-        Dynamic module can be checkpointed multiple times with weight sharing
-        using non-reentrant checkpointing implementation.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        for use_bucket_view in (True, False):
-            model = self._test_ddp_checkpointing(
-                self.DynamicCheckpointTwiceModuleWeightSharing(use_reentrant=False),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                static_graph=False,
-                find_unused_parameters=True,
-                # Grads can be none sometimes due to dynamic module not using
-                # all params.
-                allow_none_grads=True
-            )
-
-    # DDP works as expected if there is weight sharing among layers
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    @parametrize("use_reentrant", [True, False])
-    def test_ddp_checkpointing_weight_sharing(self, use_reentrant):
-        """
-        Test that checkpointing with weight sharing works.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        torch.cuda.set_device(self.rank)
-        for use_bucket_view, static_graph in product((False, True), (False, True)):
-            torch.manual_seed(31415)
-            l1 = nn.Linear(20, 20)
-            l2 = nn.Linear(20, 20)
-            l1.weight = l2.weight
-            model = nn.Sequential(l1, l2)
-            # TODO: non-reentrant based checkpointing of DDP module with
-            # static_graph runs into the below issue, see
-            # https://github.com/pytorch/pytorch/issues/70865 and
-            # https://github.com/pytorch/pytorch/issues/58111 for details.
-            err_ctx = (
-                self.assertRaisesRegex(
-                    RuntimeError,
-                    "Your training graph has changed in this iteration"
-                ) if static_graph and not use_reentrant else suppress()
-            )
-            with err_ctx:
-                self._test_ddp_checkpointing(
-                    model,
-                    process_group=process_group,
-                    use_bucket_view=use_bucket_view,
-                    static_graph=static_graph,
-                    run_checkpoint=True,
-                    use_reentrant=use_reentrant,
-                )
-
-    @requires_nccl()
-    @skip_if_lt_x_gpu(2)
-    def test_ddp_checkpointing_twice_weight_sharing(self):
-        """
-        Checkpointing should work with static graph in the case of checkpointing
-        same layer twice and having weights shared acrosss layers.
-        """
-        store = c10d.FileStore(self.file_name, self.world_size)
-        process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
-        torch.cuda.set_device(self.rank)
-        for use_bucket_view in (True, False):
-            model = self._test_ddp_checkpointing(
-                self.CheckpointTwiceModuleWeightSharing(),
-                process_group=process_group,
-                use_bucket_view=use_bucket_view,
-                static_graph=True,
-            )
 
 
 class NcclErrorHandlingTest(MultiProcessTestCase):
@@ -3053,8 +2711,6 @@ def test_nccl_warn_not_in_group_debug_info(self):
     def test_nccl_warn_not_in_group_debug_off(self):
         self._test_warn_not_in_group(backend="nccl")
 
-instantiate_parametrized_tests(DistributedDataParallelTest)
-
 if __name__ == "__main__":
     assert (
         not torch.cuda._initialized
diff --git a/test/distributed/test_data_parallel.py b/test/distributed/test_data_parallel.py
index c1720344e49dc3..3aeff9062909ae 100644
--- a/test/distributed/test_data_parallel.py
+++ b/test/distributed/test_data_parallel.py
@@ -17,6 +17,7 @@
 from torch.testing._internal.common_utils import _assertGradAndGradgradChecks, gradcheck
 from torch.testing._internal.common_utils import dtype2prec_DONTUSE
 from torch.testing._internal.common_utils import sandcastle_skip_if
+from torch.testing._internal.common_utils import TEST_WITH_ROCM
 import torch.nn.functional as F
 
 torch.set_default_dtype(torch.double)
@@ -784,6 +785,7 @@ class TestDataParallelDeviceType(TestCase):
 
     @onlyCUDA
     @skipMeta
+    @sandcastle_skip_if(TEST_WITH_ROCM, "Failing on few archs, temporarily skipped")
     @dtypes(torch.float, torch.double, torch.half)
     def test_data_parallel_module(self, device, dtype):
         l = nn.Linear(10, 5).to(device, dtype)
@@ -796,6 +798,7 @@ def test_data_parallel_module(self, device, dtype):
 
     @onlyCUDA
     @skipMeta
+    @sandcastle_skip_if(TEST_WITH_ROCM, "Failing on few archs, temporarily skipped")
     @dtypes(torch.float, torch.double, torch.half)
     def test_data_parallel_module_kwargs_only(self, device, dtype):
         class Net(nn.Module):
@@ -816,6 +819,7 @@ def forward(self, input):
 
     @onlyCUDA
     @skipMeta
+    @sandcastle_skip_if(TEST_WITH_ROCM, "Failing on few archs, temporarily skipped")
     @dtypes(torch.float, torch.double, torch.half)
     def test_data_parallel_module_kwargs_only_empty_list(self, device, dtype):
         class Net(nn.Module):
@@ -836,6 +840,7 @@ def forward(self, input):
 
     @onlyCUDA
     @skipMeta
+    @sandcastle_skip_if(TEST_WITH_ROCM, "Failing on few archs, temporarily skipped")
     @dtypes(torch.float, torch.double, torch.half)
     def test_data_parallel_module_kwargs_only_empty_dict(self, device, dtype):
         class Net(nn.Module):
@@ -856,6 +861,7 @@ def forward(self, input):
 
     @onlyCUDA
     @skipMeta
+    @sandcastle_skip_if(TEST_WITH_ROCM, "Failing on few archs, temporarily skipped")
     @dtypes(torch.float, torch.double, torch.half)
     def test_data_parallel_module_kwargs_only_empty_tuple(self, device, dtype):
         class Net(nn.Module):
diff --git a/test/distributed/test_store.py b/test/distributed/test_store.py
index 02484585c68e2e..6744ab16995d24 100644
--- a/test/distributed/test_store.py
+++ b/test/distributed/test_store.py
@@ -404,6 +404,14 @@ def test_common_errors(self):
             gen = dist.rendezvous("tcp://127.0.0.1:23456?rank=0")
             next(gen)
 
+    def test_dns_timeout(self):
+        with self.assertRaisesRegex(TimeoutError, "client socket has timed out after.*dnsnotexist"):
+            gen = dist.rendezvous(
+                "tcp://dnsnotexist:23456?world_size=2&rank=0",
+                timeout=timedelta(seconds=1),
+            )
+            next(gen)
+
     @retry_on_connect_failures
     def test_nominal(self):
         url = self.create_tcp_url()
diff --git a/test/distributions/test_distributions.py b/test/distributions/test_distributions.py
index 37128792ae28c3..55a227d0327054 100644
--- a/test/distributions/test_distributions.py
+++ b/test/distributions/test_distributions.py
@@ -34,6 +34,7 @@
 from collections import namedtuple
 from itertools import product
 from random import shuffle
+from packaging import version
 
 import torch
 
@@ -2220,39 +2221,41 @@ def test_multivariate_normal_moments(self):
 
     # We applied same tests in Multivariate Normal distribution for Wishart distribution
     def test_wishart_shape(self):
-        df = (torch.rand(5, requires_grad=True) + 1) * 10
-        df_no_batch = (torch.rand([], requires_grad=True) + 1) * 10
-        df_multi_batch = (torch.rand(6, 5, requires_grad=True) + 1) * 10
+        ndim = 3
+
+        df = torch.rand(5, requires_grad=True) + ndim
+        df_no_batch = torch.rand([], requires_grad=True) + ndim
+        df_multi_batch = torch.rand(6, 5, requires_grad=True) + ndim
 
         # construct PSD covariance
-        tmp = torch.randn(3, 10)
+        tmp = torch.randn(ndim, 10)
         cov = (torch.matmul(tmp, tmp.t()) / tmp.size(-1)).requires_grad_()
         prec = cov.inverse().requires_grad_()
         scale_tril = torch.linalg.cholesky(cov).requires_grad_()
 
         # construct batch of PSD covariances
-        tmp = torch.randn(6, 5, 3, 10)
+        tmp = torch.randn(6, 5, ndim, 10)
         cov_batched = (tmp.unsqueeze(-2) * tmp.unsqueeze(-3)).mean(-1).requires_grad_()
         prec_batched = cov_batched.inverse()
         scale_tril_batched = torch.linalg.cholesky(cov_batched)
 
         # ensure that sample, batch, event shapes all handled correctly
-        self.assertEqual(Wishart(df, cov).sample().size(), (5, 3, 3))
-        self.assertEqual(Wishart(df_no_batch, cov).sample().size(), (3, 3))
-        self.assertEqual(Wishart(df_multi_batch, cov).sample().size(), (6, 5, 3, 3))
-        self.assertEqual(Wishart(df, cov).sample((2,)).size(), (2, 5, 3, 3))
-        self.assertEqual(Wishart(df_no_batch, cov).sample((2,)).size(), (2, 3, 3))
-        self.assertEqual(Wishart(df_multi_batch, cov).sample((2,)).size(), (2, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df, cov).sample((2, 7)).size(), (2, 7, 5, 3, 3))
-        self.assertEqual(Wishart(df_no_batch, cov).sample((2, 7)).size(), (2, 7, 3, 3))
-        self.assertEqual(Wishart(df_multi_batch, cov).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df_no_batch, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df_multi_batch, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df, precision_matrix=prec).sample((2, 7)).size(), (2, 7, 5, 3, 3))
-        self.assertEqual(Wishart(df, precision_matrix=prec_batched).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
-        self.assertEqual(Wishart(df, scale_tril=scale_tril).sample((2, 7)).size(), (2, 7, 5, 3, 3))
-        self.assertEqual(Wishart(df, scale_tril=scale_tril_batched).sample((2, 7)).size(), (2, 7, 6, 5, 3, 3))
+        self.assertEqual(Wishart(df, cov).sample().size(), (5, ndim, ndim))
+        self.assertEqual(Wishart(df_no_batch, cov).sample().size(), (ndim, ndim))
+        self.assertEqual(Wishart(df_multi_batch, cov).sample().size(), (6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, cov).sample((2,)).size(), (2, 5, ndim, ndim))
+        self.assertEqual(Wishart(df_no_batch, cov).sample((2,)).size(), (2, ndim, ndim))
+        self.assertEqual(Wishart(df_multi_batch, cov).sample((2,)).size(), (2, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, cov).sample((2, 7)).size(), (2, 7, 5, ndim, ndim))
+        self.assertEqual(Wishart(df_no_batch, cov).sample((2, 7)).size(), (2, 7, ndim, ndim))
+        self.assertEqual(Wishart(df_multi_batch, cov).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df_no_batch, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df_multi_batch, cov_batched).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, precision_matrix=prec).sample((2, 7)).size(), (2, 7, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, precision_matrix=prec_batched).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, scale_tril=scale_tril).sample((2, 7)).size(), (2, 7, 5, ndim, ndim))
+        self.assertEqual(Wishart(df, scale_tril=scale_tril_batched).sample((2, 7)).size(), (2, 7, 6, 5, ndim, ndim))
 
         # check gradients
         # Modified and applied the same tests for multivariate_normal
@@ -2278,14 +2281,19 @@ def gradcheck_func(samples, nu, sigma, prec, scale_tril):
         wishart_log_prob_gradcheck(df_no_batch, None, None, scale_tril_batched)
 
     def test_wishart_stable_with_precision_matrix(self):
-        x = torch.randn(10)
+        ndim = 10
+        x = torch.randn(ndim)
         P = torch.exp(-(x - x.unsqueeze(-1)) ** 2)  # RBF kernel
-        Wishart(torch.tensor(10), precision_matrix=P)
+        Wishart(torch.tensor(ndim), precision_matrix=P)
 
     @unittest.skipIf(not TEST_NUMPY, "Numpy not found")
     def test_wishart_log_prob(self):
-        df = (torch.rand([], requires_grad=True) + 1) * 10
-        tmp = torch.randn(3, 10)
+        ndim = 3
+        df = torch.rand([], requires_grad=True) + ndim - 1
+        # SciPy allowed ndim -1 < df < ndim for Wishar distribution after version 1.7.0
+        if version.parse(scipy.__version__) < version.parse("1.7.0"):
+            df += 1.
+        tmp = torch.randn(ndim, 10)
         cov = (torch.matmul(tmp, tmp.t()) / tmp.size(-1)).requires_grad_()
         prec = cov.inverse().requires_grad_()
         scale_tril = torch.linalg.cholesky(cov).requires_grad_()
@@ -2297,7 +2305,7 @@ def test_wishart_log_prob(self):
         dist3 = Wishart(df, scale_tril=scale_tril)
         ref_dist = scipy.stats.wishart(df.item(), cov.detach().numpy())
 
-        x = dist1.sample((10,))
+        x = dist1.sample((1000,))
         expected = ref_dist.logpdf(x.transpose(0, 2).numpy())
 
         self.assertEqual(0.0, np.mean((dist1.log_prob(x).detach().numpy() - expected)**2), atol=1e-3, rtol=0)
@@ -2305,14 +2313,17 @@ def test_wishart_log_prob(self):
         self.assertEqual(0.0, np.mean((dist3.log_prob(x).detach().numpy() - expected)**2), atol=1e-3, rtol=0)
 
         # Double-check that batched versions behave the same as unbatched
-        df = (torch.rand(5, requires_grad=True) + 1) * 3
-        tmp = torch.randn(5, 3, 10)
+        df = torch.rand(5, requires_grad=True) + ndim - 1
+        # SciPy allowed ndim -1 < df < ndim for Wishar distribution after version 1.7.0
+        if version.parse(scipy.__version__) < version.parse("1.7.0"):
+            df += 1.
+        tmp = torch.randn(5, ndim, 10)
         cov = (tmp.unsqueeze(-2) * tmp.unsqueeze(-3)).mean(-1).requires_grad_()
 
         dist_batched = Wishart(df, cov)
         dist_unbatched = [Wishart(df[i], cov[i]) for i in range(df.size(0))]
 
-        x = dist_batched.sample((10,))
+        x = dist_batched.sample((1000,))
         batched_prob = dist_batched.log_prob(x)
         unbatched_prob = torch.stack([dist_unbatched[i].log_prob(x[:, i]) for i in range(5)]).t()
 
@@ -2322,28 +2333,35 @@ def test_wishart_log_prob(self):
     @unittest.skipIf(not TEST_NUMPY, "NumPy not found")
     def test_wishart_sample(self):
         set_rng_seed(0)  # see Note [Randomized statistical tests]
-        df = (torch.rand([], requires_grad=True) + 1) * 3
-        tmp = torch.randn(3, 10)
+        ndim = 3
+        df = torch.rand([], requires_grad=True) + ndim - 1
+        # SciPy allowed ndim -1 < df < ndim for Wishar distribution after version 1.7.0
+        if version.parse(scipy.__version__) < version.parse("1.7.0"):
+            df += 1.
+        tmp = torch.randn(ndim, 10)
         cov = (torch.matmul(tmp, tmp.t()) / tmp.size(-1)).requires_grad_()
         prec = cov.inverse().requires_grad_()
         scale_tril = torch.linalg.cholesky(cov).requires_grad_()
 
+        ref_dist = scipy.stats.wishart(df.item(), cov.detach().numpy())
+
         self._check_sampler_sampler(Wishart(df, cov),
-                                    scipy.stats.wishart(df.item(), cov.detach().numpy()),
+                                    ref_dist,
                                     'Wishart(df={}, covariance_matrix={})'.format(df, cov),
                                     multivariate=True)
         self._check_sampler_sampler(Wishart(df, precision_matrix=prec),
-                                    scipy.stats.wishart(df.item(), cov.detach().numpy()),
+                                    ref_dist,
                                     'Wishart(df={}, precision_matrix={})'.format(df, prec),
                                     multivariate=True)
         self._check_sampler_sampler(Wishart(df, scale_tril=scale_tril),
-                                    scipy.stats.wishart(df.item(), cov.detach().numpy()),
+                                    ref_dist,
                                     'Wishart(df={}, scale_tril={})'.format(df, scale_tril),
                                     multivariate=True)
 
     def test_wishart_properties(self):
-        df = (torch.rand([]) + 1) * 5
-        scale_tril = transform_to(constraints.lower_cholesky)(torch.randn(5, 5))
+        ndim = 5
+        df = torch.rand([]) + ndim - 1
+        scale_tril = transform_to(constraints.lower_cholesky)(torch.randn(ndim, ndim))
         m = Wishart(df=df, scale_tril=scale_tril)
         self.assertEqual(m.covariance_matrix, m.scale_tril.mm(m.scale_tril.t()))
         self.assertEqual(m.covariance_matrix.mm(m.precision_matrix), torch.eye(m.event_shape[0]))
@@ -2351,14 +2369,15 @@ def test_wishart_properties(self):
 
     def test_wishart_moments(self):
         set_rng_seed(0)  # see Note [Randomized statistical tests]
-        df = (torch.rand([]) + 1) * 3
-        scale_tril = transform_to(constraints.lower_cholesky)(torch.randn(3, 3))
+        ndim = 3
+        df = torch.rand([]) + ndim - 1
+        scale_tril = transform_to(constraints.lower_cholesky)(torch.randn(ndim, ndim))
         d = Wishart(df=df, scale_tril=scale_tril)
-        samples = d.rsample((100000,))
+        samples = d.rsample((ndim * ndim * 100000,))
         empirical_mean = samples.mean(0)
-        self.assertEqual(d.mean, empirical_mean, atol=5, rtol=0)
+        self.assertEqual(d.mean, empirical_mean, atol=0.5, rtol=0)
         empirical_var = samples.var(0)
-        self.assertEqual(d.variance, empirical_var, atol=5, rtol=0)
+        self.assertEqual(d.variance, empirical_var, atol=0.5, rtol=0)
 
     def test_exponential(self):
         rate = torch.randn(5, 5).abs().requires_grad_()
@@ -3111,12 +3130,12 @@ def test_invalid_parameter_broadcasting(self):
                 'alpha': torch.tensor([1, 1, 1])
             }),
             (StudentT, {
-                'df': torch.tensor([1, 1]),
-                'scale': torch.tensor([1, 1, 1])
+                'df': torch.tensor([1., 1.]),
+                'scale': torch.tensor([1., 1., 1.])
             }),
             (StudentT, {
-                'df': torch.tensor([1, 1]),
-                'loc': torch.tensor([1, 1, 1])
+                'df': torch.tensor([1., 1.]),
+                'loc': torch.tensor([1., 1., 1.])
             })
         ]
 
@@ -4623,8 +4642,16 @@ def setUp(self):
                 scipy.stats.weibull_min(c=positive_var2[0], scale=positive_var[0])
             ),
             (
-                Wishart(20 + positive_var[0], cov_tensor),  # scipy var for Wishart only supports scalars
-                scipy.stats.wishart(20 + positive_var[0].item(), cov_tensor),
+                # scipy var for Wishart only supports scalars
+                # SciPy allowed ndim -1 < df < ndim for Wishar distribution after version 1.7.0
+                Wishart(
+                    (20 if version.parse(scipy.__version__) < version.parse("1.7.0") else 19) + positive_var[0],
+                    cov_tensor,
+                ),
+                scipy.stats.wishart(
+                    (20 if version.parse(scipy.__version__) < version.parse("1.7.0") else 19) + positive_var[0].item(),
+                    cov_tensor,
+                ),
             ),
         ]
 
diff --git a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
index 17e38e6c9fcd44..fcbf9ec18deb16 100644
--- a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
+++ b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
@@ -41,7 +41,7 @@ torch.fx.interpreter.Interpreter.get_attr(self, target: 'Target', args: Tuple[to
 torch.fx.interpreter.Interpreter.map_nodes_to_values(self, args: torch.fx.node.Argument, n: torch.fx.node.Node) -> torch.fx.node.Argument
 torch.fx.interpreter.Interpreter.output(self, target: 'Target', args: Tuple[torch.fx.node.Argument, ...], kwargs: Dict[str, Any]) -> Any
 torch.fx.interpreter.Interpreter.placeholder(self, target: 'Target', args: Tuple[torch.fx.node.Argument, ...], kwargs: Dict[str, Any]) -> Any
-torch.fx.interpreter.Interpreter.run(self, *args, initial_env: Optional[Dict[torch.fx.node.Node, Any]] = None) -> Any
+torch.fx.interpreter.Interpreter.run(self, *args, initial_env: Optional[Dict[torch.fx.node.Node, Any]] = None, enable_io_processing: bool = True) -> Any
 torch.fx.interpreter.Interpreter.run_node(self, n: torch.fx.node.Node) -> Any
 torch.fx.interpreter.Transformer.__init__(self, module)
 torch.fx.interpreter.Transformer.call_function(self, target: 'Target', args: Tuple[torch.fx.node.Argument, ...], kwargs: Dict[str, Any]) -> Any
diff --git a/test/expect/TestPytorchExportModes.test_aten_fallback.expect b/test/expect/TestPytorchExportModes.test_aten_fallback.expect
index 41059587af0b37..d5cfb31cfeefc8 100644
--- a/test/expect/TestPytorchExportModes.test_aten_fallback.expect
+++ b/test/expect/TestPytorchExportModes.test_aten_fallback.expect
@@ -11,7 +11,7 @@ ModelProto {
       nodes: [
         Node {type: "Add", inputs: [0,1], outputs: [2], attributes: []},
         Node {type: "Constant", inputs: [], outputs: [3], attributes: [{ name: 'value', type: tensor, value:TensorProto shape: []}]},
-        Node {type: "ATen", inputs: [2,3], outputs: [4,5], attributes: [{ name: 'operator', type: string, value: 'qr'}]}
+        Node {type: "ATen", inputs: [2,3], outputs: [4,5], attributes: [{ name: 'operator', type: string, value: 'qr'}, { name: 'overload_name', type: string, value: ''}]}
       ]
     }
   opset_import: [OperatorSetIdProto { domain: }OperatorSetIdProto { domain: org.pytorch.aten}],
diff --git a/test/expect/TestPytorchExportModes.test_onnx_aten.expect b/test/expect/TestPytorchExportModes.test_onnx_aten.expect
index 22f1c57f95706a..85f4f8573d1c44 100644
--- a/test/expect/TestPytorchExportModes.test_onnx_aten.expect
+++ b/test/expect/TestPytorchExportModes.test_onnx_aten.expect
@@ -9,7 +9,7 @@ ModelProto {
       outputs: [{name: "2", type:Tensor dims: 3 4}]
       initializers: []
       nodes: [
-        Node {type: "ATen", inputs: [0,1], outputs: [2], attributes: [{ name: 'operator', type: string, value: 'fmod'}]}
+        Node {type: "ATen", inputs: [0,1], outputs: [2], attributes: [{ name: 'operator', type: string, value: 'fmod'}, { name: 'overload_name', type: string, value: ''}]}
       ]
     }
   opset_import: [OperatorSetIdProto { domain: }OperatorSetIdProto { domain: org.pytorch.aten}],
diff --git a/test/expect/TestScript.test_listconstruct_erasure.expect b/test/expect/TestScript.test_listconstruct_erasure.expect
index 0f7d470b0709e1..7d4bb8d97fc0f1 100644
--- a/test/expect/TestScript.test_listconstruct_erasure.expect
+++ b/test/expect/TestScript.test_listconstruct_erasure.expect
@@ -13,7 +13,7 @@ ModelProto {
         Node {type: "Less", inputs: [0,1], outputs: [2], attributes: []},
         Node {type: "Cast", inputs: [2], outputs: [3], attributes: [{ name: 'to', type: int, value: 2}]},
         Node {type: "Cast", inputs: [3], outputs: [4], attributes: [{ name: 'to', type: int, value: 9}]},
-        Node {type: "ATen", inputs: [0,4], outputs: [5], attributes: [{ name: 'operator', type: string, value: 'index'}]}
+        Node {type: "ATen", inputs: [0,4], outputs: [5], attributes: [{ name: 'operator', type: string, value: 'index'}, { name: 'overload_name', type: string, value: ''}]}
       ]
     }
   opset_import: [OperatorSetIdProto { domain: }OperatorSetIdProto { domain: org.pytorch.aten}],
diff --git a/test/forward_backward_compatibility/check_forward_backward_compatibility.py b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
index b7dc0d579c3467..9317e238244b7b 100644
--- a/test/forward_backward_compatibility/check_forward_backward_compatibility.py
+++ b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
@@ -92,6 +92,7 @@
     ("aten::miopen_depthwise_convolution_backward", datetime.date(9999, 1, 1)),
     ("aten::miopen_depthwise_convolution_backward_input", datetime.date(9999, 1, 1)),
     ("aten::miopen_depthwise_convolution_backward_weight", datetime.date(9999, 1, 1)),
+    ("aten::_nested_tensor", datetime.date(9999, 1, 1)),
     ("caffe2::", datetime.date(2021, 10, 23)),
     ("prepacked::unpack_prepacked_sizes_conv2d", datetime.date(9999, 1, 1)),
     ("prepacked::unpack_prepacked_sizes_linear", datetime.date(9999, 1, 1)),
@@ -106,10 +107,15 @@
     ("aten::_scatter_reduce", datetime.date(2022, 1, 31)),
     ("aten::native_multi_head_self_attention", datetime.date(9999, 1, 1)),
     ("aten::_native_multi_head_self_attention", datetime.date(9999, 1, 1)),
-    ("aten::scatter_reduce.two", datetime.date(2022, 3, 15)),
     ("aten::grid_sampler_3d_backward", datetime.date(9999, 1, 1)),
     ("aten::_transform_bias_rescale_qkv", datetime.date(9999, 1, 1)),
-    ("aten::_scatter_reduce.two", datetime.date(9999, 1, 1)),
+    ("aten::scatter_reduce.two", datetime.date(2022, 4, 15)),
+    ("aten::_s_where", datetime.date(2022, 9, 30)),
+    ("quantized::conv2d_cudnn", datetime.date(2022, 3, 22)),
+    ("quantized::conv2d_relu_cudnn", datetime.date(2022, 3, 22)),
+    ("quantized::softmax", datetime.date(2022, 4, 15)),
+    ("prim::infer_squeeze_size.dim", datetime.date(9999, 1, 1)),
+    ("prim::infer_squeeze_size", datetime.date(9999, 1, 1)),
 ]
 
 ALLOW_LIST_COMPILED = [
diff --git a/test/jit/test_autodiff.py b/test/jit/test_autodiff.py
new file mode 100644
index 00000000000000..518826f602e1ab
--- /dev/null
+++ b/test/jit/test_autodiff.py
@@ -0,0 +1,51 @@
+# Owner(s): ["oncall: jit"]
+
+import torch
+
+from torch.testing._internal.jit_utils import JitTestCase
+from typing import List
+
+class TestAutodiffJit(JitTestCase):
+    def test_undefined_tensor_lists(self):
+        def fn(tensor_list: List[torch.Tensor], add_tensor):
+            cat = torch.cat(tensor_list, dim=1)
+            r = torch.sin(cat + add_tensor)
+            return r
+
+        fn_s = torch.jit.script(fn)
+
+        a = torch.rand((3, 6), requires_grad=True)
+        b = torch.rand((3, 10), requires_grad=True)
+        x = [a, b]
+        y = torch.rand((3, 16), requires_grad=True)
+
+        ret = fn_s(x, y)
+        ret.sum().backward()
+        ret = fn_s(x, y)
+        ret.sum().backward()
+
+        ret = fn_s(x, y)
+        s = ret.sum()
+
+        # backward_fn expects 2 inputs: (grad_output, current_grad_r)
+        # current_grad_r is provided because we need to add this contribution
+        # to grad_r when we return it.
+        backward_fn = s.grad_fn.next_functions[0][0]
+
+        # check behavior with defined tensor
+        grad_out = torch.rand((3, 16))
+        grad_inputs = backward_fn(grad_out, None)
+
+        # expect 3 tensors: grad_y, grad_a, grad_b
+        self.assertEqual(3, len(grad_inputs))
+        for x in grad_inputs:
+            self.assertTrue(isinstance(x, torch.Tensor))
+
+        # now test with undefined grad_out
+        grad_inputs = backward_fn(None, None)
+
+        # expect all of them to be None
+        self.assertEqual(3, len(grad_inputs))
+        for x in grad_inputs:
+            if x is not None:
+                self.assertEqual(0, torch.max(torch.abs(x)).item())
diff --git a/test/jit/test_export_modes.py b/test/jit/test_export_modes.py
index 70d2193201a3c8..300be7d9dd6908 100644
--- a/test/jit/test_export_modes.py
+++ b/test/jit/test_export_modes.py
@@ -82,7 +82,9 @@ def forward(self, x, y):
             ModelWithAtenNotONNXOp(), (x, y),
             add_node_names=False,
             do_constant_folding=False,
-            operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK)
+            operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK,
+            # support for linalg.qr was added in later op set versions.
+            opset_version=9)
 
     # torch.fmod is using to test ONNX_ATEN.
     # If you plan to remove fmod from aten, or found this test failed.
diff --git a/test/jit/test_if_hoisting.py b/test/jit/test_if_hoisting.py
index 939ceda3c56cfd..bda285e6e43bcb 100644
--- a/test/jit/test_if_hoisting.py
+++ b/test/jit/test_if_hoisting.py
@@ -3,6 +3,7 @@
 import torch
 from torch.testing import FileCheck
 from torch.testing._internal.jit_utils import JitTestCase
+from typing import Dict
 
 if __name__ == "__main__":
     raise RuntimeError(
@@ -149,13 +150,13 @@ def fn(x: bool, y: torch.Tensor):
         self.run_pass("dce", op_graph)
 
         FileCheck().check_count("prim::If", 1, exactly=True).run(op_graph)
-        FileCheck().check_count("aten::add", 2, exactly=True).run(op_graph)
+        FileCheck().check_count("aten::add(", 2, exactly=True).run(op_graph)
         FileCheck().check_count("aten::add_", 1, exactly=True).run(op_graph)
 
         t1 = torch.Tensor([1])
         t2 = torch.Tensor([5, 6])
-        self.assertEqual(fn(True, t1), fn_script(True, t1))
-        self.assertEqual(fn(False, t2), fn_script(False, t2))
+        self.assertEqual(fn(True, t1.clone()), fn_script(True, t1.clone()))
+        self.assertEqual(fn(False, t2.clone()), fn_script(False, t2.clone()))
 
     def test_mutate_after(self):
         """
@@ -180,7 +181,6 @@ def fn(x: bool, y: torch.Tensor):
 
         FileCheck().check_count("prim::If", 1, exactly=True).run(op_graph)
         FileCheck().check_count("aten::add", 2, exactly=True).run(op_graph)
-
         t1 = torch.Tensor([1])
         t2 = torch.Tensor([5, 6])
         self.assertEqual(fn(True, t1.clone()), fn_script(True, t1.clone()))
@@ -212,3 +212,26 @@ def fn(x: bool, y: torch.Tensor):
         t2 = torch.Tensor([5, 6])
         self.assertEqual(fn(True, t1), fn_script(True, t1))
         self.assertEqual(fn(False, t2), fn_script(False, t2))
+
+    def test_hoist_mutation_2(self):
+        def fn(x, y, cond: bool, d: Dict[str, torch.Tensor]):
+            if cond:
+                m = x.relu()
+                f1 = torch.rand((2, 2))
+                d["test"] = f1
+                z = d["test"]
+            else:
+                m = y.gelu()
+                f2 = torch.rand((3, 2))
+                d["test"] = f2
+                z = d["test"]
+            return m, z
+
+        fn_s = torch.jit.script(fn)
+        op_graph = fn_s.graph
+        self.run_pass("common_expression_hoisting", op_graph)
+        self.run_pass("dce", op_graph)
+        FileCheck().check_count("aten::__getitem__", 2, exactly=True).run(op_graph)
+        FileCheck().check_count("aten::_set_item", 2, exactly=True).run(op_graph)
+        FileCheck().check_count("aten::relu", 1, exactly=True).run(op_graph)
+        FileCheck().check_count("aten::gelu", 1, exactly=True).run(op_graph)
diff --git a/test/jit/test_misc.py b/test/jit/test_misc.py
index bf3c3c3e71c11b..20120ff8f96070 100644
--- a/test/jit/test_misc.py
+++ b/test/jit/test_misc.py
@@ -228,6 +228,91 @@ def use_module_interface(mod_list: List[OneTwoModule], x: torch.Tensor):
         self.assertTrue(set(['aten::add.Tensor', 'aten::mul.Scalar']).issubset(
             set(torch.jit.export_opnames(scripted_M_mod))))
 
+    def test_math_inf(self):
+        from math import inf
+
+        def foo():
+            return inf
+
+        self.checkScript(foo, ())
+
+    def test_list_literal_infer(self):
+        def expects_intlist(x: List[int]):
+            x.append(3)
+            return x
+
+        def foo():
+            return expects_intlist([])
+
+        self.checkScript(foo, ())
+
+        def annotated_list_fail():
+            return expects_intlist(torch.jit.annotate([], List[Tensor]))
+
+        with self.assertRaises(RuntimeError):
+            torch.jit.script(annotated_list_fail)
+
+        def non_temporary_fail():
+            a = []
+            return expects_intlist(a)
+
+        with self.assertRaises(RuntimeError):
+            torch.jit.script(non_temporary_fail)
+
+
+        @torch.jit.script
+        def test_return():
+            return []
+
+        FileCheck().check("Tensor[] = prim::ListConstruct").run(test_return.graph)
+
+    def test_legacy_tensor_constructor(self):
+        # testing PyObject overload
+        def test_all_dtypes():
+            return (
+                torch.BoolTensor([2]),
+                torch.LongTensor([3]),
+                torch.ByteTensor([4]),
+                torch.CharTensor([5]),
+                torch.DoubleTensor([6]),
+                torch.FloatTensor([7]),
+                torch.IntTensor([8]),
+                torch.ShortTensor([1]),
+                torch.HalfTensor([1]),
+            )
+
+        self.checkScript(test_all_dtypes, ())
+
+        # now test empty overload
+        def empty_overload():
+            return torch.LongTensor(2, 3, 4)
+
+        eager = empty_overload()
+        jit = torch.jit.script(empty_overload)()
+        eager[:] = 1
+        jit[:] = 1
+        self.assertEqual(eager, jit)
+
+        def no_inputs():
+            return torch.DoubleTensor()
+
+        self.checkScript(no_inputs, ())
+
+        # bad schema
+        def multiple_args():
+            return torch.LongTensor(1, [2])
+
+        with self.assertRaisesRegex(RuntimeError, "multiple positional arguments that were not all integers"):
+            torch.jit.script(multiple_args)
+
+        # kwarg bad schema
+        def bad_kwarg():
+            return torch.LongTensor(hello="1")
+
+        with self.assertRaisesRegex(RuntimeError, "hello"):
+            torch.jit.script(bad_kwarg)
+
+
     def test_broadcasting_list(self):
         """
         Test BroadcastingList and torch.nn._size_N_t alias
diff --git a/test/jit/test_op_decompositions.py b/test/jit/test_op_decompositions.py
new file mode 100644
index 00000000000000..bfd6edb2e6b824
--- /dev/null
+++ b/test/jit/test_op_decompositions.py
@@ -0,0 +1,23 @@
+# Owner(s): ["oncall: jit"]
+
+import torch
+from torch.testing import FileCheck
+from torch.testing._internal.jit_utils import JitTestCase
+
+if __name__ == '__main__':
+    raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
+                       "\tpython test/test_jit.py TESTNAME\n\n"
+                       "instead.")
+
+class TestOpDecompositions(JitTestCase):
+    def test_op_decomposition(self):
+        def foo(x):
+            return torch.var(x, unbiased=True)
+
+        # TODO: more robust testing
+        foo_s = torch.jit.script(foo)
+        FileCheck().check("aten::var").run(foo_s.graph)
+        torch._C._jit_pass_run_decompositions(foo_s.graph)
+        inp = torch.rand([10, 10])
+        self.assertEqual(foo(inp), foo_s(inp))
+        FileCheck().check_not("aten::var").run(foo_s.graph)
diff --git a/test/jit/test_profiler.py b/test/jit/test_profiler.py
index 4c9380f40471f5..81df055f55b7c8 100644
--- a/test/jit/test_profiler.py
+++ b/test/jit/test_profiler.py
@@ -18,7 +18,7 @@
 class TestProfiler(JitTestCase):
     def setUp(self):
         self.prev_exec = torch._C._jit_set_profiling_executor(True)
-        self.prev_profiling = torch._C._jit_set_profiling_mode(True)
+        self.prev_profiling = torch._C._get_graph_executor_optimize(True)
         self.inline_autodiff = torch._C._debug_set_autodiff_subgraph_inlining(False)
         self.texpr_fuser_state = torch._C._jit_texpr_fuser_enabled()
         self.can_fuse_on_cpu = torch._C._jit_can_fuse_on_cpu()
@@ -34,7 +34,7 @@ def setUp(self):
 
     def tearDown(self):
         torch._C._jit_set_profiling_executor(self.prev_exec)
-        torch._C._jit_set_profiling_mode(self.prev_profiling)
+        torch._C._get_graph_executor_optimize(self.prev_profiling)
         torch._C._debug_set_autodiff_subgraph_inlining(self.inline_autodiff)
         torch._C._jit_set_texpr_fuser_enabled(self.texpr_fuser_state)
         torch._C._jit_override_can_fuse_on_cpu(self.can_fuse_on_cpu)
diff --git a/test/jit/test_python_bindings.py b/test/jit/test_python_bindings.py
index 2f086feaa904e9..37c2ef7f85af74 100644
--- a/test/jit/test_python_bindings.py
+++ b/test/jit/test_python_bindings.py
@@ -1,6 +1,7 @@
 # Owner(s): ["oncall: jit"]
 
 import torch
+from torch.testing import FileCheck
 from torch.testing._internal.jit_utils import JitTestCase
 
 if __name__ == "__main__":
@@ -82,3 +83,28 @@ def test_graph_create(self):
         gr = torch._C.Graph()
         with self.assertRaises(ValueError):
             gr.create("prim::Constant", [None])
+
+    def test_canonicalize(self):
+        ir = """
+graph(%p207 : Tensor,
+      %1 : Tensor,
+      %p407 : int):
+  %11 : Tensor = aten::view_expand_placeholder(%1)
+  %12 : Tensor = aten::pointwise_placeholder(%11, %p207, %p407)
+  %13 : Tensor = aten::view_expand_placeholder(%12)
+  %14 : Tensor = aten::pointwise_placeholder(%13)
+  return (%14)
+        """
+
+        graph1 = torch._C.parse_ir(ir)
+        graph1 = torch._C._jit_pass_canonicalize(graph1, True)
+
+        graph2 = torch._C.parse_ir(ir)
+        graph2 = torch._C._jit_pass_canonicalize(graph2)
+
+        self.assertEqual(str(graph1), str(graph2))
+        FileCheck().check("%p207").check_not("%14").run(graph1)
+
+        graph3 = torch._C.parse_ir(ir)
+        graph3 = torch._C._jit_pass_canonicalize(graph3, False)
+        FileCheck().check_not("%p207").run(graph3)
diff --git a/test/jit/test_save_load.py b/test/jit/test_save_load.py
index fbc1443024cb5a..bbe7e0a7016f6e 100644
--- a/test/jit/test_save_load.py
+++ b/test/jit/test_save_load.py
@@ -1,20 +1,22 @@
 # Owner(s): ["oncall: jit"]
 
-from typing import NamedTuple, Optional
 import io
 import os
 import pathlib
 import sys
+import unittest
+from typing import NamedTuple, Optional
 
+import torch
 from torch import Tensor
 from torch.testing._internal.common_utils import TemporaryFileName
-import torch
 
 # Make the helper files in test/ importable
 pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
 sys.path.append(pytorch_test_dir)
-from torch.testing._internal.jit_utils import (JitTestCase,
-                                               clear_class_registry)
+from torch.testing._internal.jit_utils import JitTestCase, clear_class_registry
+
+ENABLE_FLATBUFFER = os.environ.get("ENABLE_FLATBUFFER", "0") == "1"
 
 if __name__ == "__main__":
     raise RuntimeError(
@@ -23,12 +25,14 @@
         "instead."
     )
 
+
 class TestSaveLoad(JitTestCase):
     def test_different_modules(self):
         """
         Exercise the situation where we have the same qualified name
         in two different CompilationUnits on save/load.
         """
+
         class Foo(torch.nn.Module):
             def __init__(self):
                 super(Foo, self).__init__()
@@ -64,7 +68,8 @@ def forward(self, x):
         clear_class_registry()
 
         self.assertEqual(
-            first_script_module._c.qualified_name, second_script_module._c.qualified_name
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
         )
 
         class ContainsBoth(torch.nn.Module):
@@ -89,6 +94,7 @@ def test_different_functions(self):
         Exercise the situation where we have the same qualified name
         in two different CompilationUnits on save/load.
         """
+
         def lol(x):
             return x
 
@@ -118,7 +124,8 @@ def forward(self, x):
         clear_class_registry()
 
         self.assertEqual(
-            first_script_module._c.qualified_name, second_script_module._c.qualified_name
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
         )
 
         class ContainsBoth(torch.nn.Module):
@@ -143,6 +150,7 @@ def test_different_interfaces(self):
         Exercise the situation where we have the same qualified name
         in two different CompilationUnits on save/load.
         """
+
         @torch.jit.interface
         class MyInterface(object):
             def bar(self, x: Tensor) -> Tensor:
@@ -204,7 +212,8 @@ def forward(self, x):
         clear_class_registry()
 
         self.assertEqual(
-            first_script_module._c.qualified_name, second_script_module._c.qualified_name
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
         )
 
         class ContainsBoth(torch.nn.Module):
@@ -261,7 +270,6 @@ def forward(self, x):
 
                 return x, MyCoolNamedTuple(a=5)
 
-
         first_script_module = torch.jit.script(Foo())
         first_saved_module = io.BytesIO()
         torch.jit.save(first_script_module, first_saved_module)
@@ -310,7 +318,8 @@ def forward(self, x):
         clear_class_registry()
 
         self.assertEqual(
-            first_script_module._c.qualified_name, second_script_module._c.qualified_name
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
         )
 
         class ContainsBoth(torch.nn.Module):
@@ -340,44 +349,44 @@ def forward(self, a):
         value = b"bar\x00\xffbaz"
 
         expected_extra_files = {}
-        expected_extra_files['foo'] = value
+        expected_extra_files["foo"] = value
         # verify that str to bytes conversion also works
-        expected_extra_files['foo2'] = "bar"
+        expected_extra_files["foo2"] = "bar"
         m = MyMod()
 
         # Save to file.
         with TemporaryFileName() as fname:
             m.save(fname, _extra_files=expected_extra_files)
             # values don't matter
-            extra_files = {'foo': '', 'foo2': None}
+            extra_files = {"foo": "", "foo2": None}
             torch.jit.load(fname, _extra_files=extra_files)
-            self.assertEqual(value, extra_files['foo'])
+            self.assertEqual(value, extra_files["foo"])
             # results come back always as bytes
-            self.assertEqual(b"bar", extra_files['foo2'])
+            self.assertEqual(b"bar", extra_files["foo2"])
 
             # Use torch.jit API
             torch.jit.save(m, fname, _extra_files=expected_extra_files)
-            extra_files['foo'] = ''
+            extra_files["foo"] = ""
             torch.jit.load(fname, _extra_files=extra_files)
-            self.assertEqual(value, extra_files['foo'])
+            self.assertEqual(value, extra_files["foo"])
 
         # Save to buffer.
         buffer = io.BytesIO(m.save_to_buffer(_extra_files=expected_extra_files))
-        extra_files = {'foo': ''}
+        extra_files = {"foo": ""}
         torch.jit.load(buffer, _extra_files=extra_files)
-        self.assertEqual(value, extra_files['foo'])
+        self.assertEqual(value, extra_files["foo"])
 
         # Use torch.jit API
         buffer = io.BytesIO()
         torch.jit.save(m, buffer, _extra_files=expected_extra_files)
         buffer.seek(0)
-        extra_files = {'foo': ''}
+        extra_files = {"foo": ""}
         torch.jit.load(buffer, _extra_files=extra_files)
-        self.assertEqual(value, extra_files['foo'])
+        self.assertEqual(value, extra_files["foo"])
 
         # Non-existent file 'bar'
         with self.assertRaises(RuntimeError):
-            extra_files['bar'] = ''
+            extra_files["bar"] = ""
             torch.jit.load(buffer, _extra_files=extra_files)
 
     def test_save_load_using_pathlib(self):
@@ -394,7 +403,7 @@ def forward(self, a):
             m.save(path)
             m2 = torch.jit.load(path)
 
-        x = torch.tensor([1., 2., 3., 4.])
+        x = torch.tensor([1.0, 2.0, 3.0, 4.0])
         self.assertTrue(torch.equal(m(x), m2(x)))
 
     def test_save_nonexit_file(self):
@@ -455,7 +464,476 @@ class TestModule(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self.add_module("submodule_a", Submodule())
-                self.register_parameter("parameter_a", torch.nn.Parameter(torch.randn(4)))
+                self.register_parameter(
+                    "parameter_a", torch.nn.Parameter(torch.randn(4))
+                )
+                self.register_buffer("buffer", torch.randn(4))
+                self.t = torch.rand(4)  # not buffer
+
+                self.parameter_b = torch.nn.Parameter(torch.randn(4))
+                self.submodule_b = Submodule()
+
+        m = TestModule()
+        m_loaded = self.getExportImportCopy(torch.jit.script(m))
+
+        # Check submodules.
+        self.assertEqual(
+            len(list(m.named_modules())), len(list(m_loaded.named_modules()))
+        )
+        for m_s, loaded_s in zip(m.named_modules(), m_loaded.named_modules()):
+            m_name, _ = m_s
+            loaded_name, _ = loaded_s
+            self.assertEqual(m_name, loaded_name)
+
+        # Check parameters.
+        self.assertEqual(len(list(m.parameters())), len(list(m_loaded.parameters())))
+        for m_p, loaded_p in zip(m.parameters(), m_loaded.parameters()):
+            self.assertEqual(m_p, loaded_p)
+
+        # Check buffers.
+        self.assertEqual(
+            len(list(m.named_buffers())), len(list(m_loaded.named_buffers()))
+        )
+        for m_b, loaded_b in zip(m.named_buffers(), m_loaded.named_buffers()):
+            m_name, m_buffer = m_b
+            loaded_name, loaded_buffer = loaded_b
+            self.assertEqual(m_name, loaded_name)
+            self.assertEqual(m_buffer, loaded_buffer)
+
+    def test_save_load_meta_tensors(self):
+        """
+        Check that parameters, buffers, and submodules are the same after loading
+        for a module with parameters and buffers that are meta tensors
+        """
+
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super(Foo, self).__init__()
+                self.foo = torch.nn.Linear(2, 3, device="meta")
+                self.bar = torch.nn.Linear(3, 4)
+                self.register_buffer("buffer", torch.randn(4, device="meta"))
+
+            def forward(self, x):
+                x = self.foo(x)
+                x = self.bar(x)
+                return x
+
+        m = Foo()
+        m_loaded = self.getExportImportCopy(torch.jit.script(m))
+        # Check submodules.
+        self.assertEqual(
+            len(list(m.named_modules())), len(list(m_loaded.named_modules()))
+        )
+        self.assertEqual(
+            set(name for name, _ in m.named_modules()),
+            set(name for name, _ in m_loaded.named_modules()),
+        )
+        # Check parameters.
+        m_params = dict(m.named_parameters())
+        m_loaded_params = dict(m_loaded.named_parameters())
+        self.assertEqual(len(m_params), len(m_loaded_params))
+        self.assertEqual(m_params, m_loaded_params)
+        # Check buffers.
+        m_buffers = dict(m.named_buffers())
+        m_loaded_buffers = dict(m_loaded.named_buffers())
+        self.assertEqual(len(m_buffers), len(m_loaded_buffers))
+        self.assertEqual(m_buffers, m_loaded_buffers)
+        # Check params and buffers that are/are not meta tensors
+        self.assertTrue(m_params["foo.weight"].is_meta)
+        self.assertTrue(m_loaded_params["foo.weight"].is_meta)
+        self.assertTrue(m_params["foo.bias"].is_meta)
+        self.assertTrue(m_loaded_params["foo.bias"].is_meta)
+        self.assertFalse(m_params["bar.weight"].is_meta)
+        self.assertFalse(m_loaded_params["bar.weight"].is_meta)
+        self.assertFalse(m_params["bar.bias"].is_meta)
+        self.assertFalse(m_loaded_params["bar.bias"].is_meta)
+        self.assertTrue(m_buffers["buffer"].is_meta)
+        self.assertTrue(m_loaded_buffers["buffer"].is_meta)
+
+
+def script_module_to_buffer(script_module):
+    module_buffer = io.BytesIO(
+        script_module._save_to_buffer_for_lite_interpreter(_use_flatbuffer=True)
+    )
+    module_buffer.seek(0)
+    return module_buffer
+
+
+@unittest.skipIf(
+    not ENABLE_FLATBUFFER, "Need to enable flatbuffer to run the below tests"
+)
+class TestSaveLoadFlatbuffer(JitTestCase):
+    def test_different_modules(self):
+        """
+        Exercise the situation where we have the same qualified name
+        in two different CompilationUnits on save/load.
+        """
+
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super(Foo, self).__init__()
+                self.foo = torch.nn.Linear(2, 2)
+                self.bar = torch.nn.Linear(2, 2)
+
+            def forward(self, x):
+                x = self.foo(x)
+                x = self.bar(x)
+                return x
+
+        first_script_module = torch.jit.script(Foo())
+        first_saved_module = script_module_to_buffer(first_script_module)
+
+        clear_class_registry()
+
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super(Foo, self).__init__()
+                self.foo = torch.nn.Linear(2, 2)
+
+            def forward(self, x):
+                x = self.foo(x)
+                return x
+
+        second_script_module = torch.jit.script(Foo())
+        second_saved_module = script_module_to_buffer(second_script_module)
+
+        clear_class_registry()
+
+        self.assertEqual(
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
+        )
+
+        class ContainsBoth(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.add_module(
+                    "second", torch.jit.load(second_saved_module)
+                )
+                self.add_module(
+                    "first", torch.jit.load(first_saved_module)
+                )
+
+            def forward(self, x):
+                x = self.first(x)
+                x = self.second(x)
+                return x
+
+        sm = torch.jit.script(ContainsBoth())
+        contains_both = script_module_to_buffer(sm)
+        sm = torch.jit.load(contains_both)
+
+    def test_different_functions(self):
+        """
+        Exercise the situation where we have the same qualified name
+        in two different CompilationUnits on save/load.
+        """
+
+        def lol(x):
+            return x
+
+        class Foo(torch.nn.Module):
+            def forward(self, x):
+                return lol(x)
+
+        first_script_module = torch.jit.script(Foo())
+        first_saved_module = script_module_to_buffer(first_script_module)
+        clear_class_registry()
+
+        def lol(x):  # noqa: F811
+            return "hello"
+
+        class Foo(torch.nn.Module):
+            def forward(self, x):
+                return lol(x)
+
+        second_script_module = torch.jit.script(Foo())
+        second_saved_module = script_module_to_buffer(second_script_module)
+
+        clear_class_registry()
+
+        self.assertEqual(
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
+        )
+
+        class ContainsBoth(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.add_module(
+                    "second", torch.jit.load(second_saved_module)
+                )
+                self.add_module(
+                    "first", torch.jit.load(first_saved_module)
+                )
+
+            def forward(self, x):
+                x = self.first(x)
+                x = self.second(x)
+                return x
+
+        sm = torch.jit.script(ContainsBoth())
+        contains_both = script_module_to_buffer(sm)
+        sm = torch.jit.load(contains_both)
+
+    def test_different_interfaces(self):
+        """
+        Exercise the situation where we have the same qualified name
+        in two different CompilationUnits on save/load.
+        """
+
+        @torch.jit.interface
+        class MyInterface(object):
+            def bar(self, x: Tensor) -> Tensor:
+                pass
+
+        @torch.jit.script
+        class ImplementInterface(object):
+            def __init__(self):
+                pass
+
+            def bar(self, x):
+                return x
+
+        class Foo(torch.nn.Module):
+            __annotations__ = {"interface": MyInterface}
+
+            def __init__(self):
+                super().__init__()
+                self.interface = ImplementInterface()
+
+            def forward(self, x):
+                return self.interface.bar(x)
+
+        first_script_module = torch.jit.script(Foo())
+        first_saved_module = script_module_to_buffer(first_script_module)
+        clear_class_registry()
+
+        @torch.jit.interface
+        class MyInterface(object):
+            def not_bar(self, x: Tensor) -> Tensor:
+                pass
+
+        @torch.jit.script  # noqa: F811
+        class ImplementInterface(object):  # noqa: F811
+            def __init__(self):
+                pass
+
+            def not_bar(self, x):
+                return x
+
+        class Foo(torch.nn.Module):
+            __annotations__ = {"interface": MyInterface}
+
+            def __init__(self):
+                super().__init__()
+                self.interface = ImplementInterface()
+
+            def forward(self, x):
+                return self.interface.not_bar(x)
+
+        second_script_module = torch.jit.script(Foo())
+        second_saved_module = script_module_to_buffer(second_script_module)
+
+        clear_class_registry()
+
+        self.assertEqual(
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
+        )
+
+        class ContainsBoth(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.add_module(
+                    "second", torch.jit.load(second_saved_module)
+                )
+                self.add_module(
+                    "first", torch.jit.load(first_saved_module)
+                )
+
+            def forward(self, x):
+                x = self.first(x)
+                x = self.second(x)
+                return x
+
+        sm = torch.jit.script(ContainsBoth())
+        contains_both = script_module_to_buffer(sm)
+        sm = torch.jit.load(contains_both)
+
+    def test_many_collisions(self):
+        class MyCoolNamedTuple(NamedTuple):
+            a: int
+
+        @torch.jit.interface
+        class MyInterface(object):
+            def bar(self, x: Tensor) -> Tensor:
+                pass
+
+        @torch.jit.script
+        class ImplementInterface(object):
+            def __init__(self):
+                pass
+
+            def bar(self, x):
+                return x
+
+        def lol(x):
+            return x
+
+        class Foo(torch.nn.Module):
+            interface: MyInterface
+
+            def __init__(self):
+                super().__init__()
+                self.foo = torch.nn.Linear(2, 2)
+                self.bar = torch.nn.Linear(2, 2)
+                self.interface = ImplementInterface()
+
+            def forward(self, x):
+                x = self.foo(x)
+                x = self.bar(x)
+                x = lol(x)
+                x = self.interface.bar(x)
+
+                return x, MyCoolNamedTuple(a=5)
+
+        first_script_module = torch.jit.script(Foo())
+        first_saved_module = script_module_to_buffer(first_script_module)
+
+        clear_class_registry()
+
+        @torch.jit.interface
+        class MyInterface(object):
+            def not_bar(self, x: Tensor) -> Tensor:
+                pass
+
+        @torch.jit.script  # noqa: F811
+        class ImplementInterface(object):  # noqa: F811
+            def __init__(self):
+                pass
+
+            def not_bar(self, x):
+                return x
+
+        def lol(x):  # noqa: F811
+            return "asdofij"
+
+        class MyCoolNamedTuple(NamedTuple):  # noqa: F811
+            a: str
+
+        class Foo(torch.nn.Module):
+            interface: MyInterface
+
+            def __init__(self):
+                super().__init__()
+                self.foo = torch.nn.Linear(2, 2)
+                self.interface = ImplementInterface()
+
+            def forward(self, x):
+                x = self.foo(x)
+                self.interface.not_bar(x)
+                x = lol(x)
+                return x, MyCoolNamedTuple(a="hello")
+
+        second_script_module = torch.jit.script(Foo())
+        second_saved_module = script_module_to_buffer(second_script_module)
+
+        clear_class_registry()
+
+        self.assertEqual(
+            first_script_module._c.qualified_name,
+            second_script_module._c.qualified_name,
+        )
+
+        class ContainsBoth(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.add_module(
+                    "second", torch.jit.load(second_saved_module)
+                )
+                self.add_module(
+                    "first", torch.jit.load(first_saved_module)
+                )
+
+            def forward(self, x):
+                x, named_tuple_1 = self.first(x)
+                x, named_tuple_2 = self.second(x)
+                return len(x + named_tuple_2.a) + named_tuple_1.a
+
+        sm = torch.jit.script(ContainsBoth())
+        contains_both = script_module_to_buffer(sm)
+        sm = torch.jit.load(contains_both)
+
+    def test_save_load_using_pathlib(self):
+        class MyMod(torch.jit.ScriptModule):
+            @torch.jit.script_method
+            def forward(self, a):
+                return 2 * a
+
+        m = MyMod()
+
+        # Save then load.
+        with TemporaryFileName() as fname:
+            path = pathlib.Path(fname)
+            torch.jit.save_jit_module_to_flatbuffer(m, path)
+            m2 = torch.jit.load(path)
+
+        x = torch.tensor([1.0, 2.0, 3.0, 4.0])
+        self.assertTrue(torch.equal(m(x), m2(x)))
+
+    def test_save_namedtuple_input_only(self):
+        """
+        Even if a NamedTuple is only used as an input argument, saving and
+        loading should work correctly.
+        """
+        global FooTuple  # see [local resolution in python]
+
+        class FooTuple(NamedTuple):
+            a: int
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x: FooTuple) -> torch.Tensor:
+                return torch.tensor(3)
+
+        m_loaded = self.getExportImportCopy(torch.jit.script(MyModule()))
+        output = m_loaded(FooTuple(a=5))
+        self.assertEqual(output, torch.tensor(3))
+
+    def test_save_namedtuple_output_only(self):
+        """
+        Even if a NamedTuple is only used as an output argument, saving and
+        loading should work correctly.
+        """
+        global FooTuple  # see [local resolution in python]
+
+        class FooTuple(NamedTuple):
+            a: int
+
+        class MyModule(torch.nn.Module):
+            def forward(self) -> Optional[FooTuple]:
+                return None
+
+        m_loaded = self.getExportImportCopy(torch.jit.script(MyModule()))
+        output = m_loaded()
+        self.assertEqual(output, None)
+
+    def test_save_load_params_buffers_submodules(self):
+        """
+        Check that parameters, buffers, and submodules are the same after loading.
+        """
+
+        class Submodule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+        class TestModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.add_module("submodule_a", Submodule())
+                self.register_parameter(
+                    "parameter_a", torch.nn.Parameter(torch.randn(4))
+                )
                 self.register_buffer("buffer", torch.randn(4))
                 self.t = torch.rand(4)  # not buffer
 
@@ -466,7 +944,9 @@ def __init__(self):
         m_loaded = self.getExportImportCopy(torch.jit.script(m))
 
         # Check submodules.
-        self.assertEqual(len(list(m.named_modules())), len(list(m_loaded.named_modules())))
+        self.assertEqual(
+            len(list(m.named_modules())), len(list(m_loaded.named_modules()))
+        )
         for m_s, loaded_s in zip(m.named_modules(), m_loaded.named_modules()):
             m_name, _ = m_s
             loaded_name, _ = loaded_s
@@ -478,7 +958,9 @@ def __init__(self):
             self.assertEqual(m_p, loaded_p)
 
         # Check buffers.
-        self.assertEqual(len(list(m.named_buffers())), len(list(m_loaded.named_buffers())))
+        self.assertEqual(
+            len(list(m.named_buffers())), len(list(m_loaded.named_buffers()))
+        )
         for m_b, loaded_b in zip(m.named_buffers(), m_loaded.named_buffers()):
             m_name, m_buffer = m_b
             loaded_name, loaded_buffer = loaded_b
diff --git a/test/jit/test_symbolic_shape_analysis.py b/test/jit/test_symbolic_shape_analysis.py
index cd25caa92b2bbe..e756cdb6788982 100644
--- a/test/jit/test_symbolic_shape_analysis.py
+++ b/test/jit/test_symbolic_shape_analysis.py
@@ -12,6 +12,7 @@
 )
 from torch.testing._internal.common_utils import make_tensor
 from torch.testing._internal.jit_utils import JitTestCase, execWrapper
+from typing import List, Any
 
 if __name__ == '__main__':
     raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
@@ -498,3 +499,37 @@ def test_shape_function_includes(self):
         m2_shape = [20, 10]
         res = torch.jit._shapes.matmul(m1_shape, m2_shape)
         self.assertEqual(res, [10, 10])
+
+    def test_register_function_error_checking(self):
+        # this will error before registering on global map, so
+        # no issue in overwriting schema mappings
+        @torch.jit.script
+        def foo(x, y):
+            return x + y
+
+        node = foo.graph.findNode("aten::add")
+
+        @torch.jit.script
+        def wrong_input_types(x, y):
+            x: List[int] = []
+            return x
+        with self.assertRaisesRegex(RuntimeError, "Expected supertype of int"):
+            torch._C._jit_register_shape_compute_graph_for_node(node, wrong_input_types.graph)
+
+        @torch.jit.script
+        def wrong_output_types(x: List[int], y: List[int]):
+            x: List[Tensor] = []
+            return x
+
+        with self.assertRaisesRegex(RuntimeError, "but got graph_type"):
+            torch._C._jit_register_shape_compute_graph_for_node(node, wrong_output_types.graph)
+
+        @torch.jit.script
+        def too_many_inputs(x: List[int], y: List[int], z: Any, z2: Any):
+            x: List[int] = []
+            return x
+
+        with self.assertRaises(RuntimeError) as error:
+            torch._C._jit_register_shape_compute_graph_for_node(node, too_many_inputs.graph)
+
+        self.assertTrue("fewer arguments than schema" in str(error.exception))
diff --git a/test/jit/test_tensor_methods.py b/test/jit/test_tensor_methods.py
new file mode 100644
index 00000000000000..c761a3884c9238
--- /dev/null
+++ b/test/jit/test_tensor_methods.py
@@ -0,0 +1,39 @@
+# Owner(s): ["oncall: jit"]
+
+import os
+import sys
+
+import torch
+
+# Make the helper files in test/ importable
+pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
+sys.path.append(pytorch_test_dir)
+from torch.testing._internal.jit_utils import JitTestCase
+from torch.testing import FileCheck
+
+if __name__ == "__main__":
+    raise RuntimeError(
+        "This test file is not meant to be run directly, use:\n\n"
+        "\tpython test/test_jit.py TESTNAME\n\n"
+        "instead."
+    )
+
+class TestTensorMethods(JitTestCase):
+    def test_getitem(self):
+        def tensor_getitem(inp: torch.Tensor):
+            indices = torch.tensor([0, 2], dtype=torch.long)
+            return inp.__getitem__(indices)
+
+        inp = torch.rand(3, 4)
+        self.checkScript(tensor_getitem, (inp, ))
+
+        scripted = torch.jit.script(tensor_getitem)
+        FileCheck().check("aten::index").run(scripted.graph)
+
+    def test_getitem_invalid(self):
+        def tensor_getitem_invalid(inp: torch.Tensor):
+            return inp.__getitem__()
+
+        with self.assertRaisesRegexWithHighlight(
+                RuntimeError, "expected exactly 1 argument", "inp.__getitem__"):
+            torch.jit.script(tensor_getitem_invalid)
diff --git a/test/jit/test_types.py b/test/jit/test_types.py
index 9fadbedb272bb5..ca3da3c17c8cd1 100644
--- a/test/jit/test_types.py
+++ b/test/jit/test_types.py
@@ -39,7 +39,7 @@ def fn(x: torch.Tensor) -> Tuple[Tuple[torch.Tensor], Dict[str, int]]:
         expected = fn(x)
         scripted = torch.jit.script(fn)(x)
 
-        self.assertEquals(expected, scripted)
+        self.assertEqual(expected, scripted)
 
     def test_types_as_values(self):
         def fn(m: torch.Tensor) -> torch.device:
diff --git a/test/lazy/__init__.py b/test/lazy/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/test/lazy/test_bindings.py b/test/lazy/test_bindings.py
new file mode 100644
index 00000000000000..57151d4085602b
--- /dev/null
+++ b/test/lazy/test_bindings.py
@@ -0,0 +1,7 @@
+# Owner(s): ["oncall: jit"]
+
+import torch._lazy.metrics
+
+def test_metrics():
+    names = torch._lazy.metrics.counter_names()
+    assert len(names) == 0, f"Expected no counter names, but got {names}"
diff --git a/test/lazy/test_extract_compiled_graph.py b/test/lazy/test_extract_compiled_graph.py
new file mode 100644
index 00000000000000..f4152d0af68bf3
--- /dev/null
+++ b/test/lazy/test_extract_compiled_graph.py
@@ -0,0 +1,195 @@
+# Owner(s): ["oncall: jit"]
+
+import unittest
+
+from torch._lazy.ts_backend import init as init_ts_backend
+init_ts_backend()
+from torch._lazy import config
+from torch._lazy.extract_compiled_graph import extract_compiled_graph
+import torch
+from torch import nn
+import dis
+import inspect
+from torch import fx
+import re
+from contextlib import contextmanager
+import copy
+
+class ModuleConstScale(nn.Module):
+    def __init__(self):
+        super(ModuleConstScale, self).__init__()
+
+    def forward(self, a):
+        return a * 2
+
+class ModuleSub(nn.Module):
+    def __init__(self):
+        super(ModuleSub, self).__init__()
+
+    def forward(self, a, b):
+        return a - b
+
+class ModuleAddcmul(nn.Module):
+    """
+    addcmul function takes a at::Scalar which results in a special TSData containing a Scalar rather than a Tensor.
+    """
+    def __init__(self):
+        super(ModuleAddcmul, self).__init__()
+
+    def forward(self, a, b, c):
+        return torch.addcmul(a, b, c, value=5)
+
+class ModuleReturnMulti(nn.Module):
+    def __init__(self):
+        super(ModuleReturnMulti, self).__init__()
+
+    def forward(self, a, b):
+        return (b + 1, a - 1)
+
+# The default fx tracer will convert torch.randn to a constant.. We may need
+# a custom tracer.
+# class ModuleEagerTensor(nn.Module):
+#     def __init__(self):
+#         super(ModuleEagerTensor, self).__init__()
+#
+#     def forward(self, a):
+#         b = torch.randn(2, 3, device="cpu") # eager device
+#         return a + b
+
+# The module was planned to cover the case that a Fx graph return an eager
+# tensor on the default device. It's harder than ModuleEagerTensor because
+# we can not just override the device argument to Lazy since there is no
+# explicit device argument.
+#
+# Unfortunately, the default fx tracer convert the return value of the forward
+# method to a constant.. Comment out for now
+# class ModuleReturnEagerTensorOnDefaultDevice(nn.Module):
+#     def __init__(self):
+#         super(ModuleReturnEagerTensorOnDefaultDevice, self).__init__()
+#
+#     def forward(self):
+#         return torch.tensor((2, 3), dtype=torch.float32)
+
+class ModuleReturnDupTensor(nn.Module):
+    """
+    Handle the corner case that the same tensor appears multiple times in the
+    returned tuple. torchbench like drq will hit this corner case when running
+    thru torchdynamo..
+    """
+    def __init__(self):
+        super(ModuleReturnDupTensor, self).__init__()
+
+    def forward(self, a, b):
+        c = a + b
+        return a - b, c, a + 1, c
+
+class ModuleInplaceUpdate(nn.Module):
+    def __init__(self):
+        super(ModuleInplaceUpdate, self).__init__()
+
+    def forward(self, a, b):
+        a.sub_(b)
+        return b - 1, b + 1
+
+@contextmanager
+def force_fallback_ctx_mgr(fallback_op):
+    oldconfig = config.get_force_fallback()
+    config.set_force_fallback(fallback_op)
+    try:
+        yield None
+    finally:
+        config.set_force_fallback(oldconfig)
+
+@contextmanager
+def nop_ctx_mgr():
+    try:
+        yield None
+    finally:
+        pass
+
+def gen_rand_args(mod):
+    args = []
+    for _ in range(len(inspect.signature(mod.forward).parameters)):
+        args.append(torch.randn(2, 3))
+    return args
+
+def allclose(expected, actual):
+    def unwrap(cont):
+        if isinstance(cont, (list, tuple)) and len(cont) == 1:
+            return cont[0]
+        return cont
+    expected = unwrap(expected)
+    actual = unwrap(actual)
+
+    if isinstance(expected, torch.Tensor) and isinstance(actual, torch.Tensor):
+        return torch.allclose(expected, actual)
+    elif isinstance(expected, (tuple, list)) and isinstance(actual, (tuple, list)):
+        return len(expected) == len(actual) and all(torch.allclose(a, b) for a, b in zip(expected, actual))
+    else:
+        raise RuntimeError("Unexpected types")
+
+def verify_reusing_compiled_graph(mod, exception_msg_pattern, ncase=10):
+    args = gen_rand_args(mod)
+    out = mod(*args)
+
+    dis.dis(mod.forward)
+
+    try:
+        optimized_mod = extract_compiled_graph(fx.symbolic_trace(mod), args)
+    except RuntimeError as e:
+        if exception_msg_pattern is None:
+            raise e  # reraise the exception
+        exception_message = str(e)
+        if not re.search(exception_msg_pattern, exception_message):
+            raise RuntimeError(f"Expection message does not match the required pattern: {exception_message}")
+        else:
+            # We are done for the test case that expects an exception
+            return
+
+    if exception_msg_pattern is not None:
+        raise RuntimeError(f"Expect an exception matching pattern {exception_msg_pattern}")
+    print("return value of optimized_mod", optimized_mod(*args))
+
+    # check correctness
+    failed_index = []
+    for i in range(ncase):
+        rand_args = gen_rand_args(mod)
+        rand_args_copy = copy.deepcopy(rand_args)
+        expected = mod(*rand_args)
+        actual = optimized_mod(*rand_args_copy)
+
+        if not allclose(expected, actual):
+            print(f"Incorrect results. expected {expected}, actual {actual}")
+            failed_index.append(i)
+            continue
+
+        # make sure arguments match after calling the model forward method to handle inplace
+        # updates.
+        if not allclose(rand_args, rand_args_copy):
+            print(f"Incorrect updated arguments. expected {rand_args}, actual {rand_args_copy}")
+            failed_index.append(i)
+            continue
+
+    if len(failed_index) > 0:
+        raise RuntimeError(f"Failed {len(failed_index)}/{ncase} cases")
+
+def maketest(module_cls, exception_msg_pattern=None, ctxmgr=None):
+    def wrapper(self):
+        nonlocal ctxmgr
+        if not ctxmgr:
+            ctxmgr = nop_ctx_mgr()
+        with ctxmgr:
+            verify_reusing_compiled_graph(module_cls(), exception_msg_pattern)
+
+    return wrapper
+
+class OptimizeTest(unittest.TestCase):
+    test_sub = maketest(ModuleSub)
+    # Same as test_sub but force aten::sub to fallback
+    # We expect an exception caught because of LTC fallabck.
+    test_ltc_fallback = maketest(ModuleSub, exception_msg_pattern="fallback.*aten::sub", ctxmgr=force_fallback_ctx_mgr("aten::sub"))
+    test_const_scale = maketest(ModuleConstScale)
+    test_addcmul = maketest(ModuleAddcmul)
+    test_return_multi = maketest(ModuleReturnMulti)
+    test_return_dup_tensor = maketest(ModuleReturnDupTensor)
+    test_inplace_update = maketest(ModuleInplaceUpdate)
diff --git a/test/lazy/test_ts_opinfo.py b/test/lazy/test_ts_opinfo.py
new file mode 100644
index 00000000000000..87f007b93a1e26
--- /dev/null
+++ b/test/lazy/test_ts_opinfo.py
@@ -0,0 +1,160 @@
+# Owner(s): ["oncall: jit"]
+
+from typing import Sequence
+import torch
+import functools
+
+from torch.testing._internal.common_utils import run_tests, TestCase
+from torch.testing._internal.jit_utils import JitTestCase
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_device_type import ops, instantiate_device_type_tests
+import torch._lazy
+import torch._lazy.metrics
+import torch._lazy.ts_backend
+import itertools
+import yaml
+import os
+import pathlib
+
+torch._lazy.ts_backend.init()
+
+def get_test_device():
+    return 'cuda' if 'LTC_TS_CUDA' in os.environ else 'cpu'
+
+def remove_suffixes(l):
+    return [x.split(".")[0] for x in l]
+
+def init_lists():
+    path_to_script = pathlib.Path(os.path.abspath(os.path.dirname(__file__)))
+    TS_NATIVE_FUNCTIONS_PATH = path_to_script.parent.parent / "aten/src/ATen/native/ts_native_functions.yaml"
+    with open(TS_NATIVE_FUNCTIONS_PATH) as f:
+        yaml_ts = yaml.load(f, yaml.Loader)
+    LAZY_OPS_LIST = set(remove_suffixes(itertools.chain(yaml_ts["full_codegen"], yaml_ts["supported"], yaml_ts["autograd"])))
+    FALLBACK_LIST = set(["clamp"])
+    SKIP_RUNTIME_ERROR_LIST = set([
+        'index_select',  # Empty output_sizes is not supported
+        'clone',  # is clone decomposed?
+        'all',  # ASAN failure https://github.com/pytorch/pytorch/issues/74519
+        'any',  # ASAN failure https://github.com/pytorch/pytorch/issues/74519
+        'logdet',  # ASAN failure https://github.com/pytorch/pytorch/issues/74519
+    ])
+    SKIP_INCORRECT_RESULTS_LIST = set([
+        'squeeze',  # Value out of range
+        't',  # Value out of range
+        'transpose',  # Value out of range
+        'bernoulli',  # incorrect results
+        'pow',  # incorrect results
+        'addcdiv',  # incorrect results (on CI not locally?)
+    ])
+
+    return (LAZY_OPS_LIST, FALLBACK_LIST, SKIP_RUNTIME_ERROR_LIST, SKIP_INCORRECT_RESULTS_LIST)
+
+(LAZY_OPS_LIST, FALLBACK_LIST, SKIP_RUNTIME_ERROR_LIST, SKIP_INCORRECT_RESULTS_LIST) = init_lists()
+
+torch.manual_seed(42)
+
+class TestLazyTensor(JitTestCase):
+    def testConvolutionBackward(self):
+        def clone_move(t):
+            dev = 'lazy'
+            copy_t = t.detach().clone().requires_grad_(True).to(device=dev)
+            return copy_t
+
+        test_device = get_test_device()
+        inp = torch.rand(1, 3, 128, 128, device=test_device, requires_grad=True)
+        inp_copy = clone_move(inp)
+        grad = torch.rand(1, 32, 121, 121, device=test_device)  # no requires_grad
+        grad_copy = clone_move(grad)
+        weight = torch.rand(32, 3, 8, 8, device=test_device, requires_grad=True)
+        weight_copy = clone_move(weight)
+        bias = torch.rand(32, device=test_device, requires_grad=True)
+        bias_copy = clone_move(bias)
+
+        # run eager
+        conv_out = torch.nn.functional.conv2d(inp, weight, bias)
+        (inp_grad, weight_grad, bias_grad) = torch.autograd.grad([conv_out], [inp, weight, bias], [grad])
+
+        # run lazy
+        conv_copy_out = torch.nn.functional.conv2d(inp_copy, weight_copy, bias_copy)
+        (inp_copy_grad, weight_copy_grad, bias_copy_grad) = torch.autograd.grad(
+            [conv_copy_out], [inp_copy, weight_copy, bias_copy], [grad_copy])
+
+        # check numerics
+        torch.testing.assert_close(bias_copy_grad.cpu(), bias_grad.cpu())
+
+        torch.testing.assert_close(weight_copy_grad.cpu(), weight_grad.cpu())
+        torch.testing.assert_close(inp_copy_grad.cpu(), inp_grad.cpu())
+
+class TestLazyOpInfo(TestCase):
+
+    @ops([op for op in op_db if op.name in LAZY_OPS_LIST and op.name not in SKIP_RUNTIME_ERROR_LIST], allowed_dtypes=(torch.float,))
+    def test_dispatched_to_lazy(self, device, dtype, op):
+        def get_name(op):
+            l = [op.name]
+            if op.variant_test_name != '':
+                l.append(op.variant_test_name)
+            return '.'.join(l)
+
+        global FALLBACK_LIST
+        samples = op.sample_inputs("lazy", dtype, requires_grad=False)
+        sample = list(samples)[0]
+        args = [sample.input] + list(sample.args)
+        kwargs = sample.kwargs
+        torch._lazy.mark_step()
+        torch._lazy.wait_device_ops()
+        torch._lazy.metrics.reset()
+
+        r = op(*args, **kwargs)
+        torch._lazy.mark_step()
+        torch._lazy.wait_device_ops()
+        prefix = "aten" if op.name in FALLBACK_LIST else "lazy"
+        found = f"{prefix}::{op.name}" in remove_suffixes(torch._lazy.metrics.counter_names())
+        # check aliases
+        if not found:
+            for alias in op.aliases:
+                alias_found = f"{prefix}::{alias.name}" in remove_suffixes(torch._lazy.metrics.counter_names())
+                found = found or alias_found
+                if found:
+                    break
+        self.assertTrue(found)
+
+
+    @ops([op for op in op_db if op.name in LAZY_OPS_LIST and op.name not in SKIP_RUNTIME_ERROR_LIST | SKIP_INCORRECT_RESULTS_LIST], allowed_dtypes=(torch.float,))  # noqa: B950
+    def test_correctness(self, device, dtype, op):
+
+        test_device = get_test_device()
+
+        def clone_to_device(input, dev):
+            if isinstance(input, torch.Tensor):
+                return input.detach().clone().to(device=dev)
+            if isinstance(input, Sequence) and not isinstance(input, str):
+                return tuple(map(functools.partial(clone_to_device, dev=dev), input))
+            return input
+
+        def assert_allclose_rec(t):
+            a, b = t
+            self.assertEqual(type(a), type(b))
+            if isinstance(a, torch.Tensor):
+                self.assertTrue(torch.allclose(clone_to_device(a, test_device), b, atol=1e-4))
+
+            if isinstance(a, Sequence):
+                map(assert_allclose_rec, zip(a, b))
+
+        samples = op.sample_inputs("lazy", dtype, requires_grad=False)
+        for sample in samples:
+            args = [sample.input] + list(sample.args)
+            kwargs = sample.kwargs
+            copy_args = clone_to_device(args, test_device)
+
+            r_exp = op(*copy_args, **kwargs)
+            r_actual = op(*args, **kwargs)
+
+            assert_allclose_rec((r_actual, r_exp))
+
+# TODO: after we move to master, add Lazy as a new Device here:
+# https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L532
+instantiate_device_type_tests(TestLazyOpInfo, globals(), only_for="cpu")
+
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/mobile/lightweight_dispatch/test_codegen_unboxing.cpp b/test/mobile/lightweight_dispatch/test_codegen_unboxing.cpp
index 2c0002505554b7..07a845d6008ba0 100644
--- a/test/mobile/lightweight_dispatch/test_codegen_unboxing.cpp
+++ b/test/mobile/lightweight_dispatch/test_codegen_unboxing.cpp
@@ -1,5 +1,6 @@
 #include <gtest/gtest.h>
 #include <test/cpp/jit/test_utils.h>
+#include <torch/torch.h>
 #include <torch/csrc/jit/api/module.h>
 #include <torch/csrc/jit/frontend/resolver.h>
 #include <torch/csrc/jit/mobile/import.h>
@@ -190,6 +191,29 @@ TEST(LiteInterpreterTest, DivideTensor) {
   AT_ASSERT(result_1.toList().get(0).toTensor().equal(expected_1));
   AT_ASSERT(result_1.toList().get(1).toTensor().equal(expected_2));
 }
+
+TEST(LiteInterpreterTest, MultipleOps) {
+  // Load check in model: multiple_ops.ptl
+  auto testModelFile = "multiple_ops.ptl";
+
+  // class Model(torch.nn.Module):
+  //           def __init__(self):
+  //               super(Model, self).__init__()
+  //               self.ops = torch.nn.Sequential(
+  //                   torch.nn.ReLU(),
+  //                   torch.nn.Flatten(),
+  //               )
+  //           def forward(self, x):
+  //               x[1] = -2
+  //               return self.ops(x)
+
+  Module bc = _load_for_mobile(testModelFile);
+  auto b = at::ones({2, 2, 2, 2});
+  const auto result = bc.forward({b});
+
+  at::Tensor expected = torch::tensor({{1, 1, 1, 1, 1, 1, 1, 1}, {0, 0, 0, 0, 0, 0, 0, 0}}, c10::TensorOptions(c10::ScalarType::Float));
+  AT_ASSERT(result.toTensor().equal(expected));
+}
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/test/mobile/lightweight_dispatch/tests_setup.py b/test/mobile/lightweight_dispatch/tests_setup.py
index 8b1fd6f72998e2..91af29796b9d9e 100644
--- a/test/mobile/lightweight_dispatch/tests_setup.py
+++ b/test/mobile/lightweight_dispatch/tests_setup.py
@@ -150,6 +150,28 @@ def forward(self, b):
         script_model._save_for_lite_interpreter(self.path)
 
 
+class ModelWithMultipleOps(FileSetup):
+    path = 'multiple_ops.ptl'
+
+    def setup(self):
+        class Model(torch.nn.Module):
+            def __init__(self):
+                super(Model, self).__init__()
+                self.ops = torch.nn.Sequential(
+                    torch.nn.ReLU(),
+                    torch.nn.Flatten(),
+                )
+
+            def forward(self, x):
+                x[1] = -2
+                return self.ops(x)
+
+        model = Model()
+        # Script the model and save
+        script_model = torch.jit.script(model)
+        script_model._save_for_lite_interpreter(self.path)
+
+
 tests = [
     ModelWithDTypeDeviceLayoutPinMemory(),
     ModelWithTensorOptional(),
@@ -159,6 +181,7 @@ def forward(self, b):
     ModelWithArrayOfInt(),
     ModelWithTensors(),
     ModelWithStringOptional(),
+    ModelWithMultipleOps(),
 ]
 
 
diff --git a/test/mobile/model_test/android_api_module.py b/test/mobile/model_test/android_api_module.py
new file mode 100644
index 00000000000000..109e3aa963e8f4
--- /dev/null
+++ b/test/mobile/model_test/android_api_module.py
@@ -0,0 +1,128 @@
+from typing import Dict, List, Tuple, Optional
+
+import torch
+from torch import Tensor
+
+
+class AndroidAPIModule(torch.jit.ScriptModule):
+    def __init__(self):
+        super(AndroidAPIModule, self).__init__()
+
+    @torch.jit.script_method
+    def forward(self, input):
+        return None
+
+    @torch.jit.script_method
+    def eqBool(self, input: bool) -> bool:
+        return input
+
+    @torch.jit.script_method
+    def eqInt(self, input: int) -> int:
+        return input
+
+    @torch.jit.script_method
+    def eqFloat(self, input: float) -> float:
+        return input
+
+    @torch.jit.script_method
+    def eqStr(self, input: str) -> str:
+        return input
+
+    @torch.jit.script_method
+    def eqTensor(self, input: Tensor) -> Tensor:
+        return input
+
+    @torch.jit.script_method
+    def eqDictStrKeyIntValue(self, input: Dict[str, int]) -> Dict[str, int]:
+        return input
+
+    @torch.jit.script_method
+    def eqDictIntKeyIntValue(self, input: Dict[int, int]) -> Dict[int, int]:
+        return input
+
+    @torch.jit.script_method
+    def eqDictFloatKeyIntValue(self, input: Dict[float, int]) -> Dict[float, int]:
+        return input
+
+    @torch.jit.script_method
+    def listIntSumReturnTuple(self, input: List[int]) -> Tuple[List[int], int]:
+        sum = 0
+        for x in input:
+            sum += x
+        return (input, sum)
+
+    @torch.jit.script_method
+    def listBoolConjunction(self, input: List[bool]) -> bool:
+        res = True
+        for x in input:
+            res = res and x
+        return res
+
+    @torch.jit.script_method
+    def listBoolDisjunction(self, input: List[bool]) -> bool:
+        res = False
+        for x in input:
+            res = res or x
+        return res
+
+    @torch.jit.script_method
+    def tupleIntSumReturnTuple(
+        self, input: Tuple[int, int, int]
+    ) -> Tuple[Tuple[int, int, int], int]:
+        sum = 0
+        for x in input:
+            sum += x
+        return (input, sum)
+
+    @torch.jit.script_method
+    def optionalIntIsNone(self, input: Optional[int]) -> bool:
+        return input is None
+
+    @torch.jit.script_method
+    def intEq0None(self, input: int) -> Optional[int]:
+        if input == 0:
+            return None
+        return input
+
+    @torch.jit.script_method
+    def str3Concat(self, input: str) -> str:
+        return input + input + input
+
+    @torch.jit.script_method
+    def newEmptyShapeWithItem(self, input):
+        return torch.tensor([int(input.item())])[0]
+
+    @torch.jit.script_method
+    def testAliasWithOffset(self) -> List[Tensor]:
+        x = torch.tensor([100, 200])
+        a = [x[0], x[1]]
+        return a
+
+    @torch.jit.script_method
+    def testNonContiguous(self):
+        x = torch.tensor([100, 200, 300])[::2]
+        assert not x.is_contiguous()
+        assert x[0] == 100
+        assert x[1] == 300
+        return x
+
+    @torch.jit.script_method
+    def conv2d(self, x: Tensor, w: Tensor, toChannelsLast: bool) -> Tensor:
+        r = torch.nn.functional.conv2d(x, w)
+        if toChannelsLast:
+            r = r.contiguous(memory_format=torch.channels_last)
+        else:
+            r = r.contiguous()
+        return r
+
+    @torch.jit.script_method
+    def contiguous(self, x: Tensor) -> Tensor:
+        return x.contiguous()
+
+    @torch.jit.script_method
+    def contiguousChannelsLast(self, x: Tensor) -> Tensor:
+        return x.contiguous(memory_format=torch.channels_last)
+
+    @torch.jit.script_method
+    def contiguousChannelsLast3d(self, x: Tensor) -> Tensor:
+        return x.contiguous(memory_format=torch.channels_last_3d)
diff --git a/test/mobile/model_test/builtin_ops.py b/test/mobile/model_test/builtin_ops.py
new file mode 100644
index 00000000000000..75b57f7b0613d8
--- /dev/null
+++ b/test/mobile/model_test/builtin_ops.py
@@ -0,0 +1,125 @@
+import torch
+
+
+# https://pytorch.org/docs/stable/jit_builtin_functions.html#builtin-functions
+
+
+class TSBuiltinOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TSBuiltinOpsModule, self).__init__()
+
+    def forward(self):
+        x = torch.tensor(1)
+        y = torch.tensor(0.5)
+        b = float(1)
+        s = "abcde"
+        l = ["1", "2", "test", "a{}b"]
+        d = {"key": 1}
+        d2 = {0: 100}
+        return len(
+            # type
+            bool(x),
+            bool(x.item()),
+            int(y),
+            int(y.item()),
+            float(x),
+            float(x.item()),
+            # math
+            x & x,
+            bool(x) & bool(x),
+            int(x) & int(x),
+            x | x,
+            bool(x) | bool(x),
+            int(x) | int(x),
+            x << x,
+            int(x) << int(x),
+            x >> x,
+            int(x) >> int(x),
+            x ^ x,
+            bool(x) ^ bool(x),
+            int(x) ^ int(x),
+            b * float(x),
+            b * int(x),
+            b + float(x),
+            b - float(x),
+            x.item() + y.item(),
+            x.item() - y.item(),
+            x.item() * y.item(),
+            x.item() / y.item(),
+            float(x) < float(y),
+            float(x) <= float(y),
+            float(x) > float(y),
+            float(x) > int(y),
+            float(x) >= float(y),
+            float(x) >= int(y),
+            float(x) == float(y),
+            float(x) == int(y),
+            float(x) != float(y),
+            int(x) != float(y),
+            float(x) / float(y),
+            int(x) / int(y),
+            max(x),
+            max(x.item(), y.item()),
+            max(int(x), int(y)),
+            max(float(x), float(y)),
+            min(x),
+            min(x.item(), y.item()),
+            min(int(x), int(y)),
+            min(float(x), float(y)),
+            int(l[0]),
+            float(l[0]),
+            # string
+            str(torch.tensor(1)),
+            l[2].find("t"),
+            l[2].replace("t", "x"),
+            l[2].lower(),
+            l[2].startswith("t"),
+            l[2].split("t"),
+            l[2].strip(),
+            l[2].rstrip(),
+            l[2].lstrip(),
+            l[2][slice(2)],
+            l[3].format("x"),
+            ord(l[2][0]),
+            len(torch.randn(3)),
+            len(l),
+            len(l[2]),
+            len(d),
+            len(d2),
+        )
+
+
+class TSCollectionOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TSCollectionOpsModule, self).__init__()
+
+    def forward(self):
+        s = "abcde"
+        # list
+        l = ["1", "2", "test"]
+        l.reverse()
+        l.reverse()
+        l[1] = "3"
+        l.extend(["4"])
+        # str dict
+        d = {"key": 1}
+        d.clear()
+        d.update({"key": 0})
+        if "key" in d:
+            d["key"] = 2
+        #  int dict
+        d2 = {0: 100}
+        if 0 in d2:
+            d2.clear()
+            d2[0] = 100
+
+        return len(
+            s[torch.tensor(1)],
+            d["key"],
+            d2[0],
+            d.keys(),
+            d.items(),
+            d.values(),
+            d2.values(),
+            l.pop(),
+        )
diff --git a/test/mobile/model_test/coverage.yaml b/test/mobile/model_test/coverage.yaml
new file mode 100644
index 00000000000000..5433fea4df1020
--- /dev/null
+++ b/test/mobile/model_test/coverage.yaml
@@ -0,0 +1,1094 @@
+_coverage: 87.53
+_covered_ops: 344
+_generated_ops: 693
+_production_ops: 393
+_uncovered_ops: 49
+all_generated_ops:
+- aten::Bool.Tensor
+- aten::Bool.int
+- aten::Float.Scalar
+- aten::Float.Tensor
+- aten::Float.str
+- aten::FloatImplicit
+- aten::Int.Scalar
+- aten::Int.Tensor
+- aten::Int.float
+- aten::Int.str
+- aten::IntImplicit
+- aten::ScalarImplicit
+- aten::__and__.Tensor
+- aten::__and__.bool
+- aten::__and__.int
+- aten::__contains__.int
+- aten::__contains__.int_list
+- aten::__contains__.str
+- aten::__contains__.str_list
+- aten::__derive_index
+- aten::__getitem__.str
+- aten::__getitem__.t
+- aten::__lshift__.Tensor
+- aten::__lshift__.int
+- aten::__or__.Tensor
+- aten::__or__.bool
+- aten::__or__.int
+- aten::__range_length
+- aten::__rshift__.Tensor
+- aten::__rshift__.int
+- aten::__xor__.Tensor
+- aten::__xor__.bool
+- aten::__xor__.int
+- aten::_infer_size
+- aten::_set_item.int
+- aten::_set_item.str
+- aten::_set_item.t
+- aten::_shape_as_tensor
+- aten::_unique2
+- aten::abs
+- aten::acos
+- aten::acosh
+- aten::adaptive_avg_pool1d
+- aten::adaptive_avg_pool2d
+- aten::adaptive_avg_pool3d
+- aten::adaptive_max_pool1d
+- aten::adaptive_max_pool2d
+- aten::adaptive_max_pool3d
+- aten::add
+- aten::add.Scalar
+- aten::add.Tensor
+- aten::add.float
+- aten::add.int
+- aten::add.out
+- aten::add.str
+- aten::add.t
+- aten::add_.Scalar
+- aten::add_.Tensor
+- aten::add_.t
+- aten::addbmm
+- aten::addcdiv
+- aten::addcmul
+- aten::addmm
+- aten::addmv
+- aten::addr
+- aten::all
+- aten::allclose
+- aten::alpha_dropout
+- aten::alpha_dropout_
+- aten::amax
+- aten::amin
+- aten::aminmax
+- aten::angle
+- aten::any
+- aten::append.t
+- aten::arange
+- aten::arange.start
+- aten::arange.start_step
+- aten::argmax
+- aten::argmin
+- aten::argsort
+- aten::as_strided
+- aten::as_tensor.list
+- aten::asin
+- aten::asinh
+- aten::atan
+- aten::atan2
+- aten::atanh
+- aten::atleast_1d
+- aten::atleast_2d
+- aten::atleast_3d
+- aten::avg_pool1d
+- aten::avg_pool2d
+- aten::avg_pool3d
+- aten::baddbmm
+- aten::bartlett_window
+- aten::batch_norm
+- aten::bernoulli
+- aten::bernoulli_.float
+- aten::bilinear
+- aten::binary_cross_entropy
+- aten::binary_cross_entropy_with_logits
+- aten::bincount
+- aten::bitwise_and.Tensor
+- aten::bitwise_not
+- aten::bitwise_or.Tensor
+- aten::bitwise_xor.Tensor
+- aten::blackman_window
+- aten::block_diag
+- aten::bmm
+- aten::broadcast_tensors
+- aten::broadcast_to
+- aten::bucketize.Tensor
+- aten::cartesian_prod
+- aten::cat
+- aten::cauchy_
+- aten::cdist
+- aten::ceil
+- aten::ceil.Scalar
+- aten::ceil.float
+- aten::celu
+- aten::chain_matmul
+- aten::channel_shuffle
+- aten::chunk
+- aten::clamp
+- aten::clamp_
+- aten::clamp_min
+- aten::clear.int
+- aten::clear.str
+- aten::clone
+- aten::coalesce
+- aten::col2im
+- aten::column_stack
+- aten::combinations
+- aten::complex
+- aten::conj
+- aten::constant_pad_nd
+- aten::contiguous
+- aten::conv1d
+- aten::conv2d
+- aten::conv3d
+- aten::conv_transpose1d
+- aten::conv_transpose2d.input
+- aten::conv_transpose3d.input
+- aten::copy_
+- aten::copy_.float
+- aten::copy_.int
+- aten::copysign.Scalar
+- aten::copysign.Tensor
+- aten::corrcoef
+- aten::cos
+- aten::cosh
+- aten::cosine_embedding_loss
+- aten::cosine_similarity
+- aten::count_nonzero
+- aten::cpu
+- aten::cross
+- aten::cross_entropy_loss
+- aten::ctc_loss.Tensor
+- aten::cummax
+- aten::cummin
+- aten::cumprod
+- aten::cumsum
+- aten::cumulative_trapezoid.x
+- aten::deg2rad
+- aten::dense_dim
+- aten::dequantize.self
+- aten::detach
+- aten::detach_
+- aten::diag
+- aten::diag_embed
+- aten::diagflat
+- aten::diagonal
+- aten::diagonal_scatter
+- aten::diff
+- aten::digamma
+- aten::dist
+- aten::div
+- aten::div.Scalar
+- aten::div.Tensor
+- aten::div.Tensor_mode
+- aten::div.float
+- aten::div.int
+- aten::div_.Tensor
+- aten::dot
+- aten::dropout
+- aten::dropout_
+- aten::dsplit.array
+- aten::dstack
+- aten::einsum
+- aten::element_size
+- aten::elu
+- aten::embedding
+- aten::embedding_bag.padding_idx
+- aten::empty.memory_format
+- aten::empty_like
+- aten::empty_strided
+- aten::eq.Scalar
+- aten::eq.Tensor
+- aten::eq.float
+- aten::eq.float_int
+- aten::eq.int
+- aten::eq.int_list
+- aten::eq.str
+- aten::equal
+- aten::erf
+- aten::erfc
+- aten::erfinv
+- aten::exp
+- aten::exp.float
+- aten::exp2
+- aten::expand
+- aten::expand_as
+- aten::expm1
+- aten::exponential_
+- aten::extend.t
+- aten::eye
+- aten::fake_quantize_per_channel_affine
+- aten::fake_quantize_per_tensor_affine
+- aten::feature_alpha_dropout
+- aten::feature_alpha_dropout_
+- aten::feature_dropout
+- aten::feature_dropout_
+- aten::fill_.Scalar
+- aten::fill_diagonal_
+- aten::find
+- aten::flatten.using_ints
+- aten::flip
+- aten::fliplr
+- aten::flipud
+- aten::float_power.Tensor_Scalar
+- aten::float_power.Tensor_Tensor
+- aten::floor
+- aten::floor.float
+- aten::floor_divide
+- aten::floor_divide.Scalar
+- aten::floordiv.int
+- aten::fmax
+- aten::fmin
+- aten::fmod.Scalar
+- aten::frac
+- aten::fractional_max_pool2d
+- aten::fractional_max_pool3d
+- aten::frobenius_norm.dim
+- aten::frobenius_norm.out
+- aten::full
+- aten::full_like
+- aten::gather
+- aten::gcd
+- aten::ge.Scalar
+- aten::ge.Tensor
+- aten::ge.float
+- aten::ge.float_int
+- aten::ge.int
+- aten::gelu
+- aten::geometric_
+- aten::glu
+- aten::grid_sampler
+- aten::group_norm
+- aten::gru.input
+- aten::gru_cell
+- aten::gt.Scalar
+- aten::gt.Tensor
+- aten::gt.float
+- aten::gt.float_int
+- aten::gt.int
+- aten::hamming_window
+- aten::hann_window
+- aten::hardshrink
+- aten::hardsigmoid
+- aten::hardsigmoid_
+- aten::hardswish
+- aten::hardswish_
+- aten::hardtanh
+- aten::hardtanh_
+- aten::heaviside
+- aten::hinge_embedding_loss
+- aten::histc
+- aten::histogram.bin_ct
+- aten::hsplit.array
+- aten::hstack
+- aten::huber_loss
+- aten::hypot
+- aten::i0
+- aten::igamma
+- aten::igammac
+- aten::im2col
+- aten::imag
+- aten::index.Tensor
+- aten::index_fill.int_Scalar
+- aten::index_put.hacked_twin
+- aten::index_put_.hacked_twin
+- aten::index_select
+- aten::inner
+- aten::instance_norm
+- aten::is_coalesced
+- aten::is_complex
+- aten::is_conj
+- aten::is_contiguous
+- aten::is_floating_point
+- aten::is_leaf
+- aten::is_nonzero
+- aten::is_pinned
+- aten::is_set_to
+- aten::is_signed
+- aten::isclose
+- aten::isfinite
+- aten::isin.Tensor_Tensor
+- aten::isinf
+- aten::isnan
+- aten::isneginf
+- aten::isposinf
+- aten::isreal
+- aten::istft
+- aten::item
+- aten::items.str
+- aten::kaiser_window
+- aten::keys.str
+- aten::kl_div
+- aten::kron
+- aten::kthvalue
+- aten::l1_loss
+- aten::layer_norm
+- aten::lcm
+- aten::ldexp.Tensor
+- aten::le.Scalar
+- aten::le.Tensor
+- aten::le.float
+- aten::le.int
+- aten::leaky_relu
+- aten::leaky_relu_
+- aten::len.Dict_int
+- aten::len.Dict_str
+- aten::len.Tensor
+- aten::len.str
+- aten::len.t
+- aten::lerp.Scalar
+- aten::lerp.Tensor
+- aten::lgamma
+- aten::linalg_matrix_exp
+- aten::linalg_matrix_power
+- aten::linear
+- aten::linspace
+- aten::list.t
+- aten::log
+- aten::log10
+- aten::log1p
+- aten::log2
+- aten::log_normal_
+- aten::log_sigmoid
+- aten::log_softmax.int
+- aten::logaddexp
+- aten::logaddexp2
+- aten::logcumsumexp
+- aten::logical_and
+- aten::logical_and.out
+- aten::logical_not
+- aten::logical_not.out
+- aten::logical_or
+- aten::logical_or.out
+- aten::logical_xor
+- aten::logical_xor.out
+- aten::logit
+- aten::logspace
+- aten::logsumexp
+- aten::lower
+- aten::lstm.input
+- aten::lstm_cell
+- aten::lstrip
+- aten::lt.Scalar
+- aten::lt.Tensor
+- aten::lt.float
+- aten::lt.int
+- aten::margin_ranking_loss
+- aten::masked_fill.Scalar
+- aten::masked_fill_.Scalar
+- aten::masked_select
+- aten::matmul
+- aten::max
+- aten::max.dim
+- aten::max.other
+- aten::max_pool1d
+- aten::max_pool2d
+- aten::max_pool3d
+- aten::maximum
+- aten::mean
+- aten::mean.dim
+- aten::median
+- aten::meshgrid
+- aten::meshgrid.indexing
+- aten::min
+- aten::min.dim
+- aten::min.other
+- aten::minimum
+- aten::mish
+- aten::mm
+- aten::mode
+- aten::movedim.int
+- aten::mse_loss
+- aten::msort
+- aten::mul
+- aten::mul.Scalar
+- aten::mul.Tensor
+- aten::mul.float
+- aten::mul.float_int
+- aten::mul.int
+- aten::mul.int_float
+- aten::mul.left_t
+- aten::mul.out
+- aten::mul_.Scalar
+- aten::mul_.Tensor
+- aten::multi_margin_loss
+- aten::multilabel_margin_loss
+- aten::multinomial
+- aten::mv
+- aten::mvlgamma
+- aten::nan_to_num
+- aten::nan_to_num_
+- aten::nanmean
+- aten::nanmedian
+- aten::nanquantile
+- aten::nansum
+- aten::narrow
+- aten::ne.Scalar
+- aten::ne.Tensor
+- aten::ne.float
+- aten::ne.int
+- aten::ne.int_float
+- aten::ne.int_list
+- aten::ne.str
+- aten::neg
+- aten::neg.int
+- aten::new_empty
+- aten::new_full
+- aten::new_ones
+- aten::new_zeros
+- aten::nll_loss_nd
+- aten::nonzero
+- aten::norm.Scalar
+- aten::norm.ScalarOpt_dim
+- aten::norm.ScalarOpt_dim_dtype
+- aten::norm.dtype_out
+- aten::norm.out
+- aten::normal.float_float
+- aten::normal_
+- aten::nuclear_norm
+- aten::nuclear_norm.dim
+- aten::nuclear_norm.dim_out
+- aten::nuclear_norm.out
+- aten::numel
+- aten::one_hot
+- aten::ones
+- aten::ones_like
+- aten::ord
+- aten::outer
+- aten::pad_sequence
+- aten::pairwise_distance
+- aten::pdist
+- aten::permute
+- aten::pixel_shuffle
+- aten::pixel_unshuffle
+- aten::poisson
+- aten::poisson_nll_loss
+- aten::polar
+- aten::polygamma
+- aten::pop.t
+- aten::pow.Tensor_Scalar
+- aten::pow.Tensor_Tensor
+- aten::pow.int_float
+- aten::prelu
+- aten::prod
+- aten::quantile
+- aten::quantile.scalar
+- aten::quantize_per_channel
+- aten::quantize_per_tensor
+- aten::quantize_per_tensor.tensor_qparams
+- aten::quantized_gru.input
+- aten::quantized_lstm.input
+- aten::rad2deg
+- aten::rand
+- aten::rand_like
+- aten::randint
+- aten::randint.low
+- aten::randint_like
+- aten::randn
+- aten::randn_like
+- aten::random_
+- aten::randperm
+- aten::range.step
+- aten::ravel
+- aten::real
+- aten::reciprocal
+- aten::reflection_pad1d
+- aten::reflection_pad2d
+- aten::reflection_pad3d
+- aten::relu
+- aten::relu_
+- aten::remainder.Scalar
+- aten::remainder.int
+- aten::renorm
+- aten::repeat
+- aten::repeat_interleave.Tensor
+- aten::replace
+- aten::replication_pad1d
+- aten::replication_pad2d
+- aten::replication_pad3d
+- aten::requires_grad_
+- aten::reshape
+- aten::resize_as_
+- aten::resolve_conj
+- aten::resolve_neg
+- aten::reverse.t
+- aten::rnn_tanh.input
+- aten::rnn_tanh_cell
+- aten::roll
+- aten::rot90
+- aten::round
+- aten::round.Scalar
+- aten::rrelu
+- aten::rsqrt
+- aten::rstrip
+- aten::scatter.src
+- aten::scatter_.src
+- aten::scatter_add
+- aten::scatter_add_
+- aten::searchsorted.Tensor
+- aten::select.int
+- aten::select_scatter
+- aten::selu
+- aten::sgn
+- aten::sigmoid
+- aten::sign
+- aten::signbit
+- aten::silu
+- aten::sin
+- aten::sinc
+- aten::sinh
+- aten::size
+- aten::size.int
+- aten::slice.Tensor
+- aten::slice.str
+- aten::slice.t
+- aten::slice_scatter
+- aten::smooth_l1_loss
+- aten::soft_margin_loss
+- aten::softmax.int
+- aten::softplus
+- aten::softshrink
+- aten::sort
+- aten::split
+- aten::split.Tensor
+- aten::split.str
+- aten::sqrt
+- aten::sqrt.int
+- aten::square
+- aten::squeeze.dim
+- aten::squeeze_.dim
+- aten::stack
+- aten::startswith
+- aten::std
+- aten::std_mean
+- aten::stft
+- aten::str
+- aten::strip
+- aten::sub
+- aten::sub.Scalar
+- aten::sub.Tensor
+- aten::sub.float
+- aten::sub.int
+- aten::sub_.Tensor
+- aten::sum
+- aten::sum.dim_IntList
+- aten::sum.int
+- aten::t
+- aten::take
+- aten::take_along_dim
+- aten::tan
+- aten::tanh
+- aten::tensor
+- aten::tensor.float
+- aten::tensor.int
+- aten::tensor_split.indices
+- aten::tensor_split.sections
+- aten::tensordot
+- aten::tensordot.out
+- aten::tile
+- aten::to.device
+- aten::to.dtype
+- aten::to.dtype_layout
+- aten::to.prim_Device
+- aten::topk
+- aten::trace
+- aten::transpose.int
+- aten::trapezoid.x
+- aten::trapz.x
+- aten::tril
+- aten::tril_indices
+- aten::triplet_margin_loss
+- aten::triu
+- aten::triu_indices
+- aten::trunc
+- aten::trunc_
+- aten::type_as
+- aten::unbind.int
+- aten::unflatten.int
+- aten::unfold
+- aten::uniform_
+- aten::unique_consecutive
+- aten::unique_dim
+- aten::unsqueeze
+- aten::unsqueeze_
+- aten::update.str
+- aten::upsample_bicubic2d.vec
+- aten::upsample_bilinear2d.vec
+- aten::upsample_linear1d.vec
+- aten::upsample_nearest1d.vec
+- aten::upsample_nearest2d.vec
+- aten::upsample_nearest3d.vec
+- aten::upsample_trilinear3d.vec
+- aten::values.int
+- aten::values.str
+- aten::vander
+- aten::var
+- aten::var_mean
+- aten::vdot
+- aten::view
+- aten::view_as
+- aten::view_as_complex
+- aten::view_as_real
+- aten::vsplit.array
+- aten::vstack
+- aten::where
+- aten::where.ScalarOther
+- aten::where.self
+- aten::xlogy.Scalar_Other
+- aten::xlogy.Scalar_Self
+- aten::xlogy.Tensor
+- aten::zeros
+- aten::zeros.out
+- aten::zeros_like
+- prepacked::conv2d_clamp_run
+- prepacked::linear_clamp_run
+- prim::TupleUnpack
+- prim::is_meta
+- prim::is_quantized
+- prim::is_sparse
+- prim::max
+- prim::max.float
+- prim::max.int
+- prim::max.self_int
+- prim::min
+- prim::min.float
+- prim::min.int
+- prim::min.self_int
+- prim::unchecked_cast
+- quantized::add
+- quantized::add_relu
+- quantized::add_scalar
+- quantized::batch_norm2d
+- quantized::batch_norm3d
+- quantized::cat
+- quantized::conv1d
+- quantized::conv1d_prepack
+- quantized::conv1d_relu
+- quantized::conv1d_unpack
+- quantized::conv2d.new
+- quantized::conv2d_prepack
+- quantized::conv2d_relu.new
+- quantized::conv2d_unpack
+- quantized::conv3d.new
+- quantized::conv3d_prepack
+- quantized::conv3d_relu.new
+- quantized::conv3d_unpack
+- quantized::conv_transpose1d
+- quantized::conv_transpose1d_prepack
+- quantized::conv_transpose1d_unpack
+- quantized::conv_transpose2d
+- quantized::conv_transpose2d_prepack
+- quantized::conv_transpose3d_prepack
+- quantized::embedding_4bit
+- quantized::embedding_byte
+- quantized::hardswish
+- quantized::instance_norm
+- quantized::leaky_relu
+- quantized::linear
+- quantized::linear_dynamic
+- quantized::linear_dynamic_fp16
+- quantized::linear_relu
+- quantized::mul
+- quantized::mul_scalar
+- quantized::quantized_gru_cell_dynamic
+- quantized::quantized_lstm_cell_dynamic
+- quantized::quantized_rnn_tanh_cell_dynamic
+covered_ops:
+  aten::Bool.Tensor: 19
+  aten::Bool.int: 7
+  aten::Float.Scalar: 18
+  aten::Float.Tensor: 11
+  aten::Float.str: 6
+  aten::FloatImplicit: 2
+  aten::Int.Scalar: 19
+  aten::Int.Tensor: 35
+  aten::Int.float: 6
+  aten::Int.str: 12
+  aten::IntImplicit: 11
+  aten::ScalarImplicit: 3
+  aten::__and__.Tensor: 13
+  aten::__and__.bool: 11
+  aten::__and__.int: 2
+  aten::__contains__.int: 5
+  aten::__contains__.int_list: 17
+  aten::__contains__.str: 22
+  aten::__contains__.str_list: 5
+  aten::__derive_index: 24
+  aten::__getitem__.str: 20
+  aten::__getitem__.t: 178
+  aten::__lshift__.int: 2
+  aten::__range_length: 23
+  aten::__rshift__.int: 2
+  aten::__xor__.bool: 10
+  aten::_infer_size: 7
+  aten::_set_item.int: 7
+  aten::_set_item.str: 163
+  aten::_set_item.t: 8
+  aten::_shape_as_tensor: 10
+  aten::adaptive_avg_pool1d: 1
+  aten::adaptive_avg_pool2d: 33
+  aten::adaptive_avg_pool3d: 1
+  aten::add.Scalar: 33
+  aten::add.Tensor: 63
+  aten::add.float: 5
+  aten::add.int: 49
+  aten::add.out: 2
+  aten::add.str: 29
+  aten::add.t: 11
+  aten::add_.Scalar: 15
+  aten::add_.Tensor: 29
+  aten::addcmul: 2
+  aten::addmm: 7
+  aten::all: 6
+  aten::allclose: 1
+  aten::any: 14
+  aten::append.t: 59
+  aten::arange: 16
+  aten::arange.start: 6
+  aten::arange.start_step: 16
+  aten::argmax: 2
+  aten::as_strided: 10
+  aten::as_tensor.list: 4
+  aten::atan: 4
+  aten::avg_pool1d: 6
+  aten::avg_pool2d: 7
+  aten::batch_norm: 15
+  aten::binary_cross_entropy: 15
+  aten::binary_cross_entropy_with_logits: 3
+  aten::bitwise_not: 13
+  aten::bmm: 16
+  aten::broadcast_tensors: 1
+  aten::cat: 90
+  aten::ceil: 3
+  aten::ceil.float: 7
+  aten::chunk: 19
+  aten::clamp: 36
+  aten::clamp_: 12
+  aten::clamp_min: 3
+  aten::clear.str: 2
+  aten::clone: 26
+  aten::coalesce: 2
+  aten::conj: 1
+  aten::constant_pad_nd: 17
+  aten::contiguous: 113
+  aten::conv1d: 12
+  aten::conv2d: 10
+  aten::conv_transpose2d.input: 5
+  aten::copy_: 15
+  aten::copy_.int: 1
+  aten::cos: 4
+  aten::count_nonzero: 4
+  aten::ctc_loss.Tensor: 1
+  aten::cumsum: 13
+  aten::dequantize.self: 30
+  aten::detach: 34
+  aten::div: 9
+  aten::div.Scalar: 8
+  aten::div.Tensor: 71
+  aten::div.Tensor_mode: 7
+  aten::div.float: 3
+  aten::div.int: 7
+  aten::div_.Tensor: 7
+  aten::dropout: 41
+  aten::embedding: 16
+  aten::embedding_bag.padding_idx: 2
+  aten::empty.memory_format: 11
+  aten::empty_like: 11
+  aten::empty_strided: 3
+  aten::eq.Scalar: 24
+  aten::eq.Tensor: 6
+  aten::eq.int: 57
+  aten::eq.int_list: 20
+  aten::eq.str: 43
+  aten::exp: 18
+  aten::exp.float: 4
+  aten::expand: 26
+  aten::expand_as: 3
+  aten::extend.t: 38
+  aten::feature_dropout: 1
+  aten::fill_.Scalar: 17
+  aten::find: 3
+  aten::flatten.using_ints: 45
+  aten::flip: 1
+  aten::floor: 5
+  aten::floor.float: 2
+  aten::floor_divide: 4
+  aten::floor_divide.Scalar: 7
+  aten::floordiv.int: 21
+  aten::full: 10
+  aten::full_like: 10
+  aten::gather: 10
+  aten::ge.Scalar: 4
+  aten::ge.Tensor: 6
+  aten::ge.int: 29
+  aten::gelu: 12
+  aten::glu: 18
+  aten::grid_sampler: 3
+  aten::gt.Scalar: 16
+  aten::gt.float: 16
+  aten::gt.float_int: 3
+  aten::gt.int: 52
+  aten::hardsigmoid: 3
+  aten::hardsigmoid_: 2
+  aten::hardswish_: 4
+  aten::hardtanh: 3
+  aten::hardtanh_: 3
+  aten::hstack: 2
+  aten::index.Tensor: 23
+  aten::index_fill.int_Scalar: 15
+  aten::index_select: 31
+  aten::is_coalesced: 2
+  aten::is_floating_point: 9
+  aten::isnan: 1
+  aten::item: 40
+  aten::items.str: 3
+  aten::keys.str: 15
+  aten::layer_norm: 26
+  aten::le.Scalar: 1
+  aten::le.Tensor: 10
+  aten::le.float: 2
+  aten::le.int: 17
+  aten::leaky_relu: 1
+  aten::leaky_relu_: 5
+  aten::len.Dict_int: 5
+  aten::len.Tensor: 19
+  aten::len.str: 23
+  aten::len.t: 177
+  aten::linear: 46
+  aten::linspace: 3
+  aten::list.t: 24
+  aten::log: 18
+  aten::log10: 4
+  aten::log1p: 5
+  aten::log_softmax.int: 31
+  aten::logical_and: 1
+  aten::logical_not: 10
+  aten::logit: 7
+  aten::lower: 10
+  aten::lstm.input: 4
+  aten::lt.Scalar: 8
+  aten::lt.Tensor: 1
+  aten::lt.float: 16
+  aten::lt.int: 46
+  aten::masked_fill.Scalar: 16
+  aten::matmul: 12
+  aten::max: 18
+  aten::max.dim: 30
+  aten::max.other: 7
+  aten::max_pool2d: 10
+  aten::maximum: 4
+  aten::mean: 10
+  aten::mean.dim: 16
+  aten::meshgrid.indexing: 2
+  aten::min: 2
+  aten::min.dim: 4
+  aten::min.other: 17
+  aten::minimum: 4
+  aten::mse_loss: 1
+  aten::mul.Scalar: 26
+  aten::mul.Tensor: 90
+  aten::mul.float: 5
+  aten::mul.float_int: 3
+  aten::mul.int: 26
+  aten::mul.int_float: 4
+  aten::mul.left_t: 15
+  aten::mul.out: 1
+  aten::mul_.Scalar: 11
+  aten::mul_.Tensor: 5
+  aten::nan_to_num: 3
+  aten::nan_to_num_: 10
+  aten::narrow: 10
+  aten::ne.Scalar: 14
+  aten::ne.Tensor: 5
+  aten::ne.int: 44
+  aten::ne.int_float: 2
+  aten::ne.int_list: 20
+  aten::ne.str: 3
+  aten::neg: 29
+  aten::neg.int: 19
+  aten::new_zeros: 6
+  aten::nll_loss_nd: 3
+  aten::nonzero: 4
+  aten::norm.Scalar: 1
+  aten::norm.ScalarOpt_dim: 4
+  aten::numel: 8
+  aten::one_hot: 2
+  aten::ones: 38
+  aten::ones_like: 16
+  aten::ord: 20
+  aten::permute: 43
+  aten::pop.t: 7
+  aten::pow.Tensor_Scalar: 3
+  aten::pow.int_float: 2
+  aten::quantile.scalar: 1
+  aten::quantize_per_tensor: 66
+  aten::quantize_per_tensor.tensor_qparams: 1
+  aten::rand: 25
+  aten::randint.low: 2
+  aten::randn_like: 17
+  aten::reciprocal: 1
+  aten::reflection_pad2d: 1
+  aten::relu: 82
+  aten::relu_: 9
+  aten::remainder.Scalar: 2
+  aten::remainder.int: 22
+  aten::repeat: 16
+  aten::replace: 1
+  aten::replication_pad1d: 1
+  aten::replication_pad2d: 2
+  aten::replication_pad3d: 1
+  aten::requires_grad_: 4
+  aten::reshape: 36
+  aten::resize_as_: 1
+  aten::resolve_conj: 1
+  aten::resolve_neg: 1
+  aten::reverse.t: 2
+  aten::round.Scalar: 4
+  aten::rstrip: 1
+  aten::scatter_.src: 6
+  aten::scatter_add_: 10
+  aten::select.int: 57
+  aten::selu: 2
+  aten::sigmoid: 93
+  aten::sin: 4
+  aten::size: 66
+  aten::size.int: 66
+  aten::slice.Tensor: 75
+  aten::slice.str: 12
+  aten::slice.t: 43
+  aten::softmax.int: 63
+  aten::softplus: 2
+  aten::sort: 18
+  aten::split.str: 10
+  aten::sqrt: 1
+  aten::squeeze.dim: 26
+  aten::stack: 30
+  aten::startswith: 10
+  aten::str: 16
+  aten::strip: 3
+  aten::sub: 8
+  aten::sub.Scalar: 26
+  aten::sub.Tensor: 94
+  aten::sub.int: 52
+  aten::sub_.Tensor: 4
+  aten::sum: 17
+  aten::sum.dim_IntList: 19
+  aten::sum.int: 1
+  aten::t: 3
+  aten::tanh: 26
+  aten::tensor: 51
+  aten::tensor.float: 28
+  aten::tensor.int: 34
+  aten::tensor_split.indices: 4
+  aten::to.device: 11
+  aten::to.dtype: 23
+  aten::to.dtype_layout: 27
+  aten::to.prim_Device: 23
+  aten::topk: 10
+  aten::transpose.int: 33
+  aten::triu: 10
+  aten::trunc_: 3
+  aten::type_as: 6
+  aten::unbind.int: 24
+  aten::unique_consecutive: 2
+  aten::unsqueeze: 34
+  aten::unsqueeze_: 6
+  aten::update.str: 4
+  aten::upsample_bicubic2d.vec: 1
+  aten::upsample_bilinear2d.vec: 8
+  aten::upsample_linear1d.vec: 1
+  aten::upsample_nearest1d.vec: 2
+  aten::upsample_nearest2d.vec: 30
+  aten::upsample_nearest3d.vec: 2
+  aten::upsample_trilinear3d.vec: 1
+  aten::values.int: 3
+  aten::view: 61
+  aten::vstack: 1
+  aten::where.ScalarOther: 4
+  aten::where.self: 10
+  aten::zeros: 75
+  aten::zeros.out: 1
+  aten::zeros_like: 7
+  prepacked::conv2d_clamp_run: 32
+  prepacked::linear_clamp_run: 26
+  prim::TupleUnpack: 120
+  prim::max.float: 7
+  prim::max.int: 14
+  prim::max.self_int: 17
+  prim::min: 4
+  prim::min.int: 35
+  prim::min.self_int: 25
+  prim::unchecked_cast: 100
+  quantized::add: 58
+  quantized::add_relu: 1
+  quantized::batch_norm2d: 1
+  quantized::cat: 4
+  quantized::conv1d: 1
+  quantized::conv2d.new: 55
+  quantized::conv2d_prepack: 14
+  quantized::conv2d_relu.new: 50
+  quantized::conv_transpose2d: 2
+  quantized::embedding_4bit: 1
+  quantized::embedding_byte: 14
+  quantized::hardswish: 1
+  quantized::instance_norm: 1
+  quantized::leaky_relu: 2
+  quantized::linear: 27
+  quantized::linear_dynamic: 21
+  quantized::linear_dynamic_fp16: 18
+  quantized::linear_relu: 2
+  quantized::mul: 4
+uncovered_ops:
+  aten::__getitem__.Dict_int: 4
+  aten::__getitem__.Dict_str: 39
+  aten::__is__: 83
+  aten::__isnot__: 81
+  aten::__not__: 32
+  aten::_aminmax: 4
+  aten::_convolution: 12
+  aten::_convolution.deprecated: 3
+  aten::_make_per_tensor_quantized_tensor: 2
+  aten::_pack_padded_sequence: 10
+  aten::_pad_packed_sequence: 10
+  aten::_reshape_from_tensor: 10
+  aten::backward: 23
+  aten::copy_.Tensor: 27
+  aten::dequantize.list: 1
+  aten::dequantize.tensor: 36
+  aten::dim: 36
+  aten::format: 58
+  aten::get.default_str: 14
+  aten::index_put_: 16
+  aten::lstm.data: 8
+  aten::nll_loss: 1
+  aten::nll_loss2d: 1
+  aten::quantized_lstm.data: 2
+  aten::rsub.Scalar: 5
+  aten::sparse_coo_tensor.indices: 1
+  aten::sparse_resize_and_clear_: 1
+  aten::to.prim_dtype: 38
+  aten::true_divide.Tensor: 2
+  aten::upsample_nearest2d: 7
+  prepacked::conv2d_clamp_prepack: 2
+  prepacked::conv2d_transpose_clamp_prepack: 1
+  prepacked::conv2d_transpose_clamp_run: 1
+  prim::ModuleContainerIndex.list: 2
+  prim::NumToTensor.Scalar: 15
+  prim::Print: 1
+  prim::RaiseException: 103
+  prim::TupleIndex: 157
+  prim::Uninitialized: 80
+  prim::device: 46
+  prim::dtype: 45
+  prim::is_cuda: 1
+  quantized::conv2d: 4
+  quantized::conv_prepack: 5
+  quantized::linear_prepack: 29
+  quantized::linear_prepack_fp16: 25
+  quantized::linear_unpack: 4
+  quantized::linear_unpack_fp16: 4
+  quantized::mul.Scalar: 1
diff --git a/test/mobile/model_test/gen_test_model.py b/test/mobile/model_test/gen_test_model.py
new file mode 100644
index 00000000000000..e9e3908630be40
--- /dev/null
+++ b/test/mobile/model_test/gen_test_model.py
@@ -0,0 +1,243 @@
+import io
+import sys
+import torch
+import yaml
+from android_api_module import AndroidAPIModule
+from builtin_ops import (
+    TSBuiltinOpsModule,
+    TSCollectionOpsModule,
+)
+from math_ops import (
+    PointwiseOpsModule,
+    ReductionOpsModule,
+    ComparisonOpsModule,
+    OtherMathOpsModule,
+    SpectralOpsModule,
+    BlasLapackOpsModule,
+)
+from nn_ops import (
+    NNConvolutionModule,
+    NNPoolingModule,
+    NNPaddingModule,
+    NNNormalizationModule,
+    NNActivationModule,
+    NNRecurrentModule,
+    NNTransformerModule,
+    NNLinearModule,
+    NNDropoutModule,
+    NNSparseModule,
+    NNDistanceModule,
+    NNLossFunctionModule,
+    NNVisionModule,
+    NNShuffleModule,
+    NNUtilsModule,
+)
+from quantization_ops import (
+    GeneralQuantModule,
+    DynamicQuantModule,
+    StaticQuantModule,
+    FusedQuantModule,
+)
+from sampling_ops import SamplingOpsModule
+from tensor_ops import (
+    TensorOpsModule,
+    TensorCreationOpsModule,
+    TensorIndexingOpsModule,
+    TensorTypingOpsModule,
+    TensorViewOpsModule,
+)
+from torch.jit.mobile import _load_for_lite_interpreter
+from torchvision_models import MobileNetV2Module
+
+test_path_ios = "ios/TestApp/models/"
+test_path_android = "android/pytorch_android/src/androidTest/assets/"
+
+production_ops_path = "test/mobile/model_test/model_ops.yaml"
+coverage_out_path = "test/mobile/model_test/coverage.yaml"
+
+all_modules = {
+    # math ops
+    "pointwise_ops": PointwiseOpsModule(),
+    "reduction_ops": ReductionOpsModule(),
+    "comparison_ops": ComparisonOpsModule(),
+    "spectral_ops": SpectralOpsModule(),
+    "other_math_ops": OtherMathOpsModule(),
+    "blas_lapack_ops": BlasLapackOpsModule(),
+    # sampling
+    "sampling_ops": SamplingOpsModule(),
+    # tensor ops
+    "tensor_general_ops": TensorOpsModule(),
+    "tensor_creation_ops": TensorCreationOpsModule(),
+    "tensor_indexing_ops": TensorIndexingOpsModule(),
+    "tensor_typing_ops": TensorTypingOpsModule(),
+    "tensor_view_ops": TensorViewOpsModule(),
+    # nn ops
+    "convolution_ops": NNConvolutionModule(),
+    "pooling_ops": NNPoolingModule(),
+    "padding_ops": NNPaddingModule(),
+    "activation_ops": NNActivationModule(),
+    "normalization_ops": NNNormalizationModule(),
+    "recurrent_ops": NNRecurrentModule(),
+    "transformer_ops": NNTransformerModule(),
+    "linear_ops": NNLinearModule(),
+    "dropout_ops": NNDropoutModule(),
+    "sparse_ops": NNSparseModule(),
+    "distance_function_ops": NNDistanceModule(),
+    "loss_function_ops": NNLossFunctionModule(),
+    "vision_function_ops": NNVisionModule(),
+    "shuffle_ops": NNShuffleModule(),
+    "nn_utils_ops": NNUtilsModule(),
+    # quantization ops
+    "general_quant_ops": GeneralQuantModule(),
+    "dynamic_quant_ops": DynamicQuantModule(),
+    "static_quant_ops": StaticQuantModule(),
+    "fused_quant_ops": FusedQuantModule(),
+    # TorchScript buildin ops
+    "torchscript_builtin_ops": TSBuiltinOpsModule(),
+    "torchscript_collection_ops": TSCollectionOpsModule(),
+    # vision
+    "mobilenet_v2": MobileNetV2Module(),
+    # android api module
+    "android_api_module": AndroidAPIModule(),
+}
+
+models_need_trace = [
+    "static_quant_ops",
+]
+
+
+def calcOpsCoverage(ops):
+    with open(production_ops_path) as input_yaml_file:
+        production_ops_dict = yaml.safe_load(input_yaml_file)
+
+    production_ops = set(production_ops_dict["root_operators"].keys())
+    all_generated_ops = set(ops)
+    covered_ops = production_ops.intersection(all_generated_ops)
+    uncovered_ops = production_ops - covered_ops
+    coverage = round(100 * len(covered_ops) / len(production_ops), 2)
+
+    # weighted coverage (take op occurances into account)
+    total_occurances = sum(production_ops_dict["root_operators"].values())
+    covered_ops_dict = {op: production_ops_dict["root_operators"][op] for op in covered_ops}
+    uncovered_ops_dict = {op: production_ops_dict["root_operators"][op] for op in uncovered_ops}
+    covered_occurances = sum(covered_ops_dict.values())
+    occurances_coverage = round(100 * covered_occurances / total_occurances, 2)
+
+    print(f"\n{len(uncovered_ops)} uncovered ops: {uncovered_ops}\n")
+    print(f"Generated {len(all_generated_ops)} ops")
+    print(f"Covered {len(covered_ops)}/{len(production_ops)} ({coverage}%) production ops")
+    print(f"Covered {covered_occurances}/{total_occurances} ({occurances_coverage}%) occurances")
+    print(f"pytorch ver {torch.__version__}\n")
+
+    with open(coverage_out_path, "w") as f:
+        yaml.safe_dump(
+            {
+                "_covered_ops": len(covered_ops),
+                "_production_ops": len(production_ops),
+                "_generated_ops": len(all_generated_ops),
+                "_uncovered_ops": len(uncovered_ops),
+                "_coverage": round(coverage, 2),
+                "uncovered_ops": uncovered_ops_dict,
+                "covered_ops": covered_ops_dict,
+                "all_generated_ops": sorted(list(all_generated_ops)),
+            },
+            f,
+        )
+
+
+def getModuleFromName(model_name):
+    if model_name not in all_modules:
+        print("Cannot find test model for " + model_name)
+        return None, []
+
+    module = all_modules[model_name]
+    if not isinstance(module, torch.nn.Module):
+        module = module.getModule()
+
+    has_bundled_inputs = False  # module.find_method("get_all_bundled_inputs")
+
+    if model_name in models_need_trace:
+        module = torch.jit.trace(module, [])
+    else:
+        module = torch.jit.script(module)
+
+    ops = torch.jit.export_opnames(module)
+    print(ops)
+
+    # try to run the model
+    runModule(module)
+
+    return module, ops
+
+
+def runModule(module):
+    buffer = io.BytesIO(module._save_to_buffer_for_lite_interpreter())
+    buffer.seek(0)
+    lite_module = _load_for_lite_interpreter(buffer)
+    if lite_module.find_method("get_all_bundled_inputs"):
+        # run with the first bundled input
+        input = lite_module.run_method("get_all_bundled_inputs")[0]
+        lite_module.forward(*input)
+    else:
+        # assuming model has no input
+        lite_module()
+
+
+# generate all models in the given folder.
+# If it's "on the fly" mode, add "_temp" suffix to the model file.
+def generateAllModels(folder, on_the_fly=False):
+    all_ops = []
+    for name in all_modules:
+        module, ops = getModuleFromName(name)
+        all_ops = all_ops + ops
+        path = folder + name + ("_temp.ptl" if on_the_fly else ".ptl")
+        module._save_for_lite_interpreter(path)
+        print("model saved to " + path)
+    calcOpsCoverage(all_ops)
+
+
+# generate/update a given model for storage
+def generateModel(name):
+    module, ops = getModuleFromName(name)
+    if module is None:
+        return
+    path_ios = test_path_ios + name + ".ptl"
+    path_android = test_path_android + name + ".ptl"
+    module._save_for_lite_interpreter(path_ios)
+    module._save_for_lite_interpreter(path_android)
+    print("model saved to " + path_ios + " and " + path_android)
+
+
+def main(argv):
+    if argv is None or len(argv) != 1:
+        print(
+            """
+This script generate models for mobile test. For each model we have a "storage" version
+and an "on-the-fly" version. The "on-the-fly" version will be generated during test,and
+should not be committed to the repo.
+The "storage" version is for back compatibility # test (a model generated today should
+run on master branch in the next 6 months). We can use this script to update a model that
+is no longer supported.
+- use 'python gen_test_model.py android-test' to generate on-the-fly models for android
+- use 'python gen_test_model.py ios-test' to generate on-the-fly models for ios
+- use 'python gen_test_model.py android' to generate checked-in models for android
+- use 'python gen_test_model.py ios' to generate on-the-fly models for ios
+- use 'python gen_test_model.py <model_name_no_suffix>' to update the given storage model
+"""
+        )
+        return
+
+    if argv[0] == "android":
+        generateAllModels(test_path_android, on_the_fly=False)
+    elif argv[0] == "ios":
+        generateAllModels(test_path_ios, on_the_fly=False)
+    elif argv[0] == "android-test":
+        generateAllModels(test_path_android, on_the_fly=True)
+    elif argv[0] == "ios-test":
+        generateAllModels(test_path_ios, on_the_fly=True)
+    else:
+        generateModel(argv[0])
+
+
+if __name__ == "__main__":
+    main(sys.argv[1:])
diff --git a/test/mobile/model_test/math_ops.py b/test/mobile/model_test/math_ops.py
new file mode 100644
index 00000000000000..f89e3bca70d6d3
--- /dev/null
+++ b/test/mobile/model_test/math_ops.py
@@ -0,0 +1,469 @@
+# https://pytorch.org/docs/stable/torch.html#math-operations
+
+import math
+
+import torch
+
+
+class PointwiseOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(PointwiseOpsModule, self).__init__()
+
+    def forward(self):
+        return self.pointwise_ops()
+
+    def pointwise_ops(self):
+        a = torch.randn(4)
+        b = torch.randn(4)
+        t = torch.tensor([-1, -2, 3], dtype=torch.int8)
+        r = torch.tensor([0, 1, 10, 0], dtype=torch.int8)
+        t = torch.tensor([-1, -2, 3], dtype=torch.int8)
+        s = torch.tensor([4, 0, 1, 0], dtype=torch.int8)
+        f = torch.zeros(3)
+        g = torch.tensor([-1, 0, 1])
+        w = torch.tensor([0.3810, 1.2774, -0.2972, -0.3719, 0.4637])
+        return len(
+            torch.abs(torch.tensor([-1, -2, 3])),
+            torch.absolute(torch.tensor([-1, -2, 3])),
+            torch.acos(a),
+            torch.arccos(a),
+            torch.acosh(a.uniform_(1.0, 2.0)),
+            torch.add(a, 20),
+            torch.add(a, b, out=a),
+            b.add(a),
+            b.add(a, out=b),
+            b.add_(a),
+            b.add(1),
+            torch.add(a, torch.randn(4, 1), alpha=10),
+            torch.addcdiv(
+                torch.randn(1, 3), torch.randn(3, 1), torch.randn(1, 3), value=0.1
+            ),
+            torch.addcmul(
+                torch.randn(1, 3), torch.randn(3, 1), torch.randn(1, 3), value=0.1
+            ),
+            torch.angle(a),
+            torch.asin(a),
+            torch.arcsin(a),
+            torch.asinh(a),
+            torch.arcsinh(a),
+            torch.atan(a),
+            torch.arctan(a),
+            torch.atanh(a.uniform_(-1.0, 1.0)),
+            torch.arctanh(a.uniform_(-1.0, 1.0)),
+            torch.atan2(a, a),
+            torch.bitwise_not(t),
+            torch.bitwise_and(t, torch.tensor([1, 0, 3], dtype=torch.int8)),
+            torch.bitwise_or(t, torch.tensor([1, 0, 3], dtype=torch.int8)),
+            torch.bitwise_xor(t, torch.tensor([1, 0, 3], dtype=torch.int8)),
+            torch.ceil(a),
+            torch.ceil(float(torch.tensor(0.5))),
+            torch.ceil(torch.tensor(0.5).item()),
+            torch.clamp(a, min=-0.5, max=0.5),
+            torch.clamp(a, min=0.5),
+            torch.clamp(a, max=0.5),
+            torch.clip(a, min=-0.5, max=0.5),
+            torch.conj(a),
+            torch.copysign(a, 1),
+            torch.copysign(a, b),
+            torch.cos(a),
+            torch.cosh(a),
+            torch.deg2rad(
+                torch.tensor([[180.0, -180.0], [360.0, -360.0], [90.0, -90.0]])
+            ),
+            torch.div(a, b),
+            a.div(b),
+            a.div(1),
+            a.div_(b),
+            torch.divide(a, b, rounding_mode="trunc"),
+            torch.divide(a, b, rounding_mode="floor"),
+            torch.digamma(torch.tensor([1.0, 0.5])),
+            torch.erf(torch.tensor([0.0, -1.0, 10.0])),
+            torch.erfc(torch.tensor([0.0, -1.0, 10.0])),
+            torch.erfinv(torch.tensor([0.0, 0.5, -1.0])),
+            torch.exp(torch.tensor([0.0, math.log(2.0)])),
+            torch.exp(float(torch.tensor(1))),
+            torch.exp2(torch.tensor([0.0, math.log(2.0), 3.0, 4.0])),
+            torch.expm1(torch.tensor([0.0, math.log(2.0)])),
+            torch.fake_quantize_per_channel_affine(
+                torch.randn(2, 2, 2),
+                (torch.randn(2) + 1) * 0.05,
+                torch.zeros(2),
+                1,
+                0,
+                255,
+            ),
+            torch.fake_quantize_per_tensor_affine(a, 0.1, 0, 0, 255),
+            torch.float_power(torch.randint(10, (4,)), 2),
+            torch.float_power(torch.arange(1, 5), torch.tensor([2, -3, 4, -5])),
+            torch.floor(a),
+            torch.floor(float(torch.tensor(1))),
+            torch.floor_divide(torch.tensor([4.0, 3.0]), torch.tensor([2.0, 2.0])),
+            torch.floor_divide(torch.tensor([4.0, 3.0]), 1.4),
+            torch.fmod(torch.tensor([-3, -2, -1, 1, 2, 3]), 2),
+            torch.fmod(torch.tensor([1, 2, 3, 4, 5]), 1.5),
+            torch.frac(torch.tensor([1.0, 2.5, -3.2])),
+            torch.randn(4, dtype=torch.cfloat).imag,
+            torch.ldexp(torch.tensor([1.0]), torch.tensor([1])),
+            torch.ldexp(torch.tensor([1.0]), torch.tensor([1, 2, 3, 4])),
+            torch.lerp(torch.arange(1.0, 5.0), torch.empty(4).fill_(10), 0.5),
+            torch.lerp(
+                torch.arange(1.0, 5.0),
+                torch.empty(4).fill_(10),
+                torch.full_like(torch.arange(1.0, 5.0), 0.5),
+            ),
+            torch.lgamma(torch.arange(0.5, 2, 0.5)),
+            torch.log(torch.arange(5) + 10),
+            torch.log10(torch.rand(5)),
+            torch.log1p(torch.randn(5)),
+            torch.log2(torch.rand(5)),
+            torch.logaddexp(torch.tensor([-1.0]), torch.tensor([-1, -2, -3])),
+            torch.logaddexp(
+                torch.tensor([-100.0, -200.0, -300.0]), torch.tensor([-1, -2, -3])
+            ),
+            torch.logaddexp(
+                torch.tensor([1.0, 2000.0, 30000.0]), torch.tensor([-1, -2, -3])
+            ),
+            torch.logaddexp2(torch.tensor([-1.0]), torch.tensor([-1, -2, -3])),
+            torch.logaddexp2(
+                torch.tensor([-100.0, -200.0, -300.0]), torch.tensor([-1, -2, -3])
+            ),
+            torch.logaddexp2(
+                torch.tensor([1.0, 2000.0, 30000.0]), torch.tensor([-1, -2, -3])
+            ),
+            torch.logical_and(r, s),
+            torch.logical_and(r.double(), s.double()),
+            torch.logical_and(r.double(), s),
+            torch.logical_and(r, s, out=torch.empty(4, dtype=torch.bool)),
+            torch.logical_not(torch.tensor([0, 1, -10], dtype=torch.int8)),
+            torch.logical_not(torch.tensor([0.0, 1.5, -10.0], dtype=torch.double)),
+            torch.logical_not(
+                torch.tensor([0.0, 1.0, -10.0], dtype=torch.double),
+                out=torch.empty(3, dtype=torch.int16),
+            ),
+            torch.logical_or(r, s),
+            torch.logical_or(r.double(), s.double()),
+            torch.logical_or(r.double(), s),
+            torch.logical_or(r, s, out=torch.empty(4, dtype=torch.bool)),
+            torch.logical_xor(r, s),
+            torch.logical_xor(r.double(), s.double()),
+            torch.logical_xor(r.double(), s),
+            torch.logical_xor(r, s, out=torch.empty(4, dtype=torch.bool)),
+            torch.logit(torch.rand(5), eps=1e-6),
+            torch.hypot(torch.tensor([4.0]), torch.tensor([3.0, 4.0, 5.0])),
+            torch.i0(torch.arange(5, dtype=torch.float32)),
+            torch.igamma(a, b),
+            torch.igammac(a, b),
+            torch.mul(torch.randn(3), 100),
+            b.mul(a),
+            b.mul(5),
+            b.mul(a, out=b),
+            b.mul_(a),
+            b.mul_(5),
+            torch.multiply(torch.randn(4, 1), torch.randn(1, 4)),
+            torch.mvlgamma(torch.empty(2, 3).uniform_(1.0, 2.0), 2),
+            torch.tensor([float("nan"), float("inf"), -float("inf"), 3.14]),
+            torch.nan_to_num(w),
+            torch.nan_to_num_(w),
+            torch.nan_to_num(w, nan=2.0),
+            torch.nan_to_num(w, nan=2.0, posinf=1.0),
+            torch.neg(torch.randn(5)),
+            # torch.nextafter(torch.tensor([1, 2]), torch.tensor([2, 1])) == torch.tensor([eps + 1, 2 - eps]),
+            torch.polygamma(1, torch.tensor([1.0, 0.5])),
+            torch.polygamma(2, torch.tensor([1.0, 0.5])),
+            torch.polygamma(3, torch.tensor([1.0, 0.5])),
+            torch.polygamma(4, torch.tensor([1.0, 0.5])),
+            torch.pow(a, 2),
+            torch.pow(2, float(torch.tensor(0.5))),
+            torch.pow(torch.arange(1.0, 5.0), torch.arange(1.0, 5.0)),
+            torch.rad2deg(
+                torch.tensor([[3.142, -3.142], [6.283, -6.283], [1.570, -1.570]])
+            ),
+            torch.randn(4, dtype=torch.cfloat).real,
+            torch.reciprocal(a),
+            torch.remainder(torch.tensor([-3.0, -2.0]), 2),
+            torch.remainder(torch.tensor([1, 2, 3, 4, 5]), 1.5),
+            torch.round(a),
+            torch.round(torch.tensor(0.5).item()),
+            torch.rsqrt(a),
+            torch.sigmoid(a),
+            torch.sign(torch.tensor([0.7, -1.2, 0.0, 2.3])),
+            torch.sgn(a),
+            torch.signbit(torch.tensor([0.7, -1.2, 0.0, 2.3])),
+            torch.sin(a),
+            torch.sinc(a),
+            torch.sinh(a),
+            torch.sqrt(a),
+            torch.square(a),
+            torch.sub(torch.tensor((1, 2)), torch.tensor((0, 1)), alpha=2),
+            b.sub(a),
+            b.sub_(a),
+            b.sub(5),
+            torch.sum(5),
+            torch.tan(a),
+            torch.tanh(a),
+            torch.true_divide(a, a),
+            torch.trunc(a),
+            torch.trunc_(a),
+            torch.xlogy(f, g),
+            torch.xlogy(f, g),
+            torch.xlogy(f, 4),
+            torch.xlogy(2, g),
+        )
+
+
+class ReductionOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(ReductionOpsModule, self).__init__()
+
+    def forward(self):
+        return self.reduction_ops()
+
+    def reduction_ops(self):
+        a = torch.randn(4)
+        b = torch.randn(4)
+        c = torch.tensor(0.5)
+        return len(
+            torch.argmax(a),
+            torch.argmin(a),
+            torch.amax(a),
+            torch.amin(a),
+            torch.aminmax(a),
+            torch.all(a),
+            torch.any(a),
+            torch.max(a),
+            a.max(a),
+            torch.max(a, 0),
+            torch.min(a),
+            a.min(a),
+            torch.min(a, 0),
+            torch.dist(a, b),
+            torch.logsumexp(a, 0),
+            torch.mean(a),
+            torch.mean(a, 0),
+            torch.nanmean(a),
+            torch.median(a),
+            torch.nanmedian(a),
+            torch.mode(a),
+            torch.norm(a),
+            a.norm(2),
+            torch.norm(a, dim=0),
+            torch.norm(c, torch.tensor(2)),
+            torch.nansum(a),
+            torch.prod(a),
+            torch.quantile(a, torch.tensor([0.25, 0.5, 0.75])),
+            torch.quantile(a, 0.5),
+            torch.nanquantile(a, torch.tensor([0.25, 0.5, 0.75])),
+            torch.std(a),
+            torch.std_mean(a),
+            torch.sum(a),
+            torch.unique(a),
+            torch.unique_consecutive(a),
+            torch.var(a),
+            torch.var_mean(a),
+            torch.count_nonzero(a),
+        )
+
+
+class ComparisonOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(ComparisonOpsModule, self).__init__()
+
+    def forward(self):
+        a = torch.tensor(0)
+        b = torch.tensor(1)
+        return len(
+            torch.allclose(a, b),
+            torch.argsort(a),
+            torch.eq(a, b),
+            torch.eq(a, 1),
+            torch.equal(a, b),
+            torch.ge(a, b),
+            torch.ge(a, 1),
+            torch.greater_equal(a, b),
+            torch.greater_equal(a, 1),
+            torch.gt(a, b),
+            torch.gt(a, 1),
+            torch.greater(a, b),
+            torch.isclose(a, b),
+            torch.isfinite(a),
+            torch.isin(a, b),
+            torch.isinf(a),
+            torch.isposinf(a),
+            torch.isneginf(a),
+            torch.isnan(a),
+            torch.isreal(a),
+            torch.kthvalue(a, 1),
+            torch.le(a, b),
+            torch.le(a, 1),
+            torch.less_equal(a, b),
+            torch.lt(a, b),
+            torch.lt(a, 1),
+            torch.less(a, b),
+            torch.maximum(a, b),
+            torch.minimum(a, b),
+            torch.fmax(a, b),
+            torch.fmin(a, b),
+            torch.ne(a, b),
+            torch.ne(a, 1),
+            torch.not_equal(a, b),
+            torch.sort(a),
+            torch.topk(a, 1),
+            torch.msort(a),
+        )
+
+
+class OtherMathOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(OtherMathOpsModule, self).__init__()
+
+    def forward(self):
+        return self.other_ops()
+
+    def other_ops(self):
+        a = torch.randn(4)
+        b = torch.randn(4)
+        c = torch.randint(0, 8, (5,), dtype=torch.int64)
+        e = torch.randn(4, 3)
+        f = torch.randn(4, 4, 4)
+        size = [0, 1]
+        dims = [0, 1]
+        return len(
+            torch.atleast_1d(a),
+            torch.atleast_2d(a),
+            torch.atleast_3d(a),
+            torch.bincount(c),
+            torch.block_diag(a),
+            torch.broadcast_tensors(a),
+            torch.broadcast_to(a, (4)),
+            # torch.broadcast_shapes(a),
+            torch.bucketize(a, b),
+            torch.cartesian_prod(a),
+            torch.cdist(e, e),
+            torch.clone(a),
+            torch.combinations(a),
+            torch.corrcoef(a),
+            # torch.cov(a),
+            torch.cross(e, e),
+            torch.cummax(a, 0),
+            torch.cummin(a, 0),
+            torch.cumprod(a, 0),
+            torch.cumsum(a, 0),
+            torch.diag(a),
+            torch.diag_embed(a),
+            torch.diagflat(a),
+            torch.diagonal(e),
+            torch.diff(a),
+            torch.einsum("iii", f),
+            torch.flatten(a),
+            torch.flip(e, dims),
+            torch.fliplr(e),
+            torch.flipud(e),
+            torch.kron(a, b),
+            torch.rot90(e),
+            torch.gcd(c, c),
+            torch.histc(a),
+            torch.histogram(a),
+            torch.meshgrid(a),
+            torch.meshgrid(a, indexing="xy"),
+            torch.lcm(c, c),
+            torch.logcumsumexp(a, 0),
+            torch.ravel(a),
+            torch.renorm(e, 1, 0, 5),
+            torch.repeat_interleave(c),
+            torch.roll(a, 1, 0),
+            torch.searchsorted(a, b),
+            torch.tensordot(e, e),
+            torch.trace(e),
+            torch.tril(e),
+            torch.tril_indices(3, 3),
+            torch.triu(e),
+            torch.triu_indices(3, 3),
+            torch.vander(a),
+            torch.view_as_real(torch.randn(4, dtype=torch.cfloat)),
+            torch.view_as_complex(torch.randn(4, 2)).real,
+            torch.resolve_conj(a),
+            torch.resolve_neg(a),
+        )
+
+
+class SpectralOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(SpectralOpsModule, self).__init__()
+
+    def forward(self):
+        return self.spectral_ops()
+
+    def spectral_ops(self):
+        a = torch.randn(10)
+        b = torch.randn(10, 8, 4, 2)
+        return len(
+            torch.stft(a, 8),
+            torch.stft(a, torch.tensor(8)),
+            torch.istft(b, 8),
+            torch.bartlett_window(2, dtype=torch.float),
+            torch.blackman_window(2, dtype=torch.float),
+            torch.hamming_window(4, dtype=torch.float),
+            torch.hann_window(4, dtype=torch.float),
+            torch.kaiser_window(4, dtype=torch.float),
+        )
+
+
+class BlasLapackOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(BlasLapackOpsModule, self).__init__()
+
+    def forward(self):
+        return self.blas_lapack_ops()
+
+    def blas_lapack_ops(self):
+        m = torch.randn(3, 3)
+        a = torch.randn(10, 3, 4)
+        b = torch.randn(10, 4, 3)
+        v = torch.randn(3)
+        return len(
+            torch.addbmm(m, a, b),
+            torch.addmm(torch.randn(2, 3), torch.randn(2, 3), torch.randn(3, 3)),
+            torch.addmv(torch.randn(2), torch.randn(2, 3), torch.randn(3)),
+            torch.addr(torch.zeros(3, 3), v, v),
+            torch.baddbmm(m, a, b),
+            torch.bmm(a, b),
+            torch.chain_matmul(torch.randn(3, 3), torch.randn(3, 3), torch.randn(3, 3)),
+            # torch.cholesky(a), # deprecated
+            # torch.cholesky_inverse(torch.randn(3, 3)), # had some error
+            # torch.cholesky_solve(torch.randn(3, 3), torch.randn(3, 3)),
+            torch.dot(v, v),
+            # torch.linalg.eig(m), # not build with lapack
+            # torch.geqrf(a),
+            torch.ger(v, v),
+            torch.inner(m, m),
+            # torch.inverse(m),
+            # torch.det(m),
+            # torch.logdet(m),
+            # torch.slogdet(m),
+            # torch.lstsq(m, m),
+            # torch.lu(m),
+            # torch.lu_solve(m, *torch.lu(m)),
+            # torch.lu_unpack(*torch.lu(m)),
+            torch.matmul(m, m),
+            torch.matrix_power(m, 2),
+            # torch.matrix_rank(m),
+            torch.matrix_exp(m),
+            torch.mm(m, m),
+            torch.mv(m, v),
+            # torch.orgqr(a, m),
+            # torch.ormqr(a, m, v),
+            torch.outer(v, v),
+            # torch.pinverse(m),
+            # torch.qr(a),
+            # torch.solve(m, m),
+            # torch.svd(a),
+            # torch.svd_lowrank(a),
+            # torch.pca_lowrank(a),
+            # torch.symeig(a), # deprecated
+            # torch.lobpcg(a, b), # not supported
+            torch.trapz(m, m),
+            torch.trapezoid(m, m),
+            torch.cumulative_trapezoid(m, m),
+            # torch.triangular_solve(m, m),
+            torch.vdot(v, v),
+        )
diff --git a/test/mobile/model_test/model_ops.yaml b/test/mobile/model_test/model_ops.yaml
new file mode 100644
index 00000000000000..06a3640e4cbe79
--- /dev/null
+++ b/test/mobile/model_test/model_ops.yaml
@@ -0,0 +1,752 @@
+root_operators:
+  aten::Bool.Tensor: 19
+  aten::Bool.int: 7
+  aten::Float.Scalar: 18
+  aten::Float.Tensor: 11
+  aten::Float.str: 6
+  aten::FloatImplicit: 2
+  aten::Int.Scalar: 19
+  aten::Int.Tensor: 35
+  aten::Int.float: 6
+  aten::Int.str: 12
+  aten::IntImplicit: 11
+  aten::ScalarImplicit: 3
+  aten::__and__.Tensor: 13
+  aten::__and__.bool: 11
+  aten::__and__.int: 2
+  aten::__contains__.int: 5
+  aten::__contains__.int_list: 17
+  aten::__contains__.str: 22
+  aten::__contains__.str_list: 5
+  aten::__derive_index: 24
+  aten::__getitem__.Dict_int: 4
+  aten::__getitem__.Dict_str: 39
+  aten::__getitem__.str: 20
+  aten::__getitem__.t: 178
+  aten::__is__: 83
+  aten::__isnot__: 81
+  aten::__lshift__.int: 2
+  aten::__not__: 32
+  aten::__range_length: 23
+  aten::__rshift__.int: 2
+  aten::__xor__.bool: 10
+  aten::_aminmax: 4
+  aten::_convolution: 12
+  aten::_convolution.deprecated: 3
+  aten::_infer_size: 7
+  aten::_make_per_tensor_quantized_tensor: 2
+  aten::_pack_padded_sequence: 10
+  aten::_pad_packed_sequence: 10
+  aten::_reshape_from_tensor: 10
+  aten::_set_item.int: 7
+  aten::_set_item.str: 163
+  aten::_set_item.t: 8
+  aten::_shape_as_tensor: 10
+  aten::adaptive_avg_pool1d: 1
+  aten::adaptive_avg_pool2d: 33
+  aten::adaptive_avg_pool3d: 1
+  aten::add.Scalar: 33
+  aten::add.Tensor: 63
+  aten::add.float: 5
+  aten::add.int: 49
+  aten::add.out: 2
+  aten::add.str: 29
+  aten::add.t: 11
+  aten::add_.Scalar: 15
+  aten::add_.Tensor: 29
+  aten::addcmul: 2
+  aten::addmm: 7
+  aten::all: 6
+  aten::allclose: 1
+  aten::any: 14
+  aten::append.t: 59
+  aten::arange: 16
+  aten::arange.start: 6
+  aten::arange.start_step: 16
+  aten::argmax: 2
+  aten::as_strided: 10
+  aten::as_tensor.list: 4
+  aten::atan: 4
+  aten::avg_pool1d: 6
+  aten::avg_pool2d: 7
+  aten::backward: 23
+  aten::batch_norm: 15
+  aten::binary_cross_entropy: 15
+  aten::binary_cross_entropy_with_logits: 3
+  aten::bitwise_not: 13
+  aten::bmm: 16
+  aten::broadcast_tensors: 1
+  aten::cat: 90
+  aten::ceil: 3
+  aten::ceil.float: 7
+  aten::chunk: 19
+  aten::clamp: 36
+  aten::clamp_: 12
+  aten::clamp_min: 3
+  aten::clear.str: 2
+  aten::clone: 26
+  aten::coalesce: 2
+  aten::conj: 1
+  aten::constant_pad_nd: 17
+  aten::contiguous: 113
+  aten::conv1d: 12
+  aten::conv2d: 10
+  aten::conv_transpose2d.input: 5
+  aten::copy_: 15
+  aten::copy_.Tensor: 27
+  aten::copy_.int: 1
+  aten::cos: 4
+  aten::count_nonzero: 4
+  aten::ctc_loss.Tensor: 1
+  aten::cumsum: 13
+  aten::dequantize.list: 1
+  aten::dequantize.self: 30
+  aten::dequantize.tensor: 36
+  aten::detach: 34
+  aten::dim: 36
+  aten::div: 9
+  aten::div.Scalar: 8
+  aten::div.Tensor: 71
+  aten::div.Tensor_mode: 7
+  aten::div.float: 3
+  aten::div.int: 7
+  aten::div_.Tensor: 7
+  aten::dropout: 41
+  aten::embedding: 16
+  aten::embedding_bag.padding_idx: 2
+  aten::empty.memory_format: 11
+  aten::empty_like: 11
+  aten::empty_strided: 3
+  aten::eq.Scalar: 24
+  aten::eq.Tensor: 6
+  aten::eq.int: 57
+  aten::eq.int_list: 20
+  aten::eq.str: 43
+  aten::exp: 18
+  aten::exp.float: 4
+  aten::expand: 26
+  aten::expand_as: 3
+  aten::extend.t: 38
+  aten::feature_dropout: 1
+  aten::fill_.Scalar: 17
+  aten::find: 3
+  aten::flatten.using_ints: 45
+  aten::flip: 1
+  aten::floor: 5
+  aten::floor.float: 2
+  aten::floor_divide: 4
+  aten::floor_divide.Scalar: 7
+  aten::floordiv.int: 21
+  aten::format: 58
+  aten::full: 10
+  aten::full_like: 10
+  aten::gather: 10
+  aten::ge.Scalar: 4
+  aten::ge.Tensor: 6
+  aten::ge.int: 29
+  aten::gelu: 12
+  aten::get.default_str: 14
+  aten::glu: 18
+  aten::grid_sampler: 3
+  aten::gt.Scalar: 16
+  aten::gt.float: 16
+  aten::gt.float_int: 3
+  aten::gt.int: 52
+  aten::hardsigmoid: 3
+  aten::hardsigmoid_: 2
+  aten::hardswish_: 4
+  aten::hardtanh: 3
+  aten::hardtanh_: 3
+  aten::hstack: 2
+  aten::index.Tensor: 23
+  aten::index_fill.int_Scalar: 15
+  aten::index_put_: 16
+  aten::index_select: 31
+  aten::is_coalesced: 2
+  aten::is_floating_point: 9
+  aten::isnan: 1
+  aten::item: 40
+  aten::items.str: 3
+  aten::keys.str: 15
+  aten::layer_norm: 26
+  aten::le.Scalar: 1
+  aten::le.Tensor: 10
+  aten::le.float: 2
+  aten::le.int: 17
+  aten::leaky_relu: 1
+  aten::leaky_relu_: 5
+  aten::len.Dict_int: 5
+  aten::len.Tensor: 19
+  aten::len.str: 23
+  aten::len.t: 177
+  aten::linear: 46
+  aten::linspace: 3
+  aten::list.t: 24
+  aten::log: 18
+  aten::log10: 4
+  aten::log1p: 5
+  aten::log_softmax.int: 31
+  aten::logical_and: 1
+  aten::logical_not: 10
+  aten::logit: 7
+  aten::lower: 10
+  aten::lstm.data: 8
+  aten::lstm.input: 4
+  aten::lt.Scalar: 8
+  aten::lt.Tensor: 1
+  aten::lt.float: 16
+  aten::lt.int: 46
+  aten::masked_fill.Scalar: 16
+  aten::matmul: 12
+  aten::max: 18
+  aten::max.dim: 30
+  aten::max.other: 7
+  aten::max_pool2d: 10
+  aten::maximum: 4
+  aten::mean: 10
+  aten::mean.dim: 16
+  aten::meshgrid.indexing: 2
+  aten::min: 2
+  aten::min.dim: 4
+  aten::min.other: 17
+  aten::minimum: 4
+  aten::mse_loss: 1
+  aten::mul.Scalar: 26
+  aten::mul.Tensor: 90
+  aten::mul.float: 5
+  aten::mul.float_int: 3
+  aten::mul.int: 26
+  aten::mul.int_float: 4
+  aten::mul.left_t: 15
+  aten::mul.out: 1
+  aten::mul_.Scalar: 11
+  aten::mul_.Tensor: 5
+  aten::nan_to_num: 3
+  aten::nan_to_num_: 10
+  aten::narrow: 10
+  aten::ne.Scalar: 14
+  aten::ne.Tensor: 5
+  aten::ne.int: 44
+  aten::ne.int_float: 2
+  aten::ne.int_list: 20
+  aten::ne.str: 3
+  aten::neg: 29
+  aten::neg.int: 19
+  aten::new_zeros: 6
+  aten::nll_loss: 1
+  aten::nll_loss2d: 1
+  aten::nll_loss_nd: 3
+  aten::nonzero: 4
+  aten::norm.Scalar: 1
+  aten::norm.ScalarOpt_dim: 4
+  aten::numel: 8
+  aten::one_hot: 2
+  aten::ones: 38
+  aten::ones_like: 16
+  aten::ord: 20
+  aten::permute: 43
+  aten::pop.t: 7
+  aten::pow.Tensor_Scalar: 3
+  aten::pow.int_float: 2
+  aten::quantile.scalar: 1
+  aten::quantize_per_tensor: 66
+  aten::quantize_per_tensor.tensor_qparams: 1
+  aten::quantized_lstm.data: 2
+  aten::rand: 25
+  aten::randint.low: 2
+  aten::randn_like: 17
+  aten::reciprocal: 1
+  aten::reflection_pad2d: 1
+  aten::relu: 82
+  aten::relu_: 9
+  aten::remainder.Scalar: 2
+  aten::remainder.int: 22
+  aten::repeat: 16
+  aten::replace: 1
+  aten::replication_pad1d: 1
+  aten::replication_pad2d: 2
+  aten::replication_pad3d: 1
+  aten::requires_grad_: 4
+  aten::reshape: 36
+  aten::resize_as_: 1
+  aten::resolve_conj: 1
+  aten::resolve_neg: 1
+  aten::reverse.t: 2
+  aten::round.Scalar: 4
+  aten::rstrip: 1
+  aten::rsub.Scalar: 5
+  aten::scatter_.src: 6
+  aten::scatter_add_: 10
+  aten::select.int: 57
+  aten::selu: 2
+  aten::sigmoid: 93
+  aten::sin: 4
+  aten::size: 66
+  aten::size.int: 66
+  aten::slice.Tensor: 75
+  aten::slice.str: 12
+  aten::slice.t: 43
+  aten::softmax.int: 63
+  aten::softplus: 2
+  aten::sort: 18
+  aten::sparse_coo_tensor.indices: 1
+  aten::sparse_resize_and_clear_: 1
+  aten::split.str: 10
+  aten::sqrt: 1
+  aten::squeeze.dim: 26
+  aten::stack: 30
+  aten::startswith: 10
+  aten::str: 16
+  aten::strip: 3
+  aten::sub: 8
+  aten::sub.Scalar: 26
+  aten::sub.Tensor: 94
+  aten::sub.int: 52
+  aten::sub_.Tensor: 4
+  aten::sum: 17
+  aten::sum.dim_IntList: 19
+  aten::sum.int: 1
+  aten::t: 3
+  aten::tanh: 26
+  aten::tensor: 51
+  aten::tensor.float: 28
+  aten::tensor.int: 34
+  aten::tensor_split.indices: 4
+  aten::to.device: 11
+  aten::to.dtype: 23
+  aten::to.dtype_layout: 27
+  aten::to.prim_Device: 23
+  aten::to.prim_dtype: 38
+  aten::topk: 10
+  aten::transpose.int: 33
+  aten::triu: 10
+  aten::true_divide.Tensor: 2
+  aten::trunc_: 3
+  aten::type_as: 6
+  aten::unbind.int: 24
+  aten::unique_consecutive: 2
+  aten::unsqueeze: 34
+  aten::unsqueeze_: 6
+  aten::update.str: 4
+  aten::upsample_bicubic2d.vec: 1
+  aten::upsample_bilinear2d.vec: 8
+  aten::upsample_linear1d.vec: 1
+  aten::upsample_nearest1d.vec: 2
+  aten::upsample_nearest2d: 7
+  aten::upsample_nearest2d.vec: 30
+  aten::upsample_nearest3d.vec: 2
+  aten::upsample_trilinear3d.vec: 1
+  aten::values.int: 3
+  aten::view: 61
+  aten::vstack: 1
+  aten::where.ScalarOther: 4
+  aten::where.self: 10
+  aten::zeros: 75
+  aten::zeros.out: 1
+  aten::zeros_like: 7
+  prepacked::conv2d_clamp_prepack: 2
+  prepacked::conv2d_clamp_run: 32
+  prepacked::conv2d_transpose_clamp_prepack: 1
+  prepacked::conv2d_transpose_clamp_run: 1
+  prepacked::linear_clamp_run: 26
+  prim::ModuleContainerIndex.list: 2
+  prim::NumToTensor.Scalar: 15
+  prim::Print: 1
+  prim::RaiseException: 103
+  prim::TupleIndex: 157
+  prim::TupleUnpack: 120
+  prim::Uninitialized: 80
+  prim::device: 46
+  prim::dtype: 45
+  prim::is_cuda: 1
+  prim::max.float: 7
+  prim::max.int: 14
+  prim::max.self_int: 17
+  prim::min: 4
+  prim::min.int: 35
+  prim::min.self_int: 25
+  prim::unchecked_cast: 100
+  quantized::add: 58
+  quantized::add_relu: 1
+  quantized::batch_norm2d: 1
+  quantized::cat: 4
+  quantized::conv1d: 1
+  quantized::conv2d: 4
+  quantized::conv2d.new: 55
+  quantized::conv2d_prepack: 14
+  quantized::conv2d_relu.new: 50
+  quantized::conv_prepack: 5
+  quantized::conv_transpose2d: 2
+  quantized::embedding_4bit: 1
+  quantized::embedding_byte: 14
+  quantized::hardswish: 1
+  quantized::instance_norm: 1
+  quantized::leaky_relu: 2
+  quantized::linear: 27
+  quantized::linear_dynamic: 21
+  quantized::linear_dynamic_fp16: 18
+  quantized::linear_prepack: 29
+  quantized::linear_prepack_fp16: 25
+  quantized::linear_relu: 2
+  quantized::linear_unpack: 4
+  quantized::linear_unpack_fp16: 4
+  quantized::mul: 4
+  quantized::mul.Scalar: 1
+traced_operators:
+  aten::__and__.Tensor: 13
+  aten::__iand__.Tensor: 1
+  aten::__ior__.Tensor: 1
+  aten::_adaptive_avg_pool2d: 23
+  aten::_aminmax: 4
+  aten::_batch_norm_impl_index: 15
+  aten::_cat: 95
+  aten::_coalesce: 2
+  aten::_coalesced_: 3
+  aten::_convolution: 34
+  aten::_convolution.deprecated: 3
+  aten::_ctc_loss: 1
+  aten::_embedding_bag: 2
+  aten::_embedding_bag_backward: 1
+  aten::_embedding_bag_sparse_backward: 1
+  aten::_empty_affine_quantized: 87
+  aten::_empty_per_channel_affine_quantized: 28
+  aten::_index_put_impl_: 16
+  aten::_indices: 4
+  aten::_local_scalar_dense: 188
+  aten::_log_softmax: 28
+  aten::_log_softmax_backward_data: 4
+  aten::_make_per_tensor_quantized_tensor: 2
+  aten::_nnz: 3
+  aten::_pack_padded_sequence: 10
+  aten::_pack_padded_sequence_backward: 3
+  aten::_pad_packed_sequence: 10
+  aten::_reshape_alias: 93
+  aten::_reshape_from_tensor: 10
+  aten::_s_where: 15
+  aten::_shape_as_tensor: 10
+  aten::_slow_conv2d_backward.output_mask: 3
+  aten::_slow_conv2d_forward: 33
+  aten::_softmax: 63
+  aten::_sparse_coo_tensor_unsafe: 4
+  aten::_sparse_coo_tensor_with_dims_and_tensors: 5
+  aten::_to_copy: 188
+  aten::_unsafe_view: 28
+  aten::_values: 4
+  aten::abs: 1
+  aten::abs.out: 1
+  aten::adaptive_avg_pool2d: 29
+  aten::add.Scalar: 30
+  aten::add.Tensor: 72
+  aten::add.out: 2
+  aten::add_.Scalar: 11
+  aten::add_.Tensor: 48
+  aten::addmm: 41
+  aten::alias: 14
+  aten::all: 8
+  aten::allclose: 1
+  aten::aminmax: 4
+  aten::any: 14
+  aten::any.dim: 1
+  aten::arange: 10
+  aten::arange.start: 26
+  aten::arange.start_out: 28
+  aten::arange.start_step: 8
+  aten::argmax: 2
+  aten::as_strided: 188
+  aten::as_strided_: 39
+  aten::atan: 4
+  aten::atleast_1d.Sequence: 2
+  aten::atleast_2d.Sequence: 1
+  aten::avg_pool2d: 7
+  aten::batch_norm: 15
+  aten::bernoulli_.float: 2
+  aten::binary_cross_entropy: 13
+  aten::binary_cross_entropy_backward: 12
+  aten::binary_cross_entropy_with_logits: 3
+  aten::binary_cross_entropy_with_logits_backward: 2
+  aten::bitwise_and.Tensor: 13
+  aten::bitwise_and_.Tensor: 1
+  aten::bitwise_not: 13
+  aten::bitwise_or_.Tensor: 1
+  aten::bmm: 18
+  aten::broadcast_tensors: 1
+  aten::cat: 95
+  aten::ceil: 4
+  aten::ceil_: 1
+  aten::chunk: 20
+  aten::clamp: 38
+  aten::clamp_: 12
+  aten::clamp_min: 73
+  aten::clamp_min.out: 74
+  aten::clamp_min_: 4
+  aten::clone: 134
+  aten::coalesce: 2
+  aten::conj: 1
+  aten::constant_pad_nd: 14
+  aten::contiguous: 139
+  aten::conv1d: 12
+  aten::conv2d: 7
+  aten::conv_transpose2d.input: 5
+  aten::convolution: 19
+  aten::convolution_backward: 3
+  aten::copy_: 188
+  aten::copy_sparse_to_sparse_: 3
+  aten::cos: 4
+  aten::count_nonzero: 4
+  aten::count_nonzero.dim_IntList: 4
+  aten::ctc_loss.Tensor: 1
+  aten::cudnn_is_acceptable: 12
+  aten::cumsum: 14
+  aten::dense_dim: 3
+  aten::dequantize.self: 63
+  aten::dequantize.tensors: 1
+  aten::detach: 49
+  aten::div.Scalar: 188
+  aten::div.Tensor: 188
+  aten::div.Tensor_mode: 8
+  aten::div_.Scalar: 27
+  aten::div_.Tensor: 34
+  aten::dropout: 41
+  aten::elu: 2
+  aten::embedding: 16
+  aten::embedding_backward: 4
+  aten::embedding_bag.padding_idx: 2
+  aten::embedding_dense_backward: 4
+  aten::embedding_sparse_backward: 1
+  aten::empty.memory_format: 188
+  aten::empty_like: 162
+  aten::empty_strided: 188
+  aten::eq.Scalar: 25
+  aten::eq.Tensor: 188
+  aten::exp: 15
+  aten::exp_: 3
+  aten::expand: 63
+  aten::expand_as: 17
+  aten::feature_dropout: 1
+  aten::fill_.Scalar: 188
+  aten::flatten.using_ints: 42
+  aten::flip: 1
+  aten::floor: 6
+  aten::floor_divide: 7
+  aten::floor_divide.Scalar: 7
+  aten::full: 21
+  aten::full_like: 10
+  aten::gather: 11
+  aten::ge.Scalar: 2
+  aten::gelu: 12
+  aten::glu: 18
+  aten::grid_sampler: 3
+  aten::grid_sampler_2d: 3
+  aten::gt.Scalar: 16
+  aten::hardsigmoid: 3
+  aten::hardsigmoid_: 2
+  aten::hardswish_: 4
+  aten::hardtanh: 3
+  aten::hstack: 2
+  aten::index.Tensor: 20
+  aten::index_add_: 4
+  aten::index_fill.int_Scalar: 1
+  aten::index_fill_.int_Scalar: 1
+  aten::index_put_: 16
+  aten::index_select: 28
+  aten::index_select_backward: 3
+  aten::is_coalesced: 3
+  aten::is_floating_point: 8
+  aten::isclose: 1
+  aten::isfinite: 1
+  aten::isnan: 1
+  aten::item: 188
+  aten::layer_norm: 26
+  aten::le.Scalar: 2
+  aten::le.Tensor: 1
+  aten::leaky_relu: 1
+  aten::leaky_relu_: 5
+  aten::lerp_.Tensor: 1
+  aten::linear: 51
+  aten::linspace: 3
+  aten::linspace.out: 3
+  aten::log: 15
+  aten::log10: 4
+  aten::log1p: 5
+  aten::log_: 3
+  aten::log_softmax.int: 28
+  aten::logical_and: 1
+  aten::logical_and.out: 2
+  aten::logical_and_: 1
+  aten::logit: 7
+  aten::lstm.data: 8
+  aten::lstm.input: 4
+  aten::lt.Scalar: 8
+  aten::lt.Tensor: 1
+  aten::masked_fill.Scalar: 3
+  aten::masked_fill_.Scalar: 18
+  aten::matmul: 31
+  aten::max: 27
+  aten::max.dim: 31
+  aten::max.other: 4
+  aten::max_pool2d: 7
+  aten::maximum: 4
+  aten::mean: 16
+  aten::mean.dim: 26
+  aten::meshgrid.indexing: 2
+  aten::min: 25
+  aten::min.dim: 5
+  aten::min.other: 4
+  aten::minimum: 5
+  aten::mm: 40
+  aten::mul.Scalar: 31
+  aten::mul.Tensor: 103
+  aten::mul.out: 12
+  aten::mul_.Scalar: 11
+  aten::mul_.Tensor: 7
+  aten::nan_to_num: 3
+  aten::nan_to_num.out: 13
+  aten::nan_to_num_: 10
+  aten::narrow: 188
+  aten::native_batch_norm: 15
+  aten::native_layer_norm: 26
+  aten::native_layer_norm_backward: 1
+  aten::ne.Scalar: 15
+  aten::ne.Tensor: 6
+  aten::neg: 29
+  aten::new_empty_strided: 188
+  aten::nll_loss: 4
+  aten::nll_loss_backward: 4
+  aten::nll_loss_forward: 4
+  aten::nll_loss_nd: 3
+  aten::nonzero: 16
+  aten::norm.Scalar: 1
+  aten::norm.ScalarOpt_dim: 5
+  aten::normal_: 17
+  aten::one_hot: 2
+  aten::ones: 188
+  aten::ones_like: 25
+  aten::permute: 44
+  aten::pow.Tensor_Scalar: 3
+  aten::q_per_channel_scales: 28
+  aten::q_per_channel_zero_points: 28
+  aten::q_scale: 65
+  aten::q_zero_point: 85
+  aten::qscheme: 85
+  aten::quantile.scalar: 1
+  aten::quantize_per_tensor: 84
+  aten::quantize_per_tensor.tensor_qparams: 1
+  aten::quantized_lstm.data: 2
+  aten::quantized_max_pool2d: 3
+  aten::rand: 25
+  aten::randint.low: 2
+  aten::randn_like: 17
+  aten::random_.from: 2
+  aten::reciprocal: 1
+  aten::reflection_pad2d: 1
+  aten::relu: 79
+  aten::relu_: 4
+  aten::remainder.Scalar: 2
+  aten::remainder.Tensor: 2
+  aten::repeat: 14
+  aten::replication_pad2d: 2
+  aten::requires_grad_: 2
+  aten::reshape: 69
+  aten::resize_: 188
+  aten::resize_as_: 18
+  aten::resolve_conj: 70
+  aten::resolve_neg: 1
+  aten::result_type.Scalar: 3
+  aten::rsub.Scalar: 5
+  aten::scalar_tensor: 1
+  aten::scatter_.src: 6
+  aten::scatter_.value: 2
+  aten::scatter_add_: 10
+  aten::select.int: 77
+  aten::select_backward: 1
+  aten::selu: 2
+  aten::set_.source_Storage: 186
+  aten::set_.source_Storage_storage_offset: 186
+  aten::sigmoid: 90
+  aten::sigmoid_: 14
+  aten::sigmoid_backward: 17
+  aten::sin: 4
+  aten::slice.Tensor: 188
+  aten::slice_backward: 4
+  aten::slow_conv_transpose2d: 6
+  aten::softmax.int: 63
+  aten::softplus: 2
+  aten::sort: 20
+  aten::sparse_coo_tensor.indices: 1
+  aten::sparse_dim: 3
+  aten::sparse_resize_and_clear_: 1
+  aten::split.Tensor: 20
+  aten::sqrt: 1
+  aten::squeeze: 13
+  aten::squeeze.dim: 38
+  aten::squeeze_.dim: 36
+  aten::stack: 39
+  aten::sub.Scalar: 23
+  aten::sub.Tensor: 105
+  aten::sub_.Scalar: 1
+  aten::sub_.Tensor: 7
+  aten::sum: 18
+  aten::sum.IntList_out: 29
+  aten::sum.dim_IntList: 41
+  aten::t: 49
+  aten::tanh: 40
+  aten::tanh_: 14
+  aten::tanh_backward: 5
+  aten::tensor_split.indices: 4
+  aten::thnn_conv2d: 33
+  aten::threshold_backward: 17
+  aten::to.device: 35
+  aten::to.dtype: 188
+  aten::to.dtype_layout: 184
+  aten::topk: 10
+  aten::transpose.int: 73
+  aten::triu: 10
+  aten::true_divide.Tensor: 2
+  aten::trunc_: 4
+  aten::type_as: 6
+  aten::unbind.int: 38
+  aten::unfold: 14
+  aten::uniform_: 25
+  aten::unique_consecutive: 2
+  aten::unsafe_chunk: 14
+  aten::unsafe_split.Tensor: 14
+  aten::unsqueeze: 56
+  aten::unsqueeze_: 31
+  aten::upsample_bilinear2d: 7
+  aten::upsample_bilinear2d.vec: 7
+  aten::upsample_nearest2d: 31
+  aten::upsample_nearest2d.vec: 27
+  aten::value_selecting_reduction_backward: 3
+  aten::view: 95
+  aten::vstack: 1
+  aten::where.ScalarOther: 4
+  aten::where.self: 15
+  aten::zero_: 188
+  aten::zeros: 188
+  aten::zeros.out: 1
+  aten::zeros_like: 6
+  prepacked::conv2d_clamp_prepack: 1
+  prepacked::conv2d_clamp_run: 32
+  prepacked::conv2d_transpose_clamp_run: 1
+  prepacked::linear_clamp_run: 26
+  quantized::add: 58
+  quantized::add_relu: 1
+  quantized::batch_norm2d: 1
+  quantized::cat: 4
+  quantized::conv1d: 1
+  quantized::conv2d: 4
+  quantized::conv2d.new: 55
+  quantized::conv2d_prepack: 14
+  quantized::conv2d_relu.new: 50
+  quantized::conv_prepack: 5
+  quantized::conv_transpose2d: 2
+  quantized::embedding_byte: 14
+  quantized::hardswish: 1
+  quantized::instance_norm: 1
+  quantized::leaky_relu: 2
+  quantized::linear: 27
+  quantized::linear_dynamic: 21
+  quantized::linear_prepack: 29
+  quantized::linear_relu: 2
+  quantized::mul: 4
+  quantized::mul.Scalar: 1
diff --git a/test/mobile/model_test/nn_ops.py b/test/mobile/model_test/nn_ops.py
new file mode 100644
index 00000000000000..338359c964084a
--- /dev/null
+++ b/test/mobile/model_test/nn_ops.py
@@ -0,0 +1,427 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+# https://pytorch.org/docs/stable/nn.html
+class NNConvolutionModule(torch.nn.Module):
+    def __init__(self):
+        super(NNConvolutionModule, self).__init__()
+        self.input1d = torch.randn(1, 4, 36)
+        self.input2d = torch.randn(1, 4, 30, 10)
+        self.input3d = torch.randn(1, 4, 10, 4, 4)
+        self.module1d = nn.ModuleList(
+            [
+                nn.Conv1d(4, 33, 3),
+                nn.ConvTranspose1d(4, 33, 3),
+                nn.Fold(output_size=(5, 10), kernel_size=(2, 2)),
+            ]
+        )
+        self.module2d = nn.ModuleList(
+            [
+                nn.Conv2d(4, 33, 3),
+                nn.ConvTranspose2d(4, 33, 3),
+                nn.Unfold(kernel_size=3),
+            ]
+        )
+        self.module3d = nn.ModuleList(
+            [
+                nn.Conv3d(4, 33, 2),
+                nn.ConvTranspose3d(4, 33, 3),
+            ]
+        )
+
+    def forward(self):
+        return len((
+            [module(self.input1d) for i, module in enumerate(self.module1d)],
+            [module(self.input2d) for i, module in enumerate(self.module2d)],
+            [module(self.input3d) for i, module in enumerate(self.module3d)],
+        ))
+
+
+class NNPoolingModule(torch.nn.Module):
+    def __init__(self):
+        super(NNPoolingModule, self).__init__()
+        self.input1d = torch.randn(1, 16, 50)
+        self.module1d = nn.ModuleList(
+            [
+                nn.MaxPool1d(3, stride=2),
+                nn.AvgPool1d(3, stride=2),
+                nn.LPPool1d(2, 3, stride=2),
+                nn.AdaptiveMaxPool1d(3),
+                nn.AdaptiveAvgPool1d(3),
+            ]
+        )
+
+        self.input2d = torch.randn(1, 16, 30, 10)
+        self.module2d = nn.ModuleList(
+            [
+                nn.MaxPool2d((3, 2), stride=(2, 1)),
+                nn.AvgPool2d((3, 2), stride=(2, 1)),
+                nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5)),
+                nn.LPPool2d(2, 3, stride=(2, 1)),
+                nn.AdaptiveMaxPool2d((5, 7)),
+                nn.AdaptiveAvgPool2d((7)),
+            ]
+        )
+
+        self.input3d = torch.randn(1, 16, 20, 4, 4)
+        self.module3d = nn.ModuleList(
+            [
+                nn.MaxPool3d(2),
+                nn.AvgPool3d(2),
+                nn.FractionalMaxPool3d(2, output_ratio=(0.5, 0.5, 0.5)),
+                nn.AdaptiveMaxPool3d((5, 7, 9)),
+                nn.AdaptiveAvgPool3d((5, 7, 9)),
+            ]
+        )
+        # TODO max_unpool
+
+    def forward(self):
+        return len((
+            [module(self.input1d) for i, module in enumerate(self.module1d)],
+            [module(self.input2d) for i, module in enumerate(self.module2d)],
+            [module(self.input3d) for i, module in enumerate(self.module3d)],
+        ))
+
+
+class NNPaddingModule(torch.nn.Module):
+    def __init__(self):
+        super(NNPaddingModule, self).__init__()
+        self.input1d = torch.randn(1, 4, 50)
+        self.module1d = nn.ModuleList(
+            [
+                nn.ReflectionPad1d(2),
+                nn.ReplicationPad1d(2),
+                nn.ConstantPad1d(2, 3.5),
+            ]
+        )
+
+        self.input2d = torch.randn(1, 4, 30, 10)
+        self.module2d = nn.ModuleList(
+            [
+                nn.ReflectionPad2d(2),
+                nn.ReplicationPad2d(2),
+                nn.ZeroPad2d(2),
+                nn.ConstantPad2d(2, 3.5),
+            ]
+        )
+
+        self.input3d = torch.randn(1, 4, 10, 4, 4)
+        self.module3d = nn.ModuleList(
+            [
+                nn.ReflectionPad3d(1),
+                nn.ReplicationPad3d(3),
+                nn.ConstantPad3d(3, 3.5),
+            ]
+        )
+
+    def forward(self):
+        return len((
+            [module(self.input1d) for i, module in enumerate(self.module1d)],
+            [module(self.input2d) for i, module in enumerate(self.module2d)],
+            [module(self.input3d) for i, module in enumerate(self.module3d)],
+        ))
+
+
+class NNNormalizationModule(torch.nn.Module):
+    def __init__(self):
+        super(NNNormalizationModule, self).__init__()
+        self.input1d = torch.randn(1, 4, 50)
+        self.module1d = nn.ModuleList(
+            [
+                nn.BatchNorm1d(4),
+                nn.InstanceNorm1d(4),
+            ]
+        )
+
+        self.input2d = torch.randn(1, 4, 30, 10)
+        self.module2d = nn.ModuleList(
+            [
+                nn.BatchNorm2d(4),
+                nn.GroupNorm(4, 4),
+                nn.InstanceNorm2d(4),
+                nn.LayerNorm([4, 30, 10]),
+                nn.LocalResponseNorm(2),
+            ]
+        )
+
+        self.input3d = torch.randn(1, 4, 10, 4, 4)
+        self.module3d = nn.ModuleList(
+            [
+                nn.BatchNorm3d(4),
+                nn.InstanceNorm3d(4),
+                nn.ChannelShuffle(2),
+            ]
+        )
+
+    def forward(self):
+        return len((
+            [module(self.input1d) for i, module in enumerate(self.module1d)],
+            [module(self.input2d) for i, module in enumerate(self.module2d)],
+            [module(self.input3d) for i, module in enumerate(self.module3d)],
+        ))
+
+
+class NNActivationModule(torch.nn.Module):
+    def __init__(self):
+        super(NNActivationModule, self).__init__()
+        self.activations = nn.ModuleList(
+            [
+                nn.ELU(),
+                nn.Hardshrink(),
+                nn.Hardsigmoid(),
+                nn.Hardtanh(),
+                nn.Hardswish(),
+                nn.LeakyReLU(),
+                nn.LogSigmoid(),
+                # nn.MultiheadAttention(),
+                nn.PReLU(),
+                nn.ReLU(),
+                nn.ReLU6(),
+                nn.RReLU(),
+                nn.SELU(),
+                nn.CELU(),
+                nn.GELU(),
+                nn.Sigmoid(),
+                nn.SiLU(),
+                nn.Mish(),
+                nn.Softplus(),
+                nn.Softshrink(),
+                nn.Softsign(),
+                nn.Tanh(),
+                nn.Tanhshrink(),
+                # nn.Threshold(0.1, 20),
+                nn.GLU(),
+                nn.Softmin(),
+                nn.Softmax(),
+                nn.Softmax2d(),
+                nn.LogSoftmax(),
+                # nn.AdaptiveLogSoftmaxWithLoss(),
+            ]
+        )
+
+    def forward(self):
+        input = torch.randn(2, 3, 4)
+        return len((
+            [module(input) for i, module in enumerate(self.activations)],
+        ))
+
+
+class NNRecurrentModule(torch.nn.Module):
+    def __init__(self):
+        super(NNRecurrentModule, self).__init__()
+        self.rnn = nn.ModuleList(
+            [
+                nn.RNN(4, 8, 2),
+                nn.RNNCell(4, 8),
+            ]
+        )
+        self.gru = nn.ModuleList([nn.GRU(4, 8, 2), nn.GRUCell(4, 8)])
+        self.lstm = nn.ModuleList(
+            [
+                nn.LSTM(4, 8, 2),
+                nn.LSTMCell(4, 8),
+            ]
+        )
+
+    def forward(self):
+        input = torch.randn(5, 3, 4)
+        h = torch.randn(2, 3, 8)
+        c = torch.randn(2, 3, 8)
+        r = self.rnn[0](input, h)
+        r = self.rnn[1](input[0], h[0])
+        r = self.gru[0](input, h)
+        r = self.gru[1](input[0], h[0])
+        r = self.lstm[0](input, (h, c))
+        r = self.lstm[1](input[0], (h[0], c[0]))
+        return len(r)
+
+
+class NNTransformerModule(torch.nn.Module):
+    def __init__(self):
+        super(NNTransformerModule, self).__init__()
+        self.transformers = nn.ModuleList(
+            [
+                nn.Transformer(
+                    d_model=2, nhead=2, num_encoder_layers=1, num_decoder_layers=1
+                ),
+                nn.TransformerEncoder(
+                    nn.TransformerEncoderLayer(d_model=2, nhead=2), num_layers=1
+                ),
+                nn.TransformerDecoder(
+                    nn.TransformerDecoderLayer(d_model=2, nhead=2), num_layers=1
+                ),
+            ]
+        )
+
+    def forward(self):
+        input = torch.rand(1, 16, 2)
+        tgt = torch.rand((1, 16, 2))
+        r = self.transformers[0](input, tgt)
+        r = self.transformers[1](input)
+        r = self.transformers[2](input, tgt)
+        return len(r)
+
+
+class NNLinearModule(torch.nn.Module):
+    def __init__(self):
+        super(NNLinearModule, self).__init__()
+        self.linears = nn.ModuleList(
+            [
+                nn.Identity(54),
+                nn.Linear(20, 20),
+                nn.Bilinear(20, 20, 40),
+                # nn.LazyLinear(20, 30),
+            ]
+        )
+
+    def forward(self):
+        input = torch.randn(32, 20)
+        r = self.linears[0](input)
+        r = self.linears[1](input)
+        r = self.linears[2](input, input)
+        return len(r)
+
+
+class NNDropoutModule(torch.nn.Module):
+    def __init__(self):
+        super(NNDropoutModule, self).__init__()
+
+    def forward(self):
+        a = torch.randn(8, 4)
+        b = torch.randn(8, 4, 4, 4)
+        c = torch.randn(8, 4, 4, 4, 4)
+        return len(
+            F.dropout(a),
+            F.dropout2d(b),
+            F.dropout3d(c),
+            F.alpha_dropout(a),
+            F.feature_alpha_dropout(c),
+        )
+
+
+class NNSparseModule(torch.nn.Module):
+    def __init__(self):
+        super(NNSparseModule, self).__init__()
+
+    def forward(self):
+        input = torch.tensor([[1, 2, 4, 5], [4, 3, 2, 9]])
+        input2 = torch.tensor([1, 2, 4, 5, 4, 3, 2, 9])
+        embedding_matrix = torch.rand(10, 3)
+        offsets = torch.tensor([0, 4])
+        return len(
+            F.embedding(input, embedding_matrix),
+            F.embedding_bag(input2, embedding_matrix, offsets),
+            F.one_hot(torch.arange(0, 5) % 3, num_classes=5),
+        )
+
+
+class NNDistanceModule(torch.nn.Module):
+    def __init__(self):
+        super(NNDistanceModule, self).__init__()
+
+    def forward(self):
+        a = torch.randn(8, 4)
+        b = torch.randn(8, 4)
+        return len(
+            F.pairwise_distance(a, b),
+            F.cosine_similarity(a, b),
+            F.pdist(a),
+        )
+
+
+class NNLossFunctionModule(torch.nn.Module):
+    def __init__(self):
+        super(NNLossFunctionModule, self).__init__()
+        self.x = torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]])
+        self.y = torch.LongTensor([[3, 0, -1, 1]])
+
+    def forward(self):
+        a = torch.randn(3, 2)
+        b = torch.rand(3, 2)
+        c = torch.rand(3)
+        log_probs = torch.randn(50, 16, 20).log_softmax(2).detach()
+        targets = torch.randint(1, 20, (16, 30), dtype=torch.long)
+        input_lengths = torch.full((16,), 50, dtype=torch.long)
+        target_lengths = torch.randint(10, 30, (16,), dtype=torch.long)
+        return len(
+            F.binary_cross_entropy(torch.sigmoid(a), b),
+            F.binary_cross_entropy_with_logits(torch.sigmoid(a), b),
+            F.poisson_nll_loss(a, b),
+            F.cosine_embedding_loss(a, b, c),
+            F.cross_entropy(a, b),
+            F.ctc_loss(log_probs, targets, input_lengths, target_lengths),
+            # F.gaussian_nll_loss(a, b, torch.ones(5, 1)), # ENTER is not supported in mobile module
+            F.hinge_embedding_loss(a, b),
+            F.kl_div(a, b),
+            F.l1_loss(a, b),
+            F.mse_loss(a, b),
+            F.margin_ranking_loss(c, c, c),
+            F.multilabel_margin_loss(self.x, self.y),
+            F.multilabel_soft_margin_loss(self.x, self.y),
+            F.multi_margin_loss(self.x, torch.tensor([3])),
+            F.nll_loss(a, torch.tensor([1, 0, 1])),
+            F.huber_loss(a, b),
+            F.smooth_l1_loss(a, b),
+            F.soft_margin_loss(a, b),
+            F.triplet_margin_loss(a, b, -b),
+            # F.triplet_margin_with_distance_loss(a, b, -b), # can't take variable number of arguments
+        )
+
+
+class NNVisionModule(torch.nn.Module):
+    def __init__(self):
+        super(NNVisionModule, self).__init__()
+        self.input = torch.randn(1, 4, 9, 9)
+        self.vision_modules = nn.ModuleList(
+            [
+                nn.PixelShuffle(2),
+                nn.PixelUnshuffle(3),
+                nn.Upsample(scale_factor=2, mode="nearest"),
+                nn.Upsample(scale_factor=2, mode="bilinear"),
+                nn.Upsample(scale_factor=2, mode="bicubic"),
+                nn.UpsamplingNearest2d(scale_factor=2),
+                nn.UpsamplingBilinear2d(scale_factor=2),
+            ]
+        )
+        self.linear_sample = nn.Upsample(scale_factor=2, mode="linear")
+        self.trilinear_sample = nn.Upsample(scale_factor=2, mode="trilinear")
+
+    def forward(self):
+        input = torch.randn(1, 3, 16, 16)
+        for i, module in enumerate(self.vision_modules):
+            r = module(self.input)
+        return len(
+            r,
+            self.linear_sample(torch.randn(4, 9, 9)),
+            self.trilinear_sample(torch.randn(1, 3, 4, 9, 9)),
+            F.grid_sample(input, torch.ones(1, 4, 4, 2)),
+        )
+
+
+class NNShuffleModule(torch.nn.Module):
+    def __init__(self):
+        super(NNShuffleModule, self).__init__()
+        self.shuffle = nn.ChannelShuffle(2)
+
+    def forward(self):
+        return len(self.shuffle(torch.randn(1, 4, 2, 2)),)
+
+
+class NNUtilsModule(torch.nn.Module):
+    def __init__(self):
+        super(NNUtilsModule, self).__init__()
+        self.flatten = nn.Sequential(
+            nn.Linear(50, 50),
+            nn.Unflatten(1, (2, 5, 5))
+        )
+
+    def forward(self):
+        a = [torch.tensor([1, 2, 3]), torch.tensor([3, 4])]
+        b = nn.utils.rnn.pad_sequence(a, batch_first=True)
+        # c = nn.utils.rnn.pack_padded_sequence(b, batch_first=True, lengths=torch.tensor([3, 2]))
+        input = torch.randn(2, 50)
+        return len(
+            self.flatten(input),
+            b,
+        )
diff --git a/test/mobile/model_test/quantization_ops.py b/test/mobile/model_test/quantization_ops.py
new file mode 100644
index 00000000000000..d0fdb346545e7d
--- /dev/null
+++ b/test/mobile/model_test/quantization_ops.py
@@ -0,0 +1,227 @@
+import torch
+import torch.nn as nn
+
+
+class GeneralQuantModule(torch.nn.Module):
+    def __init__(self):
+        super(GeneralQuantModule, self).__init__()
+        self.embedding = torch.nn.quantized.Embedding(
+            num_embeddings=10, embedding_dim=12
+        )
+        self.embedding_input = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8])
+        self.func = torch.nn.quantized.QFunctional()
+        self.conv1 = torch.nn.quantized.ConvTranspose1d(16, 33, 3, stride=2)
+        self.conv2 = torch.nn.quantized.ConvTranspose2d(16, 33, 3, stride=2)
+        self.conv3 = torch.nn.quantized.ConvTranspose3d(16, 33, 3, stride=2)
+
+    def forward(self):
+        a = torch.quantize_per_tensor(torch.tensor([3.0]), 1.0, 0, torch.qint32)
+        b = torch.quantize_per_tensor(torch.tensor(4.0), 1.0, 0, torch.qint32)
+        c = torch.quantize_per_tensor(
+            torch.tensor([3.0]), torch.tensor(1.0), torch.tensor(0), torch.qint32
+        )
+        input1 = torch.randn(1, 16, 4)
+        input2 = torch.randn(1, 16, 4, 4)
+        input3 = torch.randn(1, 16, 4, 4, 4)
+        return len(
+            self.func.add(a, b),
+            self.func.cat((a, a), 0),
+            self.func.mul(a, b),
+            self.func.add_relu(a, b),
+            self.func.add_scalar(a, b),
+            self.func.mul_scalar(a, b),
+            self.embedding(self.embedding_input),
+            self.conv1(
+                torch.quantize_per_tensor(
+                    input1, scale=1.0, zero_point=0, dtype=torch.quint8
+                )
+            ),
+            self.conv2(
+                torch.quantize_per_tensor(
+                    input2, scale=1.0, zero_point=0, dtype=torch.quint8
+                )
+            ),
+            c,
+            # self.conv3(torch.quantize_per_tensor(input3, scale=1.0, zero_point=0, dtype=torch.quint8)), # failed on iOS
+        )
+
+
+class DynamicQuantModule:
+    def __init__(self):
+        super(DynamicQuantModule, self).__init__()
+        self.module = self.M()
+
+    def getModule(self):
+        return torch.quantization.quantize_dynamic(self.module, dtype=torch.qint8)
+
+    class M(torch.nn.Module):
+        def __init__(self):
+            super(DynamicQuantModule.M, self).__init__()
+            self.rnn = nn.RNN(4, 8, 2)
+            self.rnncell = nn.RNNCell(4, 8)
+            self.gru = nn.GRU(4, 8, 2)
+            self.grucell = nn.GRUCell(4, 8)
+            self.lstm = nn.LSTM(4, 8, 2)
+            self.lstmcell = nn.LSTMCell(4, 8)
+            self.linears = nn.ModuleList(
+                [
+                    nn.Identity(54),
+                    nn.Linear(20, 20),
+                    nn.Bilinear(20, 20, 40),
+                ]
+            )
+            self.transformers = nn.ModuleList(
+                [
+                    nn.Transformer(
+                        d_model=2, nhead=2, num_encoder_layers=1, num_decoder_layers=1
+                    ),
+                    nn.TransformerEncoder(
+                        nn.TransformerEncoderLayer(d_model=2, nhead=2), num_layers=1
+                    ),
+                    nn.TransformerDecoder(
+                        nn.TransformerDecoderLayer(d_model=2, nhead=2), num_layers=1
+                    ),
+                ]
+            )
+            # self.a = torch.nn.utils.rnn.pad_sequence([torch.tensor([1,2,3]), torch.tensor([3,4])], batch_first=True)
+
+        def forward(self):
+            input = torch.randn(5, 3, 4)
+            h = torch.randn(2, 3, 8)
+            c = torch.randn(2, 3, 8)
+            linear_input = torch.randn(32, 20)
+            trans_input = torch.randn(1, 16, 2)
+            tgt = torch.rand(1, 16, 2)
+
+            return len((
+                self.rnn(input, h),
+                self.rnncell(input[0], h[0]),
+                self.gru(input, h),
+                self.grucell(input[0], h[0]),
+                self.lstm(input, (h, c)),
+                # self.lstm(torch.nn.utils.rnn.pack_padded_sequence(self.a, lengths=torch.tensor([3,2,1])), (h, c)),
+                self.lstmcell(input[0], (h[0], c[0])),
+                self.transformers[0](trans_input, tgt),
+                self.transformers[1](trans_input),
+                self.transformers[2](trans_input, tgt),
+                self.linears[0](linear_input),
+                self.linears[1](linear_input),
+                self.linears[2](linear_input, linear_input),
+            ))
+
+
+class StaticQuantModule:
+    def __init__(self):
+        super(StaticQuantModule, self).__init__()
+
+    def getModule(self):
+        model_fp32 = self.M()
+        model_fp32.eval()
+        model_fp32.qconfig = torch.quantization.get_default_qconfig("qnnpack")
+        model_fp32_prepared = torch.quantization.prepare(model_fp32)
+        model_int8 = torch.quantization.convert(model_fp32_prepared)
+        return model_int8
+
+    class M(torch.nn.Module):
+        def __init__(self):
+            super(StaticQuantModule.M, self).__init__()
+            self.quant = torch.quantization.QuantStub()
+            self.input1d = torch.randn(4, 2, 2)
+            self.input2d = torch.randn((4, 2, 4, 4))
+            self.input3d = torch.randn(4, 2, 2, 4, 4)
+            self.linear_input = torch.randn(32, 20)
+
+            self.layer1 = nn.Sequential(
+                nn.Conv1d(2, 2, 1), nn.InstanceNorm1d(1), nn.Hardswish()
+            )
+            self.layer2 = nn.Sequential(
+                nn.Conv2d(2, 2, 1),
+                nn.BatchNorm2d(2),
+                nn.InstanceNorm2d(1),
+                nn.LeakyReLU(),
+            )
+            self.layer3 = nn.Sequential(
+                nn.Conv3d(2, 2, 1), nn.BatchNorm3d(2), nn.InstanceNorm3d(1), nn.ReLU()
+            )
+            self.layer4 = nn.Sequential(nn.Linear(4, 3))
+            self.dequant = torch.quantization.DeQuantStub()
+
+        def forward(self):
+            x = self.quant(self.input1d)
+            x = self.layer1(x)
+            x = self.dequant(x)
+
+            y = self.input2d
+            y = self.quant(y)
+            y = self.layer2(y)
+            y = self.layer4(y)
+            y = self.dequant(y)
+
+            z = self.quant(self.input3d)
+            z = self.layer3(z)
+            z = self.dequant(z)
+
+            return (x, y, z)
+
+
+class FusedQuantModule:
+    def __init__(self):
+        super(FusedQuantModule, self).__init__()
+
+    def getModule(self):
+        model_fp32 = self.M()
+        model_fp32.eval()
+        model_fp32.qconfig = torch.quantization.get_default_qconfig("qnnpack")
+        model_fp32_fused = torch.quantization.fuse_modules(
+            model_fp32,
+            [
+                ["conv1d", "relu1"],
+                ["conv2d", "relu2"],
+                ["conv3d", "relu3"],
+                ["linear", "relu4"],
+            ],
+        )
+        model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)
+        model_int8 = torch.quantization.convert(model_fp32_prepared)
+        return model_int8
+
+    class M(torch.nn.Module):
+        def __init__(self):
+            super(FusedQuantModule.M, self).__init__()
+            self.quant = torch.quantization.QuantStub()
+            self.input1d = torch.randn(4, 2, 2)
+            self.input2d = torch.randn((4, 2, 4, 4))
+            self.input3d = torch.randn(4, 2, 2, 4, 4)
+            self.conv1d = nn.Conv1d(2, 2, 1)
+            self.conv2d = nn.Conv2d(2, 2, 1)
+            self.conv3d = nn.Conv3d(2, 2, 1)
+            self.linear = nn.Linear(4, 2)
+            self.relu1 = nn.ReLU()
+            self.relu2 = nn.ReLU()
+            self.relu3 = nn.ReLU()
+            self.relu4 = nn.ReLU()
+            self.dequant = torch.quantization.DeQuantStub()
+
+        def forward(self):
+            x = self.input1d
+            y = self.input2d
+            z = self.input3d
+
+            x = self.quant(x)
+            x = self.conv1d(x)
+            x = self.relu1(x)
+            x = self.dequant(x)
+
+            y = self.quant(y)
+            y = self.conv2d(y)
+            y = self.relu2(y)
+            y = self.dequant(y)
+
+            z = self.quant(z)
+            z = self.conv3d(z)
+            z = self.relu3(z)
+            z = self.linear(z)
+            z = self.relu4(z)
+            z = self.dequant(z)
+
+            return (x, y, z)
diff --git a/test/mobile/model_test/sampling_ops.py b/test/mobile/model_test/sampling_ops.py
new file mode 100644
index 00000000000000..a1ac71a3a31903
--- /dev/null
+++ b/test/mobile/model_test/sampling_ops.py
@@ -0,0 +1,37 @@
+import torch
+
+
+# https://pytorch.org/docs/stable/torch.html#random-sampling
+
+class SamplingOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(SamplingOpsModule, self).__init__()
+
+    def forward(self):
+        a = torch.empty(3, 3).uniform_(0.0, 1.0)
+        size = (1, 4)
+        weights = torch.tensor([0, 10, 3, 0], dtype=torch.float)
+        return len(
+            # torch.seed(),
+            # torch.manual_seed(0),
+            torch.bernoulli(a),
+            # torch.initial_seed(),
+            torch.multinomial(weights, 2),
+            torch.normal(2.0, 3.0, size),
+            torch.poisson(a),
+            torch.rand(2, 3),
+            torch.rand_like(a),
+            torch.randint(10, size),
+            torch.randint_like(a, 4),
+            torch.rand(4),
+            torch.randn_like(a),
+            torch.randperm(4),
+            a.bernoulli_(),
+            a.cauchy_(),
+            a.exponential_(),
+            a.geometric_(0.5),
+            a.log_normal_(),
+            a.normal_(),
+            a.random_(),
+            a.uniform_(),
+        )
diff --git a/test/mobile/model_test/tensor_ops.py b/test/mobile/model_test/tensor_ops.py
new file mode 100644
index 00000000000000..9e04c6703d27cd
--- /dev/null
+++ b/test/mobile/model_test/tensor_ops.py
@@ -0,0 +1,279 @@
+import torch
+
+
+class TensorOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TensorOpsModule, self).__init__()
+
+    def forward(self):
+        return self.tensor_general_ops()
+
+    def tensor_general_ops(self):
+        a = torch.randn(4)
+        b = torch.tensor([1.5])
+        x = torch.ones((2,))
+        c = torch.randn(4, dtype=torch.cfloat)
+        w = torch.rand(4, 4, 4, 4)
+        v = torch.rand(4, 4, 4, 4)
+        return len(
+            # torch.is_tensor(a),
+            # torch.is_storage(a),
+            torch.is_complex(a),
+            torch.is_conj(a),
+            torch.is_floating_point(a),
+            torch.is_nonzero(b),
+            # torch.set_default_dtype(torch.float32),
+            # torch.get_default_dtype(),
+            # torch.set_default_tensor_type(torch.DoubleTensor),
+            torch.numel(a),
+            # torch.set_printoptions(),
+            # torch.set_flush_denormal(False),
+            # https://pytorch.org/docs/stable/tensors.html#tensor-class-reference
+            # x.new_tensor([[0, 1], [2, 3]]),
+            x.new_full((3, 4), 3.141592),
+            x.new_empty((2, 3)),
+            x.new_ones((2, 3)),
+            x.new_zeros((2, 3)),
+            x.is_cuda,
+            x.is_quantized,
+            x.is_meta,
+            x.device,
+            x.dim(),
+            c.real,
+            c.imag,
+            # x.backward(),
+            x.clone(),
+            w.contiguous(),
+            w.contiguous(memory_format=torch.channels_last),
+            w.copy_(v),
+            w.copy_(1),
+            w.copy_(0.5),
+            x.cpu(),
+            # x.cuda(),
+            # x.data_ptr(),
+            x.dense_dim(),
+            w.fill_diagonal_(0),
+            w.element_size(),
+            w.exponential_(),
+            w.fill_(0),
+            w.geometric_(0.5),
+            a.index_fill(0, torch.tensor([0, 2]), 1),
+            a.index_put_([torch.argmax(a)], torch.tensor(1.0)),
+            a.index_put([torch.argmax(a)], torch.tensor(1.0)),
+            w.is_contiguous(),
+            c.is_complex(),
+            w.is_conj(),
+            w.is_floating_point(),
+            w.is_leaf,
+            w.is_pinned(),
+            w.is_set_to(w),
+            # w.is_shared,
+            w.is_coalesced(),
+            w.coalesce(),
+            w.is_signed(),
+            w.is_sparse,
+            torch.tensor([1]).item(),
+            x.log_normal_(),
+            # x.masked_scatter_(),
+            # x.masked_scatter(),
+            # w.normal(),
+            w.numel(),
+            # w.pin_memory(),
+            # w.put_(0, torch.tensor([0, 1], w)),
+            x.repeat(4, 2),
+            a.clamp_(0),
+            a.clamp(0),
+            a.clamp_min(0),
+            a.hardsigmoid_(),
+            a.hardsigmoid(),
+            a.hardswish_(),
+            a.hardswish(),
+            a.hardtanh_(),
+            a.hardtanh(),
+            a.leaky_relu_(),
+            a.leaky_relu(),
+            a.relu_(),
+            a.relu(),
+            a.resize_as_(a),
+            a.type_as(a),
+            a._shape_as_tensor(),
+            a.requires_grad_(False),
+        )
+
+
+class TensorCreationOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TensorCreationOpsModule, self).__init__()
+
+    def forward(self):
+        return self.tensor_creation_ops()
+
+    def tensor_creation_ops(self):
+        i = torch.tensor([[0, 1, 1], [2, 0, 2]])
+        v = torch.tensor([3, 4, 5], dtype=torch.float32)
+        real = torch.tensor([1, 2], dtype=torch.float32)
+        imag = torch.tensor([3, 4], dtype=torch.float32)
+        inp = torch.tensor([-1.5, 0.0, 2.0])
+        values = torch.tensor([0.5])
+        quantized = torch.quantize_per_channel(
+            torch.tensor([[-1.0, 0.0], [1.0, 2.0]]),
+            torch.tensor([0.1, 0.01]),
+            torch.tensor([10, 0]),
+            0,
+            torch.quint8,
+        )
+        return len(
+            torch.tensor([[0.1, 1.2], [2.2, 3.1], [4.9, 5.2]]),
+            # torch.sparse_coo_tensor(i, v, [2, 3]), # not work for iOS
+            torch.as_tensor([1, 2, 3]),
+            torch.as_strided(torch.randn(3, 3), (2, 2), (1, 2)),
+            torch.zeros(2, 3),
+            torch.zeros((2, 3)),
+            torch.zeros([2, 3], out=i),
+            torch.zeros(5),
+            torch.zeros_like(torch.empty(2, 3)),
+            torch.ones(2, 3),
+            torch.ones((2, 3)),
+            torch.ones([2, 3]),
+            torch.ones(5),
+            torch.ones_like(torch.empty(2, 3)),
+            torch.arange(5),
+            torch.arange(1, 4),
+            torch.arange(1, 2.5, 0.5),
+            torch.range(1, 4),
+            torch.range(1, 4, 0.5),
+            torch.linspace(3.0, 3.0, steps=1),
+            torch.logspace(start=2, end=2, steps=1, base=2.0),
+            torch.eye(3),
+            torch.empty(2, 3),
+            torch.empty_like(torch.empty(2, 3), dtype=torch.int64),
+            torch.empty_strided((2, 3), (1, 2)),
+            torch.full((2, 3), 3.141592),
+            torch.full_like(torch.full((2, 3), 3.141592), 2.71828),
+            torch.quantize_per_tensor(
+                torch.tensor([-1.0, 0.0, 1.0, 2.0]), 0.1, 10, torch.quint8
+            ),
+            torch.dequantize(quantized),
+            torch.complex(real, imag),
+            torch.polar(real, imag),
+            torch.heaviside(inp, values),
+        )
+
+
+class TensorIndexingOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TensorIndexingOpsModule, self).__init__()
+
+    def forward(self):
+        return self.tensor_indexing_ops()
+
+    def tensor_indexing_ops(self):
+        x = torch.randn(2, 4)
+        y = torch.randn(4, 4)
+        t = torch.tensor([[0, 0], [1, 0]])
+        mask = x.ge(0.5)
+        i = [0, 1]
+        return len(
+            torch.cat((x, x, x), 0),
+            torch.concat((x, x, x), 0),
+            torch.conj(x),
+            torch.chunk(x, 2),
+            torch.dsplit(torch.randn(2, 2, 4), i),
+            torch.column_stack((x, x)),
+            torch.dstack((x, x)),
+            torch.gather(x, 0, t),
+            torch.hsplit(x, i),
+            torch.hstack((x, x)),
+            torch.index_select(x, 0, torch.tensor([0, 1])),
+            x.index(t),
+            torch.masked_select(x, mask),
+            torch.movedim(x, 1, 0),
+            torch.moveaxis(x, 1, 0),
+            torch.narrow(x, 0, 0, 2),
+            torch.nonzero(x),
+            torch.permute(x, (0, 1)),
+            torch.reshape(x, (-1,)),
+            torch.row_stack((x, x)),
+            torch.select(x, 0, 0),
+            torch.scatter(x, 0, t, x),
+            x.scatter(0, t, x.clone()),
+            torch.diagonal_scatter(y, torch.ones(4)),
+            torch.select_scatter(y, torch.ones(4), 0, 0),
+            torch.slice_scatter(x, x),
+            torch.scatter_add(x, 0, t, x),
+            x.scatter_(0, t, y),
+            x.scatter_add_(0, t, y),
+            # torch.scatter_reduce(x, 0, t, reduce="sum"),
+            torch.split(x, 1),
+            torch.squeeze(x, 0),
+            torch.stack([x, x]),
+            torch.swapaxes(x, 0, 1),
+            torch.swapdims(x, 0, 1),
+            torch.t(x),
+            torch.take(x, t),
+            torch.take_along_dim(x, torch.argmax(x)),
+            torch.tensor_split(x, 1),
+            torch.tensor_split(x, [0, 1]),
+            torch.tile(x, (2, 2)),
+            torch.transpose(x, 0, 1),
+            torch.unbind(x),
+            torch.unsqueeze(x, -1),
+            torch.vsplit(x, i),
+            torch.vstack((x, x)),
+            torch.where(x),
+            torch.where(t > 0, t, 0),
+            torch.where(t > 0, t, t),
+        )
+
+
+class TensorTypingOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TensorTypingOpsModule, self).__init__()
+
+    def forward(self):
+        return self.tensor_typing_ops()
+
+    def tensor_typing_ops(self):
+        x = torch.randn(1, 3, 4, 4)
+        return len(
+            x.to(torch.float),
+            x.to(torch.double),
+            x.to(torch.cfloat),
+            x.to(torch.cdouble),
+            x.to(torch.half),
+            x.to(torch.bfloat16),
+            x.to(torch.uint8),
+            x.to(torch.int8),
+            x.to(torch.short),
+            x.to(torch.int),
+            x.to(torch.long),
+            x.to(torch.bool),
+            x.to(torch.device("cpu")),
+            x.to(device="cpu", dtype=torch.float),
+            x.to(memory_format=torch.channels_last),
+        )
+
+
+class TensorViewOpsModule(torch.nn.Module):
+    def __init__(self):
+        super(TensorViewOpsModule, self).__init__()
+
+    def forward(self):
+        return self.tensor_view_ops()
+
+    def tensor_view_ops(self):
+        x = torch.randn(4, 4, 1)
+        y = torch.randn(4, 4, 2)
+        return len(
+            x[0, 2:],
+            x.detach(),
+            x.detach_(),
+            x.diagonal(),
+            x.expand(-1, -1, 3),
+            x.expand_as(y),
+            x.select(0, 1),
+            x.unflatten(1, (2, 2)),
+            x.unfold(1, 2, 2),
+            x.view(16),
+            x.view_as(torch.randn(16)),
+        )
diff --git a/test/mobile/model_test/torchvision_models.py b/test/mobile/model_test/torchvision_models.py
new file mode 100644
index 00000000000000..232afbc54b1eee
--- /dev/null
+++ b/test/mobile/model_test/torchvision_models.py
@@ -0,0 +1,24 @@
+import torch
+import torchvision
+from torch.utils.bundled_inputs import augment_model_with_bundled_inputs
+from torch.utils.mobile_optimizer import optimize_for_mobile
+
+
+class MobileNetV2Module:
+    def __init__(self):
+        super(MobileNetV2Module, self).__init__()
+
+    def getModule(self):
+        model = torchvision.models.mobilenet_v2(pretrained=True)
+        model.eval()
+        example = torch.zeros(1, 3, 224, 224)
+        traced_script_module = torch.jit.trace(model, example)
+        optimized_module = optimize_for_mobile(traced_script_module)
+        augment_model_with_bundled_inputs(
+            optimized_module,
+            [
+                (example, ),
+            ],
+        )
+        optimized_module(example)
+        return optimized_module
diff --git a/test/mobile/model_test/update_production_ops.py b/test/mobile/model_test/update_production_ops.py
new file mode 100644
index 00000000000000..6bb685e6296d4f
--- /dev/null
+++ b/test/mobile/model_test/update_production_ops.py
@@ -0,0 +1,35 @@
+"""
+This is a script to aggregate production ops from xplat/pytorch_models/build/all_mobile_model_configs.yaml.
+Specify the file path in the first argument. The results will be dump to model_ops.yaml
+"""
+
+import sys
+import yaml
+
+root_operators = {}
+traced_operators = {}
+kernel_metadata = {}
+
+with open(sys.argv[1]) as input_yaml_file:
+    model_infos = yaml.safe_load(input_yaml_file)
+    for info in model_infos:
+        for op in info["root_operators"]:
+            # aggregate occurance per op
+            root_operators[op] = 1 + (root_operators[op] if op in root_operators else 0)
+        for op in info["traced_operators"]:
+            # aggregate occurance per op
+            traced_operators[op] = 1 + (traced_operators[op] if op in traced_operators else 0)
+        # merge dtypes for each kernel
+        for kernal, dtypes in info["kernel_metadata"].items():
+            new_dtypes = dtypes + (kernel_metadata[kernal] if kernal in kernel_metadata else [])
+            kernel_metadata[kernal] = list(set(new_dtypes))
+
+
+# Only test these built-in ops. No custom ops or non-CPU ops.
+namespaces = ["aten", "prepacked", "prim", "quantized"]
+root_operators = {x: root_operators[x] for x in root_operators if x.split("::")[0] in namespaces}
+traced_operators = {x: traced_operators[x] for x in traced_operators if x.split("::")[0] in namespaces}
+
+out_path = "test/mobile/model_test/model_ops.yaml"
+with open(out_path, "w") as f:
+    yaml.safe_dump({"root_operators": root_operators}, f)
diff --git a/test/mobile/test_lite_script_module.py b/test/mobile/test_lite_script_module.py
index 90abdab4ceea8a..638ac37eb88b39 100644
--- a/test/mobile/test_lite_script_module.py
+++ b/test/mobile/test_lite_script_module.py
@@ -522,6 +522,49 @@ def forward(self, x):
         input = torch.randn(4, 1, 4, 4)
         self._compare_script_and_mobile(model=model_int8, input=input)
 
+    def test_bundled_input_with_dynamic_type(self):
+        class Model(torch.nn.Module):
+            def __init__(self):
+                super(Model, self).__init__()
+
+            def forward(
+                self,
+                x: Dict[int, torch.Tensor],
+                y: Dict[int, torch.Tensor],
+                z: Dict[int, torch.Tensor],
+            ):
+                return x
+
+        model = Model()
+        script_module = torch.jit.script(model)
+
+        sample_input = {
+            script_module.forward: [
+                (
+                    {0: torch.ones(1)},
+                    {1: torch.ones(1)},
+                    {2: torch.ones(1)},
+                )
+            ]
+        }
+
+        bundled_model = torch.utils.bundled_inputs.bundle_inputs(
+            script_module, sample_input
+        )
+
+        buf = bundled_model._save_to_buffer_for_lite_interpreter()
+        mobile_module = _load_for_lite_interpreter(io.BytesIO(buf))
+
+        i = mobile_module.run_method("get_all_bundled_inputs")
+
+        self.assertEqual(
+            i[0],
+            (
+                {0: torch.ones(1)},
+                {1: torch.ones(1)},
+                {2: torch.ones(1)},
+            ),
+        )
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/onnx/autograd_helper.py b/test/onnx/autograd_helper.py
new file mode 100644
index 00000000000000..a5c07bf1a26c58
--- /dev/null
+++ b/test/onnx/autograd_helper.py
@@ -0,0 +1,18 @@
+# Owner(s): ["module: onnx"]
+
+import torch
+
+# Autograd funtion that is a replica of the autograd funtion in
+# test_utility_funs.py (test_autograd_module_name)
+class CustomFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+        return input.clamp(min=0)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, = ctx.saved_tensors
+        grad_input = grad_output.clone()
+        grad_input[input < 0] = 0
+        return grad_input
diff --git a/test/onnx/expect/TestOperators.test_acos.expect b/test/onnx/expect/TestOperators.test_acos.expect
index bcf9463956104c..40fc61e29b7f9a 100644
--- a/test/onnx/expect/TestOperators.test_acos.expect
+++ b/test/onnx/expect/TestOperators.test_acos.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_add_broadcast.expect b/test/onnx/expect/TestOperators.test_add_broadcast.expect
index 72d69469339fe3..569b2400df8819 100644
--- a/test/onnx/expect/TestOperators.test_add_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_broadcast.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
index 81a0689a51bd82..ffa632ca475b8c 100644
--- a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
index ffdf6efd6228dc..9917880a8a228f 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
index 72d69469339fe3..569b2400df8819 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
index a00ddab51e96ef..96d2dca593256a 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_addconstant.expect b/test/onnx/expect/TestOperators.test_addconstant.expect
index 8494b62a3ed716..0e1570eb62da57 100644
--- a/test/onnx/expect/TestOperators.test_addconstant.expect
+++ b/test/onnx/expect/TestOperators.test_addconstant.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_addmm.expect b/test/onnx/expect/TestOperators.test_addmm.expect
index ee46983d9e4129..1ef0a81e2a9054 100644
--- a/test/onnx/expect/TestOperators.test_addmm.expect
+++ b/test/onnx/expect/TestOperators.test_addmm.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_argmax.expect b/test/onnx/expect/TestOperators.test_argmax.expect
index a10b3bbe32899d..38add716ff3677 100644
--- a/test/onnx/expect/TestOperators.test_argmax.expect
+++ b/test/onnx/expect/TestOperators.test_argmax.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -17,6 +17,11 @@ graph {
       i: 0
       type: INT
     }
+    attribute {
+      name: "select_last_index"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -50,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_asin.expect b/test/onnx/expect/TestOperators.test_asin.expect
index a6197dea5ffe6e..f5a44b850eb1c6 100644
--- a/test/onnx/expect/TestOperators.test_asin.expect
+++ b/test/onnx/expect/TestOperators.test_asin.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_at_op.expect b/test/onnx/expect/TestOperators.test_at_op.expect
index 46f9008a6ea5f6..8d4ba07ddcc854 100644
--- a/test/onnx/expect/TestOperators.test_at_op.expect
+++ b/test/onnx/expect/TestOperators.test_at_op.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -13,6 +13,11 @@ graph {
       s: "add"
       type: STRING
     }
+    attribute {
+      name: "overload_name"
+      s: ""
+      type: STRING
+    }
   }
   name: "torch_jit"
   input {
@@ -49,5 +54,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_atan.expect b/test/onnx/expect/TestOperators.test_atan.expect
index d9a034a2504271..c8d189e1415ef2 100644
--- a/test/onnx/expect/TestOperators.test_atan.expect
+++ b/test/onnx/expect/TestOperators.test_atan.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_avg_pool2d.expect b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
index cb5da7e05037aa..344022ec268877 100644
--- a/test/onnx/expect/TestOperators.test_avg_pool2d.expect
+++ b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
@@ -1,40 +1,43 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
+  node {
+    output: "onnx::Pad_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 8
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
   node {
     input: "onnx::Pad_0"
-    output: "onnx::AveragePool_1"
-    name: "Pad_0"
+    input: "onnx::Pad_1"
+    output: "onnx::AveragePool_2"
+    name: "Pad_1"
     op_type: "Pad"
     attribute {
       name: "mode"
       s: "constant"
       type: STRING
     }
-    attribute {
-      name: "pads"
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 0
-      type: INTS
-    }
-    attribute {
-      name: "value"
-      f: 0
-      type: FLOAT
-    }
   }
   node {
-    input: "onnx::AveragePool_1"
-    output: "2"
-    name: "AveragePool_1"
+    input: "onnx::AveragePool_2"
+    output: "3"
+    name: "AveragePool_2"
     op_type: "AveragePool"
+    attribute {
+      name: "ceil_mode"
+      i: 0
+      type: INT
+    }
     attribute {
       name: "kernel_shape"
       ints: 3
@@ -80,7 +83,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "3"
     type {
       tensor_type {
         elem_type: 1
@@ -103,5 +106,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_baddbmm.expect b/test/onnx/expect/TestOperators.test_baddbmm.expect
index 66fe45123b9f18..fc7eb0f8295e64 100644
--- a/test/onnx/expect/TestOperators.test_baddbmm.expect
+++ b/test/onnx/expect/TestOperators.test_baddbmm.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -119,5 +119,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_basic.expect b/test/onnx/expect/TestOperators.test_basic.expect
index 88d53eb0ff75d5..3d151aefabdb13 100644
--- a/test/onnx/expect/TestOperators.test_basic.expect
+++ b/test/onnx/expect/TestOperators.test_basic.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm.expect b/test/onnx/expect/TestOperators.test_batchnorm.expect
index 1bd402f6533eb9..d9c9ec338c8cb4 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -145,5 +145,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
index 426fb72af70207..a4d2e1f102498a 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -133,5 +133,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
index 88f4fdc578f140..a421443cdcda51 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -135,5 +135,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
index 0d80bedd8ab5c8..a556e38c7198a5 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -93,5 +93,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_training.expect b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
index 9090a8ff187779..5e8f2049e14337 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_training.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -149,5 +149,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_chunk.expect b/test/onnx/expect/TestOperators.test_chunk.expect
index f4973676048086..575245c807eb63 100644
--- a/test/onnx/expect/TestOperators.test_chunk.expect
+++ b/test/onnx/expect/TestOperators.test_chunk.expect
@@ -1,28 +1,158 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Split_0"
-    output: "1"
-    output: "2"
-    name: "Split_0"
-    op_type: "Split"
+    input: "onnx::Shape_0"
+    output: "onnx::Gather_1"
+    name: "Shape_0"
+    op_type: "Shape"
+  }
+  node {
+    output: "onnx::Gather_2"
+    name: "Constant_1"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Gather_1"
+    input: "onnx::Gather_2"
+    output: "onnx::Add_3"
+    name: "Gather_2"
+    op_type: "Gather"
     attribute {
       name: "axis"
       i: 0
       type: INT
     }
+  }
+  node {
+    output: "onnx::Slice_4"
+    name: "Constant_3"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Add_5"
+    name: "Constant_4"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Add_3"
+    input: "onnx::Add_5"
+    output: "onnx::Div_6"
+    name: "Add_5"
+    op_type: "Add"
+  }
+  node {
+    output: "onnx::Div_7"
+    name: "Constant_6"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Div_6"
+    input: "onnx::Div_7"
+    output: "onnx::Mul_8"
+    name: "Div_7"
+    op_type: "Div"
+  }
+  node {
+    output: "onnx::Mul_9"
+    name: "Constant_8"
+    op_type: "Constant"
     attribute {
-      name: "split"
-      ints: 2
-      ints: 1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
+  node {
+    input: "onnx::Mul_8"
+    input: "onnx::Mul_9"
+    output: "onnx::Slice_10"
+    name: "Mul_9"
+    op_type: "Mul"
+  }
+  node {
+    input: "onnx::Shape_0"
+    input: "onnx::Slice_4"
+    input: "onnx::Slice_10"
+    input: "onnx::Gather_2"
+    output: "11"
+    name: "Slice_10"
+    op_type: "Slice"
+  }
+  node {
+    output: "onnx::Mul_12"
+    name: "Constant_11"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Mul_8"
+    input: "onnx::Mul_12"
+    output: "onnx::Slice_13"
+    name: "Mul_12"
+    op_type: "Mul"
+  }
+  node {
+    input: "onnx::Shape_0"
+    input: "onnx::Slice_10"
+    input: "onnx::Slice_13"
+    input: "onnx::Gather_2"
+    output: "14"
+    name: "Slice_13"
+    op_type: "Slice"
+  }
   name: "torch_jit"
   input {
-    name: "onnx::Split_0"
+    name: "onnx::Shape_0"
     type {
       tensor_type {
         elem_type: 1
@@ -35,7 +165,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "11"
     type {
       tensor_type {
         elem_type: 1
@@ -48,7 +178,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "14"
     type {
       tensor_type {
         elem_type: 1
@@ -62,5 +192,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_clip.expect b/test/onnx/expect/TestOperators.test_clip.expect
index 50293fd9cf9421..81606851e7851e 100644
--- a/test/onnx/expect/TestOperators.test_clip.expect
+++ b/test/onnx/expect/TestOperators.test_clip.expect
@@ -1,24 +1,26 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::Clip_0"
-    output: "1"
+    input: "onnx::Clip_6"
+    input: "onnx::Clip_7"
+    output: "5"
     name: "Clip_0"
     op_type: "Clip"
-    attribute {
-      name: "max"
-      f: 0.5
-      type: FLOAT
-    }
-    attribute {
-      name: "min"
-      f: -0.5
-      type: FLOAT
-    }
   }
   name: "torch_jit"
+  initializer {
+    data_type: 1
+    name: "onnx::Clip_6"
+    raw_data: "\000\000\000\277"
+  }
+  initializer {
+    data_type: 1
+    name: "onnx::Clip_7"
+    raw_data: "\000\000\000?"
+  }
   input {
     name: "onnx::Clip_0"
     type {
@@ -36,7 +38,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
@@ -53,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_max.expect b/test/onnx/expect/TestOperators.test_clip_max.expect
index bb7bd0fc6db2f1..ceb89b3048c670 100644
--- a/test/onnx/expect/TestOperators.test_clip_max.expect
+++ b/test/onnx/expect/TestOperators.test_clip_max.expect
@@ -1,19 +1,21 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::Clip_0"
-    output: "1"
+    input: ""
+    input: "onnx::Clip_7"
+    output: "5"
     name: "Clip_0"
     op_type: "Clip"
-    attribute {
-      name: "max"
-      f: 0.1
-      type: FLOAT
-    }
   }
   name: "torch_jit"
+  initializer {
+    data_type: 1
+    name: "onnx::Clip_7"
+    raw_data: "\315\314\314="
+  }
   input {
     name: "onnx::Clip_0"
     type {
@@ -37,22 +39,22 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Clip5_dim_0"
           }
           dim {
-            dim_value: 2
+            dim_param: "Clip5_dim_1"
           }
           dim {
-            dim_value: 3
+            dim_param: "Clip5_dim_2"
           }
           dim {
-            dim_value: 4
+            dim_param: "Clip5_dim_3"
           }
         }
       }
@@ -60,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_min.expect b/test/onnx/expect/TestOperators.test_clip_min.expect
index cda3b105ccbaaf..22826be3fd5434 100644
--- a/test/onnx/expect/TestOperators.test_clip_min.expect
+++ b/test/onnx/expect/TestOperators.test_clip_min.expect
@@ -1,19 +1,21 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::Clip_0"
-    output: "1"
+    input: "onnx::Clip_7"
+    input: ""
+    output: "5"
     name: "Clip_0"
     op_type: "Clip"
-    attribute {
-      name: "min"
-      f: -0.1
-      type: FLOAT
-    }
   }
   name: "torch_jit"
+  initializer {
+    data_type: 1
+    name: "onnx::Clip_7"
+    raw_data: "\315\314\314\275"
+  }
   input {
     name: "onnx::Clip_0"
     type {
@@ -37,22 +39,22 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Clip5_dim_0"
           }
           dim {
-            dim_value: 2
+            dim_param: "Clip5_dim_1"
           }
           dim {
-            dim_value: 3
+            dim_param: "Clip5_dim_2"
           }
           dim {
-            dim_value: 4
+            dim_param: "Clip5_dim_3"
           }
         }
       }
@@ -60,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_concat2.expect b/test/onnx/expect/TestOperators.test_concat2.expect
index b5102e0f86647d..f5b6aec0c2293e 100644
--- a/test/onnx/expect/TestOperators.test_concat2.expect
+++ b/test/onnx/expect/TestOperators.test_concat2.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_conv.expect b/test/onnx/expect/TestOperators.test_conv.expect
index 55fe131ef3ac10..f1078cef39c176 100644
--- a/test/onnx/expect/TestOperators.test_conv.expect
+++ b/test/onnx/expect/TestOperators.test_conv.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -118,5 +118,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
index 980f9fab61fcbb..18e3c683e9bc92 100644
--- a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -96,5 +96,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_convtranspose.expect b/test/onnx/expect/TestOperators.test_convtranspose.expect
index 22241584dc6c21..0beedca2f2920e 100644
--- a/test/onnx/expect/TestOperators.test_convtranspose.expect
+++ b/test/onnx/expect/TestOperators.test_convtranspose.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -124,5 +124,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_cos.expect b/test/onnx/expect/TestOperators.test_cos.expect
index 7b08d883c7b802..1185bca62c5975 100644
--- a/test/onnx/expect/TestOperators.test_cos.expect
+++ b/test/onnx/expect/TestOperators.test_cos.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dict.expect b/test/onnx/expect/TestOperators.test_dict.expect
index 42a4855c818c1b..e041d535d768b4 100644
--- a/test/onnx/expect/TestOperators.test_dict.expect
+++ b/test/onnx/expect/TestOperators.test_dict.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dict_str.expect b/test/onnx/expect/TestOperators.test_dict_str.expect
index 3e72400d5f421d..eaab2752fb7dcd 100644
--- a/test/onnx/expect/TestOperators.test_dict_str.expect
+++ b/test/onnx/expect/TestOperators.test_dict_str.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dim.expect b/test/onnx/expect/TestOperators.test_dim.expect
index 77480e6173bb95..59e910a646ca99 100644
--- a/test/onnx/expect/TestOperators.test_dim.expect
+++ b/test/onnx/expect/TestOperators.test_dim.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -28,5 +28,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout.expect b/test/onnx/expect/TestOperators.test_dropout.expect
index 407a1af477f7c8..27aab5c718211c 100644
--- a/test/onnx/expect/TestOperators.test_dropout.expect
+++ b/test/onnx/expect/TestOperators.test_dropout.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -42,5 +42,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_default.expect b/test/onnx/expect/TestOperators.test_dropout_default.expect
index 523ec6bf8e307b..89c0e988aacbcd 100644
--- a/test/onnx/expect/TestOperators.test_dropout_default.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_default.expect
@@ -1,23 +1,46 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "x"
-    output: "onnx::ReduceMax_1"
-    output: "2"
-    name: "Dropout_0"
-    op_type: "Dropout"
+    output: "onnx::Dropout_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        data_type: 1
+        raw_data: "\000\000\000?"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Dropout_2"
+    name: "Constant_1"
+    op_type: "Constant"
     attribute {
-      name: "ratio"
-      f: 0.5
-      type: FLOAT
+      name: "value"
+      t {
+        data_type: 9
+        raw_data: "\001"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::ReduceMax_1"
-    output: "3"
-    name: "ReduceMax_1"
+    input: "x"
+    input: "onnx::Dropout_1"
+    input: "onnx::Dropout_2"
+    output: "onnx::ReduceMax_3"
+    output: "4"
+    name: "Dropout_2"
+    op_type: "Dropout"
+  }
+  node {
+    input: "onnx::ReduceMax_3"
+    output: "5"
+    name: "ReduceMax_3"
     op_type: "ReduceMax"
     attribute {
       name: "keepdims"
@@ -43,7 +66,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
@@ -54,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_training.expect b/test/onnx/expect/TestOperators.test_dropout_training.expect
index 523ec6bf8e307b..89c0e988aacbcd 100644
--- a/test/onnx/expect/TestOperators.test_dropout_training.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_training.expect
@@ -1,23 +1,46 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "x"
-    output: "onnx::ReduceMax_1"
-    output: "2"
-    name: "Dropout_0"
-    op_type: "Dropout"
+    output: "onnx::Dropout_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        data_type: 1
+        raw_data: "\000\000\000?"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Dropout_2"
+    name: "Constant_1"
+    op_type: "Constant"
     attribute {
-      name: "ratio"
-      f: 0.5
-      type: FLOAT
+      name: "value"
+      t {
+        data_type: 9
+        raw_data: "\001"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::ReduceMax_1"
-    output: "3"
-    name: "ReduceMax_1"
+    input: "x"
+    input: "onnx::Dropout_1"
+    input: "onnx::Dropout_2"
+    output: "onnx::ReduceMax_3"
+    output: "4"
+    name: "Dropout_2"
+    op_type: "Dropout"
+  }
+  node {
+    input: "onnx::ReduceMax_3"
+    output: "5"
+    name: "ReduceMax_3"
     op_type: "ReduceMax"
     attribute {
       name: "keepdims"
@@ -43,7 +66,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
@@ -54,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_elu.expect b/test/onnx/expect/TestOperators.test_elu.expect
index c43a3827fce974..9fc2d5aab1fed4 100644
--- a/test/onnx/expect/TestOperators.test_elu.expect
+++ b/test/onnx/expect/TestOperators.test_elu.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_embedding_bags.expect b/test/onnx/expect/TestOperators.test_embedding_bags.expect
index ee9be8e861fb9e..dfa1afddee3010 100644
--- a/test/onnx/expect/TestOperators.test_embedding_bags.expect
+++ b/test/onnx/expect/TestOperators.test_embedding_bags.expect
@@ -1,42 +1,359 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "weight"
-    input: "input"
-    input: "offsets"
-    output: "3"
-    output: "4"
-    output: "5"
-    output: "6"
-    op_type: "ATen"
+    output: "onnx::Cast_3"
+    op_type: "Constant"
     attribute {
-      name: "include_last_offset"
-      i: 0
+      name: "value"
+      t {
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Cast_3"
+    output: "onnx::Loop_4"
+    op_type: "Cast"
+    attribute {
+      name: "to"
+      i: 9
       type: INT
     }
+  }
+  node {
+    output: "5"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "input"
+    output: "onnx::Gather_6"
+    op_type: "Shape"
+  }
+  node {
+    output: "onnx::Gather_7"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Gather_6"
+    input: "onnx::Gather_7"
+    output: "onnx::Unsqueeze_8"
+    op_type: "Gather"
     attribute {
-      name: "mode"
-      i: 1
+      name: "axis"
+      i: 0
       type: INT
     }
+  }
+  node {
+    output: "onnx::Unsqueeze_9"
+    op_type: "Constant"
     attribute {
-      name: "operator"
-      s: "embedding_bag"
-      type: STRING
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    input: "onnx::Unsqueeze_8"
+    input: "onnx::Unsqueeze_9"
+    output: "onnx::Concat_10"
+    op_type: "Unsqueeze"
+  }
+  node {
+    input: "offsets"
+    input: "onnx::Concat_10"
+    output: "onnx::Slice_11"
+    op_type: "Concat"
     attribute {
-      name: "scale_grad_by_freq"
+      name: "axis"
       i: 0
       type: INT
     }
+  }
+  node {
+    output: "onnx::Slice_12"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_13"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_14"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\177"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_15"
+    op_type: "Constant"
     attribute {
-      name: "sparse"
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Slice_11"
+    input: "onnx::Slice_13"
+    input: "onnx::Slice_14"
+    input: "onnx::Slice_12"
+    input: "onnx::Slice_15"
+    output: "onnx::Shape_16"
+    op_type: "Slice"
+  }
+  node {
+    input: "onnx::Shape_16"
+    output: "onnx::Gather_17"
+    op_type: "Shape"
+  }
+  node {
+    output: "onnx::Gather_18"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Gather_17"
+    input: "onnx::Gather_18"
+    output: "onnx::Loop_19"
+    op_type: "Gather"
+    attribute {
+      name: "axis"
       i: 0
       type: INT
     }
   }
+  node {
+    input: "onnx::Loop_19"
+    input: "onnx::Loop_4"
+    output: "20"
+    op_type: "Loop"
+    attribute {
+      name: "body"
+      g {
+        node {
+          input: "onnx::Slice_11"
+          input: "21"
+          output: "23"
+          name: "Gather_0"
+          op_type: "Gather"
+          attribute {
+            name: "axis"
+            i: 0
+            type: INT
+          }
+        }
+        node {
+          input: "onnx::Shape_16"
+          input: "21"
+          output: "24"
+          name: "Gather_1"
+          op_type: "Gather"
+          attribute {
+            name: "axis"
+            i: 0
+            type: INT
+          }
+        }
+        node {
+          output: "25"
+          name: "Constant_2"
+          op_type: "Constant"
+          attribute {
+            name: "value"
+            t {
+              dims: 1
+              data_type: 7
+              raw_data: "\000\000\000\000\000\000\000\000"
+            }
+            type: TENSOR
+          }
+        }
+        node {
+          input: "23"
+          input: "25"
+          output: "26"
+          name: "Unsqueeze_3"
+          op_type: "Unsqueeze"
+        }
+        node {
+          output: "27"
+          name: "Constant_4"
+          op_type: "Constant"
+          attribute {
+            name: "value"
+            t {
+              dims: 1
+              data_type: 7
+              raw_data: "\000\000\000\000\000\000\000\000"
+            }
+            type: TENSOR
+          }
+        }
+        node {
+          input: "24"
+          input: "27"
+          output: "28"
+          name: "Unsqueeze_5"
+          op_type: "Unsqueeze"
+        }
+        node {
+          input: "input"
+          input: "26"
+          input: "28"
+          input: "5"
+          output: "29"
+          name: "Slice_6"
+          op_type: "Slice"
+        }
+        node {
+          input: "weight"
+          input: "29"
+          output: "30"
+          name: "Gather_7"
+          op_type: "Gather"
+          attribute {
+            name: "axis"
+            i: 0
+            type: INT
+          }
+        }
+        node {
+          input: "30"
+          output: "31"
+          name: "ReduceMean_8"
+          op_type: "ReduceMean"
+          attribute {
+            name: "axes"
+            ints: 0
+            type: INTS
+          }
+          attribute {
+            name: "keepdims"
+            i: 0
+            type: INT
+          }
+        }
+        node {
+          input: "onnx::Loop_4"
+          output: "32"
+          name: "Cast_9"
+          op_type: "Cast"
+          attribute {
+            name: "to"
+            i: 9
+            type: INT
+          }
+        }
+        name: "torch_jit1"
+        input {
+          name: "21"
+          type {
+            tensor_type {
+              elem_type: 7
+              shape {
+              }
+            }
+          }
+        }
+        input {
+          name: "22"
+          type {
+            tensor_type {
+              elem_type: 9
+              shape {
+              }
+            }
+          }
+        }
+        output {
+          name: "32"
+          type {
+            tensor_type {
+              elem_type: 9
+              shape {
+              }
+            }
+          }
+        }
+        output {
+          name: "31"
+          type {
+            tensor_type {
+              elem_type: 1
+              shape {
+                dim {
+                  dim_param: "Loop20_dim_1"
+                }
+              }
+            }
+          }
+        }
+      }
+      type: GRAPH
+    }
+  }
   name: "torch_jit"
   initializer {
     dims: 10
@@ -88,16 +405,16 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "20"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Loop20_dim_0"
           }
           dim {
-            dim_value: 8
+            dim_param: "Loop20_dim_1"
           }
         }
       }
@@ -105,5 +422,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_empty_like.expect b/test/onnx/expect/TestOperators.test_empty_like.expect
index 1293acb1e16fba..e4f6c6ede2cab1 100644
--- a/test/onnx/expect/TestOperators.test_empty_like.expect
+++ b/test/onnx/expect/TestOperators.test_empty_like.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_equal.expect b/test/onnx/expect/TestOperators.test_equal.expect
index 21c1e7f3caed14..5a9877d484f895 100644
--- a/test/onnx/expect/TestOperators.test_equal.expect
+++ b/test/onnx/expect/TestOperators.test_equal.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_erf.expect b/test/onnx/expect/TestOperators.test_erf.expect
index 6568ca8418d6a7..f8f70c37598dc8 100644
--- a/test/onnx/expect/TestOperators.test_erf.expect
+++ b/test/onnx/expect/TestOperators.test_erf.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_exp.expect b/test/onnx/expect/TestOperators.test_exp.expect
index b270bab2097512..49d9f74cb20d98 100644
--- a/test/onnx/expect/TestOperators.test_exp.expect
+++ b/test/onnx/expect/TestOperators.test_exp.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_expand.expect b/test/onnx/expect/TestOperators.test_expand.expect
index 988830e43c83fd..6634173a0a63aa 100644
--- a/test/onnx/expect/TestOperators.test_expand.expect
+++ b/test/onnx/expect/TestOperators.test_expand.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -131,5 +131,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten.expect b/test/onnx/expect/TestOperators.test_flatten.expect
index 48def60c9f25ba..12160e8b9e6640 100644
--- a/test/onnx/expect/TestOperators.test_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_flatten.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -9,29 +9,59 @@ graph {
     op_type: "Shape"
   }
   node {
-    input: "onnx::Slice_1"
-    output: "onnx::Concat_2"
-    name: "Slice_1"
-    op_type: "Slice"
+    output: "onnx::Slice_2"
+    name: "Constant_1"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_3"
+    name: "Constant_2"
+    op_type: "Constant"
     attribute {
-      name: "ends"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_4"
+    name: "Constant_3"
+    op_type: "Constant"
     attribute {
-      name: "starts"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
-    output: "onnx::Concat_3"
-    name: "Constant_2"
+    input: "onnx::Slice_1"
+    input: "onnx::Slice_3"
+    input: "onnx::Slice_4"
+    input: "onnx::Slice_2"
+    output: "onnx::Concat_5"
+    name: "Slice_4"
+    op_type: "Slice"
+  }
+  node {
+    output: "onnx::Concat_6"
+    name: "Constant_5"
     op_type: "Constant"
     attribute {
       name: "value"
@@ -44,10 +74,10 @@ graph {
     }
   }
   node {
-    input: "onnx::Concat_2"
-    input: "onnx::Concat_3"
-    output: "onnx::Reshape_4"
-    name: "Concat_3"
+    input: "onnx::Concat_5"
+    input: "onnx::Concat_6"
+    output: "onnx::Reshape_7"
+    name: "Concat_6"
     op_type: "Concat"
     attribute {
       name: "axis"
@@ -57,9 +87,9 @@ graph {
   }
   node {
     input: "onnx::Shape_0"
-    input: "onnx::Reshape_4"
-    output: "5"
-    name: "Reshape_4"
+    input: "onnx::Reshape_7"
+    output: "8"
+    name: "Reshape_7"
     op_type: "Reshape"
   }
   name: "torch_jit"
@@ -86,7 +116,7 @@ graph {
     }
   }
   output {
-    name: "5"
+    name: "8"
     type {
       tensor_type {
         elem_type: 1
@@ -100,5 +130,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten2D.expect b/test/onnx/expect/TestOperators.test_flatten2D.expect
index 041886291c9b38..f60b1ba7066ffa 100644
--- a/test/onnx/expect/TestOperators.test_flatten2D.expect
+++ b/test/onnx/expect/TestOperators.test_flatten2D.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -54,5 +54,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_frobenius_norm.expect b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
index b1af3261b0dd5c..fba4585b18b853 100644
--- a/test/onnx/expect/TestOperators.test_frobenius_norm.expect
+++ b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
@@ -1,35 +1,49 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
+  node {
+    output: "onnx::ReduceSum_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 2
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
   node {
     input: "x"
     input: "x"
-    output: "onnx::ReduceSum_1"
-    name: "Mul_0"
+    output: "onnx::ReduceSum_2"
+    name: "Mul_1"
     op_type: "Mul"
   }
   node {
+    input: "onnx::ReduceSum_2"
     input: "onnx::ReduceSum_1"
-    output: "onnx::Sqrt_2"
-    name: "ReduceSum_1"
+    output: "onnx::Sqrt_3"
+    name: "ReduceSum_2"
     op_type: "ReduceSum"
-    attribute {
-      name: "axes"
-      ints: 0
-      ints: 1
-      type: INTS
-    }
     attribute {
       name: "keepdims"
       i: 1
       type: INT
     }
+    attribute {
+      name: "noop_with_empty_axes"
+      i: 0
+      type: INT
+    }
   }
   node {
-    input: "onnx::Sqrt_2"
-    output: "3"
-    name: "Sqrt_2"
+    input: "onnx::Sqrt_3"
+    output: "4"
+    name: "Sqrt_3"
     op_type: "Sqrt"
   }
   name: "torch_jit"
@@ -53,7 +67,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "4"
     type {
       tensor_type {
         elem_type: 1
@@ -73,5 +87,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_full.expect b/test/onnx/expect/TestOperators.test_full.expect
index d3526e4b1c568e..fc8acf5ee80dce 100644
--- a/test/onnx/expect/TestOperators.test_full.expect
+++ b/test/onnx/expect/TestOperators.test_full.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_full_like.expect b/test/onnx/expect/TestOperators.test_full_like.expect
index d3526e4b1c568e..fc8acf5ee80dce 100644
--- a/test/onnx/expect/TestOperators.test_full_like.expect
+++ b/test/onnx/expect/TestOperators.test_full_like.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_gather.expect b/test/onnx/expect/TestOperators.test_gather.expect
index dde397e206cd89..609f89853ac694 100644
--- a/test/onnx/expect/TestOperators.test_gather.expect
+++ b/test/onnx/expect/TestOperators.test_gather.expect
@@ -1,114 +1,22 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    output: "onnx::OneHot_2"
-    name: "Constant_0"
-    op_type: "Constant"
-    attribute {
-      name: "value"
-      t {
-        dims: 2
-        data_type: 7
-        raw_data: "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000"
-      }
-      type: TENSOR
-    }
-  }
-  node {
-    output: "onnx::Gather_3"
-    name: "Constant_1"
-    op_type: "Constant"
-    attribute {
-      name: "value"
-      t {
-        dims: 1
-        data_type: 7
-        raw_data: "\001\000\000\000\000\000\000\000"
-      }
-      type: TENSOR
-    }
-  }
-  node {
-    input: "onnx::Shape_0"
-    output: "onnx::Gather_4"
-    name: "Shape_2"
-    op_type: "Shape"
-  }
-  node {
-    input: "onnx::Gather_4"
-    input: "onnx::Gather_3"
-    output: "onnx::OneHot_5"
-    name: "Gather_3"
-    op_type: "Gather"
-    attribute {
-      name: "axis"
-      i: 0
-      type: INT
-    }
-  }
-  node {
-    input: "onnx::OneHot_1"
-    input: "onnx::OneHot_5"
-    input: "onnx::OneHot_2"
-    output: "onnx::Cast_6"
-    name: "OneHot_4"
-    op_type: "OneHot"
+    input: "onnx::GatherElements_0"
+    input: "onnx::GatherElements_1"
+    output: "2"
+    name: "GatherElements_0"
+    op_type: "GatherElements"
     attribute {
       name: "axis"
       i: 1
       type: INT
     }
   }
-  node {
-    input: "onnx::Cast_6"
-    output: "onnx::Mul_7"
-    name: "Cast_5"
-    op_type: "Cast"
-    attribute {
-      name: "to"
-      i: 1
-      type: INT
-    }
-  }
-  node {
-    input: "onnx::Shape_0"
-    output: "onnx::Mul_8"
-    name: "Unsqueeze_6"
-    op_type: "Unsqueeze"
-    attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
-    }
-  }
-  node {
-    input: "onnx::Mul_8"
-    input: "onnx::Mul_7"
-    output: "onnx::ReduceSum_9"
-    name: "Mul_7"
-    op_type: "Mul"
-  }
-  node {
-    input: "onnx::ReduceSum_9"
-    output: "10"
-    name: "ReduceSum_8"
-    op_type: "ReduceSum"
-    attribute {
-      name: "axes"
-      ints: 1
-      type: INTS
-    }
-    attribute {
-      name: "keepdims"
-      i: 0
-      type: INT
-    }
-  }
   name: "torch_jit"
   input {
-    name: "onnx::Shape_0"
+    name: "onnx::GatherElements_0"
     type {
       tensor_type {
         elem_type: 1
@@ -127,7 +35,7 @@ graph {
     }
   }
   input {
-    name: "onnx::OneHot_1"
+    name: "onnx::GatherElements_1"
     type {
       tensor_type {
         elem_type: 7
@@ -146,7 +54,7 @@ graph {
     }
   }
   output {
-    name: "10"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -166,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_ge.expect b/test/onnx/expect/TestOperators.test_ge.expect
index 5246ccf0eb767c..8d578a4d25bd0b 100644
--- a/test/onnx/expect/TestOperators.test_ge.expect
+++ b/test/onnx/expect/TestOperators.test_ge.expect
@@ -1,23 +1,17 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Less_0"
-    input: "onnx::Less_1"
-    output: "onnx::Not_2"
-    name: "Less_0"
-    op_type: "Less"
-  }
-  node {
-    input: "onnx::Not_2"
-    output: "3"
-    name: "Not_1"
-    op_type: "Not"
+    input: "onnx::GreaterOrEqual_0"
+    input: "onnx::GreaterOrEqual_1"
+    output: "2"
+    name: "GreaterOrEqual_0"
+    op_type: "GreaterOrEqual"
   }
   name: "torch_jit"
   input {
-    name: "onnx::Less_0"
+    name: "onnx::GreaterOrEqual_0"
     type {
       tensor_type {
         elem_type: 6
@@ -33,7 +27,7 @@ graph {
     }
   }
   input {
-    name: "onnx::Less_1"
+    name: "onnx::GreaterOrEqual_1"
     type {
       tensor_type {
         elem_type: 6
@@ -49,7 +43,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "2"
     type {
       tensor_type {
         elem_type: 9
@@ -66,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_gelu.expect b/test/onnx/expect/TestOperators.test_gelu.expect
index d59cafa2617837..dfc7d1d88468d1 100644
--- a/test/onnx/expect/TestOperators.test_gelu.expect
+++ b/test/onnx/expect/TestOperators.test_gelu.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -122,5 +122,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_gt.expect b/test/onnx/expect/TestOperators.test_gt.expect
index 903ea41b9051fb..5aab77798bf648 100644
--- a/test/onnx/expect/TestOperators.test_gt.expect
+++ b/test/onnx/expect/TestOperators.test_gt.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_hardtanh.expect b/test/onnx/expect/TestOperators.test_hardtanh.expect
index 70a3732b20700c..1268a4c14cfd15 100644
--- a/test/onnx/expect/TestOperators.test_hardtanh.expect
+++ b/test/onnx/expect/TestOperators.test_hardtanh.expect
@@ -1,23 +1,41 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "input"
-    output: "1"
-    name: "Clip_0"
-    op_type: "Clip"
+    output: "onnx::Clip_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "max"
-      f: 0.5
-      type: FLOAT
+      name: "value"
+      t {
+        data_type: 1
+        raw_data: "\000\000\000\277"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Clip_2"
+    name: "Constant_1"
+    op_type: "Constant"
     attribute {
-      name: "min"
-      f: -0.5
-      type: FLOAT
+      name: "value"
+      t {
+        data_type: 1
+        raw_data: "\000\000\000?"
+      }
+      type: TENSOR
     }
   }
+  node {
+    input: "input"
+    input: "onnx::Clip_1"
+    input: "onnx::Clip_2"
+    output: "3"
+    name: "Clip_2"
+    op_type: "Clip"
+  }
   name: "torch_jit"
   input {
     name: "input"
@@ -36,7 +54,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "3"
     type {
       tensor_type {
         elem_type: 1
@@ -53,5 +71,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_implicit_expand.expect b/test/onnx/expect/TestOperators.test_implicit_expand.expect
index db37957247fb9f..3c94edc85b4b38 100644
--- a/test/onnx/expect/TestOperators.test_implicit_expand.expect
+++ b/test/onnx/expect/TestOperators.test_implicit_expand.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_index.expect b/test/onnx/expect/TestOperators.test_index.expect
index 1ea803f067b7aa..330d2de0d7fca6 100644
--- a/test/onnx/expect/TestOperators.test_index.expect
+++ b/test/onnx/expect/TestOperators.test_index.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_isnan.expect b/test/onnx/expect/TestOperators.test_isnan.expect
index db7a6831750001..198d3bdb238706 100644
--- a/test/onnx/expect/TestOperators.test_isnan.expect
+++ b/test/onnx/expect/TestOperators.test_isnan.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
index dad821eb13e337..d7b7ac56113014 100644
--- a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
+++ b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -29,6 +29,11 @@ graph {
       s: "layer_norm"
       type: STRING
     }
+    attribute {
+      name: "overload_name"
+      s: ""
+      type: STRING
+    }
   }
   name: "torch_jit"
   initializer {
@@ -123,5 +128,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_le.expect b/test/onnx/expect/TestOperators.test_le.expect
index c0b0d67f5a931b..374a0d0e0d5212 100644
--- a/test/onnx/expect/TestOperators.test_le.expect
+++ b/test/onnx/expect/TestOperators.test_le.expect
@@ -1,23 +1,17 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Greater_0"
-    input: "onnx::Greater_1"
-    output: "onnx::Not_2"
-    name: "Greater_0"
-    op_type: "Greater"
-  }
-  node {
-    input: "onnx::Not_2"
-    output: "3"
-    name: "Not_1"
-    op_type: "Not"
+    input: "onnx::LessOrEqual_0"
+    input: "onnx::LessOrEqual_1"
+    output: "2"
+    name: "LessOrEqual_0"
+    op_type: "LessOrEqual"
   }
   name: "torch_jit"
   input {
-    name: "onnx::Greater_0"
+    name: "onnx::LessOrEqual_0"
     type {
       tensor_type {
         elem_type: 6
@@ -33,7 +27,7 @@ graph {
     }
   }
   input {
-    name: "onnx::Greater_1"
+    name: "onnx::LessOrEqual_1"
     type {
       tensor_type {
         elem_type: 6
@@ -49,7 +43,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "2"
     type {
       tensor_type {
         elem_type: 9
@@ -66,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_linear.expect b/test/onnx/expect/TestOperators.test_linear.expect
index 372c34223bafd4..71c64dfe5a5085 100644
--- a/test/onnx/expect/TestOperators.test_linear.expect
+++ b/test/onnx/expect/TestOperators.test_linear.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_log_sigmoid.expect b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
index 993490e9e1dd2a..2681f1193102c3 100644
--- a/test/onnx/expect/TestOperators.test_log_sigmoid.expect
+++ b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -61,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_logsoftmax.expect b/test/onnx/expect/TestOperators.test_logsoftmax.expect
index d01223a4c57984..1c4de89b6402cd 100644
--- a/test/onnx/expect/TestOperators.test_logsoftmax.expect
+++ b/test/onnx/expect/TestOperators.test_logsoftmax.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_lt.expect b/test/onnx/expect/TestOperators.test_lt.expect
index 57b6366e7b2abb..2dbcc07cd9e17e 100644
--- a/test/onnx/expect/TestOperators.test_lt.expect
+++ b/test/onnx/expect/TestOperators.test_lt.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_max.expect b/test/onnx/expect/TestOperators.test_max.expect
index 295f32c6f87b9c..d9fcc0fb5f7a36 100644
--- a/test/onnx/expect/TestOperators.test_max.expect
+++ b/test/onnx/expect/TestOperators.test_max.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool.expect b/test/onnx/expect/TestOperators.test_maxpool.expect
index 13dabcfd506e39..f43712bbfc58f3 100644
--- a/test/onnx/expect/TestOperators.test_maxpool.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -7,6 +7,11 @@ graph {
     output: "1"
     name: "MaxPool_0"
     op_type: "MaxPool"
+    attribute {
+      name: "ceil_mode"
+      i: 0
+      type: INT
+    }
     attribute {
       name: "kernel_shape"
       ints: 3
@@ -65,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool_indices.expect b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
index 249112abedfac3..46c23e3a4caecd 100644
--- a/test/onnx/expect/TestOperators.test_maxpool_indices.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -8,6 +8,11 @@ graph {
     output: "onnx::Sub_2"
     name: "MaxPool_0"
     op_type: "MaxPool"
+    attribute {
+      name: "ceil_mode"
+      i: 0
+      type: INT
+    }
     attribute {
       name: "kernel_shape"
       ints: 3
@@ -43,31 +48,61 @@ graph {
     }
   }
   node {
-    input: "onnx::Slice_4"
-    output: "onnx::Sub_5"
-    name: "Slice_2"
-    op_type: "Slice"
+    output: "onnx::Slice_5"
+    name: "Constant_2"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_6"
+    name: "Constant_3"
+    op_type: "Constant"
     attribute {
-      name: "ends"
-      ints: 1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_7"
+    name: "Constant_4"
+    op_type: "Constant"
     attribute {
-      name: "starts"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
+  node {
+    input: "onnx::Slice_4"
+    input: "onnx::Slice_6"
+    input: "onnx::Slice_7"
+    input: "onnx::Slice_5"
+    output: "onnx::Sub_8"
+    name: "Slice_5"
+    op_type: "Slice"
+  }
   node {
     input: "onnx::Sub_2"
-    input: "onnx::Sub_5"
-    output: "6"
-    name: "Sub_3"
+    input: "onnx::Sub_8"
+    output: "9"
+    name: "Sub_6"
     op_type: "Sub"
   }
   name: "torch_jit"
@@ -110,7 +145,7 @@ graph {
     }
   }
   output {
-    name: "6"
+    name: "9"
     type {
       tensor_type {
         elem_type: 7
@@ -130,5 +165,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_mean.expect b/test/onnx/expect/TestOperators.test_mean.expect
index 8148bfdb54b3b4..b53b8c2f1248fd 100644
--- a/test/onnx/expect/TestOperators.test_mean.expect
+++ b/test/onnx/expect/TestOperators.test_mean.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_mean_dtype.expect b/test/onnx/expect/TestOperators.test_mean_dtype.expect
index dfda5eba27e0e7..92ce0ae3aa9925 100644
--- a/test/onnx/expect/TestOperators.test_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_mean_dtype.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_meshgrid.expect b/test/onnx/expect/TestOperators.test_meshgrid.expect
index ba0edfb1c3985a..05b9de875d9413 100644
--- a/test/onnx/expect/TestOperators.test_meshgrid.expect
+++ b/test/onnx/expect/TestOperators.test_meshgrid.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -318,5 +318,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_min.expect b/test/onnx/expect/TestOperators.test_min.expect
index 12945fa60e9fab..28ca14779f71c8 100644
--- a/test/onnx/expect/TestOperators.test_min.expect
+++ b/test/onnx/expect/TestOperators.test_min.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_mm.expect b/test/onnx/expect/TestOperators.test_mm.expect
index 4b436e8ca2491c..9492d651fd9ece 100644
--- a/test/onnx/expect/TestOperators.test_mm.expect
+++ b/test/onnx/expect/TestOperators.test_mm.expect
@@ -1,27 +1,12 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
-  node {
-    output: "onnx::Gemm_2"
-    name: "Constant_0"
-    op_type: "Constant"
-    attribute {
-      name: "value"
-      t {
-        dims: 1
-        data_type: 1
-        raw_data: "\000\000\200?"
-      }
-      type: TENSOR
-    }
-  }
   node {
     input: "onnx::Gemm_0"
     input: "onnx::Gemm_1"
-    input: "onnx::Gemm_2"
-    output: "3"
-    name: "Gemm_1"
+    output: "2"
+    name: "Gemm_0"
     op_type: "Gemm"
     attribute {
       name: "alpha"
@@ -68,7 +53,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -85,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_narrow.expect b/test/onnx/expect/TestOperators.test_narrow.expect
index 52e3e9c8ebffce..a7b13c89a646c0 100644
--- a/test/onnx/expect/TestOperators.test_narrow.expect
+++ b/test/onnx/expect/TestOperators.test_narrow.expect
@@ -1,29 +1,35 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::Slice_0"
-    output: "1"
+    input: "onnx::Slice_14"
+    input: "onnx::Slice_15"
+    input: "onnx::Slice_16"
+    output: "12"
     name: "Slice_0"
     op_type: "Slice"
-    attribute {
-      name: "axes"
-      ints: 0
-      type: INTS
-    }
-    attribute {
-      name: "ends"
-      ints: 2
-      type: INTS
-    }
-    attribute {
-      name: "starts"
-      ints: 0
-      type: INTS
-    }
   }
   name: "torch_jit"
+  initializer {
+    dims: 1
+    data_type: 7
+    name: "onnx::Slice_14"
+    raw_data: "\000\000\000\000\000\000\000\000"
+  }
+  initializer {
+    dims: 1
+    data_type: 7
+    name: "onnx::Slice_15"
+    raw_data: "\002\000\000\000\000\000\000\000"
+  }
+  initializer {
+    dims: 1
+    data_type: 7
+    name: "onnx::Slice_16"
+    raw_data: "\000\000\000\000\000\000\000\000"
+  }
   input {
     name: "onnx::Slice_0"
     type {
@@ -41,7 +47,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "12"
     type {
       tensor_type {
         elem_type: 1
@@ -58,5 +64,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_ne.expect b/test/onnx/expect/TestOperators.test_ne.expect
index 55d35128cb33c5..ab053fbcf67e19 100644
--- a/test/onnx/expect/TestOperators.test_ne.expect
+++ b/test/onnx/expect/TestOperators.test_ne.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_nonzero.expect b/test/onnx/expect/TestOperators.test_nonzero.expect
index 9090e3959742c7..cfcb1f505f8789 100644
--- a/test/onnx/expect/TestOperators.test_nonzero.expect
+++ b/test/onnx/expect/TestOperators.test_nonzero.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -58,5 +58,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p1.expect b/test/onnx/expect/TestOperators.test_norm_p1.expect
index df15562f6072ea..ec5e12b90a1690 100644
--- a/test/onnx/expect/TestOperators.test_norm_p1.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p1.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p2.expect b/test/onnx/expect/TestOperators.test_norm_p2.expect
index 1fadd7a7706fac..0388ec620821e2 100644
--- a/test/onnx/expect/TestOperators.test_norm_p2.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p2.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_ones_like.expect b/test/onnx/expect/TestOperators.test_ones_like.expect
index 30e234cc935c26..fafec789b1741c 100644
--- a/test/onnx/expect/TestOperators.test_ones_like.expect
+++ b/test/onnx/expect/TestOperators.test_ones_like.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_pad.expect b/test/onnx/expect/TestOperators.test_pad.expect
index ab8125247058ca..0a25fb0eaf8751 100644
--- a/test/onnx/expect/TestOperators.test_pad.expect
+++ b/test/onnx/expect/TestOperators.test_pad.expect
@@ -1,31 +1,190 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
+  node {
+    input: "onnx::ConstantOfShape_27"
+    output: "onnx::Concat_10"
+    name: "ConstantOfShape_0"
+    op_type: "ConstantOfShape"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Concat_28"
+    input: "onnx::Concat_10"
+    output: "onnx::Reshape_11"
+    name: "Concat_1"
+    op_type: "Concat"
+    attribute {
+      name: "axis"
+      i: 0
+      type: INT
+    }
+  }
+  node {
+    output: "onnx::Reshape_12"
+    name: "Constant_2"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 2
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\377\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Reshape_11"
+    input: "onnx::Reshape_12"
+    output: "onnx::Slice_13"
+    name: "Reshape_3"
+    op_type: "Reshape"
+  }
+  node {
+    output: "onnx::Slice_14"
+    name: "Constant_4"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_15"
+    name: "Constant_5"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\377"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_16"
+    name: "Constant_6"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\200"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_17"
+    name: "Constant_7"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\377"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Slice_13"
+    input: "onnx::Slice_15"
+    input: "onnx::Slice_16"
+    input: "onnx::Slice_14"
+    input: "onnx::Slice_17"
+    output: "onnx::Transpose_18"
+    name: "Slice_8"
+    op_type: "Slice"
+  }
+  node {
+    input: "onnx::Transpose_18"
+    output: "onnx::Reshape_19"
+    name: "Transpose_9"
+    op_type: "Transpose"
+    attribute {
+      name: "perm"
+      ints: 1
+      ints: 0
+      type: INTS
+    }
+  }
+  node {
+    output: "onnx::Reshape_20"
+    name: "Constant_10"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\377"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Reshape_19"
+    input: "onnx::Reshape_20"
+    output: "onnx::Cast_21"
+    name: "Reshape_11"
+    op_type: "Reshape"
+  }
+  node {
+    input: "onnx::Cast_21"
+    output: "onnx::Pad_22"
+    name: "Cast_12"
+    op_type: "Cast"
+    attribute {
+      name: "to"
+      i: 7
+      type: INT
+    }
+  }
   node {
     input: "input"
-    output: "1"
-    name: "Pad_0"
+    input: "onnx::Pad_22"
+    output: "23"
+    name: "Pad_13"
     op_type: "Pad"
     attribute {
       name: "mode"
       s: "reflect"
       type: STRING
     }
-    attribute {
-      name: "pads"
-      ints: 0
-      ints: 0
-      ints: 0
-      ints: 2
-      ints: 0
-      ints: 0
-      ints: 1
-      ints: 3
-      type: INTS
-    }
   }
   name: "torch_jit"
+  initializer {
+    dims: 1
+    data_type: 7
+    name: "onnx::ConstantOfShape_27"
+    raw_data: "\004\000\000\000\000\000\000\000"
+  }
+  initializer {
+    dims: 4
+    data_type: 7
+    name: "onnx::Concat_28"
+    raw_data: "\002\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000"
+  }
   input {
     name: "input"
     type {
@@ -49,22 +208,22 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "23"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Pad23_dim_0"
           }
           dim {
-            dim_value: 1
+            dim_param: "Pad23_dim_1"
           }
           dim {
-            dim_value: 3
+            dim_param: "Pad23_dim_2"
           }
           dim {
-            dim_value: 9
+            dim_param: "Pad23_dim_3"
           }
         }
       }
@@ -72,5 +231,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_params.expect b/test/onnx/expect/TestOperators.test_params.expect
index 1d1bd7d4936e13..67064d8087ae46 100644
--- a/test/onnx/expect/TestOperators.test_params.expect
+++ b/test/onnx/expect/TestOperators.test_params.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -92,5 +92,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
index d6ddd543f354b2..8dbc34a20640bf 100644
--- a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_permute2.expect b/test/onnx/expect/TestOperators.test_permute2.expect
index 42310b8337109a..7f7b6afd9d2d9e 100644
--- a/test/onnx/expect/TestOperators.test_permute2.expect
+++ b/test/onnx/expect/TestOperators.test_permute2.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_pow.expect b/test/onnx/expect/TestOperators.test_pow.expect
index 56dd281d9d2a6e..f20fd955509048 100644
--- a/test/onnx/expect/TestOperators.test_pow.expect
+++ b/test/onnx/expect/TestOperators.test_pow.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_prelu.expect b/test/onnx/expect/TestOperators.test_prelu.expect
index f38134f579ddcd..f2bcb50ef77720 100644
--- a/test/onnx/expect/TestOperators.test_prelu.expect
+++ b/test/onnx/expect/TestOperators.test_prelu.expect
@@ -1,11 +1,11 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::PRelu_0"
-    input: "onnx::PRelu_4"
-    output: "3"
+    input: "onnx::PRelu_5"
+    output: "4"
     name: "PRelu_0"
     op_type: "PRelu"
   }
@@ -15,7 +15,7 @@ graph {
     dims: 1
     dims: 1
     data_type: 1
-    name: "onnx::PRelu_4"
+    name: "onnx::PRelu_5"
     raw_data: "\000\000\200>\000\000\200>"
   }
   input {
@@ -41,7 +41,7 @@ graph {
     }
   }
   input {
-    name: "onnx::PRelu_4"
+    name: "onnx::PRelu_5"
     type {
       tensor_type {
         elem_type: 1
@@ -60,7 +60,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "4"
     type {
       tensor_type {
         elem_type: 1
@@ -83,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_prod.expect b/test/onnx/expect/TestOperators.test_prod.expect
index 33b1f0e44f3ec0..0cfeafa4da32c8 100644
--- a/test/onnx/expect/TestOperators.test_prod.expect
+++ b/test/onnx/expect/TestOperators.test_prod.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_prod_dtype.expect b/test/onnx/expect/TestOperators.test_prod_dtype.expect
index d9359ba40686fa..26a63ac840ad2e 100644
--- a/test/onnx/expect/TestOperators.test_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_prod_dtype.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_rand.expect b/test/onnx/expect/TestOperators.test_rand.expect
index 02e239ba584dc4..b4d2dbd6cb1909 100644
--- a/test/onnx/expect/TestOperators.test_rand.expect
+++ b/test/onnx/expect/TestOperators.test_rand.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -69,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_randn.expect b/test/onnx/expect/TestOperators.test_randn.expect
index ef8c51827d893c..bc2d0b23dd7b2d 100644
--- a/test/onnx/expect/TestOperators.test_randn.expect
+++ b/test/onnx/expect/TestOperators.test_randn.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -69,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
index 044a4b47cdb8bf..7e5fefad2eb701 100644
--- a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
+++ b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
@@ -1,17 +1,27 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::ReduceSum_0"
-    output: "1"
-    name: "ReduceSum_0"
-    op_type: "ReduceSum"
+    output: "onnx::ReduceSum_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: -1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\377\377\377\377\377\377\377\377"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    input: "onnx::ReduceSum_0"
+    input: "onnx::ReduceSum_1"
+    output: "2"
+    name: "ReduceSum_1"
+    op_type: "ReduceSum"
     attribute {
       name: "keepdims"
       i: 0
@@ -36,7 +46,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -50,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean.expect b/test/onnx/expect/TestOperators.test_reduced_mean.expect
index f5da3dc6d104f8..ce69ab65a6a6d4 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
index 231d847669e6a7..71d9d296aecd05 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
index 3ab9b2629d3d14..98bb26aaea36b2 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -66,5 +66,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod.expect b/test/onnx/expect/TestOperators.test_reduced_prod.expect
index 2d281995da12df..cdfbc0f5fbb69c 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
index a6bcac89d3d0a8..641d21cb9c79a5 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
index edfe354880d3c5..62befc2cf1cff7 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum.expect b/test/onnx/expect/TestOperators.test_reduced_sum.expect
index 69f8abdc48ee63..e03a204a3f9987 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum.expect
@@ -1,18 +1,27 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::ReduceSum_0"
-    output: "1"
-    name: "ReduceSum_0"
-    op_type: "ReduceSum"
+    output: "onnx::ReduceSum_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 1
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 2
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    input: "onnx::ReduceSum_0"
+    input: "onnx::ReduceSum_1"
+    output: "2"
+    name: "ReduceSum_1"
+    op_type: "ReduceSum"
     attribute {
       name: "keepdims"
       i: 0
@@ -43,7 +52,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -60,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
index 59fdbf24d48e0e..e8ffa49295a5ca 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
@@ -1,11 +1,25 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Cast_0"
     output: "onnx::ReduceSum_1"
-    name: "Cast_0"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    input: "onnx::Cast_0"
+    output: "onnx::ReduceSum_2"
+    name: "Cast_1"
     op_type: "Cast"
     attribute {
       name: "to"
@@ -14,15 +28,11 @@ graph {
     }
   }
   node {
+    input: "onnx::ReduceSum_2"
     input: "onnx::ReduceSum_1"
-    output: "2"
-    name: "ReduceSum_1"
+    output: "3"
+    name: "ReduceSum_2"
     op_type: "ReduceSum"
-    attribute {
-      name: "axes"
-      ints: 0
-      type: INTS
-    }
     attribute {
       name: "keepdims"
       i: 0
@@ -53,7 +63,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "3"
     type {
       tensor_type {
         elem_type: 11
@@ -73,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
index 6c3498d8978698..7d05fdc26041c7 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
@@ -1,17 +1,27 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::ReduceSum_0"
-    output: "1"
-    name: "ReduceSum_0"
-    op_type: "ReduceSum"
+    output: "onnx::ReduceSum_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    input: "onnx::ReduceSum_0"
+    input: "onnx::ReduceSum_1"
+    output: "2"
+    name: "ReduceSum_1"
+    op_type: "ReduceSum"
     attribute {
       name: "keepdims"
       i: 1
@@ -42,7 +52,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -65,5 +75,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemax.expect b/test/onnx/expect/TestOperators.test_reducemax.expect
index 015621e36cc3fc..bbd770761f3a09 100644
--- a/test/onnx/expect/TestOperators.test_reducemax.expect
+++ b/test/onnx/expect/TestOperators.test_reducemax.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemin.expect b/test/onnx/expect/TestOperators.test_reducemin.expect
index ba713c955d5397..a555fac90f0a67 100644
--- a/test/onnx/expect/TestOperators.test_reducemin.expect
+++ b/test/onnx/expect/TestOperators.test_reducemin.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_remainder.expect b/test/onnx/expect/TestOperators.test_remainder.expect
index 75799ad14ec68f..ecf44141260e57 100644
--- a/test/onnx/expect/TestOperators.test_remainder.expect
+++ b/test/onnx/expect/TestOperators.test_remainder.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -89,5 +89,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat.expect b/test/onnx/expect/TestOperators.test_repeat.expect
index e87fce2e4792db..5206bce0d88ff9 100644
--- a/test/onnx/expect/TestOperators.test_repeat.expect
+++ b/test/onnx/expect/TestOperators.test_repeat.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -98,5 +98,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
index fb0730a99e5534..2dbb3a436d42b5 100644
--- a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
+++ b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -92,5 +92,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_rrelu.expect b/test/onnx/expect/TestOperators.test_rrelu.expect
index 959a842d29b846..3fb75ab0bb4a93 100644
--- a/test/onnx/expect/TestOperators.test_rrelu.expect
+++ b/test/onnx/expect/TestOperators.test_rrelu.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_rsqrt.expect b/test/onnx/expect/TestOperators.test_rsqrt.expect
index 3f0b2f654fc2d8..32e4df543ae9b7 100644
--- a/test/onnx/expect/TestOperators.test_rsqrt.expect
+++ b/test/onnx/expect/TestOperators.test_rsqrt.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_rsub.expect b/test/onnx/expect/TestOperators.test_rsub.expect
index fcc5e3e46f9293..75344bfc68deeb 100644
--- a/test/onnx/expect/TestOperators.test_rsub.expect
+++ b/test/onnx/expect/TestOperators.test_rsub.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_scatter_add.expect b/test/onnx/expect/TestOperators.test_scatter_add.expect
index 7e5604971ec6c8..fd7514e306303b 100644
--- a/test/onnx/expect/TestOperators.test_scatter_add.expect
+++ b/test/onnx/expect/TestOperators.test_scatter_add.expect
@@ -1,9 +1,9 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    output: "onnx::Scatter_3"
+    output: "onnx::ScatterElements_3"
     name: "Constant_0"
     op_type: "Constant"
     attribute {
@@ -18,12 +18,12 @@ graph {
     }
   }
   node {
-    input: "onnx::Scatter_3"
-    input: "onnx::Scatter_1"
-    input: "onnx::Scatter_2"
+    input: "onnx::ScatterElements_3"
+    input: "onnx::ScatterElements_1"
+    input: "onnx::ScatterElements_2"
     output: "onnx::Add_4"
-    name: "Scatter_1"
-    op_type: "Scatter"
+    name: "ScatterElements_1"
+    op_type: "ScatterElements"
     attribute {
       name: "axis"
       i: 1
@@ -55,7 +55,7 @@ graph {
     }
   }
   input {
-    name: "onnx::Scatter_1"
+    name: "onnx::ScatterElements_1"
     type {
       tensor_type {
         elem_type: 7
@@ -71,7 +71,7 @@ graph {
     }
   }
   input {
-    name: "onnx::Scatter_2"
+    name: "onnx::ScatterElements_2"
     type {
       tensor_type {
         elem_type: 1
@@ -104,5 +104,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_selu.expect b/test/onnx/expect/TestOperators.test_selu.expect
index 9469c9432c8042..7cdc4dc8bac4e2 100644
--- a/test/onnx/expect/TestOperators.test_selu.expect
+++ b/test/onnx/expect/TestOperators.test_selu.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_shape_value_map.expect b/test/onnx/expect/TestOperators.test_shape_value_map.expect
index c0044e4f4cebd8..174551f9a7c5bd 100644
--- a/test/onnx/expect/TestOperators.test_shape_value_map.expect
+++ b/test/onnx/expect/TestOperators.test_shape_value_map.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -34,23 +34,33 @@ graph {
     }
   }
   node {
-    input: "onnx::Unsqueeze_3"
-    output: "onnx::Concat_7"
-    name: "Unsqueeze_3"
-    op_type: "Unsqueeze"
+    output: "onnx::Unsqueeze_7"
+    name: "Constant_3"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::Concat_7"
-    input: "onnx::Concat_21"
-    input: "onnx::Concat_22"
-    input: "onnx::Concat_23"
-    output: "onnx::Reshape_11"
-    name: "Concat_4"
+    input: "onnx::Unsqueeze_3"
+    input: "onnx::Unsqueeze_7"
+    output: "onnx::Concat_8"
+    name: "Unsqueeze_4"
+    op_type: "Unsqueeze"
+  }
+  node {
+    input: "onnx::Concat_8"
+    input: "onnx::Concat_26"
+    input: "onnx::Concat_27"
+    input: "onnx::Concat_28"
+    output: "onnx::Reshape_15"
+    name: "Concat_5"
     op_type: "Concat"
     attribute {
       name: "axis"
@@ -60,66 +70,62 @@ graph {
   }
   node {
     input: "x"
-    input: "onnx::Reshape_11"
-    output: "onnx::Transpose_12"
-    name: "Reshape_5"
+    input: "onnx::Reshape_15"
+    output: "onnx::Transpose_16"
+    name: "Reshape_6"
     op_type: "Reshape"
   }
   node {
-    input: "onnx::Transpose_12"
-    output: "onnx::Softmax_13"
-    name: "Transpose_6"
+    input: "onnx::Transpose_16"
+    output: "x.1"
+    name: "Transpose_7"
     op_type: "Transpose"
     attribute {
       name: "perm"
       ints: 0
-      ints: 3
-      ints: 1
       ints: 2
+      ints: 1
+      ints: 3
       type: INTS
     }
   }
   node {
-    input: "onnx::Softmax_13"
-    output: "onnx::Transpose_14"
-    name: "Softmax_7"
+    input: "x.1"
+    output: "onnx::Reshape_18"
+    name: "Softmax_8"
     op_type: "Softmax"
     attribute {
       name: "axis"
-      i: 3
+      i: 1
       type: INT
     }
   }
   node {
-    input: "onnx::Transpose_14"
-    output: "onnx::Reshape_15"
-    name: "Transpose_8"
-    op_type: "Transpose"
+    output: "onnx::Unsqueeze_20"
+    name: "Constant_9"
+    op_type: "Constant"
     attribute {
-      name: "perm"
-      ints: 0
-      ints: 3
-      ints: 2
-      ints: 1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
     input: "onnx::Unsqueeze_3"
-    output: "onnx::Concat_17"
-    name: "Unsqueeze_9"
+    input: "onnx::Unsqueeze_20"
+    output: "onnx::Concat_21"
+    name: "Unsqueeze_10"
     op_type: "Unsqueeze"
-    attribute {
-      name: "axes"
-      ints: 0
-      type: INTS
-    }
   }
   node {
-    input: "onnx::Concat_17"
-    input: "onnx::Concat_24"
-    output: "onnx::Reshape_19"
-    name: "Concat_10"
+    input: "onnx::Concat_21"
+    input: "onnx::Concat_29"
+    output: "onnx::Reshape_24"
+    name: "Concat_11"
     op_type: "Concat"
     attribute {
       name: "axis"
@@ -128,35 +134,35 @@ graph {
     }
   }
   node {
-    input: "onnx::Reshape_15"
-    input: "onnx::Reshape_19"
-    output: "20"
-    name: "Reshape_11"
+    input: "onnx::Reshape_18"
+    input: "onnx::Reshape_24"
+    output: "25"
+    name: "Reshape_12"
     op_type: "Reshape"
   }
   name: "torch_jit"
   initializer {
     dims: 1
     data_type: 7
-    name: "onnx::Concat_21"
+    name: "onnx::Concat_26"
     raw_data: "\001\000\000\000\000\000\000\000"
   }
   initializer {
     dims: 1
     data_type: 7
-    name: "onnx::Concat_22"
+    name: "onnx::Concat_27"
     raw_data: "\002\000\000\000\000\000\000\000"
   }
   initializer {
     dims: 1
     data_type: 7
-    name: "onnx::Concat_23"
+    name: "onnx::Concat_28"
     raw_data: "\377\377\377\377\377\377\377\377"
   }
   initializer {
     dims: 1
     data_type: 7
-    name: "onnx::Concat_24"
+    name: "onnx::Concat_29"
     raw_data: "\377\377\377\377\377\377\377\377"
   }
   input {
@@ -182,7 +188,7 @@ graph {
     }
   }
   output {
-    name: "20"
+    name: "25"
     type {
       tensor_type {
         elem_type: 1
@@ -199,5 +205,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_sign.expect b/test/onnx/expect/TestOperators.test_sign.expect
index 0cf0a0fa4417d0..6cb9200dc07357 100644
--- a/test/onnx/expect/TestOperators.test_sign.expect
+++ b/test/onnx/expect/TestOperators.test_sign.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_sin.expect b/test/onnx/expect/TestOperators.test_sin.expect
index 2e5710f70d4300..4ca6284c48d90c 100644
--- a/test/onnx/expect/TestOperators.test_sin.expect
+++ b/test/onnx/expect/TestOperators.test_sin.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_slice.expect b/test/onnx/expect/TestOperators.test_slice.expect
index 755625522ace89..15aa37bc2f7eb5 100644
--- a/test/onnx/expect/TestOperators.test_slice.expect
+++ b/test/onnx/expect/TestOperators.test_slice.expect
@@ -1,28 +1,73 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Slice_0"
-    output: "1"
-    name: "Slice_0"
-    op_type: "Slice"
+    output: "onnx::Slice_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_2"
+    name: "Constant_1"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_3"
+    name: "Constant_2"
+    op_type: "Constant"
     attribute {
-      name: "ends"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_4"
+    name: "Constant_3"
+    op_type: "Constant"
     attribute {
-      name: "starts"
-      ints: 1
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\001\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
+  node {
+    input: "onnx::Slice_0"
+    input: "onnx::Slice_2"
+    input: "onnx::Slice_3"
+    input: "onnx::Slice_1"
+    input: "onnx::Slice_4"
+    output: "5"
+    name: "Slice_4"
+    op_type: "Slice"
+  }
   name: "torch_jit"
   input {
     name: "onnx::Slice_0"
@@ -41,7 +86,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
@@ -58,5 +103,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_split.expect b/test/onnx/expect/TestOperators.test_split.expect
index bd11058b1a5e13..e1616e4a52cdf5 100644
--- a/test/onnx/expect/TestOperators.test_split.expect
+++ b/test/onnx/expect/TestOperators.test_split.expect
@@ -1,26 +1,34 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
+  node {
+    output: "onnx::Split_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 3
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
   node {
     input: "tensor"
-    output: "1"
+    input: "onnx::Split_1"
     output: "2"
     output: "3"
-    name: "Split_0"
+    output: "4"
+    name: "Split_1"
     op_type: "Split"
     attribute {
       name: "axis"
       i: 1
       type: INT
     }
-    attribute {
-      name: "split"
-      ints: 2
-      ints: 2
-      ints: 2
-      type: INTS
-    }
   }
   name: "torch_jit"
   input {
@@ -40,7 +48,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -56,7 +64,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "3"
     type {
       tensor_type {
         elem_type: 1
@@ -72,7 +80,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "4"
     type {
       tensor_type {
         elem_type: 1
@@ -89,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_split_with_sizes.expect b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
index 359135cdb01a78..964ba363a56e38 100644
--- a/test/onnx/expect/TestOperators.test_split_with_sizes.expect
+++ b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
@@ -1,26 +1,34 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
+  node {
+    output: "onnx::Split_1"
+    name: "Constant_0"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 3
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
   node {
     input: "tensor"
-    output: "1"
+    input: "onnx::Split_1"
     output: "2"
     output: "3"
-    name: "Split_0"
+    output: "4"
+    name: "Split_1"
     op_type: "Split"
     attribute {
       name: "axis"
       i: 1
       type: INT
     }
-    attribute {
-      name: "split"
-      ints: 2
-      ints: 1
-      ints: 3
-      type: INTS
-    }
   }
   name: "torch_jit"
   input {
@@ -40,7 +48,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -56,7 +64,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "3"
     type {
       tensor_type {
         elem_type: 1
@@ -72,7 +80,7 @@ graph {
     }
   }
   output {
-    name: "3"
+    name: "4"
     type {
       tensor_type {
         elem_type: 1
@@ -89,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_sqrt.expect b/test/onnx/expect/TestOperators.test_sqrt.expect
index 67e86c2836dd8b..91fc7bac0b7755 100644
--- a/test/onnx/expect/TestOperators.test_sqrt.expect
+++ b/test/onnx/expect/TestOperators.test_sqrt.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_std.expect b/test/onnx/expect/TestOperators.test_std.expect
index 957ac1937fb229..69df37b90452a5 100644
--- a/test/onnx/expect/TestOperators.test_std.expect
+++ b/test/onnx/expect/TestOperators.test_std.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -185,5 +185,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_sum.expect b/test/onnx/expect/TestOperators.test_sum.expect
index d923dab06db294..6722064ace203e 100644
--- a/test/onnx/expect/TestOperators.test_sum.expect
+++ b/test/onnx/expect/TestOperators.test_sum.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_sum_dtype.expect b/test/onnx/expect/TestOperators.test_sum_dtype.expect
index 3457c4d7e88bb4..2b5f417b0eee71 100644
--- a/test/onnx/expect/TestOperators.test_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_sum_dtype.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_tan.expect b/test/onnx/expect/TestOperators.test_tan.expect
index 1ff7b8ee19a030..84bc3e9420df1e 100644
--- a/test/onnx/expect/TestOperators.test_tan.expect
+++ b/test/onnx/expect/TestOperators.test_tan.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_transpose.expect b/test/onnx/expect/TestOperators.test_transpose.expect
index 41227d0b934a68..f1350a1b262334 100644
--- a/test/onnx/expect/TestOperators.test_transpose.expect
+++ b/test/onnx/expect/TestOperators.test_transpose.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_type_as.expect b/test/onnx/expect/TestOperators.test_type_as.expect
index 2af30c6ebc31a4..31803483edbd72 100644
--- a/test/onnx/expect/TestOperators.test_type_as.expect
+++ b/test/onnx/expect/TestOperators.test_type_as.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_unfold.expect b/test/onnx/expect/TestOperators.test_unfold.expect
index 58675ad825e7c5..9b5e20281d2015 100644
--- a/test/onnx/expect/TestOperators.test_unfold.expect
+++ b/test/onnx/expect/TestOperators.test_unfold.expect
@@ -1,76 +1,156 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Slice_0"
-    output: "onnx::Unsqueeze_1"
-    name: "Slice_0"
-    op_type: "Slice"
+    output: "onnx::Slice_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_2"
+    name: "Constant_1"
+    op_type: "Constant"
     attribute {
-      name: "ends"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_3"
+    name: "Constant_2"
+    op_type: "Constant"
     attribute {
-      name: "starts"
-      ints: 0
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
     input: "onnx::Slice_0"
-    output: "onnx::Unsqueeze_2"
-    name: "Slice_1"
+    input: "onnx::Slice_2"
+    input: "onnx::Slice_3"
+    input: "onnx::Slice_1"
+    output: "onnx::Unsqueeze_4"
+    name: "Slice_3"
     op_type: "Slice"
+  }
+  node {
+    output: "onnx::Slice_5"
+    name: "Constant_4"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_6"
+    name: "Constant_5"
+    op_type: "Constant"
     attribute {
-      name: "ends"
-      ints: 4
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
+  }
+  node {
+    output: "onnx::Slice_7"
+    name: "Constant_6"
+    op_type: "Constant"
     attribute {
-      name: "starts"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\004\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::Unsqueeze_1"
-    output: "onnx::Concat_3"
-    name: "Unsqueeze_2"
-    op_type: "Unsqueeze"
+    input: "onnx::Slice_0"
+    input: "onnx::Slice_6"
+    input: "onnx::Slice_7"
+    input: "onnx::Slice_5"
+    output: "onnx::Unsqueeze_8"
+    name: "Slice_7"
+    op_type: "Slice"
+  }
+  node {
+    output: "onnx::Unsqueeze_9"
+    name: "Constant_8"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::Unsqueeze_2"
-    output: "onnx::Concat_4"
-    name: "Unsqueeze_3"
+    input: "onnx::Unsqueeze_4"
+    input: "onnx::Unsqueeze_9"
+    output: "onnx::Concat_10"
+    name: "Unsqueeze_9"
     op_type: "Unsqueeze"
+  }
+  node {
+    output: "onnx::Unsqueeze_11"
+    name: "Constant_10"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
   node {
-    input: "onnx::Concat_3"
-    input: "onnx::Concat_4"
-    output: "5"
-    name: "Concat_4"
+    input: "onnx::Unsqueeze_8"
+    input: "onnx::Unsqueeze_11"
+    output: "onnx::Concat_12"
+    name: "Unsqueeze_11"
+    op_type: "Unsqueeze"
+  }
+  node {
+    input: "onnx::Concat_10"
+    input: "onnx::Concat_12"
+    output: "13"
+    name: "Concat_12"
     op_type: "Concat"
     attribute {
       name: "axis"
@@ -99,7 +179,7 @@ graph {
     }
   }
   output {
-    name: "5"
+    name: "13"
     type {
       tensor_type {
         elem_type: 1
@@ -122,5 +202,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_unsqueeze.expect b/test/onnx/expect/TestOperators.test_unsqueeze.expect
index 215b76683f3fbb..49a61c2b845151 100644
--- a/test/onnx/expect/TestOperators.test_unsqueeze.expect
+++ b/test/onnx/expect/TestOperators.test_unsqueeze.expect
@@ -1,18 +1,28 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "onnx::Unsqueeze_0"
-    output: "1"
-    name: "Unsqueeze_0"
-    op_type: "Unsqueeze"
+    output: "onnx::Unsqueeze_1"
+    name: "Constant_0"
+    op_type: "Constant"
     attribute {
-      name: "axes"
-      ints: 2
-      type: INTS
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
     }
   }
+  node {
+    input: "onnx::Unsqueeze_0"
+    input: "onnx::Unsqueeze_1"
+    output: "2"
+    name: "Unsqueeze_1"
+    op_type: "Unsqueeze"
+  }
   name: "torch_jit"
   input {
     name: "onnx::Unsqueeze_0"
@@ -31,7 +41,7 @@ graph {
     }
   }
   output {
-    name: "1"
+    name: "2"
     type {
       tensor_type {
         elem_type: 1
@@ -51,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
index a05dc823168696..e1f31dc406a0d1 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
@@ -1,24 +1,40 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "x"
-    input: "onnx::Upsample_5"
-    output: "4"
-    name: "Upsample_0"
-    op_type: "Upsample"
+    input: ""
+    input: "onnx::Resize_6"
+    output: "5"
+    name: "Resize_0"
+    op_type: "Resize"
+    attribute {
+      name: "coordinate_transformation_mode"
+      s: "asymmetric"
+      type: STRING
+    }
+    attribute {
+      name: "cubic_coeff_a"
+      f: -0.75
+      type: FLOAT
+    }
     attribute {
       name: "mode"
       s: "nearest"
       type: STRING
     }
+    attribute {
+      name: "nearest_mode"
+      s: "floor"
+      type: STRING
+    }
   }
   name: "torch_jit"
   initializer {
     dims: 4
     data_type: 1
-    name: "onnx::Upsample_5"
+    name: "onnx::Resize_6"
     raw_data: "\000\000\200?\000\000\200?\000\000\000@\000\000\000@"
   }
   input {
@@ -44,22 +60,22 @@ graph {
     }
   }
   output {
-    name: "4"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Resize5_dim_0"
           }
           dim {
-            dim_value: 2
+            dim_param: "Resize5_dim_1"
           }
           dim {
-            dim_value: 6
+            dim_param: "Resize5_dim_2"
           }
           dim {
-            dim_value: 8
+            dim_param: "Resize5_dim_3"
           }
         }
       }
@@ -67,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
index a05dc823168696..e1f31dc406a0d1 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
@@ -1,24 +1,40 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "x"
-    input: "onnx::Upsample_5"
-    output: "4"
-    name: "Upsample_0"
-    op_type: "Upsample"
+    input: ""
+    input: "onnx::Resize_6"
+    output: "5"
+    name: "Resize_0"
+    op_type: "Resize"
+    attribute {
+      name: "coordinate_transformation_mode"
+      s: "asymmetric"
+      type: STRING
+    }
+    attribute {
+      name: "cubic_coeff_a"
+      f: -0.75
+      type: FLOAT
+    }
     attribute {
       name: "mode"
       s: "nearest"
       type: STRING
     }
+    attribute {
+      name: "nearest_mode"
+      s: "floor"
+      type: STRING
+    }
   }
   name: "torch_jit"
   initializer {
     dims: 4
     data_type: 1
-    name: "onnx::Upsample_5"
+    name: "onnx::Resize_6"
     raw_data: "\000\000\200?\000\000\200?\000\000\000@\000\000\000@"
   }
   input {
@@ -44,22 +60,22 @@ graph {
     }
   }
   output {
-    name: "4"
+    name: "5"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Resize5_dim_0"
           }
           dim {
-            dim_value: 2
+            dim_param: "Resize5_dim_1"
           }
           dim {
-            dim_value: 6
+            dim_param: "Resize5_dim_2"
           }
           dim {
-            dim_value: 8
+            dim_param: "Resize5_dim_3"
           }
         }
       }
@@ -67,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
index e597ddfa5c5d30..cbd32608d2ae0e 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
@@ -1,34 +1,112 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    output: "onnx::Upsample_1"
-    name: "Constant_0"
+    input: "x"
+    output: "onnx::Slice_2"
+    name: "Shape_0"
+    op_type: "Shape"
+  }
+  node {
+    output: "onnx::Slice_3"
+    name: "Constant_1"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_4"
+    name: "Constant_2"
+    op_type: "Constant"
+    attribute {
+      name: "value"
+      t {
+        dims: 1
+        data_type: 7
+        raw_data: "\000\000\000\000\000\000\000\000"
+      }
+      type: TENSOR
+    }
+  }
+  node {
+    output: "onnx::Slice_5"
+    name: "Constant_3"
     op_type: "Constant"
     attribute {
       name: "value"
       t {
-        dims: 4
-        data_type: 1
-        raw_data: "\000\000\200?\000\000\200?\253\252\252@\000\000\200@"
+        dims: 1
+        data_type: 7
+        raw_data: "\002\000\000\000\000\000\000\000"
       }
       type: TENSOR
     }
   }
+  node {
+    input: "onnx::Slice_2"
+    input: "onnx::Slice_4"
+    input: "onnx::Slice_5"
+    input: "onnx::Slice_3"
+    output: "onnx::Concat_6"
+    name: "Slice_4"
+    op_type: "Slice"
+  }
+  node {
+    input: "onnx::Concat_6"
+    input: "onnx::Concat_12"
+    output: "onnx::Resize_8"
+    name: "Concat_5"
+    op_type: "Concat"
+    attribute {
+      name: "axis"
+      i: 0
+      type: INT
+    }
+  }
   node {
     input: "x"
-    input: "onnx::Upsample_1"
-    output: "2"
-    name: "Upsample_1"
-    op_type: "Upsample"
+    input: ""
+    input: ""
+    input: "onnx::Resize_8"
+    output: "11"
+    name: "Resize_6"
+    op_type: "Resize"
+    attribute {
+      name: "coordinate_transformation_mode"
+      s: "asymmetric"
+      type: STRING
+    }
+    attribute {
+      name: "cubic_coeff_a"
+      f: -0.75
+      type: FLOAT
+    }
     attribute {
       name: "mode"
       s: "nearest"
       type: STRING
     }
+    attribute {
+      name: "nearest_mode"
+      s: "floor"
+      type: STRING
+    }
   }
   name: "torch_jit"
+  initializer {
+    dims: 2
+    data_type: 7
+    name: "onnx::Concat_12"
+    raw_data: "\020\000\000\000\000\000\000\000\020\000\000\000\000\000\000\000"
+  }
   input {
     name: "x"
     type {
@@ -52,22 +130,22 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "11"
     type {
       tensor_type {
         elem_type: 1
         shape {
           dim {
-            dim_value: 1
+            dim_param: "Resize11_dim_0"
           }
           dim {
-            dim_value: 2
+            dim_param: "Resize11_dim_1"
           }
           dim {
-            dim_value: 16
+            dim_param: "Resize11_dim_2"
           }
           dim {
-            dim_value: 16
+            dim_param: "Resize11_dim_3"
           }
         }
       }
@@ -75,5 +153,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_view.expect b/test/onnx/expect/TestOperators.test_view.expect
index cb79b41812229f..0976258229695a 100644
--- a/test/onnx/expect/TestOperators.test_view.expect
+++ b/test/onnx/expect/TestOperators.test_view.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_view_flatten.expect b/test/onnx/expect/TestOperators.test_view_flatten.expect
index ae9d957dd9fd8a..ac814160d5bd1a 100644
--- a/test/onnx/expect/TestOperators.test_view_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_view_flatten.expect
@@ -1,11 +1,11 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "onnx::Reshape_0"
-    input: "onnx::Reshape_9"
-    output: "6"
+    input: "onnx::Reshape_11"
+    output: "8"
     name: "Reshape_0"
     op_type: "Reshape"
   }
@@ -13,7 +13,7 @@ graph {
   initializer {
     dims: 2
     data_type: 7
-    name: "onnx::Reshape_9"
+    name: "onnx::Reshape_11"
     raw_data: "\001\000\000\000\000\000\000\000\030\000\000\000\000\000\000\000"
   }
   input {
@@ -39,7 +39,7 @@ graph {
     }
   }
   output {
-    name: "6"
+    name: "8"
     type {
       tensor_type {
         elem_type: 1
@@ -56,5 +56,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/expect/TestOperators.test_zeros_like.expect b/test/onnx/expect/TestOperators.test_zeros_like.expect
index 1293acb1e16fba..e4f6c6ede2cab1 100644
--- a/test/onnx/expect/TestOperators.test_zeros_like.expect
+++ b/test/onnx/expect/TestOperators.test_zeros_like.expect
@@ -1,4 +1,4 @@
-ir_version: 4
+ir_version: 7
 producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 9
+  version: 13
 }
diff --git a/test/onnx/test_models.py b/test/onnx/test_models.py
index 5d22c255f8320f..1f2e0258edc1e4 100644
--- a/test/onnx/test_models.py
+++ b/test/onnx/test_models.py
@@ -46,15 +46,15 @@ def toC(x):
 
 
 class TestModels(TestCase):
+    opset_version = 9  # Caffe2 doesn't support the default.
     keep_initializers_as_inputs = False
-    from torch.onnx.symbolic_helper import _export_onnx_opset_version
-    opset_version = _export_onnx_opset_version
 
     def exportTest(self, model, inputs, rtol=1e-2, atol=1e-7):
         with torch.onnx.select_model_mode_for_export(model, None):
             graph = torch.onnx.utils._trace(model, inputs, OperatorExportTypes.ONNX)
             torch._C._jit_pass_lint(graph)
-            verify(model, inputs, backend, rtol=rtol, atol=atol)
+            verify(model, inputs, backend, rtol=rtol, atol=atol,
+                   opset_version=self.opset_version)
 
     def test_ops(self):
         x = Variable(
@@ -245,12 +245,12 @@ def test_shufflenet(self):
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_fcn(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 224, 224).fill_(1.0))
-        self.exportTest(toC(fcn_resnet101()), toC(x), rtol=1e-3, atol=1e-5)
+        self.exportTest(toC(fcn_resnet101(pretrained=False, pretrained_backbone=False)), toC(x), rtol=1e-3, atol=1e-5)
 
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_deeplab(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 224, 224).fill_(1.0))
-        self.exportTest(toC(deeplabv3_resnet101()), toC(x), rtol=1e-3, atol=1e-5)
+        self.exportTest(toC(deeplabv3_resnet101(pretrained=False, pretrained_backbone=False)), toC(x), rtol=1e-3, atol=1e-5)
 
     def test_r3d_18_video(self):
         x = Variable(torch.randn(1, 3, 4, 112, 112).fill_(1.0))
diff --git a/test/onnx/test_pytorch_common.py b/test/onnx/test_pytorch_common.py
index 13b4585a5def84..35a408eca244d5 100644
--- a/test/onnx/test_pytorch_common.py
+++ b/test/onnx/test_pytorch_common.py
@@ -50,17 +50,17 @@ def skipIfUnsupportedMinOpsetVersion(min_opset_version):
     def skip_dec(func):
         def wrapper(self):
             if self.opset_version < min_opset_version:
-                raise unittest.SkipTest("Skip verify test for unsupported opset_version")
+                raise unittest.SkipTest(f"Unsupported opset_version: {self.opset_version} < {min_opset_version}")
             return func(self)
         return wrapper
     return skip_dec
 
-# skips tests for all versions above min_opset_version.
-def skipIfUnsupportedMaxOpsetVersion(min_opset_version):
+# skips tests for all versions above max_opset_version.
+def skipIfUnsupportedMaxOpsetVersion(max_opset_version):
     def skip_dec(func):
         def wrapper(self):
-            if self.opset_version > min_opset_version:
-                raise unittest.SkipTest("Skip verify test for unsupported opset_version")
+            if self.opset_version > max_opset_version:
+                raise unittest.SkipTest(f"Unsupported opset_version: {self.opset_version} > {max_opset_version}")
             return func(self)
         return wrapper
     return skip_dec
@@ -107,14 +107,5 @@ def wrapper(self):
         return wrapper
     return skip_dec
 
-def skipIfONNXShapeInference(onnx_shape_inference):
-    def skip_dec(func):
-        def wrapper(self):
-            if self.onnx_shape_inference is onnx_shape_inference:
-                raise unittest.SkipTest("Skip verify test for unsupported opset_version")
-            return func(self)
-        return wrapper
-    return skip_dec
-
 def flatten(x):
     return tuple(function._iter_filter(lambda o: isinstance(o, torch.Tensor))(x))
diff --git a/test/onnx/test_pytorch_onnx_caffe2.py b/test/onnx/test_pytorch_onnx_caffe2.py
index 72ff9392254525..31c2287893a20d 100644
--- a/test/onnx/test_pytorch_onnx_caffe2.py
+++ b/test/onnx/test_pytorch_onnx_caffe2.py
@@ -117,8 +117,7 @@ def do_export(model, inputs, *args, **kwargs):
 
 
 class TestCaffe2Backend_opset9(unittest.TestCase):
-    from torch.onnx.symbolic_helper import _export_onnx_opset_version
-    opset_version = _export_onnx_opset_version
+    opset_version = 9
     embed_params = False
 
     def setUp(self):
diff --git a/test/onnx/test_pytorch_onnx_caffe2_quantized.py b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
index b427b85a2b56f6..bb84ab698a9dbd 100644
--- a/test/onnx/test_pytorch_onnx_caffe2_quantized.py
+++ b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
@@ -31,7 +31,9 @@ def generic_test(self, model, sample_inputs, input_names=None, decimal=3, relaxe
 
         f = io.BytesIO()
         torch.onnx.export(q_model, pt_inputs, f, input_names=input_names,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
+                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+                          # Caffe2 doesn't support newer opset versions
+                          opset_version=9)
         f.seek(0)
         onnx_model = onnx.load(f)
         caffe_res = c2.run_model(onnx_model, dict(zip(input_names, sample_inputs)))[0]
@@ -94,7 +96,9 @@ def export_to_onnx(self, model, input, input_names):
         model = torch.jit.load(buf)
         f = io.BytesIO()
         torch.onnx.export(model, input, f, input_names=input_names,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
+                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+                          # Caffe2 doesn't support newer opset versions
+                          opset_version=9)
         f.seek(0)
 
         onnx_model = onnx.load(f)
diff --git a/test/onnx/test_pytorch_onnx_onnxruntime.py b/test/onnx/test_pytorch_onnx_onnxruntime.py
index bdda7fb2ebc778..59ac7bb0e3a3dd 100644
--- a/test/onnx/test_pytorch_onnx_onnxruntime.py
+++ b/test/onnx/test_pytorch_onnx_onnxruntime.py
@@ -23,8 +23,7 @@
                                                        RnnModelWithPackedSequenceWithState,
                                                        RnnModelWithPackedSequenceWithoutState)
 from test_pytorch_common import (skipIfUnsupportedMinOpsetVersion, skipIfUnsupportedOpsetVersion,
-                                 skipIfNoLapack, disableScriptTest, skipIfONNXShapeInference,
-                                 skipIfUnsupportedMaxOpsetVersion, skipForAllOpsetVersions)
+                                 skipIfNoLapack, disableScriptTest, skipIfUnsupportedMaxOpsetVersion)
 from test_pytorch_common import BATCH_SIZE
 from test_pytorch_common import RNN_BATCH_SIZE, RNN_SEQUENCE_LENGTH, RNN_INPUT_SIZE, RNN_HIDDEN_SIZE
 from typing import List, Tuple, Optional, Dict
@@ -82,9 +81,7 @@ def to_numpy(elem):
 def convert_to_onnx(model, input=None, opset_version=9, do_constant_folding=True,
                     keep_initializers_as_inputs=True, dynamic_axes=None,
                     input_names=None, output_names=None,
-                    fixed_batch_size=False, training=None,
-                    onnx_shape_inference=True):
-    # export the model to ONNX
+                    fixed_batch_size=False, training=None):
     f = io.BytesIO()
     input_copy = copy.deepcopy(input)
     torch.onnx._export(model, input_copy, f,
@@ -93,8 +90,7 @@ def convert_to_onnx(model, input=None, opset_version=9, do_constant_folding=True
                        keep_initializers_as_inputs=keep_initializers_as_inputs,
                        dynamic_axes=dynamic_axes,
                        input_names=input_names, output_names=output_names,
-                       fixed_batch_size=fixed_batch_size, training=training,
-                       onnx_shape_inference=onnx_shape_inference)
+                       fixed_batch_size=fixed_batch_size, training=training)
 
     # compute onnxruntime output prediction
     so = onnxruntime.SessionOptions()
@@ -177,8 +173,7 @@ def run_model_test(self, model, batch_size=2, state_dict=None,
                                    do_constant_folding=do_constant_folding,
                                    keep_initializers_as_inputs=self.keep_initializers_as_inputs,
                                    dynamic_axes=dynamic_axes, input_names=input_names,
-                                   output_names=output_names, fixed_batch_size=fixed_batch_size, training=training,
-                                   onnx_shape_inference=self.onnx_shape_inference)
+                                   output_names=output_names, fixed_batch_size=fixed_batch_size, training=training)
         # compute onnxruntime output prediction
         if remained_onnx_input_idx is not None:
             input_onnx = []
@@ -289,11 +284,15 @@ def set_rng_seed(seed):
     random.seed(seed)
     np.random.seed(seed)
 
-class TestONNXRuntime(unittest.TestCase):
-    from torch.onnx.symbolic_helper import _export_onnx_opset_version
-    opset_version = _export_onnx_opset_version
+class _TestONNXRuntime:
+    """Abstract base class for test cases.
+
+    Intentionally not a sub-class of unittest.TestCase so that unittest / pytest
+    don't run it directly. unitest.TestCase is mixed in as another base class when
+    creating concrete sub-types. See MakeTestCase().
+    """
+    opset_version = -1  # Sub-classes must override
     keep_initializers_as_inputs = True  # For IR version 3 type export.
-    onnx_shape_inference = True
 
     def setUp(self):
         torch.manual_seed(0)
@@ -617,8 +616,8 @@ def get_test_images(self) -> Tuple[List[torch.Tensor], List[torch.Tensor]]:
     @skipIfUnsupportedMinOpsetVersion(11)
     @disableScriptTest()  # Faster RCNN model is not scriptable
     def test_faster_rcnn(self):
-        model = torchvision.models.detection.faster_rcnn.fasterrcnn_resnet50_fpn(pretrained=False, min_size=200,
-                                                                                 max_size=300)
+        model = torchvision.models.detection.faster_rcnn.fasterrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=True,
+                                                                                 min_size=200, max_size=300)
         model.eval()
         x1 = torch.randn(3, 200, 300, requires_grad=True)
         x2 = torch.randn(3, 200, 300, requires_grad=True)
@@ -664,8 +663,8 @@ def test_paste_mask_in_image(self):
     @skipIfUnsupportedMinOpsetVersion(11)
     @disableScriptTest()
     def test_mask_rcnn(self):
-        model = torchvision.models.detection.mask_rcnn.maskrcnn_resnet50_fpn(pretrained=False, min_size=200,
-                                                                             max_size=300)
+        model = torchvision.models.detection.mask_rcnn.maskrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=True,
+                                                                             min_size=200, max_size=300)
         images, test_images = self.get_test_images()
         self.run_test(model, (images,), rtol=1e-3, atol=1e-5)
         self.run_test(model, (images,), input_names=["images_tensors"], output_names=["boxes", "labels", "scores", "masks"],
@@ -705,8 +704,8 @@ def test_heatmaps_to_keypoints(self):
     @skipIfUnsupportedMinOpsetVersion(11)
     @disableScriptTest()
     def test_keypoint_rcnn(self):
-        model = torchvision.models.detection.keypoint_rcnn.keypointrcnn_resnet50_fpn(pretrained=False, min_size=200,
-                                                                                     max_size=300)
+        model = torchvision.models.detection.keypoint_rcnn.keypointrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=False,
+                                                                                     min_size=200, max_size=300)
         images, test_images = self.get_test_images()
         self.run_test(model, (images,), rtol=1e-3, atol=1e-5)
         self.run_test(model, (images,), input_names=["images_tensors"],
@@ -1436,7 +1435,6 @@ def forward(self, input1, input2, input3):
 
     # Conversion of Transpose depends on input shape to be known.
     # The following test only works when onnx shape inference is enabled.
-    @skipIfONNXShapeInference(False)
     def test_transpose_infer_shape(self):
         class TransposeModule(torch.jit.ScriptModule):
             def __init__(self):
@@ -1664,7 +1662,6 @@ def forward(self, x):
 
     # Operator rank mismatch between outputs of two branches for opsets below 11.
     @skipIfUnsupportedMinOpsetVersion(11)
-    @skipIfONNXShapeInference(False)
     def test_floating_point_infer_dtype(self):
         class FloatingPoint(torch.jit.ScriptModule):
             @torch.jit.script_method
@@ -1775,7 +1772,6 @@ def forward(self, x):
         x = torch.randn(2, 3, 4)
         self.run_test(ArithmeticModule(), x, remained_onnx_input_idx=[])
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_arithmetic_prim_bool(self):
         class ArithmeticModule(torch.nn.Module):
             def forward(self, x, y: int, z: bool, t: float):
@@ -1812,7 +1808,6 @@ def forward(self, x):
 
     # In scripting the first transpose node do not carry shape and dtype info.
     # The following test only works when onnx shape inference is enabled.
-    @skipIfONNXShapeInference(False)
     def test_arithmetic_infer_dtype(self):
         class ArithmeticModule(torch.jit.ScriptModule):
             @torch.jit.script_method
@@ -1895,7 +1890,6 @@ def forward(self, x, y):
 
     # In scripting x, y do not carry shape and dtype info.
     # The following test only works when onnx shape inference is enabled.
-    @skipIfONNXShapeInference(False)
     def test_div_promotion_script(self):
         class DivModule(torch.nn.Module):
             def forward(self, x, y):
@@ -2689,8 +2683,7 @@ def forward(self, x):
         x = torch.empty(2, 3, 3, dtype=torch.double).uniform_(0, 1)
         self.run_test(Bernoulli(), x)
 
-    # Enable test when fix for allowzero is in ORT
-    @skipForAllOpsetVersions()
+    @unittest.skip("Bug in ORT, skip test until rel-1.11.")
     @skipIfUnsupportedMinOpsetVersion(14)
     def test_reshape_allowzero(self):
         class ReshapeModel(torch.nn.Module):
@@ -2858,7 +2851,6 @@ def test_interpolate_adaptive_pooling_error(self):
         with self.assertRaises(RuntimeError) as cm:
             self._interpolate(x, "area", False, True)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_groupnorm(self):
         model = torch.nn.GroupNorm(3, 6, 0.002)
         x = torch.randn(4, 6, 180, 180, 180)
@@ -2872,7 +2864,6 @@ def test_groupnorm(self):
         x = torch.randn(4, 6, 180, 180)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_groupnorm_noaffine(self):
         model = torch.nn.GroupNorm(4, 8, 0.002, affine=False)
         x = torch.randn(3, 8, 224, 224)
@@ -3474,6 +3465,16 @@ def forward(self, x):
         x = torch.arange(1., 6., requires_grad=True)
         self.run_test(MyModule(), x)
 
+    @skipIfUnsupportedMinOpsetVersion(10)
+    def test_topk_int32_k(self):
+        class Model(torch.nn.Module):
+            def forward(self, x, k):
+                return torch.topk(x, k)
+
+        x = torch.arange(1., 6.)
+        k = torch.tensor(3, dtype=torch.int32)
+        self.run_test(Model(), (x, k))
+
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_topk_smallest_unsorted(self):
         class MyModule(torch.nn.Module):
@@ -3570,7 +3571,6 @@ def test_batchnorm1d_noaffine(self):
         x = torch.randn(10, 10, 128)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_batchnorm1d_norunningstats(self):
         x = torch.randn(10, 10)
         model = torch.nn.BatchNorm1d(10, track_running_stats=False)
@@ -3589,7 +3589,6 @@ def test_batchnorm2d_noaffine(self):
         model = torch.nn.BatchNorm2d(3, affine=False)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_batchnorm2d_norunningstats(self):
         x = torch.randn(10, 3, 128, 128)
         model = torch.nn.BatchNorm2d(3, track_running_stats=False)
@@ -3614,7 +3613,6 @@ def test_instancenorm1d_runningstats(self):
         model = torch.nn.InstanceNorm1d(5, affine=False, track_running_stats=True)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm1d_norunningstats(self):
         x = torch.randn(10, 5, 128)
         model = torch.nn.InstanceNorm1d(5, affine=True, track_running_stats=False)
@@ -3632,7 +3630,6 @@ def test_instancenorm2d_runningstats(self):
         model = torch.nn.InstanceNorm2d(3, affine=False, track_running_stats=True)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm2d_norunningstats(self):
         x = torch.randn(10, 3, 128, 128)
         model = torch.nn.InstanceNorm2d(3, affine=True, track_running_stats=False)
@@ -3650,7 +3647,6 @@ def test_instancenorm3d_runningstats(self):
         model = torch.nn.InstanceNorm3d(3, affine=False, track_running_stats=True)
         self.run_test(model, x)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm3d_norunningstats(self):
         x = torch.randn(10, 3, 128, 128, 128)
         model = torch.nn.InstanceNorm3d(3, affine=True, track_running_stats=False)
@@ -3736,6 +3732,17 @@ def forward(self, src, index):
         index = torch.tensor([[0, 1], [0, 1], [0, 1]], dtype=torch.int64)
         self.run_test(ScatterModel(), (src, index))
 
+    @skipIfUnsupportedMinOpsetVersion(9)
+    def test_bucketize(self):
+        class BucketModel(torch.nn.Module):
+            def forward(self, input, boundaries):
+                return torch.bucketize(input, boundaries), \
+                    torch.bucketize(input, boundaries, right=True)
+
+        input = torch.tensor([[2, 5, 10], [6, 8, 3]])
+        boundaries = torch.tensor([1, 5, 7, 8, 10])
+        self.run_test(BucketModel(), (input, boundaries))
+
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_one_hot(self):
         class OneHot(torch.nn.Module):
@@ -4327,6 +4334,15 @@ def forward(self, input, other):
         y = torch.randn(4, 1, requires_grad=True)
         self.run_test(model, (x, y))
 
+    def test_amax_amin(self):
+        class Model(torch.nn.Module):
+            def forward(self, x):
+                return torch.amax(x, dim=0, keepdim=True), torch.amin(x, dim=[0, 1], keepdim=False)
+
+        model = Model()
+        x = torch.randn(4, 4)
+        self.run_test(model, x)
+
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_arange_end(self):
         class ArangeScript(torch.jit.ScriptModule):
@@ -4702,6 +4718,15 @@ def forward(self, x):
         x = torch.tensor([[1, 2], [3, 4]])
         self.run_test(RepeatsDimsModel2(), (x,))
 
+    @skipIfUnsupportedMinOpsetVersion(9)
+    def test_repeat_interleave_noop(self):
+        class Model(torch.nn.Module):
+            def forward(self, x):
+                return x.repeat_interleave(1, dim=1)
+
+        x = torch.randn(4, 1, 8)
+        self.run_test(Model(), (x,))
+
     @skipIfUnsupportedMinOpsetVersion(13)
     def test_dynamic_repeat_interleave(self):
         class SingleDynamicModel(torch.nn.Module):
@@ -4894,6 +4919,9 @@ def forward(self, input):
         x = torch.randint(10, (1, 2, 3, 4))
         self.run_test(FlattenModel(), x)
 
+        x = torch.randn(4)
+        self.run_test(FlattenModel(), x)
+
     def test_flatten2d(self):
         class FlattenModel(torch.nn.Module):
             def forward(self, input):
@@ -5277,7 +5305,6 @@ def forward(self, x):
         inputs = torch.randn(16)
         self.run_test(model, inputs)
 
-    @skipIfONNXShapeInference(False)
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_loop_transpose(self):
         class LoopModel(torch.nn.Module):
@@ -5619,19 +5646,25 @@ def forward(self, x):
         self.run_test(OnesModel(), x, input_names=["x"], dynamic_axes={"x": [0, 1, 2]})
         self.run_test(OnesModel(), x, remained_onnx_input_idx=[])
 
-    @skipIfONNXShapeInference(True)
+    @skipIfUnsupportedMinOpsetVersion(9)
+    @disableScriptTest()  # torch.zeros/torch.ones with size tensor of dim != 0 not scriptable.
+    def test_zeros_ones_with_tensor_input(self):
+        class ZeroAndOnes(torch.nn.Module):
+            def forward(self, x):
+                return torch.zeros(x, 1), torch.ones(x, 1)
+
+        x = torch.tensor([2])
+        self.run_test(ZeroAndOnes(), (x, ))
+
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_tolist(self):
         class List(torch.jit.ScriptModule):
             @torch.jit.script_method
             def forward(self, input):
-                cur_shape = torch._shape_as_tensor(input)
-                final_shape: List[int] = cur_shape.tolist()
-                pad_tensor = torch.zeros([1, 2] + final_shape)
-                return pad_tensor
+                res: List[int] = input.tolist()
+                return res
 
-        x = torch.randn(2, 3)
-        self.run_test(List(), (x,))
+        self.run_test(List(), (torch.randint(100, (1,)),))
 
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_list_pass(self):
@@ -6124,7 +6157,6 @@ def forward(self, x):
                       input_names=["x"],
                       test_with_inputs=[y])
 
-    @skipIfONNXShapeInference(False)
     def test_unfold_infer_shape(self):
         class UnfoldModule(torch.jit.ScriptModule):
             def __init__(self):
@@ -6742,24 +6774,19 @@ def forward(self, x, pad: List[int]):
 
 
     @skipIfUnsupportedMaxOpsetVersion(10)
+    @disableScriptTest()  # TODO: the logic in symbolic_opset9 doesn't handle script
     def test_unsupported_pad(self):
         class Pad(torch.nn.Module):
-            def forward(self, x, pad):
+            def forward(self, x, pad: List[int]):
                 return torch.nn.functional.pad(x, pad)
 
-        def run():
-            x = torch.randn(2, 2, 4, 4)
-            y = pad = (torch.tensor(2, dtype=torch.int32), torch.tensor(4, dtype=torch.int32))
-            p = Pad()
-            f = io.BytesIO()
-            torch.onnx._export(p, (x, y), f)
+        x = torch.randn(2, 2, 4, 4)
+        y = [2, 4]
 
-        with self.assertRaises(RuntimeError) as cm:
-            run()
+        with self.assertRaisesRegex(RuntimeError, ("Unsupported: ONNX export of Pad.*" +
+                                                   "The sizes of the padding must be constant")):
+            self.run_test(Pad(), (x, y))
 
-        the_exception = cm.exception
-        self.assertEqual("Unsupported: ONNX export of Pad in opset 9. The sizes of the padding must be constant. " +
-                         "Please try opset version 11.", the_exception.args[0])
 
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_if_fold(self):
@@ -6880,7 +6907,6 @@ def forward(self, x, y):
         self.run_test(IfFoldModel(), (x, y))
 
     @skipIfUnsupportedMinOpsetVersion(11)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized(self):
         class UninitializedModel(torch.nn.Module):
             def forward(self, y):
@@ -6895,7 +6921,6 @@ def forward(self, y):
         self.run_test(UninitializedModel(), x)
 
     @skipIfUnsupportedMinOpsetVersion(11)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized_dynamic(self):
         class UninitializedModel(torch.nn.Module):
             def forward(self, y):
@@ -6914,7 +6939,6 @@ def forward(self, y):
 
     # onnx::Identity of sequence supported for ONNX opset >= 14
     @skipIfUnsupportedMinOpsetVersion(14)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized_tensorList(self):
         class UninitializedTensorListModel(torch.nn.Module):
             def forward(self, x):
@@ -6930,7 +6954,6 @@ def forward(self, x):
 
     # onnx::Identity of sequence supported for ONNX opset >= 14
     @skipIfUnsupportedMinOpsetVersion(14)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized_tensorList_dynamic(self):
         class UninitializedTensorListModel(torch.nn.Module):
             def forward(self, x):
@@ -6947,7 +6970,6 @@ def forward(self, x):
 
     # onnx::Identity of sequence supported for ONNX opset >= 14
     @skipIfUnsupportedMinOpsetVersion(14)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized_intList(self):
         class UninitializedListModel(torch.nn.Module):
             def forward(self, x):
@@ -6966,7 +6988,6 @@ def forward(self, x):
 
     # onnx::Identity of sequence supported for ONNX opset >= 14
     @skipIfUnsupportedMinOpsetVersion(14)
-    @skipIfONNXShapeInference(False)
     def test_uninitialized_tensorList_shape(self):
         class UninitializedModel(torch.nn.Module):
             def forward(self, x):
@@ -7270,6 +7291,12 @@ def forward(self, x):
         for x in [torch.randn(3, 4), torch.randn(3, 4).to(dtype=torch.bool)]:
             self.run_test(EinsumModelTranspose(), input=(x,))
 
+    @skipIfUnsupportedMinOpsetVersion(9)
+    def test_cosine_similarity(self):
+        x = torch.randn(5, 3, 2)
+        y = torch.randn(5, 3, 2)
+        self.run_test(torch.nn.CosineSimilarity(dim=2), input=(x, y))
+
     @skipIfUnsupportedMinOpsetVersion(12)
     def test_crossentropyloss(self):
         for ignore_index in [-100, 1]:
@@ -8135,7 +8162,6 @@ def forward(self, x: torch.Tensor):
 
         self.run_test(MyModule(), x)
 
-    @skipIfONNXShapeInference(False)
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_if_transpose(self):
         class IfModel(torch.nn.Module):
@@ -8151,7 +8177,6 @@ def forward(self, x):
                       output_names=["output_1"],
                       dynamic_axes={"output_1": [0, 1]})
 
-    @skipIfONNXShapeInference(False)
     @skipIfUnsupportedMinOpsetVersion(13)
     def test_if_list(self):
         class IfModel(torch.nn.Module):
@@ -8560,7 +8585,6 @@ def forward(self, input):
         x = torch.randn(6, 4, 3, 3)
         self.run_test(FakeQuantizePerChannelModel(), (x))
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_batchnorm_training(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8585,7 +8609,6 @@ def forward(self, x):
         model_export.train()
         self.run_test(model_export, (x, ), training=torch.onnx.TrainingMode.PRESERVE, rtol=1e-3, atol=1e-5)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_batchnorm_training_mode_fix_layer(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8636,7 +8659,6 @@ def forward(self, x):
         model_export.eval()
         self.run_test(model_export, (x,), training=torch.onnx.TrainingMode.PRESERVE, rtol=1e-3, atol=1e-5)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm_training(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8661,7 +8683,6 @@ def forward(self, x):
         model_export.train()
         self.run_test(model_export, (x, ), training=torch.onnx.TrainingMode.PRESERVE, rtol=1e-3, atol=1e-5)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm_training_mode_fix_layer(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8687,7 +8708,6 @@ def forward(self, x):
         model_export.train()
         self.run_test(model_export, (x,), training=torch.onnx.TrainingMode.PRESERVE, rtol=1e-3, atol=1e-5)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_instancenorm_eval_mode_train_layer(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8789,7 +8809,6 @@ def forward(self, x):
 
         np.testing.assert_allclose(ratio_pytorch, ratio_ort, rtol=0.01, atol=0.01)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_conv_bn(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -8807,7 +8826,6 @@ def forward(self, x):
         self.run_test(model_export, (x,), training=torch.onnx.TrainingMode.EVAL)
         self.run_test(model_export, (x,), training=torch.onnx.TrainingMode.TRAINING, rtol=1e-3, atol=1e-5)
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_multiple_conv_bn(self):
         class MyModule(torch.nn.Module):
             def __init__(self):
@@ -10486,7 +10504,6 @@ def symbolic_custom_invalid_add(g, input, other, alpha=None):
         loaded_model = onnx.load_from_string(f.getvalue())
 
 
-    @skipIfUnsupportedMinOpsetVersion(9)  # https://github.com/microsoft/onnxruntime/issues/9663
     def test_tuple_output_from_if_with_raised_exception(self):
         class M(torch.nn.Module):
             def __init__(self):
@@ -10527,10 +10544,16 @@ def forward(self, x):
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_linear(self):
         model = torch.nn.quantized.Linear(4, 8)
+        # Set fixed weight to avoid flaky test.
+        weight = torch.quantize_per_tensor(
+            torch.arange(32, dtype=torch.float).view(8, 4),
+            0.5, 0, torch.qint8)
         # Set non-zero bias.
-        bias = torch.arange(8).to(torch.float)
-        model.set_weight_bias(model.weight(), bias)
+        bias = torch.arange(8, dtype=torch.float)
+        model.set_weight_bias(weight, bias)
+        # Set fixed input to avoid flaky test.
         input = torch.randn(4, 4)
+        input = torch.arange(16, dtype=torch.float).view(4, 4) - 8
         input_tensor = torch.quantize_per_tensor(input, 0.5, 128, torch.quint8)
         # Currently, we need convert the model to ScriptModule before export.
         # The reason is that PackedParams contains int (not tensor).
@@ -10633,6 +10656,29 @@ def forward(self, x):
         x = torch.quantize_per_tensor(torch.randn(3, 4), 0.2, 0, torch.qint8)
         self.run_test(torch.jit.trace(Module(), x), x)
 
+    @skipIfUnsupportedMinOpsetVersion(9)
+    def test_convolution_allow_tf32(self):
+        class Module(torch.nn.Module):
+            def __init__(self, allow_tf32):
+                super().__init__()
+
+                self.allow_tf32 = allow_tf32
+                weight = torch.rand(32, 3, 3, 3)
+                self.weight = torch.nn.Parameter(weight)
+
+            def forward(self, x):
+                if self.allow_tf32:
+                    return torch._convolution(x, self.weight, None, [2, 2], [0, 0], [1, 1], False, [0, 0],
+                                              1, False, False, True, True)
+                else:
+                    return torch._convolution(x, self.weight, None, [2, 2], [0, 0], [1, 1], False, [0, 0],
+                                              1, False, False, True)
+
+        x = torch.randn(1, 3, 224, 224)
+        self.run_test(Module(False), x, rtol=1e-3, atol=1e-6)
+        self.run_test(Module(True), x, rtol=1e-3, atol=1e-6)
+
+
 def make_test(name, base, layer, bidirectional, initial_state,
               variable_length, dropout, script_test_min_opset_version,
               **extra_kwargs):
@@ -10664,7 +10710,7 @@ def f(self):
             **extra_kwargs)
 
     f.__name__ = test_name
-    setattr(TestONNXRuntime, f.__name__, f)
+    setattr(_TestONNXRuntime, f.__name__, f)
 
 def setup_rnn_tests():
     layers_opts = [
@@ -10722,7 +10768,7 @@ def setup_rnn_tests():
             test_count += 1
 
     # sanity check that a representative example does exist
-    TestONNXRuntime.test_gru_trilayer_forward_with_initial_state_without_sequence_lengths_with_dropout
+    _TestONNXRuntime.test_gru_trilayer_forward_with_initial_state_without_sequence_lengths_with_dropout
 
     # make sure no one accidentally disables all the tests without
     # noticing
@@ -10732,82 +10778,42 @@ def setup_rnn_tests():
 setup_rnn_tests()
 
 
-# opset 7 tests
-TestONNXRuntime_opset7 = type(str("TestONNXRuntime_opset7"),
-                              (unittest.TestCase,),
-                              dict(TestONNXRuntime.__dict__, opset_version=7))
-
-# opset 8 tests
-TestONNXRuntime_opset8 = type(str("TestONNXRuntime_opset8"),
-                              (unittest.TestCase,),
-                              dict(TestONNXRuntime.__dict__, opset_version=8))
-
-
-# opset 10 tests
-TestONNXRuntime_opset10 = type(str("TestONNXRuntime_opset10"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=10))
-
-# opset 11 tests
-TestONNXRuntime_opset11 = type(str("TestONNXRuntime_opset11"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=11))
-
-# opset 12 tests
-TestONNXRuntime_opset12 = type(str("TestONNXRuntime_opset12"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=12))
-
-# opset 9 tests, with keep_initializers_as_inputs=False for
-# IR version 4 style export.
-TestONNXRuntime_opset9_IRv4 = type(str("TestONNXRuntime_opset9_IRv4"),
-                                   (unittest.TestCase,),
-                                   dict(TestONNXRuntime.__dict__,
-                                        keep_initializers_as_inputs=False))
-
-
-# opset 10 tests, with keep_initializers_as_inputs=False for
-# IR version 4 style export.
-TestONNXRuntime_opset10_IRv4 = type(str("TestONNXRuntime_opset10_IRv4"),
-                                    (unittest.TestCase,),
-                                    dict(TestONNXRuntime.__dict__, opset_version=10,
-                                         keep_initializers_as_inputs=False))
-
-
-# opset 11 tests, with keep_initializers_as_inputs=False for
-# IR version 4 style export.
-TestONNXRuntime_opset11_IRv4 = type(str("TestONNXRuntime_opset11_IRv4"),
-                                    (unittest.TestCase,),
-                                    dict(TestONNXRuntime.__dict__, opset_version=11,
-                                         keep_initializers_as_inputs=False))
-
-# opset 12 tests, with keep_initializers_as_inputs=False for
-# IR version 4 style export.
-TestONNXRuntime_opset12_IRv4 = type(str("TestONNXRuntime_opset12_IRv4"),
-                                    (unittest.TestCase,),
-                                    dict(TestONNXRuntime.__dict__, opset_version=12,
-                                         keep_initializers_as_inputs=False))
-
-# opset 13 tests
-TestONNXRuntime_opset13 = type(str("TestONNXRuntime_opset13"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=13,
-                                    keep_initializers_as_inputs=False,
-                                    onnx_shape_inference=True))
-
-# opset 14 tests
-TestONNXRuntime_opset14 = type(str("TestONNXRuntime_opset14"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=14,
-                                    keep_initializers_as_inputs=False,
-                                    onnx_shape_inference=True))
-
-# opset 15 tests
-TestONNXRuntime_opset15 = type(str("TestONNXRuntime_opset15"),
-                               (unittest.TestCase,),
-                               dict(TestONNXRuntime.__dict__, opset_version=15,
-                                    keep_initializers_as_inputs=False,
-                                    onnx_shape_inference=True))
+def MakeTestCase(opset_version: int, keep_initializers_as_inputs: bool = True) -> type:
+    name = f"TestONNXRuntime_opset{opset_version}"
+    if not keep_initializers_as_inputs:
+        name += "_IRv4"
+    return type(str(name),
+                (unittest.TestCase,),
+                dict(_TestONNXRuntime.__dict__,
+                     opset_version=opset_version,
+                     keep_initializers_as_inputs=keep_initializers_as_inputs))
+
+
+TestONNXRuntime_opset7 = MakeTestCase(7)
+
+TestONNXRuntime_opset8 = MakeTestCase(8)
+
+TestONNXRuntime_opset9 = MakeTestCase(9)
+
+TestONNXRuntime_opset9_IRv4 = MakeTestCase(9, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset10 = MakeTestCase(10)
+
+TestONNXRuntime_opset10_IRv4 = MakeTestCase(10, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset11 = MakeTestCase(11)
+
+TestONNXRuntime_opset11_IRv4 = MakeTestCase(11, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset12 = MakeTestCase(12)
+
+TestONNXRuntime_opset12_IRv4 = MakeTestCase(12, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset13 = MakeTestCase(13, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset14 = MakeTestCase(14, keep_initializers_as_inputs=False)
+
+TestONNXRuntime_opset15 = MakeTestCase(15, keep_initializers_as_inputs=False)
 
 
 if __name__ == "__main__":
diff --git a/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py b/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
index 575d4caa16cebb..00a5b223bfa18e 100644
--- a/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
+++ b/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
@@ -99,6 +99,21 @@ def forward(self, x):
         x = torch.ones(3, 4, requires_grad=True, dtype=torch.float16, device=torch.device("cuda"))
         self.run_test(MyModule(), x, rtol=1e-3, atol=1e-5)
 
+    @skipIfNoCuda
+    def test_deduplicate_initializers_diff_devices(self):
+        class Model(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.w = torch.nn.Parameter(torch.ones(2, 3, device=torch.device("cpu")))
+                self.b = torch.nn.Parameter(torch.ones(3, device=torch.device("cuda")))
+
+            def forward(self, x, y):
+                return torch.matmul(self.w, x), y + self.b
+
+        x = torch.randn(3, 3, device=torch.device("cpu"))
+        y = torch.randn(3, 3, device=torch.device("cuda"))
+        self.run_test(Model(), (x, y))
+
 TestONNXRuntime_cuda.setUp = TestONNXRuntime.setUp
 TestONNXRuntime_cuda.run_test = TestONNXRuntime.run_test
 
diff --git a/test/onnx/test_pytorch_onnx_shape_inference.py b/test/onnx/test_pytorch_onnx_shape_inference.py
index ecd3641c8fd796..7d636facaf67c5 100644
--- a/test/onnx/test_pytorch_onnx_shape_inference.py
+++ b/test/onnx/test_pytorch_onnx_shape_inference.py
@@ -6,6 +6,7 @@
 from torch.onnx.symbolic_helper import (_set_onnx_shape_inference,
                                         _onnx_main_opset,
                                         _set_opset_version)
+from test_pytorch_common import skipIfUnsupportedMinOpsetVersion
 
 def expect_tensor(scalar_type, shape=None):
     def verify(actual_type):
@@ -75,6 +76,23 @@ def test_constant_of_shape_dynamic(self):
         constant_of_shape = g.op("ConstantOfShape", shape, value_t=torch.tensor([2.0]))
         self.run_test(g, constant_of_shape.node(), expect_tensor("Float", shape=(None, None, None, None)))
 
+    def test_gather_dynamic_index(self):
+        g = self.create_empty_graph()
+        input = g.addInput()
+        input.setType(input.type().with_dtype(torch.float).with_sizes([None, 3, 16, 16]))
+        indices = g.addInput()
+        indices.setType(indices.type().with_dtype(torch.int64).with_sizes([None]))
+        output = g.op("Gather", input, indices, axis_i=1)
+        self.run_test(g, output.node(), expect_tensor("Float", shape=([None, None, 16, 16])))
+
+    def test_gather_scalar_index(self):
+        g = self.create_empty_graph()
+        input = g.addInput()
+        input.setType(input.type().with_dtype(torch.float).with_sizes([None, 3, 16, 16]))
+        indices = self.insert_tensor_constant(g, torch.tensor(1))
+        output = g.op("Gather", input, indices, axis_i=1)
+        self.run_test(g, output.node(), expect_tensor("Float", shape=([None, 16, 16])))
+
     def test_reshape(self):
         g = self.create_empty_graph()
         constant = self.insert_tensor_constant(g, torch.ones(2, 16, 5, 5))
@@ -102,6 +120,15 @@ def test_reshape_symbolic(self):
         output = g.op("Reshape", input, constant)
         self.run_test(g, output.node(), expect_tensor(None, shape=(None, None, 16)))
 
+    @skipIfUnsupportedMinOpsetVersion(14)
+    def test_reshape_allowzero(self):
+        g = self.create_empty_graph()
+        input = g.addInput()
+        input.setType(input.type().with_sizes([3, 4, 0]))
+        constant = self.insert_tensor_constant(g, torch.tensor([0, 4, 3]))
+        output = g.op("Reshape", input, constant, allowzero_i=1)
+        self.run_test(g, output.node(), expect_tensor(None, shape=(0, 4, 3)))
+
     def test_slice(self):
         g = self.create_empty_graph()
         input = g.addInput()
diff --git a/test/onnx/test_utility_funs.py b/test/onnx/test_utility_funs.py
index 0f0c1e482a6603..22fe21e7291e84 100644
--- a/test/onnx/test_utility_funs.py
+++ b/test/onnx/test_utility_funs.py
@@ -15,8 +15,10 @@
                                         _unpack_list,
                                         parse_args)
 import torch.utils.cpp_extension
+from autograd_helper import CustomFunction as CustomFunction2
 from test_pytorch_common import (skipIfUnsupportedMinOpsetVersion,
-                                 skipIfUnsupportedMaxOpsetVersion)
+                                 skipIfUnsupportedMaxOpsetVersion,
+                                 skipIfNoCuda)
 from verify import verify
 
 import torchvision
@@ -956,7 +958,7 @@ def test_onnx_fallthrough(self):
         # Test aten export of op with symbolic for aten
         x = torch.randn(100, 128)
         y = torch.randn(100, 128)
-        model = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
+        model = torch.nn.PairwiseDistance(p=2, eps=1e-6)
 
         graph, _, __ = self._model_to_graph(model, (x, y),
                                             operator_export_type=OperatorExportTypes.ONNX_FALLTHROUGH,
@@ -965,7 +967,8 @@ def test_onnx_fallthrough(self):
         iter = graph.nodes()
         self.assertEqual(next(iter).kind(), "onnx::Constant")
         self.assertEqual(next(iter).kind(), "onnx::Constant")
-        self.assertEqual(next(iter).kind(), "aten::cosine_similarity")
+        self.assertEqual(next(iter).kind(), "onnx::Constant")
+        self.assertEqual(next(iter).kind(), "aten::pairwise_distance")
 
     # prim::ListConstruct is exported as onnx::SequenceConstruct for opset >= 11
     @skipIfUnsupportedMaxOpsetVersion(10)
@@ -1038,6 +1041,36 @@ def forward(self, input):
         iter = graph.nodes()
         self.assertEqual(next(iter).kind(), "prim::PythonOp")
 
+    def test_autograd_module_name(self):
+        class CustomFunction(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, input):
+                ctx.save_for_backward(input)
+                return input.clamp(min=0)
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                input, = ctx.saved_tensors
+                grad_input = grad_output.clone()
+                grad_input[input < 0] = 0
+                return grad_input
+
+        class Custom(torch.nn.Module):
+            def forward(self, input):
+                return CustomFunction.apply(input) + CustomFunction2.apply(input)
+
+        model = Custom()
+        batch = torch.FloatTensor(1, 3)
+
+        graph, _, _ = self._model_to_graph(model, batch,
+                                           input_names=["batch"], dynamic_axes={"batch": [0, 1]})
+        iter = graph.nodes()
+        autograd1 = next(iter)
+        autograd2 = next(iter)
+        self.assertEqual(autograd1.kind(), "prim::PythonOp")
+        self.assertEqual(autograd2.kind(), "prim::PythonOp")
+        self.assertNotEqual(autograd1.s("module"), autograd2.s("module"))
+
     def test_unused_initializers(self):
         class Model(torch.nn.Module):
             def __init__(self):
@@ -1252,6 +1285,13 @@ def forward(self, x):
         graph = onnx.load(io.BytesIO(f.getvalue()))
         self.assertSetEqual(set([i.name for i in graph.graph.initializer]), param_name_set)
 
+        model.train()
+        f = io.BytesIO()
+        torch.onnx.export(model, (x,), f, training=TrainingMode.PRESERVE,
+                          opset_version=self.opset_version)
+        graph = onnx.load(io.BytesIO(f.getvalue()))
+        self.assertSetEqual(set([i.name for i in graph.graph.initializer]), param_name_set)
+
         # Test eval mode.
         model.eval()
         f = io.BytesIO()
@@ -1267,6 +1307,24 @@ def test_deduplicate_initializers(self):
     def test_deduplicate_initializers_torchscript(self):
         self._test_deduplicate_initializers(torchscript=True)
 
+    @skipIfNoCuda
+    def test_deduplicate_initializers_diff_devices(self):
+        class Model(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.w_cpu = torch.nn.Parameter(torch.ones(3, device=torch.device("cpu")))
+                self.w_cuda = torch.nn.Parameter(torch.ones(3, device=torch.device("cuda")))
+
+            def forward(self, x, y):
+                return x + self.w_cpu, y + self.w_cuda
+
+        x = torch.randn(3, 3, device=torch.device("cpu"))
+        y = torch.randn(3, 3, device=torch.device("cuda"))
+        f = io.BytesIO()
+        torch.onnx.export(Model(), (x, y), f, opset_version=self.opset_version)
+        graph = onnx.load(io.BytesIO(f.getvalue()))
+        self.assertSetEqual(set([i.name for i in graph.graph.initializer]), {"w_cpu"})
+
     def test_duplicated_output_node(self):
         class DuplicatedOutputNet(torch.nn.Module):
             def __init__(self, input_size, num_classes):
diff --git a/test/package/package_e/test_nn_module.pt b/test/package/package_e/test_nn_module.pt
new file mode 100644
index 00000000000000..1c1a8964a8a42f
Binary files /dev/null and b/test/package/package_e/test_nn_module.pt differ
diff --git a/test/package/test_dependency_api.py b/test/package/test_dependency_api.py
index be867528282dca..9f1a9c9899e8b3 100644
--- a/test/package/test_dependency_api.py
+++ b/test/package/test_dependency_api.py
@@ -182,7 +182,7 @@ def test_pickle_mocked(self):
         obj2 = package_a.PackageAObject(obj)
 
         buffer = BytesIO()
-        with self.assertRaises(NotImplementedError):
+        with self.assertRaises(PackagingError):
             with PackageExporter(buffer) as he:
                 he.mock(include="package_a.subpackage")
                 he.intern("**")
diff --git a/test/package/test_misc.py b/test/package/test_misc.py
index 659355b62e5988..480217b8feb3b9 100644
--- a/test/package/test_misc.py
+++ b/test/package/test_misc.py
@@ -2,12 +2,15 @@
 # Owner(s): ["oncall: package/deploy"]
 
 import inspect
+import platform
 from io import BytesIO
+from pathlib import Path
 from textwrap import dedent
+from unittest import skipIf
 
 from torch.package import PackageExporter, PackageImporter, is_from_package
 from torch.package.package_exporter import PackagingError
-from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.common_utils import IS_FBCODE, IS_SANDCASTLE, run_tests
 
 try:
     from .common import PackageTestCase
@@ -31,6 +34,7 @@ def test_file_structure(self):
             """\
                 ├── .data
                 │   ├── extern_modules
+                │   ├── python_version
                 │   └── version
                 ├── main
                 │   └── main
@@ -54,6 +58,7 @@ def test_file_structure(self):
             """\
                 ├── .data
                 │   ├── extern_modules
+                │   ├── python_version
                 │   └── version
                 ├── main
                 │   └── main
@@ -99,6 +104,36 @@ def test_file_structure(self):
             import_exclude,
         )
 
+    def test_python_version(self):
+        """
+        Tests that the current python version is stored in the package and is available
+        via PackageImporter's python_version() method.
+        """
+        buffer = BytesIO()
+
+        with PackageExporter(buffer) as he:
+            from package_a.test_module import SimpleTest
+
+            he.intern("**")
+            obj = SimpleTest()
+            he.save_pickle("obj", "obj.pkl", obj)
+
+        buffer.seek(0)
+        hi = PackageImporter(buffer)
+
+        self.assertEqual(hi.python_version(), platform.python_version())
+
+    @skipIf(
+        IS_FBCODE or IS_SANDCASTLE,
+        "Tests that use temporary files are disabled in fbcode",
+    )
+    def test_load_python_version_from_package(self):
+        """Tests loading a package with a python version embdded"""
+        importer1 = PackageImporter(
+            f"{Path(__file__).parent}/package_e/test_nn_module.pt"
+        )
+        self.assertEqual(importer1.python_version(), "3.9.7")
+
     def test_file_structure_has_file(self):
         """
         Test Directory's has_file() method.
diff --git a/test/quantization/ao_migration/test_quantization_fx.py b/test/quantization/ao_migration/test_quantization_fx.py
index b47ffbcf72871c..0728595dba8745 100644
--- a/test/quantization/ao_migration/test_quantization_fx.py
+++ b/test/quantization/ao_migration/test_quantization_fx.py
@@ -197,7 +197,7 @@ def test_function_import_fx_utils(self):
             'create_qparam_nodes',
             'all_node_args_have_no_tensors',
             'node_return_type_is_int',
-            'node_bool_tensor_arg_indexes',
+            'get_non_observable_arg_indexes_and_types',
             'is_get_tensor_info_node',
             'maybe_get_next_module'
         ]
diff --git a/test/quantization/core/test_quantized_module.py b/test/quantization/core/test_quantized_module.py
index d001aad7242b5d..7cbab3be475e19 100644
--- a/test/quantization/core/test_quantized_module.py
+++ b/test/quantization/core/test_quantized_module.py
@@ -27,6 +27,7 @@
     override_quantized_engine,
     override_qengines,
     qengine_is_qnnpack,
+    qengine_is_onednn,
 )
 from hypothesis import assume, given
 from hypothesis import strategies as st
@@ -99,7 +100,9 @@ def _test_linear_api_impl(self, batch_size, in_features, out_features, use_bias,
                                              zero_points=zero_point_tensor,
                                              axis=0, dtype=torch.qint8)
         else:
-            W_q = torch.quantize_per_tensor(W, 0.1, 4, torch.qint8)
+            # ONEDNN only supports symmetric quantization of weight
+            W_zp = 0 if qengine_is_onednn() else 4
+            W_q = torch.quantize_per_tensor(W, 0.1, W_zp, torch.qint8)
 
         X = torch.rand(batch_size, in_features).float()
         X_q = torch.quantize_per_tensor(X, 0.2, 10, torch.quint8)
@@ -434,7 +437,7 @@ def test_conv1d_api(self):
             X_scale = 1.3
             X_zero_point = 2
             W_scale = [0.5]
-            W_zero_point = [3]
+            W_zero_point = [0] if qengine_is_onednn() else [3]
             Y_scale = 5.0
             Y_zero_point = 4
             if torch.backends.quantized.engine == 'qnnpack':
@@ -501,7 +504,7 @@ def test_conv2d_api(self):
             X_scale = 1.3
             X_zero_point = 2
             W_scale = [0.5]
-            W_zero_point = [3]
+            W_zero_point = [0] if qengine_is_onednn() else [3]
             Y_scale = 5.0
             Y_zero_point = 4
             # use_fused -> quantized class
@@ -570,7 +573,7 @@ def test_conv3d_api(self):
             X_scale = 1.3
             X_zero_point = 2
             W_scale = [0.5]
-            W_zero_point = [3]
+            W_zero_point = [0] if qengine_is_onednn() else [3]
             Y_scale = 5.0
             Y_zero_point = 4
             # use_fused -> quantized class
@@ -1200,7 +1203,8 @@ def test_dynamic_convtranspose3d(self):
     def test_linear_api(self, batch_size, in_features, out_features, use_bias, use_default_observer):
         """test API functionality for nn.quantized.dynamic.Linear"""
         W = torch.rand(out_features, in_features).float()
-        W_scale, W_zp = _calculate_dynamic_qparams(W, torch.qint8)
+        qscheme = torch.per_tensor_symmetric if qengine_is_onednn() else torch.per_tensor_affine
+        W_scale, W_zp = _calculate_dynamic_qparams(W, torch.qint8, qscheme=qscheme)
         W_q = torch.quantize_per_tensor(W, W_scale, W_zp, torch.qint8)
         X = torch.rand(batch_size, in_features).float()
         B = torch.rand(out_features).float() if use_bias else None
@@ -1311,8 +1315,8 @@ def test_lstm_api(self, dtype, bidirectional):
                 bias_keys.append(key_name1)
                 bias_keys.append(key_name2)
 
-        if not (dtype == torch.float16 and torch.backends.quantized.engine == "qnnpack"):
-            # fp16 dynamic quant is not supported for qnnpack
+        if not (dtype == torch.float16 and torch.backends.quantized.engine in ("qnnpack", "onednn")):
+            # fp16 dynamic quant is not supported for qnnpack or onednn
             x = torch.randn(seq_len, batch, input_size)
             h = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
             c = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
@@ -1362,8 +1366,8 @@ def test_gru_api(self):
         # instantiated for all engines and dtypes
 
         for dtype in [torch.qint8, torch.float16]:
-            if dtype == torch.float16 and torch.backends.quantized.engine == "qnnpack":
-                # fp16 dynamic quant is not supported for qnnpack
+            if dtype == torch.float16 and torch.backends.quantized.engine in ("qnnpack", "onednn"):
+                # fp16 dynamic quant is not supported for qnnpack or onednn
                 continue
                 # Test default instantiation
             seq_len = 4
@@ -1435,8 +1439,8 @@ def test_cell_api(self, dtype):
                     'RNNReLU': torch.ops.quantized.quantized_rnn_relu_cell_dynamic}
 
         for rnn_type in cell_dict.keys():
-            if not (dtype == torch.float16 and torch.backends.quantized.engine == "qnnpack"):
-                # fp16 dynamic quant is not supported for qnnpack
+            if not (dtype == torch.float16 and torch.backends.quantized.engine in ("qnnpack", "onednn")):
+                # fp16 dynamic quant is not supported for qnnpack or onednn
                 kwargs = {'input_size': input_size, 'hidden_size': hidden_size, 'bias': bias, 'dtype': dtype}
                 if rnn_type == 'RNNReLU':
                     kwargs['nonlinearity'] = "relu"
@@ -1545,22 +1549,7 @@ def test_rnn(self):
         hidden_size = 7
         num_layers = 2
         bias = True
-        weight_keys = []
-        bias_keys = []
         for bidirectional in [True, False]:
-            num_directions = 2 if bidirectional else 1
-            for layer in range(num_layers):
-                for direction in range(num_directions):
-                    suffix = '_reverse' if direction == 1 else ''
-                    key_name1 = 'weight_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                    key_name2 = 'weight_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                    weight_keys.append(key_name1)
-                    weight_keys.append(key_name2)
-                    key_name1 = 'bias_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                    key_name2 = 'bias_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                    bias_keys.append(key_name1)
-                    bias_keys.append(key_name2)
-
             x = torch.randn(seq_len, batch, input_size)
             h = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
             c = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
@@ -1575,11 +1564,11 @@ def test_rnn(self):
             # initialize ref rnn module
             weight_qparams = {
                 'qscheme': torch.per_tensor_affine,
-                'dtype': torch.quint8,
+                'dtype': torch.qint8,
                 'scale': 2.0,
                 'zero_point': 5
             }
-            weight_qparams_dict = {key: weight_qparams for key in fp32_rnn._flat_weights_names}
+            weight_qparams_dict = {key: weight_qparams for key in fp32_rnn._flat_weights_names if key.startswith("weight")}
             ref_rnn = nnqr.LSTM(
                 input_size=input_size,
                 hidden_size=hidden_size,
@@ -1589,10 +1578,20 @@ def test_rnn(self):
                 dropout=0.0,
                 bidirectional=bidirectional,
                 weight_qparams_dict=weight_qparams_dict)
-            ref_rnn._flat_weights = fp32_rnn._flat_weights
+            for wn in fp32_rnn._flat_weights_names:
+                setattr(ref_rnn, wn, copy.deepcopy(getattr(fp32_rnn, wn)))
+
+            ref_rnn._flat_weights = copy.deepcopy(fp32_rnn._flat_weights)
 
             # quantize and dequantize the weights for fp32_rnn module
-            fp32_rnn._flat_weights = [self._quant_dequant_weight(w, weight_qparams) for w in fp32_rnn._flat_weights]
+            flat_weights = []
+            for wn in fp32_rnn._flat_weights_names:
+                if wn.startswith("weight"):
+                    weight = self._quant_dequant_weight(getattr(fp32_rnn, wn), weight_qparams)
+                else:
+                    weight = getattr(fp32_rnn, wn)
+                flat_weights.append(weight)
+            fp32_rnn._flat_weights = flat_weights
 
             fp32_res = fp32_rnn(x, (h, c))
             ref_res = ref_rnn(x, (h, c))
diff --git a/test/quantization/core/test_quantized_op.py b/test/quantization/core/test_quantized_op.py
index b6079b37b6a2fd..c1d1251e03e12c 100644
--- a/test/quantization/core/test_quantized_op.py
+++ b/test/quantization/core/test_quantized_op.py
@@ -26,9 +26,13 @@
 from torch.testing._internal.common_quantization import skipIfNoFBGEMM, skipIfNoQNNPACK
 from torch.testing._internal.common_quantized import _quantize, _dequantize, _calculate_dynamic_qparams, \
     override_quantized_engine, supported_qengines, override_qengines, _snr
-from torch.testing._internal.common_quantized import qengine_is_qnnpack
+from torch.testing._internal.common_quantized import (
+    qengine_is_qnnpack,
+    qengine_is_onednn,
+)
 from torch.ao.quantization import PerChannelMinMaxObserver
 from torch.testing._internal.common_cuda import TEST_CUDNN
+import torch.backends.xnnpack
 
 from typing import Optional
 
@@ -71,7 +75,7 @@ def avoid_vpmaddubsw_overflow_linear(
 
 
 # Reference quantized Linear operator
-def qlinear_ref(X_q, X_scale, X_zp, W_q, W_scale, W_zp, b_q, Y_scale, Y_zp):
+def qlinear_ref(X_q, X_scale, X_zp, W_q, W_scale, W_zp, b_q, Y_scale, Y_zp, dtype=np.uint8):
     X_q = np.reshape(X_q, (-1, X_q.shape[X_q.ndim - 1]))
     row_offsets_ref = X_q.sum(axis=1).astype(np.int32).reshape((-1, 1))
     col_offsets_ref = W_q.sum(axis=1).astype(np.int32).reshape((1, -1))
@@ -85,7 +89,7 @@ def qlinear_ref(X_q, X_scale, X_zp, W_q, W_scale, W_zp, b_q, Y_scale, Y_zp):
     )
     if b_q is not None:
         Prod_XqWq_ref += b_q
-    Y_q_ref = _quantize(Prod_XqWq_ref, Y_scale / (X_scale * W_scale), Y_zp)
+    Y_q_ref = _quantize(Prod_XqWq_ref, Y_scale / (X_scale * W_scale), Y_zp, dtype=dtype)
     return Y_q_ref
 
 """Computes the output shape given pooling parameters."""
@@ -825,6 +829,44 @@ def test_qadd_relu_same_qparams(self):
             self.assertEqual(qCrelu_hat, qCrelu_out_hat,
                              msg="AddReLU.out failed")
 
+    """Tests the correctness of the cudnn add and add_relu op
+    (Similar to test_qadd_relu_different_qparams, will probably merge in the future)"""
+    @unittest.skipIf(not TEST_CUDNN, "cudnn is not enabled.")
+    @unittest.skip("Local only - currently the qconv2d_cudnn op is bulid "
+                   "with USE_EXPERIMENTAL_CUDNN_V8_API, we can enable the test "
+                   "after it is built by default")
+    def test_qadd_relu_cudnn(self):
+        dtype = torch.qint8
+        add_relu = torch.ops.quantized.add_relu
+        add = torch.ops.quantized.add
+
+        # NB: This is a strange size so that we exercise both the vectorized
+        # implementation (64-element chunks at at time) as well as the scalar
+        # implementation
+        A = torch.arange(-128, 130, dtype=torch.float).to(torch.device("cuda"))
+        B = torch.arange(-128, 130, dtype=torch.float).to(torch.device("cuda"))
+        scale_A = 2.5
+        scale_B = 6.3
+        scale_C = 12.9
+        zero_point = 0
+        qA = torch.quantize_per_tensor(A, scale=scale_A, zero_point=zero_point,
+                                       dtype=dtype)
+        qB = torch.quantize_per_tensor(B, scale=scale_B, zero_point=zero_point,
+                                       dtype=dtype)
+        # Add ground truth
+        C = (qA.dequantize() + qB.dequantize()).to(device="cpu").numpy()
+        qC = _quantize(C, scale_C, zero_point, dtype=np_dtype[dtype])
+        qC_hat = add(qA, qB, scale=scale_C, zero_point=zero_point).to(device="cpu")
+        np.testing.assert_equal(qC, qC_hat.int_repr(),
+                                "Quantized addition failed.")
+
+        # Add + ReLU ground truth
+        Crelu = C.copy()
+        Crelu[C < 0] = 0
+        qCrelu = _quantize(Crelu, scale_C, zero_point, dtype=np_dtype[dtype])
+        qCrelu_hat = add_relu(qA, qB, scale=scale_C, zero_point=zero_point).to(device="cpu")
+        np.testing.assert_equal(qCrelu, qCrelu_hat.int_repr(),
+                                "Quantized addition with ReLU failed.")
 
     """Tests the correctness of the add and add_relu op."""
     def test_qadd_relu_different_qparams(self):
@@ -992,9 +1034,20 @@ def test_qmul_relu_different_qparams(self):
                              msg="mulReLU.out failed")
 
     """Tests the correctness of the matmul op."""
-    def test_qmatmul(self):
-        A = torch.randn(size=(3, 4), dtype=torch.float32) * 3
-        B = torch.randn(size=(4, 5), dtype=torch.float32) * 3
+    @given(num_dims=st.integers(2, 5),
+           outer_dims=st.lists(st.integers(2, 6), min_size=3, max_size=3),
+           m=st.integers(2, 6),
+           k=st.integers(2, 6),
+           n=st.integers(2, 6),
+           dtypes=st.sampled_from(((torch.qint8, np.int8),
+                                   (torch.quint8, np.uint8))))
+    def test_qmatmul(self, num_dims, outer_dims, m, k, n, dtypes):
+        (torch_dtype, np_dtype) = dtypes
+
+        size_a = outer_dims[:num_dims - 2] + [m, k]
+        size_b = outer_dims[:num_dims - 2] + [k, n]
+        A = torch.randn(size=size_a, dtype=torch.float32) * 3
+        B = torch.randn(size=size_b, dtype=torch.float32) * 3
 
         scale_A = 3.1
         zero_point_A = 7
@@ -1004,15 +1057,22 @@ def test_qmatmul(self):
         scale_C = 1.3
         zero_point_C = 5
 
-        qA = torch.quantize_per_tensor(A, scale=scale_A, zero_point=zero_point_A,
-                                       dtype=torch.qint8)
-        qB = torch.quantize_per_tensor(B, scale=scale_B, zero_point=zero_point_B,
-                                       dtype=torch.qint8)
+        qA = torch.quantize_per_tensor(A,
+                                       scale=scale_A,
+                                       zero_point=zero_point_A,
+                                       dtype=torch_dtype)
+        qB = torch.quantize_per_tensor(B,
+                                       scale=scale_B,
+                                       zero_point=zero_point_B,
+                                       dtype=torch_dtype)
 
         # matmul ground truth
         C = torch.matmul(qA.dequantize(), qB.dequantize()).numpy()
-        qC = _quantize(C, scale_C, zero_point_C, dtype=np.int8)
-        qC_hat = torch.ops.quantized.matmul(qA, qB, scale=scale_C, zero_point=zero_point_C)
+        qC = _quantize(C, scale_C, zero_point_C, dtype=(np_dtype))
+        qC_hat = torch.ops.quantized.matmul(qA,
+                                            qB,
+                                            scale=scale_C,
+                                            zero_point=zero_point_C)
         np.testing.assert_equal(qC, qC_hat.int_repr(),
                                 "Quantized multiplication failed.")
 
@@ -1023,10 +1083,16 @@ def test_qmatmul(self):
         scales_B = torch.rand(size=(B.shape[axis],))
         zero_points_B = torch.randint(low=0, high=5, size=(B.shape[axis],))
 
-        qA = torch.quantize_per_channel(A, scales=scales_A, zero_points=zero_points_A,
-                                        axis=axis, dtype=torch.qint8)
-        qB = torch.quantize_per_channel(B, scales=scales_B, zero_points=zero_points_B,
-                                        axis=axis, dtype=torch.qint8)
+        qA = torch.quantize_per_channel(A,
+                                        scales=scales_A,
+                                        zero_points=zero_points_A,
+                                        axis=axis,
+                                        dtype=torch.qint8)
+        qB = torch.quantize_per_channel(B,
+                                        scales=scales_B,
+                                        zero_points=zero_points_B,
+                                        axis=axis,
+                                        dtype=torch.qint8)
         np.testing.assert_raises_regex(RuntimeError,
                                        ".*per-tensor.*",
                                        torch.ops.quantized.matmul,
@@ -1161,6 +1227,52 @@ def test_max_pool1d(self, X, kernel, stride, dilation, padding, ceil_mode):
         self.assertEqual(a_ref, a_hat.dequantize(),
                          msg="ops.quantized.max_pool1d results are off")
 
+    # TODO: merge this test with test_max_pool2d when USE_EXPERIMENTAL_CUDNN_V8_API flag is enabled in CI
+    """Tests 2D cudnn max pool operation on quantized tensors."""
+    @given(X=hu.tensor(shapes=hu.array_shapes(min_dims=3, max_dims=4,
+                                              min_side=1, max_side=10),
+                       # cudnn's support for quantized pooling is limited to
+                       # int8 currently
+                       qparams=hu.qparams(dtypes=[torch.qint8])),
+           kernel=st.sampled_from((3, 5, 7)),
+           stride=st.sampled_from((None, 1, 2)),
+           # currently there is no support for dilation for cudnn
+           # pooling
+           dilation=st.integers(1, 1),
+           padding=st.integers(0, 2),
+           ceil_mode=st.booleans())
+    @unittest.skipIf(not TEST_CUDNN, "cudnn is not enabled.")
+    @unittest.skip("Local only - currently the qconv2d_cudnn op is bulid "
+                   "with USE_EXPERIMENTAL_CUDNN_V8_API, we can enable the test "
+                   "after it is built by default")
+    def test_max_pool2d_cudnn(self, X, kernel, stride, dilation, padding, ceil_mode):
+        X, (scale, zero_point, torch_type) = X
+        assume(kernel // 2 >= padding)  # Kernel cannot be overhanging!
+        iH, iW = X.shape[-2:]
+        oH = pool_output_shape(iH, kernel, padding, stride, dilation, ceil_mode)
+        assume(oH > 0)
+        oW = pool_output_shape(iW, kernel, padding, stride, dilation, ceil_mode)
+        assume(oW > 0)
+
+        a = torch.from_numpy(X).to(device="cuda")
+        a_pool = torch.nn.functional.max_pool2d(a, kernel_size=kernel,
+                                                stride=stride,
+                                                padding=padding, dilation=dilation,
+                                                ceil_mode=ceil_mode)
+        a_ref = torch.quantize_per_tensor(a_pool, scale=scale,
+                                          zero_point=zero_point, dtype=torch_type)
+        a_ref = a_ref.dequantize()
+        qa = torch.quantize_per_tensor(a, scale=scale, zero_point=zero_point,
+                                       dtype=torch_type)
+
+        # Test the ops.quantized separately, because None is not treated.
+        a_hat = torch.ops.quantized.max_pool2d(
+            qa, kernel_size=_pair(kernel),
+            stride=_pair(kernel if stride is None else stride),
+            padding=_pair(padding), dilation=_pair(dilation), ceil_mode=ceil_mode)
+        self.assertEqual(a_ref, a_hat.dequantize(),
+                         msg="ops.quantized.max_pool2d results are off")
+
     """Tests 2D max pool operation on quantized tensors."""
     @given(X=hu.tensor(shapes=hu.array_shapes(min_dims=3, max_dims=4,
                                               min_side=1, max_side=10),
@@ -2633,7 +2745,7 @@ def forward(
             ]
 
             q_data = []
-            reduce_range = (qengine == 'fbgemm')
+            reduce_range = (qengine in ('fbgemm', 'onednn'))
             for idx, x in enumerate(fp_data):
                 scale, zero_point = _calculate_dynamic_qparams(
                     x, dtype=dtype, reduce_range=reduce_range)
@@ -2654,7 +2766,13 @@ def forward(
                     mha.eval()
 
                     # Prepare
-                    mha.qconfig = torch.ao.quantization.get_default_qconfig(qengine)
+                    if qengine_is_onednn():
+                        # `reduce_range` is False by default for ONEDNN backend
+                        # but the test fails on earlier CPUs without VNNI.
+                        # So we use a default qconfig with `reduce_range=True` here
+                        mha.qconfig = torch.ao.quantization.get_default_qconfig()
+                    else:
+                        mha.qconfig = torch.ao.quantization.get_default_qconfig(qengine)
                     mha_prepared = torch.ao.quantization.prepare(
                         mha, prepare_custom_config_dict=custom_module_config)
 
@@ -2747,7 +2865,7 @@ def test_qlinear(self, batch_size, input_channels, output_channels,
             (b_value_max - b_value_min) + b_value_min
         ).astype(np.int32) if use_bias else None
 
-        if torch.backends.quantized.engine == 'fbgemm':
+        if torch.backends.quantized.engine in ('fbgemm', 'onednn'):
             avoid_vpmaddubsw_overflow_linear(
                 batch_size,
                 input_channels,
@@ -2880,6 +2998,19 @@ def test_qlinear_legacy(self, batch_size, input_channels, output_channels):
         self.assertEqual(Y_fp32, Y_fp32_ref,
                          msg="torch.ops.quantized.fbgemm_linear_dynamic results are off")
 
+    @skipIfNoFBGEMM
+    @given(
+        input_channels=st.integers(16, 32),
+        output_channels=st.integers(4, 8),
+        exponent=st.integers(0, 8))
+    def test_linear_prepack_fp16_numerics(self, input_channels, output_channels, exponent):
+        w = torch.randn(output_channels, input_channels) * 10**exponent
+        bias = None
+        w_packed_fp16 = torch.ops.quantized.linear_prepack_fp16(w, bias)
+        w_unpacked_fp16 = torch.ops.quantized.linear_unpack_fp16(w_packed_fp16)
+        w_fp16 = w.to(torch.float16).to(torch.float32)
+        self.assertTrue(torch.equal(w_fp16, w_unpacked_fp16[0]))
+
     @skipIfNoFBGEMM
     def test_qlinear_dynamic_fp16(self):
 
@@ -2971,8 +3102,8 @@ def test_qlstmGRU(self, num_batches, input_size, hidden_size,
 
         for rnn_type in ['LSTM', 'GRU']:
             for dtype in [torch.qint8, torch.float16]:
-                # Fp16 quantization is not supported for qnnpack
-                if torch.backends.quantized.engine == 'qnnpack' and dtype == torch.float16:
+                # Fp16 quantization is not supported for qnnpack or onednn
+                if torch.backends.quantized.engine in ('qnnpack', 'onednn') and dtype == torch.float16:
                     continue
 
                 if torch.backends.quantized.engine == 'qnnpack':
@@ -3105,8 +3236,8 @@ def test_qrnncell(self, num_batches, input_size, hidden_size, per_channel_quant)
 
         for rnn_type in ['LSTMCell', 'GRUCell', 'RNNTanh', 'RNNReLU']:
             for dtype in [torch.qint8, torch.float16]:
-                # Fp16 quantization is not supported for qnnpack
-                if torch.backends.quantized.engine == 'qnnpack' and dtype == torch.float16:
+                # Fp16 quantization is not supported for qnnpack or onednn
+                if torch.backends.quantized.engine in ('qnnpack', 'onednn') and dtype == torch.float16:
                     continue
 
                 if torch.backends.quantized.engine == 'qnnpack':
@@ -3247,6 +3378,7 @@ class TestQuantizedLinear(TestCase):
     def test_qlinear(self, batch_size, input_channels, output_channels, use_bias,
                      use_relu, use_multi_dim_input, use_channelwise):
         decimal_val = 4
+        dtypes = [torch.quint8]
         if torch.backends.quantized.engine == 'qnnpack':
             # QNNPACK supports uint8 in the kernels. In the op we shift the int8
             # weight values to uint8 to be on par with fbgemm. However, this causes
@@ -3254,24 +3386,165 @@ def test_qlinear(self, batch_size, input_channels, output_channels, use_bias,
             # off by one results.
             decimal_val = 0
 
+            # only qnnpack qengine supports qint8 when xnnpack is available
+            if torch.backends.xnnpack.enabled:
+                dtypes.append(torch.qint8)
+
+        for dtype in dtypes:
+            # No support for channelwise in xnnpack (int8)
+            # ONEDNN does not support qint8
+            if dtype == torch.qint8 and (use_channelwise or qengine_is_onednn()):
+                return
+
+            nptype = np_dtype[dtype]
+            qlinear_prepack = torch.ops.quantized.linear_prepack
+            if use_relu:
+                qlinear = torch.ops.quantized.linear_relu
+            else:
+                qlinear = torch.ops.quantized.linear
+            if use_multi_dim_input:
+                batch_size *= 3  # Test the multi-dim input tensor
+            X_scale = 1.5
+            X_zp = 5
+            X_value_min = -128 if dtype == torch.qint8 else 0
+            X_value_max = 127 if dtype == torch.qint8 else 255
+            X_q0 = np.round(
+                np.random.rand(batch_size, input_channels) *
+                (X_value_max - X_value_min)
+                + X_value_min
+            ).astype(nptype)
+
+            W_scales = np.random.rand(output_channels)
+            # xnnpack forces W_zp to 0 when using symmetric quantization
+            # ONEDNN only supports symmetric quantization of weight
+            if dtype == torch.qint8 or qengine_is_onednn():
+                W_zps = np.zeros(output_channels).astype(np.int)
+            else:
+                W_zps = np.round(np.random.rand(output_channels) * 100 - 50).astype(np.int)
+            # when using symmetric quantization
+            # special restriction for xnnpack fully connected op weight
+            # [-127, 127] instead of [-128, 127]
+            W_value_min = -127 if dtype == torch.qint8 else -128
+            W_value_max = 127
+            W_q0 = np.round(
+                np.random.rand(output_channels, input_channels)
+                * (W_value_max - W_value_min)
+                + W_value_min
+            ).astype(np.int8)  # weight is always int8_t
+            b_value_min = -10
+            b_value_max = 10
+            b_q0 = np.round(
+                np.random.rand(output_channels) *
+                (b_value_max - b_value_min) + b_value_min
+            ).astype(np.int32) if use_bias else None
+            if torch.backends.quantized.engine in ('fbgemm', 'onednn'):
+                avoid_vpmaddubsw_overflow_linear(
+                    batch_size,
+                    input_channels,
+                    output_channels,
+                    X_q0,
+                    X_value_min,
+                    X_value_max,
+                    W_q0,
+                    W_value_min,
+                    W_value_max,
+                )
+            X = torch.from_numpy(_dequantize(
+                X_q0, X_scale, X_zp)).to(dtype=torch.float)
+            X_q = torch.quantize_per_tensor(
+                X, scale=X_scale, zero_point=X_zp, dtype=dtype)
+            if use_channelwise:
+                W = torch.from_numpy(_dequantize(W_q0, W_scales.reshape(
+                    (-1, 1)), W_zps.reshape((-1, 1)))).to(dtype=torch.float)
+                W_q = torch.quantize_per_channel(W, scales=torch.from_numpy(W_scales),
+                                                 zero_points=torch.from_numpy(W_zps), axis=0, dtype=torch.qint8)
+                b = torch.from_numpy(_dequantize(
+                    b_q0, X_scale * W_scales, 0)).to(dtype=torch.float) if use_bias else None
+                b_q = torch.quantize_per_channel(b, scales=torch.from_numpy(X_scale * W_scales),
+                                                 zero_points=torch.zeros(output_channels, dtype=torch.long),
+                                                 axis=0, dtype=torch.qint32) if use_bias else None
+            else:
+                W = torch.from_numpy(_dequantize(
+                    W_q0, W_scales[0], W_zps[0])).to(dtype=torch.float)
+                W_q = torch.quantize_per_tensor(W, scale=W_scales[0], zero_point=(
+                    W_zps[0].astype(int).item()), dtype=torch.qint8)
+                b = torch.from_numpy(_dequantize(
+                    b_q0, X_scale * (W_scales[0].item()), 0)).to(dtype=torch.float) if use_bias else None
+                b_q = torch.quantize_per_tensor(
+                    b, scale=X_scale * (W_scales[0].item()), zero_point=0, dtype=torch.qint32) if use_bias else None
+            # Compare X_scale * W_scale * input_channels * X_value_max * W_value_max with
+            # Y_scale * 255 (max for uint8).
+            Y_scale = 125.1234
+            Y_zp = 5
+            # Weight prepacking operator for quantized Linear
+            float_bias = b if use_bias else None
+            W_prepack = qlinear_prepack(W_q, float_bias)
+            if use_multi_dim_input:
+                X_q = X_q.view(3, int(batch_size / 3), input_channels)
+            # Quantized Linear operator with prepacked weight
+            Y_q = qlinear(X_q, W_prepack, Y_scale, Y_zp)
+            if not use_channelwise:
+                # Test the per-tensor quantization only
+                # Reference quantized Linear operator
+                Y_q_ref = qlinear_ref(X_q0, X_scale, X_zp, W_q0,
+                                      W_scales[0], W_zps[0], b_q0, Y_scale, Y_zp, dtype=nptype)
+                if use_relu:
+                    Y_q_ref[Y_q_ref < Y_zp] = Y_zp
+                if use_multi_dim_input:
+                    Y_q_ref = np.reshape(
+                        Y_q_ref, (3, int(batch_size / 3), output_channels))
+                # Assert equal
+                np.testing.assert_array_almost_equal(Y_q_ref, Y_q.int_repr().numpy(), decimal=decimal_val)
+            # Test both per-tensor and per-channel quantization
+            # Reference quantized result from PyTorch Linear operator
+            W_fp32 = W_q.dequantize().to(dtype=torch.float)
+            X_fp32 = X_q.dequantize().to(dtype=torch.float)
+            b_fp32 = b_q.dequantize().to(dtype=torch.float) if use_bias else None
+            Y_fp32_ref = F.linear(X_fp32, W_fp32, b_fp32)
+            if use_relu:
+                Y_fp32_ref[Y_fp32_ref < 0.0] = 0.0
+            Y_q_ref2 = torch.quantize_per_tensor(
+                Y_fp32_ref, Y_scale, Y_zp, dtype)
+            # Assert equal
+            np.testing.assert_array_almost_equal(
+                Y_q_ref2.int_repr().numpy(), Y_q.int_repr().numpy(), decimal=decimal_val)
+
+    @given(batch_size=st.integers(1, 4),
+           input_channels=st.integers(16, 32),
+           output_channels=st.integers(4, 8),
+           use_bias=st.sampled_from([False]),
+           use_relu=st.sampled_from([False]),
+           use_multi_dim_input=st.booleans(),
+           use_channelwise=st.sampled_from([False]))  # channelwise currently not supported for qlinear cudnn
+    @skipIfNoFBGEMM
+    @unittest.skipIf(not TEST_CUDNN, "cudnn is not enabled.")
+    @unittest.skip("Local only - currently the qconv2d_cudnn op is bulid "
+                   "with USE_EXPERIMENTAL_CUDNN_V8_API, we can enable the test "
+                   "after it is built by default")
+    # TODO: check with yang regarding CUDNN flags
+    def test_qlinear_cudnn(self, batch_size, input_channels, output_channels, use_bias,
+                           use_relu, use_multi_dim_input, use_channelwise):
         qlinear_prepack = torch.ops.quantized.linear_prepack
+        batch_size = 1
+        input_channels = 10
+        output_channels = 20
+        use_bias = False
+        use_relu = False
+        use_channelwise = False
         if use_relu:
-            qlinear = torch.ops.quantized.linear_relu
+            qlinear_op = torch.ops.quantized.linear_relu
         else:
-            qlinear = torch.ops.quantized.linear
-        if use_multi_dim_input:
-            batch_size *= 3  # Test the multi-dim input tensor
+            qlinear_op = torch.ops.quantized.linear
         X_scale = 1.5
-        X_zp = 5
-        X_value_min = 0
-        X_value_max = 225
+        X_zp = 0
+        X_value_min = -128
+        X_value_max = 127
         X_q0 = np.round(
             np.random.rand(batch_size, input_channels) *
             (X_value_max - X_value_min)
-            + X_value_min
-        ).astype(np.uint8)
-        W_scales = np.random.rand(output_channels)
-        W_zps = np.round(np.random.rand(output_channels) * 100 - 50).astype(np.int)
+            + X_value_min).astype(np.int8)
+        W_scale = 2.5
+        W_zp = 0
         W_value_min = -128
         W_value_max = 127
         W_q0 = np.round(
@@ -3285,6 +3558,15 @@ def test_qlinear(self, batch_size, input_channels, output_channels, use_bias,
             np.random.rand(output_channels) *
             (b_value_max - b_value_min) + b_value_min
         ).astype(np.int32) if use_bias else None
+        if use_bias:
+            b_value_min = -10
+            b_value_max = 10
+            b_q0 = np.round(
+                np.random.rand(output_channels) *
+                (b_value_max - b_value_min) + b_value_min
+            ).astype(np.int32)
+        else:
+            bias = None
         avoid_vpmaddubsw_overflow_linear(
             batch_size,
             input_channels,
@@ -3296,65 +3578,31 @@ def test_qlinear(self, batch_size, input_channels, output_channels, use_bias,
             W_value_min,
             W_value_max,
         )
+        quant_dtype = torch.qint8
         X = torch.from_numpy(_dequantize(
-            X_q0, X_scale, X_zp)).to(dtype=torch.float)
+            X_q0, X_scale, X_zp)).to(dtype=torch.float).to(device="cuda")
         X_q = torch.quantize_per_tensor(
-            X, scale=X_scale, zero_point=X_zp, dtype=torch.quint8)
-        if use_channelwise:
-            W = torch.from_numpy(_dequantize(W_q0, W_scales.reshape(
-                (-1, 1)), W_zps.reshape((-1, 1)))).to(dtype=torch.float)
-            W_q = torch.quantize_per_channel(W, scales=torch.from_numpy(W_scales),
-                                             zero_points=torch.from_numpy(W_zps), axis=0, dtype=torch.qint8)
-            b = torch.from_numpy(_dequantize(
-                b_q0, X_scale * W_scales, 0)).to(dtype=torch.float) if use_bias else None
-            b_q = torch.quantize_per_channel(b, scales=torch.from_numpy(X_scale * W_scales),
-                                             zero_points=torch.zeros(output_channels, dtype=torch.long),
-                                             axis=0, dtype=torch.qint32) if use_bias else None
-        else:
-            W = torch.from_numpy(_dequantize(
-                W_q0, W_scales[0], W_zps[0])).to(dtype=torch.float)
-            W_q = torch.quantize_per_tensor(W, scale=W_scales[0], zero_point=(
-                W_zps[0].astype(int).item()), dtype=torch.qint8)
-            b = torch.from_numpy(_dequantize(
-                b_q0, X_scale * (W_scales[0].item()), 0)).to(dtype=torch.float) if use_bias else None
-            b_q = torch.quantize_per_tensor(
-                b, scale=X_scale * (W_scales[0].item()), zero_point=0, dtype=torch.qint32) if use_bias else None
-        # Compare X_scale * W_scale * input_channels * X_value_max * W_value_max with
-        # Y_scale * 255 (max for uint8).
-        Y_scale = 125.1234
-        Y_zp = 5
+            X, scale=X_scale, zero_point=X_zp, dtype=quant_dtype)
+        W = torch.from_numpy(_dequantize(
+            W_q0, W_scale, W_zp)).to(dtype=torch.float).to(device="cuda")
+        W_q = torch.quantize_per_tensor(W, scale=W_scale, zero_point=W_zp, dtype=quant_dtype)
+        b = torch.from_numpy(_dequantize(
+            b_q0, X_scale * (W_zp), 0)).to(dtype=torch.float).to(device="cuda") if use_bias else None
+        b_q = torch.quantize_per_tensor(
+            b, scale=X_scale * W_scale, zero_point=0, dtype=quant_dtype) if use_bias else None
+        Y_scale = 0.5
+        Y_zp = 0
         # Weight prepacking operator for quantized Linear
         float_bias = b if use_bias else None
-        W_prepack = qlinear_prepack(W_q, float_bias)
-        if use_multi_dim_input:
-            X_q = X_q.view(3, int(batch_size / 3), input_channels)
+        W_prepack = qlinear_prepack(W_q, float_bias if use_bias else None)
         # Quantized Linear operator with prepacked weight
-        Y_q = qlinear(X_q, W_prepack, Y_scale, Y_zp)
-        if not use_channelwise:
-            # Test the per-tensor quantization only
-            # Reference quantized Linear operator
-            Y_q_ref = qlinear_ref(X_q0, X_scale, X_zp, W_q0,
-                                  W_scales[0], W_zps[0], b_q0, Y_scale, Y_zp)
-            if use_relu:
-                Y_q_ref[Y_q_ref < Y_zp] = Y_zp
-            if use_multi_dim_input:
-                Y_q_ref = np.reshape(
-                    Y_q_ref, (3, int(batch_size / 3), output_channels))
-            # Assert equal
-            np.testing.assert_array_almost_equal(Y_q_ref, Y_q.int_repr().numpy(), decimal=decimal_val)
-        # Test both per-tensor and per-channel quantization
-        # Reference quantized result from PyTorch Linear operator
-        W_fp32 = W_q.dequantize().to(dtype=torch.float)
-        X_fp32 = X_q.dequantize().to(dtype=torch.float)
-        b_fp32 = b_q.dequantize().to(dtype=torch.float) if use_bias else None
-        Y_fp32_ref = F.linear(X_fp32, W_fp32, b_fp32)
+        Y_q = qlinear_op(X_q, W_prepack, Y_scale, Y_zp).to(device="cpu")
+        Y_q_ref = qlinear_ref(X_q0, X_scale, X_zp, W_q0,
+                              W_scale, W_zp, b_q0, Y_scale, Y_zp, dtype=np.int8)
         if use_relu:
-            Y_fp32_ref[Y_fp32_ref < 0.0] = 0.0
-        Y_q_ref2 = torch.quantize_per_tensor(
-            Y_fp32_ref, Y_scale, Y_zp, torch.quint8)
-        # Assert equal
-        np.testing.assert_array_almost_equal(
-            Y_q_ref2.int_repr().numpy(), Y_q.int_repr().numpy(), decimal=decimal_val)
+            Y_q_ref[Y_q_ref < Y_zp] = Y_zp
+        decimal_val = 0
+        np.testing.assert_array_almost_equal(Y_q_ref, Y_q.int_repr().numpy(), decimal=decimal_val)
 
     """Tests the correctness of the quantized::linear_unpack op."""
     @given(W=hu.tensor(shapes=hu.array_shapes(2, 2,),
@@ -3371,6 +3619,13 @@ def test_qlinear_unpack(self, W, use_channelwise):
         qlinear_prepack = torch.ops.quantized.linear_prepack
         qlinear_unpack = torch.ops.quantized.linear_unpack
 
+        # ONEDNN only supports symmetric quantization of weight
+        if qengine_is_onednn():
+            if use_channelwise:
+                W_zps = torch.zeros(output_channels).to(torch.int64)
+            else:
+                W_zp = 0
+
         W = torch.from_numpy(W)
         if use_channelwise:
             W_q = torch.quantize_per_channel(
@@ -3834,6 +4089,10 @@ def _test_qconv_unpack_impl(self, qconv_prepack_fn, qconv_unpack_fn, inputs,
         if channelwise and transposed:
             # currently transposed conv and per-channel per quantization does not work
             return
+        # ONEDNN only supports symmetric quantization of weight and zero output padding
+        if qengine_is_onednn():
+            W_zero_point = 0
+            o_pads = len(o_pads) * [0] if o_pads is not None else None
         if channelwise:
             if transposed:
                 output_channels = W.shape[1]  # IC OC/G
@@ -3972,6 +4231,9 @@ def _test_qconv_impl(
         weight_dtype=torch.qint8,
         output_dtype=torch.quint8,
     ):
+        # ONEDNN only supports symmetric quantization of weight
+        if qengine_is_onednn() and W_zero_point is not None:
+            W_zero_point = len(W_zero_point) * [0]
         (X, W), (X_q, W_q), bias_float = self._make_qconv_tensors(
             batch_size, input_channels_per_group, input_feature_map_shape,
             output_channels_per_group, groups, kernels,
@@ -4056,7 +4318,7 @@ def _test_qconv_impl(
            Y_scale=st.floats(4.2, 5.6),
            Y_zero_point=st.integers(0, 4),
            use_bias=st.booleans(),
-           use_relu=st.sampled_from([False]),
+           use_relu=st.booleans(),
            use_channelwise=st.booleans())
     @override_qengines
     def test_qconv2d(
@@ -4104,12 +4366,22 @@ def test_qconv2d(
             dilations,
             groups,
         )
-        self._test_qconv_impl(
-            qconv, qconv_prepack, conv_op, batch_size,
-            input_channels_per_group, (height, width),
-            output_channels_per_group, groups, kernels, strides, pads, None,
-            dilations, X_scale, X_zero_point, W_scale, W_zero_point,
-            Y_scale, Y_zero_point, use_bias, use_relu, use_channelwise, False)
+
+        act_qdtypes = [torch.quint8]
+        # Only qnnpack qengine supportes qint8
+        if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
+            act_qdtypes.append(torch.qint8)
+
+        for X_qdtype in act_qdtypes:
+            if X_qdtype == torch.qint8:
+                W_zero_point = [0 for i in range(len(W_zero_point))]
+
+            self._test_qconv_impl(
+                qconv, qconv_prepack, conv_op, batch_size,
+                input_channels_per_group, (height, width),
+                output_channels_per_group, groups, kernels, strides, pads, None,
+                dilations, X_scale, X_zero_point, W_scale, W_zero_point,
+                Y_scale, Y_zero_point, use_bias, use_relu, use_channelwise, False, input_dtype=X_qdtype, output_dtype=X_qdtype)
 
     @given(batch_size=st.integers(1, 3),
            # only multiples of 16 are supported right now, might be fixed in
@@ -4181,9 +4453,9 @@ def test_qconv2d_cudnn(
         dilations = (dilation, dilation)
 
         if use_relu:
-            qconv = torch.ops.quantized.conv2d_relu_cudnn
+            qconv = torch.ops.quantized.conv2d_relu
         else:
-            qconv = torch.ops.quantized.conv2d_cudnn
+            qconv = torch.ops.quantized.conv2d
         conv_op = torch.nn.Conv2d(
             input_channels,
             output_channels,
@@ -4194,7 +4466,7 @@ def test_qconv2d_cudnn(
             groups,
         ).to(torch.device("cuda"))
         self._test_qconv_impl(
-            qconv, None, conv_op, batch_size,
+            qconv, torch.ops.quantized.conv2d_prepack, conv_op, batch_size,
             input_channels_per_group, (height, width),
             output_channels_per_group, groups, kernels, strides, pads, None,
             dilations, X_scale, X_zero_point, W_scale, W_zero_point,
@@ -4270,13 +4542,14 @@ def trace_handler(p):
         weight_int8 = torch.quantize_per_tensor(weight, 1, 0, torch.qint8).contiguous(memory_format=torch.channels_last)
         scale = 1.0
         zero_point = 0
-        conv_op = torch.ops.quantized.conv2d_cudnn
+        conv_op = torch.ops.quantized.conv2d
+        weight_prepacked = torch.ops.quantized.conv2d_prepack(weight_int8, None, stride, padding, dilation, groups)
         with profile(
                 activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                 schedule=my_schedule,
                 on_trace_ready=trace_handler) as prof:
             for i in range(30):
-                conv_op(input_int8, weight_int8, None, stride, padding, dilation, groups, scale, zero_point)
+                conv_op(input_int8, weight_prepacked, scale, zero_point)
                 prof.step()
 
         print("int8 benchmark result:")
@@ -4324,7 +4597,7 @@ def test_qconv_transpose1d(
             return  # Currently only the QNNPACK is supported
         if qengine_is_qnnpack() and (IS_PPC or TEST_WITH_UBSAN):
             return  # QNNPACK doesn't support these
-        assume(o_pad < stride or o_pad < dilation)
+        assume(o_pad < stride and o_pad < dilation)
 
         input_channels = input_channels_per_group * groups
         output_channels = output_channels_per_group * groups
@@ -4347,40 +4620,51 @@ def test_qconv_transpose1d(
             dilation=dilations,
             bias=use_bias
         )
-        X_q, W_q, bias_float = self._test_qconv_impl(
-            qconv, qconv_prepack, conv_op, batch_size,
-            input_channels_per_group, (width, ),
-            output_channels_per_group, groups, kernels, strides, pads, o_pads,
-            dilations, X_scale, X_zero_point, W_scale, W_zero_point,
-            Y_scale, Y_zero_point, use_bias, use_relu=False,
-            use_channelwise=False, use_transpose=True)
 
-        # check that this doesn't error
-        test_conv = torch.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
-        test_conv(X_q)
+        act_qdtypes = [torch.quint8]
+        # Only qnnpack qengine supportes qint8
+        if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
+            act_qdtypes.append(torch.qint8)
 
-        # Test the module implementation
-        qconv_op = torch.nn.quantized.ConvTranspose1d(
-            in_channels=input_channels,
-            out_channels=output_channels,
-            kernel_size=kernels,
-            stride=strides,
-            padding=pads,
-            output_padding=o_pads,
-            groups=groups,
-            dilation=dilations,
-            bias=use_bias
-        )
-        qconv_op.scale = Y_scale
-        qconv_op.zero_point = Y_zero_point
-        qconv_op.set_weight_bias(W_q, bias_float)
+        for X_qdtype in act_qdtypes:
+            if X_qdtype == torch.qint8:
+                W_zero_point = [0 for i in range(len(W_zero_point))]
 
-        Y_dq_ref = conv_op(X_q.dequantize())
-        Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
-                                            zero_point=Y_zero_point,
-                                            dtype=torch.quint8)
-        Y_q = qconv_op(X_q)
-        self.assertEqual(Y_q_ref, Y_q)
+            X_q, W_q, bias_float = self._test_qconv_impl(
+                qconv, qconv_prepack, conv_op, batch_size,
+                input_channels_per_group, (width, ),
+                output_channels_per_group, groups, kernels, strides, pads, o_pads,
+                dilations, X_scale, X_zero_point, W_scale, W_zero_point,
+                Y_scale, Y_zero_point, use_bias, use_relu=False,
+                use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
+
+            # check that this doesn't error
+            test_conv = torch.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
+            test_conv.scale = Y_scale
+            test_conv(X_q)
+
+            # Test the module implementation
+            qconv_op = torch.nn.quantized.ConvTranspose1d(
+                in_channels=input_channels,
+                out_channels=output_channels,
+                kernel_size=kernels,
+                stride=strides,
+                padding=pads,
+                output_padding=o_pads,
+                groups=groups,
+                dilation=dilations,
+                bias=use_bias
+            )
+            qconv_op.scale = Y_scale
+            qconv_op.zero_point = Y_zero_point
+            qconv_op.set_weight_bias(W_q, bias_float)
+
+            Y_dq_ref = conv_op(X_q.dequantize())
+            Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
+                                                zero_point=Y_zero_point,
+                                                dtype=X_qdtype)
+            Y_q = qconv_op(X_q)
+            self.assertEqual(Y_q_ref, Y_q)
 
 
     """Tests the correctness of quantized convolution op."""
@@ -4433,8 +4717,11 @@ def test_qconv_transpose2d(
             use_bias):
         if qengine_is_qnnpack() and (IS_PPC or TEST_WITH_UBSAN):
             return  # QNNPACK doesn't support these
-        assume(o_pad_h < stride_h or o_pad_h < dilation)
-        assume(o_pad_w < stride_w or o_pad_w < dilation)
+        # ONEDNN does not support output paddings
+        if qengine_is_onednn() and (o_pad_h, o_pad_w) != (0, 0):
+            return
+        assume(o_pad_h < stride_h and o_pad_h < dilation)
+        assume(o_pad_w < stride_w and o_pad_w < dilation)
 
         input_channels = input_channels_per_group * groups
         output_channels = output_channels_per_group * groups
@@ -4457,40 +4744,50 @@ def test_qconv_transpose2d(
             dilation=dilations,
             bias=use_bias
         )
-        X_q, W_q, bias_float = self._test_qconv_impl(
-            qconv, qconv_prepack, conv_op, batch_size,
-            input_channels_per_group, (height, width),
-            output_channels_per_group, groups, kernels, strides, pads, o_pads,
-            dilations, X_scale, X_zero_point, W_scale, W_zero_point,
-            Y_scale, Y_zero_point, use_bias, use_relu=False,
-            use_channelwise=False, use_transpose=True)
+        act_qdtypes = [torch.quint8]
+        # Only qnnpack qengine supportes qint8
+        if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
+            act_qdtypes.append(torch.qint8)
 
-        # check that this doesn't error
-        test_conv = torch.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
-        test_conv(X_q)
+        for X_qdtype in act_qdtypes:
+            if X_qdtype == torch.qint8:
+                W_zero_point = [0 for i in range(len(W_zero_point))]
 
-        # Test the module implementation
-        qconv_op = torch.nn.quantized.ConvTranspose2d(
-            in_channels=input_channels,
-            out_channels=output_channels,
-            kernel_size=kernels,
-            stride=strides,
-            padding=pads,
-            output_padding=o_pads,
-            groups=groups,
-            dilation=dilations,
-            bias=use_bias
-        )
-        qconv_op.scale = Y_scale
-        qconv_op.zero_point = Y_zero_point
-        qconv_op.set_weight_bias(W_q, bias_float)
+            X_q, W_q, bias_float = self._test_qconv_impl(
+                qconv, qconv_prepack, conv_op, batch_size,
+                input_channels_per_group, (height, width),
+                output_channels_per_group, groups, kernels, strides, pads, o_pads,
+                dilations, X_scale, X_zero_point, W_scale, W_zero_point,
+                Y_scale, Y_zero_point, use_bias, use_relu=False,
+                use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
+
+            # check that this doesn't error
+            test_conv = torch.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
+            test_conv.scale = Y_scale
+            test_conv(X_q)
+
+            # Test the module implementation
+            qconv_op = torch.nn.quantized.ConvTranspose2d(
+                in_channels=input_channels,
+                out_channels=output_channels,
+                kernel_size=kernels,
+                stride=strides,
+                padding=pads,
+                output_padding=o_pads,
+                groups=groups,
+                dilation=dilations,
+                bias=use_bias
+            )
+            qconv_op.scale = Y_scale
+            qconv_op.zero_point = Y_zero_point
+            qconv_op.set_weight_bias(W_q, bias_float)
 
-        Y_dq_ref = conv_op(X_q.dequantize())
-        Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
-                                            zero_point=Y_zero_point,
-                                            dtype=torch.quint8)
-        Y_q = qconv_op(X_q)
-        self.assertEqual(Y_q_ref, Y_q)
+            Y_dq_ref = conv_op(X_q.dequantize())
+            Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
+                                                zero_point=Y_zero_point,
+                                                dtype=X_qdtype)
+            Y_q = qconv_op(X_q)
+            self.assertEqual(Y_q_ref, Y_q)
 
     """Tests the correctness of quantized convolution op."""
     @given(batch_size=st.integers(1, 3),
@@ -4552,6 +4849,9 @@ def test_qconv_transpose3d(
             use_bias):
         if qengine_is_qnnpack():
             return  # QNNPACK doesn't support this
+        # ONEDNN doesn't support output paddings
+        if qengine_is_onednn() and (o_pad_t, o_pad_h, o_pad_w) != (0, 0, 0):
+            return
         assume(o_pad_t < stride_t or o_pad_t < dilation)
         assume(o_pad_h < stride_h or o_pad_h < dilation)
         assume(o_pad_w < stride_w or o_pad_w < dilation)
@@ -4587,6 +4887,7 @@ def test_qconv_transpose3d(
 
         # check that this doesn't error
         test_conv = torch.nn.quantized.ConvTranspose3d(input_channels, output_channels, 1)
+        test_conv.scale = Y_scale
         test_conv(X_q)
 
         # Test the module implementation
@@ -4736,7 +5037,7 @@ def test_qconv1d(
         output_channels = output_channels_per_group * groups
         if torch.backends.quantized.engine == 'qnnpack':
             use_channelwise = False
-        true_conv1d = torch.nn.Conv1d(
+        conv1d = torch.nn.Conv1d(
             input_channels,
             output_channels,
             kernel,
@@ -4749,12 +5050,23 @@ def test_qconv1d(
         qconv = torch.ops.quantized.conv1d
         if use_relu:
             qconv = torch.ops.quantized.conv1d_relu
-        self._test_qconv_impl(
-            qconv, qconv_prepack, true_conv1d, batch_size,
-            input_channels_per_group, (length, ),
-            output_channels_per_group, groups, kernel, [stride], [pad], None,
-            [dilation], X_scale, X_zero_point, W_scale, W_zero_point,
-            Y_scale, Y_zero_point, use_bias, use_relu, use_channelwise, False)
+
+        act_qdtypes = [torch.quint8]
+        # Only qnnpack qengine supportes qint8
+        if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
+            act_qdtypes.append(torch.qint8)
+
+        for X_qdtype in act_qdtypes:
+            if X_qdtype == torch.qint8:
+                W_zero_point = [0 for i in range(len(W_zero_point))]
+
+            self._test_qconv_impl(
+                qconv, qconv_prepack, conv1d, batch_size,
+                input_channels_per_group, (length, ),
+                output_channels_per_group, groups, kernel, [stride], [pad], None,
+                [dilation], X_scale, X_zero_point, W_scale, W_zero_point,
+                Y_scale, Y_zero_point, use_bias, use_relu, use_channelwise, False,
+                input_dtype=X_qdtype, output_dtype=X_qdtype)
 
     @given(batch_size=st.integers(1, 4),
            input_channels_per_group=st.sampled_from([2, 4, 5, 8, 16]),
@@ -5089,7 +5401,7 @@ def test_qnnpack_sigmoid_sweep(self):
     """Tests the correctness of the quantized::add (qnnpack) op."""
     @settings(suppress_health_check=(HealthCheck.filter_too_much,))
     @given(A=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
-                       qparams=hu.qparams(dtypes=torch.quint8)),
+                       qparams=hu.qparams(dtypes=[torch.quint8, torch.qint8])),
            zero_point=st.sampled_from([0, 2, 5, 15, 127]),
            scale_A=st.sampled_from([0.001, 0.057, 0.889, 12.3]),
            scale_B=st.sampled_from([0.008, 0.0821, 0.67, 7]),
@@ -5097,39 +5409,96 @@ def test_qnnpack_sigmoid_sweep(self):
     def test_qnnpack_add(self, A, zero_point, scale_A, scale_B, scale_C):
         with override_quantized_engine('qnnpack'):
             A_temp = A
-            A, (scale_a, zero_point_A, torch_type) = A_temp
-            B, (scale_b, zero_point_B, torch_type) = A_temp
-            A = torch.from_numpy(A)
-            B = torch.from_numpy(B)
-
-            assume(scale_A // scale_C >= 2**-14)
-            assume(scale_A // scale_C < 2**8)
-            assume(scale_B // scale_C >= 2**-14)
-            assume(scale_B // scale_C < 2**8)
-
-            zero_point_C = 127
-            qA = torch.quantize_per_tensor(A, scale=scale_A, zero_point=zero_point,
-                                           dtype=torch.quint8)
-            qB = torch.quantize_per_tensor(B, scale=scale_B, zero_point=zero_point,
-                                           dtype=torch.quint8)
+            for channels_last in [True, False]:
+                if channels_last and len(A_temp[0].shape) != 4:
+                    continue
+                A, (scale_a, zero_point_A, torch_type) = A_temp
+                B, (scale_b, zero_point_B, torch_type) = A_temp
+                A = torch.from_numpy(A)
+                B = torch.from_numpy(B)
 
-            # Add ground truth
-            C = (qA.dequantize() + qB.dequantize()).numpy()
+                if torch_type == torch.qint8 and not torch.backends.xnnpack.enabled:
+                    continue
 
-            qC = _quantize(C, scale_C, zero_point_C)
+                if channels_last:
+                    A = A.to(memory_format=torch.channels_last)
+                    B = B.to(memory_format=torch.channels_last)
+                assume(scale_A // scale_C >= 2**-14)
+                assume(scale_A // scale_C < 2**8)
+                assume(scale_B // scale_C >= 2**-14)
+                assume(scale_B // scale_C < 2**8)
 
-            qC_qnnp = torch.ops.quantized.add(qA, qB, scale_C, zero_point_C)
+                zero_point_C = 127
+                np_dtype = np.uint8
 
-            np.testing.assert_equal(qC, qC_qnnp.int_repr(),
-                                    "Quantized addition failed.")
+                if torch_type == torch.qint8:
+                    zero_point_C = 0
+                    np_dtype = np.int8
 
-            Crelu = C.copy()
-            Crelu[C < 0] = 0
-            qCrelu = torch.quantize_per_tensor(torch.from_numpy(Crelu), scale_C,
-                                               zero_point_C, dtype=torch.quint8)
-            qCrelu_hat = torch.ops.quantized.add_relu(qA, qB, scale=scale_C, zero_point=zero_point_C)
-            np.testing.assert_equal(qCrelu.int_repr().numpy(), qCrelu_hat.int_repr(),
-                                    "Quantized addition with ReLU failed.")
+                qA = torch.quantize_per_tensor(A, scale=scale_A, zero_point=zero_point,
+                                               dtype=torch_type)
+                qB = torch.quantize_per_tensor(B, scale=scale_B, zero_point=zero_point,
+                                               dtype=torch_type)
+
+                # Add ground truth
+                C = (qA.dequantize() + qB.dequantize()).numpy()
+
+                qC = _quantize(C, scale_C, zero_point_C, dtype=np_dtype)
+
+                qC_qnnp = torch.ops.quantized.add(qA, qB, scale_C, zero_point_C)
+
+                np.testing.assert_equal(qC, qC_qnnp.int_repr(),
+                                        "Quantized addition failed.")
+
+                Crelu = C.copy()
+                Crelu[C < 0] = 0
+                qCrelu = torch.quantize_per_tensor(torch.from_numpy(Crelu), scale_C,
+                                                   zero_point_C, dtype=torch_type)
+                qCrelu_hat = torch.ops.quantized.add_relu(qA, qB, scale=scale_C, zero_point=zero_point_C)
+                np.testing.assert_equal(qCrelu.int_repr().numpy(), qCrelu_hat.int_repr(),
+                                        "Quantized addition with ReLU failed.")
+
+    """Tests that quantized add works with broadcasting """
+    def test_qnnpack_add_broadcast(self):
+        def _run_test(A, B):
+            qA = torch.quantize_per_tensor(A, 0.02, 0, dtype)
+            qB = torch.quantize_per_tensor(B, 0.04, 2, dtype)
+
+            output_scale = 0.01
+            output_zp = 1
+
+            # ground truth
+            C = qA.dequantize() + qB.dequantize()
+            qC = torch.quantize_per_tensor(C, output_scale, output_zp, dtype)
+
+            # quantized
+            qC_hat_1 = torch.ops.quantized.add(qA, qB, output_scale, output_zp)
+            qC_hat_2 = torch.ops.quantized.add(qB, qA, output_scale, output_zp)
+
+            self.assertTrue(torch.allclose(qC.dequantize(), qC_hat_1.dequantize()))
+            self.assertTrue(torch.allclose(qC.dequantize(), qC_hat_2.dequantize()))
+
+        with override_quantized_engine("qnnpack"):
+            for dtype in (torch.qint8, torch.quint8):
+                if dtype == torch.qint8 and not torch.backends.xnnpack.enabled:
+                    continue
+
+                for channels_last in [True, False]:
+                    # 4d
+                    A = torch.randn(1, 3, 4, 4)
+                    B = torch.randn(1, 1, 1, 1)
+                    if channels_last:
+                        A = A.to(memory_format=torch.channels_last)
+                        B = B.to(memory_format=torch.channels_last)
+                    _run_test(A, B)
+
+                    # 5d
+                    C = torch.randn(1, 3, 4, 4, 4)
+                    D = torch.randn(1, 1, 1, 1, 1)
+                    if channels_last:
+                        C = C.to(memory_format=torch.channels_last_3d)
+                        D = D.to(memory_format=torch.channels_last_3d)
+                    _run_test(C, D)
 
     """Tests the correctness of quantized::qnnpack_maxpool2d op."""
     @given(A=hu.tensor(shapes=hu.array_shapes(4, 4, 3, 5),
diff --git a/test/quantization/core/test_workflow_module.py b/test/quantization/core/test_workflow_module.py
index 77fb492984c8e1..98c3fa913d015f 100644
--- a/test/quantization/core/test_workflow_module.py
+++ b/test/quantization/core/test_workflow_module.py
@@ -400,7 +400,7 @@ def test_zero_numel(self):
             x = obs(x)
 
     def _test_memoryless(self, obs_class):
-        obs = obs_class(memoryless=True)
+        obs = obs_class(averaging_constant=1)
         x = torch.randn((3, 3))
         obs(x)
         params = obs.calculate_qparams()
@@ -411,10 +411,10 @@ def _test_memoryless(self, obs_class):
             self.assertEqual(params, obs.calculate_qparams())
 
     def test_memoryless_minmaxobserver(self):
-        self._test_memoryless(MinMaxObserver)
+        self._test_memoryless(MovingAverageMinMaxObserver)
 
     def test_memoryless_perchannelminmaxobserver(self):
-        self._test_memoryless(PerChannelMinMaxObserver)
+        self._test_memoryless(MovingAveragePerChannelMinMaxObserver)
 
 # HistogramObserver that works like it does on master
 class _ReferenceHistogramObserver(HistogramObserver):
@@ -758,6 +758,17 @@ def test_fq_serializable_per_channel(self):
         for key in state_dict:
             self.assertEqual(state_dict[key], loaded_dict[key])
 
+    def test_quant_min_max_override(self):
+        observer = default_per_channel_weight_observer
+        # test no override
+        fq_module = FakeQuantize(observer)
+        self.assertEqual(fq_module.activation_post_process.quant_min, -128)
+        self.assertEqual(fq_module.activation_post_process.quant_max, 127)
+        # test quant_min/quant_max override
+        fq_module = FakeQuantize(observer, quant_min=0, quant_max=127)
+        self.assertEqual(fq_module.activation_post_process.quant_min, 0)
+        self.assertEqual(fq_module.activation_post_process.quant_max, 127)
+
 def _get_buffer_ids(module):
     """
     Object addresses stay constant if and only if all modifications are in-place
diff --git a/test/quantization/eager/test_numeric_suite_eager.py b/test/quantization/eager/test_numeric_suite_eager.py
index 3bf969395c517c..3714a1f28c67b4 100644
--- a/test/quantization/eager/test_numeric_suite_eager.py
+++ b/test/quantization/eager/test_numeric_suite_eager.py
@@ -19,6 +19,8 @@
     compare_model_outputs,
     compare_model_stub,
     compare_weights,
+    prepare_model_outputs,
+    get_matching_activations,
 )
 from torch.testing._internal.common_quantization import (
     AnnotatedConvBnReLUModel,
@@ -30,6 +32,7 @@
     QuantizationTestCase,
     SingleLayerLinearDynamicModel,
     test_only_eval_fn,
+    skip_if_no_torchvision,
 )
 from torch.testing._internal.common_quantized import override_qengines
 
@@ -421,14 +424,12 @@ def test_compare_model_outputs_functional_static(self):
         q_model(self.img_data_2d[0][0])
         q_model = convert(q_model)
         act_compare_dict = compare_model_outputs(model, q_model, self.img_data_2d[0][0])
-        self.assertEqual(len(act_compare_dict), 7)
+        self.assertEqual(len(act_compare_dict), 5)
         expected_act_compare_dict_keys = {
             "mycat.stats",
             "myadd.stats",
             "mymul.stats",
             "myadd_relu.stats",
-            "my_scalar_add.stats",
-            "my_scalar_mul.stats",
             "quant.stats",
         }
         self.assertTrue(act_compare_dict.keys() == expected_act_compare_dict_keys)
@@ -534,3 +535,50 @@ def test_shadow_logger(self):
 
         self.assertEqual(len(logger.stats["float"]), 2)
         self.assertEqual(len(logger.stats["quantized"]), 2)
+
+    @skip_if_no_torchvision
+    def _test_vision_model(self, float_model):
+        float_model.to('cpu')
+        float_model.eval()
+        float_model.fuse_model()
+        float_model.qconfig = torch.quantization.default_qconfig
+        img_data = [(torch.rand(2, 3, 224, 224, dtype=torch.float), torch.randint(0, 1, (2,), dtype=torch.long)) for _ in range(2)]
+        qmodel = quantize(float_model, torch.quantization.default_eval_fn, [img_data], inplace=False)
+
+        wt_compare_dict = compare_weights(float_model.state_dict(), qmodel.state_dict())
+
+        def compute_error(x, y):
+            Ps = torch.norm(x)
+            Pn = torch.norm(x - y)
+            return 20 * torch.log10(Ps / Pn)
+
+        data = img_data[0][0]
+        # Take in floating point and quantized model as well as input data, and returns a dict, with keys
+        # corresponding to the quantized module names and each entry being a dictionary with two keys 'float' and
+        # 'quantized', containing the activations of floating point and quantized model at matching locations.
+        act_compare_dict = compare_model_outputs(float_model, qmodel, data)
+
+
+        for key in act_compare_dict:
+            compute_error(act_compare_dict[key]['float'][0], act_compare_dict[key]['quantized'][0].dequantize())
+
+        prepare_model_outputs(float_model, qmodel)
+
+        for data in img_data:
+            float_model(data[0])
+            qmodel(data[0])
+
+        # Find the matching activation between floating point and quantized modules, and return a dict with key
+        # corresponding to quantized module names and each entry being a dictionary with two keys 'float'
+        # and 'quantized', containing the matching floating point and quantized activations logged by the logger
+        act_compare_dict = get_matching_activations(float_model, qmodel)
+
+    @skip_if_no_torchvision
+    def test_mobilenet_v2(self):
+        from torchvision.models.quantization import mobilenet_v2
+        self._test_vision_model(mobilenet_v2(pretrained=True, quantize=False))
+
+    @skip_if_no_torchvision
+    def test_mobilenet_v3(self):
+        from torchvision.models.quantization import mobilenet_v3_large
+        self._test_vision_model(mobilenet_v3_large(pretrained=True, quantize=False))
diff --git a/test/quantization/eager/test_quantize_eager_ptq.py b/test/quantization/eager/test_quantize_eager_ptq.py
index a8ca0eb3353e2c..ec287cd89fa111 100644
--- a/test/quantization/eager/test_quantize_eager_ptq.py
+++ b/test/quantization/eager/test_quantize_eager_ptq.py
@@ -3,7 +3,6 @@
 import torch
 import torch.nn as nn
 import torch.nn.quantized as nnq
-import torch.nn.quantized._reference as nnqr
 from torch.nn.utils.rnn import PackedSequence
 from torch.ao.quantization import (
     quantize,
@@ -140,17 +139,7 @@ def forward(self, x):
 
         ref_m = prepare(original_ref_m)
         ref_m(data)
-        reference_module_mapping = {
-            QuantStub: nnq.Quantize,
-            DeQuantStub: nnq.DeQuantize,
-            nn.Conv1d: nnqr.Conv1d,
-            nn.Conv2d: nnqr.Conv2d,
-            nn.Conv3d: nnqr.Conv3d,
-            nn.ConvTranspose1d: nnqr.ConvTranspose1d,
-            nn.ConvTranspose2d: nnqr.ConvTranspose2d,
-            nn.ConvTranspose3d: nnqr.ConvTranspose3d,
-        }
-        ref_m = convert(ref_m, mapping=reference_module_mapping)
+        ref_m = convert(ref_m, is_reference=True)
         ref_res = ref_m(data)
         self.assertEqual(res, ref_res)
 
@@ -202,6 +191,85 @@ def test_conv_transpose_3d(self):
             (16, 1, 10, 10, 10)
         )
 
+    def test_linear(self):
+        self._test_reference_module_impl(
+            nn.Linear,
+            nnq.Linear,
+            {'in_features': 5, 'out_features': 10},
+            (16, 5)
+        )
+
+    @override_qengines
+    def test_int16_reference_module(self):
+
+        class RefM(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.conv = nn.ConvTranspose2d(1, 1, 1)
+                self.quant1 = QuantStub()
+                self.dequant1 = DeQuantStub()
+                self.quant2 = QuantStub()
+                self.dequant2 = DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant1(x)
+                x = self.dequant1(x)
+                x = self.conv(x)
+                x = self.quant2(x)
+                x = self.dequant2(x)
+                return x
+
+
+        input_size = (16, 1, 10, 10)
+        data = torch.randn(*input_size, dtype=torch.float)
+
+        original_ref_m = RefM()
+        rand_w = torch.randn_like(original_ref_m.conv.weight)
+        rand_b = torch.randn_like(original_ref_m.conv.bias)
+        original_ref_m.conv.weight = torch.nn.Parameter(rand_w, requires_grad=False)
+        original_ref_m.conv.bias = torch.nn.Parameter(rand_b, requires_grad=False)
+
+        qengine = torch.backends.quantized.engine
+        if qengine not in supported_qengines:
+            return
+        from torch.ao.quantization.observer import MovingAverageMinMaxObserver
+
+        weight_obs = MovingAverageMinMaxObserver.with_args(
+            dtype=torch.qint32,
+            # set qmin and qmax to represent qint16
+            quant_min=-1 * (2 ** 15),
+            quant_max=(2 ** 15) - 1,
+            qscheme=torch.per_tensor_symmetric,
+        )
+        act_obs = MovingAverageMinMaxObserver.with_args(
+            dtype=torch.qint32,
+            quant_min=-1 * (2 ** 15),
+            quant_max=(2 ** 15) - 1,
+        )
+        custom_qconfig = QConfig(activation=act_obs, weight=weight_obs)
+
+        # quantize the reference model
+        original_ref_m.eval()
+        original_ref_m.qconfig = custom_qconfig
+
+        ref_m = prepare(original_ref_m)
+        # calibration
+        ref_m(torch.randn(*input_size, dtype=torch.float))
+
+        ref_m = convert(ref_m, is_reference=True)
+
+        myobs = MovingAverageMinMaxObserver(averaging_constant=0.5,
+                                            dtype=torch.qint32,
+                                            # set qmin and qmax to represent qint16
+                                            quant_min=-1 * (2 ** 15),
+                                            quant_max=(2 ** 15) - 1,
+                                            qscheme=torch.per_tensor_symmetric,
+                                            )
+        result = myobs(rand_w)
+        qparams = myobs.calculate_qparams()
+        self.assertEqual(ref_m.conv.weight_scale, qparams[0])
+
+
     def _test_activation_op_impl(
             self, float_module_class, quantized_module_class, extra_module_kwargs):
         """ Implementation for testing common activation ops like leaky relu
@@ -1391,7 +1459,8 @@ def export_to_onnx(model, input, input_names):
             model = torch.jit.load(buf)
             f = io.BytesIO()
             torch.onnx.export(model, input, f, input_names=input_names,
-                              operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
+                              operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+                              opset_version=9)
         onnx_model = export_to_onnx(model, data, input_names)
 
     @skipIfNoFBGEMM
diff --git a/test/quantization/eager/test_quantize_eager_qat.py b/test/quantization/eager/test_quantize_eager_qat.py
index 02a3175f4c80c7..984e87dacbbcd9 100644
--- a/test/quantization/eager/test_quantize_eager_qat.py
+++ b/test/quantization/eager/test_quantize_eager_qat.py
@@ -23,6 +23,7 @@
     default_qconfig,
     default_qat_qconfig,
     default_embedding_qat_qconfig,
+    default_symmetric_qnnpack_qat_qconfig,
     get_default_qat_qconfig,
     FixedQParamsFakeQuantize,
     FusedMovingAvgObsFakeQuantize,
@@ -39,6 +40,7 @@
     ManualDropoutQATModel,
     ManualLinearDynamicQATModel,
     ManualConvLinearQATModel,
+    ManualConvLinearSymmQATModel,
     ManualEmbeddingBagLinear,
     TwoLayerLinearModel,
     test_only_eval_fn,
@@ -51,6 +53,8 @@
     override_qengines,
 )
 
+from torch.testing._internal.common_utils import skipIfNoXNNPACK
+
 from hypothesis import given
 from hypothesis import strategies as st
 import torch.testing._internal.hypothesis_utils as hu
@@ -340,11 +344,45 @@ def checkQuantized(model):
                 model = quantize_qat(model, test_only_train_fn, [self.img_data_2d_train])
                 checkQuantized(model)
 
+    @skipIfNoXNNPACK
+    def test_conv_linear_symm(self):
+        r"""Same as test_conv_linear but with Symmetric quantization.
+        Supported only with qengine=qnnpack, which uses symmetric
+        kernels from xnnpack library."""
+        for qengine in supported_qengines:
+            if qengine != 'qnnpack':
+                continue
+            with override_quantized_engine(qengine):
+                model = ManualConvLinearSymmQATModel()
+
+                model = prepare_qat(model)
+                self.checkObservers(model)
+
+                test_only_train_fn(model, self.img_data_2d_train)
+                model = convert(model)
+
+                def checkQuantized(model):
+                    self.assertEqual(type(model.conv), nnq.Conv2d)
+                    self.assertEqual(type(model.fc1), nnq.Linear)
+                    self.assertEqual(type(model.fc2), nnq.Linear)
+                    test_only_eval_fn(model, self.img_data_2d)
+                    self.checkScriptable(model, self.img_data_2d)
+                    self.checkNoQconfig(model)
+
+                checkQuantized(model)
+
+                model = ManualConvLinearSymmQATModel()
+                model = quantize_qat(model, test_only_train_fn, [self.img_data_2d_train])
+                checkQuantized(model)
+
     def test_dynamic_qat_linear(self):
         for qengine in supported_qengines:
             with override_quantized_engine(qengine):
                 # Dynamic QAT without memoryless observers should fail
-                with self.assertRaisesRegex(ValueError, "Dynamic QAT requires a memoryless observer"):
+                with self.assertRaisesRegex(ValueError,
+                                            "Dynamic QAT requires a memoryless observer." +
+                                            "This means a MovingAverage observer with averaging constant equal to 1"
+                                            ):
                     model = ManualLinearDynamicQATModel(default_qat_qconfig)
                     model = prepare_qat(model, mapping={torch.nn.Linear: nnqatd.Linear})
 
@@ -1006,6 +1044,29 @@ def test_linear_bn_numerics(self):
         r2 = m(data)
         self.assertTrue(torch.allclose(r1, r2))
 
+    @skipIfNoXNNPACK
+    @override_qengines
+    def test_linear_bn_symm_numerics(self):
+        qengine = torch.backends.quantized.engine
+        if qengine != "qnnpack":
+            return  # Only qnnpack support symmetric quantization
+        m_ref = nn.Sequential(
+            nn.Linear(4, 4),
+            nn.BatchNorm1d(4),
+        )
+        m_ref_copy = copy.deepcopy(m_ref)
+        m_ref_copy = torch.ao.quantization.fuse_modules_qat(m_ref_copy, [['0', '1']])
+        qconfig = default_symmetric_qnnpack_qat_qconfig
+        m_ref_copy[0].qconfig = qconfig
+        m = nniqat.LinearBn1d.from_float(m_ref_copy[0])
+
+        # without fake_quants, fused QAT module should match fp32 module
+        m.apply(torch.quantization.disable_fake_quant)
+        data = torch.randn(4, 4)
+        r1 = m_ref(data)
+        r2 = m(data)
+        self.assertTrue(torch.allclose(r1, r2))
+
     @override_qengines
     def test_linear_bn_workflow(self):
         qengine = torch.backends.quantized.engine
diff --git a/test/quantization/fx/test_numeric_suite_fx.py b/test/quantization/fx/test_numeric_suite_fx.py
index fe1aad6c7771c3..37e737c94cfbf6 100644
--- a/test/quantization/fx/test_numeric_suite_fx.py
+++ b/test/quantization/fx/test_numeric_suite_fx.py
@@ -71,6 +71,8 @@
     extract_shadow_logger_info,
     extend_logger_results_with_comparison,
 )
+from torch.ao.quantization.fx.backend_config import get_native_backend_config_dict
+from torch.ao.quantization.fx.backend_config.utils import get_pattern_to_quantize_handlers
 
 
 # Note: these models are not for use outside of this file. While it's good
@@ -274,7 +276,19 @@ def _wrapped_sigmoid(x):
 def _wrapped_linear(x, w, b):
     return F.linear(x, w, b)
 
-
+def get_all_quant_patterns():
+    """ we are in the process to migrate the frontend of fx graph mode quant
+    to use backend_config_dict, so some of the patterns are moved to backend_config_dict
+    this function will include these patterns so that we can still have all the patterns
+    """
+    # TODO: we can remove this call, and get all patterns from backend_config_dict in
+    # the future when the frontend refactor is done in fx graph mode quantization
+    all_quant_patterns = get_default_quant_patterns()
+    # some of the patterns are moved to (native) backend_config_dict so we need to
+    # add them back here
+    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config_dict()).items():
+        all_quant_patterns[pattern] = quantize_handler
+    return all_quant_patterns
 
 class TestFXGraphMatcher(QuantizationTestCase):
 
@@ -542,7 +556,6 @@ def forward(self, x):
         self.assert_types_for_matched_subgraph_pairs(
             results, expected_types, m1p, m2p)
 
-
     def test_op_relationship_mapping(self):
         """
         Tests that the mapping of op relationships is complete.
@@ -620,7 +633,7 @@ def _op_is_unmatchable(op):
                 op in METHS_UNMATCHABLE
             )
 
-        default_quant_patterns = get_default_quant_patterns()
+        default_quant_patterns = get_all_quant_patterns()
         for pattern, qhandler_cls in default_quant_patterns.items():
             base_op = None
             if isinstance(pattern, tuple):
@@ -664,9 +677,6 @@ def _op_is_unmatchable(op):
                 # RNNDynamicQuantizeHandler
                 pass
             elif qhandler_cls == qp.DefaultNodeQuantizeHandler:
-                # torch.sum does not have quantized equivalents
-                if base_op == torch.sum:
-                    continue
                 self.assertTrue(
                     _op_in_base_sets_of_related_ops(base_op),
                     f"{base_op} not in sets of related ops")
@@ -682,8 +692,23 @@ def _op_is_unmatchable(op):
                     _op_in_base_sets_of_related_ops(base_op),
                     f"{base_op} not in sets of related ops")
             else:
-                raise AssertionError(
-                    f"handing for {qhandler_cls} not implemented")
+                # torch.sum does not have quantized equivalents
+                if base_op in [
+                        torch.sum,
+                        nn.GRUCell,
+                        nn.GRU,
+                        nn.LSTMCell,
+                        nn.RNNCell,
+                ]:
+                    continue
+                if isinstance(base_op, tuple):
+                    # skip fusion patterns
+                    continue
+                # didn't match explicit quantize handler class, we can check if the
+                # operator is in the related op set directly
+                if not (_op_in_base_sets_of_related_ops(base_op) or _op_is_unmatchable(base_op)):
+                    raise AssertionError(
+                        f"handling for {qhandler_cls} for op {base_op} not implemented")
 
     @skipIfNoFBGEMM
     def test_user_defined_function(self):
@@ -1534,7 +1559,7 @@ def test_op_io_dtype_coverage(self):
 
         # 4. go through the ops mapped to each QuantizeHandler type, and verify
         # correctness.
-        default_quant_patterns = get_default_quant_patterns()
+        default_quant_patterns = get_all_quant_patterns()
         for pattern, qhandler_cls in default_quant_patterns.items():
             base_op = None
             if isinstance(pattern, tuple):
@@ -1591,8 +1616,26 @@ def test_op_io_dtype_coverage(self):
                 # embedding shadowing is not implemented, for now
                 continue
             else:
-                raise AssertionError(
-                    f"handing for {qhandler_cls} not implemented")
+                if (
+                    base_op in FUNS_UNMATCHABLE or
+                    base_op in MODS_UNMATCHABLE or
+                    base_op in METHS_UNMATCHABLE
+                ):
+                    continue
+                if qhandler_cls(None, {}).is_general_tensor_value_op():
+                    self.assertTrue(
+                        (base_op in FUNS_IO_TYPE_FP32_OR_INT8) or
+                        (base_op in MODS_IO_TYPE_FP32_OR_INT8) or
+                        (base_op in METHS_IO_TYPE_FP32_OR_INT8),
+                        f"missing IO type handling for {base_op} using {qhandler_cls}")
+                else:
+                    self.assertTrue(
+                        (base_op in FUNS_IO_TYPE_FP32_OR_INT8) or
+                        (base_op in MODS_IO_TYPE_FP32_OR_INT8) or
+                        (base_op in METHS_IO_TYPE_FP32_OR_INT8) or
+                        (base_op in FUNS_IO_TYPE_FP32) or
+                        (base_op in MODS_IO_TYPE_FP32) or
+                        f"missing IO type handling for {base_op} using {qhandler_cls}")
 
     @skipIfNoFBGEMM
     def test_user_defined_function(self):
diff --git a/test/quantization/fx/test_quantize_fx.py b/test/quantization/fx/test_quantize_fx.py
index 484d53a146424b..56d2194bd0c7ca 100644
--- a/test/quantization/fx/test_quantize_fx.py
+++ b/test/quantization/fx/test_quantize_fx.py
@@ -80,6 +80,8 @@
     get_default_output_activation_post_process_map
 )
 
+from torch.ao.quantization.fx.utils import NodeInfo
+
 from torch.ao.quantization.fake_quantize import (
     default_affine_fixed_qparams_fake_quant,
     default_symmetric_fixed_qparams_fake_quant,
@@ -133,7 +135,9 @@
 import operator
 import unittest
 import io
-from typing import Callable, Optional
+from typing import Callable, Optional, List
+
+
 
 TEST_WITH_ROCM = os.getenv('PYTORCH_TEST_WITH_ROCM', '0') == '1'
 
@@ -596,6 +600,77 @@ def conv_bn_res_relu_extra_inputs_getter(pattern):
             if node.op == "call_module" and type(named_modules[node.target]) == torch.nn.Conv2d:
                 self.assertTrue(len(node.args) == 2), "Expecting the fused op to have two arguments"
 
+    def test_fusion_pattern_with_matchallnode(self):
+        """This test tests that the node matched by MatchAllNode will be regared as an input
+        instead of a module to be fused. For instance, we have two patterns:
+            (nn.ReLU, (torch.add, MatchAllNode, nn.Conv2d))
+            (nn.ReLU, nn.Conv2d)
+        And we wanna fuse the following model
+            Conv2d -> ReLU +
+            Conv2d ------ Add -> ReLU
+        ReLU in the first row is matched as MatchAllNode in the residual pattern. But it won't be
+        fused as part of that pattnern. It needs to be properly fused with the upstream Conv2d.
+        """
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.conv1 = torch.nn.Conv2d(3, 3, 3)
+                self.relu1 = torch.nn.ReLU()
+                self.conv2 = torch.nn.Conv2d(3, 3, 3)
+                self.relu2 = torch.nn.ReLU()
+
+            def forward(self, x):
+                y = self.conv1(x)
+                y = self.relu1(y)
+
+                x = self.conv2(x)
+                x = torch.add(x, y)
+                x = self.relu2(x)
+                return x
+
+        m = M().eval()
+
+        def fuse_conv_relu(is_qat, relu, conv):
+            return conv
+
+        def fuse_conv_res_relu(is_qat, relu, add_pattern):
+            _, conv, _ = add_pattern
+            return conv
+
+        def conv_res_relu_root_node_getter(pattern):
+            relu, (_, conv, _) = pattern
+            return conv
+
+        def conv_res_relu_extra_inputs_getter(pattern):
+            relu, (_, _, extra_input) = pattern
+            return [extra_input]
+
+        conv_relu_config = {
+            "pattern": (nn.ReLU, nn.Conv2d),
+            "fuser_method": fuse_conv_relu,
+        }
+        conv_res_relu_config = {
+            "pattern": (nn.ReLU, (torch.add, nn.Conv2d, MatchAllNode)),
+            "fuser_method": fuse_conv_res_relu,
+            "root_node_getter": conv_res_relu_root_node_getter,
+            "extra_inputs_getter": conv_res_relu_extra_inputs_getter,
+        }
+
+        backend_config_dict = {
+            "configs": [
+                conv_relu_config,
+                conv_res_relu_config,
+            ],
+        }
+        m = fuse_fx(m, backend_config_dict=backend_config_dict)
+        self.assertEqual(type(m.conv1), torch.nn.Conv2d)
+        self.assertEqual(type(m.conv2), torch.nn.Conv2d)
+        # check relu are gone since we replaced the both patterns to conv
+        self.assertFalse(hasattr(m, "relu1"))
+        self.assertFalse(hasattr(m, "relu2"))
+
+
 @skipIfNoFBGEMM
 class TestQuantizeFx(QuantizationTestCase):
     def test_pattern_match(self):
@@ -947,7 +1022,7 @@ def forward(self, x):
         qconfig_dict = {'': qconfig}
         prepared = prepare_fx(m, qconfig_dict)
         quantized = convert_fx(prepared, is_reference=True)
-        qparams = (quantized._input_scale_0, quantized._input_zero_point_0)
+        qparams = (quantized._scale_0, quantized._zero_point_0)
         weight_obs = qconfig.weight()
         weight_obs(quantized.weight)
         # Get the actual value to avoid tensor size mismatch error, torch.Size([]) vs torch.Size([1])
@@ -955,6 +1030,8 @@ def forward(self, x):
         self.assertEqual(qparams, ref_qparams)
 
     def test_conv_bn_relu(self):
+        """ Tests fusion and quantization for "Conv - Bn" and "Conv - Bn - ReLU"
+        """
         convs = {
             1: nn.Conv1d,
             2: nn.Conv2d,
@@ -995,8 +1072,7 @@ def forward(self, x):
                 x = self.dequant(x)
                 return x
 
-        # TODO: add 1d support
-        options = itertools.product([2, 3], [True, False], self.static_quant_types)
+        options = itertools.product([1, 2, 3], [True, False], self.static_quant_types)
         for dim, has_relu, quant_type in options:
             expected_node = ns.call_module(
                 quantized_conv_relus[dim] if has_relu
@@ -1033,11 +1109,57 @@ def forward(self, x):
                 fuse_modules(m_eager, fuse_list, inplace=True)
             m_eager.qconfig = qconfig
             m_eager = prepare_fn(m_eager)
+            prepared_fx = result_dict["prepared"]
+
             m_eager(*self.img_data_dict[dim][0])
             m_eager = convert(m_eager)
             result_eager = m_eager(*self.img_data_dict[dim][0])
             self.assertEqual(result, result_eager)
 
+    def test_linear_bn(self):
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(4, 4)
+                self.bn = nn.BatchNorm1d(4)
+                self.quant = QuantStub()
+                self.dequant = DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant(x)
+                x = self.linear(x)
+                x = self.bn(x)
+                x = self.dequant(x)
+                return x
+
+        data = (torch.randn(4, 4),)
+        for quant_type in self.static_quant_types:
+            expected_node = ns.call_module(nnq.Linear)
+            m = M()
+            m_eager = copy.deepcopy(m)
+            result_dict = self.checkGraphModeFxOp(m, data, quant_type, expected_node=expected_node)
+            result = result_dict["quantized_output"]
+
+            # check numerics vs eager mode
+            fuse_list = ["linear", "bn"]
+            qengine = torch.backends.quantized.engine
+            if quant_type == QuantType.STATIC:
+                m_eager.eval()
+                qconfig = get_default_qconfig(qengine)
+                prepare_fn = prepare
+                fuse_modules(m_eager, fuse_list, inplace=True)
+            else:
+                m_eager.train()
+                qconfig = get_default_qat_qconfig(qengine)
+                prepare_fn = prepare_qat
+                fuse_modules_qat(m_eager, fuse_list, inplace=True)
+            m_eager.qconfig = qconfig
+            m_eager = prepare_fn(m_eager)
+            m_eager(*data)
+            m_eager = convert(m_eager)
+            result_eager = m_eager(*data)
+            self.assertEqual(result, result_eager)
+
     @skipIfNoFBGEMM
     def test_dynamic_quant_fp16(self):
         class Linear(torch.nn.Module):
@@ -2124,6 +2246,88 @@ def forward(self, x):
             ref_res = ref_m(data)
             self.assertEqual(res, ref_res)
 
+    @skipIfNoFBGEMM
+    def test_custom_module_class_input_has_multiple_users(self):
+        """ Tests that the flow still works when the input of custom module
+        has multiple users
+        """
+        class CustomModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(3, 3)
+
+            def forward(self, x):
+                return self.linear(x)
+
+        class ObservedCustomModule(torch.nn.Module):
+            def __init__(self, linear):
+                super().__init__()
+                self.linear = linear
+
+            def forward(self, x):
+                return self.linear(x)
+
+            @classmethod
+            def from_float(cls, float_module):
+                assert hasattr(float_module, 'qconfig')
+                observed = cls(float_module.linear)
+                observed.qconfig = float_module.qconfig
+                return observed
+
+        class StaticQuantCustomModule(torch.nn.Module):
+            def __init__(self, linear):
+                super().__init__()
+                self.linear = linear
+
+            def forward(self, x):
+                return self.linear(x)
+
+            @classmethod
+            def from_observed(cls, observed_module):
+                assert hasattr(observed_module, 'qconfig')
+                assert hasattr(observed_module, 'activation_post_process')
+                observed_module.linear.activation_post_process = \
+                    observed_module.activation_post_process
+                quantized = cls(nnq.Linear.from_float(observed_module.linear))
+                return quantized
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(3, 3)
+                self.custom = CustomModule()
+
+            def forward(self, x0):
+                x1 = self.custom(x0)
+                x2 = self.linear(x0)
+                return x1 + x2
+
+        prepare_custom_config_dict = {
+            "float_to_observed_custom_module_class": {
+                "static": {
+                    CustomModule: ObservedCustomModule
+                }
+            }
+        }
+        convert_custom_config_dict = {
+            "observed_to_quantized_custom_module_class": {
+                "static": {
+                    ObservedCustomModule: StaticQuantCustomModule
+                }
+            }
+        }
+        m = M().eval()
+        m = prepare_fx(
+            m,
+            {"": default_qconfig},
+            prepare_custom_config_dict=prepare_custom_config_dict)
+        # make sure it works
+        m = convert_fx(
+            m,
+            convert_custom_config_dict=convert_custom_config_dict)
+        # make sure it runs
+        m(torch.randn(3, 3))
+
     @skipIfNoFBGEMM
     def test_non_traceable_module(self):
         class NonTraceable(torch.nn.Module):
@@ -2425,12 +2629,13 @@ def forward(self, x):
         self.assertTrue(
             set(scripted_keys) == set(non_packed_weight_keys),
             "Expected the scripted model to preserve the state_dict for non-packed weight attributes")
+        # TODO: probably don't want to hardcode the attribute names, since they are generated
         for attr_name in [
                 "mods1_0_input_scale_0", "mods1_0_input_zero_point_0",
-                "mods1_0_scale_0", "mods1_0_zero_point_0",
-                "mods1_1_scale_0", "mods1_1_zero_point_0",
-                "mods2_scale_0", "mods2_zero_point_0"]:
-            self.assertTrue(hasattr(m, attr_name))
+                "mods1_0_scale_1", "mods1_0_zero_point_1",
+                "mods1_1_scale_1", "mods1_1_zero_point_1",
+                "mods2_scale_1", "mods2_zero_point_1"]:
+            self.assertTrue(hasattr(m, attr_name), attr_name + " not found.")
 
     @skipIfNoFBGEMM
     def test_packed_weight_fused_op(self):
@@ -2543,6 +2748,234 @@ def forward(self, x):
             mp(torch.rand(4, 4, 4, 4))
             mc = convert_fx(mp)
 
+    class _NonReferenceTestModel(nn.Module):
+        def __init__(self, func, lin_in, lin_out):
+            super().__init__()
+            self.conv1 = nn.Conv2d(3, 6, 5)
+            self.pool = nn.MaxPool2d(2, 2)
+            self.lin = nn.Linear(lin_in, lin_out)
+            self.func = func
+
+        def forward(self, x, y, z):
+            x = self.pool(F.relu(self.conv1(x)))
+            x = torch.flatten(x, 1)
+            x = self.func(x, y, z)
+            x = self.lin(x)
+            return x
+
+    # This function looks at the node specified by the NodeInfo in the key of
+    # node_info_to_non_tensor_args and checks that the args at specified indices
+    # are not observed (since they are non tensors). If the args at those indices
+    # are a tuple/list (which do not show up as nodes) the function checks the
+    # individual elements of the tuple/list recursively.
+    def _check_not_observed(self, model, node_info_to_non_tensor_args):
+
+        # this is a helper function (for easier recursion) that checks whether
+        # arg_node is observed
+        def _check_node_not_observed(model, arg_node, node):
+            if isinstance(arg_node, tuple) or isinstance(arg_node, list):
+                for new_node in arg_node:
+                    _check_node_not_observed(model, new_node, node)
+            elif arg_node.op == "call_module":
+                self.assertTrue(
+                    not is_activation_post_process(getattr(model, arg_node.target)),
+                    "Arg: {0} of node: {1} is observed but is not a float tensor".format(
+                        arg_node, node
+                    ),
+                )
+
+        for node in model.graph.nodes:
+            indices = node_info_to_non_tensor_args.get(
+                NodeInfo(node.op, node.target), []
+            )
+            for index in indices:
+                if index < len(node.args):
+                    arg_node = node.args[index]
+                    _check_node_not_observed(model, arg_node, node)
+
+    # This test checks that the model gets prepared correct, doesn't have observers
+    # on specific ops (see _check_not_observed) and that the prepared model runs
+    def _test_dtype_propagation(self, model, node_info_to_non_tensor_args, *args):
+        model.eval()
+        qconfig_dict = {"": torch.ao.quantization.get_default_qconfig("fbgemm")}
+        prepared_model = prepare_fx(model, qconfig_dict)
+        self._check_not_observed(prepared_model, node_info_to_non_tensor_args)
+        prepared_model(*args)
+
+    def test_masked_fill_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.masked_fill(y, z)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), torch.randn(1176) > 0, 0.1]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "masked_fill"): [1, 2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_permute_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.permute(y, z)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), 0, 1]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "permute"): [1, 2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_repeat_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.repeat(y, z)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), 2, 1]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "repeat"): [1, 2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_reshape_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.reshape(-1, y)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), 5, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_size_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.reshape((-1, x.size(y)))
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), 0, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "size"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_transpose_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.transpose(y, z)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), 0, 1]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "transpose"): [1, 2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_torch_transpose_nontensor_args_not_observed(self):
+        # TODO: make torch.transpose traceable by fx when using
+        # variable nontensor arguments
+        # func = lambda x, y, z: torch.transpose(x, y, z) # error
+        def func(x, y, z):
+            return torch.transpose(x, 0, 1)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        node_info_to_non_tensor_args = {
+            NodeInfo("call_method", torch.transpose): [1, 2]
+        }
+        args = [torch.randn(5, 3, 32, 32), 0, 1]
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_unsqueeze_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.unsqueeze(y)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), 1, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "unsqueeze"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_unsqueeze__nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.unsqueeze_(y)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), 1, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "unsqueeze_"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_torch_unsqueeze_nontensor_args_not_observed(self):
+        # TODO: make torch.unsqueeze scriptable by fx when using
+        # variable nontensor arguments
+        # func = lambda x, y, z: torch.unsqueeze(x, y) # error
+        def func(x, y, z):
+            return torch.unsqueeze(x, 1)
+
+        model = self._NonReferenceTestModel(func, 1176, 1)
+        args = [torch.randn(5, 3, 32, 32), 1, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", torch.unsqueeze): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_view_nontensor_args_not_observed(self):
+        def func(x, y, z):
+            return x.view(-1, y)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), 5, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "view"): [2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_list_args(self):
+        def func(x, y, z):
+            return x.reshape(y)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), [-1, 5], None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_split_list_args(self):
+        def func(x, y, z):
+            return x.reshape([y, z])
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), -1, 5]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_tuple_args(self):
+        def func(x, y, z):
+            return x.reshape(y)
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), (-1, 5), None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_split_tuple_args(self):
+        def func(x, y, z):
+            return x.reshape((y, z))
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), -1, 5]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_dict_args(self):
+        def func(x, y, z):
+            return x.transpose(y["first"], y["second"])
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), {"first": 0, "second": 1}, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "transpose"): [1, 2]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_dict_tuple_args(self):
+        class reshape_module(nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, y, z):
+                return x.reshape(y["shape"])
+
+        model = self._NonReferenceTestModel(reshape_module(), 5, 1)
+        args = [torch.randn(5, 3, 32, 32), {"shape": (-1, 5)}, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "reshape"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
+    def test_propagate_dtypes_for_known_nodes_dict_split_tuple_args(self):
+        def func(x, y, z):
+            return x.reshape((y["first"], y["second"]))
+
+        model = self._NonReferenceTestModel(func, 5, 1)
+        args = [torch.randn(5, 3, 32, 32), {"first": -1, "second": 5}, None]
+        node_info_to_non_tensor_args = {NodeInfo("call_method", "transpose"): [1]}
+        self._test_dtype_propagation(model, node_info_to_non_tensor_args, *args)
+
     def test_assert_on_size_after_quant_layer(self):
         """
         Verifies that calculating a size of a quantized tensor works
@@ -2817,11 +3250,12 @@ def forward(self, x):
         m = convert_fx(m)
         keys = m.state_dict().keys()
         m(torch.randn(5, 5))
+        # TODO: probably don't want to hardcode the attribute names, since they are generated
         for attr_name in [
                 "mods1_0_input_scale_0", "mods1_0_input_zero_point_0",
                 "mods1_0_scale_0", "mods1_0_zero_point_0",
                 "mods1_1_scale_0", "mods1_1_zero_point_0"]:
-            self.assertTrue(hasattr(m, attr_name))
+            self.assertTrue(hasattr(m, attr_name), attr_name + " not found.")
 
     def test_no_obs_between_unmatched_node_and_copy_node(self):
         """
@@ -3153,7 +3587,6 @@ def forward(self, x):
     def test_preserve_tuple(self):
         """ Test tuple input type is preserved
         """
-        from typing import List
 
         class LSTM(nn.Module):
             def __init__(self):
@@ -3231,23 +3664,101 @@ def forward(self, x):
                 x = self.relu(x)
                 return x
 
-        model = M().eval()
-
         dynamic_quantized_ops = {
             float16_dynamic_qconfig: torch.ops.quantized.linear_relu_dynamic_fp16,
             default_dynamic_qconfig: torch.ops.quantized.linear_relu_dynamic
         }
-        for config in [float16_dynamic_qconfig, default_dynamic_qconfig]:
-            qconfig = {
-                "": config
+        for qconfig in [float16_dynamic_qconfig, default_dynamic_qconfig]:
+            model = M().eval()
+            qconfig_dict = {
+                "": qconfig
             }
-            m = prepare_fx(model, qconfig)
+            m = prepare_fx(model, qconfig_dict)
             m = convert_fx(m)
             m(torch.rand(5, 5))
             node_list = [
                 ns.call_module(nniqd.LinearReLU),
                 ns.call_module(nniqd.LinearReLU),
-                ns.call_function(dynamic_quantized_ops[config]),
+                ns.call_function(dynamic_quantized_ops[qconfig]),
+            ]
+            self.checkGraphModuleNodes(m, expected_node_list=node_list)
+
+    @skipIfNoFBGEMM
+    def test_dynamic_with_fusion_multiple_uses(self):
+        """
+        Tests that dynamic quantization APIs work with Linear + Relu fusion
+        """
+        class LinearRelu(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                x = self.linear(x)
+                return self.relu(x)
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear_relu = LinearRelu()
+
+            def forward(self, x):
+                x = self.linear_relu(x)
+                x = self.linear_relu(x)
+                return x
+
+        for qconfig in [float16_dynamic_qconfig, default_dynamic_qconfig]:
+            model = M().eval()
+            qconfig_dict = {
+                "": qconfig
+            }
+            m = prepare_fx(model, qconfig_dict)
+            m = convert_fx(m)
+            m(torch.rand(5, 5))
+            node_list = [
+                ns.call_module(nniqd.LinearReLU),
+                ns.call_module(nniqd.LinearReLU),
+            ]
+            self.checkGraphModuleNodes(m, expected_node_list=node_list)
+
+    @skipIfNoFBGEMM
+    def test_dynamic_linear_input_multiple_use(self):
+        """
+        Tests input for dynamic linear being used by multiple ops
+        """
+        class LinearRelu(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                x = self.linear(x)
+                return self.relu(x)
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.mod1 = LinearRelu()
+                self.mod2 = LinearRelu()
+
+            def forward(self, x):
+                y1 = self.mod1(x)
+                y2 = self.mod2(x)
+                return y1 + y2
+
+        for qconfig in [float16_dynamic_qconfig, default_dynamic_qconfig]:
+            model = M().eval()
+            qconfig_dict = {
+                "": qconfig
+            }
+            m = prepare_fx(model, qconfig_dict)
+            m = convert_fx(m)
+            m(torch.rand(5, 5, 5))
+            node_list = [
+                ns.call_module(nniqd.LinearReLU),
+                ns.call_module(nniqd.LinearReLU),
             ]
             self.checkGraphModuleNodes(m, expected_node_list=node_list)
 
@@ -3499,6 +4010,7 @@ def forward(self, x):
                     ns.call_function(torch.quantize_per_tensor): 1,
                     ns.call_function(torch.ops.quantized.linear): 2,
                     ns.call_function(torch.ops.quantized.add): 1,
+                    ns.call_function(torch.mul): 1,
                     ns.call_method("dequantize"): 1
                 }
                 order_check = [
@@ -3507,6 +4019,7 @@ def forward(self, x):
                     ns.call_function(torch.ops.quantized.linear),
                     ns.call_function(torch.ops.quantized.add),
                     ns.call_method("dequantize"),
+                    ns.call_function(torch.mul),
                     ns.call_module(nn.Linear),
                 ]
 
@@ -3520,19 +4033,6 @@ def forward(self, x):
     def _assertFixedQParamsFakeQuantizeEqual(self, fq1, fq2):
         self.assertEqual(fq1()._observer_ctr, fq2()._observer_ctr)
 
-    def test_fixed_qparams_patterns(self):
-        hard_sigmoid_keys = [torch.nn.Hardsigmoid, torch.nn.functional.hardsigmoid, "hardsigmoid", "hardsigmoid_"]
-        sigmoid_keys = [torch.nn.Sigmoid, torch.sigmoid, "sigmoid", "sigmoid_"]
-        tanh_keys = [torch.nn.Tanh, torch.tanh, "tanh", "tanh_"]
-        for k in hard_sigmoid_keys + sigmoid_keys:
-            self.assertEqual(DEFAULT_OUTPUT_OBSERVER_MAP[k], default_affine_fixed_qparams_observer)
-            self._assertFixedQParamsFakeQuantizeEqual(DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP[k],
-                                                      default_affine_fixed_qparams_fake_quant)
-        for k in tanh_keys:
-            self.assertEqual(DEFAULT_OUTPUT_OBSERVER_MAP[k], default_symmetric_fixed_qparams_observer)
-            self._assertFixedQParamsFakeQuantizeEqual(DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP[k],
-                                                      default_symmetric_fixed_qparams_fake_quant)
-
     def test_register_patterns(self):
         @register_fusion_pattern("dummy_fusion")
         class DummyFusion():
@@ -3560,10 +4060,13 @@ class DummyQuant3():
                                                   default_affine_fixed_qparams_fake_quant)
         self._assertFixedQParamsFakeQuantizeEqual(DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP["dummy_quant3"],
                                                   default_symmetric_fixed_qparams_fake_quant)
-        self.assertTrue(get_default_output_activation_post_process_map(is_training=True) is
-                        DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP)
-        self.assertTrue(get_default_output_activation_post_process_map(is_training=False) is
-                        DEFAULT_OUTPUT_OBSERVER_MAP)
+        output_fake_quantize_map = get_default_output_activation_post_process_map(is_training=True)
+        output_observer_map = get_default_output_activation_post_process_map(is_training=False)
+        self.assertEqual(output_observer_map.get("dummy_quant3"), default_symmetric_fixed_qparams_observer)
+        self._assertFixedQParamsFakeQuantizeEqual(output_fake_quantize_map.get("dummy_quant3"),
+                                                  default_symmetric_fixed_qparams_fake_quant)
+
+
 
     def test_reuse_input_qconfig(self):
         class M1(torch.nn.Module):
@@ -3652,22 +4155,63 @@ def forward(self, x):
                 break
         self.assertTrue(found_stack_trace, f"stack trace not found, node: {n.format_node()}, is_reference: False")
 
-    def test_stack_trace_preserved_subgraph_rewriter(self):
-        # a functional relu is taking the subgraph rewriter code path
+    def test_qat_skip_untraced(self):
+        class UnTraceableModuleClass(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(2, 2)
+
+            def forward(self, x):
+                return self.linear(x)
+
+        class UnTraceableModuleName(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(2, 2)
+
+            def forward(self, x):
+                return self.linear(x)
+
         class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.untraceable_module_class = UnTraceableModuleClass()
+                self.untraceable_module_name = UnTraceableModuleClass()
+
             def forward(self, x):
-                x = F.relu(x)
+                x = self.untraceable_module_class(x)
+                x = self.untraceable_module_name(x)
                 return x
 
-        m = M().eval()
-        mp = prepare_fx(m, get_default_qconfig_dict())
-        mq = convert_fx(copy.deepcopy(mp), is_reference=False)
-        found_stack_trace = False
-        for n in mq.graph.nodes:
-            if n.op == 'call_function' and n.target == F.relu:
-                found_stack_trace = n.stack_trace is not None
-                break
-        self.assertTrue(found_stack_trace, f"stack trace not found, node: {n.format_node()}, is_reference: True")
+        mod = M()
+
+        qconfig_dict = {"": torch.quantization.get_default_qat_qconfig()}
+        prepare_custom_config_dict = {
+            "non_traceable_module_class": [UnTraceableModuleClass],
+            "non_traceable_module_name": ["untraceable_module_name"],
+        }
+        mod_prep = torch.ao.quantization.quantize_fx.prepare_qat_fx(
+            mod.train(), qconfig_dict, prepare_custom_config_dict
+        )
+        mod_prep = torch.ao.quantization.quantize_fx.prepare_qat_fx(
+            mod.train(), qconfig_dict, prepare_custom_config_dict
+        )
+        self.assertTrue(
+            isinstance(mod_prep.untraceable_module_class.linear, torch.nn.Linear)
+        )
+        self.assertTrue(
+            isinstance(mod_prep.untraceable_module_name.linear, torch.nn.Linear)
+        )
+        self.assertTrue(
+            type(mod_prep.untraceable_module_class.linear)
+            is not torch.nn.qat.modules.linear.Linear,
+            "prepare_qat_fx shold not convert anything inside untraced module classes",
+        )
+        self.assertTrue(
+            type(mod_prep.untraceable_module_name.linear)
+            is not torch.nn.qat.modules.linear.Linear,
+            "prepare_qat_fx shold not convert anything inside modules named in untraced_module_names",
+        )
 
     def test_qconfig_dict_setup(self):
         class M(torch.nn.Module):
@@ -3710,6 +4254,28 @@ def forward(self, x):
                             self.assertEqual(mod.quant_min, 0)
                             self.assertEqual(mod.quant_max, 255)
 
+    def test_prepare_mode(self):
+        class LinearModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 10)
+
+            def forward(self, x):
+                return self.linear(x)
+
+        def _test(prepare_fn, qconfig_dict):
+            m = LinearModel()
+            m1 = copy.deepcopy(m)
+            m1.train()
+            prepare_fn(m1, qconfig_dict)
+            m2 = copy.deepcopy(m)
+            m2.eval()
+            prepare_fn(m2, qconfig_dict)
+
+        # Ensure prepare_fx and prepare_qat_fx work in both training and eval modes
+        _test(prepare_fx, get_default_qconfig_dict())
+        _test(prepare_qat_fx, get_default_qat_qconfig_dict())
+
 @skipIfNoFBGEMM
 class TestQuantizeFxOps(QuantizationTestCase):
     def setUp(self):
@@ -3750,41 +4316,64 @@ def setUp(self):
     """
     @skipIfNoFBGEMM
     def test_linear_module(self):
-        class ModuleLinear(torch.nn.Module):
-            def __init__(self, has_relu=False, f_relu=False):
-                super(ModuleLinear, self).__init__()
+        class LinearModel(torch.nn.Module):
+            def __init__(self):
+                super(LinearModel, self).__init__()
                 self.linear = torch.nn.Linear(30, 4).float()
-                if has_relu:
-                    if f_relu:
-                        self.relu = F.relu
-                    else:
-                        self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                return self.linear(x)
+
+        class LinearReLUModel(torch.nn.Module):
+            def __init__(self, f_relu=False):
+                super(LinearReLUModel, self).__init__()
+                self.linear = torch.nn.Linear(30, 4).float()
+                if f_relu:
+                    self.relu = F.relu
                 else:
-                    self.relu = torch.nn.Identity()
+                    self.relu = torch.nn.ReLU()
 
             def forward(self, x):
-                return self.relu(self.linear(x))
+                x = self.linear(x)
+                x = self.relu(x)
+                return x
+
+        class LinearBnModel(torch.nn.Module):
+            def __init__(self):
+                super(LinearBnModel, self).__init__()
+                self.linear = torch.nn.Linear(4, 4).float()
+                self.bn = torch.nn.BatchNorm1d(4)
+
+            def forward(self, x):
+                x = self.linear(x)
+                x = self.bn(x)
+                return x
 
+        # Test linear
         data = (torch.rand((1, 30), dtype=torch.float),)
-        options = itertools.product(
-            [ModuleLinear(has_relu=False)],
-            self.all_quant_types)
-        quantized_nodes = {
-            # quant_type:
-            QuantType.DYNAMIC: ns.call_module(nnqd.Linear),
-            QuantType.STATIC: ns.call_module(nnq.Linear),
-            # note that we are checking the final result
-            QuantType.QAT: ns.call_module(nnq.Linear),
-        }
-        for model, quant_type in options:
-            self.checkGraphModeFxOp(
-                model, data, quant_type, quantized_nodes[quant_type])
+        for quant_type in self.all_quant_types:
+            model = LinearModel()
+            quantized_module = nnqd.Linear if quant_type == QuantType.DYNAMIC else nnq.Linear
+            quantized_node = ns.call_module(quantized_module)
+            result_dict = self.checkGraphModeFxOp(model, data, quant_type, quantized_node)
+            if quant_type in self.static_quant_types:
+                self.assertEqual(result_dict["quantized_output"], result_dict["quantized_reference_output"])
 
+        # TODO: enable test for dynamic quant
+        # Test linear-relu
         for f_relu, quant_type in itertools.product([True, False], [QuantType.STATIC, QuantType.QAT]):
-            for model, quantized_node in [
-                    (ModuleLinear(has_relu=True, f_relu=f_relu), ns.call_module(nniq.LinearReLU))]:
-                result_dict = self.checkGraphModeFxOp(model, data, quant_type, quantized_node)
-                self.assertEqual(result_dict["quantized_output"], result_dict["quantized_reference_output"])
+            model = LinearReLUModel(f_relu)
+            quantized_node = ns.call_module(nniq.LinearReLU)
+            result_dict = self.checkGraphModeFxOp(model, data, quant_type, quantized_node)
+            self.assertEqual(result_dict["quantized_output"], result_dict["quantized_reference_output"])
+
+        # Test linear-bn
+        data = (torch.rand((4, 4), dtype=torch.float),)
+        for quant_type in self.static_quant_types:
+            model = LinearBnModel()
+            quantized_node = ns.call_module(nnq.Linear)
+            result_dict = self.checkGraphModeFxOp(model, data, quant_type, quantized_node)
+            self.assertEqual(result_dict["quantized_output"], result_dict["quantized_reference_output"])
 
     @skipIfNoFBGEMM
     def test_functional_linear(self):
@@ -3853,10 +4442,18 @@ def forward(self, x):
             else:
                 qlinear_fun = quant_type_to_qlinear_fun[quant_type]
 
+            if quant_type != QuantType.DYNAMIC:
+                num_dequantize = 1
+            else:
+                # we will have an extra quantize_per_tensor_dynamic + dequantize for
+                # nn.Identity right now, but it will be fixed after we use
+                # backend_config_dict to configure the default pt backend
+                num_dequantize = int(not has_relu)
+
             convert_node_occurrence = {
                 ns.call_function(torch.quantize_per_tensor): 1 if quant_type != QuantType.DYNAMIC else 0,
                 qlinear_fun: 1,
-                ns.call_method("dequantize"): 1 if quant_type != QuantType.DYNAMIC else 0
+                ns.call_method("dequantize"): num_dequantize,
             }
             prepare_expected_node_occurrence = \
                 quant_type_to_prepare_expected_node_occurrence[quant_type]
@@ -3909,8 +4506,11 @@ def forward(self, x):
                 else:
                     qlinear_fun = ns.call_function(torch.ops.quantized.linear_dynamic_fp16)
             prepare_node_occurrence = {
-                # weight
-                ns.call_module(torch.ao.quantization.PlaceholderObserver): 1
+                # activation and weight
+                # TODO: this is temporary behavior, should be fixed after we use
+                # backend_config_dict to configure default pt quantization behavior
+                # activation for nn.Identity (not has_relu)
+                ns.call_module(torch.ao.quantization.PlaceholderObserver): 2 + int(not has_relu)
             }
             convert_node_occurrence = {
                 qlinear_fun: 1,
@@ -4107,10 +4707,14 @@ def forward(self, x):
             }
             prepare_expected_node_occurrence = \
                 quant_type_to_prepare_expected_node_occurrence[quant_type]
-            self.checkGraphModeFxOp(
+            result_dict = self.checkGraphModeFxOp(
                 model, data, quant_type, qconv_fun,
                 prepare_expected_node_occurrence=prepare_expected_node_occurrence,
                 expected_node_occurrence=convert_node_occurrence)
+            if quant_type != QuantType.DYNAMIC:
+                self.assertEqual(result_dict["quantized_output"], result_dict["quantized_reference_output"])
+                # Ensure packed weights in lowered models are folded
+                self.assertIn("_packed_weight_0", result_dict["quantized"].state_dict().keys())
 
     @skipIfNoFBGEMM
     def test_quantized_conv_relu(self):
@@ -4260,10 +4864,12 @@ def test_add(self):
         self._test_binary_op_float16_impl(
             operator.add, operator.iadd)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_sub(self):
         self._test_binary_op_float16_impl(operator.sub, operator.isub)
         self._test_binary_op_float16_impl(torch.sub, None)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_div(self):
         self._test_binary_op_float16_impl(operator.truediv, operator.itruediv)
         self._test_binary_op_float16_impl(torch.div, None)
@@ -4274,6 +4880,7 @@ def test_mul(self):
             operator.mul, operator.imul, torch.ops.quantized.mul)
         self._test_binary_op_float16_impl(operator.mul, operator.imul)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_sum(self):
         class Sum(torch.nn.Module):
             def forward(self, x):
@@ -4297,6 +4904,7 @@ def forward(self, x):
             expected_node_occurrence=node_occurrence,
             custom_qconfig_dict=custom_qconfig_dict)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_bmm(self):
         class BMMMethod(torch.nn.Module):
             def __init__(self):
@@ -4403,7 +5011,7 @@ def forward(self, x):
 
         m = M()
         expected_node_occurrence = {
-            ns.call_module(torch.ao.quantization.FusedMovingAvgObsFakeQuantize): 6,
+            ns.call_module(torch.ao.quantization.FusedMovingAvgObsFakeQuantize): 5,
         }
         self._test_quantized_add_mul_qat(m, expected_node_occurrence)
 
@@ -4419,14 +5027,13 @@ def forward(self, x):
                 x = torch.mul(x, 1.0)
                 x = self.conv1(x)
                 x = torch.mul(x, 1.0)
-                # TODO: add support for add + torch.relu?
                 x = torch.relu(x)
                 x = self.conv2(x)
                 return x
 
         m = M()
         expected_node_occurrence = {
-            ns.call_module(torch.ao.quantization.FusedMovingAvgObsFakeQuantize): 6,
+            ns.call_module(torch.ao.quantization.FusedMovingAvgObsFakeQuantize): 5,
         }
         self._test_quantized_add_mul_qat(m, expected_node_occurrence)
 
@@ -4846,6 +5453,7 @@ def test_softmax_normal(self):
         self._test_default_node_quant_handler_ops(
             module, functional, qconfig, is_reference, node_list)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_gelu_reference(self):
         module = torch.nn.GELU
         functional = torch.nn.functional.gelu
@@ -4861,6 +5469,7 @@ def test_gelu_reference(self):
             ns.call_function(torch.quantize_per_tensor),
             ns.call_method('dequantize')
         ]
+        # TODO: change these to use backend_config_dict
         additional_patterns = {torch.nn.GELU: DefaultNodeQuantizeHandler,
                                torch.nn.functional.gelu: DefaultNodeQuantizeHandler}
         self._test_default_node_quant_handler_ops(
@@ -4869,6 +5478,7 @@ def test_gelu_reference(self):
         self._test_default_node_quant_handler_ops(module, functional, self.custom_qconfig, is_reference, node_list,
                                                   additional_quant_pattern_dict=self.common_quant_patterns)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_softmax_reference(self):
         module = torch.nn.Softmax
         functional = torch.nn.functional.softmax
@@ -4892,6 +5502,7 @@ def test_softmax_reference(self):
         self._test_default_node_quant_handler_ops(module, functional, self.custom_qconfig, is_reference, node_list,
                                                   additional_quant_pattern_dict=self.common_quant_patterns)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_silu_reference(self):
         module = torch.nn.SiLU
         functional = torch.nn.functional.silu
@@ -4923,6 +5534,7 @@ def test_silu_reference(self):
         self._test_default_node_quant_handler_ops(module, functional, self.custom_qconfig, is_reference, node_list,
                                                   additional_quant_pattern_dict=self.common_quant_patterns)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_mish_reference(self):
         module = torch.nn.Mish
         functional = torch.nn.functional.mish
@@ -5324,7 +5936,8 @@ def forward(self, x):
         m = M().eval()
         m = prepare_fx(m, {"": default_reuse_input_qconfig})
         m = convert_fx(m)
-        print(m)
+        # make sure it runs
+        m(torch.rand(1))
 
     def test_getitem(self):
         """ Make sure we only insert observer for getitem if the following node is matched
@@ -5398,7 +6011,6 @@ def forward(self, x):
                 x = self.sigmoid(x)
                 x = torch.sigmoid(x)
                 x = x.sigmoid()
-                x.sigmoid_()
                 x = self.hardsigmoid(x)
                 x = F.hardsigmoid(x)
                 x = F.hardsigmoid(x, inplace=True)
@@ -5406,7 +6018,6 @@ def forward(self, x):
                 # F.tanh is deprecated
                 x = torch.tanh(x)
                 x = x.tanh()
-                x.tanh_()
                 return x
 
         for eval_mode in [True, False]:
@@ -5417,12 +6028,12 @@ def forward(self, x):
                 m.eval()
                 qconfig = default_qconfig
                 prepare = prepare_fx
-                fq_count = 11
+                fq_count = 9
             else:
                 m.train()
                 qconfig = default_qat_qconfig
                 prepare = prepare_qat_fx
-                fq_count = 11
+                fq_count = 9
 
             # nothing to fuse so skipping the fuse step
             m_copy = copy.deepcopy(m)
@@ -5465,7 +6076,7 @@ def forward(self, x):
                 expected_node_list=order_check)
 
             reference_count_check = {
-                ns.call_function(torch.quantize_per_tensor) : 13,
+                ns.call_function(torch.quantize_per_tensor) : 11,
                 ns.call_method('dequantize') : 11
             }
             reference_order_check = [
@@ -5879,6 +6490,7 @@ def forward(self, x):
             m,
             expected_node_occurrence=expected_occurrence)
 
+    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_qmatmul(self):
         class M(torch.nn.Module):
             def forward(self, x, y):
@@ -6277,15 +6889,7 @@ def forward(self, input: torch.Tensor, offsets: Optional[torch.Tensor] = None,
             model = EmbeddingBagLinear().train()
             prepared_fx_model = prepare_qat_fx(model, qconfig_dict)
             test_only_train_fn(prepared_fx_model, train_indices)
-            convert_custom_config_dict = {
-                "additional_object_mapping": {
-                    "static": {
-                        torch.nn.qat.EmbeddingBag: nn.quantized.EmbeddingBag,
-                    }
-                }
-            }
             quant_model = convert_fx(prepared_fx_model,
-                                     convert_custom_config_dict=convert_custom_config_dict,
                                      qconfig_dict=qconfig_dict)
 
             def checkQuantized(model):
diff --git a/test/quantization/serialized/TestSerialization.test_linear_relu_package_quantization_transforms.get_attr_targets.pt b/test/quantization/serialized/TestSerialization.test_linear_relu_package_quantization_transforms.get_attr_targets.pt
index bb34a57f962a4d..6887e8c614a52d 100644
Binary files a/test/quantization/serialized/TestSerialization.test_linear_relu_package_quantization_transforms.get_attr_targets.pt and b/test/quantization/serialized/TestSerialization.test_linear_relu_package_quantization_transforms.get_attr_targets.pt differ
diff --git a/test/run_test.py b/test/run_test.py
index 5b5ce3b8318a26..1b73f3bde01068 100644
--- a/test/run_test.py
+++ b/test/run_test.py
@@ -104,7 +104,6 @@ def skip_test_p(name: str) -> bool:
         'test_kernel_launch_checks',
         'test_metal',
         'test_nnapi',
-        'test_functionalization',
         'test_segment_reductions',
         'test_static_runtime',
         'test_throughput_benchmark',
@@ -133,6 +132,7 @@ def skip_test_p(name: str) -> bool:
         "distributed/elastic/utils/util_test",
         "distributed/elastic/utils/distributed_test",
         "distributed/elastic/multiprocessing/api_test",
+        "test_deploy",
     ]
 )
 
@@ -168,6 +168,7 @@ def skip_test_p(name: str) -> bool:
     "test_typing",
     "distributed/elastic/events/lib_test",
     "distributed/elastic/agent/server/test/api_test",
+    "test_deploy",
 ]
 
 WINDOWS_BLOCKLIST = [
@@ -210,7 +211,9 @@ def skip_test_p(name: str) -> bool:
     "distributed/_shard/sharded_tensor/ops/test_binary_cmp",
     "distributed/_shard/sharded_tensor/ops/test_init",
     "distributed/_shard/sharded_tensor/ops/test_linear",
+    "distributed/_shard/sharding_spec/test_sharding_spec",
     "distributed/_shard/sharded_optim/test_sharded_optim",
+    "distributed/_shard/test_replicated_tensor",
 ] + FSDP_TEST
 
 ROCM_BLOCKLIST = [
@@ -228,9 +231,10 @@ def skip_test_p(name: str) -> bool:
     "distributed/_shard/sharded_tensor/ops/test_binary_cmp",
     "distributed/_shard/sharded_tensor/ops/test_init",
     "distributed/_shard/sharded_tensor/ops/test_linear",
+    "distributed/_shard/sharding_spec/test_sharding_spec",
     "distributed/_shard/sharded_optim/test_sharded_optim",
+    "distributed/_shard/test_replicated_tensor",
     "test_determination",
-    "test_multiprocessing",
     "test_jit_legacy",
     "test_type_hints",
     "test_openmp",
@@ -257,6 +261,8 @@ def skip_test_p(name: str) -> bool:
     "test_modules",
     "test_nn",
     "test_ops",
+    "test_ops_gradients",
+    "test_ops_jit",
     "test_torch"
 ]
 
@@ -306,7 +312,6 @@ def skip_test_p(name: str) -> bool:
 )
 
 JIT_EXECUTOR_TESTS = [
-    "test_jit_cuda_fuser",
     "test_jit_profiling",
     "test_jit_legacy",
     "test_jit_fuser_legacy",
@@ -867,6 +872,10 @@ def get_selected_tests(options):
     if options.exclude_distributed_tests:
         options.exclude.extend(DISTRIBUTED_TESTS)
 
+    # these tests failing in CUDA 11.6 temporary disabling. issue https://github.com/pytorch/pytorch/issues/75375
+    if torch.version.cuda is not None and LooseVersion(torch.version.cuda) == "11.6":
+        options.exclude.extend(["distributions/test_constraints"])
+
     selected_tests = exclude_tests(options.exclude, selected_tests)
 
     if sys.platform == "win32" and not options.ignore_win_blocklist:
diff --git a/test/test_ao_sparsity.py b/test/test_ao_sparsity.py
index 32b95973928e31..6b5c8574c2e679 100644
--- a/test/test_ao_sparsity.py
+++ b/test/test_ao_sparsity.py
@@ -20,5 +20,8 @@
 # Scheduler
 from ao.sparsity.test_scheduler import TestScheduler  # noqa: F401
 
+# Composability
+from ao.sparsity.test_composability import TestComposability  # noqa: F401
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_autograd.py b/test/test_autograd.py
index 1d4ef2ce38424f..408c71af075a6f 100644
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
@@ -14,6 +14,7 @@
 import uuid
 import warnings
 import operator
+import subprocess
 from copy import deepcopy
 from collections import OrderedDict
 from itertools import product
@@ -26,7 +27,6 @@
 from torch.autograd.function import once_differentiable
 from torch.autograd.profiler import (profile, record_function, emit_nvtx)
 from torch.autograd.profiler_util import (_format_time, EventList, FunctionEvent, FunctionEventAvg)
-import torch.autograd.functional as autogradF
 from torch.utils.checkpoint import checkpoint
 from torch.testing import make_tensor
 from torch.testing._internal.common_cuda import TEST_CUDA
@@ -40,7 +40,7 @@
 from torch.testing._internal.common_device_type import (instantiate_device_type_tests, skipCUDAIfRocm,
                                                         onlyCPU, onlyCUDA, dtypes, dtypesIfCUDA,
                                                         deviceCountAtLeast, skipMeta)
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import floating_types_and
 from torch.testing._internal.logging_tensor import no_dispatch
 
 import pickle
@@ -389,8 +389,8 @@ def test_not_implemented_fwad(self):
             hint_msg = "Running forward AD for an OP that does not implement it should raise a NotImplementedError"
 
             with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
-                # if forward AD ends up being implemented for torch.atan2, choose a different op
-                torch.atan2(dual_x, dual_x)
+                # if forward AD ends up being implemented for torch.igamma, choose a different op
+                torch.igamma(dual_x, dual_x)
 
     def test_accumulate_grad(self):
         grad_output = torch.ones(5, 5)
@@ -2820,7 +2820,7 @@ def test_profiler(self):
         for evt in p.function_events:
             if evt.name in names:
                 found_indices.add(names.index(evt.name))
-        self.assertEquals(len(found_indices), len(names))
+        self.assertEqual(len(found_indices), len(names))
 
     def test_profiler_seq_nr(self):
         with profile(use_kineto=kineto_available()) as p:
@@ -2931,6 +2931,21 @@ def test_record_function_callbacks(self):
         foo_event = [event for event in function_events if "foo" in event.name][0]
         self.assertEqual(foo_event.count, 1)
 
+    def test_record_function_legacy(self):
+        # Test the new _record_function ops work
+        # Note: Remove once record_function uses these directly
+        x = torch.randn(10, 10)
+        with profile(use_kineto=kineto_available()) as p:
+            handle = torch.ops.profiler._record_function_enter("bar", None)
+            try:
+                y = x * 2 + 4
+            finally:
+                torch.ops.profiler._record_function_exit(handle)
+
+        function_events = p.function_events
+        foo_event = [event for event in function_events if "bar" in event.name][0]
+        self.assertEqual(foo_event.count, 1)
+
     def test_profiler_aggregation_fake(self):
         events = EventList()
         id = [0]
@@ -4815,7 +4830,10 @@ def test_grad_fn_attr_bindings(self):
         self.assertIsInstance(out.grad_fn._saved_output_size[0], int)
         self.assertEqual(out.grad_fn._saved_align_corners, False)         # bool -> bool
         self.assertIsInstance(out.grad_fn._saved_align_corners, bool)
-        self.assertIsNone(out.grad_fn._saved_scale_factors)               # c10::optional<ArrayRef<double>> -> float[]?
+        if hasattr(out.grad_fn, '_saved_scale_factors'):
+            self.assertIsNone(out.grad_fn._saved_scale_factors)           # c10::optional<ArrayRef<double>> -> float[]?
+        else:
+            self.assertIsNone(out.grad_fn._saved_scales)                  # c10::optional<ArrayRef<double>> -> float[]?
 
         out = torch.nn.functional.interpolate(a, scale_factor=0.5, mode="linear")
         self.assertIsNone(out.grad_fn._saved_output_size)
@@ -6340,1361 +6358,76 @@ def f(x):
             memory_with_hooks = torch.cuda.memory_allocated()
             self.assertEqual(memory_with_hooks, memory_without_grad)
 
+    def test_pynode_destruction_deadlock(self):
+        script = """
+import torch
 
-def index_perm_variable(shape, max_indices):
-    if not isinstance(shape, tuple):
-        shape = (shape,)
-
-    index = torch.randperm(max_indices).narrow(0, 0, reduce(mul, shape)).view(shape)
-    return index
-
-def bernoulli_scalar():
-    return torch.tensor(0, dtype=torch.uint8).bernoulli_()
-
-
-class TestAutogradFunctional(TestCase):
-    def _assert_same_struct(self, res, base):
-        # base and res should be Tensors or tuple of Tensors with the same size
-        if isinstance(base, torch.Tensor):
-            self.assertTrue(isinstance(res, torch.Tensor))
-            self.assertEqual(base.size(), res.size())
-        elif isinstance(base, tuple):
-            self.assertTrue(isinstance(res, tuple))
-            self.assertEqual(len(base), len(res))
-            for el_base, el_res in zip(base, res):
-                self.assertTrue(isinstance(el_base, torch.Tensor))
-                self.assertTrue(isinstance(el_res, torch.Tensor))
-                self.assertEqual(el_base.size(), el_res.size())
-        else:
-            # Wrong base
-            raise RuntimeError("The base given to `_assert_same_struct` doesn't have"
-                               " the right structure.")
-
-    def _assert_interleaved_struct(self, res, base1, base2):
-        # base1 and base2 can be Tensors or tuples of Tensors.
-        # If they are tuples, res should be a tuple as well.
-        # The indexing works as follows for base1, base2 being
-        # - tuple, tuple: res[i][j][k][l] = (base1[i][k], base2[j][l])
-        # - tuple, Tensor: res[i][k][l] = (base1[i][k], base2[l])
-        # - Tensor, tuple: res[i][j][l] = (base1[i], base2[j][l])
-        # - Tensor, Tensor: res[k][l] = (base1[k], base2[l])
-        if isinstance(base1, torch.Tensor) and isinstance(base2, torch.Tensor):
-            self.assertTrue(isinstance(res, torch.Tensor))
-            self.assertEqual(res.size(), base1.size() + base2.size())
-        elif isinstance(base1, tuple) and isinstance(base2, torch.Tensor):
-            self.assertTrue(isinstance(res, tuple))
-            self.assertEqual(len(res), len(base1))
-            for el_res, el_base1 in zip(res, base1):
-                self.assertTrue(isinstance(el_res, torch.Tensor))
-                self.assertTrue(isinstance(el_base1, torch.Tensor))
-                self.assertEqual(el_res.size(), el_base1.size() + base2.size())
-        elif isinstance(base1, torch.Tensor) and isinstance(base2, tuple):
-            self.assertTrue(isinstance(res, tuple))
-            self.assertEqual(len(res), len(base2))
-            for el_res, el_base2 in zip(res, base2):
-                self.assertTrue(isinstance(el_res, torch.Tensor))
-                self.assertTrue(isinstance(el_base2, torch.Tensor))
-                self.assertEqual(el_res.size(), base1.size() + el_base2.size())
-        elif isinstance(base1, tuple) and isinstance(base2, tuple):
-            self.assertTrue(isinstance(res, tuple))
-            self.assertEqual(len(res), len(base1))
-            for el_res, el_base1 in zip(res, base1):
-                self.assertTrue(isinstance(el_res, tuple))
-                self.assertEqual(len(res), len(base2))
-                for el_el_res, el_base2 in zip(el_res, base2):
-                    self.assertTrue(isinstance(el_el_res, torch.Tensor))
-                    self.assertTrue(isinstance(el_base2, torch.Tensor))
-                    self.assertEqual(el_el_res.size(), el_base1.size() + el_base2.size())
-        else:
-            # Wrong bases
-            raise RuntimeError("The bases given to `_assert_interleaved_struct` don't have"
-                               " the right structure.")
-
-    def test_vjp_err_check(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        inp = torch.rand(4)
-        v = torch.ones(3)
-        with self.assertRaisesRegex(TypeError, "The inputs given to vjp must be either a Tensor"):
-            res = autogradF.vjp(foo, (inp, 2), v)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to vjp must"):
-            res = autogradF.vjp(bar, inp, v)
-
-        with self.assertRaisesRegex(RuntimeError, "The vector v can only be None if the user-provided function returns"):
-            res = autogradF.vjp(foo, inp)
-
-        with self.assertRaisesRegex(RuntimeError, "The given v should contain a single Tensor."):
-            res = autogradF.vjp(foo, inp, (torch.ones_like(inp), torch.ones_like(inp)))
-
-        with self.assertRaisesRegex(RuntimeError, "v has invalid size: should be torch.Size"):
-            res = autogradF.vjp(foo, inp, v[:2])
-
-        res = autogradF.vjp(foo, inp, v)[1]
-        self._assert_same_struct(res, inp)
-
-    def test_vjp_err_check_strict(self):
-        def foo(a):
-            return a.detach()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone()
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.vjp(foo, inp, v, strict=True)
-        res = autogradF.vjp(foo, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
-            res = autogradF.vjp(bar, inp, v, strict=True)
-        res = autogradF.vjp(bar, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        # The Jacobian does not depend on the input
-        def foo(a):
-            return a.clone()
-
-        inp.requires_grad_()
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
-            res = autogradF.vjp(foo, inp, v, create_graph=True, strict=True)
-        res = autogradF.vjp(foo, inp, v, create_graph=True, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1], v)
-
-    def test_vjp_no_grad(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4)
-        with torch.no_grad():
-            res = autogradF.vjp(reducer, inputs, v)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-        inputs.requires_grad_()
-        v.requires_grad_()
-        with torch.no_grad():
-            res = autogradF.vjp(reducer, inputs, v, create_graph=True)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-    def test_vjp_output(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4)
-        res = autogradF.vjp(reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-
-        def adder(x, y):
-            return 2 * x + 3 * y
-
-        inputs = (torch.rand(2), torch.rand(2))
-        v = torch.ones(2)
-        out, vjp_val = autogradF.vjp(adder, inputs, v)
-        self._assert_same_struct(vjp_val, inputs)
-        self.assertIsNone(out.grad_fn)
-        self.assertIsNone(vjp_val[0].grad_fn)
-        self.assertIsNone(vjp_val[1].grad_fn)
-
-        def adder(x, y):
-            return 2 * x + 3 * y, x + y
-
-        inputs = (torch.rand(2), torch.rand(2))
-        v = (torch.tensor([1., 0.]), torch.tensor([1., 0.]))
-        out, vjp_val = autogradF.vjp(adder, inputs, v)
-        self._assert_same_struct(vjp_val, inputs)
-        self.assertIsNone(out[0].grad_fn)
-        self.assertIsNone(out[1].grad_fn)
-        self.assertIsNone(vjp_val[0].grad_fn)
-        self.assertIsNone(vjp_val[1].grad_fn)
-
-    def test_vjp_scalar(self):
-        def reducer(x):
-            return x.sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones([])
-        res = autogradF.vjp(reducer, inputs, v)
-        self._assert_same_struct(res[0], v)
-        self._assert_same_struct(res[1], inputs)
-
-        res = autogradF.vjp(reducer, inputs)
-        self._assert_same_struct(res[0], v)
-        self._assert_same_struct(res[1], inputs)
-
-        def expander(x):
-            return x.unsqueeze(0).repeat(4)
-        inputs = torch.rand([])
-        v = torch.ones(4)
-        res = autogradF.vjp(expander, inputs, v)
-        self._assert_same_struct(res[0], v)
-        self._assert_same_struct(res[1], inputs)
-
-    def test_vjp_create_graph(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(2, 2, dtype=torch.double)
-        v = torch.ones(2, dtype=torch.double)
-
-        inputs.requires_grad_()
-        v.requires_grad_()
-        res = autogradF.vjp(reducer, inputs, v, create_graph=True)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-
-        gradcheck(lambda inp, v: autogradF.vjp(reducer, inputs, v, create_graph=True), (inputs, v))
-        gradgradcheck(lambda inp, v: autogradF.vjp(reducer, inputs, v, create_graph=True), (inputs, v))
-
-        def adder(x, y):
-            return 2 * x + 3 * y, x * y
-
-        inputs = (torch.rand(2, dtype=torch.double, requires_grad=True),
-                  torch.rand(2, dtype=torch.double, requires_grad=True))
-        v = (torch.tensor([1., 0.], dtype=torch.double, requires_grad=True),
-             torch.tensor([1., 0.], dtype=torch.double, requires_grad=True))
-
-        gradcheck(lambda *args: autogradF.vjp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
-        gradgradcheck(lambda *args: autogradF.vjp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
-
-        def foo(*args):
-            x, y = args[:2]
-            v = args[2:]
-
-            x = x.cos()
-            val, grad = autogradF.vjp(adder, (x, y), v, create_graph=True)
-
-            return val[0].exp() + val[1].exp() + grad[0].exp() + grad[1].exp() + x.exp() + y.exp()
-
-        gradcheck(foo, inputs + v)
-        gradgradcheck(foo, inputs + v)
-
-    def test_jvp_err_check(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(TypeError, "The inputs given to jvp must be either a Tensor"):
-            res = autogradF.jvp(foo, (inp, 2), v)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to jvp must"):
-            res = autogradF.jvp(bar, inp, v)
-
-        with self.assertRaisesRegex(RuntimeError, "The vector v can only be None if the input to the user-provided function"):
-            res = autogradF.jvp(foo, inp)
-
-        with self.assertRaisesRegex(RuntimeError, "The given v should contain a single Tensor."):
-            res = autogradF.jvp(foo, inp, (v, v))
-
-        with self.assertRaisesRegex(RuntimeError, "v has invalid size: should be torch.Size"):
-            res = autogradF.jvp(foo, inp, v[:2])
-
-        res = autogradF.jvp(foo, inp, v)[1]
-        self._assert_same_struct(res, foo(inp))
-
-    def test_jvp_err_check_strict(self):
-        def foo(a):
-            return a.detach()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone()
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.jvp(foo, inp, v, strict=True)
-        res = autogradF.jvp(foo, inp, v, strict=False)
-        self._assert_same_struct(res[1], res[0])
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
-            res = autogradF.jvp(bar, inp, v, strict=True)
-        res = autogradF.jvp(bar, inp, v, strict=False)
-        self._assert_same_struct(res[1], res[0])
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        # The Jacobian does not depend on the input
-        def foo(a):
-            return a.clone()
-
-        inp.requires_grad_()
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
-            res = autogradF.jvp(foo, inp, v, create_graph=True, strict=True)
-        res = autogradF.jvp(foo, inp, v, create_graph=True, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1], v)
-
-    def test_jvp_no_grad(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        with torch.no_grad():
-            res = autogradF.jvp(reducer, inputs, v)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-        inputs.requires_grad_()
-        v.requires_grad_()
-        with torch.no_grad():
-            res = autogradF.jvp(reducer, inputs, v, create_graph=True)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-    def test_jvp_output(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.jvp(reducer, inputs, v)
-        self._assert_same_struct(res[1], res[0])
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-
-        def adder(x, y):
-            return 2 * x + 3 * y
-
-        inputs = (torch.rand(2), torch.rand(2))
-        v = (torch.ones(2), torch.ones(2))
-        out, jvp_val = autogradF.jvp(adder, inputs, v)
-        self._assert_same_struct(jvp_val, out)
-        self.assertIsNone(out.grad_fn)
-        self.assertIsNone(jvp_val[0].grad_fn)
-        self.assertIsNone(jvp_val[1].grad_fn)
-
-        def adder(x, y):
-            return 2 * x + 3 * y, x + y
-
-        inputs = (torch.rand(2), torch.rand(2))
-        v = (torch.tensor([1., 0.]), torch.tensor([1., 0.]))
-        out, jvp_val = autogradF.jvp(adder, inputs, v)
-        self._assert_same_struct(jvp_val, out)
-        self.assertIsNone(out[0].grad_fn)
-        self.assertIsNone(out[1].grad_fn)
-        self.assertIsNone(jvp_val[0].grad_fn)
-        self.assertIsNone(jvp_val[1].grad_fn)
-
-    def test_jvp_scalar(self):
-        def reducer(x):
-            return x.sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.jvp(reducer, inputs, v)
-        self._assert_same_struct(res[0], torch.zeros([]))
-        self._assert_same_struct(res[1], res[0])
-
-        def expander(x):
-            return x.unsqueeze(0).repeat(4)
-        inputs = torch.rand([])
-        v = torch.ones([])
-        res = autogradF.jvp(expander, inputs, v)
-        self._assert_same_struct(res[0], torch.zeros(4))
-        self._assert_same_struct(res[1], res[0])
-
-        res = autogradF.jvp(expander, inputs)
-        self._assert_same_struct(res[0], torch.zeros(4))
-        self._assert_same_struct(res[1], res[0])
-
-    def test_jvp_create_graph(self):
-        def reducer(x):
-            return x.sum(dim=1)
-        inputs = torch.rand(2, 2, dtype=torch.double)
-        v = torch.ones(2, 2, dtype=torch.double)
-
-        inputs.requires_grad_()
-        v.requires_grad_()
-        res = autogradF.jvp(reducer, inputs, v, create_graph=True)
-        self._assert_same_struct(res[1], res[0])
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-
-        gradcheck(lambda inp, v: autogradF.jvp(reducer, inp, v, create_graph=True), (inputs, v))
-        gradgradcheck(lambda inp, v: autogradF.jvp(reducer, inp, v, create_graph=True), (inputs, v))
-
-        def adder(x, y):
-            return 2 * x + 3 * y, x * y
-
-        inputs = (torch.rand(2, dtype=torch.double, requires_grad=True),
-                  torch.rand(2, dtype=torch.double, requires_grad=True))
-        v = (torch.tensor([1., 0.], dtype=torch.double, requires_grad=True),
-             torch.tensor([1., 0.], dtype=torch.double, requires_grad=True))
-
-        gradcheck(lambda *args: autogradF.jvp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
-        gradgradcheck(lambda *args: autogradF.jvp(adder, args[:2], args[2:], create_graph=True)[1], inputs + v)
-
-        def foo(*args):
-            x, y = args[:2]
-            v = args[2:]
-
-            x = x.cos()
-            val, grad = autogradF.jvp(adder, (x, y), v, create_graph=True)
-
-            return val[0].exp() + val[1].exp() + grad[0].exp() + grad[1].exp() + x.exp() + y.exp()
-
-        gradcheck(foo, inputs + v)
-        gradgradcheck(foo, inputs + v)
-
-    def _test_construct_standard_basis_for(self, inputs):
-        numels = tuple(tensor.numel() for tensor in inputs)
-        results = autogradF._construct_standard_basis_for(inputs, numels)
-        for result, inp in zip(results, inputs):
-            self.assertEqual(result.dtype, inp.dtype)
-            self.assertEqual(result.device, inp.device)
-        results = torch.cat([result.to(device='cpu', dtype=torch.float)
-                             for result in results], dim=1)
-        expected = torch.eye(results[0].shape[0], dtype=torch.float)
-        self.assertEqual(results, expected)
-
-    def test_construct_standard_basis_for(self):
-        test_cases = [
-            (torch.randn(2, 3),),
-            (torch.randn(1),),
-            (torch.randn([]),),
-            (torch.randn(1), torch.randn([]), torch.randn([])),
-            (torch.randn(2), torch.randn(3), torch.randn([])),
-            (torch.randn(2), torch.randn([]), torch.randn(3)),
-            (torch.randn(2, 3), torch.randn(3), torch.randn(3, 4, 2)),
-            (torch.randn(2, dtype=torch.float64), torch.randn(3, dtype=torch.float32)),
-        ]
-
-        for inputs in test_cases:
-            self._test_construct_standard_basis_for(inputs)
-
-    @unittest.skipIf(not TEST_CUDA, "test requires CUDA")
-    def test_construct_standard_basis_for_cuda(self):
-        test_cases = [
-            (torch.randn(2), torch.randn(3, device='cuda')),
-            (torch.randn(3, device='cuda'), torch.randn(2)),
-        ]
-
-        for inputs in test_cases:
-            self._test_construct_standard_basis_for(inputs)
-
-    def _test_vectorize_raises_no_warnings(self, api):
-        # vmap is an experimental prototype. When someone calls torch.vmap,
-        # it raises a python warning. This test checks that
-        # autogradF.{jacobian, hessian} don't raise that experimental prototype
-        # warning; it is not nice for a public-facing API to raise a warning
-        # no matter how it is called.
-        def foo(a):
-            return (a ** 2).sum()
-
-        x = torch.randn(3)
-        with warnings.catch_warnings(record=True) as wa:
-            result = api(foo, x, vectorize=True)
-        self.assertEqual(len(wa), 0)
-
-    def test_jacobian_vectorize_raises_no_warnings(self):
-        return self._test_vectorize_raises_no_warnings(autogradF.jacobian)
-
-    def test_hessian_vectorize_raises_no_warnings(self):
-        return self._test_vectorize_raises_no_warnings(autogradF.hessian)
-
-    def _test_jacobian_err_check(self, vectorize):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(TypeError, "The inputs given to jacobian must be either a Tensor"):
-            res = autogradF.jacobian(foo, (inp, 2), vectorize=vectorize)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to jacobian must"):
-            res = autogradF.jacobian(bar, inp, vectorize=vectorize)
-
-        res = autogradF.jacobian(foo, inp, vectorize=vectorize)
-        self._assert_interleaved_struct(res, foo(inp), inp)
-
-        def foo(a, b):
-            return b, 3 * a.narrow(0, 0, 3)
-
-        inp = (torch.rand(4), torch.rand(5))
-
-        res = autogradF.jacobian(foo, inp, vectorize=vectorize)
-        self._assert_interleaved_struct(res, foo(*inp), inp)
-
-    def test_jacobian_err_check(self):
-        return self._test_jacobian_err_check(vectorize=False)
-
-    def test_jacobian_err_check_vectorize(self):
-        return self._test_jacobian_err_check(vectorize=True)
-
-    def test_jacobian_err_check_strict(self):
-        def foo(a):
-            return a.detach()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone()
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.jacobian(foo, inp, strict=True)
-        res = autogradF.jacobian(foo, inp, strict=False)
-        self._assert_interleaved_struct(res, foo(inp), inp)
-        self.assertEqual(res.abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function is independent of input 0."):
-            res = autogradF.jacobian(bar, inp, strict=True)
-        res = autogradF.jacobian(bar, inp, strict=False)
-        self._assert_interleaved_struct(res, foo(inp), inp)
-        self.assertEqual(res.abs().sum(), 0.)
-
-        # The Jacobian does not depend on the input
-        def foo(a):
-            return a.clone()
-
-        inp.requires_grad_()
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function is independent of input 0."):
-            res = autogradF.jacobian(foo, inp, create_graph=True, strict=True)
-        res = autogradF.jacobian(foo, inp, create_graph=True, strict=False)
-        self._assert_interleaved_struct(res, inp, inp)
-        self.assertEqual(res, torch.eye(4))
-
-    def test_jacobian_err_check_strict_vectorize(self):
-        def foo(x):
-            return x
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "not supported together"):
-            res = autogradF.jacobian(foo, inp, strict=True, vectorize=True)
-
-    def test_jacobian_no_grad(self):
-        def exp_reducer(x):
-            return x.exp().sum(dim=1)
-
-        inputs = torch.rand(4, 4)
-        with torch.no_grad():
-            res = autogradF.jacobian(exp_reducer, inputs)
-        self.assertIsNone(res.grad_fn)
-        self.assertNotEqual(res, torch.zeros(4, 4))
-
-        with torch.no_grad():
-            res = autogradF.jacobian(exp_reducer, inputs, create_graph=True)
-        self.assertIsNotNone(res.grad_fn)
-        self.assertNotEqual(res, torch.zeros(4, 4))
-
-    def _test_jacobian_output(self, vectorize):
-        def exp_reducer(x):
-            return x.exp().sum(dim=1)
-
-        inputs = torch.rand(4, 4)
-        res = autogradF.jacobian(exp_reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, exp_reducer(inputs), inputs)
-        self.assertIsNone(res.grad_fn)
-
-        def identity(x):
-            return x.clone()
-
-        inputs = torch.rand(4)
-        res = autogradF.jacobian(identity, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, identity(inputs), inputs)
-        self.assertIsNone(res.grad_fn)
-        self.assertEqual(res, torch.eye(4))
-
-        def add_exp_reducer(x, y):
-            return (x + y.exp()).sum(dim=1)
-
-        inputs = (torch.rand(4, 4), torch.rand(4, 4))
-        res = autogradF.jacobian(add_exp_reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, add_exp_reducer(*inputs), inputs)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-
-    def test_jacobian_output(self):
-        self._test_jacobian_output(vectorize=False)
-
-    def test_jacobian_output_vectorize(self):
-        self._test_jacobian_output(vectorize=True)
-
-    def _test_jacobian_scalar(self, vectorize):
-        def reducer(x):
-            return x.sum()
-        inputs = torch.rand(4, 4)
-        res = autogradF.jacobian(reducer, inputs, vectorize=vectorize)
-        self._assert_same_struct(res, inputs)
-
-        def expander(x):
-            return x.unsqueeze(0).repeat(4)
-        inputs = torch.rand([])
-        res = autogradF.jacobian(expander, inputs, vectorize=vectorize)
-        self._assert_same_struct(res, torch.zeros(4))
-
-    def test_jacobian_scalar(self):
-        self._test_jacobian_scalar(vectorize=False)
-
-    def test_jacobian_scalar_vectorize(self):
-        self._test_jacobian_scalar(vectorize=True)
-
-    def _test_jacobian_create_graph(self, vectorize):
-        def exp_reducer(x):
-            return x.exp().sum(dim=1)
-
-        inputs = torch.rand(4, 4, dtype=torch.double, requires_grad=True)
-        res = autogradF.jacobian(exp_reducer, inputs, create_graph=True, vectorize=vectorize)
-        self._assert_interleaved_struct(res, exp_reducer(inputs), inputs)
-        self.assertIsNotNone(res.grad_fn)
-
-        gradcheck(lambda inp: autogradF.jacobian(exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-        gradgradcheck(lambda inp: autogradF.jacobian(exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-
-        def add_exp_reducer(x, y):
-            return (x + y).exp().sum(dim=1)
-
-        inputs = (torch.rand(4, 4, dtype=torch.double, requires_grad=True),
-                  torch.rand(4, 4, dtype=torch.double, requires_grad=True))
-        res = autogradF.jacobian(add_exp_reducer, inputs, create_graph=True, vectorize=vectorize)
-        self._assert_interleaved_struct(res, add_exp_reducer(*inputs), inputs)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-
-        gradcheck(lambda *inp: autogradF.jacobian(add_exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-        gradgradcheck(lambda *inp: autogradF.jacobian(add_exp_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-
-        def foo(x, y):
-            x = x.cos()
-            val, jac = autogradF.jacobian(add_exp_reducer, (x, y), create_graph=True, vectorize=vectorize)
-
-            res = val[0].exp().sum() + val[1].exp().sum() + jac[0].exp().sum()
-            res = res + jac[1].exp().sum() + x.exp().sum() + y.exp().sum()
-            return res
-
-        gradcheck(foo, inputs)
-        gradgradcheck(foo, inputs)
-
-    def test_jacobian_create_graph(self):
-        self._test_jacobian_create_graph(vectorize=False)
-
-    def test_jacobian_create_graph_vectorize(self):
-        self._test_jacobian_create_graph(vectorize=True)
-
-    def _check_jacobian_vectorize_correctness(self, f, inputs, test_forward_ad=True):
-        expected = autogradF.jacobian(f, inputs, vectorize=False)
-        result_backward_mode = autogradF.jacobian(f, inputs, vectorize=True)
-        self.assertEqual(result_backward_mode, expected)
-
-        if test_forward_ad:
-            result_forward_mode = autogradF.jacobian(f, inputs, strategy="forward-mode", vectorize=True)
-            self.assertEqual(result_forward_mode, expected)
-
-    def test_jacobian_vectorize_correctness_simple(self):
-        def f(x):
-            return 3 * x ** 2
-
-        x = torch.randn(2, 3, 5)
-        self._check_jacobian_vectorize_correctness(f, x)
-
-    def test_jacobian_vectorize_correctness_multi_input(self):
-        def f(x, y):
-            return (x.cos() * x) @ y.sin()
-
-        x = torch.randn(2, 3)
-        y = torch.randn(3, 5)
-        self._check_jacobian_vectorize_correctness(f, (x, y))
-
-    def test_jacobian_vectorize_correctness_multi_input_multi_output(self):
-        def f(x, y):
-            return (x * x) @ y, x @ (x.sum(1) * y), y.sum()
-
-        x = torch.randn(5, 3)
-        y = torch.randn(3, 5)
-        self._check_jacobian_vectorize_correctness(f, (x, y))
-
-    def test_jacobian_vectorize_correctness_unrelated_outputs(self):
-        def f(x, y):
-            return x, y, x, y
-
-        x = torch.randn(2)
-        y = torch.randn(3)
-        self._check_jacobian_vectorize_correctness(f, (x, y))
-
-    def test_jacobian_vectorize_correctness_zero_dim(self):
-        # zero-dim output
-        def f(x, y):
-            return x.sum(), y.sum(), x * y
-
-        x = torch.randn(3)
-        y = torch.randn(3)
-        self._check_jacobian_vectorize_correctness(f, (x, y))
-
-        # zero-dim input
-        def g(x):
-            return torch.stack([x, x, x])
-
-        x = torch.randn([])
-        self._check_jacobian_vectorize_correctness(g, x)
-
-        # Mixed zero-dim input / zero-dim output
-        def h(x, y):
-            return y.sum(), x * y
-
-        x = torch.randn([])
-        y = torch.randn(1)
-        self._check_jacobian_vectorize_correctness(h, (x, y))
-
-    @unittest.skipIf(not TEST_CUDA, "test requires CUDA")
-    def test_jacobian_vectorize_correctness_different_devices(self):
-        def f(x, y):
-            return x * y, (x * y).cuda()
-
-        x = torch.randn(3)
-        y = torch.randn(3)
-        self._check_jacobian_vectorize_correctness(f, (x, y))
-
-    def test_jacobian_vectorize_correctness_different_dtype(self):
-        def f(x, y):
-            return (x * y).float(), (x * y).double()
-
-        x = torch.randn(3)
-        y = torch.randn(3)
-        # The Jacobian computed using forward AD has the dtype of the output
-        # but the Jacobian computed with reverse AD has dtype of input
-        self._check_jacobian_vectorize_correctness(f, (x, y), test_forward_ad=False)
-
-    def _check_hessian_vectorize_correctness(self, f, inputs):
-        expected = autogradF.hessian(f, inputs, vectorize=False)
-        result = autogradF.hessian(f, inputs, vectorize=True)
-        self.assertEqual(result, expected)
-
-        result_forward_mode = autogradF.hessian(f, inputs, outer_jacobian_strategy="forward-mode", vectorize=True)
-        self.assertEqual(result_forward_mode, expected)
-
-    def test_hessian_vectorize_correctness_simple(self):
-        def f(x):
-            return (3 * x ** 2).sum()
-
-        x = torch.randn(2, 3, 5)
-        self._check_hessian_vectorize_correctness(f, x)
-
-    def test_hessian_vectorize_correctness_multi_input(self):
-        def f(x, y, z):
-            return ((x.relu() * x) @ y.sin() @ z).sum()
-
-        x = torch.randn(2, 3)
-        y = torch.randn(3, 5)
-        z = torch.randn(5, 5)
-        self._check_hessian_vectorize_correctness(f, (x, y, z))
-
-    def test_hessian_vectorize_correctness_unrelated_outputs(self):
-        # output unrelated to one input
-        def f(x, y):
-            return (x ** 2).sum()
-
-        x = torch.randn(2)
-        y = torch.randn(3)
-        self._check_hessian_vectorize_correctness(f, (x, y))
-
-        # output unrelated to all inputs
-        def f(x, y):
-            return torch.ones([])
-
-        x = torch.randn(2)
-        y = torch.randn(3)
-        self._check_hessian_vectorize_correctness(f, (x, y))
-
-    def _test_hessian_err_check(self, vectorize):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        def bar2(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        def bar3(a):
-            return 3 * a.narrow(0, 0, 3), 3 * a.narrow(0, 0, 3)
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(TypeError, "The inputs given to hessian must be either a Tensor"):
-            res = autogradF.hessian(foo, (inp, 2), vectorize=vectorize)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to hessian must"):
-            res = autogradF.hessian(bar, inp, vectorize=vectorize)
-
-        err_msg_out = "The Tensor returned by the function given to hessian should contain a single element"
-        with self.assertRaisesRegex(RuntimeError, err_msg_out):
-            res = autogradF.hessian(bar2, inp, vectorize=vectorize)
-
-        with self.assertRaisesRegex(RuntimeError, "The function given to hessian should return a single Tensor"):
-            res = autogradF.hessian(bar3, inp, vectorize=vectorize)
-
-        res = autogradF.hessian(foo, inp, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inp, inp)
-
-        def foo(a, b):
-            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
-
-        inp = (torch.rand(4), torch.rand(5))
-
-        res = autogradF.hessian(foo, inp, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inp, inp)
-
-    def test_hessian_err_check(self):
-        self._test_hessian_err_check(vectorize=False)
-
-    def test_hessian_err_check_vectorize(self):
-        self._test_hessian_err_check(vectorize=True)
-
-    def test_hessian_err_check_strict(self):
-        def foo(a):
-            return a.detach().sum()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone().sum()
-
-        def bar2(a):
-            # A Linear function for which the jacobian is independent of the input
-            return (3 * a).sum()
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.hessian(foo, inp, strict=True)
-        res = autogradF.hessian(foo, inp, strict=False)
-        self._assert_interleaved_struct(res, inp, inp)
-        self.assertEqual(res.abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0"):
-            res = autogradF.hessian(bar, inp, strict=True)
-        res = autogradF.hessian(bar, inp, strict=False)
-        self._assert_interleaved_struct(res, inp, inp)
-        self.assertEqual(res.abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
-            res = autogradF.hessian(bar2, inp, strict=True)
-        res = autogradF.hessian(bar2, inp, strict=False)
-        self._assert_interleaved_struct(res, inp, inp)
-        self.assertEqual(res.abs().sum(), 0.)
-
-    def test_hessian_err_check_strict_vectorize(self):
-        def foo(x):
-            return (x ** 3).sum()
-
-        inp = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "not supported together"):
-            res = autogradF.hessian(foo, inp, strict=True, vectorize=True)
-
-    def test_hessian_no_grad(self):
-        def pow_reducer(x):
-            return x.pow(3).sum()
-
-        inputs = torch.rand(2, 2)
-        with torch.no_grad():
-            res = autogradF.hessian(pow_reducer, inputs)
-        self.assertIsNone(res[0][0].grad_fn)
-        self.assertIsNone(res[0][1].grad_fn)
-        self.assertIsNone(res[1][0].grad_fn)
-        self.assertIsNone(res[1][1].grad_fn)
-        self.assertNotEqual(res, torch.zeros(2, 2, 2))
-
-        with torch.no_grad():
-            res = autogradF.hessian(pow_reducer, inputs, create_graph=True)
-        self.assertIsNotNone(res[0][0].grad_fn)
-        self.assertIsNotNone(res[0][1].grad_fn)
-        self.assertIsNotNone(res[1][0].grad_fn)
-        self.assertIsNotNone(res[1][1].grad_fn)
-        self.assertNotEqual(res, torch.zeros(2, 2, 2))
-
-
-    def _test_hessian_output(self, vectorize):
-        def pow_reducer(x):
-            return x.pow(3).sum()
-
-        inputs = torch.rand(2, 2)
-        res = autogradF.hessian(pow_reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-        self.assertIsNone(res.grad_fn)
-
-        def add_pow_reducer(x, y):
-            return (x + y).pow(3).sum()
-
-        inputs = (torch.rand(2, 2), torch.rand(2, 2))
-        res = autogradF.hessian(add_pow_reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-        self.assertIsNone(res[0][0].grad_fn)
-        self.assertIsNone(res[0][1].grad_fn)
-        self.assertIsNone(res[1][0].grad_fn)
-        self.assertIsNone(res[1][1].grad_fn)
-
-    def test_hessian_output(self):
-        self._test_hessian_output(vectorize=False)
-
-    def test_hessian_output_vectorize(self):
-        self._test_hessian_output(vectorize=True)
-
-    def _test_hessian_scalar(self, vectorize):
-        def reducer(x):
-            return x.sum()
-        inputs = torch.rand(4, 4)
-        res = autogradF.hessian(reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-
-        inputs = torch.rand([])
-        res = autogradF.hessian(reducer, inputs, vectorize=vectorize)
-        self._assert_same_struct(res, inputs)
-
-        def bad_reducer(x):
-            return x.sum().view(1, 1, 1)
-        inputs = torch.rand(4, 4)
-        res = autogradF.hessian(bad_reducer, inputs, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-
-    def test_hessian_scalar(self):
-        return self._test_hessian_scalar(vectorize=False)
-
-    def test_hessian_scalar_vectorize(self):
-        return self._test_hessian_scalar(vectorize=True)
-
-    def _test_hessian_create_graph(self, vectorize):
-        def pow_reducer(x):
-            return x.pow(3).sum()
-
-        inputs = torch.rand(2, 2, dtype=torch.double, requires_grad=True)
-        res = autogradF.hessian(pow_reducer, inputs, create_graph=True, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-        self.assertIsNotNone(res.grad_fn)
-
-        gradcheck(lambda inp: autogradF.hessian(pow_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-        gradgradcheck(lambda inp: autogradF.hessian(pow_reducer, inp, create_graph=True, vectorize=vectorize), inputs)
-
-        def add_pow_reducer(x, y):
-            return (x + y).pow(3).sum()
-
-        inputs = (torch.rand(2, 2, dtype=torch.double, requires_grad=True),
-                  torch.rand(2, 2, dtype=torch.double, requires_grad=True))
-        res = autogradF.hessian(add_pow_reducer, inputs, create_graph=True, vectorize=vectorize)
-        self._assert_interleaved_struct(res, inputs, inputs)
-        self.assertIsNotNone(res[0][0].grad_fn)
-        self.assertIsNotNone(res[0][1].grad_fn)
-        self.assertIsNotNone(res[1][0].grad_fn)
-        self.assertIsNotNone(res[1][1].grad_fn)
-
-        def flatten(inp):
-            return tuple(el_lvl2 for el_lvl1 in inp for el_lvl2 in el_lvl1)
-
-        gradcheck(lambda *inp: flatten(autogradF.hessian(add_pow_reducer, inp, create_graph=True, vectorize=vectorize)), inputs)
-        gradgradcheck(lambda *inp: flatten(autogradF.hessian(add_pow_reducer, inp, create_graph=True, vectorize=vectorize)), inputs)
-
-        def foo(x, y):
-            x = x.cos()
-            val, hess = autogradF.hessian(add_pow_reducer, (x, y), create_graph=True, vectorize=vectorize)
-
-            res = val[0].cos().sum() + val[1].cos().sum() + hess[0].cos().sum()
-            res = res + hess[1].cos().sum() + x.cos().sum() + y.cos().sum()
-            return res
-
-        gradcheck(foo, inputs)
-        gradgradcheck(foo, inputs)
-
-    def test_hessian_create_graph(self):
-        self._test_hessian_create_graph(vectorize=False)
-
-    def test_hessian_create_graph_vectorize(self):
-        self._test_hessian_create_graph(vectorize=True)
-
-    def test_vhp_err_check(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        def bar2(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(TypeError, "The inputs given to vhp must be either a Tensor"):
-            res = autogradF.vhp(foo, (inp, 2), v)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to vhp must"):
-            res = autogradF.vhp(bar, inp, v)
-
-        err_msg_out = "The Tensor returned by the function given to vhp should contain a single element"
-        with self.assertRaisesRegex(RuntimeError, err_msg_out):
-            res = autogradF.vhp(bar2, inp, v)
-
-        with self.assertRaisesRegex(RuntimeError, "v has invalid size:"):
-            res = autogradF.vhp(foo, inp, torch.rand(5))
-
-        with self.assertRaisesRegex(TypeError, "The v given to vhp must be either a Tensor or a tuple of Tensors"):
-            res = autogradF.vhp(foo, inp, (v, 2))
-
-        res = autogradF.vhp(foo, inp, v)
-        self._assert_same_struct(res[1], inp)
-
-        def foo(a, b):
-            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
-
-        inp = (torch.rand(4), torch.rand(5))
-        v = (torch.rand(4), torch.rand(5))
-
-        res = autogradF.vhp(foo, inp, v)
-        self._assert_same_struct(res[1], inp)
-
-    def test_vhp_err_check_strict(self):
-        def foo(a):
-            return a.detach().sum()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone().sum()
-
-        def bar2(a):
-            # A Linear function for which the jacobian is independent of the input
-            return (3 * a).sum()
+class Foo(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x):
+        return x.clone()
 
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.vhp(foo, inp, v, strict=True)
-        res = autogradF.vhp(foo, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
+    @staticmethod
+    def forward(ctx, gO):
+        return gO.clone()
 
-        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
-            res = autogradF.vhp(bar, inp, v, strict=True)
-        res = autogradF.vhp(bar, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
+def get_out():
+    inp = torch.rand(2, requires_grad=True)
 
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
-            res = autogradF.vhp(bar2, inp, v, strict=True)
-        res = autogradF.vhp(bar2, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
+    # The python function is first so that it runs
+    # last in the backward pass
+    right = Foo.apply(inp)
 
-    def test_vhp_no_grad(self):
-        def reducer(x):
-            return x.exp().sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        with torch.no_grad():
-            res = autogradF.vhp(reducer, inputs, v)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
+    # An op that creates new memory
+    left1 = inp.clone()
+    # An op that saves its input
+    left2 = left1 ** 2
 
-        with torch.no_grad():
-            res = autogradF.vhp(reducer, inputs, v, create_graph=True)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-    def test_vhp_output(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.vhp(foo, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-
-        def bar(a, b):
-            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
-
-        inputs = (torch.rand(3), torch.rand(4))
-        v = (torch.ones(3), torch.ones(4))
-        out, vhp_val = autogradF.vhp(bar, inputs, v)
-        self._assert_same_struct(vhp_val, inputs)
-        self.assertIsNone(out.grad_fn)
-        self.assertIsNone(vhp_val[0].grad_fn)
-        self.assertIsNone(vhp_val[1].grad_fn)
-
-    def test_vhp_scalar(self):
-        def reducer(x):
-            return x.sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.vhp(reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-        inputs = torch.rand([])
-        v = torch.rand([])
-        res = autogradF.vhp(reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-        res = autogradF.vhp(reducer, inputs)
-        self._assert_same_struct(res[1], inputs)
-
-        def bad_reducer(x):
-            return x.sum().view(1, 1, 1)
-        inputs = torch.rand(4, 4)
-        v = torch.rand(4, 4)
-        res = autogradF.vhp(bad_reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-    def test_vhp_create_graph(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        inputs = torch.rand(4, 4, dtype=torch.double, requires_grad=True)
-        v = torch.ones(4, 4, dtype=torch.double, requires_grad=True)
-        res = autogradF.vhp(foo, inputs, v, create_graph=True)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-
-        gradcheck(lambda inp, v: autogradF.vhp(foo, inp, v, create_graph=True), (inputs, v))
-        gradgradcheck(lambda inp, v: autogradF.vhp(foo, inp, v, create_graph=True), (inputs, v))
-
-        def bar(a, b):
-            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
-
-        inputs = (torch.rand(3, dtype=torch.double, requires_grad=True),
-                  torch.rand(4, dtype=torch.double, requires_grad=True))
-        v = (torch.ones(3, dtype=torch.double, requires_grad=True),
-             torch.ones(4, dtype=torch.double, requires_grad=True))
-        out, vhp_val = autogradF.vhp(bar, inputs, v, create_graph=True)
-        self._assert_same_struct(vhp_val, inputs)
-        self.assertIsNotNone(out.grad_fn)
-        self.assertIsNotNone(vhp_val[0].grad_fn)
-        self.assertIsNotNone(vhp_val[1].grad_fn)
-
-        gradcheck(lambda *args: autogradF.vhp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
-        gradgradcheck(lambda *args: autogradF.vhp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
-
-        def foo(*args):
-            x, y = args[:2]
-            v = args[2:]
-
-            x = x.cos()
-            val, grad = autogradF.vhp(bar, (x, y), v, create_graph=True)
-
-            return val.cos() + grad[0].cos().sum() + grad[1].cos() + x.cos().sum() + y.cos()
-
-        gradcheck(foo, inputs + v)
-        gradgradcheck(foo, inputs + v)
-
-    def test_hvp_err_check(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        def bar(a):
-            return 3 * a.narrow(0, 0, 3), "bar"
-
-        def bar2(a):
-            return 3 * a.narrow(0, 0, 3)
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        res = autogradF.hvp(foo, inp, v)
-        with self.assertRaisesRegex(TypeError, "The inputs given to hvp must be either a Tensor"):
-            res = autogradF.hvp(foo, (inp, 2), v)
-
-        with self.assertRaisesRegex(TypeError, "The outputs of the user-provided function given to hvp must"):
-            res = autogradF.hvp(bar, inp, v)
-
-        err_msg_out = "The Tensor returned by the function given to hvp should contain a single element"
-        with self.assertRaisesRegex(RuntimeError, err_msg_out):
-            res = autogradF.hvp(bar2, inp, v)
-
-        with self.assertRaisesRegex(RuntimeError, "v has invalid size:"):
-            res = autogradF.hvp(foo, inp, torch.rand(5))
-
-        with self.assertRaisesRegex(TypeError, "The v given to hvp must be either a Tensor or a tuple of Tensors"):
-            res = autogradF.hvp(foo, inp, (v, 2))
-
-        res = autogradF.hvp(foo, inp, v)
-        self._assert_same_struct(res[1], inp)
-
-        def foo(a, b):
-            return (3 * b.narrow(0, 0, 3) * a.narrow(0, 0, 3)).sum()
-
-        inp = (torch.rand(4), torch.rand(5))
-        v = (torch.rand(4), torch.rand(5))
-
-        res = autogradF.hvp(foo, inp, v)
-        self._assert_same_struct(res[1], inp)
-
-    def test_hvp_err_check_strict(self):
-        def foo(a):
-            return a.detach().sum()
-
-        def bar(a):
-            # Make a non-leaf Tensor that requires_grad but that is not connected to the input
-            return a.long().float().requires_grad_().clone().sum()
-
-        def bar2(a):
-            # A Linear function for which the jacobian is independent of the input
-            return (3 * a).sum()
-
-        inp = torch.rand(4)
-        v = torch.rand(4)
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of the user-provided function does not require gradients."):
-            res = autogradF.hvp(foo, inp, v, strict=True)
-        res = autogradF.hvp(foo, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "The output of the user-provided function is independent of input 0"):
-            res = autogradF.hvp(bar, inp, v, strict=True)
-        res = autogradF.hvp(bar, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-        with self.assertRaisesRegex(RuntimeError, "jacobian of the user-provided function with respect to input 0 is"):
-            res = autogradF.hvp(bar2, inp, v, strict=True)
-        res = autogradF.hvp(bar2, inp, v, strict=False)
-        self._assert_same_struct(res[1], inp)
-        self.assertEqual(res[1].abs().sum(), 0.)
-
-    def test_hvp_no_grad(self):
-        def reducer(x):
-            return x.exp().sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        with torch.no_grad():
-            res = autogradF.hvp(reducer, inputs, v)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
+    # Inplace modify so that the backward for
+    # left2 always raises an error
+    left1 += 1
 
-        with torch.no_grad():
-            res = autogradF.hvp(reducer, inputs, v, create_graph=True)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-        self.assertNotEqual(res[1], torch.zeros(4, 4))
-
-    def test_hvp_output(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.hvp(foo, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNone(res[0].grad_fn)
-        self.assertIsNone(res[1].grad_fn)
-
-        def bar(a, b):
-            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
-
-        inputs = (torch.rand(3), torch.rand(4))
-        v = (torch.ones(3), torch.ones(4))
-        out, hvp_val = autogradF.hvp(bar, inputs, v)
-        self._assert_same_struct(hvp_val, inputs)
-        self.assertIsNone(out.grad_fn)
-        self.assertIsNone(hvp_val[0].grad_fn)
-        self.assertIsNone(hvp_val[1].grad_fn)
-
-    def test_hvp_scalar(self):
-        def reducer(x):
-            return x.exp().sum()
-        inputs = torch.rand(4, 4)
-        v = torch.ones(4, 4)
-        res = autogradF.hvp(reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-        inputs = torch.rand([])
-        v = torch.rand([])
-        res = autogradF.hvp(reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-        res = autogradF.hvp(reducer, inputs)
-        self._assert_same_struct(res[1], inputs)
-
-        def bad_reducer(x):
-            return x.exp().sum().view(1, 1, 1)
-        inputs = torch.rand(4, 4)
-        v = torch.rand(4, 4)
-        res = autogradF.hvp(bad_reducer, inputs, v)
-        self._assert_same_struct(res[1], inputs)
-
-    def test_hvp_create_graph(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
-
-        inputs = torch.rand(4, 4, dtype=torch.double, requires_grad=True)
-        v = torch.ones(4, 4, dtype=torch.double, requires_grad=True)
-        res = autogradF.hvp(foo, inputs, v, create_graph=True)
-        self._assert_same_struct(res[1], inputs)
-        self.assertIsNotNone(res[0].grad_fn)
-        self.assertIsNotNone(res[1].grad_fn)
-
-        gradcheck(lambda inp, v: autogradF.hvp(foo, inp, v, create_graph=True), (inputs, v))
-        gradgradcheck(lambda inp, v: autogradF.hvp(foo, inp, v, create_graph=True), (inputs, v))
-
-        def bar(a, b):
-            return (a + 3 * b.narrow(0, 0, 3)).exp().sum()
-
-        inputs = (torch.rand(3, dtype=torch.double, requires_grad=True),
-                  torch.rand(4, dtype=torch.double, requires_grad=True))
-        v = (torch.ones(3, dtype=torch.double, requires_grad=True),
-             torch.ones(4, dtype=torch.double, requires_grad=True))
-        out, hvp_val = autogradF.hvp(bar, inputs, v, create_graph=True)
-        self._assert_same_struct(hvp_val, inputs)
-        self.assertIsNotNone(out.grad_fn)
-        self.assertIsNotNone(hvp_val[0].grad_fn)
-        self.assertIsNotNone(hvp_val[1].grad_fn)
-
-        gradcheck(lambda *args: autogradF.hvp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
-        gradgradcheck(lambda *args: autogradF.hvp(bar, args[:2], args[2:], create_graph=True)[1], inputs + v)
-
-        def foo(*args):
-            x, y = args[:2]
-            v = args[2:]
-
-            x = x.cos()
-            val, grad = autogradF.hvp(bar, (x, y), v, create_graph=True)
-
-            return val.cos() + grad[0].cos().sum() + grad[1].cos() + x.cos().sum() + y.cos()
-
-        gradcheck(foo, inputs + v)
-        gradgradcheck(foo, inputs + v)
-
-    def test_jacobian_match_vjp_jvp(self):
-        def foo(x):
-            return x ** 3 + x.sum()
+    # An op that takes both side as input.
+    # After running, both side's last op will be in
+    # the ready queue
+    # And the op for left will run first as it was
+    # executed last during the forward
+    out = left2 + right
 
-        inputs = torch.rand(4)
-        v = torch.rand(4)
+    return out
 
-        jac = autogradF.jacobian(foo, inputs)
-        jvp = autogradF.jvp(foo, inputs, v)[1]
-        vjp = autogradF.vjp(foo, inputs, v)[1]
+# Nothing should be global variables here as, from what
+# I can see, python leaks all the global objects
+get_out().sum().backward()
 
-        self.assertEqual(jvp, torch.mm(jac, v.unsqueeze(1)).squeeze(1))
-        self.assertEqual(vjp, torch.mm(v.unsqueeze(0), jac).squeeze(0))
+# This used to deadlock when the PyNode is being destroyed after
+# the error is raised.
+"""
+        try:
+            subprocess.check_output(
+                [sys.executable, '-c', script],
+                stderr=subprocess.STDOUT,
+                # On Windows, opening the subprocess with the default CWD makes `import torch`
+                # fail, so just set CWD to this script's directory
+                cwd=os.path.dirname(os.path.realpath(__file__)),
+                # It is ok to have an extra long timeout here as a timeout means the test failed
+                timeout=20)
+        except subprocess.TimeoutExpired as e:
+            self.fail(msg="Example code timed out! See the code sample in the test for details.")
+        except subprocess.CalledProcessError as e:
+            err_msg = "RuntimeError: one of the variables needed for gradient computation"
+            self.assertTrue(err_msg in e.output.decode("utf-8"))
 
-    def test_hessian_match_vhp_hvp(self):
-        def foo(a):
-            return 3 * a.narrow(0, 0, 3).exp().sum()
+def index_perm_variable(shape, max_indices):
+    if not isinstance(shape, tuple):
+        shape = (shape,)
 
-        inputs = torch.rand(4)
-        v = torch.rand(4)
+    index = torch.randperm(max_indices).narrow(0, 0, reduce(mul, shape)).view(shape)
+    return index
 
-        hes = autogradF.hessian(foo, inputs)
-        hvp = autogradF.hvp(foo, inputs, v)[1]
-        vhp = autogradF.vhp(foo, inputs, v)[1]
+def bernoulli_scalar():
+    return torch.tensor(0, dtype=torch.uint8).bernoulli_()
 
-        self.assertEqual(hvp, torch.mm(hes, v.unsqueeze(1)).squeeze(1))
-        self.assertEqual(vhp, torch.mm(v.unsqueeze(0), hes).squeeze(0))
 
 class TestAutogradForwardModeBatchedGrad(TestCase):
     def test_out_of_place_basic(self):
@@ -7939,13 +6672,16 @@ class MySubclass(torch.Tensor):
             def __new__(cls, data=None):
                 return torch.Tensor._make_subclass(cls, data)
 
+            __torch_function__ = torch._C._disabled_torch_function_impl
+
             @classmethod
             def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                if func == torch.ops.aten.alias:
+                if func.overloadpacket == torch.ops.aten.alias:
                     counter[0] += 1
 
-                    with no_dispatch():
-                        return MySubclass(torch.ops.aten.alias(*args))
+                    # Make sure autograd is not disabled here
+                    foo = torch.rand(1, requires_grad=True)
+                    self.assertIsNotNone(foo.exp().grad_fn)
 
                 with no_dispatch():
                     return func(*args, **kwargs)
@@ -7954,10 +6690,11 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
         s = MySubclass(a)
 
         with fwAD.dual_level():
+            # Only the primal has "alias" called on it
             fwAD.make_dual(s, torch.rand_like(s))
             self.assertEqual(counter[0], 1)
             fwAD.make_dual(torch.rand_like(s), s)
-            self.assertEqual(counter[0], 2)
+            self.assertEqual(counter[0], 1)
 
     def test_print(self):
         with fwAD.dual_level() as level:
@@ -8760,7 +7497,7 @@ def test_copy_(self, device):
         # At the time of writing this test, copy_ is not generated from native_functions.yaml
         # there was a bug that bfloat16 was not recognized as floating.
         x = torch.randn(10, device=device, requires_grad=True)
-        floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point]
+        floating_dt = floating_types_and(torch.half, torch.bfloat16)
         for dt in floating_dt:
             y = torch.empty(10, device=device, dtype=dt)
             y.copy_(x)
@@ -9722,6 +8459,7 @@ def fn(x1, x2):
 # the suppressions.
 
 from autograd.test_complex import TestAutogradComplex  # noqa: F401
+from autograd.test_functional import TestAutogradFunctional  # noqa: F401
 
 # e.g., TestAutogradDeviceTypeCPU and TestAutogradDeviceTypeCUDA
 instantiate_device_type_tests(
diff --git a/test/test_binary_ufuncs.py b/test/test_binary_ufuncs.py
index f51de26948862e..8407aee05e9505 100644
--- a/test/test_binary_ufuncs.py
+++ b/test/test_binary_ufuncs.py
@@ -13,6 +13,7 @@
 import operator
 from functools import partial
 
+import torch.autograd.forward_ad as fwAD
 from torch._six import inf, nan
 from torch.testing._internal.common_utils import (
     TestCase, slowTest, iter_indices, TEST_WITH_ASAN, run_tests, gradcheck,
@@ -23,216 +24,29 @@
     skipCUDAIfRocm, skipIf, ops, OpDTypes, skipMeta)
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
-    all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes,
-    get_all_complex_dtypes, get_all_fp_dtypes,
+    all_types_and_complex_and, all_types_and, integral_types, complex_types, integral_types_and,
+    floating_types_and, floating_and_complex_types, get_all_math_dtypes,
 )
 from torch.testing._internal.common_methods_invocations import (
-    binary_ufuncs, _NOTHING)
+    binary_ufuncs, _NOTHING,
+    generate_elementwise_binary_tensors,
+    generate_elementwise_binary_small_value_tensors,
+    generate_elementwise_binary_large_value_tensors,
+    generate_elementwise_binary_extremal_value_tensors,
+    generate_elementwise_binary_broadcasting_tensors,
+    generate_elementwise_binary_with_scalar_samples
+)
 
 if TEST_SCIPY:
     import scipy.special
     import scipy.integrate
 
-# TODO: remove this
-def _generate_input(shape, dtype, device, with_extremal):
-    if shape == ():
-        x = torch.tensor((), dtype=dtype, device=device)
-    else:
-        if dtype.is_floating_point or dtype.is_complex:
-            # work around torch.randn not being implemented for bfloat16
-            if dtype == torch.bfloat16:
-                x = torch.randn(*shape, device=device) * random.randint(30, 100)
-                x = x.to(torch.bfloat16)
-            else:
-                x = torch.randn(*shape, dtype=dtype, device=device) * random.randint(30, 100)
-            x[torch.randn(*shape) > 0.5] = 0
-            if with_extremal and dtype.is_floating_point:
-                # Use extremal values
-                x[torch.randn(*shape) > 0.5] = float('nan')
-                x[torch.randn(*shape) > 0.5] = float('inf')
-                x[torch.randn(*shape) > 0.5] = float('-inf')
-            elif with_extremal and dtype.is_complex:
-                x[torch.randn(*shape) > 0.5] = complex('nan')
-                x[torch.randn(*shape) > 0.5] = complex('inf')
-                x[torch.randn(*shape) > 0.5] = complex('-inf')
-        elif dtype == torch.bool:
-            x = torch.zeros(shape, dtype=dtype, device=device)
-            x[torch.randn(*shape) > 0.5] = True
-        else:
-            x = torch.randint(15, 100, shape, dtype=dtype, device=device)
-
-    return x
-
-# TODO: refactor this out
-# Converts half/bfloat16 dtype to float when device is cpu
-def _convert_t(dtype, device):
-    if device == 'cpu' and dtype in {torch.half, torch.bfloat16}:
-        return torch.float
-    return dtype
-
-# TODO: revise the tests to use make_tensor in common_utils.py instead
-# Returns a tensor of the requested shape, dtype, and device
-# Requesting a half CPU tensor returns a float CPU tensor with
-# values representable by a half.
-# Initialization uses randint for non-float types and randn for float types.
-def _make_tensor(shape, dtype, device, fill_ones=False) -> torch.Tensor:
-    # Returns a tensor filled with ones
-    if fill_ones:
-        return torch.ones(*shape, dtype=_convert_t(dtype, device), device=device)
-
-    # Returns a tensor with random integer values
-    if not (dtype.is_floating_point or dtype.is_complex):
-        t = torch.randint(0, 10, shape, device=device)
-        if dtype != torch.uint8:
-            t = t - 5  # generate negative values also
-        return t.to(_convert_t(dtype, device))
-
-    # Populates the CPU tensor with floats representable as half/bfloat16
-    if dtype == torch.half and device == 'cpu':
-        return torch.randn(*shape, dtype=torch.float, device=device).half().float()
-    if dtype == torch.bfloat16 and device == 'cpu':
-        return torch.randn(*shape, dtype=torch.float, device=device).bfloat16().float()
-
-    # Default: returns a tensor with random float values
-    return torch.randn(shape, dtype=dtype, device=device).to(dtype=dtype)
-
 # TODO: update to use opinfos consistently
 class TestBinaryUfuncs(TestCase):
     # Generic tests for elementwise binary (AKA binary universal (u) functions (funcs))
     # TODO: below contiguous tensor results are compared with a variety of noncontiguous results.
     #   It would be interesting to have the lhs and rhs have different discontiguities.
 
-    # Returns a pair of iterables of contiguous tensors on the requested device
-    #   and with the requested dtype.
-    #
-    # This function is intended to test the non-vectorized and vectorized code
-    #   paths of unary functions, as well as their handling of odd tensor
-    #   sizes (like zero-dim tensors and tensors with zero elements).
-    #
-    # Each iterable will include an a tensor with no elements,
-    #   zero dim (scalar) tensors, small 1D tensors, a medium 1D tensor, and
-    #   a large 2D tensor.
-    def _generate_numeric_tensors(self, op, *, device, dtype, lhs_kwargs, rhs_kwargs):
-        lhs_tensors = []
-        rhs_tensors = []
-
-        shapes = ((0,),  # tensors with no elements
-                  (1, 0, 3),
-                  # zero dim (scalar) tensor
-                  (),
-                  # small 1D tensor
-                  (20,),
-                  # medium 1D tensor
-                  (812,),
-                  # large 2D tensor
-                  (1029, 917))
-
-        for kwargs, tensors in ((lhs_kwargs, lhs_tensors), (rhs_kwargs, rhs_tensors)):
-            for shape in shapes:
-                tensors.append(make_tensor(shape, dtype=dtype, device=device, **kwargs))
-
-        return lhs_tensors, rhs_tensors
-
-    # Returns a pair of iterables of contiguous tensors on the requested device and with
-    #   the requested dtype.
-    #
-    # Unlike the previous function, the values in these tensors are specified manually.
-    def _generate_interesting_small_valued_tensors(self, device, dtype):
-        # defines interesting values
-        _unsigned_int_vals = (0, 1, 55, 127, 128, 190, 210, 220, 254, 255, 256)
-        _int_vals = (0, -1, 1, -55, 55, -127, 127, -128, 128)
-        _float_vals = (0.,
-                       -.001, .001,
-                       -.25, .25,
-                       -1., 1.,
-                       -math.pi / 2, math.pi / 2,
-                       -math.pi + .00001, math.pi - .00001,
-                       -math.pi, math.pi,
-                       -math.pi - .00001, math.pi + .00001)
-
-        l_vals = []
-        r_vals = []
-
-        if dtype.is_floating_point:
-            prod = product(_float_vals, _float_vals)
-        elif dtype.is_complex:
-            complex_vals = product(_float_vals, _float_vals)
-            # Note the use of list is required here or the map generator will be
-            #  emptied by the following product and it won't produce the desired cross-product
-            complex_vals = list(map(lambda x: complex(*x), complex_vals))
-            prod = product(complex_vals, complex_vals)
-        elif dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
-            prod = product(_int_vals, _int_vals)
-        elif dtype is torch.uint8:
-            prod = product(_unsigned_int_vals, _unsigned_int_vals)
-        else:
-            raise ValueError("Unsupported dtype!")
-
-        for l, r in prod:
-            l_vals.append(l)
-            r_vals.append(r)
-
-        lhs = torch.tensor(l_vals, device=device, dtype=dtype)
-        rhs = torch.tensor(r_vals, device=device, dtype=dtype)
-
-        return lhs, rhs
-
-    def _generate_interesting_large_valued_tensors(self, device, dtype):
-        _large_int_vals = (-1113, 1113, -10701, 10701)
-        _large_float16_vals = (-501, 501, -1001.2, 1001.2, -13437.7, 13437.7)
-        _large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)
-
-        l_vals = []
-        r_vals = []
-
-        if dtype == torch.float16:
-            prod = product(_large_float16_vals, _large_float16_vals)
-        elif dtype.is_floating_point:
-            prod = product(_large_float_vals, _large_float_vals)
-        elif dtype.is_complex:
-            complex_vals = product(_large_float_vals, _large_float_vals)
-            # Note the use of list is required here or the map generator will be
-            #  emptied by the following product and it won't produce the desired cross-product
-            complex_vals = list(map(lambda x: complex(*x), complex_vals))
-            prod = product(complex_vals, complex_vals)
-        elif dtype in (torch.int16, torch.int32, torch.int64):
-            prod = product(_large_int_vals, _large_int_vals)
-        else:
-            raise ValueError("Unsupported dtype!")
-
-        for l, r in prod:
-            l_vals.append(l)
-            r_vals.append(r)
-        lhs = torch.tensor(l_vals, device=device, dtype=dtype)
-        rhs = torch.tensor(r_vals, device=device, dtype=dtype)
-
-        return lhs, rhs
-
-    def _generate_interesting_extremal_valued_tensors(self, device, dtype):
-        _float_extremals = (float('inf'), float('-inf'), float('nan'))
-
-        l_vals = []
-        r_vals = []
-
-        if dtype.is_floating_point:
-            prod = product(_float_extremals, _float_extremals)
-        elif dtype.is_complex:
-            complex_vals = product(_float_extremals, _float_extremals)
-            # Note the use of list is required here or the map generator will be
-            #  emptied by the following product and it won't produce the desired cross-product
-            complex_vals = list(map(lambda x: complex(*x), complex_vals))
-            prod = product(complex_vals, complex_vals)
-        else:
-            raise ValueError("Unsupported dtype!")
-
-        for l, r in prod:
-            l_vals.append(l)
-            r_vals.append(r)
-        lhs = torch.tensor(l_vals, device=device, dtype=dtype)
-        rhs = torch.tensor(r_vals, device=device, dtype=dtype)
-
-        return lhs, rhs
-
     # Helper for comparing torch tensors and NumPy arrays
     # TODO: should this or assertEqual also validate that strides are equal?
     def assertEqualHelper(self, actual, expected, msg, *, dtype, exact_dtype=True, **kwargs):
@@ -263,7 +77,7 @@ def assertEqualHelper(self, actual, expected, msg, *, dtype, exact_dtype=True, *
 
     # Tests that the function and its (array-accepting) reference produce the same
     #   values on given tensors
-    def _test_reference_numerics(self, dtype, op, tensor_pairs, equal_nan=True):
+    def _test_reference_numerics(self, dtype, op, gen, equal_nan=True):
         def _helper_reference_numerics(expected, actual, msg, exact_dtype, equal_nan=True):
             if not torch.can_cast(numpy_to_torch_dtype_dict[expected.dtype.type], dtype):
                 exact_dtype = False
@@ -275,19 +89,27 @@ def _helper_reference_numerics(expected, actual, msg, exact_dtype, equal_nan=Tru
             else:
                 self.assertEqualHelper(actual, expected, msg, dtype=dtype, equal_nan=equal_nan, exact_dtype=exact_dtype)
 
-        for l, r in tensor_pairs:
-            if dtype is torch.bfloat16:
-                l_numpy = l.cpu().to(torch.float32).numpy()
-                r_numpy = r.cpu().to(torch.float32).numpy()
-            else:
-                l_numpy = l.cpu().numpy()
-                r_numpy = r.cpu().numpy()
+        for sample in gen:
+            # Each sample input acquired from the generator is just one lhs tensor
+            #   and one rhs tensor
+            l = sample.input
+            r = sample.args[0]
+
+            np_input, np_args, np_kwargs = sample.numpy()
+            l_numpy = np_input
+            r_numpy = np_args[0]
 
             actual = op(l, r)
             expected = op.ref(l_numpy, r_numpy)
 
             # Crafts a custom error message for smaller, printable tensors
-            if l.numel() < 10 and r.numel() < 10:
+            def _numel(x):
+                if isinstance(x, torch.Tensor):
+                    return x.numel()
+                # Assumes x is a scalar
+                return 1
+
+            if _numel(l) < 10 and _numel(r) < 10:
                 msg = ("Failed to produce expected results! Input lhs tensor was"
                        " {0}, rhs tensor was {1}, torch result is {2}, and reference result is"
                        " {3}.").format(l, r, actual, expected)
@@ -307,13 +129,8 @@ def _helper_reference_numerics(expected, actual, msg, exact_dtype, equal_nan=Tru
 
     @ops(binary_ufuncs_with_references)
     def test_reference_numerics(self, device, dtype, op):
-        lhs_tensors, rhs_tensors = self._generate_numeric_tensors(op,
-                                                                  device=device,
-                                                                  dtype=dtype,
-                                                                  lhs_kwargs=op.lhs_make_tensor_kwargs,
-                                                                  rhs_kwargs=op.rhs_make_tensor_kwargs)
-
-        self._test_reference_numerics(dtype, op, zip(lhs_tensors, rhs_tensors), equal_nan=True)
+        gen = generate_elementwise_binary_tensors(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
     # runtime error: 128 is outside the range of representable values of type 'signed char'
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
@@ -322,8 +139,8 @@ def test_reference_numerics_small_values(self, device, dtype, op):
         if dtype is torch.bool:
             self.skipTest("Doesn't support bool!")
 
-        lhs, rhs = self._generate_interesting_small_valued_tensors(device, dtype)
-        self._test_reference_numerics(dtype, op, ((lhs, rhs),), equal_nan=True)
+        gen = generate_elementwise_binary_small_value_tensors(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
     # TODO: review if this skip is necessary
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
@@ -331,8 +148,8 @@ def test_reference_numerics_small_values(self, device, dtype, op):
          allowed_dtypes=(torch.int16, torch.int32, torch.int64, torch.float16,
                          torch.bfloat16, torch.float32, torch.float64, torch.complex64, torch.complex128))
     def test_reference_numerics_large_values(self, device, dtype, op):
-        lhs, rhs = self._generate_interesting_large_valued_tensors(device, dtype)
-        self._test_reference_numerics(dtype, op, ((lhs, rhs),), equal_nan=True)
+        gen = generate_elementwise_binary_large_value_tensors(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
     # TODO: review if this skip is necessary
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
@@ -340,58 +157,19 @@ def test_reference_numerics_large_values(self, device, dtype, op):
          allowed_dtypes=(torch.float16, torch.bfloat16, torch.float32,
                          torch.float64, torch.complex64, torch.complex128))
     def test_reference_numerics_extremal_values(self, device, dtype, op):
-        lhs, rhs = self._generate_interesting_extremal_valued_tensors(device, dtype)
-        self._test_reference_numerics(dtype, op, ((lhs, rhs),), equal_nan=True)
+        gen = generate_elementwise_binary_extremal_value_tensors(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
     # tests broadcasting and noncontiguous broadcasting behavior
     @ops(binary_ufuncs_with_references, allowed_dtypes=(torch.long, torch.float32,))
     def test_broadcasting(self, device, dtype, op):
-        shapes = (
-            ((1,), ()),
-            ((2,), ()),
-            ((1,), (2,)),
-            ((2,), (2,)),
-            ((2, 1), (2,)),
-            ((1, 2), (2,)),
-            ((3, 2), (2,)),
-            ((3, 2), (3, 2)),
-            ((1, 3, 2), (2,)),
-            ((1, 3, 2), (3, 2)),
-            ((3, 1, 2), (3, 2)),
-            ((1, 3, 2), (1, 3, 2)),
-            ((2, 3, 2), ()),
-            ((2, 3, 2), (2, 3, 2)),
-            ((3, 1, 2), (1, 3, 2)),
-        )
-
-        for shape, noncontiguous in product(shapes, [True, False]):
-            shape_lhs, shape_rhs = shape
-            lhs = make_tensor(shape_lhs, device=device, dtype=dtype,
-                              noncontiguous=noncontiguous, **op.lhs_make_tensor_kwargs)
-            rhs = make_tensor(shape_rhs, device=device, dtype=dtype,
-                              noncontiguous=noncontiguous, **op.rhs_make_tensor_kwargs)
-
-            actual = op(lhs, rhs)
-            expected = op.ref(lhs.cpu().numpy(), rhs.cpu().numpy())
-
-            self.assertEqual(actual, expected, exact_dtype=False)
-
-    @ops(binary_ufuncs, allowed_dtypes=(torch.long, torch.float32,))
-    def test_broadcast_python_scalar(self, device, dtype, op):
-        for shape_lhs in ((), (1,), (2,), (1, 2, 3),):
-            lhs = make_tensor(shape_lhs, device=device, dtype=dtype, **op.lhs_make_tensor_kwargs)
+        gen = generate_elementwise_binary_broadcasting_tensors(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
-            rhs_tensor = make_tensor((), device=device, dtype=dtype, **op.rhs_make_tensor_kwargs)
-            rhs_expanded = rhs_tensor.expand_as(lhs)
-            rhs_scalar = rhs_tensor.item()
-
-            expected = op(lhs, rhs_expanded)
-
-            actual_tensor = op(lhs, rhs_tensor)
-            actual_scalar = op(lhs, rhs_scalar)
-
-            self.assertEqual(actual_tensor, expected)
-            self.assertEqual(actual_scalar, expected)
+    @ops(binary_ufuncs_with_references, allowed_dtypes=(torch.long, torch.float32, torch.complex64))
+    def test_scalar_support(self, device, dtype, op):
+        gen = generate_elementwise_binary_with_scalar_samples(op, device=device, dtype=dtype)
+        self._test_reference_numerics(dtype, op, gen, equal_nan=True)
 
     @ops(binary_ufuncs)
     def test_contig_vs_every_other(self, device, dtype, op):
@@ -932,7 +710,7 @@ def test_inplace_division(self, device):
         id_after = id(t)
         self.assertEqual(id_before, id_after)
 
-    @dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_div_rounding_modes(self, device, dtype):
         if dtype.is_floating_point:
             low, high = -10.0, 10.0
@@ -1032,8 +810,7 @@ def test_divide_by_zero_rounding(self, device, dtype):
             actual = torch.divide(a, zero, rounding_mode=rounding_mode)
             self.assertEqual(actual, expect, exact_dtype=exact_dtype)
 
-    @dtypes(*get_all_dtypes(
-        include_bool=False, include_complex=False, include_bfloat16=False))
+    @dtypes(*all_types_and(torch.half))
     def test_div_rounding_numpy(self, device, dtype):
         info = (torch.finfo(dtype) if dtype.is_floating_point
                 else torch.iinfo(dtype))
@@ -1485,7 +1262,7 @@ def test_pow_cuda_complex_extremal_failing(self, device, dtype):
             self.assertEqual(cpu_out, cuda_out)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_dtypes(include_bool=False, include_bfloat16=False)))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_complex_scalar_pow_tensor(self, device, dtype):
         complexes = [0.5j, 1. + 1.j, -1.5j, 2.2 - 1.6j, 1 + 0j]
         first_exp = make_tensor((100,), dtype=dtype, device=device, low=-2, high=2)
@@ -1877,7 +1654,8 @@ def test_binary_ops_with_scalars(self, device):
                         self.assertEqual(expected, python_op(first, second))
                         self.assertEqual(expected, torch_op(first, second))
 
-    @dtypes(*product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False)))
+    @dtypes(*product(all_types_and(torch.half, torch.bfloat16, torch.bool),
+                     all_types_and(torch.half, torch.bfloat16, torch.bool)))
     def test_maximum_minimum_type_promotion(self, device, dtypes):
         a = torch.tensor((0, 1), device=device, dtype=dtypes[0])
         b = torch.tensor((1, 0), device=device, dtype=dtypes[1])
@@ -1885,7 +1663,7 @@ def test_maximum_minimum_type_promotion(self, device, dtypes):
             result = op(a, b)
             self.assertEqual(result.dtype, torch.result_type(a, b))
 
-    @dtypes(*(get_all_int_dtypes() + [torch.bool]))
+    @dtypes(*integral_types_and(torch.bool))
     def test_maximum_minimum_int_and_bool(self, device, dtype):
         ops = ((torch.maximum, torch.max, np.maximum), (torch.minimum, torch.min, np.minimum),
                (torch.fmax, None, np.fmax), (torch.fmin, None, np.fmin))
@@ -1911,7 +1689,7 @@ def test_maximum_minimum_int_and_bool(self, device, dtype):
             self.assertEqual(out, numpy_result)
 
     @precisionOverride({torch.bfloat16: 1e-2})
-    @dtypes(*(get_all_fp_dtypes()))
+    @dtypes(*(floating_types_and(torch.half, torch.bfloat16)))
     def test_maximum_minimum_float(self, device, dtype):
         ops = ((torch.maximum, torch.max, np.maximum), (torch.minimum, torch.min, np.minimum),
                (torch.fmax, None, np.fmax), (torch.fmin, None, np.fmin))
@@ -1939,7 +1717,7 @@ def test_maximum_minimum_float(self, device, dtype):
             self.assertEqual(tensor_result, numpy_result, exact_dtype=False)
             self.assertEqual(out, numpy_result, exact_dtype=False)
 
-    @dtypes(*(get_all_fp_dtypes()))
+    @dtypes(*(floating_types_and(torch.half, torch.bfloat16)))
     def test_maximum_minimum_float_nan_and_inf(self, device, dtype):
         # np.maximum and np.minimum functions compare input arrays element-wisely.
         # if one of the elements being compared is a NaN, then that element is returned.
@@ -1975,7 +1753,7 @@ def test_maximum_minimum_float_nan_and_inf(self, device, dtype):
                 self.assertEqual(tensor_result, numpy_result)
                 self.assertEqual(out, numpy_result)
 
-    @dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
+    @dtypes(*product(complex_types(), all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_maximum_minimum_complex(self, device, dtypes):
         for torch_op in (torch.maximum, torch.minimum, torch.max, torch.min, torch.fmax, torch.fmin):
             with self.assertRaisesRegex(RuntimeError, '.+not implemented for.+'):
@@ -2017,7 +1795,8 @@ def test_maximum_minimum_cross_device(self, device):
             self.assertEqual(tensor_result_1, numpy_result_1)
             self.assertEqual(tensor_result_2, numpy_result_2)
 
-    @dtypes(*product(get_all_fp_dtypes(), get_all_fp_dtypes()))
+    @dtypes(*product(floating_types_and(torch.half, torch.bfloat16),
+                     floating_types_and(torch.half, torch.bfloat16)))
     def test_maximum_and_minimum_subgradient(self, device, dtypes):
         def run_test(f, a, b, expected_a_grad, expected_b_grad):
             a = torch.tensor(a, requires_grad=True, device=device, dtype=dtypes[0])
@@ -2030,6 +1809,33 @@ def run_test(f, a, b, expected_a_grad, expected_b_grad):
         run_test(torch.maximum, [0., 1., 2.], [1., 1., 1.], [0., 0.5, 1.], [1., 0.5, 0.])
         run_test(torch.minimum, [0., 1., 2.], [1., 1., 1.], [1., 0.5, 0.], [0., 0.5, 1.])
 
+    def test_maximum_minimum_forward_ad_float32(self, device):
+        # TODO: This should really be covered by OpInfo but it isn't. The problem
+        # is that our gradient tests test using float64 but it should also test
+        # float32
+        x = torch.randn(3, device=device, dtype=torch.float32)
+        y = torch.randn(3, device=device, dtype=torch.float32)
+        tx = torch.randn(3, device=device, dtype=torch.float32)
+        ty = torch.randn(3, device=device, dtype=torch.float32)
+
+        with fwAD.dual_level():
+            x_dual = fwAD.make_dual(x, tx)
+            y_dual = fwAD.make_dual(y, ty)
+            result = torch.maximum(x_dual, y_dual)
+            _, result_tangent = fwAD.unpack_dual(result)
+
+        expected = torch.where(x > y, tx, ty)
+        self.assertEqual(result_tangent, expected)
+
+        with fwAD.dual_level():
+            x_dual = fwAD.make_dual(x, tx)
+            y_dual = fwAD.make_dual(y, ty)
+            result = torch.minimum(x_dual, y_dual)
+            _, result_tangent = fwAD.unpack_dual(result)
+
+        expected = torch.where(x < y, tx, ty)
+        self.assertEqual(result_tangent, expected)
+
     # TODO: tests like this should be generic
     @dtypesIfCUDA(torch.half, torch.float, torch.double)
     @dtypes(torch.float, torch.double)
@@ -2046,18 +1852,29 @@ def test_mul_intertype_scalar(self, device, dtype):
         self.assertEqual(x, 4.5)
 
     @onlyCPU
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_sub(self, device, dtype):
-        m1 = torch.tensor([2.34, 4.44], dtype=dtype, device=device)
-        m2 = torch.tensor([1.23, 2.33], dtype=dtype, device=device)
+        if dtype in integral_types():
+            # Before Python 3.10, floats were implicitly converted to ints, but with
+            #   DeprecationWarning: an integer is required (got type float).
+            #   Implicit conversion to integers using __int__ is deprecated,
+            #   and may be removed in a future version of Python.
+            # Since Python 3.10, that attempt gives an error.
+            m1 = torch.tensor([2, 4], dtype=dtype, device=device)
+            m2 = torch.tensor([1, 2], dtype=dtype, device=device)
+            diff = torch.tensor([1, 2], dtype=dtype)
+        else:
+            m1 = torch.tensor([2.34, 4.44], dtype=dtype, device=device)
+            m2 = torch.tensor([1.23, 2.33], dtype=dtype, device=device)
+            diff = torch.tensor([1.11, 2.11], dtype=dtype)
 
         if dtype == torch.bool:
             self.assertRaises(RuntimeError, lambda: m1 - m2)
         elif (dtype == torch.bfloat16 or dtype == torch.half):
             # bfloat16 has a lower precision so we have to have a separate check for it
-            self.assertEqual(m1 - m2, torch.tensor([1.11, 2.11], dtype=dtype), atol=0.01, rtol=0)
+            self.assertEqual(m1 - m2, diff, atol=0.01, rtol=0)
         else:
-            self.assertEqual(m1 - m2, torch.tensor([1.11, 2.11], dtype=dtype))
+            self.assertEqual(m1 - m2, diff)
 
     # TODO: what is this test testing?
     @onlyCPU
@@ -2108,8 +1925,8 @@ def test_min_max_binary_op_nan(self, device, dtype):
             self.assertFalse(torch.isnan(ma[i]), "max(a, b): {}, a: {}, b: {}".format(ma[i], a[i], b[i]))
             self.assertFalse(torch.isnan(mi[i]), "min(a, b): {}, a: {}, b: {}".format(mi[i], a[i], b[i]))
 
-    @dtypes(*product(get_all_dtypes(include_complex=False),
-                     get_all_dtypes(include_complex=False)))
+    @dtypes(*product(all_types_and(torch.half, torch.bfloat16, torch.bool),
+                     all_types_and(torch.half, torch.bfloat16, torch.bool)))
     def test_copysign(self, device, dtypes):
         def _test_copysign_numpy(a, b):
             torch_result = torch.copysign(a, b)
@@ -2126,7 +1943,7 @@ def _test_copysign_numpy(a, b):
             expected = torch.from_numpy(np.copysign(np_a, np_b))
             # To handle inconsistencies of type promotion between PyTorch and Numpy
             # Applied for both arguments having integral precision and bfloat16
-            types = [torch.bool, torch.bfloat16] + get_all_int_dtypes()
+            types = integral_types_and(torch.bool, torch.bfloat16)
             if a.dtype in types or b.dtype in types:
                 promoted_type = torch.promote_types(torch_result.dtype, expected.dtype)
                 torch_result = torch_result.to(promoted_type)
@@ -2171,13 +1988,13 @@ def _test_copysign_numpy(a, b):
             for case in cases:
                 _test_copysign_numpy(torch.tensor([case], device=device, dtype=dtypes[0]), b)
 
-        if dtypes[1] in get_all_fp_dtypes():
+        if dtypes[1] in floating_types_and(torch.half, torch.bfloat16):
             a = make_tensor((10, 10), device=device, dtype=dtypes[0], low=-9, high=9)
             for case in cases:
                 _test_copysign_numpy(a, torch.tensor([case], device=device, dtype=dtypes[1]))
 
-    @dtypes(*product(get_all_fp_dtypes(),
-                     get_all_fp_dtypes()))
+    @dtypes(*product(floating_types_and(torch.half, torch.bfloat16),
+                     floating_types_and(torch.half, torch.bfloat16)))
     def test_copysign_subgradient(self, device, dtypes):
         # Input is 0.0
         x = torch.tensor([0.0, 0.0, 0.0], dtype=dtypes[0], device=device, requires_grad=True)
@@ -2317,7 +2134,7 @@ def test_rdiv(self, device, dtype):
         z = torch.tensor([30 / v.item() for v in x], device=device)
         self.assertEqual(y, z, exact_dtype=False)
 
-    @dtypes(*get_all_fp_dtypes(include_bfloat16=False))
+    @dtypes(*floating_types_and(torch.half))
     def test_fmod_remainder_by_zero_float(self, device, dtype):
         fn_list = (torch.fmod, torch.remainder)
         for fn in fn_list:
@@ -2329,7 +2146,7 @@ def test_fmod_remainder_by_zero_float(self, device, dtype):
 
     @onlyNativeDeviceTypes  # Check Issue https://github.com/pytorch/pytorch/issues/48130
     @skipCUDAIfRocm  # Error happens on both ROCM and XLA
-    @dtypes(*get_all_int_dtypes())
+    @dtypes(*integral_types())
     def test_fmod_remainder_by_zero_integral(self, device, dtype):
         fn_list = (torch.fmod, torch.remainder)
         for fn in fn_list:
@@ -2354,7 +2171,7 @@ def test_fmod_remainder_by_zero_integral(self, device, dtype):
                     value = 255 if dtype == torch.uint8 else -1
                     self.assertTrue(torch.all(fn(x, zero) == value))
 
-    @dtypes(*get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False))
+    @dtypes(*all_types_and(torch.half))
     def test_fmod_remainder(self, device, dtype):
         # Use numpy as reference
         def _helper(x, mod, fns_list):
@@ -2391,7 +2208,7 @@ def _helper(x, mod, fns_list):
         # Mods: Integer, Float, Tensor, Non-contiguous Tensor
         mods = [3, 2.3, mod, mod.t()]
         # mod with floating-point dtype
-        if dtype in get_all_int_dtypes():
+        if dtype in integral_types():
             mod_float = make_tensor((10, 10), device=device, dtype=torch.float, low=-9, high=9)
             mod[mod == 0] = 1
             mods.append(mod_float)
@@ -2612,7 +2429,7 @@ def test_floor_divide_zero(self, device, dtype):
                 a // b
 
     @unittest.skipIf(TEST_WITH_ASAN, "Integer overflows are not allowed under ASAN")
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_muldiv_scalar(self, device, dtype):
         x = make_tensor((10, 3), dtype=dtype, device=device, low=None, high=None)
         s = make_tensor((1,), dtype=dtype, device="cpu", low=None, high=None).item()
@@ -2622,7 +2439,38 @@ def test_muldiv_scalar(self, device, dtype):
         self.assertEqual(x / s, x / y)
         self.assertEqual(s / x, y / x)
 
-    @dtypes(*tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2)))
+    # TODO: update make_tensor to support extremal additions and remove this in favor of make_tensor
+    def _generate_input(self, shape, dtype, device, with_extremal):
+        if shape == ():
+            x = torch.tensor((), dtype=dtype, device=device)
+        else:
+            if dtype.is_floating_point or dtype.is_complex:
+                # work around torch.randn not being implemented for bfloat16
+                if dtype == torch.bfloat16:
+                    x = torch.randn(*shape, device=device) * random.randint(30, 100)
+                    x = x.to(torch.bfloat16)
+                else:
+                    x = torch.randn(*shape, dtype=dtype, device=device) * random.randint(30, 100)
+                x[torch.randn(*shape) > 0.5] = 0
+                if with_extremal and dtype.is_floating_point:
+                    # Use extremal values
+                    x[torch.randn(*shape) > 0.5] = float('nan')
+                    x[torch.randn(*shape) > 0.5] = float('inf')
+                    x[torch.randn(*shape) > 0.5] = float('-inf')
+                elif with_extremal and dtype.is_complex:
+                    x[torch.randn(*shape) > 0.5] = complex('nan')
+                    x[torch.randn(*shape) > 0.5] = complex('inf')
+                    x[torch.randn(*shape) > 0.5] = complex('-inf')
+            elif dtype == torch.bool:
+                x = torch.zeros(shape, dtype=dtype, device=device)
+                x[torch.randn(*shape) > 0.5] = True
+            else:
+                x = torch.randint(15, 100, shape, dtype=dtype, device=device)
+
+        return x
+
+    @dtypes(*tuple(itertools.combinations_with_replacement(all_types_and_complex_and(torch.half,
+                                                                                     torch.bfloat16, torch.bool), 2)))
     def test_comparison_ops_type_promotion_and_broadcasting(self, device, dtypes):
         # issue #42660
         # testing all combinations of broadcasting and type promotion
@@ -2658,8 +2506,8 @@ def compare_with_numpy_bin_op(torch_fn, np_fn, x, y, out=None):
         for size1 in input_sizes:
             size2 = (2,) + size1  # perform broadcasting
             for with_extremal in [False, True]:
-                a = _generate_input(size1, dtypes[0], device, with_extremal)
-                b = _generate_input(size2, dtypes[1], device, with_extremal)
+                a = self._generate_input(size1, dtypes[0], device, with_extremal)
+                b = self._generate_input(size2, dtypes[1], device, with_extremal)
                 for torch_op, numpy_op in op_pairs:
                     if (dtypes[0].is_complex or dtypes[1].is_complex) and torch_op in complex_op_denylist:
                         continue
@@ -2804,8 +2652,8 @@ def test_bitwise_shift_float(self, device):
             self.assertEqual(torch_op(a, 2.2), expected_op(a, 2.2))
 
     @onlyNativeDeviceTypes
-    @dtypes(*list(product(get_all_dtypes(include_complex=False),
-                          get_all_dtypes(include_complex=False))))
+    @dtypes(*list(product(all_types_and(torch.half, torch.bfloat16, torch.bool),
+                          all_types_and(torch.half, torch.bfloat16, torch.bool))))
     def test_heaviside(self, device, dtypes):
         input_dtype = dtypes[0]
         values_dtype = dtypes[1]
@@ -2864,8 +2712,7 @@ def test_heaviside_cross_device(self, device):
         with self.assertRaisesRegex(RuntimeError, 'Expected all tensors to be on the same device'):
             torch.heaviside(y, x)
 
-    @dtypes(*list(product(get_all_complex_dtypes(),
-                          get_all_complex_dtypes())))
+    @dtypes(*list(product(complex_types(), complex_types())))
     def test_heaviside_complex(self, device, dtypes):
         input_dtype = dtypes[0]
         values_dtype = dtypes[1]
@@ -2900,15 +2747,18 @@ def _test_logical(self, device, dtypes, op, a_, b_, expected_res_):
         getattr(a, op + '_')(b)
         self.assertEqual(expected_res, a)
 
-    @dtypes(*product(get_all_dtypes(), get_all_dtypes()))
+    @dtypes(*product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
+                     all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_logical_xor(self, device, dtypes):
         self._test_logical(device, dtypes, 'logical_xor', [10, 0, 1, 0], [1, 0, 0, 10], [0, 0, 1, 1])
 
-    @dtypes(*product(get_all_dtypes(), get_all_dtypes()))
+    @dtypes(*product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
+                     all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_logical_and(self, device, dtypes):
         self._test_logical(device, dtypes, 'logical_and', [10, 0, 1, 0], [1, 0, 0, 10], [1, 0, 0, 0])
 
-    @dtypes(*product(get_all_dtypes(), get_all_dtypes()))
+    @dtypes(*product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
+                     all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_logical_or(self, device, dtypes):
         self._test_logical(device, dtypes, 'logical_or', [10, 0, 1, 0], [1, 0, 0, 10], [1, 0, 1, 1])
 
@@ -3017,7 +2867,7 @@ def test_logaddexp2(self, device, dtype):
         self._test_logaddexp(device, dtype, base2=True)
 
     def test_add(self, device):
-        dtypes = [torch.float, torch.double] + get_all_complex_dtypes()
+        dtypes = floating_and_complex_types()
         for dtype in dtypes:
             # [res] torch.add([res,] tensor1, tensor2)
             m1 = torch.randn(100, 100, dtype=dtype, device=device)
@@ -3219,7 +3069,7 @@ def test_bool_tensor_comparison_ops(self, device):
                          torch.tensor([0, 1, 0, 1, 0, 1], dtype=torch.bool, device=device))
         self.assertFalse(a.equal(b))
 
-    @dtypes(*get_all_dtypes(include_complex=False))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16, torch.bool))
     def test_logical(self, device, dtype):
         if dtype != torch.bool:
             x = torch.tensor([1, 2, 3, 4], device=device, dtype=dtype)
@@ -3406,8 +3256,8 @@ def test_pow_scalar_overloads_mem_overlap(self, device, dtype):
         self.unary_check_input_output_mem_overlap(
             doubles, sz, lambda input, out: torch.pow(42, input, out=out))
 
-    @dtypes(*list(product(get_all_dtypes(include_bool=False),
-                          get_all_dtypes(include_bool=False))))
+    @dtypes(*list(product(all_types_and_complex_and(torch.half, torch.bfloat16),
+                          all_types_and_complex_and(torch.half, torch.bfloat16))))
     def test_float_power(self, device, dtypes):
         def to_np(value):
             if isinstance(value, torch.Tensor) and value.dtype == torch.bfloat16:
@@ -3503,8 +3353,8 @@ def _promo_helper(x, y):
                         torch.Tensor.float_power_(base.clone(), exp)
 
     @skipIf(not TEST_SCIPY, "Scipy required for the test.")
-    @dtypes(*product(get_all_dtypes(include_complex=False, include_bfloat16=False),
-                     get_all_dtypes(include_complex=False, include_bfloat16=False)))
+    @dtypes(*product(all_types_and(torch.half, torch.bool),
+                     all_types_and(torch.half, torch.bool)))
     def test_xlogy_xlog1py(self, device, dtypes):
         x_dtype, y_dtype = dtypes
 
@@ -3515,7 +3365,7 @@ def out_variant_helper(torch_fn, x, y):
             self.assertEqual(expected, out)
 
         def xlogy_inplace_variant_helper(x, y):
-            if x.dtype in get_all_int_dtypes() + [torch.bool]:
+            if x.dtype in integral_types_and(torch.bool):
                 with self.assertRaisesRegex(RuntimeError,
                                             "can't be cast to the desired output type"):
                     x.clone().xlogy_(y)
@@ -3642,10 +3492,7 @@ def _compare_helper(x, y, torch_fn, reference_fn):
         _compare_helper(t, zeros, *xlog1py_fns)
         _compare_helper(t, 0., *xlog1py_fns)
 
-    @dtypes(*product(get_all_dtypes(include_complex=False,
-                                    include_half=False, include_bfloat16=False),
-                     get_all_dtypes(include_complex=False,
-                                    include_half=False, include_bfloat16=False)))
+    @dtypes(*product(all_types_and(torch.bool), all_types_and(torch.bool)))
     @skipIf(not TEST_SCIPY, "Scipy required for the test.")
     @slowTest
     def test_zeta(self, device, dtypes):
@@ -3733,20 +3580,11 @@ class UnknownType:
         torch.uint8
     ]
 
-    # TODO: refactor to use make_tensor
-    def _small_2d(dtype, device, has_zeros=True, fill_ones=False, oneish=False):
-        t = _make_tensor((5, 5), dtype, device, fill_ones=fill_ones)
-        if oneish:
-            return t.clamp(min=_number(.99, 1, dtype), max=1.01)
-        if not has_zeros:
-            return t.clamp(min=(_number(_div_min, 1, dtype)))
-        return t
-
     def create_test_func(op):
         @dtypes(*_types)
         def test(self, device, dtype):
             # Generate the inputs
-            tensor = _small_2d(dtype, device)
+            tensor = torch.empty((), device=device, dtype=dtype)
 
             # Runs the tensor op on the device
             result = getattr(tensor, op)(UnknownType())
diff --git a/test/test_complex.py b/test/test_complex.py
index 9f2e0ad32401af..88404902631f7e 100644
--- a/test/test_complex.py
+++ b/test/test_complex.py
@@ -3,12 +3,12 @@
 import torch
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, dtypes
 from torch.testing._internal.common_utils import TestCase, run_tests
-from torch.testing._internal.common_dtype import get_all_complex_dtypes
+from torch.testing._internal.common_dtype import complex_types
 
 devices = (torch.device('cpu'), torch.device('cuda:0'))
 
 class TestComplexTensor(TestCase):
-    @dtypes(*get_all_complex_dtypes())
+    @dtypes(*complex_types())
     def test_to_list(self, device, dtype):
         # test that the complex float tensor has expected values and
         # there's no garbage value in the resultant list
diff --git a/test/test_cuda.py b/test/test_cuda.py
index f38101c5f0475f..c5c4c422486afb 100644
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
@@ -3943,7 +3943,7 @@ def _test_reduce_add_coalesced(self, tensors, buffer_size):
         r_tensors = [comm.reduce_add(t) for t in zip(*dup_tensors)]
         for r, t in zip(r_tensors, tensors):
             self.assertEqualTypeString(r, t)
-            self.assertEqual(r, t * 2)
+            self.assertEqual(r.coalesce() if r.is_sparse else r, t * 2)
 
         rc_tensors = comm.reduce_add_coalesced(dup_tensors, buffer_size=buffer_size)
         self.assertEqual(r_tensors, rc_tensors)
diff --git a/test/test_dataloader.py b/test/test_dataloader.py
index 9a1e829bde4820..4900cd31516aa8 100644
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@@ -842,6 +842,21 @@ def __len__(self):
         return int(math.ceil(len(self.dataset) / float(self.batch_size)))
 
 
+class TestMultiEpochDataset(IterableDataset):
+    def __init__(self, length):
+        self.length = length
+
+    def __iter__(self):
+        worker_info = torch.utils.data.get_worker_info()
+        assert worker_info is not None
+        worker_id = worker_info.id
+        for idx in range(self.length // worker_info.num_workers):
+            yield worker_id
+
+    def __len__(self):
+        return self.length
+
+
 class CustomList(list):
     pass
 
@@ -1426,6 +1441,19 @@ def get_dataloader():
         dataset = SynchronizedSeedDataset(num_workers, batch_size, num_workers)
         self.assertEqual(set(int(batch) for batch in get_dataloader()), set(int(batch) for batch in get_dataloader()))
 
+    def test_multi_epochs_reproducibility(self):
+        num_workers = 2
+        batch_size = 10
+        num_epochs = 3
+
+        dataset = TestMultiEpochDataset(batch_size * num_workers)
+        dataloader = self._get_data_loader(dataset, batch_size=batch_size,
+                                           shuffle=False, num_workers=num_workers)
+
+        for ind in range(num_epochs):
+            for batch_idx, sample in enumerate(dataloader):
+                self.assertEqual(sample.tolist(), [batch_idx % num_workers] * batch_size)
+
     def test_worker_init_fn(self):
         dataset = SeedDataset(4)
         dataloader = self._get_data_loader(dataset, batch_size=2, num_workers=2,
@@ -2145,6 +2173,13 @@ def test_basics(self):
         self.assertEqual(list(dl), list(dl2))
         self.assertEqual(list(dl), list(dl2_threading))
 
+    class Sorter(IterDataPipe):
+        def __init__(self, datapipe):
+            self.datapipe = datapipe
+
+        def __iter__(self):
+            return iter(sorted(self.datapipe))
+
     def test_shuffle(self):
         items = list(range(1000))
         dp = IterableWrapper(items).sharding_filter().shuffle()
@@ -2152,19 +2187,27 @@ def test_shuffle(self):
         dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=False)
         self.assertEqual(items, list(dl))
 
-        dl = DataLoader(dp, batch_size=None, num_workers=2, shuffle=False,
-                        worker_init_fn=torch.utils.data.backward_compatibility.worker_init_fn)
+        dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=False,
+                         worker_init_fn=torch.utils.data.backward_compatibility.worker_init_fn)
         self.assertEqual(items, list(dl))
 
         dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=True)
         self.assertNotEqual(items, list(dl))
         self.assertEqual(items, sorted(list(dl)))
 
-        dl = DataLoader(dp, batch_size=None, num_workers=2, shuffle=True,
-                        worker_init_fn=torch.utils.data.backward_compatibility.worker_init_fn)
+        dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=True,
+                         worker_init_fn=torch.utils.data.backward_compatibility.worker_init_fn)
         self.assertNotEqual(items, list(dl))
         self.assertEqual(items, sorted(list(dl)))
 
+        dl = DataLoader2(self.Sorter(dp), batch_size=None, num_workers=2, shuffle=True)
+        self.assertEqual(list(dl), items)
+
+        dl = DataLoader2(self.Sorter(dp), batch_size=None, num_workers=2, shuffle=True,
+                         worker_init_fn=torch.utils.data.backward_compatibility.worker_init_fn)
+        self.assertEqual(list(dl), items)
+
+
 @unittest.skipIf(
     TEST_WITH_TSAN,
     "Fails with TSAN with the following error: starting new threads after multi-threaded "
diff --git a/test/test_datapipe.py b/test/test_datapipe.py
index 25d8728be001b3..09900d21dafcc1 100644
--- a/test/test_datapipe.py
+++ b/test/test_datapipe.py
@@ -565,7 +565,11 @@ class TestFunctionalIterDataPipe(TestCase):
     def _serialization_test_helper(self, datapipe):
         serialized_dp = pickle.dumps(datapipe)
         deserialized_dp = pickle.loads(serialized_dp)
-        self.assertEqual(list(datapipe), list(deserialized_dp))
+        try:
+            self.assertEqual(list(datapipe), list(deserialized_dp))
+        except AssertionError as e:
+            print(f"{datapipe} is failing.")
+            raise e
 
     def _serialization_test_for_single_dp(self, dp):
         # 1. Testing for serialization before any iteration starts
@@ -598,43 +602,44 @@ def _serialization_test_for_dp_with_children(self, dp1, dp2):
         self._serialization_test_helper(dp2)
 
     def test_serializable(self):
-        input_dp = dp.iter.IterableWrapper(range(10))
-        picklable_datapipes: List[Tuple[Type[IterDataPipe], Tuple, Dict[str, Any]]] = [
-            (dp.iter.Batcher, (3, True,), {}),
-            (dp.iter.Collator, (_fake_fn,), {}),
-            (dp.iter.Concater, (dp.iter.IterableWrapper(range(5)),), {}),
-            (dp.iter.Demultiplexer, (2, _fake_filter_fn), {}),
-            (dp.iter.FileLister, (), {}),
-            (dp.iter.FileOpener, (), {}),
-            (dp.iter.Filter, (_fake_filter_fn,), {}),
-            (dp.iter.Filter, (partial(_fake_filter_fn_constant, 5),), {}),
-            (dp.iter.Forker, (2,), {}),
-            (dp.iter.Grouper, (_fake_filter_fn,), {"group_size": 2}),
-            (dp.iter.IterableWrapper, (), {}),
-            (dp.iter.Mapper, (_fake_fn, ), {}),
-            (dp.iter.Mapper, (partial(_fake_add, 1), ), {}),
-            (dp.iter.Multiplexer, (input_dp,), {}),
-            (dp.iter.Sampler, (), {}),
-            (dp.iter.Shuffler, (), {}),
-            (dp.iter.StreamReader, (), {}),
-            (dp.iter.UnBatcher, (0,), {}),
-            (dp.iter.Zipper, (input_dp,), {}),
+        picklable_datapipes: List = [
+            (dp.iter.Batcher, None, (3, True,), {}),
+            (dp.iter.Collator, None, (_fake_fn,), {}),
+            (dp.iter.Concater, None, (dp.iter.IterableWrapper(range(5)),), {}),
+            (dp.iter.Demultiplexer, None, (2, _fake_filter_fn), {}),
+            (dp.iter.FileLister, ".", (), {}),
+            (dp.iter.FileOpener, None, (), {}),
+            (dp.iter.Filter, None, (_fake_filter_fn,), {}),
+            (dp.iter.Filter, None, (partial(_fake_filter_fn_constant, 5),), {}),
+            (dp.iter.Forker, None, (2,), {}),
+            (dp.iter.Grouper, None, (_fake_filter_fn,), {"group_size": 2}),
+            (dp.iter.IterableWrapper, range(10), (), {}),
+            (dp.iter.Mapper, None, (_fake_fn, ), {}),
+            (dp.iter.Mapper, None, (partial(_fake_add, 1), ), {}),
+            (dp.iter.Multiplexer, None, (dp.iter.IterableWrapper(range(10)),), {}),
+            (dp.iter.Sampler, None, (), {}),
+            (dp.iter.Shuffler, dp.iter.IterableWrapper([0] * 10), (), {}),
+            (dp.iter.StreamReader, None, (), {}),
+            (dp.iter.UnBatcher, None, (0,), {}),
+            (dp.iter.Zipper, None, (dp.iter.IterableWrapper(range(10)),), {}),
         ]
         # Skipping comparison for these DataPipes
-        dp_skip_comparison = {dp.iter.FileLister, dp.iter.FileOpener, dp.iter.StreamReader, dp.iter.Shuffler}
+        dp_skip_comparison = {dp.iter.FileOpener, dp.iter.StreamReader}
         # These DataPipes produce multiple DataPipes as outputs and those should be compared
         dp_compare_children = {dp.iter.Demultiplexer, dp.iter.Forker}
 
-        for dpipe, dp_args, dp_kwargs in picklable_datapipes:
+        for dpipe, custom_input, dp_args, dp_kwargs in picklable_datapipes:
+            if custom_input is None:
+                custom_input = dp.iter.IterableWrapper(range(10))
             if dpipe in dp_skip_comparison:  # Merely make sure they are picklable and loadable (no value comparison)
-                datapipe = dpipe(input_dp, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
+                datapipe = dpipe(custom_input, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
                 serialized_dp = pickle.dumps(datapipe)
                 _ = pickle.loads(serialized_dp)
             elif dpipe in dp_compare_children:  # DataPipes that have children
-                dp1, dp2 = dpipe(input_dp, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
+                dp1, dp2 = dpipe(custom_input, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
                 self._serialization_test_for_dp_with_children(dp1, dp2)
             else:  # Single DataPipe that requires comparison
-                datapipe = dpipe(input_dp, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
+                datapipe = dpipe(custom_input, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
                 self._serialization_test_for_single_dp(datapipe)
 
     def test_serializable_with_dill(self):
@@ -1402,6 +1407,10 @@ def test_shuffle_iterdatapipe(self):
         with self.assertRaisesRegex(TypeError, r"instance doesn't have valid length$"):
             len(shuffle_dp_nl)
 
+        # Test: deactivate shuffling via set_shuffle
+        unshuffled_dp = input_ds.shuffle().set_shuffle(False)
+        self.assertEqual(list(unshuffled_dp), list(input_ds))
+
     def test_zip_iterdatapipe(self):
 
         # Functional Test: raises TypeError when an input is not of type `IterDataPipe`
@@ -1433,30 +1442,45 @@ def test_zip_iterdatapipe(self):
 
 class TestFunctionalMapDataPipe(TestCase):
 
-    def _serialization_test_helper(self, datapipe, has_two_children=False):
+    def _serialization_test_helper(self, datapipe):
         serialized_dp = pickle.dumps(datapipe)
         deserialized_dp = pickle.loads(serialized_dp)
-        if not has_two_children:
+        try:
             self.assertEqual(list(datapipe), list(deserialized_dp))
-        else:
-            for c1, c2 in zip(list(datapipe), list(deserialized_dp)):
-                self.assertEqual(list(c1), list(c2))
+        except AssertionError as e:
+            print(f"{datapipe} is failing.")
+            raise e
+
+    def _serialization_test_for_single_dp(self, dp):
+        # 1. Testing for serialization before any iteration starts
+        self._serialization_test_helper(dp)
+        # 2. Testing for serialization after DataPipe is partially read
+        it = iter(dp)
+        _ = next(it)
+        self._serialization_test_helper(dp)
+        # 3. Testing for serialization after DataPipe is fully read
+        _ = list(it)
+        self._serialization_test_helper(dp)
 
     def test_serializable(self):
-        input_dp = dp.map.SequenceWrapper(range(10))
-        picklable_datapipes: List[
-            Tuple[Type[MapDataPipe], Tuple, Dict[str, Any]]
-        ] = [
-            (dp.map.Mapper, (), {}),
-            (dp.map.Mapper, (_fake_fn, ), {}),
-            (dp.map.Mapper, (partial(_fake_add, 1), ), {}),
+        picklable_datapipes: List = [
+            (dp.map.Batcher, None, (2,), {}),
+            (dp.map.Concater, None, (dp.map.SequenceWrapper(range(10)),), {}),
+            (dp.map.Mapper, None, (), {}),
+            (dp.map.Mapper, None, (_fake_fn, ), {}),
+            (dp.map.Mapper, None, (partial(_fake_add, 1), ), {}),
+            (dp.map.SequenceWrapper, range(10), (), {}),
+            (dp.map.Shuffler, dp.map.SequenceWrapper([0] * 5), (), {}),
+            (dp.map.Zipper, None, (dp.map.SequenceWrapper(range(10)),), {}),
         ]
-        for dpipe, dp_args, dp_kwargs in picklable_datapipes:
-            _ = pickle.dumps(dpipe(input_dp, *dp_args, **dp_kwargs))  # type: ignore[call-arg]
-            datapipe = dpipe(input_dp, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
-            self._serialization_test_helper(datapipe)
+        for dpipe, custom_input, dp_args, dp_kwargs in picklable_datapipes:
+            if custom_input is None:
+                custom_input = dp.map.SequenceWrapper(range(10))
+            datapipe = dpipe(custom_input, *dp_args, **dp_kwargs)  # type: ignore[call-arg]
+            self._serialization_test_for_single_dp(datapipe)
 
     def test_serializable_with_dill(self):
+        """Only for DataPipes that take in a function as argument"""
         input_dp = dp.map.SequenceWrapper(range(10))
         unpicklable_datapipes: List[
             Tuple[Type[MapDataPipe], Tuple, Dict[str, Any]]
@@ -1655,7 +1679,7 @@ class A(IterDataPipe[P]):
 
     @skipTyping
     def test_subtype(self):
-        from torch.utils.data._typing import issubtype
+        from torch.utils.data.datapipes._typing import issubtype
 
         basic_type = (int, str, bool, float, complex,
                       list, tuple, dict, set, T_co)
@@ -1703,7 +1727,7 @@ def test_subtype(self):
 
     @skipTyping
     def test_issubinstance(self):
-        from torch.utils.data._typing import issubinstance
+        from torch.utils.data.datapipes._typing import issubinstance
 
         basic_data = (1, '1', True, 1., complex(1., 0.))
         basic_type = (int, str, bool, float, complex)
@@ -1773,7 +1797,7 @@ def __iter__(self) -> Iterator[Tuple[int, str]]:
 
         self.assertTrue(issubclass(DP1, IterDataPipe))
         dp1 = DP1(10)
-        self.assertTrue(DP1.type.issubtype(dp1.type) and dp1.type.issubtype(DP1.type))
+        self.assertTrue(DP1.type.issubtype(dp1.type) and dp1.type.issubtype(DP1.type))  # type: ignore[attr-defined]
         dp1_ = DP1(5)
         self.assertEqual(dp1.type, dp1_.type)
 
@@ -1789,7 +1813,7 @@ def __iter__(self) -> Iterator[T_co]:
 
         self.assertTrue(issubclass(DP2, IterDataPipe))
         dp2 = DP2()  # type: ignore[var-annotated]
-        self.assertTrue(DP2.type.issubtype(dp2.type) and dp2.type.issubtype(DP2.type))
+        self.assertTrue(DP2.type.issubtype(dp2.type) and dp2.type.issubtype(DP2.type))  # type: ignore[attr-defined]
         dp2_ = DP2()  # type: ignore[var-annotated]
         self.assertEqual(dp2.type, dp2_.type)
 
@@ -1805,7 +1829,7 @@ def __iter__(self) -> Iterator[Tuple[T_co, str]]:
 
         self.assertTrue(issubclass(DP3, IterDataPipe))
         dp3 = DP3(range(10))  # type: ignore[var-annotated]
-        self.assertTrue(DP3.type.issubtype(dp3.type) and dp3.type.issubtype(DP3.type))
+        self.assertTrue(DP3.type.issubtype(dp3.type) and dp3.type.issubtype(DP3.type))  # type: ignore[attr-defined]
         dp3_ = DP3(5)  # type: ignore[var-annotated]
         self.assertEqual(dp3.type, dp3_.type)
 
@@ -1827,7 +1851,7 @@ def __iter__(self) -> Iterator[str]:
 
         self.assertTrue(issubclass(DP5, IterDataPipe))
         dp5 = DP5()
-        from torch.utils.data._typing import issubtype
+        from torch.utils.data.datapipes._typing import issubtype
         self.assertTrue(issubtype(dp5.type.param, Any) and issubtype(Any, dp5.type.param))
 
         class DP6(IterDataPipe[int]):
@@ -1844,13 +1868,13 @@ class DP7(IterDataPipe[Awaitable[T_co]]):
             r""" DataPipe with abstract base class"""
 
         self.assertTrue(issubclass(DP7, IterDataPipe))
-        self.assertTrue(DP7.type.param == Awaitable[T_co])
+        self.assertTrue(DP7.type.param == Awaitable[T_co])  # type: ignore[attr-defined]
 
         class DP8(DP7[str]):
             r""" DataPipe subclass from a DataPipe with abc type"""
 
         self.assertTrue(issubclass(DP8, IterDataPipe))
-        self.assertTrue(DP8.type.param == Awaitable[str])
+        self.assertTrue(DP8.type.param == Awaitable[str])  # type: ignore[attr-defined]
 
     @skipTyping
     def test_construct_time(self):
@@ -1985,6 +2009,35 @@ def test_traverse_forked(self):
         self.assertEqual(expected, graph)
 
 
+class TestCircularSerialization(TestCase):
+
+    class CustomIterDataPipe(IterDataPipe):
+        def add_one(self, x):
+            return x + 1
+
+        def classify(self, x):
+            return 0
+
+        def __init__(self):
+            self._dp = dp.iter.IterableWrapper([1, 2, 4]).map(self.add_one).demux(2, self.classify)[0]
+
+        def __iter__(self):
+            yield from self._dp
+
+    def test_circular_reference(self):
+        self.assertEqual(
+            list(TestCircularSerialization.CustomIterDataPipe()),
+            list(pickle.loads(pickle.dumps(TestCircularSerialization.CustomIterDataPipe())))
+        )
+        _ = traverse(TestCircularSerialization.CustomIterDataPipe(), only_datapipe=True)
+        _ = traverse(TestCircularSerialization.CustomIterDataPipe(), only_datapipe=False)
+
+    # TODO: Ensure this works with `dill` installed
+    # @skipIfNoDill
+    # def test_circular_serialization_with_dill(self):
+    #     assert list(self._CustomIterDataPipe()) == list(dill.loads(dill.dumps(self._CustomIterDataPipe())))
+
+
 class TestSharding(TestCase):
 
     def _get_pipeline(self):
diff --git a/test/test_dispatch.py b/test/test_dispatch.py
index 37a6054f9151e6..bf609cf50b3e3c 100644
--- a/test/test_dispatch.py
+++ b/test/test_dispatch.py
@@ -532,8 +532,8 @@ def test_computed_table_with_ambiguous_autogradother(self):
             lambda m: m.def_("foo(Tensor x) -> Tensor"),
             # m.impl("foo", torch::kCompositeImplicitAutograd, [](const Tensor & x) { return x })
             lambda m: m.impl_t_t("foo", "CompositeImplicitAutograd", debug="fn_math"),
-            # m.impl("foo", torch::kQuantizedCPU, [](const Tensor & x) { return x })
-            lambda m: m.impl_t_t("foo", "QuantizedCPU", debug="fn_quantizedcpu"),
+            # m.impl("foo", torch::kFPGA, [](const Tensor & x) { return x })
+            lambda m: m.impl_t_t("foo", "FPGA", debug="fn_fpga"),
         ])
         state, table = result.state, result.table
         self.assertExpectedInline(state, '''\
@@ -541,12 +541,12 @@ def test_computed_table_with_ambiguous_autogradother(self):
 schema: test::foo(Tensor x) -> (Tensor)
 debug: registered at /dev/null:0
 alias analysis kind: FROM_SCHEMA
-QuantizedCPU: fn_quantizedcpu :: (Tensor _0) -> (Tensor _0) [ boxed unboxed ]
+FPGA: fn_fpga :: (Tensor _0) -> (Tensor _0) [ boxed unboxed ]
 CompositeImplicitAutograd[alias]: fn_math :: (Tensor _0) -> (Tensor _0) [ boxed unboxed ]
 ''')
 
         # computed dispatch table is too big, so we only check on a few entries we're interested in.
-        extracted_table = extract_dispatch_table_with_keys(table, dispatch_keys_to_check + ('QuantizedCPU',))
+        extracted_table = extract_dispatch_table_with_keys(table, dispatch_keys_to_check + ('FPGA',))
 
         self.assertExpectedInline(extracted_table, '''\
 Undefined: fn_math [math kernel]
@@ -557,7 +557,7 @@ def test_computed_table_with_ambiguous_autogradother(self):
 AutogradCPU: fn_math [math kernel]
 AutogradCUDA: fn_math [math kernel]
 AutogradXLA: fn_math [math kernel]
-QuantizedCPU: fn_quantizedcpu [kernel]
+FPGA: fn_fpga [kernel]
 ''')
 
     def test_computed_table_with_cpu_defaultbackend(self):
@@ -616,7 +616,7 @@ def test_computed_table_with_cpu_autograd_defaultbackend(self):
 ''')
 
         # computed dispatch table is too big, so we only check on a few entries we're interested in.
-        extracted_table = extract_dispatch_table_with_keys(table, dispatch_keys_to_check + ('QuantizedCPU',))
+        extracted_table = extract_dispatch_table_with_keys(table, dispatch_keys_to_check + ('FPGA',))
 
         self.assertExpectedInline(extracted_table, '''\
 Undefined: fn_defaultbackend [default backend kernel]
@@ -627,7 +627,7 @@ def test_computed_table_with_cpu_autograd_defaultbackend(self):
 AutogradCPU: fn_autograd [autograd kernel]
 AutogradCUDA: fn_autograd [autograd kernel]
 AutogradXLA: fn_autograd [autograd kernel]
-QuantizedCPU: fn_defaultbackend [default backend kernel]
+FPGA: fn_defaultbackend [default backend kernel]
 ''')
 
     def test_computed_table_with_cpu_autograd_math_defaultbackend(self):
@@ -808,7 +808,7 @@ def test_basic(self):
 CPU             fn_CPU [kernel]
 XLA             fn_XLA [kernel]
 Lazy            fn_Lazy [kernel]
-QuantizedCPU    fn_CompositeImplicitAutograd [math kernel]
+FPGA            fn_CompositeImplicitAutograd [math kernel]
 AutogradOther   fn_CompositeImplicitAutograd [math kernel]
 AutogradCPU     fallthrough [backend fallback]
 AutogradXLA     fallthrough [backend fallback]
@@ -829,7 +829,7 @@ def test_math_autogradcpu(self):
 CPU             fn_CPU [kernel]
 XLA             fn_XLA [kernel]
 Lazy            fn_Lazy [kernel]
-QuantizedCPU    fn_CompositeImplicitAutograd [math kernel]
+FPGA            fn_CompositeImplicitAutograd [math kernel]
 AutogradOther   fn_CompositeImplicitAutograd [math kernel]
 AutogradCPU     fn_AutogradCPU [kernel]
 AutogradXLA     fallthrough [backend fallback]
@@ -864,7 +864,7 @@ def test_defaultbackend_autogradcpu(self):
 CPU             fn_CPU [kernel]
 XLA             fn_XLA [kernel]
 Lazy            fn_Lazy [kernel]
-QuantizedCPU    fn_CompositeExplicitAutograd [default backend kernel]
+FPGA            fn_CompositeExplicitAutograd [default backend kernel]
 AutogradOther   fallthrough [backend fallback]
 AutogradCPU     fn_AutogradCPU [kernel]
 AutogradXLA     fallthrough [backend fallback]
@@ -889,7 +889,7 @@ def test_defaultbackend_autogradcpu(self):
 
     def test_autogradother(self):
         dispatcher = PythonDispatcher()
-        dispatcher.register(["CPU", "QuantizedCPU", "CompositeImplicitAutograd"])
+        dispatcher.register(["CPU", "FPGA", "CompositeImplicitAutograd"])
         self.assertExpectedInline(
             dispatcher.dispatchTable(),
             '''\
@@ -900,7 +900,7 @@ def test_autogradother(self):
 CPU             fn_CPU [kernel]
 XLA             fn_CompositeImplicitAutograd [math kernel]
 Lazy            fn_CompositeImplicitAutograd [math kernel]
-QuantizedCPU    fn_QuantizedCPU [kernel]
+FPGA            fn_FPGA [kernel]
 AutogradOther   ambiguous_autogradother [ambiguous autogradother]
 AutogradCPU     fallthrough [backend fallback]
 AutogradXLA     fn_CompositeImplicitAutograd [math kernel]
@@ -915,8 +915,8 @@ def test_autogradother(self):
 Registered Kernels
 key             kernel
 ---------------------------
+FPGA            fn_FPGA
 CPU             fn_CPU
-QuantizedCPU    fn_QuantizedCPU
 CompositeImplicitAutograd[alias] fn_CompositeImplicitAutograd
 '''
         )
@@ -935,5 +935,20 @@ def test_defaultbackend_math(self):
                 r"Registration to both CompositeImplicitAutograd and CompositeExplicitAutograd is not allowed"):
             dispatcher.register(["CompositeExplicitAutograd", "CompositeImplicitAutograd"])
 
+    def test_quantized_structured_not_implemented(self):
+        x = torch.zeros([1, 1, 1])
+        y = torch.zeros([1, 1, 1])
+        scale, zero_point = 1.0, 0
+        dtype = torch.qint8
+        qx = torch.quantize_per_tensor(x, scale, zero_point, dtype)
+        qy = torch.quantize_per_tensor(y, scale, zero_point, dtype)
+        # If bmm gets quantized support you need to update this to something
+        # else that is not implemented
+        self.assertRaisesRegex(
+            NotImplementedError,
+            "Could not run 'aten::bmm.out' with arguments from the 'QuantizedCPU' backend.",
+            lambda: torch.bmm(qx, qy)
+        )
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_expanded_weights.py b/test/test_expanded_weights.py
index 6c697b6c721bcb..63d08fa55a6255 100644
--- a/test/test_expanded_weights.py
+++ b/test/test_expanded_weights.py
@@ -1,12 +1,15 @@
 # Owner(s): ["module: nn"]
 
 from functools import partial
-from itertools import product
+from itertools import product, chain
 import unittest
 
 import torch
 import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import CrossEntropyLoss
 from torch.nn.utils._per_sample_grad import call_for_per_sample_grads
+from torch.testing._internal.common_cuda import TEST_CUDA
 from torch.testing._internal.common_device_type import OpDTypes, instantiate_device_type_tests, ops
 from torch.testing._internal.common_nn import TestBase, module_tests, new_module_tests
 from torch.testing._internal.common_utils import TestCase, freeze_rng_state, make_tensor, run_tests
@@ -159,7 +162,7 @@ def test_expanded_weight_per_sample_grad(self, device, dtype, op):
             for (result_grad, expected_grad) in zip(expanded_weight_grad, per_sample_grad):
                 if result_grad is None:
                     result_grad = torch.zeros_like(expected_grad)
-                assert torch.allclose(result_grad, expected_grad), f"Got {result_grad}, expected {expected_grad}"
+                self.assertEqual(result_grad, expected_grad)
 
     @ops(filter(lambda op: op.supports_expanded_weight, op_db), dtypes=OpDTypes.supported, allowed_dtypes=(torch.double,))
     def test_unsupported_expand_weights(self, device, dtype, op):
@@ -185,10 +188,16 @@ def test_unsupported_expand_weights(self, device, dtype, op):
     def test_expanded_weight_forward(self, device, dtype, op):
         sample_inputs = op.sample_inputs(device, dtype)
         for sample_input in supported_inputs(op, sample_inputs):
+            if op.name == "nn.functional.embedding":  # embedding flips its argument order for autograd tests
+                sample_input = SampleInput(sample_input.args[0].clone(),
+                                           args=(sample_input.input.clone(),),
+                                           kwargs=sample_input.kwargs)
+                if "cuda" in device and "max_norm" in sample_input.kwargs and "padding_idx" in sample_input.kwargs:
+                    self.skipTest("embedding is non-determinstic in this case, see issue #74679")
             batch_size = sample_input.input.shape[0] if len(sample_input.input.shape) > 1 else 1
             (ew_input, ew_args, ew_kwargs) = make_expanded_weight(sample_input, batch_size)
-            expanded_weight_result = op(ew_input, *ew_args, **ew_kwargs)
-            normal_result = op(sample_input.input, *sample_input.args, **sample_input.kwargs)
+            expanded_weight_result = run_op(op, ew_input, *ew_args, **ew_kwargs)
+            normal_result = run_op(op, sample_input.input, *sample_input.args, **sample_input.kwargs)
             self.assertEqual(expanded_weight_result, normal_result)
 
     def test_expanded_weight_error(self, device):
@@ -198,10 +207,63 @@ def test_expanded_weight_error(self, device):
         with self.assertRaisesRegex(RuntimeError, r"Expanded Weights encountered but cannot handle function"):
             torch.add(sample_input, ExpandedWeight(sample_weight, batch_size))
 
+    def test_small_model(self, device):
+        def convnet(num_classes):
+            return nn.Sequential(
+                nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.AvgPool2d(kernel_size=2, stride=2),
+                nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.AvgPool2d(kernel_size=2, stride=2),
+                nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.AvgPool2d(kernel_size=2, stride=2),
+                nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.AdaptiveAvgPool2d((1, 1)),
+                nn.Flatten(start_dim=1, end_dim=-1),
+                nn.Linear(128, num_classes, bias=True),
+            )
+
+        batch_size = 32
+        model = convnet(10).to(device)
+        input = torch.randn([batch_size, 3, 28, 28], device=device)
+        targets = torch.randint(0, 10, (batch_size,), device=device)
+        criterion = CrossEntropyLoss(reduction='sum')  # use a loss that doesn't average across the batch to test in a for loop
+        result = call_for_per_sample_grads(model, batch_size, input)
+        loss = criterion(result, targets)
+        loss.backward()
+        result = []
+        for weight in model.parameters():
+            result.append(weight.grad_sample)
+            del weight.grad_sample
+
+        expected = []
+        for i in range(batch_size):
+            loss = criterion(model(input[i].unsqueeze(0)), targets[i].unsqueeze(0))
+            expected.append(torch.autograd.grad(loss, model.parameters(), torch.ones_like(loss)))
+
+        expected = [torch.stack(grad) for grad in zip(*expected)]
+        for (res, exp) in zip(result, expected):
+            self.assertEqual(res, exp, atol=1e-4, rtol=5e-5)
+
+    def test_group_norm_error(self, device):
+        # group norm has to call native_group_norm. This checks that it hits the same errors
+        # that normal group norm would
+
+        N = 3
+        C = 5
+        inp = torch.randn(N, C)
+        with self.assertRaisesRegex(RuntimeError, r"Expected number of channels in input to be divisible"):
+            F.group_norm(inp, 2)  # 5 is not divisible by 2
 
 class TestExpandedWeightModule(TestCase):
     def _do_test(self, module, input):
         batch_size = input.shape[0]
+        diff_input = input.dtype == torch.float or input.dtype == torch.double
+        if diff_input:
+            input.requires_grad_()
         with freeze_rng_state():
             # get per sample grads with ExpandedWeights context manager
             actual_res = call_for_per_sample_grads(module, batch_size, input).sum()
@@ -210,17 +272,25 @@ def _do_test(self, module, input):
             for param in module.parameters():
                 actual_grads.append(param.grad_sample)
                 del param.grad_sample
+            if diff_input:
+                actual_grads.append(input.grad.clone())
+                input.grad = torch.zeros_like(input.grad)
 
             # get per sample grads with a for loop
-            expected_res = torch.tensor(0.)
+            expected_res = torch.tensor(0., device=input.device, dtype=torch.double)
             expected_grads = []
             for i in range(batch_size):
-                res = module(input[i].unsqueeze(0)).sum()
-                expected_grads.append(torch.autograd.grad(res, module.parameters(), torch.ones_like(res)))
+                input_slice = input[i]
+                diff_params = module.parameters()
+                if diff_input:
+                    diff_params = chain(diff_params, (input_slice,))
+                res = module(input_slice.unsqueeze(0)).sum()
+                out_grads = torch.autograd.grad(res, diff_params, torch.ones_like(res), allow_unused=True)
+                expected_grads.append(out_grads)
                 expected_res += res
             expected_grads = tuple(torch.stack(grad) for grad in zip(*expected_grads))
         self.assertEqual(actual_res, expected_res)
-        assert [torch.allclose(actual, expected) for (actual, expected) in zip(actual_grads, expected_grads)]
+        [self.assertEqual(actual, expected) for (actual, expected) in zip(actual_grads, expected_grads)]
 
     def _do_test_multi_input(self, module, input):
         class TestModule(nn.Module):
@@ -232,6 +302,9 @@ def forward(self, input):
                 return self.module(input) + self.module(input)
 
         batch_size = input.shape[0]
+        diff_input = input.dtype == torch.float or input.dtype == torch.double
+        if diff_input:
+            input.requires_grad_()
         with freeze_rng_state():
             # get per sample grads with ExpandedWeights context manager, calling .backward() twice
             test_module = TestModule(module)
@@ -241,14 +314,24 @@ def forward(self, input):
             for param in module.parameters():
                 actual_grads.append(param.grad_sample)
                 del param.grad_sample
+            if diff_input:
+                actual_grads.append(input.grad.clone())
+                input.grad = torch.zeros_like(input.grad)
+
 
             # get per sample grads with a for loop, running over the input twice
             expected_grads = []
             for i in range(batch_size):
-                res = module(input[i].unsqueeze(0)).sum()
-                expected_grads.append(torch.autograd.grad(res, module.parameters(), torch.ones_like(res)))
-            expected_grads = tuple(torch.stack(grad) for grad in zip(*expected_grads))
-        assert [torch.allclose(actual, 2 * expected) for (actual, expected) in zip(actual_grads, expected_grads)]
+                input_slice = input[i]
+                diff_params = module.parameters()
+                if diff_input:
+                    diff_params = chain(diff_params, (input_slice,))
+                res = module(input_slice.unsqueeze(0)).sum()
+                out_grads = torch.autograd.grad(res, diff_params, torch.ones_like(res), allow_unused=True)
+                expected_grads.append(out_grads)
+        expected_grads = tuple(torch.stack(grad) for grad in zip(*expected_grads))
+        expected_grads = tuple(expected_grad for expected_grad in expected_grads if expected_grad is not None)
+        assert [self.assertEqual(actual, 2 * expected) for (actual, expected) in zip(actual_grads, expected_grads)]
 
     def test_per_sample_api_failing(self):
         module = nn.Linear(10, 10)
@@ -266,23 +349,28 @@ def test_per_sample_api_failing(self):
 
 class ContextManagerTests(TestBase):
     def __init__(self, *args, **kwargs):
+        self.test_cpu = kwargs.get('test_cpu', True)
+        self.test_cuda = kwargs.get('test_cuda', True)
         super().__init__(*args, **kwargs)
 
     @property
     def constructor_args(self):
         return self._get_arg('constructor_args', False)
 
-    def test_context_manager(self, test_case):
-        module = self.constructor(*self.constructor_args)
-        input = self._get_input()
+    def test_context_manager(self, test_case, device):
+        kwargs = {'device': device, 'dtype': torch.double}
+        module = self.constructor(*self.constructor_args).to(**kwargs)
+        if 'Embedding' in self.get_name():
+            kwargs['dtype'] = torch.long
+        input = self._get_input().to(**kwargs)
         if len(input.shape) == 0 or input.shape[0] == 0:
             raise unittest.SkipTest("Can't get per sample gradients when no batch dim or batch dim is 0")
         if self.constructor == torch.nn.Linear and len(input.shape) == 1:
             raise unittest.SkipTest("Can't get per sample gradients for input of rank 1")
         test_case._do_test(module, input)
 
-    def test_context_manager_multiple_inputs(self, test_case):
-        module = self.constructor(*self.constructor_args)
+    def test_context_manager_multiple_inputs(self, test_case, device):
+        module = self.constructor(*self.constructor_args).to(device)
         input = self._get_input()
         if len(input.shape) == 0 or input.shape[0] == 0:
             raise unittest.SkipTest("Can't get per sample gradients when no batch dim or batch dim is 0")
@@ -292,7 +380,7 @@ def test_context_manager_multiple_inputs(self, test_case):
 
 # TODO: Once all of these use ModuleInfo, replace with ModuleInfo tests
 # These currently use the legacy nn tests
-supported_modules = ['Linear']
+supported_modules = ['Linear', 'Conv1d', 'Conv2d', 'Conv3d', 'Embedding', 'LayerNorm', 'GroupNorm']
 supported_tests = [t for t in module_tests + new_module_tests if 'module_name' in t and t['module_name'] in supported_modules]
 for test_param in supported_tests:
     if 'constructor' not in test_param:
@@ -308,9 +396,14 @@ def test_context_manager_multiple_inputs(self, test_case):
         raise RuntimeError('Found two tests with the same name: ' + test_name)
     if decorator is not None:
         fn = decorator(fn)
-    setattr(TestExpandedWeightModule, test_name, lambda self, test=test: test.test_context_manager(self))
-    setattr(TestExpandedWeightModule, test_name_multi_input,
-            lambda self, test=test: test.test_context_manager_multiple_inputs(self))
+    if test.test_cpu:
+        setattr(TestExpandedWeightModule, test_name, lambda self, test=test: test.test_context_manager(self, 'cpu'))
+        setattr(TestExpandedWeightModule, test_name_multi_input,
+                lambda self, test=test: test.test_context_manager_multiple_inputs(self, 'cpu'))
+    if TEST_CUDA and test.test_cuda:
+        # since this checks derivatives, only use double for precision
+        setattr(TestExpandedWeightModule, test_name + '_cuda_double',
+                lambda self, test=test: test.test_context_manager(self, 'cuda'))
 
 # ------------- HELPER FUNCTIONS -----------------
 
@@ -340,12 +433,13 @@ def supported_inputs(op, sample_inputs, supported_inputs=True):
     operations that would cause inter-batch operations. Removes all of the cases it cannot deal with
     """
     def filter_fn(input):
+        convolutions = ["nn.functional.conv1d", "nn.functional.conv2d", "nn.functional.conv3d"]
         if op.name == "nn.functional.linear":
             is_supported_input = len(input.input.shape) > 1  # input of rank 1 means no batch dim
         elif op.name == "nn.functional.layer_norm":
             normalized_shape = input.args[0]
             is_supported_input = input.input.shape != normalized_shape  # would cause inter-batch operations
-        elif op.name == "nn.functional.conv2d":
+        elif op.name in convolutions:
             # currently can't deal with padding computation on Python level
             is_supported_input = 'padding' not in input.kwargs or not isinstance(input.kwargs['padding'], str)
         elif op.name == "nn.functional.embedding":
diff --git a/test/test_foreach.py b/test/test_foreach.py
index a04ddcebbaaecd..4da23dc66fc3b7 100644
--- a/test/test_foreach.py
+++ b/test/test_foreach.py
@@ -11,12 +11,13 @@
 from torch.testing._comparison import default_tolerances
 from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_ROCM, TEST_WITH_SLOW
 from torch.testing._internal.common_device_type import \
-    (instantiate_device_type_tests, dtypes, onlyCUDA, skipCUDAIfRocm, skipMeta, ops)
+    (instantiate_device_type_tests, dtypes, onlyCUDA, skipMeta, ops)
 from torch.testing._internal.common_methods_invocations import (
     foreach_unary_op_db, foreach_binary_op_db, foreach_pointwise_op_db, foreach_minmax_op_db,
     foreach_reduce_op_db)
 from torch.testing._internal.common_dtype import (
-    get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
+    all_types_and_complex_and, all_types_and, integral_types, complex_types,
+    floating_types_and, floating_types, integral_types_and,
 )
 
 # Includes some values such that N * N won't be a multiple of 4,
@@ -140,7 +141,7 @@ def _test_binary_op_tensorlists(self, device, dtype, opinfo, N, is_fastpath, dis
         self._binary_test(dtype, inplace_op, inplace_ref, inputs, is_fastpath, is_inplace=True)
         if opinfo.supports_alpha_param:
             alpha = None
-            if dtype in get_all_int_dtypes():
+            if dtype in integral_types():
                 alpha = 3
             elif dtype.is_complex:
                 alpha = complex(3, 3)
@@ -165,19 +166,11 @@ def _test_binary_op_tensorlists(self, device, dtype, opinfo, N, is_fastpath, dis
         self._binary_test(
             dtype, inplace_op, inplace_ref, inputs, is_fastpath and disable_fastpath, is_inplace=True)
 
-    # note(mkozuki): Why ROCm?
-    # ROCm is supposed to compile slow path as in
-    # https://github.com/pytorch/pytorch/blob/7e032f18cf1405804c4f787b05ea2de5e08a091e/aten/src/ATen/native/ForeachUtils.h#L148-L164,  # noqa: E501
-    # Therefore `[torch.add(*args, alpha=alpha) for args in zip(tensors1, tensors2)]` and
-    # `torch._foreach_add(tensors1, tensors2, alpha=alpha)`
-    # are expected to return the same outputs, however, the outputs look unstable for torch.bfloat16 and torch.half.
-    # log: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/2741/console
-    @skipCUDAIfRocm
     @skipMeta
     @ops(foreach_binary_op_db)
     def test_binary_op_tensorlists_fastpath(self, device, dtype, op):
         for N in N_values:
-            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
+            disable_fastpath = op.ref == torch.div and dtype in integral_types_and(torch.bool)
             if op.ref == torch.add and dtype == torch.bool:
                 disable_fastpath = True
             self._test_binary_op_tensorlists(device, dtype, op, N, True, disable_fastpath)
@@ -194,22 +187,21 @@ def _test_binary_op_scalar(self, device, dtype, opinfo, N, scalar, is_fastpath,
         self._binary_test(dtype, op, ref, inputs, is_fastpath, is_inplace=False)
         self._binary_test(dtype, inplace_op, inplace_ref, inputs, is_fastpath, is_inplace=True)
 
-    @skipCUDAIfRocm
     @skipMeta
     @ops(foreach_binary_op_db)
     def test_binary_op_scalar_fastpath(self, device, dtype, op):
         for N, scalar in itertools.product(N_values, Scalars):
-            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
+            disable_fastpath = op.ref == torch.div and dtype in integral_types_and(torch.bool)
             if isinstance(scalar, int):
                 disable_fastpath |= dtype == torch.bool
             if isinstance(scalar, float):
-                disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
+                disable_fastpath |= dtype in integral_types_and(torch.bool)
             if isinstance(scalar, bool):
                 disable_fastpath |= dtype == torch.bool
                 if op.ref in (torch.add, torch.mul):
                     disable_fastpath = False
             if isinstance(scalar, complex):
-                disable_fastpath |= dtype not in get_all_complex_dtypes()
+                disable_fastpath |= dtype not in complex_types()
             self._test_binary_op_scalar(device, dtype, op, N, scalar, True, disable_fastpath)
 
     @ops(foreach_binary_op_db)
@@ -233,22 +225,21 @@ def _test_binary_op_scalarlist(self, device, dtype, opinfo, N, scalarlist, is_fa
     # errors depending on the order of scalarlist. To keep actual unit test impl simple,
     # separating mixed scalarlist tests. By setting the first element of scalarlist to bool,
     # they are expected to throw bool sub error even in inplace test.
-    @skipCUDAIfRocm
     @skipMeta
     @ops(foreach_binary_op_db)
     def test_binary_op_scalarlist_fastpath(self, device, dtype, op):
         for N in N_values:
             for type_str, scalarlist in getScalarLists(N):
-                bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
+                bool_int_div = op.ref == torch.div and dtype in integral_types_and(torch.bool)
                 disable_fastpath = bool_int_div
                 if type_str == "int":
                     disable_fastpath |= dtype == torch.bool
                 if type_str == "float":
-                    disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
+                    disable_fastpath |= dtype in integral_types_and(torch.bool)
                 if type_str == "complex":
-                    disable_fastpath |= dtype not in get_all_complex_dtypes()
+                    disable_fastpath |= dtype not in complex_types()
                 if type_str == "mixed":
-                    disable_fastpath |= True and dtype not in get_all_complex_dtypes()
+                    disable_fastpath |= True and dtype not in complex_types()
                 self._test_binary_op_scalarlist(device, dtype, op, N, scalarlist, True, disable_fastpath)
 
     @ops(foreach_binary_op_db)
@@ -305,7 +296,7 @@ def _test_pointwise_op(self, device, dtype, opinfo, N, is_fastpath, disable_fast
     @skipMeta
     @ops(foreach_pointwise_op_db)
     def test_pointwise_op_fastpath(self, device, dtype, op):
-        disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool]
+        disable_fastpath = dtype in integral_types_and(torch.bool)
         # for N, scalar in itertools.product(N_values, Scalars):
         for N in N_values:
             self._test_pointwise_op(device, dtype, op, N, True, disable_fastpath)
@@ -363,7 +354,7 @@ def _test_unary(self, device, dtype, opinfo, N, is_fastpath):
         op, ref, inplace_op, inplace_ref = self._get_funcs(opinfo, 1)
         inputs = opinfo.sample_inputs(device, dtype, N, noncontiguous=not is_fastpath),
         # note(mkozuki): Complex inputs for `_foreach_abs` go through slowpath.
-        if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes():
+        if opinfo.name == "_foreach_abs" and dtype in complex_types():
             is_fastpath = False
         self._regular_unary_test(dtype, op, ref, inputs, is_fastpath)
         self._inplace_unary_test(dtype, inplace_op, inplace_ref, inputs, is_fastpath)
@@ -374,7 +365,7 @@ def test_unary_fastpath(self, device, dtype, op):
         for N in N_values:
             self._test_unary(device, dtype, op, N, is_fastpath=True)
 
-    @ops(foreach_unary_op_db, dtypes=get_all_dtypes())
+    @ops(foreach_unary_op_db, dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_unary_slowpath(self, device, dtype, op):
         for N in N_values:
             self._test_unary(device, dtype, op, N, is_fastpath=False)
@@ -391,7 +382,7 @@ def test_minmax_fastpath(self, device, dtype, op):
             self._minmax_test(op, inputs, True, N if dtype == torch.bool else 1)
 
     @ops(foreach_minmax_op_db,
-         dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
+         dtypes=all_types_and(torch.half, torch.bfloat16, torch.bool))
     def test_minmax_slowpath(self, device, dtype, op):
         for N in N_values:
             inputs = tuple(op.sample_inputs(device, dtype, N, noncontiguous=True) for _ in range(2))
@@ -399,7 +390,7 @@ def test_minmax_slowpath(self, device, dtype, op):
 
     # note(mkozuki): ForeachFuncInfo's of both `_foreach_maximum` and `_foreach_minimum` include integer types.
     # so, manually limit dtypes to fp types for inf&nan tests.
-    @ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True))
+    @ops(foreach_minmax_op_db, dtypes=floating_types_and(torch.half, torch.bfloat16))
     def test_minmax_float_inf_nan(self, device, dtype, op):
         inputs = (
             [
@@ -424,7 +415,7 @@ def _reduce_test(self, opinfo, inputs, ord, is_fastpath, n_expected_cudaLaunchKe
     @ops(foreach_reduce_op_db)
     def test_reduce_fastpath(self, device, dtype, op):
         for N, ord in itertools.product(N_values, (0, 1, 2, -1, -2)):
-            if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes():
+            if ord in (1, 2) and dtype in floating_types_and(torch.half, torch.bfloat16):
                 n_expected_cudaLaunchKernels = 3
             else:
                 n_expected_cudaLaunchKernels = N
@@ -437,7 +428,7 @@ def test_reduce_slowpath(self, device, dtype, op):
             inputs = op.sample_inputs(device, dtype, N, noncontiguous=True),
             self._reduce_test(op, inputs, ord, False, 1)
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_add_scalar_with_empty_list_and_empty_tensor(self, device, dtype):
         # TODO: enable empty list case
         for tensors in [[torch.randn([0])]]:
@@ -447,7 +438,7 @@ def test_add_scalar_with_empty_list_and_empty_tensor(self, device, dtype):
             torch._foreach_add_(tensors, 1)
             self.assertEqual(res, tensors)
 
-    @ops(foreach_binary_op_db, dtypes=get_all_dtypes())
+    @ops(foreach_binary_op_db, dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_binary_op_scalar_with_overlapping_tensors(self, device, dtype, op):
         foreach_op, ref = op.method_variant, op.ref
         tensors = [torch.ones(1, 1, device=device, dtype=dtype).expand(2, 1, 3)]
@@ -479,7 +470,7 @@ def test_binary_op_scalar_with_different_tensor_dtypes(self, device, dtype, op):
             runtime_error = e
         self.assertIsNone(runtime_error)
 
-    @ops(foreach_binary_op_db, dtypes=get_all_dtypes())
+    @ops(foreach_binary_op_db, dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_binary_op_list_error_cases(self, device, dtype, op):
         foreach_op, foreach_op_, ref, ref_ = op.method_variant, op.inplace_variant, op.ref, op.ref_inplace
         tensors1 = []
@@ -534,7 +525,7 @@ def test_binary_op_list_error_cases(self, device, dtype, op):
                 return
             with self.assertRaisesRegex(RuntimeError, "Expected all tensors to be on the same device"):
                 foreach_op([tensor1], [tensor2])
-            if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div:
+            if dtype in integral_types_and(torch.bool) and foreach_op == torch._foreach_div:
                 with self.assertRaisesRegex(RuntimeError, "result type"):
                     foreach_op_([tensor1], [tensor2])
             else:
@@ -543,7 +534,7 @@ def test_binary_op_list_error_cases(self, device, dtype, op):
 
     @skipMeta
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA not found")
-    @ops(foreach_binary_op_db, dtypes=get_all_dtypes())
+    @ops(foreach_binary_op_db, dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_binary_op_list_slow_path(self, device, dtype, op):
         # note(mkozuki): why `n_expected_cudaLaunchKernels=0`?
         # In this test, foreach functions don't go through fast path,
@@ -635,7 +626,7 @@ def test_binary_op_tensors_on_different_devices(self, device, dtype, op):
             self.assertEqual(actual, tensors1)
 
     @onlyCUDA
-    @ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False))
+    @ops(foreach_pointwise_op_db, allowed_dtypes=floating_types())
     def test_pointwise_op_tensors_on_different_devices(self, device, dtype, op):
         # tensors1: ['cuda', 'cpu]
         # tensors2: ['cuda', 'cpu]
@@ -653,6 +644,27 @@ def test_pointwise_op_tensors_on_different_devices(self, device, dtype, op):
         foreach_op_(tensors1, tensors2, tensors3)
         self.assertEqual(expected, tensors1)
 
+    # note: BFloat16 has the same number of exponent bits as FP32
+    # so if squared L2 norm overflows in BF16, then it also overflows in FP32.
+    @onlyCUDA
+    @ops(foreach_reduce_op_db, allowed_dtypes=(torch.half, torch.bfloat16))
+    def test_foreach_l2_large_value_input(self, device, dtype, op):
+        ord, N = 2, 10
+        max_value = torch.finfo(dtype).max
+        scaler = torch.tensor([max_value]).sqrt().to(device=device, dtype=dtype)
+        inputs = [t * scaler for t in op.sample_inputs(device, dtype, N, noncontiguous=False, low=1)],
+        # make sure that the min. of squared L2 norm value per tensor is greater than the max value of `dtype`.
+        self.assertTrue(scaler * scaler * N > max_value)
+        fn, ref_fn, *_ = self._get_funcs(op, 3)
+        actual = fn(inputs, is_cuda=True, is_fastpath=True, ord=ord)
+        expect = ref_fn(inputs, ord=ord)
+        if dtype == torch.float16:
+            # making sure the reference L2 norm values are in the range of FP16.
+            self.assertFalse(any(torch.isinf(e) for e in expect))
+        else:
+            self.assertTrue(all(torch.isinf(e) for e in expect))
+        self.assertEqual(expect, actual, equal_nan=False)
+
 
 instantiate_device_type_tests(TestForeach, globals())
 
diff --git a/test/test_functionalization.py b/test/test_functionalization.py
index 28476ff259576f..1b6bb88acf24f9 100644
--- a/test/test_functionalization.py
+++ b/test/test_functionalization.py
@@ -3,6 +3,9 @@
 import torch
 from torch.testing._internal.common_utils import TestCase, run_tests
 from torch.testing._internal.logging_tensor import LoggingTensor, capture_logs, log_input
+from torch.utils._pytree import tree_map
+
+import logging
 
 def are_aliased(x, y):
     if x._base is None and y._base is None:
@@ -13,6 +16,45 @@ def are_aliased(x, y):
         return y._base is x
     return x._base is y._base
 
+# Just for testing: a logging tensor that also transforms out-of-place ops into inplace ops.
+# That way even if the outer wrapper is functionalized, the inner wrapper will also need functionalization.
+class InplaceLoggingTensor(LoggingTensor):
+    @staticmethod
+    def __new__(cls, e):
+        r = torch.Tensor._make_wrapper_subclass(cls, e.shape, dtype=e.dtype, requires_grad=False)
+        r.elem = e
+        return r
+
+    __torch_function__ = torch._C._disabled_torch_function_impl
+
+    def __str__(self):
+        return f'InplaceLoggingTensor({self.elem})'
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        def unwrap(e):
+            if isinstance(e, InplaceLoggingTensor):
+                return e.elem
+            else:
+                return e
+
+        def wrap(e):
+            if isinstance(e, torch.Tensor):
+                return InplaceLoggingTensor(e)
+            else:
+                return e
+        f = func
+        # this subclass converts all `add()` ops into `add_()` ops
+        if f is torch.ops.aten.add.Tensor:
+            f = torch.ops.aten.add_.Tensor
+
+        rs = tree_map(wrap, f(*tree_map(unwrap, args), **tree_map(unwrap, kwargs)))
+        # after running the (potentially transformed) op,
+        # log the original op that we saw.
+        logging.getLogger("LoggingTensor").info(f"{func.__module__}.{func.__name__}", args, kwargs, rs)
+        return rs
+
+
 
 class TestFunctionalization(TestCase):
 
@@ -61,13 +103,13 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.view($0, [4, 2])
-$2 = torch._ops.aten.add($1, tensor([[1., 1.],
+$1 = torch._ops.aten.view.default($0, [4, 2])
+$2 = torch._ops.aten.add.Tensor($1, tensor([[1., 1.],
         [1., 1.],
         [1., 1.],
         [1., 1.]]))
-$3 = torch._ops.aten.view($2, [4, 2])
-$4 = torch._ops.aten.mul($3, $3)""")
+$3 = torch._ops.aten.view.default($2, [4, 2])
+$4 = torch._ops.aten.mul.Tensor($3, $3)""")
 
     def test_inplace_on_non_view(self):
         def f(x):
@@ -81,8 +123,8 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.view($0, [4, 2])
-$2 = torch._ops.aten.add($0, tensor([[1., 1.],
+$1 = torch._ops.aten.view.default($0, [4, 2])
+$2 = torch._ops.aten.add.Tensor($0, tensor([[1., 1.],
         [1., 1.],
         [1., 1.],
         [1., 1.]]))""")
@@ -101,9 +143,9 @@ def f(x):
         # We can update the output of this test if/when these tests eventually use LoggingTensor with PythonMode
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.copy_(tensor([[1., 1.],
+$1 = torch._ops.aten.copy_.default(tensor([[1., 1.],
         [1., 1.]]), $0)
-$2 = torch._ops.aten.copy_(tensor([[1., 1.],
+$2 = torch._ops.aten.copy_.default(tensor([[1., 1.],
         [1., 1.]]), $0)""")
 
     def test_diagonal(self):
@@ -118,10 +160,10 @@ def f(x):
         logs = self.get_logs(f, torch.ones(2, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.diagonal($0)
-$2 = torch._ops.aten.add($1, tensor([1., 1.]))
-$3 = torch._ops.aten.diagonal_scatter($0, $2)
-$4 = torch._ops.aten.mul($3, $3)""")
+$1 = torch._ops.aten.diagonal.default($0)
+$2 = torch._ops.aten.add.Tensor($1, tensor([1., 1.]))
+$3 = torch._ops.aten.diagonal_scatter.default($0, $2)
+$4 = torch._ops.aten.mul.Tensor($3, $3)""")
 
     def test_diagonal_mutated_input(self):
         def f(x):
@@ -146,13 +188,13 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1, $2 = torch._ops.aten.split($0, 2)
-$3 = torch._ops.aten.diagonal($2)
-$4 = torch._ops.aten.add($3, tensor([1., 1.]))
-$5, $6 = torch._ops.aten.split($0, 2)
-$7 = torch._ops.aten.diagonal_scatter($6, $4)
-$8 = torch._ops.aten.slice_scatter($0, $7, 0, 2, 4)
-$9 = torch._ops.aten.mul($8, $8)""")
+$1, $2 = torch._ops.aten.split.Tensor($0, 2)
+$3 = torch._ops.aten.diagonal.default($2)
+$4 = torch._ops.aten.add.Tensor($3, tensor([1., 1.]))
+$5, $6 = torch._ops.aten.split.Tensor($0, 2)
+$7 = torch._ops.aten.diagonal_scatter.default($6, $4)
+$8 = torch._ops.aten.slice_scatter.default($0, $7, 0, 2, 4)
+$9 = torch._ops.aten.mul.Tensor($8, $8)""")
 
     def test_view_inplace(self):
         def f(x):
@@ -166,9 +208,9 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.transpose($0, 1, 0)
-$2 = torch._ops.aten.select($1, 0, 0)
-$3 = torch._ops.aten.add($2, tensor([1., 1., 1., 1.]))""")
+$1 = torch._ops.aten.transpose.int($0, 1, 0)
+$2 = torch._ops.aten.select.int($1, 0, 0)
+$3 = torch._ops.aten.add.Tensor($2, tensor([1., 1., 1., 1.]))""")
 
     def test_scalars(self):
         def f(x):
@@ -183,10 +225,10 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.view($0, [4, 2])
-$2 = torch._ops.aten.add($1, tensor(1))
-$3 = torch._ops.aten.mul($2, tensor(2))
-$4 = torch._ops.aten.div($3, tensor(1))""")
+$1 = torch._ops.aten.view.default($0, [4, 2])
+$2 = torch._ops.aten.add.Tensor($1, tensor(1))
+$3 = torch._ops.aten.mul.Tensor($2, tensor(2))
+$4 = torch._ops.aten.div.Tensor($3, tensor(1))""")
 
     def test_everything(self):
         def f(x):
@@ -205,39 +247,39 @@ def f(x):
         logs = self.get_logs(f, torch.ones(4, 2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.view($0, [8])
-$2 = torch._ops.aten._reshape_alias($1, [2, 4], [4, 1])
-$3 = torch._ops.aten.transpose($2, 1, 0)
-$4 = torch._ops.aten.view($0, [8])
-$5 = torch._ops.aten._reshape_alias($4, [2, 4], [4, 1])
-$6 = torch._ops.aten.transpose($5, 1, 0)
-$7 = torch._ops.aten.unsqueeze($6, 0)
-$8 = torch._ops.aten.view($0, [8])
-$9 = torch._ops.aten._reshape_alias($8, [2, 4], [4, 1])
-$10 = torch._ops.aten.transpose($9, 1, 0)
-$11 = torch._ops.aten.unsqueeze($10, 0)
-$12 = torch._ops.aten.squeeze($11)
-$13, $14 = torch._ops.aten.split($12, 2)
-$15 = torch._ops.aten.add($13, tensor([[1., 1.],
+$1 = torch._ops.aten.view.default($0, [8])
+$2 = torch._ops.aten._reshape_alias.default($1, [2, 4], [4, 1])
+$3 = torch._ops.aten.transpose.int($2, 1, 0)
+$4 = torch._ops.aten.view.default($0, [8])
+$5 = torch._ops.aten._reshape_alias.default($4, [2, 4], [4, 1])
+$6 = torch._ops.aten.transpose.int($5, 1, 0)
+$7 = torch._ops.aten.unsqueeze.default($6, 0)
+$8 = torch._ops.aten.view.default($0, [8])
+$9 = torch._ops.aten._reshape_alias.default($8, [2, 4], [4, 1])
+$10 = torch._ops.aten.transpose.int($9, 1, 0)
+$11 = torch._ops.aten.unsqueeze.default($10, 0)
+$12 = torch._ops.aten.squeeze.default($11)
+$13, $14 = torch._ops.aten.split.Tensor($12, 2)
+$15 = torch._ops.aten.add.Tensor($13, tensor([[1., 1.],
         [1., 1.]]))
-$16 = torch._ops.aten.select($2, 0, 0)
-$17 = torch._ops.aten.clone($15, memory_format=0)
-$18 = torch._ops.aten._unsafe_view($17, [4])
-$19 = torch._ops.aten.view($0, [8])
-$20 = torch._ops.aten._reshape_alias($19, [2, 4], [4, 1])
-$21 = torch._ops.aten.transpose($20, 1, 0)
-$22 = torch._ops.aten.unsqueeze($21, 0)
-$23 = torch._ops.aten.squeeze($22)
-$24 = torch._ops.aten.slice_scatter($23, $15, 0, 0, 2)
-$25 = torch._ops.aten.unsqueeze($24, 0)
-$26 = torch._ops.aten.squeeze($25, 0)
-$27 = torch._ops.aten.transpose($26, 1, 0)
-$28 = torch._ops.aten._reshape_alias($27, [8], [1])
-$29 = torch._ops.aten.view($28, [4, 2])
-$30 = torch._ops.aten.view($29, [8])
-$31 = torch._ops.aten._reshape_alias($30, [2, 4], [4, 1])
-$32 = torch._ops.aten.select($31, 0, 0)
-$33 = torch._ops.aten.add($32, $18)""")
+$16 = torch._ops.aten.select.int($2, 0, 0)
+$17 = torch._ops.aten.clone.default($15, memory_format=0)
+$18 = torch._ops.aten._unsafe_view.default($17, [4])
+$19 = torch._ops.aten.view.default($0, [8])
+$20 = torch._ops.aten._reshape_alias.default($19, [2, 4], [4, 1])
+$21 = torch._ops.aten.transpose.int($20, 1, 0)
+$22 = torch._ops.aten.unsqueeze.default($21, 0)
+$23 = torch._ops.aten.squeeze.default($22)
+$24 = torch._ops.aten.slice_scatter.default($23, $15, 0, 0, 2)
+$25 = torch._ops.aten.unsqueeze.default($24, 0)
+$26 = torch._ops.aten.squeeze.dim($25, 0)
+$27 = torch._ops.aten.transpose.int($26, 1, 0)
+$28 = torch._ops.aten._reshape_alias.default($27, [8], [1])
+$29 = torch._ops.aten.view.default($28, [4, 2])
+$30 = torch._ops.aten.view.default($29, [8])
+$31 = torch._ops.aten._reshape_alias.default($30, [2, 4], [4, 1])
+$32 = torch._ops.aten.select.int($31, 0, 0)
+$33 = torch._ops.aten.add.Tensor($32, $18)""")
 
     def test_aliases_maintained_after_pass(self):
         def f(x):
@@ -279,34 +321,34 @@ def f(x):
         logs = self.get_logs(f, torch.ones(2))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.expand($0, [2])
-$2 = torch._ops.aten.add($1, $0)""")
+$1 = torch._ops.aten.expand.default($0, [2])
+$2 = torch._ops.aten.add.Tensor($1, $0)""")
 
         # Test 2: copy_() with same dtype, different shape
         self.assert_functionalization(f, torch.ones(1))
         logs = self.get_logs(f, torch.ones(1))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten.expand($0, [2])
-$2 = torch._ops.aten.add($1, $0)""")
+$1 = torch._ops.aten.expand.default($0, [2])
+$2 = torch._ops.aten.add.Tensor($1, $0)""")
 
         # Test 3: copy_() with different dtype, same shape
         self.assert_functionalization(f, torch.ones(2, dtype=torch.long))
         logs = self.get_logs(f, torch.ones(2, dtype=torch.long))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten._to_copy($0, dtype=6, layout=0, device=device(type='cpu'), pin_memory=False)
-$2 = torch._ops.aten.expand($1, [2])
-$3 = torch._ops.aten.add($2, $0)""")
+$1 = torch._ops.aten._to_copy.default($0, dtype=6, layout=0, device=device(type='cpu'), pin_memory=False)
+$2 = torch._ops.aten.expand.default($1, [2])
+$3 = torch._ops.aten.add.Tensor($2, $0)""")
 
         # Test 4: copy_() with different dtype, different shape
         self.assert_functionalization(f, torch.ones(1, dtype=torch.long))
         logs = self.get_logs(f, torch.ones(1, dtype=torch.long))
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = input('input')
-$1 = torch._ops.aten._to_copy($0, dtype=6, layout=0, device=device(type='cpu'), pin_memory=False)
-$2 = torch._ops.aten.expand($1, [2])
-$3 = torch._ops.aten.add($2, $0)""")
+$1 = torch._ops.aten._to_copy.default($0, dtype=6, layout=0, device=device(type='cpu'), pin_memory=False)
+$2 = torch._ops.aten.expand.default($1, [2])
+$3 = torch._ops.aten.add.Tensor($2, $0)""")
 
     def test_nested_functions_propagate_updates(self):
         def g(x):
@@ -324,5 +366,77 @@ def f(x):
 
         self.assert_functionalization(f, torch.ones(2, 2))
 
+    def test_mixed_wrappers_valid(self):
+        def f(x, y):
+            z = x + y
+            z.add_(1)
+            return z
+
+        x1_not_functional = LoggingTensor(torch.ones(4))
+        x2_functional = torch._to_functional_tensor(LoggingTensor(torch.ones(4)))
+
+        with capture_logs() as logs:
+            y = f(x1_not_functional, x2_functional)
+
+        # I think the alias trace is coming from the fact that x2 is technically *not*
+        # a LoggingTensor (instead it *contains* a LoggingTensor), but x1 *is* a LoggingTensor.
+        # The important thing here though is that functionalization ran the "+" kernel
+        # with a functional + non-functional tensor, and wrapped the output appropriately.
+        self.assertExpectedInline('\n'.join(logs), """\
+$2 = torch._ops.aten.add.Tensor($0, $1)
+$3 = torch._ops.aten.alias.default($2)
+$4 = torch._ops.aten.add.Tensor($3, tensor(1))""")
+
+    def test_mixed_wrappers_invalid(self):
+        x1_not_functional = torch.ones(4)
+        x2_functional = torch._to_functional_tensor(torch.ones(4))
+
+        # When dealing with mixed functional + nonfunctional tensors,
+        # normal_tensor.add_(functional_tensor) is not valid
+        # because normal_tensor would need to be "promoted" to a functional tensor.
+        with self.assertRaises(RuntimeError):
+            x1_not_functional.add_(x2_functional)
+
+    # This tests the behavior of functionalization with multiple layers of wrapped tensor subclasses.
+    def test_multiple_levels_of_wrapping(self):
+        def f(x):
+            # call an inplace op and have it get logged twice (by the outer + inner wrapper)
+            x.add_(1)
+
+        # Test 1: both the inner and outer wrapper are "functionalized"
+        x_inner_and_outer_functional = torch._to_functional_tensor(
+            InplaceLoggingTensor(torch._to_functional_tensor(LoggingTensor(torch.ones(4)))))
+
+        with capture_logs() as logs:
+            f(x_inner_and_outer_functional)
+
+        # Since both wrappers were unctionalized, they both log "add"
+        self.assertExpectedInline('\n'.join(logs), """\
+$1 = torch._ops.aten.add.Tensor($0, tensor(1))
+$3 = torch._ops.aten.add.Tensor($2, tensor(1))""")
+
+        # Test 2: only the inner wrapper is "functionalized"
+        x_only_inner_functional = InplaceLoggingTensor(torch._to_functional_tensor(LoggingTensor(torch.ones(4))))
+
+        with capture_logs() as logs:
+            f(x_only_inner_functional)
+
+        # Since only the inner wrapper is functionalized, then the inner (first) log is functionalized
+        self.assertExpectedInline('\n'.join(logs), """\
+$1 = torch._ops.aten.add.Tensor($0, tensor(1))
+$3 = torch._ops.aten.add_.Tensor($2, tensor(1))""")
+
+        # Test 3: only the inner wrapper is "functionalized"
+        x_only_outer_functional = torch._to_functional_tensor(InplaceLoggingTensor(LoggingTensor(torch.ones(4))))
+
+        with capture_logs() as logs:
+            f(x_only_outer_functional)
+
+        # Only the outer add_ is functionalized
+        # Since only the outer wrapper is functionalized, then the outer (second) log is functionalized
+        self.assertExpectedInline('\n'.join(logs), """\
+$1 = torch._ops.aten.add_.Tensor($0, tensor(1))
+$3 = torch._ops.aten.add.Tensor($2, tensor(1))""")
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_fx.py b/test/test_fx.py
index f72dbd21266974..4a5436b7968339 100644
--- a/test/test_fx.py
+++ b/test/test_fx.py
@@ -7,6 +7,7 @@
 import inspect
 import math
 import numbers
+import io
 import operator
 import os
 import pickle
@@ -17,6 +18,7 @@
 import types
 import warnings
 import unittest
+import torch.nn.utils._stateless as _stateless
 from math import sqrt
 from torch.multiprocessing import Process
 from torch.testing import FileCheck
@@ -141,6 +143,7 @@ def __init__(self, a, b):
 
 class TestFX(JitTestCase):
     def setUp(self):
+        super().setUp()
         # Checking for mutable operations whil tracing is feature flagged
         # Enable it in testing but not by default
         self.orig_tracer_mutable_flag = torch.fx.proxy.TracerBase.check_mutable_operations
@@ -151,6 +154,7 @@ def setUp(self):
             torch.ops.load_library(str(lib_file_path))
 
     def tearDown(self):
+        super().tearDown()
         torch.fx.proxy.TracerBase.check_mutable_operations = self.orig_tracer_mutable_flag
 
     def checkGraphModule(self, m: torch.nn.Module, args, kwargs=None):
@@ -457,6 +461,19 @@ def forward(self, a, b):
         gm.graph.lint()
         self.assertEqual(gm(3, 4), 14)
 
+    def test_concrete_arg_none_assert(self):
+        class Foo(torch.nn.Module):
+            def forward(self, x, val=None):
+                return x if val is None else x + val
+
+        f = Foo()
+        traced = torch.fx.symbolic_trace(f, concrete_args={'val' : None})
+        with self.assertRaisesRegex(AssertionError, 'val has been specialized to have value None'):
+            traced(torch.randn(5), torch.randn(5))
+
+        x = torch.randn(5)
+        torch.testing.assert_close(traced(x), f(x))
+
     def test_graph_unique_names(self):
         class M(torch.nn.Module):
             def forward(self, a, b):
@@ -686,6 +703,7 @@ def forward(self, a):
         for node in m_g.graph.nodes:
             self.assertTrue(node.name != "getattr")
 
+    @unittest.skip("Hotfix for SEV remediation")
     def test_trace_buffer_slice(self):
         bs, d_hid = 10, 23
 
@@ -1026,6 +1044,24 @@ def forward(self, x):
         traced_scripted = torch.jit.script(traced)
         self.assertEqual(traced_scripted(torch.rand(4)), 2)
 
+    def test_tuple_no_subscript(self):
+        def foo(x : Tuple):
+            return x[0]
+
+        traced = torch.fx.symbolic_trace(foo)
+        x = (torch.randn(5, 3),)
+        torch.testing.assert_allclose(traced(x), x[0])
+
+        bio = io.BytesIO()
+
+        torch.save(traced, bio)
+
+        bio.seek(0)
+
+        loaded = torch.load(bio)
+
+        torch.testing.assert_allclose(loaded(x), x[0])
+
     def test_torch_fx_len(self):
         class FXLenTest(torch.nn.Module):
             def forward(self, x):
@@ -1096,6 +1132,24 @@ def forward(self, a):
         out = gm(input)
         self.assertEqual(out, ref_out)
 
+    def test_torch_op_overloads(self):
+        class M(torch.nn.Module):
+            def forward(self, a):
+                b = torch.ops.aten.add.Tensor(a, a)
+                return b
+        m = M()
+        input = torch.randn(3)
+        ref_out = m(input)
+        gm = symbolic_trace(m)
+        gm.graph.lint()
+        out = gm(input)
+        self.assertEqual(out, ref_out)
+
+        for node in gm.graph.nodes:
+            if node.op == 'call_function':
+                assert isinstance(node.target, torch._ops.OpOverload)
+                assert node.target.__name__ == 'add.Tensor'
+
     def test_pickle_torch_custom_ops(self):
         class M(torch.nn.Module):
             def forward(self, a):
@@ -2661,7 +2715,7 @@ def to_trace(y):
 
     def test_profiler_ranges_side_effect(self):
         g = torch.fx.Graph()
-        handle = g.call_function(torch.ops.profiler._record_function_enter, ('test_range',))
+        handle = g.call_function(torch.ops.profiler._record_function_enter_new, ('test_range',))
         g.call_function(torch.ops.profiler._record_function_exit, (handle,))
         g.output(None)
 
@@ -2671,7 +2725,7 @@ def test_profiler_ranges_side_effect(self):
                 found_targets.setdefault(node.target)
         self.assertEqual(
             list(found_targets.keys()),
-            [torch.ops.profiler._record_function_enter, torch.ops.profiler._record_function_exit]
+            [torch.ops.profiler._record_function_enter_new, torch.ops.profiler._record_function_exit]
         )
 
         g.eliminate_dead_code()
@@ -2681,7 +2735,7 @@ def test_profiler_ranges_side_effect(self):
                 found_targets.setdefault(node.target)
         self.assertEqual(
             list(found_targets.keys()),
-            [torch.ops.profiler._record_function_enter, torch.ops.profiler._record_function_exit]
+            [torch.ops.profiler._record_function_enter_new, torch.ops.profiler._record_function_exit]
         )
 
     def test_ast_rewriter_wrapped_via_decorator(self):
@@ -2917,6 +2971,35 @@ def is_leaf_module(self, m: torch.nn.Module, module_qualified_name : str) -> boo
         gm2.delete_all_unused_submodules()
         torch.testing.assert_allclose(gm2(inputs), model(inputs))
 
+    def test_fx_stateless(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.l1 = torch.nn.Linear(1, 1)
+                self.register_buffer('buffer', torch.ones(1))
+
+            def forward(self, x):
+                return self.l1(x) + self.buffer
+
+        module = MockModule()
+        x = torch.rand((1, 1))
+        weight = torch.tensor([[1.0]], requires_grad=True)
+        bias = torch.tensor([0.0], requires_grad=True)
+        buffer = torch.tensor([0.0])
+        parameters = {'l1.weight': weight,
+                      'l1.bias': bias,
+                      'buffer': buffer}
+        fx_module = torch.fx.symbolic_trace(module)
+        res = _stateless.functional_call(fx_module, parameters, x)
+        res.backward()
+        self.assertIsNotNone(weight.grad)
+        self.assertIsNotNone(bias.grad)
+        self.assertIsNone(buffer.grad)
+        # Gradient was not calculated for the module stated and buffers
+        self.assertIsNone(module.l1.weight.grad)
+        self.assertIsNone(module.l1.bias.grad)
+        self.assertIsNone(module.buffer.grad)
+
     def test_tracing_graphmodules_as_leaf_submodules(self):
         class A(torch.nn.Module):
             def forward(self, t):
@@ -3310,6 +3393,66 @@ def f(a, b):
         ts_f = torch.jit.script(nf)
         self.assertEqual(nf(vals), ts_f(vals))
 
+    def test_custom_codegen_with_transformer(self):
+        class ListCodeGen(CodeGen):
+            def gen_fn_def(self, free_vars, maybe_return_annotation):
+                lst_unpack = f"""
+def forward(self, args_list: List[torch.Tensor]){maybe_return_annotation}:
+    {', '.join(free_vars)} = args_list"""
+                return lst_unpack
+
+            def additional_globals(self):
+                return [('List', typing.List)]
+
+            def process_inputs(self, *inputs):
+                assert(len(inputs) == 1)
+                return inputs[0]
+
+        def f(a, b):
+            return a + b
+
+        nf = symbolic_trace(f)
+        vals = [torch.randn(3), torch.randn(3)]
+        self.assertEqual(nf(*vals), f(*vals))
+
+        nf.graph.set_codegen(ListCodeGen())
+        nf.recompile()
+        self.assertEqual(nf(vals), f(*vals))
+
+        transformed_gm = Transformer(nf).transform()
+        self.assertEqual(nf(vals), transformed_gm(vals))
+
+    def test_interpreter_with_codegen(self):
+        class ListCodeGen(CodeGen):
+            def gen_fn_def(self, free_vars, maybe_return_annotation):
+                lst_unpack = f"""
+def forward(self, args_list: List[torch.Tensor]){maybe_return_annotation}:
+    {', '.join(free_vars)} = args_list"""
+                return lst_unpack
+
+            def additional_globals(self):
+                return [('List', typing.List)]
+
+            def process_inputs(self, *inputs):
+                assert(len(inputs) == 1)
+                return inputs[0]
+
+            def generate_output(self, output_args):
+                return f'return list({repr(output_args)})'
+
+            def process_outputs(self, outputs):
+                return list(outputs)
+
+        def f(a, b):
+            a = a + b
+            b = a + b
+            return a, b
+
+        nf = symbolic_trace(f)
+        vals = [torch.randn(3), torch.randn(3)]
+        nf.graph.set_codegen(ListCodeGen())
+        nf.recompile()
+        self.assertEqual(Interpreter(nf).run(vals), nf(vals))
 
     def test_imul_code_print(self):
         graph = torch.fx.Graph()
@@ -3368,6 +3511,7 @@ def test_get_torch_func_signature_exhaustive(self, device, dtype, op):
 
 class TestFXAPIBackwardCompatibility(JitTestCase):
     def setUp(self):
+        super().setUp()
         self.maxDiff = None
 
         # Checking for mutable operations whil tracing is feature flagged
@@ -3376,6 +3520,7 @@ def setUp(self):
         torch.fx.proxy.TracerBase.check_mutable_operations = True
 
     def tearDown(self):
+        super().tearDown()
         torch.fx.proxy.TracerBase.check_mutable_operations = self.orig_tracer_mutable_flag
 
 
@@ -3614,12 +3759,14 @@ def check_symbols_have_bc_designation(m, prefix):
 
 class TestFunctionalTracing(JitTestCase):
     def setUp(self):
+        super().setUp()
         # Checking for mutable operations whil tracing is feature flagged
         # Enable it in testing but not by default
         self.orig_tracer_mutable_flag = torch.fx.proxy.TracerBase.check_mutable_operations
         torch.fx.proxy.TracerBase.check_mutable_operations = True
 
     def tearDown(self):
+        super().tearDown()
         torch.fx.proxy.TracerBase.check_mutable_operations = self.orig_tracer_mutable_flag
 
     IGNORE_FUNCS = ("has_torch_function", "has_torch_function_unary",
diff --git a/test/test_fx_experimental.py b/test/test_fx_experimental.py
index 37569198347844..53798776eb91f6 100644
--- a/test/test_fx_experimental.py
+++ b/test/test_fx_experimental.py
@@ -814,6 +814,29 @@ def mod_partition(node: Node):
 
         self.assertEqual(orig_out, submodules_out)
 
+    def test_split_module_kwargs_expansion(self):
+        class ModuleWithKwargsExpansion(torch.nn.Module):
+            def forward(self, x, **kwargs):
+                return x + kwargs['foo']
+
+        mod = ModuleWithKwargsExpansion()
+        traced = torch.fx.symbolic_trace(mod)
+
+        seen_getitem = False
+
+        def split_callback(n):
+            nonlocal seen_getitem
+            split_idx = int(seen_getitem)
+            if n.target == operator.getitem:
+                seen_getitem = True
+            return split_idx
+
+        split = split_module(traced, mod, split_callback)
+
+        x = torch.randn(5, 3)
+        foo = torch.randn(5, 3)
+        torch.testing.assert_allclose(split(x, foo=foo), traced(x, foo=foo))
+
     @skipIfNoTorchVision
     def test_subgraph_trivial_resnet(self):
         # Smoke test trivially splitting resnet into 1 partition works
@@ -1516,6 +1539,7 @@ def test_normalize_operator_exhaustive(self, device, dtype, op):
             "igamma",
             "igammac",
             "index_put",
+            "linalg_pinv_singular",  # Implemented with a lambda (only the singular variant)
             "nn.functional.conv2d",
             "nn.functional.dropout",
             "nn.functional.dropout2d",
@@ -1587,6 +1611,9 @@ def test_normalize_operator_exhaustive(self, device, dtype, op):
         if op.name in op_skip:
             return
 
+        if op.formatted_name in op_skip:
+            return
+
         if op.name.startswith('_masked.'):
             return
 
diff --git a/test/test_hub.py b/test/test_hub.py
new file mode 100644
index 00000000000000..662a2cf9771ee2
--- /dev/null
+++ b/test/test_hub.py
@@ -0,0 +1,256 @@
+# Owner(s): ["module: hub"]
+
+import unittest
+from unittest.mock import patch
+import os
+import tempfile
+import warnings
+
+import torch
+import torch.hub as hub
+from torch.testing._internal.common_utils import retry, IS_SANDCASTLE, TestCase
+
+
+def sum_of_state_dict(state_dict):
+    s = 0
+    for _, v in state_dict.items():
+        s += v.sum()
+    return s
+
+
+SUM_OF_HUB_EXAMPLE = 431080
+TORCHHUB_EXAMPLE_RELEASE_URL = 'https://github.com/ailzhang/torchhub_example/releases/download/0.1/mnist_init_ones'
+
+
+@unittest.skipIf(IS_SANDCASTLE, 'Sandcastle cannot ping external')
+class TestHub(TestCase):
+
+    def setUp(self):
+        super().setUp()
+        self.previous_hub_dir = torch.hub.get_dir()
+        self.tmpdir = tempfile.TemporaryDirectory('hub_dir')
+        torch.hub.set_dir(self.tmpdir.name)
+        self.trusted_list_path = os.path.join(torch.hub.get_dir(), "trusted_list")
+
+    def tearDown(self):
+        super().tearDown()
+        torch.hub.set_dir(self.previous_hub_dir)  # probably not needed, but can't hurt
+        self.tmpdir.cleanup()
+
+    def _assert_trusted_list_is_empty(self):
+        with open(self.trusted_list_path) as f:
+            assert not f.readlines()
+
+    def _assert_in_trusted_list(self, line):
+        with open(self.trusted_list_path) as f:
+            assert line in (l.strip() for l in f.readlines())
+
+    @retry(Exception, tries=3)
+    def test_load_from_github(self):
+        hub_model = hub.load('ailzhang/torchhub_example', 'mnist', source='github', pretrained=True, verbose=False)
+        self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_load_from_local_dir(self):
+        local_dir = hub._get_cache_or_reload(
+            'ailzhang/torchhub_example',
+            force_reload=False,
+            trust_repo=True,
+            calling_fn=None
+        )
+        hub_model = hub.load(local_dir, 'mnist', source='local', pretrained=True, verbose=False)
+        self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_load_from_branch(self):
+        hub_model = hub.load('ailzhang/torchhub_example:ci/test_slash', 'mnist', pretrained=True, verbose=False)
+        self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_get_set_dir(self):
+        previous_hub_dir = torch.hub.get_dir()
+        with tempfile.TemporaryDirectory('hub_dir') as tmpdir:
+            torch.hub.set_dir(tmpdir)
+            self.assertEqual(torch.hub.get_dir(), tmpdir)
+            self.assertNotEqual(previous_hub_dir, tmpdir)
+
+            hub_model = hub.load('ailzhang/torchhub_example', 'mnist', pretrained=True, verbose=False)
+            self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+            assert os.path.exists(os.path.join(tmpdir, 'ailzhang_torchhub_example_master'))
+
+        # Test that set_dir properly calls expanduser()
+        # non-regression test for https://github.com/pytorch/pytorch/issues/69761
+        new_dir = os.path.join("~", "hub")
+        torch.hub.set_dir(new_dir)
+        self.assertEqual(torch.hub.get_dir(), os.path.expanduser(new_dir))
+
+    @retry(Exception, tries=3)
+    def test_list_entrypoints(self):
+        entry_lists = hub.list('ailzhang/torchhub_example', trust_repo=True)
+        self.assertObjectIn('mnist', entry_lists)
+
+    @retry(Exception, tries=3)
+    def test_download_url_to_file(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            f = os.path.join(tmpdir, 'temp')
+            hub.download_url_to_file(TORCHHUB_EXAMPLE_RELEASE_URL, f, progress=False)
+            loaded_state = torch.load(f)
+            self.assertEqual(sum_of_state_dict(loaded_state), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_load_state_dict_from_url(self):
+        loaded_state = hub.load_state_dict_from_url(TORCHHUB_EXAMPLE_RELEASE_URL)
+        self.assertEqual(sum_of_state_dict(loaded_state), SUM_OF_HUB_EXAMPLE)
+
+        # with name
+        file_name = "the_file_name"
+        loaded_state = hub.load_state_dict_from_url(TORCHHUB_EXAMPLE_RELEASE_URL, file_name=file_name)
+        expected_file_path = os.path.join(torch.hub.get_dir(), 'checkpoints', file_name)
+        self.assertTrue(os.path.exists(expected_file_path))
+        self.assertEqual(sum_of_state_dict(loaded_state), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_load_legacy_zip_checkpoint(self):
+        with warnings.catch_warnings(record=True) as ws:
+            warnings.simplefilter("always")
+            hub_model = hub.load('ailzhang/torchhub_example', 'mnist_zip', pretrained=True, verbose=False)
+            self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+            assert any("will be deprecated in favor of default zipfile" in str(w) for w in ws)
+
+    # Test the default zipfile serialization format produced by >=1.6 release.
+    @retry(Exception, tries=3)
+    def test_load_zip_1_6_checkpoint(self):
+        hub_model = hub.load(
+            'ailzhang/torchhub_example',
+            'mnist_zip_1_6',
+            pretrained=True,
+            verbose=False,
+            trust_repo=True
+        )
+        self.assertEqual(sum_of_state_dict(hub_model.state_dict()), SUM_OF_HUB_EXAMPLE)
+
+    @retry(Exception, tries=3)
+    def test_hub_parse_repo_info(self):
+        # If the branch is specified we just parse the input and return
+        self.assertEqual(
+            torch.hub._parse_repo_info('a/b:c'),
+            ('a', 'b', 'c')
+        )
+        # For torchvision, the default branch is main
+        self.assertEqual(
+            torch.hub._parse_repo_info('pytorch/vision'),
+            ('pytorch', 'vision', 'main')
+        )
+        # For the torchhub_example repo, the default branch is still master
+        self.assertEqual(
+            torch.hub._parse_repo_info('ailzhang/torchhub_example'),
+            ('ailzhang', 'torchhub_example', 'master')
+        )
+
+    @retry(Exception, tries=3)
+    def test_load_commit_from_forked_repo(self):
+        with self.assertRaisesRegex(ValueError, 'If it\'s a commit from a forked repo'):
+            torch.hub.load('pytorch/vision:4e2c216', 'resnet18')
+
+    @retry(Exception, tries=3)
+    @patch('builtins.input', return_value='')
+    def test_trust_repo_false_emptystring(self, patched_input):
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        self._assert_trusted_list_is_empty()
+        patched_input.assert_called_once()
+
+        patched_input.reset_mock()
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        self._assert_trusted_list_is_empty()
+        patched_input.assert_called_once()
+
+    @retry(Exception, tries=3)
+    @patch('builtins.input', return_value='no')
+    def test_trust_repo_false_no(self, patched_input):
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        self._assert_trusted_list_is_empty()
+        patched_input.assert_called_once()
+
+        patched_input.reset_mock()
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        self._assert_trusted_list_is_empty()
+        patched_input.assert_called_once()
+
+    @retry(Exception, tries=3)
+    @patch('builtins.input', return_value='y')
+    def test_trusted_repo_false_yes(self, patched_input):
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        self._assert_in_trusted_list("ailzhang_torchhub_example")
+        patched_input.assert_called_once()
+
+        # Loading a second time with "check", we don't ask for user input
+        patched_input.reset_mock()
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+        patched_input.assert_not_called()
+
+        # Loading again with False, we still ask for user input
+        patched_input.reset_mock()
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=False)
+        patched_input.assert_called_once()
+
+    @retry(Exception, tries=3)
+    @patch('builtins.input', return_value='no')
+    def test_trust_repo_check_no(self, patched_input):
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+        self._assert_trusted_list_is_empty()
+        patched_input.assert_called_once()
+
+        patched_input.reset_mock()
+        with self.assertRaisesRegex(Exception, 'Untrusted repository.'):
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+        patched_input.assert_called_once()
+
+    @retry(Exception, tries=3)
+    @patch('builtins.input', return_value='y')
+    def test_trust_repo_check_yes(self, patched_input):
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+        self._assert_in_trusted_list("ailzhang_torchhub_example")
+        patched_input.assert_called_once()
+
+        # Loading a second time with "check", we don't ask for user input
+        patched_input.reset_mock()
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+        patched_input.assert_not_called()
+
+    @retry(Exception, tries=3)
+    def test_trust_repo_true(self):
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=True)
+        self._assert_in_trusted_list("ailzhang_torchhub_example")
+
+    @retry(Exception, tries=3)
+    def test_trust_repo_builtin_trusted_owners(self):
+        torch.hub.load('pytorch/vision', 'resnet18', trust_repo="check")
+        self._assert_trusted_list_is_empty()
+
+    @retry(Exception, tries=3)
+    def test_trust_repo_none(self):
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=None)
+            assert len(w) == 1
+            assert issubclass(w[-1].category, UserWarning)
+            assert "You are about to download and run code from an untrusted repository" in str(w[-1].message)
+
+        self._assert_trusted_list_is_empty()
+
+    @retry(Exception, tries=3)
+    def test_trust_repo_legacy(self):
+        # We first download a repo and then delete the allowlist file
+        # Then we check that the repo is indeed trusted without a prompt,
+        # because it was already downloaded in the past.
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo=True)
+        os.remove(self.trusted_list_path)
+
+        torch.hub.load('ailzhang/torchhub_example', 'mnist_zip_1_6', trust_repo="check")
+
+        self._assert_trusted_list_is_empty()
diff --git a/test/test_indexing.py b/test/test_indexing.py
index 42ffa8ab24e8f2..4f0e7e4bf74bb3 100644
--- a/test/test_indexing.py
+++ b/test/test_indexing.py
@@ -692,7 +692,7 @@ def test_bool_indices(self, device):
             self.assertEqual(v[boolIndices].shape, v[uint8Indices].shape)
             self.assertEqual(v[boolIndices], v[uint8Indices])
             self.assertEqual(v[boolIndices], tensor([True], dtype=torch.bool, device=device))
-            self.assertEquals(len(w), 2)
+            self.assertEqual(len(w), 2)
 
     def test_bool_indices_accumulate(self, device):
         mask = torch.zeros(size=(10, ), dtype=torch.bool, device=device)
@@ -713,7 +713,7 @@ def test_byte_mask(self, device):
         with warnings.catch_warnings(record=True) as w:
             self.assertEqual(v[mask].shape, (3, 7, 3))
             self.assertEqual(v[mask], torch.stack([v[0], v[2], v[3]]))
-            self.assertEquals(len(w), 2)
+            self.assertEqual(len(w), 2)
 
         v = torch.tensor([1.], device=device)
         self.assertEqual(v[v == 0], torch.tensor([], device=device))
@@ -725,7 +725,7 @@ def test_byte_mask_accumulate(self, device):
             warnings.simplefilter("always")
             y.index_put_((mask, ), y[mask], accumulate=True)
             self.assertEqual(y, torch.ones(size=(10, 10), device=device))
-            self.assertEquals(len(w), 2)
+            self.assertEqual(len(w), 2)
 
     def test_index_put_accumulate_large_tensor(self, device):
         # This test is for tensors with number of elements >= INT_MAX (2^31 - 1).
@@ -876,7 +876,7 @@ def test_multiple_byte_mask(self, device):
         with warnings.catch_warnings(record=True) as w:
             warnings.simplefilter("always")
             self.assertEqual(v[mask1, :, mask2].shape, (3, 7))
-            self.assertEquals(len(w), 2)
+            self.assertEqual(len(w), 2)
 
     def test_byte_mask2d(self, device):
         v = torch.randn(5, 7, 3, device=device)
@@ -1130,7 +1130,7 @@ def test_byte_tensor_assignment(self, device):
 
         with warnings.catch_warnings(record=True) as w:
             x[b] = value
-            self.assertEquals(len(w), 1)
+            self.assertEqual(len(w), 1)
 
         self.assertEqual(x[0], value)
         self.assertEqual(x[1], torch.arange(4., 8, device=device))
diff --git a/test/test_jit.py b/test/test_jit.py
index f60857058318da..876e6cbdf1e790 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -17,6 +17,7 @@
 from jit.test_data_parallel import TestDataParallel  # noqa: F401
 from jit.test_models import TestModels  # noqa: F401
 from jit.test_modules import TestModules  # noqa: F401
+from jit.test_autodiff import TestAutodiffJit  # noqa: F401
 from jit.test_autodiff_subgraph_slicing import TestAutodiffSubgraphSlicing  # noqa: F401
 from jit.test_custom_operators import TestCustomOperators  # noqa: F401
 from jit.test_export_modes import TestExportModes  # noqa: F401
@@ -25,12 +26,13 @@
 from jit.test_builtins import TestBuiltins, TestTensorBuiltins  # noqa: F401
 from jit.test_ignore_context_manager import TestIgnoreContextManager  # noqa: F401
 from jit.test_symbolic_shape_analysis import TestSymbolicShapeAnalysis  # noqa: F401
+from jit.test_op_decompositions import TestOpDecompositions  # noqa: F401
 from jit.test_if_hoisting import TestIfHoisting  # noqa: F401
 from jit.test_unsupported_ops import TestUnsupportedOps  # noqa: F401
 from jit.test_freezing import TestFreezing, TestFrozenOptimizations, TestMKLDNNReinplacing  # noqa: F401
 from jit.test_peephole import TestPeephole  # noqa: F401
 from jit.test_alias_analysis import TestAliasAnalysis  # noqa: F401
-from jit.test_save_load import TestSaveLoad  # noqa: F401
+from jit.test_save_load import TestSaveLoad, TestSaveLoadFlatbuffer  # noqa: F401
 from jit.test_save_load_for_op_version import TestSaveLoadForOpVersion  # noqa: F401
 from jit.test_module_containers import TestModuleContainers  # noqa: F401
 from jit.test_python_bindings import TestPythonBindings  # noqa: F401
@@ -76,6 +78,7 @@
 from jit.test_device_analysis import TestDeviceAnalysis  # noqa: F401
 from jit.test_dce import TestDCE  # noqa: F401
 from jit.test_sparse import TestSparse  # noqa: F401
+from jit.test_tensor_methods import TestTensorMethods  # noqa: F401
 
 # Torch
 from torch import Tensor
@@ -203,11 +206,6 @@ def doAutodiffCheck(testname):
 # TODO: enable TE in PE when all tests are fixed
 torch._C._jit_set_texpr_fuser_enabled(GRAPH_EXECUTOR == ProfilingMode.PROFILING)
 torch._C._jit_set_profiling_executor(GRAPH_EXECUTOR != ProfilingMode.LEGACY)
-# even though FULL_PROFILER should be our default
-# we haven't tested every single test in this file
-# but we enable FULL_PROFILER for a large subset
-# of the tests with "with enable_profiling_mode_for_profiling_tests"
-torch._C._jit_set_profiling_mode(False)
 
 def LSTMCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None):
     hx, cx = hidden
@@ -969,6 +967,56 @@ def forward(self, input):
         m_dropout.eval()
         self.assertEqual(dropout(input) + 1, m_dropout(input))
 
+    def test_nn_lp_pool2d(self):
+        class Mod(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.l = torch.nn.LPPool2d(2, 3)
+                self.n = torch.nn.LPPool2d(2, (7, 1))
+
+            def forward(self, x):
+                return (self.l(x),
+                        self.n(x),
+                        torch.nn.functional.lp_pool2d(x, float(2), 3),
+                        torch.nn.functional.lp_pool2d(x, 2, 3),
+                        torch.nn.functional.lp_pool2d(x, float(2), (7, 1)))
+
+        self.checkModule(Mod(), (torch.rand(1, 3, 7, 7),))
+
+    def test_nn_lp_pool1d(self):
+        class Mod(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.l = torch.nn.LPPool1d(2, 3)
+                self.n = torch.nn.LPPool1d(2, 7)
+
+            def forward(self, x):
+                return (self.l(x),
+                        self.n(x),
+                        torch.nn.functional.lp_pool1d(x, float(2), 3),
+                        torch.nn.functional.lp_pool1d(x, 2, 3),
+                        torch.nn.functional.lp_pool1d(x, float(2), 7))
+
+        self.checkModule(Mod(), (torch.rand(1, 3, 7),))
+
+    def test_nn_padding_functional(self):
+        class Mod(nn.Module):
+            def __init__(self, *pad):
+                super().__init__()
+                self.pad = pad
+
+            def forward(self, x):
+                return F.pad(x, self.pad, mode='constant', value=3.5)
+
+        inputs = [
+            (Mod(1, 2), torch.randn(1, 3, 4)),  # 1D
+            (Mod(1, 2, 3, 4), torch.randn(1, 3, 4)),  # 2D
+            (Mod(1, 2, 3, 4, 5, 6), torch.randn(1, 3, 4)),  # 3D
+        ]
+
+        for m, inp in inputs:
+            self.checkModule(m, (inp,))
+
     def test_nn_padding(self):
         class Mod(nn.Module):
             def __init__(self, padding):
@@ -5715,12 +5763,7 @@ def test_fuser_double_float_codegen(self):
                'frac']
 
         def lookup_c_equivalent_fn(aten_fn):
-            if aten_fn == 'min':
-                return 'fmin'
-            elif aten_fn == 'max':
-                return 'fmax'
-            else:
-                return aten_fn
+            return aten_fn
 
         def test_dispatch(op, expects, dtype, binary=False):
             if dtype == torch.double:
@@ -5754,7 +5797,9 @@ def test_dispatch(op, expects, dtype, binary=False):
             test_dispatch(fn, lookup_c_equivalent_fn(fn) + '(', torch.double)
             test_dispatch(fn, lookup_c_equivalent_fn(fn) + 'f(', torch.float)
 
-        binary_fns = ['min', 'max', 'pow']
+        # 'min', 'max' were previously tested but are now replaced with ternary expressions
+        # instead of fmin() and fmax()
+        binary_fns = ['pow']
         for fn in binary_fns:
             test_dispatch(fn, lookup_c_equivalent_fn(fn) + '(', torch.double, binary=True)
             test_dispatch(fn, lookup_c_equivalent_fn(fn) + 'f(', torch.float, binary=True)
@@ -7312,7 +7357,7 @@ def test_as_tensor_tensor_input(input):
             g = test_as_tensor_tensor_input.graph_for(torch.ones(3, 4))
             FileCheck().check("Tensor = aten::as_tensor").check("Float(*, *, requires_grad=0, device=cpu) = aten::as_tensor").run(g)
 
-
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.LEGACY, "testing legacy behavior")
     def test_tensor_requires_grad(self):
         @torch.jit.script
         def test(b):
@@ -8218,6 +8263,44 @@ def test_irparser(self):
         """
         FileCheck().run(graph_str, parse_ir(graph_str))
 
+    def test_parse_tensor_constants(self):
+        def foo():
+            return torch.zeros([4, 4])
+
+        foo_s = torch.jit.script(foo)
+        torch._C._jit_pass_constant_propagation(foo_s.graph)
+
+        g = str(foo_s.graph)
+        g_parsed = parse_ir(g, parse_tensor_constants=True)
+        self.assertEqual(str(canonical(g_parsed)), str(canonical(foo_s.graph)))
+        func = torch._C._create_function_from_graph("forward", g_parsed)
+
+        out_parsed = func()
+        out_func = foo()
+        # not checking data, just dtype, size etc
+        out_parsed[:] = 0
+        out_func[:] = 0
+        self.assertEqual(out_func, out_parsed)
+
+        with self.assertRaises(RuntimeError):
+            parse_ir(g, parse_tensor_constants=False)
+
+    def test_parse_nested_names(self):
+        g_str = """
+    graph(%x.1 : Tensor):
+        %3 : int = prim::Constant[value=1]()
+        %2 : int = prim::Constant[value=2]()
+        %hi.submod.value.5 : Tensor = aten::add(%x.1, %2, %3)
+        return (%hi.submod.value.5)
+        """
+        g = parse_ir(g_str)
+        round_trip_g = parse_ir(str(g))
+        self.assertEqual(canonical(g), canonical(round_trip_g))
+
+        func1 = torch._C._create_function_from_graph("forward", g)
+        func2 = torch._C._create_function_from_graph("forward", round_trip_g)
+        self.assertEqual(func1(torch.ones([2])), func2(torch.ones([2])))
+
     def test_is_after_use(self):
         def sorted_input_use(g):
             uses = list(next(g.inputs()).uses())
@@ -11047,6 +11130,26 @@ def randint():
             FileCheck().check("Double(*, *, requires_grad=0, device=cpu)") \
                        .check_not("Float(*, *, requires_grad=0, device=cpu)").run(randint.graph_for())
 
+    @unittest.skipIf(not RUN_CUDA, "no CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING, "skip if profiling isn't enabled")
+    def test_autodiff_complex(self):
+        def foo(x: torch.Tensor, y: torch.Tensor, W: torch.Tensor):
+            return torch.exp(torch.mm(torch.complex(x, y), W.cfloat()))
+
+        @torch.jit.script
+        def jitted_foo(x: torch.Tensor, y: torch.Tensor, W: torch.Tensor):
+            return torch.exp(torch.mm(torch.complex(x, y), W.cfloat()))
+
+        x = torch.randn(128, 16, dtype=torch.float32, device='cuda:0')
+        y = torch.randn(128, 16, dtype=torch.float32, device='cuda:0')
+        W = torch.randn(16, 1, dtype=torch.float32, device='cuda:0', requires_grad=True)
+        W.data /= 4
+
+        with enable_profiling_mode_for_profiling_tests():
+            for i in range(4):
+                self.assertTrue((foo(x, y, W).grad_fn is None) == (jitted_foo(x, y, W).grad_fn is None))
+
+
     def test_linear_grad(self):
         with enable_profiling_mode_for_profiling_tests():
             def t(x: torch.Tensor, w: torch.Tensor, b: Optional[torch.Tensor]):
@@ -14820,6 +14923,12 @@ def forward(self, x):
         with self.assertRaisesRegex(Exception, "Overloads are not useable when a module"):
             a = torch.jit.script(W2())
 
+    def test_narrow_copy(self):
+        def foo(a):
+            return a.narrow_copy(0, 0, 5)
+
+        self.checkScript(foo, [torch.rand(10)])
+
     def test_select_after_chunk(self):
         def foo(x):
             chunked = torch.chunk(x, 1)
diff --git a/test/test_jit_autocast.py b/test/test_jit_autocast.py
index cec8acfe7e8542..37acb003e94778 100644
--- a/test/test_jit_autocast.py
+++ b/test/test_jit_autocast.py
@@ -659,6 +659,55 @@ def forward(self, x, y):
         # isn't enabled
         self.assertRaises(RuntimeError, lambda: scripted_thing1.forward(x, y))
 
+    @unittest.skipIf(not TEST_CUDA, "No cuda")
+    def test_jit_freeze_autocast_basic(self):
+        class TestModule(torch.nn.Module):
+            def __init__(self):
+                super(TestModule, self).__init__()
+
+            def forward(self, x, y):
+                with torch.cuda.amp.autocast():
+                    return torch.mm(x, y)
+
+        x = torch.rand((3, 4), dtype=torch.float).cuda()
+        y = torch.rand((4, 5), dtype=torch.float).cuda()
+
+        mod = TestModule().eval()
+
+        # sanity check
+        self._test_autocast(mod, "aten::_autocast_to_reduced_precision", x, y)
+
+        frozen_mod = torch.jit.freeze(torch.jit.script(mod).eval())
+        FileCheck().check_count("aten::_autocast_to_reduced_precision", 2, True).run(frozen_mod.graph)
+
+        # make sure that the runtime pass doesn't duplicate autocast nodes
+        frozen_mod(x, y)
+        optimized_graph = frozen_mod.graph_for(x, y)
+        FileCheck().check_count("aten::_autocast_to_reduced_precision", 2, True).run(optimized_graph)
+
+    @unittest.skipIf(not TEST_CUDA, "No cuda")
+    def test_jit_freeze_autocast_constants(self):
+        class TestModule(torch.nn.Module):
+            def __init__(self):
+                super(TestModule, self).__init__()
+                self.x = torch.rand((3, 4), dtype=torch.float).cuda()
+
+            def forward(self, y):
+                with torch.cuda.amp.autocast():
+                    return torch.mm(self.x, y)
+
+        y = torch.rand((4, 5), dtype=torch.float).cuda()
+        mod = TestModule().eval()
+
+        frozen_mod = torch.jit.freeze(torch.jit.script(mod).eval())
+        # freezing should pre-cast the constant self.x to remove one autocast call
+        FileCheck().check_count("aten::_autocast_to_reduced_precision", 1, True).run(frozen_mod.graph)
+
+        # the runtime autocasting pass will re-insert the second autocast call,
+        # but constant propagation will merge it with the constant that it's casting.
+        frozen_mod(y)
+        optimized_graph = frozen_mod.graph_for(y)
+        FileCheck().check_count("aten::_autocast_to_reduced_precision", 1, True).run(optimized_graph)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_jit_cuda_fuser.py b/test/test_jit_cuda_fuser.py
index 299c738c570ab0..734e0d238294a9 100644
--- a/test/test_jit_cuda_fuser.py
+++ b/test/test_jit_cuda_fuser.py
@@ -10,14 +10,18 @@
 
 import torch
 from torch.nn import functional
+from torch.profiler import profile, ProfilerActivity
 
-from torch.testing._internal.common_utils import run_tests, ProfilingMode, GRAPH_EXECUTOR  # TEST_WITH_ROCM
-from torch.testing._internal.common_cuda import TEST_MULTIGPU
 from torch.testing._internal.codegen.random_topo_test import runDefaultTestWithSeed
+from torch.testing._internal.common_cuda import TEST_MULTIGPU
+from torch.testing._internal.common_device_type import instantiate_device_type_tests, ops, OpDTypes
+from torch.testing._internal.common_jit import JitCommonTestCase
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_utils import run_tests, ProfilingMode, GRAPH_EXECUTOR, TEST_WITH_ROCM, IS_WINDOWS, slowTest
+from torch.testing._internal.jit_utils import clone_inputs, get_traced_sample_variant_pairs, JitTestCase, RUN_CUDA
+from torch.testing._internal.jit_metaprogramming_utils import create_traced_fn
 from torch.testing import FileCheck
 
-from test_jit import JitTestCase, RUN_CUDA
-
 from jit.test_fuser_common import TestFuserCommon  # noqa: F401
 
 import itertools
@@ -28,7 +32,11 @@
 
 from typing import List
 
-CUDA_MAJOR, CUDA_MINOR = (int(x) for x in torch.version.cuda.split('.'))
+RUN_NVFUSER = RUN_CUDA and not TEST_WITH_ROCM and not IS_WINDOWS
+CUDA_MAJOR, CUDA_MINOR = 0, 0
+
+if RUN_NVFUSER and torch.version.cuda is not None:
+    CUDA_MAJOR, CUDA_MINOR = (int(x) for x in torch.version.cuda.split('.'))
 
 os.environ['PYTORCH_NVFUSER_DISABLE_FALLBACK'] = '1'
 os.environ['PYTORCH_NVFUSER_DISABLE_FMA'] = '1'
@@ -63,38 +71,36 @@ def nvfuser_horizontal_fusion(flag):
         torch._C._jit_set_nvfuser_horizontal_mode(old_value)
 
 def is_pre_volta():
+    if not RUN_NVFUSER:
+        return False
     prop = torch.cuda.get_device_properties(torch.cuda.current_device())
     return prop.major < 7
 
-TEST_BF16 = torch.cuda.is_bf16_supported()
+TEST_BF16 = RUN_NVFUSER and torch.cuda.is_bf16_supported()
 
-class TestCudaFuser(JitTestCase):
+class CudaFuserTestOptions():
+    def __init__(self):
+        self.old_cpu_fuse = torch._C._jit_can_fuse_on_cpu()
+        self.old_gpu_fuse = torch._C._jit_can_fuse_on_gpu()
+        torch._C._jit_override_can_fuse_on_cpu(False)
+        torch._C._jit_override_can_fuse_on_gpu(False)
+        self.old_guard = torch._C._jit_set_nvfuser_guard_mode(False)
+        torch._C._debug_set_autodiff_subgraph_inlining(False)
+        self.old_value = torch._C._jit_set_autocast_mode(True)
 
-    special_values = torch.tensor(
-        [float("-inf"), -10, -math.pi,
-            -1, -0.5, 0, 1, 0.5,
-            math.pi, 10, float("inf"),
-            float("nan")], dtype=torch.float, device='cuda')
-
-    int_types = [
-        torch.int8,
-        torch.uint8,
-        torch.int16,
-        torch.int32,
-        torch.int64
-    ]
-
-    support_tensor_dtypes = [
-        torch.int32,
-        torch.int64,
-        torch.float16,
-        torch.float32,
-        torch.float64,
-        torch.bool
-    ]
-    if TEST_BF16:
-        support_tensor_dtypes.append(torch.bfloat16)
+        if(RUN_CUDA):
+            self.old_nvfuser = torch._C._jit_set_nvfuser_enabled(True)
+
+    def restore(self):
+        if(RUN_CUDA):
+            torch._C._jit_set_nvfuser_enabled(self.old_nvfuser)
+        torch._C._jit_override_can_fuse_on_cpu(self.old_cpu_fuse)
+        torch._C._jit_override_can_fuse_on_gpu(self.old_gpu_fuse)
+        torch._C._jit_set_nvfuser_guard_mode(self.old_guard)
+        torch._C._debug_set_autodiff_subgraph_inlining(True)
+        torch._C._jit_set_autocast_mode(self.old_value)
 
+class TestCudaFuser(JitTestCase):
     def _getSubgraphInFusion(self, graph):
         num_node = 0
         subgraph = None
@@ -114,6 +120,34 @@ def count(block, ret):
 
     def setUp(self):
         super(TestCudaFuser, self).setUp()
+
+        # cpu backup to avoid errors in case this is run on a CPU-only machine
+        dev = 'cuda' if RUN_NVFUSER else 'cpu'
+        self.special_values = torch.tensor(
+            [float("-inf"), -10, -math.pi,
+                -1, -0.5, 0, 1, 0.5,
+                math.pi, 10, float("inf"),
+                float("nan")], dtype=torch.float, device=dev)
+
+        self.int_types = [
+            torch.int8,
+            torch.uint8,
+            torch.int16,
+            torch.int32,
+            torch.int64
+        ]
+
+        self.support_tensor_dtypes = [
+            torch.int32,
+            torch.int64,
+            torch.float16,
+            torch.float32,
+            torch.float64,
+            torch.bool
+        ]
+        if TEST_BF16:
+            self.support_tensor_dtypes.append(torch.bfloat16)
+
         self.old_cpu_fuse = torch._C._jit_can_fuse_on_cpu()
         self.old_gpu_fuse = torch._C._jit_can_fuse_on_gpu()
         torch._C._jit_override_can_fuse_on_cpu(False)
@@ -122,17 +156,12 @@ def setUp(self):
         torch._C._debug_set_autodiff_subgraph_inlining(False)
         self.old_value = torch._C._jit_set_autocast_mode(True)
 
-        if(RUN_CUDA):
-            self.old_nvfuser = torch._C._jit_set_nvfuser_enabled(True)
+        if(RUN_NVFUSER):
+            self.cuda_fuser_options = CudaFuserTestOptions()
 
     def tearDown(self):
-        if(RUN_CUDA):
-            torch._C._jit_set_nvfuser_enabled(self.old_nvfuser)
-        torch._C._jit_override_can_fuse_on_cpu(self.old_cpu_fuse)
-        torch._C._jit_override_can_fuse_on_gpu(self.old_gpu_fuse)
-        torch._C._jit_set_nvfuser_guard_mode(self.old_guard)
-        torch._C._debug_set_autodiff_subgraph_inlining(True)
-        torch._C._jit_set_autocast_mode(self.old_value)
+        if(RUN_NVFUSER):
+            self.cuda_fuser_options.restore()
         super(TestCudaFuser, self).tearDown()
 
     def _run_helper(self, jit_op, op, *args):
@@ -168,7 +197,7 @@ def _run_training_helper(self, jit_op, op, grads, *args):
         )[0].graph
         self.assertGraphContainsExactly(bwd_graph, FUSION_GUARD, 1, consider_subgraphs=True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_half(self):
@@ -194,7 +223,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor, alpha: float):
         self.assertGraphContains(t_jit.graph_for(x, y, z, alpha), FUSION_GUARD)
 
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_bfloat(self):
@@ -219,7 +248,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor, alpha: float):
             self.assertEqual(oo, jit_oo)
         self.assertGraphContains(t_jit.graph_for(x, y, z, alpha), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_const(self):
@@ -236,7 +265,7 @@ def t(x, y):
         self.assertEqual(o, jit_o)
         self.assertGraphContains(t_jit.graph_for(x, y), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_chunk(self):
@@ -260,14 +289,14 @@ def t(x, y, z, q):
         self.assertGraphContains(t_jit.graph_for(x, y, z, q), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_reduction_dtypes_axis(self):
 
-        for op in [torch.sum, torch.mean, torch.amax]:
+        for op in [torch.sum, torch.mean, torch.amax, torch.var, torch.std]:
             for dtype in [torch.float16, torch.float32, torch.double]:
-                for axis in [-1, 2]:
+                for axis in [-1, 2, 0]:
                     def make_func(op):
                         def func(x: torch.Tensor):
                             o = torch.mul(x, 2.0)
@@ -285,7 +314,34 @@ def func(x: torch.Tensor):
                     self.assertTrue(self._compare("comparing output failed", o, jit_o, 1e-4))
                     self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_variance(self):
+
+        for op in [torch.var, torch.std]:
+            for dtype in [torch.float16, torch.float32, torch.double]:
+                for axis in [-2, -1, 2, 1]:
+                    for unbiased in [False, True]:
+                        def make_func(op):
+                            def func(x: torch.Tensor):
+                                o = torch.mul(x, 2.0)
+                                o = op(o, dim=[axis])
+                                return o
+                            return func
+
+                        x = torch.randn(8, 4, 16, dtype=dtype, device="cuda")
+                        t = make_func(op)
+                        t_jit = torch.jit.trace(t, x)
+                        jit_o = t_jit(x)
+                        jit_o = t_jit(x)
+                        o = t(x)
+                        self.assertEqual(o.dtype, jit_o.dtype)
+                        self.assertTrue(self._compare("comparing output failed", o, jit_o, 1e-4))
+                        self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_scalar_input(self):
@@ -303,7 +359,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
         self.assertEqual(o, jit_o)
         self.assertGraphContains(t_jit.graph_for(x, y, 2.0), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_0(self):
@@ -322,7 +378,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
         subgraph = self._getSubgraphInFusion(t_jit.graph_for(x, y, 2.0))
         self.assertGraphContainsExactly(subgraph, 'aten::add', 2, consider_subgraphs=False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_1(self):
@@ -341,7 +397,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
         subgraph = self._getSubgraphInFusion(t_jit.graph_for(x, y, 2.0))
         self.assertGraphContainsExactly(subgraph, 'aten::add', 2, consider_subgraphs=False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_2(self):
@@ -360,7 +416,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
         subgraph = self._getSubgraphInFusion(t_jit.graph_for(x, y, 2.0))
         self.assertGraphContainsExactly(subgraph, 'aten::add', 2, consider_subgraphs=False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_3(self):
@@ -382,7 +438,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
     # test_broadcasting_partition_logic_X
     # Testing partition logic that is capable to avoid creating unsupported
     # broadcasting semantics in CudaFusionGroup
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_partition_logic_0(self):
@@ -404,7 +460,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         subgraph = self._getSubgraphInFusion(t_jit.graph_for(x, y, z))
         self.assertGraphContainsExactly(subgraph, 'aten::add', 4, consider_subgraphs=False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_partition_logic_1(self):
@@ -427,7 +483,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         self.assertGraphContainsExactly(subgraph, 'aten::add', 4, consider_subgraphs=False)
 
     @unittest.skipIf(True, "Broadcast with different output not supported yet")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_multiple_output_shape(self):
@@ -449,7 +505,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
 
     @unittest.skipIf(True, "broadcast on branches can't be resolved yet")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_broadcasting_multiple_output(self):
@@ -510,7 +566,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("failing case {}\n{}\n{}\n{}".format(dtype, operation, x, y), o, jit_o, 1e-2))
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_unary_ops(self):
@@ -561,7 +617,7 @@ def test_unary_ops(self):
             self._unary_test_helper(op, dtype, False)  # test special numbers
             self._unary_test_helper(op, dtype, True)  # test random data
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_category_rule(self):
@@ -621,7 +677,7 @@ def t(x: torch.Tensor, z: float):
         z = torch.tensor(3., dtype=torch.double)
         run_scalar(x, z)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_unary_bitwise(self):
@@ -650,53 +706,173 @@ def bool_not(x: torch.Tensor, y: torch.Tensor):
         jitted.graph_for(x, y)  # Shows up in second instance, not first
         self.assertGraphContains(jitted.graph_for(x, y), FUSION_GUARD)
 
-    def _binary_test_helper(self, operation, dtypes, random_data):
-        if isinstance(dtypes, tuple):
-            dtype_arg1, dtype_arg2 = dtypes
-        else:
-            dtype_arg1 = dtype_arg2 = dtypes
+    def _get_scalar_binary_test_fn(self, category_and_type1, category_and_type2, operation):
+        category1, dtype_arg1 = category_and_type1
+        category2, dtype_arg2 = category_and_type2
 
-        def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
+        def t_intx_tensory(x: int, y: torch.Tensor):
             o = operation(x, y)
-            o = o + z
+            o = 2 + o
             return o
 
-        def t_int(x: torch.Tensor, y: torch.Tensor):
+        def t_doublex_tensory(x: float, y: torch.Tensor):
             o = operation(x, y)
             o = 2 + o
             return o
+        # Omit both scalar cases and swap cases
+        assert category1 == "scalar" and category2 != "scalar"
+        if dtype_arg1.is_floating_point:
+            return t_doublex_tensory
+        if dtype_arg1 == torch.int64 or dtype_arg1 == torch.int32:
+            return t_intx_tensory
+        raise NotImplementedError
+
+    def _binary_test_helper(self, operation, dtypes, random_data, categories="ndim"):
+        if isinstance(dtypes, tuple):
+            dtype_arg1, dtype_arg2 = dtypes
+        else:
+            dtype_arg1 = dtype_arg2 = dtypes
 
-        def t_float(x: torch.Tensor, y: torch.Tensor):
+        if isinstance(categories, tuple) and random_data:
+            category1, category2 = categories
+        elif not random_data:
+            category1 = category2 = "ndim"
+        else:
+            category1 = category2 = categories
+
+        def is_cpu_category(x):
+            return x == "0dimcpu" or x == "scalar"
+
+        # skip unsupported cases
+        if is_cpu_category(category1) and is_cpu_category(category2):
+            return
+
+        # only test cases with first operand as scalar
+        if category2 == "scalar":
+            return
+
+        # skip ops that doesn't support scalar inputs in eager
+        if operation in [
+            torch.atan2,
+            torch.max,
+            torch.min,
+            torch.remainder,  # unsupported in nvfuser
+        ]:
+            if category1 == "scalar" or category2 == "scalar":
+                return
+
+        if operation in [
+            torch.fmod,
+            torch.eq,
+            torch.ne,
+            torch.ge,
+            torch.gt,
+            torch.le,
+            torch.lt
+        ]:
+            if category1 == "scalar":
+                return
+
+        # operators that does not support bfloat16
+        if operation in [torch.fmod]:
+            if dtype_arg1 == torch.bfloat16 or dtype_arg2 == torch.bfloat16:
+                return
+
+        def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
             o = operation(x, y)
-            o = 2. + o
+            o = o + z
             return o
 
         shape = (4, 32, 32)
+
+        shapex = shape if category1 == "ndim" else ()
+        shapey = shape if category2 == "ndim" else ()
+
         if random_data:
-            x = (torch.randn(shape, dtype=torch.float, device="cuda") * 5).to(dtype_arg1)
-            y = (torch.randn(shape, dtype=torch.float, device="cuda") * 5).to(dtype_arg2)
+            x = (torch.randn(shapex, dtype=torch.float, device="cuda") * 5).to(dtype_arg1)
+            y = (torch.randn(shapey, dtype=torch.float, device="cuda") * 5).to(dtype_arg2)
         else:
             x = self.special_values.to(dtype=dtype_arg1)
             y = (torch.rand_like(self.special_values) * 5).to(dtype_arg2)
+
+        r"""
+            Category conversion
+        """
+        has_scalar = False
+        if category1 == "scalar":
+            has_scalar = True
+            x = x.item()
+
+        if category1 == "0dimcpu":
+            x = x.to(device="cpu")
+
+        if category2 == "scalar":
+            has_scalar = True
+            y = y.item()
+
+        if category2 == "0dimcpu":
+            y = y.to(device="cpu")
+
         z = torch.tensor([2], device="cuda").to(dtype_arg1)
+        is_dtype_arg1_int = dtype_arg1 == torch.int32 or dtype_arg1 == torch.int64
+        is_dtype_arg2_int = dtype_arg2 == torch.int32 or dtype_arg2 == torch.int64
+
+        if operation in [torch.pow]:
+            if is_dtype_arg1_int and is_dtype_arg2_int:
+                if category2 == "scalar":
+                    # RuntimeError: Integers to negative integer powers are not allowed
+                    y = abs(y)
+                if category2 == "0dimcpu" and y == -1:
+                    # https://github.com/pytorch/pytorch/issues/73196
+                    y = y - 1
+                if category2 == "0dimcpu" and y == -2:
+                    # avoid pow(0, -2), which gives inconsistent results on integer tensor
+                    y = y - 1
 
         # Avoid division by zero for integer tensors
         div_like = [torch.div, torch.fmod, torch.remainder]
         if operation in div_like and (dtype_arg2 == torch.int32 or dtype_arg2 == torch.int64):
             y[y == 0] = 1
 
-        for test_fn in [t, t_int, t_float]:
-            o = t(x, y, z)
-            t_jit = torch.jit.script(t)
-            jit_o = t_jit(x, y, z)
-            jit_o = t_jit(x, y, z)
-            jit_o = t_jit(x, y, z)
+        test_value = True
+        if dtype_arg1 == torch.half or dtype_arg2 == torch.half:
+            test_value = False
+        if dtype_arg1 == torch.bfloat16 or dtype_arg2 == torch.bfloat16:
+            test_value = False
 
-            self.assertEqual(o.dtype, jit_o.dtype)
-            self.assertEqual(o, jit_o)
-            self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
+        try:
+            if not has_scalar:
+                o = t(x, y, z)
+                t_jit = torch.jit.script(t)
+                jit_o = t_jit(x, y, z)
+                jit_o = t_jit(x, y, z)
+                jit_o = t_jit(x, y, z)
+
+                self.assertEqual(o.dtype, jit_o.dtype)
+                if test_value:
+                    self.assertEqual(o, jit_o)
+                self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
+
+            elif category2 != "scalar":  # only test the case where first is scalar
+                test_fn = self._get_scalar_binary_test_fn((category1, dtype_arg1), (category2, dtype_arg2), operation)
+                o = test_fn(x, y)
+                t_jit = torch.jit.script(test_fn)
+                jit_o = t_jit(x, y)
+                jit_o = t_jit(x, y)
+                jit_o = t_jit(x, y)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+                self.assertEqual(o.dtype, jit_o.dtype)
+                if test_value:
+                    self.assertEqual(o, jit_o)
+                self.assertGraphContains(t_jit.graph_for(x, y), FUSION_GUARD)
+        except Exception as e:
+            print("failing test for op: ", operation.__name__)
+            print("with input\n\tx: ", x)
+            print("\ty: ", y)
+            print("\tz: ", z)
+            raise e
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_binary_ops(self):
@@ -704,14 +880,12 @@ def test_binary_ops(self):
         data_types = [
             torch.int32,
             torch.int64,
-            # torch.float16,
+            torch.float16,
             torch.float32,
             torch.float64
         ]
-        '''
         if TEST_BF16:
             data_types.append(torch.bfloat16)
-        '''
         operations = [torch.mul,
                       torch.div,
                       torch.atan2,
@@ -726,12 +900,24 @@ def test_binary_ops(self):
                       torch.gt,
                       torch.le,
                       torch.lt]
-        binary_dtype_combinations = itertools.combinations(data_types, 2)
+
+        category_types = [
+            "scalar",
+            "0dim",
+            "0dimcpu",
+            "ndim"
+        ]
+
+        binary_dtype_combinations = list(itertools.combinations(data_types, 2))
+        category_combinations = list(itertools.combinations(category_types, 2))
+
+        for op, dtypes, categories in itertools.product(operations, binary_dtype_combinations, category_combinations):
+            self._binary_test_helper(op, dtypes, True, categories)  # random data
+
         for op, dtypes in itertools.product(operations, binary_dtype_combinations):
-            self._binary_test_helper(op, dtypes, True)  # random data
             self._binary_test_helper(op, dtypes, False)  # special numbers
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_binary_bitwise(self):
@@ -778,7 +964,7 @@ def jit_xor(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
             self.assertEqual(o, jit_o)
             self.assertGraphContains(jitted.graph_for(x, y, z), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_type_as_op(self):
@@ -835,7 +1021,7 @@ def threshold(x: torch.Tensor, th: int, val: int):
         threshold_jit = torch.jit.script(threshold)
         self._run_helper(threshold_jit, threshold, x, arg2, arg3)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_ternary_ops_integer_compatibility(self):
@@ -888,7 +1074,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor, alpha: torch.Tensor):
         self.assertEqual(o, jit_o)
         self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_ternary_ops_type_promotion(self):
@@ -910,7 +1096,7 @@ def test_ternary_ops_type_promotion(self):
             self._ternary_test_helper(op, dtypes, False)  # special numbers
 
     # We can't test the scalar version of rsub from python
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING, "Requires fusion optimization pass to be effective")
     def test_rsub(self):
         x = torch.randn(4, 8, 32, 32, dtype=torch.float, device="cuda")
@@ -924,7 +1110,7 @@ def rsub(x: torch.Tensor, y: torch.Tensor):
         rsub_jit = torch.jit.script(rsub)
         self._run_helper(rsub_jit, rsub, x, y)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     # legacy fuser does not work for rand_like, see issue #34361
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING, "Requires fusion optimization pass to be effective")
     def test_ternary_ops(self):
@@ -976,7 +1162,7 @@ def lerp_scale(x: torch.Tensor, y: torch.Tensor, z: float):
         lerp_scale_jit = torch.jit.script(lerp_scale)
         self._run_helper(lerp_scale_jit, lerp_scale, x, y, 0.5)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING, "Requires profiling node to run cuda fuser")
     def test_addcmul_ops(self):
         x = torch.randn(4, 8, 32, 32, dtype=torch.float, device="cuda")
@@ -1004,7 +1190,7 @@ def addcmul_const_alpha(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         addcmul_const_alpha_jit = torch.jit.script(addcmul_const_alpha)
         self._run_helper(addcmul_const_alpha_jit, addcmul_const_alpha, x, y, z)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dynamic_size(self):
@@ -1044,7 +1230,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: float):
         self.assertGraphContains(t_jit.graph_for(x, y, 2.0), FUSION_GUARD)
         torch._C._jit_set_nvfuser_guard_mode(old_guard)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_random_topo(self):
@@ -1095,7 +1281,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
     # we are testing inputs with all combination of permutation order, just to
     # ensure that integration would be able to generate functionally correct
     # kernels
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_binary_ops_permutation(self):
@@ -1109,7 +1295,7 @@ def test_binary_ops_permutation(self):
                     x = [7, 8, 12]
                     self._permutation_helper(x, b_axis, torch.float32, "cuda", perm0, perm1)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_binary_ops_channels_last_with_bcast(self):
@@ -1160,7 +1346,7 @@ def forward(self, x: torch.Tensor, y: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, y), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_reduction(self):
@@ -1210,7 +1396,7 @@ def _layer_norm_autodiff_helper(self, model, grad, shapes, args):
         FileCheck().check(FUSION_GUARD).run(v2.graph)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_layer_norm_autodiff(self):
@@ -1252,7 +1438,7 @@ def t(shapes: List[int], x, eps: float, cudnn: bool):
             self._layer_norm_autodiff_helper(m, grad, shapes, args)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_layer_norm_parser(self):
@@ -1312,7 +1498,7 @@ def forward(self, x: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_native_layer_norm(self):
@@ -1326,7 +1512,7 @@ def test_native_layer_norm(self):
                     self._native_layer_norm_helper(input_shape, norm_shape, torch.float32, "cuda", 1e-4, affine)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_native_layer_norm_half(self):
@@ -1339,7 +1525,7 @@ def test_native_layer_norm_half(self):
                 self._native_layer_norm_helper(input_shape, norm_shape, torch.float16, "cuda", 5e-3)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -1352,7 +1538,15 @@ def test_native_layer_norm_bfloat(self):
                 norm_shape = [input_shape[idx] for idx in range(dims - offset, dims)]
                 self._native_layer_norm_helper(input_shape, norm_shape, torch.bfloat16, "cuda", 1e-1)
 
-    def _norm_helper(self, shape, dtype, device, error, is_batch_norm_else_instance_norm, memory_format=torch.contiguous_format):
+    def _norm_helper(self,
+                     shape,
+                     dtype,
+                     device,
+                     error,
+                     is_batch_norm_else_instance_norm,
+                     memory_format=torch.contiguous_format,
+                     *,
+                     layer_dtype=torch.float32):
         class MyBatchNorm(torch.nn.Module):
             def __init__(self):
                 super(MyBatchNorm, self).__init__()
@@ -1374,8 +1568,8 @@ def forward(self, x: torch.Tensor, r_mean: torch.Tensor, r_var: torch.Tensor):
         t = MyBatchNorm() if is_batch_norm_else_instance_norm else MyInstanceNorm()
 
         x = torch.randn(shape, dtype=dtype, device=device).to(memory_format=memory_format)
-        running_mean = torch.zeros(shape[1], dtype=torch.float32, device=device)
-        running_var = torch.ones(shape[1], dtype=torch.float32, device=device)
+        running_mean = torch.zeros(shape[1], dtype=layer_dtype, device=device)
+        running_var = torch.ones(shape[1], dtype=layer_dtype, device=device)
         t_jit = torch.jit.script(t)
 
         eager_running_mean = running_mean.clone()
@@ -1400,7 +1594,38 @@ def forward(self, x: torch.Tensor, r_mean: torch.Tensor, r_var: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, running_mean, running_var), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_layer_norm_trivial_reduce_dim(self):
+        def t_wb(shapes: List[int], x, w, b, eps: float, cudnn: bool):
+            o = torch.layer_norm(x, shapes, w, b, eps, cudnn)
+            o = torch.relu(o)
+            return o
+
+        batch = [1]
+        shapes = [2, 7, 3]
+
+        grad = torch.randn(batch + shapes, dtype=torch.float32, device="cuda")
+        args = [torch.randn(batch + shapes, dtype=torch.float32, device="cuda").requires_grad_()]
+        args.append(torch.randn(shapes, dtype=torch.float32, device="cuda").requires_grad_())
+        args.append(torch.randn(shapes, dtype=torch.float32, device="cuda").requires_grad_())
+        self._layer_norm_autodiff_helper(t_wb, grad, shapes, args)
+
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_norm_half_layer(self):
+        size = [2, 4, 2, 2]
+
+        for is_batch_norm_else_instance_norm in [False, True]:
+            for mf in [torch.channels_last, torch.contiguous_format]:
+                self._norm_helper(size, torch.float16, "cuda", 1e-3, is_batch_norm_else_instance_norm,
+                                  memory_format=mf, layer_dtype=torch.float16)
+
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_norm_channels_last(self):
@@ -1412,7 +1637,7 @@ def test_norm_channels_last(self):
                     self._norm_helper(size, torch.float32, "cuda", 1e-4, is_batch_norm_else_instance_norm, memory_format=mf)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_norm(self):
@@ -1429,7 +1654,7 @@ def test_norm(self):
                         self._norm_helper(x, torch.float32, "cuda", 1e-4, is_batch_norm_else_instance_norm)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_norm_large(self):
@@ -1445,7 +1670,7 @@ def test_norm_large(self):
                     self._norm_helper(x, torch.float32, "cuda", 1e-4, is_batch_norm_else_instance_norm)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_norm_half(self):
@@ -1462,7 +1687,7 @@ def test_norm_half(self):
                         self._norm_helper(x, torch.float16, "cuda", 5e-3, is_batch_norm_else_instance_norm)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -1479,7 +1704,7 @@ def test_norm_bfloat(self):
                         x[1] = C
                         self._norm_helper(x, torch.bfloat16, "cuda", 1e-1, is_batch_norm_else_instance_norm)
 
-    def _softmax_helper(self, shape, reduction_axis, dtype, device, error):
+    def _softmax_helper(self, shape, reduction_axis, is_log_softmax, dtype, device, error):
         class MySoftmax(torch.nn.Module):
             __constants__ = ['reduction_axis']
 
@@ -1492,22 +1717,40 @@ def forward(self, x: torch.Tensor, y: torch.Tensor):
                 o = torch.nn.functional.softmax(o, dim=self.reduction_axis)
                 return o
 
-        t = MySoftmax()
+        class MyLogSoftmax(torch.nn.Module):
+            __constants__ = ['reduction_axis']
 
-        x = torch.randn(shape, dtype=dtype, device=device)
-        y = torch.randn(shape, dtype=dtype, device=device)
+            def __init__(self):
+                super(MyLogSoftmax, self).__init__()
+                self.reduction_axis = reduction_axis
+
+            def forward(self, x: torch.Tensor, y: torch.Tensor):
+                o = torch.add(x, y)
+                o = torch.nn.functional.log_softmax(o, dim=self.reduction_axis)
+                return o
+
+        gradient_check = (dtype == torch.float64)
+        t = MyLogSoftmax() if is_log_softmax else MySoftmax()
+
+        x = torch.randn(shape, dtype=dtype, device=device, requires_grad=gradient_check)
+        y = torch.randn(shape, dtype=dtype, device=device, requires_grad=gradient_check)
         t_jit = torch.jit.script(t)
         jit_o = t_jit(x, y)
         jit_o = t_jit(x, y)
-        o = t(x, y)
-        self.assertEqual(o.dtype, jit_o.dtype)
-        # numerical issues here due to our scheduling.
-        # can't use `self.assertEqual(o, jit_o)`
-        self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
-        self.assertGraphContains(t_jit.graph_for(x, y), FUSION_GUARD)
+        jit_o = t_jit(x, y)
+
+        if gradient_check:
+            gradcheck(t_jit.forward, [x, y], nondet_tol=1e-5)
+        else:
+            o = t(x, y)
+            self.assertEqual(o.dtype, jit_o.dtype)
+            # numerical issues here due to our scheduling.
+            # can't use `self.assertEqual(o, jit_o)`
+            self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
+            self.assertGraphContains(t_jit.graph_for(x, y), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_softmax_dtype(self):
@@ -1549,7 +1792,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         FileCheck().check(FUSION_GUARD).run(bwd_graph)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test__softmax_function(self):
@@ -1573,7 +1816,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x, y), FUSION_GUARD, 1, consider_subgraphs=True)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test__softmax_function_half_to_float(self):
@@ -1597,7 +1840,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x, y), FUSION_GUARD, 1, consider_subgraphs=True)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_softmax(self):
@@ -1606,14 +1849,21 @@ def test_softmax(self):
         output_size = int(pow(output_size, 1. / dims))
         reduction_sizes = [67, 256, 1024, 4096]
 
+        # gradient check
+        for reduction_dim in range(dims):
+            for is_log_softmax in [False, True]:
+                shape = [output_size for idx in range(dims)]
+                self._softmax_helper(shape, reduction_dim, is_log_softmax, torch.float64, "cuda", 1e-4)
+
         for reduction_dim in range(dims):
             for reduction_size in reduction_sizes:
                 x = [output_size for idx in range(dims)]
                 x[reduction_dim] = reduction_size
-                self._softmax_helper(x, reduction_dim, torch.float32, "cuda", 1e-4)
+                for is_log_softmax in [False, True]:
+                    self._softmax_helper(x, reduction_dim, is_log_softmax, torch.float32, "cuda", 1e-4)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_softmax_half(self):
@@ -1626,10 +1876,11 @@ def test_softmax_half(self):
             for reduction_size in reduction_sizes:
                 x = [output_size for idx in range(dims)]
                 x[reduction_dim] = reduction_size
-                self._softmax_helper(x, reduction_dim, torch.float16, "cuda", 5e-3)
+                for is_log_softmax in [False, True]:
+                    self._softmax_helper(x, reduction_dim, is_log_softmax, torch.float16, "cuda", 5e-3)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -1643,10 +1894,11 @@ def test_softmax_bfloat(self):
             for reduction_size in reduction_sizes:
                 x = [output_size for idx in range(dims)]
                 x[reduction_dim] = reduction_size
-                self._softmax_helper(x, reduction_dim, torch.bfloat16, "cuda", 1e-1)
+                for is_log_softmax in [False, True]:
+                    self._softmax_helper(x, reduction_dim, is_log_softmax, torch.bfloat16, "cuda", 1e-1)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_reduction_permutation(self):
@@ -1660,7 +1912,7 @@ def test_reduction_permutation(self):
                         self._reduction_helper(x, axes, torch.float32, "cuda", perm0, perm1)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_reduction_multiple_output(self):
@@ -1699,7 +1951,7 @@ def t(x: torch.Tensor, y: torch.Tensor, scale: float, z: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, y, scale, z), FUSION_GUARD)
         torch._C._jit_set_nvfuser_guard_mode(old_guard)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_channels_last_with_broadcast(self):
@@ -1805,7 +2057,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         '''
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_pw_single_reduction_partition(self):
@@ -1830,7 +2082,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_permutation_preservation(self):
@@ -1868,7 +2120,7 @@ def t(x: torch.Tensor):
         self.assertTrue(jit_o.is_contiguous(memory_format=torch.channels_last))
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_normalization_partition(self):
@@ -1896,7 +2148,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor, r_mean: torch.Tensor, r
         self.assertGraphContains(t_jit.graph_for(x, y, z, r_m, r_v), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_sum_to_one(self):
@@ -1917,7 +2169,7 @@ def t(x: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_single_reduction_broadcast(self):
@@ -1941,7 +2193,7 @@ def t(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor):
         self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_trivial_reduction(self):
@@ -1962,7 +2214,7 @@ def t(x: torch.Tensor):
         self.assertEqual(o, jit_o)
         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_profiling_node(self):
@@ -1978,7 +2230,7 @@ def repro(x: torch.Tensor, alpha: float):
         self._run_helper(repro_jit, repro, x, 0.6)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_reduction_sizes_op(self):
@@ -2002,7 +2254,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x, y), FUSION_GUARD, 0)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_profile_ivalue(self):
@@ -2025,7 +2277,7 @@ def t(x: torch.Tensor, y: torch.Tensor, dim: List[int], keepdim: bool):
         self.assertGraphContains(t_jit.graph_for(x, y, (0, 1), False), FUSION_GUARD)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_sum_to_size(self):
@@ -2059,7 +2311,7 @@ def t(x: torch.Tensor, y: torch.Tensor, new_size: List[int]):
         self.assertEqual(o, jit_o)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_grad_sum_to_size(self):
@@ -2118,7 +2370,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertEqual(x.grad, ref_x.grad)
         self.assertEqual(y.grad, ref_y.grad)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dropout_inference_fusion(self):
@@ -2135,7 +2387,7 @@ def t(x: torch.Tensor, p: float, train: bool):
 
         self._run_helper(t_jit, t, x, 0.15, False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dropout_train_nograd_fusion(self):
@@ -2152,7 +2404,7 @@ def t(x: torch.Tensor, p: float, train: bool):
 
         self._run_helper(t_jit, t, x, 0.0, True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dropout_train_nograd_prob_check(self):
@@ -2183,7 +2435,7 @@ def t(x: torch.Tensor, p: float, train: bool):
             self.assertGraphContainsExactly(t_jit.graph_for(x, prob, True), FUSION_GUARD, 1, consider_subgraphs=True)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dropout_training_fusion(self):
@@ -2214,7 +2466,7 @@ def t2(x: torch.Tensor, p: float, train: bool):
         # numbers between eager mode and the jit is different
         self._run_training_helper(t2_jit, t2, grads, x, 0.0, True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_gelu(self):
@@ -2234,7 +2486,7 @@ def t(x: torch.Tensor, mode : str):
         self._run_training_helper(t_jit, t, grads, x, 'tanh')
         torch._C._jit_set_nvfuser_guard_mode(old_guard)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_dropout_training_prob_check(self):
@@ -2267,13 +2519,15 @@ def t(x: torch.Tensor, p: float, train: bool):
             self.assertTrue((percent_zeros >= (prob - 0.01)) and (percent_zeros <= (prob + 0.01)))
             self.assertGraphContainsExactly(t_jit.graph_for(x, prob, True), FUSION_GUARD, 1, consider_subgraphs=True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_linear(self):
         in_feature = 2
         out_feature = 8
-        x = torch.randn(4, in_feature, dtype=torch.float32, device='cuda')
+        # Changing the input dims to be 3-D to avoid eager mode bias fusion
+        # The bias fusion causes some precision issues with TF-32
+        x = torch.randn(2, 4, in_feature, dtype=torch.float32, device='cuda')
         weight = torch.randn(out_feature, in_feature, dtype=torch.float32, device='cuda')
         bias = torch.randn(out_feature, dtype=torch.float32, device='cuda')
 
@@ -2292,7 +2546,7 @@ def t(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor):
         # have been optimized away
         self.assertGraphContainsExactly(t_jit.graph_for(x, weight, bias), FUSION_GUARD, 1)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_backward_type(self):
@@ -2335,7 +2589,7 @@ def test1(x: torch.Tensor, y: torch.Tensor):
             self.assertEqual(y.grad.dtype, y.dtype)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_autocast_1(self):
@@ -2372,7 +2626,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertEqual(y.grad.dtype, y.dtype)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_autocast_2(self):
@@ -2408,7 +2662,7 @@ def t(x: torch.Tensor):
         self.assertEqual(x.grad.dtype, x.dtype)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -2446,7 +2700,7 @@ def t(x: torch.Tensor, y: torch.Tensor):
         self.assertEqual(y.grad.dtype, y.dtype)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -2482,7 +2736,7 @@ def t(x: torch.Tensor):
         self.assertEqual(jit_o.dtype, torch.float)
         self.assertEqual(x.grad.dtype, x.dtype)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_to_dtype_fp32_to_fp16(self):
@@ -2501,7 +2755,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.half)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_to_dtype_fp16_to_fp32(self):
@@ -2520,7 +2774,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.float)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_to_dtype_fp16_to_fp16(self):
@@ -2539,7 +2793,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.half)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -2559,7 +2813,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.bfloat16)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -2579,7 +2833,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.float)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     @unittest.skipIf(not TEST_BF16, "device does not support BFloat16")
@@ -2599,7 +2853,7 @@ def t(x: torch.Tensor):
         self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1)
         self.assertEqual(jit_o.dtype, torch.bfloat16)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(not TEST_MULTIGPU, "requires multiple CUDA device")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
@@ -2621,7 +2875,7 @@ def t(x):
         x = x.to("cuda:1")
         jit_o = t_jit(x)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_graph_for_with_missing_optimized_engine(self):
@@ -2648,7 +2902,7 @@ def t(x: torch.Tensor, flag: bool):
         # have been optimized away
         self.assertGraphContainsExactly(t_jit.graph_for(x, True), FUSION_GUARD, 1, True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_branches(self):
@@ -2678,7 +2932,7 @@ def t(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, flag: bool):
         # have been optimized away
         self.assertGraphContainsExactly(t_jit.graph_for(x, weight, bias, True), FUSION_GUARD, 1)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_scalar_tensor(self):
@@ -2701,7 +2955,7 @@ def t(x: torch.Tensor):
 
     @unittest.skipIf(os.environ.get('PYTORCH_NO_CUDA_MEMORY_CACHING') is not None,
                      "skipping graph_rng when caching allocator is disabled")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(CUDA_MAJOR < 11, "requires CUDA11 or above")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
@@ -2858,7 +3112,7 @@ def forward(self, x):
                                           e0))
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_batch_norm_half(self):
@@ -2873,7 +3127,25 @@ def test_batch_norm_half(self):
                 self._test_batch_norm_impl_index_helper(4, 8, 5, affine, track_running_stats, training, torch.half)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_batch_norm_impl_index_inner_bcast(self):
+        # the repro
+        self._test_batch_norm_impl_index_helper(2, 1, 1, False, True, True)
+
+        # running the full set
+        setups = [
+            [True, True],
+            [False, False],
+            [True, False],
+            [False, True]]
+        for training_and_track, affine in itertools.product(setups, [True, False]):
+            training, track_running_stats = training_and_track
+            self._test_batch_norm_impl_index_helper(2, 1, 1, affine, track_running_stats, training)
+
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_batch_norm_impl_index_correctness(self):
@@ -2897,7 +3169,7 @@ def test_batch_norm_impl_index_correctness(self):
                     training, track_running_stats = training_and_track
                     self._test_batch_norm_impl_index_helper(b, c, hw, affine, track_running_stats, training)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_softplus_fuser(self):
@@ -2923,7 +3195,7 @@ def shifted_softplus(x: torch.Tensor, shift: float):
         assert torch.allclose(jit_grad, aten_grad)
         self.assertGraphContains(jitted.graph_for(inp, 0.693147), FUSION_GROUP, True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_inplace_removal(self):
@@ -2943,7 +3215,7 @@ def t(x: torch.Tensor):
         self.assertGraphContains(graph, 'aten::add', True)
         self.assertGraphContains(graph, 'aten::relu', True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_conv2d_bias(self):
@@ -2984,11 +3256,11 @@ def t_bias(x: torch.Tensor, w: torch.Tensor, bias: torch.Tensor):
             jit_o = jitted_bias(inp, weight, bias)
 
         graph = jitted_bias.graph_for(inp)
-        self.assertGraphContainsExactly(graph, FUSION_GROUP, 0)
+        self.assertGraphContains(graph, FUSION_GROUP, True)
         self.assertGraphContains(graph, 'prim::add_optional', True)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_remove_output_used_only_in_dtype(self):
@@ -3021,7 +3293,7 @@ def forward(self, x, y):
             self.assertGraphContains(graph, FUSION_GROUP, True)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_fix_shape_expression_bn(self):
@@ -3053,31 +3325,6 @@ def forward(self, x, y):
             graph = jitted.graph_for(x, y)
             self.assertGraphContains(graph, FUSION_GROUP, True)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
-    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
-                     "Requires fusion optimization pass to be effective")
-    def test_linear_1d_weight_mismatch_bias_dtype(self):
-        def t(x: torch.Tensor, w: torch.Tensor, b: torch.Tensor):
-            o = torch.nn.functional.linear(x, w, b)
-            return o.relu()
-
-        device = "cuda"
-        jitted = torch.jit.script(t)
-        x = torch.randn(2, 5, 5, dtype=torch.half, device=device)
-        w = torch.randn(5, dtype=torch.half, device=device)
-        b = torch.randn(5, dtype=torch.float32, device=device)
-
-        for i in range(3):
-            jit_o = jitted(x, w, b)
-        jit_o = jitted(x, w, b)
-        o = t(x, w, b)
-        self.assertEqual(o, jit_o)
-        self.assertEqual(o.dtype, jit_o.dtype)
-        self.assertEqual(o.size(), jit_o.size())
-        graph = jitted.graph_for(x, w, b)
-        self.assertGraphContains(graph, FUSION_GROUP, True)
-        self.assertGraphContains(graph, 'aten::matmul', True)
-
     def _run_fwd_helper(self, func, ops, *args):
         jitted = torch.jit.script(func)
         for i in range(3):
@@ -3093,7 +3340,7 @@ def _run_fwd_helper(self, func, ops, *args):
             self.assertGraphContainsExactly(graph, op, 0)
 
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_sibling_fusion(self):
@@ -3114,7 +3361,7 @@ def t2(x: torch.Tensor, y: torch.Tensor):
             return o1, o2
         self._run_fwd_helper(t2, ['aten::sum', 'aten::mul'], x, y)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_clean_profile_ivalue(self):
@@ -3136,7 +3383,7 @@ def t(x: torch.Tensor, flag: bool):
         graph = jit_t.graph_for(x, True)
         out = jit_t(x, False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_sibling_fusion_no_scalar_inputs(self):
@@ -3187,7 +3434,9 @@ def forward(self, inputs : torch.Tensor, view_shape : List[int]):
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
         graph = t_jit.graph_for(x, output_shape)
 
-        has_inferred_dimension = any([dim == -1 for dim in output_shape])
+        # TODO: revert disabled aten::view
+        # has_inferred_dimension = any([dim == -1 for dim in output_shape])
+        has_inferred_dimension = True
         if has_inferred_dimension:
             # prohibit fusing when view_shape contains an inferred dimension
             self.assertGraphContainsExactly(graph, FUSION_GROUP, 0)
@@ -3204,27 +3453,28 @@ def __init__(self):
                 with torch.no_grad():
                     self.bias.fill_(10)
 
-            def forward(self, inputs : torch.Tensor, view_shape : List[int]):
+            def forward(self, inputs : torch.Tensor, bias : torch.Tensor, view_shape : List[int]):
                 o = inputs.view(view_shape)
-                inputs = inputs * self.bias
+                inputs.add_(bias)
                 return torch.relu(o)
 
         t = BiasViewRelu()
         x = torch.randn(shape, dtype=dtype, device=device, requires_grad=False)
+        bias = torch.randn(shape, dtype=dtype, device=device, requires_grad=False)
         t_jit = torch.jit.script(t)
 
         # profiling
-        jit_o = t_jit(x, output_shape)
+        jit_o = t_jit(x.clone(), bias, output_shape)
         # optimization
-        jit_o = t_jit(x, output_shape)
+        jit_o = t_jit(x.clone(), bias, output_shape)
         # final
-        jit_o = t_jit(x, output_shape)
+        jit_o = t_jit(x.clone(), bias, output_shape)
         # eager - baseline
-        o = t(x, output_shape)
+        o = t(x.clone(), bias, output_shape)
 
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
-        graph = t_jit.graph_for(x, output_shape)
+        graph = t_jit.graph_for(x, bias, output_shape)
         self.assertGraphContainsExactly(graph, FUSION_GUARD, 0)
         self.assertGraphContainsExactly(graph, 'prim::view_copy', 0)
 
@@ -3334,7 +3584,7 @@ def _view_test_generator(self, ndims, test_fn):
                 total += 1
                 test_fn(all_views[idx], all_views[jdx], torch.float, 'cuda', 1e-6)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_view(self):
@@ -3344,6 +3594,47 @@ def test_view(self):
             self._view_test_generator(ndims, self._bias_view_relu_helper)
         self._alias_bias_view_relu_helper([2, 3, 4, 5], [1, 6, 1, 2, 2, 5, 1], torch.float, 'cuda', 1e-6)
 
+    def _ltc_helper(self, shape, dtype, device, error, approximate=True):
+        # modeled after LTC linear layer
+        class LTC(torch.nn.Module):
+            def __init__(self):
+                super(LTC, self).__init__()
+                self.weight = torch.nn.Parameter(torch.randn([1024, 1024], dtype=dtype, device=device), requires_grad=False)
+                self.bias = torch.nn.Parameter(torch.randn([1, 1024], dtype=dtype, device=device), requires_grad=False)
+
+            def forward(self, inputs : torch.Tensor):
+                o = inputs.view([32768, 1024])
+                o = torch.mm(o, self.weight)
+                o = o.view([256, 128, 1024])
+                o = o + self.bias
+                o = o.view([32768, 1024])
+                o = o.view([256, 128, 1024])
+                return torch.nn.functional.gelu(o)
+
+        t = LTC()
+        x = torch.randn(shape, dtype=dtype, device=device, requires_grad=False)
+        t_jit = torch.jit.script(t)
+
+        # profile/optimization runs
+        for i in range(3):
+            jit_o = t_jit(x)
+        o = t(x)
+
+        self.assertEqual(o.dtype, jit_o.dtype)
+        self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
+        graph = t_jit.graph_for(x)
+        # TODO: revert disabled aten::view
+        # self.assertGraphContains(graph, FUSION_GUARD)
+        # self.assertGraphContains(graph, 'prim::view_copy', True)
+        self.assertGraphContainsExactly(graph, FUSION_GUARD, 0)
+        self.assertGraphContainsExactly(graph, 'prim::view_copy', 0, True)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_nested_view(self):
+        self._ltc_helper([256, 128, 1024], torch.float, 'cuda', 1e-6)
+
     def _bias_squeeze_relu_helper(self, shape, dtype, device, error):
         class BiasSqueezeRelu(torch.nn.Module):
             def __init__(self):
@@ -3366,7 +3657,7 @@ def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
 
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
-        graph = t_jit.graph_for(x)
+        graph = t_jit.graph_for(x, bias)
         self.assertGraphContains(graph, FUSION_GUARD)
         self.assertGraphContains(graph, 'prim::squeeze_copy', True)
 
@@ -3377,7 +3668,7 @@ def __init__(self):
 
             def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
                 o = torch.squeeze(inputs)
-                inputs = inputs * bias
+                inputs.add_(bias)
                 return torch.relu(o)
 
         t = BiasSqueezeRelu()
@@ -3385,10 +3676,10 @@ def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
         bias = torch.randn(shape, dtype=dtype, device=device, requires_grad=False)
         t_jit = torch.jit.script(t)
 
-        jit_o = t_jit(x, bias)
-        jit_o = t_jit(x, bias)
-        jit_o = t_jit(x, bias)
-        o = t(x, bias)
+        jit_o = t_jit(x.clone(), bias)
+        jit_o = t_jit(x.clone(), bias)
+        jit_o = t_jit(x.clone(), bias)
+        o = t(x.clone(), bias)
 
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
@@ -3396,13 +3687,37 @@ def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
         self.assertGraphContainsExactly(graph, FUSION_GUARD, 0)
         self.assertGraphContainsExactly(graph, 'prim::squeeze_copy', 0)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_squeeze(self):
         self._bias_squeeze_relu_helper([1, 6, 1, 2, 2, 5, 1], torch.float, 'cuda', 1e-6)
         self._alias_bias_squeeze_relu_helper([1, 6, 1, 2, 2, 5, 1], torch.float, 'cuda', 1e-6)
 
+    # remove this after opinfo tests are enabled
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_squeeze_zero(self):
+        x = torch.tensor(1.0, dtype=torch.float, device="cuda")
+
+        def squeeze_0(x: torch.Tensor):
+            o = x + 1.
+            o = torch.squeeze(o, 0)
+            o = o * 2.
+            return o
+
+        def squeeze_1(x: torch.Tensor):
+            o = x + 1.
+            o = torch.squeeze(o, -1)
+            o = o + .5
+            return o
+
+        squeeze_0_jit = torch.jit.script(squeeze_0)
+        self._run_helper(squeeze_0_jit, squeeze_0, x)
+        squeeze_1_jit = torch.jit.script(squeeze_1)
+        self._run_helper(squeeze_1_jit, squeeze_1, x)
+
     def _bias_unsqueeze_relu_helper(self, shape, dtype, device, error):
         class BiasUnsqueezeRelu(torch.nn.Module):
             def __init__(self):
@@ -3425,7 +3740,7 @@ def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
 
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
-        graph = t_jit.graph_for(x)
+        graph = t_jit.graph_for(x, bias)
         self.assertGraphContains(graph, FUSION_GUARD)
         self.assertGraphContains(graph, 'prim::unsqueeze_copy', True)
 
@@ -3435,9 +3750,8 @@ def __init__(self):
                 super(BiasUnsqueezeRelu, self).__init__()
 
             def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
-                o = torch.squeeze(inputs)
                 o = torch.unsqueeze(inputs, 0)
-                inputs = inputs * bias
+                inputs.add_(bias)
                 return torch.relu(o)
 
         t = BiasUnsqueezeRelu()
@@ -3445,25 +3759,25 @@ def forward(self, inputs : torch.Tensor, bias : torch.Tensor):
         bias = torch.randn(shape, dtype=dtype, device=device, requires_grad=False)
         t_jit = torch.jit.script(t)
 
-        jit_o = t_jit(x, bias)
-        jit_o = t_jit(x, bias)
-        jit_o = t_jit(x, bias)
-        o = t(x, bias)
+        jit_o = t_jit(x.clone(), bias)
+        jit_o = t_jit(x.clone(), bias)
+        jit_o = t_jit(x.clone(), bias)
+        o = t(x.clone(), bias)
 
         self.assertEqual(o.dtype, jit_o.dtype)
         self.assertTrue(self._compare("comparing output failed", o, jit_o, error))
-        graph = t_jit.graph_for(x)
+        graph = t_jit.graph_for(x, bias)
         self.assertGraphContainsExactly(graph, FUSION_GUARD, 0)
         self.assertGraphContainsExactly(graph, 'prim::unsqueeze_copy', 0)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_unsqueeze(self):
         self._bias_unsqueeze_relu_helper([2, 3, 4, 5], torch.float, 'cuda', 1e-6)
         self._alias_bias_unsqueeze_relu_helper([2, 3, 4, 5], torch.float, 'cuda', 1e-6)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_alias_pass_fix(self):
@@ -3479,7 +3793,7 @@ def t(x, w, b):
         t_jit = torch.jit.script(t)
         self._run_helper(t_jit, t, x, w, b)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_squeeze_negative_dim(self):
@@ -3494,7 +3808,7 @@ def t(x):
         t_jit = torch.jit.script(t)
         self._run_helper(t_jit, t, x)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_singleton_fusion(self):
@@ -3507,7 +3821,32 @@ def t(x):
             t_jit = torch.jit.script(t)
             self._run_helper(t_jit, t, x)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_issue1445_fusion(self):
+        def f(t0, t1, t2, t3):
+            masked_input = torch.where(t1, t2, t3)
+            total = masked_input.sum([0, 1, 2, 3])
+            sizes : List[int] = []
+            t10 = torch.reshape(t0, sizes)
+            t7 = total / t10
+            t4 = t7.to(dtype=torch.float)
+            return t4
+
+        x = torch.randn(1, 1, 1, 1, device='cuda').to(dtype=torch.long)
+        y = torch.randn(3, 2, 1, 1, device='cuda').to(dtype=torch.bool).expand([3, 2, 1, 2])
+        z = torch.randn(3, 2, 1, 2, device='cuda')
+        w = torch.tensor(1.5, device='cuda')
+
+        f_jit = torch.jit.script(f)
+        for i in range(5):
+            out_jit = f_jit(x, y, z, w)
+        out = f(x, y, z, w)
+        self.assertEqual(out, out_jit)
+        self.assertGraphContainsExactly(f_jit.graph_for(x, y, z, w), FUSION_GROUP, 1)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_disable_sibling_fuse(self):
@@ -3528,7 +3867,7 @@ def t(x, y, s):
             # sibling fusion should be disabled with the flag
             self.assertGraphContainsExactly(t_jit.graph_for(x, y, s), FUSION_GUARD, 0)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_build_shape_expression_native_dropout(self):
@@ -3550,7 +3889,7 @@ def t(x):
             self.assertEqual(oo, jit_oo)
         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_scalar_tensor_permuted(self):
@@ -3564,7 +3903,7 @@ def t(x, y):
             t_jit = torch.jit.script(t)
             self._run_helper(t_jit, t, x, y)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_cpu_scalar(self):
@@ -3609,7 +3948,7 @@ def t3(x, y, z):
             self.assertGraphContainsExactly(t3.graph_for(x, y, z), FUSION_GUARD, 1)
             self.assertGraphContainsExactly(t3.graph_for(x, y, z), 'aten::add', 1)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_shape_expression(self):
@@ -3660,9 +3999,410 @@ def run(fn):
         for t in [t_unsqueeze, t_squeeze, t_squeeze_dim, t_squeeze_dim_no_op]:
             run(t)
 
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_scalar_cuda_tensor(self):
+        x = torch.tensor(2.0, device="cuda")
+
+        with nvfuser_singleton_fusion(True):
+            def t(x):
+                return x + 1.0
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x)
+
+            @torch.jit.script
+            def t_jitted(x):
+                return x.sum(0)
+
+            for i in range(5):
+                t_jitted(x)
+            self.assertGraphContainsExactly(t_jitted.graph_for(x), FUSION_GUARD, 0)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_overlapped_input(self):
+        x = torch.randn(8, device="cuda").as_strided((2, 4), (1, 1))
+
+        with nvfuser_singleton_fusion(True):
+            def t(x):
+                return x + 1.0
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    def test_reduction_empty_axes(self):
+        x = torch.randn(4, 2, 3, device="cuda").permute([1, 2, 0])
+
+        with nvfuser_singleton_fusion(True):
+            def t(x):
+                sizes : List[int] = []
+                return x.sum(sizes)
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    def test_int_tensor_input(self):
+        x = torch.randn(4, 2, device="cuda").to(dtype=torch.int)
+
+        with nvfuser_singleton_fusion(True):
+            def t(x):
+                return x.amax(dim=0)
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_to_boolean(self):
+        x = torch.randn(4, 2, device="cuda")
+
+        with nvfuser_singleton_fusion(True):
+            def t(x):
+                return x.to(dtype=torch.bool)
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_view_copy_graph_guard(self):
+        x = torch.randn(4, 2, 3, device="cuda").permute([1, 2, 0])
+        y = [4, 6]
+
+        with nvfuser_singleton_fusion(True):
+            def t(x, y : List[int]):
+                t1 = x + 1.0
+                t2 = t1 * 1.0
+                out = t2.reshape(y)
+                return out.relu()
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x, y)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_view_copy_graph_guard_double_fusion(self):
+        x = torch.randn(2, 2, 5, device="cuda")
+        w = torch.randn(5, 5, device="cuda")
+
+        with nvfuser_singleton_fusion(True):
+            def t(x, w):
+                o = x.view([4, x.size()[-1]])
+                o = torch.matmul(o, w)
+                o = o.view([2, 2, o.size()[1]])
+                return o
+
+            t_jit = torch.jit.script(t)
+            for i in range(3):
+                jit_o = t_jit(x, w)
+            o = t(x, w)
+            self.assertEqual(jit_o, o)
+            # TODO: revert disabled aten::view
+            # self.assertGraphContainsExactly(t_jit.graph_for(x, w), FUSION_GUARD, 2, consider_subgraphs=True)
+            self.assertGraphContainsExactly(t_jit.graph_for(x, w), FUSION_GUARD, 0, consider_subgraphs=True)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_input_output_passthrough(self):
+        def t(t0, t1, t2):
+            mask = t1.to(dtype=torch.bool)
+            masked_input = torch.where(t0, mask, t2)
+            return masked_input, mask
+
+        t_jit = torch.jit.script(t)
+        # stick to integers, this avoid the numerical difference due to our
+        # promotion
+        x = torch.randn(4, 4, device='cuda').to(dtype=torch.bool)
+        y = torch.randn(4, 4, device='cuda').to(dtype=torch.bool)
+        z = torch.tensor(1.0, device='cuda').to(dtype=torch.bool)
+        jit_o = t_jit(x, y, z)
+        jit_o = t_jit(x, y, z)
+        o = t(x, y, z)
+        for oo, jit_oo in zip(o, jit_o):
+            self.assertEqual(oo.dtype, jit_oo.dtype)
+            self.assertEqual(oo, jit_oo)
+        self.assertGraphContains(t_jit.graph_for(x, y, z), FUSION_GUARD)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_pointwise_reference_tensor(self):
+        def t(input1, input2, scalar):
+            _unsafe_view = torch.ops.aten._unsafe_view(input1, [2, 4, 16])
+            add_ = torch.ops.aten.add_(_unsafe_view, input2)
+            gelu_ = torch.ops.aten.gelu(add_)
+            view_ = torch.ops.aten.view(gelu_, [8, 16])
+            mul_ = torch.ops.aten.mul(add_, scalar)
+            return [view_, mul_]
+
+        x = torch.randn(8, 16, device="cuda")
+        bias = torch.randn(16, device="cuda")
+        scalar = torch.ones(torch.Size([]), device="cuda")
+
+        t_jit = torch.jit.script(t)
+        for i in range(3):
+            jit_o = t_jit(x, bias, scalar)
+        o = t(x, bias, scalar)
+        self.assertEqual(jit_o, o)
+        self.assertGraphContains(t_jit.graph_for(x, bias, scalar), FUSION_GUARD)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    def test_native_batch_norm_backward(self):
+        grad_output = torch.randn(4, 2, 3, device="cuda")
+        input = torch.randn(4, 2, 3, device="cuda")
+        weight = torch.randn(2, device="cuda")
+
+        r_m = torch.randn(2, device="cuda")
+        r_v = torch.randn(2, device="cuda").abs()
+
+        save_mean = torch.randn(2, device="cuda")
+        save_invstd = torch.randn(2, device="cuda").abs()
+
+        with nvfuser_singleton_fusion(True):
+            def t(grad_out, input, weight, r_m, r_v, save_mean, save_invstd, train: bool, eps: float, mask: List[bool]):
+                return torch.ops.aten.native_batch_norm_backward(grad_out, input, weight, r_m, r_v, save_mean,
+                                                                 save_invstd, train, eps, mask)
+
+            t_jit = torch.jit.script(t)
+            for i in range(4):
+                jit_o = t_jit(grad_output, input, weight, r_m.clone(), r_v.clone(),
+                              save_mean, save_invstd, True, 1e-5, [True, True, True])
+
+            ref_m = r_m.clone()
+            ref_v = r_v.clone()
+            jit_o = t_jit(grad_output, input, weight, r_m, r_v, save_mean, save_invstd, True, 1e-5, [True, True, True])
+            o = t(grad_output, input, weight, ref_m, ref_v, save_mean, save_invstd, True, 1e-5, [True, True, True])
+            for oo, jit_oo in zip(o, jit_o):
+                self.assertEqual(oo.dtype, jit_oo.dtype)
+                self.assertEqual(oo, jit_oo)
+            self.assertEqual(ref_m.dtype, r_m.dtype)
+            self.assertEqual(ref_m, r_m)
+            self.assertEqual(ref_v.dtype, r_v.dtype)
+            self.assertEqual(ref_v, r_v)
+            self.assertGraphContains(t_jit.graph_for(grad_output, input, weight, r_m.clone(), r_v.clone, save_mean,
+                                                     save_invstd, True, 1e-5, [True, True, True]), FUSION_GUARD)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_contiguous_on_broadcasted(self):
+        x = torch.randn(4, 1, device="cuda")
+        y = torch.randn(4, 128, device="cuda")
+
+        with nvfuser_singleton_fusion(True):
+            def t(x, y):
+                t1 = x.expand([4, 128])
+                t2 = t1 * y
+                return t2
+
+            t_jit = torch.jit.script(t)
+            self._run_helper(t_jit, t, x, y)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_skip_parser(self):
+        x = torch.randn(4, 12, device="cuda")
+
+        with nvfuser_singleton_fusion(True):
+            def fn(x):
+                t1 = x + 1.0
+                return t1.relu()
+
+            fn_jit = torch.jit.script(fn)
+            self._run_helper(fn_jit, fn, x)
+
+            # add node should have been merged into fusion
+            self.assertGraphContains(fn_jit.graph_for(x), FUSION_GUARD)
+            self.assertGraphContainsExactly(fn_jit.graph_for(x), 'aten::add', 0)
+
+            # flips skip parse for `aten::add`, following fusion should skip the
+            # add node
+            self.assertFalse(torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True))
+
+            def fn_1(x):
+                t1 = x + 2.0  # change const value so we'll not reuse plan
+                return t1.relu()
+
+            fn_1_jit = torch.jit.script(fn_1)
+            self._run_helper(fn_1_jit, fn_1, x)
+
+            # add node should have been merged into fusion
+            self.assertGraphContains(fn_1_jit.graph_for(x), FUSION_GUARD)
+            self.assertGraphContainsExactly(fn_1_jit.graph_for(x), 'aten::add', 1)
+
+            # flips skip parse for `aten::add`, next fusion should fuse add node
+            self.assertTrue(torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True))
+
+            def fn_2(x):
+                t1 = x + 2.0  # change const value so we'll not reuse plan
+                return t1.relu()
+
+            fn_2_jit = torch.jit.script(fn_2)
+            self._run_helper(fn_2_jit, fn_2, x)
+
+            # add node should have been merged into fusion
+            self.assertGraphContains(fn_2_jit.graph_for(x), FUSION_GUARD)
+            self.assertGraphContainsExactly(fn_2_jit.graph_for(x), 'aten::add', 0)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_cuda_fusion_guard(self):
+        old_guard = torch._C._jit_set_nvfuser_guard_mode(True)
+
+        class ConvModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x):
+                return x.sin().sigmoid()
+
+        mod = ConvModule().to(device="cuda")
+
+        inputs = [torch.randn(20, 16, 50, 100, device="cuda", requires_grad=True)]
+
+        def reduce_scalar(temp):
+            return temp.sum()
+
+        scripted = torch.jit.script(mod)
+        with torch.no_grad():
+            scripted(*inputs)
+        res = scripted(*inputs)
+        reduce_scalar(res).backward()
+        torch._C._jit_set_nvfuser_guard_mode(old_guard)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_nvfuser_comparison_callbacks_with_fallback(self):
+        try:
+            fused_result = None
+            unfused_result = None
+            graph_ir = None
+
+            def callback(fused_outputs, unfused_outputs, graph_str):
+                nonlocal unfused_result
+                nonlocal fused_result
+                nonlocal graph_ir
+                unfused_result = unfused_outputs[-1]
+                fused_result = fused_outputs[-1]
+                graph_ir = graph_str
+            torch._C._jit_nvfuser_set_comparison_callback(True, callback)
+
+            def fn(x, y):
+                z = torch.add(x, y)
+                return torch.relu(z)
+
+            x = torch.rand((4, 4)).cuda() - 0.5
+            y = torch.rand((4, 4)).cuda() - 0.5
+
+            fn_s = torch.jit.script(fn)
+            fn_s(x, y)
+            fn_s(x, y)
+            fn_s(x, y)
+
+            expected = fn(x, y)
+
+            self.assertEqual(expected, fused_result)
+            self.assertEqual(expected, unfused_result)
+            FileCheck().check("aten::add").run(graph_ir)
+        finally:
+            torch._C._jit_nvfuser_clear_comparison_callback()
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_nvfuser_comparison_callbacks_without_fallback(self):
+        try:
+            fused_result = None
+            unfused_result = None
+            graph_ir = None
+
+            def callback(fused_outputs, unfused_outputs, graph_str):
+                nonlocal unfused_result
+                nonlocal fused_result
+                nonlocal graph_ir
+                if len(unfused_outputs) > 0:
+                    unfused_result = unfused_outputs[-1]
+                fused_result = fused_outputs[-1]
+                graph_ir = graph_str
+            torch._C._jit_nvfuser_set_comparison_callback(False, callback)
+
+            def fn(x, y):
+                z = torch.add(x, y)
+                return torch.relu(z)
+
+            x = torch.rand((4, 4)).cuda() - 0.5
+            y = torch.rand((4, 4)).cuda() - 0.5
+
+            fn_s = torch.jit.script(fn)
+            fn_s(x, y)
+            fn_s(x, y)
+            fn_s(x, y)
+
+            expected = fn(x, y)
+
+            self.assertEqual(expected, fused_result)
+            self.assertEqual(None, unfused_result)
+            FileCheck().check("aten::add").run(graph_ir)
+        finally:
+            torch._C._jit_nvfuser_clear_comparison_callback()
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires NVFuser")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_cuda_fusion_guard_backward(self):
+        old_guard = torch._C._jit_set_nvfuser_guard_mode(True)
+
+        inp = torch.randn(10, device="cuda", requires_grad=True)
+        grad = torch.randn(10, device="cuda")
+
+        def f(x):
+            a = x.cos().cos()
+            return a
+        scripted = torch.jit.script(f)
+
+        with profile(activities=[ProfilerActivity.CPU]) as prof:
+            for _ in range(5):
+                inp.grad = None
+                out = scripted(inp)
+                out.backward(grad)
+
+        # check that we do not have fallback triggered
+        self.assertEqual(prof.events().table().find("fallback"), -1)
+        torch._C._jit_set_nvfuser_guard_mode(old_guard)
+
 class TestPassManagerCudaFuser(JitTestCase):
+    def setUp(self):
+        super().setUp()
+        if RUN_NVFUSER:
+            self.is_enabled = torch._C._jit_set_nvfuser_enabled(False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    def tearDown(self):
+        if RUN_NVFUSER:
+            torch._C._jit_set_nvfuser_enabled(self.is_enabled)
+        super().tearDown()
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_context_manager_test(self):
@@ -3698,7 +4438,7 @@ def t3(x, y):
         t_jit_3(x, y)
         self.assertGraphContainsExactly(t_jit_3.graph_for(x, y), FUSION_GUARD, 0)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     def test_register_fuser(self):
         self.assertFalse(torch._C._jit_set_nvfuser_enabled(True))
         self.assertTrue(torch._C._jit_nvfuser_enabled())
@@ -3708,5 +4448,41 @@ def test_register_fuser(self):
         self.assertFalse(torch._C._jit_nvfuser_enabled())
 
 
+class TestCudaFuserOpInfo(JitCommonTestCase):
+    def setUp(self):
+        if RUN_NVFUSER:
+            self.cuda_fuser_options = CudaFuserTestOptions()
+        self.nvfuser_single_node_mode = torch._C._jit_set_nvfuser_single_node_mode(True)
+
+    def tearDown(self):
+        if RUN_NVFUSER:
+            self.cuda_fuser_options.restore()
+        torch._C._jit_set_nvfuser_single_node_mode(self.nvfuser_single_node_mode)
+
+    @slowTest
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @ops(op_db, dtypes=OpDTypes.supported)
+    def test_nvfuser_correctness(self, device, dtype, op):
+        variant_sample_pairs = get_traced_sample_variant_pairs(device, dtype, op)
+
+        for variant, sample in variant_sample_pairs:
+            trace = create_traced_fn(self, variant)
+            ref = variant(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
+
+            trace(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
+
+            val = trace(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
+
+            self.assertEqual(ref, val)
+
+        # https://github.com/pytorch/pytorch/issues/35600
+        # each torch.jit.trace adds state to the _python_cu compilation unit
+        # since this test traces a lot of functions, out-of-memory can occur
+        # if the CU is not cleared.
+        torch.jit._state._python_cu.drop_all_functions()
+
+instantiate_device_type_tests(TestCudaFuserOpInfo, globals(), only_for=("cuda"))
+
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_jit_fuser_te.py b/test/test_jit_fuser_te.py
index ab2b85c6bb3ba0..ac0718269c1907 100644
--- a/test/test_jit_fuser_te.py
+++ b/test/test_jit_fuser_te.py
@@ -18,7 +18,7 @@
 # inferred erroneously runs or skips
 # some tests
 torch._C._jit_set_profiling_executor(True)
-torch._C._jit_set_profiling_mode(True)
+torch._C._get_graph_executor_optimize(True)
 
 from torch.testing._internal.common_utils import run_tests, ProfilingMode, GRAPH_EXECUTOR, \
     enable_profiling_mode_for_profiling_tests, slowTest
@@ -82,6 +82,7 @@ def inline_fusion_groups():
 
 class TestTEFuser(JitTestCase):
     def setUp(self):
+        super().setUp()
         self.tensorexpr_options = TensorExprTestOptions()
 
         # note: `self.dynamic_shapes` instatiated in specialization of class
@@ -109,6 +110,7 @@ def setUp(self):
     def tearDown(self):
         self.tensorexpr_options.restore()
         torch._C._jit_set_fusion_strategy(self.old_fusion_strategy)
+        super().tearDown()
 
     def assertAllFused(self, graph, except_for=None):
         except_for = except_for if except_for is not None else set()
@@ -1353,79 +1355,80 @@ def apply(fn):
                 )
 
     def test_unary_ops(self):
-        def apply(fn):
-            return lambda x: fn(x)
-
-        unary_ops = [
-            torch.lgamma,
-            torch.sigmoid,
-            torch.reciprocal,
-            torch.neg,
-            torch.relu,
-            F.relu6,
-            torch.log,
-            torch.log10,
-            torch.log1p,
-            torch.log2,
-            torch.exp,
-            torch.expm1,
-            torch.erf,
-            torch.erfc,
-            torch.cos,
-            torch.sin,
-            torch.tan,
-            torch.acos,
-            torch.asin,
-            torch.cosh,
-            torch.sinh,
-            torch.atan,
-            torch.tanh,
-            F.hardtanh,
-            F.hardsigmoid,
-            F.hardswish,
-            F.softplus,
-            torch.sqrt,
-            torch.rsqrt,
-            torch.abs,
-            torch.ceil,
-            torch.floor,
-            torch.round,
-            torch.trunc,
-            torch.frac,
-            # TODO: broken on ROCm?
-            # F.hardshrink,
-            F.leaky_relu,
-            lambda x: torch.threshold(x, 0, -10),
-            lambda x: torch.clamp(x, -10, 10),
-        ]
-        gpu_only = {torch.erf, torch.erfc}
-        sizes = [(1,), (2,), (4, 4)]
-        for dtype, op, device, size in product(self.dtypes, unary_ops, self.devices, sizes):
-            # TODO: Add back when https://github.com/pytorch/pytorch/issues/55905 is closed
-            if dtype in [torch.float16, torch.bfloat16] and device == "cpu":
-                continue
-            # todo - re-enable. fails with .500
-            if dtype == torch.bfloat16 and op == torch.round:
-                continue
-            if op in gpu_only and device == "cpu":
-                continue
-            try:
-                x = self.data_for(dtype, device, size=size)
-                fn = apply(op)
-                ref = fn(x)
-            except Exception:
-                # If eager mode doesn't support a dtype/op/device combo,
-                # neither does the fuser.  Catch everything to avoid needing to
-                # guess what errors might be thrown by eager.
-                continue
-            try:
-                t = torch.jit.trace(fn, (x,))
-                torch.testing.assert_close(ref, t(x))
-                self.assertAllFused(t.graph_for(x))
-            except Exception as e:
-                raise RuntimeError(
-                    " ".join(["Failed:", str(dtype), op.__name__, device, str(size)])
-                )
+        with torch._jit_internal._disable_emit_hooks():
+            def apply(fn):
+                return lambda x: fn(x)
+
+            unary_ops = [
+                torch.lgamma,
+                torch.sigmoid,
+                torch.reciprocal,
+                torch.neg,
+                torch.relu,
+                F.relu6,
+                torch.log,
+                torch.log10,
+                torch.log1p,
+                torch.log2,
+                torch.exp,
+                torch.expm1,
+                torch.erf,
+                torch.erfc,
+                torch.cos,
+                torch.sin,
+                torch.tan,
+                torch.acos,
+                torch.asin,
+                torch.cosh,
+                torch.sinh,
+                torch.atan,
+                torch.tanh,
+                F.hardtanh,
+                F.hardsigmoid,
+                F.hardswish,
+                F.softplus,
+                torch.sqrt,
+                torch.rsqrt,
+                torch.abs,
+                torch.ceil,
+                torch.floor,
+                torch.round,
+                torch.trunc,
+                torch.frac,
+                # TODO: broken on ROCm?
+                # F.hardshrink,
+                F.leaky_relu,
+                lambda x: torch.threshold(x, 0, -10),
+                lambda x: torch.clamp(x, -10, 10),
+            ]
+            gpu_only = {torch.erf, torch.erfc}
+            sizes = [(1,), (2,), (4, 4)]
+            for dtype, op, device, size in product(self.dtypes, unary_ops, self.devices, sizes):
+                # TODO: Add back when https://github.com/pytorch/pytorch/issues/55905 is closed
+                if dtype in [torch.float16, torch.bfloat16] and device == "cpu":
+                    continue
+                # todo - re-enable. fails with .500
+                if dtype == torch.bfloat16 and op == torch.round:
+                    continue
+                if op in gpu_only and device == "cpu":
+                    continue
+                try:
+                    x = self.data_for(dtype, device, size=size)
+                    fn = apply(op)
+                    ref = fn(x)
+                except Exception:
+                    # If eager mode doesn't support a dtype/op/device combo,
+                    # neither does the fuser.  Catch everything to avoid needing to
+                    # guess what errors might be thrown by eager.
+                    continue
+                try:
+                    t = torch.jit.trace(fn, (x,))
+                    torch.testing.assert_close(ref, t(x))
+                    self.assertAllFused(t.graph_for(x))
+                except Exception as e:
+                    raise RuntimeError(
+                        " ".join(["Failed:", str(dtype), op.__name__, device, str(size)])
+                    )
 
     def test_binary_ops(self):
         def apply(fn):
@@ -1592,47 +1595,48 @@ def fn(x, y):
                 )
 
     def test_binary_tensor_scalar_ops(self):
-        def apply_with_scalar(fn, scalar):
-            return lambda x: fn(x, scalar)
-
-        # FIXME: Fails in IR Eval: torch.int64 and_ cpu
-        binary_ops = [
-            operator.__and__,
-            operator.__or__,
-            operator.__xor__,
-            torch.add,
-            torch.sub,
-            torch.mul,
-            torch.eq,
-            torch.ne,
-            torch.ge,
-            torch.lt,
-            torch.gt,
-        ]
-        devices = self.devices
-        # Maybe we should split this into separate tests to speed it up by
-        # only using  scalar values relevant to particular ops
-        scalars = [1.5, 3, 0, -2.0, -1]
-        for dtype, op, device, scalar in product(self.dtypes, binary_ops, devices, scalars):
-            if dtype in [torch.float16, torch.bfloat16] and device == "cpu":
-                continue
-            try:
-                x = self.data_for(dtype, device)
-                fn = apply_with_scalar(op, scalar)
-                ref = fn(x)
-            except Exception:
-                # If eager mode doesn't support a dtype/op/device combo,
-                # neither does the fuser.  Catch everything to avoid needing to
-                # guess what errors might be thrown by eager.
-                continue
-            try:
-                t = torch.jit.trace(fn, (x))
-                self.assertEqual(ref, t(x))
-                self.assertAllFused(t.graph_for(x))
-            except Exception as e:
-                raise RuntimeError(
-                    " ".join(["Failed:", str(dtype), op.__name__, device])
-                )
+        with torch._jit_internal._disable_emit_hooks():
+            def apply_with_scalar(fn, scalar):
+                return lambda x: fn(x, scalar)
+
+            # FIXME: Fails in IR Eval: torch.int64 and_ cpu
+            binary_ops = [
+                operator.__and__,
+                operator.__or__,
+                operator.__xor__,
+                torch.add,
+                torch.sub,
+                torch.mul,
+                torch.eq,
+                torch.ne,
+                torch.ge,
+                torch.lt,
+                torch.gt,
+            ]
+            devices = self.devices
+            # Maybe we should split this into separate tests to speed it up by
+            # only using  scalar values relevant to particular ops
+            scalars = [1.5, 3, 0, -2.0, -1]
+            for dtype, op, device, scalar in product(self.dtypes, binary_ops, devices, scalars):
+                if dtype in [torch.float16, torch.bfloat16] and device == "cpu":
+                    continue
+                try:
+                    x = self.data_for(dtype, device)
+                    fn = apply_with_scalar(op, scalar)
+                    ref = fn(x)
+                except Exception:
+                    # If eager mode doesn't support a dtype/op/device combo,
+                    # neither does the fuser.  Catch everything to avoid needing to
+                    # guess what errors might be thrown by eager.
+                    continue
+                try:
+                    t = torch.jit.trace(fn, (x))
+                    self.assertEqual(ref, t(x))
+                    self.assertAllFused(t.graph_for(x))
+                except Exception as e:
+                    raise RuntimeError(
+                        " ".join(["Failed:", str(dtype), op.__name__, device])
+                    )
 
     def test_binary_div_ops(self):
         def apply_with_scalar(fn, scalar):
@@ -2473,12 +2477,21 @@ def get_name(op):
         l.append(op.variant_test_name)
     return '.'.join(l)
 
-class TestNNCOpInfo(JitCommonTestCase):
+# Purpose of this class is to allow super() calls.
+# super() [with no arguments] fails, presumably because of how instantiate_device_type_tests works.
+# super(TestNNCOpInfo, self) fails because TestNNCOpInfo gets deleted from global scope.
+# super(JitCommonTestCase, self).fn() would skip JitCommonTestCase.fn() implementation
+class TestNNCOpInfoParent(JitCommonTestCase):
+    pass
+
+class TestNNCOpInfo(TestNNCOpInfoParent):
     def setUp(self):
+        super(TestNNCOpInfoParent, self).setUp()
         self.tensorexpr_options = TensorExprTestOptions()
 
     def tearDown(self):
         self.tensorexpr_options.restore()
+        super(TestNNCOpInfoParent, self).tearDown()
 
     def te_compile(self, device, dtype, op):
         if op.name in skip_ops:
@@ -2578,9 +2591,13 @@ def test_nnc_correctness(self, device, dtype, op):
 only_for = ("cpu", "cuda")
 instantiate_device_type_tests(TestNNCOpInfo, globals(), only_for=only_for)
 
+# Purpose of this class is to allow super() calls. (See TestNNCOpInfoParent)
+class TestLoopnestRandomizationParent(JitTestCase):
+    pass
 
-class TestLoopnestRandomization(JitTestCase):
+class TestLoopnestRandomization(TestLoopnestRandomizationParent):
     def setUp(self):
+        super(TestLoopnestRandomizationParent, self).setUp()
         self.old_cpu_fuser_state = torch._C._jit_can_fuse_on_cpu()
         self.old_must_use_cpu_state = torch._C._jit_get_te_must_use_llvm_cpu()
         self.old_gpu_fuser_state = torch._C._jit_can_fuse_on_gpu()
@@ -2591,7 +2608,7 @@ def setUp(self):
         torch._C._jit_override_can_fuse_on_gpu(True)
 
         self.old_profiling_executor = torch._C._jit_set_profiling_executor(True)
-        self.old_profiling_mode = torch._C._jit_set_profiling_mode(True)
+        self.old_profiling_mode = torch._C._get_graph_executor_optimize(True)
 
         self.old_fusion_inlining = torch._C._debug_get_fusion_group_inlining()
         torch._C._debug_set_fusion_group_inlining(False)
@@ -2608,7 +2625,7 @@ def setUp(self):
 
     def tearDown(self):
         torch._C._jit_set_profiling_executor(self.old_profiling_executor)
-        torch._C._jit_set_profiling_mode(self.old_profiling_mode)
+        torch._C._get_graph_executor_optimize(self.old_profiling_mode)
 
         torch._C._jit_override_can_fuse_on_gpu(self.old_gpu_fuser_state)
         torch._C._jit_override_can_fuse_on_cpu(self.old_cpu_fuser_state)
@@ -2620,6 +2637,7 @@ def tearDown(self):
 
         # Set it back to 0.
         os.environ["PYTORCH_TENSOREXPR_RANDOM_TRANSFORM_SEED"] = "0"
+        super(TestLoopnestRandomizationParent, self).tearDown()
 
     @onlyCPU
     @unittest.skipIf(not LLVM_ENABLED, "Compiles with TensorExprKernel")
diff --git a/test/test_linalg.py b/test/test_linalg.py
index a7e9ccc2bddfb2..0fcc3006b4715f 100644
--- a/test/test_linalg.py
+++ b/test/test_linalg.py
@@ -25,8 +25,8 @@
      onlyCUDA, skipCUDAVersionIn, skipMeta, skipCUDAIfNoCusolver)
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
-    all_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes,
-    get_all_fp_dtypes,
+    all_types, all_types_and_complex_and, floating_and_complex_types, integral_types,
+    floating_and_complex_types_and, floating_types_and, complex_types,
 )
 from torch.testing._internal.common_cuda import SM53OrLater, tf32_on_and_off, CUDA11OrLater, CUDA9
 from torch.distributions.binomial import Binomial
@@ -101,7 +101,7 @@ def check(a_sizes_, b_sizes_):
 
     # Tests torch.outer, and its alias, torch.ger, vs. NumPy
     @precisionOverride({torch.bfloat16: 1e-1})
-    @dtypes(*(get_all_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_outer(self, device, dtype):
         def run_test_case(a, b):
             if dtype == torch.bfloat16:
@@ -264,7 +264,8 @@ def numpy_ref(a, b):
                 else:
                     # driver == 'gelsy'
                     # QR based algorithm; setting the value too high might lead to non-unique solutions and flaky tests
-                    rcond = 1e-4
+                    # so we skip this case
+                    continue
 
             # specifying rcond value has no effect for gels driver so no need to run the tests again
             if driver == 'gels' and rcond is not None:
@@ -744,7 +745,7 @@ def check(m, a, b, beta, alpha):
         check(m_scalar, a, b, beta, alpha)
 
         # test nans and infs are not propagated to the output when beta == 0
-        float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes()
+        float_and_complex_dtypes = floating_and_complex_types_and(torch.half, torch.bfloat16)
         if beta == 0 and dtype in float_and_complex_dtypes:
             m[0][10] = m[10][10] = m[20][20] = float('inf')
             m[1][10] = m[11][10] = m[21][20] = float('nan')
@@ -757,7 +758,7 @@ def test_addr_bool(self, device, dtype):
         self._test_addr_vs_numpy(device, dtype, beta=False, alpha=False)
         self._test_addr_vs_numpy(device, dtype, beta=True, alpha=True)
 
-    @dtypes(*(get_all_int_dtypes()))
+    @dtypes(*integral_types())
     def test_addr_integral(self, device, dtype):
         with self.assertRaisesRegex(RuntimeError,
                                     'argument beta must not be a floating point number.'):
@@ -778,7 +779,7 @@ def test_addr_integral(self, device, dtype):
         self._test_addr_vs_numpy(device, dtype, beta=2, alpha=2)
 
     @precisionOverride({torch.bfloat16: 1e-1})
-    @dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     def test_addr_float_and_complex(self, device, dtype):
         with self.assertRaisesRegex(RuntimeError,
                                     'Boolean beta only supported for Boolean results.'):
@@ -791,11 +792,11 @@ def test_addr_float_and_complex(self, device, dtype):
         self._test_addr_vs_numpy(device, dtype, beta=0., alpha=2)
         # when beta is not zero
         self._test_addr_vs_numpy(device, dtype, beta=0.5, alpha=2)
-        if dtype in get_all_complex_dtypes():
+        if dtype in complex_types():
             self._test_addr_vs_numpy(device, dtype, beta=(0 + 0.1j), alpha=(0.2 - 0.2j))
 
-    @dtypes(*itertools.product(get_all_dtypes(),
-                               get_all_dtypes()))
+    @dtypes(*itertools.product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
+                               all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_outer_type_promotion(self, device, dtypes):
         a = torch.randn(5).to(device=device, dtype=dtypes[0])
         b = torch.randn(5).to(device=device, dtype=dtypes[1])
@@ -805,7 +806,7 @@ def test_outer_type_promotion(self, device, dtypes):
 
     # don't use @dtypes decorator to avoid generating ~1700 tests per device
     def test_addr_type_promotion(self, device):
-        for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3):
+        for dtypes0, dtypes1, dtypes2 in product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool), repeat=3):
             a = make_tensor((5,), device=device, dtype=dtypes0, low=-2, high=2)
             b = make_tensor((5,), device=device, dtype=dtypes1, low=-2, high=2)
             m = make_tensor((5, 5), device=device, dtype=dtypes2, low=-2, high=2)
@@ -2936,7 +2937,7 @@ def run_test_singular_input(batch_dim, n):
     @skipCPUIfNoLapack
     @onlyNativeDeviceTypes   # TODO: XLA doesn't raise exception
     @skipCUDAIfRocm
-    @skipCUDAVersionIn([(11, 3), (11, 5)])  # https://github.com/pytorch/pytorch/issues/57482
+    @skipCUDAVersionIn([(11, 3), (11, 5), (11, 6)])  # https://github.com/pytorch/pytorch/issues/57482
     @dtypes(*floating_and_complex_types())
     def test_inverse_errors_large(self, device, dtype):
         # Test batched inverse of singular matrices reports errors without crashing (gh-51930)
@@ -3240,6 +3241,27 @@ def run_test_singular_input(batch_dim, n):
             with self.assertRaisesRegex(RuntimeError, "tensors to be on the same device"):
                 torch.linalg.solve(a, b, out=out)
 
+    @skipCUDAIfNoMagma
+    @skipCPUIfNoLapack
+    @dtypes(*floating_and_complex_types())
+    def test_solve_batched_broadcasting(self, device, dtype):
+        from numpy.linalg import solve
+
+        def run_test(A_dims, B_dims):
+            A_matrix_size = A_dims[-1]
+            A_batch_dims = A_dims[:-2]
+            B, A = self.solve_test_helper(A_batch_dims + (A_matrix_size, A_matrix_size), B_dims, device, dtype)
+            actual = torch.linalg.solve(A, B)
+            expected = solve(A.cpu().numpy(), B.cpu().numpy())
+            self.assertEqual(actual, expected)
+
+        # test against numpy.linalg.solve
+        run_test((5, 5), (2, 0, 5, 3))  # broadcasting with 0 batch dim
+        run_test((2, 0, 5, 5), (5, 3))  # broadcasting with 0 batch dim
+        run_test((2, 1, 3, 4, 4), (4, 6))  # broadcasting B
+        run_test((4, 4), (2, 1, 3, 4, 2))  # broadcasting A
+        run_test((1, 3, 1, 4, 4), (2, 1, 3, 4, 5))  # broadcasting A & B
+
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
     @dtypes(*floating_and_complex_types())
@@ -3678,6 +3700,9 @@ def test_matrix_rank_atol_rtol(self, device, dtype):
             result = torch.linalg.matrix_rank(a, atol=tol_value, rtol=tol_value)
             self.assertEqual(result, 2)  # there are 2 singular values above max(0.81, 1.5*0.81)
 
+    # CUDA 11.6 issue failure https://github.com/pytorch/pytorch/issues/75391
+    @skipCUDAIf(torch.version.cuda is not None
+                and torch.version.cuda.split(".") == ["11", "6"], "There's a bug in CUDA 11.6")
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
     @dtypes(*floating_and_complex_types())
@@ -4405,7 +4430,7 @@ def test_linalg_solve_triangular(self, device, dtype):
     @onlyCUDA
     @skipCUDAIfNoMagma  # Magma needed for the PLU decomposition
     @skipCUDAIfRocm  # There is a memory access bug in rocBLAS in the (non-batched) solve_triangular
-    @skipCUDAVersionIn([(11, 3), (11, 5)])  # Tracked in https://github.com/pytorch/pytorch/issues/70111
+    @skipCUDAVersionIn([(11, 3), (11, 5), (11, 6)])  # Tracked in https://github.com/pytorch/pytorch/issues/70111
     @dtypes(*floating_and_complex_types())
     @precisionOverride({torch.float32: 1e-2, torch.complex64: 1e-2,
                         torch.float64: 1e-8, torch.complex128: 1e-8})
@@ -5050,9 +5075,11 @@ def call_torch_fn(*args, **kwargs):
             A_LU, pivots = fn(torch.lu, (2, 0, 0))
             self.assertEqual([(2, 0, 0), (2, 0)], [A_LU.shape, pivots.shape])
 
-    @dtypesIfCUDA(torch.cfloat, torch.cdouble,
-                  *get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater)))
-    @dtypes(*(set(get_all_dtypes()) - {torch.half, torch.bool}))
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.half] if not CUDA9 else [],
+                  *[torch.bfloat16] if CUDA11OrLater and SM53OrLater else []
+                  ))
+    @dtypes(*all_types_and_complex_and(torch.bfloat16))
     def test_blas_alpha_beta_empty(self, device, dtype):
         # This test is disabled on CUDA 9 due to:
         # See: https://github.com/pytorch/pytorch/issues/31006
@@ -5088,7 +5115,7 @@ def test_blas_alpha_beta_empty(self, device, dtype):
         self.assertEqual(torch.full((2, 3), beta * value, dtype=dtype, device=device),
                          torch.addmm(input=input, mat1=mat, mat2=mat2, alpha=alpha, beta=beta, out=out))
 
-    @dtypes(*(get_all_complex_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     def test_blas_nan_out(self, device, dtype):
         # These functions should work correctly with NaN filled outputs,
         # but need special handling, see [NOTE: cpu_zero]
@@ -5674,7 +5701,7 @@ def tracker(worker):
 ---(input size: {:4}, eigenpairs:{:2}, units: relative error, maxiter={:4})---
 '''.format(tol, eq_err, eq_err_general, iters1, eq_err_scipy, eq_err_general_scipy, iters2, m, k, niter))
 
-    def _test_addmm_addmv(self, f, t, m, v, *, alpha=None, beta=None, transpose_out=False):
+    def _test_addmm_addmv(self, f, t, m, v, *, alpha=None, beta=None, transpose_out=False, activation=None):
         dtype = t.dtype
         numpy_dtype = dtype
         if dtype in {torch.bfloat16}:
@@ -5693,15 +5720,19 @@ def _test_addmm_addmv(self, f, t, m, v, *, alpha=None, beta=None, transpose_out=
         res3 = alpha * (m.to(numpy_dtype).cpu().numpy() @ v.to(numpy_dtype).cpu().numpy())
         if beta != 0:
             res3 += (beta * t).to(numpy_dtype).cpu().numpy()
+        if activation == "relu":
+            res3 = res3 * (res3 > 0)
+        else:
+            assert activation is None, f"unsupported activation {activation}"
         res3 = torch.from_numpy(res3).to(dtype)
         self.assertEqual(res1, res2)
         self.assertEqual(res1, res3)
 
     @precisionOverride({torch.bfloat16: 1e-0, torch.half: 5e-4, torch.float: 1e-4, torch.double: 1e-8,
                         torch.cfloat: 1e-4, torch.cdouble: 1e-8})
-    @dtypesIfCUDA(*get_all_complex_dtypes(),
-                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)),
-                                     include_half=(not TEST_WITH_ROCM)))
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.bfloat16] if TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater) else [],
+                  *[torch.half] if not TEST_WITH_ROCM else []))
     @dtypes(torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble)
     def test_addmv(self, device, dtype):
         # have to use torch.randn(...).to(bfloat16) instead of
@@ -5736,7 +5767,8 @@ def test_addmv(self, device, dtype):
         for m, v in itertools.product(ms, vs):
             self._test_addmm_addmv(torch.addmv, t, m, v, beta=0)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
+    @dtypesIfCUDA(*floating_types_and(*[torch.bfloat16] if TEST_WITH_ROCM or (CUDA11OrLater and
+                  SM53OrLater) else []))
     @dtypes(torch.float, torch.double)
     def test_addmv_rowmajor_colmajor_incx_incy_lda(self, device, dtype):
         # tests (o, s)*(s).  o is output size, s is summed size.
@@ -5765,29 +5797,23 @@ def _test(row_major, incx, incy, lda_tail):
         for row_major, incx, incy, lda_tail in itertools.product((False, True), (1, 2), (1, 2), (0, 1)):
             _test(row_major, incx, incy, lda_tail)
 
-    @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
-                        torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
-    @dtypesIfCUDA(*get_all_complex_dtypes(),
-                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
-    @dtypes(*get_all_complex_dtypes(), *get_all_fp_dtypes())
-    @tf32_on_and_off(0.05)
-    def test_addmm(self, device, dtype):
+    def _test_addmm_impl(self, func, activation, device, dtype):
         M = torch.randn(10, 25, device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        self._test_addmm_addmv(torch.addmm, M, m1, m2)
+        self._test_addmm_addmv(func, M, m1, m2, activation=activation)
 
         # Test 0-strided
         M = torch.randn(10, 1, device=device).to(dtype).expand(10, 25)
         m1 = torch.randn(10, 1, device=device).to(dtype).expand(10, 50)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        self._test_addmm_addmv(torch.addmm, M, m1, m2)
+        self._test_addmm_addmv(func, M, m1, m2, activation=activation)
 
         # Test beta=0, M=nan
         M = torch.full((10, 25), math.nan, device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        self._test_addmm_addmv(torch.addmm, M, m1, m2, beta=0)
+        self._test_addmm_addmv(func, M, m1, m2, beta=0, activation=activation)
 
         # Test transpose
         for t1, t2, t3, t4 in itertools.product([True, False], repeat=4):
@@ -5799,10 +5825,28 @@ def maybe_transpose(cond, m):
             M = maybe_transpose(t1, torch.randn(10, 25, device=device).to(dtype))
             m1 = maybe_transpose(t2, torch.randn(10, 50, device=device).to(dtype))
             m2 = maybe_transpose(t3, torch.randn(50, 25, device=device).to(dtype))
-            self._test_addmm_addmv(torch.addmm, M, m1, m2, transpose_out=t4)
+            self._test_addmm_addmv(func, M, m1, m2, transpose_out=t4, activation=activation)
+
+    @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
+                        torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.bfloat16] if TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater) else []))
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
+    @tf32_on_and_off(0.05)
+    def test_addmm(self, device, dtype):
+        self._test_addmm_impl(torch.addmm, None, device, dtype)
+
+    @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
+                        torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
+    @dtypesIfCUDA(*floating_types_and(
+                  *[torch.bfloat16] if TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater) else []))
+    @dtypes(*floating_types_and(torch.bfloat16))
+    @tf32_on_and_off(0.05)
+    def test_addmm_activation(self, device, dtype):
+        self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)
 
     @dtypes(torch.float, torch.double)
-    @dtypesIfCUDA(*([torch.float, torch.double] + get_all_complex_dtypes()))
+    @dtypesIfCUDA(*floating_and_complex_types())
     @tf32_on_and_off(0.005)
     def test_addmm_sizes(self, device, dtype):
         for m in [0, 1, 25]:
@@ -5855,7 +5899,8 @@ def test_matmul_45724(self, device):
 
     @slowTest
     @onlyNativeDeviceTypes
-    @dtypes(torch.float32, torch.float64, torch.bfloat16, torch.int32, torch.int64, torch.cfloat, torch.cdouble)
+    # bfloat16 doesn't have sufficient precision to pass this test
+    @dtypes(torch.float32, torch.float64, torch.int32, torch.int64, torch.cfloat, torch.cdouble)
     @dtypesIfCUDA(torch.float32, torch.float64, torch.cfloat, torch.cdouble)
     @tf32_on_and_off(0.01)
     def test_mm(self, device, dtype):
@@ -6000,7 +6045,7 @@ def test_strided_mm_bmm(self, device, dtype):
     @precisionOverride({torch.half: 0.05, torch.bfloat16: 0.05})
     @skipCUDAIf(torch.version.cuda == "10.1", "flaky on CUDA 10.1")
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     @tf32_on_and_off(0.05)
     def test_bmm(self, device, dtype):
         if self.device_type == 'cuda' and dtype is torch.bfloat16 and CUDA11OrLater and not SM53OrLater:
@@ -6112,7 +6157,7 @@ def _test_addbmm_baddbmm(self, func, b1, b2, ref, out_tensor):
 
     @precisionOverride({torch.half: 0.05, torch.bfloat16: 0.05})
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     @tf32_on_and_off(0.05)
     def test_addbmm(self, device, dtype):
         if self.device_type == 'cuda' and dtype is torch.bfloat16 and CUDA11OrLater and not SM53OrLater:
@@ -6185,7 +6230,7 @@ def generate_tensor():
 
     @precisionOverride({torch.half: 0.1, torch.bfloat16: 0.5})
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     @tf32_on_and_off(0.05)
     def test_baddbmm(self, device, dtype):
         if self.device_type == 'cuda' and dtype is torch.bfloat16 and CUDA11OrLater and not SM53OrLater:
diff --git a/test/test_logging.py b/test/test_logging.py
index 4bb057fd157a8b..01fdd3f8edd838 100644
--- a/test/test_logging.py
+++ b/test/test_logging.py
@@ -12,10 +12,10 @@ def testApiUsage(self):
         subprocess
         """
         s = TestCase.runWithPytorchAPIUsageStderr("import torch")
-        self.assertRegexpMatches(s, "PYTORCH_API_USAGE.*import")
+        self.assertRegex(s, "PYTORCH_API_USAGE.*import")
         # import the shared library directly - it triggers static init but doesn't call anything
         s = TestCase.runWithPytorchAPIUsageStderr("from ctypes import CDLL; CDLL('{}')".format(torch._C.__file__))
-        self.assertNotRegexpMatches(s, "PYTORCH_API_USAGE")
+        self.assertNotRegex(s, "PYTORCH_API_USAGE")
 
 
 if __name__ == '__main__':
diff --git a/test/test_masked.py b/test/test_masked.py
index fa192086c89a16..1b9b4b075f7c09 100644
--- a/test/test_masked.py
+++ b/test/test_masked.py
@@ -10,11 +10,11 @@
 import unittest
 
 from torch.testing._internal.common_utils import \
-    (TestCase, suppress_warnings, _TestParametrizer)
+    (TestCase, parametrize, suppress_warnings, _TestParametrizer)
 from torch.testing._internal.common_methods_invocations import \
     (op_db, SampleInput)
 from torch.testing._internal.common_device_type import \
-    (instantiate_device_type_tests, ops, onlyNativeDeviceTypes)
+    (instantiate_device_type_tests, ops, onlyNativeDeviceTypes, precisionOverride)
 
 
 def apply_masked_reduction_along_dim(op, input, *args, **kwargs):
@@ -113,7 +113,10 @@ def apply_masked_reduction_along_dim(op, input, *args, **kwargs):
     output = input.new_full(shape, float('nan') if dtype.is_floating_point else 0, dtype=dtype)
 
     # apply op to all elementary slices:
-    inpmask = torch._masked._input_mask(input, mask=mask)
+    if mask is None:
+        inpmask = input.new_ones([], dtype=torch.bool).expand(input.shape)
+    else:
+        inpmask = torch._masked._input_mask(input, mask=mask)
     for s in itertools.product(*ranges):
         # data of an elementary slice is 1D sequence and has only
         # masked-in elements:
@@ -142,7 +145,10 @@ def apply_masked_normalization_along_dim(op, input, *args, **kwargs):
     dim = args[dim_pos]
     args0 = args[:dim_pos] + (0,) + args[dim_pos + 1:]
     output = torch.zeros_like(input, dtype=dtype)
-    inpmask = torch._masked._input_mask(input, mask=mask)
+    if mask is None:
+        inpmask = input.new_ones([], dtype=torch.bool).expand(input.shape)
+    else:
+        inpmask = torch._masked._input_mask(input, mask=mask)
     dim_ = dim % input.ndim
     left_ranges = tuple(map(range, input.shape[:dim_]))
     right_ranges = tuple(map(range, input.shape[dim_ + 1:]))
@@ -155,6 +161,7 @@ def apply_masked_normalization_along_dim(op, input, *args, **kwargs):
 reference_functions = dict(
     norm=lambda *args, **kwargs: apply_masked_reduction_along_dim(torch.linalg.vector_norm, *args, **dict(kwargs, dim_position=1)),
     var=lambda *args, **kwargs: apply_masked_reduction_along_dim(torch.var, *args, **dict(kwargs, dim_position=0)),
+    std=lambda *args, **kwargs: apply_masked_reduction_along_dim(torch.std, *args, **dict(kwargs, dim_position=0)),
     softmax=lambda *args, **kwargs: apply_masked_normalization_along_dim(torch.softmax, *args, **kwargs),
     log_softmax=lambda *args, **kwargs: apply_masked_normalization_along_dim(torch.log_softmax, *args, **kwargs),
     softmin=lambda *args, **kwargs: apply_masked_normalization_along_dim(torch.nn.functional.softmin, *args, **kwargs),
@@ -236,14 +243,7 @@ def sample_inputs_generator():
                                               kwargs=sample_input_kwargs)
                         if layout != torch.sparse_coo and op.supports_sparse:
                             sample_input_kwargs = sample_input.kwargs.copy()
-                            if mask.layout == torch.sparse_csr:
-                                # TODO: remove this if-block when sparse csr supports to_sparse
-                                mask = torch.sparse_coo_tensor(
-                                    torch._convert_indices_from_csr_to_coo(mask.crow_indices(), mask.col_indices()),
-                                    mask.values(), mask.shape)._coalesced_(True)
-                                sample_input_kwargs.update(mask=mask)
-                            else:
-                                sample_input_kwargs.update(mask=mask.to_sparse())
+                            sample_input_kwargs.update(mask=mask.to_sparse())
                             yield SampleInput(sample_input.input.clone(),
                                               args=sample_input.args,
                                               kwargs=sample_input_kwargs)
@@ -264,31 +264,37 @@ class TestMasked(TestCase):
 
     def assertEqualMasked(self, actual, expected, mask):
         strided = to_strided(actual)
-        strided = torch.where(mask, strided, strided.new_zeros([]))
-        expected = torch.where(mask, expected, expected.new_zeros([]))
+        if mask is not None:
+            strided = torch.where(mask, strided, strided.new_zeros([]))
+            expected = torch.where(mask, expected, expected.new_zeros([]))
         self.assertEqual(strided, expected, exact_device=False)
 
     @onlyNativeDeviceTypes
     @suppress_warnings
     @ops(masked_ops_with_references)
+    @precisionOverride({torch.bfloat16: 5e-4, torch.float16: 5e-4})
     def test_reference_masked(self, device, dtype, op):
         op_name = op.name.rsplit('.', 1)[-1]
         ref_op = reference_functions[op_name]
         sample_inputs = op.sample_inputs(device, dtype)
         for sample_input in sample_inputs:
             t_inp, t_args, t_kwargs = sample_input.input, sample_input.args, sample_input.kwargs
-            if op_name == 'var' and not (t_inp.dtype.is_floating_point or t_inp.dtype.is_complex):
-                # torch.var does not support integer inputs
+            if op_name in {'var', 'std'} and not (t_inp.dtype.is_floating_point or t_inp.dtype.is_complex):
+                # torch.var/torch.std does not support integer inputs
                 continue
             actual = op.op(t_inp, *t_args, **t_kwargs)
             expected = ref_op(t_inp, *t_args, **t_kwargs)
-            outmask = torch._masked._output_mask(op.op, t_inp, *t_args, **t_kwargs)
+            if t_kwargs.get('mask') is None:
+                outmask = None
+            else:
+                outmask = torch._masked._output_mask(op.op, t_inp, *t_args, **t_kwargs)
             self.assertEqualMasked(actual, expected, outmask)
 
     @mask_layouts()
     @onlyNativeDeviceTypes
     @suppress_warnings
     @ops(masked_ops_with_non_strided_support)
+    @precisionOverride({torch.bfloat16: 5e-3, torch.float16: 5e-3})
     def test_mask_layout(self, layout, device, dtype, op, sample_inputs):
         for sample in sample_inputs:
             t_inp, t_args, t_kwargs = sample.input, sample.args, sample.kwargs
@@ -300,9 +306,124 @@ def test_mask_layout(self, layout, device, dtype, op, sample_inputs):
             #  op(inp, mask).to_dense() == op(inp.to_dense(), mask.to_dense()) at outmask
             #
             r_inp, r_args, r_kwargs = to_strided((t_inp, t_args, t_kwargs))
-            outmask = torch._masked._output_mask(op.op, r_inp, *r_args, **r_kwargs)
+            if r_kwargs.get('mask') is None:
+                outmask = None
+            else:
+                outmask = torch._masked._output_mask(op.op, r_inp, *r_args, **r_kwargs)
             expected = op.op(r_inp, *r_args, **r_kwargs)
             self.assertEqualMasked(actual, expected, outmask)
 
+    @parametrize("sparse_kind,fill_value", [('coo', 0), ('hybrid_coo', 0),
+                                            ('coo', 123), ('hybrid_coo', 123),
+                                            ('csr', 0), ('csr', 123)],
+                 name_fn=lambda sparse_kind, fill_value: f'{sparse_kind}_fill_value_{fill_value}')
+    def test_where(self, sparse_kind, fill_value):
+
+        is_hybrid = False
+        if sparse_kind == 'coo':
+
+            def to_sparse(dense):
+                return dense.to_sparse(2)
+
+            def set_values(sparse, index, value):
+                sparse._values()[index] = value
+
+        elif sparse_kind == 'hybrid_coo':
+            is_hybrid = True
+
+            def to_sparse(dense):
+                return dense.to_sparse(1)
+
+            def set_values(sparse, index, value):
+                sparse._values()[index] = value
+
+        elif sparse_kind == 'csr':
+
+            def to_sparse(dense):
+                return dense.to_sparse_csr()
+
+            def set_values(sparse, index, value):
+                sparse.values()[index] = value
+
+        else:
+            assert 0, sparse_kind
+
+        mask = torch.tensor([[1, 0, 1, 0, 0],
+                             [1, 1, 1, 1, 0],
+                             [0, 1, 0, 1, 0],
+                             [0, 0, 0, 0, 0],
+                             [0, 0, 1, 1, 0],
+                             [1, 1, 0, 0, 0]]).to(dtype=bool)
+        mask = to_sparse(mask)
+        # make some specified mask elements as explicit masked-out masks:
+        if is_hybrid:
+            set_values(mask, (1, 1), False)
+            set_values(mask, (-2, -2), False)
+        else:
+            set_values(mask, 3, False)
+            set_values(mask, -3, False)
+
+        input = torch.tensor([[1, 0, 0, 0, -1],
+                              [2, 3, 0, 0, -2],
+                              [0, 4, 5, 0, -3],
+                              [0, 0, 6, 7, 0],
+                              [0, 8, 9, 0, -3],
+                              [10, 11, 0, 0, -5]])
+        input = to_sparse(input)
+        # make specified input elements have zero values:
+        if is_hybrid:
+            set_values(input, (1, 1), 0)
+            set_values(input, (-1, 0), 0)
+            F = fill_value
+        else:
+            set_values(input, 3, 0)
+            set_values(input, -3, 0)
+            F = 0
+
+        # expected where result:
+        Z = 99
+        # Z value corresponds to masked-in elements that are not
+        # specified in the input and it will be replaced with a zero
+        tmp = torch.tensor([[1, F, Z, F, F],
+                            [2, F, Z, Z, F],
+                            [F, 4, F, Z, F],
+                            [0, 0, 0, 0, 0],
+                            [F, F, 9, F, F],
+                            [Z, 11, F, F, F]])
+        tmp = to_sparse(tmp)
+
+
+        sparse = torch._masked._where(mask, input,
+                                      torch.tensor(fill_value, dtype=input.dtype, device=input.device))
+
+        if tmp.layout == torch.sparse_coo:
+            expected_sparse = torch.sparse_coo_tensor(
+                tmp.indices(),
+                torch.where(tmp.values() != Z, tmp.values(), tmp.values().new_full([], 0)),
+                input.shape)
+            outmask = torch.sparse_coo_tensor(sparse.indices(),
+                                              sparse.values().new_full(sparse.values().shape, 1).to(dtype=bool),
+                                              sparse.shape)._coalesced_(True)
+        elif tmp.layout == torch.sparse_csr:
+            expected_sparse = torch.sparse_csr_tensor(
+                tmp.crow_indices(),
+                tmp.col_indices(),
+                torch.where(tmp.values() != Z, tmp.values(), tmp.values().new_full([], 0)),
+                input.shape)
+            outmask = torch.sparse_csr_tensor(sparse.crow_indices(), sparse.col_indices(),
+                                              sparse.values().new_full(sparse.values().shape, 1).to(dtype=bool),
+                                              sparse.shape)
+        else:
+            assert 0
+
+        self.assertEqual(sparse, expected_sparse)
+
+        # check invariance:
+        #  torch.where(mask.to_dense(), input.to_dense(), fill_value)
+        #    == where(mask, input, fill_value).to_dense(fill_value)
+        expected = torch.where(mask.to_dense(), input.to_dense(), torch.full(input.shape, F))
+        dense = torch.where(outmask.to_dense(), sparse.to_dense(), torch.full(sparse.shape, F))
+        self.assertEqual(dense, expected)
+
 
 instantiate_device_type_tests(TestMasked, globals(), except_for='meta')
diff --git a/test/test_module_init.py b/test/test_module_init.py
index 589db4b71622e3..fa0ac8f79dcee1 100644
--- a/test/test_module_init.py
+++ b/test/test_module_init.py
@@ -166,6 +166,9 @@ def build_constructor_arg_db():
         torch.nn.UpsamplingBilinear2d: ((), {}),
         torch.nn.UpsamplingNearest2d: ((), {}),
         torch.nn.ZeroPad2d: ((0,), {}),
+        torch.nn.qat.Conv1d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
         torch.nn.qat.Conv2d: ((3, 3, 3), {
             'qconfig': torch.ao.quantization.default_qconfig,
         }),
@@ -206,7 +209,7 @@ def build_constructor_arg_db():
         torch.nn.quantized.EmbeddingBag: ((10, 3), {
             'factory_kwargs': {},
         }),
-        torch.nn.quantized.GroupNorm: ((2, 3, torch.nn.Parameter(torch.tensor(2.)),
+        torch.nn.quantized.GroupNorm: ((2, 4, torch.nn.Parameter(torch.tensor(2.)),
                                         torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
         torch.nn.quantized.Hardswish: ((0.1, 0,), {}),
         torch.nn.quantized.InstanceNorm1d: ((2, torch.nn.Parameter(torch.tensor(2.)),
diff --git a/test/test_multiprocessing.py b/test/test_multiprocessing.py
index cdadf6aec001b5..515121586fcfd5 100644
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@@ -258,7 +258,7 @@ def test_fill():
             self.assertTrue(e.is_set())
             self.assertTrue(data[0].eq(4).all())
             self.assertTrue(data[1].eq(4).all())
-            p.join(1)
+            p.join(100)
             self.assertFalse(p.is_alive())
 
         def test_receive():
@@ -280,7 +280,7 @@ def test_receive():
             # collect them properly
             del t1, t2
             e.set()
-            p.join(1)
+            p.join(100)
             self.assertFalse(p.is_alive())
 
         with leak_checker(self) as lc:
@@ -587,6 +587,7 @@ def _test_event_multiprocess_child(event, p2c, c2p):
         event.synchronize()
         c2p.put(1)  # notify parent synchronization is done
 
+    @unittest.skip("Skipped as this test fails on ROCm")
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
@@ -645,6 +646,7 @@ def _test_event_handle_importer_consumer(handle, p2c, c2p):
         c2p.put(1)  # nofity synchronization is done in child
         p2c.get()  # wait for parent to finish before destructing child event
 
+    @unittest.skip("Skipped as this test fails on ROCm")
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
@@ -684,6 +686,7 @@ def _test_event_handle_exporter_consumer(handle, p2c, c2p):
             # destructing e1
             p2c.get()
 
+    @unittest.skip("Skipped as this test fails on ROCm")
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
@@ -753,7 +756,7 @@ def hook(*unused):
 
         self.assertEqual(var.data, torch.ones(5, 5, device=device))
         self.assertEqual(var.grad.data, torch.ones(5, 5, device=device) * 4)
-        p.join(1)
+        p.join(100)
         self.assertFalse(p.is_alive())
 
     # Check sharing a cudaMalloc allocation with different types of storage.
diff --git a/test/test_namedtuple_return_api.py b/test/test_namedtuple_return_api.py
index ddc23e45f276e5..c0d8c6489aa8f7 100644
--- a/test/test_namedtuple_return_api.py
+++ b/test/test_namedtuple_return_api.py
@@ -18,7 +18,8 @@
     'triangular_solve', 'cummax', 'cummin', 'linalg_eigh', "_unpack_dual", 'linalg_qr',
     'linalg_svd', '_linalg_svd', 'linalg_slogdet', 'fake_quantize_per_tensor_affine_cachemask',
     'fake_quantize_per_channel_affine_cachemask', 'linalg_lstsq', 'linalg_eig', 'linalg_cholesky_ex',
-    'frexp', 'lu_unpack', 'histogram', '_fake_quantize_per_tensor_affine_cachemask_tensor_qparams',
+    'frexp', 'lu_unpack', 'histogram', 'histogramdd',
+    '_fake_quantize_per_tensor_affine_cachemask_tensor_qparams',
     '_fused_moving_avg_obs_fq_helper', 'linalg_lu_factor', 'linalg_lu_factor_ex',
     '_det_lu_based_helper',
     '_lu_with_info',
@@ -100,6 +101,7 @@ def test_namedtuple_return(self):
                input=(torch.tensor([3, 2, 1, 4, 5], dtype=torch.int32), True, True),
                names=('P', 'L', 'U'), hasout=True),
             op(operators=['histogram'], input=(1,), names=('hist', 'bin_edges'), hasout=True),
+            op(operators=['histogramdd'], input=(1,), names=('hist', 'bin_edges'), hasout=False),
             op(operators=['_fake_quantize_per_tensor_affine_cachemask_tensor_qparams'],
                input=(torch.tensor([1.0]), torch.tensor([0], dtype=torch.int), torch.tensor([1]), 0, 255),
                names=('output', 'mask',), hasout=False),
diff --git a/test/test_nestedtensor.py b/test/test_nestedtensor.py
index cf868f2761794c..eeaa51b24d66ce 100644
--- a/test/test_nestedtensor.py
+++ b/test/test_nestedtensor.py
@@ -133,14 +133,15 @@ def test_numel(self):
                 RuntimeError, "numel is disabled", lambda: a1.numel(),
             )
 
-    @unittest.skipIf(IS_FBCODE, "size is not virtual in fbcode.")
     @torch.inference_mode()
     def test_size(self):
         for constructor in _iter_constructors():
             a1 = constructor([])
             self.assertRaisesRegex(
                 RuntimeError,
-                "NestedTensorImpl doesn't support sizes",
+                "Tensors of type NestedTensorImpl do not have sizes"
+                if IS_FBCODE
+                else "NestedTensorImpl doesn't support sizes",
                 lambda: a1.size(),
             )
 
@@ -182,3 +183,12 @@ def test_repr_string(self):
         )
         self.assertEqual(str(a), expected)
         self.assertEqual(repr(a), expected)
+
+    @torch.inference_mode()
+    def test_activations(self):
+        for func in (torch.nn.functional.relu, torch.nn.functional.relu_, torch.nn.functional.gelu, torch._C._nn.gelu_):
+            t = torch.tensor([-1, 0, 1], dtype=torch.float)
+            nt = nested_tensor([t])
+            nested_result = func(nt)
+            self.assertTrue(nested_result.is_nested)
+            self.assertEqual(func(t), nested_result.unbind()[0])
diff --git a/test/test_nn.py b/test/test_nn.py
index 30c4d136e1b907..809cd8b455191d 100644
--- a/test/test_nn.py
+++ b/test/test_nn.py
@@ -35,12 +35,13 @@
 from torch.nn import Parameter
 from torch.nn.parameter import UninitializedParameter, UninitializedBuffer
 from torch.nn.parallel._functions import Broadcast
-from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes
+from torch.testing._internal.common_dtype import integral_types, floating_types_and, get_all_math_dtypes, \
+    floating_and_complex_types_and
 from torch.testing._internal.common_utils import freeze_rng_state, run_tests, TestCase, skipIfNoLapack, skipIfRocm, \
     skipIfRocmVersionLessThan, skipIfNotMiopenSuggestNHWC, TEST_NUMPY, TEST_SCIPY, TEST_WITH_ROCM, download_file, \
     get_function_arglist, load_tests, \
     suppress_warnings, TemporaryFileName, TEST_WITH_UBSAN, IS_PPC, \
-    parametrize as parametrize_test, subtest, instantiate_parametrized_tests
+    parametrize as parametrize_test, subtest, instantiate_parametrized_tests, set_default_dtype
 from torch.testing._internal.common_cuda import TEST_CUDA, TEST_MULTIGPU, TEST_CUDNN, TEST_CUDNN_VERSION
 from torch.testing._internal.common_nn import NNTestCase, NewModuleTest, CriterionTest, \
     module_tests, criterion_tests, loss_reference_fns, \
@@ -53,6 +54,7 @@
 from torch.nn import MultiheadAttention
 
 from hypothesis import given
+from torch.testing import make_tensor
 import torch.testing._internal.hypothesis_utils as hu
 from torch.testing._internal.common_utils import _assertGradAndGradgradChecks, gradcheck, gradgradcheck, \
     GRADCHECK_NONDET_TOL
@@ -69,6 +71,7 @@
 
 if TEST_SCIPY:
     from scipy import stats
+    import scipy.signal
     import scipy.ndimage
 
 if TEST_NUMPY:
@@ -892,7 +895,7 @@ def test_no_grad(self):
                 self.assertRaises(RuntimeError, lambda: output2.backward(torch.ones(1, 5, 10, 10)))
 
     def test_invalid_conv1d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double]:
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
             module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True).to(dtype)
             input = torch.randn(1, 3, 4).to(dtype)
             with self.assertRaisesRegex(RuntimeError,
@@ -907,30 +910,32 @@ def test_invalid_conv1d(self):
                 module(input)
 
     def test_mismatch_shape_conv2d(self):
-        x = torch.randn(1, 10, 1, 28, 28)
-        w = torch.randn(6, 1, 5, 5)
+        for dtype in (torch.float, torch.cfloat):
+            x = torch.randn(1, 10, 1, 28, 28, dtype=dtype)
+            w = torch.randn(6, 1, 5, 5, dtype=dtype)
 
-        with self.assertRaisesRegex(RuntimeError,
-                                    r'Expected 3D \(unbatched\) or 4D \(batched\) input to conv2d, but got ' +
-                                    r'input of size: \[1, 10, 1, 28, 28\]'):
+            with self.assertRaisesRegex(RuntimeError,
+                                        r'Expected 3D \(unbatched\) or 4D \(batched\) input to conv2d, but got ' +
+                                        r'input of size: \[1, 10, 1, 28, 28\]'):
 
-            F.conv2d(x, w)
+                F.conv2d(x, w)
 
     def test_conv2d_discontiguous_weight(self):
-        # Test for https://github.com/pytorch/pytorch/issues/55781
-        x = torch.ones(64, 16, 16, 16)
-        weight = torch.arange(0, 1.0, 1 / 2.0 ** 10).reshape(32, 16, 1, 2)[:, :, :, ::2]
-        self.assertFalse(weight.is_contiguous())
-        y = torch.nn.functional.conv2d(x, weight, None)
-        if torch.backends.mkldnn.is_available():
-            # Disable MKLDNN explicitly, so that either NNPACK or THCNN will be used
-            with torch.backends.mkldnn.flags(enabled=False):
-                y_ = torch.nn.functional.conv2d(x, weight, None)
-                self.assertEqual(y, y_)
-        self.assertEqual(y.sum(), 4186112.)
+        for dtype in (torch.float, torch.cfloat):
+            # Test for https://github.com/pytorch/pytorch/issues/55781
+            x = torch.ones(64, 16, 16, 16, dtype=dtype)
+            weight = torch.arange(0, 1.0, 1 / 2.0 ** 10).reshape(32, 16, 1, 2).to(dtype)[:, :, :, ::2]
+            self.assertFalse(weight.is_contiguous())
+            y = torch.nn.functional.conv2d(x, weight, None)
+            if torch.backends.mkldnn.is_available():
+                # Disable MKLDNN explicitly, so that either NNPACK or THCNN will be used
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = torch.nn.functional.conv2d(x, weight, None)
+                    self.assertEqual(y, y_)
+            self.assertEqual(y.sum(), 4186112.)
 
     def test_invalid_conv2d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double]:
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
             module = torch.nn.Conv2d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
             input = torch.empty(1, 1, 4, 4).to(dtype)
             self.assertRaises(RuntimeError, lambda: module(input))
@@ -955,7 +960,7 @@ def test_invalid_conv2d(self):
                 module(input)
 
     def test_invalid_conv3d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double]:
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
             module = torch.nn.Conv3d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
             input = torch.empty(1, 1, 4, 4, 4).to(dtype)
             self.assertRaises(RuntimeError, lambda: module(input))
@@ -3169,6 +3174,40 @@ def forward(self, X):
             Y = model.weight
             self.assertEqual(id(X), id(Y))
 
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_caching_parametrization_with_transfer_parametrizations_and_params(self):
+        r"""Test that transferring parametrizations doesn't cause issues with caching"""
+        class Skew(nn.Module):
+            def forward(self, X):
+                X = X.tril(-1)
+                return X - X.T
+
+        class Orthogonal(nn.Module):
+            def forward(self, X):
+                Id = torch.eye(X.size(0), device=X.device)
+                return torch.linalg.solve(Id + X, Id - X)
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", Skew())
+        parametrize.register_parametrization(model, "weight", Orthogonal())
+
+        to_model = nn.Linear(5, 5)
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        with parametrize.cached():
+            X = model.weight
+            Y = model.weight
+            self.assertEqual(id(X), id(Y))
+
+            A = to_model.weight
+            B = to_model.weight
+            self.assertEqual(id(A), id(B))
+
+            # test that the results are distinct objects for each module
+            self.assertNotEqual(id(A), id(X))
+
     def test_parametrization_same_training_mode(self):
         r"""Test training mode updated on parametrization registration"""
         class Identity(nn.Module):
@@ -3184,6 +3223,220 @@ def forward(self, X):
         self.assertTrue(module.parametrizations.weight[0].training)
         self.assertTrue(module.parametrizations.weight[1].training)
 
+    def test_type_before_parametrizations(self):
+        r"""Test that type_before_parametrizations always retrieves original type"""
+
+        class Identity(nn.Module):
+            def forward(self, X):
+                return X
+
+        model = nn.Linear(5, 5)
+        original_type = type(model)
+        self.assertTrue(
+            parametrize.type_before_parametrizations(model) == original_type
+        )
+        parametrize.register_parametrization(model, "weight", Identity())
+        self.assertTrue(
+            parametrize.type_before_parametrizations(model) == original_type
+        )
+
+    def test_transfer_parametrizations_and_params(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class AddOne(nn.Module):
+            def forward(self, x):
+                return x + 1.0
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+            def right_inverse(self, x):
+                return 0.5 * x
+
+        class MinusOne(nn.Module):
+            def forward(self, x):
+                return x - 1.0
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", AddOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        parametrize.register_parametrization(model, "weight", MinusOne())
+        hold_weight = model.weight
+
+        to_model = nn.qat.Linear(
+            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # checks that final and original value are correct and the to_model is parametrized
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+
+        # check that the transfer didn't affect the original value
+        self.assertEqual(hold_weight, model.weight)
+
+        # testing that changes to one set of parametrizations do not affect the other
+        parametrize.remove_parametrizations(to_model, "weight")
+        self.assertFalse(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(model, "weight"))
+
+        # also test that parameters that don't exist in to_model get transferred
+        model.test_param = Parameter(torch.randn(5, 5))
+
+        self.assertTrue(not hasattr(to_model, "test_param"))
+        parametrize.register_parametrization(model, "test_param", Double())
+        hold_test_param = model.test_param
+        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
+
+        # check that previously missing params got transferred correctly
+        self.assertEqual(model.test_param, to_model.test_param)
+        self.assertEqual(
+            model.parametrizations.test_param.original,
+            to_model.parametrizations.test_param.original,
+        )
+
+        # check that the new transfer didn't change the value for the from_module
+        self.assertEqual(hold_test_param, model.test_param)
+
+    def test_transfer_parametrizations_and_params_right_inverse(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+            def right_inverse(self, x):
+                return 0.5 * x
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", Double())
+        hold_weight = model.weight
+
+        to_model = nn.qat.Linear(
+            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # check that transfer occurs successfully
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+
+        # check that transfer doesn't affect the from_model weight
+        self.assertEqual(hold_weight, model.weight)
+
+    def test_transfer_parametrizations_and_params_single_param(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class AddOne(nn.Module):
+            def forward(self, x):
+                return x + 1.0
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+        class MinusOne(nn.Module):
+            def forward(self, x):
+                return x - 1.0
+
+        model = nn.Linear(5, 5, bias=True)
+        parametrize.register_parametrization(model, "weight", AddOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        parametrize.register_parametrization(model, "weight", MinusOne())
+        parametrize.register_parametrization(model, "bias", AddOne())
+        parametrize.register_parametrization(model, "bias", Double())
+        parametrize.register_parametrization(model, "bias", MinusOne())
+
+        to_model = nn.qat.Linear(
+            5, 5, bias=True, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model, "weight")
+
+        # check that weight and only weight was transferred
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+        self.assertTrue("bias" not in to_model.parametrizations)
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    # and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_transfer_parametrizations_and_params_many_to_one(self):
+        # A parametrization with several outputs
+        class RankOne(nn.Module):
+            def forward(self, x, y):
+                # Form a rank-1 matrix from a pair of vectors
+                return x.unsqueeze(-1) @ y.unsqueeze(-2)
+
+            def right_inverse(self, Y):
+                # We project the given matrix onto the rank 1 matrices
+                U, S, Vh = torch.linalg.svd(Y, full_matrices=False)
+                # S is ordered in a decreasing way.
+                s0_sqrt = S[0].sqrt().unsqueeze(-1)
+                return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+        model = nn.Linear(3, 3)
+        parametrize.register_parametrization(model, "weight", RankOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        hold_weight = model.weight
+
+        to_model = nn.qat.Linear(
+            3, 3, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # checks that final and original value are correct and the to_model is parametrized
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original0,
+            to_model.parametrizations.weight.original0,
+        )
+        self.assertEqual(
+            model.parametrizations.weight.original1,
+            to_model.parametrizations.weight.original1,
+        )
+
+        # check that the transfer didn't affect the original value
+        self.assertEqual(hold_weight, model.weight)
+
+        # testing that changes to one set of parametrizations do not affect the other
+        model.test_param = Parameter(torch.randn(3, 3))
+
+        self.assertTrue(not hasattr(to_model, "test_param"))
+        parametrize.register_parametrization(model, "test_param", RankOne())
+        hold_test_param = model.test_param
+        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
+
+        # also check that previously missing params got transferred correctly
+        self.assertEqual(model.test_param, to_model.test_param)
+        self.assertEqual(
+            model.parametrizations.test_param.original0,
+            to_model.parametrizations.test_param.original0,
+        )
+        self.assertEqual(
+            model.parametrizations.test_param.original1,
+            to_model.parametrizations.test_param.original1,
+        )
+
+        # check that the new transfer didn't change the value for the from_module
+        self.assertEqual(hold_test_param, model.test_param)
+
     # torch/nn/utils/prune.py
     @unittest.skipIf(not TEST_NUMPY, "numpy not found")
     def test_validate_pruning_amount_init(self):
@@ -4823,7 +5076,7 @@ def assert_weight_allclose_Q(weight, W):
                                                 (torch.float32, torch.complex64),
                                                 (True, False)):
             # Conv2d does not support complex yet
-            if not use_linear and dtype.is_complex:
+            if not use_linear:
                 continue
 
             if use_linear:
@@ -6305,7 +6558,7 @@ def test(should_raise, module, input_size, dtype):
                 # just run it to ensure no exception raised.
                 module(input)
 
-        for dtype in [torch.bfloat16, torch.float, torch.double]:
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
             # Conv1d
             test(True, nn.Conv1d(1, 1, 3).to(dtype), (1, 2), dtype)
             test(True, nn.Conv1d(1, 1, 3, stride=2).to(dtype), (1, 2), dtype)
@@ -6381,8 +6634,6 @@ def test_ConvTranspose2d_half_cublas_gemm(self):
             output = deconv(inputs)
             output.mean().backward()
 
-
-    @skipIfRocm
     # For https://github.com/pytorch/pytorch/pull/1273
     # Almost identical to the above `test_Conv2d_naive_groups`
     def test_Conv2d_groups_nobias(self):
@@ -6422,7 +6673,6 @@ def test_Conv2d_groups_nobias(self):
     # Covering special case when group > 1, input-channel / group < 16 and output-channel is multiple of 16
     # See also https://github.com/pytorch/pytorch/pull/18463#issuecomment-476563686
     # and https://github.com/pytorch/pytorch/pull/18463#issuecomment-477001024
-    @skipIfRocm
     def test_Conv2d_groups_nobias_v2(self):
         torch.manual_seed(123)
         dev_dtypes = [("cpu", torch.float)]
@@ -9978,10 +10228,10 @@ def test_grid_sample_error_checking(self):
         with self.assertRaisesRegex(ValueError, "but got: 'garbage'"):
             F.grid_sample(input, grid, padding_mode='garbage', align_corners=False)
 
-        with self.assertRaisesRegex(RuntimeError, "expected 4D or 5D input"):
+        with self.assertRaisesRegex(RuntimeError, "expected grid to have size 1 in last dimension"):
             F.grid_sample(input[0], grid, align_corners=False)
 
-        with self.assertRaisesRegex(RuntimeError, "grid with same number of dimensions"):
+        with self.assertRaisesRegex(RuntimeError, "expected grid to have size 2 in last dimension"):
             F.grid_sample(input, torch.empty(1, 1, 1, 1, 3), align_corners=False)
 
         with self.assertRaisesRegex(RuntimeError, "expected grid and input to have same batch size"):
@@ -9997,7 +10247,7 @@ def test_grid_sample_error_checking(self):
             F.grid_sample(torch.empty(1, 1, 2, 2, 2), torch.empty(1, 1, 1, 1, 3), mode='bicubic')
 
         if TEST_CUDA:
-            with self.assertRaisesRegex(RuntimeError, "expected input and grid to be on same device"):
+            with self.assertRaisesRegex(RuntimeError, "Expected all tensors to be on the same device"):
                 F.grid_sample(input.cuda(), grid, align_corners=False)
 
     def test_affine_grid_error_checking(self):
@@ -11390,6 +11640,12 @@ def test_cross_entropy_loss_precision(self):
         outd = loss_cpu(inputd, target)
         self.assertEqual(outf, outd, exact_dtype=False)
 
+    def test_cross_entropy_loss_zero_div(self):
+        # Test for issue #73165
+        input_1 = torch.rand([5, 0], dtype=torch.float32)
+        input_2 = torch.rand([5, 0], dtype=torch.float32)
+        torch.nn.CrossEntropyLoss()(input_1, input_2)
+
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA not available")
     def test_convert_sync_batchnorm(self):
         module = torch.nn.Sequential(
@@ -12826,9 +13082,8 @@ def _test_GroupNorm_general(self, device, dtype=torch.float):
             (2, 6, 4, 2, 2): 4,
         }
         for shape, g in bad_shape_g.items():
-            gn = nn.GroupNorm(g, shape[1])
-            input = torch.empty(*shape, device=device, dtype=dtype).uniform_(0, 10)
-            self.assertRaises(RuntimeError, lambda: gn(input))
+            with self.assertRaises(ValueError):
+                gn = nn.GroupNorm(g, shape[1])
 
     def _test_GroupNorm_cuda_half(self):
         input = torch.zeros(2, 4, 3, 2, requires_grad=True).cuda().half().random_(1, 10)
@@ -13099,7 +13354,7 @@ def test_affine_3d_rotateRandom(self, device):
 
     @onlyCUDA
     @skipCUDAIfNoCudnn
-    @dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
+    @dtypes(*floating_and_complex_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
     def test_Conv2d_deterministic_cudnn(self, device, dtype):
         inputs = torch.randn(2, 3, 5, 5, device=device, dtype=dtype, requires_grad=True)
         with cudnn.flags(enabled=True, benchmark=True, deterministic=True):
@@ -13118,7 +13373,7 @@ def test_Conv2d_deterministic_cudnn(self, device, dtype):
 
 
     @onlyCUDA
-    @dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
+    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
     def test_Conv2d_large_workspace(self, device, dtype):
         # These sizes require huge cuDNN workspaces. Make sure we choose a
         # reasonable algorithm that does not run out of memory
@@ -13243,7 +13498,7 @@ def test_Conv3d_depthwise_naive_groups(self, device, dtype):
 
 
     @onlyCUDA
-    @dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
+    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
     def test_noncontig_conv_grad(self, device, dtype):
         # FIXME: remove after adding non-contiguous grad tests for all modules
         module = nn.Conv2d(3, 5, kernel_size=3, padding=1).to(device, dtype)
@@ -13359,8 +13614,8 @@ def test_conv_double_backward_stride(self):
                                                batch_size, inp_size, dilation,
                                                no_weight)
 
-
-    def test_conv1d_same_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_same_padding(self, device, dtype):
         # Test padding='same' outputs the correct shape
         test_args = [
             # in_size
@@ -13373,22 +13628,22 @@ def test_conv1d_same_padding(self, device):
             [1],
         ]
         for in_size, k_size, dilation, stride in itertools.product(*test_args):
-            x = torch.rand(1, 1, in_size, device=device)
-            y = torch.rand(1, 1, k_size, device=device)
+            x = torch.rand(1, 1, in_size, device=device, dtype=dtype)
+            y = torch.rand(1, 1, k_size, device=device, dtype=dtype)
             z = F.conv1d(x, y, padding='same', dilation=dilation, stride=stride)
             self.assertEqual(z.size(2), int(math.ceil(in_size / stride)))
 
         # Compare F.conv1d padding='same' output against manual padding
         # Without strides/dilation
-        x = torch.rand(1, 1, 12, device=device)
-        y = torch.rand(1, 1, 3, device=device)
+        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 3, device=device, dtype=dtype)
         expect = F.conv1d(x, y, padding=1)
         actual = F.conv1d(x, y, padding='same')
         self.assertEqual(expect, actual)
 
         # With dilation
-        x = torch.rand(1, 1, 12, device=device)
-        y = torch.rand(1, 1, 4, device=device)
+        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
         expect = F.conv1d(x, y, padding=3, dilation=2)
         actual = F.conv1d(x, y, padding='same', dilation=2)
         self.assertEqual(expect, actual)
@@ -13398,76 +13653,89 @@ def test_conv1d_same_padding(self, device):
         actual = F.conv1d(x, y, padding='same', dilation=3)
         self.assertEqual(expect, actual)
 
-
-    def test_conv2d_same_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_same_padding(self, device, dtype):
+        if dtype is torch.cfloat:
+            rtol, atol = 2e-6, 2e-6
+        else:
+            rtol, atol = None, None
         # Compare F.conv2d padding='same' output against manual padding
         # Without strides/dilation
-        x = torch.rand(1, 1, 10, 11, device=device)
-        y = torch.rand(1, 1, 4, 5, device=device)
+        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype)
         expect = F.conv2d(x, y, padding=(2, 2))[..., 1:, :]
         actual = F.conv2d(x, y, padding='same')
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
         # With dilation
-        y = torch.rand(1, 1, 3, 4, device=device)
+        y = torch.rand(1, 1, 3, 4, device=device, dtype=dtype)
         expect = F.conv2d(x, y, padding=(2, 3), dilation=2)
         actual = F.conv2d(x, y, padding='same', dilation=2)
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
         # Dilation with asymmetric padding
-        y = torch.rand(1, 1, 4, 4, device=device)
+        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype)
         expect = F.conv2d(x, y, padding=5, dilation=3)[..., 1:, 1:]
         actual = F.conv2d(x, y, padding='same', dilation=3)
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
-    def test_conv3d_same_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv3d_same_padding(self, device, dtype):
+        if dtype is torch.cfloat:
+            rtol, atol = 2e-6, 2e-6
+        else:
+            rtol, atol = None, None
         # Compare F.conv3d padding='same' output against manual padding
         # Without strides/dilation
-        x = torch.rand(1, 1, 10, 11, 12, device=device)
-        y = torch.rand(1, 1, 1, 2, 5, device=device)
+        x = torch.rand(1, 1, 10, 11, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 1, 2, 5, device=device, dtype=dtype)
         expect = F.conv3d(x, y, padding=(0, 1, 2))[..., :, 1:, :]
         actual = F.conv3d(x, y, padding='same')
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
         # With dilation
         expect = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
         actual = F.conv3d(x, y, padding='same', dilation=2)
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
         # Dilation with asymmetric padding
-        y = torch.rand(1, 1, 4, 4, 4, device=device)
+        y = torch.rand(1, 1, 4, 4, 4, device=device, dtype=dtype)
         expect = F.conv3d(x, y, padding=5, dilation=3)[..., 1:, 1:, 1:]
         actual = F.conv3d(x, y, padding='same', dilation=3)
-        self.assertEqual(expect, actual)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
 
-    def test_conv1d_valid_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_valid_padding(self, device, dtype):
         # Test F.conv1d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 10, device=device)
-        y = torch.rand(1, 1, 4, device=device)
+        x = torch.rand(1, 1, 10, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
         expect = F.conv1d(x, y)
         actual = F.conv1d(x, y, padding='valid')
         self.assertEqual(expect, actual)
 
-    def test_conv2d_valid_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_valid_padding(self, device, dtype):
         # Test F.conv2d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 1, 10, device=device)
-        y = torch.rand(1, 1, 1, 4, device=device)
+        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype)
         expect = F.conv2d(x, y)
         actual = F.conv2d(x, y, padding='valid')
         self.assertEqual(expect, actual)
 
-    def test_conv3d_valid_padding(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv3d_valid_padding(self, device, dtype):
         # Test F.conv3d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 1, 1, 10, device=device)
-        y = torch.rand(1, 1, 1, 1, 4, device=device)
+        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device)
+        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device)
         expect = F.conv3d(x, y)
         actual = F.conv3d(x, y, padding='valid')
         self.assertEqual(expect, actual)
 
-    def test_conv1d_same_padding_backward(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_same_padding_backward(self, device, dtype):
         # Test F.conv1d gradients work with padding='same'
-        x = torch.rand(1, 1, 12, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 4, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 12, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
 
         # Symmetric padding
         z = F.conv1d(x, y, padding=3, dilation=2)
@@ -13492,10 +13760,11 @@ def test_conv1d_same_padding_backward(self, device):
         self.assertEqual(gx_expect, x.grad)
         self.assertEqual(gy_expect, y.grad)
 
-    def test_conv2d_same_padding_backward(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_same_padding_backward(self, device, dtype):
         # Test F.conv2d gradients work with padding='same'
-        x = torch.rand(1, 1, 10, 11, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 4, 5, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype, requires_grad=True)
+        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype, requires_grad=True)
 
         # Symmetric padding
         z = F.conv2d(x, y, padding=(3, 4), dilation=2)
@@ -13510,7 +13779,7 @@ def test_conv2d_same_padding_backward(self, device):
         x.grad, y.grad = None, None
 
         # Asymmetric padding
-        y = torch.rand(1, 1, 4, 4, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype, requires_grad=True)
         z = F.conv2d(x, y, padding=2)[..., 1:, 1:]
         z.sum().backward()
         gx_expect, gy_expect = x.grad, y.grad
@@ -13521,12 +13790,13 @@ def test_conv2d_same_padding_backward(self, device):
         self.assertEqual(gx_expect, x.grad)
         self.assertEqual(gy_expect, y.grad)
 
-    def test_conv3d_same_padding_backward(self, device):
+    @dtypes(torch.double, torch.cdouble)
+    def test_conv3d_same_padding_backward(self, device, dtype):
         check_forward_ad = torch.device(device).type != 'xla'
 
         # Test F.conv3d gradients work with padding='same'
-        x = torch.rand(1, 1, 1, 11, 12, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 1, 2, 5, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 1, 11, 12, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 1, 2, 5, dtype=dtype, device=device, requires_grad=True)
 
         # Symmetric padding
         z = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
@@ -13548,7 +13818,7 @@ def test_conv3d_same_padding_backward(self, device):
                           check_fwd_over_rev=True)
 
         # Asymmetric padding
-        y = torch.rand(1, 1, 1, 4, 4, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 1, 4, 4, dtype=dtype, device=device, requires_grad=True)
         z = F.conv3d(x, y, padding=2)[..., 1:, 1:]
         z.sum().backward()
         gx_expect, gy_expect = x.grad, y.grad
@@ -13566,10 +13836,11 @@ def test_conv3d_same_padding_backward(self, device):
             gradgradcheck(lambda x, y: F.conv3d(x, y, padding='same'), (x, y),
                           check_fwd_over_rev=True)
 
-    def test_conv1d_valid_padding_backward(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_valid_padding_backward(self, device, dtype):
         # Test F.conv1d gradients work with padding='valid'
-        x = torch.rand(1, 1, 10, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 4, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 10, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
         F.conv1d(x, y, padding=0).sum().backward()
         gx_expect, gy_expect = x.grad, y.grad
         x.grad, y.grad = None, None
@@ -13579,10 +13850,132 @@ def test_conv1d_valid_padding_backward(self, device):
         self.assertEqual(gx_expect, gx_actual)
         self.assertEqual(gy_expect, gy_actual)
 
-    def test_conv2d_valid_padding_backward(self, device):
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv1d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 10), device=device, dtype=dtype)
+        feat_dim = t.shape[1]
+        weight_even = make_tensor((1, 1, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 1-D inputs.
+            t_a = t.view(-1).cpu().numpy()
+            w_a = weight.view(-1).cpu().numpy()
+            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv1d is different
+                # from SciPy
+                p = weight.shape[2] // 2
+                t = torch.nn.functional.pad(t, (p, p))
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve
+            weight_flipped = torch.flip(weight, (2,))
+            actual = torch.nn.functional.conv1d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:feat_dim]
+
+            self.assertEqual(actual, expected)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv2d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 5, 10), device=device, dtype=dtype)
+        weight_even = make_tensor((1, 1, 2, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 3, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 2-D inputs.
+            t_a = t.squeeze(0).cpu().numpy()
+            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
+            expected = scipy.signal.convolve2d(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv2d is different
+                # from SciPy
+                left_right_pad = weight.shape[3] // 2
+                top_bottom_pad = weight.shape[2] // 2
+                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad)
+                t = torch.nn.functional.pad(t, p)
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve2d
+            weight_flipped = torch.flip(weight, (2, 3))
+            actual = torch.nn.functional.conv2d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:5, :10]
+
+            self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv3d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 5, 5, 10), device=device, dtype=dtype)
+        weight_even = make_tensor((1, 1, 2, 2, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 2, 3, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 3-D inputs.
+            t_a = t.squeeze(0).cpu().numpy()
+            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
+            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv3d is different
+                # from SciPy
+                left_right_pad = weight.shape[4] // 2
+                top_bottom_pad = weight.shape[3] // 2
+                front_back_pad = weight.shape[2] // 2
+                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad,
+                     front_back_pad, front_back_pad)
+                t = torch.nn.functional.pad(t, p)
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve
+            weight_flipped = torch.flip(weight, (2, 3, 4))
+            actual = torch.nn.functional.conv3d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:5, :5, :10]
+
+            self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @dtypes(torch.float, torch.complex64)
+    def test_conv2d_valid_padding_backward(self, device, dtype):
         # Test F.conv2d gradients work with padding='valid'
-        x = torch.rand(1, 1, 1, 10, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 1, 4, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype, requires_grad=True)
+        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype, requires_grad=True)
         F.conv2d(x, y, padding=0).sum().backward()
         gx_expect, gy_expect = x.grad, y.grad
         x.grad, y.grad = None, None
@@ -13592,12 +13985,13 @@ def test_conv2d_valid_padding_backward(self, device):
         self.assertEqual(gx_expect, gx_actual)
         self.assertEqual(gy_expect, gy_actual)
 
-    def test_conv3d_valid_padding_backward(self, device):
+    @dtypes(torch.double, torch.cdouble)
+    def test_conv3d_valid_padding_backward(self, device, dtype):
         check_forward_ad = torch.device(device).type != 'xla'
 
         # Test F.conv3d gradients work with padding='valid'
-        x = torch.rand(1, 1, 1, 1, 10, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 1, 1, 4, device=device, requires_grad=True)
+        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device, requires_grad=True)
         F.conv3d(x, y, padding=0).sum().backward()
         gx_expect, gy_expect = x.grad, y.grad
         x.grad, y.grad = None, None
@@ -13800,6 +14194,20 @@ def _make_noncontiguous(inp):
         if layout is torch._mkldnn:
             return
 
+        if backend_actual != torch._C._ConvBackend.Empty:  # FIXME: forward AD fails
+            # Forward AD and forward-over-reverse AD smoke test in float32
+            # TODO: remove this if we introduce per-op gradient tests for float32
+            with fwAD.dual_level():
+                dual_inputs = [(fwAD.make_dual(i, torch.rand_like(i)) if isinstance(i, torch.Tensor) else i) for i in inputs]
+                # Forward AD
+                output = convolution(*dual_inputs)
+                # Forward over reverse AD
+                grad_output_d = fwAD.make_dual(torch.rand_like(output), torch.rand_like(output))
+                if has_bias:
+                    torch.autograd.grad(output, [x, weight, bias], grad_output_d)
+                else:
+                    torch.autograd.grad(output, [x, weight], grad_output_d)
+
         # Convert to float64 for gradcheck.
         x = x.to(torch.float64).detach().requires_grad_(True)
         weight = weight.to(torch.float64).detach().requires_grad_(True)
@@ -14623,26 +15031,27 @@ def test_BatchNorm_empty(self, device):
         self.assertEqual(mod.weight.grad, torch.tensor([0., 0, 0], device=device))
         self.assertEqual(mod.bias.grad, torch.tensor([0., 0, 0], device=device))
 
-    def test_conv_empty_channel(self, device):
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv_empty_channel(self, device, dtype):
         in_channels = 0
-        mod = torch.nn.Conv1d(in_channels, 8, 2, stride=2).to(device)
-        inp = torch.randn(2, 0, 15, device=device)
+        mod = torch.nn.Conv1d(in_channels, 8, 2, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 15, device=device, dtype=dtype)
         self._test_module_empty_input(mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
             inp = torch.randn(2, 1, 0, device=device)
             mod(inp)
 
-        mod = torch.nn.Conv2d(in_channels, 33, 3, stride=2).to(device)
-        inp = torch.randn(2, 0, 50, 100, device=device)
+        mod = torch.nn.Conv2d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 50, 100, device=device, dtype=dtype)
         self._test_module_empty_input(mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
             inp = torch.randn(2, 1, 40, 0, device=device)
             mod(inp)
 
-        mod = torch.nn.Conv3d(in_channels, 33, 3, stride=2).to(device)
-        inp = torch.randn(2, 0, 50, 20, 40, device=device)
+        mod = torch.nn.Conv3d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 50, 20, 40, device=device, dtype=dtype)
         self._test_module_empty_input(mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
@@ -14918,6 +15327,31 @@ def test_unequal_when_beta_is_greater_than_one():
         test_unequal_when_beta_is_less_than_one()
         test_unequal_when_beta_is_greater_than_one()
 
+    @onlyCPU
+    def test_smooth_l1_loss_bfloat16(self, device):
+        def test_dtype(fn, input, target, dtype):
+            input = input.detach().clone().to(dtype=dtype).requires_grad_(True)
+            input2 = input.detach().clone().float().requires_grad_(True)
+            target = target.detach().clone().to(dtype=dtype)
+            target2 = target.detach().clone().float()
+            out = fn(input, target)
+            out.sum().backward()
+            out2 = fn(input2, target2)
+            out2.sum().backward()
+            self.assertEqual(out.dtype, dtype)
+            self.assertEqual(input.grad.dtype, dtype)
+            self.assertEqual(out, out2, exact_dtype=False)
+            self.assertEqual(input.grad, input2.grad, exact_dtype=False)
+
+        def func(device):
+            return nn.SmoothL1Loss().to(device=device)
+
+        shapes = [[1, 3, 1, 6], [1, 3, 1, 128], [1, 3, 128, 128]]
+        for shape in shapes:
+            x = torch.randn(shape, device=device, requires_grad=True)
+            t = torch.randn(shape, device=device)
+            test_dtype(func(device), x, t, torch.bfloat16)
+
     # We don't want to make propagating NaN a hard requirement on ops, but for
     # these easy ones, we should make them do so.
     def test_nonlinearity_propagate_nan(self, device):
@@ -15693,9 +16127,7 @@ def test_upsamplingBicubic2d(self, device, antialias, align_corners):
         # for scale_factor in [0.5, 1, 1.5, 2]:
         for scale_factor in [2, ]:
             in_t = torch.ones(2, 3, 8, 8, device=device)
-            print("dtype: ", in_t.dtype)
             out_t = F.interpolate(in_t, scale_factor=scale_factor, **kwargs)
-            print(out_t)
             out_size = int(math.floor(in_t.shape[-1] * scale_factor))
             expected_out = torch.ones(2, 3, out_size, out_size, device=device)
             self.assertEqual(expected_out, out_t, atol=1e-5, rtol=0)
@@ -16153,6 +16585,7 @@ def test_masked_softmax(self, device):
                 mask = mask.cuda()
             mask = mask.reshape(B, 1, 1, L).expand(B, num_heads, L, L).bool()
             native_res = torch._masked_softmax(input, mask)
+            mask = ~mask
             mask = mask.float()
 
             def slow_masked_softmax(input, mask):
@@ -16176,6 +16609,7 @@ def test_masked_softmax_transformer_layout(self, device):
         mask = mask.bool()
         native_res = torch._masked_softmax(input, mask)
         mask = mask.reshape(B, 1, 1, L).expand(B, num_heads, L, L)
+        mask = ~mask
         mask = mask.float()
 
         def slow_masked_softmax(input, mask):
@@ -17059,7 +17493,7 @@ def test_embedding_bag_empty_input(self, device, dtypes):
             output = Embed(input=x, offsets=torch.tensor([0, 0], device=device, dtype=dtypes[1]))
             self.assertEqual(output, torch.zeros_like(output))
 
-    @skipCUDAIf(True, "cuda assert is not recovarable.")
+    @skipCUDAIf(True, "no out-of-bounds check on CUDA for perf.")
     @dtypes(*itertools.product((torch.float, torch.double), (torch.int, torch.long)))
     @parametrize_test("padding_idx", [None, 0])
     @parametrize_test("mode", ["sum", "mean", "max"])
@@ -17178,15 +17612,15 @@ def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum',
                         bags.append(embeddings.narrow(0, offset, length).max(0)[0])
         return torch.stack(bags)
 
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double)))
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.half, torch.float, torch.double)))
     def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device, dtypes):
         # Test empty input and per sample weight, and backward pass. There was a CUDA
         # invalid configuration bug (more context in #46572)
         def test_per_sample_weights(mode, trainable_scale):
             es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
             es.weight.data.copy_(
-                torch.arange(1, 11, device=device, dtype=dtypes[2]).view_as(es.weight))
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
             input = torch.tensor([], device=device, dtype=dtypes[0])
             offsets = torch.tensor([0, 0, 0, 0, 0], device=device, dtype=dtypes[1])
             per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
@@ -17217,13 +17651,13 @@ def test_per_sample_weights(mode, trainable_scale):
         for mode, trainable in itertools.product(modes, trainable_scale):
             test_per_sample_weights(mode, trainable)
 
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double)))
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
     def test_EmbeddingBag_per_sample_weights_and_offsets(self, device, dtypes):
         def test_per_sample_weights(mode, trainable_scale):
             es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
             es.weight.data.copy_(
-                torch.arange(1, 11, device=device, dtype=dtypes[2]).view_as(es.weight))
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
             input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
             offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
             per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
@@ -17251,13 +17685,13 @@ def test_per_sample_weights(mode, trainable_scale):
         for mode, trainable in itertools.product(modes, trainable_scale):
             test_per_sample_weights(mode, trainable)
 
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double)))
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
     def test_EmbeddingBag_per_sample_weights_and_new_offsets(self, device, dtypes):
         def test_per_sample_weights_new_offsets(mode, trainable_scale, include_last_offset, has_weight=True):
             es = nn.EmbeddingBag(5, 2, mode=mode, include_last_offset=include_last_offset).to(dtype=dtypes[2], device=device)
             es.weight.data.copy_(
-                torch.arange(1, 11, device=device, dtype=dtypes[2]).view_as(es.weight))
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
             input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
             offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
 
@@ -17413,7 +17847,7 @@ def _test_EmbeddingBag(
     ):
         # check a known test example
         es = nn.EmbeddingBag(5, 2, mode=mode, sparse=sparse).to(device, wdtype)
-        es.weight.data.copy_(torch.arange(1, 11, device=device, dtype=wdtype).view_as(es.weight))
+        es.weight.data.copy_(torch.arange(1, 11, device=device).view_as(es.weight).to(wdtype))
         input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtype)
         offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=odtype)
 
@@ -17516,8 +17950,8 @@ def _test_EmbeddingBag(
             offset[-1] = 100
             self.assertRaises(RuntimeError, lambda: es(input.view(-1), offset))
 
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double)))
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
     def test_embedding_bag_device(self, device, dtypes):
         self._test_EmbeddingBag(device, 'sum', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
         self._test_EmbeddingBag(device, 'mean', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
@@ -17530,7 +17964,7 @@ def test_embedding_bag_device(self, device, dtypes):
         elif self.device_type == 'cpu':
             # TODO: figure out why precision on sparse embeddings isn't the
             # same as for dense.
-            test_backward = dtypes[2] is not torch.float
+            test_backward = dtypes[2] is not torch.float and dtypes[2] is not torch.float16
 
         self._test_EmbeddingBag(
             device,
@@ -17551,8 +17985,8 @@ def test_embedding_bag_device(self, device, dtypes):
             test_backward=test_backward,
         )
 
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double)))
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
     def test_embedding_bag_non_contiguous_weight(self, device, dtypes):
         weight_tensor = torch.randn(3, 4, dtype=dtypes[2], device=device)
 
@@ -17582,6 +18016,11 @@ def test_embedding_bag_bfloat16(self, device, dtypes):
         self._test_EmbeddingBag(device, 'sum', True, wdtype=torch.bfloat16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
         self._test_EmbeddingBag(device, 'mean', True, wdtype=torch.bfloat16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
 
+    @onlyNativeDeviceTypes  # currently fails on XLA
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
+    def test_embedding_bag_half(self, device, dtypes):
+        self._test_EmbeddingBag(device, 'sum', True, wdtype=torch.float16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
+
     @onlyCUDA
     @dtypes(torch.half, torch.float, torch.double)
     def test_multihead_attention_dtype(self, device, dtype):
@@ -17597,7 +18036,7 @@ def test_multihead_attention_dtype(self, device, dtype):
         self.assertEqual(q.size(), out[0].size())
         self.assertEqual(dtype, out[0].dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
+    @dtypesIfCUDA(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
     @dtypes(torch.float)
     def test_Conv2d_naive_groups(self, device, dtype):
         # Check that grouped convolutions matches two half convolutions
@@ -17632,7 +18071,7 @@ def test_Conv2d_naive_groups(self, device, dtype):
                          torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
                          atol=dtype2prec_DONTUSE[dtype], rtol=0)
 
-    @dtypes(torch.double)
+    @dtypes(torch.double, torch.cdouble)
     def test_Conv2d_backward_depthwise(self, device, dtype):
         x = torch.randn(2, 2, 4, 20, device=device, dtype=dtype, requires_grad=True)
         weight = torch.randn(2, 1, 3, 5, device=device, dtype=dtype, requires_grad=True)
@@ -17965,37 +18404,37 @@ def expected_output(dim):
         self.assertEqual(output[0, 0, 0, 0], float("-inf"))
         self.assertEqual(indices[0, 0, 0, 0], 0)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_MaxPool1d_indices(self, device, dtype):
         self._test_maxpool_indices(1, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_MaxPool2d_indices(self, device, dtype):
         self._test_maxpool_indices(2, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_MaxPool3d_indices(self, device, dtype):
         self._test_maxpool_indices(3, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_AdaptiveMaxPool1d_indices(self, device, dtype):
         self._test_maxpool_indices(1, adaptive=True, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_AdaptiveMaxPool2d_indices(self, device, dtype):
         self._test_maxpool_indices(2, adaptive=True, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_AdaptiveMaxPool3d_indices(self, device, dtype):
         self._test_maxpool_indices(3, adaptive=True, device=device, dtype=dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_maxpool_indices_no_batch_dim(self, device, dtype):
         """Check that indices with no batch dim is consistent with a single batch."""
@@ -18160,7 +18599,7 @@ def test_pooling_zero_stride(self, device):
                 self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
                                        lambda: fn_module(x))
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_pool_large_size(self, device, dtype):
         for op in ('max', 'avg'):
@@ -18174,7 +18613,7 @@ def test_pool_large_size(self, device, dtype):
                 # check if the output shape was still computed correctly
                 self.assertEqual(x.shape[2], res.shape[2])
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float)
     def test_pool_invalid_size(self, device, dtype):
         for op in ('max', 'avg'):
@@ -18418,6 +18857,35 @@ def test_multi_margin_loss_errors(self, device):
                           lambda: nn.functional.multi_margin_loss(torch.randn(5, device=device),
                                                                   torch.zeros(3, device=device)))
 
+    @onlyCPU
+    def test_activations_bfloat16_cpu(self, device):
+        def test_bfloat16(fn, device, inp_dims, prec):
+            # bfloat16 compute
+            input = torch.randn(inp_dims, dtype=torch.bfloat16, device=device, requires_grad=True)
+            out = fn(input)
+            grad_input = torch.randn_like(out, dtype=torch.bfloat16, device=device)
+            out.backward(grad_input)
+
+            # fp32 compute
+            input2 = input.detach().clone().float().requires_grad_(True)
+            out2 = fn(input2)
+            grad_input2 = grad_input.detach().clone().float()
+            out2.backward(grad_input2)
+
+            self.assertEqual(out.dtype, torch.bfloat16)
+            self.assertEqual(input.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2, atol=prec, rtol=0, exact_dtype=False)
+            self.assertEqual(input.grad.data, input2.grad.data, atol=prec, rtol=0, exact_dtype=False)
+
+        shapes = [[1, 3, 1, 6], [1, 3, 1, 128], [1, 3, 256, 256]]
+        for shape in shapes:
+            test_bfloat16(torch.nn.LogSigmoid(), device, shape, prec=2e-2)
+            test_bfloat16(torch.nn.Hardsigmoid(), device, shape, prec=1e-2)
+            test_bfloat16(torch.nn.Hardshrink(), device, shape, prec=1e-2)
+            test_bfloat16(torch.nn.Softshrink(), device, shape, prec=1e-2)
+            test_bfloat16(torch.nn.Hardswish(), device, shape, prec=2e-2)
+            test_bfloat16(torch.nn.Softplus(), device, shape, prec=1e-2)
+
     def _test_bfloat16_ops(self, op, device, inp_dims=(), prec=1e-2, scale_factor=None):
         # fp32 compute
         input1 = torch.randn(inp_dims, dtype=torch.float32, device=device, requires_grad=True)
@@ -18467,7 +18935,7 @@ def test_softmax_bfloat16(self, device):
     @skipCUDAIfRocmVersionLessThan((4, 3))
     @skipCUDAIfNotMiopenSuggestNHWC
     @skipCUDAIfCudnnVersionLessThan(7603)
-    @dtypes(torch.half, torch.float)
+    @dtypes(torch.half, torch.float, torch.cfloat)
     def test_conv_cudnn_nhwc(self, device, dtype):
         def helper(n, c, h, w, out_channels, kernel_size, groups):
             input = torch.randint(-3, 3, (n, c, h, w), dtype=dtype, device=device)\
@@ -19350,6 +19818,32 @@ def test_leaky_relu_inplace_with_zero_slope(self, device):
         expected_bf16 = torch.tensor([0., 0., 1.], device=device, dtype=torch.bfloat16)
         self.assertEqual(a_bf16.grad, expected_bf16)
 
+    @onlyCPU
+    def test_softshrink(self, device):
+        x = torch.tensor([[1.21, 0.56, 0.5001, 0.4999, 1.2357, -0.4999, -0.5001, -1.154,
+                           0.254, -0.24, -0.225, 0.104, 0.002, -0.001, 0.0574, 1.2344,
+                           0.1748, -0.1797, -0.8125, 0.2051, -1.1328, 1.2344, -0.1562, 2.3554,
+                           -0.1953, 0.0304, -0.3613, -1.3047, 1.0312, 0.1436, -0.6953, 0.5664,
+                           -0.5820, -0.3301, 0.8203, 0.6133, 0.5938],
+                          [-0.8203, -1.2344, -0.5234, 2.5312, -0.4551, -0.6875, -1.5547, -0.2217,
+                           -0.3027, 2.6406, 1.3047, 0.2344, -1.6719, 0.2773, -1.3516, 3.4575,
+                           0.4414, 0.2656, 2.1094, -1.5156, 1.2344, -0.4336, 0.6797, -3.5486,
+                           0.9766, -0.4062, 1.4844, 0.7500, -1.7578, 0.7461, 1.6094, 8.5458,
+                           0.3730, -0.3477, -1.0625, 0.3848, 0.0557]], device=device)
+        expected = torch.tensor([[0.71, 0.06, 0.0001, 0., 0.7357, 0., -0.0001, -0.654,
+                                  0., 0., 0., 0., 0., 0., 0., 0.7344,
+                                  0., 0., -0.3125, 0., -0.6328, 0.7344, 0., 1.8554,
+                                  0., 0., 0., -0.8047, 0.5312, 0., -0.1953, 0.0664,
+                                  -0.0820, 0.0, 0.3203, 0.1133, 0.0938],
+                                 [-0.3203, -0.7344, -0.0234, 2.0312, 0.0, -0.1875, -1.0547, 0.,
+                                  0.0, 2.1406, 0.8047, 0., -1.1719, 0., -0.8516, 2.9575,
+                                  0., 0., 1.6094, -1.0156, 0.7344, 0., 0.1797, -3.0486,
+                                  0.4766, 0., 0.9844, 0.2500, -1.2578, 0.2461, 1.1094, 8.0458,
+                                  0., 0., -0.5625, 0., 0.]])
+        softshrink = torch.nn.Softshrink()
+        out = softshrink(x)
+        self.assertEqual(out, expected, atol=1e-2, rtol=0)
+
     def test_threshold_inplace_overlap(self, device):
         # Inplace threshold is okay, because it is idempotent
         x = torch.randn((1, 6), device=device).expand((6, 6))
diff --git a/test/test_numpy_interop.py b/test/test_numpy_interop.py
index 2c1395a19ac8ea..96c1016c2dbb3d 100644
--- a/test/test_numpy_interop.py
+++ b/test/test_numpy_interop.py
@@ -9,7 +9,7 @@
     (TestCase, run_tests)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, onlyCPU, dtypes, skipMeta)
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import all_types_and_complex_and
 
 # For testing handling NumPy objects and sending tensors to / accepting
 #   arrays from NumPy.
@@ -234,6 +234,28 @@ def test_from_list_of_ndarray_warning(self, device):
         with self.assertWarnsOnceRegex(UserWarning, warning_msg):
             torch.tensor([np.array([0]), np.array([1])], device=device)
 
+    def test_ctor_with_invalid_numpy_array_sequence(self, device):
+        # Invalid list of numpy array
+        with self.assertRaisesRegex(ValueError, "expected sequence of length"):
+            torch.tensor([np.random.random(size=(3, 3)), np.random.random(size=(3, 0))], device=device)
+
+        # Invalid list of list of numpy array
+        with self.assertRaisesRegex(ValueError, "expected sequence of length"):
+            torch.tensor([[np.random.random(size=(3, 3)), np.random.random(size=(3, 2))]], device=device)
+
+        with self.assertRaisesRegex(ValueError, "expected sequence of length"):
+            torch.tensor([[np.random.random(size=(3, 3)), np.random.random(size=(3, 3))],
+                          [np.random.random(size=(3, 3)), np.random.random(size=(3, 2))]], device=device)
+
+        # expected shape is `[1, 2, 3]`, hence we try to iterate over 0-D array
+        # leading to type error : not a sequence.
+        with self.assertRaisesRegex(TypeError, "not a sequence"):
+            torch.tensor([[np.random.random(size=(3)), np.random.random()]], device=device)
+
+        # list of list or numpy array.
+        with self.assertRaisesRegex(ValueError, "expected sequence of length"):
+            torch.tensor([[1, 2, 3], np.random.random(size=(2,)), ], device=device)
+
     @onlyCPU
     def test_ctor_with_numpy_scalar_ctor(self, device) -> None:
         dtypes = [
@@ -397,7 +419,7 @@ def test_has_storage_numpy(self, device):
             self.assertIsNotNone(torch.tensor(arr, device=device, dtype=torch.long).storage())
             self.assertIsNotNone(torch.tensor(arr, device=device, dtype=torch.uint8).storage())
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_numpy_scalar_cmp(self, device, dtype):
         if dtype.is_complex:
             tensors = (torch.tensor(complex(1, 3), dtype=dtype, device=device),
diff --git a/test/test_ops.py b/test/test_ops.py
index 5791f93fb53aca..cbf6629862547b 100644
--- a/test/test_ops.py
+++ b/test/test_ops.py
@@ -1,30 +1,29 @@
-# Owner(s): ["high priority"]
+# Owner(s): ["module: unknown"]
 
 from collections.abc import Sequence
-from functools import partial, wraps
+from functools import partial
 import warnings
 import unittest
 import itertools
-
 import torch
 
-from torch.testing import FileCheck, make_tensor
-from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes
+from torch.testing import make_tensor
+from torch.testing._internal.common_dtype import floating_and_complex_types_and, all_types_and_complex_and
 from torch.testing._internal.common_utils import \
     (TestCase, is_iterable_of_tensors, run_tests, IS_SANDCASTLE, clone_input_helper,
-     gradcheck, gradgradcheck, IS_IN_CI, suppress_warnings, noncontiguous_like,
+     IS_IN_CI, suppress_warnings, noncontiguous_like,
      TEST_WITH_ASAN, IS_WINDOWS, IS_FBCODE, first_sample)
 from torch.testing._internal.common_methods_invocations import \
     (op_db, _NOTHING, UnaryUfuncInfo, ReductionOpInfo, SpectralFuncInfo)
 from torch.testing._internal.common_device_type import \
     (deviceCountAtLeast, instantiate_device_type_tests, ops, onlyCPU,
      onlyCUDA, onlyNativeDeviceTypes, OpDTypes, skipMeta)
-from torch.testing._internal.common_jit import JitCommonTestCase, check_against_reference
-from torch.testing._internal.jit_metaprogramming_utils import create_script_fn, create_traced_fn, \
-    check_alias_annotation
-from torch.testing._internal.jit_utils import disable_autodiff_subgraph_inlining, is_lambda
+
+
 import torch.testing._internal.opinfo_helper as opinfo_helper
-from torch.testing._internal.composite_compliance import _check_composite_compliance
+from torch.testing._internal import composite_compliance
+
+TEST_ROCM = torch.cuda.is_available() and torch.version.hip is not None
 
 # TODO: fixme https://github.com/pytorch/pytorch/issues/68972
 torch.set_default_dtype(torch.float32)
@@ -68,8 +67,13 @@ def tearDownClass(cls):
     @onlyNativeDeviceTypes
     @ops(op_db, dtypes=OpDTypes.none)
     def test_dtypes(self, device, op):
+        # Check complex32 support only if the op claims.
+        # TODO: Once the complex32 support is better, we should add check for complex32 unconditionally.
+        include_complex32 = ((torch.complex32,) if op.supports_dtype(torch.complex32, device) else ())
+
         # dtypes to try to backward in
-        allowed_backward_dtypes = floating_and_complex_types_and(torch.bfloat16, torch.float16)
+        allowed_backward_dtypes = floating_and_complex_types_and(
+            *((torch.half, torch.bfloat16) + include_complex32))
 
         # lists for (un)supported dtypes
         supported_dtypes = []
@@ -82,7 +86,8 @@ def unsupported(dtype):
             if dtype in allowed_backward_dtypes:
                 unsupported_backward_dtypes.append(dtype)
 
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(
+                *((torch.half, torch.bfloat16, torch.bool) + include_complex32)):
             # tries to acquire samples - failure indicates lack of support
             requires_grad = (dtype in allowed_backward_dtypes and op.supports_autograd)
             try:
@@ -204,6 +209,7 @@ def test_multiple_devices(self, devices, dtype, op):
     # This test runs in double and complex double precision because
     # NumPy does computation internally using double precision for many functions
     # resulting in possible equality check failures.
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
     @onlyNativeDeviceTypes
     @suppress_warnings
     @ops(_ref_test_ops, allowed_dtypes=(torch.float64, torch.long, torch.complex128))
@@ -212,8 +218,8 @@ def test_reference_testing(self, device, dtype, op):
             # Sets the default dtype to NumPy's default dtype of double
             cur_default = torch.get_default_dtype()
             torch.set_default_dtype(torch.double)
-            sample_inputs = op.sample_inputs(device, dtype)
-            for sample_input in sample_inputs:
+            reference_inputs = op.reference_inputs(device, dtype)
+            for sample_input in reference_inputs:
                 self.compare_with_reference(op, op.ref, sample_input, exact_dtype=(dtype is not torch.long))
         finally:
             torch.set_default_dtype(cur_default)
@@ -680,24 +686,6 @@ def _test_inplace_preserve_storage(samples, variants):
             inplace_samples = list(filter(lambda sample: not sample.broadcasts_input, samples))
             _test_inplace_preserve_storage(inplace_samples, inplace_variants)
 
-    # Checks if the operator (if it is composite) is written to support most
-    # backends and Tensor subclasses. See "CompositeImplicitAutograd Compliance"
-    # in aten/src/ATen/native/README.md for more details
-    #
-    # NB: onlyCPU because CompositeImplicitAutograd ops go through the same
-    # codepath on all devices. Ideally we'd use a meta device here but coverage
-    # for that is not good yet.
-    @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, '__torch_dispatch__ does not work in fbcode')
-    @onlyCPU
-    @ops(op_db, allowed_dtypes=(torch.float,))
-    def test_composite_compliance(self, device, dtype, op):
-        samples = op.sample_inputs(device, dtype, requires_grad=False)
-
-        for sample in samples:
-            args = [sample.input] + list(sample.args)
-            kwargs = sample.kwargs
-            _check_composite_compliance(op, args, kwargs)
-
     @onlyCPU
     @ops(op_db, allowed_dtypes=(torch.float,))
     def test_floating_inputs_are_differentiable(self, device, dtype, op):
@@ -722,465 +710,48 @@ def check_tensor_floating_is_differentiable(t):
             for arg in sample.kwargs.values():
                 check_tensor_floating_is_differentiable(arg)
 
+    # Reference testing for operations in complex32 against complex64.
+    # NOTE: We test against complex64 as NumPy doesn't have a complex32 equivalent dtype.
+    @ops(op_db, allowed_dtypes=(torch.complex32,))
+    def test_complex_half_reference_testing(self, device, dtype, op):
+        if not op.supports_dtype(torch.complex32, device):
+            unittest.skip("Does not support complex32")
 
-# gradcheck requires double precision
-_gradcheck_ops = partial(ops, dtypes=OpDTypes.supported,
-                         allowed_dtypes=[torch.double, torch.cdouble])
-
-
-class TestGradients(TestCase):
-    exact_dtype = True
-
-    # Copies inputs to inplace operations to avoid inplace modifications
-    #   to leaves requiring gradient
-    def _get_safe_inplace(self, inplace_variant):
-        @wraps(inplace_variant)
-        def _fn(t, *args, **kwargs):
-            return inplace_variant(t.clone(), *args, **kwargs)
-
-        return _fn
-
-    def _check_helper(self, device, dtype, op, variant, check, *, check_forward_ad=False, check_backward_ad=True,
-                      check_batched_grad=None, check_batched_forward_grad=False):
-        assert check in ('gradcheck', 'bwgrad_bwgrad', 'fwgrad_bwgrad')
-        # NB: check_backward_ad does not affect gradgradcheck (always True)
-        if variant is None:
-            self.skipTest("Skipped! Variant not implemented.")
-        if not op.supports_dtype(dtype, torch.device(device).type):
-            self.skipTest(f"Skipped! {op.name} does not support dtype {str(dtype)}")
-
-        def is_inplace(variant):
-            if hasattr(variant, "__wrapped__"):
-                return variant.__wrapped__ is op.get_inplace()
-            return variant is op.get_inplace()
+        for sample in op.sample_inputs(device, dtype):
+            actual = op(sample.input, *sample.args, **sample.kwargs)
+            (inp, args, kwargs) = sample.transform(lambda x: x.to(torch.complex64))
+            expected = op(inp, *args, **kwargs)
+            self.assertEqual(actual, expected, exact_dtype=False)
 
-        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
-        samples = op.sample_inputs(device, dtype, requires_grad=True, include_conjugated_inputs=include_conjugated_inputs)
+class TestCompositeCompliance(TestCase):
+    # Checks if the operator (if it is composite) is written to support most
+    # backends and Tensor subclasses. See "CompositeImplicitAutograd Compliance"
+    # in aten/src/ATen/native/README.md for more details
+    @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, '__torch_dispatch__ does not work in fbcode')
+    @ops(op_db, allowed_dtypes=(torch.float,))
+    def test_operator(self, device, dtype, op):
+        samples = op.sample_inputs(device, dtype, requires_grad=False)
 
         for sample in samples:
-            if sample.broadcasts_input and is_inplace(variant):
-                continue
+            args = [sample.input] + list(sample.args)
+            kwargs = sample.kwargs
+            composite_compliance.check_with_mode(op, args, kwargs)
+            composite_compliance.check_all_permutations(op, args, kwargs)
 
-            # Note on TensorList inputs
-            #
-            # gradcheck does not support TensorList inputs so here we pass TensorList
-            # inputs of size n as n single Tensor inputs to gradcheck and wrap the op
-            # in a function that puts the n Tensor inputs back into a TensorList
-            def fn(*inputs):
-                # Put tensors back into TensorList since we splat them when passing to gradcheck
-                if is_iterable_of_tensors(sample.input):
-                    n = len(sample.input)
-                    inputs = (inputs[:n], *inputs[n:])
-                output = op.gradcheck_wrapper(variant, *inputs, **sample.kwargs)
-                if sample.output_process_fn_grad is not None:
-                    return sample.output_process_fn_grad(output)
-                return output
-
-            # Splat TensorList inputs into single Tensor inputs
-            gradcheck_args = (sample.input,) if isinstance(sample.input, torch.Tensor) else tuple(sample.input)
-            gradcheck_args += sample.args
-
-            if check == 'gradcheck':
-                if check_batched_grad is None:
-                    check_batched_grad = op.check_batched_grad
-                self.assertTrue(gradcheck(fn, gradcheck_args,
-                                          check_batched_grad=check_batched_grad,
-                                          check_grad_dtypes=True,
-                                          nondet_tol=op.gradcheck_nondet_tol,
-                                          fast_mode=op.gradcheck_fast_mode,
-                                          check_forward_ad=check_forward_ad,
-                                          check_backward_ad=check_backward_ad,
-                                          check_undefined_grad=True,
-                                          check_batched_forward_grad=check_batched_forward_grad))
-            elif check in ('bwgrad_bwgrad', 'fwgrad_bwgrad'):  # gradgrad check
-                self.assertFalse(check_forward_ad, msg="Cannot run forward AD check for gradgradcheck")
-                for gen_non_contig_grad_outputs in (False, True):
-                    kwargs = {
-                        "gen_non_contig_grad_outputs": gen_non_contig_grad_outputs,
-                        "check_batched_grad": op.check_batched_gradgrad,
-                        "check_grad_dtypes": True,
-                        "nondet_tol": op.gradcheck_nondet_tol,
-                        "fast_mode": op.gradcheck_fast_mode
-                    }
-                    if check == "fwgrad_bwgrad":
-                        kwargs["check_fwd_over_rev"] = True
-                        kwargs["check_rev_over_rev"] = False
-                        kwargs["check_batched_grad"] = False
-                        kwargs["check_undefined_grad"] = False
-
-                    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
-            else:
-                self.assertTrue(False, msg="Unknown check requested!")
-
-    def _grad_test_helper(self, device, dtype, op, variant, *, check_forward_ad=False, check_backward_ad=True,
-                          check_batched_grad=None, check_batched_forward_grad=False):
-        return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
-                                  check_backward_ad=check_backward_ad, check_batched_grad=check_batched_grad,
-                                  check_batched_forward_grad=check_batched_forward_grad)
-
-    def _skip_helper(self, op, device, dtype):
-        if not op.supports_autograd and not op.supports_forward_ad:
-            self.skipTest("Skipped! autograd not supported.")
-        if not op.supports_complex_autograd(torch.device(device).type) and dtype.is_complex:
-            self.skipTest("Skipped! Complex autograd not supported.")
-
-    # Tests that gradients are computed correctly
-    @_gradcheck_ops(op_db)
-    def test_fn_grad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-        self._grad_test_helper(device, dtype, op, op.get_op())
-
-    # Method grad (and gradgrad, see below) tests are disabled since they're
-    #   costly and redundant with function grad (and gradgad) tests
-    # @_gradcheck_ops(op_db)
-    # def test_method_grad(self, device, dtype, op):
-    #     self._skip_helper(op, device, dtype)
-    #     self._grad_test_helper(device, dtype, op, op.get_method())
-
-    @_gradcheck_ops(op_db)
-    def test_inplace_grad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-        if not op.inplace_variant or not op.supports_inplace_autograd:
-            self.skipTest("Skipped! Operation does not support inplace autograd.")
-        self._grad_test_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()))
-
-    # Test that gradients of gradients are computed correctly
-    @_gradcheck_ops(op_db)
-    def test_fn_gradgrad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-        if not op.supports_gradgrad:
-            self.skipTest("Skipped! Operation does not support gradgrad")
-        self._check_helper(device, dtype, op, op.get_op(), 'bwgrad_bwgrad')
-
-    # Test that forward-over-reverse gradgrad is computed correctly
-    @_gradcheck_ops(op_db)
-    def test_fn_fwgrad_bwgrad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        if op.supports_fwgrad_bwgrad:
-            self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
-        else:
-            err_msg = r"Trying to use forward AD with .* that does not support it\."
-            hint_msg = ("Running forward-over-backward gradgrad for an OP that has does not support it did not "
-                        "raise any error. If your op supports forward AD, you should set supports_fwgrad_bwgrad=True.")
-            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
-                self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
-
-    # Test that gradients of gradients are properly raising
-    @_gradcheck_ops(op_db)
-    def test_fn_fail_gradgrad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-        if op.supports_gradgrad:
-            self.skipTest("Skipped! Operation does support gradgrad")
-
-        err_msg = r"derivative for .* is not implemented"
-        with self.assertRaisesRegex(RuntimeError, err_msg):
-            self._check_helper(device, dtype, op, op.get_op(), 'bwgrad_bwgrad')
-
-    # Method gradgrad (and grad, see above) tests are disabled since they're
-    #   costly and redundant with function gradgrad (and grad) tests
-    # @_gradcheck_ops(op_db)
-    # def test_method_gradgrad(self, device, dtype, op):
-    #     self._skip_helper(op, device, dtype)
-    #     self._gradgrad_test_helper(device, dtype, op, op.get_method())
-
-    @_gradcheck_ops(op_db)
-    def test_inplace_gradgrad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-        if not op.inplace_variant or not op.supports_inplace_autograd:
-            self.skipTest("Skipped! Operation does not support inplace autograd.")
-        self._check_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), "bwgrad_bwgrad")
-
-    def _forward_grad_helper(self, device, dtype, op, variant, is_inplace):
-        # TODO: clean up how attributes are passed to gradcheck from OpInfos
-        def call_grad_test_helper():
-            check_batched_forward_grad = ((op.check_batched_forward_grad and not is_inplace) or
-                                          (op.check_inplace_batched_forward_grad and is_inplace))
-            self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
-                                   check_batched_grad=False, check_batched_forward_grad=check_batched_forward_grad)
-        if op.supports_forward_ad:
-            call_grad_test_helper()
-        else:
-            err_msg = r"Trying to use forward AD with .* that does not support it\."
-            hint_msg = ("Running forward AD for an OP that has does not support it did not "
-                        "raise any error. If your op supports forward AD, you should set supports_forward_ad=True")
-            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
-                call_grad_test_helper()
-
-    @_gradcheck_ops(op_db)
-    def test_forward_mode_AD(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
-
-    @_gradcheck_ops(op_db)
-    def test_inplace_forward_mode_AD(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        if not op.inplace_variant or not op.supports_inplace_autograd:
-            self.skipTest("Skipped! Operation does not support inplace autograd.")
-
-        self._forward_grad_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), is_inplace=True)
-
-    # Functions that do not support autograd should not fail in forward mode
-    # Inplace functions (such as "resize_") are expected to fail in forward mode and should be skipped
-    # Test only when supports_autograd=False and for double dtype
-    @ops(filter(lambda op: not op.supports_autograd, op_db), dtypes=OpDTypes.supported, allowed_dtypes=(torch.double,))
-    def test_nondifferentiable(self, device, dtype, op):
-        # Expecting no errors
+    # There are some weird unexpected successe here that imply rocm goes down
+    # a different path than CUDA sometimes. There's not an easy way to describe
+    # this in OpInfo so we're just going to skip all ROCM tests...
+    @unittest.skipIf(TEST_ROCM, "The CUDA tests give sufficient signal")
+    @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, '__torch_dispatch__ does not work in fbcode')
+    @ops([op for op in op_db if op.supports_autograd], allowed_dtypes=(torch.float,))
+    def test_backward(self, device, dtype, op):
         samples = op.sample_inputs(device, dtype, requires_grad=True)
-        sample = first_sample(self, samples)
-        result = op(sample.input, *sample.args, **sample.kwargs)
-
-
-# Tests operators for consistency between JIT and eager, also checks
-#   correctness of JIT specific alias schemas and intended
-#   autodifferentiation behavior.
-# Inherits from JitCommonTestCase instead of TestCase directly to share
-#   functionality with original test_jit.py method operator tests
-class TestJit(JitCommonTestCase):
-    exact_dtype = True
 
-    # Tests that the forward and backward passes of operations produce the
-    #   same values for the cross-product of op variants (function, method, inplace)
-    #   and runtimes (eager, traced, scripted).
-    # TODO WARNING: inplace x {traced, scripted} not currently tested
-    @_variant_ops(op_db)
-    def test_variant_consistency_jit(self, device, dtype, op):
-        _requires_grad = op.supports_autograd and (dtype.is_floating_point or
-                                                   op.supports_complex_autograd(torch.device(device).type))
-
-        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
-        samples = op.sample_inputs(device, dtype, requires_grad=_requires_grad, include_conjugated_inputs=include_conjugated_inputs)
-
-        # Acquires variants to test
-        func = op.get_op()
-        method = op.get_method()
-        variants = {
-            # TODO: inplace tests currently fail, fix and add inplace variant
-            'function': func, 'method': method,
-        }
-
-        # TODO: find better way to standardize on op registration itself..
-        has_fake_function = op.name in ["resize_", 'resize_as_']
-
-        if has_fake_function:
-            variants = {'method': getattr(torch.Tensor, op.name)}
-            samples = op.sample_inputs(device, dtype, requires_grad=False)
-
-        support_script = op.supports_scripting
-
-        tested = False
         for sample in samples:
-            # Test traced and scripted consistency
-            for func_type, variant in variants.items():
-                if variant is None:
-                    continue
-
-                # scripting and check_alias_analysis do not work with lambdas
-                # lambdas are typically used as a way to simulate methods without
-                # functional variants, so rely on the other variant for testing
-                # for now
-                if is_lambda(variant):
-                    continue
-
-                tested = True
-
-                # Create accessor for script function variant
-                name = op.name + '_' if func_type == 'inplace' else op.name
-
-                # run with disable_autodiff_subgraph_inlining(True) to test
-                #   autodiff support. Context manager forces the graph to contain
-                #   DifferentiableGraph nodes if they are present
-                with disable_autodiff_subgraph_inlining():
-                    # Check scripted forward, grad, and grad grad
-                    if support_script:
-                        script_fn = create_script_fn(self, name, func_type)
-
-                    def out_fn(output):
-                        # Processes the output for autograd
-                        if sample.output_process_fn_grad is not None:
-                            return sample.output_process_fn_grad(output)
-                        return output
-
-                    def get_sample():
-                        return clone_input_helper(sample.input) if op.name[-1] == '_' else sample.input
-
-                    if support_script:
-                        check_against_reference(self,
-                                                script_fn,
-                                                func,
-                                                out_fn,
-                                                (get_sample(),) + sample.args,
-                                                sample.kwargs,
-                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
-
-                    # Check traced forward, grad, and grad grad
-                    # TODO: fix tracing here
-                    supports_tracing = not has_fake_function
-                    if op.assert_jit_shape_analysis:
-                        self.assertTrue(supports_tracing)
-
-                    if supports_tracing:
-                        traced_fn = create_traced_fn(self, variant)
-                        check_against_reference(self,
-                                                traced_fn,
-                                                func,
-                                                out_fn,
-                                                (get_sample(),) + sample.args,
-                                                sample.kwargs,
-                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
-
-                    # Check alias annotation schema for correctness (make
-                    #   sure inputs that aren't supposed to be modified aren't)
-                    # Note: only runs in float32 because schema isn't affected by dtype,
-                    #   so running it on all dtypes is would be excessive
-                    if dtype == torch.float32:
-                        # TODO: no reason why we cant run this with tracing graph
-                        if support_script and op.name != "rsub":
-                            check_alias_annotation(name, (get_sample(),) + sample.args, sample.kwargs,
-                                                   func_type=func_type, aten_name=op.aten_name)
-
-                        # TODO: use script graph as well
-                        checked_shape_analysis = False
-                        if supports_tracing:
-                            out = variant(get_sample(), *sample.args, **sample.kwargs)
-
-                            # right now, tuple of outputs and tensor output supported
-                            # TODO: list of tensor outputs
-                            tuple_of_tensors = isinstance(out, tuple) and all([isinstance(elem, torch.Tensor) for elem in out])
-
-                            if isinstance(out, torch.Tensor) or tuple_of_tensors:
-                                if tuple_of_tensors:
-                                    sizes = [elem.size() for elem in out]
-                                else:
-                                    sizes = out.size()
-                                self.checkShapeAnalysis(sizes, traced_fn.graph, op.assert_jit_shape_analysis)
-                                checked_shape_analysis = True
-                        if op.assert_jit_shape_analysis:
-                            self.assertTrue(checked_shape_analysis)
-
-                    # Check autodifferentiation of nodes for traced and scripted graphs, only need to check once per sample
-                    if dtype is torch.float32:
-                        # Sandcastle doesn't fuse nodes
-                        if IS_SANDCASTLE:
-                            # fusible nodes are expected to be found in FusionGroups in the DifferentiableGraphs
-                            nonfusible_nodes = op.autodiff_nonfusible_nodes + op.autodiff_fusible_nodes
-                            fusible_nodes = []
-                        else:
-                            nonfusible_nodes = op.autodiff_nonfusible_nodes
-                            fusible_nodes = op.autodiff_fusible_nodes
-
-                        if supports_tracing:
-                            self.assertAutodiffNode(traced_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
-                        if support_script:
-                            self.assertAutodiffNode(script_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
-        assert tested, "JIT Test does not execute any logic"
-
-    # alias testing is only done with torch.float for the same reason
-    _alias_ops = partial(ops, dtypes=OpDTypes.supported,
-                         allowed_dtypes=(torch.float,))
-
-    @_alias_ops((op for op in op_db if op.aliases))
-    def test_jit_alias_remapping(self, device, dtype, op):
-        # Required to avoid undefined value: tensor error in JIT compilation of the function template
-        tensor = torch.tensor
-
-        # NOTE: only tests on first sample
-        samples = op.sample_inputs(device, dtype, requires_grad=True)
-        sample = first_sample(self, samples)
-
-        # [Scripting Data Preparation]
-        # Prepare data for test scripting
-        # Below we prepare strings of args/kwargs with and without type annotations.
-        # These strings are inserted into function template strings which is then torch scripted.
-        # - args string is ["t0"] corresponding to the "input" tensor required by the op
-        # - args_kw is the value of args and strings of kwargs used to call the op (without type annotations), for example,
-        # ["to", "1.0", "(1,)", "True", "tensor(1.0)"] -> def fn(t0): return variant(t0, 1.0, (1,), True, tensor(1.0))
-        args = ["t0"]
-
-        def quote_strs(v):
-            if isinstance(v, str):
-                return f"'{v}'"
-
-            return str(v)
-
-        args_kw = args + \
-            [f"{v}" for v in sample.args] + \
-            [f"{k}={quote_strs(v)}" for k, v in sample.kwargs.items()]
-
-        # Prepare data for test tracing
-        sample_args_kwargs = ()
-        if len(sample.args) > 0:
-            sample_args_kwargs += (sample.args, )
-        if len(sample.kwargs) > 0:
-            sample_args_kwargs += (sample.kwargs, )
-
-        original_name = op.aten_name
-        original_name_inplace = original_name + "_"
-        expected_dtype = op(sample.input, *sample.args, **sample.kwargs).dtype
+            args = [sample.input] + list(sample.args)
+            kwargs = sample.kwargs
+            composite_compliance.check_backward_formula(op, args, kwargs)
 
-        for a_op in op.aliases:
-            inplace = a_op.inplace_variant
-            method_or_inplace = [a_op.inplace_variant, a_op.method_variant]
-            variants = (v for v in (a_op.op, a_op.method_variant, a_op.inplace_variant) if v is not None)
-
-            # Test scripting:
-            for variant in variants:
-                variant_name = variant.__name__
-                op_name = original_name_inplace if variant is inplace else original_name
-
-                if variant in method_or_inplace:
-                    fn_template = '''
-                        def _fn(t0{c}):
-                            return t0.{alias_name}({args_kw})
-                    '''
-                    # remove the first input tensor
-                    script = fn_template.format(
-                        c=", " if len(args_kw[1:]) > 1 else "",
-                        args_kw=", ".join(args_kw[1:]),
-                        alias_name=variant_name,
-                    )
-                else:
-                    fn_template = '''
-                        def _fn({args}):
-                            return variant({args_kw})
-                    '''
-                    script = fn_template.format(
-                        args=", ".join(args),
-                        args_kw=", ".join(args_kw),
-                    )
-                scripted = torch.jit.CompilationUnit(script)._fn
-
-                if (variant is inplace and not torch.can_cast(expected_dtype, dtype)):
-                    try:
-                        inp = clone_input_helper(sample.input)
-                        scripted(inp)
-                    except Exception as e:
-                        continue
-                    self.fail("Inplace operation on integer tensor that should be promoted to float didn't fail!")
-
-                inp = clone_input_helper(sample.input)
-                scripted(inp)
-                inp = clone_input_helper(sample.input)
-                graph = scripted.graph_for(inp)
-                FileCheck().check(op.aten_name).check_not(variant_name).run(graph)
-
-            # Test tracing:
-            for variant in variants:
-                variant_name = variant.__name__
-                op_name = original_name_inplace if variant is inplace else original_name
-
-                def _fn(*sample_args, **sample_kwargs):
-                    return variant(*sample_args, **sample_kwargs)
-
-                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
-                traced = torch.jit.trace(_fn, *inp)
-                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
-                traced(*inp)
-                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
-                graph = traced.graph_for(*inp)
-                FileCheck().check(op_name).check_not(variant_name).run(graph)
 
 class TestMathBits(TestCase):
     # Tests that
@@ -1313,8 +884,7 @@ def is_bit_set(x):
 
 
 instantiate_device_type_tests(TestCommon, globals())
-instantiate_device_type_tests(TestGradients, globals())
-instantiate_device_type_tests(TestJit, globals())
+instantiate_device_type_tests(TestCompositeCompliance, globals())
 instantiate_device_type_tests(TestMathBits, globals())
 
 if __name__ == '__main__':
diff --git a/test/test_ops_gradients.py b/test/test_ops_gradients.py
new file mode 100644
index 00000000000000..6d8037fb44d0fd
--- /dev/null
+++ b/test/test_ops_gradients.py
@@ -0,0 +1,228 @@
+# Owner(s): ["module: unknown"]
+
+from functools import partial, wraps
+import torch
+
+from torch.testing._internal.common_utils import \
+    (TestCase, is_iterable_of_tensors, run_tests, gradcheck, gradgradcheck, first_sample)
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_device_type import \
+    (instantiate_device_type_tests, ops, OpDTypes)
+
+# TODO: fixme https://github.com/pytorch/pytorch/issues/68972
+torch.set_default_dtype(torch.float32)
+
+# gradcheck requires double precision
+_gradcheck_ops = partial(ops, dtypes=OpDTypes.supported,
+                         allowed_dtypes=[torch.double, torch.cdouble])
+
+class TestGradients(TestCase):
+    exact_dtype = True
+
+    # Copies inputs to inplace operations to avoid inplace modifications
+    #   to leaves requiring gradient
+    def _get_safe_inplace(self, inplace_variant):
+        @wraps(inplace_variant)
+        def _fn(t, *args, **kwargs):
+            return inplace_variant(t.clone(), *args, **kwargs)
+
+        return _fn
+
+    def _check_helper(self, device, dtype, op, variant, check, *, check_forward_ad=False, check_backward_ad=True,
+                      check_batched_grad=None, check_batched_forward_grad=False):
+        assert check in ('gradcheck', 'bwgrad_bwgrad', 'fwgrad_bwgrad')
+        # NB: check_backward_ad does not affect gradgradcheck (always True)
+        if variant is None:
+            self.skipTest("Skipped! Variant not implemented.")
+        if not op.supports_dtype(dtype, torch.device(device).type):
+            self.skipTest(f"Skipped! {op.name} does not support dtype {str(dtype)}")
+
+        def is_inplace(variant):
+            if hasattr(variant, "__wrapped__"):
+                return variant.__wrapped__ is op.get_inplace()
+            return variant is op.get_inplace()
+
+        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
+        samples = op.sample_inputs(device, dtype, requires_grad=True, include_conjugated_inputs=include_conjugated_inputs)
+
+        for sample in samples:
+            if sample.broadcasts_input and is_inplace(variant):
+                continue
+
+            # Note on TensorList inputs
+            #
+            # gradcheck does not support TensorList inputs so here we pass TensorList
+            # inputs of size n as n single Tensor inputs to gradcheck and wrap the op
+            # in a function that puts the n Tensor inputs back into a TensorList
+            def fn(*inputs):
+                # Put tensors back into TensorList since we splat them when passing to gradcheck
+                if is_iterable_of_tensors(sample.input):
+                    n = len(sample.input)
+                    inputs = (inputs[:n], *inputs[n:])
+                output = op.gradcheck_wrapper(variant, *inputs, **sample.kwargs)
+                if sample.output_process_fn_grad is not None:
+                    return sample.output_process_fn_grad(output)
+                return output
+
+            # Splat TensorList inputs into single Tensor inputs
+            gradcheck_args = (sample.input,) if isinstance(sample.input, torch.Tensor) else tuple(sample.input)
+            gradcheck_args += sample.args
+
+            if check == 'gradcheck':
+                if check_batched_grad is None:
+                    check_batched_grad = op.check_batched_grad
+                self.assertTrue(gradcheck(fn, gradcheck_args,
+                                          check_batched_grad=check_batched_grad,
+                                          check_grad_dtypes=True,
+                                          nondet_tol=op.gradcheck_nondet_tol,
+                                          fast_mode=op.gradcheck_fast_mode,
+                                          check_forward_ad=check_forward_ad,
+                                          check_backward_ad=check_backward_ad,
+                                          check_undefined_grad=True,
+                                          check_batched_forward_grad=check_batched_forward_grad))
+            elif check in ('bwgrad_bwgrad', 'fwgrad_bwgrad'):  # gradgrad check
+                self.assertFalse(check_forward_ad, msg="Cannot run forward AD check for gradgradcheck")
+                for gen_non_contig_grad_outputs in (False, True):
+                    kwargs = {
+                        "gen_non_contig_grad_outputs": gen_non_contig_grad_outputs,
+                        "check_batched_grad": op.check_batched_gradgrad,
+                        "check_grad_dtypes": True,
+                        "nondet_tol": op.gradcheck_nondet_tol,
+                        "fast_mode": op.gradcheck_fast_mode
+                    }
+                    if check == "fwgrad_bwgrad":
+                        kwargs["check_fwd_over_rev"] = True
+                        kwargs["check_rev_over_rev"] = False
+                        kwargs["check_batched_grad"] = False
+                        kwargs["check_undefined_grad"] = False
+
+                    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
+            else:
+                self.assertTrue(False, msg="Unknown check requested!")
+
+    def _grad_test_helper(self, device, dtype, op, variant, *, check_forward_ad=False, check_backward_ad=True,
+                          check_batched_grad=None, check_batched_forward_grad=False):
+        return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
+                                  check_backward_ad=check_backward_ad, check_batched_grad=check_batched_grad,
+                                  check_batched_forward_grad=check_batched_forward_grad)
+
+    def _skip_helper(self, op, device, dtype):
+        if not op.supports_autograd and not op.supports_forward_ad:
+            self.skipTest("Skipped! autograd not supported.")
+        if not op.supports_complex_autograd(torch.device(device).type) and dtype.is_complex:
+            self.skipTest("Skipped! Complex autograd not supported.")
+
+    # Tests that gradients are computed correctly
+    @_gradcheck_ops(op_db)
+    def test_fn_grad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+        self._grad_test_helper(device, dtype, op, op.get_op())
+
+    # Method grad (and gradgrad, see below) tests are disabled since they're
+    #   costly and redundant with function grad (and gradgad) tests
+    # @_gradcheck_ops(op_db)
+    # def test_method_grad(self, device, dtype, op):
+    #     self._skip_helper(op, device, dtype)
+    #     self._grad_test_helper(device, dtype, op, op.get_method())
+
+    @_gradcheck_ops(op_db)
+    def test_inplace_grad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+        if not op.inplace_variant or not op.supports_inplace_autograd:
+            self.skipTest("Skipped! Operation does not support inplace autograd.")
+        self._grad_test_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()))
+
+    # Test that gradients of gradients are computed correctly
+    @_gradcheck_ops(op_db)
+    def test_fn_gradgrad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+        if not op.supports_gradgrad:
+            self.skipTest("Skipped! Operation does not support gradgrad")
+        self._check_helper(device, dtype, op, op.get_op(), 'bwgrad_bwgrad')
+
+    # Test that forward-over-reverse gradgrad is computed correctly
+    @_gradcheck_ops(op_db)
+    def test_fn_fwgrad_bwgrad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        if op.supports_fwgrad_bwgrad:
+            self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
+        else:
+            err_msg = r"Trying to use forward AD with .* that does not support it"
+            hint_msg = ("Running forward-over-backward gradgrad for an OP that has does not support it did not "
+                        "raise any error. If your op supports forward AD, you should set supports_fwgrad_bwgrad=True.")
+            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
+                self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
+
+    # Test that gradients of gradients are properly raising
+    @_gradcheck_ops(op_db)
+    def test_fn_fail_gradgrad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+        if op.supports_gradgrad:
+            self.skipTest("Skipped! Operation does support gradgrad")
+
+        err_msg = r"derivative for .* is not implemented"
+        with self.assertRaisesRegex(RuntimeError, err_msg):
+            self._check_helper(device, dtype, op, op.get_op(), 'bwgrad_bwgrad')
+
+    # Method gradgrad (and grad, see above) tests are disabled since they're
+    #   costly and redundant with function gradgrad (and grad) tests
+    # @_gradcheck_ops(op_db)
+    # def test_method_gradgrad(self, device, dtype, op):
+    #     self._skip_helper(op, device, dtype)
+    #     self._gradgrad_test_helper(device, dtype, op, op.get_method())
+
+    @_gradcheck_ops(op_db)
+    def test_inplace_gradgrad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+        if not op.inplace_variant or not op.supports_inplace_autograd:
+            self.skipTest("Skipped! Operation does not support inplace autograd.")
+        self._check_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), "bwgrad_bwgrad")
+
+    def _forward_grad_helper(self, device, dtype, op, variant, is_inplace):
+        # TODO: clean up how attributes are passed to gradcheck from OpInfos
+        def call_grad_test_helper():
+            check_batched_forward_grad = ((op.check_batched_forward_grad and not is_inplace) or
+                                          (op.check_inplace_batched_forward_grad and is_inplace))
+            self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
+                                   check_batched_grad=False, check_batched_forward_grad=check_batched_forward_grad)
+        if op.supports_forward_ad:
+            call_grad_test_helper()
+        else:
+            err_msg = r"Trying to use forward AD with .* that does not support it"
+            hint_msg = ("Running forward AD for an OP that has does not support it did not "
+                        "raise any error. If your op supports forward AD, you should set supports_forward_ad=True")
+            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
+                call_grad_test_helper()
+
+    @_gradcheck_ops(op_db)
+    def test_forward_mode_AD(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
+
+    @_gradcheck_ops(op_db)
+    def test_inplace_forward_mode_AD(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        if not op.inplace_variant or not op.supports_inplace_autograd:
+            self.skipTest("Skipped! Operation does not support inplace autograd.")
+
+        self._forward_grad_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), is_inplace=True)
+
+    # Functions that do not support autograd should not fail in forward mode
+    # Inplace functions (such as "resize_") are expected to fail in forward mode and should be skipped
+    # Test only when supports_autograd=False and for double dtype
+    @ops(filter(lambda op: not op.supports_autograd, op_db), dtypes=OpDTypes.supported, allowed_dtypes=(torch.double,))
+    def test_nondifferentiable(self, device, dtype, op):
+        # Expecting no errors
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+        sample = first_sample(self, samples)
+        result = op(sample.input, *sample.args, **sample.kwargs)
+
+
+
+instantiate_device_type_tests(TestGradients, globals())
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_ops_jit.py b/test/test_ops_jit.py
new file mode 100644
index 00000000000000..f74587955cf3c7
--- /dev/null
+++ b/test/test_ops_jit.py
@@ -0,0 +1,280 @@
+# Owner(s): ["module: unknown"]
+
+from functools import partial
+
+import torch
+
+from torch.testing import FileCheck
+from torch.testing._internal.common_utils import \
+    (run_tests, IS_SANDCASTLE, clone_input_helper, first_sample)
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_device_type import instantiate_device_type_tests, ops, OpDTypes
+from torch.testing._internal.common_jit import JitCommonTestCase, check_against_reference
+from torch.testing._internal.jit_metaprogramming_utils import create_script_fn, create_traced_fn, check_alias_annotation
+from torch.testing._internal.jit_utils import disable_autodiff_subgraph_inlining, is_lambda
+
+
+# TODO: fixme https://github.com/pytorch/pytorch/issues/68972
+torch.set_default_dtype(torch.float32)
+
+# variant testing is only done with torch.float and torch.cfloat to avoid
+#   excessive test times and maximize signal to noise ratio
+_variant_ops = partial(ops, dtypes=OpDTypes.supported,
+                       allowed_dtypes=(torch.float, torch.cfloat))
+
+
+
+# Tests operators for consistency between JIT and eager, also checks
+#   correctness of JIT specific alias schemas and intended
+#   autodifferentiation behavior.
+# Inherits from JitCommonTestCase instead of TestCase directly to share
+#   functionality with original test_jit.py method operator tests
+class TestJit(JitCommonTestCase):
+    exact_dtype = True
+
+    # Tests that the forward and backward passes of operations produce the
+    #   same values for the cross-product of op variants (function, method, inplace)
+    #   and runtimes (eager, traced, scripted).
+    # TODO WARNING: inplace x {traced, scripted} not currently tested
+    @_variant_ops(op_db)
+    def test_variant_consistency_jit(self, device, dtype, op):
+        _requires_grad = op.supports_autograd and (dtype.is_floating_point or
+                                                   op.supports_complex_autograd(torch.device(device).type))
+
+        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
+        samples = op.sample_inputs(device, dtype, requires_grad=_requires_grad, include_conjugated_inputs=include_conjugated_inputs)
+
+        # Acquires variants to test
+        func = op.get_op()
+        method = op.get_method()
+        variants = {
+            # TODO: inplace tests currently fail, fix and add inplace variant
+            'function': func, 'method': method,
+        }
+
+        # TODO: find better way to standardize on op registration itself..
+        has_fake_function = op.name in ["resize_", 'resize_as_']
+
+        if has_fake_function:
+            variants = {'method': getattr(torch.Tensor, op.name)}
+            samples = op.sample_inputs(device, dtype, requires_grad=False)
+
+        support_script = op.supports_scripting
+
+        tested = False
+        for sample in samples:
+            # Test traced and scripted consistency
+            for func_type, variant in variants.items():
+                if variant is None:
+                    continue
+
+                # scripting and check_alias_analysis do not work with lambdas
+                # lambdas are typically used as a way to simulate methods without
+                # functional variants, so rely on the other variant for testing
+                # for now
+                if is_lambda(variant):
+                    continue
+
+                tested = True
+
+                # Create accessor for script function variant
+                name = op.name + '_' if func_type == 'inplace' else op.name
+
+                # run with disable_autodiff_subgraph_inlining(True) to test
+                #   autodiff support. Context manager forces the graph to contain
+                #   DifferentiableGraph nodes if they are present
+                with disable_autodiff_subgraph_inlining():
+                    # Check scripted forward, grad, and grad grad
+                    if support_script:
+                        script_fn = create_script_fn(self, name, func_type)
+
+                    def out_fn(output):
+                        # Processes the output for autograd
+                        if sample.output_process_fn_grad is not None:
+                            return sample.output_process_fn_grad(output)
+                        return output
+
+                    def get_sample():
+                        return clone_input_helper(sample.input) if op.name[-1] == '_' else sample.input
+
+                    if support_script:
+                        check_against_reference(self,
+                                                script_fn,
+                                                func,
+                                                out_fn,
+                                                (get_sample(),) + sample.args,
+                                                sample.kwargs,
+                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
+
+                    # Check traced forward, grad, and grad grad
+                    # TODO: fix tracing here
+                    supports_tracing = not has_fake_function
+                    if op.assert_jit_shape_analysis:
+                        self.assertTrue(supports_tracing)
+
+                    if supports_tracing:
+                        traced_fn = create_traced_fn(self, variant)
+                        check_against_reference(self,
+                                                traced_fn,
+                                                func,
+                                                out_fn,
+                                                (get_sample(),) + sample.args,
+                                                sample.kwargs,
+                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
+
+                    # Check alias annotation schema for correctness (make
+                    #   sure inputs that aren't supposed to be modified aren't)
+                    # Note: only runs in float32 because schema isn't affected by dtype,
+                    #   so running it on all dtypes is would be excessive
+                    if dtype == torch.float32:
+                        # TODO: no reason why we cant run this with tracing graph
+                        if support_script and op.name != "rsub":
+                            check_alias_annotation(name, (get_sample(),) + sample.args, sample.kwargs,
+                                                   func_type=func_type, aten_name=op.aten_name)
+
+                        # TODO: use script graph as well
+                        checked_shape_analysis = False
+                        if supports_tracing:
+                            out = variant(get_sample(), *sample.args, **sample.kwargs)
+
+                            # right now, tuple of outputs and tensor output supported
+                            # TODO: list of tensor outputs
+                            tuple_of_tensors = isinstance(out, tuple) and all([isinstance(elem, torch.Tensor) for elem in out])
+
+                            if isinstance(out, torch.Tensor) or tuple_of_tensors:
+                                if tuple_of_tensors:
+                                    sizes = [elem.size() for elem in out]
+                                else:
+                                    sizes = out.size()
+                                self.checkShapeAnalysis(sizes, traced_fn.graph, op.assert_jit_shape_analysis)
+                                checked_shape_analysis = True
+                        if op.assert_jit_shape_analysis:
+                            self.assertTrue(checked_shape_analysis)
+
+                    # Check autodifferentiation of nodes for traced and scripted graphs, only need to check once per sample
+                    if dtype is torch.float32:
+                        # Sandcastle doesn't fuse nodes
+                        if IS_SANDCASTLE:
+                            # fusible nodes are expected to be found in FusionGroups in the DifferentiableGraphs
+                            nonfusible_nodes = op.autodiff_nonfusible_nodes + op.autodiff_fusible_nodes
+                            fusible_nodes = []
+                        else:
+                            nonfusible_nodes = op.autodiff_nonfusible_nodes
+                            fusible_nodes = op.autodiff_fusible_nodes
+
+                        if supports_tracing:
+                            self.assertAutodiffNode(traced_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
+                        if support_script:
+                            self.assertAutodiffNode(script_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
+        assert tested, "JIT Test does not execute any logic"
+
+    # alias testing is only done with torch.float for the same reason
+    _alias_ops = partial(ops, dtypes=OpDTypes.supported,
+                         allowed_dtypes=(torch.float,))
+
+    @_alias_ops((op for op in op_db if op.aliases))
+    def test_jit_alias_remapping(self, device, dtype, op):
+        # Required to avoid undefined value: tensor error in JIT compilation of the function template
+        tensor = torch.tensor
+
+        # NOTE: only tests on first sample
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+        sample = first_sample(self, samples)
+
+        # [Scripting Data Preparation]
+        # Prepare data for test scripting
+        # Below we prepare strings of args/kwargs with and without type annotations.
+        # These strings are inserted into function template strings which is then torch scripted.
+        # - args string is ["t0"] corresponding to the "input" tensor required by the op
+        # - args_kw is the value of args and strings of kwargs used to call the op (without type annotations), for example,
+        # ["to", "1.0", "(1,)", "True", "tensor(1.0)"] -> def fn(t0): return variant(t0, 1.0, (1,), True, tensor(1.0))
+        args = ["t0"]
+
+        def quote_strs(v):
+            if isinstance(v, str):
+                return f"'{v}'"
+
+            return str(v)
+
+        args_kw = args + \
+            [f"{v}" for v in sample.args] + \
+            [f"{k}={quote_strs(v)}" for k, v in sample.kwargs.items()]
+
+        # Prepare data for test tracing
+        sample_args_kwargs = ()
+        if len(sample.args) > 0:
+            sample_args_kwargs += (sample.args, )
+        if len(sample.kwargs) > 0:
+            sample_args_kwargs += (sample.kwargs, )
+
+        original_name = op.aten_name
+        original_name_inplace = original_name + "_"
+        expected_dtype = op(sample.input, *sample.args, **sample.kwargs).dtype
+
+        for a_op in op.aliases:
+            inplace = a_op.inplace_variant
+            method_or_inplace = [a_op.inplace_variant, a_op.method_variant]
+            variants = (v for v in (a_op.op, a_op.method_variant, a_op.inplace_variant) if v is not None)
+
+            # Test scripting:
+            for variant in variants:
+                variant_name = variant.__name__
+                op_name = original_name_inplace if variant is inplace else original_name
+
+                if variant in method_or_inplace:
+                    fn_template = '''
+                        def _fn(t0{c}):
+                            return t0.{alias_name}({args_kw})
+                    '''
+                    # remove the first input tensor
+                    script = fn_template.format(
+                        c=", " if len(args_kw[1:]) > 1 else "",
+                        args_kw=", ".join(args_kw[1:]),
+                        alias_name=variant_name,
+                    )
+                else:
+                    fn_template = '''
+                        def _fn({args}):
+                            return variant({args_kw})
+                    '''
+                    script = fn_template.format(
+                        args=", ".join(args),
+                        args_kw=", ".join(args_kw),
+                    )
+                scripted = torch.jit.CompilationUnit(script)._fn
+
+                if (variant is inplace and not torch.can_cast(expected_dtype, dtype)):
+                    try:
+                        inp = clone_input_helper(sample.input)
+                        scripted(inp)
+                    except Exception as e:
+                        continue
+                    self.fail("Inplace operation on integer tensor that should be promoted to float didn't fail!")
+
+                inp = clone_input_helper(sample.input)
+                scripted(inp)
+                inp = clone_input_helper(sample.input)
+                graph = scripted.graph_for(inp)
+                FileCheck().check(op.aten_name).check_not(variant_name).run(graph)
+
+            # Test tracing:
+            for variant in variants:
+                variant_name = variant.__name__
+                op_name = original_name_inplace if variant is inplace else original_name
+
+                def _fn(*sample_args, **sample_kwargs):
+                    return variant(*sample_args, **sample_kwargs)
+
+                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
+                traced = torch.jit.trace(_fn, *inp)
+                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
+                traced(*inp)
+                inp = (clone_input_helper(sample.input),) + sample_args_kwargs
+                graph = traced.graph_for(*inp)
+                FileCheck().check(op_name).check_not(variant_name).run(graph)
+
+
+instantiate_device_type_tests(TestJit, globals())
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_optim.py b/test/test_optim.py
index c59d6a49bb4918..7ec98caeebe484 100644
--- a/test/test_optim.py
+++ b/test/test_optim.py
@@ -20,7 +20,7 @@
     _LRScheduler, CyclicLR, CosineAnnealingWarmRestarts, OneCycleLR, ChainedScheduler, \
     EPOCH_DEPRECATION_WARNING
 from torch.optim.swa_utils import AveragedModel, SWALR, update_bn
-from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_UBSAN, load_tests, \
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_ROCM, TEST_WITH_UBSAN, load_tests, \
     skipIfRocm
 # load_tests from common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
@@ -228,6 +228,12 @@ def fn_base(optimizer, weight, bias):
         # Make sure state dict wasn't modified
         self.assertEqual(state_dict, state_dict_c)
 
+        # Make sure that device of state['step'] is still CPU
+        new_state_dict = optimizer_cuda.state_dict()
+        if 'step' in state_dict['state'][0] and torch.is_tensor(state_dict['state'][0]['step']):
+            for state in new_state_dict['state'].values():
+                self.assertEqual(state['step'].device.type, 'cpu')
+
         for _i in range(20):
             optimizer.step(fn)
             optimizer_cuda.step(fn_cuda)
@@ -620,20 +626,24 @@ def test_adadelta(self):
         self.rel_tol = 4e-3
         for optimizer in [optim.Adadelta, optim_mt.Adadelta]:
             self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias])
+                lambda weight, bias, maximize: optimizer([weight, bias], maximize=maximize),
+                constructor_accepts_maximize=True
             )
             self._test_basic_cases(
-                lambda weight, bias: optimizer(
-                    self._build_params_dict(weight, bias, rho=0.95))
+                lambda weight, bias, maximize: optimizer(
+                    self._build_params_dict(weight, bias, rho=0.95), maximize=maximize),
+                constructor_accepts_maximize=True
             )
             self._test_basic_cases(
-                lambda weight, bias: optimizer(
-                    self._build_params_dict(weight, bias, rho=0.95)),
+                lambda weight, bias, maximize: optimizer(
+                    self._build_params_dict(weight, bias, rho=0.95), maximize=maximize),
                 [lambda opt: StepLR(opt, gamma=0.9, step_size=10),
-                 lambda opt: ReduceLROnPlateau(opt)]
+                 lambda opt: ReduceLROnPlateau(opt)],
+                constructor_accepts_maximize=True
             )
             self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], weight_decay=1)
+                lambda weight, bias, maximize: optimizer([weight, bias], weight_decay=1, maximize=maximize),
+                constructor_accepts_maximize=True
             )
             with self.assertRaisesRegex(ValueError, "Invalid rho value: 1.1"):
                 optimizer(None, lr=1e-2, rho=1.1)
@@ -653,6 +663,8 @@ def test_adadelta_complex(self):
             )
 
     def test_nadam(self):
+        if TEST_WITH_ROCM:
+            self.rel_tol = 1e-5
         for optimizer in [optim.NAdam, optim_mt.NAdam]:
             self._test_basic_cases(
                 lambda weight, bias: optimizer([weight, bias], lr=1e-3)
diff --git a/test/test_overrides.py b/test/test_overrides.py
index e3a7e2b13eed70..34eac8081db0af 100644
--- a/test/test_overrides.py
+++ b/test/test_overrides.py
@@ -1,4 +1,4 @@
-# Owner(s): ["high priority"]
+# Owner(s): ["module: __torch_function__"]
 
 import torch
 import numpy as np
@@ -7,6 +7,7 @@
 import pprint
 import pickle
 import collections
+import unittest
 
 from torch.testing._internal.common_utils import TestCase, run_tests
 from torch.overrides import (
@@ -14,8 +15,10 @@
     has_torch_function,
     get_overridable_functions,
     get_testing_overrides,
-    is_tensor_method_or_property
+    is_tensor_method_or_property,
+    TorchFunctionMode
 )
+from functools import partial
 
 Tensor = torch.Tensor
 
@@ -28,7 +31,7 @@
 
 def foo(a, b, c=None):
     """A function multiple arguments and an optional argument"""
-    if any(type(t) is not Tensor for t in (a, b, c)) and has_torch_function((a, b, c)):
+    if has_torch_function((a, b, c)):
         return handle_torch_function(foo, (a, b, c), a, b, c=c)
     if c:
         return a + b + c
@@ -36,19 +39,19 @@ def foo(a, b, c=None):
 
 def bar(a):
     """A function with one argument"""
-    if type(a) is not Tensor and has_torch_function((a,)):
+    if has_torch_function((a,)):
         return handle_torch_function(bar, (a,), a)
     return a
 
 def baz(a, b):
     """A function with multiple arguments"""
-    if type(a) is not Tensor or type(b) is not Tensor and has_torch_function((a, b)):
+    if has_torch_function((a, b)):
         return handle_torch_function(baz, (a, b), a, b)
     return a + b
 
 def quux(a):
     """Used to test that errors raised in user implementations get propagated"""
-    if type(a) is not Tensor and has_torch_function((a,)):
+    if has_torch_function((a,)):
         return handle_torch_function(quux, (a,), a)
     return a
 
@@ -621,6 +624,9 @@ def instance_gen():
                     func_args.append(torch.float32)
                 elif t == 'c10::string_view':
                     func_args.append('')
+                elif t == 'SymInt':
+                    # TODO: generate actual SymbolicInt
+                    func_args.append(1)
                 else:
                     raise RuntimeError(f"Unsupported argument type {t} for {arg['name']} of function {func}")
         else:
@@ -690,7 +696,10 @@ def test(self):
         test_method.__name__ = name
         setattr(cls, name, test_method)
 
-# generate_tensor_like_override_tests(TestTorchFunctionOverride)
+generate_tensor_like_override_tests(TestTorchFunctionOverride)
+TestTorchFunctionOverride.test_torch_functional_histogramdd = unittest.skip(
+    "histogramdd is missing __torch_function__ support")(
+        TestTorchFunctionOverride.test_torch_functional_histogramdd)
 
 class Wrapper:
     "Basic data container that knows how to unwrap itself"
@@ -1056,14 +1065,151 @@ def __torch_function__(self, *args, **kwargs):
                 pass
 
         a = Bad1()
-        with self.assertWarnsRegex(DeprecationWarning, "as a plain method is deprecated"):
-            # This needs to be a function that handle torch_function on the python side
-            torch.split(a, (2))
-
-        a = Bad2()
-        with self.assertWarnsRegex(DeprecationWarning, "as a plain method is deprecated"):
-            # This needs to be a function that handle torch_function on the python side
-            torch.split(a, (2))
+        for a in (Bad1(), Bad2()):
+            with self.assertWarnsRegex(DeprecationWarning, "as a plain method is deprecated"):
+                # Function that handles torch_function on the python side
+                torch.nn.functional.dropout(a)
+
+            with self.assertWarnsRegex(UserWarning, "as a plain method is deprecated"):
+                # Function that handles torch_function in C++
+                torch.abs(a)
+
+class TestTorchFunctionMode(TestCase):
+    def test_basic(self):
+        class A(TorchFunctionMode):
+            def __torch_function__(self, *args, **kwargs):
+                return -1
+        # NB: factory functions get overridden too!
+        x = torch.randn(1)
+        with torch.overrides.push_torch_function_mode(A):
+            self.assertEqual(torch.randn(3), -1)
+            self.assertEqual(torch.add(x, x), -1)
+            self.assertEqual(torch.split(None, [2]), -1)  # python side
+            self.assertEqual(bar(x), -1)
+
+    def test_enable_torch_function_mode_with_tensor_subclass(self):
+        x = torch.randn(1)
+        with torch.overrides.enable_torch_function_mode(SubTensor):
+            self.assertEqual(torch.mm(x, x), -1)
+
+    def test_modes_handle_first(self):
+        class A(TorchFunctionMode):
+            def __torch_function__(self, *args, **kwargs):
+                return -40
+
+        x = SubTensor()
+        with torch.overrides.push_torch_function_mode(A):
+            self.assertEqual(torch.neg(x), -40)
+            self.assertEqual(torch.mean(x), -40)
+            self.assertEqual(torch.mm(x, x), -40)
+            self.assertEqual(bar(x), -40)
+
+    def test_modes_return_notimplemented(self):
+        class MyMode(TorchFunctionMode):
+            def __torch_function__(self, *args, **kwargs):
+                return NotImplemented
+
+        x = SubTensor()
+        with torch.overrides.push_torch_function_mode(MyMode):
+            self.assertEqual(torch.mean(x), 0)
+            self.assertEqual(torch.mm(x, x), -1)
+            self.assertEqual(bar(x), 1)
+            self.assertRaisesRegex(
+                TypeError, r'SubTensor.+MyMode',
+                lambda: self.assertEqual(torch.max(x, x)))
+
+    def test_mode_stack(self):
+        logs = []
+
+        class Logger(TorchFunctionMode):
+            def __init__(self, name):
+                self.name = name
+
+            def __torch_function__(self, func, types, args=(), kwargs=None):
+                if kwargs is None:
+                    kwargs = {}
+                logs.append(self.name)
+                return func(*args, **kwargs)
+
+        x = torch.randn(1)
+        with torch.overrides.push_torch_function_mode(partial(Logger, "A")):
+            with torch.overrides.push_torch_function_mode(partial(Logger, "B")):
+                torch.mean(x)
+
+        self.assertEqual(logs, ["B", "A"])
+
+    def test_push_mode_instance_errors(self):
+        class A(TorchFunctionMode):
+            pass
+        with self.assertRaisesRegex(ValueError, 'instance of TorchFunctionMode'):
+            with torch.overrides.push_torch_function_mode(A(inner=None)):
+                pass
+
+    def test_push_mode_returns_unrelated(self):
+        with self.assertRaisesRegex(ValueError, 'return a TorchFunctionMode'):
+            with torch.overrides.push_torch_function_mode(lambda *, inner: None):
+                pass
+
+    def test_missing_inner_mode_ctor(self):
+        self.assertRaisesRegex(TypeError, 'push_torch_function_mode', lambda: TorchFunctionMode())
+
+    def test_enable_torch_function_mode_trivial(self):
+        class A(TorchFunctionMode):
+            def __torch_function__(self, *args, **kwargs):
+                return -40
+        a = A(inner=None)
+        with torch.overrides.enable_torch_function_mode(a):
+            with torch.overrides.enable_torch_function_mode(a):
+                self.assertEqual(bar(None), -40)
+
+    def test_enable_torch_function_mode_replace(self):
+        class A(TorchFunctionMode):
+            def __init__(self, val):
+                self.val = val
+
+            def __torch_function__(self, *args, **kwargs):
+                return self.val
+        a1 = A(-40, inner=None)
+        a2 = A(-41, inner=None)
+        with torch.overrides.enable_torch_function_mode(a1):
+            with torch.overrides.enable_torch_function_mode(a2, replace=a1):
+                self.assertEqual(bar(None), -41)
+
+    def test_enable_torch_function_mode_ignore_preexisting(self):
+        class A(TorchFunctionMode):
+            def __init__(self, val):
+                self.val = val
+
+            def __torch_function__(self, *args, **kwargs):
+                return self.val
+        a1 = A(-40, inner=None)
+        a2 = A(-41, inner=None)
+        with torch.overrides.enable_torch_function_mode(a1):
+            with torch.overrides.enable_torch_function_mode(a2, ignore_preexisting=True):
+                self.assertEqual(bar(None), -41)
+
+    def test_reentrant_mode_idiom(self):
+        log = []
+
+        class A(TorchFunctionMode):
+            def __torch_function__(self, func, types, args=(), kwargs=None):
+                if kwargs is None:
+                    kwargs = {}
+                log.append(func)
+                if func is torch.sub:
+                    with torch.overrides.enable_torch_function_mode(self, replace=self.inner):
+                        input, other = args
+                        assert not kwargs
+                        return torch.add(input, other, alpha=-1)
+                return func(*args, **kwargs)
+
+        x = torch.randn(1)
+        y = torch.randn(1)
+        with torch.overrides.push_torch_function_mode(A):
+            torch.sub(x, y)
+        # add hits the torch function again!
+        self.assertEqual(log, [torch.sub, torch.add])
+
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_per_overload_api.py b/test/test_per_overload_api.py
index cb949180320d4e..cdb2b79835121a 100644
--- a/test/test_per_overload_api.py
+++ b/test/test_per_overload_api.py
@@ -10,8 +10,8 @@ def test_basics_opoverloadpacket(self):
         add_packet = torch.ops.aten.add
 
         # class attributes
-        self.assertEqual(add_packet.op_name, 'add')
-        self.assertEqual(add_packet.qualified_op_name, 'aten.add')
+        self.assertEqual(add_packet.__name__, 'add')
+        self.assertEqual(str(add_packet), 'aten.add')
 
         # callable
         self.assertEqual(add_packet(torch.tensor(2), torch.tensor(3)), torch.tensor(5))
@@ -27,7 +27,7 @@ def test_basics_opoverloadpacket(self):
         self.assertEqual(id(add_packet), id(copy.deepcopy(add_packet)))
 
         # pretty print
-        self.assertEqual(str(add_packet), "OpOverloadPacket(op='aten.add')")
+        self.assertEqual(repr(add_packet), "<OpOverloadPacket(op='aten.add')>")
 
         self.assertRaises(AttributeError, lambda: add_packet.foo)
 
@@ -36,9 +36,9 @@ def test_basics_opoverload(self):
         add_tensoroverload = add_packet.Tensor
 
         # class attributes
-        self.assertEqual(add_tensoroverload.name, 'aten.add')
-        self.assertEqual(add_tensoroverload.overload_name, 'Tensor')
-        self.assertEqual(add_tensoroverload.overload_packet, add_packet)
+        self.assertEqual(str(add_tensoroverload), 'aten.add.Tensor')
+        self.assertEqual(add_tensoroverload.__name__, 'add.Tensor')
+        self.assertEqual(add_tensoroverload.overloadpacket, add_packet)
 
         # deepcopy is a no-op
         self.assertEqual(id(add_tensoroverload), id(copy.deepcopy(add_tensoroverload)))
@@ -48,7 +48,7 @@ def test_basics_opoverload(self):
         self.assertEqual(id(add_tensoroverload), id(another_add_tensoroverload))
 
         # pretty print
-        self.assertEqual(str(add_tensoroverload), "OpOverload(op='aten.add', overload='Tensor')")
+        self.assertEqual(repr(add_tensoroverload), "<OpOverload(op='aten.add', overload='Tensor')>")
 
         # callable
         self.assertEqual(add_tensoroverload(torch.tensor(2), torch.tensor(3)), torch.tensor(5))
diff --git a/test/test_profiler.py b/test/test_profiler.py
index 30a9452735cd35..cb2e4a0e5d3157 100644
--- a/test/test_profiler.py
+++ b/test/test_profiler.py
@@ -64,6 +64,31 @@ def test_mem_leak(self):
         self.assertTrue(not (is_increasing and max_diff > 100 * 1024),
                         msg='memory usage is increasing, {}'.format(str(last_rss)))
 
+    def test_custom_module_input_op_ids(self):
+        class MyFunc(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                ctx.save_for_backward(x)
+                return x
+
+            @staticmethod
+            def backward(ctx, gO):
+                x, = ctx.saved_tensors
+                return x
+
+        def custom_layer(input_ten):
+            return MyFunc.apply(input_ten)
+
+        # Only testing that emit_nvtx runs when
+        # record_shapes option is enabled.
+        with torch.autograd.profiler.emit_nvtx(record_shapes=True) as prof:
+            x = torch.randn(10, 10, requires_grad=True)
+            y = torch.randn(10, 10, requires_grad=True)
+            z = x + y
+            s = custom_layer(z)
+            q = s.sum()
+            q.backward()
+
 class TestRecordFunction(TestCase):
     def _record_function_with_param(self):
         u = torch.randn(3, 4, 5, requires_grad=True)
diff --git a/test/test_public_bindings.py b/test/test_public_bindings.py
index 769e2315974732..260a3ac783cd72 100644
--- a/test/test_public_bindings.py
+++ b/test/test_public_bindings.py
@@ -138,6 +138,7 @@ def test_no_new_bindings(self):
             "InterfaceType",
             "IntStorageBase",
             "IntType",
+            "SymIntType",
             "IODescriptor",
             "is_anomaly_enabled",
             "is_autocast_cache_enabled",
diff --git a/test/test_python_dispatch.py b/test/test_python_dispatch.py
index 555e76965a8b7c..4a743d44f88ec5 100644
--- a/test/test_python_dispatch.py
+++ b/test/test_python_dispatch.py
@@ -1,4 +1,4 @@
-# Owner(s): ["high priority"]
+# Owner(s): ["module: __torch_dispatch__"]
 
 import tempfile
 import torch
@@ -31,11 +31,11 @@ def test_basic(self) -> None:
             # self.assertEqual(saved_x._version, x._version)
         self.assertExpectedInline('\n'.join(logs), '''\
 $0 = input('x')
-$1 = torch._ops.aten.mul($0, $0)
+$1 = torch._ops.aten.mul.Tensor($0, $0)
 $2 = input('grad_y')
-$3 = torch._ops.aten.mul($2, $0)
-$4 = torch._ops.aten.mul($2, $0)
-$5 = torch._ops.aten.add($4, $3)''')
+$3 = torch._ops.aten.mul.Tensor($2, $0)
+$4 = torch._ops.aten.mul.Tensor($2, $0)
+$5 = torch._ops.aten.add.Tensor($4, $3)''')
 
     def test_out(self) -> None:
         with capture_logs() as logs:
@@ -51,7 +51,7 @@ def test_out(self) -> None:
         self.assertExpectedInline('\n'.join(logs), '''\
 $0 = input('x')
 $1 = input('y')
-$2 = torch._ops.aten.abs($0, out=$1)''')
+$2 = torch._ops.aten.abs.out($0, out=$1)''')
 
 
     def test_kwarg_only(self) -> None:
@@ -74,11 +74,11 @@ def test_kwarg_only(self) -> None:
 $0 = input('x')
 $1 = input('y')
 $2 = input('z')
-$3 = torch._ops.aten.addmv($0, $1, $2)
-$4 = torch._ops.aten.addmv($0, $1, $2)
-$5 = torch._ops.aten.addmv($0, $1, $2, beta=2)
-$6 = torch._ops.aten.addmv($0, $1, $2, alpha=2)
-$7 = torch._ops.aten.addmv($0, $1, $2, beta=2, alpha=2)''')
+$3 = torch._ops.aten.addmv.default($0, $1, $2)
+$4 = torch._ops.aten.addmv.default($0, $1, $2)
+$5 = torch._ops.aten.addmv.default($0, $1, $2, beta=2)
+$6 = torch._ops.aten.addmv.default($0, $1, $2, alpha=2)
+$7 = torch._ops.aten.addmv.default($0, $1, $2, beta=2, alpha=2)''')
 
     def test_kwarg_only_and_positional_default(self) -> None:
         with capture_logs() as logs:
@@ -96,10 +96,10 @@ def test_kwarg_only_and_positional_default(self) -> None:
         self.assertExpectedInline('\n'.join(logs), '''\
 $0 = input('x')
 $1 = input('y')
-$2 = torch._ops.aten.kl_div($0, $1)
-$3 = torch._ops.aten.kl_div($0, $1, 2)
-$4 = torch._ops.aten.kl_div($0, $1, log_target=True)
-$5 = torch._ops.aten.kl_div($0, $1, 2, log_target=True)''')
+$2 = torch._ops.aten.kl_div.default($0, $1)
+$3 = torch._ops.aten.kl_div.default($0, $1, 2)
+$4 = torch._ops.aten.kl_div.default($0, $1, log_target=True)
+$5 = torch._ops.aten.kl_div.default($0, $1, 2, log_target=True)''')
 
     def test_list_ret(self) -> None:
         # test all sequence types are permissible returns
@@ -111,7 +111,7 @@ def __new__(cls, elem):
 
                 @classmethod
                 def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                    if func == torch.ops.aten.split:
+                    if func.overloadpacket == torch.ops.aten.split:
                         with no_dispatch():
                             return list_type(torch.split(*args))
                     else:
@@ -134,7 +134,7 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
                 return "arf"
 
         # Wobbles depending on NDEBUG mode of pybind11
-        self.assertRaisesRegexp(
+        self.assertRaisesRegex(
             RuntimeError, "Unable to cast", lambda: A(torch.zeros(1)).neg(),
         )
         self.assertRaisesRegexp(
@@ -152,8 +152,8 @@ def test_detach_appears_twice_when_called_once(self) -> None:
         # would be bad if calling .detach() once emits 3+ detaches).
         self.assertExpectedInline('\n'.join(logs), '''\
 $0 = input('x')
-$1 = torch._ops.aten.detach($0)
-$2 = torch._ops.aten.detach($1)''')
+$1 = torch._ops.aten.detach.default($0)
+$2 = torch._ops.aten.detach.default($1)''')
 
     def test_metadata_change_not_allowed(self) -> None:
         x = LoggingTensor(torch.ones(1))
@@ -264,11 +264,11 @@ def backward(ctx, grad_output):
         self.assertExpectedInline('\n'.join(logs), '''\
 $0 = input('x')
 $1 = input('x.grad')
-$2 = torch._ops.aten.pow($0, 2)
+$2 = torch._ops.aten.pow.Tensor_Scalar($0, 2)
 $3 = input('grad_output')
-$4 = torch._ops.aten.mul($3, tensor(2))
-$5 = torch._ops.aten.mul($4, $0)
-$6 = torch._ops.aten.add_($1, $5)''')
+$4 = torch._ops.aten.mul.Tensor($3, tensor(2))
+$5 = torch._ops.aten.mul.Tensor($4, $0)
+$6 = torch._ops.aten.add_.Tensor($1, $5)''')
 
     def test_subclass_creation(self):
         # Make sure these statements runs without error
@@ -376,7 +376,7 @@ def __new__(cls, elem, *args, **kwargs):
 
             @classmethod
             def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                if func.__name__ == "clone":
+                if func.overloadpacket.__name__ == "clone":
                     # Return a plain tensor from clone().
                     return args[0].elem.clone()
                 raise RuntimeError("NYI")
@@ -444,7 +444,7 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
         idxs = (MyTensor(torch.tensor(0)),)
         v = torch.randn(1)
         res = x.index_put_(idxs, v)
-        self.assertEqual(called_funcs, [torch.ops.aten.index_put_])
+        self.assertEqual(called_funcs, [torch.ops.aten.index_put_.default])
 
     def test_enable_python_mode_error(self) -> None:
         with self.assertRaisesRegex(ValueError, "__torch_dispatch__"):
@@ -594,7 +594,7 @@ def wrap(e):
                 # It prevents infinite recursion.
                 with no_dispatch():
                     rs = tree_map(wrap, func(*tree_map(unwrap, args), **tree_map(unwrap, kwargs)))
-                if func.__name__ == "add":
+                if func.overloadpacket.__name__ == "add":
                     return None
                 else:
                     return rs
@@ -659,7 +659,26 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
         x = torch.randn(2)
         y = torch.randn(2)
         self.assertEqual(SubTensor(x) + SubTensor(y), x + y)
-        self.assertEqual(called, [torch.ops.aten.add])
+        self.assertEqual(called, [torch.ops.aten.add.Tensor])
+
+    def test_dispatch_super_call_list_arg(self):
+        called = []
+
+        class SubTensorWithListArg(torch.Tensor):
+            @staticmethod
+            def __new__(cls, elem):
+                return torch.Tensor._make_subclass(cls, elem)
+
+            __torch_function__ = torch._C._disabled_torch_function_impl
+
+            @classmethod
+            def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+                called.append(func)
+                return super().__torch_dispatch__(func, types, list(args), kwargs)
+
+        x = torch.randn(2)
+        self.assertEqual(SubTensorWithListArg(x).neg(), x.neg())
+        self.assertEqual(called, [torch.ops.aten.neg.default])
 
     def test_dispatch_super_dont_autograd(self):
         called = []
@@ -685,7 +704,13 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
 
         x = SubTensor(torch.randn(2, requires_grad=True))
         x.neg()
-        self.assertEqual(called, [torch.ops.aten.neg])
+        self.assertEqual(called, [torch.ops.aten.neg.default])
+
+    def test_construct_int_tensor(self):
+        class SubTensor(torch.Tensor):
+            pass
+        # should not fail
+        SubTensor(torch.zeros(2, dtype=torch.int))
 
     def test_multiple_ops_subclass(self):
         # This is a Direct Subclass, don't do that!
diff --git a/test/test_pytree.py b/test/test_pytree.py
index 81631c45c3fdd6..c39f5cb3a0a01d 100644
--- a/test/test_pytree.py
+++ b/test/test_pytree.py
@@ -1,4 +1,4 @@
-# Owner(s): ["high priority"]
+# Owner(s): ["module: pytree"]
 
 import torch
 from torch.testing._internal.common_utils import TestCase, run_tests
diff --git a/test/test_reductions.py b/test/test_reductions.py
index 0def4b9b25253f..52c0a8a1d25785 100644
--- a/test/test_reductions.py
+++ b/test/test_reductions.py
@@ -13,8 +13,8 @@
 from torch._six import inf, nan
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
-    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
-    integral_types_and, floating_and_complex_types_and
+    all_types_and_complex_and, get_all_math_dtypes, integral_types, complex_types, floating_types_and,
+    integral_types_and, floating_and_complex_types_and, all_types_and,
 )
 from torch.testing._internal.common_utils import (
     TestCase, run_tests, skipIfNoSciPy, slowTest, torch_to_numpy_dtype_dict,
@@ -357,13 +357,13 @@ def _test_ref(self, op: ReductionOpInfo, t: torch.Tensor, **reduction_kwargs):
             self.assertEqual(result, expected, exact_dtype=False)
 
     @ops(filter(lambda op: op.ref is not None, reduction_ops),
-         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
+         allowed_dtypes=all_types_and_complex_and(torch.half, torch.bool))
     def test_ref_scalar_input(self, device, dtype, op: ReductionOpInfo):
         """Compares op against reference for scalar input tensors"""
         self._test_ref(op, make_tensor([], dtype=dtype, device=device))
 
     @ops(filter(lambda op: op.ref is not None, reduction_ops),
-         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
+         allowed_dtypes=all_types_and_complex_and(torch.half, torch.bool))
     def test_ref_small_input(self, device, dtype, op: ReductionOpInfo):
         """Compares op against reference for small input tensors"""
         t = make_tensor((5, 3, 4, 2), dtype=dtype, device=device, low=-2, high=2, exclude_zero=True)
@@ -391,7 +391,7 @@ def test_ref_large_input_64bit_indexing(self, device, dtype, op: ReductionOpInfo
         self._test_ref(op, make_tensor((275000000,), dtype=dtype, device=device, low=-1, high=1, exclude_zero=True))
 
     @ops(filter(lambda op: op.ref is not None, reduction_ops),
-         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
+         allowed_dtypes=all_types_and_complex_and(torch.half, torch.bool))
     def test_ref_duplicate_values(self, device, dtype, op: ReductionOpInfo):
         """Compares op against reference for input tensors with duplicate values"""
         t = make_tensor((4, 4), dtype=dtype, device=device, low=-2, high=2, exclude_zero=True)
@@ -452,7 +452,7 @@ def test_dim_reduction_less_than_64(self, device):
         sizes = [1] * 65
         x = torch.randn(sizes, device=device)
         ops = [torch.mean, torch.sum, torch.nansum, torch.std, torch.logsumexp, torch.std, torch.var,
-               torch.amin, torch.amax, torch.norm]
+               torch.norm]
         for op in ops:
             with self.assertRaisesRegex(RuntimeError, "only tensors with up to 64 dims are supported"):
                 op(x, 64)
@@ -1415,7 +1415,7 @@ def test_dtype_bfloat16(values_bf16=False, boundaries_bf16=False):
             test_dtype_bfloat16(False, True)
             test_dtype_bfloat16(True, True)
 
-    @dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_nansum(self, device, dtype):
         args = product(
             (True, False),  # noncontiguous
@@ -1468,15 +1468,14 @@ def _test_reduction_function_with_numpy(self, torch_func, np_func, device, dtype
                             self.compare_with_numpy(torch_func_partial, np_func_partial, x, device=None, dtype=None,
                                                     atol=atol, rtol=rtol, exact_dtype=exact_dtype)
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_count_nonzero(self, device, dtype):
         self._test_reduction_function_with_numpy(torch.count_nonzero, np.count_nonzero, device, dtype)
         self._test_reduction_function_with_numpy(torch.count_nonzero, np.count_nonzero, device, dtype, True)
 
     def _test_sum_reduction_vs_numpy(self, torch_fn, np_fn, device, dtype, with_keepdim=False, with_extremal=False):
         def is_integral(dtype):
-            return dtype in get_all_int_dtypes()
+            return dtype in integral_types()
 
         # On Windows CI, the current version of `numpy` promotes all lower integers
         # dtypes to int32 while `torch` promotes them to int64. Hence we skip on checking
@@ -1505,28 +1504,30 @@ def is_integral(dtype):
                                                      with_keepdim=with_keepdim, with_extremal=with_extremal)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*all_types_and(torch.half))
     def test_sum_vs_numpy(self, device, dtype):
         self._test_sum_reduction_vs_numpy(torch.sum, np.sum, device, dtype)
         self._test_sum_reduction_vs_numpy(torch.sum, np.sum, device, dtype, with_extremal=True)
         self._test_sum_reduction_vs_numpy(torch.sum, np.sum, device, dtype, with_keepdim=True)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*all_types_and(torch.half))
     def test_nansum_vs_numpy(self, device, dtype):
         self._test_sum_reduction_vs_numpy(torch.nansum, np.nansum, device, dtype)
         self._test_sum_reduction_vs_numpy(torch.nansum, np.nansum, device, dtype, with_extremal=True)
         self._test_sum_reduction_vs_numpy(torch.nansum, np.nansum, device, dtype, with_keepdim=True)
 
-    @dtypes(*(get_all_complex_dtypes()))
+    @dtypes(*complex_types())
     def test_nansum_complex(self, device, dtype):
         x = torch.randn((3, 3, 3), device=device, dtype=dtype)
         with self.assertRaisesRegex(RuntimeError, "nansum does not support complex inputs"):
             torch.nansum(x)
 
-    def test_nansum_out_dtype(self, device):
-        dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))
-        for inp_dtype, out_dtype in combinations(dtypes, 2):
+    @dtypes(*all_types_and(torch.half))
+    def test_nansum_out_dtype(self, device, dtype):
+        out_dtype = dtype
+        inp_dtypes = all_types_and(torch.half) if out_dtype.is_floating_point else integral_types()
+        for inp_dtype in inp_dtypes:
             shape = _rand_shape(random.randint(2, 5), min_size=5, max_size=10)
             x = _generate_input(shape, inp_dtype, device, with_extremal=False)
             torch_fn = partial(torch.nansum, dtype=out_dtype)
@@ -1534,7 +1535,7 @@ def test_nansum_out_dtype(self, device):
             np_fn = partial(np.nansum, dtype=np_out_dtype)
             self.compare_with_numpy(torch_fn, np_fn, x, device=None, dtype=None)
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*all_types_and(torch.half))
     def test_argminmax_multiple(self, device, dtype):
         # Case: All Ones
         t = torch.ones(3, 3, device=device, dtype=dtype)
@@ -1542,7 +1543,7 @@ def test_argminmax_multiple(self, device, dtype):
         self.compare_with_numpy(torch.argmin, np.argmin, t)
 
         # Case: With single `nan` present.
-        if dtype in get_all_fp_dtypes():
+        if dtype in floating_types_and(torch.half, torch.bfloat16):
             t[2, 2] = float('nan')
             self.compare_with_numpy(torch.argmax, np.argmax, t)
             self.compare_with_numpy(torch.argmin, np.argmin, t)
@@ -1619,8 +1620,7 @@ def verify_against_numpy(t):
                           [0, 0]], device=device, dtype=dtype)
         verify_against_numpy(t)
 
-    @dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
-                             include_bool=True, include_complex=True)))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_all_any_vs_numpy(self, device, dtype):
         # Note [all, any uint8 compatibility]: However for compatibility reason,
         # for `uint8`, they return Tensor of same dtype `uint8`.
@@ -1735,7 +1735,7 @@ def _test_output_dtype(x):
     @onlyNativeDeviceTypes
     def test_repeated_dim(self, device):
         ops = [torch.mean, torch.sum, torch.nansum, torch.std, torch.logsumexp, torch.std, torch.var,
-               torch.amin, torch.amax, torch.norm]
+               torch.norm]
         x = torch.randn(3, 3, 3, 3, device=device)
 
         error_msg = r'appears multiple times in the list of dims'
@@ -1835,10 +1835,6 @@ def test_minmax_illegal_dtype(self, device):
             torch.max(x, dim=0, out=(illegal_values, valid_indices))
         with self.assertRaisesRegex(RuntimeError, rmsg):
             torch.min(x, dim=0, out=(illegal_values, valid_indices))
-        with self.assertRaisesRegex(RuntimeError, rmsg):
-            torch.amax(x, dim=0, out=illegal_values)
-        with self.assertRaisesRegex(RuntimeError, rmsg):
-            torch.amin(x, dim=0, out=illegal_values)
         with self.assertRaisesRegex(RuntimeError, rmsg):
             torch.max(x, dim=0, out=(valid_values, illegal_indices))
         with self.assertRaisesRegex(RuntimeError, rmsg):
@@ -1848,7 +1844,7 @@ def test_minmax_illegal_dtype(self, device):
         with self.assertRaisesRegex(RuntimeError, rmsg):
             torch.min(x, dim=0, out=(illegal_values, illegal_indices))
 
-    @dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_dim_arg_reduction_scalar(self, device, dtype):
         example = 4.0
 
@@ -1866,7 +1862,7 @@ def test_dim_arg_reduction_scalar(self, device, dtype):
 
 
     @precisionOverride({torch.float16: 1e-2, torch.bfloat16: 1e-2})
-    @dtypes(*(set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8}))
+    @dtypes(*set(all_types_and(torch.half, torch.bfloat16)) - {torch.uint8})
     def test_dim_reduction(self, device, dtype):
         example = [[-1, 2, 1], [5, 3, 6]]
 
@@ -3241,8 +3237,7 @@ def test_reduction_empty_any_all(self, device):
         shape = (2, 0, 4)
         x = torch.randn(shape, device=device)
 
-        for dtype in get_all_dtypes(include_half=True, include_bfloat16=False,
-                                    include_bool=True, include_complex=True):
+        for dtype in all_types_and_complex_and(torch.half, torch.bool):
             # Refer: [all, any uint8 compatibility]
             if dtype == torch.uint8:
                 out_dtype = torch.uint8
diff --git a/test/test_scatter_gather_ops.py b/test/test_scatter_gather_ops.py
index cd944da7366718..9ef198f7d93225 100644
--- a/test/test_scatter_gather_ops.py
+++ b/test/test_scatter_gather_ops.py
@@ -10,7 +10,9 @@
     (run_tests, TestCase,)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, dtypesIfCUDA,
-     toleranceOverride, tol)
+     toleranceOverride, tol,)
+from torch.testing._internal.common_dtype import \
+    (get_all_dtypes, get_all_fp_dtypes,)
 
 # Protects against includes accidentally setting the default dtype
 assert torch.get_default_dtype() is torch.float32
@@ -22,13 +24,16 @@
 
 class TestScatterGather(TestCase):
     # Fills an index tensor with valid indices
-    def _fill_indices(self, idx, dim, dim_size, elems_per_row, m, n, o):
+    def _fill_indices(self, idx, dim, dim_size, elems_per_row, m, n, o, unique_indices=True):
         for i in range(1 if dim == 0 else m):
             for j in range(1 if dim == 1 else n):
                 for k in range(1 if dim == 2 else o):
                     ii = [i, j, k]
                     ii[dim] = slice(0, idx.size(dim) + 1)
-                    idx[tuple(ii)] = torch.randperm(dim_size)[0:elems_per_row]
+                    if unique_indices:
+                        idx[tuple(ii)] = torch.randperm(dim_size)[0:elems_per_row]
+                    else:
+                        idx[tuple(ii)] = torch.randint(dim_size, (elems_per_row,))
 
     @dtypes(torch.float32, torch.complex64)
     def test_gather(self, device, dtype):
@@ -67,7 +72,8 @@ def test_gather_bool(self, device, dtype):
         expected = torch.tensor(((False, False), (True, True)), device=device, dtype=dtype)
         self.assertEqual(actual, expected, atol=0, rtol=0)
 
-    def _test_scatter_base(self, fn, *, device, dtype, is_scalar, reduction):
+    def _test_scatter_base(self, fn, *, device, dtype, is_scalar, reduction,
+                           unique_indices=True, include_self=True):
         m, n, o = random.randint(10, 20), random.randint(10, 20), random.randint(10, 20)
         elems_per_row = random.randint(1, 10)
         dim = random.randrange(3)
@@ -75,7 +81,7 @@ def _test_scatter_base(self, fn, *, device, dtype, is_scalar, reduction):
         idx_size = [m, n, o]
         idx_size[dim] = elems_per_row
         idx = torch.empty(tuple(idx_size), device=device, dtype=torch.long)
-        self._fill_indices(idx, dim, ([m, n, o])[dim], elems_per_row, m, n, o)
+        self._fill_indices(idx, dim, ([m, n, o])[dim], elems_per_row, m, n, o, unique_indices)
 
         if is_scalar:
             src = random.random()
@@ -85,11 +91,15 @@ def _test_scatter_base(self, fn, *, device, dtype, is_scalar, reduction):
 
         base = make_tensor((m, n, o), device=device, dtype=dtype)
         if reduction is not None:
-            actual = fn(base.clone(), dim, idx, src, reduce=reduction)
+            if fn is torch.Tensor.scatter_reduce_:
+                actual = fn(base.clone(), dim, idx, src, reduce=reduction, include_self=include_self)
+            else:
+                actual = fn(base.clone(), dim, idx, src, reduce=reduction)
         else:
             actual = fn(base.clone(), dim, idx, src)
 
         expected = base.clone()
+        counts = torch.zeros(base.shape, dtype=torch.long, device=device) + include_self
         for i in range(idx_size[0]):
             for j in range(idx_size[1]):
                 for k in range(idx_size[2]):
@@ -98,16 +108,35 @@ def _test_scatter_base(self, fn, *, device, dtype, is_scalar, reduction):
                     if fn is torch.Tensor.scatter_add_:
                         expected[tuple(ii)] += src[i, j, k]
                     else:
-                        # method may be 'scatter_' or 'scatter'
-                        # both might have a reduction argument
+                        # method may be 'scatter_', 'scatter', 'scatter_reduce'
+                        # or 'scatter_reduce_', the former two might have a reduction argument
+                        # while the latter two always do
                         value = src if is_scalar else src[i, j, k]
 
-                        if reduction == "add":
-                            expected[tuple(ii)] += value
-                        elif reduction == "multiply":
-                            expected[tuple(ii)] *= value
-                        else:
+                        if ((not include_self) and counts[tuple(ii)] == 0):
                             expected[tuple(ii)] = value
+                        else:
+                            if reduction == "add" or reduction == "sum":
+                                expected[tuple(ii)] += value
+                            elif reduction == "multiply" or reduction == "prod":
+                                expected[tuple(ii)] *= value
+                            elif reduction == "amax":
+                                expected[tuple(ii)] = max(expected[tuple(ii)], value)
+                            elif reduction == "amin":
+                                expected[tuple(ii)] = min(expected[tuple(ii)], value)
+                            elif reduction == "mean":
+                                expected[tuple(ii)] += value
+                            else:
+                                expected[tuple(ii)] = value
+
+                        counts[tuple(ii)] += 1
+
+        if (reduction == "mean"):
+            counts.masked_fill_(counts == 0, 1)
+            if (dtype.is_floating_point or dtype.is_complex):
+                expected /= counts
+            else:
+                expected.div_(counts, rounding_mode="floor")
 
         self.assertEqual(actual, expected, atol=0, rtol=0)
 
@@ -158,6 +187,46 @@ def test_scatter_add_mult_index_base(self, device, dtype):
         self.assertEqual(res0[0, :], m * torch.ones(n, device=device, dtype=dtype), atol=0, rtol=0)
         self.assertEqual(res1[:, 0], n * torch.ones(m, device=device, dtype=dtype), atol=0, rtol=0)
 
+    # FIXME: discrepancy between bool ReduceAdd on CUDA and CPU (a + b on CPU and buggy a && b on CUDA)
+    @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True, include_bool=False))
+    def test_scatter_reduce_sum(self, device, dtype):
+        for include_self in (True, False):
+            self._test_scatter_base(torch.Tensor.scatter_reduce_, device=device, dtype=dtype,
+                                    is_scalar=False, reduction='sum', unique_indices=False,
+                                    include_self=include_self)
+
+    @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True))
+    @dtypesIfCUDA(*get_all_fp_dtypes(include_half=True, include_bfloat16=True))
+    def test_scatter_reduce_prod(self, device, dtype):
+        for include_self in (True, False):
+            self._test_scatter_base(torch.Tensor.scatter_reduce_, device=device, dtype=dtype,
+                                    is_scalar=False, reduction='prod', unique_indices=False,
+                                    include_self=include_self)
+
+    @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True, include_bool=False))
+    @dtypesIfCUDA(*get_all_fp_dtypes(include_half=True, include_bfloat16=True))
+    def test_scatter_reduce_mean(self, device, dtype):
+        for include_self in (True, False):
+            self._test_scatter_base(torch.Tensor.scatter_reduce_, device=device, dtype=dtype,
+                                    is_scalar=False, reduction='mean', unique_indices=False,
+                                    include_self=include_self)
+
+    @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
+    @dtypesIfCUDA(*get_all_fp_dtypes(include_half=True, include_bfloat16=True))
+    def test_scatter_reduce_amax(self, device, dtype):
+        for include_self in (True, False):
+            self._test_scatter_base(torch.Tensor.scatter_reduce_, device=device, dtype=dtype,
+                                    is_scalar=False, reduction='amax', unique_indices=False,
+                                    include_self=include_self)
+
+    @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
+    @dtypesIfCUDA(*get_all_fp_dtypes(include_half=True, include_bfloat16=True))
+    def test_scatter_reduce_amin(self, device, dtype):
+        for include_self in (True, False):
+            self._test_scatter_base(torch.Tensor.scatter_reduce_, device=device, dtype=dtype,
+                                    is_scalar=False, reduction='amin', unique_indices=False,
+                                    include_self=include_self)
+
 
 # Generic Device Test Framework instantation, see
 #   https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests
diff --git a/test/test_serialization.py b/test/test_serialization.py
index 878c602d5d64da..9204392683b6b7 100644
--- a/test/test_serialization.py
+++ b/test/test_serialization.py
@@ -23,7 +23,7 @@
 from torch.testing._internal.common_utils import TestCase, IS_WINDOWS, \
     TEST_DILL, run_tests, download_file, BytesIOContext, TemporaryFileName
 from torch.testing._internal.common_device_type import instantiate_device_type_tests
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import all_types_and_complex_and
 
 # These tests were all copied from `test/test_torch.py` at some point, so see
 # the actual blame, see this revision
@@ -414,7 +414,7 @@ def test_serialization_save_warnings(self):
         with warnings.catch_warnings(record=True) as warns:
             with tempfile.NamedTemporaryFile() as checkpoint:
                 x = torch.save(torch.nn.Linear(2, 3), checkpoint)
-                self.assertEquals(len(warns), 0)
+                self.assertEqual(len(warns), 0)
 
     def test_serialization_map_location(self):
         test_file_path = download_file('https://download.pytorch.org/test_data/gpu_tensors.pt')
@@ -616,10 +616,11 @@ def save_load_check(a, b):
             self.assertEqual(a, a_loaded)
             self.assertEqual(b, b_loaded)
 
-        for device, dtype in product(devices, get_all_dtypes()):
+        for device, dtype in product(devices, all_types_and_complex_and(torch.half,
+                                                                        torch.bfloat16, torch.bool)):
             a = torch.tensor([], dtype=dtype, device=device)
 
-            for other_dtype in get_all_dtypes():
+            for other_dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
                 s = torch._TypedStorage(
                     wrap_storage=a.storage()._untyped(),
                     dtype=other_dtype)
@@ -726,7 +727,7 @@ def import_module(name, filename):
                 loaded = torch.load(checkpoint)
                 self.assertTrue(isinstance(loaded, module.Net))
                 if can_retrieve_source:
-                    self.assertEquals(len(w), 0)
+                    self.assertEqual(len(w), 0)
 
             # Replace the module with different source
             fname = get_file_path_2(os.path.dirname(os.path.dirname(torch.__file__)), 'torch', 'testing',
@@ -737,7 +738,7 @@ def import_module(name, filename):
                 loaded = torch.load(checkpoint)
                 self.assertTrue(isinstance(loaded, module.Net))
                 if can_retrieve_source:
-                    self.assertEquals(len(w), 1)
+                    self.assertEqual(len(w), 1)
                     self.assertTrue(w[0].category, 'SourceChangeWarning')
 
     def test_serialization_container(self):
diff --git a/test/test_shape_ops.py b/test/test_shape_ops.py
index 13c636d6563a4c..de709cc1ee627c 100644
--- a/test/test_shape_ops.py
+++ b/test/test_shape_ops.py
@@ -15,7 +15,7 @@
 from torch.testing._internal.common_device_type import (
     instantiate_device_type_tests, onlyCPU, onlyCUDA, dtypes, onlyNativeDeviceTypes,
     dtypesIfCUDA, largeTensorTest)
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import all_types_and_complex_and, all_types, all_types_and
 
 # TODO: replace with make_tensor
 def _generate_input(shape, dtype, device, with_extremal):
@@ -227,9 +227,8 @@ def test_diagonal_multidim(self, device, dtype):
         self.assertEqual(expected, result)
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_complex=False, include_bool=False, include_half=False,
-                            include_bfloat16=False))
-    @dtypesIfCUDA(*get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
+    @dtypes(*all_types())
+    @dtypesIfCUDA(*all_types_and(torch.half))
     def test_trace(self, device, dtype):
         def test(shape):
             tensor = make_tensor(shape, dtype=dtype, device=device, low=-9, high=9)
@@ -341,7 +340,7 @@ def test_clamp_raises_arg_errors(self, device):
         with self.assertRaisesRegex(RuntimeError, error_msg):
             torch.clamp(X)
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_flip(self, device, dtype):
         make_from_data = partial(torch.tensor, device=device, dtype=dtype)
         make_from_size = partial(make_tensor, device=device, dtype=dtype)
@@ -440,7 +439,7 @@ def gen_data():
         for dims in test_dims:
             self.assertEqual(size, list(data.flip(dims).size()))
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_flip_errors(self, device, dtype):
         make_arg = partial(make_tensor, dtype=dtype, device=device)
         data = make_arg((2, 2, 2))
@@ -458,7 +457,7 @@ def test_flip_errors(self, device, dtype):
     def _rand_shape(self, dim, min_size, max_size):
         return tuple(torch.randint(min_size, max_size + 1, (dim,)))
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_flip_numpy(self, device, dtype):
         make_arg = partial(make_tensor, dtype=dtype, device=device)
 
@@ -476,6 +475,7 @@ def test_flip_numpy(self, device, dtype):
 
     @onlyCUDA  # CPU is too slow
     @largeTensorTest('17GB')  # 4 tensors of 4GB (in, out) x (torch, numpy) + 1GB
+    @largeTensorTest("81GB", "cpu") # even for CUDA test, sufficient system memory is required
     def test_flip_large_tensor(self, device):
         t_in = torch.empty(2**32 + 1, dtype=torch.uint8).random_()
         torch_fn = partial(torch.flip, dims=(0,))
@@ -567,7 +567,7 @@ def test_nonzero_no_warning(self, device):
             t.nonzero()
             self.assertEqual(len(w), 0)
 
-    @dtypes(*get_all_dtypes(include_complex=False))
+    @dtypes(*all_types_and(torch.half, torch.bool, torch.bfloat16))
     def test_nonzero(self, device, dtype):
 
         shapes = [
diff --git a/test/test_sort_and_select.py b/test/test_sort_and_select.py
index ab6c72285ce8f9..ba99d3ed7a0ffe 100644
--- a/test/test_sort_and_select.py
+++ b/test/test_sort_and_select.py
@@ -8,11 +8,9 @@
 from itertools import permutations, product
 
 from torch.testing import make_tensor
-from torch.testing._internal.common_dtype import (
-    all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes,
-)
+from torch.testing._internal.common_dtype import all_types, all_types_and, floating_types_and
 from torch.testing._internal.common_utils import \
-    (TEST_WITH_ROCM, TestCase, run_tests, slowTest)
+    (TestCase, run_tests, slowTest)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, onlyNativeDeviceTypes,
      skipCUDAIfRocm, onlyCUDA, dtypesIfCUDA, dtypesIfCPU, onlyCPU, largeTensorTest)
@@ -133,7 +131,7 @@ def test_sort(self, device):
                                  'random with NaNs')
 
     # FIXME: remove torch.bool from unsupported types once support is added for cub sort
-    @dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_stable_sort(self, device, dtype):
         sizes = (100, 1000, 10000)
         for ncopies in sizes:
@@ -226,7 +224,7 @@ def test_topk_1d_output_discontiguous(self, device, dtype):
             self.assertEqual(values, values_cont)
 
     # FIXME: remove torch.bool from unsupported types once support is added for cub sort
-    @dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_stable_sort_against_numpy(self, device, dtype):
         if dtype in floating_types_and(torch.float16, torch.bfloat16):
             inf = float('inf')
@@ -289,7 +287,7 @@ def repeated_index_fill(t, dim, idxs, vals):
             idx_numpy = np.argsort(sample_numpy, axis=dim, kind='stable')
             self.assertEqual(idx_torch, idx_numpy)
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_msort(self, device, dtype):
         def test(shape):
             tensor = make_tensor(shape, dtype=dtype, device=device, low=-9, high=9)
@@ -678,7 +676,6 @@ def test_topk_integral(self, device, dtype):
 
     @onlyCUDA
     @dtypes(torch.bfloat16)
-    @skipCUDAIfRocm
     def test_topk_bfloat16(self, device, dtype):
 
         small = 10
@@ -687,12 +684,9 @@ def test_topk_bfloat16(self, device, dtype):
         for curr_size in (small, large, verylarge):
             self._test_topk_dtype(device, dtype, False, curr_size)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_topk_nonfinite(self, device, dtype):
-        if TEST_WITH_ROCM and dtype == torch.bfloat16:
-            return
-
         x = torch.tensor([float('nan'), float('inf'), 1e4, 0, -1e4, -float('inf')], device=device, dtype=dtype)
         val, idx = x.topk(4)
         expect = torch.tensor([float('nan'), float('inf'), 1e4, 0], device=device, dtype=dtype)
@@ -721,15 +715,9 @@ def test_topk_4d(self, device):
             self.assertEqual(ind, expected_ind, atol=0, rtol=0)
 
     @onlyNativeDeviceTypes
-    @dtypesIfCUDA(*(get_all_dtypes(include_complex=False,
-                                   include_bool=False,
-                                   include_half=False,
-                                   include_bfloat16=True)))
-    @dtypes(*(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)))
+    @dtypesIfCUDA(*all_types_and(torch.bfloat16))
+    @dtypes(*all_types())
     def test_topk_zero(self, device, dtype):
-        if TEST_WITH_ROCM and dtype == torch.bfloat16:
-            return
-
         # https://github.com/pytorch/pytorch/issues/49205
         t = torch.rand(2, 2, device=device).to(dtype=dtype)
         val, idx = torch.topk(t, k=0, largest=False)
@@ -782,12 +770,9 @@ def ensure_tuple(x):
                 self.assertEqual(expected_inverse.view(additional_shape), y_inverse)
                 self.assertEqual(expected_counts, y_counts)
 
-    @dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
-    @dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
+    @dtypesIfCPU(*all_types_and(torch.bool, torch.bfloat16))
+    @dtypes(*all_types_and(torch.half, torch.bool))
     def test_unique(self, device, dtype):
-        if dtype is torch.half and self.device_type == 'cpu':
-            return  # CPU does not have half support
-
         def ensure_tuple(x):
             if isinstance(x, torch.Tensor):
                 return (x,)
@@ -842,12 +827,9 @@ def ensure_tuple(x):
                                 count += 1
                         self.assertEqual(j, count)
 
-    @dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
-    @dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
+    @dtypesIfCPU(*all_types_and(torch.bool, torch.bfloat16))
+    @dtypes(*all_types_and(torch.half, torch.bool))
     def test_unique_consecutive(self, device, dtype):
-        if dtype is torch.half and self.device_type == 'cpu':
-            return  # CPU does not have half support
-
         if dtype is torch.bool:
             x = torch.tensor([True, False, False, False, True, True, False, False, False], dtype=torch.bool, device=device)
             expected_unique = torch.tensor([True, False, True, False], dtype=torch.bool, device=device)
diff --git a/test/test_sparse.py b/test/test_sparse.py
index a50d493cdac635..86fef22ef49aef 100644
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@@ -7,9 +7,6 @@
 import random
 import unittest
 from torch.testing import make_tensor
-from torch.testing._internal.common_dtype import (
-    all_types_and_complex,
-)
 from torch.testing._internal.common_utils import TestCase, run_tests, skipIfRocm, do_test_dtypes, \
     do_test_empty_full, load_tests, TEST_NUMPY, IS_WINDOWS, gradcheck, coalescedonoff, \
     DeterministicGuard, first_sample, IS_LINUX
@@ -17,16 +14,16 @@
 from numbers import Number
 from typing import Dict, Any
 from distutils.version import LooseVersion
-from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes
 from torch.testing._internal.common_cuda import \
     (SM53OrLater, SM80OrLater, CUDA11OrLater)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, ops, dtypes, dtypesIfCUDA, onlyCPU, onlyCUDA, precisionOverride,
      deviceCountAtLeast, OpDTypes)
 from torch.testing._internal.common_methods_invocations import \
-    (sparse_unary_ufuncs)
+    (sparse_unary_ufuncs, sparse_masked_reduction_ops)
 from torch.testing._internal.common_dtype import (
-    floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes,
+    all_types, all_types_and_complex, all_types_and_complex_and, floating_and_complex_types,
+    floating_and_complex_types_and, integral_types, floating_types_and,
 )
 
 # load_tests from torch.testing._internal.common_utils is used to automatically filter tests for
@@ -315,6 +312,10 @@ def test_tensor(x, res):
             self.assertEqual(res, dense_x)
             self.assertEqual(res, safe_dense_x)
 
+            # Only run autograd test for float64
+            if x.dtype != torch.float64:
+                return
+
             def fn(x):
                 return x.to_dense()
             x.requires_grad_(True)
@@ -346,6 +347,7 @@ def fn(x):
             ], dtype=dtype, device=device)
 
             test_tensor(x, res)
+            test_tensor(res, res)
 
             i = self.index_tensor([
                 [0, 1, 2, 2],
@@ -1954,7 +1956,7 @@ def test_narrow(self, device, dtype, coalesced):
 
     def _test_log1p_tensor(self, sparse_tensor, coalesced):
         def is_integral(dtype):
-            return dtype in get_all_int_dtypes()
+            return dtype in integral_types()
 
         dense_tensor = sparse_tensor.to_dense()
         expected_output = dense_tensor.log1p()
@@ -1985,7 +1987,7 @@ def is_integral(dtype):
                 sparse_tensor.requires_grad_()
 
     @coalescedonoff
-    @dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
+    @dtypes(*all_types())
     def test_log1p(self, device, dtype, coalesced):
         if coalesced:
             input_coalesced = torch.sparse_coo_tensor(
@@ -2093,7 +2095,7 @@ def test_neg_negative(self, device, dtype, coalesced):
 
     def _test_asin_arcsin(self, sparse_tensor, coalesced):
         def is_integral(dtype):
-            return dtype in get_all_int_dtypes()
+            return dtype in integral_types()
         is_integral_dtype = is_integral(sparse_tensor.dtype)
 
         dense_tensor = sparse_tensor.to_dense()
@@ -2128,7 +2130,7 @@ def is_integral(dtype):
                     op(sparse_tensor)
 
     @coalescedonoff
-    @dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
+    @dtypes(*all_types())
     def test_asin_arcsin(self, device, dtype, coalesced):
         if coalesced:
             input_coalesced = torch.sparse_coo_tensor(
@@ -2615,14 +2617,14 @@ def test_legacy_new(self, device):
 
     @onlyCPU  # not really, but we only really want to run this once
     def test_dtypes(self, device):
-        all_sparse_dtypes = get_all_dtypes(include_complex=True)
+        all_sparse_dtypes = all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16)
         do_test_dtypes(self, all_sparse_dtypes, torch.sparse_coo, torch.device('cpu'))
         if torch.cuda.is_available():
             do_test_dtypes(self, all_sparse_dtypes, torch.sparse_coo, torch.device('cuda:0'))
 
     @onlyCPU  # not really, but we only really want to run this once
     def test_empty_full(self, device):
-        all_sparse_dtypes = get_all_dtypes(include_complex=True)
+        all_sparse_dtypes = all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16)
         do_test_empty_full(self, all_sparse_dtypes, torch.sparse_coo, torch.device('cpu'))
         if torch.cuda.device_count() > 0:
             do_test_empty_full(self, all_sparse_dtypes, torch.sparse_coo, None)
@@ -3219,14 +3221,12 @@ def sparse_log(x):
     # TODO: Check after why ROCm's cusparseXcsrgemm2Nnz function doesn't return the same nnz value as CUDA
     @skipIfRocm
     @coalescedonoff
-    @dtypes(*get_all_complex_dtypes(),
-            *get_all_fp_dtypes(include_half=False, include_bfloat16=False))
-    @dtypesIfCUDA(*((torch.complex64,) if CUDA11OrLater else ()),
-                  *((torch.complex128,) if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else ()),
-                  *get_all_fp_dtypes(
-                      include_half=(CUDA11OrLater and SM53OrLater),
-                      include_bfloat16=(CUDA11OrLater and SM80OrLater)))
-    @precisionOverride({torch.bfloat16: 2.5e-2, torch.float16: 2.5e-2, torch.complex64: 1e-2, torch.float32: 1e-2})
+    @dtypes(*floating_and_complex_types())
+    @dtypesIfCUDA(*floating_types_and(*[torch.half] if CUDA11OrLater and SM53OrLater else [],
+                                      *[torch.bfloat16] if CUDA11OrLater and SM80OrLater else [],
+                                      *[torch.complex64] if CUDA11OrLater else [],
+                                      *[torch.complex128] if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else []))
+    @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2, torch.complex64: 1e-2, torch.float32: 1e-2})
     def test_sparse_matmul(self, device, dtype, coalesced):
         """
         This function test `torch.sparse.mm` when both the mat1 and mat2 are sparse tensors.
@@ -3402,21 +3402,21 @@ class TestSparseOneOff(TestCase):
     def test_cuda_from_cpu(self):
         with self.assertRaisesRegex(
                 RuntimeError,
-                "backend of indices \\(CUDA\\) must match backend of values \\(CPU\\)"):
+                "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"):
             torch.sparse.FloatTensor(torch.zeros(1, 4).long().cuda(),
                                      torch.randn(4, 4, 4),
                                      [3, 4, 4])
 
         with self.assertRaisesRegex(
                 RuntimeError,
-                "backend of indices \\(CUDA\\) must match backend of values \\(CPU\\)"):
+                "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"):
             torch.sparse.FloatTensor(torch.zeros(1, 4).long().cuda(),
                                      torch.randn(4, 4, 4, 0),
                                      [3, 4, 4, 0])
 
         with self.assertRaisesRegex(
                 RuntimeError,
-                "backend of indices \\(CUDA\\) must match backend of values \\(CPU\\)"):
+                "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"):
             torch.sparse.FloatTensor(torch.LongTensor(1, 0).cuda(),
                                      torch.randn(0, 4, 4, 0),
                                      [0, 4, 4, 0])
@@ -3547,9 +3547,48 @@ def fn(x):
                 fast_mode=op.gradcheck_fast_mode))
 
 
+class TestSparseMaskedReductions(TestCase):
+    exact_dtype = True
+
+    @ops(sparse_masked_reduction_ops)
+    def test_future_empty_dim(self, device, dtype, op):
+        """Currently, `dim=()` in reductions operations means "reduce over
+        all dimensions" while in future, it will read "no reduce". See
+        https://github.com/pytorch/pytorch/issues/29137
+
+        For sparse masked reductions, we'll implement the current behavior.
+
+        For testing, we'll use samples with `dim=0` and map it to
+        `dim=()` until
+        torch.testing._internal.common_methods_invocations._generate_reduction_kwargs
+        is made to generate samples with `dim=()` for non-scalar
+        inputs. With this and after gh-29137 is resolved, this test
+        can be deleted. See also `torch._masked._canonical_dim`
+        implementation about changing the `dim=()` behavior.
+        """
+
+        samples = op.sample_inputs_func(op, device, dtype, requires_grad=False)
+        for sample_input in samples:
+            if sample_input.kwargs.get('dim') != 0:
+                continue
+            sample_input_kwargs = dict(sample_input.kwargs)
+            sample_input_kwargs['dim'] = ()    # reduce over all dimensions
+
+            t = sample_input.input
+            mask = sample_input_kwargs.get('mask')
+            sparse_op_kwargs = dict(sample_input_kwargs)
+            actual = op(t.to_sparse(), *sample_input.args, **sample_input_kwargs)
+            self.assertEqual(actual.layout, torch.sparse_coo)
+
+            expected = op(t, *sample_input.args, **sample_input_kwargs).to_sparse()
+            self.assertEqual(actual, expected)
+
+
 # e.g., TestSparseUnaryUfuncsCPU and TestSparseUnaryUfuncsCUDA
 instantiate_device_type_tests(TestSparseUnaryUfuncs, globals(), except_for='meta')
 
+instantiate_device_type_tests(TestSparseMaskedReductions, globals(), except_for='meta')
+
 # e.g., TestSparseCPU and TestSparseCUDA
 instantiate_device_type_tests(TestSparse, globals(), except_for='meta')
 
diff --git a/test/test_sparse_csr.py b/test/test_sparse_csr.py
index 8c120376b118f0..a546bc26b329a8 100644
--- a/test/test_sparse_csr.py
+++ b/test/test_sparse_csr.py
@@ -4,17 +4,20 @@
 import random
 import itertools
 import unittest
-from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor
+from torch.testing import make_tensor
 from torch.testing._internal.common_cuda import SM53OrLater, SM80OrLater, TEST_CUSPARSE_GENERIC
 from torch.testing._internal.common_utils import \
-    (TEST_WITH_ROCM, TEST_SCIPY, TEST_MKL, IS_WINDOWS, TestCase, run_tests, load_tests, coalescedonoff)
+    (TEST_WITH_ROCM, TEST_SCIPY, TEST_MKL, IS_WINDOWS, TestCase, run_tests, load_tests, coalescedonoff, parametrize)
 from torch.testing._internal.common_device_type import \
     (ops, instantiate_device_type_tests, dtypes, OpDTypes, dtypesIfCUDA, onlyCPU, onlyCUDA, skipCUDAIfNoCusparseGeneric,
      precisionOverride, skipMeta, skipCUDAIf, skipCUDAIfRocm, skipCPUIfNoMklSparse)
 from torch.testing._internal.common_methods_invocations import \
-    (op_db, sparse_csr_unary_ufuncs, )
+    (op_db, sparse_csr_unary_ufuncs, ReductionOpInfo)
 from torch.testing._internal.common_cuda import _get_torch_cuda_version, CUDA11OrLater
-from torch.testing._internal.common_dtype import floating_types, get_all_dtypes
+from torch.testing._internal.common_dtype import (
+    floating_types, all_types_and_complex_and, floating_and_complex_types, floating_types_and,
+    all_types_and_complex, floating_and_complex_types_and
+)
 from test_sparse import CUSPARSE_SPMM_COMPLEX128_SUPPORTED
 
 if TEST_SCIPY:
@@ -135,7 +138,28 @@ def test_csr_layout(self):
         self.assertEqual(str(torch.sparse_csr), 'torch.sparse_csr')
         self.assertEqual(type(torch.sparse_csr), torch.layout)
 
-    @dtypes(*get_all_dtypes())
+    def test_csr_stride(self):
+        a = self.genSparseCSRTensor((3, 3), 3, dtype=torch.float, device=self.device_type, index_dtype=torch.int64)
+
+        with self.assertRaisesRegex(RuntimeError, "Sparse CSR tensors do not have strides"):
+            a.stride()
+
+        with self.assertRaisesRegex(RuntimeError, "Sparse CSR tensors do not have strides"):
+            a.stride(-1)
+
+    def test_csr_storage(self):
+        a = self.genSparseCSRTensor((3, 3), 3, dtype=torch.float, device=self.device_type, index_dtype=torch.int64)
+
+        with self.assertRaisesRegex(RuntimeError, "Cannot access storage of SparseCsrTensorImpl"):
+            a.storage()
+
+    def test_csr_is_contiguous(self):
+        a = self.genSparseCSRTensor((3, 3), 3, dtype=torch.float, device=self.device_type, index_dtype=torch.int64)
+
+        with self.assertRaisesRegex(RuntimeError, "Tensors of type SparseCsrTensorImpl do not have is_contiguous"):
+            a.is_contiguous()
+
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_csr_constructor_shape_inference(self, device, dtype):
         crow_indices = [0, 2, 4]
         col_indices = [0, 1, 0, 1]
@@ -148,7 +172,7 @@ def test_sparse_csr_constructor_shape_inference(self, device, dtype):
         self.assertEqual(dtype, sparse.dtype)
         self.assertEqual(torch.device(device), sparse.device)
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_csr_constructor(self, device, dtype):
         crow_indices = [0, 2, 4]
         col_indices = [0, 1, 0, 1]
@@ -165,7 +189,34 @@ def test_sparse_csr_constructor(self, device, dtype):
             self.assertEqual(torch.tensor(col_indices, dtype=index_dtype), sparse.col_indices())
             self.assertEqual(torch.tensor(values, dtype=dtype), sparse.values())
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
+    def test_sparse_csr_batch_constructor(self, device, dtype):
+        batch_shape = (2, 3)
+        crow_indices = torch.tensor([0, 2, 4], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        col_indices = torch.tensor([0, 1, 0, 1], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        values = torch.tensor([1, 2, 3, 4], device=device, dtype=dtype).repeat(6, 1).reshape(*batch_shape, -1)
+        for index_dtype in [torch.int32, torch.int64]:
+            sparse = torch.sparse_csr_tensor(crow_indices.to(index_dtype),
+                                             col_indices.to(index_dtype),
+                                             values,
+                                             size=(*batch_shape, 2, 10),
+                                             dtype=dtype,
+                                             device=device)
+            self.assertEqual((*batch_shape, 2, 10), sparse.shape)
+            self.assertEqual(crow_indices.to(index_dtype), sparse.crow_indices())
+            self.assertEqual(col_indices.to(index_dtype), sparse.col_indices())
+            self.assertEqual(values, sparse.values())
+
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
+    def test_sparse_csr_batch_constructor_shape_inference(self, device, dtype):
+        batch_shape = (2, 3)
+        crow_indices = torch.tensor([0, 2, 4], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        col_indices = torch.tensor([0, 1, 0, 1], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        values = torch.tensor([1, 2, 3, 4], device=device, dtype=dtype).repeat(6, 1).reshape(*batch_shape, -1)
+        sparse = torch.sparse_csr_tensor(crow_indices, col_indices, values, dtype=dtype, device=device)
+        self.assertEqual((*batch_shape, crow_indices.shape[-1] - 1, col_indices.max() + 1), sparse.shape)
+
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_sparse_csr_constructor_from_lists(self, device, dtype):
         # without size
         sparse = torch.sparse_csr_tensor([0, 2, 4],
@@ -195,18 +246,20 @@ def test_sparse_csr_constructor_from_lists(self, device, dtype):
             self.assertEqual(torch.tensor([1, 2, 3, 4], dtype=dtype, device=device), sparse.values())
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half))
     def test_empty(self, device, dtype):
         ns = [5, 2, 0]
-        for shape in itertools.product(ns, ns):
+        batch_shapes = [(), (2,), (2, 3)]
+        for m, n, b in itertools.product(ns, ns, batch_shapes):
+            shape = (*b, m, n)
             result = torch.empty(shape, dtype=dtype, device=device, layout=torch.sparse_csr)
             self.assertEqual(result.shape, shape)
             self.assertEqual(result.dtype, dtype)
             self.assertEqual(result.device, torch.device(device))
             self.assertEqual(result.layout, torch.sparse_csr)
-            self.assertEqual(result.crow_indices().shape, (shape[0] + 1,))
-            self.assertEqual(result.col_indices().shape, (0,))
-            self.assertEqual(result.values().shape, (0,))
+            self.assertEqual(result.crow_indices().shape, (*b, shape[-2] + 1,))
+            self.assertEqual(result.col_indices().shape, (*b, 0,))
+            self.assertEqual(result.values().shape, (*b, 0,))
             self.assertEqual(result._nnz(), 0)
             self.assertEqual(result.crow_indices().device, torch.device(device))
             self.assertEqual(result.col_indices().device, torch.device(device))
@@ -216,31 +269,27 @@ def test_empty(self, device, dtype):
             self.assertEqual(result.values().dtype, dtype)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
     def test_empty_errors(self, device, dtype):
-        with self.assertRaisesRegex(RuntimeError, "torch.empty: Only 2D sparse CSR tensors are supported."):
+        with self.assertRaisesRegex(RuntimeError, "torch.empty: Only batched sparse CSR matrices are supported, but got size"):
             torch.empty((5,), dtype=dtype, device=device, layout=torch.sparse_csr)
 
-        with self.assertRaisesRegex(RuntimeError, "torch.empty: Only 2D sparse CSR tensors are supported."):
-            torch.empty((2, 3, 4), dtype=dtype, device=device, layout=torch.sparse_csr)
-
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
     def test_clone(self, device, dtype):
-        x = torch.sparse_csr_tensor([0, 2, 4],
-                                    [0, 1, 0, 1],
-                                    [1, 2, 3, 4],
-                                    dtype=dtype,
-                                    device=device)
-        y = x.clone()
-
-        self.assertEqual(x.shape, y.shape)
-        self.assertEqual(x.crow_indices(), y.crow_indices())
-        self.assertEqual(x.col_indices(), y.col_indices())
-        self.assertEqual(x.values(), y.values())
+        from operator import mul
+        from functools import reduce
+        for batch_shape in ((), (2,), (2, 3)):
+            prod = reduce(mul, batch_shape, 1)
+            crow_indices = torch.tensor([0, 2, 4], device=device).repeat(prod, 1).reshape(*batch_shape, -1)
+            col_indices = torch.tensor([0, 1, 0, 1], device=device).repeat(prod, 1).reshape(*batch_shape, -1)
+            values = torch.tensor([1, 2, 3, 4], device=device, dtype=dtype).repeat(prod, 1).reshape(*batch_shape, -1)
+            sparse = torch.sparse_csr_tensor(crow_indices, col_indices, values, dtype=dtype, device=device)
+            cloned_sparse = sparse.clone()
+            self.assertEqual(sparse, cloned_sparse)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_copy(self, device, dtype):
 
         def run_test(shape, nnz, index_type):
@@ -249,17 +298,16 @@ def run_test(shape, nnz, index_type):
 
             a.copy_(b)
 
-            self.assertEqual(a.crow_indices(), b.crow_indices())
-            self.assertEqual(a.col_indices(), b.col_indices())
-            self.assertEqual(a.values(), b.values())
+            self.assertEqual(a, b)
 
         ns = [5, 2, 0]
-        for shape, index_dtype in zip(itertools.product(ns, ns), [torch.int32, torch.int64]):
-            run_test(shape, 0, index_dtype)
-            run_test(shape, shape[0] * shape[1], index_dtype)
+        batch_shapes = [(), (2,), (2, 3)]
+        for (m, n, b), index_dtype in zip(itertools.product(ns, ns, batch_shapes), [torch.int32, torch.int64]):
+            run_test((*b, m, n), 0, index_dtype)
+            run_test((*b, m, n), m * n, index_dtype)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_copy_errors(self, device, dtype):
         for index_dtype in [torch.int32, torch.int64]:
             shape1 = (2, 3)
@@ -278,36 +326,42 @@ def test_copy_errors(self, device, dtype):
                 a.copy_(b)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_resize(self, device, dtype):
-        for index_dtype in [torch.int32, torch.int64]:
-            shape = (2, 3)
+        batch_shapes = [(), (2,), (2, 3)]
+        for index_dtype, b in zip([torch.int32, torch.int64], batch_shapes):
+            shape = (*b, 2, 3)
             nnz = 6
             a = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=index_dtype)
 
-            new_shape = (4, 5)
+            new_shape = (*b, 4, 5)
             a.resize_(new_shape)
 
             self.assertEqual(a.shape, new_shape)
             # resize to larger shape doesn't add specified elements
             self.assertEqual(a._nnz(), nnz)
 
-            new_shape = (1, 5)
+            new_shape = (*b, 1, 5)
             a.resize_(new_shape)
 
             self.assertEqual(a.shape, new_shape)
             # resize to smaller shape trims specified elements
             self.assertEqual(a._nnz(), 5)
 
+            # trim batched dimensions
+            a.resize_(new_shape[-2], new_shape[-1])
+            self.assertEqual(a.shape, (new_shape[-2], new_shape[-1]))
+            self.assertEqual(a._nnz(), 5)
+
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_resize_errors(self, device, dtype):
         for index_dtype in [torch.int32, torch.int64]:
             shape = (2, 3)
             nnz = 6
             a = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=index_dtype)
 
-            with self.assertRaisesRegex(RuntimeError, "torch.resize_: Only 2D sparse CSR tensors are supported."):
+            with self.assertRaisesRegex(RuntimeError, "torch.resize_: Only batched sparse CSR matrices are supported"):
                 new_shape = (4,)
                 a.resize_(new_shape)
 
@@ -352,49 +406,62 @@ def test_factory_layout_invariants_check(self, device):
                                     torch.tensor([1, 2, 3, 4]))
 
     def test_factory_shape_invariants_check(self, device):
-        crow_indices = [0, 2, 4]
-        col_indices = [0, 1, 0, 1]
-        values = [1, 2, 3, 4]
+        crow_indices = torch.tensor([0, 2, 4], device=device)
+        col_indices = torch.tensor([0, 1, 0, 1], device=device)
+        values = torch.tensor([1, 2, 3, 4], device=device)
         size = (2, 10)
-        torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor(col_indices), torch.tensor(values), size,
-                                device=device)
+        torch.sparse_csr_tensor(crow_indices, col_indices, values, size, device=device)
 
-        with self.assertRaisesRegex(RuntimeError, r"size of a CSR tensor must be of length 2, but got: 3"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor(col_indices), torch.tensor(values),
-                                    size=(2, 10, 2),
+        with self.assertRaisesRegex(RuntimeError, r"size of a batched CSR tensor must have length >= 2, but got: 1"):
+            torch.sparse_csr_tensor(crow_indices, col_indices, values,
+                                    size=(2,),
                                     device=device)
 
-        with self.assertRaisesRegex(RuntimeError, r"crow_indices must have dim\=1 but got crow_indices\.dim\(\)\=2"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices).repeat(2, 1),
-                                    torch.tensor(col_indices),
-                                    torch.tensor(values),
+        with self.assertRaisesRegex(RuntimeError, r"crow_indices must have dim >= 1 but got crow_indices\.dim\(\)\ = 0"):
+            torch.sparse_csr_tensor(torch.zeros((), device=device, dtype=torch.int64),
+                                    col_indices,
+                                    values,
                                     size,
                                     device=device)
 
-        with self.assertRaisesRegex(RuntimeError, r"col_indices must have dim\=1 but got col_indices\.dim\(\)\=2"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices),
-                                    torch.tensor(col_indices).repeat(2, 1),
-                                    torch.tensor(values),
+        with self.assertRaisesRegex(RuntimeError, r"col_indices must have dim >= 1 but got col_indices\.dim\(\)\ = 0"):
+            torch.sparse_csr_tensor(crow_indices,
+                                    torch.zeros((), device=device, dtype=torch.int64),
+                                    values,
                                     size,
                                     device=device)
 
-        with self.assertRaisesRegex(RuntimeError, r"values must have dim\=1 but got values\.dim\(\)\=2"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices),
-                                    torch.tensor(col_indices),
-                                    torch.tensor(values).repeat(2, 1),
+        with self.assertRaisesRegex(RuntimeError, r"values must have dim >= 1 but got values\.dim\(\)\ = 0"):
+            torch.sparse_csr_tensor(crow_indices,
+                                    col_indices,
+                                    torch.zeros((), device=device, dtype=torch.int64),
                                     size,
                                     device=device)
 
         with self.assertRaisesRegex(RuntimeError,
-                                    r"crow_indices\.numel\(\) must be size\(0\) \+ 1, but got: 3"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor(col_indices), torch.tensor(values), (1, 1),
+                                    r"crow_indices\.size\(-1\) must be equal to size\[-2\] \+ 1 \(that is 2\), but got: 3"):
+            torch.sparse_csr_tensor(crow_indices, col_indices, values, (1, 1),
+                                    device=device)
+
+
+        with self.assertRaisesRegex(RuntimeError,
+                                    r"Number of dimensions of crow_indices and col_indices must be the same"):
+            torch.sparse_csr_tensor(crow_indices, col_indices.repeat(2, 1), values, size,
+                                    device=device)
+
+        with self.assertRaisesRegex(RuntimeError,
+                                    r"Number of dimensions of indices and values must be the same"):
+            torch.sparse_csr_tensor(crow_indices, col_indices, values.repeat(2, 1), size,
                                     device=device)
 
+        with self.assertRaisesRegex(RuntimeError,
+                                    r"Number of dimensions of indices must be one less"):
+            torch.sparse_csr_tensor(crow_indices.repeat(2, 1), col_indices.repeat(2, 1), values.repeat(2, 1), size,
+                                    device=device)
 
         with self.assertRaisesRegex(RuntimeError,
-                                    r"col_indices and values must have equal sizes, " +
-                                    r"but got col_indices\.numel\(\): 3, values\.numel\(\): 4"):
-            torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor([0, 1, 0]), torch.tensor(values), size,
+                                    r"All batch dimensions of the provided size, indices, and values must be the same"):
+            torch.sparse_csr_tensor(crow_indices.repeat(2, 1), col_indices.repeat(3, 1), values.repeat(4, 1), (2, 2, 10),
                                     device=device)
 
     def test_factory_indices_invariants_check(self, device):
@@ -413,7 +480,7 @@ def test_factory_indices_invariants_check(self, device):
 
         with self.assertRaisesRegex(RuntimeError,
                                     r"at position i \= 2," +
-                                    r" this condition crow_indices\[i - 1\] <\= crow_indices\[i\] fails"):
+                                    r" the condition crow_indices\[i - 1\] <\= crow_indices\[i\] fails"):
             torch.sparse_csr_tensor(torch.tensor([0, 5, 4]), torch.tensor(col_indices), torch.tensor(values), size,
                                     device=device)
 
@@ -421,12 +488,12 @@ def test_factory_indices_invariants_check(self, device):
             torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor([0, -1, 0, 1]), torch.tensor(values), size,
                                     device=device)
 
-        with self.assertRaisesRegex(RuntimeError, r"size\(1\) should be greater than col_indices\.max\(\)"):
+        with self.assertRaisesRegex(RuntimeError, r"size\[-1\] should be greater than col_indices\.max\(\)"):
             torch.sparse_csr_tensor(torch.tensor(crow_indices), torch.tensor([0, 11, 0, 1]), torch.tensor(values), size,
                                     device=device)
 
     @onlyCUDA
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_factory_device_type_inference(self, device, dtype):
         cpu_cuda = ('cpu', 'cuda')
         cpu_cuda_none = cpu_cuda + (None,)
@@ -497,7 +564,7 @@ def test_sparse_csr_print(self, device):
         self.assertExpected('\n'.join(printed))
         self.maxDiff = orig_maxDiff
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_csr_from_dense(self, device, dtype):
         dense = torch.tensor([[4, 5, 0], [0, 0, 0], [1, 0, 0]], dtype=dtype, device=device)
         sparse = dense.to_sparse_csr()
@@ -517,7 +584,7 @@ def test_sparse_csr_from_dense(self, device, dtype):
         self.assertEqual(torch.tensor([0, 1, 2] * 3, dtype=torch.int64), sparse.col_indices())
         self.assertEqual(torch.tensor([2] * 9, dtype=dtype), sparse.values())
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_csr_to_dense(self, device, dtype):
         mn = [5, 2, 0]
         for (m, n) in itertools.product(mn, mn):
@@ -526,12 +593,12 @@ def test_sparse_csr_to_dense(self, device, dtype):
             sparse = dense.to_sparse_csr()
             self.assertEqual(sparse.to_dense(), dense)
 
-        crow_indices = torch.tensor([0, 3, 5])
-        col_indices = torch.tensor([0, 1, 2, 0, 1])
-        values = torch.tensor([1, 2, 1, 3, 4], dtype=dtype)
-        csr = torch.sparse_csr_tensor(crow_indices, col_indices,
-                                      values, dtype=dtype, device=device)
-        dense = torch.tensor([[1, 2, 1], [3, 4, 0]], dtype=dtype, device=device)
+        batch_shape = (2, 3)
+        crow_indices = torch.tensor([0, 3, 5], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        col_indices = torch.tensor([0, 1, 2, 0, 1], device=device).repeat(6, 1).reshape(*batch_shape, -1)
+        values = torch.tensor([1, 2, 1, 3, 4], device=device, dtype=dtype).repeat(6, 1).reshape(*batch_shape, -1)
+        csr = torch.sparse_csr_tensor(crow_indices, col_indices, values, dtype=dtype, device=device)
+        dense = torch.tensor([[1, 2, 1], [3, 4, 0]], dtype=dtype, device=device).repeat(6, 1).reshape(csr.shape)
         self.assertEqual(csr.to_dense(), dense)
 
     @skipCPUIfNoMklSparse
@@ -577,7 +644,39 @@ def test_coo_to_csr_convert(self, device, dtype, coalesced):
         values = torch.tensor([2, 1, 6, 4, 10, 3, 5, 9, 8, 7], dtype=dtype, device=device)
         self.assertEqual(csr.values(), values)
 
-    @dtypes(*get_all_dtypes())
+    @parametrize("blocksize", [2, 4])
+    @parametrize("shape", [(24, 24), (12, 24)])
+    @dtypes((torch.double, torch.int32), (torch.double, torch.int64))
+    @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
+    @skipMeta
+    def test_csr_to_block_csr(self, device, dtypes, shape, blocksize):
+        dtype, index_dtype = dtypes
+        m, k = shape
+        nnz = random.randint(0, m * k)
+        t = self.genSparseCSRTensor((m * blocksize, k * blocksize), nnz, dtype=dtype,
+                                    device=device, index_dtype=index_dtype)
+        st = sp.csr_matrix((t.values().cpu(), t.col_indices().cpu(), t.crow_indices().cpu()), shape=tuple(t.size()))
+        block_t = torch.sparse._csr_to_block_csr(t, (blocksize, blocksize))
+        self.assertEqual(block_t.values().dim(), 3)
+        block_st = st.tobsr(blocksize=(blocksize, blocksize))
+        self.assertEqual(block_t.values().cpu(), block_st.data)
+        self.assertEqual(block_t.col_indices().cpu(), torch.tensor(block_st.indices).to(index_dtype))
+        self.assertEqual(block_t.crow_indices().cpu(), torch.tensor(block_st.indptr).to(index_dtype))
+
+    @dtypes(torch.double)
+    @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
+    def test_csr_to_block_csr_errors(self, device, dtype):
+        for index_dtype in [torch.int32, torch.int64]:
+            nnz = 15
+            t = self.genSparseCSRTensor((16, 16), nnz, dtype=dtype,
+                                        device=device, index_dtype=index_dtype)
+            with self.assertRaisesRegex(RuntimeError, "must be square."):
+                block_t = torch.sparse._csr_to_block_csr(t, (2, 3))
+
+            with self.assertRaisesRegex(RuntimeError, r"size \(16, 16\) with block size \(5, 5\)"):
+                block_t = torch.sparse._csr_to_block_csr(t, (5, 5))
+
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_csr_from_dense_convert_error(self, device, dtype):
         size = (4, 2, 4)
         dense = make_tensor(size, dtype=dtype, device=device)
@@ -603,8 +702,9 @@ def test_matmul_device_mismatch(self, device, dtype):
     @skipCPUIfNoMklSparse
     @skipCUDAIfNoCusparseGeneric
     @dtypes(*floating_and_complex_types())
-    @dtypesIfCUDA(*get_all_complex_dtypes(),
-                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.half] if SM53OrLater else [],
+                  *[torch.bfloat16] if SM80OrLater else []))
     def test_csr_matvec(self, device, dtype):
         side = 100
         for index_dtype in [torch.int32, torch.int64]:
@@ -709,45 +809,61 @@ def run_test_block_addmm_addmv(self, addmv_addmm, c, a, b, op_b=False, op_out=Fa
         self.assertEqual(actual, out)
         self.assertEqual(actual, expected)
 
+    @parametrize("block_size", [1, 2, 3])
+    @parametrize("index_dtype", [torch.int32, torch.int64])
     @skipCPUIfNoMklSparse
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
-    def test_block_addmm(self, device, dtype):
-        for index_dtype in [torch.int32, torch.int64]:
-            for (m, n, k), block_size, noncontiguous in zip(itertools.product([1, 5], repeat=3), [1, 2, 3], [True, False]):
-                nnz = random.randint(0, m * k)
+    def test_block_addmm(self, device, dtype, index_dtype, block_size):
+        for (m, n, k), noncontiguous in zip(itertools.product([1, 5], repeat=3), [True, False]):
+            nnz = random.randint(0, m * k)
+            if not noncontiguous:
+                a = self.genSparseCSRTensor((m * block_size, k * block_size), nnz,
+                                            dtype=dtype, device=device, index_dtype=index_dtype)
+                a = torch.sparse._csr_to_block_csr(a, (block_size, block_size))
+            else:
                 a = self.genSparseCSRTensor((m, k), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
                 a_data = make_tensor((nnz, block_size, block_size), dtype=dtype, device=device)
                 a_data = a_data.mT if noncontiguous else a_data   # Test column-major blocks
-                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(), a_data, (m * block_size, k * block_size))
-                b = make_tensor((k * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
-                c = make_tensor((m * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
-                for op_b, op_out in itertools.product([True, False], repeat=2):
-                    self.run_test_block_addmm_addmv(torch.addmm, c, a, b, op_b, op_out, dtype=dtype, device=device)
-
+                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(),
+                                                    a_data, (m * block_size, k * block_size))
+            b = make_tensor((k * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            c = make_tensor((m * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            for op_b, op_out in itertools.product([True, False], repeat=2):
+                self.run_test_block_addmm_addmv(torch.addmm, c, a, b, op_b, op_out, dtype=dtype, device=device)
+
+    @parametrize("block_size", [2, 3])
+    @parametrize("index_dtype", [torch.int32, torch.int64])
     @skipCPUIfNoMklSparse
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
-    def test_block_addmv(self, device, dtype):
-        for index_dtype in [torch.int32, torch.int64]:
-            block_sizes = [1, 2, 3]
-            if TEST_WITH_ROCM or not TEST_CUSPARSE_GENERIC:
-                block_sizes = [2, 3]
-            for (m, k), block_size, noncontiguous in zip(itertools.product([1, 5], repeat=2), block_sizes, [True, False]):
-                nnz = random.randint(0, m * k)
+    def test_block_addmv(self, device, dtype, index_dtype, block_size):
+        # TODO: Explicitly disable block size 1 support
+        # if (TEST_WITH_ROCM or not TEST_CUSPARSE_GENERIC) and block_size == 1:
+        #     return
+        for (m, k), noncontiguous in zip(itertools.product([1, 5], repeat=2), [True, False]):
+            nnz = random.randint(0, m * k)
+            if not noncontiguous:
+                a = self.genSparseCSRTensor((m * block_size, k * block_size), nnz,
+                                            dtype=dtype, device=device, index_dtype=index_dtype)
+                a = torch.sparse._csr_to_block_csr(a, (block_size, block_size))
+            else:
                 a = self.genSparseCSRTensor((m, k), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
                 a_data = make_tensor((nnz, block_size, block_size), dtype=dtype, device=device)
-                a_data = a_data.mT if noncontiguous else a_data  # Test column-major blocks
-                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(), a_data, (m * block_size, k * block_size))
-                b = make_tensor((k * block_size,), dtype=dtype, device=device, noncontiguous=noncontiguous)
-                c = make_tensor((m * block_size,), dtype=dtype, device=device, noncontiguous=noncontiguous)
-                self.run_test_block_addmm_addmv(torch.addmv, c, a, b, dtype=dtype, device=device)
-
+                a_data = a_data.mT if noncontiguous else a_data   # Test column-major blocks
+                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(),
+                                                    a_data, (m * block_size, k * block_size))
+            b = make_tensor((k * block_size,), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            c = make_tensor((m * block_size,), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            self.run_test_block_addmm_addmv(torch.addmv, c, a, b, dtype=dtype, device=device)
+
+    @parametrize("block_size", [2, 3])
+    @parametrize("index_dtype", [torch.int32, torch.int64])
     @skipCPUIfNoMklSparse
     @skipCUDAIfRocm
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
-    def test_block_triangular_solve(self, device, dtype):
+    def test_block_triangular_solve(self, device, dtype, index_dtype, block_size):
         def run_test(a, b, upper, transpose, unitriangular, op_out):
             actual = torch.triangular_solve(b, a, upper=upper, unitriangular=unitriangular, transpose=transpose)
             actual_X = actual.solution
@@ -782,53 +898,70 @@ def run_test(a, b, upper, transpose, unitriangular, op_out):
             self.assertEqual(out, actual_X)
             self.assertEqual(out, expected_X)
 
-        for index_dtype in [torch.int32, torch.int64]:
-            for (m, k), block_size, noncontiguous in zip(itertools.product([1, 5], repeat=2), [2, 3], [True, False]):
-                nnz = random.randint(0, m * m)
+        for (m, k), noncontiguous in zip(itertools.product([1, 5], repeat=2), [True, False]):
+            nnz = random.randint(0, m * m)
+            if not noncontiguous:
+                a = self.genSparseCSRTensor((m * block_size, m * block_size), nnz,
+                                            dtype=dtype, device=device, index_dtype=index_dtype)
+                a = torch.sparse._csr_to_block_csr(a, (block_size, block_size))
+            else:
                 a = self.genSparseCSRTensor((m, m), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
                 a_data = make_tensor((nnz, block_size, block_size), dtype=dtype, device=device)
                 a_data = a_data.mT if noncontiguous else a_data  # Test column-major blocks
-                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(), a_data, (m * block_size, m * block_size))
-                b = make_tensor((m * block_size, k), dtype=dtype, device=device, noncontiguous=noncontiguous)
+                a = torch._sparse_csr_tensor_unsafe(a.crow_indices(), a.col_indices(),
+                                                    a_data, (m * block_size, m * block_size))
+            b = make_tensor((m * block_size, k), dtype=dtype, device=device, noncontiguous=noncontiguous)
 
-                for (upper, unitriangular, transpose, op_out) in itertools.product([True, False], repeat=4):
-                    run_test(a, b, upper, unitriangular, transpose, op_out)
+            for (upper, unitriangular, transpose, op_out) in itertools.product([True, False], repeat=4):
+                run_test(a, b, upper, unitriangular, transpose, op_out)
 
     @skipCPUIfNoMklSparse
     @dtypes(torch.double)
     def test_mm(self, device, dtype):
-        def test_shape(di, dj, dk, nnz):
+        def test_shape(di, dj, dk, nnz0=None, nnz1=None):
             for index_dtype in [torch.int32, torch.int64]:
-                x = self.genSparseCSRTensor((di, dj), nnz, device=device, dtype=dtype, index_dtype=index_dtype)
-                t = torch.randn(di, dk, dtype=dtype, device=device)
-                y = torch.randn(dj, dk, dtype=dtype, device=device)
                 alpha = random.random()
                 beta = random.random()
 
-                # res = beta * t  + alpha * (x @ y)
-                res = torch.addmm(t, x, y, beta=beta, alpha=alpha)
-                expected = torch.addmm(t, x.to_dense(), y, beta=beta, alpha=alpha)
-                self.assertEqual(res, expected)
-
-                res = torch.addmm(t, x, y)
-                expected = torch.addmm(t, x.to_dense(), y)
-                self.assertEqual(res, expected)
-
-                res = torch.mm(x, y)
-                expected = torch.mm(x.to_dense(), y)
-                self.assertEqual(res, expected)
+                def _test(t, x, y):
+                    # res = beta * t  + alpha * (x @ y)
+                    res = torch.addmm(t, x, y, beta=beta, alpha=alpha)
+                    expected = torch.addmm(t, x.to_dense(), y.to_dense(), beta=beta, alpha=alpha)
+                    self.assertEqual(res, expected)
+
+                    res = torch.addmm(t, x, y)
+                    expected = torch.addmm(t, x.to_dense(), y.to_dense())
+                    self.assertEqual(res, expected)
+
+                    res = torch.mm(x, y)
+                    expected = torch.mm(x.to_dense(), y.to_dense())
+                    self.assertEqual(res, expected)
+
+                if nnz0 is None:
+                    nnz0 = random.randint(di * dk // 2, di * dk)
+                t = torch.randn(di, dj, dtype=dtype, device=device)
+                x = self.genSparseCSRTensor((di, dk), nnz0, device=device, dtype=dtype, index_dtype=index_dtype)
+                y = torch.randn(dk, dj, dtype=dtype, device=device)
+                _test(t, x, y)
+
+                if nnz1 is None:
+                    nnz1 = random.randint(dk * dj // 2, dk * dj)
+                t = torch.randn(di, dj, dtype=dtype, device=device)
+                x = torch.randn(di, dk, dtype=dtype, device=device)
+                y = self.genSparseCSRTensor((dk, dj), nnz1, device=device, dtype=dtype, index_dtype=index_dtype)
+                _test(t, x, y)
 
         for i in range(2, 5):
             for j in range(2, 8):
                 for k in range(2, 8):
-                    test_shape(i, j, k, i * j // 2)
-        test_shape(4, 4, 4, 0)
+                    test_shape(i, j, k)
+        test_shape(4, 4, 4, 0, 0)
 
     @skipCPUIfNoMklSparse
     @dtypes(*floating_and_complex_types())
-    @dtypesIfCUDA(*get_all_complex_dtypes(),
-                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
-                                     include_bfloat16=SM80OrLater and TEST_CUSPARSE_GENERIC))
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.half] if SM53OrLater and TEST_CUSPARSE_GENERIC else [],
+                  *[torch.bfloat16] if SM80OrLater and TEST_CUSPARSE_GENERIC else []))
     @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
     def test_sparse_mm(self, device, dtype):
         def test_shape(d1, d2, d3, nnz, transposed, index_dtype):
@@ -845,9 +978,9 @@ def test_shape(d1, d2, d3, nnz, transposed, index_dtype):
             test_shape(7, 8, 9, 20, True, index_dtype)
 
     @dtypes(*floating_and_complex_types())
-    @dtypesIfCUDA(*get_all_complex_dtypes(),
-                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
-                                     include_bfloat16=SM80OrLater and TEST_CUSPARSE_GENERIC))
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.half] if SM53OrLater and TEST_CUSPARSE_GENERIC else [],
+                  *[torch.bfloat16] if SM80OrLater and TEST_CUSPARSE_GENERIC else []))
     @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
     def test_sparse_addmm(self, device, dtype):
         def test_shape(m, n, p, nnz, broadcast, index_dtype, alpha_beta=None):
@@ -879,10 +1012,10 @@ def test_shape(m, n, p, nnz, broadcast, index_dtype, alpha_beta=None):
     @dtypes(*floating_and_complex_types())
     @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
                         torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
-    @dtypesIfCUDA(torch.complex64,
-                  *((torch.complex128,) if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else ()),
-                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
-                                                   include_half=SM53OrLater))
+    @dtypesIfCUDA(*floating_types_and(torch.complex64,
+                                      *[torch.bfloat16] if SM80OrLater else [],
+                                      *[torch.half] if SM53OrLater else [],
+                                      *[torch.complex128] if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else []))
     @skipCUDAIf(
         not _check_cusparse_spgemm_available(),
         "cuSparse Generic API SpGEMM is not available"
@@ -950,32 +1083,32 @@ def maybe_transpose(cond, m):
             m2 = maybe_transpose(t3, torch.randn(50, 25, device=device).to(dtype))
             _test_addmm_addmv(self, torch.addmm, M, m1, m2, transpose_out=t4, layout=torch.sparse_csr, mode="dense_result")
 
+    @parametrize("k", [0, 1, 8])
+    @parametrize("n", [0, 1, 10])
+    @parametrize("m", [0, 1, 25])
     @skipCPUIfNoMklSparse
     @dtypes(*floating_and_complex_types())
-    @dtypesIfCUDA(torch.complex64,
-                  *((torch.complex128,) if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else ()),
-                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
-                                                   include_half=SM53OrLater))
+    @dtypesIfCUDA(*floating_types_and(torch.complex64,
+                                      *[torch.bfloat16] if SM80OrLater else [],
+                                      *[torch.half] if SM53OrLater else [],
+                                      *[torch.complex128] if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else []))
     @skipCUDAIf(
         not _check_cusparse_spgemm_available(),
         "cuSparse Generic API SpGEMM is not available"
     )
     @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
                         torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
-    def test_addmm_sizes_all_sparse_csr(self, device, dtype):
-        for m in [0, 1, 25]:
-            for n in [0, 1, 10]:
-                for k in [0, 1, 8]:
-                    M = torch.randn(n, m, device=device).to(dtype)
-                    m1 = torch.randn(n, k, device=device).to(dtype)
-                    m2 = torch.randn(k, m, device=device).to(dtype)
-                    _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="all_sparse")
-
-                    M = torch.randn(n, m, device=device).to(dtype).to_sparse_csr()
-                    m1 = torch.randn(n, k + 1, device=device).to(dtype).to_sparse_csr()
-                    m2 = torch.randn(k, m, device=device).to(dtype).to_sparse_csr()
-                    self.assertRaisesRegex(RuntimeError, f"{n}x{k + 1}.*{k}x{m}", lambda: torch.addmm(M, m1, m2))
-                    self.assertRaisesRegex(RuntimeError, f"{n}x{k + 1}.*{k}x{m}", lambda: torch.mm(m1, m2))
+    def test_addmm_sizes_all_sparse_csr(self, device, dtype, m, n, k):
+        M = torch.randn(n, m, device=device).to(dtype)
+        m1 = torch.randn(n, k, device=device).to(dtype)
+        m2 = torch.randn(k, m, device=device).to(dtype)
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="all_sparse")
+
+        M = torch.randn(n, m, device=device).to(dtype).to_sparse_csr()
+        m1 = torch.randn(n, k + 1, device=device).to(dtype).to_sparse_csr()
+        m2 = torch.randn(k, m, device=device).to(dtype).to_sparse_csr()
+        self.assertRaisesRegex(RuntimeError, f"{n}x{k + 1}.*{k}x{m}", lambda: torch.addmm(M, m1, m2))
+        self.assertRaisesRegex(RuntimeError, f"{n}x{k + 1}.*{k}x{m}", lambda: torch.mm(m1, m2))
 
     @skipCPUIfNoMklSparse
     @dtypes(torch.float)
@@ -1051,6 +1184,9 @@ def test2(*, is_sparse):
     @dtypes(torch.float, torch.double)
     def test_add(self, device, dtype):
         def _test_spadd_shape(nnz, shape):
+            # sparse.to_dense() uses torch.add internally so if torch.add is wrong,
+            # the dense tensor will be wrong but this test would still pass
+            # there's a separate test that checks for the correctness of the .to_dense() call
             x = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=torch.int32)
             y = torch.randn(*shape, dtype=dtype, device=device)
             r = random.random()
@@ -1072,10 +1208,42 @@ def _test_spadd_shape(nnz, shape):
 
             self.assertEqual(res, expected)
 
-        _test_spadd_shape(10, [100, 100])
-        _test_spadd_shape(0, [100, 100])
-        _test_spadd_shape(10, [100, 1])
-        _test_spadd_shape(10, [1, 100])
+        ns = [2, 5]
+        batch_shapes = [(), (2,), (2, 3)]
+        for b, m, n in itertools.product(batch_shapes, ns, ns):
+            _test_spadd_shape(0, (*b, m, n))
+            _test_spadd_shape(m * n // 2, (*b, m, n))
+            _test_spadd_shape(m * n, (*b, m, n))
+
+    @dtypes(torch.float, torch.double)
+    def test_mul(self, device, dtype):
+        def _test_spadd_shape(fn, nnz, shape):
+            x = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=torch.int32)
+            y = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=torch.int32)
+
+            res = fn(y, x)
+            expected = fn(y.to_dense(), x.to_dense()).to_sparse_csr()
+            self.assertEqual(res, expected)
+
+        _test_spadd_shape(torch.mul, 100, [100, 100])
+        _test_spadd_shape(torch.mul, 0, [100, 100])
+        _test_spadd_shape(torch.mul, 100, [100, 1])
+        _test_spadd_shape(torch.mul, 100, [1, 100])
+
+        s = torch.sparse_coo_tensor([[0], [1]], [5.0], (2, 3), device=device)
+        s = s.to_sparse_csr()
+        t23 = s.to_dense()
+
+        if device == 'cpu':
+            with self.assertRaisesRegex(RuntimeError, r"mul\(sparse_csr, dense\) is not supported"):
+                s * t23
+            with self.assertRaisesRegex(RuntimeError, r"mul\(dense, sparse_csr\) is not supported"):
+                t23 * s
+        elif device == 'cuda':
+            with self.assertRaisesRegex(NotImplementedError, "CUDA"):
+                s * t23
+            with self.assertRaisesRegex(NotImplementedError, "CUDA"):
+                t23 * s
 
     @skipCPUIfNoMklSparse
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
@@ -1297,7 +1465,7 @@ def test_sampled_addmm_errors(self, device, dtype):
             torch.sparse.sampled_addmm(a_sparse, a, a_sparse)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_coo_csr_conversion(self, device, dtype):
         for m, n in itertools.product([5, 2, 0], [5, 2, 0]):
             size = (m, n)
@@ -1308,7 +1476,7 @@ def test_coo_csr_conversion(self, device, dtype):
             self.assertEqual(csr_sparse.to_dense(), dense)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_csr_coo_conversion(self, device, dtype):
         for m, n in itertools.product([5, 2, 0], [5, 2, 0]):
             size = (m, n)
@@ -1332,7 +1500,9 @@ def test_sparse_csr_consistency(self, device, dtype, op):
             # Sparse CSR only supports 2D tensors as inputs
             if sample.input.ndim != 2:
                 continue
-
+            # Reductions on sparse CSR require keepdim=True
+            if isinstance(op, ReductionOpInfo):
+                continue
             expected = op(sample.input)
             assert torch.is_tensor(expected)
             output = op(sample.input.to_sparse_csr())
@@ -1389,10 +1559,7 @@ def test_sparse_csr_unary_out(self, device, dtype, op):
                                           index_dtype=sample.input.crow_indices().dtype)
             op(sample.input, *sample.args, **sample.kwargs, out=out)
 
-            self.assertEqual(out.values(), expect.values())
-            self.assertEqual(out.crow_indices(), expect.crow_indices())
-            self.assertEqual(out.col_indices(), expect.col_indices())
-            self.assertEqual(out._nnz(), expect._nnz())
+            self.assertEqual(out, expect)
 
     @ops(sparse_csr_unary_ufuncs)
     def test_sparse_csr_unary_inplace(self, device, dtype, op):
@@ -1424,10 +1591,7 @@ def test_sparse_csr_unary_inplace(self, device, dtype, op):
             actual = op.inplace_variant(sample.input, *sample.args, **sample.kwargs)
 
             self.assertIs(actual, sample.input)
-            self.assertEqual(actual.values(), expect.values())
-            self.assertEqual(actual.crow_indices(), expect.crow_indices())
-            self.assertEqual(actual.col_indices(), expect.col_indices())
-            self.assertEqual(actual._nnz(), expect._nnz())
+            self.assertEqual(actual, expect)
 
     @unittest.expectedFailure
     @ops(sparse_csr_unary_ufuncs, dtypes=OpDTypes.supported, allowed_dtypes=[torch.double, torch.cdouble])
@@ -1469,7 +1633,8 @@ def test_autograd_dense_output_addmm(self, device, dtype):
             raise ValueError("Expected at least one 2D tensor in samples to convert to sparse.")
 
         for sample in samples:
-            a = sample.args[0].to_sparse_csr()
+            # TODO: Remove detach once we have autograd support for CSR input
+            a = sample.args[0].to_sparse_csr().detach()
 
             for addmm in [torch.addmm, torch.sparse.addmm]:
 
@@ -1500,7 +1665,8 @@ def test_autograd_dense_output_addmv(self, device, dtype):
             raise ValueError("Expected at least one 2D tensor in samples to convert to sparse.")
 
         for sample in samples:
-            a = sample.args[0].to_sparse_csr()
+            # TODO: Remove detach once we have autograd support for CSR input
+            a = sample.args[0].to_sparse_csr().detach()
 
             def fn(c, b):
                 output = torch.addmv(c, a, b, **sample.kwargs)
@@ -1532,7 +1698,8 @@ def test_autograd_dense_output(self, device, dtype, op):
 
         # Here we assume that the signature is op(sparse_input, dense_input) -> dense_output
         for sample in samples:
-            sparse_input = sample.input.to_sparse_csr()
+            # TODO: Remove detach once we have autograd support for CSR input
+            sparse_input = sample.input.to_sparse_csr().detach()
 
             def fn(*args):
                 output = op.gradcheck_wrapper(op.get_op(), sparse_input, *args, **sample.kwargs)
@@ -1546,7 +1713,7 @@ def fn(*args):
             args = [make_tensor(a.shape, device=device, dtype=dtype, noncontiguous=True, requires_grad=True) for a in sample.args]
             self.assertTrue(torch.autograd.gradcheck(fn, args, fast_mode=True))
 
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex())
     def test_direct_coo_csr_conversion(self, device, dtype):
         for m, n in itertools.product([5, 2, 0], [5, 2, 0]):
             size = (m, n)
@@ -1556,7 +1723,27 @@ def test_direct_coo_csr_conversion(self, device, dtype):
             self.assertEqual(coo_sparse.to_sparse_csr().to_sparse_coo(), coo_sparse)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
+    def test_sum(self, device, dtype):
+        def run_test(shape, nnz, index_type):
+            a = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=index_dtype)
+            self.assertEqual(a.sum(), a.values().sum())
+            if dtype in floating_types():
+                a.requires_grad_(True)
+                with self.assertRaisesRegex(RuntimeError,
+                                            ("Function SumBackward0 returned an invalid gradient at " +
+                                             "index 0 - expected layout SparseCsr but got Strided")):
+                    a.sum().backward()
+        for shape, index_dtype in itertools.product(
+                [(10, 5), (10, 10)],
+                [torch.int32, torch.int64]):
+            run_test(shape, 0, index_dtype)
+            run_test(shape, max(shape), index_dtype)
+            run_test(shape, shape[0] * shape[1], index_dtype)
+
+
+    @skipMeta
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_transpose(self, device, dtype):
 
         def run_test(shape, nnz, index_type, dim0, dim1):
@@ -1577,16 +1764,14 @@ def run_test(shape, nnz, index_type, dim0, dim1):
     # TODO: This is a stopgap for a rigorous extension of our autograd tests
     # to test the functionality of detach
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_exercise_detach(self, device, dtype):
         shape = (3, 3)
         nnz = 4
         for index_dtype in [torch.int32, torch.int64]:
             inp = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=index_dtype)
             detached_inp = inp.detach()
-            self.assertEqual(inp.values(), detached_inp.values())
-            self.assertEqual(inp.crow_indices(), detached_inp.crow_indices())
-            self.assertEqual(inp.col_indices(), detached_inp.col_indices())
+            self.assertEqual(inp, detached_inp)
 
 
 
diff --git a/test/test_spectral_ops.py b/test/test_spectral_ops.py
index c11b87b507aec1..344c810bd4bd44 100644
--- a/test/test_spectral_ops.py
+++ b/test/test_spectral_ops.py
@@ -13,7 +13,7 @@
     (TestCase, run_tests, TEST_NUMPY, TEST_LIBROSA, TEST_MKL)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, ops, dtypes, onlyNativeDeviceTypes,
-     skipCPUIfNoFFT, deviceCountAtLeast, onlyCUDA, OpDTypes, skipIf)
+     skipCPUIfNoFFT, skipCUDAIfRocm, deviceCountAtLeast, onlyCUDA, OpDTypes, skipIf)
 from torch.testing._internal.common_methods_invocations import (
     spectral_funcs, SpectralFuncInfo, SpectralFuncType)
 
@@ -204,6 +204,7 @@ def get_op_name(op):
         else:
             return (input, s, dim, norm)
 
+    @skipCUDAIfRocm
     @onlyNativeDeviceTypes
     @ops([op for op in spectral_funcs if op.ndimensional == SpectralFuncType.OneD])
     def test_reference_1d(self, device, dtype, op):
@@ -367,6 +368,7 @@ def test_fft_half_and_bfloat16_errors(self, device, dtype, op):
             op(x)
 
     # nd-fft tests
+    @skipCUDAIfRocm
     @onlyNativeDeviceTypes
     @unittest.skipIf(not TEST_NUMPY, 'NumPy not found')
     @ops([op for op in spectral_funcs if op.ndimensional == SpectralFuncType.ND])
diff --git a/test/test_tensor_creation_ops.py b/test/test_tensor_creation_ops.py
index abb9710363cfe5..27a91c398b2679 100644
--- a/test/test_tensor_creation_ops.py
+++ b/test/test_tensor_creation_ops.py
@@ -20,8 +20,10 @@
     onlyCPU, largeTensorTest, precisionOverride, dtypes,
     onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types)
 from torch.testing._internal.common_dtype import (
-    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
+    all_types_and_complex_and, get_all_math_dtypes, all_types_and, floating_and_complex_types,
+    floating_types, floating_and_complex_types_and, integral_types_and
 )
+from torch.testing._creation import float_to_corresponding_complex_type_map
 
 from torch.utils.dlpack import to_dlpack
 
@@ -147,7 +149,7 @@ def test_vander_types(self, device, dtype):
                 exact_dtype=False)
 
     def test_cat_all_dtypes_and_devices(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             x = torch.tensor([[1, 2], [3, 4]], dtype=dt, device=device)
 
             expected1 = torch.tensor([[1, 2], [3, 4], [1, 2], [3, 4]], dtype=dt, device=device)
@@ -157,7 +159,7 @@ def test_cat_all_dtypes_and_devices(self, device):
             self.assertEqual(torch.cat((x, x), 1), expected2)
 
     def test_fill_all_dtypes_and_devices(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             for x in [torch.tensor((10, 10), dtype=dt, device=device),
                       torch.empty(10000, dtype=dt, device=device)]:  # large tensor
                 numel = x.numel()
@@ -311,7 +313,7 @@ def run_test(shape, device, diagonal, dtype):
                   (3, 1), (5, 3, 1), (7, 5, 3, 1),  # very fat matrices
                   (1, 3), (5, 1, 3), (7, 5, 1, 3),  # very thin matrices
                   (1, 3, 3, 3), (3, 1, 3, 3, 3)]    # unsqueezed batch dimensions
-        dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16]
+        dtypes = all_types_and_complex_and(torch.half, torch.bool)
         for s, d, dtype in product(shapes, diagonals, dtypes):
             run_test(s, device, d, dtype)
 
@@ -508,12 +510,12 @@ def test_block_diag_scipy(self, device):
             self.assertEqual(torch_result, scipy_result)
 
     @onlyNativeDeviceTypes
-    @dtypes(torch.float32, torch.float64)
+    @dtypes(torch.half, torch.float32, torch.float64)
     def test_torch_complex(self, device, dtype):
         real = torch.tensor([1, 2], device=device, dtype=dtype)
         imag = torch.tensor([3, 4], device=device, dtype=dtype)
         z = torch.complex(real, imag)
-        complex_dtype = torch.complex64 if dtype == torch.float32 else torch.complex128
+        complex_dtype = float_to_corresponding_complex_type_map[dtype]
         self.assertEqual(torch.tensor([1.0 + 3.0j, 2.0 + 4.0j], dtype=complex_dtype), z)
 
     @onlyNativeDeviceTypes
@@ -531,12 +533,12 @@ def test_torch_polar(self, device, dtype):
 
     @onlyNativeDeviceTypes
     @dtypes(torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64,
-            torch.float16, torch.complex64, torch.complex128, torch.bool)
+            torch.complex64, torch.complex128, torch.bool)
     def test_torch_complex_floating_dtype_error(self, device, dtype):
         for op in (torch.complex, torch.polar):
             a = torch.tensor([1, 2], device=device, dtype=dtype)
             b = torch.tensor([3, 4], device=device, dtype=dtype)
-            error = r"Expected both inputs to be Float or Double tensors but " \
+            error = r"Expected both inputs to be Half, Float or Double tensors but " \
                     r"got [A-Za-z]+ and [A-Za-z]+"
         with self.assertRaisesRegex(RuntimeError, error):
             op(a, b)
@@ -1009,8 +1011,7 @@ def _test_special_stacks(self, dim, at_least_dim, torch_fn, np_fn, device, dtype
                         np_fn(np_input)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_hstack_column_stack(self, device, dtype):
         ops = ((torch.hstack, np.hstack), (torch.column_stack, np.column_stack))
         for torch_op, np_op in ops:
@@ -1029,8 +1030,7 @@ def test_hstack_column_stack(self, device, dtype):
                          torch_result)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_vstack_row_stack(self, device, dtype):
         ops = ((torch.vstack, np.vstack), (torch.row_stack, np.row_stack))
         for torch_op, np_op in ops:
@@ -1047,8 +1047,7 @@ def test_vstack_row_stack(self, device, dtype):
                 self.assertEqual(actual, expected)
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_dstack(self, device, dtype):
         self._test_special_stacks(2, 3, torch.dstack, np.dstack, device, dtype)
         for i in range(5):
@@ -1600,6 +1599,10 @@ def test_cartesian_prod(self, device):
     def test_combinations(self, device):
         a = torch.tensor([1, 2, 3], device=device)
 
+        c = torch.combinations(a, r=0)
+        expected = torch.empty(0, dtype=a.dtype, device=device)
+        self.assertEqual(c, expected)
+
         c = torch.combinations(a, r=1)
         expected = torch.tensor(list(combinations(a, r=1)), device=device)
         self.assertEqual(c, expected)
@@ -1752,7 +1755,7 @@ def test_random_from_to_bool(self, device):
                         lambda: t.random_(from_, to_)
                     )
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.bfloat16, torch.half))
     def test_random_full_range(self, device, dtype):
         size = 2000
         alpha = 0.1
@@ -1786,7 +1789,7 @@ def test_random_full_range(self, device, dtype):
         self.assertTrue(from_ <= t.to(torch.double).min() < (from_ + delta))
         self.assertTrue((to_inc_ - delta) < t.to(torch.double).max() <= to_inc_)
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.bfloat16, torch.half))
     def test_random_from_to(self, device, dtype):
         size = 2000
         alpha = 0.1
@@ -1875,7 +1878,7 @@ def test_random_from_to(self, device, dtype):
                         lambda: t.random_(from_, to_)
                     )
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.bfloat16, torch.half))
     def test_random_to(self, device, dtype):
         size = 2000
         alpha = 0.1
@@ -1933,7 +1936,7 @@ def test_random_to(self, device, dtype):
                     lambda: t.random_(from_, to_)
                 )
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.bfloat16, torch.half))
     def test_random_default(self, device, dtype):
         size = 2000
         alpha = 0.1
@@ -2124,13 +2127,7 @@ def test_constructor_dtypes(self, device):
         self.assertRaises(TypeError, lambda: torch.set_default_tensor_type(torch.float32))
 
         # don't allow passing dtype to set_default_dtype
-        for t in get_all_dtypes(
-                include_half=True,
-                include_bfloat16=True,
-                include_bool=True,
-                include_complex=True,
-                include_complex32=True,
-                include_qint=True):
+        for t in all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.qint8):
             # only floating-point types are supported as the default type
             if t in (
                     torch.half,
@@ -2668,8 +2665,17 @@ def test_empty_tensor_props(self, device):
             y = torch.empty(tuple(size_ones_instead_of_zeros), device=device)
             self.assertEqual(x.stride(), y.stride())
 
+    @onlyNativeDeviceTypes
+    def test_empty_overflow(self, device):
+        with self.assertRaisesRegex(RuntimeError, 'Storage size calculation overflowed'):
+            torch.empty([2, 4, 2**29, 2**29], dtype=torch.float64)
+        with self.assertRaisesRegex(RuntimeError, 'Storage size calculation overflowed'):
+            torch.empty([8, 8, 2**29, 2**29], dtype=torch.float64)
+        with self.assertRaisesRegex(RuntimeError, 'Storage size calculation overflowed'):
+            torch.empty_strided([8, 8], [2**61, 1], dtype=torch.float64)
+
     def test_eye(self, device):
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             if dtype == torch.bfloat16:
                 continue
             # Test the RuntimeError is raised when either m or n is a negative number
@@ -2702,8 +2708,7 @@ def test_eye(self, device):
                 self.assertEqual(res1, res2)
 
     @precisionOverride({torch.float: 1e-8, torch.double: 1e-10})
-    @dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*floating_and_complex_types())
     def test_linspace_vs_numpy(self, device, dtype):
         start = -0.0316082797944545745849609375 + (0.8888888888j if dtype.is_complex else 0)
         end = .0315315723419189453125 + (0.444444444444j if dtype.is_complex else 0)
@@ -2740,7 +2745,7 @@ def test_logspace_vs_numpy_complex(self, device, dtype):
                                                     device, dtype)
 
     @precisionOverride({torch.float: 1e-6, torch.double: 1e-10})
-    @dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
+    @dtypes(*floating_types())
     def test_logspace_vs_numpy(self, device, dtype):
         start = -0.0316082797944545745849609375
         end = .0315315723419189453125
@@ -2832,8 +2837,6 @@ def test_signal_window_functions(self, device, dtype, window):
         self._test_signal_window_functions(window, dtype, device)
 
     @onlyNativeDeviceTypes
-    # See https://github.com/pytorch/pytorch/issues/72630
-    @skipMeta
     @precisionOverride({torch.bfloat16: 5e-2, torch.half: 1e-3})
     @unittest.skipIf(not TEST_SCIPY, "Scipy not found")
     @dtypesIfCUDA(torch.float, torch.double, torch.bfloat16, torch.half, torch.long)
@@ -2847,7 +2850,7 @@ def test_tensor_factories_empty(self, device):
         shapes = [(5, 0, 1), (0,), (0, 0, 1, 0, 2, 0, 0)]
 
         for shape in shapes:
-            for dt in get_all_dtypes():
+            for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
 
                 self.assertEqual(shape, torch.zeros(shape, device=device, dtype=dt).shape)
                 self.assertEqual(shape, torch.zeros_like(torch.zeros(shape, device=device, dtype=dt)).shape)
@@ -2933,8 +2936,8 @@ def test_arange_bfloat16(self, device):
         bfloat16_tensor = torch.arange(0, 6, step=2, dtype=torch.bfloat16, device=device)
         self.assertEqual(ref_tensor, bfloat16_tensor)
 
-    @dtypes(*get_all_dtypes(include_bool=False, include_half=False))
-    @dtypesIfCUDA(*get_all_dtypes(include_bool=False, include_half=True))
+    @dtypes(*all_types_and_complex_and(torch.bfloat16))
+    @dtypesIfCUDA(*all_types_and_complex_and(torch.bfloat16))
     def test_linspace(self, device, dtype):
         _from = random.random()
         to = _from + random.random()
@@ -3051,12 +3054,12 @@ def _test_linspace(self, device, dtype, steps):
     # See NOTE [Linspace+Logspace precision override]
     @skipCPUIf(True, "compares with CPU")
     @precisionOverride({torch.half: 0.0039 + LINSPACE_LOGSPACE_EXTRA_EPS})
-    @dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     def test_linspace_device_vs_cpu(self, device, dtype):
         self._test_linspace(device, dtype, steps=10)
 
     @skipCPUIf(True, "compares with CPU")
-    @dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
+    @dtypes(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     def test_linspace_special_steps(self, device, dtype):
         for steps in self.LINSPACE_LOGSPACE_SPECIAL_STEPS:
             self._test_linspace(device, dtype, steps=steps)
@@ -3097,10 +3100,9 @@ def test_logspace_special_steps(self, device, dtype):
             self._test_logspace(device, dtype, steps=steps)
             self._test_logspace_base2(device, dtype, steps=steps)
 
-    @dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
-    @dtypesIfCUDA(*((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16])
-                    if TEST_WITH_ROCM
-                    else get_all_dtypes(include_bool=False, include_half=True, include_complex=False)))
+    @dtypes(*all_types_and(torch.bfloat16))
+    @dtypesIfCUDA(*integral_types_and(torch.half, torch.bfloat16, torch.float32) if TEST_WITH_ROCM else
+                  all_types_and(torch.half, torch.bfloat16))
     def test_logspace(self, device, dtype):
         _from = random.random()
         to = _from + random.random()
@@ -3898,7 +3900,7 @@ def check(**kwargs):
     # data pointer (which is basically the point here), since they all
     # return 0.
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_alias_from_tensor(self, device, dtype):
         self._test_alias_with_cvt(identity, device, dtype)
 
@@ -3909,7 +3911,7 @@ def test_alias_from_numpy(self, device, dtype):
 
     # Skipping 'meta', since 'to_dlpack' does not work for them.
     @skipMeta
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_alias_from_dlpack(self, device, dtype):
         self._test_alias_with_cvt(to_dlpack, device, dtype)
 
@@ -3941,13 +3943,13 @@ def check(**kwargs):
 
         # Copy is forced because of different dtype
         if not only_with_dtype:
-            for other in get_all_dtypes():
+            for other in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
                 if dtype != other:
                     check(same_dtype=False, dtype=other)
                     check(same_dtype=False, dtype=other, copy=True)
 
     @skipMeta
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_copy_tensor(self, device, dtype):
         self._test_copy_with_cvt(identity, device, dtype)
 
@@ -3957,7 +3959,7 @@ def test_copy_from_numpy(self, device, dtype):
         self._test_copy_with_cvt(to_numpy, device, dtype)
 
     @skipMeta
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_copy_from_dlpack(self, device, dtype):
         self._test_copy_with_cvt(to_dlpack, device, dtype)
 
@@ -3980,17 +3982,17 @@ def check(**kwargs):
 
     @onlyCUDA
     @deviceCountAtLeast(2)
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_copy_from_tensor_mult_devices(self, devices, dtype):
         self._test_copy_mult_devices(devices, dtype, identity)
 
     @onlyCUDA
     @deviceCountAtLeast(2)
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_copy_from_dlpack_mult_devices(self, devices, dtype):
         self._test_copy_mult_devices(devices, dtype, to_dlpack)
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_copy_list(self, device, dtype):
         original = make_tensor((5, 5), dtype=dtype, device=torch.device("cpu"))
 
@@ -4071,6 +4073,8 @@ def test_astensor_consistency(self, device):
             [0.0, True, False, 42],
             # With Complex
             [0.0, True, False, 42, 5j],
+            # With Range
+            range(5),
         ]
 
         for e in examples:
diff --git a/test/test_tensorboard.py b/test/test_tensorboard.py
index 4300e9a71006bf..7f34fd90dd5c14 100644
--- a/test/test_tensorboard.py
+++ b/test/test_tensorboard.py
@@ -562,15 +562,15 @@ def forward(self, x):
         expected_proto = GraphDef()
         text_format.Parse(expected_str, expected_proto)
 
-        self.assertEquals(len(expected_proto.node), len(actual_proto.node))
+        self.assertEqual(len(expected_proto.node), len(actual_proto.node))
         for i in range(len(expected_proto.node)):
             expected_node = expected_proto.node[i]
             actual_node = actual_proto.node[i]
-            self.assertEquals(expected_node.name, actual_node.name)
-            self.assertEquals(expected_node.op, actual_node.op)
-            self.assertEquals(expected_node.input, actual_node.input)
-            self.assertEquals(expected_node.device, actual_node.device)
-            self.assertEquals(
+            self.assertEqual(expected_node.name, actual_node.name)
+            self.assertEqual(expected_node.op, actual_node.op)
+            self.assertEqual(expected_node.input, actual_node.input)
+            self.assertEqual(expected_node.device, actual_node.device)
+            self.assertEqual(
                 sorted(expected_node.attr.keys()), sorted(actual_node.attr.keys()))
 
     def test_nested_nn_squential(self):
diff --git a/test/test_tensorexpr.py b/test/test_tensorexpr.py
index 42ca49dc347574..8a5e918eda4b97 100644
--- a/test/test_tensorexpr.py
+++ b/test/test_tensorexpr.py
@@ -13,11 +13,13 @@
 
 class BaseTestClass(JitTestCase):
     def setUp(self):
+        super(BaseTestClass, self).setUp()
         self.tensorexpr_options = TensorExprTestOptions()
         self.devices = ['cpu'] if not torch.cuda.is_available() else ['cpu', 'cuda']
 
     def tearDown(self):
         self.tensorexpr_options.restore()
+        super(BaseTestClass, self).tearDown()
 
     def assertLastGraphAllFused(self):
         self.assertAllFused(torch.jit.last_executed_optimized_graph())
diff --git a/test/test_testing.py b/test/test_testing.py
index 948890f87f5c6e..2ccb6ff3628237 100644
--- a/test/test_testing.py
+++ b/test/test_testing.py
@@ -22,14 +22,13 @@
      deviceCountAtLeast, ops, expectedFailureMeta)
 from torch.testing._internal.common_methods_invocations import op_db
 import torch.testing._internal.opinfo_helper as opinfo_helper
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import all_types_and_complex_and
 from torch.testing._internal.common_modules import modules, module_db
 
 # For testing TestCase methods and torch.testing functions
 class TestTesting(TestCase):
     # Ensure that assertEqual handles numpy arrays properly
-    @dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
-                             include_bool=True, include_complex=True)))
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half))
     def test_assertEqual_numpy(self, device, dtype):
         S = 10
         test_sizes = [
@@ -279,6 +278,11 @@ def check(size, low, high, requires_grad, noncontiguous):
             check(size, None, None, False, False)
             check(size, 2, 4, True, True)
 
+    def test_make_tensor_complex32(self, device):
+        # verify that we can generate torch.complex32 tensor
+        t = make_tensor((1, 2, 3), dtype=torch.complex32, device=device)
+        self.assertEqual(t.dtype, torch.complex32)
+
     # The following tests (test_cuda_assert_*) are added to ensure test suite terminates early
     # when CUDA assert was thrown. Because all subsequent test will fail if that happens.
     # These tests are slow because it spawn another process to run test suite.
@@ -403,7 +407,7 @@ def test_get_supported_dtypes(self, device):
         ops_to_test = list(filter(lambda op: op.name in ['atan2', 'topk', 'xlogy'], op_db))
 
         for op in ops_to_test:
-            dynamic_dtypes = opinfo_helper.get_supported_dtypes(op.op, op.sample_inputs_func, self.device_type)
+            dynamic_dtypes = opinfo_helper.get_supported_dtypes(op, op.sample_inputs_func, self.device_type)
             dynamic_dispatch = opinfo_helper.dtypes_dispatch_hint(dynamic_dtypes)
             if self.device_type == 'cpu':
                 dtypes = op.dtypesIfCPU
diff --git a/test/test_torch.py b/test/test_torch.py
index 67f820457b7496..4f4e53f3e7487c 100644
--- a/test/test_torch.py
+++ b/test/test_torch.py
@@ -52,9 +52,11 @@
 from typing import Tuple
 import torch.backends.quantized
 import torch.testing._internal.data
-from torch.testing._internal.common_cuda import tf32_on_and_off, tf32_is_not_fp32
+from torch.testing._internal.common_cuda import (
+    tf32_on_and_off, tf32_is_not_fp32, TEST_CUDNN)
 from torch.testing._internal.common_dtype import (
-    get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes
+    floating_types_and, get_all_math_dtypes, all_types_and_complex_and, complex_types,
+    all_types_and, floating_types, floating_and_complex_types, integral_types,
 )
 
 # Protects against includes accidentally setting the default dtype
@@ -116,19 +118,6 @@ def test_cuda_vitals_gpu_only(self, device):
 class TestTorchDeviceType(TestCase):
     exact_dtype = True
 
-    # FIXME: Port this to ErrorInputs on where
-    @onlyCUDA
-    @dtypes(torch.float32)
-    def test_where_invalid_device(self, device, dtype):
-        for devices in [('cpu', device, device), (device, 'cpu', 'cpu'),
-                        (device, 'cpu', device), ('cpu', device, 'cpu')]:
-            condition = make_tensor(16, device=devices[0], dtype=torch.float32)
-            x = make_tensor(16, device=devices[1], dtype=torch.float32)
-            y = make_tensor(16, device=devices[2], dtype=torch.float32)
-            with self.assertRaisesRegex(RuntimeError,
-                                        "Expected condition, x and y to be on the same device"):
-                torch.where(condition, x, y)
-
     # TODO: move all tensor creation to common ops
     def _rand_shape(self, dim, min_size, max_size):
         shape = []
@@ -233,7 +222,17 @@ def test_storage_setitem(self, device, dtype):
         self.assertEqual(s, storage_type(l))
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
+    def test_tensor_storage_type(self, device, dtype):
+        a = make_tensor((10,), dtype=dtype, device=device, low=-9, high=9)
+
+        module = torch.cuda if (torch.device(device).type == 'cuda') else torch
+        expected_storage_type = getattr(module, torch.storage._dtype_to_storage_type_map()[dtype])
+
+        self.assertEqual(a.storage_type(), expected_storage_type)
+
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_tensor_from_storage(self, device, dtype):
         a = make_tensor((4, 5, 3), dtype=dtype, device=device, low=-9, high=9)
         a_s = a.storage()
@@ -242,7 +241,7 @@ def test_tensor_from_storage(self, device, dtype):
         c = torch.tensor(a_s._untyped(), device=device, dtype=dtype).reshape(a.size())
         self.assertEqual(a, c)
 
-        for error_dtype in get_all_dtypes():
+        for error_dtype in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             if error_dtype == dtype:
                 continue
             with self.assertRaisesRegex(RuntimeError, r'Expected a Storage of type'):
@@ -250,7 +249,7 @@ def test_tensor_from_storage(self, device, dtype):
                 torch.tensor(error_storage, device=device, dtype=dtype)
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_set_storage(self, device, dtype):
         a = make_tensor((4, 5, 3), dtype=dtype, device=device, low=-9, high=9)
         a_s = a.storage()
@@ -259,7 +258,7 @@ def test_set_storage(self, device, dtype):
         c = torch.tensor([], device=device, dtype=dtype).set_(a_s._untyped()).reshape(a.size())
         self.assertEqual(a, c)
 
-        for error_dtype in get_all_dtypes():
+        for error_dtype in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             if error_dtype == dtype:
                 continue
             with self.assertRaisesRegex(RuntimeError, r'Expected a Storage of type'):
@@ -460,26 +459,12 @@ def test_scalar_check(self, device):
         self.assertEqual((), torch.cummax(zero_d, 0)[0].shape)
         self.assertEqual((), torch.cummin(zero_d, 0)[0].shape)
 
-        # renorm
-        self.assertRaises(RuntimeError, lambda: torch.renorm(zero_d, 0.5, 0, 1.0))
-
         # sort, topk
         self.assertEqual([(), ()], [x.shape for x in torch.sort(zero_d, 0, False)])
         self.assertEqual([(), ()], [x.shape for x in torch.sort(zero_d, 0, True)])
         self.assertEqual([(), ()], [x.shape for x in torch.topk(zero_d, 1, 0, False)])
         self.assertEqual([(), ()], [x.shape for x in torch.topk(zero_d, 1, 0, True)])
 
-        # lstsq (gels)
-        self.assertRaises(RuntimeError, lambda: torch.lstsq(zero_d, zero_d))
-
-        # eig
-        self.assertRaises(RuntimeError, lambda: torch.eig(zero_d, False))
-        self.assertRaises(RuntimeError, lambda: torch.eig(zero_d, True))
-
-        # this is only implemented on cpu
-        if (torch.device(device).type == 'cpu'):
-            self.assertRaises(RuntimeError, lambda: torch.ormqr(zero_d, zero_d, zero_d))
-
         # max, min
         self.assertEqual((), torch.max(zero_d, zero_d).shape)
         self.assertEqual((1,), torch.max(one_d, zero_d).shape)
@@ -488,9 +473,6 @@ def test_scalar_check(self, device):
         self.assertEqual((1,), torch.min(one_d, zero_d).shape)
         self.assertEqual((1,), torch.min(zero_d, one_d).shape)
 
-        # diag
-        self.assertRaises(RuntimeError, lambda: torch.diag(zero_d))
-
         zero_d_int = torch.tensor(1, device=device)
         one_d_int = torch.tensor([1], device=device)
 
@@ -1415,15 +1397,55 @@ def backward_func(slf, device):
 
         backward_func(self, device)
 
-    def test_embedding_scalar_weight_error(self, device):
-        indices = torch.rand(2, 2, device=device).long()
-        weights = [
-            torch.tensor(1.0, device=device),
-            torch.tensor(1.0, device=device).reshape(1, 1, 1),
-        ]
-        for weight in weights:
-            with self.assertRaisesRegex(RuntimeError, "'weight' must be 2-D"):
-                torch.embedding(weight, indices)
+    def test_invalid_shapes_grid_sampler(self, device):
+        make_arg = partial(
+            make_tensor, device=device, dtype=torch.float64, requires_grad=True)
+
+        inputs = (
+            # input, grid
+            ((5, 5, 5, 5, 5,), (1, 1, 1, 4, 4,)),  # 3d
+            ((5, 5, 5, 5,), (1, 1, 4, 4,)),  # 2d
+        )
+
+        interpolation_mode = 0
+        padding_mode = 0
+        align_corners = True
+
+        err = "expected grid and input to have same batch size"
+
+        for input, grid in inputs:
+            input = make_arg(input)
+            grid = make_arg(grid, low=-1, high=1)
+
+            # Wrapper for the 2d, 3d, and cuDNN functions listed below.
+            with self.assertRaisesRegex(RuntimeError, err):
+                torch.grid_sampler(
+                    input, grid, interpolation_mode, padding_mode,
+                    align_corners)
+
+            # Expects 2d input.
+            with self.assertRaisesRegex(RuntimeError, err):
+                torch.grid_sampler_2d(
+                    input, grid, interpolation_mode, padding_mode,
+                    align_corners)
+
+            # Expects 3d input.
+            with self.assertRaisesRegex(RuntimeError, err):
+                torch.grid_sampler_3d(
+                    input, grid, interpolation_mode, padding_mode,
+                    align_corners)
+
+            # Expects 2d input.
+            with self.assertRaisesRegex(RuntimeError, err):
+                torch._grid_sampler_2d_cpu_fallback(
+                    input, grid, interpolation_mode, padding_mode,
+                    align_corners)
+
+            # Expects 2d input, on CUDA.
+            # Doesn't work on CPU and ROCm.
+            if device != 'cpu' and TEST_CUDNN and not TEST_WITH_ROCM:
+                with self.assertRaisesRegex(RuntimeError, err):
+                    torch.cudnn_grid_sampler(input, grid)
 
     def test_dist(self, device):
         def run_test(x, y):
@@ -1592,13 +1614,13 @@ def _cond_fn(x):
             _sync_raises_helper(f, level)
 
 
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_log_normal(self, device, dtype):
         a = torch.tensor([10], dtype=dtype, device=device).log_normal_()
         self.assertEqual(a.dtype, dtype)
         self.assertEqual(a.size(), torch.Size([1]))
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_geometric(self, device, dtype):
         a = torch.tensor([10], dtype=dtype, device=device).geometric_(0.5)
         self.assertEqual(a.dtype, dtype)
@@ -1630,9 +1652,9 @@ def test_repeat_interleave(self, device):
             self.assertEqual(a_with_output.dtype, y.dtype)
             self.assertEqual(a_with_output.size(), torch.Size([3, 2]))
 
-    @dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
-    @dtypesIfCPU(*(get_all_fp_dtypes(include_half=False, include_bfloat16=True)))
-    @dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*floating_types())
+    @dtypesIfCPU(*floating_types_and(torch.bfloat16))
+    @dtypesIfCUDA(*floating_types_and(torch.half))
     def test_bernoulli_p(self, device, dtype):
         for trivial_p in ([0, 1], [1, 0, 1, 1, 0, 1]):
             x = torch.tensor(trivial_p, dtype=dtype, device=device)
@@ -1652,9 +1674,9 @@ def isBinary(t):
         self.assertTrue(isBinary(p))
 
     # RngUniform not implemented for Integral type in XLA test
-    @dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
-    @dtypesIfCPU(*(get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False)))
-    @dtypesIfCUDA(*(get_all_dtypes(include_bfloat16=False, include_complex=False)))
+    @dtypes(*floating_types())
+    @dtypesIfCPU(*all_types_and(torch.bool))
+    @dtypesIfCUDA(*all_types_and(torch.bool, torch.half))
     def test_bernoulli_self(self, device, dtype):
 
         def isBinary(t):
@@ -1666,7 +1688,7 @@ def isBinary(t):
         t.bernoulli_(0.5)
         self.assertTrue(isBinary(t))
 
-        for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False):
+        for p_dtype in floating_types_and(*[torch.half] if device.startswith('cuda') else []):
             p = torch.rand(10, dtype=p_dtype, device=device).expand(10, 10)
             t.fill_(2)
             t.bernoulli_(p)
@@ -1681,8 +1703,8 @@ def isBinary(t):
             self.assertTrue(isBinary(t))
 
     @slowTest
-    @dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
-    @dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*floating_types())
+    @dtypesIfCUDA(*floating_types_and(torch.half))
     def test_bernoulli_edge_cases(self, device, dtype):
         # Need to draw a lot of samples to cover every random floating point number.
         a = torch.zeros(10000, 10000, dtype=dtype, device=device)  # probability of drawing "1" is 0
@@ -1693,7 +1715,7 @@ def test_bernoulli_edge_cases(self, device, dtype):
         num_zeros = (torch.bernoulli(b) == 0).sum()
         self.assertEqual(num_zeros, 0)
 
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_exponential(self, device, dtype):
         a = torch.tensor([10], dtype=dtype, device=device).exponential_(0.5)
         self.assertEqual(a.dtype, dtype)
@@ -1759,25 +1781,8 @@ def check(t, correction=1, fweights=None, aweights=None):
                 for correction, fw, aw in product([0, 1, 2], [None, fweights], [None, aweights]):
                     check(x, correction, fweights, aweights)
 
-    # FIXME: port to ErrorInputs
-    def test_cov_error(self, device):
-        def check(msg, *args, **kwargs):
-            with self.assertRaisesRegex(RuntimeError, r'cov\(\):.*' + msg + r'.*'):
-                torch.cov(*args, **kwargs)
-
-        a = torch.rand(2)
-        check(r'expected input to have two or fewer dimensions', torch.rand(2, 2, 2))
-        check(r'expected fweights to have one or fewer dimensions', a, fweights=torch.rand(2, 2))
-        check(r'expected aweights to have one or fewer dimensions', a, aweights=torch.rand(2, 2))
-        check(r'expected fweights to have integral dtype', a, fweights=torch.rand(2))
-        check(r'expected aweights to have floating point dtype', a, aweights=torch.tensor([1, 1]))
-        check(r'expected fweights to have the same numel', a, fweights=torch.tensor([1]))
-        check(r'expected aweights to have the same numel', a, aweights=torch.rand(1))
-        check(r'fweights cannot be negative', a, fweights=torch.tensor([-1, -2]))
-        check(r'aweights cannot be negative', a, aweights=torch.tensor([-1., -2.]))
-
     @skipIfNoSciPy
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_uniform_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -1789,8 +1794,8 @@ def test_uniform_kstest(self, device, dtype):
                     self.assertTrue(res.statistic < 0.1)
 
     @skipIfNoSciPy
-    @dtypes(*get_all_fp_dtypes(include_bfloat16=False))
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half))
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     def test_normal_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -1801,7 +1806,7 @@ def test_normal_kstest(self, device, dtype):
                 self.assertTrue(res.statistic < 0.1)
 
     @skipIfNoSciPy
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_lognormal_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -1815,7 +1820,7 @@ def test_lognormal_kstest(self, device, dtype):
                     self.assertTrue(res.statistic < 0.1)
 
     @skipIfNoSciPy
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_exponential_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -1825,7 +1830,7 @@ def test_exponential_kstest(self, device, dtype):
             self.assertTrue(res.statistic < 0.1)
 
     @skipIfNoSciPy
-    @dtypes(*get_all_fp_dtypes())
+    @dtypes(*floating_types_and(torch.half, torch.bfloat16))
     def test_cauchy_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -1846,7 +1851,7 @@ def test_cauchy_no_inf(self, device, dtype):
             self.assertFalse(x.isinf().sum())
 
     @skipIfNoSciPy
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_geometric_kstest(self, device, dtype):
         from scipy import stats
         size = 1000
@@ -2087,37 +2092,6 @@ def test_cdist_same_inputs(self, device):
             # values such as nan or inf
             assert torch.isfinite(x.grad).all()
 
-    def test_multinomial_constraints(self, device):
-        x = torch.empty(1, 2, 3, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "prob_dist must be 1 or 2 dim",
-            lambda: torch.multinomial(x, 2))
-        x = torch.empty(1, 2, dtype=torch.long, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "multinomial only supports floating-point dtypes for input",
-            lambda: torch.multinomial(x, 2))
-        x = torch.empty(1, 2, dtype=torch.double, device=device)
-        y = torch.empty(1, 2, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "multinomial expects Long tensor out",
-            lambda: torch.multinomial(x, 2, out=y))
-        x = torch.empty(2, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "cannot sample n_sample <= 0 samples",
-            lambda: torch.multinomial(x, 0))
-        x = torch.empty(2, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "cannot sample n_sample <= 0 samples",
-            lambda: torch.multinomial(x, -1))
-        x = torch.empty(2, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "cannot sample n_sample > prob_dist",
-            lambda: torch.multinomial(x, 3, False))
-        x = torch.empty(16777217, dtype=torch.double, device=device)
-        self.assertRaisesRegex(
-            RuntimeError, "number of categories cannot exceed",
-            lambda: torch.multinomial(x, 3))
-
     def test_cumsum(self, device):
         x = torch.rand(100, 100, device=device)
         res1 = torch.cumsum(x, 1)
@@ -2357,7 +2331,7 @@ def to_np(t):
 
     # All tensors appear contiguous on XLA
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_diff_noncontig(self, device, dtype):
         shapes = (
             (1,),
@@ -2377,9 +2351,9 @@ def test_diff_noncontig(self, device, dtype):
             self._test_diff_numpy(non_contig)
 
     # RngNormal not implemented for type f16 for XLA
-    @dtypes(*get_all_dtypes(include_half=False, include_bfloat16=False))
-    @dtypesIfCPU(*get_all_dtypes(include_bfloat16=False))
-    @dtypesIfCUDA(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.bool))
+    @dtypesIfCPU(*all_types_and_complex_and(torch.half, torch.bool))
+    @dtypesIfCUDA(*all_types_and_complex_and(torch.half, torch.bool))
     def test_diff(self, device, dtype):
         shapes = (
             (1,),
@@ -2551,38 +2525,6 @@ def test_gradient_type_promotion(self, device):
                 actual, expected = self._inf_nan_preprocess(list(actual), expected)
                 self.assertEqual(actual, expected, equal_nan=True, exact_dtype=False)
 
-    # FIXME: port this to ErrorInputs
-    @onlyNativeDeviceTypes
-    @dtypes(torch.long, torch.float32, torch.complex64)
-    def test_error_gradient(self, device, dtype):
-        t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], device=device, dtype=dtype)
-        with self.assertRaisesRegex(RuntimeError, 'torch.gradient expected spacing to be unspecified, a scalar '):
-            dim = (1, 0)
-            spacing = [0.1]
-            torch.gradient(t, spacing=spacing, dim=dim, edge_order=1)
-
-        with self.assertRaisesRegex(RuntimeError, 'torch.gradient only supports edge_order=1 and edge_order=2.'):
-            torch.gradient(t, edge_order=3)
-
-        with self.assertRaisesRegex(RuntimeError, 'dim 1 appears multiple times in the list of dims'):
-            dim = (1, 1)
-            spacing = 0.1
-            torch.gradient(t, spacing=spacing, dim=dim, edge_order=1)
-
-        with self.assertRaisesRegex(RuntimeError, 'torch.gradient expected each tensor to be on the same device,'):
-            dim = (0, 1)
-            coordinates = [torch.tensor([1, 2, 4], device='cpu'), torch.tensor([1, 2, 4], device='meta')]
-            torch.gradient(t, spacing=coordinates, dim=dim, edge_order=1)
-
-        with self.assertRaises(IndexError):
-            torch.gradient(t, dim=3)
-
-        with self.assertRaisesRegex(RuntimeError, 'torch.gradient expected each dimension size to be at least'):
-            torch.gradient(torch.tensor([[1], [2], [3]]), edge_order=1)
-
-        with self.assertRaisesRegex(RuntimeError, 'torch.gradient expected each dimension size to be at least'):
-            torch.gradient(torch.tensor([[1, 2], [3, 4]]), edge_order=2)
-
     def _test_large_cum_fn_helper(self, x, fn):
         x_cpu = x.cpu().float()
         expected = fn(x_cpu)
@@ -2602,6 +2544,7 @@ def test_large_cumsum(self, device, dtype):
 
     @onlyCUDA
     @dtypes(torch.half)  # only small dtype not to get oom
+    @largeTensorTest("48GB", "cpu")
     def test_large_cumprod(self, device, dtype):
         # initialization to avoid overflow and half caveats
         x = torch.empty(2**30 + 200, device=device, dtype=dtype)
@@ -2650,7 +2593,7 @@ def test_bool_tensor_value_change(self, device):
 
     # FIXME: move to shape ops test suite
     def test_unfold_all_devices_and_dtypes(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
 
             if dt == torch.bool:
                 x = torch.empty((0, 1, 3, 0), dtype=dt, device=device)
@@ -2672,7 +2615,7 @@ def test_unfold_scalars(self, device):
     # FIXME: move to data movement test suite
     def test_copy_all_dtypes_and_devices(self, device):
         from copy import copy
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             x = torch.tensor([1, 2, 3, 4], dtype=dt, device=device)
             x_clone = x.clone()
             y = copy(x)
@@ -2741,7 +2684,7 @@ def test_copy_transpose_math_view(self, device, dtype):
             self.assertEqual(dst, src.conj_physical())
 
     def test_clone_all_dtypes_and_devices(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16):
             x = torch.tensor((1, 1), dtype=dt, device=device)
             y = x.clone()
             self.assertEqual(x, y)
@@ -2812,7 +2755,7 @@ def test_narrow_empty(self, device):
             self.assertEqual(sz, y.size())
 
     # FIXME: move to test indexing
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_index_copy(self, device, dtype):
         # We just test for num_copy <= num_dest, as otherwise there are repeated indices
         # and the behavior is undefined
@@ -2847,7 +2790,7 @@ def ref_index_copy(tgt, dim, idx, src):
     # onlyNativeDeviceTypes due to an XLA error:
     # https://github.com/pytorch/pytorch/issues/53256
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_index_copy_scalars(self, device, dtype):
         # Create the 8 possible combinations of scalar sizes for target / index / source
         scalars = ((make_tensor(size_t, dtype=dtype, device=device, low=None, high=None),
@@ -2957,7 +2900,7 @@ def test_index_put_non_accumulate_deterministic(self, device) -> None:
                 self.assertEqual(output, input_list)
 
     # FIXME: move to test indexing
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_index_fill(self, device, dtype):
         x = torch.tensor([[1, 2], [4, 5]], dtype=dtype, device=device)
         index = torch.tensor([0], device=device)
@@ -2975,7 +2918,7 @@ def test_index_fill(self, device, dtype):
     # FIXME: move to test indexing
     # The test fails for zero-dimensional tensors on XLA
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_index_select(self, device, dtype):
         num_src, num_out = 3, 5
 
@@ -3021,7 +2964,7 @@ def ref_index_select(src, dim, idx):
             self.assertEqual(out.item(), source.item())
 
     # FIXME: find a test suite for the take operator
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_take(self, device, dtype):
         idx_size = (4,)
 
@@ -3056,7 +2999,7 @@ def ref_take(src, idx):
     # FIXME: find a test suite for the put operator
     # The bool instance does not work on GPU. See
     # https://github.com/pytorch/pytorch/issues/54317
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_put(self, device, dtype):
         src_size = (4,)
 
@@ -3127,7 +3070,7 @@ def ref_put(dst, idx, src, accumulate):
     # FIXME: find a test suite for the put operator
     # The bool instance does not work on GPU. See
     # https://github.com/pytorch/pytorch/issues/54317
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_put_accumulate(self, device, dtype):
         # Test for parallel adds with accumulate == True
         low_precision = dtype == torch.half or dtype == torch.bfloat16
@@ -3171,13 +3114,9 @@ def scatter_allow_reduce(self, device, dtype, reduceop):
         device_type = torch.device(device).type
         return device_type != 'cuda' or (reduceop == 'multiply' and dtype.is_floating_point)
 
-    # FIXME: port to test_scatter_gather_ops.py
-    # torch.{zeros, ones} do not support ComplexHalf (torch.complex32)
-    # So, we are skipping it here.
-    @dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
-              get_all_complex_dtypes()))
-    @dtypesIfCPU(*get_all_dtypes())
-    @dtypesIfCUDA(*get_all_dtypes())
+    @dtypes(*floating_and_complex_types())
+    @dtypesIfCPU(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
+    @dtypesIfCUDA(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_scatter_reduce_operations_to_large_input(self, device, dtype):
         index = torch.tensor([[1], [2]], device=device, dtype=torch.long)
         test_data = [
@@ -3202,13 +3141,9 @@ def test_scatter_reduce_operations_to_large_input(self, device, dtype):
             input.scatter_(0, index, src, reduce=operation)
             self.assertEqual(input, result)
 
-    # FIXME: port to test_scatter_gather_ops.py
-    # torch.{zeros, ones} do not support ComplexHalf (torch.complex32)
-    # So, we are skipping it here.
-    @dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
-              get_all_complex_dtypes()))
-    @dtypesIfCPU(*get_all_dtypes())
-    @dtypesIfCUDA(*get_all_dtypes())
+    @dtypes(*floating_and_complex_types())
+    @dtypesIfCPU(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
+    @dtypesIfCUDA(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_scatter_reduce_scalar(self, device, dtype):
         index = torch.tensor([[1], [2]], device=device, dtype=torch.long)
         test_data = [
@@ -3245,13 +3180,9 @@ def test_scatter_add_non_unique_index(self, device):
                          torch.tensor([[3], [1]], device=device,
                                       dtype=torch.float32).repeat(1, width))
 
-    # FIXME: port to test_scatter_gather_ops.py
-    # torch.{zeros, ones} do not support ComplexHalf (torch.complex32)
-    # So, we are skipping it here.
-    @dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
-              get_all_complex_dtypes()))
-    @dtypesIfCPU(*get_all_dtypes())
-    @dtypesIfCUDA(*get_all_dtypes())
+    @dtypes(*floating_and_complex_types())
+    @dtypesIfCPU(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
+    @dtypesIfCUDA(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_scatter_reduce_non_unique_index(self, device, dtype):
         height = 2
         width = 2
@@ -3272,12 +3203,8 @@ def test_scatter_reduce_non_unique_index(self, device, dtype):
             input.scatter_(0, index, src, reduce=operation)
             self.assertEqual(input, result, msg=f"result: {result} input: {input} method: {str(operation)}")
 
-    # FIXME: port to test_scatter_gather_ops.py
-    # torch.{zeros, ones} do not support ComplexHalf (torch.complex32)
-    # So, we are skipping it here.
     @onlyCUDA
-    @dtypes(*(get_all_complex_dtypes() +
-              get_all_int_dtypes()))
+    @dtypes(*integral_types(), *complex_types())
     def test_scatter_reduce_multiply_unsupported_dtypes(self, device, dtype):
         height = 2
         width = 2
@@ -3329,7 +3256,7 @@ def test_scatter_add_bool(self, device):
 
     # FIXME: find a test suite for the masked scatter operator
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_masked_scatter(self, device, dtype):
         dt = dtype
         with warnings.catch_warnings(record=True) as w:
@@ -3406,8 +3333,6 @@ def test_masked_scatter_bool_tensor(self, device):
 
     # FIXME: find a test suite for the masked scatter operator
     #   test_scatter_gather_ops or test_masked_ops?
-    # refer https://github.com/pytorch/pytorch/issues/60190
-    @skipIfRocm
     @onlyCUDA
     @largeTensorTest('30GB')
     def test_masked_scatter_large_tensor(self, device):
@@ -3418,7 +3343,7 @@ def test_masked_scatter_large_tensor(self, device):
         self.assertEqual(result, result_cpu)
 
     # FIXME: find a test suite for the masked select operator
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_masked_select(self, device, dtype):
         if device == 'cpu':
             warn = 'masked_select received a mask with dtype torch.uint8,'
@@ -3486,7 +3411,7 @@ def test_masked_select_discontiguous(self, device):
                 self.assertEqual(out_dc, expected, atol=0, rtol=0)
 
     # FIXME: find a test suite for the masked fill operator
-    @dtypes(*product(get_all_dtypes(), (torch.uint8, torch.bool)))
+    @dtypes(*product(all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16), (torch.uint8, torch.bool)))
     def test_masked_fill(self, device, dtypes):
         dtype = dtypes[0]
         mask_dtype = dtypes[1]
@@ -3791,15 +3716,18 @@ def test_pdist_norm_backward(self, device):
     # FIXME: find a test suite for the pdist operator
     @unittest.skipIf(IS_FBCODE and IS_REMOTE_GPU, "sandcastle OOM with current tpx gpu/re configuration")
     @skipIfRocm
+    @onlyCUDA
+    @largeTensorTest('10GB', device='cpu')
+    @largeTensorTest('5GB', device='cuda')
     def test_pdist_norm_large(self, device):
         # use dim0>=46342 for forward, see:
         # https://github.com/pytorch/pytorch/issues/30583
         # Compare output using GPU with the CPU implementation, as brute_pdist uses too much memory
-        if 'cuda' in device:
-            x = torch.randn(50000, 1, dtype=torch.float32)
-            expected_cpu = torch.pdist(x, p=2)
-            actual_gpu = torch.pdist(x.to(device), p=2)
-            self.assertEqual(expected_cpu, actual_gpu.cpu())
+        x = torch.randn(50000, 1, dtype=torch.float32)      # 50k * 4 bytes = 200 KB
+        # Will require 1249975000 float32s
+        expected_cpu = torch.pdist(x, p=2)                  # ~1250M * 4 bytes = 5 GB on CPU
+        actual_gpu = torch.pdist(x.to(device), p=2)         # 5 GB on GPU
+        self.assertEqual(expected_cpu, actual_gpu.cpu())    # Another 5 GB on CPU
 
     # FIXME: move to elementwise ternary test suite
     @onlyNativeDeviceTypes
@@ -4033,19 +3961,6 @@ def test_masked_fill_mem_overlap(self, device):
         with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
             mask[1:].masked_fill_(mask[:-1], False)
 
-    # FIXME: convert to ErrorInputs
-    @onlyNativeDeviceTypes
-    def test_masked_select_mem_overlap(self, device):
-        x = torch.rand((1,), device=device).expand((3,))
-        y = torch.rand((6,), device=device)
-        mask = torch.tensor([True, False, True, True, False, False], device=device)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.masked_select(y, mask, out=x)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.masked_select(y, mask, out=y)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.masked_select(mask.clone(), mask, out=mask)
-
     # FIXME: convert to ErrorInputs
     @expectedFailureMeta  # RuntimeError not raised
     @onlyNativeDeviceTypes
@@ -4057,15 +3972,6 @@ def test_masked_scatter_mem_overlap(self, device):
         with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
             x.masked_scatter_(mask, src)
 
-    # FIXME: convert to ErrorInputs
-    @onlyNativeDeviceTypes
-    def test_index_select_mem_overlap(self, device):
-        x = torch.rand((1, 6), device=device).expand((2, 6))
-        y = torch.rand((3, 6), device=device)
-        ind = torch.tensor([0, 1], dtype=torch.int64, device=device)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.index_select(y, 1, ind, out=x)
-
     # FIXME: convert to ErrorInputs
     @onlyNativeDeviceTypes
     def test_scatter_mem_overlap(self, device):
@@ -4080,32 +3986,6 @@ def test_scatter_mem_overlap(self, device):
         with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
             ind.scatter_(0, ind, ind.clone())
 
-    # FIXME: convert to ErrorInputs
-    @onlyNativeDeviceTypes
-    def test_gather_mem_overlap(self, device):
-        x = torch.rand((1,), device=device).expand((3,))
-        src = torch.rand((6,), device=device)
-        ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.gather(src, 0, ind, out=x)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.gather(src, 0, ind, out=src)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.gather(ind.clone(), 0, ind[1:], out=ind[:1])
-
-    # FIXME: convert to ErrorInputs
-    @onlyNativeDeviceTypes
-    def test_take_mem_overlap(self, device):
-        x = torch.rand((1,), device=device).expand((3,))
-        src = torch.rand((6,), device=device)
-        ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.take(src, ind, out=x)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.take(src, ind, out=src)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.take(ind.clone(), ind[1:], out=ind[:-1])
-
     # FIXME: move to test distributions
     @onlyCUDA
     def test_multinomial_device_constrain(self, device):
@@ -4564,7 +4444,7 @@ def compare_strides(s1, s2, div):
     # FIXME: move dlpack tests to their own test class/suite
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_dlpack_capsule_conversion(self, device, dtype):
         # DLpack does not explicitly support bool (xref dmlc/dlpack#75)
         x = make_tensor((5,), dtype=dtype, device=device)
@@ -4573,7 +4453,7 @@ def test_dlpack_capsule_conversion(self, device, dtype):
 
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_dlpack_protocol_conversion(self, device, dtype):
         x = make_tensor((5,), dtype=dtype, device=device)
         z = from_dlpack(x)
@@ -4589,7 +4469,7 @@ def test_dlpack_shared_storage(self, device):
 
     @skipMeta
     @onlyCUDA
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_dlpack_conversion_with_streams(self, device, dtype):
         # Create a stream where the tensor will reside
         stream = torch.cuda.Stream()
@@ -4608,7 +4488,7 @@ def test_dlpack_conversion_with_streams(self, device, dtype):
 
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_from_dlpack(self, device, dtype):
         x = make_tensor((5,), dtype=dtype, device=device)
         y = torch.from_dlpack(x)
@@ -4616,7 +4496,7 @@ def test_from_dlpack(self, device, dtype):
 
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_from_dlpack_noncontinguous(self, device, dtype):
         x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)
 
@@ -4642,7 +4522,7 @@ def test_from_dlpack_noncontinguous(self, device, dtype):
 
     @skipMeta
     @onlyCUDA
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_dlpack_conversion_with_diff_streams(self, device, dtype):
         stream_a = torch.cuda.Stream()
         stream_b = torch.cuda.Stream()
@@ -4659,7 +4539,7 @@ def test_dlpack_conversion_with_diff_streams(self, device, dtype):
 
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_from_dlpack_dtype(self, device, dtype):
         x = make_tensor((5,), dtype=dtype, device=device)
         y = torch.from_dlpack(x)
@@ -4691,7 +4571,7 @@ def __dlpack__(self, stream=None):
 
     @skipMeta
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bool=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_dlpack_tensor_invalid_stream(self, device, dtype):
         with self.assertRaises(TypeError):
             x = make_tensor((5,), dtype=dtype, device=device)
@@ -5201,8 +5081,7 @@ def _where_valid_scalar_tensor_combination(self, scalar_type, dtype):
 
     # FIXME: move to elementwise ternary test suite
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_where_scalar_invalid_combination_raises(self, device, dtype):
 
         def checkRaises(scalar_type, dtype, condition, x, scalar_1):
@@ -5215,8 +5094,7 @@ def checkRaises(scalar_type, dtype, condition, x, scalar_1):
 
     # FIXME: move to elementwise ternary test suite
     @skipCUDAVersionIn([(11, 2)])  # test fails for 11.2, see https://github.com/pytorch/pytorch/issues/51980
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_where_scalar_valid_combination(self, device, dtype):
 
         def checkResult(scalar_type, dtype, condition, x, scalar_1):
@@ -5329,6 +5207,48 @@ def test_assertRaisesRegex_ignore_msg_non_native_device(self, device):
         with self.assertRaisesRegex(RuntimeError, msg):
             torch.nn.functional.nll_loss(x, t, weight=invalid_weight)
 
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.complex32))
+    def test_copy_(self, device, dtype):
+        def can_cast(src_dtype, dst_dtype):
+            # torch.can_cast(torch.int16, torch.uint8) returns True
+            # which isn't actually safe-cast.
+            # This function returns False in this case.
+            def is_unsigned_int(dtype):
+                return dtype is torch.uint8
+
+            if is_unsigned_int(dst_dtype):
+                return is_unsigned_int(src_dtype)
+            return torch.can_cast(src_dtype, dst_dtype)
+
+        def make_tensor_wrapper(shape, dtype):
+            if dtype is not torch.complex32:
+                # Make tensor does not support generating
+                # complex32 tensor
+                return make_tensor(shape, device=device, dtype=dtype)
+            return torch.randn(shape, device=device, dtype=dtype)
+
+        t = make_tensor_wrapper((50,), dtype)
+        src_dtypes = all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.complex32)
+        for src_dtype in src_dtypes:
+            src = make_tensor_wrapper((50,), dtype=src_dtype)
+            t.copy_(src)
+            dst = make_tensor_wrapper((50, ), dtype=src_dtype)
+            if can_cast(src_dtype, dtype):
+                rtol = None
+                atol = None
+                if dtype in (torch.half, torch.complex32):
+                    rtol = 1e-3
+                    atol = 1e-3
+                if dtype in (torch.bfloat16,):
+                    rtol = 1e-2
+                    atol = 1e-2
+                self.assertEqual(src, dst.copy_(t), rtol=rtol, atol=atol)
+
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.complex32))
+    def test_item(self, device, dtype):
+        t = torch.ones((), device=device, dtype=dtype)
+        self.assertEqual(1, t.item())
+
 
 # Tests that compare a device's computation with the (gold-standard) CPU's.
 class TestDevicePrecision(TestCase):
@@ -5757,69 +5677,6 @@ def test_unflatten(self):
                                     r"the unspecified dimension size -1 can be any value and is ambiguous"):
             torch.randn(2, 0).unflatten(1, (2, -1, 0))
 
-    # FIXME: move to test_scatter_gather_ops.py
-    def test_scatter_reduce(self):
-        dtype = device = None
-        output_size = 10
-        shape = [5, 10, 20]
-        reduces = ["sum", "prod", "mean", "amax", "amin"]
-        fills = {"sum": 0, "prod": 1, "mean": 0, "amax": -(2 ** 31), "amin": 2 ** 31 - 1}
-        fns = {"sum": lambda t, v: t.add_(v),
-               "prod": lambda t, v: t.mul_(v),
-               "mean": lambda t, v, n: t.mul_(n).add_(v).div_(n + 1),
-               "amax": lambda t, v: torch.max(t, v, out=t),
-               "amin": lambda t, v: torch.min(t, v, out=t)}
-
-        index = torch.randint(0, output_size, shape, dtype=torch.long, device=device)
-        input = torch.randn(shape, dtype=dtype, device=device)
-
-        for reduce in reduces:
-            for dim in range(len(shape)):
-                output = input.scatter_reduce(dim, index, reduce, output_size=output_size)
-
-                # Check that output is of the correct size
-                output_shape = copy.copy(shape)
-                output_shape[dim] = output_size
-                self.assertEqual(output.shape, output_shape)
-
-                expected = torch.zeros(output_shape, dtype=dtype, device=device)
-                expected.fill_(fills[reduce])
-                counts = torch.zeros(output_shape, dtype=dtype, device=device)
-                for i, j, k in itertools.product(range(shape[0]), range(shape[1]), range(shape[2])):
-                    v = input[i, j, k]
-                    m = index[i, j, k]
-
-                    if dim == 0:
-                        i = m
-                    elif dim == 1:
-                        j = m
-                    else:
-                        k = m
-
-                    op = fns[reduce]
-                    if (reduce == "mean"):
-                        op(expected[i, j, k], v, counts[i, j, k])
-                    else:
-                        op(expected[i, j, k], v)
-                    counts[i, j, k] += 1
-
-                if (reduce == "amin" or reduce == "amax"):
-                    expected.masked_fill_(counts == 0, 0)
-
-                self.assertTrue(torch.allclose(output, expected))
-
-        with self.assertRaisesRegex(RuntimeError, "Expected `dim` to be in range -3 to 2"):
-            torch.scatter_reduce(input, 4, index, "sum")
-
-        with self.assertRaisesRegex(RuntimeError, "Shape mismatch"):
-            index2 = torch.randint(0, output_size, (10, ), dtype=torch.long, device=device)
-            torch.scatter_reduce(input, 0, index2, "sum")
-
-        with self.assertRaisesRegex(RuntimeError, "Expected `index` values to be in range 0 to 2"):
-            input2 = torch.randn(10, dtype=dtype, device=device)
-            index2 = torch.tensor([0, 1, 0, 1, 2, 3, 3, 4, 4, 3])
-            torch.scatter_reduce(input2, 0, index2, "sum", output_size=2)
-
     def test_structseq_repr(self):
         a = torch.arange(250).reshape(5, 5, 10)
         expected = """
@@ -6339,6 +6196,7 @@ def test_from_buffer(self):
         self.assertEqual(bools.size(), 8)
         self.assertEqual(bools.tolist(), [False, True, True, True, True, True, True, True])
         self.assertEqual(bools.type(), 'torch.BoolStorage')
+        self.assertTrue(isinstance(bools, torch.BoolStorage))
 
         f = bytearray(b'\x80\x02\x8a\nl\xfc\x9cF\xf9 j\xa8P\x19.\x80\x02M\xe9')
         bools = torch.BoolStorage.from_buffer(f, 'big')
@@ -6351,6 +6209,122 @@ def test_from_buffer(self):
         bytes = torch.ByteStorage.from_buffer(a)
         self.assertEqual(bytes.nbytes(), 4)
         self.assertEqual(bytes.tolist(), [1, 2, 3, 4])
+        self.assertTrue(isinstance(bytes, torch.ByteStorage))
+
+    def test_storage_error(self):
+        quantized_storages = [
+            torch.QInt32Storage,
+            torch.QInt8Storage,
+            torch.QUInt2x4Storage,
+            torch.QUInt4x2Storage,
+            torch.QUInt8Storage,
+        ]
+
+        with self.assertRaisesRegex(RuntimeError, r"Only child classes of _LegacyStorage can be instantiated"):
+            torch.storage._LegacyStorage()
+
+        for storage_class in torch._storage_classes:
+            if storage_class in [torch._UntypedStorage, torch.cuda._UntypedStorage, torch._TypedStorage]:
+                continue
+
+            device = 'cuda' if storage_class.__module__ == 'torch.cuda' else 'cpu'
+            dtype = storage_class.dtype
+
+            if device == 'cuda' and not torch.cuda.is_available():
+                continue
+
+            # Legacy <type>Storage constructor errors
+            with self.assertRaisesRegex(RuntimeError, r"'device' cannot be specified"):
+                storage_class(device='cpu')
+
+            with self.assertRaisesRegex(RuntimeError, r"'dtype' cannot be specified"):
+                storage_class(dtype=torch.float)
+
+            with self.assertRaisesRegex(TypeError, r"got an unexpected keyword"):
+                storage_class(sdlkjf=torch.float)
+
+            with self.assertRaisesRegex(RuntimeError, r"Too many positional arguments"):
+                storage_class(0, 0)
+
+            with self.assertRaisesRegex(TypeError, r"invalid data type"):
+                storage_class('string')
+
+            with self.assertRaisesRegex(TypeError, r"Argument type not recognized"):
+                storage_class(torch.tensor([]))
+
+            s = storage_class()
+
+            with self.assertRaisesRegex(RuntimeError, r"No positional arguments"):
+                storage_class(0, wrap_storage=s._untyped())
+
+            with self.assertRaisesRegex(TypeError, r"must be _UntypedStorage"):
+                storage_class(wrap_storage=s)
+
+            if torch.cuda.is_available():
+                if storage_class in quantized_storages:
+                    with self.assertRaisesRegex(RuntimeError, r"Cannot create CUDA storage with quantized dtype"):
+                        s.cuda()
+
+                else:
+
+                    if s.is_cuda:
+                        s_other_device = s.cpu()
+                    else:
+                        s_other_device = s.cuda()
+
+                    with self.assertRaisesRegex(RuntimeError, r"Device of 'wrap_storage' must be"):
+                        storage_class(wrap_storage=s_other_device._untyped())
+
+            # _TypedStorage constructor errors
+            with self.assertRaisesRegex(RuntimeError, r"No positional arguments"):
+                torch._TypedStorage(0, wrap_storage=s._untyped(), dtype=dtype)
+
+            with self.assertRaisesRegex(RuntimeError, r"Argument 'dtype' must be specified"):
+                torch._TypedStorage(wrap_storage=s._untyped())
+
+            with self.assertRaisesRegex(TypeError, r"Argument 'dtype' must be torch.dtype"):
+                torch._TypedStorage(wrap_storage=s._untyped(), dtype=0)
+
+            with self.assertRaisesRegex(RuntimeError, r"Argument 'device' should not be specified"):
+                torch._TypedStorage(wrap_storage=s._untyped(), dtype=dtype, device=device)
+
+            with self.assertRaisesRegex(TypeError, r"Argument 'wrap_storage' must be _UntypedStorage"):
+                torch._TypedStorage(wrap_storage=s, dtype=dtype)
+
+            with self.assertRaisesRegex(RuntimeError, r"Storage device not recognized"):
+                torch._TypedStorage(dtype=dtype, device='xla')
+
+            if torch.cuda.is_available():
+                if storage_class in quantized_storages:
+                    with self.assertRaisesRegex(RuntimeError, r"Cannot create CUDA storage with quantized dtype"):
+                        torch._TypedStorage(dtype=dtype, device='cuda')
+
+            with self.assertRaisesRegex(TypeError, r"Argument type not recognized"):
+                torch._TypedStorage(torch.tensor([]), dtype=dtype, device=device)
+
+            with self.assertRaisesRegex(RuntimeError, r"Too many positional arguments"):
+                torch._TypedStorage(0, 0, dtype=dtype, device=device)
+
+    def test_storage_error_no_attribute(self):
+        storage_classes = [
+            torch.cuda.ByteStorage,
+            torch.cuda.FloatStorage,
+            torch.cuda._UntypedStorage,
+        ]
+        for storage_class in storage_classes:
+            with self.assertRaisesRegex(RuntimeError, r'Not available for CUDA storage'):
+                storage_class.from_buffer()
+
+            if storage_class == torch.cuda._UntypedStorage:
+                with self.assertRaisesRegex(RuntimeError, r'Not available for CUDA storage'):
+                    storage_class._new_with_weak_ptr()
+
+            else:
+                with self.assertRaisesRegex(AttributeError, r'has no attribute'):
+                    storage_class._new_with_weak_ptr()
+
+            with self.assertRaisesRegex(RuntimeError, r'Not available for CUDA storage'):
+                storage_class._new_shared_filename(0, 0, 0)
 
     def test_storage_casts(self):
         storage = torch.IntStorage([-1, 0, 1, 2, 3, 4])
@@ -7109,6 +7083,14 @@ def test_fill_diagonal(self):
         e1.fill_diagonal_(v, wrap=True)
         self.assertEqual(e1, e2)
 
+    def test_setting_real_imag_to_a_number(self):
+        x = torch.randn(4, dtype=torch.cfloat)
+        x.real = 0
+        x.imag = 0
+        zeros = torch.zeros(4)
+        self.assertEqual(x.real, zeros)
+        self.assertEqual(x.imag, zeros)
+
     def test_batch_norm_cpu_inference(self):
         # input nchw in (2,1,1,1), (2,2,2,2)
         inputs = [
@@ -7165,6 +7147,11 @@ def test_empty_meta(self):
         self.assertEqual(z.size(), (2 ** 20, 2 ** 20))
         self.assertRaises(RuntimeError, lambda: z[0][0].item())
 
+    @noarchTest
+    def test_format_scalar_meta(self):
+        x = torch.empty((), device='meta')
+        self.assertEqual(format(x), repr(x))
+
     @noarchTest
     def test_upsample_nearest1d_meta(self):
         # TODO: this test should be triggered by test_nn.py but right
@@ -7408,12 +7395,12 @@ def test_numel(self):
 
     # Verifies that (deep)copies of dtypes are the same objects
     def test_copy_dtypes(self):
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             copied_dtype = copy.deepcopy(dtype)
             self.assertIs(dtype, copied_dtype)
 
     def test_dtype_is_signed(self):
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.half):
             self.assertEqual(dtype.is_signed, torch.is_signed(torch.tensor(0, dtype=dtype)))
 
         self.assertRaisesRegex(RuntimeError, 'not supported for quantized', lambda: torch.quint8.is_signed)
@@ -7528,6 +7515,12 @@ def test_copy_transpose(self):
         self.assertEqual(y[:, 0], range(100))
         self.assertEqual(y[:, 40], range(4000, 4100))
 
+        x = torch.arange(100 * 100).reshape(100, 100).to(dtype=torch.complex32).t()
+        y = torch.empty(100, 100, dtype=torch.complex32)
+        y.copy_(x)
+        self.assertEqual(y[:, 0], range(100))
+        self.assertEqual(y[:, 40], range(4000, 4100))
+
     # FIXME: Port to a more appropriate test suite
     def test_copy_broadcast(self):
         torch.zeros(5, 6).copy_(torch.zeros(6))
diff --git a/test/test_type_promotion.py b/test/test_type_promotion.py
index f32a89933f0880..a157f49962d5c5 100644
--- a/test/test_type_promotion.py
+++ b/test/test_type_promotion.py
@@ -11,7 +11,7 @@
 from torch.testing._internal.common_device_type import (instantiate_device_type_tests, onlyNativeDeviceTypes,
                                                         dtypes, dtypesIfCUDA, onlyCPU, expectedFailureMeta, skipMeta)
 from torch.testing._internal.common_dtype import (
-    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes
+    all_types_and_complex_and, all_types_and, get_all_math_dtypes, integral_types_and, floating_types_and
 )
 
 if TEST_NUMPY:
@@ -184,7 +184,7 @@ def test_bfloat16(self, device):
             self.assertEqual(bf + scalar, scalar + bf)
 
         # with tensor
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             t = torch.tensor(1, dtype=dtype, device=device)
             self.assertEqual(bf + t, t + bf)
             if dtype in (torch.float16, torch.float32, torch.float64, torch.cfloat, torch.cdouble):
@@ -340,7 +340,8 @@ def test_create_bool_tensors(self, device):
         # this seems like odd behavior but ints also create float tensors, numpy doesn't have this function.
         self.assertEqual(torch.scalar_tensor(False, device=device), torch.tensor(0., device=device))
 
-    @dtypes(*itertools.product(get_all_dtypes(), get_all_dtypes()))
+    @dtypes(*itertools.product(all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
+                               all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     def test_result_type(self, device, dtypes):
         "Test result_type for tensor vs tensor and scalar vs scalar."
 
@@ -562,7 +563,7 @@ def test_promote_types(self, device):
 
     @float_double_default_dtype
     def test_promote_self(self, device):
-        for dtype in get_all_dtypes():
+        for dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             self.assertEqual(torch.promote_types(dtype, dtype), dtype)
 
     @expectedFailureMeta
@@ -880,7 +881,7 @@ def test_numpy_array_binary_ufunc_promotion(self, device, dtypes):
 
     @onlyNativeDeviceTypes
     def test_cat_different_dtypes(self, device):
-        dtypes = get_all_dtypes(include_bfloat16=False)
+        dtypes = all_types_and_complex_and(torch.half, torch.bool)
         for x_dtype, y_dtype in itertools.product(dtypes, dtypes):
             x_vals, y_vals = [1, 2, 3], [4, 5, 6]
 
@@ -899,7 +900,7 @@ def test_cat_different_dtypes(self, device):
 
     @onlyNativeDeviceTypes
     def test_cat_out_different_dtypes(self, device):
-        dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False)
+        dtypes = all_types_and_complex_and(torch.half)
         for x_dtype, y_dtype, out_dtype in itertools.product(dtypes, dtypes, dtypes):
             out = torch.zeros(6, device=device, dtype=out_dtype)
             x = torch.tensor([1, 2, 3], device=device, dtype=x_dtype)
@@ -971,21 +972,19 @@ def test_computation_ignores_out(self, device):
         self.assertEqual(result, a - b, exact_dtype=False)
         self.assertNotEqual(result, a.double() - b, exact_dtype=False)
 
-    @dtypesIfCUDA(*itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False),
-                                     get_all_dtypes(include_bfloat16=False, include_complex=False)))
-    @dtypes(*itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False,
-                                              include_complex=False),
-                               get_all_dtypes(include_half=False, include_bfloat16=False,
-                                              include_complex=False)))
+    @dtypesIfCUDA(*itertools.product(all_types_and(torch.half, torch.bool),
+                                     all_types_and(torch.half, torch.bool)))
+    @dtypes(*itertools.product(all_types_and(torch.bool),
+                               all_types_and(torch.bool)))
     def test_atan2_type_promotion(self, device, dtypes):
         dtype1, dtype2 = dtypes
         default_float = torch.get_default_dtype()
 
         def is_int(dtype):
-            return dtype in get_all_int_dtypes() + [torch.bool]
+            return dtype in integral_types_and(torch.bool)
 
         def is_float(dtype):
-            return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False)
+            return dtype in floating_types_and(torch.half)
 
         def get_binary_float_result_type(x, y):
             dtype1 = x.dtype
diff --git a/test/test_unary_ufuncs.py b/test/test_unary_ufuncs.py
index 3cfcd4fa2e813e..c6ca4ffc81c8d6 100644
--- a/test/test_unary_ufuncs.py
+++ b/test/test_unary_ufuncs.py
@@ -21,8 +21,8 @@
     OpDTypes)
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
-    floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes,
-    get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
+    floating_types_and, all_types_and_complex_and, integral_types_and, get_all_math_dtypes,
+    complex_types, all_types_and, floating_and_complex_types_and
 )
 
 if TEST_SCIPY:
@@ -517,8 +517,7 @@ def test_out_arg_all_dtypes(self, device, dtype, op):
             out = torch.empty_like(input, dtype=out_dtype)
             self._test_out_arg(op, input, out, expected, **torch_kwargs)
 
-    @dtypes(*(get_all_int_dtypes() + [torch.bool] +
-              get_all_fp_dtypes(include_bfloat16=False)))
+    @dtypes(*all_types_and(torch.bool, torch.half))
     def test_nan_to_num(self, device, dtype):
         for contiguous in [False, True]:
             x = make_tensor((64, 64), low=0., high=100., dtype=dtype, device=device)
@@ -596,7 +595,7 @@ def test_digamma(self, device, dtype):
         self.compare_with_numpy(torch.digamma, scipy.special.digamma, tensor)
 
     @skipCUDAIfRocm
-    @dtypes(*get_all_fp_dtypes(include_half=True, include_bfloat16=False))
+    @dtypes(*floating_types_and(torch.half))
     def test_frexp(self, device, dtype):
         input = make_tensor((50, 50), dtype=dtype, device=device)
         mantissa, exponent = torch.frexp(input)
@@ -611,15 +610,13 @@ def test_frexp(self, device, dtype):
 
     @skipCUDAIfRocm
     def test_frexp_assert_raises(self, device):
-        invalid_input_dtypes = get_all_int_dtypes() + \
-            get_all_complex_dtypes() + \
-            [torch.bool]
+        invalid_input_dtypes = integral_types_and(torch.bool) + complex_types()
         for dtype in invalid_input_dtypes:
             input = make_tensor((50, 50), dtype=dtype, device=device)
             with self.assertRaisesRegex(RuntimeError, r"torch\.frexp\(\) only supports floating-point dtypes"):
                 torch.frexp(input)
 
-        for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False):
+        for dtype in floating_types_and(torch.half):
             input = make_tensor((50, 50), dtype=dtype, device=device)
 
             dtypes = list(all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
@@ -872,7 +869,7 @@ def test_unary_out_op_mem_overlap(self, device, dtype):
 
     # TODO: opinfo hardshrink
     @onlyCPU
-    @dtypes(torch.float, torch.double)
+    @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_hardshrink(self, device, dtype):
         data = torch.tensor([1, 0.5, 0.3, 0.6], dtype=dtype, device=device).view(2, 2)
         self.assertEqual(torch.tensor([1, 0.5, 0, 0.6], dtype=dtype, device=device).view(2, 2),
@@ -888,7 +885,7 @@ def test_hardshrink(self, device, dtype):
                          data.t().hardshrink(0.3))
 
     @onlyCPU
-    @dtypes(torch.float, torch.double)
+    @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_hardshrink_edge_cases(self, device, dtype) -> None:
         def h(values, l_expected):
             for l, expected in l_expected.items():
@@ -913,6 +910,7 @@ def test_helper(min, max):
     @onlyCPU
     @slowTest
     @dtypes(torch.float)
+    @unittest.skipIf(True, "Insufficient memory on linux.(2|4)xlarge")
     def test_exp_slow(self, device, dtype):
         # Test for https://github.com/pytorch/pytorch/issues/17271
         # This is pretty slow on my Macbook but it only takes a few
@@ -922,8 +920,7 @@ def test_exp_slow(self, device, dtype):
         self.assertEqual(a, b.expand(2 ** 31))
 
     @precisionOverride({torch.bfloat16: 1e-2, torch.float: 0.0002, torch.double: 0.0002})
-    @dtypesIfCUDA(torch.float, torch.double, torch.bfloat16)
-    @dtypes(torch.float, torch.double)
+    @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_hardswish(self, device, dtype):
         inputValues = [-1000, -4, -3, -2, 0, 2, 3, 4, 1000]
         expectedOutput = np.multiply(
@@ -944,8 +941,7 @@ def test_hardswish(self, device, dtype):
         self.assertEqual(inputTensorCpy, expectedOutputTensor)
 
     @precisionOverride({torch.bfloat16: 1e-2, torch.float: 0.0002, torch.double: 0.0002})
-    @dtypesIfCUDA(torch.float, torch.double, torch.bfloat16)
-    @dtypes(torch.float, torch.double)
+    @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_hardsigmoid(self, device, dtype):
         inputValues = [-1000, -4, -3, -2, 0, 2, 3, 4, 1000]
         expectedOutput = np.minimum(np.maximum((np.add(inputValues, 3)), 0), 6) / 6.0
@@ -962,8 +958,7 @@ def test_hardsigmoid(self, device, dtype):
                          torch.tensor(expectedOutput, dtype=dtype, device=device))
 
     @precisionOverride({torch.bfloat16: 1e-2, torch.float: 0.0002, torch.double: 0.0002})
-    @dtypesIfCUDA(torch.float, torch.double, torch.bfloat16)
-    @dtypes(torch.float, torch.double)
+    @dtypes(torch.float, torch.double, torch.bfloat16)
     def test_hardsigmoid_backward(self, device, dtype):
         inputValues = [-3.0, 3.0, -2.0, 2.0, -6.0, 6.0]
         expectedValues = [0.0, 0.0, 1.0 / 6.0, 1.0 / 6.0, 0.0, 0.0]
@@ -1182,7 +1177,7 @@ def _i0_range_helper(self, range, device, dtype):
             t = torch.rand(1000, device=device).to(dtype) * r
             self._i0_helper(t)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.bfloat16, torch.float32, torch.float64)
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     def test_i0_range1(self, device, dtype):
@@ -1190,7 +1185,7 @@ def test_i0_range1(self, device, dtype):
         # The domain is (-13.25, 13.25)
         self._i0_range_helper(13.25, device, dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.bfloat16, torch.float32, torch.float64)
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     def test_i0_range2(self, device, dtype):
@@ -1205,7 +1200,7 @@ def test_i0_range3(self, device, dtype):
         # The domain is (-709.75, 709.75)
         self._i0_range_helper(709.75, device, dtype)
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.bfloat16, torch.float32, torch.float64)
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     def test_i0_special(self, device, dtype):
@@ -1215,7 +1210,7 @@ def test_i0_special(self, device, dtype):
         t = torch.tensor([inf, -inf, nan], device=device, dtype=dtype)
         self.assertTrue(torch.i0(t).isnan().all())
 
-    @dtypesIfCUDA(*get_all_fp_dtypes())
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
     @dtypes(torch.bfloat16, torch.float32, torch.float64)
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
     def test_special_i0_i1_vs_scipy(self, device, dtype):
@@ -1269,11 +1264,25 @@ def check_equal(t):
             self.assertEqual(actual, expected)
 
         range = (-10, 10)
+        t = torch.linspace(*range, 1, device=device, dtype=dtype)
+        check_equal(t)
 
-        t = torch.linspace(*range, int(1e4), device=device, dtype=dtype)
+        # Skip testing NaN, inf, -inf since they are tested in reference_numerics tests.
+        info = torch.finfo(dtype)
+        min, max, eps, tiny = info.min, info.max, info.eps, info.tiny
+        t = torch.tensor([min, max, eps, tiny], dtype=dtype, device=device)
         check_equal(t)
 
-        # NaN, inf, -inf are tested in reference_numerics tests.
+    @dtypes(torch.float32, torch.float64)
+    @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
+    def test_special_log_ndtr_vs_scipy(self, device, dtype):
+        def check_equal(t):
+            # Test by comparing with scipy
+            actual = torch.special.log_ndtr(t)
+            expected = scipy.special.log_ndtr(t.cpu().numpy())
+            self.assertEqual(actual, expected)
+
+        # Skip testing NaN, inf, -inf since they are tested in reference_numerics tests.
         info = torch.finfo(dtype)
         min, max, eps, tiny = info.min, info.max, info.eps, info.tiny
         t = torch.tensor([min, max, eps, tiny], dtype=dtype, device=device)
@@ -1307,7 +1316,7 @@ def test_abs_zero(self, device, dtype):
         for num in abs_zeros:
             self.assertGreater(math.copysign(1.0, num), 0.0)
 
-    @dtypes(*(get_all_dtypes(include_bool=False)))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
     def test_isposinf_isneginf_non_boolean_output(self, device, dtype):
         # test non-boolean tensors as the `out=` parameters
         # boolean outputs are tested in the above testcases
@@ -1349,10 +1358,8 @@ def assert_tuple_empty(tup, dim):
         self.assertEqual(torch.empty(0, dtype=torch.long), z[0])
 
     # TODO: rationalize with exp OpInfo
-    @dtypes(*(get_all_fp_dtypes(include_half=False) +
-              get_all_complex_dtypes()))
-    @dtypesIfCUDA(*(get_all_fp_dtypes(include_half=True) +
-                    get_all_complex_dtypes()))
+    @dtypes(*floating_and_complex_types_and(torch.bfloat16))
+    @dtypesIfCUDA(*floating_and_complex_types_and(torch.half, torch.bfloat16))
     def test_exp(self, device, dtype):
         for v in (2, -2) + ((1j, 1 + 1j) if dtype.is_complex else ()):
             a = torch.tensor(v, dtype=dtype, device=device) * torch.arange(18, device=device) / 3 * math.pi
diff --git a/test/test_utils.py b/test/test_utils.py
index c8f4e3aa9453b7..6338b8d5d810a5 100644
--- a/test/test_utils.py
+++ b/test/test_utils.py
@@ -1,4 +1,4 @@
-# Owner(s): ["high priority"]
+# Owner(s): ["module: unknown"]
 
 import sys
 import os
@@ -18,10 +18,9 @@
 import torch.cuda
 from torch.utils.checkpoint import checkpoint, checkpoint_sequential
 import torch.utils.cpp_extension
-import torch.hub as hub
 from torch.autograd._functions.utils import check_onnx_broadcast
 from torch.onnx.symbolic_opset9 import _prepare_onnx_paddings
-from torch.testing._internal.common_utils import has_breakpad, load_tests, retry, IS_SANDCASTLE, IS_WINDOWS, TEST_WITH_ASAN
+from torch.testing._internal.common_utils import has_breakpad, load_tests, IS_SANDCASTLE, IS_WINDOWS, TEST_WITH_ASAN
 
 # load_tests from torch.testing._internal.common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
@@ -411,12 +410,6 @@ def test_multi_drop(self):
 test_dir = os.path.abspath(os.path.dirname(str(__file__)))
 
 
-class TestFFI(TestCase):
-    def test_deprecated(self):
-        with self.assertRaisesRegex(ImportError, "torch.utils.ffi is deprecated. Please use cpp extensions instead."):
-            from torch.utils.ffi import create_extension  # type: ignore[attr-defined] # noqa: F401
-
-
 @unittest.skipIf('SKIP_TEST_BOTTLENECK' in os.environ.keys(), 'SKIP_TEST_BOTTLENECK is set')
 class TestBottleneck(TestCase):
     def _run(self, command, timeout=30):
@@ -584,146 +577,6 @@ def try_check_onnx_broadcast(dims1, dims2, expect_broadcast, expect_fail):
         try_check_onnx_broadcast(dims1, dims2, True, False)
 
 
-def sum_of_state_dict(state_dict):
-    s = 0
-    for _, v in state_dict.items():
-        s += v.sum()
-    return s
-
-SUM_OF_HUB_EXAMPLE = 431080
-TORCHHUB_EXAMPLE_RELEASE_URL = 'https://github.com/ailzhang/torchhub_example/releases/download/0.1/mnist_init_ones'
-
-@unittest.skipIf(IS_SANDCASTLE, 'Sandcastle cannot ping external')
-class TestHub(TestCase):
-    @retry(Exception, tries=3)
-    def test_load_from_github(self):
-        hub_model = hub.load(
-            'ailzhang/torchhub_example',
-            'mnist',
-            source='github',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_load_from_local_dir(self):
-        local_dir = hub._get_cache_or_reload(
-            'ailzhang/torchhub_example', force_reload=False)
-        hub_model = hub.load(
-            local_dir,
-            'mnist',
-            source='local',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_load_from_branch(self):
-        hub_model = hub.load(
-            'ailzhang/torchhub_example:ci/test_slash',
-            'mnist',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_set_dir(self):
-        temp_dir = tempfile.gettempdir()
-        hub.set_dir(temp_dir)
-        hub_model = hub.load(
-            'ailzhang/torchhub_example',
-            'mnist',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-        assert os.path.exists(temp_dir + '/ailzhang_torchhub_example_master')
-        shutil.rmtree(temp_dir + '/ailzhang_torchhub_example_master')
-
-    @retry(Exception, tries=3)
-    def test_list_entrypoints(self):
-        entry_lists = hub.list('ailzhang/torchhub_example', force_reload=True)
-        self.assertObjectIn('mnist', entry_lists)
-
-    @retry(Exception, tries=3)
-    def test_download_url_to_file(self):
-        temp_file = os.path.join(tempfile.gettempdir(), 'temp')
-        hub.download_url_to_file(TORCHHUB_EXAMPLE_RELEASE_URL, temp_file, progress=False)
-        loaded_state = torch.load(temp_file)
-        self.assertEqual(sum_of_state_dict(loaded_state),
-                         SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_load_state_dict_from_url(self):
-        loaded_state = hub.load_state_dict_from_url(TORCHHUB_EXAMPLE_RELEASE_URL)
-        self.assertEqual(sum_of_state_dict(loaded_state),
-                         SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_load_zip_checkpoint(self):
-        hub_model = hub.load(
-            'ailzhang/torchhub_example',
-            'mnist_zip',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-
-    # Test the default zipfile serialization format produced by >=1.6 release.
-    @retry(Exception, tries=3)
-    def test_load_zip_1_6_checkpoint(self):
-        hub_model = hub.load(
-            'ailzhang/torchhub_example',
-            'mnist_zip_1_6',
-            pretrained=True,
-            verbose=False)
-        self.assertEqual(sum_of_state_dict(hub_model.state_dict()),
-                         SUM_OF_HUB_EXAMPLE)
-
-
-    def test_hub_dir(self):
-        with tempfile.TemporaryDirectory('hub_dir') as dirname:
-            torch.hub.set_dir(dirname)
-            self.assertEqual(torch.hub.get_dir(), dirname)
-
-    @retry(Exception, tries=3)
-    def test_hub_parse_repo_info(self):
-        # If the branch is specified we just parse the input and return
-        self.assertEqual(
-            torch.hub._parse_repo_info('a/b:c'),
-            ('a', 'b', 'c')
-        )
-        # For torchvision, the default branch is main
-        self.assertEqual(
-            torch.hub._parse_repo_info('pytorch/vision'),
-            ('pytorch', 'vision', 'main')
-        )
-        # For the torchhub_example repo, the default branch is still master
-        self.assertEqual(
-            torch.hub._parse_repo_info('ailzhang/torchhub_example'),
-            ('ailzhang', 'torchhub_example', 'master')
-        )
-
-    @retry(Exception, tries=3)
-    def test_load_state_dict_from_url_with_name(self):
-        with tempfile.TemporaryDirectory('hub_dir') as dirname:
-            torch.hub.set_dir(dirname)
-            file_name = 'test_file'
-            loaded_state = hub.load_state_dict_from_url(TORCHHUB_EXAMPLE_RELEASE_URL, file_name=file_name)
-            self.assertTrue(os.path.exists(os.path.join(dirname, 'checkpoints', file_name)))
-            self.assertEqual(sum_of_state_dict(loaded_state),
-                             SUM_OF_HUB_EXAMPLE)
-
-    @retry(Exception, tries=3)
-    def test_load_commit_from_forked_repo(self):
-        with self.assertRaisesRegex(
-                ValueError,
-                'If it\'s a commit from a forked repo'):
-            model = torch.hub.load('pytorch/vision:4e2c216', 'resnet18', force_reload=True)
-
 class TestHipify(TestCase):
     def test_import_hipify(self):
         from torch.utils.hipify import hipify_python  # noqa: F401
diff --git a/test/test_view_ops.py b/test/test_view_ops.py
index d85d53e6991510..064d001727ab70 100644
--- a/test/test_view_ops.py
+++ b/test/test_view_ops.py
@@ -16,7 +16,7 @@
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, onlyCPU, dtypes, onlyNativeDeviceTypes, skipMeta)
 from torch.testing._internal.common_dtype import (
-    get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
+    all_types_and_complex_and, complex_types, all_types_and, floating_and_complex_types_and,
 )
 
 # TODO: replace this with make_tensor() in common_utils.py
@@ -121,14 +121,14 @@ def _do_transpose(self, x, contiguous=False, dim0=0, dim1=1):
         else:
             return x.transpose(dim0, dim1)
 
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_conj_self(self, device, dtype):
         t = torch.ones(5, 5, device=device)
         s = t.conj()
         self.assertTrue(s is t)
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_view_dtype_new(self, device, dtype):
         dtypes = torch_to_numpy_dtype_dict.copy()
         del dtypes[torch.bool]
@@ -210,18 +210,18 @@ def calc_expected_size_and_stride(a, view_dtype):
         # because view(dtype) does not support backward yet
         # TODO: Remove this when autograd support is added
         if dtype.is_floating_point or dtype.is_complex:
-            for view_dtype in [*get_all_fp_dtypes(), *get_all_complex_dtypes()]:
+            for view_dtype in floating_and_complex_types_and(torch.half, torch.bfloat16):
                 t = make_tensor((5, 5, 64), dtype=dtype, device=device, low=-5, high=5, requires_grad=True)
                 self.assertFalse(t.view(view_dtype).requires_grad)
 
     # Test the extra error checks that happen when the view dtype
     # has a greater element size than the original dtype
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_view_dtype_upsize_errors(self, device, dtype):
         dtype_size = torch._utils._element_size(dtype)
 
-        for view_dtype in get_all_dtypes():
+        for view_dtype in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             view_dtype_size = torch._utils._element_size(view_dtype)
             if view_dtype_size <= dtype_size:
                 continue
@@ -302,7 +302,7 @@ def fn(contiguous_input=True, dim0=0, dim1=1):
         self.assertEqual(res.shape, torch.Size([0]))
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_complex_dtypes(include_complex32=True))
+    @dtypes(*complex_types(), torch.complex32)
     def test_view_as_real(self, device, dtype):
         def fn(contiguous_input=True):
             t = torch.randn(3, 4, dtype=dtype, device=device)
@@ -340,7 +340,7 @@ def fn(contiguous_input=True):
         self.assertEqual(res.shape, torch.Size([2]))
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_view_tensor_split(self, device, dtype):
         a = make_tensor((40, 30), dtype=dtype, device=device, low=-9, high=9)
         a_split_dim0 = a.tensor_split(7, 0)
@@ -351,7 +351,7 @@ def test_view_tensor_split(self, device, dtype):
             self.assertTrue(self.is_view_of(a, a_split_dim1_tensor))
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_view_tensor_hsplit(self, device, dtype):
         t = make_tensor((4, 4, 4), dtype=dtype, device=device, low=-9, high=9)
         t_hsplit = torch.hsplit(t, 2)
@@ -361,7 +361,7 @@ def test_view_tensor_hsplit(self, device, dtype):
         self.assertEqual(t_hsplit[1][2, 0, 2], t[2, 2, 2])
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_view_tensor_vsplit(self, device, dtype):
         t = make_tensor((4, 4, 4), dtype=dtype, device=device, low=-9, high=9)
         t_vsplit = torch.vsplit(t, 2)
@@ -371,7 +371,7 @@ def test_view_tensor_vsplit(self, device, dtype):
         self.assertEqual(t_vsplit[1][0, 2, 2], t[2, 2, 2])
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_view_tensor_dsplit(self, device, dtype):
         t = make_tensor((4, 4, 4), dtype=dtype, device=device, low=-9, high=9)
         t_dsplit = torch.dsplit(t, 2)
@@ -381,7 +381,7 @@ def test_view_tensor_dsplit(self, device, dtype):
         self.assertEqual(t_dsplit[1][2, 2, 0], t[2, 2, 2])
 
     @onlyNativeDeviceTypes
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
+    @dtypes(*all_types_and(torch.half, torch.bfloat16))
     def test_imag_noncomplex(self, device, dtype):
         t = torch.ones((5, 5), dtype=dtype, device=device)
 
@@ -389,7 +389,7 @@ def test_imag_noncomplex(self, device, dtype):
             torch.imag(t)
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_complex_dtypes())
+    @dtypes(*complex_types())
     def test_real_imag_view(self, device, dtype):
         def compare_with_numpy(contiguous_input=True):
             t = torch.randn(3, 3, dtype=dtype, device=device)
@@ -420,7 +420,7 @@ def compare_with_numpy(contiguous_input=True):
         self.assertEqual(a[5:].imag, a.imag[5:])
 
     @onlyNativeDeviceTypes
-    @dtypes(*get_all_complex_dtypes())
+    @dtypes(*complex_types())
     def test_conj_imag_view(self, device, dtype) -> None:
         t = _make_tensor((4, 5,), dtype, device)
         t_numpy_conj = torch.from_numpy(t.cpu().numpy().conj()).to(device=device)
@@ -445,7 +445,7 @@ def test_conj_view_with_shared_memory(self, device) -> None:
         self.assertEqual(torch.add(b, c), b.add_(c))
 
     @onlyNativeDeviceTypes
-    @dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
+    @dtypes(*product(complex_types(), all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool)))
     @suppress_warnings
     def test_set_real_imag(self, device, dtypes):
         x = torch.randn(10, dtype=dtypes[0], device=device)
@@ -917,29 +917,38 @@ def _test_ravel(tensors, size, nc=False):
                 flat = src.ravel()
                 self.assertEqual(flat.shape, torch.Size([size]))
                 self.assertEqual(src.view(-1), flat)
-                self.assertEqual(flat._base, src)
+                self.assertIs(flat._base, src)
+                self.assertTrue(flat.is_contiguous())
 
                 # Non-continuous Tensor -> Copy
                 if nc:
                     nc_src = src.t()
                     nc_flat = nc_src.ravel()
                     self.assertEqual(nc_flat.shape, torch.Size([size]))
-                    self.assertEqual(nc_src.reshape(-1), nc_flat)
-                    self.assertTrue(nc_flat._base != nc_src)
+                    self.assertEqual(nc_src.contiguous().view(-1), nc_flat)
+                    self.assertIsNot(nc_flat._base, src)
+                    self.assertTrue(nc_flat.is_contiguous())
 
         # Test that flatten returns 1-dim tensor when given a 0-dim tensor
         zero_dim_tensor = torch.tensor(123, device=device)
         flat0 = zero_dim_tensor.ravel()
         one_dim_tensor = torch.tensor([123], device=device)
         flat1 = zero_dim_tensor.ravel()
+        nc_ones_tensor = torch.ones(10, device=device)[::2]
+        flat2 = nc_ones_tensor.ravel()
 
         self.assertEqual(zero_dim_tensor.shape, torch.Size([]))
         self.assertEqual(flat0.shape, torch.Size([1]))
         self.assertEqual(one_dim_tensor.shape, torch.Size([1]))
         self.assertEqual(flat1.shape, torch.Size([1]))
+        self.assertEqual(nc_ones_tensor.shape, torch.Size([5]))
+        self.assertEqual(flat2.shape, torch.Size([5]))
         self.assertEqual(flat0, one_dim_tensor)
         self.assertEqual(flat0, flat1)
         self.assertEqual(flat0.shape, flat1.shape)
+        self.assertTrue(flat0.is_contiguous())
+        self.assertTrue(flat1.is_contiguous())
+        self.assertTrue(flat2.is_contiguous())
 
         # Test both float tensor and quantized tensor
         tensors = [torch.randn(5, 5, 5, 5, device=device),
@@ -1255,7 +1264,7 @@ def test_T(self, device):
         scalar = torch.tensor(5, device=device)
         self.assertEqual(scalar, scalar.T)
 
-    @dtypes(*(torch.testing.get_all_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_transposes(self, device, dtype):
         for op in ("T", "H", "mT", "mH", "adjoint"):
             shapes = ((), (2, 3), (2, 3, 4)) if op[0] == "m" or op == "adjoint" else ((), (2, 3),)
@@ -1271,7 +1280,7 @@ def test_transposes(self, device, dtype):
                     t2 = t2.conj()
                 self.assertEqual(t2, t1)
 
-    @dtypes(*(torch.testing.get_all_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_transposes_errors(self, device, dtype):
         for op in ("H", "mT", "mH", "adjoint"):
             shapes = ((2,), (2, 3, 4)) if op == "H" else ((2,),)
@@ -1397,8 +1406,7 @@ def _test_atleast_dim(self, torch_fn, np_fn, device, dtype):
                         self.assertEqual(np_res, torch_res)
 
     # TODO: are these view ops?
-    @dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
-              get_all_complex_dtypes()))
+    @dtypes(*all_types_and_complex_and(torch.half))
     def test_atleast(self, device, dtype):
         self._test_atleast_dim(torch.atleast_1d, np.atleast_1d, device, dtype)
         self._test_atleast_dim(torch.atleast_2d, np.atleast_2d, device, dtype)
@@ -1535,7 +1543,7 @@ def test_broadcast_shapes_numpy_ref(self, device):
             self.assertEqual(res1, res2_numpy)
 
     # Skip BFloat16 since numpy does not support it
-    @dtypes(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_broadcast_to(self, device, dtype):
         def can_broadcast(s0, s1):
             # s0.dim() <= s1.dim(), reverse s0 and s1 to compare trailing dimension
@@ -1638,7 +1646,7 @@ def test_view(self, device):
         self.assertEqual(tensor.view(6, 2, 1), contig_tensor.view(6, 2, 1))
         self.assertEqual(tensor.view(1, 6, 2, 1), contig_tensor.view(1, 6, 2, 1))
 
-    @dtypes(*get_all_dtypes())
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_reshape_view_semantics(self, device, dtype):
         tensor = make_tensor((15, 4), dtype=dtype, device=device)
         target = (20, 3)
@@ -1665,7 +1673,7 @@ def test_contiguous(self, device):
 
     @onlyNativeDeviceTypes
     # Skip BFloat16 since numpy does not support it
-    @dtypes(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_tensor_split_sections(self, device, dtype):
         input_sizes = [
             (0,),
@@ -1696,7 +1704,7 @@ def test_tensor_split_sections(self, device, dtype):
 
     @onlyNativeDeviceTypes
     # Skip BFloat16 since numpy does not support it
-    @dtypes(*get_all_dtypes(include_bfloat16=False))
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bool))
     def test_tensor_split_indices(self, device, dtype):
         input_sizes = [
             (0,),
@@ -1775,20 +1783,28 @@ def test_tensor_split_errors(self, device):
 
     def test_resize_all_dtypes_and_devices(self, device):
         shape = (2, 2)
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dt, device=device)
             x.resize_(shape)
             self.assertEqual(shape, x.shape)
 
     def test_resize_as_all_dtypes_and_devices(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dt, device=device)
             y = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=dt, device=device)
             x.resize_as_(y)
             self.assertEqual(y.shape, x.shape)
 
+    @onlyNativeDeviceTypes
+    def test_resize_overflow(self, device):
+        x = torch.empty((), dtype=torch.float64)
+        with self.assertRaisesRegex(RuntimeError, 'Storage size calculation overflowed'):
+            x.resize_([2, 4, 2**29, 2**29])
+        with self.assertRaisesRegex(RuntimeError, 'overflow'):
+            x.resize_([8, 8, 2**29, 2**29])
+
     def test_view_all_dtypes_and_devices(self, device):
-        for dt in get_all_dtypes():
+        for dt in all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool):
             x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dt, device=device)
             self.assertEqual(x.view(6).shape, [6])
 
diff --git a/test/typing/reveal/namedtuple.py b/test/typing/reveal/namedtuple.py
index 8a0508b325c5a9..2e130338f0b976 100644
--- a/test/typing/reveal/namedtuple.py
+++ b/test/typing/reveal/namedtuple.py
@@ -7,9 +7,9 @@
 t_sort[0][0, 0] == 1.5      # noqa: B015
 t_sort.indices[0, 0] == 1   # noqa: B015
 t_sort.values[0, 0] == 1.5  # noqa: B015
-reveal_type(t_sort)  # E: Tuple[{Tensor}, {Tensor}, fallback=torch._C.namedtuple_values_indices]
+reveal_type(t_sort)  # E: Tuple[{Tensor}, {Tensor}, fallback=torch.return_types.sort]
 
 t_qr = torch.linalg.qr(t)
 t_qr[0].shape == [2, 2]     # noqa: B015
 t_qr.Q.shape == [2, 2]      # noqa: B015
-reveal_type(t_qr)  # E: Tuple[{Tensor}, {Tensor}, fallback=torch._C._VariableFunctions.namedtuple_Q_R]
+reveal_type(t_qr)  # E: Tuple[{Tensor}, {Tensor}, fallback=torch.return_types.qr]
diff --git a/third_party/eigen b/third_party/eigen
index d41dc4dd74acce..3147391d946bb4 160000
--- a/third_party/eigen
+++ b/third_party/eigen
@@ -1 +1 @@
-Subproject commit d41dc4dd74acce21fb210e7625d5d135751fa9e5
+Subproject commit 3147391d946bb4b6c68edd901f2add6ac1f31f8c
diff --git a/third_party/fbgemm b/third_party/fbgemm
index d399aee88df3ec..9cf1a9ffefbb43 160000
--- a/third_party/fbgemm
+++ b/third_party/fbgemm
@@ -1 +1 @@
-Subproject commit d399aee88df3ece31d2615a2938837b1e745f446
+Subproject commit 9cf1a9ffefbb439e823dd3340ab4967e0cfe23a6
diff --git a/third_party/kineto b/third_party/kineto
index b5bb62d25be75c..b2b48c00c6e5bd 160000
--- a/third_party/kineto
+++ b/third_party/kineto
@@ -1 +1 @@
-Subproject commit b5bb62d25be75c381dbbd975276602f021982ef2
+Subproject commit b2b48c00c6e5bd8e807e2231adb229db6a1d1c22
diff --git a/tools/amd_build/build_amd.py b/tools/amd_build/build_amd.py
index 38698631c03cf0..785f63085c2ea4 100755
--- a/tools/amd_build/build_amd.py
+++ b/tools/amd_build/build_amd.py
@@ -89,6 +89,8 @@
     "tools/autograd/templates/python_variable_methods.cpp",
 ]
 
+includes = [os.path.join(proj_dir, include) for include in includes]
+
 for new_dir in args.extra_include_dir:
     abs_new_dir = os.path.join(proj_dir, new_dir)
     if os.path.exists(abs_new_dir):
@@ -112,6 +114,8 @@
     "torch/include/*",
 ]
 
+ignores = [os.path.join(proj_dir, ignore) for ignore in ignores]
+
 # Check if the compiler is hip-clang.
 def is_hip_clang() -> bool:
     try:
diff --git a/tools/autograd/BUILD.bazel b/tools/autograd/BUILD.bazel
new file mode 100644
index 00000000000000..2fd1043f2d408f
--- /dev/null
+++ b/tools/autograd/BUILD.bazel
@@ -0,0 +1,10 @@
+py_library(
+    name = "autograd",
+    srcs = glob(["*.py"]),
+    data = glob([
+        "*.yaml",
+        "templates/*",
+    ]),
+    visibility = ["//:__subpackages__"],
+    deps = ["//tools/codegen"],
+)
diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index c21e7222a854e7..0bbb57a4c49264 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -315,6 +315,7 @@
 
 - name: atan2(Tensor self, Tensor other) -> Tensor
   self, other: atan2_backward(grad, self, other, grad_input_mask)
+  result: (-self_p * other_t + other_p * self_t) / (self_p.pow(2) + other_p.pow(2))
 
 - name: baddbmm(Tensor self, Tensor batch1, Tensor batch2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta.conj())
@@ -365,12 +366,14 @@
 
 - name: cholesky_inverse(Tensor self, bool upper=False) -> Tensor
   self: cholesky_inverse_backward(grad, self, upper, result)
+  result: cholesky_inverse_jvp(self_p, self_t, result, upper)
 
 # For clamp, gradient is not defined at the boundaries. But empirically it's helpful
 # to be able to get gradient on min and max, so we return the subgradient 1 for these cases.
 - name: clamp.Tensor(Tensor self, Tensor? min=None, Tensor? max=None) -> Tensor
   self: clamp_backward(grad, self, min, max)
   min, max: clamp_backward_min_max(grad, self, min, max, grad_input_mask)
+  result: clamp_jvp(self_p, self_t, min_p, min_t, max_p, max_t)
 
 - name: clamp(Tensor self, Scalar? min=None, Scalar? max=None) -> Tensor
   self: clamp_backward(grad, self, min, max)
@@ -383,7 +386,7 @@
 - name: clamp_min.Tensor(Tensor self, Tensor min) -> Tensor
   self: where(self >= min, grad, at::scalar_tensor(0., grad.options()))
   min: where(self < min, grad, at::scalar_tensor(0., grad.options()))
-  result: where(self_p >= min_p, self_t, at::scalar_tensor(0., self_p.options())) + where(self_p < min_p, min_t, at::scalar_tensor(0., self_p.options()))
+  result: where(self_p >= min_p, self_t, min_t)
 
 - name: clamp_max(Tensor self, Scalar max) -> Tensor
   self: where(self <= max, grad, at::scalar_tensor(0., grad.options()))
@@ -392,7 +395,7 @@
 - name: clamp_max.Tensor(Tensor self, Tensor max) -> Tensor
   self: where(self <= max, grad, at::scalar_tensor(0., grad.options()))
   max: where(self > max, grad, at::scalar_tensor(0., grad.options()))
-  result: where(self_p <= max_p, self_t, at::scalar_tensor(0., self_p.options())) + where(self_p > max_p, max_t, at::scalar_tensor(0., self_p.options()))
+  result: where(self_p <= max_p, self_t, max_t)
 
 - name: clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor
   self: grad
@@ -415,6 +418,7 @@
 
 - name: polar(Tensor abs, Tensor angle) -> Tensor
   abs, angle: polar_backward(grad, result)
+  result: at::complex(abs_t*angle_p.cos() - angle_t*abs_p*angle_p.sin(), abs_t*angle_p.sin() + angle_t*abs_p*angle_p.cos())
 
 - name: _conj(Tensor(a) self) -> Tensor(a)
   self: grad.conj()
@@ -549,6 +553,7 @@
 
 - name: native_dropout(Tensor input, float p, bool? train) -> (Tensor, Tensor)
   input: "GradMode::is_enabled() ? infinitely_differentiable_native_dropout_backward(grad, result1, (!train.has_value() || !train.value() ? 1 : (p == 1 ? 0.0 : 1.0 / (1.0 - p)))) : native_dropout_backward(grad, result1, (!train.has_value() || !train.value() ? 1 : (p == 1 ? 0.0 : 1.0 / (1.0 - p))))"
+  result0: "(!train.has_value() || train.value()) ? (p == 1 ? 0.0 : 1.0 / (1.0 - p)) * input_t * result1 : input_t"
 
 - name: native_dropout_backward(Tensor grad_output, Tensor mask, float scale) -> Tensor
   grad_output: "native_dropout_double_backward(grad, grad_output, mask, scale)"
@@ -910,6 +915,7 @@
 
 - name: logsumexp(Tensor self, int[1] dim, bool keepdim=False) -> Tensor
   self: logsumexp_backward(grad, self, result, dim, keepdim)
+  result: logsumexp_jvp(self_p, self_t, dim, keepdim)
 
 - name: lstsq(Tensor self, Tensor A) -> (Tensor solution, Tensor QR)
   self: not_implemented("lstsq")
@@ -979,7 +985,7 @@
 - name: maximum(Tensor self, Tensor other) -> Tensor
   self: at::where(self == other, grad / 2, grad).masked_fill_(self < other, 0)
   other: at::where(self == other, grad / 2, grad).masked_fill_(self > other, 0)
-  result: other_t + at::where(self_p == other_p, 0.5, (self_p > other_p).to(result.scalar_type())) * (self_t - other_t)
+  result: other_t + at::where(self_p == other_p, at::scalar_tensor(0.5, result.options()), (self_p > other_p).to(result.scalar_type())) * (self_t - other_t)
 
 - name: fmax(Tensor self, Tensor other) -> Tensor
   self: grad.masked_fill((self >= other).logical_or_(other.isnan()).logical_not_(), 0)
@@ -1035,7 +1041,7 @@
 - name: minimum(Tensor self, Tensor other) -> Tensor
   self: at::where(self == other, grad / 2, grad).masked_fill_(self > other, 0)
   other: at::where(self == other, grad / 2, grad).masked_fill_(self < other, 0)
-  result: other_t + at::where(self_p == other_p, 0.5, (self_p < other_p).to(result.scalar_type())) * (self_t - other_t)
+  result: other_t + at::where(self_p == other_p, at::scalar_tensor(0.5, result.options()), (self_p < other_p).to(result.scalar_type())) * (self_t - other_t)
 
 - name: fmin(Tensor self, Tensor other) -> Tensor
   self: grad.masked_fill((self <= other).logical_or_(other.isnan()).logical_not_(), 0)
@@ -1266,6 +1272,15 @@
   self: grad * std::sqrt(2 * M_PI) * (result.square() / 2).exp()
   result: auto_element_wise
 
+- name: special_log_ndtr(Tensor self) -> Tensor
+  self: grad / std::sqrt(2 * M_PI) * (result + self.pow(2) / 2).neg().exp()
+  result: auto_element_wise
+
+# [Note: Sometimes view derivatives]
+# The following situation applies to other operations as well.
+# TODO: This note is only referenced once by to_dense. Make this
+# more generic if it's been referenced more than once.
+#
 # DO NOT define a backward for reshape!
 # reshape is special in that it sometimes returns a view, and sometimes not.
 # Defining a backward will make codegen spit out the forward call as
@@ -1447,9 +1462,11 @@
 - name: rsub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
   self: handle_r_to_c(self.scalar_type(), maybe_multiply(-grad, alpha.conj()))
   other: handle_r_to_c(other.scalar_type(), grad)
+  result: -maybe_multiply(self_t, alpha) + other_t
 
 - name: rsub.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> Tensor
   self: handle_r_to_c(self.scalar_type(), maybe_multiply(-grad, alpha.conj()))
+  result: auto_element_wise
 
 - name: sum(Tensor self, *, ScalarType? dtype=None) -> Tensor
   self: grad.expand(self.sizes())
@@ -1564,7 +1581,11 @@
   self: zeros_like(grad)
   result: auto_element_wise
 
-- name: to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
+# DO NOT define a backward for to_dense
+# See [Note: Sometimes view derivatives]
+# - name: to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
+#
+- name: _to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
   self: to_dense_backward(grad, self)
 
 - name: to_sparse(Tensor self) -> Tensor
@@ -1642,7 +1663,7 @@
   self: at::view_as_real(grad.contiguous().resolve_conj()) # [gx, gy]
   result: at::view_as_complex(self_t)
 
-- name: _s_where(Tensor condition, Tensor self, Tensor other) -> Tensor
+- name: where.self(Tensor condition, Tensor self, Tensor other) -> Tensor
   condition: non_differentiable
   self: where(condition, grad, zeros_like(grad))
   other: where(condition, zeros_like(grad), grad)
@@ -1754,10 +1775,12 @@
 - name: nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
   self: nll_loss_backward(grad, self, target, weight, reduction, ignore_index, total_weight)
   target: non_differentiable
+  output: std::get<0>(nll_loss_forward(self_t, target, weight, reduction, ignore_index))
 
 - name: nll_loss2d_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
   self: nll_loss2d_backward(grad, self, target, weight, reduction, ignore_index, total_weight)
   target: non_differentiable
+  output: std::get<0>(nll_loss2d_forward(self_t, target, weight, reduction, ignore_index))
 
 - name: smooth_l1_loss(Tensor self, Tensor target, int reduction=Mean, float beta=1.0) -> Tensor
   self: smooth_l1_loss_backward(grad, self, target, reduction, beta)
@@ -1837,6 +1860,7 @@
 
 - name: _log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _log_softmax_backward_data(grad, result, dim, self.scalar_type())
+  result: self_t - logsumexp_jvp(self_p, self_t, {dim}, true)
 
 - name: _sparse_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _sparse_log_softmax_backward_data(grad, result, dim, self)
@@ -1855,6 +1879,7 @@
 
 - name: _softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _softmax_backward_data(grad, result, dim, self.scalar_type())
+  result: result * (self_t - logsumexp_jvp(self_p, self_t, {dim}, true))
 
 - name: _sparse_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _sparse_softmax_backward_data(grad, result, dim, self)
@@ -1903,43 +1928,52 @@
   self: replication_pad3d_backward(grad, self, padding)
   result: auto_linear
 
-  # NOTE: Not implementing forward AD formulas for non-vec upsample overloads because they are
-  #       only kept for backward compatability
 - name: upsample_linear1d(Tensor self, int[1] output_size, bool align_corners, float? scales=None) -> Tensor
   self: upsample_linear1d_backward(grad, output_size, self.sizes(), align_corners, scales)
+  result: auto_linear
 
 - name: upsample_bilinear2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   self: upsample_bilinear2d_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_bilinear2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   self: _upsample_bilinear2d_aa_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_bicubic2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   self: upsample_bicubic2d_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_bicubic2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   self: _upsample_bicubic2d_aa_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
 
 - name: upsample_trilinear3d(Tensor self, int[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   self: upsample_trilinear3d_backward(grad, output_size, self.sizes(), align_corners, scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
   self: upsample_nearest1d_backward(grad, output_size, self.sizes(), scales)
+  result: auto_linear
 
 - name: _upsample_nearest_exact1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
   self: _upsample_nearest_exact1d_backward(grad, output_size, self.sizes(), scales)
+  result: auto_linear
 
 - name: upsample_nearest2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
   self: upsample_nearest2d_backward(grad, output_size, self.sizes(), scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_nearest_exact2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
   self: _upsample_nearest_exact2d_backward(grad, output_size, self.sizes(), scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_nearest3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   self: upsample_nearest3d_backward(grad, output_size, self.sizes(), scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_nearest_exact3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   self: _upsample_nearest_exact3d_backward(grad, output_size, self.sizes(), scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_linear1d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   input: upsample_linear1d_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
@@ -2144,6 +2178,7 @@
 - name: elu_backward(Tensor grad_output, Scalar alpha, Scalar scale, Scalar input_scale, bool is_result, Tensor self_or_result) -> Tensor
   grad_output: elu_backward(grad, alpha, scale, input_scale, is_result, self_or_result)
   self_or_result: elu_double_backward(grad, grad_output, alpha, scale, input_scale, is_result, self_or_result)
+  result: elu_backward(grad_output_t, alpha, scale, input_scale, is_result, self_or_result_p) + elu_double_backward(self_or_result_t, grad_output_p, alpha, scale, input_scale, is_result, self_or_result_p)
 
 - name: fractional_max_pool2d_backward(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] output_size, Tensor indices) -> Tensor
   grad_output: max_pool_double_backward(grad, indices, 2)
@@ -2186,6 +2221,24 @@
   # self_is_result is always false here since double backward call is an out-of-place call, self is input itself
   grad_output: leaky_relu_backward(grad, self, negative_slope, false)
   self: zeros_like(grad)
+  # leaky_relu_backward(grad_output, self, negative_slope, false)
+  # computes grad_output * at::where(self_p > 0, 1, negative_slope)
+  # so the jvp formula is the following:
+  # grad_output_t * at::where(self_p > 0, self_p.new_ones([]), negative_slope);
+  #
+  # leaky_relu_backward(grad_output, result, negative_slope, true)
+  # computes grad_output * at::where(result > 0, 1, negative_slope)
+  # under the assumption that `negative_slope` is positive (otherwise,
+  # it is not possible to compute the gradient).
+  #
+  # so the jvp formula is the following:
+  # grad_output_t * at::where(result_p > 0, result_p.new_ones([]), negative_slope);
+  # with the assumption that negative_slope is positive.
+  #
+  # Combined together that results in the following optimized kernel which
+  # also checks the assumption that negative_slope is positive when self_is_result
+  # is True:
+  result: leaky_relu_backward(grad_output_t, self_p, negative_slope, self_is_result)
 
 - name: max_pool2d_with_indices_backward(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride, int[2] padding, int[2] dilation, bool ceil_mode, Tensor indices) -> Tensor
   grad_output: max_pool_double_backward(grad, indices, 2)
@@ -2286,43 +2339,52 @@
   self: zeros_like(grad)
   result: zeros_like(self_t) + threshold_backward(grad_output_t, self_p, threshold)
 
-  # NOTE: Not implementing forward AD formulas for backwards of non-vec upsample overloads
-  #       because they are only kept for backward compatability
 - name: upsample_linear1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, bool align_corners, float? scales=None) -> Tensor
   grad_output: upsample_linear1d(grad, output_size, align_corners, scales)
+  result: auto_linear
 
 - name: upsample_bilinear2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: upsample_bilinear2d(grad, output_size, align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_bilinear2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: _upsample_bilinear2d_aa(grad, output_size, align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_bicubic2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: upsample_bicubic2d(grad, output_size, align_corners, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_bicubic2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: _upsample_bicubic2d_aa(grad, output_size, align_corners, scales_h, scales_w)
 
 - name: upsample_trilinear3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: upsample_trilinear3d(grad, output_size, align_corners, scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_nearest1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
   grad_output: upsample_nearest1d(grad, output_size, scales)
+  result: auto_linear
 
 - name: _upsample_nearest_exact1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
   grad_output: _upsample_nearest_exact1d(grad, output_size, scales)
+  result: auto_linear
 
 - name: upsample_nearest2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: upsample_nearest2d(grad, output_size, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_nearest_exact2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: _upsample_nearest_exact2d(grad, output_size, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_nearest3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: upsample_nearest3d(grad, output_size, scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: _upsample_nearest_exact3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   grad_output: _upsample_nearest_exact3d(grad, output_size, scales_d, scales_h, scales_w)
+  result: auto_linear
 
 - name: upsample_linear1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   grad_output: upsample_linear1d(grad, output_size, align_corners, scale_factors)
@@ -2490,12 +2552,15 @@
 # fft
 - name: _fft_r2c(Tensor self, int[] dim, int normalization, bool onesided) -> Tensor
   self: fft_r2c_backward(grad, dim, normalization, onesided, self.size(dim.back()))
+  result: auto_linear
 
 - name: _fft_c2r(Tensor self, int[] dim, int normalization, int last_dim_size) -> Tensor
   self: fft_c2r_backward(grad, dim, normalization)
+  result: auto_linear
 
 - name: _fft_c2c(Tensor self, int[] dim, int normalization, bool forward) -> Tensor
   self: _fft_c2c(grad, dim, normalization, !forward)
+  result: auto_linear
 
 - name: unbind.int(Tensor(a -> *) self, int dim=0) -> Tensor(a)[]
   self: unbind_backward(grads, dim)
@@ -2595,6 +2660,6 @@
 - name: _efficientzerotensor(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   output_differentiability: [False]
 
-- name: scatter_reduce.two(Tensor self, int dim, Tensor index, str reduce, *, int? output_size=None) -> Tensor
-  self: scatter_reduce_backward(grad, self, dim, index, reduce, result)
+- name: scatter_reduce.two(Tensor self, int dim, Tensor index, Tensor src, str reduce, *, bool include_self=True) -> Tensor
+  self, src: scatter_reduce_backward(grad, self, dim, index, src, reduce, include_self, result)
   index: non_differentiable
diff --git a/tools/autograd/gen_autograd_functions.py b/tools/autograd/gen_autograd_functions.py
index be7c7212db8dc6..fd9f50e8eb80b9 100644
--- a/tools/autograd/gen_autograd_functions.py
+++ b/tools/autograd/gen_autograd_functions.py
@@ -13,7 +13,8 @@
                                         uses_single_grad)
 from tools.codegen.api.types import (Binding, BaseCType, OptionalCType, tensorT, longT,
                                      doubleT, scalarT, stringT, boolT, intArrayRefT,
-                                     tensorListT, MutRefCType, ListCType, ArrayRefCType)
+                                     tensorListT, MutRefCType, ListCType, ArrayRefCType,
+                                     optionalIntArrayRefT)
 from tools.codegen.code_template import CodeTemplate
 from tools.codegen.utils import FileManager
 from tools.codegen.model import Argument
@@ -204,7 +205,7 @@
 
 GETTER_BODY_VEC_SAVEDVAR = """\
 PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
-for (int i = 0; i < prop.size(); i++) {
+for (auto i: c10::irange(prop.size())) {
   PyTuple_SetItem(tup, (Py_ssize_t) i, THPVariable_Wrap(prop[i].unpack(self->cdata)));
 }
 return tup;
@@ -212,7 +213,7 @@
 
 GETTER_BODY_RAW_VEC_SAVEDVAR = """\
 PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
-for (int i = 0; i < prop.size(); i++) {
+for (auto i : c10::irange(prop.size())) {
   pybind11::object obj = pybind11::cast(prop[i], pybind11::return_value_policy::reference);
   PyTuple_SetItem(tup, (Py_ssize_t) i, obj.release().ptr());
 }
@@ -221,7 +222,7 @@
 
 GETTER_BODY_ARRAYREF_LONG = """\
 PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
-for (int i = 0; i < prop.size(); i++) {
+for (auto i : c10::irange(prop.size())) {
   PyTuple_SetItem(tup, (Py_ssize_t) i, PyLong_FromUnsignedLong((uint64_t) prop[i]));
 }
 return tup;
@@ -229,7 +230,7 @@
 
 GETTER_BODY_ARRAYREF_DOUBLE = """\
 PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
-for (int i = 0; i < prop.size(); i++) {
+for (auto i : c10::irange(prop.size())) {
   PyTuple_SetItem(tup, (Py_ssize_t) i, PyFloat_FromDouble((double) prop[i]));
 }
 return tup;
@@ -422,6 +423,10 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
             saved_variables.append(f'std::vector<int64_t> {name};')
             getter_definitions.append(GETTER_DEFINITION.substitute(
                 op=info.op, name=name, body=GETTER_BODY_ARRAYREF_LONG))
+        elif type == BaseCType(optionalIntArrayRefT):
+            saved_variables.append(f'c10::OptionalArray<int64_t> {name};')
+            getter_definitions.append(GETTER_DEFINITION_OPT_ARRAYREF.substitute(
+                op=info.op, name=name, body=GETTER_BODY_ARRAYREF_LONG))
         elif type == OptionalCType(BaseCType(intArrayRefT)):
             saved_variables.append(f'c10::OptionalArray<int64_t> {name};')
             getter_definitions.append(GETTER_DEFINITION_OPT_ARRAYREF.substitute(
diff --git a/tools/autograd/gen_variable_type.py b/tools/autograd/gen_variable_type.py
index 4b634146dfedcd..62def9cc627371 100644
--- a/tools/autograd/gen_variable_type.py
+++ b/tools/autograd/gen_variable_type.py
@@ -91,7 +91,7 @@
     'triu', 'chunk', 'zero_', 'eq_', 'ne_', 'add', '__radd__', 'sum',
     '_conj', 'sin', 'cos', 'mul', 'sinc', 'sinh', 'cosh', '__rmul__',
     'sgn', 'asin', 'acos', 'sub', 'div', 'cat', 'view_as_complex', 'index_put',
-    'neg', 'complex', 'select', '_s_where', 'as_strided', 'slice', 'constant_pad_nd',
+    'neg', 'complex', 'select', 'where', 'as_strided', 'slice', 'constant_pad_nd',
     'unbind', 'split', 'split_with_sizes', 'unsafe_split', 'split_with_sizes_backward',
     'dot', 'vdot', 'cholesky', 'triangular_solve', 'mm', '_unsafe_view', 'mv', 'outer',
     'bmm', 'diagonal', 'alias', 'atan', 'log', 'log10', 'log1p', 'log2', 'reciprocal',
@@ -111,10 +111,11 @@
     'scatter', 'scatter_add', 'sigmoid', 'sigmoid_backward', 'trapezoid', 'cumulative_trapezoid',
     'conj_physical_', '_neg_view', '_reshape_alias', '_det_lu_based_helper', 'lu_solve',
     'linalg_solve_triangular', 'linalg_pinv', 'linalg_lstsq', 'col2im', 'col2im_backward', 'im2col', 'im2col_backward',
+    'cholesky_inverse',
 }
 
 GRADIENT_IMPLEMENTED_FOR_SPARSE_COMPLEX = {
-    'to_dense', '_coalesce', 'coalesce', 'values', '_sparse_coo_tensor_with_dims_and_tensors',
+    '_to_dense', '_coalesce', 'coalesce', 'values', '_sparse_coo_tensor_with_dims_and_tensors',
     'sparse_mask_helper_cuda', '_sparse_addmm',
 }
 
@@ -359,12 +360,12 @@
 """)
 
 FW_DERIVATIVE_FORBID_TEMPLATE = CodeTemplate("""\
-TORCH_CHECK_NOT_IMPLEMENTED(!(${cond}), "Trying to use forward AD with ${msg} that does not support it.");
+TORCH_CHECK_NOT_IMPLEMENTED(!(${cond}), "Trying to use forward AD with ${name} that does not support it ${msg}");
 """)
 
 FW_DERIVATIVE_FORBID_LIST_TEMPLATE = CodeTemplate("""\
 for (const auto& _t: ${arg}) {
-    TORCH_CHECK_NOT_IMPLEMENTED(!(${cond}), "Trying to use forward AD with ${msg} that does not support it.");
+    TORCH_CHECK_NOT_IMPLEMENTED(!(${cond}), "Trying to use forward AD with ${name} that does not support it ${msg}");
 }
 """)
 
@@ -952,9 +953,11 @@ def emit_fw_derivatives() -> List[str]:
     def emit_forbid_fw_derivatives(is_out_fn: bool = False) -> str:
         def get_msg() -> str:
             if is_out_fn:
-                msg = name + " (because it is an out= function)"
+                msg = "because it is an out= function"
             else:
-                msg = name
+                msg = ("because it has not been implemented yet.\\nPlease file an issue "
+                       "to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml "
+                       "so that we can prioritize its implementation.")
             return msg
         res = ""
         to_check: List[str] = []
@@ -964,13 +967,13 @@ def get_msg() -> str:
                 to_check.append(FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp=inp.name))
             elif is_tensor_list_type(inp.type):
                 cond = FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp="_t")
-                res += FW_DERIVATIVE_FORBID_LIST_TEMPLATE.substitute(arg=inp.name, cond=cond, msg=get_msg())
+                res += FW_DERIVATIVE_FORBID_LIST_TEMPLATE.substitute(arg=inp.name, cond=cond, name=name, msg=get_msg())
             else:
                 raise RuntimeError(f'Unsupported input type for "{name}" when forbidding forward AD usage.')
 
         if len(to_check) > 0:
             cond = " || ".join(to_check)
-            res += FW_DERIVATIVE_FORBID_TEMPLATE.substitute(cond=cond, msg=get_msg())
+            res += FW_DERIVATIVE_FORBID_TEMPLATE.substitute(cond=cond, name=name, msg=get_msg())
         return res
 
     body: List[str] = []
diff --git a/tools/autograd/templates/python_variable_methods.cpp b/tools/autograd/templates/python_variable_methods.cpp
index c2e3c41746219c..95f8d3fafc119d 100644
--- a/tools/autograd/templates/python_variable_methods.cpp
+++ b/tools/autograd/templates/python_variable_methods.cpp
@@ -541,6 +541,28 @@ static PyObject * THPVariable_xpu(PyObject* self, PyObject* args, PyObject* kwar
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject * THPVariable_ipu(PyObject* self, PyObject* args, PyObject* kwargs)
+{
+  HANDLE_TH_ERRORS
+  static PythonArgParser parser({
+    "ipu(Device? device=None, bool non_blocking=False, *, MemoryFormat? memory_format=None)",
+    "ipu(Device? device=None, bool async=False, *, MemoryFormat? memory_format=None)|deprecated"
+  });
+  auto& self_ = THPVariable_Unpack(self);
+  ParsedArgs<3> parsed_args;
+  auto r = parser.parse(self, args, kwargs, parsed_args);
+
+  if (r.has_torch_function()) {
+    return handle_torch_function(r, self, args, kwargs, THPVariableClass, "torch.Tensor");
+  }
+
+  auto device = r.isNone(0) ? at::Device(at::DeviceType::IPU) : r.device(0);
+  auto opt_memory_format = r.memoryformatOptional(2);
+  TORCH_CHECK(device.is_ipu(), "Invalid device, must be ipu device");
+  return THPVariable_Wrap(dispatch_to(self_, device, r.toBool(1), false, opt_memory_format));
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject * THPVariable_to_type(PyObject* self, ScalarType scalarType, c10::optional<c10::MemoryFormat> optional_memory_format) {
   HANDLE_TH_ERRORS
   auto& self_ = THPVariable_Unpack(self);
@@ -1205,6 +1227,7 @@ PyMethodDef variable_methods[] = {
   {"cpu", castPyCFunctionWithKeywords(THPVariable_cpu), METH_VARARGS | METH_KEYWORDS, NULL},
   {"cuda", castPyCFunctionWithKeywords(THPVariable_cuda), METH_VARARGS | METH_KEYWORDS, NULL},
   {"xpu", castPyCFunctionWithKeywords(THPVariable_xpu), METH_VARARGS | METH_KEYWORDS, NULL},
+  {"ipu", castPyCFunctionWithKeywords(THPVariable_ipu), METH_VARARGS | METH_KEYWORDS, NULL},
   {"data_ptr", THPVariable_data_ptr, METH_NOARGS, NULL},
   {"dim", THPVariable_dim, METH_NOARGS, NULL},
   {"has_names", THPVariable_has_names, METH_NOARGS, NULL},
diff --git a/tools/bazel.bzl b/tools/bazel.bzl
index 3589d09df314d3..edb99f898d267b 100644
--- a/tools/bazel.bzl
+++ b/tools/bazel.bzl
@@ -3,6 +3,13 @@ load("@rules_cuda//cuda:defs.bzl", "requires_cuda_enabled")
 load("//c10/macros:cmake_configure_file.bzl", "cmake_configure_file")
 load("//tools/config:defs.bzl", "if_cuda")
 
+def _py_library(name, **kwds):
+    deps = [dep for dep in kwds.pop("deps", []) if dep != None]
+    native.py_library(name = name, deps = deps, **kwds)
+
+def _requirement(_pypi_project):
+    return None
+
 # Rules implementation for the Bazel build system. Since the common
 # build structure aims to replicate Bazel as much as possible, most of
 # the rules simply forward to the Bazel definitions.
@@ -14,6 +21,9 @@ rules = struct(
     filegroup = native.filegroup,
     glob = native.glob,
     if_cuda = if_cuda,
+    py_binary = native.py_binary,
+    py_library = _py_library,
+    requirement = _requirement,
     requires_cuda_enabled = requires_cuda_enabled,
     select = select,
     test_suite = native.test_suite,
diff --git a/tools/build_variables.bzl b/tools/build_variables.bzl
index c957ec6cb17e51..81c09f23a9dd7d 100644
--- a/tools/build_variables.bzl
+++ b/tools/build_variables.bzl
@@ -42,21 +42,33 @@ GENERATED_CPP = [
     "autograd/generated/python_variable_methods.cpp",
 ]
 
+# This is duplicated in caffe2/CMakeLists.txt for now and not yet used in buck
+GENERATED_LAZY_TS_CPP = [
+    "lazy/generated/LazyNativeFunctions.cpp",
+    "lazy/generated/RegisterAutogradLazy.cpp",
+    "lazy/generated/RegisterLazy.cpp",
+]
+
 # NVFuser runtime library
 libtorch_nvfuser_runtime_sources = [
+    "torch/csrc/jit/codegen/cuda/runtime/array.cu",
     "torch/csrc/jit/codegen/cuda/runtime/bf16_support.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_sync_default.cu",
     "torch/csrc/jit/codegen/cuda/runtime/broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu",
     "torch/csrc/jit/codegen/cuda/runtime/helpers.cu",
     "torch/csrc/jit/codegen/cuda/runtime/index_utils.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu",
     "torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu",
     "torch/csrc/jit/codegen/cuda/runtime/tensor.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/tuple.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/type_traits.cu",
     "torch/csrc/jit/codegen/cuda/runtime/welford.cu",
     "torch/csrc/jit/codegen/cuda/runtime/warp.cu",
     "aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh",
@@ -148,6 +160,7 @@ libtorch_profiler_sources = [
     "torch/csrc/autograd/profiler_legacy.cpp",
     "torch/csrc/autograd/profiler_kineto.cpp",
     "torch/csrc/profiler/api.cpp",
+    "torch/csrc/profiler/collection.cpp",
     "torch/csrc/profiler/kineto_shim.cpp",
     "torch/csrc/profiler/nvtx_observer.cpp",
     "torch/csrc/monitor/counters.cpp",
@@ -239,6 +252,7 @@ core_sources_full_mobile_no_backend_interface = [
     "torch/csrc/jit/passes/constant_propagation.cpp",
     "torch/csrc/jit/passes/restore_mutation.cpp",
     "torch/csrc/jit/passes/create_autodiff_subgraphs.cpp",
+    "torch/csrc/jit/passes/cuda_graph_fuser.cpp",
     "torch/csrc/jit/passes/dead_code_elimination.cpp",
     "torch/csrc/jit/passes/eliminate_no_ops.cpp",
     "torch/csrc/jit/passes/remove_redundant_profiles.cpp",
@@ -320,11 +334,14 @@ core_sources_full_mobile_no_backend_interface = [
     "torch/csrc/jit/runtime/interpreter/preprocess_graph.cpp",
     "torch/csrc/jit/runtime/interpreter.cpp",
     "torch/csrc/jit/runtime/logging.cpp",
+    "torch/csrc/jit/runtime/simple_graph_executor_impl.cpp",
     "torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp",
     "torch/csrc/jit/runtime/profiling_record.cpp",
     "torch/csrc/jit/runtime/script_profile.cpp",
     "torch/csrc/jit/runtime/symbolic_script.cpp",
     "torch/csrc/jit/runtime/symbolic_shape_registry.cpp",
+    "torch/csrc/jit/runtime/decomposition_registry.cpp",
+    "torch/csrc/jit/runtime/decomposition_registry_util.cpp",
     "torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp",
     "torch/csrc/jit/runtime/jit_trace.cpp",
     "torch/csrc/jit/serialization/callstack_debug_info_serialization.cpp",
@@ -341,6 +358,7 @@ core_sources_full_mobile_no_backend_interface = [
     "torch/csrc/jit/tensorexpr/cpp_codegen.cpp",
     "torch/csrc/jit/tensorexpr/eval.cpp",
     "torch/csrc/jit/tensorexpr/expr.cpp",
+    "torch/csrc/jit/tensorexpr/external_functions_core.cpp",
     "torch/csrc/jit/tensorexpr/external_functions_registry.cpp",
     "torch/csrc/jit/tensorexpr/graph_opt.cpp",
     "torch/csrc/jit/tensorexpr/hash_provider.cpp",
@@ -402,6 +420,7 @@ lazy_tensor_core_sources = [
     "torch/csrc/lazy/backend/lowering_context.cpp",
     "torch/csrc/lazy/core/config.cpp",
     "torch/csrc/lazy/core/debug_util.cpp",
+    "torch/csrc/lazy/core/dynamic_ir.cpp",
     "torch/csrc/lazy/core/hash.cpp",
     "torch/csrc/lazy/core/helpers.cpp",
     "torch/csrc/lazy/core/ir.cpp",
@@ -432,6 +451,9 @@ lazy_tensor_core_sources = [
     "torch/csrc/lazy/core/view_ops/unsqueeze.cpp",
     "torch/csrc/lazy/core/view_ops/select_view_update.cpp",
     "torch/csrc/lazy/core/view_ops/view.cpp",
+    # We should better segment the sources, but for now there are actually dependencies
+    # from some core files on some of these ts_backend files
+    # so we continue to build these parts of ts_backend in all build configs
     "torch/csrc/lazy/ts_backend/config.cpp",
     "torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.cpp",
     "torch/csrc/lazy/ts_backend/ops/cast.cpp",
@@ -442,6 +464,20 @@ lazy_tensor_core_sources = [
     "torch/csrc/lazy/ts_backend/ts_node.cpp",
 ]
 
+# We can't build all of the ts backend under certain build configurations, e.g. mobile,
+# since it depends on things like autograd, meta functions, which may be disabled
+lazy_tensor_ts_sources = [
+    "torch/csrc/lazy/ts_backend/ops/batch_norm_ops.cpp",
+    "torch/csrc/lazy/ts_backend/ops/random_ops.cpp",
+    "torch/csrc/lazy/ts_backend/ts_autograd_functions.cpp",
+    "torch/csrc/lazy/ts_backend/ts_backend_impl.cpp",
+    "torch/csrc/lazy/ts_backend/ts_lowering_context.cpp",
+    "torch/csrc/lazy/ts_backend/ts_native_functions.cpp",
+    "torch/csrc/lazy/ts_backend/ts_node_lowering.cpp",
+    "torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp",
+    "torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp",
+]
+
 lazy_tensor_core_python_sources = [
     "torch/csrc/lazy/python/init.cpp",
     "torch/csrc/lazy/python/python_util.cpp",
@@ -639,6 +675,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/compute_at.cpp",
     "torch/csrc/jit/codegen/cuda/compute_at_map.cpp",
     "torch/csrc/jit/codegen/cuda/codegen.cpp",
+    "torch/csrc/jit/codegen/cuda/contiguity.cpp",
     "torch/csrc/jit/codegen/cuda/dispatch.cpp",
     "torch/csrc/jit/codegen/cuda/expr_evaluator.cpp",
     "torch/csrc/jit/codegen/cuda/executor.cpp",
@@ -669,8 +706,10 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/lower_allocation.cpp",
     "torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp",
     "torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp",
     "torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp",
     "torch/csrc/jit/codegen/cuda/lower_index.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_index_hoist.cpp",
     "torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp",
     "torch/csrc/jit/codegen/cuda/lower_loops.cpp",
     "torch/csrc/jit/codegen/cuda/lower_magic_zero.cpp",
@@ -678,6 +717,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/lower_predicate.cpp",
     "torch/csrc/jit/codegen/cuda/lower_replace_size.cpp",
     "torch/csrc/jit/codegen/cuda/lower_shift.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_sync_information.cpp",
     "torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp",
     "torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp",
     "torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp",
@@ -716,6 +756,8 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/transform_view.cpp",
     "torch/csrc/jit/codegen/cuda/type.cpp",
     "torch/csrc/jit/codegen/cuda/utils.cpp",
+    "torch/csrc/jit/codegen/cuda/mma_type.cpp",
+    "torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp",
     "torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp",
     "torch/csrc/jit/tensorexpr/cuda_codegen.cpp",
     "torch/csrc/jit/runtime/register_cuda_ops.cpp",
@@ -873,6 +915,7 @@ libtorch_python_core_sources = [
     "torch/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp",
     "torch/csrc/jit/passes/onnx/shape_type_inference.cpp",
     "torch/csrc/jit/passes/onnx/function_extraction.cpp",
+    "torch/csrc/jit/passes/onnx/onnx_log.cpp",
     "torch/csrc/jit/python/pybind_utils.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.cpp",
@@ -981,6 +1024,7 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/ParallelNativeTBB.cpp",
     "aten/src/ATen/ParallelOpenMP.cpp",
     "aten/src/ATen/ParallelThreadPoolNative.cpp",
+    "aten/src/ATen/PythonTorchFunctionTLS.cpp",
     "aten/src/ATen/ScalarOps.cpp",
     "aten/src/ATen/SequenceNumber.cpp",
     "aten/src/ATen/SparseTensorImpl.cpp",
@@ -1159,7 +1203,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/quantized/cpu/qconcat.cpp",
     "aten/src/ATen/native/quantized/cpu/qconv.cpp",
     "aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp",
-    "aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp",
+    "aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp",
     "aten/src/ATen/native/quantized/cpu/qelu.cpp",
     "aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp",
     "aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp",
@@ -1171,7 +1215,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp",
     "aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp",
     "aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp",
-    "aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp",
+    "aten/src/ATen/native/quantized/cpu/qlinear_unpack_impl.cpp",
     "aten/src/ATen/native/quantized/cpu/qmatmul.cpp",
     "aten/src/ATen/native/quantized/cpu/qmul.cpp",
     "aten/src/ATen/native/quantized/cpu/qnormalization.cpp",
@@ -1195,6 +1239,9 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/quantized/fake_quant_per_channel_affine.cpp",
     "aten/src/ATen/native/quantized/fake_quant_per_tensor_affine.cpp",
     "aten/src/ATen/native/quantized/library.cpp",
+    "aten/src/ATen/native/quantized/cpu/ruy_utils.cpp",
+    "aten/src/ATen/native/quantized/cpu/xnnpack_utils.cpp",
+    "aten/src/ATen/native/quantized/qlinear_unpack.cpp",
     "aten/src/ATen/quantized/QTensorImpl.cpp",
     "aten/src/ATen/quantized/Quantizer.cpp",
     "aten/src/ATen/native/Activation.cpp",
@@ -1214,7 +1261,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/CPUBlas.cpp",
     "aten/src/ATen/native/ChanelShuffle.cpp",
     "aten/src/ATen/native/Col2Im.cpp",
-    "aten/src/ATen/native/ConstantPadNd.cpp",
+    "aten/src/ATen/native/PadNd.cpp",
     "aten/src/ATen/native/Convolution.cpp",
     "aten/src/ATen/native/ConvolutionMM2d.cpp",
     "aten/src/ATen/native/ConvolutionMM3d.cpp",
diff --git a/tools/code_coverage/README.md b/tools/code_coverage/README.md
index 6e83dc593ed155..67adb445d053d7 100644
--- a/tools/code_coverage/README.md
+++ b/tools/code_coverage/README.md
@@ -3,7 +3,7 @@
 ## Overview
 
 This tool is designed for calculating code coverage for Pytorch project.
-It’s an integrated tool. You can use this tool to run and generate both file-level and line-level report for C++ and Python tests. It will also be the tool we use in *CircleCI* to generate report for each master commit.
+It’s an integrated tool. You can use this tool to run and generate both file-level and line-level report for C++ and Python tests. It will also be the tool we use in *CircleCI* to generate report for each main commit.
 
 ### Simple
 * *Simple command to run:*
@@ -30,11 +30,11 @@ This part will introduce about the arguments you can use when run this tool. The
 We have two different compilers, `gcc` and `clang`, and this tool supports both. But it is recommended to use `gcc` because it's much faster and use less disk place. The examples will also be divided to two parts, for `gcc` and `clang`.
 
 ## Preparation
-The first step is to [build *Pytorch* from source](https://github.com/pytorch/pytorch#from-source) with `CODE_COVERAGE` option `ON`. You may also want to set `BUILD_TEST` option `ON` to get the test binaries. Besides, if you are under `gcc` compiler, to get accurate result, it is recommended to also select `CMAKE_BUILD_CONFIG=Debug`.
+The first step is to [build *Pytorch* from source](https://github.com/pytorch/pytorch#from-source) with `USE_CPP_CODE_COVERAGE` option `ON`. You may also want to set `BUILD_TEST` option `ON` to get the test binaries. Besides, if you are under `gcc` compiler, to get accurate result, it is recommended to also select `CMAKE_BUILD_TYPE=Debug`.
 See: [how to adjust build options](https://github.com/pytorch/pytorch#adjust-build-options-optional) for reference. Following is one way to adjust build option:
 ```
 # in build/ folder (all build artifacts must in `build/` folder)
-cmake .. -DCODE_COVERAGE=ON -DBUILD_TEST=ON -DCMAKE_BUILD_CONFIG=Debug
+cmake .. -DUSE_CPP_CODE_COVERAGE=ON -DBUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug
 ```
 
 
@@ -53,7 +53,7 @@ python oss_coverage.py --run-only=atest
 ```
 This command will run `atest` binary in `build/bin/` folder and generate reoports over the entire *Pytorch* folder. You can find the reports in `profile/summary`. But you may only be interested in the `aten` folder, in this case, try:
 ```
-python oss_coverage.py --run-only=atest --interested-only=aten
+python oss_coverage.py --run-only=atest --interest-only=aten
 ```
 In *Pytorch*, `c++` tests located in `build/bin/` and `python` tests located in `test/`. If you want to run `python` test, try:
 ```
@@ -62,7 +62,7 @@ python oss_coverage.py --run-only=test_complex.py
 
 You may also want to specify more than one test or interested folder, in this case, try:
 ```
-python oss_coverage.py --run-only=atest c10_logging_test --interested-only aten/src/Aten c10/core
+python oss_coverage.py --run-only=atest c10_logging_test --interest-only aten/src/Aten c10/core
 ```
 That it is! With these two simple options, you can customize many different functionality according to your need.
 By default, the tool will run all tests in `build/bin` folder (by running all executable binaries in it) and `test/` folder (by running `run_test.py`), and then collect coverage over the entire *Pytorch* folder. If this is what you want, try:
@@ -84,9 +84,9 @@ By default all steps will be run, but you can specify only run one of them. Foll
 `—summary` is useful when you have different interested folder. For example,
 ```bash
 # after run this command
-python oss_coverage.py --run-only=atest --interested-folder=aten
+python oss_coverage.py --run-only=atest --interest-only=aten
 # you may then want to learn atest's coverage over c10, instead of running the test again, you can:
-python oss_coverage.py --run-only=atest --interested-folder=c10 --summary
+python oss_coverage.py --run-only=atest --interest-only=c10 --summary
 ```
 
 
diff --git a/tools/codegen/BUILD.bazel b/tools/codegen/BUILD.bazel
new file mode 100644
index 00000000000000..d1a0db360d230f
--- /dev/null
+++ b/tools/codegen/BUILD.bazel
@@ -0,0 +1,4 @@
+load("//:tools/bazel.bzl", "rules")
+load(":build.bzl", "define_targets")
+
+define_targets(rules = rules)
diff --git a/tools/codegen/api/autograd.py b/tools/codegen/api/autograd.py
index 64b7547e78f0d3..635ad927e8a221 100644
--- a/tools/codegen/api/autograd.py
+++ b/tools/codegen/api/autograd.py
@@ -335,9 +335,44 @@ def repl(m: Match[str]) -> str:
                     required_primals = required_primals + ("self",) if required_primals else ("self",)
 
                 if not is_exact_match:
-                    # Make sure that the forward grad is modified inplace when the original formula
-                    # is out of place
-                    formula = f"self_t_raw.defined() ? self_t_raw.copy_({formula}) : {formula}"
+                    # NOTE [In-place forward AD formula Optimization]
+                    #
+                    # This optimization transforms the formula to directly do inplace, i.e.
+                    # instead of self_t.copy_(self_t.op()) we do self_t.op_() when the following are met:
+                    #
+                    # 1) the formula satisfies the pattern: "self_t.op(*args)"
+                    # 2) "op" in (1) needs to be the same as the op the derivative is for
+                    #
+                    # (2) may seem too strict, but currently the only ops that satisfy (1) also satisfy (2)
+                    # If there is a need, we can relax (2) to allow any op that has an in-place variant
+                    is_single_method_on_self_t = False
+                    match = re.fullmatch(r'self_t.([\w]*)\((.*)\)', formula)
+                    if match:
+                        op_name, between_parens = match.group(1), match.group(2)
+
+                        # We want to...
+                        #   Match: self_t.op1(other_p.op2(arg))
+                        #   Avoid: self_t.op1(args) + self_t.op2(args)
+                        #   Avoid: self_t.op1(other_p.op2(arg)) + self_t.op2(args)
+                        def check_parens_nest_level_gt_zero(s: str) -> bool:
+                            level = 1
+                            for ch in s:
+                                if ch == ")":
+                                    level -= 1
+                                    if level == 0:
+                                        return False
+                                if ch == "(":
+                                    level += 1
+                            return True
+                        is_single_method_on_self_t = check_parens_nest_level_gt_zero(between_parens)
+                    directly_do_inplace = is_single_method_on_self_t and op_name == info.name
+
+                    if directly_do_inplace:
+                        formula = f"self_t_raw.defined() ? self_t_raw.{op_name}_({between_parens}) : {formula}"
+                    else:
+                        # Make sure that the forward grad is modified inplace when the original formula
+                        # is out of place
+                        formula = f"self_t_raw.defined() ? self_t_raw.copy_({formula}) : {formula}"
 
                 required_original_self_value = bool(re.search(IDENT_REGEX.format("original_self_p"), formula))
 
diff --git a/tools/codegen/api/cpp.py b/tools/codegen/api/cpp.py
index a485fc17acf601..904ab1c486940c 100644
--- a/tools/codegen/api/cpp.py
+++ b/tools/codegen/api/cpp.py
@@ -6,7 +6,8 @@
                                      MutRefCType, ArrayCType, ListCType, VectorCType, ArrayRefCType,
                                      OptionalCType, TupleCType, SpecialArgName, boolT, scalarT,
                                      tensorListT, dimnameListT, tensorT, voidT, longT,
-                                     BaseTypeToCppMapping, intArrayRefT, tensorOptionsT)
+                                     BaseTypeToCppMapping, intArrayRefT, optionalIntArrayRefT,
+                                     tensorOptionsT)
 from tools.codegen import local
 from tools.codegen.utils import assert_never
 from typing import Optional, Sequence, Union, List, Set
@@ -92,6 +93,8 @@ def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName, remove_non_owni
                 return NamedCType(binds, ConstRefCType(OptionalCType(BaseCType(tensorT))))
         elif str(t.elem) == 'Scalar':
             return NamedCType(binds, ConstRefCType(OptionalCType(BaseCType(scalarT))))
+        elif isinstance(t.elem, ListType) and str(t.elem.elem) == 'int':
+            return NamedCType(binds, BaseCType(optionalIntArrayRefT))
         elem = argumenttype_type(t.elem, mutable=mutable, binds=binds)
         return NamedCType(binds, OptionalCType(elem.type))
     elif isinstance(t, ListType):
diff --git a/tools/codegen/api/lazy.py b/tools/codegen/api/lazy.py
index ebbc72eb1fc000..6c927e62aa9014 100644
--- a/tools/codegen/api/lazy.py
+++ b/tools/codegen/api/lazy.py
@@ -1,29 +1,29 @@
-from typing import List, Union, Tuple
+from typing import List, Union, Tuple, Optional
 from tools.codegen.model import (Type, BaseTy, BaseType, OptionalType,
                                  ListType, OperatorName, FunctionSchema,
-                                 Return, TensorOptionsArguments)
+                                 Return, TensorOptionsArguments, Argument)
 from tools.codegen.api.types import (CType, BaseCppType, BaseCType, OptionalCType,
                                      NamedCType, deviceT, layoutT,
                                      VectorCType, boolT, longT, doubleT, ListCType, stringT,
-                                     scalarT, scalarTypeT)
+                                     scalarT, scalarTypeT, memoryFormatT)
 
 valueT = BaseCppType('torch::lazy', 'Value')
-
+# this is a bad hack. I need to refactor the data model to represent each arg in the schema as an object,
+# making it easier to represent special properties of an arg.
+tensorListValueT = BaseCppType('torch::lazy', 'Value')
 
 def process_ir_type(typ: Type) -> Union[BaseCType, VectorCType, OptionalCType, ListCType]:
     """
     This function takes a type from NativeFunctions and converts it for use with
-    lazy tensor codegen.  Currently its output is used in several places, and so far
-    it has been possible for them to all use the same conversions, but that may not be
-    optimal or possible in the finished system.
+    lazy tensor codegen.
 
     Type conversion for lazy currently consists of
-     (1) changing Tensor-like things into Value-like things
+     (1) changing at::Tensors into lazy::Values
      (2) wrapping everything in a BaseCType
-     (3) making reference types into values (e.g. vector instead of IntArrayRef)
+     (3) making cpp-reference types into cpp-value types (e.g. vector instead of IntArrayRef)
 
-    (1) converts Tensors to Values since Values are how Lazy IR represents tensors.  There
-    is special handling for Optional[Tensor] or List[Tensor], etc- hence 'tensor-like'
+    (1) converts at::Tensors to lazy::Values (which wrap lazy::Nodes, with which Lazy IR represents tensors.)
+    There is special handling for Optional[Tensor] or List[Tensor], etc- hence 'tensor-like'
 
     This is incomplete- there are assertions in places that it's expected to need to add
     more types as the codegen is used with more operators.
@@ -33,7 +33,7 @@ def process_ir_type(typ: Type) -> Union[BaseCType, VectorCType, OptionalCType, L
             return BaseCType(valueT)
         elif typ.name == BaseTy.Scalar:
             # at::scalar has special handling,
-            # and is wrapped in an IR value just like at::tensor
+            # and is wrapped in an lazy::Value just like at::tensor
             return BaseCType(valueT)
         elif typ.name == BaseTy.ScalarType:
             return BaseCType(scalarTypeT)
@@ -49,6 +49,8 @@ def process_ir_type(typ: Type) -> Union[BaseCType, VectorCType, OptionalCType, L
             return BaseCType(deviceT)
         elif typ.name == BaseTy.Layout:
             return BaseCType(layoutT)
+        elif typ.name == BaseTy.MemoryFormat:
+            return BaseCType(memoryFormatT)
         else:
             raise AssertionError(f"TODO add support for type {repr(typ)}")
     elif isinstance(typ, OptionalType):
@@ -57,6 +59,9 @@ def process_ir_type(typ: Type) -> Union[BaseCType, VectorCType, OptionalCType, L
         if str(typ.elem) == 'Tensor?':
             # TODO(whc) is this actually correct? or should it use a Vector like above
             return ListCType(OptionalCType(BaseCType(valueT)))
+        elif str(typ.elem) == 'Tensor':
+            # this is a TensorList which comes in from GetTensorList as a Value
+            return BaseCType(tensorListValueT)
         else:
             return VectorCType(process_ir_type(typ.elem))
     else:
@@ -74,8 +79,7 @@ def isValueType(typ: CType) -> bool:
         return typ.type == valueT or typ.type == scalarT
     elif isinstance(typ, (OptionalCType, ListCType, VectorCType)):
         return isValueType(typ.elem)
-    else:
-        return False
+    return False
 
 def isWrappedScalarType(typ: Type) -> bool:
     """
@@ -89,45 +93,79 @@ def isWrappedScalarType(typ: Type) -> bool:
         return typ.name == BaseTy.Scalar
     elif isinstance(typ, (OptionalType, ListType)):
         return isWrappedScalarType(typ.elem)
-    else:
-        return False
+    return False
+
+def isGeneratorType(typ: Type) -> bool:
+    if isinstance(typ, BaseType):
+        return typ.name == BaseTy.Generator
+    elif isinstance(typ, (OptionalType)):
+        return isGeneratorType(typ.elem)
+    return False
+
+class LazyArgument:
+    name: str
+    orig_type: Type
+    lazy_type_: Optional[CType]
+    is_wrapped_scalar: bool
+    is_generator: bool
+
+    # true if this argument is or contains a lazy IR value
+    is_lazy_value: bool
+
+    def __init__(self, arg: Argument):
+        self.name = arg.name
+        self.orig_type = arg.type
+        self.is_generator = isGeneratorType(arg.type)
+        if self.is_generator:
+            assert isinstance(arg.type, OptionalType), "We expect all generators are optional since currently they are"
+            # there is no handling for generators in TorchScript IR (or XLA)
+            # so we fall back to eager if the (optional)generator has value, and otherwise
+            # its null and safe to exclude from lazy IR
+            self.lazy_type_ = None
+        else:
+            self.lazy_type_ = process_ir_type(arg.type)
+        self.is_wrapped_scalar = isWrappedScalarType(arg.type)
 
+        self.is_lazy_value = not self.is_generator and isValueType(self.lazy_type)
+
+    @property
+    def lazy_type(self) -> CType:
+        assert self.lazy_type_ is not None, f"Attempted to access lazy_type for invalid argument {self.name}"
+        return self.lazy_type_
 
 # Inspired by a FunctionSchema object, a LazyIrSchema holds the schema of a Lazy IR node.
 # Unlike a FunctionSchema, it has no round-trippable string form (relating to the YAML),
 # but carries type information from a native FunctionSchema modified for use with IR nodes,
 # and preserving original argument names.
-
-
 class LazyIrSchema:
     # The name of the operator this function schema describes.
     name: 'OperatorName'
 
-    positional_arg_types: Tuple[NamedCType, ...]
-    keyword_arg_types: Tuple[NamedCType, ...]
+    positional_args: Tuple[LazyArgument, ...]
+    keyword_args: Tuple[LazyArgument, ...]
 
     # TODO: Need to handle collisions with argument names at some point
     returns: Tuple['Return', ...]
 
-    wrapped_scalar_names: List[str]
+    # if this schema has a Generator arg, list its orig ctype/name but don't
+    # build a LazyArgument since lazy IR doesn't support it
+    generator_arg: Optional[NamedCType] = None
 
     def __init__(self, func: FunctionSchema):
 
-        positional_arg_types = []
+        positional_args = []
         for arg_field in ["pre_self_positional",
                           "self_arg",
                           "post_self_positional"]:
             if arg_field == "self_arg" and func.arguments.self_arg is not None:
                 arg = getattr(func.arguments, "self_arg").argument
-                positional_arg_types.append(NamedCType(arg.name, process_ir_type(arg.type)))
+                positional_args.append(LazyArgument(arg))
             elif getattr(func.arguments, arg_field) is not None:
-                positional_arg_types.extend([
-                    NamedCType(
-                        arg.name,
-                        process_ir_type(arg.type)) for arg in getattr(func.arguments, arg_field)])
-        self.positional_arg_types = tuple(positional_arg_types)
+                positional_args.extend([
+                    LazyArgument(arg) for arg in getattr(func.arguments, arg_field)])
+        self.positional_args = tuple(positional_args)
 
-        keyword_arg_types = []
+        keyword_args = []
         for arg_field in ["pre_tensor_options_kwarg_only",
                           "tensor_options",
                           "post_tensor_options_kwarg_only",
@@ -136,11 +174,14 @@ def __init__(self, func: FunctionSchema):
             if curr_args is not None:
                 if isinstance(curr_args, TensorOptionsArguments):
                     curr_args = curr_args.all()
-                keyword_arg_types.extend([NamedCType(arg.name, process_ir_type(arg.type)) for arg in curr_args])
-        self.keyword_arg_types = tuple(keyword_arg_types)
+                for arg in curr_args:
+                    if isGeneratorType(arg.type):
+                        assert self.generator_arg is None, "We expect there is only one generator arg"
+                        self.generator_arg = NamedCType(arg.name, arg.type)
+                keyword_args.extend([LazyArgument(arg) for arg in curr_args])
+        self.keyword_args = tuple(keyword_args)
         self.name = func.name
         self.returns = func.returns
-        self.wrapped_scalar_names = [arg.name for arg in func.schema_order_arguments() if isWrappedScalarType(arg.type)]
 
     @property
     def node_name(self) -> str:
@@ -162,36 +203,42 @@ def aten_name(self) -> str:
     def base_name(self) -> str:
         return f"{self.name.name.base}"
 
-    def filtered_types(self, positional: bool = True, keyword: bool = True,
-                       values: bool = True, scalars: bool = True) -> List[NamedCType]:
-        types: List[NamedCType] = []
+    def filtered_args(self, positional: bool = True, keyword: bool = True,
+                      values: bool = True, scalars: bool = True, generator: bool = False) -> List[LazyArgument]:
+        # This function maintains the sorted order of arguments but provides different filtered views.
+        # Some parts of the code care about kwargs vs args (TS lowerings),
+        # other parts care about whether they need to wrap the arg in a lazy value or leave it alone.
+        # Generators are special cased, as they are needed for fallback/shape-inference but not supported
+        # in TS lowerings and therefore also omitted from lazy IR.
+        args: List[LazyArgument] = []
         if positional:
-            types.extend(self.positional_arg_types)
+            args.extend(self.positional_args)
         if keyword:
-            types.extend(self.keyword_arg_types)
-
-        if values and scalars:
-            return types
-
-        if values:
-            return [t for t in types if isValueType(t.type)]
+            args.extend(self.keyword_args)
+
+        if values and scalars and generator:
+            return args
+        elif values and scalars:
+            return [a for a in args if not a.is_generator]
+        elif values:
+            return [a for a in args if a.is_lazy_value]
         elif scalars:
-            return [t for t in types if not isValueType(t.type)]
+            return [a for a in args if not a.is_lazy_value and (generator or not a.is_generator)]
 
         return []
 
     @property
-    def positional_values(self) -> List[NamedCType]:
-        return self.filtered_types(positional=True, keyword=False, values=True, scalars=False)
+    def positional_values(self) -> List[LazyArgument]:
+        return self.filtered_args(positional=True, keyword=False, values=True, scalars=False)
 
     @property
-    def positional_scalars(self) -> List[NamedCType]:
-        return self.filtered_types(positional=True, keyword=False, values=False, scalars=True)
+    def positional_scalars(self) -> List[LazyArgument]:
+        return self.filtered_args(positional=True, keyword=False, values=False, scalars=True)
 
     @property
-    def keyword_values(self) -> List[NamedCType]:
-        return self.filtered_types(positional=False, keyword=True, values=True, scalars=False)
+    def keyword_values(self) -> List[LazyArgument]:
+        return self.filtered_args(positional=False, keyword=True, values=True, scalars=False)
 
     @property
-    def keyword_scalars(self) -> List[NamedCType]:
-        return self.filtered_types(positional=False, keyword=True, values=False, scalars=True)
+    def keyword_scalars(self) -> List[LazyArgument]:
+        return self.filtered_args(positional=False, keyword=True, values=False, scalars=True)
diff --git a/tools/codegen/api/python.py b/tools/codegen/api/python.py
index 6c362cb87387b3..759f7e504aab3f 100644
--- a/tools/codegen/api/python.py
+++ b/tools/codegen/api/python.py
@@ -188,29 +188,6 @@
 class PythonReturns:
     returns: Tuple[Return, ...]
 
-    def named_tuple_pyi(self) -> Optional[Tuple[str, str]]:
-        python_returns = [argument_type_str_pyi(r.type) for r in self.returns]
-        field_names = namedtuple_fieldnames(self.returns)
-        if field_names:
-            namedtuple_name = '_'.join(['namedtuple'] + field_names)
-            tuple_args = [f'("{name}", {typ})' for name, typ in zip(field_names, python_returns)]
-            namedtuple_def = f'NamedTuple("{namedtuple_name}", [{", ".join(tuple_args)}])'
-            return namedtuple_name, namedtuple_def
-        return None
-
-    def returns_str_pyi(self) -> str:
-        named_tuple = self.named_tuple_pyi()
-        if named_tuple is not None:
-            namedtuple_name, _ = named_tuple
-            return namedtuple_name
-
-        python_returns = [argument_type_str_pyi(r.type) for r in self.returns]
-        if len(python_returns) > 1:
-            return 'Tuple[' + ', '.join(python_returns) + ']'
-        if len(python_returns) == 1:
-            return python_returns[0]
-        return 'None'
-
 
 @dataclass(frozen=True)
 class PythonArgument:
@@ -399,7 +376,7 @@ def signature_str_pyi(self, *, skip_outputs: bool = False) -> str:
             schema_formals.insert(positional_argc, '*')
 
         # only pyi signatures include returns
-        returns_str = self.returns.returns_str_pyi()
+        returns_str = returns_str_pyi(self)
         # pyi also includes self (with no typing/defaults) for methods
         if self.method:
             schema_formals.insert(0, "self")
@@ -425,7 +402,7 @@ def signature_str_pyi_vararg(self, *, skip_outputs: bool = False) -> Optional[st
         # vararg signatures also omit the asterix
         schema_formals[0] = '*' + args[0].name + ': _int'
 
-        returns_str = self.returns.returns_str_pyi()
+        returns_str = returns_str_pyi(self)
         # pyi also includes self (with no typing/defaults) for methods
         if self.method:
             schema_formals.insert(0, "self")
@@ -465,7 +442,7 @@ def signature_str_pyi(self, *, skip_outputs: bool = False) -> str:
         if len(schema_formals) > positional_argc:
             schema_formals.insert(positional_argc, '*')
 
-        returns_str = self.returns.returns_str_pyi()
+        returns_str = returns_str_pyi(self)
         return f'def {self.name}({", ".join(schema_formals)}) -> {returns_str}: ...'
 
     def signature_str_pyi_vararg(self, *, skip_outputs: bool = False) -> Optional[str]:
@@ -594,7 +571,7 @@ def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
         elif t.name in [BaseTy.bool, BaseTy.QScheme, BaseTy.Scalar,
                         BaseTy.ScalarType, BaseTy.Generator, BaseTy.Storage,
                         BaseTy.Layout, BaseTy.Device, BaseTy.MemoryFormat,
-                        BaseTy.Dimname, BaseTy.Stream, BaseTy.ConstQuantizerPtr]:
+                        BaseTy.Dimname, BaseTy.Stream, BaseTy.ConstQuantizerPtr, BaseTy.SymInt]:
             # These python schema type names line up with their function schema names
             return t.name.name
 
@@ -777,6 +754,8 @@ def argument_type_str_pyi(t: Type) -> str:
     if isinstance(t, BaseType):
         if t.name == BaseTy.int:
             ret = '_int'
+        if t.name == BaseTy.SymInt:
+            ret = 'SymInt'
         elif t.name == BaseTy.float:
             ret = '_float'
         elif t.name == BaseTy.str:
@@ -826,6 +805,51 @@ def argument_type_str_pyi(t: Type) -> str:
     raise RuntimeError(f'unrecognized type {repr(t)}')
 
 
+def return_type_str_pyi(t: Type) -> str:
+    # Where arguments are open to accepting Union, return types should return
+    # concrete types
+
+    if isinstance(t, OptionalType):
+        inner = return_type_str_pyi(t.elem)
+        return f"Optional[{inner}]"
+
+    if isinstance(t, BaseType):
+        if t.name == BaseTy.Device:
+            return '_device'
+        elif t.name == BaseTy.Dimname:
+            ret = 'Optional[str]'
+        else:
+            return argument_type_str_pyi(t)
+
+    if isinstance(t, ListType):
+        inner = return_type_str_pyi(t.elem)
+        return f"List[{inner}]"
+
+    return argument_type_str_pyi(t)
+
+def returns_named_tuple_pyi(signature: PythonSignature) -> Optional[Tuple[str, str]]:
+    python_returns = [return_type_str_pyi(r.type) for r in signature.returns.returns]
+    namedtuple_name = signature.name
+    field_names = namedtuple_fieldnames(signature.returns.returns)
+    if field_names:
+        tuple_args = [f'("{name}", {typ})' for name, typ in zip(field_names, python_returns)]
+        namedtuple_def = f'NamedTuple("{namedtuple_name}", [{", ".join(tuple_args)}])'
+        return namedtuple_name, namedtuple_def
+    return None
+
+def returns_str_pyi(signature: PythonSignature) -> str:
+    field_names = namedtuple_fieldnames(signature.returns.returns)
+    if field_names:
+        return f"torch.return_types.{signature.name}"
+
+    python_returns = [return_type_str_pyi(r.type) for r in signature.returns.returns]
+    if len(python_returns) > 1:
+        return 'Tuple[' + ', '.join(python_returns) + ']'
+    if len(python_returns) == 1:
+        return python_returns[0]
+    return 'None'
+
+
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
 #
 #                        C++ Function Dispatch
@@ -919,6 +943,7 @@ def dispatch_lambda_arg(cpp_arg: Binding) -> DispatchLambdaArgument:
     '::std::tuple<at::Tensor,at::Tensor,at::Tensor,at::Tensor,int64_t>',
     '::std::tuple<at::Tensor,at::Tensor,double,at::Tensor,int64_t>',
     '::std::tuple<double,int64_t>',
+    '::std::tuple<at::Tensor,::std::vector<at::Tensor>>',
     '::std::vector<at::Tensor>',
     'at::Scalar', 'bool', 'int64_t', 'void*', 'void',
     'at::QScheme', 'double',
@@ -1011,6 +1036,8 @@ def arg_parser_unpack_method(t: Type, has_default: bool) -> str:
             return 'deviceWithDefault' if has_default else 'device'
         elif t.name == BaseTy.int:
             return 'toInt64'
+        elif t.name == BaseTy.SymInt:
+            return 'toSymInt'
         elif t.name == BaseTy.bool:
             return 'toBool'
         elif t.name == BaseTy.float:
diff --git a/tools/codegen/api/structured.py b/tools/codegen/api/structured.py
index a8c714a293f44f..b12a092a49f0a5 100644
--- a/tools/codegen/api/structured.py
+++ b/tools/codegen/api/structured.py
@@ -5,7 +5,8 @@
 from tools.codegen.api.types import (ArgName, BaseCType, Binding, ArrayRefCType,
                                      ConstRefCType, OptionalCType, NamedCType,
                                      tensorT, scalarT, intArrayRefT, dimnameListT,
-                                     optionalTensorRefT, optionalScalarRefT)
+                                     optionalTensorRefT, optionalScalarRefT,
+                                     optionalIntArrayRefT, iTensorListRefT)
 
 from tools.codegen.api import cpp
 from tools.codegen.utils import assert_never
@@ -37,15 +38,13 @@ def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName) -> NamedCType:
             return NamedCType(binds, BaseCType(optionalTensorRefT))
         elif t.elem == BaseType(BaseTy.Scalar):
             return NamedCType(binds, BaseCType(optionalScalarRefT))
+        elif isinstance(t.elem, ListType) and str(t.elem.elem) == 'int':
+            return NamedCType(binds, BaseCType(optionalIntArrayRefT))
         elem = argumenttype_type(t.elem, mutable=mutable, binds=binds)
         return NamedCType(binds, OptionalCType(elem.type))
     elif isinstance(t, ListType):
         if t.elem == BaseType(BaseTy.Tensor):
-            raise AssertionError(
-                "list of tensor not supported by structured yet; to implement this "
-                "resolve torch::List issue, see "
-                "https://fb.workplace.com/groups/894363187646754/permalink/1149276442155426"
-            )
+            return NamedCType(binds, BaseCType(iTensorListRefT))
         # TODO: delete these special cases; see tools.codegen.api.cpp--these
         # must be changed in tandem, but there are problems; see
         # https://github.com/pytorch/pytorch/pull/51485
diff --git a/tools/codegen/api/translate.py b/tools/codegen/api/translate.py
index 53919136ba6bdc..aea12852ea22e7 100644
--- a/tools/codegen/api/translate.py
+++ b/tools/codegen/api/translate.py
@@ -1,12 +1,12 @@
 from typing import Dict, Sequence, List, NoReturn, Union
-from tools.codegen.api.types import (BaseCType, Binding, ConstRefCType,
+from tools.codegen.api.types import (tensorListT, BaseCType, Binding, ConstRefCType,
                                      Expr, MutRefCType, OptionalCType,
                                      NamedCType, SpecialArgName, tensorT,
                                      memoryFormatT, tensorOptionsT, scalarTypeT,
                                      boolT, deviceT, layoutT, optionalTensorRefT,
-                                     scalarT, optionalScalarRefT,
+                                     iTensorListRefT, scalarT, optionalScalarRefT,
                                      VectorCType, longT, intArrayRefT,
-                                     scalar_t, opmath_t)
+                                     scalar_t, opmath_t, optionalIntArrayRefT)
 
 # This file implements a small program synthesis engine that implements
 # conversions between one API to another.
@@ -39,6 +39,7 @@
 options_ctype = NamedCType("options", ConstRefCType(BaseCType(tensorOptionsT)))
 
 longVec_ctype = VectorCType(BaseCType(longT))
+optionalLongVec_ctype = OptionalCType(VectorCType(BaseCType(longT)))
 optionalScalar_ctype = OptionalCType(BaseCType(scalarT))
 optionalTensor_ctype = OptionalCType(BaseCType(tensorT))
 
@@ -141,6 +142,10 @@ def translate(
         if t.type == BaseCType(scalar_t):
             ctx[NamedCType(t.name, BaseCType(opmath_t))] = f'static_cast<opmath_t>({b.expr})'
 
+        # [Note: ITensorListRef]
+        if t.type == BaseCType(tensorListT):
+            ctx[NamedCType(t.name, BaseCType(iTensorListRefT))] = f"at::ITensorListRef({b.expr})"
+
     # Add implicit bindings if the generated code is inside a Tensor method
     if method:
         ctx[NamedCType("self", MutRefCType(BaseCType(tensorT)))] = "const_cast<Tensor&>(*this)"
@@ -235,6 +240,8 @@ def direct_solve(goal: NamedCType) -> str:
         # We can always do translations from value types to reference types, like vector<int> -> IntArrayRef
         elif goal.type == BaseCType(intArrayRefT):
             return direct_solve(NamedCType(goal.name, longVec_ctype))
+        elif goal.type == BaseCType(optionalIntArrayRefT):
+            return direct_solve(NamedCType(goal.name, optionalLongVec_ctype))
         elif goal.type == BaseCType(optionalScalarRefT):
             return direct_solve(NamedCType(goal.name, optionalScalar_ctype))
         elif goal.type == BaseCType(optionalTensorRefT):
@@ -254,6 +261,10 @@ def direct_solve(goal: NamedCType) -> str:
                 intArrayRef_ctype = NamedCType(goal.name, BaseCType(intArrayRefT))
                 argname = direct_solve(intArrayRef_ctype)
                 return f'{argname}.vec()'
+            elif goal.type == OptionalCType(VectorCType(BaseCType(longT))):
+                optionalIntArrayRef_ctype = NamedCType(goal.name, BaseCType(optionalIntArrayRefT))
+                argname = direct_solve(optionalIntArrayRef_ctype)
+                return f'{argname}.has_value() ? c10::make_optional({argname}->vec()) : c10::nullopt'
             elif goal.type == OptionalCType(BaseCType(scalarT)):
                 optionalScalarRef_ctype = NamedCType(goal.name, BaseCType(optionalScalarRefT))
                 argname = direct_solve(optionalScalarRef_ctype)
diff --git a/tools/codegen/api/types.py b/tools/codegen/api/types.py
index 8a01b49bfb42fc..81a198a79e5240 100644
--- a/tools/codegen/api/types.py
+++ b/tools/codegen/api/types.py
@@ -53,6 +53,7 @@ def __str__(self) -> str:
 tensorT = BaseCppType('at', 'Tensor')
 optionalTensorRefT = BaseCppType('at', 'OptionalTensorRef')
 tensorListT = BaseCppType('at', 'TensorList')
+iTensorListRefT = BaseCppType('at', 'ITensorListRef')
 dimnameT = BaseCppType('at', 'Dimname')
 dimnameListT = BaseCppType('at', 'DimnameList')
 layoutT = BaseCppType('at', 'Layout')
@@ -64,9 +65,11 @@ def __str__(self) -> str:
 storageT = BaseCppType('at', 'Storage')
 streamT = BaseCppType('at', 'Stream')
 intArrayRefT = BaseCppType('at', 'IntArrayRef')
+optionalIntArrayRefT = BaseCppType('at', 'OptionalIntArrayRef')
 tensorOptionsT = BaseCppType('at', 'TensorOptions')
 typeAndSizeT = BaseCppType('torch::autograd::generated', 'TypeAndSize')
 tensorGeometryT = BaseCppType('at', 'TensorGeometry')
+SymIntT = BaseCppType('c10', 'SymInt')
 
 # Types representing template parameters.  Technically, we probably shouldn't
 # represent them this way in codegen, but it was pretty convenient.
@@ -105,6 +108,7 @@ def __str__(self) -> str:
     BaseTy.QScheme: qschemeT,
     BaseTy.Storage: storageT,
     BaseTy.Stream: streamT,
+    BaseTy.SymInt: SymIntT,
 }
 
 # CTypes encode C++ type structure as needed for translation.
diff --git a/tools/codegen/build.bzl b/tools/codegen/build.bzl
new file mode 100644
index 00000000000000..ed04e35a439133
--- /dev/null
+++ b/tools/codegen/build.bzl
@@ -0,0 +1,16 @@
+def define_targets(rules):
+    rules.py_library(
+        name = "codegen",
+        srcs = rules.glob(["**/*.py"]),
+        deps = [
+            rules.requirement("PyYAML"),
+            rules.requirement("typing-extensions"),
+        ],
+        visibility = ["//visibility:public"],
+    )
+
+    rules.py_binary(
+        name = "gen",
+        srcs = [":codegen"],
+        visibility = ["//visibility:public"],
+    )
diff --git a/tools/codegen/decompositions/gen_jit_decompositions.py b/tools/codegen/decompositions/gen_jit_decompositions.py
new file mode 100644
index 00000000000000..a934bb3ecc4d80
--- /dev/null
+++ b/tools/codegen/decompositions/gen_jit_decompositions.py
@@ -0,0 +1,84 @@
+#!/usr/bin/env python3
+import os
+from pathlib import Path
+
+from torch.jit._decompositions import decomposition_table
+# from tools.codegen.code_template import CodeTemplate
+
+DECOMP_HEADER = r"""
+/**
+ * @generated
+ * This is an auto-generated file. Please do not modify it by hand.
+ * To re-generate, please run:
+ * cd ~/pytorch && python tools/codegen/decompositions/gen_jit_decompositions.py
+ */
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/inliner.h>
+#include <torch/csrc/jit/runtime/operator.h>
+#include <torch/csrc/jit/runtime/decomposition_registry_util.h>
+
+namespace torch {
+namespace jit {
+
+
+const std::string decomp_funcs =
+R"("""
+
+
+
+DECOMP_CENTER = r"""
+)";
+
+const std::string& GetSerializedDecompositions() {
+  return decomp_funcs;
+}
+
+const OperatorMap<std::string>& GetDecompositionMapping() {
+  // clang-format off
+ static const OperatorMap<std::string> decomposition_mapping {
+"""
+
+DECOMP_END = r"""
+  };
+  // clang-format on
+
+  return decomposition_mapping;
+}
+
+} // namespace jit
+} // namespace torch
+"""
+
+
+DECOMPOSITION_UTIL_FILE_NAME = "decomposition_registry_util.cpp"
+
+def gen_serialized_decompisitions() -> str:
+    return "\n".join([scripted_func.code for scripted_func in decomposition_table.values()])
+
+def gen_decomposition_mappings() -> str:
+    decomposition_mappings = []
+    for schema, scripted_func in decomposition_table.items():
+        decomposition_mappings.append(
+            '    {"' + schema + '", "' + scripted_func.name + '"},'
+        )
+    return "\n".join(decomposition_mappings)
+
+def write_decomposition_util_file(path: str) -> None:
+    decomposition_str = gen_serialized_decompisitions()
+    decomposition_mappings = gen_decomposition_mappings()
+    file_components = [DECOMP_HEADER, decomposition_str, DECOMP_CENTER, decomposition_mappings, DECOMP_END]
+    print("writing file to : ", path + "/" + DECOMPOSITION_UTIL_FILE_NAME)
+    with open(
+        os.path.join(path, DECOMPOSITION_UTIL_FILE_NAME), "wb"
+    ) as out_file:
+        final_output = "".join(file_components)
+        out_file.write(final_output.encode("utf-8"))
+
+def main() -> None:
+    pytorch_dir = Path(__file__).resolve().parents[3]
+    upgrader_path = pytorch_dir / "torch" / "csrc" / "jit" / "runtime"
+    write_decomposition_util_file(str(upgrader_path))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/codegen/dest/lazy_ir.py b/tools/codegen/dest/lazy_ir.py
index c744c2b91d0087..acc86169d4dbfd 100644
--- a/tools/codegen/dest/lazy_ir.py
+++ b/tools/codegen/dest/lazy_ir.py
@@ -1,29 +1,31 @@
-from abc import ABC, abstractmethod
+from abc import ABC
 from typing import List, Union
 from dataclasses import dataclass
 from tools.codegen.context import method_with_native_function
 from tools.codegen.model import (BackendIndex, NativeFunction,
                                  NativeFunctionsGroup)
-from tools.codegen.api.types import (BaseCType, OptionalCType, NamedCType,
+from tools.codegen.api.types import (BaseCType, OptionalCType,
                                      VectorCType, kernel_signature)
 import tools.codegen.api.dispatcher as dispatcher
-from tools.codegen.api.lazy import LazyIrSchema, isValueType
+from tools.codegen.api.lazy import LazyIrSchema, LazyArgument, isValueType, tensorListValueT
 from tools.codegen.dest.lazy_ts_lowering import ts_lowering_body
 
-def node_ctor_arg_rvalue_string(arg: NamedCType, schema: LazyIrSchema) -> str:
+def node_ctor_arg_rvalue_string(arg: LazyArgument) -> str:
     """
-    Given a NamedCType from a lazy IR schema,
+    Given a LazyArgument,
     generate a c++ string for materializing an rvalue of that arg for passing into
     a lazy Node constructor.
     """
 
-    if isValueType(arg.type):
-        if isinstance(arg.type, BaseCType):
-            if arg.name in schema.wrapped_scalar_names:
+    if isValueType(arg.lazy_type):
+        if isinstance(arg.lazy_type, BaseCType):
+            if arg.is_wrapped_scalar:
                 return f"torch::lazy::LazyGraphExecutor::Get()->GetIrValueForScalarFromCodegen({arg.name})"
+            elif arg.lazy_type.type is tensorListValueT:
+                return f"lazy_{arg.name}_tensorlist"
             return f"lazy_{arg.name}->GetIrValue()"
-        elif isinstance(arg.type, OptionalCType):
-            if arg.name in schema.wrapped_scalar_names:
+        elif isinstance(arg.lazy_type, OptionalCType):
+            if arg.is_wrapped_scalar:
                 return f"{arg.name} ? " \
                     f"c10::make_optional(torch::lazy::LazyGraphExecutor::Get()->GetIrValueForScalarFromCodegen(*{arg.name})) : " \
                     "c10::nullopt"
@@ -31,14 +33,14 @@ def node_ctor_arg_rvalue_string(arg: NamedCType, schema: LazyIrSchema) -> str:
                    f"c10::make_optional(lazy_{arg.name}->GetIrValue()) : " \
                    "c10::nullopt"
         else:
-            raise AssertionError("TODO not sure if there are other valid types to handle here")
+            raise AssertionError(f"TODO not sure if there are other valid types to handle here ({arg.lazy_type})")
     else:
-        if isinstance(arg.type, VectorCType) and isinstance(arg.type.elem, BaseCType):
-            return f"std::vector<{arg.type.elem.type}>({arg.name}.begin(), {arg.name}.end())"
-        elif (isinstance(arg.type, OptionalCType) and
-                isinstance(arg.type.elem, VectorCType) and
-                isinstance(arg.type.elem.elem, BaseCType)):
-            return f"torch::lazy::ToOptionalVector<{arg.type.elem.elem.type}>({arg.name})"
+        if isinstance(arg.lazy_type, VectorCType) and isinstance(arg.lazy_type.elem, BaseCType):
+            return f"std::vector<{arg.lazy_type.elem.type}>({arg.name}.begin(), {arg.name}.end())"
+        elif (isinstance(arg.lazy_type, OptionalCType) and
+                isinstance(arg.lazy_type.elem, VectorCType) and
+                isinstance(arg.lazy_type.elem.elem, BaseCType)):
+            return f"torch::lazy::ToOptionalVector<{arg.lazy_type.elem.elem.type}>({arg.name})"
         else:
             return f"{arg.name}"
 
@@ -46,20 +48,24 @@ def node_ctor_inputs(schema: LazyIrSchema) -> str:
     """
     Produce a formatted string with the arguments as passed into the constructor of a node class.
     """
-    node_ctor_values = [node_ctor_arg_rvalue_string(arg, schema) for arg in schema.filtered_types()]
+    node_ctor_values = [node_ctor_arg_rvalue_string(arg) for arg in schema.filtered_args()]
     return ",\n                              ".join(node_ctor_values)
 
 def gen_fallback_code(schema: LazyIrSchema, overload_name: str) -> str:
     """
     Generate code that falls back to eager conditioned on a predicate
     """
-    fallback_args = ",\n                ".join([str(arg.name) for arg in schema.filtered_types()])
+    fallback_args = ",\n                ".join([str(arg.name) for arg in schema.filtered_args(generator=True)])
     if len(overload_name):
         aten_op_str = f"ATEN_OP2({schema.aten_name}, {overload_name})"
     else:
         aten_op_str = f"ATEN_OP({schema.aten_name})"
+    or_has_generator = ""
+    if schema.generator_arg:
+        # generators are always optional and there is never more than one, at least currently
+        or_has_generator = f" || ({schema.generator_arg.name}.has_value() && {schema.generator_arg.name}->defined())"
     return f"""
-        if (force_eager_fallback({aten_symbol(schema)})) {{
+        if (force_eager_fallback({aten_symbol(schema)}){or_has_generator}) {{
             return at::native::call_fallback_fn<&ltc_eager_fallback, {aten_op_str}>::call(
                 {fallback_args}
             );
@@ -78,56 +84,56 @@ def aten_symbol(schema: LazyIrSchema) -> str:
 class LazyIR(ABC):
     backend_index: BackendIndex
     node_base: str
-    lowering_function_type: str = ""
-    lowering_context_type: str = ""
-    lowering_return_type: str = ""
 
     @method_with_native_function
     def __call__(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> List[str]:
         func = f.functional.func if isinstance(f, NativeFunctionsGroup) else f.func
         return self.gen(f)
 
-    @abstractmethod
-    def lowering_body(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> str:
-        pass
+    # there is no lowering functionality generated unless this IR base class is subclassed and
+    # implemented as a backend-specific node
+    def lowering_function(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> str:
+        return ""
 
     def gen(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> List[str]:
         # for now, we just want one IR class decl and soon after also the method defs
         # and we use the functional version not out/inplace.
         func = f.functional.func if isinstance(f, NativeFunctionsGroup) else f.func
         schema = LazyIrSchema(func)
-        all_types = schema.filtered_types()
-        value_types = schema.filtered_types(values=True, scalars=False)
-        scalar_types = schema.filtered_types(values=False, scalars=True)
+        all_args = schema.filtered_args()
+        value_args = schema.filtered_args(values=True, scalars=False)
+        scalar_args = schema.filtered_args(values=False, scalars=True)
 
-        node_ctor_args = ", ".join([f"const {i.cpp_type()}& {i.name}" for i in all_types])
-        scalar_initializers = ",\n        ".join([f"{t.name}({t.name})" for t in scalar_types])
+        node_ctor_args = ", ".join([f"const {i.lazy_type.cpp_type()}& {i.name}" for i in all_args])
+        scalar_initializers = ",\n        ".join([f"{a.name}({a.name})" for a in scalar_args])
         comma_if_scalar_initializers = ",\n" if len(scalar_initializers) else ""
-        scalar_decls = "\n  ".join([f"{t.cpp_type()} {t.name};" for t in scalar_types])
-        scalar_hashes = ", ".join([f"{f.name}" for f in scalar_types])
+        scalar_decls = "\n  ".join([f"std::string {a.name};" if a.lazy_type.cpp_type() == "c10::string_view"
+                                    else f"{a.lazy_type.cpp_type()} {a.name};"
+                                    for a in scalar_args])
+        scalar_hashes = ", ".join([f"{a.name}" for a in scalar_args])
         base_ctor_value_args_list = []
         optional_values = []
-        for t in value_types:
-            if isinstance(t.type, BaseCType):
-                base_ctor_value_args_list.append(f"{t.name}")
-            elif isinstance(t.type, OptionalCType):
-                base_ctor_value_args_list.append(f"{t.name}.value_or(kNullValue)")
-                optional_values.append(t.name)
+        for arg in value_args:
+            if isinstance(arg.lazy_type, BaseCType) or isinstance(arg.lazy_type, VectorCType):
+                base_ctor_value_args_list.append(f"{arg.name}")
+            elif isinstance(arg.lazy_type, OptionalCType):
+                base_ctor_value_args_list.append(f"{arg.name}.value_or(kNullValue)")
+                optional_values.append(arg.name)
             else:
-                raise AssertionError("TODO not sure if there are other valid types to handle here")
+                raise AssertionError(f"TODO not sure if there are other valid types to handle here ({arg.lazy_type})")
         base_ctor_value_args = ", ".join(base_ctor_value_args_list)
         has_optional_decls = "\n  ".join([f"bool has_{value}: 1;" for value in optional_values])
         has_optional_defs = "\n    ".join([f"has_{value} = !!{value};" for value in optional_values])
         members_to_string = []
-        for t in scalar_types:
-            if isinstance(t.type, OptionalCType):
-                members_to_string.append(f"""if ({t.name}.has_value()) {{
-    ss << ", {t.name}=" << {t.name}.value();
+        for arg in scalar_args:
+            if isinstance(arg.lazy_type, OptionalCType):
+                members_to_string.append(f"""if ({arg.name}.has_value()) {{
+    ss << ", {arg.name}=" << {arg.name}.value();
 }} else {{
-    ss << ", {t.name}=null";
+    ss << ", {arg.name}=null";
 }}""")
             else:
-                members_to_string.append(f'ss << ", {t.name}=" << {t.name};')
+                members_to_string.append(f'ss << ", {arg.name}=" << {arg.name};')
         members_to_string_str = "\n    ".join(members_to_string)
 
         return [f"""\
@@ -151,10 +157,7 @@ class {schema.node_name} : public {self.node_base} {{
     return ss.str();
   }}
 
-  {self.lowering_return_type} Lower({self.lowering_function_type} function,
-                   {self.lowering_context_type} loctx) const override {{
-    {self.lowering_body(f)}
-  }}
+  {self.lowering_function(f)}
 
   {scalar_decls}
   {has_optional_decls}
@@ -166,31 +169,34 @@ class {schema.node_name} : public {self.node_base} {{
 
 @dataclass(frozen=True)
 class TSLazyIR(LazyIR):
-    lowering_function_type: str = "std::shared_ptr<torch::jit::GraphFunction>"
-    lowering_context_type: str = "torch::lazy::TSLoweringContext*"
-    lowering_return_type: str = "torch::lazy::TSOpVector"
-
-    def lowering_body(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> str:
-        return ts_lowering_body(f)
 
+    def lowering_function(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> str:
+        return f"""torch::lazy::TSOpVector Lower(std::shared_ptr<torch::jit::GraphFunction> function,
+    torch::lazy::TSLoweringContext* loctx) const override {{
+    {ts_lowering_body(f)}
+  }}"""
 
-def lazy_tensor_decls(value_types: List[NamedCType], tensor_class: str, schema: LazyIrSchema) -> str:
+def lazy_tensor_decls(value_args: List[LazyArgument], tensor_class: str) -> str:
     lazy_tensor_decls: List[str] = []
-    for t in value_types:
-        if t.name in schema.wrapped_scalar_names:
+    for arg in value_args:
+        if arg.is_wrapped_scalar:
             # no lazy tensor wrapper for scalars that are promoted to IR values
             continue
-        if isinstance(t.type, BaseCType):
-            lazy_tensor_decls.append(
-                f"{tensor_class}Ptr lazy_{t.name} = "
-                f"torch::lazy::GetLtcTensorOrCreateForWrappedNumber({t.name}, *common_device);")
-        elif isinstance(t.type, OptionalCType):
+        elif isinstance(arg.lazy_type, BaseCType):
+            if arg.lazy_type.type is tensorListValueT:
+                lazy_tensor_decls.append(
+                    f"auto lazy_{arg.name}_tensorlist = torch::lazy::GetTensorList({arg.name});")
+            else:
+                lazy_tensor_decls.append(
+                    f"{tensor_class}Ptr lazy_{arg.name} = "
+                    f"torch::lazy::GetLtcTensorOrCreateForWrappedNumber({arg.name}, *common_device);")
+        elif isinstance(arg.lazy_type, OptionalCType):
             # TODO(alanwaketan): Maybe we want to apply GetLtcTensorOrCreateForWrappedNumber here, but hold it
             # until we encounter a real world example.
             lazy_tensor_decls.append(
-                f"    {tensor_class}Ptr lazy_{t.name} = torch::lazy::TryGetLtcTensor({t.name}.value_or(at::Tensor()));")
+                f"    {tensor_class}Ptr lazy_{arg.name} = torch::lazy::TryGetLtcTensor({arg.name}.value_or(at::Tensor()));")
         else:
-            raise AssertionError("TODO not sure if there are other valid types to handle here")
+            raise AssertionError(f"TODO not sure if there are other valid types to handle here ({arg.lazy_type})")
     return ("\n        ").join(lazy_tensor_decls)
 
 @dataclass(frozen=True)
@@ -198,38 +204,27 @@ class GenLazyNativeFuncDefinition:
     class_method_name: str
     backend_index: BackendIndex
     tensor_class: str
+    gen_forced_fallback_code: bool
 
-    @method_with_native_function
-    def __call__(self, func: NativeFunction) -> List[str]:
-        sig = kernel_signature(func, self.backend_index)
+    def gen_shape_call(self, func: NativeFunction) -> str:
         metadata = self.backend_index.get_kernel(func)
         assert metadata is not None
         schema = LazyIrSchema(func.func)
-        all_types = schema.filtered_types()
-        value_types = schema.filtered_types(values=True, scalars=False)
-        scalar_types = schema.filtered_types(values=False, scalars=True)
+        all_args = schema.filtered_args()
         returns_length = len(schema.returns)
 
-        fallback_str = gen_fallback_code(schema, overload_name=func.func.name.overload_name)
-        value_types_names = [f"{t.name}" for t in value_types if t.name not in schema.wrapped_scalar_names]
-        assert len(value_types_names) > 0, "Code below assumes there is at least one tensor arg"
-        get_device_str = f"""auto common_device = torch::lazy::GetBackendDevice({', '.join(value_types_names)});
-        TORCH_INTERNAL_ASSERT(common_device);
-        """
-
-        lazy_tensor_decls_str = lazy_tensor_decls(value_types, self.tensor_class, schema)
-        node_ctor_input_str = node_ctor_inputs(schema)
-
         # call the meta kernel if it exists, to compute output shape/dtype for our IR
         if func.structured or func.structured_delegate is not None:
             meta_out = """std::vector<Shape> shapes{Shape(out_meta.scalar_type(), out_meta.sizes().vec())};"""
             if returns_length > 1:
+
                 def this_shape(i: int) -> str:
                     return f"Shape(std::get<{i}>(out_meta).scalar_type(), std::get<{i}>(out_meta).sizes().vec())"
-                shapes_str = ','.join([this_shape(i) for i in range(returns_length)])
+
+                shapes_str = ",".join([this_shape(i) for i in range(returns_length)])
                 meta_out = "std::vector<Shape> shapes{" + shapes_str + "};"
 
-            meta_str = f"""auto out_meta = at::meta::{schema.aten_name}({', '.join(str(t.name) for t in all_types)});
+            meta_str = f"""auto out_meta = at::meta::{schema.aten_name}({', '.join(str(a.name) for a in all_args)});
         {meta_out}"""
         else:
             shape_sig = ComputeShapeSignature(metadata.kernel, func)
@@ -239,7 +234,43 @@ def this_shape(i: int) -> str:
         meta_str += f"""
         TORCH_INTERNAL_ASSERT(shapes.size() == {returns_length});"""
 
-        node_str = f"""auto node = torch::lazy::MakeNode<ir::ops::{schema.node_name}>({node_ctor_input_str},
+        # Calculating which dimensions are symbolic
+        func_schema_str = "aten::" + str(func.func)
+        meta_str += f"""
+        if(symbolicShapeEnabled()){{
+            std::vector<jit::IValue> inputs = {{ {', '.join(str(a.name) for a in all_args)} }};
+            char* schema_str = "{func_schema_str}";
+            applySymbolicShapesOnLT(schema_str, inputs, shapes);
+        }}
+        """
+        return meta_str
+
+    @method_with_native_function
+    def __call__(self, func: NativeFunction) -> List[str]:
+        sig = kernel_signature(func, self.backend_index)
+        metadata = self.backend_index.get_kernel(func)
+        assert metadata is not None
+        schema = LazyIrSchema(func.func)
+        value_args = schema.filtered_args(values=True, scalars=False)
+        returns_length = len(schema.returns)
+
+        fallback_str = ""
+        if self.gen_forced_fallback_code:
+            fallback_str = gen_fallback_code(schema, overload_name=func.func.name.overload_name)
+
+        value_types_names = [f"{a.name}" for a in value_args if not a.is_wrapped_scalar]
+        assert (
+            len(value_types_names) > 0
+        ), "Code below assumes there is at least one tensor arg"
+        get_device_str = f"""auto common_device = torch::lazy::GetBackendDevice({', '.join(value_types_names)});
+        TORCH_INTERNAL_ASSERT(common_device);
+        """
+
+        lazy_tensor_decls_str = lazy_tensor_decls(value_args, self.tensor_class)
+        node_ctor_input_str = node_ctor_inputs(schema)
+        shape_str = self.gen_shape_call(func)
+
+        node_str = f"""auto node = torch::lazy::MakeNode<{schema.node_name}>({node_ctor_input_str},
                                                                                       std::move(shapes));"""
         first_tensor_name = value_types_names[0]
         bridge_str = """auto result = torch::lazy::CreateAtenFromLtcTensor(
@@ -253,24 +284,28 @@ def this_shape(i: int) -> str:
         auto result = torch::lazy::TupleAtenFromLtcTensors<{returns_length}>(lazy_tensors);"""
 
         if schema.name.name.inplace or func.func.is_out_fn():
-            assert returns_length == 1, "We assumed there was no such case where an op is an in-place variant " \
-                                        "and has tuple outputs."
+            assert returns_length == 1, (
+                "We assumed there was no such case where an op is an in-place variant "
+                f"and has tuple outputs, but got tuple of len {returns_length}."
+            )
             bridge_str = f"""lazy_{first_tensor_name}->SetInPlaceIrValue(node);
         auto& result = {first_tensor_name};"""
 
-
-        return [f"""\
+        return [
+            f"""\
     {sig.decl(name=f"{self.class_method_name}::{metadata.kernel}")} {{
         {fallback_str}
         TORCH_LAZY_FN_COUNTER("lazy::");
         {get_device_str}
         {lazy_tensor_decls_str}
-        {meta_str}
+        {shape_str}
         {node_str}
         {bridge_str}
         return result;
     }};\n
-    """]
+    """
+        ]
+
 
 class ComputeShapeSignature:
     """
@@ -279,7 +314,7 @@ class ComputeShapeSignature:
     def __init__(self, kernel_name: str, f: NativeFunction):
         self.__schema = LazyIrSchema(f.func)
         self.__dispatch_args = ', '.join([a.decl() for a in dispatcher.arguments(f.func)])
-        self.__call_args = ", ".join([f"{t.name}" for t in self.__schema.filtered_types()])
+        self.__call_args = ", ".join([f"{arg.name}" for arg in self.__schema.filtered_args(generator=True)])
         self.__kernel_name = kernel_name
 
     def __decl_suffix(self) -> str:
@@ -309,8 +344,8 @@ def __call__(self, f: NativeFunction) -> List[str]:
         metadata = self.backend_index.get_kernel(f)
         assert metadata is not None
         schema = LazyIrSchema(f.func)
-        value_types = schema.filtered_types(values=True, scalars=False)
-        lazy_tensor_decls_str = lazy_tensor_decls(value_types, self.tensor_class, schema)
+        value_args = schema.filtered_args(values=True, scalars=False)
+        lazy_tensor_decls_str = lazy_tensor_decls(value_args, self.tensor_class)
         node_ctor_input_str = node_ctor_inputs(schema)
 
         # Only generate shape/dtype fn for non-structured kernels,
diff --git a/tools/codegen/dest/lazy_ts_lowering.py b/tools/codegen/dest/lazy_ts_lowering.py
index 3f7701d5587a9a..25d594aa459ff6 100644
--- a/tools/codegen/dest/lazy_ts_lowering.py
+++ b/tools/codegen/dest/lazy_ts_lowering.py
@@ -1,6 +1,6 @@
 from typing import Union
 from tools.codegen.model import (NativeFunction, NativeFunctionsGroup)
-from tools.codegen.api.lazy import LazyIrSchema, isValueType
+from tools.codegen.api.lazy import LazyIrSchema
 from tools.codegen.api.types import OptionalCType
 
 
@@ -11,19 +11,19 @@ def ts_lowering_body(f: Union[NativeFunctionsGroup, NativeFunction]) -> str:
     schema = LazyIrSchema(func)
 
     emplace_arguments = []
-    for value in schema.positional_arg_types:
-        if isValueType(value.type):
-            if isinstance(value.type, OptionalCType):
-                emplace_arguments.append(f"has_{value.name} ? loctx->GetOutputOp(operand(i++)) : nullptr")
+    for arg in schema.positional_args:
+        if arg.is_lazy_value:
+            if isinstance(arg.lazy_type, OptionalCType):
+                emplace_arguments.append(f"has_{arg.name} ? loctx->GetOutputOp(operand(i++)) : nullptr")
                 continue
             emplace_arguments.append('loctx->GetOutputOp(operand(i++))')
             continue
-        emplace_arguments.append(f'"{value.name}", {value.name}')
+        emplace_arguments.append(f'"{arg.name}", {arg.name}')
 
     emplace_arguments_str = "\n    ".join(
         [f"arguments.emplace_back({a});" for a in emplace_arguments])
-    emplace_kwarg_values = [f'"{t.name}", loctx->GetOutputOp(operand(i++))' for t in schema.keyword_values]
-    emplace_kwarg_scalars = [f'"{t.name}", {t.name}' for t in schema.keyword_scalars]
+    emplace_kwarg_values = [f'"{arg.name}", loctx->GetOutputOp(operand(i++))' for arg in schema.keyword_values]
+    emplace_kwarg_scalars = [f'"{arg.name}", {arg.name}' for arg in schema.keyword_scalars]
     emplace_kwarguments = "\n    ".join(
         [f"kwarguments.emplace_back({a});" for a in emplace_kwarg_values + emplace_kwarg_scalars])
     return f"""\
diff --git a/tools/codegen/dest/register_dispatch_key.py b/tools/codegen/dest/register_dispatch_key.py
index c555768d08ce38..dee32075f0376a 100644
--- a/tools/codegen/dest/register_dispatch_key.py
+++ b/tools/codegen/dest/register_dispatch_key.py
@@ -43,7 +43,9 @@ def gen_registration_headers(
     elif per_operator_headers:
         headers += [
             "#include <ATen/ops/empty.h>",
-            "#include <ATen/ops/empty_strided.h>"]
+            "#include <ATen/ops/empty_strided.h>",
+            "#include <ATen/ops/_copy_from_and_resize.h>",
+            "#include <ATen/ops/_copy_from.h>"]
     else:
         headers.append("#include <ATen/Functions.h>")
 
@@ -60,30 +62,15 @@ def gen_create_out_helper(backend_index: BackendIndex) -> List[str]:
         dispatch = str(backend_index.dispatch_key).lower()
         empty_impl = f"at::detail::empty_{dispatch}"
         empty_strided_impl = f"at::detail::empty_strided_{dispatch}"
-        runtime_empty_supported_check = ""
-    elif backend_index.dispatch_key == DispatchKey.CompositeExplicitAutograd:
+    elif backend_index.dispatch_key in (
+            DispatchKey.CompositeExplicitAutograd, DispatchKey.QuantizedCPU, DispatchKey.QuantizedCUDA):
         empty_impl = "at::empty"
         empty_strided_impl = "at::empty_strided"
-        runtime_empty_supported_check = """\
-  if (!c10::detail::backend_supports_empty_operator(options)) {{
-    // The main purpose of this CompositeExplicitAutograd kernel is to provide
-    // a "free" implementation of out-of-place operators.
-    // If a backend hasn't implemented an out-of-place op but has implemented
-    // the out= variant, then this kernel will call their out= variant.
-    // It does that by using at::empty() to create the tensor to pass to the out= variant though,
-    // so this "default" kernel doesn't actually handle backends that don't support at::empty
-    // (e.g. quantized backends).
-    // Returning an undefined tensor here allows us to reach the out= kernel and give a better error.
-    // Longer term, this could be better fixed by https://github.com/pytorch/pytorch/issues/52680
-    return at::Tensor();
-  }}
-"""
     else:
         return []
 
     return [f"""
 Tensor create_out(IntArrayRef sizes, IntArrayRef strides, const TensorOptions &options) {{
-  {runtime_empty_supported_check}
   if (strides.empty()) {{
       return {empty_impl}(sizes, {empty_options});
   }} else {{
@@ -191,6 +178,10 @@ class RegisterDispatchKey:
     # all of the existing kernel signatures scattered across aten/src/ATen/native.
     class_method_name: Optional[str]
 
+    # Only set to true in lightweight dispatch. If lightweight dispatch is enabled we are registering
+    # operators into JIT op registry, thus we need to avoid generating code to register into the dispatcher.
+    skip_dispatcher_op_registration: bool
+
     @staticmethod
     def gen_device_check(type: DeviceCheckType, args: List[Argument], method_name: str) -> str:
         if type == DeviceCheckType.NoCheck:
@@ -282,6 +273,7 @@ def gen_structured(self, g: NativeFunctionsGroup) -> List[str]:
             self.rocm,
             self.cpp_namespace,
             self.class_method_name,
+            self.skip_dispatcher_op_registration,
             g
         )
         return list(mapMaybe(structured_gen.gen_one, g.functions()))
@@ -376,7 +368,7 @@ def generate_defn(cpp_sig: CppSignature) -> str:
 
                 device_guard = "// DeviceGuard omitted"  # default
                 if f.device_guard and self.backend_index.device_guard:
-                    has_tensor_options = any(isinstance(a.argument, TensorOptionsArguments) for a in args)
+                    has_tensor_options = any(isinstance(a, TensorOptionsArguments) for a in f.func.arguments.non_out)
                     if has_tensor_options:
                         # kernel is creating a tensor
                         device_guard = """
@@ -416,7 +408,7 @@ def generate_defn(cpp_sig: CppSignature) -> str:
 """
 
             elif self.target is Target.REGISTRATION:
-                if f.manual_kernel_registration:
+                if f.manual_kernel_registration or self.skip_dispatcher_op_registration:
                     return None
                 else:
                     payload = f"TORCH_FN({name})"
diff --git a/tools/codegen/gen.py b/tools/codegen/gen.py
index 4b35ee81f343ca..101f1fbe96aed6 100644
--- a/tools/codegen/gen.py
+++ b/tools/codegen/gen.py
@@ -26,6 +26,7 @@
 import tools.codegen.api.meta as meta
 import tools.codegen.api.structured as structured
 from tools.codegen.api.translate import translate
+from tools.codegen.code_template import CodeTemplate
 from tools.codegen.selective_build.selector import SelectiveBuilder
 from tools.codegen.utils import (
     Target, concatMap, context, mapMaybe, YamlDumper, YamlLoader, FileManager, assert_never, make_file_manager
@@ -1090,7 +1091,8 @@ def gen_aggregated_headers(
                         selector,
                         rocm=rocm,
                         cpp_namespace='at::native',
-                        class_method_name=None),
+                        class_method_name=None,
+                        skip_dispatcher_op_registration=False),
                     grouped_native_functions
                 )),
             })
@@ -1198,7 +1200,8 @@ def gen_per_operator_headers(
                     selector,
                     rocm=rocm,
                     cpp_namespace='at::native',
-                    class_method_name=None),
+                    class_method_name=None,
+                    skip_dispatcher_op_registration=False),
                 grouped_functions
             ))
 
@@ -1418,6 +1421,25 @@ def operator_headers() -> List[str]:
                 return headers
 
         backend_index = backend_indices[dispatch_key]
+        dispatch_registrations_body = "" if skip_dispatcher_op_registration else "\n".join(list(concatMap(
+            dest.RegisterDispatchKey(
+                backend_index,
+                Target.REGISTRATION,
+                selector,
+                rocm=rocm,
+                cpp_namespace='at::native',
+                class_method_name=None,
+                skip_dispatcher_op_registration=skip_dispatcher_op_registration),
+            grouped_native_functions
+        )))
+        static_template = CodeTemplate("""\
+TORCH_LIBRARY_IMPL(aten, $dispatch_key, m) {
+    $dispatch_registrations_body
+};""")
+        static_init_dispatch_registrations = static_template.substitute(
+            dispatch_key=dispatch_key,
+            dispatch_registrations_body=dispatch_registrations_body
+        )
         dispatch_namespace = str(dispatch_key).lower()
         fm.write_with_template(f'Register{dispatch_key}.cpp', 'RegisterDispatchKey.cpp', lambda: {
             'extra_cuda_headers': extra_cuda_headers if is_cuda_dispatch_key(dispatch_key) else '',
@@ -1434,7 +1456,8 @@ def operator_headers() -> List[str]:
                     selector,
                     rocm=rocm,
                     cpp_namespace='at::native',
-                    class_method_name=None),
+                    class_method_name=None,
+                    skip_dispatcher_op_registration=skip_dispatcher_op_registration),
                 grouped_native_functions
             )),
             'dispatch_anonymous_definitions': list(concatMap(
@@ -1444,19 +1467,12 @@ def operator_headers() -> List[str]:
                     selector,
                     rocm=rocm,
                     cpp_namespace='at::native',
-                    class_method_name=None),
-                grouped_native_functions
-            )),
-            'dispatch_registrations': [] if skip_dispatcher_op_registration else list(concatMap(
-                dest.RegisterDispatchKey(
-                    backend_index,
-                    Target.REGISTRATION,
-                    selector,
-                    rocm=rocm,
-                    cpp_namespace='at::native',
-                    class_method_name=None),
+                    class_method_name=None,
+                    skip_dispatcher_op_registration=skip_dispatcher_op_registration),
                 grouped_native_functions
             )),
+            'static_init_dispatch_registrations': static_init_dispatch_registrations,
+            'deferred_dispatch_registrations': "",
         })
 
         for g in structured_native_functions:
diff --git a/tools/codegen/gen_backend_stubs.py b/tools/codegen/gen_backend_stubs.py
index 5b703889ab85ef..587eea4d48c799 100644
--- a/tools/codegen/gen_backend_stubs.py
+++ b/tools/codegen/gen_backend_stubs.py
@@ -11,6 +11,7 @@
 from tools.codegen.selective_build.selector import SelectiveBuilder
 from tools.codegen.utils import Target, concatMap, context, YamlLoader, FileManager
 from tools.codegen.context import native_function_manager
+from tools.codegen.code_template import CodeTemplate
 import tools.codegen.dest as dest
 import tools.codegen.api.dispatcher as dispatcher
 from tools.codegen.api.types import DispatcherSignature
@@ -19,7 +20,7 @@
 # Parses the external backend's yaml, and adds a new BackendIndex for the backend's dispatch key.
 # Returns a Tuple of (backend_key, autograd_key, cpp_namespace, updated BackendIndex mapping)
 ParsedExternalYaml = namedtuple('ParsedExternalYaml', [
-    'backend_key', 'autograd_key', 'cpp_namespace', 'backend_indices'])
+    'backend_key', 'autograd_key', 'class_name', 'cpp_namespace', 'backend_indices'])
 def parse_backend_yaml(
         backend_yaml_path: str,
         grouped_native_functions: Sequence[Union[NativeFunction, NativeFunctionsGroup]],
@@ -35,11 +36,13 @@ def parse_backend_yaml(
         yaml_values = yaml.load(f, Loader=YamlLoader)
     assert isinstance(yaml_values, dict)
 
-    valid_keys = ['backend', 'cpp_namespace', 'extra_headers', 'supported', 'autograd', 'full_codegen']
+    valid_keys = ['backend', 'class_name', 'cpp_namespace', 'extra_headers', 'supported', 'autograd', 'full_codegen']
 
     backend = yaml_values.pop('backend', None)
     assert backend is not None, 'You must provide a value for "backend"'
 
+    class_name = yaml_values.pop('class_name', None)
+
     cpp_namespace = yaml_values.pop('cpp_namespace', None)
     assert cpp_namespace is not None, 'You must provide a value for "cpp_namespace"'
 
@@ -133,13 +136,14 @@ def create_backend_index(
 autograd key. They cannot be mix and matched. If this is something you need, feel free to create an issue! \
 {forward_kernels[0].kernel} is listed under "supported", but {backward_kernels[0].kernel} is listed under "autograd".'
 
-    return ParsedExternalYaml(backend_key, autograd_key, cpp_namespace, backend_indices)
+    return ParsedExternalYaml(backend_key, autograd_key, class_name, cpp_namespace, backend_indices)
 
 def error_on_missing_kernels(
         native_functions: Sequence[NativeFunction],
         backend_indices: Dict[DispatchKey, BackendIndex],
         backend_key: DispatchKey,
         autograd_key: Optional[DispatchKey],
+        class_name: str,
         kernel_defn_file_path: str,
         full_codegen: Optional[List[OperatorName]] = None,
 ) -> None:
@@ -152,9 +156,6 @@ def error_on_missing_kernels(
     if full_codegen is None:
         full_codegen = []
 
-    class_name: Optional[str] = backend_indices[backend_key].native_function_class_name()
-    assert class_name is not None
-
     expected_backend_op_names: List[OperatorName] = \
         list(backend_indices[backend_key].index.keys()) + \
         [] if autograd_key is None else list(backend_indices[autograd_key].index.keys())
@@ -208,7 +209,8 @@ def gen_dispatchkey_nativefunc_headers(
         backend_indices: Dict[DispatchKey, BackendIndex],
         grouped_native_functions: Sequence[Union[NativeFunction, NativeFunctionsGroup]],
         backend_dispatch_key: DispatchKey,
-        autograd_dispatch_key: Optional[DispatchKey]) -> None:
+        autograd_dispatch_key: Optional[DispatchKey],
+        backend_name: str = "") -> None:
     assert class_name is not None
     generated_comment = 'Autogenerated file by gen_backend_stubs.py. Do not edit directly!'
 
@@ -230,26 +232,81 @@ def gen_dispatchkey_nativefunc_headers(
         'class_name': class_name,
         'namespace_epilogue': ns_helper.epilogue,
         'dispatch_declarations': backend_declarations + autograd_declarations,
+        'BackendName': backend_name,
+        'DispatchKey': backend_dispatch_key,
+
     })
 
 
 def gen_dispatcher_registrations(
         fm: FileManager,
         output_dir: str,
+        class_name: str,
         cpp_namespace: str,
         backend_indices: Dict[DispatchKey, BackendIndex],
         grouped_native_functions: Sequence[Union[NativeFunction, NativeFunctionsGroup]],
         backend_dispatch_key: DispatchKey,
         dispatch_key: DispatchKey,
-        selector: 'SelectiveBuilder') -> None:
+        selector: 'SelectiveBuilder',
+        # build_in_tree is true for lazy TS backend and affects include paths, not used for external backends
+        build_in_tree: bool = False,
+        per_operator_headers: bool = False,
+        backend_name: str = "",
+        eager_registration: bool = True) -> None:
+    headers = [
+        f"{output_dir}/{backend_dispatch_key}NativeFunctions.h",
+    ]
+    if build_in_tree:
+        external_backend_headers_str = "\n".join(f'#include <{h}>' for h in headers)
+    else:
+        external_backend_headers_str = "\n".join(f'#include "{h}"' for h in headers)
+
+    assert class_name is not None
     backend_index = backend_indices[dispatch_key]
+
+    dispatch_registrations_body = list(concatMap(
+        dest.RegisterDispatchKey(
+            backend_index,
+            Target.REGISTRATION,
+            selector,
+            rocm=False,
+            cpp_namespace=cpp_namespace,
+            class_method_name=f'{class_name}',
+            skip_dispatcher_op_registration=False),
+        grouped_native_functions
+    ))
+    deferred_dispatch_registrations = ""
+    static_init_dispatch_registrations = ""
+    if eager_registration:
+        static_template = CodeTemplate("""\
+TORCH_LIBRARY_IMPL(aten, $dispatch_key, m) {
+    $dispatch_registrations_body
+};""")
+        static_init_dispatch_registrations = static_template.substitute(
+            dispatch_key=dispatch_key,
+            dispatch_registrations_body=dispatch_registrations_body
+        )
+    else:
+        deferred_template = CodeTemplate("""\
+TORCH_API void Register${backend_name}${dispatch_key}NativeFunctions() {
+    static auto m = MAKE_TORCH_LIBRARY_IMPL(aten, $dispatch_key);
+    $dispatch_registrations_body
+}""")
+        deferred_dispatch_registrations = deferred_template.substitute(
+            backend_name=backend_name,
+            dispatch_key=dispatch_key,
+            dispatch_registrations_body=dispatch_registrations_body
+        )
+
     fm.write_with_template(f'Register{dispatch_key}.cpp', 'RegisterDispatchKey.cpp', lambda: {
+        'static_init_dispatch_registrations': static_init_dispatch_registrations,
+        'deferred_dispatch_registrations': deferred_dispatch_registrations,
         'extra_cuda_headers': '',
-        'external_backend_headers': f'#include "{output_dir}/{backend_dispatch_key}NativeFunctions.h"',
-        'ops_headers': '#include <ATen/Functions.h>',
+        'external_backend_headers': external_backend_headers_str,
+        'ops_headers': '#include <ATen/Functions.h>' if not per_operator_headers else '',
         'DispatchKey': dispatch_key,
         'dispatch_namespace': dispatch_key.lower(),
-        'dispatch_headers': dest.gen_registration_headers(backend_index, per_operator_headers=False, rocm=False),
+        'dispatch_headers': dest.gen_registration_headers(backend_index, per_operator_headers=per_operator_headers, rocm=False),
         'dispatch_helpers': dest.gen_registration_helpers(backend_index),
         'dispatch_namespaced_definitions': '',
         'dispatch_anonymous_definitions': list(concatMap(
@@ -259,17 +316,8 @@ def gen_dispatcher_registrations(
                 selector,
                 rocm=False,
                 cpp_namespace=cpp_namespace,
-                class_method_name=f'{backend_dispatch_key}NativeFunctions'),
-            grouped_native_functions
-        )),
-        'dispatch_registrations': list(concatMap(
-            dest.RegisterDispatchKey(
-                backend_index,
-                Target.REGISTRATION,
-                selector,
-                rocm=False,
-                cpp_namespace=cpp_namespace,
-                class_method_name=f'{dispatch_key}NativeFunctions'),
+                class_method_name=f'{class_name}',
+                skip_dispatcher_op_registration=False),
             grouped_native_functions
         )),
     })
@@ -293,6 +341,7 @@ def make_file_manager(install_dir: str) -> FileManager:
     backend_key = parsed_backend_yaml.backend_key
     autograd_key = parsed_backend_yaml.autograd_key
     cpp_namespace = parsed_backend_yaml.cpp_namespace
+    class_name = parsed_backend_yaml.class_name
     backend_indices = parsed_backend_yaml.backend_indices
 
     selector = SelectiveBuilder.get_nop_selector()
@@ -302,17 +351,24 @@ def make_file_manager(install_dir: str) -> FileManager:
         # This could be useful if a backend wants to quickly set up a noop yaml file but doesn't have any kernels ready yet.
         return
 
-    class_name = backend_indices[backend_key].native_function_class_name()
+    if class_name is None:
+        # class_name is an optional argument to backend yaml file.
+        # if specified it allows an external backend to override
+        # the name of the class that all generated kernel definitions live under.
+        # if not specified, its value is given as native_function_class_name.
+        class_name = backend_indices[backend_key].native_function_class_name()
+    assert class_name is not None
 
     if impl_path is not None:
-        error_on_missing_kernels(native_functions, backend_indices, backend_key, autograd_key, impl_path)
+        error_on_missing_kernels(native_functions, backend_indices, backend_key, autograd_key, class_name, impl_path)
+
 
+    gen_dispatchkey_nativefunc_headers(fm, class_name, cpp_namespace, backend_indices,
+                                       grouped_native_functions, backend_key, autograd_key)
 
-        gen_dispatchkey_nativefunc_headers(fm, class_name, cpp_namespace, backend_indices,
-                                           grouped_native_functions, backend_key, autograd_key)
+    for dispatch_key in [backend_key] if autograd_key is None else [backend_key, autograd_key]:
+        gen_dispatcher_registrations(fm, output_dir, class_name, cpp_namespace, backend_indices,
+                                     grouped_native_functions, backend_key, dispatch_key, selector)
 
-        for dispatch_key in [backend_key] if autograd_key is None else [backend_key, autograd_key]:
-            gen_dispatcher_registrations(fm, output_dir, cpp_namespace, backend_indices, grouped_native_functions,
-                                         backend_key, dispatch_key, selector)
 if __name__ == '__main__':
     main()
diff --git a/tools/codegen/gen_functionalization_type.py b/tools/codegen/gen_functionalization_type.py
index 6666a493be7423..06521836d733fc 100644
--- a/tools/codegen/gen_functionalization_type.py
+++ b/tools/codegen/gen_functionalization_type.py
@@ -10,7 +10,6 @@
 )
 from tools.codegen.selective_build.selector import SelectiveBuilder
 from typing import List, Optional, Union, Tuple
-from tools.codegen.utils import mapMaybe
 
 def modifies_arguments(f: NativeFunction) -> bool:
     return f.func.kind() in [SchemaKind.inplace, SchemaKind.out]
@@ -40,15 +39,26 @@ def is_tensor_like(a: Union[Argument, TensorOptionsArguments, SelfArgument]) ->
 # unwraps all tensor-like arguments, returning:
 # (1) a string containing all of the logic that does the unwrapping
 # (2) a context, to be used by translate(), with all of the relevant bindings.
-def unwrap_tensor_args(sig: DispatcherSignature) -> Tuple[str, List[Binding]]:
+def unwrap_tensor_args(sig: DispatcherSignature, *, is_view_op: bool) -> Tuple[str, List[Binding]]:
     context: List[Binding] = []
     unwrapped_tensor_args: List[str] = []
     for arg in sig.arguments():
         if is_tensor_like(arg.argument):
             # for tensor inputs, we want to unwrap them before passing them into the redispatch calls.
             unwrapped_name = f'{arg.name}_'
-            unwrapped_tensor_args.append(
-                f'auto {unwrapped_name} = at::functionalization::impl::from_functional_tensor({arg.name});')
+            # For most ops, the functionalization needs to sync any pending updates on the input tensors
+            # before calling the operator, since otherwise the operator will act on stale data.
+            # For view ops though, we can continue to defer syncing until the tensor is used by
+            # a non-view operator.
+            maybe_sync_input = '' if is_view_op else f'at::functionalization::impl::sync({arg.name});'
+            unwrapped_tensor_args.append(f"""
+      {arg.nctype.remove_const_ref().cpp_type()} {unwrapped_name};
+      if (at::functionalization::impl::isFunctionalTensor({arg.name})) {{
+        {maybe_sync_input}
+        {unwrapped_name} = at::functionalization::impl::from_functional_tensor({arg.name});
+      }} else {{
+        {unwrapped_name} = {arg.name};
+      }}""")
             context.append(arg.with_name(unwrapped_name))
         else:
             # for non-tensor inputs, we want to pass them directly into the redispatch calls.
@@ -129,11 +139,10 @@ def emit_view_functionalization_body(
     assert_view_op_properties(f.func)
     view_tensor_name = dispatcher_sig.arguments()[0].name
 
-    keyset = 'dispatchKeySet & c10::after_func_keyset'
     return_type = dispatcher_sig.returns_type().remove_const_ref().cpp_type()
 
-    unwrap_tensor_args_str, unwrapped_args_ctx = unwrap_tensor_args(dispatcher_sig)
-    view_redispatch_args = [keyset] + [e.expr for e in translate(unwrapped_args_ctx, call_sig.arguments(), method=False)]
+    unwrap_tensor_args_str, unwrapped_args_ctx = unwrap_tensor_args(dispatcher_sig, is_view_op=True)
+    view_redispatch_args = [e.expr for e in translate(unwrapped_args_ctx, call_sig.arguments(), method=False)]
 
     forward_lambda = FunctionalizationLambda.from_func(f, functional_op=functional_op, is_reverse=False)
     reverse_lambda = FunctionalizationLambda.from_func(f, functional_op=functional_op, is_reverse=True)
@@ -145,6 +154,12 @@ def emit_view_functionalization_body(
     if f.tag is Tag.inplace_view:
         # See Note [Functionalization Pass - Inplace View Ops] for more details
         return f"""
+      if (!at::functionalization::impl::isFunctionalTensor({view_tensor_name})) {{
+        // functionalization is re-entrant, but will no-op if it wasn't passed a FunctionalTensorWrapper.
+        {unwrap_tensor_args_str}
+        at::AutoDispatchSkipFunctionalize guard;
+        return at::_ops::{f.func.name.unambiguous_name()}::call({', '.join(view_redispatch_args)});
+      }}
       at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
         {forward_lambda.decl()} {{
           return {forward_lambda.inner_call()}
@@ -154,7 +169,6 @@ def emit_view_functionalization_body(
         }}
       );
       at::functionalization::impl::mutate_view_meta({view_tensor_name}, view_meta);
-      {unwrap_tensor_args_str}
       {return_type} reference_tensor_output;
       {{
         at::AutoDispatchSkipFunctionalize guard;
@@ -169,13 +183,18 @@ def emit_view_functionalization_body(
     else:
         return f"""
       {unwrap_tensor_args_str}
+      if (!at::functionalization::impl::isFunctionalTensor({view_tensor_name})) {{
+        // functionalization is re-entrant, but will no-op if it wasn't passed a FunctionalTensorWrapper.
+        at::AutoDispatchSkipFunctionalize guard;
+        return at::_ops::{api_name}::call({', '.join(view_redispatch_args)});
+      }}
       {return_type} tmp_output;
       {return_type} reference_tensor_output;
       {{
         at::AutoDispatchSkipFunctionalize guard;
         {meta_conversion_str}
         reference_tensor_output = at::_ops::{api_name}::call({', '.join(meta_call_args)});
-        tmp_output = at::_ops::{api_name}::redispatch({', '.join(view_redispatch_args)});
+        tmp_output = at::_ops::{api_name}::call({', '.join(view_redispatch_args)});
         // I'm fusing the [alias removal], [mutation removal], [add views back] passes together.
         // Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal).
       }}
@@ -203,16 +222,23 @@ def emit_inplace_functionalization_body(
 
     dispatcher_sig = DispatcherSignature.from_schema(f.func)
 
-    keyset = 'dispatchKeySet & c10::after_func_keyset'
     return_type = dispatcher_sig.returns_type().remove_const_ref().cpp_type()
 
-    unwrap_tensor_args_str, unwrapped_args_ctx = unwrap_tensor_args(dispatcher_sig)
+    unwrap_tensor_args_str, unwrapped_args_ctx = unwrap_tensor_args(dispatcher_sig, is_view_op=False)
 
     maybe_return = '' if len(f.func.returns) == 0 else 'return '
-    sync_tensor_args = '\n      '.join(mapMaybe(
-        lambda arg: f'at::functionalization::impl::sync({arg.name});'
-                    if arg.type.is_tensor_like() else None,
-        f.func.arguments.flat_all))
+
+    mutated_names = [a.name for a in f.func.arguments.flat_all if a.type.is_tensor_like() and a.annotation is not None]
+    non_mutated_names = [a.name for a in f.func.arguments.flat_all if a.type.is_tensor_like() and a.annotation is None]
+    # all mutable inputs must be functional tensors in order to participate in functionalization
+    check_all_mutated_args_are_functional = ' && '.join(
+        ['true'] + [f'at::functionalization::impl::isFunctionalTensor({a})' for a in mutated_names])
+    check_any_non_mutated_args_are_functional = ' || '.join(
+        ['false'] + [f'at::functionalization::impl::isFunctionalTensor({a})' for a in non_mutated_names])
+    # These are used in the cases where we don't functionalize and redispatch to the inplace op
+    # case 1: we hit an inplace op that doesn't have an out-of-place equivalent
+    # case 2: we hit an inplace ops but our inputs are not functional tensors (in which case our kernel just no-ops)
+    inplace_exprs = [e.expr for e in translate(unwrapped_args_ctx, dispatcher_sig.arguments(), method=False)]
 
     # Note [functionalizating copy_() and not preserving strides]
     # copy_() can't be functionalized, since there doesn't exist an out-of-place variant.
@@ -225,34 +251,31 @@ def emit_inplace_functionalization_body(
     # - There are actually a few other places where the functionalization pass currently doesn't support strides:
     #   calls to slice/diagonal_scatter don't currently preserve the strides of their inputs (but maybe we should fix this).
     if str(f.func.name) == 'copy_':
-        exprs = [keyset] + [a.name for a in unwrapped_args_ctx]
-        functional_call_str = f"""\
-            auto tmp_intermediate = at::_ops::to_other::redispatch({keyset}, src_, self_, non_blocking, false, c10::nullopt);
-            tmp_output = at::_ops::expand_as::redispatch({keyset}, tmp_intermediate, self_);"""
+        functional_call_str = """\
+            auto tmp_intermediate = at::_ops::to_other::call(src_, self_, non_blocking, false, c10::nullopt);
+            tmp_output = at::_ops::expand_as::call(tmp_intermediate, self_);"""
     elif functional_op is None:
         # We can't functionalize this inplace op, since we don't know what the corresponding functional op is.
-        inplace_exprs = [keyset] + [e.expr for e in translate(unwrapped_args_ctx, dispatcher_sig.arguments(), method=False)]
-        warn_str = "Note: the functionalization pass encountered an operator ({}) that it could not functionalize, \
+        warn_str = "Note: the functionalization pass encountered an operator ({str(f.func.name)}) that it could not functionalize, \
 because it couldn't find an out-of-place equivalent of the operator to call. \
 Instead, it's calling the inplace/view operator directly. \
-If this causes problems in your program, consider upstreaming the out-of-place op to PyTorch.".format(str(f.func.name))
+If this causes problems in your program, consider upstreaming the out-of-place op to PyTorch."
 
         return f"""
       if (c10::impl::tls_local_dispatch_key_set().included_.has(c10::DispatchKey::Functionalize)) {{
           TORCH_WARN("{warn_str}");
       }}
-      {sync_tensor_args}
       {unwrap_tensor_args_str}
       at::AutoDispatchSkipFunctionalize guard;
       // Redispatch as normally otherwise, since XLA has its own lowerings for special inplace ops.
-      {maybe_return}at::_ops::{f.func.name.unambiguous_name()}::redispatch({', '.join(inplace_exprs)});
+      {maybe_return}at::_ops::{f.func.name.unambiguous_name()}::call({', '.join(inplace_exprs)});
 """
     else:
         # call the out-of-place variant of the op
         functional_sig = DispatcherSignature.from_schema(functional_op.func)
-        functional_exprs = [keyset] + [e.expr for e in translate(unwrapped_args_ctx, functional_sig.arguments(), method=False)]
+        functional_exprs = [e.expr for e in translate(unwrapped_args_ctx, functional_sig.arguments(), method=False)]
         functional_call_str = \
-            f"tmp_output = at::_ops::{functional_op.func.name.unambiguous_name()}::redispatch({', '.join(functional_exprs)});"
+            f"tmp_output = at::_ops::{functional_op.func.name.unambiguous_name()}::call({', '.join(functional_exprs)});"
 
     mutable_input_post_processing = '\n'.join([
         f"""
@@ -263,16 +286,29 @@ def emit_inplace_functionalization_body(
         if a.annotation and a.annotation.is_write and a.type.is_tensor_like()])
 
     return f"""
-      {sync_tensor_args}
       {unwrap_tensor_args_str}
-      {return_type} tmp_output;
-      {{
+      if (!({check_all_mutated_args_are_functional})) {{
+        if (({check_any_non_mutated_args_are_functional})) {{
+         // case 1: trying to mutate a non functional tensor with a functional tensor is an error
+         TORCH_INTERNAL_ASSERT(false,
+           "mutating a non-functional tensor with a functional tensor is not allowed.",
+           " Please ensure that all of your inputs are wrapped inside of a functionalize() call.");
+        }} else {{
+         // case 2: arguments are not functional tensors, so we no-op and redispatch.
+         at::AutoDispatchSkipFunctionalize guard;
+         at::_ops::{f.func.name.unambiguous_name()}::call({', '.join(inplace_exprs)});
+        {return_str(f)};
+        }}
+      }} else {{
+        {return_type} tmp_output;
+        {{
           at::AutoDispatchSkipFunctionalize guard;
           // The functionalization pass explicitly doesn't pass out= parameters to the redispatch
           {functional_call_str}
-      }}
-      {mutable_input_post_processing}
-      {return_str(f)};"""
+        }}
+        {mutable_input_post_processing}
+        {return_str(f)};
+      }}"""
 
 
 def emit_declaration_for_noncomposite_views(f: NativeFunction) -> str:
diff --git a/tools/codegen/gen_lazy_tensor.py b/tools/codegen/gen_lazy_tensor.py
index 12a0dec9170e00..591abf3a479239 100644
--- a/tools/codegen/gen_lazy_tensor.py
+++ b/tools/codegen/gen_lazy_tensor.py
@@ -6,7 +6,7 @@
 from collections import namedtuple, Counter
 from typing import List, Dict, Union, Sequence, Optional, Callable, Iterable, Iterator, Tuple, Type
 from tools.codegen.dest.lazy_ir import LazyIR, TSLazyIR
-from tools.codegen.gen import get_grouped_native_functions, parse_native_yaml
+from tools.codegen.gen import get_grouped_native_functions, parse_native_yaml, NamespaceHelper
 from tools.codegen.model import (FunctionSchema,
                                  NativeFunction, NativeFunctionsGroup, OperatorName)
 from tools.codegen.selective_build.selector import SelectiveBuilder
@@ -16,6 +16,64 @@
                                 gen_dispatchkey_nativefunc_headers,
                                 gen_dispatcher_registrations)
 
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
+#
+#                        Lazy Tensor Codegen
+#
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
+# Overview
+# ~~~~~~~~
+#
+# This codegen script builds on existing data models and helpers used
+# by all ATen backends, and adds new functionality specific to lazy
+# tensor backends.
+#
+# Inputs:
+# - <backend>_native_functions.yaml: controls which operators are
+#   supported by the backend.
+#
+# Outputs:
+# (for all backends)
+# <DispatchKey>Ir.h defines Lazy IR classes to be constructed during tracing
+# - opt-in: also generate 'lowering' methods for the TorchScript backend only
+# <DispatchKey>NativeFunctions.cpp defines implementations of native functions which perform lazy tracing
+# - opt-in: 'full_codegen' section of backend yaml; 'supported' section omits these implementations
+# <DispatchKey>NativeFunctions.h declares implementations of native functions for both 'supported' and 'full_codegen'
+# ops
+#
+# Register<DispatchKey>.cpp registers all op implementations with the dispatcher
+# RegisterAutograd<DispatchKey>.cpp registers all autograd implementations with the dispatcher
+#
+# Validation Helpers:
+# - Shape Inference: errs if any ops in backend yaml require shape inference not provided by meta kernels or
+#   implementations in torch/csrc/lazy/core/shape_inference.*
+# - native function impls: errs if any 'supported' ops do not have an implementation defined in the backend
+#   (non-codegen) implementation file
+#
+#
+# About the Data Model
+# ~~~~~~~~~~~~~~~~~~~~
+#
+# Modeled after ATen codegen, the first step is to parse yaml and build a data model for the operators
+# we care about.  In this case, the <backend>_native_functions yaml defines a subset of the core operators
+# (defined in more detail in the main native_functions.yaml), which will be supported by your backend.
+# Backends can list ops in two categories:
+#  - `supported` ops require hand-implementations but still get codegenned declarations and registrations
+#  - `full_codegen` ops get implementations (and IR classes) generated too
+#
+# Each native function is modeled as an object with a schema, and each schema has objects representing their
+# arguments.  Much of the codegen is manipulation of the arguments and their types.  For example, lazy tensor
+# backends need to transform 'at::Tensor' arguments into 'lazy::Value' objects, as well as replacing reference
+# types (stringref) with actual string objects, and this is done by manipulating the data model objects.
+# - see api/lazy.py for the lazy data model
+#
+# Once the data model is set up, the rest of this script processes a number of templates for output CPP file
+# and fills in the template values using helpers in `dest/lazy_ir.py` and `dest/lazy_ts_lowering.py`.  These
+# helpers mostly iterate over functions and their arguments, outputting different c++ snippets.
+#
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
+
+
 # Parses the external backend's yaml, and adds a new BackendIndex for the backend's dispatch key.
 # Returns a Tuple of (backend_key, autograd_key, cpp_namespace, updated BackendIndex mapping, full_codegen)
 ParsedExternalYaml = namedtuple('ParsedExternalYaml', [
@@ -62,6 +120,15 @@ def validate_shape_inference_header(shape_inference_hdr: str, expected_shape_inf
 and implement it in the the corresponding shape_inference.cpp file.\n
 {decl}"""
 
+class default_args:
+    node_base: str = "Node"
+    node_base_hdr: Optional[str] = None
+    shape_inference_hdr: str = "torch/csrc/lazy/core/shape_inference.h"
+    tensor_class: str = "torch::lazy::LazyTensor"
+    tensor_class_hdr: str = "torch/csrc/lazy/core/tensor.h"
+    lazy_ir_cls: Type[LazyIR] = LazyIR
+    backend_name: str = "TorchScript"
+
 def main() -> None:
     parser = argparse.ArgumentParser(description='Generate Lazy Tensor backend files')
     parser.add_argument(
@@ -78,41 +145,62 @@ def main() -> None:
         '--gen_ts_lowerings', action="store_true",
         help='Generate TorchScript lowerings in addition to Lazy IR and NativeFunctions')
     parser.add_argument(
-        '--node_base', type=str, default="Node", help='Name of backend specific custom Lazy IR Node base class')
+        '--node_base', type=str, default=default_args.node_base,
+        help='Name of backend specific custom Lazy IR Node base class')
     parser.add_argument(
-        '--node_base_hdr', type=str, default=None, help='Path to header file defining custom Lazy IR Node base class')
+        '--node_base_hdr', type=str, default=default_args.node_base_hdr,
+        help='Path to header file defining custom Lazy IR Node base class')
     parser.add_argument(
-        '--shape_inference_hdr', type=str, default=None,
+        '--shape_inference_hdr', type=str, default=default_args.shape_inference_hdr,
         help='Path to header file defining custom Lazy shape inference functions')
     parser.add_argument(
-        '--tensor_class', type=str, default="torch::lazy::LazyTensor",
+        '--tensor_class', type=str, default=default_args.tensor_class,
         help='Name of backend specific custom Lazy Tensor class')
     parser.add_argument(
-        '--tensor_class_hdr', type=str, default="torch/csrc/lazy/core/tensor.h",
+        '--tensor_class_hdr', type=str, default=default_args.tensor_class_hdr,
         help='Path to header file defining custom Lazy Tensor class')
+    parser.add_argument(
+        '--backend_name', type=str, default=default_args.backend_name,
+        help='Name of the backend to generate')
     options = parser.parse_args()
 
-    run(options.source_yaml, options.output_dir, options.dry_run, options.impl_path,
-        options.gen_ts_lowerings, options.node_base, options.node_base_hdr,
-        options.tensor_class, options.tensor_class_hdr, options.shape_inference_hdr,
-        TSLazyIR)
-
-
-def run(source_yaml: str, output_dir: str, dry_run: bool, impl_path: Optional[str],
-        gen_ts_lowerings: bool, node_base: str, node_base_hdr: Optional[str],
-        tensor_class: str, tensor_class_hdr: str, shape_inference_hdr: str,
-        lazy_ir_cls: Type[LazyIR]) -> None:
-
     # Assumes that this file lives at PYTORCH_ROOT/tools/codegen/gen_backend_stubs.py
-    pytorch_root = pathlib.Path(__file__).parent.parent.parent.absolute()
-    template_dir = os.path.join(pytorch_root, "aten/src/ATen/templates")
+    torch_root = pathlib.Path(__file__).parent.parent.parent.absolute()
+    aten_path = str(torch_root / "aten" / "src" / "ATen")
+    ir_gen_class: Type[LazyIR] = default_args.lazy_ir_cls
+    if options.gen_ts_lowerings:
+        ir_gen_class = TSLazyIR
+
+    run_gen_lazy_tensor(aten_path, options.source_yaml, options.output_dir, options.dry_run, options.impl_path,
+                        options.node_base, options.node_base_hdr,
+                        options.tensor_class, options.tensor_class_hdr, options.shape_inference_hdr,
+                        ir_gen_class, options.backend_name)
+
+
+def run_gen_lazy_tensor(aten_path: str, source_yaml: str, output_dir: str,
+                        dry_run: bool, impl_path: Optional[str],
+                        node_base: str = default_args.node_base,
+                        node_base_hdr: Optional[str] = default_args.node_base_hdr,
+                        tensor_class: str = default_args.tensor_class,
+                        tensor_class_hdr: str = default_args.tensor_class_hdr,
+                        shape_inference_hdr: str = default_args.shape_inference_hdr,
+                        lazy_ir_cls: Type[LazyIR] = default_args.lazy_ir_cls,
+                        # build_in_tree is true for TS backend and affects include paths
+                        build_in_tree: bool = False,
+                        # per_operator_headers changes whether ATen/Functions.h or individual operator headers are used
+                        # it must match how ATen was built
+                        per_operator_headers: bool = False,
+                        backend_name: str = default_args.backend_name,
+                        gen_forced_fallback_code: bool = False) -> None:
+
+    template_dir = os.path.join(aten_path, "templates")
 
     def make_file_manager(install_dir: str) -> FileManager:
         return FileManager(install_dir=install_dir, template_dir=template_dir, dry_run=dry_run)
 
     fm = make_file_manager(output_dir)
 
-    native_yaml_path = os.path.join(pytorch_root, 'aten/src/ATen/native/native_functions.yaml')
+    native_yaml_path = os.path.join(aten_path, 'native/native_functions.yaml')
     parsed_yaml = parse_native_yaml(native_yaml_path)
     native_functions, backend_indices = parsed_yaml.native_functions, parsed_yaml.backend_indices
     grouped_native_functions = get_grouped_native_functions(native_functions)
@@ -171,7 +259,7 @@ def gen_key(func: FunctionSchema) -> Tuple[str, str]:
 
     if impl_path is not None:
         error_on_missing_kernels(native_functions, backend_indices, backend_key,
-                                 autograd_key, impl_path, full_codegen)
+                                 autograd_key, class_name, impl_path, full_codegen)
 
 
     """ Validate Shape Inference Definitions
@@ -196,19 +284,29 @@ def gen_key(func: FunctionSchema) -> Tuple[str, str]:
                 codegenInplaceVariant=True
             )
         )
+
         validate_shape_inference_header(shape_inference_hdr, expected_shape_infr_decls)
     assert class_name is not None
 
     # Generate nativefunction declarations
+    # Note, eager registrations is set to False for the lazy TS backend as another LTC backend
+    # may want to register their own lazy kernels instead of registering the TS ones.
+    # The registration will lazily happen when init_ts_backend is called.
     gen_dispatchkey_nativefunc_headers(fm, class_name, cpp_namespace, backend_indices,
-                                       grouped_native_functions, backend_key, autograd_key)
+                                       grouped_native_functions, backend_key, autograd_key,
+                                       backend_name)
 
     # Generate Dispatcher registrations which hook up the nativefunctions
     for dispatch_key in [backend_key] if autograd_key is None else [backend_key, autograd_key]:
-        gen_dispatcher_registrations(fm, output_dir, cpp_namespace, backend_indices, grouped_native_functions,
-                                     backend_key, dispatch_key, selector)
+        gen_dispatcher_registrations(fm, output_dir, class_name, cpp_namespace, backend_indices, grouped_native_functions,
+                                     backend_key, dispatch_key, selector,
+                                     build_in_tree=build_in_tree,
+                                     per_operator_headers=per_operator_headers,
+                                     backend_name=backend_name,
+                                     eager_registration=False)
 
     # Generate native function impls that build IR nodes
+    ns_helper = NamespaceHelper(cpp_namespace)
     fm.write_with_template(f'{backend_key}NativeFunctions.cpp', 'DispatchKeyNativeFunctions.cpp', lambda: {
         'includes': [f'#include <{path}>' for path in [
             tensor_class_hdr,
@@ -216,46 +314,46 @@ def gen_key(func: FunctionSchema) -> Tuple[str, str]:
             "ATen/Functions.h",
             "ATen/MetaFunctions.h",
             "ATen/Operators.h",
+            "ATen/native/CPUFallback.h",
             "torch/csrc/lazy/core/lazy_graph_executor.h",
             "torch/csrc/lazy/core/metrics.h",
             "torch/csrc/lazy/core/shape.h",
-            "lazy_tensor_core/csrc/ts_backend/aten_eager_fallback.h",
             f"{output_dir}/{backend_key}NativeFunctions.h",
-            f"{output_dir}/{backend_key}LazyIr.h",
-        ]],
+            f"{output_dir}/LazyIr.h",
+        ] + (["torch/csrc/lazy/ts_backend/ts_eager_fallback.h"] if gen_forced_fallback_code else [])],
         'native_functions_include': '',
-        'backend_namespace': 'torch_lazy_tensors',  # this is wrong
+        'namespace_prologue': ns_helper.prologue,
+        'namespace_epilogue': ns_helper.epilogue,
         'native_function_definitions':
         list(concat_map_codegen(
             dest.GenLazyNativeFuncDefinition(f'{backend_key}NativeFunctions',
                                              backend_indices[backend_key],
-                                             tensor_class),
+                                             tensor_class,
+                                             gen_forced_fallback_code),
             grouped_native_functions,
             codegenInplaceVariant=True
         )),
     })
-
     # Generate IR node classes
-    fm.write_with_template(f'{backend_key}LazyIr.h', 'LazyIr.h', lambda: {
+    fm.write_with_template('LazyIr.h', 'LazyIr.h', lambda: {
         'lazy_ir_sysinc': [f'#include <{path}>' for path in [
             "ATen/core/Formatting.h",
             "c10/core/ScalarType.h",
             "c10/util/Optional.h",
             "torch/csrc/lazy/core/hash.h",
             "torch/csrc/lazy/core/ir.h",
+            "torch/csrc/lazy/core/shape.h",
             "vector",
         ]],
         'lazy_ir_inc': [f'#include "{path}"' for path in [
             node_base_hdr if node_base_hdr is not None else None
         ] if path is not None],
-        'external_backend_headers': f'#include "{output_dir}/{backend_key}NativeFunctions.h"',
-        'namespaced_headers': '',
-        'DispatchKey': backend_key,
-        'dispatch_namespace': backend_key.lower(),
         'ir_declarations': list(concat_map_codegen(
             lazy_ir_cls(backend_indices[backend_key], node_base),
             grouped_native_functions
         )),
+        'namespace_prologue': ns_helper.prologue,
+        'namespace_epilogue': ns_helper.epilogue,
     })
 
 
diff --git a/tools/codegen/model.py b/tools/codegen/model.py
index 1c61517a3e52b2..8536588848c4f1 100644
--- a/tools/codegen/model.py
+++ b/tools/codegen/model.py
@@ -48,58 +48,68 @@ class DispatchKey(Enum):
     Undefined = 0
     CatchAll = Undefined
 
-    CPU = auto()
-    CUDA = auto()
-    HIP = auto()
+    Dense = auto()
     FPGA = auto()
     ORT = auto()
-    XLA = auto()
-    Lazy = auto()
     Vulkan = auto()
     Metal = auto()
-    XPU = auto()
     MKLDNN = auto()
     OpenGL = auto()
     OpenCL = auto()
     IDEEP = auto()
-    QuantizedCPU = auto()
-    QuantizedCUDA = auto()
-    QuantizedXPU = auto()
+    Quantized = auto()
     CustomRNGKeyId = auto()
     MkldnnCPU = auto()
-    SparseCPU = auto()
-    SparseCUDA = auto()
+    Sparse = auto()
     SparseCsrCPU = auto()
     SparseCsrCUDA = auto()
-    SparseHIP = auto()
-    SparseXPU = auto()
-    NestedTensor = auto()
-    PrivateUse1 = auto()
-    PrivateUse2 = auto()
-    PrivateUse3 = auto()
-    EndOfBackendKeys = PrivateUse3
 
     ZeroTensor = auto()
     Meta = auto()
     BackendSelect = auto()
     Named = auto()
     AutogradOther = auto()
+    AutogradFunctionality = auto()
+    AutogradNestedTensor = auto()
+    Tracer = auto()
+    Autocast = auto()
+    Batched = auto()
+    VmapMode = auto()
+    TESTING_ONLY_GenericWrapper = auto()
+    TESTING_ONLY_GenericMode = auto()
+    EndOfFunctionalityKeys = TESTING_ONLY_GenericMode
+
+    CPU = auto()
+    CUDA = auto()
+    HIP = auto()
+    XLA = auto()
+    Lazy = auto()
+    IPU = auto()
+    XPU = auto()
+    NestedTensor = auto()
+    PrivateUse1 = auto()
+    PrivateUse2 = auto()
+    PrivateUse3 = auto()
+
+    QuantizedCPU = auto()
+    QuantizedCUDA = auto()
+    QuantizedXPU = auto()
+
+    SparseCPU = auto()
+    SparseCUDA = auto()
+    SparseHIP = auto()
+    SparseXPU = auto()
+
     AutogradCPU = auto()
     AutogradCUDA = auto()
     AutogradXLA = auto()
     AutogradLazy = auto()
-    AutogradNestedTensor = auto()
+    AutogradIPU = auto()
     AutogradXPU = auto()
     AutogradPrivateUse1 = auto()
     AutogradPrivateUse2 = auto()
     AutogradPrivateUse3 = auto()
-    Tracer = auto()
-    Autocast = auto()
-    Batched = auto()
-    VmapMode = auto()
-    TESTING_ONLY_GenericWrapper = auto()
-    TESTING_ONLY_GenericMode = auto()
-    NumDispatchKeys = auto()
+
     Autograd = auto()
     CompositeImplicitAutograd = auto()
     CompositeExplicitAutograd = auto()
@@ -454,6 +464,7 @@ def from_yaml(
 
         python_module = e.pop('python_module', None)
         assert python_module is None or isinstance(python_module, str), f'not a str: {python_module}'
+        assert python_module is None or Variant.method not in variants, 'functions in modules cannot be methods'
 
         category_override = e.pop('category_override', None)
         assert category_override is None or isinstance(category_override, str), f'not a str: {category_override}'
@@ -1181,6 +1192,7 @@ def is_list_like(self) -> Optional['ListType']:
     'QScheme',
     'Storage',
     'Stream',
+    'SymInt',
     'ConstQuantizerPtr',  # TODO: rename
 ))
 
diff --git a/tools/codegen/operator_versions/gen_mobile_upgraders.py b/tools/codegen/operator_versions/gen_mobile_upgraders.py
index 5721d7086f81a7..fbfb1b39c1d2cb 100644
--- a/tools/codegen/operator_versions/gen_mobile_upgraders.py
+++ b/tools/codegen/operator_versions/gen_mobile_upgraders.py
@@ -119,8 +119,7 @@ class ByteCode(Enum):
         upgrader_function.function.append_operator(
             op.name,
             op.overload_name,
-            op.num_specified_args,
-            caffe2::serialize::kMaxSupportedFileFormatVersion);
+            op.num_specified_args);
       }
     }
     return upgrader_function_list;
diff --git a/tools/codegen/operator_versions/gen_mobile_upgraders_constant.py b/tools/codegen/operator_versions/gen_mobile_upgraders_constant.py
index 2adf6e793eebef..f83e5d1f4c943b 100644
--- a/tools/codegen/operator_versions/gen_mobile_upgraders_constant.py
+++ b/tools/codegen/operator_versions/gen_mobile_upgraders_constant.py
@@ -2,6 +2,6 @@
  * @generated
  * This is an auto-generated file. Please do not modify it by hand.
  * To re-generate, please run:
- * cd ~/pytorch && python torch/csrc/jit/mobile/upgrader_mobile.cpp
+ * cd ~/pytorch && python tools/codegen/operator_versions/gen_mobile_upgraders.py
  */
 """
diff --git a/tools/extract_scripts.py b/tools/extract_scripts.py
index fd90b1b9f0e5eb..5312ed00da111a 100755
--- a/tools/extract_scripts.py
+++ b/tools/extract_scripts.py
@@ -63,6 +63,8 @@ def main() -> None:
 
         for job_name, job in workflow['jobs'].items():
             job_dir = out / p / job_name
+            if "steps" not in job:
+                continue
             steps = job['steps']
             index_chars = len(str(len(steps) - 1))
             for i, step in enumerate(steps, start=1):
diff --git a/tools/git-pre-commit b/tools/git-pre-commit
index 1c4340c6b43486..a7b9e4562cd6f8 100755
--- a/tools/git-pre-commit
+++ b/tools/git-pre-commit
@@ -1,9 +1,6 @@
 #!/bin/bash
 set -e
 
-echo "Running pre-commit flake8"
-python3 tools/linter/flake8_hook.py
-
 echo "Running pre-commit clang-tidy"
 git diff HEAD > pr.diff
 python3 -m tools.linter.clang_tidy --diff-file "pr.diff"
diff --git a/tools/jit/gen_unboxing.py b/tools/jit/gen_unboxing.py
index 9171c56a2f5584..976cf3e676ab71 100644
--- a/tools/jit/gen_unboxing.py
+++ b/tools/jit/gen_unboxing.py
@@ -10,6 +10,7 @@
 from tools.codegen.context import method_with_native_function
 from tools.codegen.gen import parse_native_yaml, cpp_string
 from tools.codegen.model import NativeFunction, NativeFunctionsGroup, Variant
+from tools.codegen.selective_build.selector import SelectiveBuilder
 from tools.codegen.utils import Target, FileManager, mapMaybe, make_file_manager
 from typing import Union, Sequence
 from typing_extensions import Literal
@@ -19,9 +20,12 @@
 @dataclass(frozen=True)
 class ComputeUnboxingFunctions:
     target: Union[Literal[Target.DECLARATION], Literal[Target.DEFINITION]]
+    selector: SelectiveBuilder
 
     @method_with_native_function
     def __call__(self, f: NativeFunction) -> str:
+        if not self.selector.is_root_operator(f"aten::{f.func.name}"):
+            return ""
 
         if self.target is Target.DECLARATION:
             # Note [The ATen Codegen Unboxing API]
@@ -78,11 +82,15 @@ def __call__(self, f: NativeFunction) -> str:
 # Generates RegisterCodegenUnboxedKernels.cpp.
 @dataclass(frozen=True)
 class ComputeCodegenUnboxedKernels:
+    selector: SelectiveBuilder
+
     @method_with_native_function
     def __call__(self, f: NativeFunction) -> str:
+        if not self.selector.is_root_operator(f"aten::{f.func.name}"):
+            return ""
         # We unconditionally generate function wrappers,
         sig_group = CppSignatureGroup.from_native_function(
-            f, method=(Variant.method in f.variants)
+            f, method=False
         )
 
         sig = sig_group.most_faithful_signature()
@@ -90,9 +98,34 @@ def __call__(self, f: NativeFunction) -> str:
         # escape double quote in schema, get rid of extra double quotes
         schema = cpp_string(str(sig.func))[1:-1]
 
+        # arguments
+        args = sig.arguments()
+        connector = ",\n\t\t"
+        args_code = []
+        for arg in args:
+            if not arg.default:
+                arg_cpp = "c10::IValue(c10::nullopt)"
+            elif arg.default.startswith('{'):
+                arg_cpp = f"c10::IntArrayRef({arg.default})"
+            else:
+                arg_cpp = f"c10::IValue({arg.default})"
+            args_code.append(f"""c10::Argument("{arg.name}", nullptr, c10::nullopt, {arg_cpp})""")
+
+        returns = f.func.returns
+        returns_code = []
+        for ret in returns:
+            returns_code.append(f"""c10::Argument("{ret.name if ret.name else ""}")""")
         return f"""
+// aten::{schema}
 OperatorGenerator(
-    TORCH_SELECTIVE_SCHEMA("aten::{schema}"),
+    "aten::{f.func.name.name}",
+    "{f.func.name.overload_name}",
+    {{
+        {connector.join(args_code)}
+    }},
+    {{
+        {connector.join(returns_code)}
+    }},
     [](Stack & stack) {{
         RECORD_FUNCTION("{sig.name()}", std::vector<c10::IValue>());
         at::unboxing::{unboxing.name(f)}(stack);
@@ -106,6 +139,7 @@ def gen_unboxing(
         *,
         native_functions: Sequence[NativeFunction],
         cpu_fm: FileManager,
+        selector: SelectiveBuilder,
 ) -> None:
     def key_func(fn: Union[NativeFunction, NativeFunctionsGroup]) -> str:
         return fn.root_name
@@ -115,7 +149,7 @@ def key_func(fn: Union[NativeFunction, NativeFunctionsGroup]) -> str:
         native_functions,
         key_fn=key_func,
         env_callable=lambda fn: {
-            "definitions": [ComputeUnboxingFunctions(Target.DEFINITION)(fn)]
+            "definitions": [ComputeUnboxingFunctions(Target.DEFINITION, selector)(fn)]
         },
         num_shards=5,
         sharded_keys={"definitions"},
@@ -124,7 +158,7 @@ def key_func(fn: Union[NativeFunction, NativeFunctionsGroup]) -> str:
         "UnboxingFunctions.h",
         lambda: {
             "declarations": list(
-                mapMaybe(ComputeUnboxingFunctions(Target.DECLARATION), native_functions)
+                mapMaybe(ComputeUnboxingFunctions(Target.DECLARATION, selector), native_functions)
             ),
         },
     )
@@ -132,8 +166,8 @@ def key_func(fn: Union[NativeFunction, NativeFunctionsGroup]) -> str:
         "RegisterCodegenUnboxedKernels.cpp",
         native_functions,
         key_fn=key_func,
-        env_callable=lambda fn: {"unboxed_ops": [ComputeCodegenUnboxedKernels()(fn)]},
-        num_shards=5,
+        env_callable=lambda fn: {"unboxed_ops": [ComputeCodegenUnboxedKernels(selector)(fn)]},
+        num_shards=10,
         sharded_keys={"unboxed_ops"},
     )
 
@@ -156,9 +190,21 @@ def main() -> None:
     parser.add_argument(
         '--dry-run', action='store_true',
         help='run without writing any files (still updates outputs)')
+    parser.add_argument(
+        '--op_selection_yaml_path',
+        help='Provide a path to the operator selection (for custom build) YAML '
+             'that contains the information about the set of selected operators '
+             'and their categories (training, ...). Each operator is either a '
+             'full operator name with overload or just a bare operator name. '
+             'The operator names also contain the namespace prefix (e.g. aten::)')
 
     options = parser.parse_args()
 
+    if options.op_selection_yaml_path is not None:
+        selector = SelectiveBuilder.from_yaml_path(options.op_selection_yaml_path)
+    else:
+        selector = SelectiveBuilder.get_nop_selector()
+
     native_yaml_path = os.path.join(options.source_path, "native/native_functions.yaml")
     parsed_yaml = parse_native_yaml(native_yaml_path)
     native_functions, backend_indices = (
@@ -167,7 +213,7 @@ def main() -> None:
     )
 
     cpu_fm = make_file_manager(options=options)
-    gen_unboxing(native_functions=native_functions, cpu_fm=cpu_fm)
+    gen_unboxing(native_functions=native_functions, cpu_fm=cpu_fm, selector=selector)
 
     if options.output_dependencies:
         depfile_path = pathlib.Path(options.output_dependencies).resolve()
diff --git a/tools/linter/clang_format_all.py b/tools/linter/clang_format_all.py
index 7792f15a77d126..2a5f9370e922f6 100755
--- a/tools/linter/clang_format_all.py
+++ b/tools/linter/clang_format_all.py
@@ -21,13 +21,21 @@
 # If you edit this, please edit the allowlist in clang_format_ci.sh as well.
 CLANG_FORMAT_ALLOWLIST = [
     "c10/",
+    "ios/",
     "torch/csrc/jit/",
+    "torch/csrc/deploy/",
     "test/cpp/jit/",
     "test/cpp/tensorexpr/"
 ]
 
+CLANG_FORMAT_BLOCK_LIST = {
+    "torch/csrc/jit/serialization/mobile_bytecode_generated.h",
+}
+
+
 # Only files with names matching this regex will be formatted.
-CPP_FILE_REGEX = re.compile(".*\\.(h|cpp|cc|c|hpp)$")
+CPP_FILE_REGEX = re.compile(".*\\.(h|cpp|cc|c|hpp|m|mm)$")
+
 
 
 def get_allowlisted_files() -> Set[str]:
@@ -39,6 +47,9 @@ def get_allowlisted_files() -> Set[str]:
     for dir in CLANG_FORMAT_ALLOWLIST:
         for root, dirnames, filenames in os.walk(dir):
             for filename in filenames:
+                fullpath = os.path.join(root, filename)
+                if fullpath in CLANG_FORMAT_BLOCK_LIST:
+                    continue
                 if CPP_FILE_REGEX.match(filename):
                     matches.append(os.path.join(root, filename))
     return set(matches)
diff --git a/tools/linter/clang_format_ci.sh b/tools/linter/clang_format_ci.sh
index 6f5220e516d19f..15c8d235fe91c8 100755
--- a/tools/linter/clang_format_ci.sh
+++ b/tools/linter/clang_format_ci.sh
@@ -7,7 +7,9 @@ set -eux
 # If you edit this allowlist, please edit the one in clang_format_all.py as well
 find . -type f \
   -path './c10/*' -or \
-  -path './torch/csrc/jit/*' -or \
+  -path './ios/*' -or \
+  -path './torch/csrc/jit/!(serialization/mobile_bytecode_generated.h)' -or \
+  -path './torch/csrc/deploy/*' -or \
   -path './test/cpp/jit/*' -or \
   -path './test/cpp/tensorexpr/*' \
   | xargs tools/linter/git-clang-format --verbose "$1" --
diff --git a/tools/linter/clang_tidy/__main__.py b/tools/linter/clang_tidy/__main__.py
index fa6403a64bb664..18f2da24337fc6 100644
--- a/tools/linter/clang_tidy/__main__.py
+++ b/tools/linter/clang_tidy/__main__.py
@@ -5,6 +5,7 @@
 import subprocess
 import re
 import sys
+from sysconfig import get_paths as gp
 from typing import List
 
 
@@ -13,6 +14,9 @@
 from tools.linter.install.clang_tidy import INSTALLATION_PATH
 from tools.linter.install.download_bin import PYTORCH_ROOT
 
+# Returns '/usr/local/include/python<version number>'
+def get_python_include_dir() -> str:
+    return gp()['include']
 
 def clang_search_dirs() -> List[str]:
     # Compilers are ordered based on fallback preference
@@ -76,6 +80,9 @@ def clang_search_dirs() -> List[str]:
         "-torch/csrc/jit/serialization/export.cpp",
         "-torch/csrc/jit/serialization/import.cpp",
         "-torch/csrc/jit/serialization/import_legacy.cpp",
+        "-torch/csrc/jit/serialization/mobile_bytecode_generated.cpp",
+        "-torch/csrc/init_flatbuffer_module.cpp",
+        "-torch/csrc/stub_with_flatbuffer.c",
         "-torch/csrc/onnx/init.cpp",
         "-torch/csrc/cuda/nccl.*",
         "-torch/csrc/cuda/python_nccl.cpp",
@@ -90,7 +97,11 @@ def clang_search_dirs() -> List[str]:
         "-torch/csrc/deploy/test_deploy_python_ext.cpp",
     ],
     "paths": ["torch/csrc/"],
-    "include-dir": ["/usr/lib/llvm-11/include/openmp"] + clang_search_dirs(),
+    "include-dir": [
+        "/usr/lib/llvm-11/include/openmp",
+        get_python_include_dir(),
+        os.path.join(PYTORCH_ROOT, "third_party/pybind11/include")
+    ] + clang_search_dirs(),
     "clang-tidy-exe": INSTALLATION_PATH,
     "compile-commands-dir": "build",
     "config-file": ".clang-tidy",
diff --git a/tools/linter/clang_tidy/generate_build_files.py b/tools/linter/clang_tidy/generate_build_files.py
index 9e3db664ab0d9d..95ff98c30011b2 100644
--- a/tools/linter/clang_tidy/generate_build_files.py
+++ b/tools/linter/clang_tidy/generate_build_files.py
@@ -51,8 +51,7 @@ def run_autogen() -> None:
             "tools/setup_helpers/generate_code.py",
             "--native-functions-path",
             "aten/src/ATen/native/native_functions.yaml",
-            "--nn-path",
-            "aten/src",
+            "--gen_lazy_ts_backend",
         ]
     )
 
diff --git a/tools/linter/flake8_hook.py b/tools/linter/flake8_hook.py
deleted file mode 100755
index b9ebd5b4793123..00000000000000
--- a/tools/linter/flake8_hook.py
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/usr/bin/env python3
-
-import sys
-
-from flake8.main import git  # type: ignore[import]
-
-if __name__ == '__main__':
-    sys.exit(
-        git.hook(
-            strict=True,
-            lazy=git.config_for('lazy'),
-        )
-    )
diff --git a/tools/onnx/update_default_opset_version.py b/tools/onnx/update_default_opset_version.py
new file mode 100755
index 00000000000000..358bbfdfe39ce1
--- /dev/null
+++ b/tools/onnx/update_default_opset_version.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+
+"""Updates the default value of opset_version.
+
+The current policy is that the default should be set to the
+latest released version as of 18 months ago.
+
+Usage:
+Run with no arguments.
+"""
+
+import datetime
+import os
+import pathlib
+import re
+import sys
+import subprocess
+from subprocess import DEVNULL
+
+
+pytorch_dir = pathlib.Path(__file__).parent.parent.parent.resolve()
+onnx_dir = pytorch_dir / "third_party" / "onnx"
+os.chdir(onnx_dir)
+
+date = datetime.datetime.now() - datetime.timedelta(days=18 * 30)
+onnx_commit = subprocess.check_output(("git", "log", f"--until={date}", "--max-count=1", "--format=%H"),
+                                      encoding="utf-8").strip()
+onnx_tags = subprocess.check_output(("git", "tag", "--list", f"--contains={onnx_commit}"), encoding="utf-8")
+tag_tups = []
+semver_pat = re.compile(r"v(\d+)\.(\d+)\.(\d+)")
+for tag in onnx_tags.splitlines():
+    match = semver_pat.match(tag)
+    if match:
+        tag_tups.append(tuple(int(x) for x in match.groups()))
+
+version_str = "{}.{}.{}".format(*min(tag_tups))
+
+print("Using ONNX release", version_str)
+
+head_commit = subprocess.check_output(("git", "log", "--max-count=1", "--format=%H", "HEAD"),
+                                      encoding="utf-8").strip()
+
+new_default = None
+
+subprocess.check_call(("git", "checkout", f"v{version_str}"), stdout=DEVNULL, stderr=DEVNULL)
+try:
+    from onnx import helper  # type: ignore[import]
+    for version in helper.VERSION_TABLE:
+        if version[0] == version_str:
+            new_default = version[2]
+            print("found new default opset_version", new_default)
+            break
+    if not new_default:
+        sys.exit(f"failed to find version {version_str} in onnx.helper.VERSION_TABLE at commit {onnx_commit}")
+finally:
+    subprocess.check_call(("git", "checkout", head_commit), stdout=DEVNULL, stderr=DEVNULL)
+
+os.chdir(pytorch_dir)
+
+
+def read_sub_write(path: str, prefix_pat: str) -> None:
+    with open(path, encoding="utf-8") as f:
+        content_str = f.read()
+    content_str = re.sub(prefix_pat, r"\g<1>{}".format(new_default), content_str)
+    with open(path, "w", encoding="utf-8") as f:
+        f.write(content_str)
+    print("modified", path)
+
+read_sub_write(os.path.join("torch", "onnx", "symbolic_helper.py"),
+               r"(_default_onnx_opset_version = )\d+")
+read_sub_write(os.path.join("torch", "onnx", "__init__.py"),
+               r"(opset_version \(int, default )\d+")
+
+print("Updating operator .expect files")
+subprocess.check_call(("python", "setup.py", "develop"),
+                      stdout=DEVNULL, stderr=DEVNULL)
+subprocess.check_call(("python", os.path.join("test", "onnx", "test_operators.py"), "--accept"),
+                      stdout=DEVNULL, stderr=DEVNULL)
diff --git a/tools/pyi/gen_pyi.py b/tools/pyi/gen_pyi.py
index 73cc5fb2cbdeb4..6325bedffaaf67 100644
--- a/tools/pyi/gen_pyi.py
+++ b/tools/pyi/gen_pyi.py
@@ -4,7 +4,8 @@
 
 from tools.codegen.model import Variant
 from tools.codegen.api.python import (PythonSignatureGroup,
-                                      PythonSignatureNativeFunctionPair)
+                                      PythonSignatureNativeFunctionPair,
+                                      returns_named_tuple_pyi)
 from tools.codegen.gen import parse_native_yaml
 from tools.codegen.utils import FileManager
 from typing import Sequence, List, Dict
@@ -77,6 +78,7 @@ def should_bind_method(python_func: PythonSignatureNativeFunctionPair) -> bool:
     'range',
     # defined in functional
     'einsum',
+    'histogramdd',
     # reduction argument; these bindings don't make sense
     'binary_cross_entropy_with_logits',
     'ctc_loss',
@@ -397,7 +399,7 @@ def gen_pyi(native_yaml_path: str, deprecated_yaml_path: str, fm: FileManager) -
         name = group.signature.name
         unsorted_function_hints[name] += generate_type_hints(group)
 
-        named_tuple = group.signature.returns.named_tuple_pyi()
+        named_tuple = returns_named_tuple_pyi(group.signature)
         if named_tuple is not None and not group.signature.deprecated:
             # deprecated namedtuples are currently not included for torch functions
             tuple_name, tuple_def = named_tuple
@@ -468,6 +470,7 @@ def gen_pyi(native_yaml_path: str, deprecated_yaml_path: str, fm: FileManager) -
         '_is_view': ['def _is_view(self) -> _bool: ...'],
         'is_cuda': ['is_cuda: _bool'],
         'is_leaf': ['is_leaf: _bool'],
+        'is_nested': ['is_nested: _bool'],
         'is_sparse': ['is_sparse: _bool'],
         'is_sparse_csr' : ['is_sparse_csr: _bool'],
         'is_quantized': ['is_quantized: _bool'],
@@ -475,6 +478,7 @@ def gen_pyi(native_yaml_path: str, deprecated_yaml_path: str, fm: FileManager) -
         'is_ort': ['is_ort: _bool'],
         'is_mkldnn': ['is_mkldnn: _bool'],
         'is_vulkan': ['is_vulkan: _bool'],
+        'is_ipu': ['is_ipu: _bool'],
         'storage_offset': ['def storage_offset(self) -> _int: ...'],
         'to': ['def to(self, dtype: _dtype, non_blocking: _bool=False, copy: _bool=False) -> Tensor: ...',
                'def to(self, device: Optional[Union[_device, str]]=None, dtype: Optional[_dtype]=None, '
@@ -524,7 +528,7 @@ def gen_pyi(native_yaml_path: str, deprecated_yaml_path: str, fm: FileManager) -
         name = group.signature.name
         unsorted_tensor_method_hints[name] += generate_type_hints(group)
 
-        named_tuple = group.signature.returns.named_tuple_pyi()
+        named_tuple = returns_named_tuple_pyi(group.signature)
         if named_tuple is not None and not group.signature.deprecated:
             # deprecated namedtuples are currently not included for torch functions
             tuple_name, tuple_def = named_tuple
@@ -615,6 +619,10 @@ def gen_pyi(native_yaml_path: str, deprecated_yaml_path: str, fm: FileManager) -
         'generated_comment': '@' + 'generated from torch/_C/_VariableFunctions.pyi.in',
         **env,
     })
+    fm.write_with_template('torch/return_types.pyi', 'torch/_C/return_types.pyi.in', lambda: {
+        'generated_comment': '@' + 'generated from torch/_C/return_types.pyi',
+        **env,
+    })
     gen_nn_functional(fm)
 
 
diff --git a/tools/setup_helpers/BUILD.bazel b/tools/setup_helpers/BUILD.bazel
new file mode 100644
index 00000000000000..f7239029a0911b
--- /dev/null
+++ b/tools/setup_helpers/BUILD.bazel
@@ -0,0 +1,16 @@
+py_binary(
+    name = "generate_code",
+    srcs = ["generate_code.py"],
+    deps = [
+        "//:tools_jit",
+        "//tools/autograd",
+        "//tools/codegen",
+    ],
+    visibility = ["//:__pkg__"],
+)
+
+py_binary(
+    name = "gen_version_header",
+    srcs = ["gen_version_header.py"],
+    visibility = ["//:__pkg__"],
+)
diff --git a/tools/setup_helpers/generate_code.py b/tools/setup_helpers/generate_code.py
index ef90acc3935a15..9d176e45c91065 100644
--- a/tools/setup_helpers/generate_code.py
+++ b/tools/setup_helpers/generate_code.py
@@ -27,7 +27,6 @@ def all_generator_source() -> List[str]:
 
 
 def generate_code(ninja_global: Optional[str] = None,
-                  nn_path: Optional[str] = None,
                   native_functions_path: Optional[str] = None,
                   install_dir: Optional[str] = None,
                   subset: Optional[str] = None,
@@ -135,7 +134,6 @@ def get_selector(
 def main() -> None:
     parser = argparse.ArgumentParser(description='Autogenerate code')
     parser.add_argument('--native-functions-path')
-    parser.add_argument('--nn-path')
     parser.add_argument('--ninja-global')
     parser.add_argument('--install_dir')
     parser.add_argument(
@@ -162,11 +160,20 @@ def main() -> None:
         help='force it to generate schema-only registrations for ops that are not'
         'listed on --selected-op-list'
     )
+    parser.add_argument(
+        '--gen_lazy_ts_backend',
+        action='store_true',
+        help='Enable generation of the torch::lazy TorchScript backend'
+    )
+    parser.add_argument(
+        '--per_operator_headers',
+        action='store_true',
+        help='Build lazy tensor ts backend with per-operator ATen headers, must match how ATen was built'
+    )
     options = parser.parse_args()
 
     generate_code(
         options.ninja_global,
-        options.nn_path,
         options.native_functions_path,
         options.install_dir,
         options.subset,
@@ -176,6 +183,34 @@ def main() -> None:
         operator_selector=get_selector(options.selected_op_list_path, options.operators_yaml_path),
     )
 
+    if options.gen_lazy_ts_backend:
+        aten_path = os.path.dirname(os.path.dirname(options.native_functions_path))
+        ts_backend_yaml = os.path.join(aten_path, 'native/ts_native_functions.yaml')
+        ts_native_functions = "torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
+        ts_node_base = "torch/csrc/lazy/ts_backend/ts_node.h"
+        if options.install_dir is None:
+            options.install_dir = "torch/csrc"
+        lazy_install_dir = os.path.join(options.install_dir, "lazy/generated")
+        if not os.path.exists(lazy_install_dir):
+            os.makedirs(lazy_install_dir)
+
+        assert os.path.isfile(ts_backend_yaml), f"Unable to access ts_backend_yaml: {ts_backend_yaml}"
+        assert os.path.isfile(ts_native_functions), f"Unable to access {ts_native_functions}"
+        from tools.codegen.gen_lazy_tensor import run_gen_lazy_tensor
+        from tools.codegen.dest.lazy_ir import TSLazyIR
+        run_gen_lazy_tensor(aten_path=aten_path,
+                            source_yaml=ts_backend_yaml,
+                            backend_name="TorchScript",
+                            output_dir=lazy_install_dir,
+                            dry_run=False,
+                            impl_path=ts_native_functions,
+                            node_base="TsNode",
+                            node_base_hdr=ts_node_base,
+                            build_in_tree=True,
+                            lazy_ir_cls=TSLazyIR,
+                            per_operator_headers=options.per_operator_headers,
+                            gen_forced_fallback_code=True)
+
 
 if __name__ == "__main__":
     main()
diff --git a/tools/stats/export_slow_tests.py b/tools/stats/export_slow_tests.py
index b9d71cfb6cb7a2..6659438479c233 100644
--- a/tools/stats/export_slow_tests.py
+++ b/tools/stats/export_slow_tests.py
@@ -12,6 +12,7 @@
 SLOW_TESTS_FILE = '.pytorch-slow-tests.json'
 SLOW_TEST_CASE_THRESHOLD_SEC = 60.0
 RELATIVE_DIFFERENCE_THRESHOLD = 0.1
+IGNORED_JOBS = ["asan", "periodic"]
 
 def get_test_case_times() -> Dict[str, float]:
     reports: List[Report] = get_previous_reports_for_branch('origin/viable/strict', "")
@@ -21,6 +22,10 @@ def get_test_case_times() -> Dict[str, float]:
         if report.get('format_version', 1) != 2:  # type: ignore[misc]
             raise RuntimeError("S3 format currently handled is version 2 only")
         v2report = cast(Version2Report, report)
+
+        if any(job_name in str(report['build_job']) for job_name in IGNORED_JOBS):
+            continue
+
         for test_file in v2report['files'].values():
             for suitename, test_suite in test_file['suites'].items():
                 for casename, test_case in test_suite['cases'].items():
diff --git a/tools/stats/import_test_stats.py b/tools/stats/import_test_stats.py
index 375f7181b4583e..1b6c1907a98ab4 100644
--- a/tools/stats/import_test_stats.py
+++ b/tools/stats/import_test_stats.py
@@ -10,13 +10,14 @@
 
 def get_disabled_issues() -> List[str]:
     pr_body = os.getenv('PR_BODY', '')
+    commit_messages = os.getenv('COMMIT_MESSAGES', '')
     # The below regex is meant to match all *case-insensitive* keywords that
     # GitHub has delineated would link PRs to issues, more details here:
     # https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue.
     # E.g., "Close #62851", "fixES #62851" and "RESOLVED #62851" would all match, but not
     # "closes  #62851" --> extra space, "fixing #62851" --> not a keyword, nor "fix 62851" --> no #
-    regex = '(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) #([0-9]+)'
-    issue_numbers = [x[4] for x in re.findall(regex, pr_body)]
+    regex = '(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)'
+    issue_numbers = [x[5] for x in re.findall(regex, pr_body + commit_messages)]
     print("Ignoring disabled issues: ", issue_numbers)
     return issue_numbers
 
diff --git a/tools/stats/print_test_stats.py b/tools/stats/print_test_stats.py
index b1887c8d277c31..836ee5f81cc51a 100755
--- a/tools/stats/print_test_stats.py
+++ b/tools/stats/print_test_stats.py
@@ -107,8 +107,14 @@ def plural(n: int) -> str:
 
 
 def get_base_commit(sha1: str) -> str:
+    default_branch = os.environ.get('GIT_DEFAULT_BRANCH')
+    # capture None and "" cases
+    if not default_branch:
+        default_branch = "master"
+
+    default_remote = f"origin/{default_branch}"
     return subprocess.check_output(
-        ["git", "merge-base", sha1, "origin/master"],
+        ["git", "merge-base", sha1, default_remote],
         encoding="ascii",
     ).strip()
 
@@ -206,7 +212,7 @@ def analyze(
     base_reports: Dict[Commit, List[SimplerReport]],
 ) -> List[SuiteDiff]:
     nonempty_shas = [sha for sha, reports in base_reports.items() if reports]
-    # most recent master ancestor with at least one S3 report,
+    # most recent main ancestor with at least one S3 report,
     # or empty list if there are none (will show all tests as added)
     base_report = base_reports[nonempty_shas[0]] if nonempty_shas else []
 
@@ -525,7 +531,7 @@ def regression_info(
     and its test times. Since Python dicts maintain insertion order
     (guaranteed as part of the language spec since 3.7), the
     base_reports argument must list the head's several most recent
-    master commits, from newest to oldest (so the merge-base is
+    main commits, from newest to oldest (so the merge-base is
     list(base_reports)[0]).
     """
     simpler_head = simplify(head_report)
@@ -570,6 +576,10 @@ def __init__(self, dom: Any) -> None:
         self.class_name = str(dom.attributes['classname'].value)
         self.name = str(dom.attributes['name'].value)
         self.time = float(dom.attributes['time'].value)
+        # The following attribute is currently ONLY used in process_intentional_test_runs for validation
+        # reasons. The test filename that populates TestFile is calculated and passed down through the test report path.
+        # The reason we don't just use this attribute is because it doesn't exist for cpp tests, e.g., in test_libtorch
+        self.file = str(dom.attributes['file'].value) if dom.hasAttribute('file') else 'N/A - probably a cpp test'
         error_elements = dom.getElementsByTagName('error')
         # DISCLAIMER: unexpected successes and expected failures are currently not reported in assemble_s3_object
         self.expected_failure = False
@@ -595,9 +605,9 @@ def __repr__(self) -> str:
         return self.__str__()
 
     def __str__(self) -> str:
-        return f'[TestCase name: {self.name} | class_name: {self.class_name} | time: {self.time} | ' \
+        return f'[TestCase name: {self.name} | class_name: {self.class_name} | file: {self.file} | time: {self.time} | ' \
             f'expected_failure: {self.expected_failure} | skipped: {self.skipped} | errored: {self.errored} | ' \
-            f'unexpected_success: {self.unexpected_success} | failed: {self.failed}]'
+            f'unexpected_success: {self.unexpected_success} | failed: {self.failed}]\n'
 
 class TestSuite:
     def __init__(self, name: str) -> None:
@@ -638,6 +648,17 @@ def update(self, test_case: TestCase) -> None:
         self.test_cases[name].expected_failure |= test_case.expected_failure
 
 
+# Tests that spawn duplicates (usually only twice) intentionally
+MULTITESTS = [
+    'test_cpp_extensions_aot',
+    'distributed/test_distributed_spawn',
+    'distributed\\test_distributed_spawn',  # for windows
+    'distributed/test_c10d_gloo',
+    'distributed\\test_c10d_gloo',  # for windows
+    'cpp'  # The caffe2 cpp tests spawn duplicate test cases as well.
+]
+
+
 DuplicatedDict = Dict[str, Dict[str, List[TestCase]]]
 
 class TestFile:
@@ -647,27 +668,20 @@ def __init__(self, name: str) -> None:
         self.test_suites: Dict[str, TestSuite] = dict()
 
     def append(self, test_case: TestCase, test_type: str, duplicated_tests_dict: DuplicatedDict) -> None:
-        is_multi_test = self.name == 'test_cpp_extensions_aot' or \
-            self.name == 'distributed/test_distributed_spawn' or \
-            self.name == 'distributed/test_c10d_gloo' or \
-            self.name == 'cpp'  # The caffe2 cpp tests spawn duplicate test cases as well.
-        if is_multi_test:
-            suite_name = test_case.class_name + '__' + test_type
-        else:
-            suite_name = test_case.class_name
+        suite_name = test_case.class_name
         if suite_name not in self.test_suites:
             self.test_suites[suite_name] = TestSuite(suite_name)
         if test_case.name in self.test_suites[suite_name].test_cases:
-            if is_multi_test:
+            if self.name in MULTITESTS:
                 self.test_suites[suite_name].update(test_case)
                 self.total_time += test_case.time
-            else:
-                # Gather up duplicated test cases
-                if suite_name not in duplicated_tests_dict:
-                    duplicated_tests_dict[suite_name] = dict()
-                if test_case.name not in duplicated_tests_dict[suite_name]:
-                    duplicated_tests_dict[suite_name][test_case.name] = [self.test_suites[suite_name].test_cases[test_case.name]]
-                duplicated_tests_dict[suite_name][test_case.name].append(test_case)
+
+            # Gather up duplicated test cases to parse for flaky reruns
+            if suite_name not in duplicated_tests_dict:
+                duplicated_tests_dict[suite_name] = dict()
+            if test_case.name not in duplicated_tests_dict[suite_name]:
+                duplicated_tests_dict[suite_name][test_case.name] = [self.test_suites[suite_name].test_cases[test_case.name]]
+            duplicated_tests_dict[suite_name][test_case.name].append(test_case)
         else:
             self.test_suites[suite_name].append(test_case)
             self.total_time += test_case.time
@@ -737,17 +751,9 @@ def process_intentional_test_runs(runs: List[TestCase]) -> Tuple[int, int]:
         else:
             num_pass += 1
 
-    REPEAT_TEST_FOR_TYPES_TESTS = [
-        "test_data_parallel_module",
-        "test_data_parallel_module_kwargs_only",
-        "test_data_parallel_module_kwargs_only_empty_list",
-        "test_data_parallel_module_kwargs_only_empty_dict",
-        "test_data_parallel_module_kwargs_only_empty_tuple"
-    ]
-
-    # Do not run checks for tests that use repeat_test_for_types decorator as they do not go well with our retry
-    # functionality. Once issue https://github.com/pytorch/pytorch/issues/69865 is fixed, we should remove the exception
-    if not any([x in test_run.name for x in REPEAT_TEST_FOR_TYPES_TESTS]):
+    # Do not run duplication checks for test files that spawn duplicate tests intentionally
+    # and are not necessarily flaky test reruns.
+    if not any(x in test_run.file for x in MULTITESTS):
         err_msg = f'Warning: unintentional test case duplicates found for {test_run.name} in suite {test_run.class_name}.'
         report_only = os.getenv('PYTORCH_OVERRIDE_FLAKY_SIGNAL') != '1'
         if report_only and num_fail + num_errored + num_unexpected_success < 1 or not report_only and num_expected_fail < 1:
@@ -774,7 +780,7 @@ def assemble_flaky_test_stats(duplicated_tests_by_file: Dict[str, DuplicatedDict
         for suite_name, testcase_to_runs in suite_to_dict.items():
             for testcase_name, list_of_runs in testcase_to_runs.items():
                 num_green, num_red = process_intentional_test_runs(list_of_runs)
-                if num_green > 0:   # Otherwise, it's likely just a failing test
+                if num_green > 0 and num_red > 0:   # Flaky tests show different results in consecutive reruns
                     flaky_tests.append({
                         "name": testcase_name,
                         "suite": suite_name,
@@ -790,6 +796,7 @@ def assemble_flaky_test_stats(duplicated_tests_by_file: Dict[str, DuplicatedDict
         # write to S3 to go to Rockset as well
         import uuid
         for flaky_test in flaky_tests:
+            flaky_test["job_id"] = os.environ["GHA_WORKFLOW_JOB_ID"]
             flaky_test["workflow_id"] = workflow_id
             key = f"flaky_tests/{workflow_id}/{uuid.uuid4()}.json"
             obj = get_S3_object_from_bucket("ossci-raw-job-status", key)
@@ -943,7 +950,7 @@ def print_regressions(head_report: Report, *, num_prev_commits: int) -> None:
         encoding="ascii",
     ))
 
-    # if current commit is already on master, we need to exclude it from
+    # if current commit is already on main, we need to exclude it from
     # this history; otherwise we include the merge-base
     commits = subprocess.check_output(
         ["git", "rev-list", f"--max-count={num_prev_commits+1}", base],
diff --git a/tools/stats/upload_test_stats.py b/tools/stats/upload_test_stats.py
new file mode 100644
index 00000000000000..899fc0495948c6
--- /dev/null
+++ b/tools/stats/upload_test_stats.py
@@ -0,0 +1,206 @@
+import argparse
+import os
+import requests
+import shutil
+import zipfile
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Dict, List, Any
+
+import rockset  # type: ignore[import]
+import boto3  # type: ignore[import]
+
+PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch"
+GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
+REQUEST_HEADERS = {
+    "Accept": "application/vnd.github.v3+json",
+    "Authorization": "token " + GITHUB_TOKEN,
+}
+S3_RESOURCE = boto3.resource("s3")
+TEMP_DIR = Path(os.environ["RUNNER_TEMP"]) / "tmp-test-stats"
+
+
+def parse_xml_report(
+    report: Path, workflow_id: int, workflow_run_attempt: int
+) -> List[Dict[str, Any]]:
+    """Convert a test report xml file into a JSON-serializable list of test cases."""
+    # [Job id in artifacts]
+    # Retrieve the job id from the report path. In our GHA workflows, we append
+    # the job id to the end of the report name, so `report` looks like:
+    #     unzipped-test-reports-foo_5596745227/test/test-reports/foo/TEST-foo.xml
+    # and we want to get `5596745227` out of it.
+    job_id = int(report.parts[0].rpartition("_")[2])
+
+    print(f"Parsing test report: {report}, job id: {job_id}")
+    root = ET.parse(report)
+
+    test_cases = []
+    for test_case in root.findall("testcase"):
+        case = process_xml_element(test_case)
+        case["workflow_id"] = workflow_id
+        case["workflow_run_attempt"] = workflow_run_attempt
+        case["job_id"] = job_id
+        test_cases.append(case)
+
+    return test_cases
+
+
+def process_xml_element(element: ET.Element) -> Dict[str, Any]:
+    """Convert a test suite element into a JSON-serializable dict."""
+    ret: Dict[str, Any] = {}
+
+    # Convert attributes directly into dict elements.
+    # e.g.
+    #     <testcase name="test_foo" classname="test_bar"></testcase>
+    # becomes:
+    #     {"name": "test_foo", "classname": "test_bar"}
+    ret.update(element.attrib)
+
+    # By default, all attributes are strings. Apply a few special conversions
+    # here for well-known attributes so that they are the right type in Rockset.
+    line = ret.get("line")
+    if line:
+        ret["line"] = int(line)
+    time = ret.get("time")
+    if time:
+        ret["time"] = float(time)
+
+    # Convert inner and outer text into special dict elements.
+    # e.g.
+    #     <testcase>my_inner_text</testcase> my_tail
+    # becomes:
+    #     {"text": "my_inner_text", "tail": " my_tail"}
+    if element.text and element.text.strip():
+        ret["text"] = element.text
+    if element.tail and element.tail.strip():
+        ret["tail"] = element.tail
+
+    # Convert child elements recursively, placing them at a key:
+    # e.g.
+    #     <testcase>
+    #       <foo>hello</foo>
+    #     </testcase>
+    # becomes
+    #    {"foo": {"text": "hello"}}
+    for child in element:
+        ret[child.tag] = process_xml_element(child)
+    return ret
+
+
+def get_artifact_urls(workflow_run_id: int) -> Dict[Path, str]:
+    """Get all workflow artifacts with 'test-report' in the name."""
+    response = requests.get(
+        f"{PYTORCH_REPO}/actions/runs/{workflow_run_id}/artifacts?per_page=100",
+    )
+    artifacts = response.json()["artifacts"]
+    while "next" in response.links.keys():
+        response = requests.get(response.links["next"]["url"], headers=REQUEST_HEADERS)
+        artifacts.extend(response.json()["artifacts"])
+
+    artifact_urls = {}
+    for artifact in artifacts:
+        if "test-report" in artifact["name"]:
+            artifact_urls[Path(artifact["name"])] = artifact["archive_download_url"]
+    return artifact_urls
+
+
+def unzip(p: Path) -> None:
+    """Unzip the provided zipfile to a similarly-named directory.
+
+    Returns None if `p` is not a zipfile.
+
+    Looks like: /tmp/test-reports.zip -> /tmp/unzipped-test-reports/
+    """
+    assert p.is_file()
+    unzipped_dir = p.with_name("unzipped-" + p.stem)
+
+    with zipfile.ZipFile(p, "r") as zip:
+        zip.extractall(unzipped_dir)
+
+
+def download_and_extract_artifact(
+    artifact_name: Path, artifact_url: str, workflow_run_attempt: int
+) -> None:
+    # [Artifact run attempt]
+    # All artifacts on a workflow share a single namespace. However, we can
+    # re-run a workflow and produce a new set of artifacts. To avoid name
+    # collisions, we add `-runattempt1<run #>-` somewhere in the artifact name.
+    #
+    # This code parses out the run attempt number from the artifact name. If it
+    # doesn't match the one specified on the command line, skip it.
+    atoms = str(artifact_name).split("-")
+    for atom in atoms:
+        if atom.startswith("runattempt"):
+            found_run_attempt = int(atom[len("runattempt") :])
+            if workflow_run_attempt != found_run_attempt:
+                print(f"Skipping {artifact_name} as it is an invalid run attempt.")
+
+    print(f"Downloading and extracting {artifact_name}")
+
+    response = requests.get(artifact_url, headers=REQUEST_HEADERS)
+    with open(artifact_name, "wb") as f:
+        f.write(response.content)
+    unzip(artifact_name)
+
+
+def download_and_extract_s3_reports(
+    workflow_run_id: int, workflow_run_attempt: int
+) -> None:
+    bucket = S3_RESOURCE.Bucket("gha-artifacts")
+    objs = bucket.objects.filter(
+        Prefix=f"pytorch/pytorch/{workflow_run_id}/{workflow_run_attempt}/artifact/test-reports"
+    )
+
+    for obj in objs:
+        p = Path(Path(obj.key).name)
+        print(f"Downloading and extracting {p}")
+        with open(p, "wb") as f:
+            f.write(obj.get()["Body"].read())
+        unzip(p)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Upload test stats to Rockset")
+    parser.add_argument(
+        "--workflow-run-id",
+        required=True,
+        help="id of the workflow to get artifacts from",
+    )
+    parser.add_argument(
+        "--workflow-run-attempt",
+        required=True,
+        help="which retry of the workflow this is",
+    )
+    args = parser.parse_args()
+
+    if TEMP_DIR.exists():
+        print("rm: ", TEMP_DIR)
+        shutil.rmtree(TEMP_DIR)
+
+    print("mkdir: ", TEMP_DIR)
+    TEMP_DIR.mkdir()
+    print("cd to ", TEMP_DIR)
+    os.chdir(TEMP_DIR)
+
+    # Download and extract all the reports (both GHA and S3)
+    download_and_extract_s3_reports(args.workflow_run_id, args.workflow_run_attempt)
+    artifact_urls = get_artifact_urls(args.workflow_run_id)
+    for name, url in artifact_urls.items():
+        download_and_extract_artifact(Path(name), url, args.workflow_run_attempt)
+
+    # Parse the reports and transform them to JSON
+    test_cases = []
+    for xml_report in Path(".").glob("**/*.xml"):
+        test_cases.extend(
+            parse_xml_report(
+                xml_report, int(args.workflow_run_id), int(args.workflow_run_attempt)
+            )
+        )
+
+    # Write the JSON to rockset
+    print(f"Writing {len(test_cases)} test cases to Rockset")
+    client = rockset.Client(
+        api_server="api.rs2.usw2.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
+    )
+    client.Collection.retrieve("test_run").add_docs(test_cases)
+    print("Done!")
diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py
index ee2ee8a0f0b9f9..9dae08c366068f 100644
--- a/tools/test/test_gen_backend_stubs.py
+++ b/tools/test/test_gen_backend_stubs.py
@@ -208,7 +208,7 @@ def test_unrecognized_key(self) -> None:
 - abs
 invalid_key: invalid_val'''
         output_error = self.get_errors_from_gen_backend_stubs(yaml_str)
-        self.assertExpectedInline(output_error, ''' contains unexpected keys: invalid_key. Only the following keys are supported: backend, cpp_namespace, extra_headers, supported, autograd, full_codegen''')  # noqa: B950
+        self.assertExpectedInline(output_error, ''' contains unexpected keys: invalid_key. Only the following keys are supported: backend, class_name, cpp_namespace, extra_headers, supported, autograd, full_codegen''')  # noqa: B950
 
     # if use_out_as_primary is provided, it must be a bool
     def test_use_out_as_primary_non_bool(self) -> None:
diff --git a/tools/test/test_import_test_stats.py b/tools/test/test_import_test_stats.py
new file mode 100644
index 00000000000000..5a43a7d45e8a97
--- /dev/null
+++ b/tools/test/test_import_test_stats.py
@@ -0,0 +1,51 @@
+import os
+import unittest
+from tools.stats.import_test_stats import get_disabled_issues
+from typing import List
+from unittest.mock import patch
+
+class TestGetDisabledIssues(unittest.TestCase):
+
+    def run_assert_disabled_issues(self, pr_body: str, commit_messages: str, expected: List[str]) -> None:
+        with patch.dict(os.environ, {"PR_BODY": pr_body, "COMMIT_MESSAGES": commit_messages}):
+            disabled_issues = get_disabled_issues()
+        self.assertEqual(disabled_issues, expected)
+
+    # test variations of close in PR_BODY
+    def test_closes_pr_body(self) -> None:
+        pr_body = 'closes #123 Close #143 ClOsE #345 closed #10283'
+        self.run_assert_disabled_issues(pr_body, '', ['123', '143', '345', '10283'])
+
+    # test variations of fix in COMMIT_MESSAGES
+    def test_fixes_commit_messages(self) -> None:
+        commit_messages = 'fix #123 FixEd #143 fixes #345 FiXeD #10283'
+        self.run_assert_disabled_issues('', commit_messages, ['123', '143', '345', '10283'])
+
+    # test variations of resolve in PR_BODY and COMMIT_MESSAGES
+    def test_resolves_pr_commits(self) -> None:
+        pr_body = 'resolve #123 resolveS #143'
+        commit_messages = 'REsolved #345 RESOLVES #10283'
+        self.run_assert_disabled_issues(pr_body, commit_messages, ['123', '143', '345', '10283'])
+
+    # test links
+    def test_issue_links(self) -> None:
+        pr_body = 'closes https://github.com/pytorch/pytorch/issues/75198 fixes https://github.com/pytorch/pytorch/issues/75123'
+        self.run_assert_disabled_issues(pr_body, '', ['75198', '75123'])
+
+    # test strange spacing
+    def test_spacing(self) -> None:
+        pr_body = 'resolve #123,resolveS #143Resolved #345\nRESOLVES #10283'
+        commit_messages = 'Fixed #2348fixes https://github.com/pytorch/pytorch/issues/75123resolveS #2134'
+        self.run_assert_disabled_issues(pr_body, commit_messages, ['123', '143', '345', '10283', '2348', '75123', '2134'])
+
+    # test bad things
+    def test_not_accepted(self) -> None:
+        pr_body = 'fixes189 fixeshttps://github.com/pytorch/pytorch/issues/75123 ' \
+            'closedhttps://githubcom/pytorch/pytorch/issues/75123'
+        commit_messages = 'fix 234, fixes # 45, fixing #123, close 234, closes#45, closing #123 resolve 234, ' \
+            'resolves  #45, resolving #123'
+        self.run_assert_disabled_issues(pr_body, commit_messages, [])
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/tools/testing/test_selections.py b/tools/testing/test_selections.py
index c83b0619f03067..f09b87ac1a26bd 100644
--- a/tools/testing/test_selections.py
+++ b/tools/testing/test_selections.py
@@ -156,7 +156,8 @@ def _query_failure_test_module(reports: List[Tuple["Report", str]]) -> List[str]
 
 
 def _query_changed_test_files() -> List[str]:
-    cmd = ["git", "diff", "--name-only", "origin/master", "HEAD"]
+    default_branch = f"origin/{os.environ.get('GIT_DEFAULT_BRANCH', 'master')}"
+    cmd = ["git", "diff", "--name-only", default_branch, "HEAD"]
     proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 
     if proc.returncode != 0:
diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
index 00892ea09eae7d..4dddf7b33d71bf 100644
--- a/torch/CMakeLists.txt
+++ b/torch/CMakeLists.txt
@@ -44,6 +44,9 @@ set(TORCH_PYTHON_SRCS
     )
 append_filelist("libtorch_python_core_sources" TORCH_PYTHON_SRCS)
 
+list(APPEND TORCH_PYTHON_SRCS
+    ${TORCH_SRC_DIR}/csrc/init_flatbuffer_module.cpp)
+
 # NB: This has to match the condition under which the JIT test directory
 #     is included (at the time of writing that's in caffe2/CMakeLists.txt).
 if(BUILD_TEST)
@@ -190,6 +193,7 @@ add_custom_target(torch_python_stubs DEPENDS
     "${TORCH_SRC_DIR}/_C/__init__.pyi"
     "${TORCH_SRC_DIR}/_C/_VariableFunctions.pyi"
     "${TORCH_SRC_DIR}/nn/functional.pyi"
+    "${TORCH_SRC_DIR}/utils/data/datapipes/datapipe.pyi"
 )
 add_custom_command(
     OUTPUT
@@ -210,6 +214,18 @@ add_custom_command(
     WORKING_DIRECTORY
     "${TORCH_ROOT}"
 )
+file(GLOB_RECURSE datapipe_files "${TORCH_SRC_DIR}/utils/data/datapipes/*.py")
+add_custom_command(
+    OUTPUT
+    "${TORCH_SRC_DIR}/utils/data/datapipes/datapipe.pyi"
+    COMMAND
+    "${PYTHON_EXECUTABLE}" ${TORCH_SRC_DIR}/utils/data/datapipes/gen_pyi.py
+    DEPENDS
+    "${TORCH_SRC_DIR}/utils/data/datapipes/datapipe.pyi.in"
+    ${datapipe_files}
+    WORKING_DIRECTORY
+    "${TORCH_ROOT}"
+)
 if(USE_DISTRIBUTED)
     if(WIN32)
       append_filelist("libtorch_python_distributed_core_sources" TORCH_PYTHON_SRCS)
@@ -376,6 +392,9 @@ set_source_files_properties(
 # Disable certain warnings for GCC-9.X
 if(CMAKE_COMPILER_IS_GNUCXX AND (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 9.0.0))
   set_source_files_properties(${TORCH_SRC_DIR}/csrc/Module.cpp PROPERTIES COMPILE_FLAGS "-Wno-cast-function-type")
+  set_source_files_properties(
+    ${TORCH_SRC_DIR}/csrc/init_flatbuffer_module.cpp
+    PROPERTIES COMPILE_FLAGS "-Wno-cast-function-type")
   set_source_files_properties(${TORCH_SRC_DIR}/csrc/autograd/python_variable.cpp PROPERTIES COMPILE_FLAGS "-Wno-cast-function-type")
 endif()
 
diff --git a/torch/_C/_VariableFunctions.pyi.in b/torch/_C/_VariableFunctions.pyi.in
index 1b3a760c8cbd49..75d566f131ab59 100644
--- a/torch/_C/_VariableFunctions.pyi.in
+++ b/torch/_C/_VariableFunctions.pyi.in
@@ -5,13 +5,11 @@ from typing import List, Tuple, Optional, Union, Any, ContextManager, Callable,
 from typing_extensions import Literal
 from torch._six import inf
 
-from torch.types import _int, _float, _bool, Number, _dtype, _device, _qscheme, _size, _layout
+from torch.types import _int, _float, _bool, Number, _dtype, _device, _qscheme, _size, _layout, SymInt
+import torch
 
 import builtins
 
-# REDUNDANT!
-${namedtuple_defs}
-
 ${function_hints}
 
 ${all_directive}
diff --git a/torch/_C/__init__.pyi.in b/torch/_C/__init__.pyi.in
index db093932f1c8e7..e252e9025778ed 100644
--- a/torch/_C/__init__.pyi.in
+++ b/torch/_C/__init__.pyi.in
@@ -12,7 +12,7 @@ from typing import (
 from typing_extensions import Literal
 from torch._six import inf
 
-from torch.types import _int, _float, _bool, _dtype, _device, _qscheme, _size, _layout, Device, Number, Storage
+from torch.types import _int, _float, _bool, _dtype, _device, _qscheme, _size, _layout, Device, Number, Storage, SymInt
 from torch.storage import _TypedStorage
 
 import builtins
@@ -22,6 +22,8 @@ import builtins
 from . import _nn as _nn
 from . import _onnx as _onnx
 from . import _VariableFunctions as _VariableFunctions
+from . import _lazy as _lazy
+from . import _lazy_ts_backend as _lazy_ts_backend
 
 T = TypeVar('T')
 
@@ -214,6 +216,7 @@ def _jit_pass_propagate_shapes_on_graph(Graph) -> None: ...
 def _jit_erase_non_input_shape_information(Graph) -> None: ...
 def _jit_pass_common_expression_hoisting(Graph) -> None: ...
 def _jit_get_schemas_for_operator(name :str) -> List[FunctionSchema]: ...
+def _jit_get_all_schemas() -> List[FunctionSchema]: ...
 def _jit_check_alias_annotation(g: Graph, args: Tuple[Any, ...], unqualified_op_name: str): ...
 def _jit_can_fuse_on_cpu() -> _bool: ...
 def _jit_can_fuse_on_gpu() -> _bool: ...
@@ -233,7 +236,7 @@ def _jit_set_te_must_use_llvm_cpu(use_llvm: _bool): ...
 def _jit_set_nvfuser_enabled(enable: _bool) -> _bool: ...
 def _jit_cat_wo_conditionals(optimize_cat: _bool): ...
 def _jit_opt_conditionals(opt_conds: _bool): ...
-def _jit_pass_canonicalize(graph: Graph): ...
+def _jit_pass_canonicalize(graph: Graph, keep_unique_names: _bool = True): ...
 def _jit_pass_erase_shape_information(graph: Graph): ...
 def _jit_pass_fold_convbn(module: 'torch.jit.ScriptModule'): ...
 def _jit_pass_insert_observers(module: 'torch.jit.ScriptModule',
@@ -260,7 +263,7 @@ ResolutionCallback = Callable[[str], Callable[..., Any]]
 
 # Defined in torch/csrc/jit/python/script_init.cpp
 #        and torch/csrc/jit/python/init.cpp
-def _create_function_from_graph(qualname: str, graph: Graph) -> Graph: ...
+def _create_function_from_graph(qualname: str, graph: Graph) -> ScriptFunction: ...
 def _debug_set_autodiff_subgraph_inlining(disabled: _bool) -> None: ...
 def _ivalue_tags_match(lhs: ScriptModule, rhs: ScriptModule) -> _bool: ...
 def _jit_assert_is_instance(obj: Any, type: JitType): ...
@@ -281,7 +284,7 @@ def _get_model_ops_and_info_from_buffer(buffer: BinaryIO): ...
 def _get_mobile_model_contained_types(filename: Union[str, Path]): ...
 def _get_mobile_model_contained_types_from_buffer(buffer: BinaryIO): ...
 def _logging_set_logger(logger: LoggerBase) -> LoggerBase: ...
-def _get_graph_executor_optimize() -> _bool: ...
+def _get_graph_executor_optimize(optimize: Optional[_bool] = None) -> _bool: ...
 def _set_graph_executor_optimize(optimize: _bool): ...
 def _export_opnames(module: ScriptModule) -> List[str]: ...
 def _create_function_from_trace(
@@ -318,7 +321,7 @@ def _jit_pass_onnx_assign_output_shape(graph: Graph, tensors: List[Tensor], desc
 def _jit_pass_onnx_remove_inplace_ops_for_onnx(graph: Graph, module: Module) -> None: ...
 def _jit_pass_remove_inplace_ops(graph: Graph) -> None: ...
 def _jit_pass_canonicalize_graph_fuser_ops(graph: Graph) -> None: ...
-def _jit_pass_peephole(graph: Graph, addmm_fusion_enabled: _bool) -> None: ...
+def _jit_pass_peephole(graph: Graph, disable_shape_peepholes: _bool = False) -> None: ...
 def _jit_pass_fuse_addmm(graph: Graph) -> None: ...
 def _jit_pass_onnx_preprocess(graph: Graph) -> None: ...
 def _jit_pass_prepare_division_for_onnx(graph: Graph) -> None: ...
@@ -345,6 +348,10 @@ def _jit_pass_onnx_function_substitution(graph: Graph) -> None: ...
 def _jit_pass_onnx_function_extraction(graph: Graph, module_names : Set[str], param_names : List[str]) -> Dict[Node, Dict[str, str]]: ...
 def _jit_pass_onnx_clear_scope_records() -> None: ...
 def _jit_pass_onnx_track_scope_attributes(graph: Graph, onnx_attrs: Dict[str, Any]) -> None: ...
+def _jit_is_onnx_log_enabled() -> _bool: ...
+def _jit_set_onnx_log_enabled(enabled: _bool) -> None: ...
+def _jit_set_onnx_log_output_stream(stream_name: str) -> None: ...
+def _jit_onnx_log(*args: Any) -> None: ...
 def _jit_pass_lower_graph(graph: Graph, m: Module) -> Tuple[Graph, List[IValue]]: ...
 def _jit_pass_inline_fork_wait(graph: Graph) -> None: ...
 def _jit_pass_onnx_deduplicate_initializers(graph: Graph, params_dict: Dict[str, IValue], is_train: _bool) -> Dict[str, IValue]: ...
@@ -466,7 +473,8 @@ class Graph:
     def setInsertPoint(self, n: Union[Block, Node]) -> None: ...
     def insert_point_guard(self, n: Union[Block, Node]) -> _InsertPoint: ...
     def insertPoint(self) -> Node: ...
-    def insertGraph(sellf, callee: Graph, inputs: List[Value]) -> List[Value]: ...
+    def insertGraph(self, callee: Graph, inputs: List[Value]) -> List[Value]: ...
+    def makeMultiOutputIntoTuple(self) -> None: ...
     ...
 
 
@@ -481,6 +489,8 @@ class Argument:
 class FunctionSchema:
     arguments: List[Argument]
     returns: List[Argument]
+    name: str
+    overload_name: str
     ...
 
 class _UpgraderEntry:
@@ -817,9 +827,6 @@ class ThroughputBenchmark(object):
     def run_once(self, *args: Any, **kwargs: Any) -> Any: ...
     def benchmark(self, config: BenchmarkConfig) -> BenchmarkExecutionStats: ...
 
-# IDK if these are actually exposed here, hope they are
-${namedtuple_defs}
-
 # Defined in torch/csrc/generic/Storage.cpp
 ${legacy_storage_base_hints}
 
@@ -1134,6 +1141,9 @@ class TensorType(JitType):
     def getInferred(cls) -> TensorType: ...
     def with_sizes(self, other: Optional[List[Optional[_int]]]) -> TensorType: ...
     def sizes(self) -> Optional[List[_int]]: ...
+    def strides(self) -> Optional[List[_int]]: ...
+    def device(self) -> Optional[_device]: ...
+    def dtype(self) -> Optional[_dtype]: ...
     @staticmethod
     def create_from_tensor(t: Tensor) -> TensorType: ...
 
diff --git a/torch/_C/_autograd.pyi b/torch/_C/_autograd.pyi
index 38ac7ccaea0c8d..9cdf801dd7602b 100644
--- a/torch/_C/_autograd.pyi
+++ b/torch/_C/_autograd.pyi
@@ -87,6 +87,7 @@ def _prepare_profiler(config: ProfilerConfig, activities: Set[ProfilerActivity])
 def _disable_profiler() -> _ProfilerResult: ...
 def _profiler_enabled() -> bool: ...
 def _add_metadata_json(key: str, value: str) -> None: ...
+def _kineto_step() -> None: ...
 def kineto_available() -> bool: ...
 def _record_function_with_args_enter(name: str, args: List[Any]) -> torch.Tensor: ...
 def _record_function_with_args_exit(handle: torch.Tensor) -> None: ...
diff --git a/torch/_C/_distributed_rpc.pyi b/torch/_C/_distributed_rpc.pyi
index d89f614123e1c4..58d555297929f7 100644
--- a/torch/_C/_distributed_rpc.pyi
+++ b/torch/_C/_distributed_rpc.pyi
@@ -85,7 +85,7 @@ class TensorPipeAgent(RpcAgent):
         store: Store,
         name: str,
         worker_id: int,
-        world_size: int,
+        world_size: Optional[int],
         opts: _TensorPipeRpcBackendOptionsBase,
         reverse_device_maps: Dict[str, Dict[torch.device, torch.device]],
         devices: List[torch.device],
diff --git a/torch/_C/_lazy.pyi b/torch/_C/_lazy.pyi
new file mode 100644
index 00000000000000..5b4cf101234a44
--- /dev/null
+++ b/torch/_C/_lazy.pyi
@@ -0,0 +1,17 @@
+from typing import List
+from torch import Tensor
+
+#defined in torch/csrc/lazy/python/init.cpp
+def _mark_step(device: str, devices: List[str], wait: bool): ...
+def _wait_device_ops(devices: List[str]): ...
+def _reset_metrics(): ...
+def _counter_names() -> List[str]: ...
+def _counter_value(name: str) -> int: ...
+def _get_graph_hash(tensors: List[Tensor]) -> str: ...
+def _sync_multi(tensors: List[Tensor], devices: List[str], wait: bool = True, sync_ltc_data: bool = True): ...
+def _get_tensor_id(tensor: Tensor) -> int: ...
+def _get_tensors_text(tensors: List[Tensor]) -> str: ...
+def _get_tensors_dot(tensors: List[Tensor]) -> str: ...
+def _get_tensors_backend(tensors: List[Tensor]) -> str: ...
+def _get_force_fallback() -> str: ...
+def _set_force_fallback(newval: str): ...
diff --git a/torch/_C/_lazy_ts_backend.pyi b/torch/_C/_lazy_ts_backend.pyi
new file mode 100644
index 00000000000000..91575fe939bfa2
--- /dev/null
+++ b/torch/_C/_lazy_ts_backend.pyi
@@ -0,0 +1,8 @@
+#defined in torch/csrc/lazy/python/init.cpp
+
+from typing import List, Tuple, Any
+from torch import Tensor
+
+def _init(): ...
+def _get_tensors_ts_device_data_node(tensors: List[Tensor]) -> Tuple[List[int], List[Any]]: ...
+def _run_cached_graph(hash_str: str, graph_inputs: List[Any]) -> List[Tensor]: ...
diff --git a/torch/_C/build.bzl b/torch/_C/build.bzl
new file mode 100644
index 00000000000000..230124eb69aa81
--- /dev/null
+++ b/torch/_C/build.bzl
@@ -0,0 +1,6 @@
+def define_targets(rules):
+    rules.filegroup(
+        name = "pyi.in",
+        srcs = rules.glob(["*.pyi.in"]),
+        visibility = ["//visibility:public"],
+    )
diff --git a/torch/_C/return_types.pyi.in b/torch/_C/return_types.pyi.in
new file mode 100644
index 00000000000000..aa540ea328b5d9
--- /dev/null
+++ b/torch/_C/return_types.pyi.in
@@ -0,0 +1,10 @@
+# ${generated_comment}
+
+from torch import Tensor, Generator, strided, memory_format, contiguous_format, strided
+from typing import List, Tuple, Optional, Union, Any, ContextManager, Callable, overload, Iterator, NamedTuple, Sequence, TypeVar
+from typing_extensions import Literal
+from torch._six import inf
+
+from torch.types import _int, _float, _bool, Number, _dtype, _device, _qscheme, _size, _layout
+
+${namedtuple_defs}
diff --git a/torch/_C_flatbuffer/__init__.pyi b/torch/_C_flatbuffer/__init__.pyi
new file mode 100644
index 00000000000000..3a2ff059b0ed9d
--- /dev/null
+++ b/torch/_C_flatbuffer/__init__.pyi
@@ -0,0 +1,10 @@
+from torch._C import LiteScriptModule, ScriptModule
+
+def _load_mobile_module_from_file(filename: str): ...
+def _load_mobile_module_from_bytes(bytes_: bytes): ...
+def _load_jit_module_from_file(filename: str): ...
+def _load_jit_module_from_bytes(bytes_: bytes): ...
+def _save_mobile_module(m: LiteScriptModule, filename: str): ...
+def _save_jit_module(m: ScriptModule, filename: str): ...
+def _save_mobile_module_to_bytes(m: LiteScriptModule) -> bytes: ...
+def _save_jit_module_to_bytes(m: ScriptModule) -> bytes: ...
diff --git a/torch/__init__.py b/torch/__init__.py
index 64827961c30cac..7011dc4e3b963d 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -39,6 +39,7 @@
     'no_grad', 'enable_grad', 'rand', 'randn', 'inference_mode',
     'DoubleStorage', 'FloatStorage', 'LongStorage', 'IntStorage',
     'ShortStorage', 'CharStorage', 'ByteStorage', 'BoolStorage',
+    '_TypedStorage',
     'DoubleTensor', 'FloatTensor', 'LongTensor', 'IntTensor',
     'ShortTensor', 'CharTensor', 'ByteTensor', 'BoolTensor', 'Tensor',
     'lobpcg', 'use_deterministic_algorithms',
@@ -594,7 +595,7 @@ def is_warn_always_enabled():
 ################################################################################
 
 from ._tensor import Tensor
-from .storage import _StorageBase, _TypedStorage
+from .storage import _StorageBase, _TypedStorage, _LegacyStorage
 
 # NOTE: New <type>Storage classes should never be added. When adding a new
 # dtype, use torch.storage._TypedStorage directly.
@@ -602,87 +603,87 @@ def is_warn_always_enabled():
 class _UntypedStorage(_C.ByteStorageBase, _StorageBase):
     pass
 
-class ByteStorage(_TypedStorage):
+class ByteStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.uint8
 
-class DoubleStorage(_TypedStorage):
+class DoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.double
 
-class FloatStorage(_TypedStorage):
+class FloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.float
 
-class HalfStorage(_TypedStorage):
+class HalfStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.half
 
-class LongStorage(_TypedStorage):
+class LongStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.long
 
-class IntStorage(_TypedStorage):
+class IntStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.int
 
-class ShortStorage(_TypedStorage):
+class ShortStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.short
 
-class CharStorage(_TypedStorage):
+class CharStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.int8
 
-class BoolStorage(_TypedStorage):
+class BoolStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.bool
 
-class BFloat16Storage(_TypedStorage):
+class BFloat16Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.bfloat16
 
-class ComplexDoubleStorage(_TypedStorage):
+class ComplexDoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.cdouble
 
-class ComplexFloatStorage(_TypedStorage):
+class ComplexFloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.cfloat
 
-class QUInt8Storage(_TypedStorage):
+class QUInt8Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.quint8
 
-class QInt8Storage(_TypedStorage):
+class QInt8Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.qint8
 
-class QInt32Storage(_TypedStorage):
+class QInt32Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.qint32
 
-class QUInt4x2Storage(_TypedStorage):
+class QUInt4x2Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.quint4x2
 
-class QUInt2x4Storage(_TypedStorage):
+class QUInt2x4Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.quint2x4
@@ -692,6 +693,7 @@ def dtype(self):
     ShortStorage, CharStorage, ByteStorage, HalfStorage, BoolStorage,
     QUInt8Storage, QInt8Storage, QInt32Storage, BFloat16Storage,
     ComplexFloatStorage, ComplexDoubleStorage, QUInt4x2Storage, QUInt2x4Storage,
+    _TypedStorage
 }
 
 # The _tensor_classes set is initialized by the call to _C._initialize_tensor_type_bindings()
@@ -715,7 +717,7 @@ def manager_path():
         raise RuntimeError("Unable to find torch_shm_manager at " + path)
     return path.encode('utf-8')
 
-from .autocast_mode import autocast
+from torch.amp import autocast
 
 # Shared memory manager needs to know the exact location of manager executable
 _C._initExtension(manager_path())
@@ -819,9 +821,6 @@ def _assert(condition, message):
 from torch import __future__ as __future__
 from torch import profiler as profiler
 
-from torch.nested._nestedtensor import NestedTensor
-from torch.nested._nestedtensor import nested_tensor
-
 _C._init_names(list(torch._storage_classes))
 
 # attach docstrings to torch and tensor functions
diff --git a/torch/_jit_internal.py b/torch/_jit_internal.py
index ba570b35391e4e..3c067d5c1c53a3 100644
--- a/torch/_jit_internal.py
+++ b/torch/_jit_internal.py
@@ -18,6 +18,7 @@
 import typing
 import io
 import pickle
+import threading
 # This is needed. `torch._jit_internal` is imported before `torch.distributed.__init__`.
 # Explicitly ask to import `torch.distributed.__init__` first.
 # Otherwise, "AttributeError: module 'torch' has no attribute 'distributed'" is raised.
@@ -1251,6 +1252,8 @@ def persistent_id(self, obj):
             return ""
         if isinstance(obj, torch.cuda.Event):
             return ""
+        if isinstance(obj, threading.Thread):
+            return ""
         return None
 
 
diff --git a/torch/_lazy/__init__.py b/torch/_lazy/__init__.py
new file mode 100644
index 00000000000000..ff4e90c0edf237
--- /dev/null
+++ b/torch/_lazy/__init__.py
@@ -0,0 +1,33 @@
+import torch._C._lazy
+
+
+def mark_step(device: str = "lazy:0", wait=False):
+    """Triggers a mark step, which amounts to
+    - collecting a group of 'live' lazy tensors to index into the compilation cache
+      (lowering/compiling their IR graphs if not cached)
+    - kicking off execution of the compiled function
+    - (optionally, wait=True) waiting for cpu-side execution to complete (does not sync the accelerator)
+    """
+    # TODO(whc) expand this to include backend hooks and align with XLA backend needs
+    torch._C._lazy._mark_step(device, [], wait=wait)
+
+def wait_device_ops(devices=None):
+    """Waits for all the async operations on the given devices to complete.
+    Args:
+      devices (string..., optional): The devices whose async ops need to be waited
+        for. If empty, all the local devices will be waited for.
+    """
+    if devices is None:
+        devices = []
+    torch._C._lazy._wait_device_ops(devices=devices)
+
+def sync_multi(tensors, devices):
+    """
+    Sync the list of lazy tensors so there IR get lowered for the activate backend
+    and the compiled computation graph get cached.
+    """
+    torch._C._lazy._sync_multi(tensors, devices)
+
+def get_tensor_id(tensor):
+    """Return a unique id of the lazy tensor maintained by LTC"""
+    return torch._C._lazy._get_tensor_id(tensor)
diff --git a/torch/_lazy/computation.py b/torch/_lazy/computation.py
new file mode 100644
index 00000000000000..7dd57cd7238d45
--- /dev/null
+++ b/torch/_lazy/computation.py
@@ -0,0 +1,23 @@
+import torch._C._lazy
+import torch._C._lazy_ts_backend
+
+def get_tensors_ts_device_data_node(tensors):
+    """Return tensor ids and eager tensors for DeviceData nodes in the
+       IR for the passed in lazy tensors.
+
+       TODO: This API is currently ts backend specific. We are working on
+       generalizing it to all backends including XLA.
+    """
+    return torch._C._lazy_ts_backend._get_tensors_ts_device_data_node(tensors)
+
+def get_graph_hash(tensors):
+    """Return the graph hash for the passed in lazy tensors"""
+    return torch._C._lazy._get_graph_hash(tensors)
+
+def run_cached_graph(hash_str, graph_inputs):
+    """Running the cached computation graph with the given inputs
+
+       TODO: This API is currently ts backend specific. We are working on
+       generalizing it to all backends including XLA.
+    """
+    return torch._C._lazy_ts_backend._run_cached_graph(hash_str, graph_inputs)
diff --git a/torch/_lazy/config.py b/torch/_lazy/config.py
new file mode 100644
index 00000000000000..acff69da4e5a35
--- /dev/null
+++ b/torch/_lazy/config.py
@@ -0,0 +1,9 @@
+import torch._C._lazy
+
+def get_force_fallback():
+    """Get the config used to force LTC fallback"""
+    return torch._C._lazy._get_force_fallback()
+
+def set_force_fallback(configval):
+    """Set the config used to force LTC fallback"""
+    torch._C._lazy._set_force_fallback(configval)
diff --git a/torch/_lazy/debug.py b/torch/_lazy/debug.py
new file mode 100644
index 00000000000000..882056ca9c0f3b
--- /dev/null
+++ b/torch/_lazy/debug.py
@@ -0,0 +1,20 @@
+import torch._C._lazy
+
+
+def render_ir_graph(tensors):
+    """Return a text dump of the LTC IR graph in dot format for the tensors.
+       The text can be processed by tools like dot to be rendered in pdf,png etc."""
+    return torch._C._lazy._get_tensors_dot(tensors)
+
+def dump_ir(tensors, ir_format):
+    """Return a dump of the tensors in the specified format.
+       Valid format are
+       - text: for LTC IR
+       - backend: for the activate backend IR
+    """
+    if ir_format == "text":
+        return torch._C._lazy._get_tensors_text(tensors)
+    elif ir_format == "backend":
+        return torch._C._lazy._get_tensors_backend(tensors)
+    else:
+        raise RuntimeError(f"Unrecognized IR format: {ir_format}")
diff --git a/torch/_lazy/extract_compiled_graph.py b/torch/_lazy/extract_compiled_graph.py
new file mode 100644
index 00000000000000..37d0e67f31f3f3
--- /dev/null
+++ b/torch/_lazy/extract_compiled_graph.py
@@ -0,0 +1,199 @@
+import torch._lazy.metrics as metrics
+from torch._lazy.tensor_factory_functions import tensor_factory_functions
+from torch._lazy import computation
+from torch._lazy import debug as lazy_debug
+import torch._lazy as lazy
+import dataclasses
+from typing import List, Dict, Any, Callable
+import copy
+from torch import fx
+import torch
+import itertools
+import os
+
+debug = os.environ.get("debug_extract_compiled_graph") is not None
+
+@dataclasses.dataclass
+class GraphInputMatcher:
+    """
+    The GraphInputMatcher class setup the graph inputs for future calls after lazy tracing.
+    Specifically, those graph inputs corresponding to method parameters should be replaced with the
+    arguments for the current call.
+
+    tensor_id_to_arg_idx maps the tensor id to the parameter index.
+    graph_input_tensor_ids, graph_input_ivalues list the tensor_id and ivalue for each of the
+    TS/XLA graph inputs.
+    """
+    tensor_id_to_arg_idx: Dict[int, int]
+    graph_input_tensor_ids: List[int]
+    # there are 2 categories of graph_input_tensors.
+    # Category 1: those whose id are not found in tensor_id_to_arg_idx. These are
+    # most likely const tensors and we can get its content from graph_input_tensors
+    # Category 2: those whose id are found in tensor_id_to_arg_idx. We should get
+    #  the tensor from method arguments
+    graph_input_ivalues: List[Any]
+
+    # get the real graph input tensors
+    def __call__(self, args):
+        real_input = []
+        for tensor_id, traced_ivalue in zip(self.graph_input_tensor_ids, self.graph_input_ivalues):
+            arg_idx = self.tensor_id_to_arg_idx.get(tensor_id, None)
+            if arg_idx is None:
+                inp = traced_ivalue
+            else:
+                inp = args[arg_idx]
+            real_input.append(inp)
+        return real_input
+
+class ReturnValueHandler:
+    r"""
+    When ltc_sync_multi is called on multi tensors, the compiled graph
+    will contain output only for unique tensors - if a tensor appears multiple
+    times in the input to _ltc_sync_multi, only the first occurance matters.
+
+    However from python level, we still expect multi tensors returned with duplciation
+    even if the TS graph dedup the output. e.g. for method:
+
+      def forward(self, a):
+        return a, a
+
+    the TS graph captured by LTC will return a single tensor, but Python method expects 2.
+
+    This class dedup the lazy tensors first to get the index that will be used
+    to duplicate the eager tensors later.
+    """
+    def __init__(self, lazy_out_list):
+        self.index: List[List[int]] = []
+        self.total_count = len(lazy_out_list)
+
+        tensor_id_to_idx: Dict[int, int] = dict()
+        for dup_idx, lazy_tensor in enumerate(lazy_out_list):
+            uniq_idx = tensor_id_to_idx.get(id(lazy_tensor), None)
+            if uniq_idx is not None:
+                self.index[uniq_idx].append(dup_idx)
+            else:
+                uniq_idx = len(self.index)
+                self.index.append([dup_idx])
+                tensor_id_to_idx[id(lazy_tensor)] = uniq_idx
+
+    def duplicate_eager_tensors(self, eager_tensor_list):
+        duplicated_list = [None] * self.total_count
+        assert len(eager_tensor_list) == len(self.index)
+
+        for uniq_idx, eager_tensor in enumerate(eager_tensor_list):
+            for dup_idx in self.index[uniq_idx]:
+                duplicated_list[dup_idx] = eager_tensor
+        return duplicated_list
+
+def force_lazy_device(model: fx.GraphModule):
+    """
+    Factory methods in a Fx graph may create tensors for a specific eager devices.
+    If we take no actions, those eager tensors will be mixed with lazy tensors and
+    cause crash. This method overwrite those eager device to lazy device.
+    """
+    def tolazydevice(dev):
+        if isinstance(dev, torch.device):
+            return torch.device("lazy", index=dev.index)
+        return dev
+
+    def hasDeviceArg(args, kwargs):
+        return any(isinstance(arg, torch.device) for arg in itertools.chain(args, kwargs.values()))
+
+    for nd in model.graph.nodes:
+        nd.args = tuple(tolazydevice(arg) for arg in nd.args)
+        nd.kwargs = {k: tolazydevice(v) for k, v in nd.kwargs.items()}
+
+        # For torchbench like yolov3, hf_Bart, dynamo generates Fx graph that return
+        # eager tensors on the default device
+        # (check https://gist.github.com/shunting314/eabdf6c769c59bc384469717b8f9bb7f for yolove,
+        # and https://gist.github.com/shunting314/8d5e2d9348a3258959d3954186c48814 for hf_Bart).
+        # To force those tensors on the lazy device, we can not simply override
+        # the device argument since there is no explicit device argument.
+        # What we are doing here is, for the list of covered tensor factory methods
+        # we add a lazy device argument explicity.
+        #
+        # TODO: This solution is no ideal since we may miss some factory methods. In future
+        # when we support lazy mode, this method can be replaced by that.
+        if nd.target in tensor_factory_functions and not hasDeviceArg(nd.args, nd.kwargs):
+            kwargs = dict(nd.kwargs)  # nd.kwargs is immutable. make a mutable copy.
+            kwargs["device"] = torch.device("lazy")
+            nd.kwargs = kwargs
+
+    model.recompile()
+
+def get_fallback_ops():
+    fallback_ops = []
+    for opname in metrics.counter_names():
+        if "aten::" not in opname:
+            continue
+        val = int(metrics.counter_value(opname))
+        if val > 0:
+            fallback_ops.append(f"{opname}={val}")
+
+    return fallback_ops
+
+def extract_compiled_graph(model: fx.GraphModule, example_inputs) -> Callable:
+    """
+    Optimize an eager model with LTC and returns a wrapper to execute the
+    compiled graph directly without retracing. It depends on other mechanisms
+    like TorchDynamo guards to guarantee the returned wrapper is only called
+    when it's safe.
+    """
+    lazy_args = [arg.to(device="lazy") for arg in example_inputs]
+    args_tensor_ids = [lazy.get_tensor_id(lazy_arg) for lazy_arg in lazy_args]
+    tensor_id_to_arg_idx = {tensor_id: i for i, tensor_id in enumerate(args_tensor_ids)}
+    lazy_model = copy.deepcopy(model).to(device=torch.device("lazy"))
+    force_lazy_device(lazy_model)
+
+    # This line executes lazy tracing and enable us extracting compiled graph later
+    metrics.reset()
+    lazy_out = lazy_model(*lazy_args)
+    fallback_ops = get_fallback_ops()
+    metrics.reset()
+
+    if len(fallback_ops) > 0:
+        raise RuntimeError(f"Fail to extact the compiled graph because of fallback: {','.join(fallback_ops)}")
+
+    if not isinstance(lazy_out, (tuple, list)):
+        lazy_out = (lazy_out,)
+
+    args_and_out = tuple(lazy_args) + tuple(lazy_out)
+    return_value_handler = ReturnValueHandler(args_and_out)
+    if debug:
+        print("Fx code:\n", model.code)
+        print("LTC IR:", lazy_debug.dump_ir(args_and_out, "text"))
+
+    # TODO: this part is TS backend specific for now and will be generalized to
+    # support XLA
+    graph_input_tensor_ids, graph_input_ivalues = computation.get_tensors_ts_device_data_node(args_and_out)
+    assert len(graph_input_tensor_ids) == len(graph_input_ivalues)
+    graph_input_matcher = GraphInputMatcher(tensor_id_to_arg_idx, graph_input_tensor_ids, graph_input_ivalues)
+
+    graph_hash = computation.get_graph_hash(args_and_out)
+
+    if debug:
+        print("graph_hash", graph_hash)
+        print(f"args_tensor_ids {args_tensor_ids}")
+        print("tensor ids from device data:", graph_input_tensor_ids)
+
+    # sync the list of output tensors so the computation graph for these
+    # tensors will be cached. Those computation graphs can be retrieved
+    # by graph hash later.
+    lazy.sync_multi(args_and_out, [])
+
+    def optimized_mod(*args):
+        if len(args_and_out) == 0:
+            return ()
+        graph_input = graph_input_matcher(args)
+        res = return_value_handler.duplicate_eager_tensors(computation.run_cached_graph(graph_hash, graph_input))
+
+        assert len(res) == len(args_and_out)
+        for i, arg in enumerate(args):
+            # only copy those tensors that get inplace updated
+            if arg is not res[i]:
+                arg.copy_(res[i])
+
+        # skip the args
+        return res[len(args):]
+
+    return optimized_mod
diff --git a/torch/_lazy/metrics.py b/torch/_lazy/metrics.py
new file mode 100644
index 00000000000000..043db981bb71ed
--- /dev/null
+++ b/torch/_lazy/metrics.py
@@ -0,0 +1,13 @@
+import torch._C._lazy
+
+def reset():
+    """Resets all metric counters."""
+    torch._C._lazy._reset_metrics()
+
+def counter_names():
+    """Retrieves all the currently active counter names."""
+    return torch._C._lazy._counter_names()
+
+def counter_value(name: str):
+    """Return the value of the counter with the speficied name"""
+    return torch._C._lazy._counter_value(name)
diff --git a/torch/_lazy/tensor_factory_functions.py b/torch/_lazy/tensor_factory_functions.py
new file mode 100644
index 00000000000000..47aa9c500466da
--- /dev/null
+++ b/torch/_lazy/tensor_factory_functions.py
@@ -0,0 +1,48 @@
+import torch
+
+"""
+tensor_factory_functions defines the list of torch functions that create tensors.
+The list is grabbed by searching thru native_functions.yaml by the following
+regular expression:
+
+  cat native_functions.yaml | grep 'func:' | grep -v "Tensor.*->" | grep "[-]>.*Tensor"
+
+It's possible that new tensor factory functions are added making this list stale.
+Use at your own risk or regenerate the list.
+"""
+tensor_factory_functions = (
+    torch._cudnn_init_dropout_state,
+    torch.arange,
+    torch.bartlett_window,
+    torch.blackman_window,
+    torch._empty_affine_quantized,
+    torch.empty_strided,
+    torch.eye,
+    torch.full,
+    torch.from_file,
+    torch.hann_window,
+    torch.hamming_window,
+    torch.kaiser_window,
+    torch.linspace,
+    torch.logspace,
+    torch.ones,
+    torch.scalar_tensor,
+    torch.rand,
+    torch.randint,
+    torch.randn,
+    torch.randperm,
+    torch.range,
+    torch._efficientzerotensor,
+    torch.zeros,
+    torch.tril_indices,
+    torch.triu_indices,
+    # Note: the following functions match the regular expression search above but
+    # they are not available in the torch module. Comment out.
+    # torch._sparse_coo_tensor_with_dims,
+    # torch.fft_fftfreq,
+    # torch.fft_rfftfreq,
+) + (
+    # torch.tensor is special since it's not in native_functions.yaml
+    # add it separately
+    torch.tensor,
+)
diff --git a/torch/_lazy/ts_backend.py b/torch/_lazy/ts_backend.py
new file mode 100644
index 00000000000000..118de2dbefca00
--- /dev/null
+++ b/torch/_lazy/ts_backend.py
@@ -0,0 +1,5 @@
+import torch._C._lazy_ts_backend
+
+def init():
+    """Initializes the lazy Torchscript backend"""
+    torch._C._lazy_ts_backend._init()
diff --git a/torch/_lobpcg.py b/torch/_lobpcg.py
index 560d9579e61f90..f6d53c5ae7c8e0 100644
--- a/torch/_lobpcg.py
+++ b/torch/_lobpcg.py
@@ -652,17 +652,16 @@ class LOBPCG(object):
     """
 
     def __init__(self,
-                 A,        # type: Optional[Tensor]
-                 B,        # type: Optional[Tensor]
-                 X,        # type: Tensor
-                 iK,       # type: Optional[Tensor]
-                 iparams,  # type: Dict[str, int]
-                 fparams,  # type: Dict[str, float]
-                 bparams,  # type: Dict[str, bool]
-                 method,   # type: str
-                 tracker   # type: None
-                 ):
-        # type: (...) -> None
+                 A: Optional[Tensor],
+                 B: Optional[Tensor],
+                 X: Tensor,
+                 iK: Optional[Tensor],
+                 iparams: Dict[str, int],
+                 fparams: Dict[str, float],
+                 bparams: Dict[str, bool],
+                 method: str,
+                 tracker: None
+                 ) -> None:
 
         # constant parameters
         self.A = A
@@ -681,10 +680,10 @@ def __init__(self,
         self.E = torch.zeros((n, ), dtype=X.dtype, device=X.device)
         self.R = torch.zeros((m, n), dtype=X.dtype, device=X.device)
         self.S = torch.zeros((m, 3 * n), dtype=X.dtype, device=X.device)
-        self.tvars = {}               # type: Dict[str, Tensor]
-        self.ivars = {'istep': 0}     # type: Dict[str, int]
-        self.fvars = {'_': 0.0}       # type: Dict[str, float]
-        self.bvars = {'_': False}     # type: Dict[str, bool]
+        self.tvars: Dict[str, Tensor] = {}
+        self.ivars: Dict[str, int] = {'istep': 0}
+        self.fvars: Dict[str, float] = {'_': 0.0}
+        self.bvars: Dict[str, bool] = {'_': False}
 
     def __str__(self):
         lines = ['LOPBCG:']
@@ -947,11 +946,10 @@ def _get_rayleigh_ritz_transform(self, S):
         return Rinv * d_col
 
     def _get_svqb(self,
-                  U,     # Tensor
-                  drop,  # bool
-                  tau    # float
-                  ):
-        # type: (Tensor, bool, float) -> Tensor
+                  U: Tensor,     # Tensor
+                  drop: bool,  # bool
+                  tau: float    # float
+                  ) -> Tensor:
         """Return B-orthonormal U.
 
         .. note:: When `drop` is `False` then `svqb` is based on the
diff --git a/torch/_masked/__init__.py b/torch/_masked/__init__.py
index e3ed37af4436a9..e28aeec93d54d0 100644
--- a/torch/_masked/__init__.py
+++ b/torch/_masked/__init__.py
@@ -163,9 +163,12 @@ def _generate_docstring(func):
         prod=(('dim',), ('keepdim=False', 'dtype=None', 'mask=None')),
         amin=(('dim',), ('keepdim=False', 'dtype=None', 'mask=None')),
         amax=(('dim',), ('keepdim=False', 'dtype=None', 'mask=None')),
+        argmin=(('dim__as_int',), ('keepdim=False', 'dtype=None', 'mask=None')),
+        argmax=(('dim__as_int',), ('keepdim=False', 'dtype=None', 'mask=None')),
         mean=(('dim',), ('keepdim=False', 'dtype=None', 'mask=None')),
         norm=(('ord', 'dim',), ('keepdim=False', 'dtype=None', 'mask=None')),
         var=(('dim', 'unbiased'), ('keepdim=False', 'dtype=None', 'mask=None')),
+        std=(('dim', 'unbiased'), ('keepdim=False', 'dtype=None', 'mask=None')),
         softmax=(('dim__as_int',), ('dtype=None', 'mask=None')),
         log_softmax=(('dim__as_int',), ('dtype=None', 'mask=None')),
         softmin=(('dim__as_int',), ('dtype=None', 'mask=None')),
@@ -226,9 +229,12 @@ def _generate_docstring(func):
         prod='product',
         amax='maximum',
         amin='minimum',
+        argmax='argmax',
+        argmin='argmin',
         mean='mean',
         norm='norm',
-        var='variance')
+        var='variance',
+        std='standard_deviation')
 
     normalization_names = dict(
         softmax='softmax',
@@ -248,7 +254,7 @@ def _generate_docstring(func):
     if func.__name__ in {'norm', 'normalize'}:
         example_args = (2.0, example_dim)
         example_input = example_input.to(dtype=torch.float32)
-    elif func.__name__ in {'var'}:
+    elif func.__name__ in {'var', 'std'}:
         example_args = (example_dim, False)
     else:
         example_args = (example_dim,)
@@ -343,12 +349,12 @@ def _reduction_identity(op_name: str, input: Tensor, *args):
         return torch.tensor(0, dtype=dtype, device=device)
     elif op_name == 'prod':
         return torch.tensor(1, dtype=dtype, device=device)
-    elif op_name == 'amax':
+    elif op_name in {'amax', 'argmax'}:
         if torch.is_floating_point(input):
             return torch.tensor(-torch.inf, dtype=dtype, device=device)
         elif torch.is_signed(input) or dtype == torch.uint8:
             return torch.tensor(torch.iinfo(dtype).min, dtype=dtype, device=device)
-    elif op_name == 'amin':
+    elif op_name in {'amin', 'argmin'}:
         if torch.is_floating_point(input):
             return torch.tensor(torch.inf, dtype=dtype, device=device)
         elif torch.is_signed(input) or dtype == torch.uint8:
@@ -366,7 +372,7 @@ def _reduction_identity(op_name: str, input: Tensor, *args):
             assert torch.is_floating_point(input), input.dtype
             return torch.tensor(torch.inf, dtype=dtype, device=device)
         return torch.tensor(0, dtype=dtype, device=device)
-    elif op_name == 'var':
+    elif op_name in {'var', 'std'}:
         return None
     raise NotImplementedError(f'identity of {op_name} on {dtype} input')
 
@@ -375,6 +381,12 @@ def _canonical_dim(dim: DimOrDims, ndim: int) -> Tuple[int, ...]:
     """Return dim argument as a tuple of sorted dim values.
     """
     dims: List[int] = []
+    if dim == ():
+        # Currently, `dim=()` in reductions operations means "reduce
+        # over all dimensions" while in future, it will read "no
+        # reduce". See https://github.com/pytorch/pytorch/issues/29137
+        # When gh-29137 is resolved, this if-block must be deleted.
+        dim = None
     if dim is None:
         return tuple(range(ndim))
     ndim = max(ndim, 1)
@@ -388,30 +400,252 @@ def _canonical_dim(dim: DimOrDims, ndim: int) -> Tuple[int, ...]:
     return tuple(sorted(dims))
 
 
+def _sparse_coo_flatten_indices(indices: Tensor, shape: tuple):
+    # Flatted N-D indices to 1-D indices
+    flat_indices = indices.new_zeros(indices.size(1))
+    for d, sz in enumerate(shape):
+        flat_indices.mul_(sz)
+        flat_indices.add_(indices[d])
+    return flat_indices
+
+
+def _any(input: Tensor, dim: tuple, keepdim: bool):
+    # Support torch.any with tuple dim argument.
+    # Workaround of https://github.com/pytorch/pytorch/issues/56586
+    r = input
+    for d in reversed(dim):
+        r = r.any(dim=d, keepdim=keepdim)
+    return r
+
+
+def _sparse_coo_where(mask: Tensor, input: Tensor, fill_value: Tensor) -> Tensor:
+    """Sparse variant of torch.where. Supports sparse COO and hybrid sparse COO tensors.
+
+    _sparse_coo_where implements the following invariant:
+
+      _sparse_coo_where(mask, input, fill_value).to_dense(fill_value) ==
+        torch.where(mask.to_dense(), input.to_dense(), torch.full(input.shape, fill_value))
+
+    where `a == b` means `assertEqual(a, b)`, mask is boolean sparse
+    tensor, and `to_dense(fill_value)` is like `to_dense()` except
+    that the unspecified elements are mapped to `fill_value` rather
+    than to `0`.
+
+    Returns a sparse COO tensor with the following features:
+
+    - all specified elements correspond to masked-in elements that
+      have the values of the input tensor. If there exists a masked-in
+      element (as specified by mask) that is not specified in the
+      input, in the result tensor, the corresponding element has value
+      0. In the dense part of the sparse tensor, the masked-out
+      elements are replaced with fill_value.
+
+    - all unspecified elements correspond to masked-out elements.
+    """
+
+    assert input.layout == torch.sparse_coo
+    assert mask.layout == input.layout
+    assert mask.shape == input.shape
+    assert mask.dense_dim() == input.dense_dim()  # TODO: eliminate this restriction
+
+    input = input.coalesce()
+
+    # For set operations on sparse tensor indices, we'll convert
+    # multi-dimensional indices to 1-D indices for efficiency.
+    input_flat_indices = _sparse_coo_flatten_indices(input.indices(), input.shape[:input.sparse_dim()])
+    mask_flat_indices = _sparse_coo_flatten_indices(mask.indices(), mask.shape[:mask.sparse_dim()])
+
+    # the set of mask flat indices that define masked-in elements:
+    if mask.dense_dim() > 0:
+        mask_values = _any(mask.values(), tuple(range(1, input.sparse_dim() + 1)), False)
+    else:
+        mask_values = mask.values()
+    maskin_flat_indices = mask_flat_indices[mask_values.nonzero()[:, 0]]
+
+    def intersection(i1, i2):
+        union, counts = torch.cat([i1, i2]).unique(return_counts=True)
+        return union, torch.where(counts.gt(1))
+
+    def minus(i1, i2):
+        union, counts = torch.cat([i1, i2]).unique(return_counts=True)
+        return intersection(union[torch.where(counts.eq(1))], i1)
+
+    def _apply(a):
+        obj, w = a
+        return obj[w]
+
+    # the set of input flat indices of specified and masked-in elements:
+    maskin_input_flat_indices = _apply(intersection(maskin_flat_indices, input_flat_indices))
+    _, w = intersection(input_flat_indices, maskin_input_flat_indices)
+
+    # the indices and values of masked-in elements
+    where_input_indices = input.indices()[(slice(None),) + w]
+    where_input_values = input.values()[w]
+
+    if mask.dense_dim() > 0:
+        # apply mask to the dense part of the input values:
+        _, w1 = intersection(mask_flat_indices, maskin_input_flat_indices)
+        where_mask_values = mask.values()[w1]
+        where_input_values = torch.where(where_mask_values, where_input_values,
+                                         where_input_values.new_full([], fill_value.item()))
+
+    # the set of flat indices of unspecified input and masked-in elements:
+    maskin_zero_flat_indices = _apply(minus(maskin_flat_indices, maskin_input_flat_indices))
+
+    # the indices of masked-in zero elements
+    _, w = intersection(mask_flat_indices, maskin_zero_flat_indices)
+    where_zero_indices = mask.indices()[(slice(None),) + w]
+
+    # construct result
+    n = where_zero_indices.size(1)
+    if n == 0:
+        # the input is coalesced, hence input_flat_indices are ordered
+        # and the result is guaranteed to be coalesced:
+        result = torch.sparse_coo_tensor(where_input_indices, where_input_values, input.shape)
+        return result._coalesced_(True)
+
+    where_indices = torch.cat([where_input_indices, where_zero_indices], dim=1)
+    where_values = torch.cat([where_input_values, where_input_values.new_zeros((n,) + where_input_values.shape[1:])])
+    result = torch.sparse_coo_tensor(where_indices, where_values, input.shape)
+
+    # appending zero elements leads to uncoalesced sparse tensor
+    return result.coalesce()
+
+
+def _sparse_csr_where(mask: Tensor, input: Tensor, fill_value: Tensor) -> Tensor:
+    """Sparse variant of torch.where. Supports sparse CSR tensors.
+    """
+    # TODO: implement sparse CSR specific where operator for efficiency
+    return _sparse_coo_where(mask.to_sparse_coo(), input.to_sparse_coo(), fill_value).to_sparse_csr()
+
+
+def _where(mask: Tensor, input: Tensor, fill_value: Tensor) -> Tensor:
+    """torch.where with sparse inputs support.
+
+    _where implements the following invariant:
+
+      _where(mask, input, fill_value).to_dense(fill_value) ==
+        torch.where(mask.to_dense(), input.to_dense(), torch.full(input.shape, fill_value))
+
+    where `a == b` means `assertEqual(a, b)`, mask is boolean sparse
+    tensor, and `to_dense(fill_value)` is like `to_dense()` except
+    that the unspecified elements are mapped to `fill_value` rather
+    than to `0`.
+
+    Returns a sparse tensor with the following features:
+
+    - all specified elements correspond to masked-in elements that
+      have the values of the input tensor. If there exists a masked-in
+      element (as specified by mask) that is not specified in the
+      input, in the result tensor, the corresponding element has value
+      0. In the dense part of the sparse tensor, the masked-out
+      elements are replaced with fill_value.
+
+    - all unspecified elements correspond to masked-out elements.
+    """
+    if mask.layout == torch.strided:
+        if fill_value.dtype == torch.bool:
+            # Workaround internal assert failure in
+            # test_nvfuser_correctness__masked_mean_cuda_bool: We
+            # don't have an op for aten::new_full but it isn't a
+            # special case.  Argument types: Tensor, int[], bool, int,
+            # int, Device, bool
+            fill = input.new_full([], int(fill_value.item())).to(dtype=torch.bool)
+        else:
+            fill = input.new_full([], fill_value.item())
+        return torch.where(mask, input, fill)
+    elif mask.layout == torch.sparse_coo:
+        return _sparse_coo_where(mask, input, fill_value)
+    elif mask.layout == torch.sparse_csr:
+        return _sparse_csr_where(mask, input, fill_value)
+    else:
+        raise ValueError(f'_where expects strided or sparse COO or sparse CSR tensor but got {mask.layout}')
+
+
 def _input_mask(input: Tensor, *args, **kwargs) -> Tensor:
     """Return canonical input mask.
-    Canonical input mask is a boolean tensor with the same shape as
-    input and with (broadcasted) content of mask, if specified.
+
+    A canonical input mask is defined as a boolean mask tensor that
+    shape and layout matches with the shape and the layout of the
+    input.
+
+    The canonical input mask is computed from the :attr:`mask` tensor
+    content to meet the following criteria:
+
+    1. The shape of the canonical input mask is the same as the shape
+       of :attr:`input` tensor. If the mask tensor has a smaller shape
+       than the shape of the :attr:`input`, broadcasting rules will be
+       applied. Downcasting of mask is not supported.
+
+    2. The layout of the canonical input mask is the same as the
+       layout of the :attr:`input` tensor. If the mask has different
+       layout, it will be converted to the expected layout.  In the
+       case of sparse COO layout, the canonical input mask will be
+       coalesced.
+
+    3. The dtype of the canonical input mask is torch.bool. If the
+       mask dtype is not bool then it will be converted to bool dtype
+       using `.to(dtype=bool)` method call.
+
+    4. The elements of the canonical input mask have boolean values
+       copied from the content of the :attr:`mask` tensor (after
+       possible broadcasting and dtype conversion transforms).  In
+       general, the sparsity pattern of the sparse canonical input
+       mask need not to be the same as the sparsity pattern of the
+       sparse :attr:`input` tensor.
+
     """
+    if input.layout not in {torch.strided, torch.sparse_coo, torch.sparse_csr}:
+        raise ValueError(f'_input_mask expects strided or sparse COO or sparse CSR tensor but got {input.layout}')
+
     mask = kwargs.get('mask')
+
+    # default mask
     if mask is None:
-        inmask = input.new_ones(input.shape, dtype=torch.bool)
-    elif mask.ndim < input.ndim:
-        inmask = torch.broadcast_to(mask.clone(), input.shape).to(dtype=torch.bool)
-    elif mask.ndim > input.ndim:
-        raise IndexError("_input_mask expected broadcastable mask (got mask dimensionality higher than of the input)")
-    elif mask.shape != input.shape:
-        inmask = torch.broadcast_to(mask.clone(), input.shape).to(dtype=torch.bool)
-    else:
-        inmask = mask.to(dtype=torch.bool)
-    return inmask
+        raise ValueError('_input_mask requires explicit mask')
+
+    # mask shape must match with input shape
+    if mask.shape != input.shape:
+        if mask.ndim > input.ndim:
+            raise IndexError("_input_mask expected broadcastable mask (got mask dimensionality higher than of the input)")
+        if mask.layout == torch.strided:
+            mask = torch.broadcast_to(mask.clone(), input.shape).to(dtype=torch.bool)
+        elif mask.layout == torch.sparse_coo:
+            mask = torch._sparse_broadcast_to(mask, input.shape)
+        else:
+            assert mask.layout == torch.sparse_csr
+            # Broadcasting of CSR tensors is not implemented. Working
+            # around by using COO layout.
+            mask = torch._sparse_broadcast_to(mask.to_sparse(), input.shape).to_sparse_csr()
+
+    # mask layout must match with input layout
+    if mask.layout != input.layout:
+        if input.layout == torch.strided:
+            mask = mask.to_dense()
+        elif input.layout == torch.sparse_coo:
+            if mask.layout == torch.strided:
+                mask = mask.to_sparse(input.sparse_dim())
+            else:
+                mask = mask.to_sparse()
+        else:
+            assert input.layout == torch.sparse_csr
+            mask = mask.to_sparse_csr()
+
+    # sparse mask must be coalesced
+    if mask.layout == torch.sparse_coo:
+        mask = mask.coalesce()
+
+    # mask is a boolean tensor
+    mask = mask.to(dtype=torch.bool)
+
+    return mask
 
 
 def _output_mask(op, input: Tensor, *args, **kwargs) -> Tensor:
     """Return output mask of masked operation applied to given arguments.
     """
     if callable(op):
-        is_reduction = op.__name__ in {'sum', 'prod', 'amax', 'amin', 'mean', 'norm', 'var'}
+        is_reduction = op.__name__ in {'sum', 'prod', 'amax', 'amin', 'argmax', 'argmin', 'mean', 'norm', 'var', 'std'}
         is_normalization = op.__name__ in {'softmax', 'log_softmax', 'softmin', 'normalize'}
         if is_reduction:
             if op.__name__ == 'norm':
@@ -421,10 +655,7 @@ def _output_mask(op, input: Tensor, *args, **kwargs) -> Tensor:
             outmask = _input_mask(input, *args, **kwargs)
             keepdim = kwargs.get('keepdim', False)
             dim_ = _canonical_dim(dim, input.ndim)
-            # Workaround https://github.com/pytorch/pytorch/issues/56586
-            for d in reversed(dim_):
-                outmask = outmask.any(dim=d, keepdim=bool(keepdim))
-            return outmask
+            return _any(outmask, dim_, bool(keepdim))
         elif is_normalization:
             return _input_mask(input, *args, **kwargs)
         else:
@@ -433,6 +664,19 @@ def _output_mask(op, input: Tensor, *args, **kwargs) -> Tensor:
         raise ValueError(f'_output_mask expected masked operation (got {type(op).__name__} object)')
 
 
+def _combine_input_and_mask(op, input: Tensor, mask, *args) -> Tensor:
+    """Return input with masked-out elements eliminated for the given operations.
+    """
+    if mask is None:
+        return input
+    canonical_mask = _input_mask(input, mask=mask)
+    if callable(op):
+        fill_value = _reduction_identity(op.__name__, input, *args)
+        return _where(canonical_mask, input, fill_value)
+    else:
+        raise ValueError(f'_combine_input_and_mask expected masked operation (got {type(op).__name__} object)')
+
+
 @_apply_docstring_templates
 def sum(input: Tensor,
         dim: DimOrDims = None,
@@ -443,15 +687,43 @@ def sum(input: Tensor,
     # __doc__ is generated by _apply_docstring_templates decorator
     if dtype is None:
         dtype = input.dtype
-    # TODO: What follows is a reference implementation of a masked sum
-    # operation that is to be replaced with an optimized one and
-    # extended to support other layouts.
+    dim_ = _canonical_dim(dim, input.ndim)
+
+    mask_input = _combine_input_and_mask(sum, input, mask)
     if input.layout == torch.strided:
-        mask_input = input if mask is None else torch.where(mask, input, input.new_zeros([]))
-        dim_ = _canonical_dim(dim, input.ndim)
         return torch.sum(mask_input, dim_, bool(keepdim), dtype=dtype)
+
+    elif input.layout == torch.sparse_coo:
+        if mask_input.ndim == 0:
+            # Workaround https://github.com/pytorch/pytorch/issues/65400
+            dim_ = ()
+
+        result = torch.sparse.sum(mask_input, dim=list(dim_), dtype=dtype)
+        if result.dtype != dtype:
+            # https://github.com/pytorch/pytorch/issues/65392
+            # https://github.com/pytorch/pytorch/pull/66153
+            result = result.to(dtype)
+
+        if result.ndim == 0 and result.layout == torch.strided:
+            result = result.to_sparse()
+
+        if keepdim and mask_input.ndim > 0:
+            # torch.sparse.sum does not support keepdim argument, so,
+            # here we restore the squeezed dimensions
+            if mask_input.dense_dim() > 0:
+                raise NotImplementedError('torch._masked.sum on hybrid COO sparse tensor')
+            indices = result._indices().new_zeros((mask_input.ndim, result._nnz()))
+            original_dims = tuple(i for i in range(mask_input.ndim) if i not in dim_)
+            indices[original_dims, ] = result._indices()
+            shape = tuple((1 if i in dim_ else mask_input.shape[i]) for i in range(mask_input.ndim))
+            result = torch.sparse_coo_tensor(indices, result._values(), shape, dtype=result.dtype, device=result.device)
+
+        return result
+
+    elif input.layout == torch.sparse_csr:
+        return torch._sparse_csr_sum(mask_input, dim=list(dim_), keepdim=bool(keepdim), dtype=dtype)
     else:
-        raise ValueError(f'masked sum expects strided tensor (got {input.layout} tensor)')
+        raise ValueError(f'masked sum expects strided, sparse_coo, or sparse_csr tensor (got {input.layout} tensor)')
 
 
 @_apply_docstring_templates
@@ -462,10 +734,9 @@ def prod(input: Tensor,
          dtype: Optional[DType] = None,
          mask: Optional[Tensor] = None) -> Tensor:
     # __doc__ is generated by _apply_docstring_templates decorator
+    mask_input = _combine_input_and_mask(prod, input, mask)
     if input.layout == torch.strided:
-        mask_input = input if mask is None else torch.where(mask, input, torch.ones_like(input))
         dim_ = _canonical_dim(dim, input.ndim)
-
         # Workaround https://github.com/pytorch/pytorch/issues/56586
         result = mask_input
         for d in reversed(dim_):
@@ -496,12 +767,8 @@ def amax(input: Tensor,
 {reduction_example}"""
     if dtype is None:
         dtype = input.dtype
+    mask_input = _combine_input_and_mask(amax, input, mask)
     if input.layout == torch.strided:
-        if mask is None:
-            mask_input = input
-        else:
-            identity = input.new_full([], _reduction_identity('amax', input))
-            mask_input = torch.where(mask, input, identity)
         dim_ = _canonical_dim(dim, mask_input.ndim)
         return torch.amax(mask_input, dim_, bool(keepdim)).to(dtype=dtype)
     else:
@@ -527,18 +794,58 @@ def amin(input: Tensor,
 {reduction_example}"""
     if dtype is None:
         dtype = input.dtype
+    mask_input = _combine_input_and_mask(amin, input, mask)
     if input.layout == torch.strided:
-        if mask is None:
-            mask_input = input
-        else:
-            identity = input.new_full([], _reduction_identity('amin', input))
-            mask_input = torch.where(mask, input, identity)
         dim_ = _canonical_dim(dim, mask_input.ndim)
         return torch.amin(mask_input, dim_, bool(keepdim)).to(dtype=dtype)
     else:
         raise ValueError(f'masked amin expects strided tensor (got {input.layout} tensor)')
 
 
+@_apply_docstring_templates
+def argmax(input: Tensor,
+           dim: int = None,
+           *,
+           keepdim: Optional[bool] = False,
+           dtype: Optional[DType] = None,
+           mask: Optional[Tensor] = None) -> Tensor:
+    """\
+{reduction_signature}
+{reduction_descr}
+{reduction_identity_dtype}
+{reduction_args}
+{reduction_example}"""
+    if dtype is None:
+        dtype = input.dtype
+    mask_input = _combine_input_and_mask(argmax, input, mask)
+    if input.layout == torch.strided:
+        return torch.argmax(mask_input, dim, bool(keepdim)).to(dtype=dtype)
+    else:
+        raise ValueError(f'masked argmax expects strided tensor (got {input.layout} tensor)')
+
+
+@_apply_docstring_templates
+def argmin(input: Tensor,
+           dim: int = None,
+           *,
+           keepdim: Optional[bool] = False,
+           dtype: Optional[DType] = None,
+           mask: Optional[Tensor] = None) -> Tensor:
+    """\
+{reduction_signature}
+{reduction_descr}
+{reduction_identity_dtype}
+{reduction_args}
+{reduction_example}"""
+    if dtype is None:
+        dtype = input.dtype
+    mask_input = _combine_input_and_mask(argmin, input, mask)
+    if input.layout == torch.strided:
+        return torch.argmin(mask_input, dim, bool(keepdim)).to(dtype=dtype)
+    else:
+        raise ValueError(f'masked argmin expects strided tensor (got {input.layout} tensor)')
+
+
 @_apply_docstring_templates
 def mean(input: Tensor,
          dim: DimOrDims = None,
@@ -564,9 +871,14 @@ def mean(input: Tensor,
     if dtype is None:
         dtype = input.dtype
     if input.layout == torch.strided:
-        inmask = _input_mask(input, mask=mask)
-        count = sum(inmask.new_ones(input.shape, dtype=torch.int64), dim, keepdim=keepdim, mask=inmask)
-        total = sum(input, dim, keepdim=keepdim, dtype=dtype, mask=inmask)
+        if mask is None:
+            # TODO: compute count analytically
+            count = sum(torch.ones(input.shape, dtype=torch.int64, device=input.device), dim, keepdim=keepdim)
+            total = sum(input, dim, keepdim=keepdim, dtype=dtype)
+        else:
+            inmask = _input_mask(input, mask=mask)
+            count = sum(inmask.new_ones(input.shape, dtype=torch.int64), dim, keepdim=keepdim, mask=inmask)
+            total = sum(input, dim, keepdim=keepdim, dtype=dtype, mask=inmask)
         return total / count
     else:
         raise ValueError(f'masked sum expects strided tensor (got {input.layout} tensor)')
@@ -594,35 +906,22 @@ def norm(input: Tensor,
 {reduction_example}"""
     if dtype is None:
         dtype = input.dtype
+    mask_input = _combine_input_and_mask(norm, input, mask, ord)
     if input.layout == torch.strided:
-        identity = input.new_full([], _reduction_identity('norm', input, ord))
-        mask_input = input if mask is None else torch.where(mask, input, identity)
         dim_ = _canonical_dim(dim, input.ndim)
         return torch.linalg.vector_norm(mask_input, ord, dim_, bool(keepdim), dtype=dtype)
     else:
         raise ValueError(f'masked norm expects strided tensor (got {input.layout} tensor)')
 
 
-@_apply_docstring_templates
-def var(input: Tensor,
-        dim: DimOrDims = None,
-        unbiased: Optional[bool] = False,
-        *,
-        keepdim: Optional[bool] = False,
-        dtype: Optional[DType] = None,
-        mask: Optional[Tensor] = None) -> Tensor:
-    """\
-{reduction_signature}
-
-{reduction_descr}
-
-The identity value of sample variance operation is undefined.  The
-elements of output tensor with strided layout, that correspond to
-fully masked-out elements, have ``nan`` values.
-
-{reduction_args}
-
-{reduction_example}"""
+def std_var(input: Tensor,
+            dim: DimOrDims = None,
+            unbiased: Optional[bool] = False,
+            *,
+            keepdim: Optional[bool] = False,
+            dtype: Optional[DType] = None,
+            mask: Optional[Tensor] = None,
+            take_sqrt: Optional[bool] = False) -> Tensor:
     if dtype is None:
         dtype = input.dtype
         if not (dtype.is_floating_point or dtype.is_complex):
@@ -631,23 +930,88 @@ def var(input: Tensor,
     if not (compute_dtype.is_floating_point or compute_dtype.is_complex):
         compute_dtype = torch.float32
     if input.layout == torch.strided:
-        inmask = _input_mask(input, mask=mask)
-        count = sum(inmask.new_ones(input.shape, dtype=torch.int64), dim, keepdim=True, mask=inmask)
-        sample_total = sum(input, dim, keepdim=True, dtype=dtype, mask=inmask)
+        if mask is None:
+            # TODO: compute count analytically
+            count = sum(torch.ones(input.shape, dtype=torch.int64, device=input.device), dim, keepdim=True)
+            sample_total = sum(input, dim, keepdim=True, dtype=dtype)
+        else:
+            inmask = _input_mask(input, mask=mask)
+            count = sum(inmask.new_ones(input.shape, dtype=torch.int64), dim, keepdim=True, mask=inmask)
+            sample_total = sum(input, dim, keepdim=True, dtype=dtype, mask=inmask)
         # TODO: replace torch.subtract/divide/square/maximum with
         # masked subtract/divide/square/maximum when these will be
         # available.
         sample_mean = torch.divide(sample_total, count)
         x = torch.subtract(input, sample_mean)
-        total = sum(x * x.conj(), dim, keepdim=keepdim, dtype=compute_dtype, mask=inmask)
+        if mask is None:
+            total = sum(x * x.conj(), dim, keepdim=keepdim, dtype=compute_dtype)
+        else:
+            total = sum(x * x.conj(), dim, keepdim=keepdim, dtype=compute_dtype, mask=inmask)
         if not keepdim:
             count = count.reshape(total.shape)
         if unbiased:
             count = torch.subtract(count, 1)
             count = torch.maximum(count, count.new_zeros([]))
-        return torch.divide(total, count).to(dtype=dtype)
+        output = torch.divide(total, count).to(dtype=dtype)
+        if take_sqrt:
+            output = torch.sqrt(output)
+        return output
     else:
-        raise ValueError(f'masked var expects strided tensor (got {input.layout} tensor)')
+        raise ValueError(f'masked std/var expects strided tensor (got {input.layout} tensor)')
+
+
+@_apply_docstring_templates
+def var(input: Tensor,
+        dim: DimOrDims = None,
+        unbiased: Optional[bool] = False,
+        *,
+        keepdim: Optional[bool] = False,
+        dtype: Optional[DType] = None,
+        mask: Optional[Tensor] = None) -> Tensor:
+    """\
+{reduction_signature}
+{reduction_descr}
+The identity value of sample variance operation is undefined. The
+elements of output tensor with strided layout, that correspond to
+fully masked-out elements, have ``nan`` values.
+{reduction_args}
+{reduction_example}"""
+    return std_var(
+        input=input,
+        dim=dim,
+        unbiased=unbiased,
+        keepdim=keepdim,
+        dtype=dtype,
+        mask=mask,
+        take_sqrt=False,
+    )
+
+
+@_apply_docstring_templates
+def std(input: Tensor,
+        dim: DimOrDims = None,
+        unbiased: Optional[bool] = False,
+        *,
+        keepdim: Optional[bool] = False,
+        dtype: Optional[DType] = None,
+        mask: Optional[Tensor] = None) -> Tensor:
+    """\
+{reduction_signature}
+{reduction_descr}
+The identity value of sample standard deviation operation is undefined. The
+elements of output tensor with strided layout, that correspond to
+fully masked-out elements, have ``nan`` values.
+{reduction_args}
+{reduction_example}"""
+    return std_var(
+        input=input,
+        dim=dim,
+        unbiased=unbiased,
+        keepdim=keepdim,
+        dtype=dtype,
+        mask=mask,
+        take_sqrt=True
+    )
 
 
 @_apply_docstring_templates
@@ -659,10 +1023,8 @@ def softmax(input: Tensor,
     if dtype is None:
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
+    mask_input = _combine_input_and_mask(amax, input, mask)
     if input.layout == torch.strided:
-        fill = input.new_full([], _reduction_identity('amax', input))
-        inmask = _input_mask(input, mask=mask)
-        mask_input = torch.where(inmask, input, fill)
         return torch.nn.functional.softmax(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(f'masked softmax expects strided tensor (got {input.layout} tensor)')
@@ -677,10 +1039,8 @@ def log_softmax(input: Tensor,
     if dtype is None:
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
+    mask_input = _combine_input_and_mask(amax, input, mask)
     if input.layout == torch.strided:
-        fill = input.new_full([], _reduction_identity('amax', input))
-        inmask = _input_mask(input, mask=mask)
-        mask_input = torch.where(inmask, input, fill)
         return torch.nn.functional.log_softmax(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(f'masked log_softmax expects strided tensor (got {input.layout} tensor)')
@@ -695,10 +1055,8 @@ def softmin(input: Tensor,
     if dtype is None:
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
+    mask_input = _combine_input_and_mask(amin, input, mask)
     if input.layout == torch.strided:
-        fill = input.new_full([], _reduction_identity('amin', input))
-        inmask = _input_mask(input, mask=mask)
-        mask_input = torch.where(inmask, input, fill)
         return torch.nn.functional.softmin(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(f'masked softmin expects strided tensor (got {input.layout} tensor)')
@@ -715,13 +1073,12 @@ def normalize(input: Tensor,
     if dtype is None:
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
+    # TODO: eliminate mask_input as unnecessary when using masked divide.
+    mask_input = _combine_input_and_mask(sum, input, mask)
     if input.layout == torch.strided:
         nrm_ = norm(input, ord, dim, keepdim=True, dtype=dtype, mask=mask)
         # TODO: replace torch.maximum with masked maximum when available.
         denom = torch.maximum(nrm_, nrm_.new_full([], eps))
-        # TODO: eliminate mask_input as unnecessary when using masked divide.
-        inmask = _input_mask(input, mask=mask)
-        mask_input = input if mask is None else torch.where(inmask, input, input.new_zeros([]))
         # TODO: replace torch.divide with masked divide when available.
         return torch.divide(mask_input, denom)
     else:
diff --git a/torch/_masked/_docs.py b/torch/_masked/_docs.py
index b8519b5f8f7b55..40b58ed8123d06 100644
--- a/torch/_masked/_docs.py
+++ b/torch/_masked/_docs.py
@@ -149,6 +149,136 @@
     tensor([                 -3, 9223372036854775807])
 """
 
+argmax_docstring = """argmax(input, dim, *, keepdim=False, dtype=None, mask=None) -> Tensor
+Returns argmax of all the elements in the :attr:`input`
+tensor along the given dimension(s) :attr:`dim` while the :attr:`input`
+elements are masked out according to the boolean tensor
+:attr:`mask`.
+The identity value of argmax operation, which is used to start the
+reduction, depends on input dtype. For instance, for float32, uint8,
+and int32 dtypes, the identity values are ``-inf``, ``0``, and ``-2147483648``, respectively.
+If :attr:`keepdim` is ``True``, the output tensor is of the same size
+as :attr:`input` except in the dimension(s) :attr:`dim` where it is of
+size 1. Otherwise, :attr:`dim` is squeezed (see
+:func:`torch.squeeze`), resulting in the output tensor having 1 (or
+``len(dim)``) fewer dimension(s).
+
+The boolean tensor :attr:`mask` defines the "validity" of
+:attr:`input` tensor elements: if :attr:`mask` element is True
+then the corresponding element in :attr:`input` tensor will be
+included in argmax computation, otherwise the element is
+ignored.
+
+When all elements of :attr:`input` along the given dimension
+:attr:`dim` are ignored (fully masked-out), the corresponding element
+of the output tensor will have undefined value: it may or may not
+correspond to the identity value of argmax operation; the
+choice may correspond to the value that leads to the most efficient
+storage of :attr:`output` tensor.
+
+The mask of the output tensor can be computed as
+``torch.any(torch.broadcast_to(mask, input.shape), dim, keepdim=keepdim,
+dtype=torch.bool)``.
+
+The shapes of the :attr:`mask` tensor and the :attr:`input` tensor
+don't need to match, but they must be :ref:`broadcastable
+<broadcasting-semantics>` and the dimensionality of the :attr:`mask`
+tensor must not be greater than of the :attr:`input` tensor.
+
+Args:
+    input (Tensor): the input tensor
+    dim (int): the dimension along which argmax is computed.
+
+Keyword args:
+    keepdim (bool, optional): whether the output tensor has
+      :attr:`dim` retained or not. Default: False.
+    dtype (:class:`torch.dtype`, optional): the desired data type
+      of returned tensor.  If specified, the input tensor is
+      casted to :attr:`dtype` before the operation is
+      performed. Default: None.
+    mask (:class:`torch.Tensor`, optional): the boolean tensor
+      containing the binary mask of validity of input tensor
+      elements.
+      Default: None that is equivalent to ``torch.ones(input.shape, dtype=torch.bool)``.
+Example::
+
+    >>> input = tensor([[-3, -2, -1], [ 0, 1, 2]])
+    >>> input
+    tensor([[-3, -2, -1],
+            [ 0,  1,  2]])
+    >>> mask = tensor([[ True, False, True], [False, False, False]])
+    >>> mask
+    tensor([[ True, False,  True],
+            [False, False, False]])
+    >>> torch._masked.argmax(input, 1, mask=mask)
+    tensor([2, 0])
+"""
+
+argmin_docstring = """argmin(input, dim, *, keepdim=False, dtype=None, mask=None) -> Tensor
+Returns argmin of all the elements in the :attr:`input`
+tensor along the given dimension(s) :attr:`dim` while the :attr:`input`
+elements are masked out according to the boolean tensor
+:attr:`mask`.
+The identity value of argmin operation, which is used to start the
+reduction, depends on input dtype. For instance, for float32, uint8,
+and int32 dtypes, the identity values are ``inf``, ``255``, and ``2147483647``, respectively.
+If :attr:`keepdim` is ``True``, the output tensor is of the same size
+as :attr:`input` except in the dimension(s) :attr:`dim` where it is of
+size 1. Otherwise, :attr:`dim` is squeezed (see
+:func:`torch.squeeze`), resulting in the output tensor having 1 (or
+``len(dim)``) fewer dimension(s).
+
+The boolean tensor :attr:`mask` defines the "validity" of
+:attr:`input` tensor elements: if :attr:`mask` element is True
+then the corresponding element in :attr:`input` tensor will be
+included in argmin computation, otherwise the element is
+ignored.
+
+When all elements of :attr:`input` along the given dimension
+:attr:`dim` are ignored (fully masked-out), the corresponding element
+of the output tensor will have undefined value: it may or may not
+correspond to the identity value of argmin operation; the
+choice may correspond to the value that leads to the most efficient
+storage of :attr:`output` tensor.
+
+The mask of the output tensor can be computed as
+``torch.any(torch.broadcast_to(mask, input.shape), dim, keepdim=keepdim,
+dtype=torch.bool)``.
+
+The shapes of the :attr:`mask` tensor and the :attr:`input` tensor
+don't need to match, but they must be :ref:`broadcastable
+<broadcasting-semantics>` and the dimensionality of the :attr:`mask`
+tensor must not be greater than of the :attr:`input` tensor.
+
+Args:
+    input (Tensor): the input tensor
+    dim (int): the dimension along which argmin is computed.
+
+Keyword args:
+    keepdim (bool, optional): whether the output tensor has
+      :attr:`dim` retained or not. Default: False.
+    dtype (:class:`torch.dtype`, optional): the desired data type
+      of returned tensor.  If specified, the input tensor is
+      casted to :attr:`dtype` before the operation is
+      performed. Default: None.
+    mask (:class:`torch.Tensor`, optional): the boolean tensor
+      containing the binary mask of validity of input tensor
+      elements.
+      Default: None that is equivalent to ``torch.ones(input.shape, dtype=torch.bool)``.
+Example::
+
+    >>> input = tensor([[-3, -2, -1], [ 0, 1, 2]])
+    >>> input
+    tensor([[-3, -2, -1],
+            [ 0,  1,  2]])
+    >>> mask = tensor([[ True, False, True], [False, False, False]])
+    >>> mask
+    tensor([[ True, False,  True],
+            [False, False, False]])
+    >>> torch._masked.argmin(input, 1, mask=mask)
+    tensor([0, 0])
+"""
+
 log_softmax_docstring = """log_softmax(input, dim, *, dtype=None, mask=None) -> Tensor
 
 Returns log_softmax of all the slices in the :attr:`input` tensor
@@ -593,6 +723,74 @@
             [   nan,    nan,    nan]])
 """
 
+std_docstring = """std(input, dim, unbiased, *, keepdim=False, dtype=None, mask=None) -> Tensor
+Returns standard_deviation of all the elements in the :attr:`input`
+tensor along the given dimension(s) :attr:`dim` while the :attr:`input`
+elements are masked out according to the boolean tensor
+:attr:`mask`.
+The identity value of sample standard deviation operation is undefined. The
+elements of output tensor with strided layout, that correspond to
+fully masked-out elements, have ``nan`` values.
+If :attr:`keepdim` is ``True``, the output tensor is of the same size
+as :attr:`input` except in the dimension(s) :attr:`dim` where it is of
+size 1. Otherwise, :attr:`dim` is squeezed (see
+:func:`torch.squeeze`), resulting in the output tensor having 1 (or
+``len(dim)``) fewer dimension(s).
+
+The boolean tensor :attr:`mask` defines the "validity" of
+:attr:`input` tensor elements: if :attr:`mask` element is True
+then the corresponding element in :attr:`input` tensor will be
+included in standard_deviation computation, otherwise the element is
+ignored.
+
+When all elements of :attr:`input` along the given dimension
+:attr:`dim` are ignored (fully masked-out), the corresponding element
+of the output tensor will have undefined value: it may or may not
+correspond to the identity value of standard_deviation operation; the
+choice may correspond to the value that leads to the most efficient
+storage of :attr:`output` tensor.
+
+The mask of the output tensor can be computed as
+``torch.any(torch.broadcast_to(mask, input.shape), dim, keepdim=keepdim,
+dtype=torch.bool)``.
+
+The shapes of the :attr:`mask` tensor and the :attr:`input` tensor
+don't need to match, but they must be :ref:`broadcastable
+<broadcasting-semantics>` and the dimensionality of the :attr:`mask`
+tensor must not be greater than of the :attr:`input` tensor.
+
+Args:
+    input (Tensor): the input tensor
+    dim (int or tuple of ints, optional): the dimension or dimensions to reduce.
+      Default: None that is equivalent to ``tuple(range(input.ndim))``.
+    unbiased (bool): when True, use Bessel’s correction, otherwise, compute
+      the uncorrected sample variance.
+
+Keyword args:
+    keepdim (bool, optional): whether the output tensor has
+      :attr:`dim` retained or not. Default: False.
+    dtype (:class:`torch.dtype`, optional): the desired data type
+      of returned tensor.  If specified, the input tensor is
+      casted to :attr:`dtype` before the operation is
+      performed. Default: None.
+    mask (:class:`torch.Tensor`, optional): the boolean tensor
+      containing the binary mask of validity of input tensor
+      elements.
+      Default: None that is equivalent to ``torch.ones(input.shape, dtype=torch.bool)``.
+Example::
+
+    >>> input = tensor([[-3, -2, -1], [ 0, 1, 2]])
+    >>> input
+    tensor([[-3, -2, -1],
+            [ 0,  1,  2]])
+    >>> mask = tensor([[ True, False, True], [False, False, False]])
+    >>> mask
+    tensor([[ True, False,  True],
+            [False, False, False]])
+    >>> torch._masked.std(input, 1, False, mask=mask)
+    tensor([1., nan])
+"""
+
 sum_docstring = """sum(input, dim, *, keepdim=False, dtype=None, mask=None) -> Tensor
 
 Returns sum of all the elements in the :attr:`input`
diff --git a/torch/_ops.py b/torch/_ops.py
index 13470bd8558256..645b309bfb3024 100644
--- a/torch/_ops.py
+++ b/torch/_ops.py
@@ -32,13 +32,17 @@ def __init__(self, overloadpacket, op, schema):
         self._op = op
         self._schema = schema
         self._overloadpacket = overloadpacket
+        self._overloadname = 'default' if schema.overload_name == '' else schema.overload_name
+        self.__name__ = "{}.{}".format(self._schema.name.split("::")[1], self._overloadname)
+        self.__module__ = overloadpacket.__module__
+        op.__module__ = overloadpacket.__module__
 
     # it's a no-op since OpOverload object is immutable and must be unique for a given op overload.
     def __deepcopy__(self, memo=None):
         return self
 
-    def __str__(self):
-        return "OpOverload(op='{}.{}', overload='{}')".format(*self._schema.name.split("::"), self.overload_name)
+    def __repr__(self):
+        return "<OpOverload(op='{}.{}', overload='{}')>".format(*self._schema.name.split("::"), self._overloadname)
 
     def __call__(self, *args, **kwargs):
         return self._op(*args, **kwargs or {})
@@ -46,17 +50,15 @@ def __call__(self, *args, **kwargs):
     def __getattr__(self, key):
         return getattr(self._op, key)
 
-    # `my_namespace::my_op`
-    @property
-    def name(self):
-        return "{}.{}".format(*self._schema.name.split("::"))
+    def __hash__(self):
+        return hash(self._op)
 
-    @property
-    def overload_name(self):
-        return self._schema.overload_name
+    # `my_namespace.my_op_name.overload_name`
+    def __str__(self):
+        return "{}.{}.{}".format(*self._schema.name.split("::"), self._overloadname)
 
     @property
-    def overload_packet(self):
+    def overloadpacket(self):
         return self._overloadpacket
 
     @property
@@ -72,23 +74,21 @@ def __init__(self, qualified_op_name, op_name, op):
         # These attributes are accessible on the object through the properties
         # defined below but are immutable
         self._qualified_op_name = qualified_op_name
-        self._op_name = op_name
+        self.__name__ = op_name
         self._op = op
 
     # it's a no-op since OpOverloadPacket object is immutable and must be unique for a given op.
     def __deepcopy__(self, memo=None):
         return self
 
-    def __str__(self):
-        return "OpOverloadPacket(op='{}.{}')".format(*self._qualified_op_name.split("::"))
+    def __repr__(self):
+        return "<OpOverloadPacket(op='{}.{}')>".format(*self._qualified_op_name.split("::"))
 
-    @property
-    def qualified_op_name(self):
-        return "{}.{}".format(*self._qualified_op_name.split("::"))
+    def __hash__(self):
+        return hash(self._op)
 
-    @property
-    def op_name(self):
-        return self._op_name
+    def __str__(self):
+        return "{}.{}".format(*self._qualified_op_name.split("::"))
 
     @property
     def op(self):
diff --git a/torch/_python_dispatcher.py b/torch/_python_dispatcher.py
index aa19a18efb3b56..fe0c6253fdd34a 100644
--- a/torch/_python_dispatcher.py
+++ b/torch/_python_dispatcher.py
@@ -15,9 +15,9 @@
 - CPU/AutogradCPU: represents in-tree backends which we usually have dedicated inference &
     autograd kernel in pytorch core library.
     E.g. CPU, CUDA
-- QuantizedCPU/AutogradOther: represents in-tree backends which we usually have backend specific
+- FPGA/AutogradOther: represents in-tree backends which we usually have backend specific
     inference kernels, but they share the same autograd kernel specified in AutogradOther.
-    E.g. QuantizedCPU, QuantizedCUDA
+    E.g. FPGA, SparseCsrCPU
 - XLA/AutogradXLA: represents out-of-tree backends which we don't have either inference or autograd
     kernel defined in pytorch core library. Backend owner is responsible for registering both
     inference & autograd kernels in their extensions(e.g. torch-xla) for the operators they support.
@@ -53,7 +53,7 @@ class PythonDispatcher:
     name = "foo"
     runtime_keys = [
         "CPU", "AutogradCPU",
-        "QuantizedCPU", "AutogradOther",
+        "FPGA", "AutogradOther",
         "XLA", "AutogradXLA",
         "Lazy", "AutogradLazy",
     ]
diff --git a/torch/_tensor.py b/torch/_tensor.py
index 6a50a029ae769e..cc853162a19d01 100644
--- a/torch/_tensor.py
+++ b/torch/_tensor.py
@@ -202,11 +202,7 @@ def storage(self):
         if self.dtype not in torch.storage._dtype_to_storage_type_map():
             raise RuntimeError(f'unsupported Storage type: {self.dtype}')
 
-        storage = self._storage()
-        storage_name = torch.storage._dtype_to_storage_type_map()[self.dtype]
-        storage_class = eval(type(storage).__module__ + '.' + storage_name)
-        storage = storage_class(wrap_storage=storage)
-        return storage
+        return torch._TypedStorage(wrap_storage=self._storage(), dtype=self.dtype)
 
     def _reduce_ex_internal(self, proto):
         check_serializing_named_tensor(self)
@@ -223,7 +219,7 @@ def _reduce_ex_internal(self, proto):
         # 2. Python list is not a good fit due to performance reason.
         #    `tolist()` converts every single element in the tensor into python objects
         #    and serialize them one by one.
-        if self.device.type in ['xla', 'ort', 'mlc']:
+        if self.device.type in ['xla', 'ort', 'mlc', 'hpu']:
             return (torch._utils._rebuild_device_tensor_from_numpy, (self.cpu().numpy(),
                                                                      self.dtype,
                                                                      str(self.device),
@@ -659,7 +655,7 @@ def __rmod__(self, other):
     def __format__(self, format_spec):
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.__format__, (self,), self, format_spec)
-        if self.dim() == 0:
+        if self.dim() == 0 and not self.is_meta:
             return self.item().__format__(format_spec)
         return object.__format__(self, format_spec)
 
@@ -866,10 +862,10 @@ def storage_type(self):
         Returns the type of the underlying storage.
 
         """
-        # NB: this returns old fashioned _TypedStorage, e.g., FloatStorage, as it
-        # would be pretty pointless otherwise (it would always return
-        # _UntypedStorage)
-        return type(self.storage())
+        if has_torch_function_unary(self):
+            return handle_torch_function(Tensor.storage_type, (self,), self)
+
+        return self.storage()._get_legacy_storage_class()
 
     def refine_names(self, *names):
         r"""Refines the dimension names of :attr:`self` according to :attr:`names`.
@@ -1067,53 +1063,7 @@ def to_sparse_coo(self):
             25
 
        """
-        if self.is_sparse:
-            return self
-        if self.is_sparse_csr:
-            crow_indices = self.crow_indices()
-            col_indices = self.col_indices()
-            indices = torch._convert_indices_from_csr_to_coo(crow_indices, col_indices,
-                                                             out_int32=crow_indices.dtype == torch.int32)
-            return torch.sparse_coo_tensor(indices,
-                                           self.values(),
-                                           size=self.shape,
-                                           dtype=self.dtype,
-                                           device=self.device)
-        else:
-            return self.to_sparse()
-
-    def to_sparse_csr(self):
-        """ Convert a tensor to compressed row storage format. Only works with 2D tensors.
-
-        Examples::
-
-            >>> dense = torch.randn(5, 5)
-            >>> sparse = dense.to_sparse_csr()
-            >>> sparse._nnz()
-            25
-
-        """
-        shape = self.size()
-        fill_value = 0
-        if len(shape) != 2:
-            raise RuntimeError("Only 2D tensors can be converted to the CSR format but got shape: ", shape)
-
-        if self.is_sparse:
-            coalesced_self = self.coalesce()
-            row_indices = coalesced_self.indices()[0]
-            device = coalesced_self.values().device
-            crow_indices = torch._convert_indices_from_coo_to_csr(
-                row_indices, self.shape[0], out_int32=row_indices.dtype == torch.int32)
-            return torch.sparse_csr_tensor(crow_indices,
-                                           coalesced_self.indices()[1].contiguous(),
-                                           coalesced_self.values(),
-                                           size=coalesced_self.shape,
-                                           dtype=coalesced_self.dtype,
-                                           device=device)
-        elif self.is_sparse_csr:
-            return self
-        else:
-            return self.to_sparse().to_sparse_csr()
+        return self.to_sparse()
 
     def _update_names(self, names, inplace):
         if has_torch_function_unary(self):
diff --git a/torch/_tensor_docs.py b/torch/_tensor_docs.py
index 7ff5da2c2f4e41..49e43c502861e2 100644
--- a/torch/_tensor_docs.py
+++ b/torch/_tensor_docs.py
@@ -1060,6 +1060,24 @@ def add_docstr_all(method, docstr):
     {memory_format}
 """.format(**common_args))
 
+add_docstr_all('ipu',
+               r"""
+ipu(device=None, non_blocking=False, memory_format=torch.preserve_format) -> Tensor
+
+Returns a copy of this object in IPU memory.
+
+If this object is already in IPU memory and on the correct device,
+then no copy is performed and the original object is returned.
+
+Args:
+    device (:class:`torch.device`): The destination IPU device.
+        Defaults to the current IPU device.
+    non_blocking (bool): If ``True`` and the source is in pinned memory,
+        the copy will be asynchronous with respect to the host.
+        Otherwise, the argument has no effect. Default: ``False``.
+    {memory_format}
+""".format(**common_args))
+
 add_docstr_all('xpu',
                r"""
 xpu(device=None, non_blocking=False, memory_format=torch.preserve_format) -> Tensor
@@ -3374,11 +3392,68 @@ def callable(a, b) -> number
 
 """.format(**reproducibility_notes))
 
-add_docstr_all('scatter_reduce', r"""
-scatter_reduce(input, dim, index, reduce, *, output_size=None) -> Tensor
+add_docstr_all('scatter_reduce_', r"""
+scatter_reduce_(dim, index, src, reduce, *, include_self=True) -> Tensor
 
-See :func:`torch.scatter_reduce`
-""")
+Reduces all values from the :attr:`src` tensor to the indices specified in
+the :attr:`index` tensor in the :attr:`self` tensor using the applied reduction
+defined via the :attr:`reduce` argument (:obj:`"sum"`, :obj:`"prod"`, :obj:`"mean"`,
+:obj:`"amax"`, :obj:`"amin"`). For each value in :attr:`src`, it is reduced to an
+index in :attr:`self` which is specified by its index in :attr:`src` for
+``dimension != dim`` and by the corresponding value in :attr:`index` for
+``dimension = dim``. If :obj:`include_self="True"`, the values in the :attr:`self`
+tensor are included in the reduction.
+
+:attr:`self`, :attr:`index` and :attr:`src` should all have
+the same number of dimensions. It is also required that
+``index.size(d) <= src.size(d)`` for all dimensions ``d``, and that
+``index.size(d) <= self.size(d)`` for all dimensions ``d != dim``.
+Note that ``index`` and ``src`` do not broadcast.
+
+For a 3-D tensor with :obj:`reduce="sum"` and :obj:`include_self=True` the
+output is given as::
+
+    self[index[i][j][k]][j][k] += src[i][j][k]  # if dim == 0
+    self[i][index[i][j][k]][k] += src[i][j][k]  # if dim == 1
+    self[i][j][index[i][j][k]] += src[i][j][k]  # if dim == 2
+
+Note:
+    {forward_reproducibility_note}
+
+.. note::
+
+    The backward pass is implemented only for ``src.shape == index.shape``.
+
+.. warning::
+
+    This function is in beta and may change in the near future.
+
+Args:
+    dim (int): the axis along which to index
+    index (LongTensor): the indices of elements to scatter and reduce.
+    src (Tensor): the source elements to scatter and reduce
+    reduce (str): the reduction operation to apply for non-unique indices
+        (:obj:`"sum"`, :obj:`"prod"`, :obj:`"mean"`, :obj:`"amax"`, :obj:`"amin"`)
+    include_self (bool): whether elements from the :attr:`self` tensor are
+        included in the reduction
+
+Example::
+
+    >>> src = torch.tensor([1., 2., 3., 4., 5., 6.])
+    >>> index = torch.tensor([0, 1, 0, 1, 2, 1])
+    >>> input = torch.tensor([1., 2., 3., 4.])
+    >>> input.scatter_reduce(0, index, src, reduce="sum")
+    tensor([5., 14., 8., 4.])
+    >>> input.scatter_reduce(0, index, src, reduce="sum", include_self=False)
+    tensor([4., 12., 5., 4.])
+    >>> input2 = torch.tensor([5., 4., 3., 2.])
+    >>> input2.scatter_reduce(0, index, src, reduce="amax")
+    tensor([5., 6., 5., 2.])
+    >>> input2.scatter_reduce(0, index, src, reduce="amax", include_self=False)
+    tensor([3., 6., 5., 2.])
+
+
+""".format(**reproducibility_notes))
 
 add_docstr_all('select',
                r"""
@@ -4146,6 +4221,20 @@ def callable(a, b) -> number
            size=(3, 3), nnz=1, layout=torch.sparse_coo)
 """)
 
+add_docstr_all('to_sparse_csr',
+               r"""
+to_sparse_csr() -> Tensor
+Convert a tensor to compressed row storage format. Only works with 2D tensors.
+
+Example::
+
+    >>> dense = torch.randn(5, 5)
+    >>> sparse = dense.to_sparse_csr()
+    >>> sparse._nnz()
+    25
+
+""")
+
 add_docstr_all('to_mkldnn',
                r"""
 to_mkldnn() -> Tensor
@@ -4752,6 +4841,13 @@ def callable(a, b) -> number
 Out-of-place version of :meth:`torch.Tensor.scatter_add_`
 """)
 
+add_docstr_all('scatter_reduce',
+               r"""
+scatter_reduce(dim, index, src, reduce, *, include_self=True) -> Tensor
+
+Out-of-place version of :meth:`torch.Tensor.scatter_reduce_`
+""")
+
 add_docstr_all('masked_scatter',
                r"""
 masked_scatter(mask, tensor) -> Tensor
@@ -4868,6 +4964,11 @@ def callable(a, b) -> number
 Is ``True`` if the Tensor is stored on the GPU, ``False`` otherwise.
 """)
 
+add_docstr_all('is_ipu',
+               r"""
+Is ``True`` if the Tensor is stored on the IPU, ``False`` otherwise.
+""")
+
 add_docstr_all('is_xpu',
                r"""
 Is ``True`` if the Tensor is stored on the XPU, ``False`` otherwise.
diff --git a/torch/_tensor_str.py b/torch/_tensor_str.py
index b0bb6e93aaeecc..1c97505b0781b8 100644
--- a/torch/_tensor_str.py
+++ b/torch/_tensor_str.py
@@ -298,14 +298,14 @@ def get_summarized_data(self):
         return torch.stack([get_summarized_data(x) for x in self])
 
 def _str_intern(inp):
-    prefix = 'tensor('
+    self, tangent = torch.autograd.forward_ad.unpack_dual(inp)
+    prefix = "nested_tensor(" if self.is_nested else 'tensor('
     indent = len(prefix)
     suffixes = []
 
     # This is used to extract the primal value and thus disable the forward AD
     # within this function.
     # TODO(albanD) This needs to be updated when more than one level is supported
-    self, tangent = torch.autograd.forward_ad.unpack_dual(inp)
 
     # Note [Print tensor device]:
     # A general logic here is we only print device when it doesn't match
@@ -380,6 +380,11 @@ def _str_intern(inp):
             suffixes.append('zero_point=' + str(self.q_per_channel_zero_points()))
             suffixes.append('axis=' + str(self.q_per_channel_axis()))
         tensor_str = _tensor_str(self.dequantize(), indent)
+    elif self.is_nested:
+        def indented_str(s, indent):
+            return "\n".join(f"  {line}" for line in s.split("\n"))
+        strs = ",\n".join(indented_str(str(t), indent + 1) for t in torch.ops.aten.unbind.int(self, 0))
+        tensor_str = f"[\n{strs}\n]"
     else:
         if self.is_meta:
             suffixes.append('size=' + str(tuple(self.shape)))
diff --git a/torch/_torch_docs.py b/torch/_torch_docs.py
index 4ba8d92b5834fb..10626fe72aa156 100644
--- a/torch/_torch_docs.py
+++ b/torch/_torch_docs.py
@@ -1031,9 +1031,6 @@ def merge_dicts(*dicts):
 CPU device, and not share its memory.
 
 .. seealso::
-    :func:`torch.as_tensor` creates a tensor that always shares memory if the input is a
-           tensor or a NumPy array, copying otherwise.
-
     :func:`torch.tensor` creates a tensor that always copies the data from the input object.
 
     :func:`torch.from_numpy` creates a tensor that always shares memory from NumPy arrays.
@@ -8548,57 +8545,10 @@ def merge_dicts(*dicts):
 """)
 
 add_docstr(torch.scatter_reduce, r"""
-scatter_reduce(input, dim, index, reduce, *, output_size=None) -> Tensor
-
-Reduces all values from the :attr:`input` tensor to the indices specified in
-the :attr:`index` tensor. For each value in :attr:`input`, its output index is
-specified by its index in :attr:`input` for ``dimension != dim`` and by the
-corresponding value in :attr:`index` for ``dimension = dim``.
-The applied reduction for non-unique indices is defined via the :attr:`reduce`
-argument (:obj:`"sum"`, :obj:`"prod"`, :obj:`"mean"`, :obj:`"amax"`, :obj:`"amin"`).
-For non-existing indices, the output will be filled with the identity of the
-applied reduction (1 for :obj:`"prod"` and 0 otherwise).
-
-It is also required that ``index.size(d) == input.size(d)`` for all dimensions ``d``.
-Moreover, if :attr:`output_size` is defined the the values of :attr:`index` must be
-between ``0`` and ``output_size - 1`` inclusive.
-
-
-For a 3-D tensor with :obj:`reduce="sum"`, the output is given as::
-
-    out[index[i][j][k]][j][k] += input[i][j][k]  # if dim == 0
-    out[i][index[i][j][k]][k] += input[i][j][k]  # if dim == 1
-    out[i][j][index[i][j][k]] += input[i][j][k]  # if dim == 2
+scatter_reduce(input, dim, index, src, reduce, *, include_self=True) -> Tensor
 
-Note:
-    This out-of-place operation is similar to the in-place versions of
-    :meth:`~torch.Tensor.scatter_` and :meth:`~torch.Tensor.scatter_add_`,
-    in which the output tensor is automatically created according to the
-    maximum values in :attr:`index` and filled based on the identity of the
-    applied reduction.
-
-Note:
-    {forward_reproducibility_note}
-
-Args:
-    input (Tensor): the input tensor
-    dim (int): the axis along which to index
-    index (LongTensor): the indices of elements to scatter and reduce.
-    src (Tensor): the source elements to scatter and reduce
-    reduce (str): the reduction operation to apply for non-unique indices
-        (:obj:`"sum"`, :obj:`"prod"`, :obj:`"mean"`, :obj:`"amax"`, :obj:`"amin"`)
-    output_size (int, optional): the size of the output at dimension :attr:`dim`.
-        If set to :obj:`None`, will get automatically inferred according to
-        :obj:`index.max() + 1`
-
-Example::
-
-    >>> input = torch.tensor([1, 2, 3, 4, 5, 6])
-    >>> index = torch.tensor([0, 1, 0, 1, 2, 1])
-    >>> torch.scatter_reduce(input, 0, index, reduce="sum", output_size=3)
-    tensor([4, 12, 5])
-
-""".format(**reproducibility_notes))
+Out-of-place version of :meth:`torch.Tensor.scatter_reduce_`
+""")
 
 add_docstr(torch.select,
            r"""
@@ -9800,10 +9750,10 @@ def merge_dicts(*dicts):
            r"""
 roll(input, shifts, dims=None) -> Tensor
 
-Roll the tensor along the given dimension(s). Elements that are shifted beyond the
-last position are re-introduced at the first position. If a dimension is not
-specified, the tensor will be flattened before rolling and then restored
-to the original shape.
+Roll the tensor :attr:`input` along the given dimension(s). Elements that are
+shifted beyond the last position are re-introduced at the first position. If
+:attr:`dims` is `None`, the tensor will be flattened before rolling and then
+restored to the original shape.
 
 Args:
     {input}
@@ -9821,6 +9771,11 @@ def merge_dicts(*dicts):
             [3, 4],
             [5, 6],
             [7, 8]])
+    >>> torch.roll(x, 1)
+    tensor([[8, 1],
+            [2, 3],
+            [4, 5],
+            [6, 7]])
     >>> torch.roll(x, 1, 0)
     tensor([[7, 8],
             [1, 2],
diff --git a/torch/amp/__init__.py b/torch/amp/__init__.py
new file mode 100644
index 00000000000000..e4fe09f55632e4
--- /dev/null
+++ b/torch/amp/__init__.py
@@ -0,0 +1 @@
+from .autocast_mode import autocast
diff --git a/torch/autocast_mode.py b/torch/amp/autocast_mode.py
similarity index 93%
rename from torch/autocast_mode.py
rename to torch/amp/autocast_mode.py
index daf2a34383fb43..e9edae02819aad 100644
--- a/torch/autocast_mode.py
+++ b/torch/amp/autocast_mode.py
@@ -3,7 +3,7 @@
 import warnings
 
 from typing import Any, Optional
-from .types import _dtype
+from torch.types import _dtype
 
 def autocast_decorator(autocast_instance, func):
     @functools.wraps(func)
@@ -47,7 +47,7 @@ class autocast(object):
             loss.backward()
             optimizer.step()
 
-    See the :ref:`Automatic Mixed Precision examples<amp-examples>` for usage (along with gradient scaling)
+    See the :ref:`CUDA Automatic Mixed Precision examples<amp-examples>` for usage (along with gradient scaling)
     in more complex scenarios (e.g., gradient penalty, multiple models/losses, custom autograd functions).
 
     :class:`autocast` can also be used as a decorator, e.g., on the ``forward`` method of your model::
@@ -102,6 +102,23 @@ def forward(self, input):
         # After exiting autocast, calls f_float16.float() to use with d_float32
         g_float32 = torch.mm(d_float32, f_bfloat16.float())
 
+    Example to use with jit trace in inference::
+
+        class TestModel(nn.Module):
+            def __init__(self, input_size, num_classes):
+                super(TestModel, self).__init__()
+                self.fc1 = nn.Linear(input_size, num_classes)
+            def forward(self, x):
+                return self.fc1(x)
+
+        input_size = 2
+        num_classes = 2
+        model = TestModel(input_size, num_classes).eval()
+
+        with torch.cpu.amp.autocast(cache_enabled=False):
+            model = torch.jit.trace(model, torch.randn(1, input_size))
+        print(model.graph_for(torch.randn(1, input_size)))
+
     Type mismatch errors *in* an autocast-enabled region are a bug; if this is what you observe,
     please file an issue.
 
diff --git a/torch/ao/ns/_numeric_suite.py b/torch/ao/ns/_numeric_suite.py
index 2db70b87a56aa6..2a54535678b271 100644
--- a/torch/ao/ns/_numeric_suite.py
+++ b/torch/ao/ns/_numeric_suite.py
@@ -436,6 +436,8 @@ def get_matching_activations(
     quantized_dict = get_logger_dict(q_module)
     act_dict: Dict[str, Dict] = {}
     for key in quantized_dict:
+        if len(quantized_dict[key]["tensor_val"]) == 0:
+            continue
         match_key = _find_match(sorted(float_dict, reverse=True), key, "stats")
         if match_key is not None:
             act_dict[key] = {}
diff --git a/torch/ao/ns/fx/mappings.py b/torch/ao/ns/fx/mappings.py
index c31261913ad358..5c3574c108a277 100644
--- a/torch/ao/ns/fx/mappings.py
+++ b/torch/ao/ns/fx/mappings.py
@@ -26,8 +26,10 @@ def get_base_name_to_sets_of_related_ops() -> Dict[str, Set[NSNodeTargetType]]:
             nn.Conv1d,
             nnq.Conv1d,
             nnqd.Conv1d,
+            nnqat.Conv1d,
             nniqat.ConvBn1d,
             nniqat.ConvBnReLU1d,
+            nniqat.ConvReLU1d,
             nniq.ConvReLU1d,
             nni.ConvReLU1d,
         ]),
@@ -74,6 +76,7 @@ def get_base_name_to_sets_of_related_ops() -> Dict[str, Set[NSNodeTargetType]]:
             nn.Linear,
             nnq.Linear,
             nni.LinearReLU,
+            nni.LinearBn1d,
             nniq.LinearReLU,
             nniqd.LinearReLU,
             nnqat.Linear,
@@ -447,10 +450,10 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         F.dropout,
         F.silu,
         F.mish,
-        # TODO(future PR): implement shadowing for binary ops and
-        # uncomment below
-        # operator.add,
-        # operator.mul,
+        operator.add,
+        torch.add,
+        operator.mul,
+        torch.mul,
         torch.sum,
     ])
 
@@ -513,6 +516,7 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         torch.squeeze,
         torch.stack,
         torch.unsqueeze,
+        operator.add,
     ])
 
     MODS_IO_TYPE_FP32: Set[NSNodeTargetType] = set([
@@ -527,6 +531,7 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         nnqd.Conv1d,
         nnqd.Conv2d,
         nnqd.Conv3d,
+        nnqat.Conv1d,
         nnqat.Conv2d,
         nnqat.Conv3d,
         nnqat.Embedding,
@@ -561,6 +566,7 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         nni.ConvReLU2d,
         nni.ConvReLU3d,
         nni.LinearReLU,
+        nni.LinearBn1d,
         nni.ConvBn1d,
         nni.ConvBn2d,
         nni.ConvBn3d,
@@ -570,6 +576,7 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         nniqat.ConvBnReLU1d,
         nniqat.ConvBnReLU2d,
         nniqat.ConvBnReLU3d,
+        nniqat.ConvReLU1d,
         nniqat.ConvReLU2d,
         nniqat.ConvReLU3d,
         nniqat.LinearReLU,
@@ -581,7 +588,6 @@ def get_node_type_to_io_type_map() -> Dict[str, Set[NSNodeTargetType]]:
         nnq.Linear,
         nnq.Conv1d,
         nnq.Conv2d,
-        nniq.ConvReLU2d,
         nnq.Conv3d,
         nnq.BatchNorm2d,
         nnq.BatchNorm3d,
diff --git a/torch/ao/ns/fx/pattern_utils.py b/torch/ao/ns/fx/pattern_utils.py
index b0adb5faf95d15..96569789bde4b6 100644
--- a/torch/ao/ns/fx/pattern_utils.py
+++ b/torch/ao/ns/fx/pattern_utils.py
@@ -8,7 +8,7 @@
 
 from torch.ao.quantization.utils import getattr_from_fqn
 from .ns_types import NSNodeTargetType
-from torch.ao.quantization.fx.pattern_utils import get_default_quant_patterns
+from torch.ao.quantization.fx.backend_config.utils import get_native_quant_patterns
 from torch.ao.quantization import (
     ObserverBase,
     FakeQuantizeBase,
@@ -66,9 +66,18 @@ def get_reversed_fusions() -> List[Tuple[NSFusionType, int]]:
     # * multiple ops: (torch.nn.ReLU, torch.nn.Conv2d)
     # For fusions, we only care about patterns composed of multiple ops.
     # TODO(future PR): allow customizations from default patterns.
-    all_quant_patterns = get_default_quant_patterns()
+    all_quant_patterns = get_native_quant_patterns()
+
     default_base_op_idx = 0
     for quant_pattern, _quant_handler in all_quant_patterns.items():
+        # TODO: this is a temporary hack to flatten the patterns from quantization so
+        # that it works with the ns matcher function, maybe we should use `is_match`
+        # in torch.ao.quantization.fx.match_utils to match the patterns
+        if isinstance(quant_pattern, tuple) and len(quant_pattern) == 2 and \
+           isinstance(quant_pattern[1], tuple) and len(quant_pattern[1]) == 2:
+            # flatten the pattern with form (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))
+            quant_pattern = (quant_pattern[0], quant_pattern[1][0], quant_pattern[1][1])
+
         # Only patterns of multiple ops are fusions, ignore
         # patterns which contain a single ops (they get matched
         # without caring about fusions).
diff --git a/torch/ao/ns/fx/weight_utils.py b/torch/ao/ns/fx/weight_utils.py
index 36e183efe1d8ec..4dba8461957efd 100644
--- a/torch/ao/ns/fx/weight_utils.py
+++ b/torch/ao/ns/fx/weight_utils.py
@@ -189,6 +189,7 @@ def get_op_to_type_to_weight_extraction_fn() -> Dict[str, Dict[Callable, Callabl
             nnqat.Linear: mod_weight_detach,
             nnqd.Linear: mod_weight_bias_0,
             nniqat.LinearReLU: mod_weight_detach,
+            nniqat.LinearBn1d: mod_weight_detach,
             nn.modules.linear.NonDynamicallyQuantizableLinear: mod_weight_detach,
             # LSTM
             nn.LSTM: get_lstm_weight,
diff --git a/torch/ao/quantization/_quantize_fx_do_not_use.py b/torch/ao/quantization/_quantize_fx_do_not_use.py
deleted file mode 100644
index d39abe299393b3..00000000000000
--- a/torch/ao/quantization/_quantize_fx_do_not_use.py
+++ /dev/null
@@ -1,34 +0,0 @@
-import torch
-from torch.fx import GraphModule
-from typing import Dict, Any, Optional
-from .quantize_fx import (
-    _check_is_graph_module,
-    check_is_valid_convert_custom_config_dict
-)
-from .fx._convert_do_not_use import _convert_do_not_use
-
-def _convert_fx_do_not_use(
-        graph_module: GraphModule, is_reference: bool = False,
-        convert_custom_config_dict: Dict[str, Any] = None,
-        _remove_qconfig: bool = True,
-        backend_config_dict: Optional[Dict[str, Any]] = None) -> torch.nn.Module:
-    """
-    Please do not use, this is a temporary function to migrate convert_fx
-    to a new implementation
-    """
-    assert is_reference
-    if convert_custom_config_dict is None:
-        convert_custom_config_dict = {}
-
-    _check_is_graph_module(graph_module)
-    check_is_valid_convert_custom_config_dict(convert_custom_config_dict)
-
-    quantized = _convert_do_not_use(
-        graph_module, is_reference, convert_custom_config_dict,
-        False, _remove_qconfig_flag=_remove_qconfig,
-        backend_config_dict=backend_config_dict)
-
-    preserved_attributes = convert_custom_config_dict.get("preserved_attributes", [])
-    for attr_name in preserved_attributes:
-        setattr(quantized, attr_name, getattr(graph_module, attr_name))
-    return quantized
diff --git a/torch/ao/quantization/fake_quantize.py b/torch/ao/quantization/fake_quantize.py
index 9e49a8392e3ea2..ec8b9ffd3b2084 100644
--- a/torch/ao/quantization/fake_quantize.py
+++ b/torch/ao/quantization/fake_quantize.py
@@ -6,11 +6,9 @@
 import torch
 from torch.nn import Module
 from torch.ao.quantization.observer import (
-    MinMaxObserver,
     MovingAverageMinMaxObserver,
     HistogramObserver,
     MovingAveragePerChannelMinMaxObserver,
-    PerChannelMinMaxObserver,
     FixedQParamsObserver,
     default_affine_fixed_qparams_observer,
     default_symmetric_fixed_qparams_observer,
@@ -123,15 +121,25 @@ class FakeQuantize(FakeQuantizeBase):
     scale: torch.Tensor
     zero_point: torch.Tensor
 
-    def __init__(self, observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=255, **observer_kwargs):
+    def __init__(self, observer=MovingAverageMinMaxObserver, quant_min=None, quant_max=None, **observer_kwargs):
         super().__init__()
-        assert quant_min <= quant_max, \
-            'quant_min must be less than or equal to quant_max'
-        self.quant_min = quant_min
-        self.quant_max = quant_max
+        # Populate quant_min/quant_max to observer_kwargs if valid
+        if quant_min is not None and quant_max is not None:
+            assert quant_min <= quant_max, \
+                'quant_min must be less than or equal to quant_max'
+            dtype = observer_kwargs.get("dtype", torch.quint8)
+            if hasattr(observer, "p"):
+                # In case observer is _PartialWrapper, dtype can be stored in
+                # observer.p.keywords["dtype"]
+                dtype = getattr(getattr(observer, "p", {}), "keywords", {}).get(
+                    "dtype", dtype
+                )
+            assert torch.iinfo(dtype).min <= quant_min, 'quant_min out of bound'
+            assert quant_max <= torch.iinfo(dtype).max, 'quant_max out of bound'
+            observer_kwargs.update({"quant_min": quant_min, "quant_max": quant_max})
         self.activation_post_process = observer(**observer_kwargs)
-        assert torch.iinfo(self.activation_post_process.dtype).min <= quant_min, 'quant_min out of bound'
-        assert quant_max <= torch.iinfo(self.activation_post_process.dtype).max, 'quant_max out of bound'
+        self.quant_min = self.activation_post_process.quant_min
+        self.quant_max = self.activation_post_process.quant_max
         if _is_float_qparams(self.activation_post_process.qscheme):
             zero_point_dtype = torch.float
         else:
@@ -335,10 +343,11 @@ def forward(self, X: torch.Tensor) -> torch.Tensor:
                                                    dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, reduce_range=False)
 """
 Default fake_quant for weights.
+Observer is memoryless since averaging_constant is 1.
 """
 
-default_dynamic_fake_quant = FakeQuantize.with_args(observer=MinMaxObserver, quant_min=0, quant_max=255,
-                                                    dtype=torch.quint8, memoryless=True)
+default_dynamic_fake_quant = FakeQuantize.with_args(observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=255,
+                                                    dtype=torch.quint8, averaging_constant=1)
 """
 Default dynamic fake_quant for activations.
 """
@@ -355,23 +364,25 @@ def forward(self, X: torch.Tensor) -> torch.Tensor:
                                                                ch_axis=0)
 """
 Default fake_quant for per-channel weights.
+Observer is memoryless since averaging_constant is 1.
 """
-default_embedding_fake_quant = FakeQuantize.with_args(observer=PerChannelMinMaxObserver,
+default_embedding_fake_quant = FakeQuantize.with_args(observer=MovingAveragePerChannelMinMaxObserver,
                                                       qscheme=torch.per_channel_affine_float_qparams,
                                                       dtype=torch.quint8,
                                                       quant_min=0,
                                                       quant_max=255,
                                                       ch_axis=0,
-                                                      memoryless=True)
+                                                      averaging_constant=1)
 """
 Default fake_quant for embeddings.
+Observer is memoryless since averaging_constant is 1.
 """
 
-default_embedding_fake_quant_4bit = FakeQuantize.with_args(observer=PerChannelMinMaxObserver,
+default_embedding_fake_quant_4bit = FakeQuantize.with_args(observer=MovingAveragePerChannelMinMaxObserver,
                                                            qscheme=torch.per_channel_affine_float_qparams,
                                                            ch_axis=0,
                                                            dtype=torch.quint4x2,
-                                                           memoryless=True)
+                                                           averaging_constant=1)
 
 default_histogram_fake_quant = FakeQuantize.with_args(observer=HistogramObserver,
                                                       quant_min=0,
@@ -411,6 +422,27 @@ def forward(self, X: torch.Tensor) -> torch.Tensor:
 Fused version of `default_per_channel_weight_fake_quant`, with improved performance.
 """
 
+fused_wt_fake_quant_range_neg_127_to_127 = FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                                   quant_min=-127,
+                                                                                   quant_max=127,
+                                                                                   dtype=torch.qint8,
+                                                                                   qscheme=torch.per_tensor_symmetric,
+                                                                                   eps=2 ** -12)
+"""
+Fused version of `default_weight_fake_quant`, with the 8-bit values restricted to [-127, +127], excluding -128.
+"""
+
+fused_per_channel_wt_fake_quant_range_neg_127_to_127 = FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                                               quant_min=-127,
+                                                                                               quant_max=127,
+                                                                                               dtype=torch.qint8,
+                                                                                               qscheme=torch.per_channel_symmetric,
+                                                                                               eps=2 ** -12)
+"""
+Fused version of `default_per_channel_weight_fake_quant`, with the 8-bit values restricted to [-127, +127], excluding -128.
+"""
+
+
 def _is_fake_quant_script_module(mod):
     ''' Returns true if given mod is an instance of FakeQuantize script module.
     '''
diff --git a/torch/ao/quantization/fuse_modules.py b/torch/ao/quantization/fuse_modules.py
index f276eea3c871ff..1f7027f5c8d574 100644
--- a/torch/ao/quantization/fuse_modules.py
+++ b/torch/ao/quantization/fuse_modules.py
@@ -7,6 +7,7 @@
 # for backward compatiblity
 from torch.ao.quantization.fuser_method_mappings import fuse_conv_bn  # noqa: F401
 from torch.ao.quantization.fuser_method_mappings import fuse_conv_bn_relu  # noqa: F401
+from torch.nn.utils.parametrize import type_before_parametrizations
 
 from typing import List, Optional
 
@@ -41,7 +42,7 @@ def fuse_known_modules(mod_list, is_qat, additional_fuser_method_mapping=None):
     For these sequences, the first element in the output module list performs
     the fused operation. The rest of the elements are set to nn.Identity()
     """
-    types = tuple(type(m) for m in mod_list)
+    types = tuple(type_before_parametrizations(m) for m in mod_list)
     fuser_method = get_fuser_method(types, additional_fuser_method_mapping)
     if fuser_method is None:
         raise NotImplementedError("Cannot fuse modules: {}".format(types))
diff --git a/torch/ao/quantization/fuser_method_mappings.py b/torch/ao/quantization/fuser_method_mappings.py
index f152c30b616f99..a2882f1360479c 100644
--- a/torch/ao/quantization/fuser_method_mappings.py
+++ b/torch/ao/quantization/fuser_method_mappings.py
@@ -33,8 +33,6 @@ def fuse_conv_bn(is_qat, conv, bn):
     }
 
     if is_qat:
-        # TODO: remove the assert later
-        assert conv.training, "qat is only supported when conv.training is True currently"
         assert bn.num_features == conv.out_channels, 'Output channel of Conv2d must match num_features of BatchNorm2d'
         assert bn.affine, 'Only support fusing BatchNorm2d with affine set to True'
         assert bn.track_running_stats, 'Only support fusing BatchNorm2d with tracking_running_stats set to True'
@@ -66,8 +64,6 @@ def fuse_conv_bn_relu(is_qat, conv, bn, relu):
         "Conv and BN both must be in the same mode (train or eval)."
     fused_module : Optional[Type[nn.Sequential]] = None
     if is_qat:
-        # TODO: remove the assert later
-        assert conv.training, "qat is only supported when conv.training is True currently"
         map_to_fused_module_train = {
             nn.Conv1d: nni.ConvBnReLU1d,
             nn.Conv2d: nni.ConvBnReLU2d,
@@ -113,8 +109,6 @@ def fuse_linear_bn(is_qat, linear, bn):
         "Linear and BN both must be in the same mode (train or eval)."
 
     if is_qat:
-        # TODO: remove the assert later
-        assert linear.training, "qat is only supported when linear.training is True currently"
         assert bn.num_features == linear.out_features,\
             "Output features of Linear must match num_features of BatchNorm1d"
         assert bn.affine, "Only support fusing BatchNorm1d with affine set to True"
@@ -142,8 +136,7 @@ def fuse_convtranspose_bn(is_qat, convt, bn):
         "ConvTranspose and BN both must be in the same mode (train or eval)."
 
     if is_qat:
-        assert convt.training, "qat is only supported when convt.training is True currently"
-        raise Exception("Fusing ConvTranspose+BatchNorm not yet supported in training.")
+        raise Exception("Fusing ConvTranspose+BatchNorm not yet supported in QAT.")
     else:
         return nn.utils.fusion.fuse_conv_bn_eval(convt, bn, transpose=True)
 
diff --git a/torch/ao/quantization/fx/_convert_do_not_use.py b/torch/ao/quantization/fx/_convert_do_not_use.py
deleted file mode 100644
index 3d5aea83953cd9..00000000000000
--- a/torch/ao/quantization/fx/_convert_do_not_use.py
+++ /dev/null
@@ -1,332 +0,0 @@
-from typing import Any, Dict, List, Optional, Set, Callable
-import torch
-from torch.fx import (
-    GraphModule,
-)
-from torch.fx.graph import (
-    Graph,
-    Node,
-)
-from ..qconfig import QConfigAny
-from ..utils import (
-    activation_is_int8_quantized,
-    weight_is_statically_quantized,
-    get_qparam_dict,
-    _parent_name,
-)
-from .backend_config.utils import get_quantized_reference_module_mapping
-
-from .graph_module import (
-    QuantizedGraphModule,
-    is_observed_standalone_module,
-)
-from ._equalize import update_obs_for_equalization, convert_eq_obs
-from .utils import (
-    get_custom_module_class_keys,
-    get_quantize_node_info,
-    create_getattr_from_value,
-)
-
-from torch.ao.quantization.quantize import (
-    _remove_qconfig,
-    is_activation_post_process,
-)
-
-from .convert import restore_state
-
-# these are tuples so that they can work with isinstance(module, tuple_of_classes)
-FUSED_MODULE_CLASSES = (
-    torch.nn.intrinsic.LinearReLU,
-    torch.nn.intrinsic.ConvReLU1d,
-    torch.nn.intrinsic.ConvReLU2d,
-    torch.nn.intrinsic.ConvReLU3d,
-)
-
-QAT_MODULE_CLASSES = (
-    torch.nn.qat.Linear,
-    torch.nn.qat.Conv2d,
-    torch.nn.qat.Conv3d,
-    torch.nn.intrinsic.qat.LinearReLU,
-    torch.nn.intrinsic.qat.ConvBn2d,
-    torch.nn.intrinsic.qat.ConvBnReLU2d,
-    torch.nn.intrinsic.qat.ConvReLU2d,
-    torch.nn.intrinsic.qat.ConvBn3d,
-    torch.nn.intrinsic.qat.ConvBnReLU3d,
-    torch.nn.intrinsic.qat.ConvReLU3d
-)
-
-def insert_dequantize_node(
-        node: Node,
-        graph: Graph):
-    """ Inserts dequantize node for `node` in `graph`
-    """
-    with graph.inserting_after(node):
-        dequantize_node = graph.call_method("dequantize", (node,))
-        for user_node in dict(node.users):
-            if user_node is not dequantize_node:
-                user_node.replace_input_with(node, dequantize_node)
-
-
-def convert_standalone_module(
-        node: Node,
-        modules: Dict[str, torch.nn.Module],
-        model: torch.fx.GraphModule,
-        is_reference: bool,
-        backend_config_dict: Dict[str, Any]):
-    convert = torch.ao.quantization._quantize_fx_do_not_use._convert_do_not_use  # type: ignore[attr-defined]
-    # We know that observed standalone module is a GraphModule since
-    # it's produced by us
-    observed_standalone_module : GraphModule = modules[str(node.target)]  # type: ignore[assignment]
-    sm_input_quantized_idxs = \
-        observed_standalone_module \
-        ._standalone_module_input_quantized_idxs\
-        .tolist()  # type: ignore[operator]
-    # remove the dequantize nodes for inputs
-    args = list(node.args)
-    for idx in range(len(args)):
-        if idx in sm_input_quantized_idxs:
-            arg = args[idx]
-            if arg.op == "call_method" and arg.target == "dequantize":  # type: ignore[union-attr]
-                quantize_node = arg.args[0]  # type: ignore[union-attr]
-                node.replace_input_with(arg, quantize_node)
-                if len(arg.users) == 0:  # type: ignore[union-attr]
-                    model.graph.erase_node(arg)
-    # add dequantize node for output
-    sm_output_quantized_idxs = \
-        observed_standalone_module \
-        ._standalone_module_output_quantized_idxs \
-        .tolist()  # type: ignore[operator]
-    if len(sm_output_quantized_idxs) > 0:
-        assert sm_output_quantized_idxs[0] == 0, "Currently only quantized"
-        "output idxs = [0] is supported"
-
-        # if it's non-empty, then it means the output is kept in quantized form
-        # we'll just add a dequantize node after this node
-        insert_dequantize_node(node, model.graph)
-
-    # TODO: allow convert_custom_config_dict to override backend_config_dict
-    # for standalone module
-    quantized_standalone_module = convert(
-        observed_standalone_module,
-        is_reference=True,
-        backend_config_dict=backend_config_dict)
-    parent_name, name = _parent_name(node.target)
-    # update the modules dict
-    setattr(modules[parent_name], name, quantized_standalone_module)
-    modules[str(node.target)] = quantized_standalone_module
-
-def convert_weighted_module(
-        node: Node,
-        modules: Dict[str, torch.nn.Module],
-        observed_node_names: Set[str],
-        quantized_reference_module_mapping: Dict[Callable, Any]):
-    original_module = modules[str(node.target)]
-    qconfig = original_module.qconfig
-
-    is_observed = node.name in observed_node_names
-    is_activation_quantized = activation_is_int8_quantized(qconfig)
-    is_weight_quantized = weight_is_statically_quantized(qconfig)
-    # TODO: rename weight_is_statically_quantized to weight_is_int8_quantized
-    if qconfig is None or \
-       not is_observed or \
-       not is_weight_quantized or \
-       not is_activation_quantized:
-        return
-
-    float_module = original_module
-    fused_module = None
-    if isinstance(
-            original_module,
-            QAT_MODULE_CLASSES):
-        # case 1. converting qat module to
-        # a float module, we need to attch
-        # weight fake_quant to the module,
-        # weight fake_quant is assumed to be run during
-        # QAT so we don't need to run it again here
-        float_module = original_module.to_float()  # type: ignore[operator]
-        # change qat conv to conv
-        parent_name, name = _parent_name(node.target)
-        setattr(modules[parent_name], name, float_module)
-        if isinstance(float_module, torch.nn.intrinsic._FusedModule):
-            fused_module = float_module
-            float_module = fused_module[0]
-        weight_post_process = original_module.weight_fake_quant
-    else:
-        # case 2. converting a float module/fused float module
-        # to float module, we need to attach
-        # weight observer to the conv module and run it
-        # with conv weight
-        if isinstance(original_module, torch.nn.intrinsic._FusedModule):
-            fused_module = original_module
-            float_module = fused_module[0]  # type: ignore[index]
-        assert qconfig is not None
-        weight_post_process = qconfig.weight()  # type: ignore[union-attr, operator]
-        # run weight observer
-        weight_post_process(float_module.weight)  # type: ignore[operator]
-    weight_qparams = get_qparam_dict(weight_post_process)
-    # TODO: may need to change the mapping when we support dynamic quantization
-    ref_qmodule_cls = quantized_reference_module_mapping.get(type(float_module), None)
-    assert ref_qmodule_cls is not None, f"No reference quantized module class configured for {type(float_module)}"
-    ref_qmodule = ref_qmodule_cls.from_float(float_module, weight_qparams)  # type: ignore[attr-defined]
-    if fused_module is not None:
-        fused_module[0] = ref_qmodule
-    else:
-        parent_name, name = _parent_name(node.target)
-        setattr(modules[parent_name], name, ref_qmodule)
-
-def _convert_do_not_use(
-        model: GraphModule, is_reference: bool = False,
-        convert_custom_config_dict: Dict[str, Any] = None,
-        is_standalone_module: bool = False,
-        _remove_qconfig_flag: bool = True,
-        backend_config_dict: Optional[Dict[str, Any]] = None) -> torch.nn.Module:
-    """
-    We will convert an observed model (a module with observer calls) to a reference
-    quantized model, the rule is simple:
-    1. for each observer module call in the graph, we'll convert it to calls to
-       quantize and dequantize functions based on the observer instance
-    2. for weighted operations like linear/conv, we need to convert them to reference
-       quantized module, this requires us to know whether the dtype configured for the
-       weight is supported in the backend, this is done in prepare step and the result
-       is stored in observed_node_names, we can decide whether we need to swap the
-       module based on this set
-
-    standalone_module means it a submodule that is not inlined in
-    parent module, and will be quantized separately as one unit.
-
-    Returns a quantized standalone module, whether input/output is quantized is
-    specified by prepare_custom_config_dict, with
-    input_quantized_idxs, output_quantized_idxs, please
-    see docs for prepare_fx for details
-    """
-    if convert_custom_config_dict is None:
-        convert_custom_config_dict = {}
-    patterns, node_name_to_scope, prepare_custom_config_dict, observed_node_names = restore_state(model)
-    qconfig_map: Dict[str, QConfigAny] = model._qconfig_map  # type: ignore[assignment]
-
-    assert is_reference, "_convert_do_not_use only supports reference option"
-
-    # mapping from fully qualified module name to module instance
-    # for example,
-    # {
-    #   '': Model(...),
-    #   'linear': Linear(...),
-    #   'linear.weight_fake_quant': PerChannelMinMaxObserver(...),
-    # }
-    # We use remove_duplicate=False here because torch.cat uses
-    # the same activation_post_process module instance but different names
-    modules = dict(model.named_modules(remove_duplicate=False))
-
-    custom_module_classes = get_custom_module_class_keys(
-        convert_custom_config_dict,
-        "observed_to_quantized_custom_module_class")
-
-    if model._equalization_qconfig_map is not None:
-        # If we want to do equalization then do the following:
-        # Calculate the equalization scale, update the observers with the scaled
-        # inputs, and scale the weight
-        weight_eq_obs_dict = update_obs_for_equalization(model, modules)
-        convert_eq_obs(model, modules, weight_eq_obs_dict)
-
-    graph_inputs: List[str] = []
-    for node in model.graph.nodes:
-        if node.op == 'placeholder':
-            graph_inputs.append(node.name)
-
-    def replace_observer_with_quantize_dequantize_node(graph: Graph, node: Node, modules: Dict[str, torch.nn.Module]) -> None:
-        """ Replace activation_post_process module call node with quantize and
-        dequantize node
-
-        Before:
-        ... -> observer_0(x) -> ...
-        After:
-        ... -> torch.quantize_per_tensor(x, ...) -> x.dequantize() -> ...
-        """
-        assert modules is not None
-        assert isinstance(node.target, str)
-        observer_module = modules[node.target]
-        root_module = modules[""]
-        if observer_module.dtype == torch.float32:
-            # remove the node for now
-            # TODO: support dynamic quant
-            with graph.inserting_before(node):
-                node.replace_all_uses_with(node.args[0])
-                graph.erase_node(node)
-        elif observer_module.dtype in [torch.quint8, torch.qint8, torch.float16]:
-            node_type, quantize_op, qparams = get_quantize_node_info(observer_module)
-            # replace observer node with quant - dequant node
-            with graph.inserting_before(node):
-                input_node = node.args[0]
-                inputs = [input_node]
-                for key, value in qparams.items():
-                    if key in ['_scale_', '_zero_point_']:
-                        # For scale and zero_point values we register them as buffers in the root module.
-                        # TODO: maybe need more complex attr name here
-                        qparam_node = create_getattr_from_value(root_module, graph, key, value)
-                        inputs.append(qparam_node)
-                    else:
-                        # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
-                        inputs.append(value)
-
-                quantized_node = graph.create_node(node_type, quantize_op, tuple(inputs), {})
-                dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
-                node.replace_all_uses_with(dequantized_node)
-                graph.erase_node(node)
-
-
-    # additional state to override inputs to be quantized, if specified
-    # by the user
-    placeholder_node_seen_cnt = 0
-    output_node_seen_cnt = 0
-    input_quantized_idxs: List[int] = prepare_custom_config_dict.get(
-        "input_quantized_idxs", [])
-    output_quantized_idxs: List[int] = prepare_custom_config_dict.get(
-        "output_quantized_idxs", [])
-
-    if backend_config_dict is None:
-        backend_config_dict = {}
-    quantized_reference_module_mapping = get_quantized_reference_module_mapping(backend_config_dict)
-    # convert tuples so that it can work with isinstance(module, tuple_of_classes)
-    weighted_module_classes = tuple(quantized_reference_module_mapping.keys())
-
-    for node in list(model.graph.nodes):
-        if node.op == 'placeholder':
-            cur_placeholder_node_idx = placeholder_node_seen_cnt
-            placeholder_node_seen_cnt += 1
-            if cur_placeholder_node_idx in input_quantized_idxs:
-                # Inputs are assumed to be quantized if the user specifid the
-                # input_quantized_idxs override.
-                # we need to dequantize the inputs since all operators took
-                # floating point inputs in reference quantized models
-                insert_dequantize_node(node, model.graph)
-        elif node.op == "output":
-            cur_output_node_idx = output_node_seen_cnt
-            output_node_seen_cnt += 1
-            if cur_output_node_idx in output_quantized_idxs:
-                # Result are kept quantized if the user specified the
-                # output_quantized_idxs override.
-                # Remove the dequantize operator in the end
-                maybe_dequantize_node = node.args[0]
-                if isinstance(maybe_dequantize_node, Node) and \
-                   maybe_dequantize_node.op == "call_method" and \
-                   maybe_dequantize_node.target == "dequantize":
-                    quantize_node = maybe_dequantize_node.args[0]
-                    maybe_dequantize_node.replace_all_uses_with(quantize_node)
-                    model.graph.erase_node(maybe_dequantize_node)
-        elif node.op == "call_module":
-            if is_activation_post_process(modules[node.target]):
-                replace_observer_with_quantize_dequantize_node(model.graph, node, modules)
-            elif is_observed_standalone_module(modules[node.target]):
-                # TODO: move this to a separate function
-                convert_standalone_module(node, modules, model, is_reference, backend_config_dict)
-
-            elif type(modules[node.target]) in set(
-                    weighted_module_classes).union(QAT_MODULE_CLASSES).union(FUSED_MODULE_CLASSES):
-                convert_weighted_module(node, modules, observed_node_names, quantized_reference_module_mapping)
-
-    # removes qconfig and activation_post_process modules
-    if _remove_qconfig_flag:
-        _remove_qconfig(model)
-    preserved_attributes = set(convert_custom_config_dict.get("preserved_attributes", []))
-    model = QuantizedGraphModule(model, model.graph, preserved_attributes)
-    return model
diff --git a/torch/ao/quantization/fx/_lower_to_native_backend.py b/torch/ao/quantization/fx/_lower_to_native_backend.py
index 8b66370cb2a364..fdd0a5c172b75c 100644
--- a/torch/ao/quantization/fx/_lower_to_native_backend.py
+++ b/torch/ao/quantization/fx/_lower_to_native_backend.py
@@ -1,31 +1,28 @@
-import itertools
 import torch
-from torch.fx import map_arg
+from torch.fx import map_arg, Node
 from torch.fx.graph import Graph
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
+import torch.nn.intrinsic.quantized.dynamic as nniqd
 import torch.nn.quantized as nnq
+import torch.nn.quantized.dynamic as nnqd
 import torch.nn.quantized._reference as nnqr
 from torch.nn.quantized.modules.utils import WeightedQuantizedModule
-from . import subgraph_rewriter_FORKED_DO_NOT_USE
 from .graph_module import QuantizedGraphModule
-from .quantized_fusion_patterns_and_replacements import get_fbgemm_patterns_and_replacements
-from .match_utils import is_match, MatchAllNode
-from .quantization_types import Pattern
 from .utils import (
     collect_producer_nodes,
     get_linear_prepack_op_for_dtype,
     get_new_attr_name_with_prefix,
+    get_qconv_prepack_op,
     graph_module_from_producer_nodes,
 )
 from ..utils import _parent_name
 from ..qconfig import QConfigAny
 from ..quantization_mappings import get_quantized_operator
 from .utils import create_node_from_old_node_preserve_meta
-from typing import Dict, Tuple, Type, List, Callable, Any, Union
-from torch.fx import Node
+from typing import Dict, Tuple, Type, List, Callable, Any, Union, Set, Optional
 import operator
 
 QOP_TO_ARG_NAMES_TO_SKIP = {
@@ -85,6 +82,10 @@ def is_default_node(node, modules):
         torch.nn.InstanceNorm3d,
         torch.nn.LayerNorm,
         torch.nn.Dropout,
+        torch.nn.BatchNorm2d,
+        torch.nn.BatchNorm3d,
+        torch.nn.intrinsic.BNReLU2d,
+        torch.nn.intrinsic.BNReLU3d,
     ]
     return _is_node_in_list(node, modules, func_list, method_list, module_type_list)
 
@@ -179,9 +180,13 @@ def is_special_pattern_node(node, modules):
         res_module = res_module or is_call_module
     return res_function, res_method, res_module
 
-
 def is_dequantize_node(node):
-    return isinstance(node, Node) and node.op == 'call_method' and node.target == 'dequantize'
+    return isinstance(node, Node) and node.op == "call_method" and node.target == "dequantize"
+
+def is_getattr_tensor_metadata_node(node):
+    return node.op == "call_function" and \
+        node.target == getattr and \
+        node.args[1] in ["shape"]
 
 def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigAny]):
     """
@@ -192,16 +197,32 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA
     """
     return op.name in qconfig_map and qconfig_map[op.name] is None
 
-# Mapping from reference module class to the replacement quantized module class for lowering
-LOWER_MODULE_MAP: Dict[Type[nn.Module], Type[WeightedQuantizedModule]] = {
+# Mapping from reference module class to the replacement static quantized module class for lowering
+STATIC_LOWER_MODULE_MAP: Dict[Type[nn.Module], Type[WeightedQuantizedModule]] = {
     nnqr.Linear: nnq.Linear,
     nnqr.Conv1d: nnq.Conv1d,
     nnqr.Conv2d: nnq.Conv2d,
     nnqr.Conv3d: nnq.Conv3d,
 }
 
-# TODO: merge with LOWER_MODULE_MAP after we merge
-# _lower_weighted_ref_module and special_pattern_replacement
+# Mapping from reference module class to the replacement dynamic quantized module class for lowering
+DYNAMIC_LOWER_MODULE_MAP: Dict[Type[nn.Module], Type[nn.Module]] = {
+    nnqr.Linear: nnqd.Linear,
+    nnqr.GRUCell: nnqd.GRUCell,
+    nnqr.LSTMCell: nnqd.LSTMCell,
+    nnqr.RNNCell: nnqd.RNNCell,
+    nnqr.LSTM: nnqd.LSTM,
+}
+
+# Mapping from reference module class to the replacement weight only quantized module class for lowering
+# TODO: correct the namespace for these modules
+WEIGHT_ONLY_LOWER_MODULE_MAP: Dict[Type[nn.Module], Type[nn.Module]] = {
+    nnqr.Embedding: nnq.Embedding,
+    nnqr.EmbeddingBag: nnq.EmbeddingBag,
+}
+
+# TODO: merge with STATIC_LOWER_MODULE_MAP after we merge
+# _lower_static_weighted_ref_module and special_pattern_replacement
 SPECIAL_PATTERN_LOWER_MODULE_MAP = {
     nn.BatchNorm2d: nnq.BatchNorm2d,
     nn.BatchNorm3d: nnq.BatchNorm3d,
@@ -215,26 +236,38 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA
     nn.InstanceNorm3d: nnq.InstanceNorm3d,
     nn.LayerNorm: nnq.LayerNorm,
     nn.Dropout: nnq.Dropout,
+    nni.BNReLU2d: nniq.BNReLU2d,
+    nni.BNReLU3d: nniq.BNReLU3d,
 }
 
 # Mapping from fused module class to a 2-tuple of:
 #   1) The inner reference module class
-#   2) The replacement quantized module class for lowering
-LOWER_FUSED_MODULE_MAP: Dict[Type[nn.Module], Tuple[Type[nn.Module], Type[WeightedQuantizedModule]]] = {
+#   2) The replacement static quantized module class for lowering
+STATIC_LOWER_FUSED_MODULE_MAP: Dict[Type[nn.Module], Tuple[Type[nn.Module], Type[WeightedQuantizedModule]]] = {
     nni.LinearReLU: (nnqr.Linear, nniq.LinearReLU),
     nni.ConvReLU1d: (nnqr.Conv1d, nniq.ConvReLU1d),
     nni.ConvReLU2d: (nnqr.Conv2d, nniq.ConvReLU2d),
     nni.ConvReLU3d: (nnqr.Conv3d, nniq.ConvReLU3d),
 }
 
+# Mapping from fused module class to a 2-tuple of:
+#   1) The inner reference module class
+#   2) The replacement dynamic quantized module class for lowering
+DYNAMIC_LOWER_FUSED_MODULE_MAP: Dict[Type[nn.Module], Tuple[Type[nn.Module], Type[nn.Module]]] = {
+    nni.LinearReLU: (nnqr.Linear, nniqd.LinearReLU),
+}
+
 # Mapping from a functional to lower to a 2-tuple of
 #   1) The quantized version of the op
 #   2) The quantized version of the op fused with relu, if it exists, else None
-LOWER_FUNCTIONAL_MAP = {
+STATIC_LOWER_FUNCTIONAL_MAP: Dict[Callable, Tuple[Callable, Callable]] = {
     F.linear: (torch.ops.quantized.linear, torch.ops.quantized.linear_relu),
+    F.conv1d: (torch.ops.quantized.conv1d, torch.ops.quantized.conv1d_relu),
+    F.conv2d: (torch.ops.quantized.conv2d, torch.ops.quantized.conv2d_relu),
+    F.conv3d: (torch.ops.quantized.conv3d, torch.ops.quantized.conv3d_relu),
 }
 
-WEIGHT_PREPACK_OPS = {
+WEIGHT_PREPACK_OPS: Set[Callable] = {
     torch._ops.ops.quantized.linear_prepack,
     torch._ops.ops.quantized.linear_prepack_fp16,
     torch._ops.ops.quantized.conv1d_prepack,
@@ -242,9 +275,39 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA
     torch._ops.ops.quantized.conv3d_prepack,
 }
 
+# Mapping from a functional to a dictionary, where the key is a 2-tuple of
+# (activation_compute_dtype, weight_dtype) and the value is a 2-tuple of
+#   1) The dynamically quantized version of the op
+#   2) The dynamically quantized version of the op fused with relu, if it exists, else None
+DYNAMIC_LOWER_FUNCTIONAL_MAP: Dict[Callable, Dict[Tuple[torch.dtype, torch.dtype], Tuple[Callable, Optional[Callable]]]] = {
+    F.linear: {
+        (torch.quint8, torch.qint8): (torch.ops.quantized.linear_dynamic,
+                                      torch.ops.quantized.linear_relu_dynamic),
+        (torch.float16, torch.float16): (torch.ops.quantized.linear_dynamic_fp16,
+                                         torch.ops.quantized.linear_relu_dynamic_fp16)
+    },
+    # dynamic conv + relu is not available yet
+    F.conv1d: {
+        (torch.quint8, torch.qint8): (torch.ops.quantized.conv1d_dynamic, None),
+    },
+    F.conv2d: {
+        (torch.quint8, torch.qint8): (torch.ops.quantized.conv2d_dynamic, None),
+    },
+    F.conv3d: {
+        (torch.quint8, torch.qint8): (torch.ops.quantized.conv3d_dynamic, None),
+    },
+}
+
+CONV_FUNCTIONAL_OPS: Set[Callable] = {
+    F.conv1d,
+    F.conv2d,
+    F.conv3d,
+}
+
 def fold_weight(
-        quantized: QuantizedGraphModule,
-        node_name_to_scope: Dict[str, Tuple[str, type]]) -> QuantizedGraphModule:
+    quantized: QuantizedGraphModule,
+    node_name_to_scope: Dict[str, Tuple[str, type]]
+) -> QuantizedGraphModule:
     """
     Trace back from the weight node util we hit getattr, reconstruct the
     graph module with the traced nodes and run the graph module to pack the
@@ -295,186 +358,404 @@ def load_arg(a):
         else:
             # copy other nodes
             env[node.name] = folded_graph.node_copy(node, load_arg)
-    quantized = QuantizedGraphModule(quantized_root, folded_graph, quantized_root.preserved_attr_names)
-    return quantized
+    return QuantizedGraphModule(quantized_root, folded_graph, quantized_root.preserved_attr_names)
 
-def _lower_weighted_ref_module(model: QuantizedGraphModule) -> QuantizedGraphModule:
+def _get_module(node: Node, modules: Dict[str, nn.Module]) -> Optional[nn.Module]:
+    """
+    Return the `torch.nn.Module` that corresponds to the specified node's target.
+    If no such node exists, return None.
+    """
+    if node.op == "call_module" and str(node.target) in modules:
+        return modules[str(node.target)]
+    else:
+        return None
+
+def _match_static_pattern(
+    node: Node,
+    modules: Dict[str, nn.Module],
+    qconfig_map: Dict[str, QConfigAny],
+    matching_modules_or_ops: List[Callable],
+    dequantize_node_arg_indices: List[int]
+) -> Union[Tuple[Node, Node, Node], Tuple[None, None, None]]:
+    """
+    Match the pattern (dequantize - ref node - quantize) against the node provided.
+
+    If there is a match, return a 3-tuple of:
+      1) q_node: the quantize node,
+      2) relu_node: a relu node wrapping the ref_node, and
+      3) ref_node: a reference module or functional node to replace with its quantized counterpart
+    Otherwise, if there is no match, return a 3-tuple of (None, None, None).
+
+    Parameters:
+      node: The `torch.fx.Node` to match against.
+      modules: A mapping from node names to modules in the model graph, used for module lookup.
+      qconfig_map: A mapping from node names to the qconfigs associated with the nodes.
+          If the corresponding qconfig for the reference node is None, then return no match.
+      matching_modules_or_ops: Either a list of functions or a list of `torch.nn.Module`s.
+          If the reference node is not in this list, then return no match.
+      dequantize_node_arg_indices: A list of indices in the reference node args where dequantize
+          nodes may be present. An empty list means skipping the check for dequantize nodes.
+    """
+    SKIP_LOWERING_VALUE = (None, None, None)
+
+    # Match quantize node
+    if node.op != "call_function" or node.target != torch.quantize_per_tensor:
+        return SKIP_LOWERING_VALUE
+    q_node = node
+    ref_node = q_node.args[0]
+    assert(isinstance(ref_node, Node))
+
+    # Handle cases where the node is wrapped in a ReLU
+    if (ref_node.op == "call_function" and ref_node.target in (F.relu, torch.relu)) or\
+            (ref_node.op == "call_module" and type(_get_module(ref_node, modules)) == nn.ReLU):
+        relu_node = ref_node
+        ref_node = relu_node.args[0]
+        assert(isinstance(ref_node, Node))
+    else:
+        relu_node = None
+    if should_skip_lowering(ref_node, qconfig_map):
+        return SKIP_LOWERING_VALUE
+
+    # Match reference module or functional
+    if isinstance(matching_modules_or_ops[0], type) and issubclass(matching_modules_or_ops[0], nn.Module):
+        expected_op = "call_module"
+        match_key = type(_get_module(ref_node, modules))
+    else:
+        expected_op = "call_function"
+        match_key = ref_node.target
+    if ref_node.op != expected_op or match_key not in matching_modules_or_ops:
+        return SKIP_LOWERING_VALUE
+
+    # Match dequantize node(s). Both of the following conditions must pass:
+    # (1) All `torch.fx.Node`s at the matching indices must be a dequantize node
+    # (2) There must be at least one dequantize node
+    matched_dequantize = False
+    for i in dequantize_node_arg_indices:
+        assert i < len(ref_node.args),\
+            "Dequantize index %s exceeded reference node's arg length %s" % (i, len(ref_node.args))
+        arg = ref_node.args[i]
+        if is_dequantize_node(arg):
+            matched_dequantize = True
+        elif isinstance(arg, Node):
+            return SKIP_LOWERING_VALUE
+    if not matched_dequantize:
+        return SKIP_LOWERING_VALUE
+
+    return (q_node, relu_node, ref_node)
+
+def _lower_static_weighted_ref_module(
+        model: QuantizedGraphModule,
+        qconfig_map: Dict[str, QConfigAny]):
     """
     Traverse the graph and find dequantize - ref module - quantize patterns
     and replace them with the quantized version of the ref module.
     """
-    for ref_class in list(LOWER_MODULE_MAP.keys()) + list(LOWER_FUSED_MODULE_MAP.keys()):
-        pattern = (torch.quantize_per_tensor,
-                   (ref_class, "dequantize"),
-                   MatchAllNode, MatchAllNode, MatchAllNode)
-        modules = dict(model.named_modules(remove_duplicate=False))
-        nodes = list(model.graph.nodes)
-        # TODO: maybe orgnize this better (e.g. break down to more functions)
-        # to make this function more readable
-        for n in model.graph.nodes:
-            if not is_match(modules, n, pattern):
-                continue
-            q_node = n
-            ref_node = q_node.args[0]
-            dq_node = ref_node.args[0]
-            # get output scale/zero_point/dtype from the quantize node
-            scale_node = q_node.args[1]
-            zero_point_node = q_node.args[2]
-            dtype = q_node.args[3]
-
-            # this can be removed if we add support for "get_attr" in is_match
-            if scale_node.op != "get_attr" or zero_point_node.op != "get_attr":
-                print("Find the pattern but scale_node and zero_point node are not `get_attr`,"
-                      f"got: {scale_node.format_node} {zero_point_node.format_node()}")
+    modules = dict(model.named_modules(remove_duplicate=False))
+    nodes = list(model.graph.nodes)
+    for n in model.graph.nodes:
+        # Step 0: Find nodes that match this pattern (dequantize - ref module - quantize)
+        matching_modules = list(STATIC_LOWER_MODULE_MAP.keys()) + list(STATIC_LOWER_FUSED_MODULE_MAP.keys())
+        (q_node, relu_node, ref_node) = _match_static_pattern(
+            n, modules, qconfig_map, matching_modules, dequantize_node_arg_indices=[0])  # type: ignore[arg-type]
+        if q_node is None:
+            continue
+        assert(ref_node is not None)
+        (_, scale_node, zero_point_node, _) = q_node.args
+        ref_module = _get_module(ref_node, modules)
+        ref_class = type(ref_module)
+        assert(isinstance(scale_node, Node))
+        assert(isinstance(zero_point_node, Node))
+        assert(issubclass(ref_class, nn.Module))
+
+        # Step 1: Change this pattern to use the corresponding quantized module
+        # For fused modules, we also check whether the inner module is a reference module
+        # If so, we replace the entire fused module with the corresponding quantized module
+        if ref_class in STATIC_LOWER_FUSED_MODULE_MAP:
+            inner_ref_class, q_class = STATIC_LOWER_FUSED_MODULE_MAP[ref_class]
+            if type(ref_module[0]) != inner_ref_class:  # type: ignore[index]
                 continue
+        else:
+            q_class = STATIC_LOWER_MODULE_MAP[ref_class]
+        output_scale = getattr(model, scale_node.target)
+        output_zero_point = getattr(model, zero_point_node.target)
+        q_module = q_class.from_reference(ref_module, output_scale, output_zero_point)
+        # replace reference module with quantized module
+        parent_name, module_name = _parent_name(ref_node.target)
+        setattr(modules[parent_name], module_name, q_module)
+
+        # Step 2: Remove dq_node, q_node and its args
+        dq_node = ref_node.args[0]
+        assert(isinstance(dq_node, Node))
+        dq_node.replace_all_uses_with(dq_node.args[0])
+        model.graph.erase_node(dq_node)
+        q_node.replace_all_uses_with(ref_node)
+        model.graph.erase_node(q_node)
+        model.graph.erase_node(scale_node)
+        model.graph.erase_node(zero_point_node)
 
-            # this can be removed if we add support for constants in is_match
-            if dtype != torch.quint8:
-                print(f"Only qint8 output for quantized op is supported, got: {dtype}")
+def _lower_dynamic_weighted_ref_module(model: QuantizedGraphModule):
+    """
+    Traverse the graph and find quantize_per_tensor_dynamic - dequantize - ref_module patterns
+    and replace them with the dynamically quantized version of the ref module.
+    """
+    named_modules = dict(model.named_modules(remove_duplicate=False))
+    for n in model.graph.nodes:
+        if n.op != "call_module" or \
+           type(named_modules[str(n.target)]) not in \
+           set(DYNAMIC_LOWER_MODULE_MAP.keys()).union(
+               set(DYNAMIC_LOWER_FUSED_MODULE_MAP.keys())):
+            continue
+        ref_node = n
+        dq_node = ref_node.args[0]
+        if dq_node.op != "call_method" or dq_node.target != "dequantize":
+            continue
+        # don't support lowering the pattern when the result of dequantize is used by
+        # multiple nodes
+        if len(dq_node.users) > 1:
+            continue
+
+        input_dynamic_q_node = dq_node.args[0]
+        # don't support lowering the pattern when the result of quantize is used by
+        # multiple nodes
+        if len(input_dynamic_q_node.users) > 1:
+            continue
+
+        if input_dynamic_q_node.op != "call_function" or \
+           input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
+            continue
+
+        activation_compute_dtype = input_dynamic_q_node.args[1]
+        is_fp16 = activation_compute_dtype == torch.float16
+        is_int8 = activation_compute_dtype in [torch.quint8, torch.qint8]
+        if not is_int8 and not is_fp16:
+            continue
+
+        ref_module = named_modules[str(ref_node.target)]
+        ref_class = type(ref_module)
+        if ref_class in DYNAMIC_LOWER_FUSED_MODULE_MAP:
+            inner_ref_class, q_class = DYNAMIC_LOWER_FUSED_MODULE_MAP[ref_class]
+            if type(ref_module[0]) != inner_ref_class:
                 continue
+        else:
+            q_class = DYNAMIC_LOWER_MODULE_MAP.get(ref_class)  # type: ignore[assignment]
+        # TODO: maybe define a WeightedDynamicallyQuantizedModule
+        q_module = q_class.from_reference(ref_module)  # type: ignore[attr-defined]
 
-            # change this pattern to use the corresponding quantized module
-            ref_module = modules[ref_node.target]
-            output_scale = getattr(model, scale_node.target)
-            output_zero_point = getattr(model, zero_point_node.target)
-            # For fused modules, we also check whether the inner module is a reference module
-            # If so, we replace the entire fused module with the corresponding quantized module
-            if ref_class in LOWER_FUSED_MODULE_MAP:
-                inner_ref_class, q_class = LOWER_FUSED_MODULE_MAP[ref_class]
-                if type(ref_module[0]) != inner_ref_class:
-                    continue
-            else:
-                q_class = LOWER_MODULE_MAP[type(ref_module)]
-            assert issubclass(q_class, WeightedQuantizedModule)  # suppress mypy warnings
-            q_module = q_class.from_reference(ref_module, output_scale, output_zero_point)
-
-            # replace reference module with quantized module
-            parent_name, module_name = _parent_name(ref_node.target)
-            setattr(modules[parent_name], module_name, q_module)
-            # remove dq node:
-            dq_node_input = dq_node.args[0]
-
-            dq_node.replace_all_uses_with(dq_node_input)
-            model.graph.erase_node(dq_node)
+        # replace reference moduel with dynamically quantized module
+        parent_name, module_name = _parent_name(ref_node.target)
+        setattr(named_modules[parent_name], module_name, q_module)
 
-            # remove q node and args:
-            q_node.replace_all_uses_with(ref_node)
-            model.graph.erase_node(q_node)
-            model.graph.erase_node(scale_node)
-            model.graph.erase_node(zero_point_node)
-    return model
+        # remove q - dq node
+        dq_node.replace_all_uses_with(input_dynamic_q_node)
+        model.graph.erase_node(dq_node)
+        input_dynamic_q_node.replace_all_uses_with(input_dynamic_q_node.args[0])
+        model.graph.erase_node(input_dynamic_q_node)
 
-def _lower_weighted_ref_functional(
-    model: QuantizedGraphModule,
-    qconfig_map: Dict[str, QConfigAny]
-) -> QuantizedGraphModule:
+def _lower_weight_only_weighted_ref_module(model: QuantizedGraphModule):
+    """
+    Traverse the graph and find ref_module patterns
+    and replace them with the weight only quantized version of the ref module.
+    """
+    named_modules = dict(model.named_modules(remove_duplicate=False))
+    for n in model.graph.nodes:
+        if n.op != "call_module" or \
+           type(named_modules[str(n.target)]) not in \
+           set(WEIGHT_ONLY_LOWER_MODULE_MAP.keys()):
+            continue
+        ref_node = n
+        ref_module = named_modules[str(ref_node.target)]
+        ref_class = type(ref_module)
+        q_class = WEIGHT_ONLY_LOWER_MODULE_MAP.get(ref_class)
+        # TODO: WeightedQuantizedModule is currently assuming static quant apis
+        # with output_scale, output_zero_point in from_reference, we may want to
+        # relax that, or rename this
+        # TODO: maybe define a WeightedWeightOnlyQuantizedModule
+        q_module = q_class.from_reference(ref_module)  # type: ignore[union-attr]
+
+        # replace reference moduel with dynamically quantized module
+        parent_name, module_name = _parent_name(ref_node.target)
+        setattr(named_modules[parent_name], module_name, q_module)
+
+def _lower_static_weighted_ref_functional(
+        model: QuantizedGraphModule,
+        qconfig_map: Dict[str, QConfigAny]):
     """
     Traverse the graph and replace functional reference patterns with their quantized versions.
     """
-    for ref_func, (q_func, q_relu_func) in LOWER_FUNCTIONAL_MAP.items():
-        configurations = itertools.product(
-            (False, True),  # is_relu: whether ref_func is wrapped in a relu op
-            (False, True),  # has_bias: whether bias is passed as an extra argument to ref_func
-        )
-        for is_relu, has_bias in configurations:
-            if is_relu and q_relu_func is None:
-                continue
+    modules = dict(model.named_modules(remove_duplicate=False))
+    nodes = list(model.graph.nodes)
+    for n in model.graph.nodes:
+        # Step 0: Find nodes that match this pattern (dequantize - functional op - quantize)
+        matching_ops = list(STATIC_LOWER_FUNCTIONAL_MAP.keys())
+        (q_node, relu_node, func_node) = _match_static_pattern(
+            n, modules, qconfig_map, matching_ops, dequantize_node_arg_indices=[0, 1])
+        if q_node is None:
+            continue
+        assert(func_node is not None)
+        (_, output_scale_node, output_zp_node, _) = q_node.args
+        (input_dq_node, weight_dq_node, *remaining_func_args) = func_node.args
+        assert(isinstance(output_zp_node, Node))
+        assert(isinstance(input_dq_node, Node))
+        assert(isinstance(weight_dq_node, Node))
+        quantized_weight = weight_dq_node.args[0]
+        assert(isinstance(quantized_weight, Node))
+        if quantized_weight.op != "call_function" or\
+                quantized_weight.target not in (torch.quantize_per_tensor, torch.quantize_per_channel):
+            continue
 
-            # Set up match pattern: (dequantize - [relu_op - ] func_op - quantize)
-            # Func args: (dequantized inputs, dequantized weights[, bias])
-            # Quantize args: (func, scale, zp, dtype)
-            func_pattern: Tuple[Any, ...] = ()
-            if has_bias:
-                func_pattern = (ref_func, "dequantize", "dequantize", MatchAllNode)
-            else:
-                func_pattern = (ref_func, "dequantize", "dequantize")
-            if is_relu:
-                func_pattern = (F.relu, func_pattern)
-            pattern = (torch.quantize_per_tensor, func_pattern, MatchAllNode, MatchAllNode, MatchAllNode)
-
-            # Iterate through nodes in the graph to find a match
-            # If there is a match, replace the above pattern with the corresponding quantized op
-            modules = dict(model.named_modules(remove_duplicate=False))
-            nodes = list(model.graph.nodes)
-            for n in model.graph.nodes:
-                if not is_match(modules, n, pattern):
-                    continue
-                q_node = n
-                (func_node, output_scale_node, output_zp_node, dtype) = q_node.args
-                if is_relu:
-                    relu_node = func_node
-                    func_node = relu_node.args[0]
-                else:
-                    relu_node = None
-                input_dq_node = func_node.args[0]
-                weight_dq_node = func_node.args[1]
-
-                if should_skip_lowering(func_node, qconfig_map):
-                    continue
-
-                # Step 1: Replace quantized weights with packed weights, which will be folded later
-                quantized_weight = weight_dq_node.args[0]
-                weight_dtype = quantized_weight.args[-1]
-                if has_bias:
-                    bias = func_node.args[2]
-                else:
-                    bias = func_node.kwargs.get("bias", None)
-                prepack_args = (quantized_weight, bias)
-                if ref_func == F.linear:
-                    prepack_op = get_linear_prepack_op_for_dtype(weight_dtype)
-                else:
-                    raise ValueError("Lowering for functional currently only supports linear op")
-                insert_prepack_after = bias if has_bias else quantized_weight
-                with model.graph.inserting_after(insert_prepack_after):
-                    packed_weight = model.graph.create_node("call_function", prepack_op, prepack_args, {})
-
-                # Step 2: Replace reference pattern with the corresponding quantized op
-                func_node.args = (input_dq_node.args[0], packed_weight, output_scale_node, output_zp_node)
-                func_node.target = q_relu_func if is_relu else q_func
-                q_node.replace_all_uses_with(func_node)
-                output_zp_node.append(func_node)
-
-                # Clean up: Remove dequantize and quantize nodes and the old func node
-                for dqn in [input_dq_node, weight_dq_node]:
-                    dqn_input = dqn.args[0]
-                    dqn.replace_all_uses_with(dqn_input)
-                    model.graph.erase_node(dqn)
-                model.graph.erase_node(q_node)
-                if is_relu:
-                    model.graph.erase_node(relu_node)
-    return model
+        # Step 1: Replace quantized weights with packed weights, which will be folded later
+        # Use the right prepack op and prepare the corresponding args
+        # Linear prepack args: (quantized weights[, bias])
+        # Conv prepack args: (quantized weights[, bias, stride, padding, dilation, groups])
+        prepack_args = [quantized_weight] + remaining_func_args
+        if func_node.target == F.linear:
+            weight_dtype = quantized_weight.args[-1]
+            prepack_op = get_linear_prepack_op_for_dtype(weight_dtype)
+        elif func_node.target in CONV_FUNCTIONAL_OPS:
+            prepack_op = get_qconv_prepack_op(func_node.target)  # type: ignore[arg-type]
+            # For conv1d, the stride, padding, and dilation args may be ints,
+            # in which case we need to convert them to tuples
+            if func_node.target == F.conv1d:
+                for i in [2, 3, 4]:
+                    if len(prepack_args) > i and isinstance(prepack_args[i], int):
+                        prepack_args[i] = (prepack_args[i],)
+        else:
+            raise ValueError("Lowering is not supported for op '%s'" % func_node.target)
+        with model.graph.inserting_before(output_scale_node):
+            packed_weight = model.graph.create_node("call_function", prepack_op, tuple(prepack_args), {})
+
+        # Step 2: Replace reference pattern with the corresponding quantized op
+        (q_func, q_relu_func) = STATIC_LOWER_FUNCTIONAL_MAP[func_node.target]  # type: ignore[index]
+        func_node.target = q_relu_func if relu_node is not None else q_func
+        func_node.args = (input_dq_node.args[0], packed_weight, output_scale_node, output_zp_node)
+        q_node.replace_all_uses_with(func_node)
+        # Move func_node after output_zp_node in the graph
+        output_zp_node.append(func_node)
+
+        # Clean up: Remove dequantize and quantize nodes, and the relu node if it exists
+        for dqn in [input_dq_node, weight_dq_node]:
+            dqn_input = dqn.args[0]
+            dqn.replace_all_uses_with(dqn_input)
+            model.graph.erase_node(dqn)
+        model.graph.erase_node(q_node)
+        if relu_node is not None:
+            model.graph.erase_node(relu_node)
 
-def _lower_quantized_binary_op(
-    model: QuantizedGraphModule,
-    qconfig_map: Dict[str, QConfigAny]
-) -> QuantizedGraphModule:
+def _lower_dynamic_weighted_ref_functional(
+        model: QuantizedGraphModule,
+        qconfig_map: Dict[str, QConfigAny]):
+    """
+    Traverse the graph and replace functional reference patterns with their dynamically
+    quantized versions.
+    Examples:
+    quantize_per_tensor_dynamic - dequantize - functional linear --> linear_dynamic
+    to(torch.float16) - dequantize - functional linear --> linear_dynamic_fp16
+    """
     modules = dict(model.named_modules(remove_duplicate=False))
+    nodes = list(model.graph.nodes)
+    # we want to search in reserved order so that we can match the larger patterns first
+    # e.g. we want to match linear - relu before linear.
+    for n in reversed(model.graph.nodes):
+
+        # Step 0: Find nodes that match this pattern
+        # (quantize_per_tensor_dynamic - dequantize - dynamically quantized op)
+        # We search for the pattern backwards, starting with the quantize node
+        # Quantize node args: (func, scale, zp, dtype)
+        func_node = n
+        # Handle cases where the functional op is wrapped in a ReLU
+        if func_node.op == "call_function" and func_node.target == F.relu or \
+           func_node.op == "call_module" and \
+           type(modules[str(func_node.target)]) == torch.nn.ReLU:
+            relu_node = func_node
+            func_node = relu_node.args[0]
+        else:
+            relu_node = None
+        if should_skip_lowering(func_node, qconfig_map):
+            continue
+        # Linear args: (dequantized inputs, dequantized weights[, bias])
+        # Conv args: (dequantized inputs, dequantized weights[, bias, stride, padding, dilation, groups])
+        if func_node.op != "call_function" or func_node.target not in DYNAMIC_LOWER_FUNCTIONAL_MAP:
+            continue
+        (input_dq_node, weight_dq_node, *remaining_func_args) = func_node.args
+        if input_dq_node.op != "call_method" or input_dq_node.target != "dequantize" or \
+           weight_dq_node.op != "call_method" or weight_dq_node.target != "dequantize":
+            continue
 
-    def get_bop_patterns(bop: Any) -> List[Pattern]:
-        patterns: List[Pattern] = []
-        bop_pattern = (bop, MatchAllNode, MatchAllNode)
-        for relu_op in [torch.relu, torch.nn.functional.relu, torch.nn.ReLU]:
-            patterns.append(
-                (torch.quantize_per_tensor,
-                 (relu_op, bop_pattern),
-                 MatchAllNode, MatchAllNode, MatchAllNode))
-        patterns.append(
-            (torch.quantize_per_tensor,
-             bop_pattern,
-             MatchAllNode, MatchAllNode, MatchAllNode))
-        return patterns
-
-    patterns: List[Pattern] = []
-    for bop in [operator.add, torch.add, operator.mul, torch.mul]:
-        patterns.extend(get_bop_patterns(bop))
-    patterns.extend(
-        [
-            (torch.quantize_per_tensor,
-             (torch.matmul, "dequantize", "dequantize"),
-             MatchAllNode, MatchAllNode, MatchAllNode)
-        ]
-    )
+        input_dynamic_q_node = input_dq_node.args[0]
+        # don't support lowering the pattern when the result of quantize is used by
+        # multiple nodes
+        if len(input_dynamic_q_node.users) > 1:
+            continue
+
+        if input_dynamic_q_node.op != "call_function" or \
+           input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
+            continue
+
+        reduce_range_node = None
+        (pattern_input, activation_compute_dtype, reduce_range_node) = input_dynamic_q_node.args
+        is_fp16 = activation_compute_dtype == torch.float16
+        is_int8 = activation_compute_dtype in [torch.quint8, torch.qint8]
+        if not is_int8 and not is_fp16:
+            continue
+
+        quantized_weight = weight_dq_node.args[0]
+        weight_dtype = quantized_weight.args[-1]
+
+        # Step 1: Try to select reference pattern with the corresponding quantized op
+        dynamic_quant_dtype_key = (activation_compute_dtype, weight_dtype)
+        if dynamic_quant_dtype_key not in DYNAMIC_LOWER_FUNCTIONAL_MAP[func_node.target]:
+            print(f"Didn't find dtype combination {dynamic_quant_dtype_key} during "
+                  f"dynamic quantized op lowering for {func_node.target}")
+            continue
+        (q_func, q_relu_func) = DYNAMIC_LOWER_FUNCTIONAL_MAP[func_node.target][dynamic_quant_dtype_key]
+
+        if q_func is None or q_relu_func is None:
+            print("Didn't find corresponding quantized function or quantized relu function "
+                  f"for {func_node.target}, {dynamic_quant_dtype_key}")
+            continue
+
+        # Step 2: Replace quantized weights with packed weights, which will be folded later
+        # Use the right prepack op and prepare the corresponding args
+        # Linear prepack args: (quantized weights[, bias])
+        # Conv prepack args: (quantized weights[, bias, stride, padding, dilation, groups])
+        prepack_args = [quantized_weight] + remaining_func_args
+        if func_node.target == F.linear:
+            prepack_op = get_linear_prepack_op_for_dtype(weight_dtype)
+        elif func_node.target in CONV_FUNCTIONAL_OPS:
+            prepack_op = get_qconv_prepack_op(func_node.target)
+            # For conv1d, the stride, padding, and dilation args may be ints,
+            # in which case we need to convert them to tuples
+            if func_node.target == F.conv1d:
+                for i in [2, 3, 4]:
+                    if len(prepack_args) > i and isinstance(prepack_args[i], int):
+                        prepack_args[i] = (prepack_args[i],)
+        else:
+            raise ValueError("Lowering is not supported for op '%s'" % func_node.target)
+        with model.graph.inserting_before(func_node):
+            packed_weight = model.graph.create_node("call_function", prepack_op, tuple(prepack_args), {})
+
+        # Step 3: Replace reference pattern with the corresponding quantized op
+        func_node.target = q_relu_func if relu_node is not None else q_func
+        if is_int8:
+            func_node.args = (pattern_input, packed_weight, reduce_range_node)
+        else:
+            func_node.args = (pattern_input, packed_weight)
 
+        if relu_node is not None:
+            relu_node.replace_all_uses_with(func_node)
+
+        # Step 4: Remove dequantize and quantize nodes, and the relu node if it exists
+        for dqn in [input_dq_node, weight_dq_node]:
+            dqn_input = dqn.args[0]
+            dqn.replace_all_uses_with(dqn_input)
+            model.graph.erase_node(dqn)
+        model.graph.erase_node(input_dynamic_q_node)
+        if relu_node is not None:
+            model.graph.erase_node(relu_node)
+
+def _lower_quantized_binary_op(
+        model: QuantizedGraphModule,
+        qconfig_map: Dict[str, QConfigAny]):
     qbin_op_mapping: Dict[Union[Callable, str], Callable] = {
         operator.add: torch.ops.quantized.add,
         torch.add: torch.ops.quantized.add,
@@ -488,94 +769,62 @@ def get_bop_patterns(bop: Any) -> List[Pattern]:
         operator.mul: torch.ops.quantized.mul_relu,
         torch.mul: torch.ops.quantized.mul_relu,
     }
-    for pattern in patterns:
-        for n in model.graph.nodes:
-            if not is_match(modules, n, pattern):
-                continue
-            q_node = n
-            is_quantize = q_node.target == torch.quantize_per_tensor
-            is_to_fp16 = q_node.op == "call_method" and q_node.target == "to" and q_node.args[1] == torch.float16
-            if not (is_quantize or is_to_fp16):
-                continue
-
-            # start tracing back from quantize node
-            node = q_node.args[0]
-            if not isinstance(node, Node):
-                continue
-            relu_node = None
-            if (
-                node.op == 'call_function' and
-                    node.target in (torch.nn.functional.relu, torch.relu)
-            ) or (
-                node.op == 'call_module' and
-                    isinstance(modules[str(node.target)], torch.nn.ReLU)
-            ):
-                relu_node = node
-                node = node.args[0]
-
-            # binary operator node, e.g. torch.add(x, y)
-            bop_node = node
-            if bop_node.op != "call_function" or \
-               bop_node.target not in set([torch.add, operator.add, torch.mul, operator.mul, torch.matmul]):
-                continue
-
-            if should_skip_lowering(bop_node, qconfig_map):
-                continue
+    binary_ops_to_lower: List[Callable] = [operator.add, torch.add, operator.mul, torch.mul, torch.matmul]
+    modules = dict(model.named_modules(remove_duplicate=False))
+    for n in model.graph.nodes:
+        # Step 0: Find nodes that match this pattern (dequantize - ref module - quantize)
+        (q_node, relu_node, bop_node) = _match_static_pattern(
+            n, modules, qconfig_map, binary_ops_to_lower, dequantize_node_arg_indices=[0, 1])
+        if q_node is None:
+            continue
+        assert(bop_node is not None)
+        (_, scale_node, zero_point_node, _) = q_node.args
 
-            # remove dequant node
-            arg0 = bop_node.args[0]
-            arg1 = bop_node.args[1]
-            dq_node0, dq_node1 = None, None
-            if is_dequantize_node(arg0):
-                dq_node0 = arg0
-            if is_dequantize_node(arg1):
-                dq_node1 = arg1
-            if dq_node0 is None and dq_node1 is None:
+        # Step 1: Remove dequant nodes
+        num_dq_nodes = 0
+        for arg in bop_node.args:
+            if not is_dequantize_node(arg):
                 continue
-            for dq_node in [dq_node0, dq_node1]:
-                if dq_node is None:
-                    continue
-                # dequantize node is only used once, this is enforced by `is_match`
-                dn_input = dq_node.args[0]
-                dq_node.replace_all_uses_with(dn_input)
-                model.graph.erase_node(dq_node)
-
-            # swap binary op to quantized binary op
-            assert bop_node.target in qbin_op_mapping
-            binop_to_qbinop = qbin_op_mapping if relu_node is None else qbin_relu_op_mapping
-            qbin_op = binop_to_qbinop[bop_node.target]
-            # prepare the args for quantized bianry op
-            # (x, y)
-            qop_node_args = list(bop_node.args)
-            # (x, y, scale, zero_point)
-            # add scale and zero_point arguments for Tensor - Tensor operation
-            if dq_node0 is not None and dq_node1 is not None:
-                qop_node_args.extend([q_node.args[1], q_node.args[2]])
-
-            # insert a call to quantized binary op and remove the original binary op
-            with model.graph.inserting_after(q_node):
-                qop_node = create_node_from_old_node_preserve_meta(
-                    model.graph,
-                    ("call_function", qbin_op, tuple(qop_node_args), {}),
-                    bop_node)
-                q_node.replace_all_uses_with(qop_node)
-
-            # remove quantize node
-            model.graph.erase_node(q_node)
-            # remove relu node if any
-            if relu_node is not None:
-                model.graph.erase_node(relu_node)
-            # remove binary op node
-            model.graph.erase_node(bop_node)
-
-    return model
+            dq_node = arg
+            assert(isinstance(dq_node, Node))
+            dn_input = dq_node.args[0]
+            dq_node.replace_all_uses_with(dn_input)
+            model.graph.erase_node(dq_node)
+            num_dq_nodes += 1
+        assert(num_dq_nodes > 0)
+
+        # Step 2: Swap binary op to quantized binary op
+        assert bop_node.target in qbin_op_mapping
+        binop_to_qbinop = qbin_op_mapping if relu_node is None else qbin_relu_op_mapping
+        qbin_op = binop_to_qbinop[bop_node.target]
+        # prepare the args for quantized bianry op
+        # (x, y)
+        qop_node_args = list(bop_node.args)
+        # (x, y, scale, zero_point)
+        # add scale and zero_point arguments for Tensor - Tensor operation
+        if num_dq_nodes == 2:
+            qop_node_args.extend([scale_node, zero_point_node])
+        # insert a call to quantized binary op and remove the original binary op
+        with model.graph.inserting_after(q_node):
+            qop_node = create_node_from_old_node_preserve_meta(
+                model.graph,
+                ("call_function", qbin_op, tuple(qop_node_args), {}),
+                bop_node)
+            q_node.replace_all_uses_with(qop_node)
+
+        # Step 3: Remove quantize node, binary op node, and relu node if any
+        model.graph.erase_node(q_node)
+        if relu_node is not None:
+            model.graph.erase_node(relu_node)
+        model.graph.erase_node(bop_node)
 
-def special_pattern_replacement(model: QuantizedGraphModule) -> QuantizedGraphModule:
+def special_pattern_replacement(model: QuantizedGraphModule):
     modules = dict(model.named_modules(remove_duplicate=False))
     for n in model.graph.nodes:
         q_node = n
         is_quantize = q_node.target == torch.quantize_per_tensor
-        is_to_fp16 = q_node.op == "call_method" and q_node.target == "to" and q_node.args[1] == torch.float16
+        is_to_fp16 = q_node.op == "call_method" and q_node.target == "to" and \
+            len(q_node.args) == 2 and q_node.args[1] == torch.float16
         if not (is_quantize or is_to_fp16):
             continue
         ref_node = q_node.args[0]
@@ -677,6 +926,20 @@ def special_pattern_replacement(model: QuantizedGraphModule) -> QuantizedGraphMo
 
     return model
 
+def _lower_getattr_tensor_metadta_op(model: QuantizedGraphModule):
+    """ Modified the graph of the model inplace, to skip extra dequantize op before
+    the general tensor shape ops when possible
+    """
+    for n in model.graph.nodes:
+        if is_getattr_tensor_metadata_node(n):
+            maybe_dq = n.args[0]
+            if maybe_dq.op != "call_method" or maybe_dq.target != "dequantize":
+                continue
+            # skip the dequantize node
+            args = list(n.args)
+            args[0] = n.args[0].args[0]
+            n.args = tuple(args)
+
 def _lower_to_native_backend(
     model: QuantizedGraphModule,
     qconfig_map: Dict[str, QConfigAny],
@@ -686,13 +949,16 @@ def _lower_to_native_backend(
     to the native backend in PyTorch (fbgemm/qnnpack), both backends shares the same
     operator signature so they can be lowered with the same function
     """
-    model = _lower_weighted_ref_module(model)
-    model = _lower_weighted_ref_functional(model, qconfig_map)
-    for pattern, replacement in get_fbgemm_patterns_and_replacements():
-        subgraph_rewriter_FORKED_DO_NOT_USE.replace_pattern(model, pattern, replacement)
+    _lower_static_weighted_ref_module(model, qconfig_map)
+    _lower_dynamic_weighted_ref_module(model)
+    _lower_weight_only_weighted_ref_module(model)
+    _lower_static_weighted_ref_functional(model, qconfig_map)
+    _lower_dynamic_weighted_ref_functional(model, qconfig_map)
     _lower_quantized_binary_op(model, qconfig_map)
+    _lower_getattr_tensor_metadta_op(model)
     special_pattern_replacement(model)
     model = fold_weight(model, node_name_to_scope)
+    model.graph.eliminate_dead_code()
     model.recompile()
     model.graph.lint()
     return model
diff --git a/torch/ao/quantization/fx/backend_config/__init__.py b/torch/ao/quantization/fx/backend_config/__init__.py
index b595b660344e9c..3fc6762815763c 100644
--- a/torch/ao/quantization/fx/backend_config/__init__.py
+++ b/torch/ao/quantization/fx/backend_config/__init__.py
@@ -1,4 +1,5 @@
 from .tensorrt import get_tensorrt_backend_config_dict
+from .native import get_native_backend_config_dict
 
 # TODO: add more validations
 def validate_backend_config_dict(backend_config_dict):
diff --git a/torch/ao/quantization/fx/backend_config/native.py b/torch/ao/quantization/fx/backend_config/native.py
new file mode 100644
index 00000000000000..e18465a19cf039
--- /dev/null
+++ b/torch/ao/quantization/fx/backend_config/native.py
@@ -0,0 +1,618 @@
+from collections import namedtuple
+from typing import List, Dict, Any
+import operator
+import torch
+from .observation_type import ObservationType
+import torch.nn.functional as F
+import torch.nn as nn
+import torch.nn.intrinsic as nni
+import torch.nn.intrinsic.qat as nniqat
+import torch.nn.qat as nnqat
+import torch.nn.quantized._reference as nnqr
+from ...observer import (
+    default_affine_fixed_qparams_observer,
+    default_symmetric_fixed_qparams_observer,
+)
+from ...fake_quantize import FixedQParamsFakeQuantize
+from ...fuser_method_mappings import (
+    reverse_sequential_wrapper2,
+    reverse2,
+    reverse3,
+    fuse_conv_bn,
+    fuse_conv_bn_relu,
+    fuse_linear_bn,
+    fuse_convtranspose_bn,
+)
+
+_ConvMetadata = namedtuple(
+    "_ConvMetadata",
+    ["root", "transpose", "bn", "reference", "qat", "relu", "relu_qat", "bn_qat",
+     "bn_relu_qat", "func"])
+_Conv1dMetadata = _ConvMetadata(
+    nn.Conv1d, nn.ConvTranspose1d, nn.BatchNorm1d, nnqr.Conv1d, nnqat.Conv1d, nni.ConvReLU1d,
+    nniqat.ConvReLU1d, nniqat.ConvBn1d, nniqat.ConvBnReLU1d, F.conv1d)
+_Conv2dMetadata = _ConvMetadata(
+    nn.Conv2d, nn.ConvTranspose2d, nn.BatchNorm2d, nnqr.Conv2d, nnqat.Conv2d, nni.ConvReLU2d,
+    nniqat.ConvReLU2d, nniqat.ConvBn2d, nniqat.ConvBnReLU2d, F.conv2d)
+_Conv3dMetadata = _ConvMetadata(
+    nn.Conv3d, nn.ConvTranspose3d, nn.BatchNorm3d, nnqr.Conv3d, nnqat.Conv3d, nni.ConvReLU3d,
+    nniqat.ConvReLU3d, nniqat.ConvBn3d, nniqat.ConvBnReLU3d, F.conv3d)
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+# weighted op int8 dtype config
+# this is config for ops that has quantized weights, like linear, conv
+weighted_op_int8_dtype_config = {
+    # optional, input activation dtype
+    "input_dtype": torch.quint8,
+    # optional, weight dtype
+    "weight_dtype": torch.qint8,
+    # optional, bias dtype
+    "bias_dtype": torch.float,
+    # optional, output activation dtype
+    "output_dtype": torch.quint8
+}
+
+default_op_quint8_dtype_config = {
+    # optional, input activation dtype
+    "input_dtype": torch.quint8,
+    # optional, output activation dtype
+    "output_dtype": torch.quint8,
+}
+
+default_op_fp16_dtype_config = {
+    # optional, input activation dtype
+    "input_dtype": torch.float16,
+    # optional, weight dtype
+    "weight_dtype": torch.float16,
+    # optional, output activation dtype
+    "output_dtype": torch.float16,
+}
+
+default_dynamic_int8_dtype_config = {
+    "input_dtype": torch.quint8,
+    "weight_dtype": torch.qint8,
+    "output_dtype": torch.quint8,
+    # currently the dtype check is not yet enabled, so we provided the dtype_configs but
+    # it is not really used yet,
+    # we will enable it a bit later after we moved everything to backend_config_dict
+    "is_dynamic": True,
+}
+
+weight_only_quint8_dtype_config = {
+    "input_dtype": torch.float,
+    "weight_dtype": torch.quint8,
+    "output_dtype": torch.float,
+}
+
+weight_only_quint4x2_dtype_config = {
+    "input_dtype": torch.float,
+    "weight_dtype": torch.quint4x2,
+    "output_dtype": torch.float,
+}
+
+# ======================
+# |  OPERATOR CONFIGS  |
+# ======================
+
+def _get_default_op_backend_config(op, dtype_configs):
+    return {
+        "pattern": op,
+        "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+        "dtype_configs": dtype_configs,
+    }
+
+_DEFAULT_OP_INT8_CONFIGS = [
+    _get_default_op_backend_config(op, [default_op_quint8_dtype_config]) for op in [
+        torch.nn.ConvTranspose1d,
+        torch.nn.ConvTranspose2d,
+        torch.nn.ELU,
+        torch.nn.LeakyReLU,
+        torch.nn.Hardswish,
+        torch.nn.InstanceNorm1d,
+        torch.nn.InstanceNorm2d,
+        torch.nn.InstanceNorm3d,
+        torch.nn.LayerNorm,
+        torch.nn.Dropout,
+        torch.nn.functional.elu,
+        torch.nn.functional.hardswish,
+        torch.nn.functional.instance_norm,
+        torch.nn.functional.leaky_relu,
+        torch.nn.functional.dropout,
+        torch.nn.functional.layer_norm,
+    ]]
+
+def _get_linear_configs():
+    """
+    Return all configs related to linear modules and ops.
+    """
+    observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+    dtype_configs = [weighted_op_int8_dtype_config]
+    linear_configs = []
+
+    # (1) Single linear modules/functions
+    # -------------------------------------
+    # linear module
+    linear_configs.append({
+        # Please see README under this folder for pattern format
+        "pattern": torch.nn.Linear,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        # the root module for the pattern, used to query the reference quantized module
+        # e.g. for a (torch.nn.ReLU, torch.nn.Linear) pattern, the root will be torch.nn.Linear
+        "root_module": torch.nn.Linear,
+        # the corresponding reference quantized module for the root module
+        "reference_quantized_module_for_root": nnqr.Linear,
+        "qat_module": nnqat.Linear,
+    })
+    # linear qat module
+    linear_configs.append({
+        "pattern": nnqat.Linear,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        "root_module": torch.nn.Linear,
+        "reference_quantized_module_for_root": nnqr.Linear,
+    })
+    # functional linear
+    linear_configs.append({
+        "pattern": torch.nn.functional.linear,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+    })
+
+    # (2) Linear + relu
+    # -------------------
+    # 2.1 linear module + relu fusion config
+    # linear relu, linear module + relu module
+    linear_configs.append({
+        "pattern": (torch.nn.ReLU, torch.nn.Linear),
+        "dtype_configs": dtype_configs,
+        "fuser_method": reverse_sequential_wrapper2(nni.LinearReLU),
+    })
+    # linear relu, linear module + functional relu
+    linear_configs.append({
+        "pattern": (torch.nn.functional.relu, torch.nn.Linear),
+        "dtype_configs": dtype_configs,
+        "fuser_method": reverse_sequential_wrapper2(nni.LinearReLU),
+    })
+
+    # 2.2 linear module + relu, fused module configs
+    # linear relu, fused module
+    linear_configs.append({
+        "pattern": nni.LinearReLU,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        "root_module": torch.nn.Linear,
+        "reference_quantized_module_for_root": nnqr.Linear,
+        "qat_module": nniqat.LinearReLU,
+    })
+    # linear relu, qat fused module
+    linear_configs.append({
+        "pattern": nniqat.LinearReLU,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        "root_module": torch.nn.Linear,
+        "reference_quantized_module_for_root": nnqr.Linear,
+    })
+    # 2.3 functional linear + relu configs
+    # linear relu, functional linear + relu module
+    linear_configs.append({
+        "pattern": (torch.nn.ReLU, F.linear),
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+    })
+    # linear relu, functional linear + functional relu
+    linear_configs.append({
+        "pattern": (F.relu, F.linear),
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+    })
+
+    # (3) Linear + batchnorm
+    # ------------------------
+    # 3.1 linear bn fusion
+    linear_configs.append({
+        "pattern": (nn.BatchNorm1d, nn.Linear),
+        "dtype_configs": dtype_configs,
+        "fuser_method": reverse2(fuse_linear_bn)
+    })
+
+    # 3.2 linear bn quantization
+    # linear bn, fused module
+    linear_configs.append({
+        "pattern": nni.LinearBn1d,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        "root_module": torch.nn.Linear,
+        "reference_quantized_module_for_root": nnqr.Linear,
+        "qat_module": nniqat.LinearBn1d,
+    })
+    # linear bn, qat fused module
+    linear_configs.append({
+        "pattern": nniqat.LinearBn1d,
+        "observation_type": observation_type,
+        "dtype_configs": dtype_configs,
+        "root_module": torch.nn.Linear,
+        "reference_quantized_module_for_root": nnqr.Linear,
+    })
+    return linear_configs
+
+def _get_conv_configs():
+    """
+    Return all configs related to conv modules and ops.
+    """
+    conv_configs = []
+    observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+    dtype_configs = [weighted_op_int8_dtype_config]
+    for convs in [_Conv1dMetadata, _Conv2dMetadata, _Conv3dMetadata]:
+
+        # (1) Single conv modules/functions
+        # -----------------------------------
+        # conv module
+        conv_configs.append({
+            "pattern": convs.root,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+            "qat_module": convs.qat,
+        })
+        # conv qat module
+        conv_configs.append({
+            "pattern": convs.qat,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+        })
+        # functional conv
+        conv_configs.append({
+            "pattern": convs.func,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+        })
+
+        # (2) Conv + relu
+        # -----------------
+        # 2.1 conv module + relu fusion configs
+        # conv relu fusion, conv module + relu module
+        conv_configs.append({
+            "pattern": (torch.nn.ReLU, convs.root),
+            "dtype_configs": dtype_configs,
+            "fuser_method": reverse_sequential_wrapper2(convs.relu),
+        })
+        # conv relu fusion, conv module + functional relu
+        conv_configs.append({
+            "pattern": (F.relu, convs.root),
+            "dtype_configs": dtype_configs,
+            "fuser_method": reverse_sequential_wrapper2(convs.relu),
+        })
+        # 2.2 conv module + relu fused module configs
+        # conv relu, fused module
+        conv_configs.append({
+            "pattern": convs.relu,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+            "qat_module": convs.relu_qat,
+        })
+        # conv relu, qat fused module
+        conv_configs.append({
+            "pattern": convs.relu_qat,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+        })
+        # 2.3 functional conv + relu configs
+        # conv relu, functional conv + relu module
+        conv_configs.append({
+            "pattern": (torch.nn.ReLU, convs.func),
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+        })
+        # conv relu, functional conv + functional relu
+        conv_configs.append({
+            "pattern": (F.relu, convs.func),
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+        })
+
+        # (3) Conv + batchnorm (+ relu)
+        # -------------------------------
+        # 3.1 conv bn fusion configs
+        # conv + bn fusion
+        conv_configs.append({
+            "pattern": (convs.bn, convs.root),
+            "dtype_configs": dtype_configs,
+            "fuser_method": reverse2(fuse_conv_bn),
+        })
+        # conv + bn + relu module fusion
+        conv_configs.append({
+            "pattern": (nn.ReLU, (convs.bn, convs.root)),
+            "dtype_configs": dtype_configs,
+            "fuser_method": reverse3(fuse_conv_bn_relu),
+        })
+        # conv + bn + relu functional fusion
+        conv_configs.append({
+            "pattern": (F.relu, (convs.bn, convs.root)),
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "fuser_method": reverse3(fuse_conv_bn_relu),
+        })
+        # TODO: we can add fusion for torch.relu as well
+
+        # 3.2 conv + bn (+ relu) fused module configs
+        # conv bn, qat fused module
+        conv_configs.append({
+            "pattern": convs.bn_qat,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+        })
+        # conv bn relu, qat fused module
+        conv_configs.append({
+            "pattern": convs.bn_relu_qat,
+            "observation_type": observation_type,
+            "dtype_configs": dtype_configs,
+            "root_module": convs.root,
+            "reference_quantized_module_for_root": convs.reference,
+        })
+
+        # (4) conv transpose fusion
+        conv_configs.append({
+            "pattern": (convs.bn, convs.transpose),
+            "dtype_configs": dtype_configs,
+            "fuser_method": reverse2(fuse_convtranspose_bn),
+        })
+
+    return conv_configs
+
+def _get_binary_op_configs():
+    binary_op_configs: List[Dict[str, Any]] = []
+    num_tensor_args_to_observation_type_mapping = {
+        # TODO: this is not used right now since we have extra check in prepare
+        # will need to change this to NO_OBSERVER later after we implemented
+        # Tensor dtype inference properly
+        0: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+        1: ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
+        2: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+    }
+    dtype_configs = [
+        weighted_op_int8_dtype_config,
+    ]
+    for op_with_quantized_bop_scalar_variant in [
+            operator.add, torch.add, operator.mul, torch.mul]:
+        binary_op_configs.append({
+            "pattern": (torch.nn.ReLU, op_with_quantized_bop_scalar_variant),
+            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
+            "dtype_configs": dtype_configs,
+        })
+        binary_op_configs.append({
+            "pattern": (torch.nn.functional.relu, op_with_quantized_bop_scalar_variant),
+            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
+            "dtype_configs": dtype_configs,
+        })
+        binary_op_configs.append({
+            "pattern": (torch.relu, op_with_quantized_bop_scalar_variant),
+            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
+            "dtype_configs": dtype_configs,
+        })
+        binary_op_configs.append({
+            "pattern": op_with_quantized_bop_scalar_variant,
+            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
+            "dtype_configs": dtype_configs,
+        })
+    return binary_op_configs
+
+
+def _get_fixed_qparams_op_configs():
+    fixed_qparams_op_configs = []
+    for fixed_qparam_op, output_observer in [
+            (torch.nn.Hardsigmoid, default_affine_fixed_qparams_observer),
+            (torch.nn.functional.hardsigmoid, default_affine_fixed_qparams_observer),
+            ("hardsigmoid", default_affine_fixed_qparams_observer),
+            ("hardsigmoid_", default_affine_fixed_qparams_observer),
+            (torch.nn.Sigmoid, default_affine_fixed_qparams_observer),
+            (torch.sigmoid, default_affine_fixed_qparams_observer),
+            ("sigmoid", default_affine_fixed_qparams_observer),
+            ("sigmoid_", default_affine_fixed_qparams_observer),
+            (torch.nn.Tanh, default_symmetric_fixed_qparams_observer),
+            (torch.tanh, default_symmetric_fixed_qparams_observer),
+            ("tanh", default_symmetric_fixed_qparams_observer),
+            ("tanh_", default_symmetric_fixed_qparams_observer),
+    ]:
+        fixed_qparams_op_configs.append({
+            "pattern": fixed_qparam_op,
+            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+            # TODO: The following two keys are temporary, since we don't want to put observer in the configs
+            # we expect that it's provided by user
+            # What we want to put here is the requirement on observers, in this case dtype,
+            # quant_min, quant_max etc., but we need to first move all configs to
+            # backend_config_dict to do that, we'll remove these keys after we fully migrated
+            # everything to use backend_config_dict
+            "_overwrite_output_fake_quantizer": FixedQParamsFakeQuantize.with_args(observer=output_observer),
+            "_overwrite_output_observer": output_observer,
+            "dtype_configs": [
+                weighted_op_int8_dtype_config,
+            ],
+        })
+    return fixed_qparams_op_configs
+
+_CAT_CONFIG = {
+    "pattern": torch.cat,
+    "observation_type": ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
+    "dtype_configs": [
+        default_op_quint8_dtype_config,
+    ]
+}
+
+def _get_bn_configs():
+    """ Get configs related to batchnorm
+    """
+    bn_configs = []
+    bn_to_fused_bn = {
+        torch.nn.BatchNorm2d: nni.BNReLU2d,
+        torch.nn.BatchNorm3d: nni.BNReLU3d,
+    }
+    for bn in bn_to_fused_bn.keys():
+        # bn module + relu module fusion config
+        bn_configs.append({
+            "pattern": (torch.nn.ReLU, bn),
+            "dtype_configs": default_op_quint8_dtype_config,
+            "fuser_method": reverse_sequential_wrapper2(bn_to_fused_bn[bn]),
+        })
+        # bn module + F.relu fusion config
+        bn_configs.append({
+            "pattern": (torch.nn.functional.relu, bn),
+            "dtype_configs": default_op_quint8_dtype_config,
+            "fuser_method": reverse_sequential_wrapper2(bn_to_fused_bn[bn]),
+        })
+        bn_configs.append({
+            "pattern": bn,
+            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+            "dtype_configs": default_op_quint8_dtype_config,
+        })
+
+    # fused bn configs
+    for fused_bn in bn_to_fused_bn.values():
+        bn_configs.append({
+            "pattern": fused_bn,
+            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+            "dtype_configs": default_op_quint8_dtype_config,
+        })
+    return bn_configs
+
+def _get_share_qparams_op_configs():
+    """ Get the operator config for the operators that works for both float and quantized input
+    if input is quantized, the output Tensor shares the same quantization parameter
+    with input.
+    Example operator: avgpool2d, reshape, transpose, maxpool2d
+    Example observed operator:
+    observer_0 - avgpool2d - observer_0 (same observer instance as input)
+    """
+
+    def _get_share_qprams_op_backend_config(op):
+        return {
+            "pattern": op,
+            "observation_type": ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
+            "dtype_configs": [default_op_quint8_dtype_config],
+        }
+
+    share_qparams_ops = [
+        torch.nn.AdaptiveAvgPool1d,
+        torch.nn.AdaptiveAvgPool2d,
+        torch.nn.AdaptiveAvgPool3d,
+        torch.nn.AvgPool1d,
+        torch.nn.AvgPool2d,
+        torch.nn.AvgPool3d,
+        torch.nn.Hardtanh,
+        torch.nn.Identity,
+        torch.nn.MaxPool1d,
+        torch.nn.MaxPool2d,
+        torch.nn.MaxPool3d,
+        torch.nn.ReLU,
+        torch.nn.ReLU6,
+        torch.adaptive_avg_pool1d,
+        torch.nn.functional.adaptive_avg_pool2d,
+        torch.nn.functional.adaptive_avg_pool3d,
+        torch.nn.functional.hardtanh,
+        torch.nn.functional.hardtanh_,
+        torch.nn.functional.interpolate,
+        torch.nn.functional.max_pool1d,
+        torch.nn.functional.max_pool2d,
+        torch.nn.functional.max_pool3d,
+        torch.nn.functional.relu,
+        torch.nn.functional.relu6,
+        torch.avg_pool1d,
+        torch._C._nn.avg_pool2d,
+        torch._C._nn.avg_pool3d,
+        torch.clamp,
+        torch.flatten,
+        torch.mean,
+        torch.repeat_interleave,
+        torch.transpose,
+        torch.squeeze,
+        torch.stack,
+        torch.unsqueeze,
+        operator.floordiv,
+        "contiguous",
+        "clamp",
+        "detach",
+        "detach_",
+        "mean",
+        "permute",
+        "repeat",
+        "repeat_interleave",
+        "reshape",
+        "resize_",
+        "relu",
+        "relu_",
+        "shape",
+        "size",
+        "squeeze",
+        "squeeze_",
+        "transpose",
+        "unsqueeze",
+        "unsqueeze_",
+        "view"
+    ]
+    return [_get_share_qprams_op_backend_config(op) for op in share_qparams_ops]
+
+def _get_rnn_op_configs():
+    rnn_op_configs = []
+    for rnn_op in [
+            torch.nn.GRUCell,
+            torch.nn.LSTMCell,
+            torch.nn.RNNCell,
+            torch.nn.LSTM,
+    ]:
+        rnn_op_configs.append({
+            "pattern": rnn_op,
+            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+            "dtype_configs": [default_dynamic_int8_dtype_config],
+        })
+    return rnn_op_configs
+
+def _get_embedding_op_configs():
+    embedding_op_configs = []
+    for embedding_op in [
+            torch.nn.Embedding,
+            torch.nn.EmbeddingBag,
+            nnqat.Embedding,
+            nnqat.EmbeddingBag,
+    ]:
+        embedding_op_configs.append({
+            "pattern": embedding_op,
+            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+            "dtype_configs": [
+                weight_only_quint8_dtype_config,
+                weight_only_quint4x2_dtype_config
+            ],
+            # This is temporary, and will be removed soon
+            "_input_output_observed": False
+        })
+    return embedding_op_configs
+
+def get_native_backend_config_dict():
+    """ Get backend_config_dict for PyTorch Native backend (fbgemm/qnnpack). """
+    return {
+        # optional
+        "name": "native",
+        "configs": [
+            *_DEFAULT_OP_INT8_CONFIGS,
+            *_get_linear_configs(),
+            *_get_conv_configs(),
+            *_get_binary_op_configs(),
+            *_get_fixed_qparams_op_configs(),
+            _CAT_CONFIG,
+            *_get_bn_configs(),
+            *_get_share_qparams_op_configs(),
+            *_get_rnn_op_configs(),
+            *_get_embedding_op_configs(),
+        ],
+    }
diff --git a/torch/ao/quantization/fx/backend_config/quantize_handler.py b/torch/ao/quantization/fx/backend_config/quantize_handler.py
index fe932e31bd214a..b836cc3bf149af 100644
--- a/torch/ao/quantization/fx/backend_config/quantize_handler.py
+++ b/torch/ao/quantization/fx/backend_config/quantize_handler.py
@@ -1,18 +1,67 @@
 import torch
-from typing import Dict
-from torch.fx.graph import Node
+from typing import Dict, Callable, Any, Optional
 from .observation_type import ObservationType
 from ..quantization_patterns import QuantizeHandler
+from ..quantization_types import Pattern, NodePattern
+from ...utils import (
+    activation_dtype,
+)
 
-def get_quantize_handler_cls(observation_type, dtype_configs):
+def get_quantize_handler_cls(
+        observation_type,
+        dtype_configs,
+        num_tensor_args_to_observation_type,
+        overwrite_output_fake_quantizer,
+        overwrite_output_observer,
+        input_output_observed):
 
     class ConfigurableQuantizeHandler(QuantizeHandler):
-        def __init__(self, node: Node, modules: Dict[str, torch.nn.Module]):
-            super().__init__(node, modules)
-            self.observation_type = observation_type
+        def __init__(
+                self,
+                node_pattern: NodePattern,
+                modules: Dict[str, torch.nn.Module],
+                root_node_getter: Callable = None):
+            super().__init__(node_pattern, modules, root_node_getter)
+            if num_tensor_args_to_observation_type:
+                assert self.num_tensor_args in num_tensor_args_to_observation_type, \
+                    f"Must provide observation_type config for tensor number {self.num_tensor_args}" \
+                    f" in num_tensor_args_to_observation_type for {node_pattern}"
+                self.observation_type = num_tensor_args_to_observation_type[self.num_tensor_args]
+            else:
+                self.observation_type = observation_type
             self.dtype_configs = dtype_configs
+            self.overwrite_output_fake_quantizer = overwrite_output_fake_quantizer
+            self.overwrite_output_observer = overwrite_output_observer
+            self.input_output_observed_ = input_output_observed
 
         def is_general_tensor_value_op(self) -> bool:
-            return observation_type == ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT
+            return self.observation_type == ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT
+
+        # TODO: change this to output activation
+        def get_activation_ctr(
+                self,
+                qconfig: Any,
+                pattern: Pattern,
+                is_training: bool,
+        ) -> Optional[Callable]:
+            """
+            Returns the constructor for the activation observer which should be
+            used for the pattern matched to this handler. Some handlers override
+            this to a different value than what is specified in the qconfig.
+            """
+            act_dtype = activation_dtype(qconfig)
+            # TODO: change to is_qat
+            if is_training:
+                if act_dtype == torch.quint8 and self.overwrite_output_fake_quantizer is not None:
+                    return self.overwrite_output_fake_quantizer
+            else:
+                if act_dtype == torch.quint8 and self.overwrite_output_observer is not None:
+                    return self.overwrite_output_observer
+            return qconfig.activation
+
+        # This is temporary, and will be removed soon
+        def input_output_observed(self):
+            return self.input_output_observed_
+
 
     return ConfigurableQuantizeHandler
diff --git a/torch/ao/quantization/fx/backend_config/utils.py b/torch/ao/quantization/fx/backend_config/utils.py
index 04f080289c8d30..b45641c4012c45 100644
--- a/torch/ao/quantization/fx/backend_config/utils.py
+++ b/torch/ao/quantization/fx/backend_config/utils.py
@@ -1,8 +1,12 @@
+from typing import Dict, Any, List, Callable, Union
+
 import torch
+from torch.ao.quantization.utils import get_combined_dict
+from torch.ao.quantization.fx.pattern_utils import get_default_quant_patterns, sorted_patterns_dict
 import torch.nn as nn
 from .quantize_handler import get_quantize_handler_cls
 from .fuse_handler import get_fuse_handler_cls
-from typing import Dict, Any, List, Callable, Union
+from .native import get_native_backend_config_dict
 from ..quantization_types import Pattern, QuantizerCls
 
 def get_pattern_to_quantize_handlers(
@@ -16,10 +20,20 @@ def get_pattern_to_quantize_handlers(
     pattern_to_quantize_handlers = dict()
     for config in backend_config_dict.get("configs", []):
         pattern = config["pattern"]
-        observation_type = config["observation_type"]
+        observation_type = config.get("observation_type", None)
         dtype_configs = config["dtype_configs"]
+        num_tensor_args_to_observation_type = config.get("num_tensor_args_to_observation_type", {})
+        overwrite_fake_quantizer = config.get("_overwrite_output_fake_quantizer", None)
+        overwrite_observer = config.get("_overwrite_output_observer", None)
+        input_output_observed = config.get("_input_output_observed", True)
         pattern_to_quantize_handlers[pattern] = \
-            get_quantize_handler_cls(observation_type, dtype_configs)
+            get_quantize_handler_cls(
+                observation_type,
+                dtype_configs,
+                num_tensor_args_to_observation_type,
+                overwrite_fake_quantizer,
+                overwrite_observer,
+                input_output_observed)
 
     return pattern_to_quantize_handlers
 
@@ -125,3 +139,18 @@ def extra_inputs_getter(pattern) -> List[Any]:
             extra_inputs_getter_mapping[pattern] = extra_inputs_getter
 
     return extra_inputs_getter_mapping
+
+def get_native_quant_patterns(additional_quant_patterns: Dict[Pattern, QuantizerCls] = None) -> Dict[Pattern, QuantizerCls]:
+    """
+    Return a map from pattern to quantize handlers based on the default patterns and the native backend_config_dict.
+    The returned map is sorted such that longer patterns will be encountered first when iterating through it.
+    """
+    patterns = get_default_quant_patterns()
+    if additional_quant_patterns is not None:
+        patterns = get_combined_dict(patterns, additional_quant_patterns)
+    # TODO: currently we just extend the quantize handlers generated from
+    # `get_native_backend_config_dict`
+    # in the future we can just assign backend_config_dict when everything is defined
+    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config_dict()).items():
+        patterns[pattern] = quantize_handler
+    return sorted_patterns_dict(patterns)
diff --git a/torch/ao/quantization/fx/common_quantization_patterns.py b/torch/ao/quantization/fx/common_quantization_patterns.py
index a6e687cc6e91ba..a863c18a383e14 100644
--- a/torch/ao/quantization/fx/common_quantization_patterns.py
+++ b/torch/ao/quantization/fx/common_quantization_patterns.py
@@ -1,73 +1,8 @@
-import torch
-from torch.fx.graph import (
-    Node,
-    Graph,
-)
-
-from ..utils import (
-    get_qconfig_dtypes,
-    activation_dtype,
-)
-
-from .utils import (
-    quantize_node,
-)
-
 from .quantization_patterns import (
     QuantizeHandler,
 )
-
-from ..qconfig import QConfigAny
-
-from typing import Any, Callable, Dict, Tuple
-
+# TODO: remove
 class CommonQuantizeHandler(QuantizeHandler):
     """ Common quantized op, first input and first output will be quantized
     """
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        if node.op == "call_function" or node.op == "call_method":
-            self.op = node.target
-        elif node.op == "call_module":
-            self.op = type(modules[str(node.target)])
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        if not self.all_node_args_are_tensors:
-            return NotImplemented
-        assert node.op in ['call_module', 'call_function'], 'Only call_module and ' + \
-            'call_function are handled in DefaultNode'
-        assert is_reference
-        if convert_custom_config_dict is None:
-            convert_custom_config_dict = {}
-        additional_static_quant_mapping = convert_custom_config_dict.get("static", {})
-
-        dtypes = get_qconfig_dtypes(qconfig)
-        # We can produce reference for a dtypes including
-        # (torch.quint8, torch.qint8, torch.qint32, torch.float16)
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            # make sure the input is quantized to act_dtype
-            load_arg(quantized={0: act_dtype})(node.args)
-            args = load_arg(quantized=torch.float)(node.args)
-            kwargs = load_arg(quantized=torch.float)(node.kwargs)
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return quantize_node(
-                op_out, activation_post_process,
-                node, modules, quantized_graph, node_name_to_scope, is_input=False)
+    pass
diff --git a/torch/ao/quantization/fx/convert.py b/torch/ao/quantization/fx/convert.py
index 717ad46529f0bb..5bb8b16910f7c8 100644
--- a/torch/ao/quantization/fx/convert.py
+++ b/torch/ao/quantization/fx/convert.py
@@ -1,29 +1,26 @@
-from typing import Any, Dict, Tuple, List, Callable, Optional, Union, Set
-from collections import defaultdict
-import copy
+from typing import Any, Dict, List, Optional, Set, Callable, Tuple
 import torch
+import copy
+import warnings
 from torch.fx import (
     GraphModule,
-    Proxy,
-    map_arg
 )
 from torch.fx.graph import (
     Graph,
     Node,
+    Argument,
 )
-from torch.fx.node import Argument
-from .quantization_types import Pattern
-from ..qconfig import QConfigAny, qconfig_equals
-from .match_utils import (
-    find_matches,
-)
-from .graph_module import (
-    is_observed_module,
-    is_observed_standalone_module,
-    QuantizedGraphModule,
+from ..utils import (
+    activation_is_statically_quantized,
+    weight_is_quantized,
+    get_qparam_dict,
+    _parent_name,
+    get_swapped_custom_module_class,
+    get_quant_type,
 )
-from .quantization_patterns import (
-    QuantizeHandler,
+from ..qconfig import (
+    QConfigAny,
+    qconfig_equals
 )
 from ..qconfig_dict_utils import (
     convert_dict_to_ordered_dict,
@@ -34,64 +31,138 @@
     compare_prepare_convert_qconfig_dict,
     update_qconfig_for_fusion,
 )
+from ..quantization_mappings import DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS
+from .backend_config.utils import get_quantized_reference_module_mapping
+from .graph_module import (
+    QuantizedGraphModule,
+    is_observed_module,
+    is_observed_standalone_module,
+)
 from ._equalize import update_obs_for_equalization, convert_eq_obs
 from .utils import (
-    is_get_tensor_info_node,
-    node_return_type_is_int,
-    quantize_node,
+    get_custom_module_class_keys,
+    get_quantize_node_info,
+    create_getattr_from_value,
     collect_producer_nodes,
     graph_module_from_producer_nodes,
-    get_custom_module_class_keys,
     WEIGHT_INDEX_DICT,
 )
+from .quantization_patterns import (
+    QuantizeHandler,
+)
+from .quantization_types import Pattern
+from ..quant_type import QuantType
 
 from torch.ao.quantization.quantize import (
     _remove_qconfig,
     is_activation_post_process,
 )
-from ..utils import (
-    activation_is_statically_quantized,
-    activation_dtype,
+from .lower_to_fbgemm import lower_to_fbgemm
+
+# these are tuples so that they can work with isinstance(module, tuple_of_classes)
+FUSED_MODULE_CLASSES = (
+    torch.nn.intrinsic.LinearReLU,
+    torch.nn.intrinsic.LinearBn1d,
+    torch.nn.intrinsic.ConvReLU1d,
+    torch.nn.intrinsic.ConvReLU2d,
+    torch.nn.intrinsic.ConvReLU3d,
 )
 
-from .lower_to_fbgemm import lower_to_fbgemm
-from ..quantization_mappings import (
-    DEFAULT_QAT_MODULE_MAPPINGS,
+FLOAT_WEIGHTED_MODULE_CLASSES = (
+    torch.nn.Linear,
+    torch.nn.Conv1d,
+    torch.nn.Conv2d,
+    torch.nn.Conv3d,
+)
+
+QAT_MODULE_CLASSES = (
+    torch.nn.qat.Linear,
+    torch.nn.qat.Conv2d,
+    torch.nn.qat.Conv3d,
+    torch.nn.intrinsic.qat.LinearReLU,
+    torch.nn.intrinsic.qat.LinearBn1d,
+    torch.nn.intrinsic.qat.ConvBn1d,
+    torch.nn.intrinsic.qat.ConvBnReLU1d,
+    torch.nn.intrinsic.qat.ConvReLU1d,
+    torch.nn.intrinsic.qat.ConvBn2d,
+    torch.nn.intrinsic.qat.ConvBnReLU2d,
+    torch.nn.intrinsic.qat.ConvReLU2d,
+    torch.nn.intrinsic.qat.ConvBn3d,
+    torch.nn.intrinsic.qat.ConvBnReLU3d,
+    torch.nn.intrinsic.qat.ConvReLU3d
+)
+
+WEIGHT_ONLY_MODULE_CLASSES = (
+    torch.nn.Embedding,
+    torch.nn.EmbeddingBag,
 )
 
+DYNAMIC_MODULE_CLASSES = (
+    torch.nn.GRUCell,
+    torch.nn.LSTMCell,
+    torch.nn.RNNCell,
+    torch.nn.LSTM,
+)
+
+def restore_state(
+        observed: torch.nn.Module
+) -> Tuple[Dict[Pattern, QuantizeHandler],
+           Dict[str, Tuple[str, type]],
+           Dict[str, Any],
+           Set[str]]:
+    assert is_observed_module(observed), \
+        'incoming model must be produced by prepare_fx'
+    prepare_custom_config_dict: Dict[str, Any] = \
+        observed._prepare_custom_config_dict  # type: ignore[assignment]
+    node_name_to_scope: Dict[str, Tuple[str, type]] = observed._node_name_to_scope  # type: ignore[assignment]
+    patterns: Dict[Pattern, QuantizeHandler] = observed._patterns  # type: ignore[assignment]
+    observed_node_names: Set[str] = observed._observed_node_names  # type: ignore[assignment]
+    return patterns, node_name_to_scope, prepare_custom_config_dict, observed_node_names
+
+def has_none_qconfig(node: Argument, qconfig_map: Dict[str, QConfigAny]) -> bool:
+    """ Check if a node has a qconfig of None, i.e. user requested to not quantize
+    the node
+    """
+    return isinstance(node, Node) and node.name in qconfig_map and qconfig_map[node.name] is None
+
 def run_weight_observers(observed: GraphModule) -> None:
-    r''' Extract the subgraph that produces the weight for dynamic quant
+    """ Extract the subgraph that produces the weight for dynamic quant
     or weight only quant node and run the subgraph to observe the weight.
     Note that the observers of dynamic quant or weight only quant ops are
     run during the convert step.
-    '''
+    """
     for node in observed.graph.nodes:
-        if node.op == 'call_function' and node.target in WEIGHT_INDEX_DICT:
-            for i, node_arg in enumerate(node.args):
-                if i in WEIGHT_INDEX_DICT[node.target]:
-                    # node_arg is weight
-                    weight_observer_nodes = collect_producer_nodes(node_arg)
-                    if weight_observer_nodes is not None:
-                        weight_observer_module = \
-                            graph_module_from_producer_nodes(
-                                observed, weight_observer_nodes)
-                        # run the weight observer
-                        weight_observer_module()
-
-def remove_quant_dequant_pairs(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
+        if node.op != 'call_function' or node.target not in WEIGHT_INDEX_DICT:
+            continue
+        for i, node_arg in enumerate(node.args):
+            if i not in WEIGHT_INDEX_DICT[node.target]:
+                continue
+            # node_arg is weight
+            weight_observer_nodes = collect_producer_nodes(node_arg)
+            if weight_observer_nodes is None:
+                continue
+            weight_observer_module = \
+                graph_module_from_producer_nodes(
+                    observed, weight_observer_nodes)
+            # run the weight observer
+            weight_observer_module()
+
+# this method is temporary will be removed soon
+def duplicate_quantize_dynamic_node(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
     quantized_root = quantized
     for node in quantized.graph.nodes:
-        if node.op == "call_function" and node.target in [torch.quantize_per_tensor, torch.quantize_per_channel]:
+        if (node.op == "call_function" and node.target == torch.quantize_per_tensor_dynamic):
             users = list(node.users)
-            user = users[0] if users else None
-            if len(users) == 1 and user.op == "call_method" and user.target == "dequantize":
-                user.replace_all_uses_with(node.args[0])
-                quantized.graph.erase_node(user)
-                orig_args = list(node.args)
+            if len(users) > 1:
+                for user in users:
+                    with quantized.graph.inserting_before(node):
+                        new_node = quantized.graph.create_node(
+                            "call_function",
+                            torch.quantize_per_tensor_dynamic,
+                            node.args,
+                            node.kwargs)
+                    user.replace_input_with(node, new_node)
                 quantized.graph.erase_node(node)
-                for arg in orig_args:
-                    if isinstance(arg, Node) and len(list(arg.users)) == 0:
-                        quantized.graph.erase_node(arg)
 
     quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
     return quantized
@@ -138,28 +209,376 @@ def remove_extra_dequantize(quantized: QuantizedGraphModule) -> QuantizedGraphMo
     quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
     return quantized
 
+def remove_quant_dequant_pairs(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
+    quantized_root = quantized
+    for node in quantized.graph.nodes:
+        if node.op == "call_function" and node.target in [torch.quantize_per_tensor, torch.quantize_per_channel]:
+            users = list(node.users)
+            user = users[0] if users else None
+            if len(users) == 1 and user.op == "call_method" and user.target == "dequantize":
+                user.replace_all_uses_with(node.args[0])
+                quantized.graph.erase_node(user)
+                orig_args = list(node.args)
+                quantized.graph.erase_node(node)
+                for arg in orig_args:
+                    if isinstance(arg, Node) and len(list(arg.users)) == 0:
+                        quantized.graph.erase_node(arg)
 
-def restore_state(
-        observed: torch.nn.Module
-) -> Tuple[Dict[Pattern, QuantizeHandler],
-           Dict[str, Tuple[str, type]],
-           Dict[str, Any],
-           Set[str]]:
-    assert is_observed_module(observed), \
-        'incoming model must be produced by prepare_fx'
-    prepare_custom_config_dict: Dict[str, Any] = \
-        observed._prepare_custom_config_dict  # type: ignore[assignment]
-    node_name_to_scope: Dict[str, Tuple[str, type]] = observed._node_name_to_scope  # type: ignore[assignment]
-    patterns: Dict[Pattern, QuantizeHandler] = observed._patterns  # type: ignore[assignment]
-    observed_node_names: Set[str] = observed._observed_node_names  # type: ignore[assignment]
-    return patterns, node_name_to_scope, prepare_custom_config_dict, observed_node_names
+    quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
+    return quantized
 
-def convert(model: GraphModule, is_reference: bool = False,
-            convert_custom_config_dict: Dict[str, Any] = None,
-            is_standalone_module: bool = False,
-            _remove_qconfig_flag: bool = True,
-            convert_qconfig_dict: Dict[str, Any] = None) -> torch.nn.Module:
-    """ standalone_module means it a submodule that is not inlined in
+def maybe_recursive_remove_dequantize(arg: Any, node: Node, graph: Graph):
+    """ If the arg is a dequantize Node, or a list/tuple/dict of dequantize Node,
+    we'll recursively remove the dequantize Node
+    """
+    if isinstance(arg, Node) and \
+       arg.op == "call_method" and \
+       arg.target == "dequantize":
+        quantize_node = arg.args[0]
+        # we only replace the specific use since dequantize could be used by other nodes
+        # as well
+        node.replace_input_with(arg, quantize_node)
+    elif isinstance(arg, (list, tuple)):
+        for arg_element in arg:
+            maybe_recursive_remove_dequantize(arg_element, node, graph)
+    elif isinstance(arg, dict):
+        for arg_element in arg.values():
+            maybe_recursive_remove_dequantize(arg_element, node, graph)
+    else:
+        warnings.warn(f"Unsupported node type in recursive remove dequantize: {type(arg)}")
+
+def get_module_path_and_prefix(
+        obs_node: Node,
+        node_name_to_scope: Dict[str, Tuple[str, type]],
+        qconfig_map: Dict[str, QConfigAny]):
+    """ Given and observer node, get the `Scope` or the fully qualified name for
+    the submodule containing the observed node, also return a prefix of "_input"
+    when the observed node is an input of a F.linear op, and not the output of another
+    quantized op.
+    TODO: this logic is hacky, we should think about how to remove it or make it more
+    general
+    """
+    observed_node = obs_node.args[0]
+    # an observer can be inserted for both input of the next operator or output of the previous
+    # operator (they can be the same)
+    # this flag identifies if the observer is inserted only because the observed node is
+    # the input of the next operator
+    assert isinstance(observed_node, Node), \
+        f"Expecting observed node to be a Node, but got {observed_node}"
+    is_input_observer_only = qconfig_map[observed_node.name] is None if observed_node.name in qconfig_map else None
+    if is_input_observer_only:
+        # if the quantize function is at the input of op, then we find the first user of the observer_node
+        # to get the path. If a linear call_function is in the user list, we return the first instance
+        # of linear node to get the FQN.
+        users = list(obs_node.users)
+        first_linear_use_or_first_use = users[0] if users else None
+        linear_node = None
+        for n in users:
+            if n.op == "call_function" and n.target == torch.nn.functional.linear:
+                linear_node = n
+                break
+        if linear_node:
+            first_linear_use_or_first_use = linear_node
+        prefix = "_input"
+    else:
+        # if the quantize function is at the output of the op, we use the observer input node to get the path
+        first_linear_use_or_first_use = observed_node
+        prefix = ""
+
+    if first_linear_use_or_first_use and first_linear_use_or_first_use.name in node_name_to_scope:
+        module_path, _ = node_name_to_scope[first_linear_use_or_first_use.name]
+    else:
+        # TODO: it's not used, so actually we can skip quantization
+        # but this requires changing return type of quantize_node
+        # we can fix it later if needed
+        module_path = ""
+    return module_path, prefix
+
+def insert_dequantize_node(
+        node: Node,
+        graph: Graph):
+    """ Inserts dequantize node for `node` in `graph`
+    """
+    with graph.inserting_after(node):
+        dequantize_node = graph.call_method("dequantize", (node,))
+        for user_node in dict(node.users):
+            if user_node is not dequantize_node:
+                user_node.replace_input_with(node, dequantize_node)
+
+def maybe_get_observer_for_node(
+        node: Node,
+        modules: Dict[str, torch.nn.Module]
+) -> Optional[torch.nn.Module]:
+    """
+    If the node is observed, return the observer
+    instance. Otherwise, return None.
+    """
+    for maybe_obs_node, _ in node.users.items():
+        if maybe_obs_node.op == 'call_module':
+            maybe_obs = modules[str(maybe_obs_node.target)]
+            if is_activation_post_process(maybe_obs):
+                return maybe_obs
+    return None
+
+def convert_standalone_module(
+        node: Node,
+        modules: Dict[str, torch.nn.Module],
+        model: torch.fx.GraphModule,
+        is_reference: bool,
+        backend_config_dict: Optional[Dict[str, Any]]):
+    """ Converts a observed standalone module to a quantized standalone module by calling
+    the fx convert api, currently using the same `is_reference` flag as parent, but we may
+    changing this behavior in the future (e.g. separating quantization and lowering for
+    standalone module as well)
+
+    Args:
+      - node: The call_module node of the observed standalone module
+      - modules: named_module of original model
+      - model: original model
+      - is_reference: a flag from parent provided by user to decide if we want to
+        produce a reference model or a fbgemm/qnnpack model
+      - backend_config_dict: backend configuration of the target backend of quantization
+    """
+    convert = torch.ao.quantization.quantize_fx.convert_fx  # type: ignore[attr-defined]
+    # We know that observed standalone module is a GraphModule since
+    # it's produced by us
+    observed_standalone_module : GraphModule = modules[str(node.target)]  # type: ignore[assignment]
+    sm_input_quantized_idxs = \
+        observed_standalone_module \
+        ._standalone_module_input_quantized_idxs\
+        .tolist()  # type: ignore[operator]
+    # remove the dequantize nodes for inputs
+    args = list(node.args)
+    for idx in range(len(args)):
+        if idx in sm_input_quantized_idxs:
+            arg = args[idx]
+            if arg.op == "call_method" and arg.target == "dequantize":  # type: ignore[union-attr]
+                quantize_node = arg.args[0]  # type: ignore[union-attr]
+                node.replace_input_with(arg, quantize_node)
+                if len(arg.users) == 0:  # type: ignore[union-attr]
+                    model.graph.erase_node(arg)
+    # add dequantize node for output
+    sm_output_quantized_idxs = \
+        observed_standalone_module \
+        ._standalone_module_output_quantized_idxs \
+        .tolist()  # type: ignore[operator]
+    if len(sm_output_quantized_idxs) > 0:
+        assert sm_output_quantized_idxs[0] == 0, "Currently only quantized"
+        "output idxs = [0] is supported"
+
+        # if it's non-empty, then it means the output is kept in quantized form
+        # we'll just add a dequantize node after this node
+        insert_dequantize_node(node, model.graph)
+
+    # TODO: allow convert_custom_config_dict to override backend_config_dict
+    # for standalone module
+    # TODO: think about how to handle `is_reference` here
+    quantized_standalone_module = convert(
+        observed_standalone_module,
+        is_reference=is_reference,
+        backend_config_dict=backend_config_dict)
+    parent_name, name = _parent_name(node.target)
+    # update the modules dict
+    setattr(modules[parent_name], name, quantized_standalone_module)
+    modules[str(node.target)] = quantized_standalone_module
+
+def convert_weighted_module(
+        node: Node,
+        modules: Dict[str, torch.nn.Module],
+        observed_node_names: Set[str],
+        quantized_reference_module_mapping: Dict[Callable, Any],
+        qconfig_map: Dict[str, QConfigAny]):
+    """ Convert a weighted module to reference quantized module in the model
+    If the QConfig of a QAT module is not set, the module will still be converted to
+    a float module.
+
+    Args:
+      - node: The call_module node of the observed standalone module
+      - modules: named_module of original model
+      - observed_node_names: names for the set of observed fx node, we can skip
+        this conversion if the node is not observed
+      - quantized_reference_module_mapping: module mapping from floating point module class
+        to quantized reference module class, e.g. nn.Conv2d to nn.quantized._reference.Conv2d
+    """
+    original_module = modules[str(node.target)]
+    float_module = original_module
+    weight_post_process = None
+
+    if isinstance(
+            original_module,
+            QAT_MODULE_CLASSES):
+        # Converting qat module to a float module, we need to attch
+        # weight fake_quant to the module, weight fake_quant is assumed to be run during
+        # QAT so we don't need to run it again here
+        float_module = original_module.to_float()  # type: ignore[operator]
+        # change qat module to float module
+        parent_name, name = _parent_name(node.target)
+        setattr(modules[parent_name], name, float_module)
+        weight_post_process = original_module.weight_fake_quant
+
+    qconfig = original_module.qconfig
+    is_observed = node.name in observed_node_names
+    # If a qconfig is not defined for this node, then skip converting to a reference module
+    if qconfig is None or has_none_qconfig(node, qconfig_map) or not is_observed:
+        return
+
+    # TODO: rename weight_is_statically_quantized to weight_is_int8_quantized
+    is_weight_quantized = weight_is_quantized(qconfig)
+    quant_type = get_quant_type(qconfig)
+
+    # skip reference module swapping for embedding when quantization mode does not
+    # match
+    # TODO: we need a more systematic way to handle this after we migrate to use
+    # backend_config_dict everywhere
+    if isinstance(original_module, WEIGHT_ONLY_MODULE_CLASSES) and \
+       quant_type != QuantType.WEIGHT_ONLY:
+        return
+
+    if isinstance(original_module, DYNAMIC_MODULE_CLASSES) and \
+       quant_type != QuantType.DYNAMIC:
+        return
+
+    # the condition for swapping the module to reference quantized module is:
+    # weights need to be quantized
+    if not is_weight_quantized:
+        return
+
+    fused_module = None
+    # extract the inidividual float_module and fused module
+    if isinstance(float_module, torch.nn.intrinsic._FusedModule):
+        fused_module = float_module
+        float_module = fused_module[0]  # type: ignore[index]
+
+    # TODO: expose this through backend_config_dict
+    # weight_qparams or weight_qparams dict
+    wq_or_wq_dict = {}
+    if isinstance(float_module, torch.nn.RNNCellBase):
+        weight_post_process_ih = qconfig.weight()  # type: ignore[union-attr, operator]
+        weight_post_process_hh = qconfig.weight()  # type: ignore[union-attr, operator]
+        weight_post_process_ih(float_module.weight_ih)
+        weight_post_process_hh(float_module.weight_hh)
+        weight_qparams_ih = get_qparam_dict(weight_post_process_ih)
+        weight_qparams_hh = get_qparam_dict(weight_post_process_hh)
+        wq_or_wq_dict = {
+            "weight_ih": weight_qparams_ih,
+            "weight_hh": weight_qparams_hh,
+        }
+    elif isinstance(float_module, torch.nn.LSTM):
+        # format for wq_or_wq_dict (flattened attributes):
+        # {"weight_ih_l0_scale": ..., "weight_ih_l0_qscheme": ..., ...}
+        for wn in float_module._flat_weights_names:
+            if hasattr(float_module, wn) and wn.startswith("weight"):
+                weight = getattr(float_module, wn)
+                weight_post_process = qconfig.weight()  # type: ignore[union-attr, operator]
+                if weight_post_process.dtype == torch.qint8:
+                    weight_post_process(weight)
+                wq_or_wq_dict[wn] = get_qparam_dict(weight_post_process)
+    else:
+        # weight_post_process is None means the original module is not a QAT module
+        # we need to get weight_post_process from qconfig in this case
+        if weight_post_process is None:
+            weight_post_process = qconfig.weight()  # type: ignore[union-attr, operator]
+        # run weight observer
+        # TODO: This is currently a hack for QAT to get the right shapes for scale and zero point.
+        # In the future, we should require the user to calibrate the model after calling prepare
+        # Issue: https://github.com/pytorch/pytorch/issues/73941
+        weight_post_process(float_module.weight)  # type: ignore[operator]
+        wq_or_wq_dict = get_qparam_dict(weight_post_process)
+
+    # We use the same reference module for all modes of quantization: static, dynamic, weight_only
+    ref_qmodule_cls = quantized_reference_module_mapping.get(type(float_module), None)
+    assert ref_qmodule_cls is not None, f"No reference quantized module class configured for {type(float_module)}"
+    ref_qmodule = ref_qmodule_cls.from_float(float_module, wq_or_wq_dict)  # type: ignore[attr-defined]
+    if fused_module is not None:
+        fused_module[0] = ref_qmodule
+    else:
+        parent_name, name = _parent_name(node.target)
+        setattr(modules[parent_name], name, ref_qmodule)
+
+def convert_custom_module(
+        node: Node,
+        graph: Graph,
+        modules: Dict[str, torch.nn.Module],
+        custom_module_class_mapping: Dict[Callable, Callable],
+        statically_quantized_custom_module_nodes: Set[Node]):
+    """ Converts an observed custom module to a quantized custom module based on
+    `custom_module_class_mapping`
+    For static quantization, we'll also remove the previous `dequantize` node and
+    attach the observer node for output to the module, the observer for the node
+    will be converted to a dequantize node instead of quantize-dequantize pairs
+    later in the graph. In the end we would have a quantized custom module that
+    has the same interface as a default quantized module in nn.quantized namespace,
+    i.e. quantized input and quantized output.
+
+    Args:
+      - node: The call_module node of the observed standalone module
+      - graph: The graph containing the node
+      - modules: named_module of original model
+      - custom_module_class_mapping: mapping from observed custom module class to
+        quantized custom module class, used to swap custom modules
+      - statically_quantized_custom_module_nodes: we'll add the custom module node
+        if we find it is statically quantized, this will be used later when converting
+        observers to quant/dequant node pairs, if the observed node is a statically
+        quantized custom module nodes, we'll convert the observer to a dequantize node,
+        this is to keep the interface the same as the default quantized module.
+        TODO: maybe we want to redesign this part to align with reference model design
+        as well, but there has been some discussions around the interface, so we can do
+        it later.
+    """
+    observed_custom_module = modules[str(node.target)]
+    maybe_obs = maybe_get_observer_for_node(node, modules)
+    qconfig = observed_custom_module.qconfig
+    if activation_is_statically_quantized(qconfig):
+        statically_quantized_custom_module_nodes.add(node)
+        # remove the previous dequant node
+        prev_node = node.args[0]
+        # expecting the input node for a custom module node to be a Node
+        assert isinstance(prev_node, Node), \
+            f"Expecting the argument for custom module node to be a Node, but got {prev_node}"
+        if prev_node.op == "call_method" and prev_node.target == "dequantize":
+            # change the connection for custom module, we'll change the input
+            # of custom module node to quantize node:
+            # Before: quantize - dequantize - custom - module
+            # After: quantize - custom - module
+            #              \ - dequantize
+            node.replace_input_with(prev_node, prev_node.args[0])
+
+            # Remove the dequantize node if it doesn't have other users
+            if len(prev_node.users) == 0:
+                graph.erase_node(prev_node)
+
+        # absorb the following observer into the module conversion
+        activation_post_process = maybe_get_observer_for_node(node, modules)
+        assert activation_post_process is not None
+        observed_custom_module.activation_post_process = activation_post_process
+
+    # swap the observed custom module to quantized custom module
+    quantized_custom_module_class = get_swapped_custom_module_class(
+        observed_custom_module, custom_module_class_mapping, qconfig)
+    quantized_custom_module = \
+        quantized_custom_module_class.from_observed(observed_custom_module)
+    parent_name, name = _parent_name(node.target)
+    setattr(modules[parent_name], name, quantized_custom_module)
+
+def convert(
+        model: GraphModule, is_reference: bool = False,
+        convert_custom_config_dict: Dict[str, Any] = None,
+        is_standalone_module: bool = False,
+        _remove_qconfig_flag: bool = True,
+        convert_qconfig_dict: Dict[str, Any] = None,
+        backend_config_dict: Optional[Dict[str, Any]] = None) -> torch.nn.Module:
+    """
+    We will convert an observed model (a module with observer calls) to a reference
+    quantized model, the rule is simple:
+    1. for each observer module call in the graph, we'll convert it to calls to
+       quantize and dequantize functions based on the observer instance
+    2. for weighted operations like linear/conv, we need to convert them to reference
+       quantized module, this requires us to know whether the dtype configured for the
+       weight is supported in the backend, this is done in prepare step and the result
+       is stored in observed_node_names, we can decide whether we need to swap the
+       module based on this set
+
+    standalone_module means it a submodule that is not inlined in
     parent module, and will be quantized separately as one unit.
 
     Returns a quantized standalone module, whether input/output is quantized is
@@ -169,7 +588,7 @@ def convert(model: GraphModule, is_reference: bool = False,
     """
     if convert_custom_config_dict is None:
         convert_custom_config_dict = {}
-    patterns, node_name_to_scope, prepare_custom_config_dict, _ = restore_state(model)
+    patterns, node_name_to_scope, prepare_custom_config_dict, observed_node_names = restore_state(model)
     qconfig_map: Dict[str, QConfigAny] = model._qconfig_map  # type: ignore[assignment]
 
     # TODO this should be removed now that gpu support for quantization is being supported.
@@ -198,9 +617,7 @@ def convert(model: GraphModule, is_reference: bool = False,
         modules_copy = copy.deepcopy(modules)
         convert_dict_to_ordered_dict(convert_qconfig_dict)
         if model._is_qat:
-            additional_qat_module_mapping = prepare_custom_config_dict.get(
-                "additional_qat_module_mapping", {})
-            convert_qconfig_dict = update_qconfig_for_qat(convert_qconfig_dict, additional_qat_module_mapping)
+            convert_qconfig_dict = update_qconfig_for_qat(convert_qconfig_dict, {})
         convert_qconfig_dict = update_qconfig_for_fusion(model, convert_qconfig_dict)
 
         compare_prepare_convert_qconfig_dict(prepare_qconfig_dict, convert_qconfig_dict)  # type: ignore[arg-type]
@@ -217,10 +634,7 @@ def convert(model: GraphModule, is_reference: bool = False,
     custom_module_classes = get_custom_module_class_keys(
         convert_custom_config_dict,
         "observed_to_quantized_custom_module_class")
-    matches = find_matches(
-        model.graph, modules, patterns,
-        qconfig_map,
-        custom_module_classes=custom_module_classes)
+    custom_module_class_mapping = convert_custom_config_dict.get("observed_to_quantized_custom_module_class", {})
 
     if model._equalization_qconfig_map is not None:
         # If we want to do equalization then do the following:
@@ -233,353 +647,167 @@ def convert(model: GraphModule, is_reference: bool = False,
     # for dynamic quant ops or weight only quant ops
     run_weight_observers(model)
 
-    quantized_graph = Graph()
-    env: Dict[str, Dict[Optional[torch.dtype], Node]] = defaultdict(lambda: defaultdict(Node))  # type: ignore[arg-type]
-
     graph_inputs: List[str] = []
     for node in model.graph.nodes:
         if node.op == 'placeholder':
             graph_inputs.append(node.name)
 
-    def load_non_quantized(n: Node) -> Node:
-        assert n.name in env, \
-            'trying to load float node but did not find ' + \
-            'node:' + n.name + \
-            ' in env: ' + \
-            str(env)
-        dtype_to_node = env[n.name]
-        if torch.float in dtype_to_node:
-            return dtype_to_node[torch.float]
-        elif None in dtype_to_node:
-            return dtype_to_node[None]
-        else:
-            quantized_node = None
-            for dtype in [torch.quint8, torch.qint8, torch.float16]:
-                if dtype in dtype_to_node:
-                    quantized_node = dtype_to_node[dtype]
-                    break
-            assert quantized_node is not None, "Did not find a supported quantized dtype:{}".format(dtype_to_node)
-            env[n.name][torch.float] = Proxy(quantized_node).dequantize().node
-            return env[n.name][torch.float]
-
-    def load_quantized(dtype: torch.dtype):
-        def load_quantized_impl(n: Node):
-            assert n.name in env, \
-                'trying to load quantized node but did not find node:' + \
-                n.name + ' in environment:' + str(env)
-            dtype_to_node = env[n.name]
-            local_dtype : Optional[torch.dtype] = dtype
-            if local_dtype == torch.float and local_dtype not in dtype_to_node:
-                local_dtype = None
-            if local_dtype in [torch.float, None]:
-                return load_non_quantized(n)
-            assert local_dtype in dtype_to_node, f'Expecting {dtype} in {dtype_to_node}'
-            return dtype_to_node[local_dtype]
-
-        return load_quantized_impl
-
-    def load_x(n: Node) -> Node:
-        assert n.name in env, \
-            'node ' + n.name + ' does not exist in environment'
-        dtype_to_node = env[n.name]
-        dtypes = [torch.quint8, torch.qint8, torch.float16, torch.float32, None]
-        for dtype in dtypes:
-            if dtype in dtype_to_node:
-                return dtype_to_node[dtype]
-        raise Exception(f'dtype {dtype} not found in environment: {dtype_to_node} for node {n.name}')
-
-    def load_arg(
-            quantized: Optional[Union[List[int], Dict[int, torch.dtype], torch.dtype, Tuple[int, ...]]]
-    ) -> Callable[[Node], Argument]:
+    # TODO: move this outside of this function
+    def replace_observer_with_quantize_dequantize_node(
+            model: torch.nn.Module,
+            graph: Graph,
+            node: Node,
+            modules: Dict[str, torch.nn.Module],
+            node_name_to_scope: Dict[str, Tuple[str, type]],
+            qconfig_map: Dict[str, QConfigAny]) -> None:
+        """ Replace activation_post_process module call node with quantize and
+        dequantize node
+
+        Before:
+        ... -> observer_0(x) -> ...
+        After:
+        ... -> torch.quantize_per_tensor(x, ...) -> x.dequantize() -> ...
         """
-        Input: quantized, which can be None, torch.dtype, list or tuple
-          - if quantized is None, then we'll load the node as long as it
-            exists
-          - if quantized is a dtype, then all args will be
-            quantized to the specific dtype
-          - if quantized is an empty list or tuple, then it is the same as load_arg(quantized=torch.float)
-          - if quantized is a list or tuple, then arg should be a list and
-            the args with corresponding indexes will be quantized to torch.quint8
-
-
-        Output: fn which takes arg_or_args, and loads them from the
-            corresponding environment depending on the value of quantized.
-        """
-        assert quantized is None or \
-            isinstance(quantized, (tuple, list, dict, torch.dtype)), type(quantized)
-        if isinstance(quantized, (tuple, list, dict)) and len(quantized) == 0:
-            # empty tuple or list means nothing is quantized
-            quantized = torch.float
-
-        def load_arg_impl(arg_or_args):
-            # we'll update the format of `quantized`
-            # to better match arg_or_args
-            updated_quantized: Optional[Union[List[int], torch.dtype, Dict[int, torch.dtype], Tuple[int, ...]]] = quantized
-
-            if isinstance(quantized, (tuple, list)) and \
-               len(quantized) == 1 and isinstance(arg_or_args, Node):
-                # when argument is one Node instead of tuple, we just need to check
-                # 0 is in the quantized list
-                if 0 in quantized:
-                    updated_quantized = torch.quint8
-
-            if updated_quantized is None:
-                return map_arg(arg_or_args, load_x)
-            if isinstance(updated_quantized, torch.dtype):
-                return map_arg(
-                    arg_or_args,
-                    load_quantized(updated_quantized))
-            elif isinstance(updated_quantized, (tuple, list)):
-                assert isinstance(arg_or_args, (tuple, list)), arg_or_args
-                loaded_args = []
-                # for now, we only support quantizing positional arguments
-                for i, a in enumerate(arg_or_args):
-                    if i in updated_quantized:
-                        # Currently it's hardcoded to torch.quint8, we can extend this
-                        # in the future to support all quantized
-                        # dtypes
-                        loaded_args.append(map_arg(a, load_quantized(torch.quint8)))
-                    else:
-                        loaded_args.append(map_arg(a, load_non_quantized))
-                return type(arg_or_args)(loaded_args)
-            elif isinstance(updated_quantized, dict):
-                loaded_args = []
-                for i, a in enumerate(arg_or_args):
-                    if i in updated_quantized:
-                        loaded_args.append(map_arg(a, load_quantized(updated_quantized[i])))
-                    else:
-                        loaded_args.append(map_arg(a, load_non_quantized))
-                return type(arg_or_args)(loaded_args)
-        return load_arg_impl
-
-    def node_arg_is_quantized(node_arg: Any) -> bool:
-        if isinstance(node_arg, Node):
-            assert node_arg.name in env, \
-                'Expecting node_arg to be in the environment'
-            if node_arg.name in env:
-                dtype_to_node = env[node_arg.name]
-                return any([x in dtype_to_node for x in [torch.quint8, torch.qint8, torch.float16]])
-            else:
-                return False
-        elif isinstance(node_arg, list):
-            quantized = map(node_arg_is_quantized, node_arg)
-            if all(quantized):
-                return True
-            elif not any(quantized):
-                return False
-            else:
-                raise Exception(
-                    "partially quantized inputs in list not handled yet")
-        else:
-            return False
-
-    def is_output_quantized(
-            node: Node, obj: QuantizeHandler, qconfig: QConfigAny,
-            modules: Dict[str, torch.nn.Module]) -> bool:
-        """ Check if output node is quantized or not """
-        assert modules is not None
-        # for some ops the output is quantized only when `is_reference` is True
-        # and when `is_reference` is False, it has limited qconfig
-        # support, for example `add`
-        # ideally this check should not happen here, it should happen either in
-        # prepare or during lowering, we don't need this check
-        # after the default path is changed to produce reference patterns
-        quantized = obj.is_output_quantized(qconfig)
-
-        # Need to get correct quantized/non-quantized state forn the output
-        # of FixedQParamsQuantizeHandler
-        # TODO: we may want to try to remove the special case here
-        # as well
-        if obj.should_mark_output_quantized_from_input_quantized_status(qconfig):
-            assert node.op in [
-                'call_module',
-                'call_function',
-                'call_method'], \
-                'FixedQParamsQuantizeHandler of type ' + node.op + ' is not handled'
-            # TODO: need to extend this to consider all relevant args instead of just arg[0]
-            quantized = node_arg_is_quantized(node.args[0])
-
-        # the output is unquantized if the node is not a CopyNode
-        # or the activation is not statically quantized
-        if not activation_is_statically_quantized(qconfig) or \
-           not obj.input_output_observed():
-            quantized = False
-        if node_return_type_is_int(node):
-            quantized = False
-
-        return quantized
-
-    def insert_quantize_node(node: Node, modules: Dict[str, torch.nn.Module]) -> None:
-        """ Given a activation_post_process module call node, insert a
-        quantize node"""
         assert modules is not None
         assert isinstance(node.target, str)
+        module_path, prefix = get_module_path_and_prefix(node, node_name_to_scope, qconfig_map)
         observer_module = modules[node.target]
-        prev_node = node.args[0]
-        if observer_module.dtype == torch.float32:
-            # copy the observer for fp32 dtype
-            env[node.name][torch.float] = quantized_graph.node_copy(
-                node, load_non_quantized)
-        elif isinstance(prev_node, Node) and prev_node.name in env:
-            # if previous node is already quantized, we'll just remove the
-            # activation_post_process
-            prev_dtype_to_node: Dict[Optional[torch.dtype], Node] = env[prev_node.name]
-            current_dtype: Optional[torch.dtype] = observer_module.dtype  # type: ignore[assignment]
-            if current_dtype in prev_dtype_to_node:
-                env[node.name][current_dtype] = prev_dtype_to_node[current_dtype]
-            else:
-                root_module = modules[""]
-                assert isinstance(prev_node, Node)
-                observer_dtype: torch.dtype = observer_module.dtype  # type: ignore[assignment]
-                env[node.name][observer_dtype] = \
-                    quantize_node(
-                        load_non_quantized(prev_node),
-                        observer_module, node, modules, quantized_graph,
-                        node_name_to_scope, is_input=True)
+        maybe_quantize_node_info = get_quantize_node_info(observer_module)
+        # Skip replacing observers to quant/dequant nodes if the qconfigs of all
+        # consumers and producers of this observer are None
+        skip_replacement = all([
+            has_none_qconfig(n, qconfig_map) for n in
+            list(node.args) + list(node.users.keys())])
+        if skip_replacement or maybe_quantize_node_info is None:
+            # didn't find correponding quantize op and info for the observer_module
+            # so we just remove the observer
+            with graph.inserting_before(node):
+                node.replace_all_uses_with(node.args[0])
+                graph.erase_node(node)
         else:
-            # replace activation post process with quantization ops
-            root_module = modules[""]
-            assert isinstance(node.args[0], Node)
-            dtype: torch.dtype = observer_module.dtype  # type: ignore[assignment]
-            env[node.name][dtype] = \
-                quantize_node(
-                    load_non_quantized(node.args[0]),
-                    observer_module, node, modules,
-                    quantized_graph,
-                    node_name_to_scope, is_input=True)
+            # otherwise, we can convert the observer moduel call to quantize/dequantize node
+            node_type, quantize_op, qparams = maybe_quantize_node_info
+            # replace observer node with quant - dequant node
+            with graph.inserting_before(node):
+                input_node = node.args[0]
+                inputs = [input_node]
+                for key, value in qparams.items():
+                    # TODO: we can add the information of whether a value needs to
+                    # be registered as an attribute in qparams dict itself
+                    if key in ['_scale_', '_zero_point_']:
+                        # For scale and zero_point values we register them as buffers in the root module.
+                        # TODO: maybe need more complex attr name here
+                        qparam_node = create_getattr_from_value(model, graph, module_path + prefix + key, value)
+                        inputs.append(qparam_node)
+                    else:
+                        # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
+                        inputs.append(value)
+
+                quantized_node = graph.create_node(node_type, quantize_op, tuple(inputs), {})
+                dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
+                node.replace_all_uses_with(dequantized_node)
+                graph.erase_node(node)
+
+    # this is a temporary hack for custom module, we may want to implement
+    # this properly after the custom module class design is finalized
+    def replace_observer_with_dequantize_node(node: Node, graph: Graph):
+        call_custom_module_node = node.args[0]
+        assert isinstance(call_custom_module_node, Node), \
+            f"Expecting the for call custom module node to be a Node, but got {call_custom_module_node}"
+        node.replace_all_uses_with(call_custom_module_node)
+        graph.erase_node(node)
+        insert_dequantize_node(call_custom_module_node, graph)
 
     # additional state to override inputs to be quantized, if specified
     # by the user
     placeholder_node_seen_cnt = 0
-    output_node_seen_cnt = 0
     input_quantized_idxs: List[int] = prepare_custom_config_dict.get(
         "input_quantized_idxs", [])
     output_quantized_idxs: List[int] = prepare_custom_config_dict.get(
         "output_quantized_idxs", [])
 
-    for node in model.graph.nodes:
-        if node.op == "output":
-            cur_output_node_idx = output_node_seen_cnt
-            output_node_seen_cnt += 1
-            if cur_output_node_idx in output_quantized_idxs:
-                # Result are kept quantized if the user specified the
-                # output_quantized_idxs override.
-                graph_output = map_arg(node.args[0], load_x)
-            else:
-                graph_output = map_arg(node.args[0], load_non_quantized)
-            quantized_graph.output(graph_output)
-            continue
-        root_node, matched, matched_pattern, obj, qconfig = \
-            matches.get(node.name, (None, None, None, None, None))
-        if root_node is node:
-            is_observed_standalone_module_node = (
-                node.op == 'call_module' and
-                is_observed_standalone_module(
-                    modules[node.target])
-            )
-            if qconfig is None and not is_observed_standalone_module_node:
-                result = quantized_graph.node_copy(
-                    node, load_non_quantized)
-                quantized = False
-                # If there are QAT swapped modules in the graph that we don't want to quantize, rever them back to FP32 ones.
-                if node.op == 'call_module' and type(modules[node.target]) in DEFAULT_QAT_MODULE_MAPPINGS.values():
-                    float_mod = modules[node.target].to_float()
-                    setattr(model, node.name, float_mod)
-                    with model.graph.inserting_before(node):
-                        new_float_node = model.graph.create_node('call_module', node.name, node.args, node.kwargs)
-            else:
-                assert obj is not None
-                # We will get whether the output is quantized or not before
-                # convert for standalone module and after convert
-                # for non-standalone module, since _standalone_module_output_quantized_idxs
-                # is only available in observed standalone module
-                if is_observed_standalone_module_node:
-                    out_quant_idxs = modules[node.target]._standalone_module_output_quantized_idxs.tolist()  # noqa: B950
-                    assert len(out_quant_idxs) <= 1, "Currently standalone only support one output"
-                    quantized = 0 in out_quant_idxs
-
-                qconfig = qconfig_map[node.name]
-                # Note: load_arg can be overwritten in the convert method when used to
-                # create Node in graph
-                result = obj.convert(
-                    node, qconfig, modules, quantized_graph, node_name_to_scope, load_arg, is_reference=is_reference,
-                    convert_custom_config_dict=convert_custom_config_dict)
-                if not is_observed_standalone_module_node:
-                    quantized = is_output_quantized(node, obj, qconfig, modules)
-
-            if quantized:
-                env[node.name][activation_dtype(qconfig)] = result
-            else:
-                env[node.name][torch.float] = result
-            continue
-        elif root_node is not None:
-            if qconfig is None:
-                # This branch is hit if all of these conditions are met:
-                # 1. we are in a fusion pattern of multiple nodes (i.e. add-relu)
-                # 2. the current node is not the "root_node" of the pattern
-                # 3. quantization for this pattern is disabled
-                #
-                # In this case, we need to make sure to populate the env with
-                # intermediate nodes manually, because the QuantizeHandler.convert
-                # function will not be called.
-                result = quantized_graph.node_copy(
-                    node, load_non_quantized)
-                env[node.name][torch.float] = result
-            continue
+    if backend_config_dict is None:
+        quantized_reference_module_mapping = copy.deepcopy(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS)
+    else:
+        quantized_reference_module_mapping = get_quantized_reference_module_mapping(backend_config_dict)
+    # convert tuples so that it can work with isinstance(module, tuple_of_classes)
+    weighted_module_classes = tuple(quantized_reference_module_mapping.keys())
+    statically_quantized_custom_module_nodes: Set[Node] = set()
 
-        # handle activation post process calls
-        if node.op == 'call_module' and \
-                is_activation_post_process(modules[node.target]):
-            insert_quantize_node(node, modules)
-        elif node.op == 'placeholder':
+    for node in list(model.graph.nodes):
+        if node.op == 'placeholder':
             cur_placeholder_node_idx = placeholder_node_seen_cnt
             placeholder_node_seen_cnt += 1
             if cur_placeholder_node_idx in input_quantized_idxs:
-                env[node.name][torch.quint8] = quantized_graph.node_copy(
-                    node, load_non_quantized)
+                # Inputs are assumed to be quantized if the user specifid the
+                # input_quantized_idxs override.
+                # we need to dequantize the inputs since all operators took
+                # floating point inputs in reference quantized models
+                insert_dequantize_node(node, model.graph)
+        elif node.op == "output":
+            # If the argument is empty we don't need to do anything
+            if len(output_quantized_idxs) == 0:
+                continue
+            # Result are kept quantized if the user specified the
+            # output_quantized_idxs override.
+            # Remove the dequantize operator for the node in the end if any
+            return_node = node
+            output = node.args[0]
+            # outputs can be Node, list, tuple, dict, other cases are not supported yet
+            if isinstance(output, (list, tuple)):
+                for idx in output_quantized_idxs:
+                    maybe_recursive_remove_dequantize(output[idx], return_node, model.graph)
+            elif isinstance(output, (Node, dict)):
+                # we treat dict as a single argument currently, but it can be extended
+                # to support {"key": dtype} after we change output_quantized_idxs to
+                # dict
+                if 0 in output_quantized_idxs:
+                    maybe_recursive_remove_dequantize(output, return_node, model.graph)
             else:
-                env[node.name][torch.float] = \
-                    quantized_graph.node_copy(node, load_non_quantized)
-        else:
-            # copy quantized or non-quantized node
-            # get_tensor_info_node like shape works for both
-            # quantized and non-quantized input and output a non-Tensor
-            # (we use None for dtype currently for non-Tensors)
-            if is_get_tensor_info_node(node):
-                env[node.name][None] = \
-                    quantized_graph.node_copy(node, load_x)
-            else:
-                env[node.name][torch.float] = \
-                    quantized_graph.node_copy(node, load_non_quantized)
+                warnings.warn(f"Unsupported node type for output_quantized_idxs: {type(output)}")
+        elif node.op == "call_module":
+            if is_activation_post_process(modules[node.target]):
+                observed_node = node.args[0]
+                if observed_node in statically_quantized_custom_module_nodes:
+                    replace_observer_with_dequantize_node(node, model.graph)
+                else:
+                    replace_observer_with_quantize_dequantize_node(
+                        model, model.graph, node, modules, node_name_to_scope,
+                        qconfig_map)
+            elif is_observed_standalone_module(modules[node.target]):
+                convert_standalone_module(
+                    node, modules, model, is_reference, backend_config_dict)
+            elif type(modules[node.target]) in set(
+                    weighted_module_classes).union(QAT_MODULE_CLASSES).union(FUSED_MODULE_CLASSES):
+                # extra check for fused module classes to make sure they are fused module classes
+                # of target modules
+                if type(modules[node.target]) in FUSED_MODULE_CLASSES and \
+                   type(modules[node.target][0]) not in FLOAT_WEIGHTED_MODULE_CLASSES:
+                    continue
+                convert_weighted_module(
+                    node, modules, observed_node_names, quantized_reference_module_mapping, qconfig_map)
+            elif type(modules[node.target]) in custom_module_classes:
+                convert_custom_module(
+                    node, model.graph, modules, custom_module_class_mapping,
+                    statically_quantized_custom_module_nodes)
 
-    # remove activation post process
-    act_post_process_removed_graph = Graph()
-    remove_env: Dict[str, Node] = {}
-
-    def load_arg_remove(a: Argument) -> Argument:
-        return map_arg(a, lambda node: remove_env[node.name])
+    preserved_attributes = set(convert_custom_config_dict.get("preserved_attributes", []))
+    model = QuantizedGraphModule(model, copy.deepcopy(model.graph), preserved_attributes)
 
-    for node in quantized_graph.nodes:
-        if node.op == 'output':
-            act_post_process_removed_graph.output(
-                map_arg(node.args[0], load_arg_remove))
-            continue
-        if node.op == 'call_module' and \
-           is_activation_post_process(modules[node.target]):
-            # remove activation post process node
-            remove_env[node.name] = remove_env[node.args[0].name]
-        else:
-            remove_env[node.name] = act_post_process_removed_graph.node_copy(
-                node, load_arg_remove)
+    # remove deadcode after converting observers to quant/dequant ops
+    model.graph.eliminate_dead_code()
+    model.recompile()
 
-    # removes qconfig and activation_post_process modules
-    if _remove_qconfig_flag:
-        _remove_qconfig(model)
-    preserved_attributes = set(convert_custom_config_dict.get("preserved_attributes", []))
-    model = QuantizedGraphModule(model, act_post_process_removed_graph, preserved_attributes)
+    # TODO: maybe move this to quantize_fx.py
     if not is_reference:
         model = duplicate_dequantize_node(model)
+        model = duplicate_quantize_dynamic_node(model)
         model = lower_to_fbgemm(model, qconfig_map, node_name_to_scope)
         model = remove_quant_dequant_pairs(model)
         model = remove_extra_dequantize(model)
+    # TODO: this looks hacky, we want to check why we need this and see if we can
+    # remove this
+    # removes qconfig and activation_post_process modules
+    if _remove_qconfig_flag:
+        _remove_qconfig(model)
     return model
diff --git a/torch/ao/quantization/fx/fuse.py b/torch/ao/quantization/fx/fuse.py
index a8d48420c8d34b..c7f4444c6a0317 100644
--- a/torch/ao/quantization/fx/fuse.py
+++ b/torch/ao/quantization/fx/fuse.py
@@ -4,9 +4,6 @@
     map_arg
 )
 from torch.fx.graph import Graph
-from ..utils import (
-    get_combined_dict
-)
 from .graph_module import (
     FusedGraphModule
 )
@@ -15,13 +12,14 @@
     MatchAllNode,
 )
 from .pattern_utils import (
-    get_default_fusion_patterns,
+    sorted_patterns_dict,
 )
 
 from .backend_config.utils import get_fusion_pattern_to_fuse_handler_cls
 from .backend_config.utils import get_fuser_method_mapping
 from .backend_config.utils import get_fusion_pattern_to_root_node_getter
 from .backend_config.utils import get_fusion_pattern_to_extra_inputs_getter
+from .backend_config import get_native_backend_config_dict
 
 from .fusion_patterns import *  # noqa: F401,F403
 
@@ -42,21 +40,14 @@ def fuse(
     input_graph = model.graph
     named_modules = dict(input_root.named_modules())
 
-    # TODO: remove this branch after we define the configurations for the
-    # default/native backend
     if backend_config_dict is None:
-        additional_fusion_patterns = \
-            fuse_custom_config_dict.get("additional_fusion_pattern", {})
-        fusion_pattern_to_fuse_handler_cls = get_combined_dict(
-            get_default_fusion_patterns(), additional_fusion_patterns)
-        fuser_method_mapping = None
-        fusion_pattern_to_root_node_getter = {}
-        fusion_pattern_to_extra_inputs_getter = {}
-    else:
-        fusion_pattern_to_fuse_handler_cls = get_fusion_pattern_to_fuse_handler_cls(backend_config_dict)
-        fuser_method_mapping = get_fuser_method_mapping(backend_config_dict)
-        fusion_pattern_to_root_node_getter = get_fusion_pattern_to_root_node_getter(backend_config_dict)
-        fusion_pattern_to_extra_inputs_getter = get_fusion_pattern_to_extra_inputs_getter(backend_config_dict)
+        backend_config_dict = get_native_backend_config_dict()
+
+    fusion_pattern_to_fuse_handler_cls = sorted_patterns_dict(get_fusion_pattern_to_fuse_handler_cls(backend_config_dict))
+    fuser_method_mapping = get_fuser_method_mapping(backend_config_dict)
+    fusion_pattern_to_root_node_getter = get_fusion_pattern_to_root_node_getter(backend_config_dict)
+    fusion_pattern_to_extra_inputs_getter = get_fusion_pattern_to_extra_inputs_getter(backend_config_dict)
+
     # find fusion
     fusion_pairs = _find_matches(
         input_root, input_graph, fusion_pattern_to_fuse_handler_cls)
@@ -111,6 +102,7 @@ def _find_matches(
     # a map from node to the matched subpattern
     node_to_subpattern: Dict[Node, Any] = {}
 
+    # TODO: dedup with quantization matching function in match_utils.py
     def apply_match(pattern, node, match, matched_node_pattern, node_to_subpattern):
         if isinstance(pattern, tuple):
             s, *args = pattern
@@ -122,10 +114,13 @@ def apply_match(pattern, node, match, matched_node_pattern, node_to_subpattern):
         else:
             # the first pattern matches will take precedence
             if node.name not in match_map:
-                node_to_subpattern[node] = pattern
                 matched_node_pattern.append(node)
-                root_node, pattern, handler = match
-                match_map[node.name] = (root_node, pattern, matched_node_pattern, handler, node_to_subpattern)
+                # MatchAllNode here is actually MatchAllInputNode which should not
+                # be added to match_map
+                if pattern is not MatchAllNode:
+                    node_to_subpattern[node] = pattern
+                    root_node, pattern, handler = match
+                    match_map[node.name] = (root_node, pattern, matched_node_pattern, handler, node_to_subpattern)
 
     for node in reversed(graph.nodes):
         if node.name not in match_map:
@@ -133,5 +128,6 @@ def apply_match(pattern, node, match, matched_node_pattern, node_to_subpattern):
                 matched_node_pattern: List[Node] = []
                 if is_match(modules, node, pattern):
                     apply_match(pattern, node, (node, pattern, value(node)), matched_node_pattern, node_to_subpattern)
+                    break
 
     return match_map
diff --git a/torch/ao/quantization/fx/fusion_patterns.py b/torch/ao/quantization/fx/fusion_patterns.py
index aa4d39c831562b..70a2701e5ac174 100644
--- a/torch/ao/quantization/fx/fusion_patterns.py
+++ b/torch/ao/quantization/fx/fusion_patterns.py
@@ -1,8 +1,5 @@
 import torch
 from torch.fx.graph import Node, Graph
-from .pattern_utils import (
-    register_fusion_pattern,
-)
 from ..utils import _parent_name
 from .quantization_types import NodePattern, Pattern
 from ..fuser_method_mappings import get_fuser_method_new
@@ -34,31 +31,7 @@ def fuse(self,
              is_qat: bool) -> Node:
         pass
 
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.Conv1d))
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.Conv2d))
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.Conv3d))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.Conv1d))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.Conv2d))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.Conv3d))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.Linear))
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.Linear))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.BatchNorm2d))
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.BatchNorm2d))
-@register_fusion_pattern((torch.nn.functional.relu, torch.nn.BatchNorm3d))
-@register_fusion_pattern((torch.nn.ReLU, torch.nn.BatchNorm3d))
-@register_fusion_pattern((torch.nn.BatchNorm1d, torch.nn.Conv1d))
-@register_fusion_pattern((torch.nn.BatchNorm2d, torch.nn.Conv2d))
-@register_fusion_pattern((torch.nn.BatchNorm3d, torch.nn.Conv3d))
-@register_fusion_pattern((torch.nn.BatchNorm1d, torch.nn.Linear))
-@register_fusion_pattern((torch.nn.ReLU, (torch.nn.BatchNorm1d, torch.nn.Conv1d)))
-@register_fusion_pattern((torch.nn.ReLU, (torch.nn.BatchNorm2d, torch.nn.Conv2d)))
-@register_fusion_pattern((torch.nn.ReLU, (torch.nn.BatchNorm3d, torch.nn.Conv3d)))
-@register_fusion_pattern((torch.nn.functional.relu, (torch.nn.BatchNorm1d, torch.nn.Conv1d)))
-@register_fusion_pattern((torch.nn.functional.relu, (torch.nn.BatchNorm2d, torch.nn.Conv2d)))
-@register_fusion_pattern((torch.nn.functional.relu, (torch.nn.BatchNorm3d, torch.nn.Conv3d)))
-@register_fusion_pattern((torch.nn.BatchNorm1d, torch.nn.ConvTranspose1d))
-@register_fusion_pattern((torch.nn.BatchNorm2d, torch.nn.ConvTranspose2d))
-@register_fusion_pattern((torch.nn.BatchNorm3d, torch.nn.ConvTranspose3d))
+# TODO: move this to backend_config.fuse_handler
 class DefaultFuseHandler(FuseHandler):
     def __init__(
             self,
@@ -75,11 +48,9 @@ def fuse(self,
              fuse_custom_config_dict: Dict[str, Any],
              fuser_method_mapping: Optional[Dict[Pattern, Union[torch.nn.Sequential, Callable]]],
              is_qat: bool) -> Node:
-        additional_fuser_method_mapping = fuse_custom_config_dict.get("additional_fuser_method_mapping", {})
         assert root_node.op == "call_module", "Expecting module node to be a call_module Node"
         root_module = named_modules[str(root_node.target)]
-        assert len(additional_fuser_method_mapping) == 0, "Fusion implementation is "
-        "undergoing changes, additoinal_fuser_method_mapping is not supported currently."
+
         def get_modules(pattern):
             """ Given a node pattern, extract the corresponding modules
             e.g. input: (relu_node, (bn_node, conv_node))
diff --git a/torch/ao/quantization/fx/graph_module.py b/torch/ao/quantization/fx/graph_module.py
index ef43a42d030ff7..2e37e4a557e47e 100644
--- a/torch/ao/quantization/fx/graph_module.py
+++ b/torch/ao/quantization/fx/graph_module.py
@@ -18,7 +18,7 @@ def __init__(self, root: Union[torch.nn.Module, Dict[str, Any]], graph: Graph, p
     def __deepcopy__(self, memo):
         fake_mod = torch.nn.Module()
         fake_mod.__dict__ = copy.deepcopy(self.__dict__)
-        return FusedGraphModule(fake_mod, self.graph, self.preserved_attr_names)
+        return FusedGraphModule(fake_mod, copy.deepcopy(self.graph), copy.deepcopy(self.preserved_attr_names))
 
 class ObservedGraphModule(GraphModule):
 
@@ -45,7 +45,7 @@ def __init__(self, root: Union[torch.nn.Module, Dict[str, Any]], graph: Graph, p
     def __deepcopy__(self, memo):
         fake_mod = torch.nn.Module()
         fake_mod.__dict__ = copy.deepcopy(self.__dict__)
-        return ObservedGraphModule(fake_mod, self.graph, self.preserved_attr_names)
+        return ObservedGraphModule(fake_mod, copy.deepcopy(self.graph), copy.deepcopy(self.preserved_attr_names))
 
 def is_observed_module(module: Any) -> bool:
     return isinstance(module, ObservedGraphModule)
@@ -60,7 +60,7 @@ def __init__(self, root: Union[torch.nn.Module, Dict[str, Any]], graph: Graph, p
     def __deepcopy__(self, memo):
         fake_mod = torch.nn.Module()
         fake_mod.__dict__ = copy.deepcopy(self.__dict__)
-        return ObservedStandaloneGraphModule(fake_mod, self.graph, self.preserved_attr_names)
+        return ObservedStandaloneGraphModule(fake_mod, copy.deepcopy(self.graph), copy.deepcopy(self.preserved_attr_names))
 
 def is_observed_standalone_module(module: Any) -> bool:
     return isinstance(module, ObservedStandaloneGraphModule)
@@ -104,4 +104,4 @@ def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
     def __deepcopy__(self, memo):
         fake_mod = torch.nn.Module()
         fake_mod.__dict__ = copy.deepcopy(self.__dict__)
-        return QuantizedGraphModule(fake_mod, self.graph, self.preserved_attr_names)
+        return QuantizedGraphModule(fake_mod, copy.deepcopy(self.graph), copy.deepcopy(self.preserved_attr_names))
diff --git a/torch/ao/quantization/fx/match_utils.py b/torch/ao/quantization/fx/match_utils.py
index 876bc39d547132..a1217ec2f8973c 100644
--- a/torch/ao/quantization/fx/match_utils.py
+++ b/torch/ao/quantization/fx/match_utils.py
@@ -7,8 +7,6 @@
 from .quantization_types import Pattern
 from .quantization_patterns import (
     QuantizeHandler,
-    CustomModuleQuantizeHandler,
-    StandaloneModuleQuantizeHandler,
 )
 from ..qconfig import (
     QConfigAny,
@@ -76,6 +74,7 @@ def find_matches(
         graph: Graph,
         modules: Dict[str, torch.nn.Module],
         patterns: Dict[Pattern, QuantizeHandler],
+        root_node_getter_mapping: Dict[Pattern, Callable],
         qconfig_map: Dict[str, QConfigAny],
         standalone_module_names: List[str] = None,
         standalone_module_classes: List[Callable] = None,
@@ -114,29 +113,80 @@ def find_matches(
     match_map: Dict[str, MatchResult] = {}
     all_matched : Set[str] = set()
 
-    def record_match(pattern, node, matched):
+    def _recursive_record_node_in_match_map(
+            last_node,
+            match_map,
+            node_pattern,
+            matched_node_pattern,
+            pattern,
+            match_value,
+            qconfig):
+        if isinstance(node_pattern, Node):
+            match_map[node_pattern.name] = (
+                last_node, matched_node_pattern, pattern, match_value, qconfig)
+        else:
+            for n in node_pattern:
+                _recursive_record_node_in_match_map(last_node, match_map, n, matched_node_pattern, pattern, match_value, qconfig)
+
+    # TODO: 1. merge with fuse matcher 2. document the code
+    def record_match(
+            pattern,
+            node,
+            last_node,
+            matched_node_pattern,
+            match_map):
         if isinstance(pattern, tuple):
             s, *args = pattern
-            record_match(s, node, matched)
+            current_node_pattern: List[Node] = []
+            record_match(
+                s,
+                node,
+                last_node,
+                matched_node_pattern,
+                match_map)
             if pattern[0] is not getattr:
                 for subpattern, arg in zip(args, node.args):
-                    record_match(subpattern, arg, matched)
+                    record_match(
+                        subpattern,
+                        arg,
+                        node,
+                        current_node_pattern,
+                        match_map)
+            if len(current_node_pattern) > 1:
+                matched_node_pattern.append(tuple(current_node_pattern))
+            else:
+                matched_node_pattern.append(current_node_pattern[0])
         else:
-            matched.append(node)
+            matched_node_pattern.append(node)
 
-    cache_for_no_tensor_check: Dict[Node, bool] = dict()
     for node in reversed(graph.nodes):
         if node.name not in match_map and node.name not in all_matched:
-            for pattern, value in patterns.items():
-                if is_match(modules, node, pattern):
-                    matched: List[Any] = []
-                    record_match(pattern, node, matched)
-                    for n in matched:
-                        match_map[n.name] = (
-                            node, matched, pattern, value(node, modules),  # type: ignore[operator]
-                            qconfig_map[n.name])
-                        all_matched.add(n.name)
-                    # break after finding the first match
+            for pattern, quantize_handler_cls in patterns.items():
+                root_node_getter = root_node_getter_mapping.get(pattern, None)
+                if is_match(modules, node, pattern) and node.name not in match_map:
+                    matched_node_pattern: List[Node] = []
+                    record_match(
+                        pattern,
+                        node,
+                        node,
+                        matched_node_pattern,
+                        match_map)
+                    quantize_handler = quantize_handler_cls(  # type: ignore[operator]
+                        matched_node_pattern,
+                        modules,
+                        root_node_getter)
+                    last_node = node
+                    # record the match for all nodes in the pattern
+                    _recursive_record_node_in_match_map(
+                        last_node,
+                        match_map,
+                        # we need to record all nodes in the matched pattern in the match_map
+                        matched_node_pattern,
+                        # this is a part of the value corresponding to the node
+                        matched_node_pattern,
+                        pattern,
+                        quantize_handler,
+                        qconfig_map[node.name])
                     break
 
     # add custom module instances to the match result
@@ -146,7 +196,7 @@ def record_match(pattern, node, matched):
            type(modules[node.target]) in custom_module_classes:
             custom_module_qconfig = qconfig_map[node.name]
             match_map[node.name] = (
-                node, [node], None, CustomModuleQuantizeHandler(node, modules),
+                node, node, None, QuantizeHandler(node, modules, is_custom_module=True),
                 custom_module_qconfig)
 
     def is_standalone_module(node_target: str, modules: Dict[str, torch.nn.Module]):
@@ -162,10 +212,10 @@ def is_standalone_module(node_target: str, modules: Dict[str, torch.nn.Module]):
            (is_standalone_module(node.target, modules) or
                 is_observed_standalone_module(modules[node.target])):
             # add node to matched nodes
-            custom_module_qconfig = qconfig_map[node.name]
+            standalone_module_qconfig = qconfig_map[node.name]
             match_map[node.name] = (
-                node, [node], None,
-                StandaloneModuleQuantizeHandler(node, modules),
-                custom_module_qconfig)
+                node, node, None,
+                QuantizeHandler(node, modules, is_standalone_module=True),
+                standalone_module_qconfig)
 
     return match_map
diff --git a/torch/ao/quantization/fx/pattern_utils.py b/torch/ao/quantization/fx/pattern_utils.py
index bba17d730d6ac2..7c8c034108c4fd 100644
--- a/torch/ao/quantization/fx/pattern_utils.py
+++ b/torch/ao/quantization/fx/pattern_utils.py
@@ -8,7 +8,7 @@
 from ..fake_quantize import FixedQParamsFakeQuantize
 # from .quantization_patterns import BinaryOpQuantizeHandler
 from ..observer import ObserverBase
-
+import copy
 
 # TODO(future PR): fix the typing on QuantizeHandler (currently a circular dependency)
 QuantizeHandler = Any
@@ -25,7 +25,7 @@ def insert(fn):
     return insert
 
 def get_default_fusion_patterns() -> Dict[Pattern, QuantizeHandler]:
-    return DEFAULT_FUSION_PATTERNS
+    return copy.copy(DEFAULT_FUSION_PATTERNS)
 
 DEFAULT_QUANTIZATION_PATTERNS = OrderedDict()
 
@@ -47,15 +47,15 @@ def insert(fn):
 
 # Get patterns for both static quantization and qat
 def get_default_quant_patterns() -> Dict[Pattern, QuantizeHandler]:
-    return DEFAULT_QUANTIZATION_PATTERNS
+    return copy.copy(DEFAULT_QUANTIZATION_PATTERNS)
 
 # a map from pattern to output activation post process constructor
 # e.g. torch.sigmoid -> default_affine_fixed_qparam_fake_quant
 def get_default_output_activation_post_process_map(is_training) -> Dict[Pattern, ObserverBase]:
     if is_training:
-        return DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP
+        return copy.copy(DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP)
     else:
-        return DEFAULT_OUTPUT_OBSERVER_MAP
+        return copy.copy(DEFAULT_OUTPUT_OBSERVER_MAP)
 
 # Example use of register pattern function:
 # @register_fusion_pattern(torch.nn.ReLU, (torch.nn.BatchNorm2d, torch.nn.Conv2d)))
@@ -63,3 +63,27 @@ def get_default_output_activation_post_process_map(is_training) -> Dict[Pattern,
 #     def __init__(...):
 #         ...
 #
+
+def sorted_patterns_dict(patterns_dict: Dict[Pattern, QuantizeHandler]) -> Dict[Pattern, QuantizeHandler]:
+    """
+    Return a sorted version of the patterns dictionary such that longer patterns are matched first,
+    e.g. match (F.relu, F.linear) before F.relu.
+    This works for current use cases, but we may need to have a more clever way to sort
+    things to address more complex patterns
+    """
+
+    def get_len(pattern):
+        """ this will calculate the length of the pattern by counting all the entries
+        in the pattern.
+        this will make sure (nn.ReLU, (nn.BatchNorm, nn.Conv2d)) comes before
+        (nn.BatchNorm, nn.Conv2d) so that we can match the former first
+        """
+        len = 0
+        if isinstance(pattern, tuple):
+            for item in pattern:
+                len += get_len(item)
+        else:
+            len += 1
+        return len
+
+    return OrderedDict(sorted(patterns_dict.items(), key=lambda kv: -get_len(kv[0]) if isinstance(kv[0], tuple) else 1))
diff --git a/torch/ao/quantization/fx/prepare.py b/torch/ao/quantization/fx/prepare.py
index 3c50565d60b856..f3a490258d1451 100644
--- a/torch/ao/quantization/fx/prepare.py
+++ b/torch/ao/quantization/fx/prepare.py
@@ -30,11 +30,12 @@
 
 from .quantization_patterns import (
     QuantizeHandler,
-    CustomModuleQuantizeHandler,
-    StandaloneModuleQuantizeHandler,
 )
 
-from .quantization_types import Pattern
+from .quantization_types import (
+    Pattern,
+    NodePattern
+)
 
 from ._equalize import (
     is_equalization_observer,
@@ -48,7 +49,7 @@
 
 from .pattern_utils import (
     MatchResult,
-    get_default_quant_patterns,
+    sorted_patterns_dict,
 )
 
 from .match_utils import (
@@ -60,7 +61,7 @@
     get_custom_module_class_keys,
     all_node_args_have_no_tensors,
     assert_and_get_unique_device,
-    node_bool_tensor_arg_indexes,
+    get_non_observable_arg_indexes_and_types,
     get_new_attr_name_with_prefix,
     NON_QUANTIZABLE_WEIGHT_OPS,
     WEIGHT_INDEX_DICT,
@@ -77,7 +78,6 @@
 )
 
 from ..utils import (
-    get_combined_dict,
     get_qconfig_dtypes,
     get_swapped_custom_module_class,
     activation_is_statically_quantized,
@@ -89,11 +89,16 @@
     get_pattern_to_dtype_configs,
     get_pattern_to_input_type_to_index,
     get_module_to_qat_module,
+    get_native_quant_patterns,
+    get_fusion_pattern_to_root_node_getter,
 )
 
 from typing import Any, Callable, Dict, List, Optional, Tuple, Union, Set
 from collections import defaultdict
 
+# list of dtypes to not add observers to
+DO_NOT_OBS_DTYPE_LIST = [int, float, torch.bool, None]
+
 def is_activation_post_process_node(node: Node, modules: Dict[str, torch.nn.Module]) -> bool:
     return isinstance(node, torch.fx.Node) and node.op == "call_module" and \
         is_activation_post_process(modules[str(node.target)])
@@ -125,7 +130,7 @@ def node_arg_is_bias(node: Node, arg: Any) -> bool:
 def is_input_arg_dtype_supported_by_backend(
     arg: Argument,
     node: Node,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     dtype_config: Dict[str, torch.dtype],
 ) -> bool:
     """ Check if the configured qconfig for the argument
@@ -152,7 +157,7 @@ def is_input_arg_dtype_supported_by_backend(
 
 def is_output_dtype_supported_by_backend(
     node: Node,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     dtype_config: Dict[str, torch.dtype],
 ) -> bool:
     """ Check if the configured qconfig for the output
@@ -169,15 +174,15 @@ def is_observer_in_same_graph(node, modules, node_name_to_target_dtype):
     in a different place rather than not observed.
     """
     node_output_dtype = get_arg_target_dtype_as_output(node, modules, node_name_to_target_dtype)
-    if isinstance(node.args[0], Node):
+    if len(node.args) > 0 and isinstance(node.args[0], Node):
         if node_output_dtype == torch.quint8 and node.args[0].op == 'placeholder':
             return False
     return True
 
 def is_pattern_dtype_config_supported_by_backend(
     pattern: Optional[Pattern],
-    matched_nodes: Optional[List[Node]],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    matched_node_pattern: Optional[NodePattern],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     backend_config_dict: Optional[Dict[str, Any]]
 ) -> bool:
     """ Check is the dtype configuration of a pattern is supported by
@@ -185,14 +190,15 @@ def is_pattern_dtype_config_supported_by_backend(
     """
     if backend_config_dict is None or pattern is None:
         return True
-    assert matched_nodes is not None and len(matched_nodes) >= 1
+    assert matched_node_pattern is not None and len(matched_node_pattern) >= 1
     pattern_to_dtype_configs = get_pattern_to_dtype_configs(backend_config_dict)
     dtype_configs: List[Dict[str, torch.dtype]] = pattern_to_dtype_configs.get(pattern, [])
 
-    # TODO: this only checks one input and one output, need to generalize to multiple
+    # TODO: this only works for one input and one output patterns, need to generalize to multiple
     # inputs/output
-    input_node = matched_nodes[-1]
-    output_node = matched_nodes[0]
+    root_node = _default_root_node_getter(matched_node_pattern)
+    input_node = root_node
+    output_node = matched_node_pattern[0]
     for dtype_config in dtype_configs:
         # check if arg dtype are supported
         supported = True
@@ -243,6 +249,19 @@ def qat_swap_modules(
         module_to_qat_module: Dict[Callable, Callable]) -> None:
     convert(root, mapping=module_to_qat_module, inplace=True, remove_qconfig=False)
 
+def add_matched_node_name_to_set(matched_node_pattern: NodePattern, s: Set[str]):
+    if isinstance(matched_node_pattern, Node):
+        s.add(matched_node_pattern.name)
+    elif isinstance(matched_node_pattern, (list, tuple)):
+        for maybe_node in matched_node_pattern:
+            add_matched_node_name_to_set(maybe_node, s)
+
+# this is temporary, will be removed soon
+def _default_root_node_getter(node_pattern):
+    while not isinstance(node_pattern, Node):
+        node_pattern = node_pattern[-1]
+    return node_pattern
+
 # TODO: remove observed_op, looks like it's not used
 def insert_observer(
     node: Node,
@@ -283,7 +302,7 @@ def get_target_activation_dtype_for_node(
     qhandler: Optional[QuantizeHandler],
     modules: Dict[str, torch.nn.Module],
     cache_for_no_tensor_check: Dict[Node, bool],
-) -> Dict[str, Optional[torch.dtype]]:
+) -> Dict[str, Optional[Union[torch.dtype, type]]]:
     """
     Returns the expected dtype of the input and output of this node after
     convert. If the value is not None, it represents the dtype of the
@@ -329,7 +348,7 @@ def get_target_activation_dtype_for_node(
 
         # get qconfig to determine the eventual dtype of this node
         if qconfig is not None:
-            if qhandler is not None and qhandler.input_output_observed() and qhandler.is_output_quantized(qconfig):
+            if qhandler is not None and qhandler.input_output_observed():
                 act_dtype, weight_dtype, act_compute_dtype = \
                     get_qconfig_dtypes(qconfig)
                 bias_dtype = torch.float16 \
@@ -337,6 +356,7 @@ def get_target_activation_dtype_for_node(
                     else torch.float
                 return {
                     "input_activation_dtype": act_dtype,
+                    "input_activation_compute_dtype": act_compute_dtype,
                     "weight_dtype": weight_dtype,
                     "bias_dtype": bias_dtype,
                     "output_activation_dtype": act_dtype,
@@ -372,8 +392,8 @@ def get_target_activation_dtype_for_node(
 def get_arg_target_dtype_as_output(
     arg: Node,
     modules: Dict[str, torch.nn.Module],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
-) -> Optional[torch.dtype]:
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+) -> Optional[Union[torch.dtype, type]]:
     """ Get the target output activation dtype for
     the argumnet in the original graph, skipping inserted observers
     We are assuming that the observers are inserted correctly, and the dtype for
@@ -391,8 +411,8 @@ def get_arg_target_dtype_as_input_to_node(
     arg: Node,
     node: Node,
     modules: Dict[str, torch.nn.Module],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
-) -> Optional[torch.dtype]:
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+) -> Optional[Union[torch.dtype, type]]:
     """ Get the target argument dtype for the argument `arg`, as input
     to node `node`
     """
@@ -410,6 +430,24 @@ def get_arg_target_dtype_as_input_to_node(
     else:
         return node_name_to_target_dtype[node.name]["bias_dtype"]
 
+def get_arg_target_compute_dtype_as_input_to_node(
+    arg: Node,
+    node: Node,
+    modules: Dict[str, torch.nn.Module],
+    node_name_to_target_dtype: Dict[str, Dict[str, Union[torch.dtype, type, None]]],
+) -> Union[torch.dtype, type, None]:
+    """ Get the target argument dtype for the argument `arg`, as input
+    to node `node`
+    """
+    assert isinstance(arg, Node)
+    is_weight = node_arg_is_weight(node, arg)
+    is_bias = node_arg_is_bias(node, arg)
+    is_activation = not is_weight and not is_bias
+    if is_activation and \
+       "input_activation_compute_dtype" in node_name_to_target_dtype[node.name]:
+        return node_name_to_target_dtype[node.name]["input_activation_compute_dtype"]
+    else:
+        return None
 
 def maybe_insert_input_observer_for_arg_or_kwarg(
     node: Union[Node, Any],
@@ -418,7 +456,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config_dict: Dict[str, Any],
     backend_config_dict: Optional[Dict[str, Any]],
@@ -447,8 +485,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     # default (no observer)
     new_arg = arg
 
-    is_standalone_module = qhandler is not None and \
-        isinstance(qhandler, StandaloneModuleQuantizeHandler)
+    is_standalone_module = qhandler is not None and qhandler.is_standalone_module()
     assert qconfig is not None
     if not is_standalone_module:
         # regular flow for most nodes, except standalone modules
@@ -461,6 +498,9 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
 
         arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype)
         arg_as_input_target_dtype = get_arg_target_dtype_as_input_to_node(arg, node, modules, node_name_to_target_dtype)
+        arg_as_input_target_compute_dtype = \
+            get_arg_target_compute_dtype_as_input_to_node(
+                arg, node, modules, node_name_to_target_dtype)
         needs_obs = (
             # if the dtypes are different, we need an observer
             (arg_as_output_target_dtype != arg_as_input_target_dtype) and
@@ -469,10 +509,16 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
             # TODO(future PR): change this so a placeholder is inserted for
             # future dequants, to make the logic easier to understand
             (arg_as_input_target_dtype != torch.float) and
-            # if arg is a bool tensor or not a tensor, do not insert observer
-            (arg_as_output_target_dtype not in (torch.bool, None)) and
+            # if arg output dtype is in DO_NOT_OBS_DTYPE_LIST do not insert observer
+            (arg_as_output_target_dtype not in DO_NOT_OBS_DTYPE_LIST) and
             # if qconfig is reuse_input qconfig, we won't insert extra observer for input
-            not is_reuse_input_qconfig_
+            not is_reuse_input_qconfig_ or
+            # need to add input observer for dynamic quantization
+            # only add observer for first input for now, we may need to extend
+            # qconfig_dict and backend_config_dict to support more general configurations
+            # of dynamic quantization, e.g. dynamically quantizing second input, third
+            # input etc.
+            (arg_as_input_target_compute_dtype in [torch.quint8, torch.int8, torch.float16]) and arg is node.args[0]
         )
 
     else:
@@ -544,7 +590,7 @@ def maybe_insert_input_observers_for_node(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config_dict: Dict[str, Any],
     backend_config_dict: Optional[Dict[str, Any]],
@@ -599,7 +645,7 @@ def maybe_insert_input_equalization_observers_for_node(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     is_branch: bool,
 ) -> None:
     """
@@ -643,7 +689,7 @@ def maybe_insert_output_observer_for_node(
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
     matches: Dict[str, MatchResult],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     matched_pattern: Any,
     qhandler: Optional[QuantizeHandler],
     is_qat: bool,
@@ -654,7 +700,7 @@ def maybe_insert_output_observer_for_node(
 
     If `node` does not need an output observer, returns None.
     """
-    root_node, matched_nodes, pattern, qhandler, qconfig = matches.get(
+    root_node, _, pattern, qhandler, qconfig = matches.get(
         node.name, (None, None, None, None, None))
 
     if qhandler is None:
@@ -663,13 +709,10 @@ def maybe_insert_output_observer_for_node(
     assert qconfig is not None
     assert node.op != 'output', 'observer insertion for outputs is handled elsewhere'
 
-    is_standalone_module = qhandler is not None and \
-        isinstance(qhandler, StandaloneModuleQuantizeHandler)
+    is_standalone_module = qhandler is not None and qhandler.is_standalone_module()
 
     dtype = node_name_to_target_dtype[node.name]["output_activation_dtype"]
-    should_insert_observer = \
-        qhandler.should_insert_observer_for_output(
-            qconfig, is_qat) and dtype not in (torch.bool, None, torch.float)
+    should_insert_observer = dtype not in DO_NOT_OBS_DTYPE_LIST + [torch.float]
     # TODO(future PR): move the following logic to
     # should_insert_observer_for_output
     should_insert_observer = should_insert_observer and \
@@ -696,7 +739,7 @@ def maybe_insert_output_observer_for_node(
 def maybe_insert_observers_before_graph_output(
     graph_output_node: Node,
     output_quantized_idxs: List[int],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     qconfig_map: Dict[str, QConfigAny],
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
@@ -725,7 +768,7 @@ def maybe_insert_observers_before_graph_output(
     def _recursive_maybe_replace_node_with_obs(
         maybe_node: Argument,
         target_dtype: torch.dtype,
-        node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+        node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
         qconfig_map: Dict[str, QConfigAny],
         model: torch.nn.Module,
         modules: Dict[str, torch.nn.Module],
@@ -796,8 +839,8 @@ def _recursive_maybe_replace_node_with_obs(
 
 def maybe_propagate_dtype_for_node(
     node: Node,
-    target_dtype: torch.dtype,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    target_dtype: Union[torch.dtype, type],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     matches: Dict[str, MatchResult],
 ) -> None:
     """
@@ -809,9 +852,9 @@ def maybe_propagate_dtype_for_node(
     node_name_to_target_dtype[node.name]["input_activation_dtype"] = target_dtype
     node_name_to_target_dtype[node.name]["output_activation_dtype"] = target_dtype
     # if this is a copy node, propagate to first arg
-    root_node, matched_nodes, pattern, qhandler, qconfig = matches.get(
+    root_node, _, pattern, qhandler, qconfig = matches.get(
         node.name, (None, None, None, None, None))
-    if qhandler is not None and qhandler.is_general_tensor_shape_op():
+    if qhandler is not None and qhandler.is_general_tensor_value_op():
         prev_node = node.args[0]
         if isinstance(prev_node, Node):
             maybe_propagate_dtype_for_node(
@@ -819,7 +862,7 @@ def maybe_propagate_dtype_for_node(
 
 def propagate_dtypes_for_known_nodes(
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]],
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     matches: Dict[str, MatchResult],
 ) -> None:
     """
@@ -833,11 +876,26 @@ def propagate_dtypes_for_known_nodes(
     replace this with a better way to reason about dtypes of tensors.
     """
     for node in graph.nodes:
-        bool_arg_idxs = node_bool_tensor_arg_indexes(node)
-        for bool_arg_idx in bool_arg_idxs:
-            cur_node = node.args[bool_arg_idx]
-            maybe_propagate_dtype_for_node(
-                cur_node, torch.bool, node_name_to_target_dtype, matches)
+        non_observable_arg_dict = get_non_observable_arg_indexes_and_types(node)
+
+        for arg_type in non_observable_arg_dict:
+            non_observable_indices = non_observable_arg_dict[arg_type](node)
+
+            for index in non_observable_indices:
+                arg = node.args[index]
+
+                # when an argument is a tuple, it does not show up as another node so we need to go through
+                # all elements of the tuple manually
+                if isinstance(arg, tuple) or isinstance(arg, list):
+                    arg_list = list(arg)
+                else:
+                    arg_list = [arg]
+
+                for cur_arg in arg_list:
+                    # hard coded arguments show up but aren't `Node` typed and do not need dtype propgated
+                    if isinstance(cur_arg, torch.fx.node.Node):
+                        maybe_propagate_dtype_for_node(
+                            cur_arg, arg_type, node_name_to_target_dtype, matches)
 
 def maybe_make_input_output_share_observers(
     node: Node,
@@ -1021,7 +1079,7 @@ def insert_observers_for_model(
     #   }
     #
     # TODO: rename this to node_name_to_target_dtype_info
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[torch.dtype]]] = defaultdict(dict)
+    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]] = defaultdict(dict)
     cache_for_no_tensor_check: Dict[Node, bool] = dict()
 
     inputs_seen_counter = 0
@@ -1033,7 +1091,7 @@ def insert_observers_for_model(
     # other nodes output dtype is specified by the qconfig
     modules = dict(model.named_modules(remove_duplicate=False))
     for node in model.graph.nodes:
-        root_node, matched_nodes, pattern, qhandler, qconfig = matches.get(
+        root_node, _, pattern, qhandler, qconfig = matches.get(
             node.name, (None, None, None, None, None))
         node_name_to_target_dtype[node.name] = get_target_activation_dtype_for_node(
             node, qconfig, inputs_seen_counter, outputs_seen_counter,
@@ -1074,7 +1132,7 @@ def insert_observers_for_model(
 
         elif node.op in ('call_module', 'call_method', 'call_function', 'output'):
             # check for matches
-            root_node, matched_nodes, pattern, qhandler, qconfig = matches.get(
+            last_node, matched_node_pattern, pattern, qhandler, qconfig = matches.get(
                 node.name, (None, None, None, None, None))
             equalization_qconfig = equalization_config_map.get(node.name, None)
 
@@ -1093,15 +1151,14 @@ def insert_observers_for_model(
             )
 
             is_supported_by_backend = is_pattern_dtype_config_supported_by_backend(
-                pattern, matched_nodes, node_name_to_target_dtype, backend_config_dict)
+                pattern, matched_node_pattern, node_name_to_target_dtype, backend_config_dict)
 
             if not skip_inserting_observers and is_supported_by_backend:
                 modules = dict(model.named_modules(remove_duplicate=False))
                 if node.op != 'output':
-                    assert matched_nodes is not None
+                    assert matched_node_pattern is not None
                     # add matched nodes to the observed node name set
-                    for n in matched_nodes:
-                        observed_node_names.add(n.name)
+                    add_matched_node_name_to_set(matched_node_pattern, observed_node_names)
 
                     # This is currently only used for equalization.
                     # Checks if the current node is in a branch in which the two
@@ -1128,26 +1185,28 @@ def insert_observers_for_model(
                             if user != node and is_user_quantized:
                                 is_quantized_branch = True
 
-                    # this modifies node inplace
-                    maybe_insert_input_observers_for_node(
-                        node, qconfig, model, modules, graph,
-                        node_name_to_target_dtype,
-                        qhandler,
-                        prepare_custom_config_dict,
-                        backend_config_dict)
-
-                    # Insert equalization input observers if needed
-                    maybe_insert_input_equalization_observers_for_node(
-                        node, equalization_qconfig, model, modules, graph,
-                        node_name_to_target_dtype, is_quantized_branch)
-
-                    is_last_node_of_pattern = root_node is node
+                    # TODO: this only works for sequential fusion right now, extend it
+                    # it to automatically detect all input nodes based on the pattern
+                    # need to change find_matches function to return this information
+                    root_node = _default_root_node_getter(matched_node_pattern)
+                    is_input_node_of_the_pattern = node is root_node
+                    if is_input_node_of_the_pattern:
+                        # this modifies node inplace
+                        maybe_insert_input_observers_for_node(
+                            node, qconfig, model, modules, graph,
+                            node_name_to_target_dtype,
+                            qhandler,
+                            prepare_custom_config_dict,
+                            backend_config_dict)
+
+                        # Insert equalization input observers if needed
+                        maybe_insert_input_equalization_observers_for_node(
+                            node, equalization_qconfig, model, modules, graph,
+                            node_name_to_target_dtype, is_quantized_branch)
+
+                    is_last_node_of_pattern = node is last_node
                     is_general_tensor_value_op = \
                         (qhandler is not None and qhandler.is_general_tensor_value_op())
-
-                    is_general_tensor_shape_op = \
-                        (qhandler is not None and qhandler.is_general_tensor_shape_op())
-
                     is_reuse_input_qconfig_ = is_reuse_input_qconfig(qconfig)
 
                     if is_last_node_of_pattern:
@@ -1183,11 +1242,11 @@ def insert_observers_for_model(
                             # to make all inputs and outputs use the first input's
                             # observer
                             if (is_general_tensor_value_op and is_observer_in_same_graph_) or \
-                                    is_general_tensor_shape_op or is_reuse_input_qconfig_:
+                                    is_reuse_input_qconfig_:
                                 if not maybe_make_input_output_share_observers(node, model, modules):
                                     remove_output_observer(node, model, modules)
 
-                            if isinstance(qhandler, CustomModuleQuantizeHandler):
+                            if qhandler is not None and qhandler.is_custom_module():
                                 swap_custom_module_to_observed(node, qconfig, modules, prepare_custom_config_dict)
 
                 else:  # output
@@ -1226,11 +1285,11 @@ def run_prepare_fx_on_standalone_modules(
     """
     for (
         node_name,
-        (root_node, matched_nodes, pattern, qhandler, qconfig),
+        (root_node, _, pattern, qhandler, qconfig),
     ) in matches.items():
         if qhandler is None:
             continue
-        elif not isinstance(qhandler, StandaloneModuleQuantizeHandler):
+        elif not qhandler.is_standalone_module():
             continue
 
         sm_qconfig_dict, sm_prepare_config_dict, sm_backend_config_dict = \
@@ -1312,8 +1371,6 @@ def prepare(
     if equalization_qconfig_dict is None:
         equalization_qconfig_dict = {}
 
-    additional_quant_patterns = \
-        prepare_custom_config_dict.get("additional_quant_pattern", {})
     # mapping from a tuple of nodes in reverse order to uninitialized
     #   QuantizeHandler subclass. For example,
     # {
@@ -1324,13 +1381,14 @@ def prepare(
     #   ((<function relu at 0x7f766a7360d0>, <built-in function add>):
     #     <class 'torch.ao.quantization.fx.quantize.Add'>),
     # }
+    # TODO: rename to pattern_to_quantize_handler
     patterns: Dict[Pattern, QuantizeHandler] = {}
     if backend_config_dict is None:
-        quant_patterns = get_default_quant_patterns()
-        patterns = get_combined_dict(
-            quant_patterns, additional_quant_patterns)
+        patterns = get_native_quant_patterns({})
+        root_node_getter_mapping = {}
     else:
         patterns = get_pattern_to_quantize_handlers(backend_config_dict)
+        patterns = sorted_patterns_dict(patterns)
 
         # TODO: make WEIGHT_INDEX_DICT and BIAS_INDEX_DICT an argument to the functions that needs them
         # TODO: refactor this part to return WEIGHT_INDEX_DICT and BIAS_INDEX_DICT
@@ -1350,27 +1408,27 @@ def prepare(
                 else:
                     index_dict[pattern] = [index]  # type: ignore[index]
 
+        root_node_getter_mapping = \
+            get_fusion_pattern_to_root_node_getter(backend_config_dict)
+
     convert_dict_to_ordered_dict(qconfig_dict)
     convert_dict_to_ordered_dict(equalization_qconfig_dict)
     qconfig_dict = update_qconfig_for_fusion(model, qconfig_dict)
     equalization_qconfig_dict = update_qconfig_for_fusion(model, equalization_qconfig_dict)
     flattened_qconfig_dict = get_flattened_qconfig_dict(qconfig_dict)
     # TODO: support regex as well
-    propagate_qconfig_(model, flattened_qconfig_dict)
+    propagate_qconfig_(model, flattened_qconfig_dict, prepare_custom_config_dict)
 
     if is_qat:
-        additional_qat_module_mapping = prepare_custom_config_dict.get(
-            "additional_qat_module_mapping", {})
         # this path will be deprecated after we fully migrate the convert path
         # of fbgemm/qnnpack to use the reference path, it will stay
         # here for a few months
         if backend_config_dict is None:
-            module_to_qat_module = get_combined_dict(
-                get_default_qat_module_mappings(), additional_qat_module_mapping)
+            module_to_qat_module = get_default_qat_module_mappings()
         else:
             module_to_qat_module = get_module_to_qat_module(backend_config_dict)
         qat_swap_modules(model, module_to_qat_module)
-        qconfig_dict = update_qconfig_for_qat(qconfig_dict, additional_qat_module_mapping)
+        qconfig_dict = update_qconfig_for_qat(qconfig_dict, {})
 
     # mapping from fully qualified module name to module instance
     # for example,
@@ -1396,8 +1454,8 @@ def prepare(
     custom_module_classes = get_custom_module_class_keys(
         prepare_custom_config_dict, "float_to_observed_custom_module_class")
     matches = find_matches(
-        model.graph, modules, patterns, qconfig_map, standalone_module_names,
-        standalone_module_classes, custom_module_classes)
+        model.graph, modules, patterns, root_node_getter_mapping, qconfig_map,
+        standalone_module_names, standalone_module_classes, custom_module_classes)
 
     input_quantized_idxs: List[int] = prepare_custom_config_dict.get(
         "input_quantized_idxs", [])
diff --git a/torch/ao/quantization/fx/qconfig_utils.py b/torch/ao/quantization/fx/qconfig_utils.py
index 80afa562a10f4a..188de460dbae2f 100644
--- a/torch/ao/quantization/fx/qconfig_utils.py
+++ b/torch/ao/quantization/fx/qconfig_utils.py
@@ -215,7 +215,6 @@ def check_is_valid_prepare_custom_config_dict(prepare_custom_config_dict: Option
                                                "non_traceable_module_class",
                                                "additional_fuser_method_mapping",
                                                "additional_qat__module_mapping",
-                                               "additional_fusion_pattern",
                                                "additional_quant_pattern",
                                                "input_quantized_idxs",
                                                "output_quantized_idxs",
diff --git a/torch/ao/quantization/fx/quantization_patterns.py b/torch/ao/quantization/fx/quantization_patterns.py
index 7f9947bccb39b1..486208d98bbc40 100644
--- a/torch/ao/quantization/fx/quantization_patterns.py
+++ b/torch/ao/quantization/fx/quantization_patterns.py
@@ -1,56 +1,25 @@
 import torch
-from torch.fx import GraphModule
 from torch.fx.graph import (
     Node,
-    Graph,
-)
-from ..observer import (
-    default_affine_fixed_qparams_observer,
-    default_symmetric_fixed_qparams_observer,
-)
-
-from ..quantization_mappings import (
-    get_static_quant_module_class,
-    get_dynamic_quant_module_class,
-)
-from ..utils import (
-    get_swapped_custom_module_class,
-    activation_is_statically_quantized,
-    activation_is_int8_quantized,
-    weight_is_statically_quantized,
-    get_qconfig_dtypes,
-    activation_dtype,
-    get_qparam_dict,
-)
-
-from torch.ao.quantization.quantize import (
-    is_activation_post_process,
 )
 
-from .pattern_utils import (
-    register_quant_pattern,
-    get_default_output_activation_post_process_map,
-    Pattern,
-)
-from ..utils import _parent_name
 from .utils import (
     all_node_args_have_no_tensors,
-    quantize_node,
-    get_per_tensor_qparams,
-    get_linear_prepack_op_for_dtype,
-    create_qparam_nodes,
-    get_qconv_prepack_op,
-    get_qconv_op,
-    create_node_from_old_node_preserve_meta,
 )
-
-from ..qconfig import QConfigAny
+from .quantization_types import (
+    Pattern,
+    NodePattern,
+)
 
 from abc import ABC
-import operator
-import warnings
+from typing import Any, Callable, Dict, Optional
 
-from typing import Any, Callable, Dict, Union, Optional, Tuple, List
+def _default_root_node_getter(node_pattern):
+    if node_pattern is None:
+        return node_pattern
+    while not isinstance(node_pattern, Node):
+        node_pattern = node_pattern[-1]
+    return node_pattern
 
 # -------------------------
 # Pattern Registrations
@@ -62,33 +31,37 @@
 class QuantizeHandler(ABC):
     """ Base handler class for the quantizer patterns
     """
-    def __init__(self, node: Node, modules: Dict[str, torch.nn.Module]):
+    def __init__(
+            self,
+            node_pattern: NodePattern,
+            modules: Dict[str, torch.nn.Module],
+            root_node_getter: Callable = None,
+            is_custom_module=False,
+            is_standalone_module=False):
         """ Records pattern information in __init__, which will be used
         in convert
         """
-        # this is an indicator of whether all the inputs are Node or not
-        # since some op might be quantized differently depending on whether
-        # all inputs are tensors or not, e.g. add/mul
-        self.num_tensor_args = len(node.args)
-        self.all_node_args_are_tensors = True
-        # the last node of the matched pattern
-        self.last_node = node
-
-    def _maybe_get_last_node_only_observer(
-        self,
-        modules: Dict[str, torch.nn.Module]
-    ) -> Optional[torch.nn.Module]:
-        """
-        If the last node of the pattern is observed, return the observer
-        instance. Otherwise, return None.
-        """
-        for maybe_obs_node, _ in self.last_node.users.items():
-            if maybe_obs_node.op == 'call_module':
-                maybe_obs = modules[str(maybe_obs_node.target)]
-                if is_activation_post_process(maybe_obs):
-                    return maybe_obs
-        return None
-
+        self.node_pattern = node_pattern
+        self.modules = modules
+        if root_node_getter is None:
+            root_node_getter = _default_root_node_getter
+        self.root_node = root_node_getter(node_pattern)
+        self.is_custom_module_ = is_custom_module
+        self.is_standalone_module_ = is_standalone_module
+        self.num_tensor_args = 0
+        # determine how many of the first two args are Tensors (versus scalars)
+        # this distinguishes things like "x + y" from "x + 2" or "2 + x"
+        if isinstance(self.root_node, Node):
+            cache_for_no_tensor_check: Dict[Node, bool] = dict()
+            for arg_idx in range(len(self.root_node.args)):
+                arg = self.root_node.args[arg_idx]
+                if isinstance(arg, Node) and (
+                        not all_node_args_have_no_tensors(
+                            arg, self.modules, cache_for_no_tensor_check)):
+                    self.num_tensor_args += 1
+
+    # TODO: can remove after the is_dynamic flag is defined, so that we can
+    # move embedding op to backend_config_dict
     def input_output_observed(self) -> bool:
         """
         Returns True if the pattern matched to this qhandler could be
@@ -100,44 +73,16 @@ def is_general_tensor_value_op(self) -> bool:
         """
         Returns True if the operator works for both floating point and
         quantized input, and does some computation based on the input Tensor,
+        or the ops that only re-arranges the Tensor values or query some metadata
+        about the Tensor
         so we need to insert observer/fake_quant for the output of the
-        operator since the distribution of values is different for input and output
-        Tensors (for HistogramObserver)
-        while they share the same quantization parameters
-        Example: avgpool2d
-        """
-        return False
-
-    def is_general_tensor_shape_op(self) -> bool:
-        """ Similar to is_general_tensor_value_op, this is a check
-        for ops that works for both floating point and quantized input,
-        that only re-arranges the Tensor values or query some metadata about the Tensor
-        We don't insert observer/fake_quant for the output of these operators
-        Example: reshape, transpose, maxpool2d
-        """
-        return False
-
-    def should_insert_observer_for_output(
-        self,
-        qconfig: Any,
-        model_is_training: bool,
-    ) -> bool:
-        """
-        Returns true if an observer should be inserted for the output of
-        the pattern matched to this QuantizeHandler instance during the
-        prepare step.
-        """
-        # TODO(future PR): potentially clean up and deduplicate these
-        # mappings.
-        return self.all_node_args_are_tensors and self.input_output_observed()
-
-    def should_mark_output_quantized_from_input_quantized_status(
-        self,
-        qconfig: QConfigAny
-    ) -> bool:
-        """
-        Returns true if after convert, the output of the matched pattern is
-        quantized iff the first input is also quantized.
+        operator (same observer instance as input)
+        since the distribution of values is different for input and output
+        Tensors (for HistogramObserver) while they share the same quantization
+        parameters
+        Example operator: avgpool2d, reshape, transpose, maxpool2d
+        Example observed operator:
+        observer_0 - avgpool2d - observer_0 (same observer instance as input)
         """
         return False
 
@@ -154,1510 +99,62 @@ def get_activation_ctr(
         """
         return qconfig.activation
 
-    def is_output_quantized(self, qconfig):
-        """ Returns true if the output node of convert is quantized
-        when is_reference is False, we would return float node when a certain dtype
-        combination is not supported (since fbgemm/qnnpack only support certain dtype
-        combinations), so the output may be float, but when is_reference is True,
-        we support all dtype combinations so the output will always be quantized.
-
-        TODO: This is fragile, whether output is quantized should not depend on `is_reference` since
-        we want to make sure whether a Tensor is quantized
-        should be the same in prepare and convert and is_reference
-        is only available in convert currently
-
-        """
-        return True
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        """ Convert the given node to a quantized node and insert
-        it to the quantized graph
-        """
-        return NotImplemented
-
-
-# Binary op configs
-
-# Supported combinations are:
-# quant_type | activation (compute_type) | weight
-#  static       quint8                      qint8
-
-# tuple (activation_dtype, weight_dtype, compute_dtype)
-# these are supported types for common binary ops like add/mul etc.
-all_dtypes = [
-    (torch.qint8, torch.qint8, None),
-    (torch.quint8, torch.qint8, None),
-    (torch.float16, torch.float16, None),
-]
-fp16_dtypes = [
-    (torch.float16, torch.float16, None)
-]
-int8_dtypes = [
-    (torch.qint8, torch.qint8, None),
-    (torch.quint8, torch.qint8, None),
-]
-binary_op_supported_dtypes : Dict[Union[Callable, str], List[Tuple[torch.dtype, torch.dtype, None]]] = {
-    operator.add: all_dtypes,
-    torch.add: all_dtypes,
-    operator.mul: all_dtypes,
-    torch.mul: all_dtypes,
-    torch.bmm: fp16_dtypes,
-    torch.sub: fp16_dtypes,
-    operator.sub: fp16_dtypes,
-    torch.div: fp16_dtypes,
-    operator.truediv: fp16_dtypes,
-    torch.matmul: int8_dtypes,
-}
-
-default_op_supported_dtypes = {
-    torch.nn.ConvTranspose1d: int8_dtypes,
-    torch.nn.ConvTranspose2d: int8_dtypes,
-    torch.nn.ELU: int8_dtypes,
-    torch.nn.LeakyReLU: int8_dtypes,
-    torch.nn.Hardswish: int8_dtypes,
-    torch.nn.InstanceNorm1d: int8_dtypes,
-    torch.nn.InstanceNorm2d: int8_dtypes,
-    torch.nn.InstanceNorm3d: int8_dtypes,
-    torch.nn.LayerNorm: all_dtypes,
-    torch.nn.SiLU: fp16_dtypes,
-    torch.nn.Mish: fp16_dtypes,
-    torch.nn.GELU: int8_dtypes,
-    torch.nn.Dropout: int8_dtypes,
-    torch.nn.Softmax: int8_dtypes,
-    torch.nn.functional.elu: int8_dtypes,
-    torch.nn.functional.hardswish: int8_dtypes,
-    torch.nn.functional.instance_norm: int8_dtypes,
-    torch.nn.functional.layer_norm: all_dtypes,
-    torch.nn.functional.leaky_relu: int8_dtypes,
-    torch.nn.functional.silu: fp16_dtypes,
-    torch.nn.functional.mish: fp16_dtypes,
-    torch.nn.functional.gelu: int8_dtypes,
-    torch.nn.functional.softmax: int8_dtypes,
-    torch.nn.functional.dropout: int8_dtypes,
-    torch.sum: fp16_dtypes,
-}
-
-QAT_CONV_MODULE_CLASSES = \
-    (torch.nn.qat.Conv2d,
-     torch.nn.qat.Conv3d,
-     torch.nn.intrinsic.qat.ConvBn1d,
-     torch.nn.intrinsic.qat.ConvBn2d,
-     torch.nn.intrinsic.qat.ConvBn3d,
-     torch.nn.intrinsic.qat.ConvBnReLU1d,
-     torch.nn.intrinsic.qat.ConvBnReLU2d,
-     torch.nn.intrinsic.qat.ConvBnReLU3d,
-     torch.nn.intrinsic.qat.ConvReLU2d,
-     torch.nn.intrinsic.qat.ConvReLU3d)
-
-##########################
-# Helper Functions
-##########################
-
-def _load_weight_qparams(
-        self, state_dict, prefix, local_metadata, strict,
-        missing_keys, unexpected_keys, error_msgs):
-    key = prefix + "_weight_qparams"
-    if key in state_dict:
-        self._weight_qparams = state_dict[key]
-        state_dict.pop(key)
+    def is_custom_module(self):
+        return self.is_custom_module_
 
-def _save_weight_qparams(self, destination, prefix, keep_vars):
-    for attr_name in dir(self):
-        if "_weight_qparams" == attr_name and \
-           isinstance(getattr(self, attr_name), dict):
-            weight_qparams = getattr(self, attr_name)
-            destination[prefix + attr_name] = weight_qparams
+    def is_standalone_module(self):
+        return self.is_standalone_module_
 
-
-def _to_reference(float_module, weight_qparams):
-    """ Make a weighted float module (e.g. conv and linear )a reference module by
-    attaching _weight_qparams that records the qparams for weight
-    and change the name for the module so that it's recognized
-    when people print the model
-    """
-    float_module._weight_qparams = weight_qparams
-    float_module._register_state_dict_hook(_save_weight_qparams)
-    float_module._register_load_state_dict_pre_hook(_load_weight_qparams, with_module=True)
-
-    float_module_name = float_module._get_name()
-
-    def _get_name():
-        return float_module_name + "(Reference)"
-
-    float_module._get_name = _get_name
-
-@register_quant_pattern(operator.add)
-@register_quant_pattern(operator.sub)
-@register_quant_pattern(operator.mul)
-@register_quant_pattern(operator.truediv)
-@register_quant_pattern(torch.add)
-@register_quant_pattern(torch.sub)
-@register_quant_pattern(torch.mul)
-@register_quant_pattern(torch.div)
-@register_quant_pattern(torch.bmm)
-@register_quant_pattern((torch.nn.ReLU, operator.add))
-@register_quant_pattern((torch.nn.ReLU, operator.mul))
-@register_quant_pattern((torch.nn.ReLU, torch.add))
-@register_quant_pattern((torch.nn.ReLU, torch.mul))
-@register_quant_pattern((torch.nn.functional.relu, operator.add))
-@register_quant_pattern((torch.nn.functional.relu, operator.mul))
-@register_quant_pattern((torch.nn.functional.relu, torch.add))
-@register_quant_pattern((torch.nn.functional.relu, torch.mul))
-@register_quant_pattern((torch.relu, operator.add))
-@register_quant_pattern((torch.relu, operator.mul))
-@register_quant_pattern(torch.matmul)
+# TODO: remove this class, this is still exposed in torch.quantization
+# but we should be able to break bc
 class BinaryOpQuantizeHandler(QuantizeHandler):
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        self.relu_node = None
-        if (
-            node.op == 'call_function' and
-                node.target in (torch.nn.functional.relu, torch.relu)
-        ) or (
-            node.op == 'call_module' and
-                isinstance(modules[str(node.target)], torch.nn.ReLU)
-        ):
-            self.relu_node = node
-            node = node.args[0]  # type: ignore[assignment]
-        self.binary_op_node = node
-        self.binary_op = node.target
-
-        # determine how many of the first two args are Tensors (versus scalars)
-        # this distinguishes things like "x + y" from "x + 2" or "2 + x"
-        self.num_tensor_args = 0
-        cache_for_no_tensor_check: Dict[Node, bool] = dict()
-        for arg_idx in range(len(self.binary_op_node.args)):
-            arg = self.binary_op_node.args[arg_idx]
-            if isinstance(arg, Node) and (not all_node_args_have_no_tensors(arg, modules, cache_for_no_tensor_check)):
-                self.num_tensor_args += 1
-        self.all_node_args_are_tensors = \
-            (self.num_tensor_args == len(self.binary_op_node.args))
-
-    def should_insert_observer_for_output(
-        self,
-        qconfig: Any,
-        model_is_training: bool,
-    ) -> bool:
-        """
-        Returns true if an observer should be inserted for the output of
-        the pattern matched to this QuantizeHandler instance during the
-        prepare step.
-        """
-        dtypes = get_qconfig_dtypes(qconfig)
-        if not (self.binary_op in binary_op_supported_dtypes and dtypes in binary_op_supported_dtypes[self.binary_op]):
-            return False
-        if self.num_tensor_args == 1:
-            return True
-        elif self.all_node_args_are_tensors and self.input_output_observed():
-            return True
-        else:
-            return False
-
-    def is_general_tensor_value_op(self) -> bool:
-        return self.num_tensor_args == 1
+    pass
 
-    def input_output_observed(self):
-        # for x + y where x and y are scalars, we do not observe anything
-        return self.num_tensor_args > 0
-
-    def is_output_quantized(self, qconfig):
-        dtypes = get_qconfig_dtypes(qconfig)
-        return self.binary_op in binary_op_supported_dtypes and \
-            dtypes in binary_op_supported_dtypes[self.binary_op]
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-
-        if self.num_tensor_args == 0:
-            # example: x + y, when x and y are scalars
-            return quantized_graph.node_copy(
-                node, load_arg(quantized=None))
-
-        dtypes = get_qconfig_dtypes(qconfig)
-
-        act_dtype = activation_dtype(qconfig)
-        dtypes = get_qconfig_dtypes(qconfig)
-        if act_dtype == torch.float or \
-           not (self.binary_op in binary_op_supported_dtypes and dtypes in binary_op_supported_dtypes[self.binary_op]):
-            if self.relu_node:
-                op_out = quantized_graph.node_copy(self.binary_op_node, load_arg(quantized=torch.float))
-                relu_args = [op_out]
-                relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                return create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                    self.relu_node)
-            else:
-                return quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-        else:
-            if self.num_tensor_args == 2:
-                # make sure both inputs are quantized to act_dtype
-                load_arg(quantized={0: act_dtype, 1: act_dtype})(self.binary_op_node.args)
-            args = load_arg(quantized=torch.float)(self.binary_op_node.args)
-            kwargs = load_arg(quantized=torch.float)(self.binary_op_node.kwargs)
-            op_out = quantized_graph.node_copy(self.binary_op_node, load_arg(quantized=torch.float))
-
-            def modified_load_arg(n: Node):
-                if n.name == self.binary_op_node.name:
-                    return op_out
-                else:
-                    return load_arg(quantized=torch.float)(n)
-
-            if self.relu_node:
-                op_out = quantized_graph.node_copy(self.relu_node, modified_load_arg)
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            return quantize_node(
-                op_out, activation_post_process,
-                node, modules, quantized_graph, node_name_to_scope, is_input=False)
-
-@register_quant_pattern(torch.cat)
 class CatQuantizeHandler(QuantizeHandler):
-    def is_general_tensor_value_op(self) -> bool:
-        return True
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        if not self.all_node_args_are_tensors:
-            return NotImplemented
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            # make sure the first argument is quantized to act_dtype
-            load_arg(quantized={0: act_dtype})(node.args)
-            args = list(load_arg(quantized=torch.float)(node.args))
-            kwargs = load_arg(quantized=torch.float)(node.kwargs)
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return quantize_node(
-                op_out,
-                activation_post_process,
-                node,
-                modules,
-                quantized_graph,
-                node_name_to_scope,
-                is_input=False)
+    pass
 
-# handle conv, maybe followed by relu
-# NB: matching order is reversed, that is we match from the bottom of this list to the beginning
-@register_quant_pattern(torch.nn.Conv1d)
-@register_quant_pattern(torch.nn.Conv2d)
-@register_quant_pattern(torch.nn.Conv3d)
-@register_quant_pattern(torch.nn.functional.conv1d)
-@register_quant_pattern(torch.nn.functional.conv2d)
-@register_quant_pattern(torch.nn.functional.conv3d)
-# TODO: add qat.Conv1d
-@register_quant_pattern(torch.nn.qat.Conv2d)
-@register_quant_pattern(torch.nn.qat.Conv3d)
-@register_quant_pattern(torch.nn.intrinsic.ConvReLU1d)
-@register_quant_pattern(torch.nn.intrinsic.ConvReLU2d)
-@register_quant_pattern(torch.nn.intrinsic.ConvReLU3d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBn1d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBn2d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBn3d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBnReLU1d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBnReLU2d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvBnReLU3d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvReLU2d)
-@register_quant_pattern(torch.nn.intrinsic.qat.ConvReLU3d)
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.functional.conv1d))
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.functional.conv2d))
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.functional.conv3d))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.functional.conv1d))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.functional.conv2d))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.functional.conv3d))
-# just for error checks
-@register_quant_pattern((torch.nn.ReLU, torch.nn.Conv1d))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.Conv2d))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.Conv3d))
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.Conv2d))
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.Conv3d))
-# TODO: rename Relu -> ReLU to be more consistent with other classes
+# TODO: remove this class
 class ConvReluQuantizeHandler(QuantizeHandler):
-    def __init__(self, node: Node, modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        self.relu_node = None
-        if (node.op == 'call_function' and node.target is torch.nn.functional.relu) or \
-           (node.op == 'call_module' and isinstance(modules[str(node.target)], torch.nn.ReLU)):
-            self.relu_node = node
-            node = node.args[0]  # type: ignore[assignment]
-        self.conv_node = node
-        if node.op == "call_module":
-            self.conv = modules[str(self.conv_node.target)]
-        elif node.op == "call_function":
-            self.conv = node.target  # type: ignore[assignment]
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        # Supported combinations are:
-        # quant_type | activation (compute_type) | weight
-        #  static       quint8                      qint8
-
-        # tuple (activation_dtype, weight_dtype, compute_dtype)
-        supported_dtypes = [
-            (torch.quint8, torch.qint8, None),
-        ]
-
-        # TODO: is_reference option for conv module
-        dtypes = get_qconfig_dtypes(qconfig)
-        # leave the op unquantized if the dtype combination is not supported
-        if not is_reference and dtypes not in supported_dtypes:
-            warnings.warn(
-                "dtype combination: {} is not "
-                "supported by Conv "
-                "supported dtype combinations are: {}".format(dtypes, supported_dtypes))
-            if self.relu_node:
-                conv_out = quantized_graph.node_copy(self.conv_node, load_arg(quantized=torch.float))
-                relu_args = [conv_out]
-                relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                return create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                    self.relu_node)
-            else:
-                return quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-
-        activation_int8_quantized = activation_is_int8_quantized(qconfig)
-
-        if self.conv_node.op == 'call_module':
-            # note that relu should already be fused into conv module in the fusion step
-            assert self.relu_node is None, 'conv module and relu fusion is not executed, ' \
-                'please make sure to run fusion before prepare'
-            output_activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert output_activation_post_process is not None
-
-            module_types_supports_reference_pattern = [
-                torch.nn.Conv1d,
-                torch.nn.Conv2d,
-                torch.nn.Conv3d,
-                torch.nn.intrinsic.ConvReLU1d,
-                torch.nn.intrinsic.ConvReLU2d,
-                torch.nn.intrinsic.ConvReLU3d,
-            ]
-            module_types_supports_reference_pattern.extend(list(QAT_CONV_MODULE_CLASSES))
-            # We'll always produce reference pattern for torch.nn.Conv*d,
-            # will remove the else branch after we migrated all use cases
-            if is_reference or \
-                    type(self.conv) in module_types_supports_reference_pattern and \
-                    dtypes in [(torch.quint8, torch.qint8, None)]:
-                # produce dequant - float_op - quant pattern
-                dtype = torch.float
-                if activation_int8_quantized:
-                    dtype = activation_dtype(qconfig)
-                activation = load_arg(quantized=dtype)(self.conv_node.args[0])
-                args = load_arg(quantized=torch.float)(self.conv_node.args)
-                # Get the float conv and attach quantization scheme and quantization
-                # parameters of weight to the module
-                # and qparam is a dictionary of
-                # {"qscheme": ..., "scale": ..., "zero_point": ...} for per tensor quantization or
-                # {"qscheme": ..., "scale": ..., "zero_point": ..., "axis": ...} for per channel quantization
-                float_conv = self.conv
-                fused_conv = None
-                if isinstance(
-                        float_conv,
-                        QAT_CONV_MODULE_CLASSES):
-                    # case 1. converting qat conv module to
-                    # a float conv module, we need to attch
-                    # weight fake_quant to the conv module,
-                    # weight fake_quant is assumed to be run during
-                    # QAT so we don't need to run it again here
-                    float_conv = float_conv.to_float()  # type: ignore[operator]
-                    # change qat conv to conv
-                    parent_name, name = _parent_name(self.conv_node.target)
-                    setattr(modules[parent_name], name, float_conv)
-                    if isinstance(float_conv, torch.nn.intrinsic._FusedModule):
-                        fused_conv = float_conv
-                        float_conv = fused_conv[0]
-                    weight_post_process = self.conv.weight_fake_quant
-                else:
-                    # case 2. converting a conv module/fused conv module
-                    # to float conv module, we need to attach
-                    # weight observer to the conv module and run it
-                    # with conv weight
-                    if isinstance(float_conv, torch.nn.intrinsic._FusedModule):
-                        fused_conv = float_conv
-                        float_conv = fused_conv[0]  # type: ignore[index]
-                    assert qconfig is not None
-                    weight_post_process = qconfig.weight()
-
-                # return early when we don't have a valid match
-                # this typically happens when we called the same conv multiple times in the
-                # same graph, and it is transformed in previous steps into a reference conv already
-                if type(float_conv) not in [torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]:
-                    op_out = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ('call_module', self.conv_node.target, args, {}),
-                        self.conv_node)
-                    return op_out
-
-                qconv_cls = get_static_quant_module_class(
-                    type(float_conv), is_reference=True)
-                # run weight observer
-                # TODO: This is currently a hack for QAT to get the right shapes for scale and zero point.
-                # In the future, we should require the user to calibrate the model after calling prepare
-                weight_post_process(float_conv.weight)  # type: ignore[operator]
-                weight_qparams = get_qparam_dict(weight_post_process)
-                # hardcoded for now, TODO: expose the api to user,
-                # we can have a map from module to reference module
-                # and allow user to register new ones
-                ref_conv = qconv_cls.from_float(float_conv, weight_qparams)  # type: ignore[attr-defined]
-                # if the parent is a fused conv (Sequential), we can replace the first
-                # item to ref conv, otherwise we can update
-                # the conv instance in the module tree
-                if fused_conv is not None:
-                    fused_conv[0] = ref_conv
-                    parent_name, name = _parent_name(self.conv_node.target)
-                    setattr(modules[parent_name], name, fused_conv)
-                else:
-                    parent_name, name = _parent_name(self.conv_node.target)
-                    setattr(modules[parent_name], name, ref_conv)
-                op_out = create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ('call_module', self.conv_node.target, args, {}),
-                    self.conv_node)
-                if output_activation_post_process:
-                    op_out = quantize_node(
-                        op_out,
-                        output_activation_post_process,
-                        node,
-                        modules,
-                        quantized_graph,
-                        node_name_to_scope,
-                        is_input=False)
-                return op_out
-            else:
-                if convert_custom_config_dict is None:
-                    convert_custom_config_dict = {}
-                additional_static_quant_mapping = convert_custom_config_dict.get("static", {})
-                # 1. attach activation post process to module
-                self.conv.activation_post_process = output_activation_post_process
-                # 2. select quantized class
-                qconv_cls = get_static_quant_module_class(
-                    type(self.conv), additional_static_quant_mapping, is_reference=is_reference)
-                quantized = qconv_cls.from_float(self.conv)
-                parent_name, name = _parent_name(self.conv_node.target)
-                setattr(modules[parent_name], name, quantized)
-                return create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    (
-                        'call_module',
-                        self.conv_node.target,
-                        (load_arg(quantized=torch.quint8)(self.conv_node.args[0]),),
-                        {},
-                    ),
-                    self.conv_node)
-        else:  # call_function
-            assert self.conv_node.op == "call_function"
-            if is_reference:
-                # make sure the input and weight are quantized to torch.quint8, torch.qint8, respectively
-                load_arg(quantized={0: torch.quint8, 1: torch.qint8})(self.conv_node.args)
-                args = load_arg(quantized=torch.float)(self.conv_node.args)
-                kwargs = load_arg(quantized=torch.float)(self.conv_node.kwargs)
-                op_out = create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ("call_function", self.conv, args, kwargs),
-                    self.conv_node)
-                if self.relu_node:
-                    relu_args = [op_out]
-                    relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                    relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                    op_out = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                        self.relu_node)
+    pass
 
-                if activation_int8_quantized:
-                    root_module = modules['']
-                    act_post_process_name = self.relu_node.name if self.relu_node else self.conv_node.name
-                    act_post_process_node = self.relu_node if self.relu_node else self.conv_node
-                    activation_post_process = \
-                        self._maybe_get_last_node_only_observer(modules)
-                    assert activation_post_process is not None
-                    return quantize_node(
-                        op_out,
-                        activation_post_process,
-                        act_post_process_node,
-                        modules,
-                        quantized_graph,
-                        node_name_to_scope,
-                        is_input=False)
-                else:
-                    # output for dynamically quantized conv op is not quantized
-                    return op_out
-            else:
-                assert len(self.conv_node.args) >= 7, \
-                    "only conv2d calls with all arguments specified is supported right now in is_reference=False option"
-                # make sure the input and weight are quantized to torch.quint8, torch.qint8, respectively
-                args = load_arg(quantized={0: torch.quint8, 1: torch.qint8})(self.conv_node.args)
-                # pack weight
-                weight = load_arg(quantized=torch.qint8)(self.conv_node.args[1])
-                other_args = load_arg(quantized=torch.float)(self.conv_node.args[2:])
-                bias, stride, padding, dilation, groups = other_args
-                if self.conv == torch.nn.functional.conv1d:
-                    # F.conv1d can take `int` as well as `list[int]` for stride,
-                    # padding, dilation, but the prepack op cannot. Convert
-                    # these to lists if needed.
-                    stride = [stride] if isinstance(stride, int) else stride
-                    padding = [padding] if isinstance(padding, int) else padding
-                    dilation = [dilation] if isinstance(dilation, int) else dilation
-                prepack_args = (weight, bias, stride, padding, dilation, groups)
-                prepack_op = get_qconv_prepack_op(self.conv)
-                packed_weight = quantized_graph.create_node(
-                    "call_function", prepack_op, prepack_args, {})
-                assert activation_int8_quantized, \
-                    "currently only static quantization is supported for conv"
-                # construct conv input
-                if activation_int8_quantized:
-                    qconv_op = get_qconv_op(self.conv, self.relu_node is not None)
-                    conv_input = load_arg(quantized=torch.quint8)(self.conv_node.args[0])
-
-                    activation_post_process = \
-                        self._maybe_get_last_node_only_observer(modules)
-                    assert activation_post_process is not None
-
-                    scale, zero_point, _ = get_per_tensor_qparams(activation_post_process)
-                    scale_node, zero_point_node = \
-                        create_qparam_nodes(
-                            self.conv_node.name, scale, zero_point, modules,
-                            quantized_graph, node_name_to_scope)
-                    qconv_args = (conv_input, packed_weight, scale_node, zero_point_node)
-                    kwargs = load_arg(quantized=torch.float)(self.conv_node.kwargs)
-                    op = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ('call_function', qconv_op, qconv_args, kwargs),
-                        self.conv_node)
-                    # Store the name of the fused op to get the path of node after fusion as well.
-                    # TODO: may need to change the key to Node regenerate the map in each transformation,
-                    # since we might not be able to rely on the name
-                    node_name_to_scope[op.name] = node_name_to_scope[self.conv_node.name]
-                    return op
-                else:
-                    # conv2d_dyanmic branch
-                    raise Exception("Only static quant is supported for conv")
-
-@register_quant_pattern(torch.nn.Linear)
-@register_quant_pattern(torch.nn.functional.linear)
-@register_quant_pattern(torch.nn.qat.Linear)
-@register_quant_pattern(torch.nn.intrinsic.LinearReLU)
-@register_quant_pattern(torch.nn.intrinsic.qat.LinearReLU)
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.functional.linear))
-@register_quant_pattern((torch.nn.ReLU, torch.nn.functional.linear))
-# for error checks
-@register_quant_pattern((torch.nn.ReLU, torch.nn.Linear))
-@register_quant_pattern((torch.nn.functional.relu, torch.nn.Linear))
+# TODO: remove this class
 class LinearReLUQuantizeHandler(QuantizeHandler):
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        self.relu_node = None
-        if (node.op == 'call_function' and node.target is torch.nn.functional.relu) or \
-           (node.op == 'call_module' and isinstance(modules[str(node.target)], torch.nn.ReLU)):
-            self.relu_node = node
-            node = node.args[0]  # type: ignore[assignment]
-        self.linear_node = node
-        if node.op == 'call_module':
-            self.linear = modules[str(self.linear_node.target)]
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        if convert_custom_config_dict is None:
-            convert_custom_config_dict = {}
-        # Supported combinations are:
-        # quant_type | activation (compute_type) | weight
-        #  static       quint8                      qint8
-        #  dynamic      float32 (quint8)            qint8
-        #  weight_only  float32                    float16
-        # tuple (activation_dtype, weight_dtype, compute_dtype)
-        supported_dtypes = [
-            (torch.quint8, torch.qint8, None),
-            (torch.float32, torch.qint8, torch.quint8),
-            (torch.float32, torch.float16, None),
-            # static float16 quantization
-            (torch.float16, torch.float16, None),
-        ]
-        dtypes = get_qconfig_dtypes(qconfig)
-        # leave the op unquantized if the dtype combination is not supported
-        if not is_reference and dtypes not in supported_dtypes:
-            warnings.warn(
-                "dtype combination: {} is not "
-                "supported by Linear "
-                "supported dtype combinations are: {}".format(dtypes, supported_dtypes))
-            if self.relu_node:
-                op_out = quantized_graph.node_copy(self.linear_node, load_arg(quantized=torch.float))
-                relu_args = [op_out]
-                relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                return create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                    self.relu_node)
-            else:
-                return quantized_graph.node_copy(node, load_arg(quantized=None))
-
-        activation_int8_quantized = activation_is_int8_quantized(qconfig)
-        activation_statically_quantized = activation_is_statically_quantized(qconfig)
-        weight_dtype = dtypes[1]
-        if self.linear_node.op == 'call_module':
-
-            output_activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-
-            # note that relu should already be fused into linear modul in the fusion step
-            assert self.relu_node is None, 'linear module and relu fusion is not executed, ' \
-                'please make sure to run fusion before prepare'
-            # we'll always produce reference pattern for the following modules
-            # will remove the else branch after we migrated all use cases
-            module_allowlist = [
-                torch.nn.Linear,
-                torch.nn.qat.Linear,
-                torch.nn.intrinsic.modules.fused.LinearReLU,
-                torch.nn.intrinsic.qat.modules.linear_relu.LinearReLU
-            ]
-            if is_reference or type(self.linear) in module_allowlist and dtypes in [(torch.quint8, torch.qint8, None)]:
-                # produce dequant - float_op - quant pattern
-                dtype = torch.float
-                if activation_int8_quantized:
-                    dtype = activation_dtype(qconfig)
-                activation = load_arg(quantized=dtype)(self.linear_node.args[0])
-                args = load_arg(quantized=torch.float)(self.linear_node.args)
+    pass
 
-                # Get the float linear and attach qscheme and qparams the the module
-                float_linear = self.linear
-                fused_linear = None
-                if isinstance(float_linear, (torch.nn.qat.Linear, torch.nn.intrinsic.qat.LinearReLU)):
-                    float_linear = float_linear.to_float()
-                    # change qat linear to linear
-                    parent_name, name = _parent_name(self.linear_node.target)
-                    setattr(modules[parent_name], name, float_linear)
-                    # Attach weight fake quant to the linear module
-                    if isinstance(float_linear, torch.nn.intrinsic.LinearReLU):
-                        fused_linear = float_linear
-                        float_linear = float_linear[0]
-                    weight_post_process = self.linear.weight_fake_quant
-                else:
-                    if isinstance(float_linear, torch.nn.intrinsic.LinearReLU):
-                        fused_linear = float_linear
-                        float_linear = self.linear[0]  # type: ignore[index]
-                    # Attach the weight observer to the module
-                    weight_post_process = qconfig.weight()  # type: ignore[union-attr]
-
-                # Run weight observer
-                # TODO: This is currently a hack for QAT to get the right shapes for scale and zero point.
-                # In the future, we should require the user to calibrate the model after calling prepare
-                weight_post_process(float_linear.weight)  # type: ignore[operator]
-
-                weight_qparams = get_qparam_dict(weight_post_process)
-                # TODO: include the configuration in backend_config_dict
-                # we can have a map from module to reference module
-                # and allow user to register new ones
-                qlinear_cls = get_static_quant_module_class(
-                    type(float_linear), is_reference=True)
-                ref_linear = qlinear_cls.from_float(float_linear, weight_qparams)
-
-                # if the parent is a fused linear (Sequential), we can replace the first
-                # item to ref linear, otherwise we can update
-                # the linear instance in the module tree
-                if fused_linear is not None:
-                    fused_linear[0] = ref_linear
-                else:
-                    parent_name, name = _parent_name(self.linear_node.target)
-                    setattr(modules[parent_name], name, ref_linear)
-                op_out = create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ('call_module', self.linear_node.target, args, {}),
-                    self.linear_node)
-                if output_activation_post_process:
-                    op_out = quantize_node(
-                        op_out,
-                        output_activation_post_process,
-                        node,
-                        modules,
-                        quantized_graph,
-                        node_name_to_scope,
-                        is_input=False)
-                return op_out
-            # non-reference option
-            else:
-                # 1. attach output activation post process to linear module
-                if output_activation_post_process:
-                    self.linear.activation_post_process = output_activation_post_process
-
-                # 2. select corresponding quantized linear class for the float linear class
-                if activation_int8_quantized:
-                    additional_static_quant_mapping = convert_custom_config_dict.get("static", {})
-                    qlinear = get_static_quant_module_class(
-                        type(self.linear), additional_static_quant_mapping)
-                else:
-                    assert dtypes in [
-                        (torch.float32, torch.qint8, torch.quint8),
-                        (torch.float32, torch.float16, None),
-                    ], f"dtype {dtypes} not supported yet"
-                    additional_dynamic_quant_mapping = convert_custom_config_dict.get("dynamic", {})
-                    qlinear = get_dynamic_quant_module_class(type(self.linear), additional_dynamic_quant_mapping)
-
-                quantized = qlinear.from_float(self.linear)
-                parent_name, name = _parent_name(self.linear_node.target)
-                setattr(modules[parent_name], name, quantized)
-                # activation needs to be quantized for static quantization
-                dtype = torch.float
-                if activation_int8_quantized:
-                    dtype = activation_dtype(qconfig)
-                return create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    (
-                        'call_module',
-                        self.linear_node.target,
-                        (load_arg(quantized=dtype)(self.linear_node.args[0]),), {},
-                    ),
-                    self.linear_node)
-        else:  # call_function
-            assert self.linear_node.op == 'call_function'
-            if is_reference or self.linear_node.target == torch.nn.functional.linear and\
-                    dtypes in [(torch.quint8, torch.qint8, None)]:
-                quantized_input_dtypes = [torch.float, torch.float]
-                if activation_int8_quantized:
-                    quantized_input_dtypes[0] = torch.quint8
-                if weight_is_statically_quantized(qconfig):
-                    quantized_input_dtypes[1] = torch.qint8
-                args = load_arg(quantized=quantized_input_dtypes)(self.linear_node.args)
-                args = load_arg(quantized=torch.float)(self.linear_node.args)
-                kwargs = load_arg(quantized=torch.float)(self.linear_node.kwargs)
-                op_out = create_node_from_old_node_preserve_meta(
-                    quantized_graph,
-                    ("call_function", torch.nn.functional.linear, args, kwargs),
-                    self.linear_node)
-                if self.relu_node:
-                    relu_args = [op_out]
-                    relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                    relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                    op_out = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                        self.relu_node)
-
-                if activation_statically_quantized:
-                    # quantize output for statically quantized linear op
-                    root_module = modules['']
-                    act_post_process_name = self.relu_node.name if self.relu_node else self.linear_node.name
-                    act_post_process_node = self.relu_node if self.relu_node else self.linear_node
-                    activation_post_process = \
-                        self._maybe_get_last_node_only_observer(modules)
-                    assert activation_post_process is not None
-                    return quantize_node(
-                        op_out,
-                        activation_post_process,
-                        act_post_process_node,
-                        modules,
-                        quantized_graph,
-                        node_name_to_scope,
-                        is_input=False,
-                        output_prefix="")
-                else:
-                    # output for dynamically quantized linear op is not quantized
-                    return op_out
-            else:  # non-reference option
-                # prepacking weights for static int8 quant and dynamic quant
-                if dtypes != (torch.float16, torch.float16, None):
-                    # linear args
-                    # (x, weight, bias, ...)
-                    # TODO: the name should be weight is int8 quantized
-                    weight_quantized = weight_is_statically_quantized(qconfig)
-                    dtype = weight_dtype if weight_quantized else torch.float
-                    linear_weight = load_arg(quantized=dtype)(self.linear_node.args[1])
-
-                    # get other arguments
-                    kwargs = {**load_arg(quantized=torch.float)(self.linear_node.kwargs)}
-                    # all args after bias, including bias
-                    other_args = load_arg(quantized=torch.float)(self.linear_node.args[2:])
-                    # bias might be either positional, or a keyword argument
-                    if len(self.linear_node.args) > 2:
-                        bias = load_arg(quantized=torch.float)(self.linear_node.args[2])
-                        other_args = other_args[1:]  # remove the bias argument
-                    else:
-                        bias = kwargs.pop('bias', None)
-
-                    prepack_args = (linear_weight, bias)
-                    prepack_op = get_linear_prepack_op_for_dtype(weight_dtype)
-                    packed_weight = quantized_graph.create_node(
-                        'call_function', prepack_op, prepack_args, {})
-                # construct linear input
-                if activation_int8_quantized:
-                    qlinear_op = torch.ops.quantized.linear_relu if self.relu_node else torch.ops.quantized.linear
-                    linear_input = load_arg(quantized=torch.quint8)(self.linear_node.args[0])
-                    activation_post_process = \
-                        self._maybe_get_last_node_only_observer(modules)
-                    assert activation_post_process is not None
-                    scale, zero_point, _ = get_per_tensor_qparams(activation_post_process)
-                    scale_node, zero_point_node = \
-                        create_qparam_nodes(
-                            self.linear_node.name, scale, zero_point, modules,
-                            quantized_graph, node_name_to_scope)
-
-                    qlinear_args = (linear_input, packed_weight, scale_node, zero_point_node)
-                    op = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ("call_function", qlinear_op, qlinear_args, kwargs),
-                        self.linear_node)
-                    # Store the name of the fused op to get the path of node after fusion as well.
-                    # TODO: may need to change the key to Node regenerate the map in each transformation,
-                    # since we might not be able to rely on the name
-                    node_name_to_scope[op.name] = node_name_to_scope[self.linear_node.name]
-                    return op
-                elif dtypes in [(torch.float32, torch.qint8, torch.quint8),
-                                (torch.float32, torch.float16, None)]:
-                    # choose linear dynamic or linear dynamic fp16 op based on weight dtype
-                    if weight_dtype == torch.qint8:
-                        if self.relu_node:
-                            qlinear_op = torch.ops.quantized.linear_relu_dynamic
-                        else:
-                            qlinear_op = torch.ops.quantized.linear_dynamic
-                    else:
-                        if self.relu_node:
-                            qlinear_op = torch.ops.quantized.linear_relu_dynamic_fp16
-                        else:
-                            qlinear_op = torch.ops.quantized.linear_dynamic_fp16
-
-                    linear_input = load_arg(quantized=torch.float)(self.linear_node.args[0])
-                    qlinear_args = (linear_input, packed_weight)  # type: ignore[assignment]
-                    op_out = create_node_from_old_node_preserve_meta(
-                        quantized_graph,
-                        ("call_function", qlinear_op, qlinear_args, kwargs),
-                        self.linear_node)
-                    # Store the name of the dynamic op to get the path of node after replacement as well.
-                    # TODO: may need to change the key to Node regenerate the map in each transformation,
-                    # since we might not be able to rely on the name
-                    node_name_to_scope[op_out.name] = node_name_to_scope[self.linear_node.name]
-                    return op_out
-                else:
-                    assert dtypes == (torch.float16, torch.float16, None)
-                    # TODO (refactor) this is duplicated, maybe have a helper function
-                    if self.relu_node:
-                        op_out = quantized_graph.node_copy(self.linear_node, load_arg(quantized=torch.float))
-                        relu_args = [op_out]
-                        relu_args.extend(load_arg(quantized=torch.float)(self.relu_node.args[1:]))
-                        relu_kwargs = load_arg(quantized=torch.float)(self.relu_node.kwargs)
-                        op_out = create_node_from_old_node_preserve_meta(
-                            quantized_graph,
-                            ("call_function", torch.nn.functional.relu, tuple(relu_args), relu_kwargs),
-                            self.relu_node)
-                    else:
-                        op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-                    return quantized_graph.create_node(
-                        "call_method", "to", (op_out, torch.float16), {})
-
-@register_quant_pattern(torch.nn.BatchNorm2d)
-@register_quant_pattern(torch.nn.BatchNorm3d)
-@register_quant_pattern(torch.nn.intrinsic.BNReLU2d)
-@register_quant_pattern(torch.nn.intrinsic.BNReLU3d)
+# TODO: remove this class
 class BatchNormQuantizeHandler(QuantizeHandler):
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        assert node.op == 'call_module'
-        self.bn_node = node
-        self.bn = modules[str(self.bn_node.target)]
+    pass
 
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        if convert_custom_config_dict is None:
-            convert_custom_config_dict = {}
-        additional_static_quant_mapping = convert_custom_config_dict.get("static", {})
-        # 1. attach activation post process to module
-        output_activation_post_process = \
-            self._maybe_get_last_node_only_observer(modules)
-        assert output_activation_post_process is not None
-        if is_reference:
-            # produce dequant - float_op - quant pattern
-            dtype = activation_dtype(qconfig)
-            activation = load_arg(quantized=dtype)(self.bn_node.args[0])
-            args = load_arg(quantized=torch.float)(self.bn_node.args)
-            op_out = create_node_from_old_node_preserve_meta(
-                quantized_graph,
-                ("call_module", self.bn_node.target, args, {}),
-                self.bn_node)
-            if output_activation_post_process:
-                op_out = quantize_node(
-                    op_out,
-                    output_activation_post_process,
-                    node,
-                    modules,
-                    quantized_graph,
-                    node_name_to_scope,
-                    is_input=False)
-            return op_out
-        else:
-            self.bn.activation_post_process = output_activation_post_process
-            qbn_cls = get_static_quant_module_class(type(self.bn), additional_static_quant_mapping)
-            quantized = qbn_cls.from_float(self.bn)
-            parent_name, name = _parent_name(self.bn_node.target)
-            setattr(modules[parent_name], name, quantized)
-            return create_node_from_old_node_preserve_meta(
-                quantized_graph,
-                (
-                    'call_module',
-                    self.bn_node.target,
-                    load_arg(quantized=[0])(self.bn_node.args),
-                    load_arg(quantized=torch.float)(self.bn_node.kwargs),
-                ),
-                self.bn_node)
-
-@register_quant_pattern(torch.nn.qat.Embedding)
-@register_quant_pattern(torch.nn.qat.EmbeddingBag)
-@register_quant_pattern(torch.nn.Embedding)
-@register_quant_pattern(torch.nn.EmbeddingBag)
+# TODO: remove this class
 class EmbeddingQuantizeHandler(QuantizeHandler):
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-
-    def input_output_observed(self) -> bool:
-        return False
+    pass
 
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        # Supported combinations are:
-        # quant_type  | activation | weight | activation_compute_type
-        # weight_only |  float32   | quint8 | None
-        # weight_only |  float32   | quint4x2 | None
-        # tuple (activation_dtype, weight_dtype, compute_dtype)
-        supported_dtypes = [
-            (torch.float32, torch.quint8, None),
-            (torch.float32, torch.quint4x2, None),
-        ]
-        assert node.op == 'call_module'
-        emb_node = node
-        dtypes = get_qconfig_dtypes(qconfig)
-        # leave the op unquantized if the dtype combination is not supported
-        if dtypes not in supported_dtypes:
-            warnings.warn(
-                "dtype combination: {} is not "
-                "supported by Embedding/EmbeddingBag, "
-                "supported dtype combinations are: {}".format(dtypes, supported_dtypes))
-            return quantized_graph.node_copy(node, load_arg(quantized=None))
-
-        emb = modules[str(emb_node.target)]
-        qemb = get_static_quant_module_class(type(emb))
-        quantized = qemb.from_float(emb)
-        parent_name, name = _parent_name(emb_node.target)
-        setattr(modules[parent_name], name, quantized)
-        return create_node_from_old_node_preserve_meta(
-            quantized_graph,
-            (
-                'call_module',
-                emb_node.target,
-                load_arg(quantized=torch.float)(emb_node.args),
-                load_arg(quantized=torch.float)(emb_node.kwargs),
-            ),
-            emb_node)
-
-# TODO (maybe): merge with embedding quantize handler
-@register_quant_pattern(torch.nn.GRUCell)
-@register_quant_pattern(torch.nn.LSTMCell)
-@register_quant_pattern(torch.nn.RNNCell)
-@register_quant_pattern(torch.nn.LSTM)
+# TODO: remove this class
 class RNNDynamicQuantizeHandler(QuantizeHandler):
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-
-    def input_output_observed(self) -> bool:
-        return False
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        # Supported combinations are:
-        # quant_type  | activation | weight | activation_compute_type
-        # dynamic |  float32   | qint8 | quint8
-        # dynamic |  float32   | float16 | None
-        # tuple (activation_dtype, weight_dtype, compute_dtype)
-        supported_dtypes = [
-            (torch.float32, torch.qint8, torch.quint8),
-            (torch.float32, torch.float16, None),
-        ]
-        assert node.op == 'call_module'
-        dtypes = get_qconfig_dtypes(qconfig)
-        # leave the op unquantized if the dtype combination is not supported
-        if dtypes not in supported_dtypes:
-            warnings.warn(
-                "dtype combination: {} is not "
-                "supported by Embedding/EmbeddingBag, "
-                "supported dtype combinations are: {}".format(dtypes, supported_dtypes))
-            return quantized_graph.node_copy(node, load_arg(quantized=None))
+    pass
 
-        act_dtype, weight_dtype, compute_dtype = dtypes
-        activation = load_arg(quantized=act_dtype)(node.args[0])
-        module = modules[str(node.target)]
-        qmodule_cls = get_dynamic_quant_module_class(type(module))
-        qmodule = qmodule_cls.from_float(module)
-        parent_name, name = _parent_name(node.target)
-        setattr(modules[parent_name], name, qmodule)
-        return create_node_from_old_node_preserve_meta(
-            quantized_graph,
-            (
-                'call_module',
-                node.target,
-                load_arg(quantized=torch.float)(node.args),
-                load_arg(quantized=torch.float)(node.kwargs),
-            ),
-            node)
-
-ARGS_TO_SKIP = {
-    torch._ops.ops.quantized.hardswish: ['inplace'],
-    torch._ops.ops.quantized.elu: ['inplace'],
-    torch._ops.ops.quantized.dropout: ['inplace'],
-    torch._ops.ops.quantized.instance_norm:
-    ['running_mean', 'running_var', 'use_input_stats', 'momentum'],
-}
-@register_quant_pattern(torch.nn.ConvTranspose1d)
-@register_quant_pattern(torch.nn.ConvTranspose2d)
-@register_quant_pattern(torch.nn.ELU)
-@register_quant_pattern(torch.nn.LeakyReLU)
-@register_quant_pattern(torch.nn.Hardswish)
-@register_quant_pattern(torch.nn.InstanceNorm1d)
-@register_quant_pattern(torch.nn.InstanceNorm2d)
-@register_quant_pattern(torch.nn.InstanceNorm3d)
-@register_quant_pattern(torch.nn.LayerNorm)
-@register_quant_pattern(torch.nn.SiLU)
-@register_quant_pattern(torch.nn.Mish)
-@register_quant_pattern(torch.nn.Dropout)
-# we currently only support reference patterns for these ops so they have been removed
-# until they receive a proper fp16 kernel. To use the reference pattern, use a custom qconfig
-# @register_quant_pattern(torch.nn.GELU)
-# @register_quant_pattern(torch.nn.Softmax)
-@register_quant_pattern(torch.nn.functional.elu)
-@register_quant_pattern(torch.nn.functional.hardswish)
-@register_quant_pattern(torch.nn.functional.instance_norm)
-@register_quant_pattern(torch.nn.functional.layer_norm)
-@register_quant_pattern(torch.nn.functional.leaky_relu)
-@register_quant_pattern(torch.nn.functional.silu)
-@register_quant_pattern(torch.nn.functional.mish)
-@register_quant_pattern(torch.nn.functional.dropout)
-# we currently only support reference patterns for these ops so they have been removed
-# until they receive a proper fp16 kernel. To use the reference pattern, use a custom qconfig
-# @register_quant_pattern(torch.nn.functional.gelu)
-# @register_quant_pattern(torch.nn.functional.softmax)
-@register_quant_pattern(torch.sum)
+# TODO: remove this class
 class DefaultNodeQuantizeHandler(QuantizeHandler):
     """ Common quantized op, first input and first output will be quantized
     """
-    def __init__(
-            self,
-            node: Node,
-            modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        if node.op == "call_function" or node.op == "call_method":
-            self.op = node.target
-        elif node.op == "call_module":
-            self.op = type(modules[str(node.target)])
-
-    def is_output_quantized(self, qconfig):
-        dtypes = get_qconfig_dtypes(qconfig)
-        return self.op in default_op_supported_dtypes and \
-            dtypes in default_op_supported_dtypes[self.op]
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        if not self.all_node_args_are_tensors:
-            return NotImplemented
-        assert node.op in ['call_module', 'call_function'], 'Only call_module and ' + \
-            'call_function are handled in DefaultNode'
-        if convert_custom_config_dict is None:
-            convert_custom_config_dict = {}
-        additional_static_quant_mapping = convert_custom_config_dict.get("static", {})
-
-        dtypes = get_qconfig_dtypes(qconfig)
-        if not is_reference and dtypes not in default_op_supported_dtypes[self.op]:
-            warnings.warn(
-                "dtype combination: {} is not "
-                "supported by {} "
-                "supported dtype combinations are: {}".format(dtypes, self.op, default_op_supported_dtypes[self.op]))
-            return quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-
-        # We can produce reference for a dtypes including
-        # (torch.quint8, torch.qint8, torch.qint32, torch.float16)
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            # make sure the input is quantized to act_dtype
-            load_arg(quantized={0: act_dtype})(node.args)
-            args = load_arg(quantized=torch.float)(node.args)
-            kwargs = load_arg(quantized=torch.float)(node.kwargs)
-            # swap float module to reference module (ConvTranspose)
-            float_module = modules[str(node.target)] if node.op == "call_module" else None
-            if type(float_module) in [torch.nn.ConvTranspose1d, torch.nn.ConvTranspose2d]:
-                ref_module_cls = get_static_quant_module_class(type(float_module), is_reference=True)
+    pass
 
-                weight_post_process = qconfig.weight()  # type: ignore[union-attr]
-                weight_post_process(float_module.weight)  # type: ignore[union-attr]
-                weight_qparams = get_qparam_dict(weight_post_process)
-                ref_module = ref_module_cls.from_float(float_module, weight_qparams)  # type: ignore[attr-defined]
-                parent_name, name = _parent_name(node.target)
-                setattr(modules[parent_name], name, ref_module)
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return quantize_node(
-                op_out, activation_post_process,
-                node, modules, quantized_graph, node_name_to_scope, is_input=False)
-
-@register_quant_pattern(torch.nn.Hardsigmoid, default_affine_fixed_qparams_observer)
-@register_quant_pattern(torch.nn.functional.hardsigmoid, default_affine_fixed_qparams_observer)
-@register_quant_pattern('hardsigmoid', default_affine_fixed_qparams_observer)
-@register_quant_pattern('hardsigmoid_', default_affine_fixed_qparams_observer)
-@register_quant_pattern(torch.nn.Sigmoid, default_affine_fixed_qparams_observer)
-@register_quant_pattern(torch.sigmoid, default_affine_fixed_qparams_observer)
-@register_quant_pattern('sigmoid', default_affine_fixed_qparams_observer)
-@register_quant_pattern('sigmoid_', default_affine_fixed_qparams_observer)
-@register_quant_pattern(torch.nn.Tanh, default_symmetric_fixed_qparams_observer)
-@register_quant_pattern(torch.tanh, default_symmetric_fixed_qparams_observer)
-@register_quant_pattern('tanh', default_symmetric_fixed_qparams_observer)
-@register_quant_pattern('tanh_', default_symmetric_fixed_qparams_observer)
+# TODO: remove this class
 class FixedQParamsOpQuantizeHandler(QuantizeHandler):
-    def __init__(self,
-                 node: Node,
-                 modules: Dict[str, torch.nn.Module]):
-        super().__init__(node, modules)
-        self.node = node
-
-    def should_mark_output_quantized_from_input_quantized_status(
-        self,
-        qconfig: QConfigAny
-    ) -> bool:
-        # FixQParamOps are the same as CopyNode in int8 quantization
-        return activation_dtype(qconfig) in [torch.quint8, torch.qint8]
-
-    # some qhandlers override the activations constructor
-    def get_activation_ctr(self, qconfig, pattern, is_training) -> Optional[Callable]:
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.quint8:
-            return get_default_output_activation_post_process_map(is_training).get(
-                pattern, qconfig.activation)
-        else:
-            return qconfig.activation
+    pass
 
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            # make sure the input is quantized to act_dtype
-            load_arg(quantized={0: act_dtype})(node.args)
-            args = load_arg(quantized=torch.float)(node.args)
-            kwargs = load_arg(quantized=torch.float)(node.kwargs)
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return quantize_node(
-                op_out, activation_post_process,
-                node, modules, quantized_graph, node_name_to_scope, is_input=False)
-
-@register_quant_pattern(torch.nn.AdaptiveAvgPool1d)
-@register_quant_pattern(torch.nn.AdaptiveAvgPool2d)
-@register_quant_pattern(torch.nn.AdaptiveAvgPool3d)
-@register_quant_pattern(torch.nn.AvgPool1d)
-@register_quant_pattern(torch.nn.AvgPool2d)
-@register_quant_pattern(torch.nn.AvgPool3d)
-@register_quant_pattern(torch.nn.Hardtanh)
-@register_quant_pattern(torch.nn.MaxPool1d)
-@register_quant_pattern(torch.nn.MaxPool2d)
-@register_quant_pattern(torch.nn.MaxPool3d)
-@register_quant_pattern(torch.nn.ReLU)
-@register_quant_pattern(torch.nn.ReLU6)
-@register_quant_pattern(torch.adaptive_avg_pool1d)
-@register_quant_pattern(torch.nn.functional.adaptive_avg_pool2d)
-@register_quant_pattern(torch.nn.functional.adaptive_avg_pool3d)
-@register_quant_pattern(torch.nn.functional.hardtanh)
-@register_quant_pattern(torch.nn.functional.hardtanh_)
-@register_quant_pattern(torch.nn.functional.interpolate)
-@register_quant_pattern(torch.nn.functional.max_pool1d)
-@register_quant_pattern(torch.nn.functional.max_pool2d)
-@register_quant_pattern(torch.nn.functional.max_pool3d)
-@register_quant_pattern(torch.nn.functional.relu)
-@register_quant_pattern(torch.nn.functional.relu6)
-@register_quant_pattern(torch.avg_pool1d)
-@register_quant_pattern(torch._C._nn.avg_pool2d)
-@register_quant_pattern(torch._C._nn.avg_pool3d)
-@register_quant_pattern(torch.clamp)
-@register_quant_pattern(torch.flatten)
-@register_quant_pattern(torch.mean)
-@register_quant_pattern(operator.floordiv)
-@register_quant_pattern('clamp')
-@register_quant_pattern('mean')
-@register_quant_pattern('relu')
-@register_quant_pattern('relu_')
+# TODO: remove
 class CopyNodeQuantizeHandler(QuantizeHandler):
-    """ Operators that works on both float and quantized input
-    if input is quantized, the output Tensor shares
-    the same quantization parameter with input.
-    These ops will do computation on the input Tensor, e.g. average pool, so we will
-    insert extra observer/fake_quant for the output of these operators.
-    TODO: maybe rename this to TensorValueOpQuantizeHandler
-    """
-    def should_mark_output_quantized_from_input_quantized_status(
-        self,
-        qconfig: QConfigAny
-    ) -> bool:
-        return True
-
-    def is_general_tensor_value_op(self) -> bool:
-        return True
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
+    pass
 
-        # when activation dtype is torch.float, the node does not require
-        # observation
-        # e.g. dynamic quantization or weight_only quantization
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            if activation_post_process is not None:
-                # make sure the input is quantized to act_dtype
-                load_arg(quantized={0: act_dtype})(node.args)
-                args = list(load_arg(quantized=torch.float)(node.args))
-                kwargs = load_arg(quantized=torch.float)(node.kwargs)
-                op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-                return quantize_node(
-                    op_out,
-                    activation_post_process,
-                    node, modules, quantized_graph, node_name_to_scope, is_input=False)
-            else:
-                op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-                return op_out
-
-class CustomModuleQuantizeHandler(QuantizeHandler):
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        """ Convert a float custom module to quantized custom module
-        """
-        assert node.op == 'call_module'
-        assert convert_custom_config_dict is not None
-        custom_module_class_mapping = convert_custom_config_dict.get("observed_to_quantized_custom_module_class", None)
-        assert custom_module_class_mapping is not None
-        observed_custom_module = modules[str(node.target)]
-        if activation_is_statically_quantized(qconfig):
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            assert activation_post_process is not None
-            observed_custom_module.activation_post_process = activation_post_process
-        quantized_custom_module_class = get_swapped_custom_module_class(
-            observed_custom_module, custom_module_class_mapping, qconfig)
-        quantized_custom_module = \
-            quantized_custom_module_class.from_observed(observed_custom_module)
-        parent_name, name = _parent_name(node.target)
-        setattr(modules[parent_name], name, quantized_custom_module)
-        # hardcoded the quntized input to be None (take whatever is in the environemnt),
-        # we can extend this
-        # if there is a need, e.g. get the indexes of quantized inputs from some
-        # module attribute like module._QUANTIZED_INPUT_INDEXES
-        return quantized_graph.node_copy(node, load_arg(quantized=None))
-
-@register_quant_pattern(torch.nn.Identity)
-@register_quant_pattern(torch.transpose)
-@register_quant_pattern(torch.repeat_interleave)
-@register_quant_pattern(torch.squeeze)
-@register_quant_pattern(torch.stack)
-@register_quant_pattern(torch.unsqueeze)
-@register_quant_pattern('contiguous')
-@register_quant_pattern('detach')
-@register_quant_pattern('detach_')
-@register_quant_pattern('permute')
-@register_quant_pattern('repeat')
-@register_quant_pattern('repeat_interleave')
-@register_quant_pattern('reshape')
-@register_quant_pattern('resize_')
-@register_quant_pattern('shape')
-@register_quant_pattern('size')
-@register_quant_pattern('squeeze')
-@register_quant_pattern('squeeze_')
-@register_quant_pattern('transpose')
-@register_quant_pattern('unsqueeze')
-@register_quant_pattern('unsqueeze_')
-@register_quant_pattern('view')
+# TODO: remove
 class GeneralTensorShapeOpQuantizeHandler(QuantizeHandler):
-    """ Operators that works on both float and quantized input
-    if input is quantized, the output Tensor shares
-    the same quantization parameter with input.
-    These ops only do rearrangement of Tensor values, for
-    example reshape, or just query the information about Tensor
-    e.g. size, and we do not insert extra observer/fake_quant
-    for the output of the operator.
-    """
-    def is_general_tensor_shape_op(self) -> bool:
-        return True
+    pass
 
-    def should_mark_output_quantized_from_input_quantized_status(
-        self,
-        qconfig: QConfigAny
-    ) -> bool:
-        return True
-
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        # when activation dtype is torch.float, the node does not require
-        # observation
-        # e.g. dynamic quantization or weight_only quantization
-        act_dtype = activation_dtype(qconfig)
-        if act_dtype == torch.float:
-            op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-            return op_out
-        else:
-            activation_post_process = \
-                self._maybe_get_last_node_only_observer(modules)
-            if activation_post_process is not None:
-                args = list(load_arg(quantized=torch.float)(node.args))
-                kwargs = load_arg(quantized=torch.float)(node.kwargs)
-                op_out = quantized_graph.node_copy(node, load_arg(quantized=torch.float))
-                return quantize_node(
-                    op_out,
-                    activation_post_process,
-                    node, modules, quantized_graph, node_name_to_scope, is_input=False)
-            else:
-                return quantized_graph.node_copy(node, load_arg(quantized=torch.float))
+# TODO: not used, can be removed after torch.quantization namespace is deprecated
+class CustomModuleQuantizeHandler(QuantizeHandler):
+    pass
 
+# TODO: not used, can be removed after torch.quantization namespace is deprecated
 class StandaloneModuleQuantizeHandler(QuantizeHandler):
-    """ Converts an observed standalone module to quantized standalone module
-    by calling convert_fx on the observed standalone module.
-    """
-    def convert(self,
-                node: Node,
-                qconfig: QConfigAny,
-                modules: Dict[str, torch.nn.Module],
-                quantized_graph: Graph,
-                node_name_to_scope: Dict[str, Tuple[str, type]],
-                load_arg: Callable,
-                is_reference: bool = False,
-                convert_custom_config_dict: Dict[str, Any] = None) -> Node:
-        assert node.op == 'call_module'
-        convert = torch.ao.quantization.quantize_fx._convert_standalone_module_fx  # type: ignore[attr-defined]
-        # We know that observed standalone module is a GraphModule since
-        # it's produced by us
-        observed_standalone_module : GraphModule = modules[str(node.target)]  # type: ignore[assignment]
-        input_quantized_idxs = observed_standalone_module._standalone_module_input_quantized_idxs.tolist()  # type: ignore[operator]
-        quantized_standalone_module = convert(observed_standalone_module, is_reference=is_reference)
-        parent_name, name = _parent_name(node.target)
-        # update the modules dict
-        setattr(modules[parent_name], name, quantized_standalone_module)
-        modules[str(node.target)] = quantized_standalone_module
-        return quantized_graph.node_copy(node, load_arg(quantized=input_quantized_idxs))
+    pass
diff --git a/torch/ao/quantization/fx/quantized_fusion_patterns_and_replacements.py b/torch/ao/quantization/fx/quantized_fusion_patterns_and_replacements.py
deleted file mode 100644
index ce23f17db71d8f..00000000000000
--- a/torch/ao/quantization/fx/quantized_fusion_patterns_and_replacements.py
+++ /dev/null
@@ -1,152 +0,0 @@
-import torch
-
-def relu_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.relu(x, inplace=True)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu_non_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.relu(x, inplace=False)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu_replacement(x, scale, zero_point):
-    x = torch.nn.functional.relu(x)
-    return x
-
-def relu_method_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = x.relu()
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu_method_replacement(x, scale, zero_point):
-    x = x.relu()
-    return x
-
-def relu_inplace_method_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = x.relu_()
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu_inplace_method_replacement(x, scale, zero_point):
-    x = x.relu_()
-    return x
-
-def relu6_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.relu6(x, inplace=True)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu6_non_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.relu6(x, inplace=False)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def relu6_replacement(x, scale, zero_point):
-    x = torch.nn.functional.relu6(x)
-    return x
-
-
-def hardtanh_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.hardtanh(x, inplace=True)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def hardtanh_non_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.hardtanh(x, inplace=False)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def hardtanh_replacement(x, scale, zero_point):
-    x = torch.nn.functional.hardtanh(x)
-    return x
-
-def hardtanh_inplace_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.nn.functional.hardtanh_(x)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def hardtanh_inplace_replacement(x, scale, zero_point):
-    x = torch.nn.functional.hardtanh_(x)
-    return x
-
-def min_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.min(x)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def min_replacement(x, scale, zero_point):
-    x = torch.min(x)
-    return x
-
-def max_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.max(x)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def max_replacement(x, scale, zero_point):
-    x = torch.max(x)
-    return x
-
-def mean_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.mean(x)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def mean_replacement(x, scale, zero_point):
-    x = torch.mean(x)
-    return x
-
-def mean_method_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = x.mean()
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def mean_method_replacement(x, scale, zero_point):
-    x = x.mean()
-    return x
-
-def flatten_pattern(x, scale, zero_point):
-    x = x.dequantize()
-    x = torch.flatten(x)
-    x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
-    return x
-
-def flatten_replacement(x, scale, zero_point):
-    x = torch.flatten(x)
-    return x
-
-def _get_all_patterns_and_replacements():
-    return [
-        (relu_inplace_pattern, relu_replacement),
-        (relu_non_inplace_pattern, relu_replacement),
-        (relu_method_pattern, relu_method_replacement),
-        (relu_inplace_method_pattern, relu_inplace_method_replacement),
-        (relu6_inplace_pattern, relu6_replacement),
-        (relu6_non_inplace_pattern, relu6_replacement),
-        (hardtanh_pattern, hardtanh_replacement),
-        (hardtanh_non_inplace_pattern, hardtanh_replacement),
-        (hardtanh_inplace_pattern, hardtanh_inplace_replacement),
-        (mean_pattern, mean_replacement),
-        (mean_method_pattern, mean_method_replacement),
-    ]
-
-
-def get_fbgemm_patterns_and_replacements():
-    return _get_all_patterns_and_replacements()
-
-def get_qnnpack_patterns_and_replacements():
-    return _get_all_patterns_and_replacements()
diff --git a/torch/ao/quantization/fx/subgraph_rewriter_FORKED_DO_NOT_USE.py b/torch/ao/quantization/fx/subgraph_rewriter_FORKED_DO_NOT_USE.py
deleted file mode 100644
index a64b537173a90f..00000000000000
--- a/torch/ao/quantization/fx/subgraph_rewriter_FORKED_DO_NOT_USE.py
+++ /dev/null
@@ -1,445 +0,0 @@
-from torch.fx.graph_module import GraphModule
-from torch.fx.graph import Graph
-from torch.fx.node import Node
-from torch.fx._symbolic_trace import symbolic_trace
-from torch.fx._compatibility import compatibility
-
-import copy
-from typing import Callable, Dict, List, NamedTuple, Optional, Set
-import torch
-
-@compatibility(is_backward_compatible=True)
-class Match(NamedTuple):
-    # Node from which the match was found
-    anchor: Node
-    # Maps nodes in the pattern subgraph to nodes in the larger graph
-    nodes_map: Dict[Node, Node]
-
-class _SubgraphMatcher:
-    def __init__(self, pattern: Graph) -> None:
-        self.pattern = pattern
-        if len(pattern.nodes) == 0:
-            raise ValueError("_SubgraphMatcher cannot be initialized with an "
-                             "empty pattern")
-        # `self.pattern_anchor` is the output Node in `pattern`
-        self.pattern_anchor = next(iter(reversed(pattern.nodes)))
-        # Ensure that there is only a single output value in the pattern
-        # since we don't support multiple outputs
-        assert len(self.pattern_anchor.all_input_nodes) == 1, \
-            "Pattern matching on multiple outputs is not supported"
-        # Maps nodes in the pattern subgraph to nodes in the larger graph
-        self.nodes_map: Dict[Node, Node] = {}
-
-    def matches_subgraph_from_anchor(self, anchor: Node) -> bool:
-        """
-        Checks if the whole pattern can be matched starting from
-        ``anchor`` in the larger graph.
-
-        Pattern matching is done by recursively comparing the pattern
-        node's use-def relationships against the graph node's.
-        """
-        self.nodes_map = {}
-        return self._match_nodes(self.pattern_anchor, anchor)
-
-    # Compare the pattern node `pn` against the graph node `gn`
-    def _match_nodes(self, pn: Node, gn: Node) -> bool:
-
-        # Check if we've already matched these nodes in the current
-        # traversal
-        if pn in self.nodes_map:
-            return self.nodes_map[pn] == gn
-
-        def attributes_are_equal(pn: Node, gn: Node) -> bool:
-            # Use placeholder and output nodes as wildcards. The
-            # only exception is that an output node can't match
-            # a placeholder
-            if (pn.op == "placeholder"
-                    or (pn.op == "output" and gn.op != "placeholder")):
-                return True
-            return pn.op == gn.op and pn.target == gn.target
-
-        # Terminate early if the node attributes are not equal
-        if not attributes_are_equal(pn, gn):
-            return False
-
-        # Optimistically mark `pn` as a match for `gn`
-        self.nodes_map[pn] = gn
-
-        # Traverse the use-def relationships to ensure that `pn` is a true
-        # match for `gn`
-        if pn.op == "placeholder":
-            return True
-        if (pn.op != "output"
-                and len(pn.all_input_nodes) != len(gn.all_input_nodes)):
-            return False
-        if pn.op == "output":
-            match_found = any(self._match_nodes(pn.all_input_nodes[0], gn_)
-                              for gn_ in gn.all_input_nodes)
-        else:
-            match_found = (len(pn.all_input_nodes) == len(gn.all_input_nodes)
-                           and all(self._match_nodes(pn_, gn_) for pn_, gn_
-                                   in zip(pn.all_input_nodes, gn.all_input_nodes)))
-        if not match_found:
-            self.nodes_map.pop(pn)
-            return False
-
-        return True
-
-
-def _replace_submodules(gm: GraphModule, replacement: torch.nn.Module) -> None:
-    gm.delete_all_unused_submodules()
-
-    if isinstance(replacement, GraphModule):
-        replacement.graph.lint()
-
-    def try_get_submodule(mod: torch.nn.Module, target: str) -> Optional[torch.nn.Module]:
-        try:
-            mod_match = mod.get_submodule(target)
-            return mod_match
-        except AttributeError:
-            return None
-
-    for node in gm.graph.nodes:
-        if node.op == "call_module" or node.op == "get_attr":
-
-            gm_submod = try_get_submodule(gm, node.target)
-
-            replacement_submod = try_get_submodule(replacement, node.target)
-
-            # CASE 1: This target already exists as a submodule in our
-            # result GraphModule. Whether or not it exists in
-            # `replacement`, the existing submodule takes precedence.
-            if gm_submod is not None:
-                continue
-
-            # CASE 2: The target exists as a submodule in `replacement`
-            # only, so we need to copy it over.
-            elif replacement_submod is not None:
-                new_submod = copy.deepcopy(getattr(replacement, node.target))
-                gm.add_submodule(node.target, new_submod)
-
-            # CASE 3: The target doesn't exist as a submodule in `gm`
-            # or `replacement`
-            else:
-                raise RuntimeError("Attempted to create a \"", node.op,
-                                   "\" node during subgraph rewriting "
-                                   f"with target {node.target}, but "
-                                   "the referenced submodule does not "
-                                   "exist in either the original "
-                                   "GraphModule `gm` or the replacement"
-                                   " GraphModule `replacement`")
-
-    gm.graph.lint()
-
-@compatibility(is_backward_compatible=True)
-def replace_pattern(gm: GraphModule, pattern: Callable, replacement: Callable) -> List[Match]:
-    """
-    Matches all possible non-overlapping sets of operators and their
-    data dependencies (``pattern``) in the Graph of a GraphModule
-    (``gm``), then replaces each of these matched subgraphs with another
-    subgraph (``replacement``).
-
-    Args:
-        ``gm``: The GraphModule that wraps the Graph to operate on
-        ``pattern``: The subgraph to match in ``gm`` for replacement
-        ``replacement``: The subgraph to replace ``pattern`` with
-
-    Returns:
-        List[Match]: A list of ``Match`` objects representing the places
-        in the original graph that ``pattern`` was matched to. The list
-        is empty if there are no matches. ``Match`` is defined as:
-
-        .. code-block:: python
-
-            class Match(NamedTuple):
-                # Node from which the match was found
-                anchor: Node
-                # Maps nodes in the pattern subgraph to nodes in the larger graph
-                nodes_map: Dict[Node, Node]
-
-    Examples:
-
-    .. code-block:: python
-
-        import torch
-        from torch.fx import symbolic_trace, subgraph_rewriter
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-
-            def forward(self, x, w1, w2):
-                m1 = torch.cat([w1, w2]).sum()
-                m2 = torch.cat([w1, w2]).sum()
-                return x + torch.max(m1) + torch.max(m2)
-
-        def pattern(w1, w2):
-            return torch.cat([w1, w2]).sum()
-
-        def replacement(w1, w2):
-            return torch.stack([w1, w2])
-
-        traced_module = symbolic_trace(M())
-
-        subgraph_rewriter.replace_pattern(traced_module, pattern, replacement)
-
-    The above code will first match ``pattern`` in the ``forward``
-    method of ``traced_module``. Pattern-matching is done based on
-    use-def relationships, not node names. For example, if you had
-    ``p = torch.cat([a, b])`` in ``pattern``, you could match
-    ``m = torch.cat([a, b])`` in the original ``forward`` function,
-    despite the variable names being different (``p`` vs ``m``).
-
-    The ``return`` statement in ``pattern`` is matched based on its
-    value only; it may or may not match to the ``return`` statement in
-    the larger graph. In other words, the pattern doesn't have to extend
-    to the end of the larger graph.
-
-    When the pattern is matched, it will be removed from the larger
-    function and replaced by ``replacement``. If there are multiple
-    matches for ``pattern`` in the larger function, each non-overlapping
-    match will be replaced. In the case of a match overlap, the first
-    found match in the set of overlapping matches will be replaced.
-    ("First" here being defined as the first in a topological ordering
-    of the Nodes' use-def relationships. In most cases, the first Node
-    is the parameter that appears directly after ``self``, while the
-    last Node is whatever the function returns.)
-
-    One important thing to note is that the parameters of the
-    ``pattern`` Callable must be used in the Callable itself,
-    and the parameters of the ``replacement`` Callable must match
-    the pattern. The first rule is why, in the above code block, the
-    ``forward`` function has parameters ``x, w1, w2``, but the
-    ``pattern`` function only has parameters ``w1, w2``. ``pattern``
-    doesn't use ``x``, so it shouldn't specify ``x`` as a parameter.
-    As an example of the second rule, consider replacing
-
-    .. code-block:: python
-
-        def pattern(x, y):
-            return torch.neg(x) + torch.relu(y)
-
-    with
-
-    .. code-block:: python
-
-        def replacement(x, y):
-            return torch.relu(x)
-
-    In this case, ``replacement`` needs the same number of parameters
-    as ``pattern`` (both ``x`` and ``y``), even though the parameter
-    ``y`` isn't used in ``replacement``.
-
-    After calling ``subgraph_rewriter.replace_pattern``, the generated
-    Python code looks like this:
-
-    .. code-block:: python
-
-        def forward(self, x, w1, w2):
-            stack_1 = torch.stack([w1, w2])
-            sum_1 = stack_1.sum()
-            stack_2 = torch.stack([w1, w2])
-            sum_2 = stack_2.sum()
-            max_1 = torch.max(sum_1)
-            add_1 = x + max_1
-            max_2 = torch.max(sum_2)
-            add_2 = add_1 + max_2
-            return add_2
-    """
-    # Get the graphs for `gm`, `pattern`, `replacement`
-    original_graph = gm.graph
-    pattern_graph = symbolic_trace(pattern).graph
-    replacement_graph = symbolic_trace(replacement).graph
-
-    # Find all possible pattern matches in original_graph. Note that
-    # pattern matches may overlap with each other.
-    matcher = _SubgraphMatcher(pattern_graph)
-    matches: List[Match] = []
-
-    # Consider each node as an "anchor" (deepest matching graph node)
-    for anchor in original_graph.nodes:
-
-        if matcher.matches_subgraph_from_anchor(anchor):
-
-            def pattern_is_contained(nodes_map: Dict[Node, Node]) -> bool:
-                # `lookup` represents all the nodes in `original_graph`
-                # that are part of `pattern`
-                lookup: Dict[Node, Node] = {v: k for k, v in nodes_map.items()}
-                for n in lookup.keys():
-
-                    # Nodes that can "leak"...
-
-                    # Placeholders (by definition)
-                    if n.op == "placeholder":
-                        continue
-                    # Pattern output (acts as a container)
-                    if lookup[n].op == "output":
-                        continue
-                    # Result contained by pattern output (what we'll
-                    # hook in to the new Graph, thus what we'll
-                    # potentially use in other areas of the Graph as
-                    # an input Node)
-                    if (len(lookup[n].users) == 1
-                            and list(lookup[n].users.keys())[0].op == "output"):
-                        continue
-
-                    for user in n.users:
-                        # If this node has users that were not in
-                        # `lookup`, then it must leak out of the
-                        # pattern subgraph
-                        if user not in lookup:
-                            return False
-                return True
-
-            # It's not a match if the pattern leaks out into the rest
-            # of the graph
-            if pattern_is_contained(matcher.nodes_map):
-                # Shallow copy nodes_map
-                matches.append(Match(anchor=anchor,
-                                     nodes_map=copy.copy({
-                                         key: value
-                                         for key, value in matcher.nodes_map.items()
-                                     })))
-
-    # The set of all nodes in `original_graph` that we've seen thus far
-    # as part of a pattern match
-    replaced_nodes: Set[Node] = set()
-    # As we progressively replace nodes, we'll need to keep track of how the match results should change
-    match_changed_node: Dict[Node, Node] = dict()
-
-    # Return True if one of the nodes in the current match has already
-    # been used as part of another match
-    def overlaps_with_prev_match(match: Match) -> bool:
-        for pn, gn in match.nodes_map.items():
-            if pn.op in ["placeholder", "output"]:
-                continue
-            if gn in replaced_nodes and gn.op != "placeholder":
-                return True
-        return False
-
-    for match in matches:
-        # Skip overlapping matches
-        if overlaps_with_prev_match(match):
-            continue
-
-        # Map replacement graph nodes to their copy in `original_graph`
-        val_map: Dict[Node, Node] = {}
-
-        pattern_placeholders = [n for n in pattern_graph.nodes
-                                if n.op == "placeholder"]
-        assert len(pattern_placeholders) > 0
-        replacement_placeholders = [n for n in replacement_graph.nodes
-                                    if n.op == "placeholder"]
-        assert len(pattern_placeholders) == len(replacement_placeholders)
-        placeholder_map = {r: p for r, p
-                           in zip(replacement_placeholders, pattern_placeholders)}
-
-        # node from `original_graph` that matched with the output node
-        # in `pattern`
-        subgraph_output: Node = match.anchor
-
-        def mark_node_as_replaced(n: Node) -> None:
-            if n not in match.nodes_map.values():
-                return
-            for n_ in n.all_input_nodes:
-                mark_node_as_replaced(n_)
-            replaced_nodes.add(n)
-
-        for input_node in subgraph_output.all_input_nodes:
-            mark_node_as_replaced(input_node)
-
-        # Initialize `val_map` with mappings from placeholder nodes in
-        # `replacement` to their corresponding node in `original_graph`
-        for replacement_node in replacement_placeholders:
-            # Get the `original_graph` placeholder node
-            # corresponding to the current `replacement_node`
-            pattern_node = placeholder_map[replacement_node]
-            original_graph_node = match_changed_node.get(match.nodes_map[pattern_node], match.nodes_map[pattern_node])
-
-            # Populate `val_map`
-            val_map[replacement_node] = original_graph_node
-
-        # Copy the stack trace from the original graph to the replacement graph.
-        # Currently this is using a naive strategy:
-        # 1. find the first node with non-null stack trace in the original graph
-        # 2. if found, copy this stack trace to every node in the replacement graph
-        first_stack_trace = None
-        for pn, gn in match.nodes_map.items():
-            if gn.stack_trace is not None:
-                first_stack_trace = gn.stack_trace
-                break
-        if first_stack_trace is not None:
-            for node in replacement_graph.nodes:
-                node.stack_trace = first_stack_trace
-
-        # Copy the replacement graph over
-        with original_graph.inserting_before(subgraph_output):
-            copied_output = original_graph.graph_copy(replacement_graph,
-                                                      val_map)
-
-        # Clear out stack traces to prevent interference with next match
-        for node in replacement_graph.nodes:
-            node.stack_trace = None
-
-        # Hook the output Node of the replacement subgraph in to the
-        # original Graph at the correct location
-
-        # CASE 1: We need to hook the replacement subgraph in somewhere
-        # in the middle of the graph. We replace the Node in the
-        # original graph that corresponds to the end of the pattern
-        # subgraph
-        if subgraph_output.op != "output":
-            pattern_outputs = [n for n in pattern_graph.nodes
-                               if n.op == "output"]
-            assert len(pattern_outputs) > 0
-            replacement_outputs = [n for n in replacement_graph.nodes
-                                   if n.op == "output"]
-            assert len(replacement_outputs) == len(pattern_outputs)
-            outputs_map = {p: r for r, p
-                           in zip(replacement_outputs, pattern_outputs)}
-
-            for pn, gn in match.nodes_map.items():
-                if gn.op == "placeholder":
-                    continue
-
-                # Search for the node corresponding to the output of the pattern
-                if pn.op != "output":
-                    continue
-                assert subgraph_output == gn
-
-                # Update all anchor inputs to the new nodes
-                rn = outputs_map[pn]
-                for pn_input, rn_input in zip(pn.all_input_nodes, rn.all_input_nodes):
-                    gn_input = match.nodes_map[pn_input]
-                    rn_input_in_original_graph = val_map[rn_input]
-                    gn_input.replace_all_uses_with(rn_input_in_original_graph)
-                    # We store the updated node point in case other nodes want to use it
-                    match_changed_node[gn_input] = rn_input_in_original_graph
-
-            assert subgraph_output.op != "output"
-        # CASE 2: The pattern subgraph match extends to the end of the
-        # original graph, so we need to change the current graph's
-        # output Node to reflect the insertion of the replacement graph.
-        # We'll keep the current output Node, but update its args and
-        # `_input_nodes` as necessary
-        else:
-            subgraph_output.args = ((copied_output,))
-            if isinstance(copied_output, Node):
-                subgraph_output._input_nodes = {copied_output: None}
-
-        assert isinstance(copied_output, Node)
-        # Erase the `pattern` nodes
-        for node in reversed(original_graph.nodes):
-            if len(node.users) == 0 and node.op != "output":
-                original_graph.erase_node(node)
-
-    # Update the passed-in GraphModule to reflect the new state of
-    # `original_graph`
-    gm.recompile()
-
-    # If `replacement` was an nn.Module, we'll need to make sure that
-    # all the submodules have been copied over correctly
-    if isinstance(replacement, torch.nn.Module):
-        _replace_submodules(gm, replacement)
-
-    return matches
diff --git a/torch/ao/quantization/fx/utils.py b/torch/ao/quantization/fx/utils.py
index cbb56d405353e8..70b852395ca905 100644
--- a/torch/ao/quantization/fx/utils.py
+++ b/torch/ao/quantization/fx/utils.py
@@ -12,7 +12,9 @@
 )
 
 from typing import Callable, Optional, List, Dict, Any, Set, Tuple, Union, Type
+from collections import namedtuple
 import operator
+import warnings
 
 # A dictionary for querying the weight index for a given op
 WEIGHT_INDEX_DICT = {
@@ -111,7 +113,7 @@ def get_per_tensor_qparams(activation_post_process):
     dtype = activation_post_process.dtype
     return scale, zero_point, dtype
 
-def get_quantize_node_info(activation_post_process: Callable) -> Tuple[str, Union[Callable, str], Dict[str, Any]]:
+def get_quantize_node_info(activation_post_process: Callable) -> Optional[Tuple[str, Union[Callable, str], Dict[str, Any]]]:
     ''' Given an activation_post_process module,
     return node_type(e.g. call_function), quantize op(e.g. quantize_per_tensor) and a dictionary
     of extracted qparams from the module
@@ -137,14 +139,17 @@ def get_quantize_node_info(activation_post_process: Callable) -> Tuple[str, Unio
         node_type = "call_method"
         quantize_op = "to"
         qparams = {"_dtype_": dtype}
-    elif dtype == torch.float32 and compute_dtype in [torch.quint8, torch.qint8]:
+    elif dtype == torch.float32 and compute_dtype in [torch.quint8, torch.qint8, torch.float16]:
+        # dynamic quantization
         node_type = "call_function"
         quantize_op = torch.quantize_per_tensor_dynamic
+        # TODO: get reduce range from observer
+        # reduce_range = activation_post_process.reduce_range
         reduce_range = torch.backends.quantized.engine == "fbgemm"
         qparams = {"_dtype_": compute_dtype, "_reduce_range_": reduce_range}
     else:
-        raise Exception("Unsupported dtype in get_quantize_node_info:" + str(dtype))
-    assert quantize_op is not None
+        warnings.warn(f"Unsupported activation_post_process in get_quantize_node_info: {activation_post_process}")
+        return None
     return node_type, quantize_op, qparams
 
 def quantize_node(
@@ -193,7 +198,10 @@ def quantize_node(
         module_path = ""
     root_module = modules['']
     graph = quantized_graph
-    node_type, quantize_op, qparams = get_quantize_node_info(obs_module)
+    maybe_quantize_node_info = get_quantize_node_info(obs_module)
+    assert maybe_quantize_node_info is not None, \
+        f"Expecting quantize node info not to be None, observer: {obs_module}"
+    node_type, quantize_op, qparams = maybe_quantize_node_info
     inputs = [in_node]
 
     for key, value in qparams.items():
@@ -464,6 +472,74 @@ def all_node_args_have_no_tensors(node: Node, modules: Dict[str, torch.nn.Module
         cache[node] = result
     return result
 
+def all_node_args_except_first(node: Node) -> List[int]:
+    """
+    Returns all node arg indices after first
+    """
+    return list(range(1, len(node.args)))
+
+def return_arg_list(arg_indices: List[int]) -> Callable[[Node], List[int]]:
+    """
+    Constructs a function that takes a node as arg and returns the arg_indices
+    that are valid for node.args
+    """
+    def arg_indices_func(node: Node) -> List[int]:
+        return [i for i in arg_indices if i < len(node.args)]
+    return arg_indices_func
+
+NodeInfo = namedtuple("NodeInfo", "op target")
+
+# this dict identifies which indices of a node are non tensors
+# so that they can be propagated correctly since inserting observers
+# for them would cause errors
+
+NON_OBSERVABLE_ARG_DICT: Dict[NodeInfo, Dict[Union[type, torch.dtype], Callable[[Node], List[int]]]] = {
+    NodeInfo("call_method", "masked_fill") : {
+        torch.bool: return_arg_list([1]),
+        float: return_arg_list([2])
+    },
+    NodeInfo("call_method", "permute") : {
+        int: all_node_args_except_first
+    },
+    NodeInfo("call_method", "repeat") : {
+        int: all_node_args_except_first
+    },
+    NodeInfo("call_method", "reshape") : {
+        int: all_node_args_except_first
+    },
+    NodeInfo("call_method", "size") : {
+        int: return_arg_list([1])
+    },
+    NodeInfo("call_method", "transpose") : {
+        int: all_node_args_except_first
+    },
+    NodeInfo("call_method", torch.transpose) : {
+        int: all_node_args_except_first
+    },
+    NodeInfo("call_method", "unsqueeze") : {
+        int: return_arg_list([1])
+    },
+    NodeInfo("call_method", "unsqueeze_") : {
+        int: return_arg_list([1])
+    },
+    NodeInfo("call_method", torch.unsqueeze) : {
+        int: return_arg_list([1])
+    },
+    NodeInfo("call_method", "view") : {
+        int: all_node_args_except_first
+    },
+}
+
+EMPTY_ARG_DICT: Dict[Union[type, torch.dtype], Callable[[Node], List[int]]] = {}
+
+def get_non_observable_arg_indexes_and_types(node: Node) -> Dict[Union[type, torch.dtype], Callable[[Node], List[int]]]:
+    """
+    Returns a dict with of non float tensor types as keys and values which correspond to a
+    function to retrieve the list (which takes the node as an argument)
+    """
+    info = NodeInfo(node.op, node.target)
+
+    return NON_OBSERVABLE_ARG_DICT.get(info, EMPTY_ARG_DICT)
 
 def node_return_type_is_int(node: Node) -> bool:
     """
@@ -472,13 +548,6 @@ def node_return_type_is_int(node: Node) -> bool:
     """
     return node.op == 'call_method' and node.target == 'size'
 
-def node_bool_tensor_arg_indexes(node: Node) -> List[int]:
-    """
-    Returns indexes of boolean Tensor args
-    """
-    if node.op == "call_method" and node.target == "masked_fill":
-        return [1]
-    return []
 
 def is_get_tensor_info_node(node: Node) -> bool:
     """ Returns True if this node is a node that takes a Tensor as input and output some
diff --git a/torch/ao/quantization/observer.py b/torch/ao/quantization/observer.py
index 73f911a68f7b71..1bdc603213aa2c 100644
--- a/torch/ao/quantization/observer.py
+++ b/torch/ao/quantization/observer.py
@@ -128,6 +128,7 @@ class _ObserverBase(ObserverBase):
                       This is sometimes required to avoid instruction overflow.
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
         quant_max: Maximum quantization value. If unspecified, it will follow the 8-bit setup.
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     .. warning::
 
@@ -169,6 +170,7 @@ def __init__(
         quant_min=None,
         quant_max=None,
         factory_kwargs=None,
+        eps=torch.finfo(torch.float32).eps,
     ) -> None:
         factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
         super(_ObserverBase, self).__init__(dtype=dtype)
@@ -180,7 +182,7 @@ def __init__(
             )
         self.reduce_range = reduce_range
         self.register_buffer(
-            "eps", torch.tensor([torch.finfo(torch.float32).eps], **factory_kwargs)
+            "eps", torch.tensor([eps], **factory_kwargs)
         )
         assert self.qscheme in (
             torch.per_tensor_affine,
@@ -346,8 +348,7 @@ class MinMaxObserver(_ObserverBase):
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
         quant_max: Maximum quantization value. If unspecified, it will follow the 8-bit setup.
-        memoryless: Boolean that controls whether observer removes old data when a new input is seen.
-                    This is most useful for simulating dynamic quantization, especially during QAT.
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     Given running min/max as :math:`x_\text{min}` and :math:`x_\text{max}`,
     scale :math:`s` and zero point :math:`z` are computed as:
@@ -406,7 +407,7 @@ def __init__(
         quant_min=None,
         quant_max=None,
         factory_kwargs=None,
-        memoryless=False,
+        eps=torch.finfo(torch.float32).eps,
     ) -> None:
 
         # For x86 quantized kernels, we need to ensure that the vpmaddubsw
@@ -422,8 +423,8 @@ def __init__(
             quant_min=quant_min,
             quant_max=quant_max,
             factory_kwargs=factory_kwargs,
+            eps=eps,
         )
-        self.memoryless = memoryless
         factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
         self.register_buffer("min_val", torch.tensor(float("inf"), **factory_kwargs))
         self.register_buffer("max_val", torch.tensor(float("-inf"), **factory_kwargs))
@@ -441,8 +442,6 @@ def forward(self, x_orig):
         r"""Records the running minimum and maximum of ``x``."""
         if x_orig.numel() == 0:
             return x_orig
-        elif self.memoryless:
-            self.reset_min_max_vals()
         x = x_orig.detach()  # avoid keeping autograd tape
         x = x.to(self.min_val.dtype)
         min_val_cur, max_val_cur = torch.aminmax(x)
@@ -483,6 +482,7 @@ class MovingAverageMinMaxObserver(MinMaxObserver):
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
         quant_max: Maximum quantization value. If unspecified, it will follow the 8-bit setup.
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     The moving average min/max is computed as follows
 
@@ -519,6 +519,7 @@ def __init__(
         reduce_range=False,
         quant_min=None,
         quant_max=None,
+        eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
         self.averaging_constant = averaging_constant
@@ -528,6 +529,7 @@ def __init__(
             reduce_range=reduce_range,
             quant_min=quant_min,
             quant_max=quant_max,
+            eps=eps,
             **kwargs
         )
 
@@ -565,8 +567,7 @@ class PerChannelMinMaxObserver(_ObserverBase):
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
         quant_max: Maximum quantization value. If unspecified, it will follow the 8-bit setup.
-        memoryless: Boolean that controls whether observer removes old data when a new input is seen.
-                    This is most useful for simulating dynamic quantization, especially during QAT.
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     The quantization parameters are computed the same way as in
     :class:`~torch.ao.quantization.observer.MinMaxObserver`, with the difference
@@ -588,7 +589,7 @@ def __init__(
         quant_min=None,
         quant_max=None,
         factory_kwargs=None,
-        memoryless=False,
+        eps=torch.finfo(torch.float32).eps,
     ) -> None:
         super(PerChannelMinMaxObserver, self).__init__(
             dtype=dtype,
@@ -597,8 +598,8 @@ def __init__(
             quant_min=quant_min,
             quant_max=quant_max,
             factory_kwargs=factory_kwargs,
+            eps=eps,
         )
-        self.memoryless = memoryless
         factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
         self.ch_axis = ch_axis
         self.register_buffer("min_val", torch.tensor([], **factory_kwargs))
@@ -631,7 +632,7 @@ def _forward(self, x_orig):
         # are done in place and types need to match for comparisons
         y = y.to(self.min_val.dtype)
         y = torch.flatten(y, start_dim=1)
-        if min_val.numel() == 0 or max_val.numel() == 0 or self.memoryless:
+        if min_val.numel() == 0 or max_val.numel() == 0:
             min_val, max_val = torch.aminmax(y, dim=1)
         else:
             min_val_cur, max_val_cur = torch.aminmax(y, dim=1)
@@ -751,6 +752,7 @@ class MovingAveragePerChannelMinMaxObserver(PerChannelMinMaxObserver):
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
         quant_max: Maximum quantization value. If unspecified, it will follow the 8-bit setup.
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     The quantization parameters are computed the same way as in
     :class:`~torch.ao.quantization.observer.MovingAverageMinMaxObserver`, with the
@@ -770,6 +772,7 @@ def __init__(
         reduce_range=False,
         quant_min=None,
         quant_max=None,
+        eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
         super(MovingAveragePerChannelMinMaxObserver, self).__init__(
@@ -779,6 +782,7 @@ def __init__(
             reduce_range=reduce_range,
             quant_min=quant_min,
             quant_max=quant_max,
+            eps=eps,
             **kwargs
         )
         self.averaging_constant = averaging_constant
@@ -822,6 +826,7 @@ class HistogramObserver(_ObserverBase):
         dtype: Quantized data type
         qscheme: Quantization scheme to be used
         reduce_range: Reduces the range of the quantized data type by 1 bit
+        eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
 
     The scale and zero point are computed as follows:
 
@@ -848,6 +853,7 @@ def __init__(
         quant_min=None,
         quant_max=None,
         factory_kwargs=None,
+        eps=torch.finfo(torch.float32).eps,
     ) -> None:
         # bins: The number of bins used for histogram calculation.
         super(HistogramObserver, self).__init__(
@@ -857,6 +863,7 @@ def __init__(
             quant_min=quant_min,
             quant_max=quant_max,
             factory_kwargs=factory_kwargs,
+            eps=eps,
         )
         factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
         self.bins = bins
@@ -1435,6 +1442,13 @@ def load_observer_state_dict(mod, obs_dict):
 Default weight observer.
 """
 
+weight_observer_range_neg_127_to_127 = MinMaxObserver.with_args(
+    dtype=torch.qint8, qscheme=torch.per_tensor_symmetric,
+    quant_min=-127, quant_max=127, eps=2 ** -12)
+"""
+Symmetric weight observer with the 8-bit values restricted to [-127, +127], excluding -128.
+"""
+
 default_histogram_observer = HistogramObserver.with_args(quant_min=0, quant_max=127)
 """
 Default histogram observer, usually used for PTQ.
@@ -1448,6 +1462,13 @@ def load_observer_state_dict(mod, obs_dict):
 weight quantization is supported, such as `fbgemm`.
 """
 
+per_channel_weight_observer_range_neg_127_to_127 = MinMaxObserver.with_args(
+    dtype=torch.qint8, qscheme=torch.per_channel_symmetric,
+    quant_min=-127, quant_max=127, eps=2 ** -12)
+"""
+Per-channel, symmetric weight observer with the 8-bit values restricted to [-127, +127], excluding -128.
+"""
+
 default_dynamic_quant_observer = PlaceholderObserver.with_args(
     dtype=torch.float, compute_dtype=torch.quint8
 )
diff --git a/torch/ao/quantization/qconfig.py b/torch/ao/quantization/qconfig.py
index c35739ab9b82ed..94e9646d84522a 100644
--- a/torch/ao/quantization/qconfig.py
+++ b/torch/ao/quantization/qconfig.py
@@ -16,6 +16,8 @@
     default_fused_per_channel_wt_fake_quant,
     default_embedding_fake_quant,
     default_embedding_fake_quant_4bit,
+    fused_wt_fake_quant_range_neg_127_to_127,
+    fused_per_channel_wt_fake_quant_range_neg_127_to_127,
 )
 
 from .observer import (
@@ -32,6 +34,8 @@
     default_per_channel_weight_observer,
     default_placeholder_observer,
     default_weight_observer,
+    weight_observer_range_neg_127_to_127,
+    per_channel_weight_observer_range_neg_127_to_127,
     default_reuse_input_observer,
 )
 import warnings
@@ -113,7 +117,7 @@ def __new__(cls, activation=torch.nn.Identity, weight=torch.nn.Identity):
 Default dynamic qconfig.
 """
 
-float16_dynamic_qconfig = QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float32),
+float16_dynamic_qconfig = QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float32, compute_dtype=torch.float16),
                                   weight=PlaceholderObserver.with_args(dtype=torch.float16))
 """
 Dynamic qconfig with weights quantized to `torch.float16`.
@@ -184,8 +188,8 @@ def get_default_qconfig(backend='fbgemm', version=0):
     Returns the default PTQ qconfig for the specified backend.
 
     Args:
-      * `backend`: a string representing the target backend. Currently supports `fbgemm`
-        and `qnnpack`.
+      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
+        `qnnpack` and `onednn`.
 
     Return:
         qconfig
@@ -197,6 +201,9 @@ def get_default_qconfig(backend='fbgemm', version=0):
         elif backend == 'qnnpack':
             qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                               weight=default_weight_observer)
+        elif backend == 'onednn':
+            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
+                              weight=default_per_channel_weight_observer)
         else:
             qconfig = default_qconfig
     else:
@@ -205,6 +212,42 @@ def get_default_qconfig(backend='fbgemm', version=0):
 
     return qconfig
 
+"""
+Default, symmetric PTQ qconfig for the specified backend. And a per_channel
+variant of the same.
+
+Symmetric here applies to signed weights with zero point = 0, and additional
+value restrictions. The activations are also signed 8-bit integers with this
+qconfig.
+
+    * Once this change is merged [as of 3/17/22], with backend or qengine =
+    'qnnpack', some quantized operators with this symmetric qconfig may use
+    operators from xnnpack library.
+
+        ** Support to use xnnpack ops with `qnnpack` backed for asymmetric
+        qconfig (returned by get_default_qconfig()) is not available yet.
+
+    * This qconfig uses signed activations and weights. Weights have added
+    restrictions such as zero point is forced to be 0, making the weights
+    symmetric, hence the name. And the 8-bit quantized values are
+    restricting to to [-127, +127], excluding -128.
+
+    * xnnpack has a requantization scale value restriction, 0x1p-32 <=
+    requantization_scale < 256.0 where, `requantization_scale = (input_scale
+    * kernel_scale) / (output_scale)`. Using this eps (w/ assumed max value
+    of 256) is to prevent requantization_scale to go below xnnpack lower
+    threshold.
+"""
+default_symmetric_qnnpack_qconfig = QConfig(activation=HistogramObserver.with_args(dtype=torch.qint8,
+                                                                                   reduce_range=False,
+                                                                                   eps=2 ** -12),
+                                            weight=weight_observer_range_neg_127_to_127)
+
+default_per_channel_symmetric_qnnpack_qconfig = QConfig(activation=HistogramObserver.with_args(dtype=torch.qint8,
+                                                                                               reduce_range=False,
+                                                                                               eps=2 ** -12),
+                                                        weight=per_channel_weight_observer_range_neg_127_to_127)
+
 default_embedding_qat_qconfig = QConfig(activation=NoopObserver.with_args(dtype=torch.float32),
                                         weight=default_embedding_fake_quant)
 
@@ -216,8 +259,8 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
     Returns the default QAT qconfig for the specified backend.
 
     Args:
-      * `backend`: a string representing the target backend. Currently supports `fbgemm`
-        and `qnnpack`.
+      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
+        `qnnpack` and `onednn`.
       * `version`: version, for backwards compatibility. Can be `None` or `1`.
 
     Return:
@@ -237,6 +280,11 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
                                                                 quant_max=255,
                                                                 reduce_range=False),
                               weight=default_weight_fake_quant)
+        elif backend == 'onednn':
+            qconfig = QConfig(activation=FakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                quant_min=0,
+                                                                quant_max=255),
+                              weight=default_per_channel_weight_fake_quant)
         else:
             qconfig = default_qat_qconfig
     # Use the fused observe + fake_quant modules for doing QAT.
@@ -253,6 +301,11 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
                                                                                  quant_max=255,
                                                                                  reduce_range=False),
                               weight=default_fused_wt_fake_quant)
+        elif backend == 'onednn':
+            qconfig = QConfig(activation=FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                                 quant_min=0,
+                                                                                 quant_max=255),
+                              weight=default_fused_per_channel_wt_fake_quant)
         else:
             qconfig = default_qat_qconfig_v2
     else:
@@ -261,6 +314,27 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
 
     return qconfig
 
+"""
+Default symmetric QAT qconfig for qnnpack. And its per channel weight variant.
+"""
+default_symmetric_qnnpack_qat_qconfig = QConfig(
+    activation=FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                       quant_min=-128,
+                                                       quant_max=127,
+                                                       dtype=torch.qint8,
+                                                       reduce_range=False,
+                                                       eps=2 ** -12),
+    weight=fused_wt_fake_quant_range_neg_127_to_127)
+
+default_per_channel_symmetric_qnnpack_qat_qconfig = QConfig(
+    activation=FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                       quant_min=-128,
+                                                       quant_max=127,
+                                                       dtype=torch.qint8,
+                                                       reduce_range=False,
+                                                       eps=2 ** -12),
+    weight=fused_per_channel_wt_fake_quant_range_neg_127_to_127)
+
 def _get_default_qconfig_dict_helper(qconfig, qconfig_transpose):
     return {
         "": qconfig,
@@ -404,9 +478,10 @@ def partial_equals(p1, p2):
 def activation_is_memoryless(qconfig: QConfig):
     """
     Return whether the observer for activations defined in the given QConfig is memoryless.
+    This means a MovingAverage observer with averaging constant equal to 1.
     """
     def _is_memoryless(observer):
-        return hasattr(observer, "memoryless") and observer.memoryless
+        return hasattr(observer, "averaging_constant") and observer.averaging_constant == 1
     act = qconfig.activation()
     if isinstance(act, FakeQuantizeBase) and hasattr(act, "activation_post_process"):
         return _is_memoryless(act.activation_post_process)
diff --git a/torch/ao/quantization/quantization_mappings.py b/torch/ao/quantization/quantization_mappings.py
index d561f42ad44722..88016f06cda057 100644
--- a/torch/ao/quantization/quantization_mappings.py
+++ b/torch/ao/quantization/quantization_mappings.py
@@ -23,9 +23,12 @@
     default_symmetric_fixed_qparams_fake_quant,
 )
 from torch.ao.quantization.utils import get_combined_dict
+from torch.nn.utils.parametrize import type_before_parametrizations
 
 # Default map for swapping float module to reference quantized modules
 DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS : Dict[Callable, Any] = {
+    QuantStub: nnq.Quantize,
+    DeQuantStub: nnq.DeQuantize,
     nn.Linear: nnqr.Linear,
     nn.Conv1d: nnqr.Conv1d,
     nn.Conv2d: nnqr.Conv2d,
@@ -33,6 +36,12 @@
     nn.ConvTranspose1d: nnqr.ConvTranspose1d,
     nn.ConvTranspose2d: nnqr.ConvTranspose2d,
     nn.ConvTranspose3d: nnqr.ConvTranspose3d,
+    nn.Embedding: nnqr.Embedding,
+    nn.EmbeddingBag: nnqr.EmbeddingBag,
+    nn.GRUCell: nnqr.GRUCell,
+    nn.LSTMCell: nnqr.LSTMCell,
+    nn.RNNCell: nnqr.RNNCell,
+    nn.LSTM: nnqr.LSTM,
 }
 
 # Default map for swapping float module to quantized ones
@@ -175,6 +184,11 @@ def get_default_static_quant_module_mappings() -> Dict[Callable, Any]:
     '''
     return copy.deepcopy(DEFAULT_STATIC_QUANT_MODULE_MAPPINGS)
 
+def get_default_static_quant_reference_module_mappings() -> Dict[Callable, Any]:
+    ''' Get reference module mapping for post training static quantization
+    '''
+    return copy.deepcopy(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS)
+
 def get_embedding_static_quant_module_mappings() -> Dict[Callable, Any]:
     ''' Get module mapping, including mapping for embedding QAT
     '''
@@ -293,7 +307,7 @@ def _get_special_act_post_process(module: torch.nn.Module) -> Optional[Callable]
     input: torch.nn.Sigmoid
     output: default_affine_fixed_qparam_fake_quant
     """
-    return DEFAULT_MODULE_TO_ACT_POST_PROCESS.get(type(module), None)
+    return DEFAULT_MODULE_TO_ACT_POST_PROCESS.get(type_before_parametrizations(module), None)
 
 def _has_special_act_post_process(module: torch.nn.Module) -> bool:
     return module.training and type(module) in DEFAULT_MODULE_TO_ACT_POST_PROCESS
diff --git a/torch/ao/quantization/quantize.py b/torch/ao/quantization/quantize.py
index fad2b8abe6eabc..f5aa195c94dd9e 100644
--- a/torch/ao/quantization/quantize.py
+++ b/torch/ao/quantization/quantize.py
@@ -10,13 +10,14 @@
 from torch.ao.quantization.quantization_mappings import (
     get_default_dynamic_quant_module_mappings,
     get_default_static_quant_module_mappings,
+    get_default_static_quant_reference_module_mappings,
     get_default_qat_module_mappings,
     get_default_qconfig_propagation_list,
     no_observer_set,
     _has_special_act_post_process,
     _get_special_act_post_process,
 )
-from .utils import get_qparam_dict
+from .utils import get_qparam_dict, has_no_children_ignoring_parametrizations
 from torch.ao.quantization.stubs import DeQuantStub, QuantWrapper
 from torch.ao.quantization.qconfig import (
     add_module_to_qconfig_obs_ctr,
@@ -25,6 +26,7 @@
     float_qparams_weight_only_qconfig,
     float_qparams_weight_only_qconfig_4bit,
     activation_is_memoryless)
+from torch.nn.utils.parametrize import type_before_parametrizations
 
 def is_activation_post_process(module):
     return (isinstance(module, torch.ao.quantization.ObserverBase) or
@@ -32,7 +34,7 @@ def is_activation_post_process(module):
 
 
 def _propagate_qconfig_helper(module, qconfig_dict,
-                              qconfig_parent=None, prefix=''):
+                              qconfig_parent=None, prefix='', prepare_custom_config_dict=None):
     r"""This is a helper function for `propagate_qconfig_`
 
     Args:
@@ -44,12 +46,14 @@ def _propagate_qconfig_helper(module, qconfig_dict,
                        module
         prefix: corresponding prefix of the current module, used as key in
                 qconfig_dict
+        prepare_custom_config_dict: dictionary for custom handling of modules
+                                    see docs for :func:`~torch.ao.quantization.prepare_fx`
 
     Return:
         None, module is modified inplace with qconfig attached
     """
 
-    module_qconfig = qconfig_dict.get(type(module), qconfig_parent)
+    module_qconfig = qconfig_dict.get(type_before_parametrizations(module), qconfig_parent)
     module_qconfig = qconfig_dict.get(prefix, module_qconfig)
     module_qconfig = getattr(module, 'qconfig', module_qconfig)
 
@@ -60,10 +64,16 @@ def _propagate_qconfig_helper(module, qconfig_dict,
 
     for name, child in module.named_children():
         module_prefix = prefix + '.' + name if prefix else name
-        _propagate_qconfig_helper(child, qconfig_dict,
-                                  qconfig_with_device_check, module_prefix)
+        #  do no not propagate qconfig to child if child is non traceable
+        if prepare_custom_config_dict is None or not (
+            name in prepare_custom_config_dict.get("non_traceable_module_name", [])
+            or type(child) in prepare_custom_config_dict.get("non_traceable_module_class", [])
+        ):
+            _propagate_qconfig_helper(
+                child, qconfig_dict, qconfig_with_device_check, module_prefix
+            )
 
-def propagate_qconfig_(module, qconfig_dict=None):
+def propagate_qconfig_(module, qconfig_dict=None, prepare_custom_config_dict=None):
     r"""Propagate qconfig through the module hierarchy and assign `qconfig`
     attribute on each leaf module
 
@@ -73,13 +83,17 @@ def propagate_qconfig_(module, qconfig_dict=None):
             quantization configuration, qconfig applies to all submodules of a
             given module unless qconfig for the submodules are specified (when
             the submodule already has qconfig attribute)
+        prepare_custom_config_dict: dictionary for custom handling of modules
+            see docs for :func:`~torch.ao.quantization.prepare_fx`
 
     Return:
         None, module is modified inplace with qconfig attached
     """
     if qconfig_dict is None:
         qconfig_dict = {}
-    _propagate_qconfig_helper(module, qconfig_dict)
+    if prepare_custom_config_dict is None:
+        prepare_custom_config_dict = {}
+    _propagate_qconfig_helper(module, qconfig_dict, prepare_custom_config_dict=prepare_custom_config_dict)
 
 def _observer_forward_hook(self, input, output):
     r"""Forward hook that calls observer on the output
@@ -157,9 +171,9 @@ def insert_activation_post_process(m, special_act_post_process=None):
 
     for name, child in module.named_children():
         # TODO remove Dropout special after codebase stable
-        if type(child) in [nn.Dropout]:
+        if type_before_parametrizations(child) in [nn.Dropout]:
             continue
-        elif type(child) in [nnq.FloatFunctional, nnq.QFunctional]:
+        elif type_before_parametrizations(child) in [nnq.FloatFunctional, nnq.QFunctional]:
             if needs_observation(child):
                 child.activation_post_process = get_activation_post_process(child.qconfig, device)
         elif isinstance(child, _FusedModule):
@@ -169,23 +183,23 @@ def insert_activation_post_process(m, special_act_post_process=None):
         elif _has_special_act_post_process(child):
             special_act_post_process = _get_special_act_post_process(child)
             insert_activation_post_process(child, special_act_post_process)
-        elif non_leaf_module_list is not None and type(child) in non_leaf_module_list:
+        elif non_leaf_module_list is not None and type_before_parametrizations(child) in non_leaf_module_list:
             if needs_observation(child):
                 insert_activation_post_process(child)
-        elif needs_observation(child) and type(child) in custom_module_class_mapping:
-            observed_child = custom_module_class_mapping[type(child)].from_float(child)
+        elif needs_observation(child) and type_before_parametrizations(child) in custom_module_class_mapping:
+            observed_child = custom_module_class_mapping[type_before_parametrizations(child)].from_float(child)
             setattr(module, name, observed_child)
             # TODO: These are the modules that cannot be observed
             #       Once there are more, we should move them to a separate list
-            if custom_module_class_mapping[type(child)] not in no_observer_set():
+            if custom_module_class_mapping[type_before_parametrizations(child)] not in no_observer_set():
                 insert_activation_post_process(observed_child)
         else:
             add_observer_(child, qconfig_propagation_list, non_leaf_module_list, device, custom_module_class_mapping)
 
     # Insert observers only for leaf nodes, note that this observer is for
     # the output of the module, for input QuantStub will observe them
-    if len(module._modules) == 0 and not isinstance(module, torch.nn.Sequential) \
-       and type(module) in qconfig_propagation_list:
+    if has_no_children_ignoring_parametrizations(module) and not isinstance(module, torch.nn.Sequential) \
+       and type_before_parametrizations(module) in qconfig_propagation_list:
         insert_activation_post_process(module)
 
 def get_unique_devices_(module):
@@ -207,7 +221,7 @@ def add_quant_dequant(module):
         wraps the input module, the latter case only happens when the input
         module is a leaf module and we want to quantize it.
     """
-    if len(module._modules) == 0 and hasattr(module, 'qconfig') and module.qconfig:
+    if has_no_children_ignoring_parametrizations(module) and hasattr(module, 'qconfig') and module.qconfig:
         return QuantWrapper(module)
 
     for name, child in module.named_children():
@@ -472,7 +486,7 @@ def quantize_qat(model, run_fn, run_args, inplace=False):
 
 def convert(
         module, mapping=None, inplace=False, remove_qconfig=True,
-        convert_custom_config_dict=None):
+        is_reference=False, convert_custom_config_dict=None):
     r"""Converts submodules in input module to a different module according to `mapping`
     by calling `from_float` method on the target module class. And remove qconfig at the
     end if remove_qconfig is set to True.
@@ -503,7 +517,7 @@ def convert(
     if not inplace:
         module = copy.deepcopy(module)
     _convert(
-        module, mapping, inplace=True,
+        module, mapping, inplace=True, is_reference=is_reference,
         convert_custom_config_dict=convert_custom_config_dict)
     if remove_qconfig:
         _remove_qconfig(module)
@@ -511,7 +525,7 @@ def convert(
 
 def _convert(
         module, mapping=None, inplace=False,
-        convert_custom_config_dict=None):
+        is_reference=False, convert_custom_config_dict=None):
     r"""Converts submodules in input module to a different module according to `mapping`
     by calling `from_float` method on the target module class
 
@@ -522,10 +536,12 @@ def _convert(
                  Modules
         inplace: carry out model transformations in-place, the original module
                  is mutated
+        is_reference: a flag to enable quantized reference module
 
     """
     if mapping is None:
-        mapping = get_default_static_quant_module_mappings()
+        mapping = get_default_static_quant_reference_module_mappings() if is_reference \
+            else get_default_static_quant_module_mappings()
     if convert_custom_config_dict is None:
         convert_custom_config_dict = {}
     custom_module_class_mapping = convert_custom_config_dict.get("observed_to_quantized_custom_module_class", {})
@@ -537,9 +553,9 @@ def _convert(
         # both fused modules and observed custom modules are
         # swapped as one unit
         if not isinstance(mod, _FusedModule) and \
-           type(mod) not in custom_module_class_mapping:
+           type_before_parametrizations(mod) not in custom_module_class_mapping:
             _convert(mod, mapping, True,  # inplace
-                     convert_custom_config_dict)
+                     is_reference, convert_custom_config_dict)
         reassign[name] = swap_module(mod, mapping, custom_module_class_mapping)
 
     for key, value in reassign.items():
@@ -561,11 +577,11 @@ def swap_module(mod, mapping, custom_module_class_mapping):
     new_mod = mod
     if hasattr(mod, 'qconfig') and mod.qconfig is not None:
         swapped = False
-        if type(mod) in custom_module_class_mapping:
-            new_mod = custom_module_class_mapping[type(mod)].from_observed(mod)
+        if type_before_parametrizations(mod) in custom_module_class_mapping:
+            new_mod = custom_module_class_mapping[type_before_parametrizations(mod)].from_observed(mod)
             swapped = True
-        elif type(mod) in mapping:
-            qmod = mapping[type(mod)]
+        elif type_before_parametrizations(mod) in mapping:
+            qmod = mapping[type_before_parametrizations(mod)]
             if hasattr(qmod, '_IS_REFERENCE') and qmod._IS_REFERENCE:
                 assert mod.qconfig is not None
                 weight_post_process = mod.qconfig.weight()
diff --git a/torch/ao/quantization/quantize_fx.py b/torch/ao/quantization/quantize_fx.py
index 1eb71c1ca20d04..c5929304c5a1b2 100644
--- a/torch/ao/quantization/quantize_fx.py
+++ b/torch/ao/quantization/quantize_fx.py
@@ -6,7 +6,8 @@
 from torch.fx.node import Target, Node, Argument
 from torch.nn.intrinsic import _FusedModule
 from .fx import fuse  # noqa: F401
-from .fx import prepare, convert  # noqa: F401
+from .fx import prepare  # noqa: F401
+from .fx.convert import convert
 from .fx import get_tensorrt_backend_config_dict  # noqa: F401
 from .fx.graph_module import ObservedGraphModule
 from .fx.qconfig_utils import (
@@ -309,10 +310,6 @@ def fuse_fx(
         * `fuse_custom_config_dict`: Dictionary for custom configurations for fuse_fx, e.g.::
 
             fuse_custom_config_dict = {
-              "additional_fuser_method_mapping": {
-                (Module1, Module2): fuse_module1_module2
-              }
-
               # Attributes that are not used in forward function will
               # be removed when constructing GraphModule, this is a list of attributes
               # to preserve as an attribute of the GraphModule even when they are
@@ -328,7 +325,6 @@ def fuse_fx(
 
     """
     torch._C._log_api_usage_once("quantization_api.quantize_fx.fuse_fx")
-    assert not model.training, "fuse_fx only works on models in eval mode"
     check_is_valid_fuse_custom_config_dict(fuse_custom_config_dict)
     graph_module = torch.fx.symbolic_trace(model)
     preserved_attributes: Set[str] = set()
@@ -439,27 +435,6 @@ def prepare_fx(
                NonTraceableModule
             ],
 
-            # Additional fuser_method mapping
-            "additional_fuser_method_mapping": {
-               (torch.nn.Conv2d, torch.nn.BatchNorm2d): fuse_conv_bn
-            },
-
-            # Additioanl module mapping for qat
-            "additional_qat_module_mapping": {
-               torch.nn.intrinsic.ConvBn2d: torch.nn.qat.ConvBn2d
-            },
-
-            # Additional fusion patterns
-            "additional_fusion_pattern": {
-               (torch.nn.BatchNorm2d, torch.nn.Conv2d): ConvReluFusionhandler
-            },
-
-            # Additional quantization patterns
-            "additional_quant_pattern": {
-               torch.nn.Conv2d: ConvReluQuantizeHandler,
-               (torch.nn.ReLU, torch.nn.Conv2d): ConvReluQuantizeHandler,
-            }
-
             # By default, inputs and outputs of the graph are assumed to be in
             # fp32. Providing `input_quantized_idxs` will set the inputs with the
             # corresponding indices to be quantized. Providing
@@ -511,7 +486,6 @@ def calibrate(model, data_loader):
 
     """
     torch._C._log_api_usage_once("quantization_api.quantize_fx.prepare_fx")
-    assert not model.training, "prepare_fx only works for models in " + "eval mode"
     return _prepare_fx(
         model,
         qconfig_dict,
@@ -560,7 +534,6 @@ def train_loop(model, train_data):
 
     """
     torch._C._log_api_usage_once("quantization_api.quantize_fx.prepare_qat_fx")
-    assert model.training, "prepare_qat_fx only works for models in  " + "train mode"
     return _prepare_fx(
         model,
         qconfig_dict,
@@ -577,6 +550,7 @@ def _convert_fx(
     is_standalone_module: bool = False,
     _remove_qconfig: bool = True,
     qconfig_dict: Dict[str, Any] = None,
+    backend_config_dict: Dict[str, Any] = None,
 ) -> torch.nn.Module:
     """ `is_standalone_module`: see docs in :func:`~torch.ao.quantization.prepare_standalone_module_fx`
     """
@@ -593,6 +567,7 @@ def _convert_fx(
         is_standalone_module,
         _remove_qconfig_flag=_remove_qconfig,
         convert_qconfig_dict=qconfig_dict,
+        backend_config_dict=backend_config_dict,
     )
 
     preserved_attributes = convert_custom_config_dict.get("preserved_attributes", [])
@@ -607,6 +582,7 @@ def convert_fx(
     convert_custom_config_dict: Optional[Dict[str, Any]] = None,
     _remove_qconfig: bool = True,
     qconfig_dict: Dict[str, Any] = None,
+    backend_config_dict: Dict[str, Any] = None,
 ) -> torch.nn.Module:
     r""" Convert a calibrated or trained model to a quantized model
 
@@ -618,20 +594,6 @@ def convert_fx(
         * `convert_custom_config_dict`: dictionary for custom configurations for convert function::
 
             convert_custom_config_dict = {
-
-              # additional object (module/operator) mappings that will overwrite the default
-              # module mappinng
-              "additional_object_mapping": {
-                 "static": {
-                    FloatModule: QuantizedModule,
-                    float_op: quantized_op
-                 },
-                 "dynamic": {
-                    FloatModule: DynamicallyQuantizedModule,
-                    float_op: dynamically_quantized_op
-                 },
-              },
-
               # user will manually define the corresponding quantized
               # module class which has a from_observed class method that converts
               # observed custom module to quantized custom module
@@ -677,6 +639,11 @@ def convert_fx(
               ],
             }
 
+         * `backend_config_dict`: A configuration for the backend which describes how
+            operators should be quantized in the backend, this includes quantization
+            mode support (static/dynamic/weight_only), dtype support (quint8/qint8 etc.),
+            observer placement for each operators and fused operators. Detailed
+            documentation can be found in torch/ao/quantization/fx/backend_config/README.md
 
     Return:
         A quantized model (GraphModule)
@@ -694,6 +661,7 @@ def convert_fx(
         convert_custom_config_dict,
         _remove_qconfig=_remove_qconfig,
         qconfig_dict=qconfig_dict,
+        backend_config_dict=backend_config_dict,
     )
 
 
diff --git a/torch/ao/quantization/utils.py b/torch/ao/quantization/utils.py
index 0533119703bcb1..f42b5c1ce723f0 100644
--- a/torch/ao/quantization/utils.py
+++ b/torch/ao/quantization/utils.py
@@ -6,6 +6,7 @@
 import torch
 from torch.ao.quantization.quant_type import QuantType, quant_type_to_str
 from typing import Tuple, Any, Union, Callable
+from torch.nn.utils.parametrize import is_parametrized
 
 # Type for fusion patterns, it can be more complicated than the following actually,
 # see pattern.md for docs
@@ -184,6 +185,16 @@ def activation_is_statically_quantized(qconfig):
     """
     return activation_dtype(qconfig) in [torch.quint8, torch.qint8, torch.float16]
 
+def activation_is_dynamically_quantized(qconfig):
+    """ Given a qconfig, decide if the activation needs to be
+    dynamically quantized or not, this includes dynamically quantizing to
+    quint8, qint8 and float16
+    """
+    activation_dtype, _, activation_compute_dtype = \
+        get_qconfig_dtypes(qconfig)
+    return activation_dtype == torch.float and \
+        activation_compute_dtype in [torch.quint8, torch.qint8, torch.float16]
+
 def activation_is_int8_quantized(qconfig):
     """ Given a qconfig, decide if the activation needs to be
     quantized to int8 or not, this includes quantizing to quint8, qint8
@@ -200,7 +211,7 @@ def weight_is_quantized(qconfig):
     """ Given a qconfig, decide if the weight needs to be
     quantized or not
     """
-    return weight_dtype(qconfig) in [torch.quint8, torch.qint8, torch.float16]
+    return weight_dtype(qconfig) in [torch.quint8, torch.qint8, torch.float16, torch.quint4x2]
 
 def weight_is_statically_quantized(qconfig):
     """ Given a qconfig, decide if the weight needs to be statically
@@ -235,7 +246,7 @@ def get_quant_type(qconfig):
     assert qconfig is not None
     activation = qconfig.activation()
     weight = qconfig.weight()
-    static_dtypes = [torch.quint8, torch.qint8]
+    static_dtypes = [torch.quint8, torch.qint8, torch.quint4x2]
     if weight.dtype in static_dtypes:
         if activation.dtype in static_dtypes:
             return QuantType.STATIC
@@ -289,6 +300,7 @@ def calculate_qmin_qmax(quant_min: int, quant_max: int, has_customized_qrange: b
     r"""Calculates actual qmin and qmax based on the quantization range,
     observer datatype and if range is reduced.
     """
+    # TODO(jerryzh): Figure out why custom quant_min/quant_max are still adjusted.
     if has_customized_qrange:
         # This initialization here is to be resolve TorchScript compilation issues and allow
         # using of refinement to decouple initial_qmin and initial_qmax from quantization range.
@@ -315,10 +327,6 @@ def calculate_qmin_qmax(quant_min: int, quant_max: int, has_customized_qrange: b
             assert (
                 0 < qrange_len <= 2**31
             ), "quantization range should be positive and not exceed the maximum bit range (=4294967296)."
-        if dtype == torch.qint8:
-            quant_min, quant_max = -qrange_len // 2, qrange_len // 2 - 1
-        else:
-            quant_min, quant_max = 0, qrange_len - 1
         if reduce_range:
             quant_min, quant_max = quant_min // 2, quant_max // 2
     else:
@@ -349,3 +357,16 @@ def _parent_name(target):
         return '', r[0]
     else:
         return r[0], r[1]
+
+def has_no_children_ignoring_parametrizations(module):
+    """
+    Checks if module._modules is empty or
+    if module is a parametrization, checks that module._modules only has
+    the 'parametrizations' module
+    """
+    if len(module._modules) == 0:
+        return True
+    elif is_parametrized(module):
+        return len(module._modules) == 1 and 'parametrizations' in module._modules
+    else:
+        return False
diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py
index 28eb729ffcbae0..7c1188da10b47a 100644
--- a/torch/autograd/__init__.py
+++ b/torch/autograd/__init__.py
@@ -309,7 +309,7 @@ def variable(*args, **kwargs):
                                 _supported_activities, _add_metadata_json, SavedTensor,
                                 _push_saved_tensors_default_hooks, _pop_saved_tensors_default_hooks)
 
-from torch._C._autograd import (_ProfilerResult, _KinetoEvent,
+from torch._C._autograd import (_ProfilerResult, _KinetoEvent, _kineto_step,
                                 _prepare_profiler, _enable_profiler, _disable_profiler)
 
 from . import profiler
diff --git a/torch/autograd/functional.py b/torch/autograd/functional.py
index 6fe0b5ee09f354..d94407e30833c1 100644
--- a/torch/autograd/functional.py
+++ b/torch/autograd/functional.py
@@ -416,11 +416,12 @@ def _construct_standard_basis_for(tensors: Tuple[torch.Tensor, ...], tensor_nume
     assert len(tensors) == len(tensor_numels)
     assert len(tensors) > 0
     total_numel = sum(tensor_numels)
-    diag_start_indices = (0, *torch.tensor(tensor_numels).cumsum(dim=0)[:-1].neg().unbind())
     chunks = tuple(tensor.new_zeros(total_numel, tensor_numel)
                    for tensor, tensor_numel in zip(tensors, tensor_numels))
-    for chunk, diag_start_idx in zip(chunks, diag_start_indices):
+    diag_start_idx = 0
+    for chunk, numel in zip(chunks, tensor_numels):
         chunk.diagonal(diag_start_idx).fill_(1)
+        diag_start_idx -= numel
     return chunks
 
 
diff --git a/torch/autograd/grad_mode.py b/torch/autograd/grad_mode.py
index c57a16f80d76be..331327e26737a7 100644
--- a/torch/autograd/grad_mode.py
+++ b/torch/autograd/grad_mode.py
@@ -111,7 +111,7 @@ class no_grad(_DecoratorContextManager):
 
     Example::
 
-        >>> x = torch.tensor([1], requires_grad=True)
+        >>> x = torch.tensor([1.], requires_grad=True)
         >>> with torch.no_grad():
         ...   y = x * 2
         >>> y.requires_grad
@@ -206,7 +206,7 @@ class set_grad_enabled(_DecoratorContextManager):
 
     Example::
 
-        >>> x = torch.tensor([1], requires_grad=True)
+        >>> x = torch.tensor([1.], requires_grad=True)
         >>> is_train = False
         >>> with torch.set_grad_enabled(is_train):
         ...   y = x * 2
diff --git a/torch/autograd/gradcheck.py b/torch/autograd/gradcheck.py
index fd6e7651999362..0ec2c2d1ef9066 100644
--- a/torch/autograd/gradcheck.py
+++ b/torch/autograd/gradcheck.py
@@ -504,7 +504,7 @@ def _stack_and_check_tensors(list_of_list_of_tensors, inputs,
 If the test
 - manually invokes gradcheck/gradgradcheck, then call gradcheck/gradgradcheck
   with `nondet_tol=<tol>` as a keyword argument.
-- is OpInfo-based (e.g., in test_ops.py), then modify the OpInfo for the test
+- is OpInfo-based (e.g., in test_ops_gradients.py), then modify the OpInfo for the test
   to have `gradcheck_nondet_tol=<tol>`.
 - is a Module test (e.g., in common_nn.py), then modify the corresponding
   module_test entry to have `gradcheck_nondet_tol=<tol>`
@@ -717,7 +717,7 @@ def _check_no_differentiable_outputs_fast(func, func_out, all_inputs, inputs_ind
 If the test
 - manually invokes gradcheck/gradgradcheck, then call gradcheck/gradgradcheck
   with `check_batched_grad=False` as a keyword argument.
-- is OpInfo-based (e.g., in test_ops.py), then modify the OpInfo for the test
+- is OpInfo-based (e.g., in test_ops_gradients.py), then modify the OpInfo for the test
   to have `check_batched_grad=False` and/or `check_batched_gradgrad=False`.
 
 If you're modifying an existing operator that supports batched grad computation,
@@ -743,7 +743,7 @@ def _check_no_differentiable_outputs_fast(func, func_out, all_inputs, inputs_ind
 If the test
 - manually invokes gradcheck/gradgradcheck, then call gradcheck/gradgradcheck
   with `check_batched_forward_grad=False` as a keyword argument.
-- is OpInfo-based (e.g., in test_ops.py), then modify the OpInfo for the test
+- is OpInfo-based (e.g., in test_ops_gradients.py), then modify the OpInfo for the test
   to have `check_batched_forward_grad=False`
 """
 
@@ -1196,7 +1196,7 @@ def _adjusted_atol(atol, u, v):
 If the test
 - manually invokes gradcheck/gradgradcheck, then call gradcheck/gradgradcheck
   with `fast_mode=False` as a keyword argument.
-- is OpInfo-based (e.g., in test_ops.py), then modify the OpInfo for the test
+- is OpInfo-based (e.g., in test_ops_gradients.py), then modify the OpInfo for the test
   to have `gradcheck_fast_mode=False`
 - is a Module test (e.g., in common_nn.py), then modify the corresponding
   module_test entry to have `gradcheck_fast_mode=False`
diff --git a/torch/autograd/profiler.py b/torch/autograd/profiler.py
index 91c8d40c0cd1c1..af410570d9071c 100644
--- a/torch/autograd/profiler.py
+++ b/torch/autograd/profiler.py
@@ -6,7 +6,7 @@
 from torch.autograd import (
     DeviceType, ProfilerActivity, ProfilerConfig, ProfilerState,
     kineto_available, _ProfilerResult, _disable_profiler, _enable_profiler,
-    _prepare_profiler, _supported_activities
+    _prepare_profiler, _supported_activities, _kineto_step,
 )
 import torch
 import torch.cuda
@@ -428,17 +428,20 @@ def __init__(self, name: str, args: Optional[str] = None):
         self.args: Optional[str] = args
         # Whether or not we should run record function's end callbacks when exiting.
         self.run_callbacks_on_exit: bool = True
-        # Stores underlying RecordFunction as a tensor. TODO: move to custom
-        # class (https://github.com/pytorch/pytorch/issues/35026).
-        self.handle: torch.Tensor = torch.zeros(1)
+        # TODO: TorchScript ignores standard type annotation here
+        # self.record: Optional["torch.classes.profiler._RecordFunction"] = None
+        self.record = torch.jit.annotate(Optional["torch.classes.profiler._RecordFunction"], None)
 
     def __enter__(self):
-        self.handle = torch.ops.profiler._record_function_enter(self.name, self.args)
+        self.record = torch.ops.profiler._record_function_enter_new(self.name, self.args)
         return self
 
     def __exit__(self, exc_type: Any, exc_value: Any, traceback: Any):
         if self.run_callbacks_on_exit:
-            torch.ops.profiler._record_function_exit(self.handle)
+            # Local variable is needed by TorchScript to refine Optional[T] to T
+            record = self.record
+            assert record is not None
+            torch.ops.profiler._record_function_exit(record)
 
     def _call_end_callbacks_on_future(self, fut: Future[Any]) -> Future[Any]:
         """
@@ -465,7 +468,11 @@ def _call_end_callbacks_on_future(self, fut: Future[Any]) -> Future[Any]:
         # We are scheduling to run this RecordFunction's end callbacks when the
         # passed in future completes, so don't run end callbacks on exit.
         self.run_callbacks_on_exit = False
-        profiled_future = torch.ops.profiler._call_end_callbacks_on_jit_fut(self.handle, fut)
+
+        # Local variable is needed by TorchScript to refine Optional[T] to T
+        record = self.record
+        assert record is not None
+        profiled_future = torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
         return profiled_future
 
 
@@ -664,3 +671,10 @@ def parse_nvprof_trace(path):
 
     functions.sort(key=lambda evt: evt.time_range.start)
     return functions
+
+
+def kineto_step():
+    """ Notify kineto so it is aware of iteration boundaries for asynchronous
+        trace requests.
+    """
+    _kineto_step()
diff --git a/torch/autograd/profiler_util.py b/torch/autograd/profiler_util.py
index 6062c097b25319..dc505fbc210aac 100644
--- a/torch/autograd/profiler_util.py
+++ b/torch/autograd/profiler_util.py
@@ -642,6 +642,7 @@ def _filter_name(name):
     filtered_out_names = [
         MEMORY_EVENT_NAME,  # used only for the top-level memory events
         "profiler::_record_function_enter",
+        "profiler::_record_function_enter_new",
         "profiler::_record_function_exit",
         "aten::is_leaf",
         "aten::output_nr",
diff --git a/torch/backends/_coreml/preprocess.py b/torch/backends/_coreml/preprocess.py
index 7f27e60e5acb44..3884058cd0ecf0 100644
--- a/torch/backends/_coreml/preprocess.py
+++ b/torch/backends/_coreml/preprocess.py
@@ -1,7 +1,6 @@
 import hashlib
 import json
-from dataclasses import dataclass, astuple, field
-from typing import Dict, Tuple, List
+from typing import Dict, Tuple
 
 import coremltools as ct  # type: ignore[import]
 import torch
@@ -35,86 +34,56 @@ class CoreMLComputeUnit:
     ALL = "all"
 
 
-@dataclass
-class _TensorSpec:
-    shape: List[int] = field(default_factory=List[int])
-    dtype: int = ScalarType.Float
-
-
-def TensorSpec(*args, **kwargs):
-    """
-    TensorSpec specifies the tensor information. The default dtype is float32
-    Example:
-    ts = TensorSpec(
-        shape = [1, 3, 224, 224],
-        dtype = ScalarType.Float
-    )
-    """
-    return astuple(_TensorSpec(*args, **kwargs))
-
-
-@dataclass
-class _CompileSpec:
-    inputs: Tuple[_TensorSpec] = ()  # type: ignore[assignment]
-    outputs: Tuple[_TensorSpec] = ()  # type: ignore[assignment]
-    backend: str = CoreMLComputeUnit.CPU
-    allow_low_precision: bool = True
-
-
-def CompileSpec(*args, **kwargs):
-    """
-    CompileSpec specifies the model information.
-    Example:
-    cs = CompileSpec(
-            inputs=(
-                TensorSpec(
-                    shape=[1, 3, 224, 224],
-                ),
-            ),
-            outputs=(
-                TensorSpec(
-                    shape=[1, 1000],
-                ),
-            ),
-            backend=CoreMLComputeUnit.CPU,
-            allow_low_precision=True,
-    ),
-    """
-    return astuple(_CompileSpec(*args, **kwargs))
-
-
-def _convert_to_mil_type(spec: _TensorSpec, name: str):
-    ml_type = TensorType(shape=spec.shape, dtype=torch_to_mil_types[spec.dtype])
+def TensorSpec(shape, dtype=ScalarType.Float):
+    return (shape, dtype)
+
+
+def CompileSpec(inputs, outputs, backend=CoreMLComputeUnit.CPU, allow_low_precision=True):
+    return (inputs, outputs, backend, allow_low_precision)
+
+
+def _check_enumerated_shape(shape):
+    for s in shape:
+        if not isinstance(s, (list, tuple)):
+            return False
+    return True
+
+
+def _convert_to_mil_type(shape, dtype, name: str):
+    mil_shape = shape
+    if _check_enumerated_shape(shape):
+        mil_shape = ct.EnumeratedShapes(shape)
+    ml_type = TensorType(shape=mil_shape, dtype=torch_to_mil_types[dtype])
     ml_type.name = name
     return ml_type
 
 
 def preprocess(script_module: torch._C.ScriptObject, compile_spec: Dict[str, Tuple]):
     spec = compile_spec["forward"]
-    forward_spec = _CompileSpec(*spec)
+    input_specs, output_specs, backend, allow_low_precision = spec
     mil_inputs = []
     inputs = []
-    for index, input_spec in enumerate(forward_spec.inputs):
-        input_spec = _TensorSpec(*input_spec)  # type: ignore[misc]
+    for index, input in enumerate(input_specs):
+        shape, dtype = input
         name = "input_" + str(index)
-        inputs.append([name, str(input_spec.dtype), str(input_spec.shape)])
-        ml_type = _convert_to_mil_type(input_spec, name)
+        inputs.append([name, str(dtype), str(shape)])
+        ml_type = _convert_to_mil_type(shape, dtype, name)
         mil_inputs.append(ml_type)
     model = torch.jit.RecursiveScriptModule._construct(script_module, lambda x: None)
     mlmodel = ct.convert(model, inputs=mil_inputs)
     spec = mlmodel.get_spec()
-    output_specs = forward_spec.outputs
     assert len(spec.description.output) == len(output_specs)  # type: ignore[attr-defined]
     outputs = []
-    for index, output_spec in enumerate(output_specs):
-        output_spec = _TensorSpec(*output_spec)  # type: ignore[misc]
+    for index, output in enumerate(output_specs):
+        shape, dtype = output
         name = spec.description.output[index].name  # type: ignore[attr-defined]
-        outputs.append([name, str(output_spec.dtype), str(output_spec.shape)])
+        outputs.append([name, str(dtype), str(shape)])
     mlmodel = ct.models.model.MLModel(spec)
+    print(mlmodel)
     config = {
         "spec_ver": str(spec.specificationVersion),  # type: ignore[attr-defined]
-        "backend": forward_spec.backend,
-        "allow_low_precision": str(forward_spec.allow_low_precision),
+        "backend": backend,
+        "allow_low_precision": str(allow_low_precision),
     }
     metadata = {
         "coremltool_ver": mlmodel.user_defined_metadata[CT_METADATA_VERSION],
diff --git a/torch/backends/_nnapi/serializer.py b/torch/backends/_nnapi/serializer.py
index d29b5987295c74..4bbf9b5e85308a 100644
--- a/torch/backends/_nnapi/serializer.py
+++ b/torch/backends/_nnapi/serializer.py
@@ -1549,11 +1549,28 @@ def add_adaptive_avg_pool2d(self, node):
         self.add_operation(NNAPI_OperationCode.AVERAGE_POOL_2D, inputs, outputs)
 
     def add_upsample_nearest2d(self, node):
-        assert node.inputsSize() == 3
+        assert node.inputsSize() == 3 or node.inputsSize() == 4
         assert node.outputsSize() == 1
-        image, size_jit, scale_jit = node.inputs()
+        if node.inputsSize() == 3:
+            image, size_jit, scale_jit = node.inputs()
+        else:
+            image, size_jit, scale_h_jit, scale_w_jit = node.inputs()
         size_ctype, size_arg = self.get_constant_value(size_jit)
-        scale_ctype, scale_arg = self.get_constant_value(scale_jit)
+
+        if node.inputsSize() == 3:
+            scale_ctype, scale_arg = self.get_constant_value(scale_jit)
+        else:
+            scale_h_ctype, scale_h_arg = self.get_constant_value(scale_h_jit)
+            scale_w_ctype, scale_w_arg = self.get_constant_value(scale_w_jit)
+
+            # The only way for the 4-argument overload of upsample_nearest2d to
+            # have been added to the graph without error is if the scale_h and
+            # scale_w arguments are None
+            assert scale_h_ctype.kind() == "NoneType"
+            assert scale_w_ctype.kind() == "NoneType"
+
+            scale_ctype = scale_h_ctype
+            scale_arg = scale_h_arg
 
         image_id, image_oper = self.get_tensor_operand_by_jitval(image)
         assert len(image_oper.shape) == 4
diff --git a/torch/backends/quantized/__init__.py b/torch/backends/quantized/__init__.py
index a24d88bcc6e6d4..6f7d479e90c4a4 100644
--- a/torch/backends/quantized/__init__.py
+++ b/torch/backends/quantized/__init__.py
@@ -11,6 +11,8 @@ def _get_qengine_id(qengine: str) -> int:
         ret = 1
     elif qengine == 'qnnpack':
         ret = 2
+    elif qengine == 'onednn':
+        ret = 3
     else:
         ret = -1
         raise RuntimeError("{} is not a valid value for quantized engine".format(qengine))
@@ -18,7 +20,7 @@ def _get_qengine_id(qengine: str) -> int:
 
 # This function should correspond to the enums present in c10/core/QEngine.h
 def _get_qengine_str(qengine: int) -> str:
-    all_engines = {0 : 'none', 1 : 'fbgemm', 2 : 'qnnpack'}
+    all_engines = {0 : 'none', 1 : 'fbgemm', 2 : 'qnnpack', 3 : 'onednn'}
     return all_engines.get(qengine, '*undefined')
 
 class _QEngineProp(object):
diff --git a/torch/cpu/amp/autocast_mode.py b/torch/cpu/amp/autocast_mode.py
index 49ffb5c11b4257..03cbcdcda0fc61 100644
--- a/torch/cpu/amp/autocast_mode.py
+++ b/torch/cpu/amp/autocast_mode.py
@@ -1,7 +1,7 @@
 import torch
 from typing import Any
 
-class autocast(torch.autocast_mode.autocast):
+class autocast(torch.amp.autocast_mode.autocast):
     r"""
     See :class:`torch.autocast`.
     ``torch.cpu.amp.autocast(args...)`` is equivalent to ``torch.autocast("cpu", args...)``
diff --git a/torch/csrc/api/include/torch/fft.h b/torch/csrc/api/include/torch/fft.h
index 23ecbf1be0c697..71a3146c990f18 100644
--- a/torch/csrc/api/include/torch/fft.h
+++ b/torch/csrc/api/include/torch/fft.h
@@ -44,7 +44,7 @@ inline Tensor ifft(const Tensor& self,
 /// torch::fft::fft2(t);
 /// ```
 inline Tensor fft2(const Tensor& self,
-                   c10::optional<IntArrayRef> s=c10::nullopt,
+                   OptionalIntArrayRef s=c10::nullopt,
                    IntArrayRef dim={-2, -1},
                    c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_fft2(self, s, dim, norm);
@@ -59,7 +59,7 @@ inline Tensor fft2(const Tensor& self,
 /// torch::fft::ifft2(t);
 /// ```
 inline Tensor ifft2(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
                     IntArrayRef dim={-2, -1},
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_ifft2(self, s, dim, norm);
@@ -74,8 +74,8 @@ inline Tensor ifft2(const Tensor& self,
 /// torch::fft::fftn(t);
 /// ```
 inline Tensor fftn(const Tensor& self,
-                   c10::optional<IntArrayRef> s=c10::nullopt,
-                   c10::optional<IntArrayRef> dim=c10::nullopt,
+                   at::OptionalIntArrayRef s=c10::nullopt,
+                   at::OptionalIntArrayRef dim=c10::nullopt,
                    c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_fftn(self, s, dim, norm);
 }
@@ -89,8 +89,8 @@ inline Tensor fftn(const Tensor& self,
 /// torch::fft::ifftn(t);
 /// ```
 inline Tensor ifftn(const Tensor& self,
-                   c10::optional<IntArrayRef> s=c10::nullopt,
-                   c10::optional<IntArrayRef> dim=c10::nullopt,
+                   at::OptionalIntArrayRef s=c10::nullopt,
+                   at::OptionalIntArrayRef dim=c10::nullopt,
                    c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_ifftn(self, s, dim, norm);
 }
@@ -138,7 +138,7 @@ inline Tensor irfft(const Tensor& self,
 /// torch::fft::rfft2(t);
 /// ```
 inline Tensor rfft2(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
                     IntArrayRef dim={-2, -1},
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_rfft2(self, s, dim, norm);
@@ -153,7 +153,7 @@ inline Tensor rfft2(const Tensor& self,
 /// torch::fft::irfft2(t);
 /// ```
 inline Tensor irfft2(const Tensor& self,
-                     c10::optional<IntArrayRef> s=c10::nullopt,
+                     at::OptionalIntArrayRef s=c10::nullopt,
                      IntArrayRef dim={-2, -1},
                      c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_irfft2(self, s, dim, norm);
@@ -168,8 +168,8 @@ inline Tensor irfft2(const Tensor& self,
 /// torch::fft::rfftn(t);
 /// ```
 inline Tensor rfftn(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
-                    c10::optional<IntArrayRef> dim=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
+                    at::OptionalIntArrayRef dim=c10::nullopt,
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_rfftn(self, s, dim, norm);
 }
@@ -183,8 +183,8 @@ inline Tensor rfftn(const Tensor& self,
 /// torch::fft::irfftn(t);
 /// ```
 inline Tensor irfftn(const Tensor& self,
-                   c10::optional<IntArrayRef> s=c10::nullopt,
-                   c10::optional<IntArrayRef> dim=c10::nullopt,
+                   at::OptionalIntArrayRef s=c10::nullopt,
+                   at::OptionalIntArrayRef dim=c10::nullopt,
                    c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_irfftn(self, s, dim, norm);
 }
@@ -238,7 +238,7 @@ inline Tensor ihfft(const Tensor& self,
 /// assert(T.is_floating_point() && T.numel() == 128 * 128);
 /// ```
 inline Tensor hfft2(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
                     IntArrayRef dim={-2, -1},
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_hfft2(self, s, dim, norm);
@@ -256,7 +256,7 @@ inline Tensor hfft2(const Tensor& self,
 /// assert(t.is_complex() && t.size(1) == 65);
 /// ```
 inline Tensor ihfft2(const Tensor& self,
-                     c10::optional<IntArrayRef> s=c10::nullopt,
+                     at::OptionalIntArrayRef s=c10::nullopt,
                      IntArrayRef dim={-2, -1},
                      c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_ihfft2(self, s, dim, norm);
@@ -274,7 +274,7 @@ inline Tensor ihfft2(const Tensor& self,
 /// assert(T.is_floating_point() && T.numel() == 128 * 128);
 /// ```
 inline Tensor hfftn(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
                     IntArrayRef dim={-2, -1},
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_hfftn(self, s, dim, norm);
@@ -292,7 +292,7 @@ inline Tensor hfftn(const Tensor& self,
 /// assert(t.is_complex() && t.size(1) == 65);
 /// ```
 inline Tensor ihfftn(const Tensor& self,
-                    c10::optional<IntArrayRef> s=c10::nullopt,
+                    at::OptionalIntArrayRef s=c10::nullopt,
                     IntArrayRef dim={-2, -1},
                     c10::optional<c10::string_view> norm=c10::nullopt) {
   return torch::fft_ihfftn(self, s, dim, norm);
@@ -341,7 +341,7 @@ inline Tensor rfftfreq(int64_t n, const TensorOptions& options) {
 /// auto x = torch::randn({127, 4});
 /// auto centred_fft = torch::fft::fftshift(torch::fft::fftn(x));
 /// ```
-inline Tensor fftshift(const Tensor& x, c10::optional<IntArrayRef> dim=c10::nullopt) {
+inline Tensor fftshift(const Tensor& x, at::OptionalIntArrayRef dim=c10::nullopt) {
   return torch::fft_fftshift(x, dim);
 }
 
@@ -356,7 +356,7 @@ inline Tensor fftshift(const Tensor& x, c10::optional<IntArrayRef> dim=c10::null
 /// auto unshift = torch::fft::ifftshift(shift);
 /// assert(torch::allclose(x, unshift));
 /// ```
-inline Tensor ifftshift(const Tensor& x, c10::optional<IntArrayRef> dim=c10::nullopt) {
+inline Tensor ifftshift(const Tensor& x, at::OptionalIntArrayRef dim=c10::nullopt) {
   return torch::fft_ifftshift(x, dim);
 }
 
diff --git a/torch/csrc/api/include/torch/linalg.h b/torch/csrc/api/include/torch/linalg.h
index e16c1f61e503b2..705e2e41b73d7a 100644
--- a/torch/csrc/api/include/torch/linalg.h
+++ b/torch/csrc/api/include/torch/linalg.h
@@ -84,27 +84,27 @@ inline Tensor matrix_exp(const Tensor& self) {
   return torch::linalg_matrix_exp(self);
 }
 
-inline Tensor norm(const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor norm(const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_norm(self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor norm(const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor norm(const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_norm(self, ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_norm_out(result, self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& norm_out(Tensor& result, const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& norm_out(Tensor& result, const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_norm_out(result, self, ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor vector_norm(const Tensor& self, Scalar ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor vector_norm(const Tensor& self, Scalar ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_vector_norm(self, ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& vector_norm_out(Tensor& result, const Tensor& self, Scalar ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& vector_norm_out(Tensor& result, const Tensor& self, Scalar ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return torch::linalg_vector_norm_out(result, self, ord, opt_dim, keepdim, opt_dtype);
 }
 
@@ -228,11 +228,11 @@ inline Tensor& tensorinv_out(Tensor& result,const Tensor& self, int64_t ind) {
   return torch::linalg_tensorinv_out(result, self, ind);
 }
 
-inline Tensor tensorsolve(const Tensor& self, const Tensor& other, optional<IntArrayRef> dims) {
+inline Tensor tensorsolve(const Tensor& self, const Tensor& other, OptionalIntArrayRef dims) {
   return torch::linalg_tensorsolve(self, other, dims);
 }
 
-inline Tensor& tensorsolve_out(Tensor& result, const Tensor& self, const Tensor& other, optional<IntArrayRef> dims) {
+inline Tensor& tensorsolve_out(Tensor& result, const Tensor& self, const Tensor& other, OptionalIntArrayRef dims) {
   return torch::linalg_tensorsolve_out(result, self, other, dims);
 }
 
@@ -354,22 +354,22 @@ inline Tensor matrix_exp(const Tensor& input) {
 }
 
 // C10_DEPRECATED_MESSAGE("linalg_norm is deprecated, use norm instead.")
-inline Tensor linalg_norm(const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor linalg_norm(const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm(self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
 // C10_DEPRECATED_MESSAGE("linalg_norm is deprecated, use norm instead.")
-inline Tensor linalg_norm(const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor linalg_norm(const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm(self, ord, opt_dim, keepdim, opt_dtype);
 }
 
 // C10_DEPRECATED_MESSAGE("linalg_norm_out is deprecated, use norm_out instead.")
-inline Tensor& linalg_norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& linalg_norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm_out(result, self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
 // C10_DEPRECATED_MESSAGE("linalg_norm_out is deprecated, use norm_out instead.")
-inline Tensor& linalg_norm_out(Tensor& result, const Tensor& self, c10::string_view ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& linalg_norm_out(Tensor& result, const Tensor& self, c10::string_view ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm_out(result, self, ord, opt_dim, keepdim, opt_dtype);
 }
 
@@ -384,28 +384,28 @@ inline std::tuple<Tensor&, Tensor&> lu_factor_out(Tensor& LU, Tensor& pivots, co
   return detail::lu_factor_out(LU, pivots, self, pivot);
 }
 
-inline Tensor norm(const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor norm(const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm(self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor norm(const Tensor& self, std::string ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor norm(const Tensor& self, std::string ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm(self, ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& norm_out(Tensor& result, const Tensor& self, const optional<Scalar>& opt_ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm_out(result, self, opt_ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& norm_out(Tensor& result, const Tensor& self, std::string ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& norm_out(Tensor& result, const Tensor& self, std::string ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::norm_out(result, self, ord, opt_dim, keepdim, opt_dtype);
 }
 
 /// See https://pytorch.org/docs/master/linalg.html#torch.linalg.vector_norm
-inline Tensor vector_norm(const Tensor& self, Scalar ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor vector_norm(const Tensor& self, Scalar ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::vector_norm(self, ord, opt_dim, keepdim, opt_dtype);
 }
 
-inline Tensor& vector_norm_out(Tensor& result, const Tensor& self, Scalar ord, optional<IntArrayRef> opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
+inline Tensor& vector_norm_out(Tensor& result, const Tensor& self, Scalar ord, OptionalIntArrayRef opt_dim, bool keepdim, optional<ScalarType> opt_dtype) {
   return detail::vector_norm_out(result, self, ord, opt_dim, keepdim, opt_dtype);
 }
 
@@ -574,11 +574,11 @@ inline Tensor& tensorinv_out(Tensor& result, const Tensor& self, int64_t ind) {
 /// auto b = torch::randn(2*3, 4);
 /// auto x = torch::linalg::tensorsolve(a, b);
 /// ```
-inline Tensor tensorsolve(const Tensor& input, const Tensor& other, optional<IntArrayRef> dims) {
+inline Tensor tensorsolve(const Tensor& input, const Tensor& other, OptionalIntArrayRef dims) {
   return detail::tensorsolve(input, other, dims);
 }
 
-inline Tensor& tensorsolve_out(Tensor& result, const Tensor& input, const Tensor& other, optional<IntArrayRef> dims) {
+inline Tensor& tensorsolve_out(Tensor& result, const Tensor& input, const Tensor& other, OptionalIntArrayRef dims) {
   return detail::tensorsolve_out(result, input, other, dims);
 }
 
diff --git a/torch/csrc/api/include/torch/nn/functional/padding.h b/torch/csrc/api/include/torch/nn/functional/padding.h
index 611f407d9b7a77..1b2f77626cdbbd 100644
--- a/torch/csrc/api/include/torch/nn/functional/padding.h
+++ b/torch/csrc/api/include/torch/nn/functional/padding.h
@@ -1,83 +1,36 @@
 #pragma once
 
 #include <torch/nn/options/padding.h>
+#include <ATen/native/PadNd.h>
 
 namespace torch {
 namespace nn {
 namespace functional {
 
-inline Tensor _narrow_with_range(const Tensor& input, int64_t dim, int64_t start, int64_t end) {
-  return input.narrow(dim, start, end - start);
-}
-
-inline Tensor _pad_circular(Tensor input, IntArrayRef padding) {
-  int padding_size = padding.size();
-  input = torch::cat({input, _narrow_with_range(input, 2, 0, padding[-1 + padding_size])}, /*dim=*/2);
-  input = torch::cat({_narrow_with_range(input, 2, -(padding[-1 + padding_size] + padding[-2 + padding_size]), -padding[-1 + padding_size]), input}, /*dim=*/2);
-
-  if (padding_size > 2) {
-    input = torch::cat({input, _narrow_with_range(input, 3, 0, padding[-3 + padding_size])}, /*dim=*/3);
-    input = torch::cat({_narrow_with_range(input, 3, -(padding[-3 + padding_size] + padding[-4 + padding_size]), -padding[-3 + padding_size]), input}, /*dim=*/3);
-  }
-
-  if (padding_size > 4) {
-    input = torch::cat({input, _narrow_with_range(input, 4, 0, padding[-5 + padding_size])}, /*dim=*/4);
-    input = torch::cat({_narrow_with_range(input, 4, -(padding[-5 + padding_size] + padding[-6 + padding_size]), -padding[-5 + padding_size]), input}, /*dim=*/4);
-  }
-
-  return input;
-}
-
 #ifndef DOXYGEN_SHOULD_SKIP_THIS
 namespace detail {
 inline Tensor pad(const Tensor& input,
                   IntArrayRef pad,
                   PadFuncOptions::mode_t mode,
                   double value) {
-  TORCH_CHECK(pad.size() % 2 == 0, "Padding length must be divisible by 2");
-  TORCH_CHECK(((int64_t)(pad.size() / 2)) <= input.dim(), "Padding length too large");
-  if (c10::get_if<enumtype::kConstant>(&mode)) {
-    return torch::constant_pad_nd(input, pad, value);
-  } else {
-    TORCH_CHECK(
-      value == 0,
-      "Padding mode \"",
-      torch::enumtype::get_enum_name(mode),
-      "\" doesn't take in value argument");
-    if (pad.size() == 2 && (input.dim() == 2 || input.dim() == 3)) {
-      if (c10::get_if<enumtype::kReflect>(&mode)) {
-        return torch::reflection_pad1d(input, pad);
-      } else if (c10::get_if<enumtype::kReplicate>(&mode)) {
-        return torch::replication_pad1d(input, pad);
-      } else if (c10::get_if<enumtype::kCircular>(&mode)) {
-        return _pad_circular(input, pad);
-      } else {
-        TORCH_CHECK(false, "NotImplementedError");
-      }
-    } else if(pad.size() == 4 && (input.dim() == 3 || input.dim() == 4)) {
-      if (c10::get_if<enumtype::kReflect>(&mode)) {
-        return torch::reflection_pad2d(input, pad);
-      } else if (c10::get_if<enumtype::kReplicate>(&mode)) {
-        return torch::replication_pad2d(input, pad);
-      } else if (c10::get_if<enumtype::kCircular>(&mode)) {
-        return _pad_circular(input, pad);
-      } else {
-        TORCH_CHECK(false, "NotImplementedError");
-      }
-    } else if (pad.size() == 6 && (input.dim() == 4 || input.dim() == 5)) {
-      if (c10::get_if<enumtype::kReflect>(&mode)) {
-        return torch::reflection_pad3d(input, pad);
-      } else if (c10::get_if<enumtype::kReplicate>(&mode)) {
-        return torch::replication_pad3d(input, pad);
-      } else if (c10::get_if<enumtype::kCircular>(&mode)) {
-        return _pad_circular(input, pad);
-      } else {
-        TORCH_CHECK(false, "NotImplementedError");
-      }
-    } else {
-      TORCH_CHECK(false, "Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now");
+  const auto mode_enum = [&] {
+    if (c10::get_if<enumtype::kConstant>(&mode)) {
+      return at::padding_mode::constant;
+    } else if (c10::get_if<enumtype::kReflect>(&mode)) {
+      return at::padding_mode::reflect;
+    } else if (c10::get_if<enumtype::kReplicate>(&mode)) {
+      return at::padding_mode::replicate;
+    } else if (c10::get_if<enumtype::kCircular>(&mode)) {
+      return at::padding_mode::circular;
     }
+    TORCH_CHECK(false, "Unrecognised padding mode");
+  }();
+
+  c10::optional<double> fill_value;
+  if (value != 0.0) {
+    fill_value = value;
   }
+  return at::_pad_enum(input, pad, static_cast<int64_t>(mode_enum), fill_value);
 }
 } // namespace detail
 #endif /* DOXYGEN_SHOULD_SKIP_THIS */
diff --git a/torch/csrc/api/include/torch/special.h b/torch/csrc/api/include/torch/special.h
index 6e0ecc0fbcadac..d667e094f99353 100644
--- a/torch/csrc/api/include/torch/special.h
+++ b/torch/csrc/api/include/torch/special.h
@@ -215,6 +215,15 @@ inline Tensor& logsumexp_out(Tensor& result, const Tensor& self, IntArrayRef dim
   return torch::special_logsumexp_out(result, self, dims, keepdim);
 }
 
+/// Computes the argument, x, for which the area under the Gaussian probability density
+/// function (integrated from minus infinity to x) is equal to input, elementwise.
+/// See https://pytorch.org/docs/master/special.html#torch.special.ndtri
+///
+/// Example:
+/// ```
+/// auto t = torch::rand(128, dtype=kDouble);
+/// torch::special::ndtri(t);
+/// ```
 inline Tensor ndtri(const Tensor& self) {
   return torch::special_ndtri(self);
 }
@@ -223,6 +232,23 @@ inline Tensor& ndtri_out(Tensor& result, const Tensor& self) {
   return torch::special_ndtri_out(result, self);
 }
 
+/// Computes the log of area under the standard Gaussian probability density function,
+/// integrated from minus infinity to :attr:`input`, elementwise
+/// See https://pytorch.org/docs/master/special.html#torch.special.log_ndtr
+///
+/// Example:
+/// ```
+/// auto t = torch::randn(128, dtype=kDouble);
+/// torch::special::log_ndtr(t);
+/// ```
+inline Tensor log_ndtr(const Tensor& self) {
+  return torch::special_log_ndtr(self);
+}
+
+inline Tensor& log_ndtr_out(Tensor& result, const Tensor& self) {
+  return torch::special_log_ndtr_out(result, self);
+}
+
 /// Computes the logit of input, elementwise.
 /// See https://pytorch.org/docs/master/special.html#torch.special.logit.
 ///
diff --git a/torch/csrc/autograd/FunctionsManual.cpp b/torch/csrc/autograd/FunctionsManual.cpp
index c91d82d9263586..162fe0e9fe61a4 100644
--- a/torch/csrc/autograd/FunctionsManual.cpp
+++ b/torch/csrc/autograd/FunctionsManual.cpp
@@ -232,7 +232,7 @@ Tensor norm_backward(Tensor grad, const Tensor& self, const optional<Scalar> & p
   return self_scaled * scale_v;
 }
 
-Tensor linalg_vector_norm_backward(Tensor grad, const Tensor& self, const Scalar& scalar_ord, Tensor norm, const optional<IntArrayRef>& opt_dim, bool keepdim) {
+Tensor linalg_vector_norm_backward(Tensor grad, const Tensor& self, const Scalar& scalar_ord, Tensor norm, const at::OptionalIntArrayRef& opt_dim, bool keepdim) {
   auto dim = opt_dim.value_or(IntArrayRef({}));
   return norm_backward(grad, self, scalar_ord, norm, dim, keepdim);
 }
@@ -717,6 +717,22 @@ std::tuple<at::Tensor, at::Tensor> clamp_backward_min_max(
   return ret;
 }
 
+at::Tensor clamp_jvp(
+  const Tensor& self_p, const Tensor& self_t,
+  const Tensor& min_p, const Tensor& min_t,
+  const Tensor& max_p, const Tensor& max_t
+) {
+  if (min_p.defined() && max_p.defined()) {
+    return where(min_p > max_p, max_t, where(self_p < min_p, min_t, where(self_p > max_p, max_t, self_t)));
+  } else if (min_p.defined()) {
+    return where(self_p > min_p, self_t, min_t);
+  } else if (max_p.defined()) {
+    return where(self_p < max_p, self_t, max_t);
+  } else {
+    return self_t;
+  }
+}
+
 Tensor convolution_jvp(
     const Tensor& input_p, const Tensor& input_t,
     const Tensor& weight_p, const Tensor& weight_t,
@@ -764,7 +780,7 @@ Tensor convolution_backward_jvp_grad_bias(
   } else {
     TORCH_INTERNAL_ASSERT(
         false,
-        "convolution_backward_jvp_grad_bias expected dim of grad_out_t to be 3, 4, or 4, but got: ",
+        "convolution_backward_jvp_grad_bias expected dim of grad_out_t to be 3, 4, or 5, but got: ",
         grad_out_t.dim());
   }
 }
@@ -1050,7 +1066,7 @@ static Tensor var_backward(const Tensor & grad, const Tensor & self, int64_t cor
   return (2.0 / (self.numel() - correction)) * grad * (self - self.mean());
 }
 
-Tensor var_backward(Tensor grad, const Tensor& self, c10::optional<IntArrayRef> dim_opt,
+Tensor var_backward(Tensor grad, const Tensor& self, at::OptionalIntArrayRef dim_opt,
     c10::optional<int64_t> correction_opt, bool keepdim) {
   auto correction = correction_opt.value_or(1);
   if (self.dim() == 0 || !dim_opt.has_value()) {
@@ -1065,7 +1081,7 @@ Tensor var_backward(Tensor grad, const Tensor& self, c10::optional<IntArrayRef>
   return (2.0 / dof) * grad * (self - self.mean(dim, /*keepdim=*/true));
 }
 
-Tensor var_jvp(const Tensor& self_t, const Tensor& self_p, const Tensor& result, c10::optional<IntArrayRef> dim_opt,
+Tensor var_jvp(const Tensor& self_t, const Tensor& self_p, const Tensor& result, at::OptionalIntArrayRef dim_opt,
     c10::optional<int64_t> correction_opt, bool keepdim) {
   auto correction = correction_opt.value_or(1);
   if (self_p.dim() == 0 || !dim_opt.has_value()) {
@@ -1078,7 +1094,7 @@ Tensor var_jvp(const Tensor& self_t, const Tensor& self_p, const Tensor& result,
 
 Tensor std_backward(
     const Tensor& result, const Tensor& grad, const Tensor& self,
-    c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim) {
+    at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim) {
   auto grad_var = (grad / (result * 2)).masked_fill_(result == 0, 0);
   return var_backward(grad_var, self, dim, correction, keepdim);
 }
@@ -1093,7 +1109,7 @@ Tensor mean_backward(Tensor grad, const IntArrayRef sizes, int64_t numel) {
 
 static Tensor mean_backward(
     const Tensor& grad, const IntArrayRef sizes, int64_t numel,
-    c10::optional<IntArrayRef> dim, bool keepdim) {
+    at::OptionalIntArrayRef dim, bool keepdim) {
   if (dim.has_value()) {
     return mean_backward(grad, sizes, *dim, keepdim);
   } else {
@@ -1103,7 +1119,7 @@ static Tensor mean_backward(
 
 Tensor var_std_mean_backward(
     const variable_list& grads, const Tensor& self, const Tensor& r1,
-    const Tensor& r2, c10::optional<IntArrayRef> dim,
+    const Tensor& r2, at::OptionalIntArrayRef dim,
     c10::optional<int64_t> correction, bool keepdim, bool is_std) {
   Tensor grad;
   if (grads[0].defined()) {
@@ -1176,19 +1192,35 @@ Tensor cholesky_inverse_backward(Tensor grad, Tensor L, bool upper, Tensor inver
   at::NoTF32Guard disable_tf32;
   Tensor grad_L;
   if (grad.defined()) {
-    Tensor common_term = grad + grad.mT();
+    Tensor common_term = grad + grad.mH();
     common_term = at::matmul(inverse, at::matmul(common_term, inverse));
     if (upper) {
       grad_L = -at::matmul(L, common_term);
     } else {
       grad_L = -at::matmul(common_term, L);
     }
-  } else {
-    grad_L = at::zeros({1}, L.options()).expand_as(L);
   }
+
   return grad_L;
 }
 
+// If X = (L L^H)^{-1} with L lower-triangular with a real positive diagonal,
+// then dX = K^H + K, where
+// K =  L^{-H} dL^{-1} [dL^{-1} = -L^{-1} dL L^{-1}]
+//   = -L^{-H} L^{-1} dL L^{-1} [L^{-H} L^{-1} = X]
+//   = -X dL L^{-1} [X = X^H = L^{-H} L^{-1} = L^{-1} L^{-H}]
+//   = -X dL X L^{H}.
+// If X = (U^H U)^{-1} with U upper-triangular with a real positive diagonal,
+// then K becomes
+// K = -X dU^H X U
+Tensor cholesky_inverse_jvp(const Tensor& F, const Tensor& dF, const Tensor& X, bool upper) {
+  at::NoTF32Guard disable_tf32;
+  const auto CF = upper ? F : F.mH();
+  const auto dCF = upper ? dF.mH() : dF;
+  const auto partial_dX = -X.matmul(dCF).matmul(X).matmul(CF);
+  return partial_dX + partial_dX.mH();
+}
+
 // The formula for forward AD is adapted from
 //
 // Golub, Gene H., and Victor Pereyra. "The Differentiation of Pseudo-Inverses and Nonlinear
@@ -5200,6 +5232,25 @@ Tensor lu_factor_ex_jvp(
   }
 }
 
+Tensor logsumexp_jvp(const Tensor& self_p, const Tensor& self_t, IntArrayRef dim, bool keepdim) {
+  // NB: for simplicitly, we recompute some values that can be reused from forward
+  auto self_p_exp = (self_p - at::amax(self_p, dim, true)).exp();  // Use the exp-normalize trick
+  auto sumexp_p = self_p_exp.sum(dim, keepdim);
+
+  // NB: it's OK for logsumexp_jvp to be reused for formulas like softmax/log_softmax
+  //     that only have one differentiable input, because that means self_t are never zerotensors
+  TORCH_INTERNAL_ASSERT(!self_t._is_zerotensor())
+  if (areAnyTensorSubclassLike({self_p, self_t})) {
+    auto result = (self_p_exp * self_t).sum(dim, keepdim);
+    result /= sumexp_p;
+    return result;
+  } else {
+    self_p_exp *= self_t;
+    auto sumexp_t = self_p_exp.sum(dim, keepdim);
+    return sumexp_t /= sumexp_p;
+  }
+}
+
 Tensor warn_backwards(const Tensor &grad_output) {
   TORCH_WARN("Warn from backward");
   return grad_output;
@@ -5224,41 +5275,53 @@ std::tuple<Tensor, Tensor> _cudnn_convolution_backward(
   return result;
 }
 
-Tensor scatter_reduce_backward(const Tensor & grad,
-                               const Tensor& input,
-                               int dim,
-                               const Tensor & index,
-                               c10::string_view reduce,
-                               const Tensor & result){
-  Tensor grad_input;
-
+std::tuple<Tensor, Tensor> scatter_reduce_backward(
+  const Tensor& grad,
+  const Tensor& self,
+  int dim,
+  const Tensor& index,
+  const Tensor& src,
+  c10::string_view reduce,
+  bool include_self,
+  const Tensor& result) {
+  Tensor grad_self, grad_src;
+
+  // FIXME: complex gradients not handled correctly
+  // For now this is ok as scatter_reduce isn't added to the whitelist
+  // in tools/autograd/gen_variable_type.py
 
-  // TODO: gather doesn't support broadcasting of input and index
-  // currently this works because scatter_reduce doesn't support broadcasting yet but
-  // this needs to be fixed when scatter_reduce is upgraded to support broadcasting
-  // by broadcasting index here too.
+  if (!grad.defined()) {
+    return std::make_tuple(grad_self, grad_src);
+  }
 
   if (reduce == "sum") {
-    grad_input = grad.gather(dim, index);
+    grad_self = grad;
+    grad_src = grad.gather(dim, index);
   } else if (reduce == "prod") {
-    grad_input = (grad * result).gather(dim, index) / input;
-    // handle nans in above computation when input = 0, we know result = 0 (0 / 0 -> nan)
-    // so just replace with 0
-    grad_input.masked_fill_(input == 0, 0);
+    grad_self = (grad * result) / self;
+    grad_self.masked_fill_(self == 0, 0);
+    grad_src = (grad * result).gather(dim, index) / src;
+    grad_src.masked_fill_(src == 0, 0);
   } else if (reduce == "mean") {
-    Tensor N = zeros_like(grad);
-    N.scatter_add_(dim, index, ones_like(input));
-    Tensor N_input = N.gather(dim, index);
-    grad_input = grad.gather(dim, index) / N_input;
-    grad_input.masked_fill_(N_input == 0, 0);
+    Tensor N = include_self ? ones_like(grad) : zeros_like(grad);
+    N = N.scatter_add(dim, index, ones_like(src));
+    N.masked_fill_(N == 0, 1);
+    grad_self = grad / N;
+    Tensor N_src = N.gather(dim, index);
+    grad_src = grad.gather(dim, index) / N_src;
   } else if (reduce == "amax" || reduce == "amin") {
+    grad_self = (self == result) * grad;
     Tensor value = result.gather(dim, index);
-    grad_input = (input == value) * grad.gather(dim, index);
+    grad_src = (src == value) * grad.gather(dim, index);
   } else {
     AT_ERROR("Expected 'reduce' to be one of 'sum', 'prod', 'mean', 'amax', 'amin' but got ", reduce, ".");
   }
 
-  return grad_input;
+  if (!include_self) {
+    grad_self = grad_self.scatter(dim, index, 0);
+  }
+
+  return std::make_tuple(grad_self, grad_src);
 
 }
 
diff --git a/torch/csrc/autograd/FunctionsManual.h b/torch/csrc/autograd/FunctionsManual.h
index 9451f5f49d20a4..c9c245b3cd1c69 100644
--- a/torch/csrc/autograd/FunctionsManual.h
+++ b/torch/csrc/autograd/FunctionsManual.h
@@ -49,7 +49,7 @@ Tensor restore_reduced_dims(const Tensor &output, IntArrayRef dims, bool keepdim
 Tensor scale_grad_by_count(const Tensor &grad, const Tensor &mask, IntArrayRef dims);
 at::Tensor norm_backward(const at::Tensor & grad, const at::Tensor & self, const optional<at::Scalar> & p_, const at::Tensor & norm);
 at::Tensor norm_backward(at::Tensor grad, const at::Tensor & self, const optional<at::Scalar> & p_, at::Tensor norm, at::IntArrayRef dim, bool keepdim);
-at::Tensor linalg_vector_norm_backward(at::Tensor grad, const at::Tensor & self, const at::Scalar & ord, at::Tensor norm, const c10::optional<at::IntArrayRef> & opt_dim, bool keepdim);
+at::Tensor linalg_vector_norm_backward(at::Tensor grad, const at::Tensor & self, const at::Scalar & ord, at::Tensor norm, const at::OptionalIntArrayRef & opt_dim, bool keepdim);
 at::Tensor pow_backward(at::Tensor grad, const at::Tensor & self, const at::Scalar & exponent_);
 at::Tensor pow_backward_self(at::Tensor grad, const at::Tensor & self, const at::Tensor & exponent);
 at::Tensor pow_backward_exponent(at::Tensor grad, const at::Tensor& self, const at::Tensor& exponent, at::Tensor result);
@@ -77,6 +77,7 @@ at::Tensor solve_backward_self(const at::Tensor & grad, const at::Tensor & self,
 at::Tensor solve_backward_A(const at::Tensor & grad, const at::Tensor & self, const at::Tensor & A, const at::Tensor & solution);
 at::Tensor cumsum_backward(const at::Tensor & grad, int64_t dim);
 at::Tensor logsumexp_backward(at::Tensor grad, const at::Tensor & self, at::Tensor result, at::IntArrayRef dim, bool keepdim);
+at::Tensor logsumexp_jvp(const at::Tensor& self_p, const at::Tensor& self_t, IntArrayRef dim, bool keepdim);
 at::Tensor logcumsumexp_backward(at::Tensor grad, const at::Tensor & self, at::Tensor result, int64_t dim);
 at::Tensor unbind_backward(const variable_list& grads, int64_t dim);
 at::Tensor unsqueeze_to(const at::Tensor & self, at::IntArrayRef sizes);
@@ -85,6 +86,11 @@ std::vector<at::Tensor> cat_tensors_backward(const at::Tensor & grad, const std:
 at::Tensor clamp_backward(const at::Tensor & grad, const at::Tensor &self, const optional<at::Scalar>& min, const optional<at::Scalar>& max);
 at::Tensor clamp_backward(const at::Tensor & grad, const at::Tensor &self, const at::Tensor& min, const at::Tensor& max);
 std::tuple<at::Tensor, at::Tensor> clamp_backward_min_max(const at::Tensor& grad, const at::Tensor& self, const at::Tensor& min, const at::Tensor& max, const std::array<bool, 2>&);
+at::Tensor clamp_jvp(
+  const Tensor& self_p, const Tensor& self_t,
+  const Tensor& min_p, const Tensor& min_t,
+  const Tensor& max_p, const Tensor& max_t
+);
 at::IntArrayRef strides_or_error(const Tensor & input, c10::string_view const & input_name);
 at::Tensor mm_mat1_backward(const Tensor & grad, const Tensor & mat2, at::IntArrayRef mat1_sizes, at::IntArrayRef mat1_strides, const Scalar & alpha);
 at::Tensor mm_mat2_backward(const at::Tensor & grad, const at::Tensor & mat1, at::IntArrayRef sizes, at::IntArrayRef strides, const at::Scalar & alpha);
@@ -97,16 +103,17 @@ at::Tensor infinitely_differentiable_native_dropout_backward(const at::Tensor& g
 at::Tensor native_dropout_double_backward(const at::Tensor& ggI, const at::Tensor& grad, const at::Tensor& mask, double scale);
 at::Tensor evenly_distribute_backward(at::Tensor grad, const at::Tensor & input, const at::Tensor & value);
 at::Tensor sgn_backward(Tensor result, Tensor grad, Tensor self);
-at::Tensor var_backward(at::Tensor grad, const at::Tensor& self, c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim);
-at::Tensor var_jvp(const at::Tensor& self_t, const at::Tensor& self_p, const at::Tensor& result, c10::optional<IntArrayRef> dim_opt, c10::optional<int64_t> correction_opt, bool keepdim);
-at::Tensor std_backward(const at::Tensor& result, const at::Tensor& grad, const at::Tensor& self, c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim);
+at::Tensor var_backward(at::Tensor grad, const at::Tensor& self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim);
+at::Tensor var_jvp(const at::Tensor& self_t, const at::Tensor& self_p, const at::Tensor& result, at::OptionalIntArrayRef dim_opt, c10::optional<int64_t> correction_opt, bool keepdim);
+at::Tensor std_backward(const at::Tensor& result, const at::Tensor& grad, const at::Tensor& self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim);
 at::Tensor mean_backward(at::Tensor grad, const at::IntArrayRef sizes, at::IntArrayRef dim, bool keepdim);
 at::Tensor mean_backward(at::Tensor grad, const at::IntArrayRef sizes, int64_t numel);
-at::Tensor var_std_mean_backward(const variable_list& grads, const at::Tensor& self, const at::Tensor& r1, const at::Tensor& r2, c10::optional<IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim, bool is_std);
+at::Tensor var_std_mean_backward(const variable_list& grads, const at::Tensor& self, const at::Tensor& r1, const at::Tensor& r2, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim, bool is_std);
 at::Tensor masked_scatter_backward(const at::Tensor & grad, const at::Tensor & mask, at::IntArrayRef sizes);
 at::Tensor cholesky_backward(at::Tensor grad, bool upper, at::Tensor L);
 at::Tensor cholesky_jvp(const at::Tensor& input_tangent, const at::Tensor& L, bool upper);
 at::Tensor cholesky_inverse_backward(at::Tensor grad, at::Tensor L, bool upper, at::Tensor inverse);
+at::Tensor cholesky_inverse_jvp(const at::Tensor& F, const at::Tensor& dF, const at::Tensor& X, bool upper);
 Tensor pinv_jvp(
   const Tensor& A,
   const Tensor& pinvA,
@@ -465,12 +472,14 @@ std::tuple<Tensor, Tensor> _cudnn_convolution_backward(
     at::IntArrayRef output_padding, at::IntArrayRef stride, at::IntArrayRef dilation, bool transposed, int64_t groups,
     ::std::array<bool,2> output_mask);
 
-Tensor scatter_reduce_backward(
+std::tuple<Tensor, Tensor> scatter_reduce_backward(
   const Tensor& grad,
-  const Tensor& input,
+  const Tensor& self,
   int dim,
   const Tensor& index,
+  const Tensor& src,
   c10::string_view reduce,
+  bool include_self,
   const Tensor& result
 );
 
diff --git a/torch/csrc/autograd/TraceTypeManual.cpp b/torch/csrc/autograd/TraceTypeManual.cpp
index 031b50215d8caf..a96fa42abd172a 100644
--- a/torch/csrc/autograd/TraceTypeManual.cpp
+++ b/torch/csrc/autograd/TraceTypeManual.cpp
@@ -283,7 +283,9 @@ void general_trace_function(
         AT_ASSERT(iter->isObject());
         tracer::addOutput(node, iter->toObject());
       } else {
-        throw std::runtime_error("unsupported output type: " + type->str());
+        throw std::runtime_error(
+            "unsupported output type: " + type->str() +
+            ", from operator: " + toString(op.operator_name()));
       }
     }
   }
diff --git a/torch/csrc/autograd/function.h b/torch/csrc/autograd/function.h
index cc5fa59e9ed6a2..e258cbf4b6588d 100644
--- a/torch/csrc/autograd/function.h
+++ b/torch/csrc/autograd/function.h
@@ -151,6 +151,9 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
     // probably operate with names.
     at::NoNamesGuard no_names_guard;
 
+    // Keep track of backward pass for rocblas.
+    at::BackwardPassGuard in_backward;
+
     bool pre_sampled = false;
     if (at::shouldRunRecordFunction(&pre_sampled)) {
       // Using RecordFunction to trigger observers in the backward pass
diff --git a/torch/csrc/autograd/init.cpp b/torch/csrc/autograd/init.cpp
index 8499fd90314978..36b7b185b596d8 100644
--- a/torch/csrc/autograd/init.cpp
+++ b/torch/csrc/autograd/init.cpp
@@ -9,7 +9,6 @@
 #include <torch/csrc/autograd/grad_mode.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 #include <ATen/autocast_mode.h>
-#include <ATen/cpp_custom_type_hack.h>
 #include <ATen/record_function.h>
 #include <torch/csrc/autograd/profiler.h>
 #include <torch/csrc/autograd/profiler_python.h>
@@ -21,8 +20,10 @@
 #include <torch/csrc/autograd/utils/python_arg_parsing.h>
 #include <torch/csrc/autograd/python_mode.h>
 #include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/autograd/record_function_ops.h>
 #include <torch/csrc/utils/pycfunction_helpers.h>
 #include <c10/core/ScalarType.h>
+#include <ATen/PythonTorchFunctionTLS.h>
 
 #include <set>
 #include <unordered_set>
@@ -233,6 +234,7 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject *unused) {
   m.def("_disable_profiler", disableProfiler);
   m.def("_prepare_profiler", prepareProfiler);
   m.def("_add_metadata_json", addMetadataJson);  // Only if `USE_KINETO` is set
+  m.def("_kineto_step", profilerStep);  // Only if `USE_KINETO` is set
   m.def("kineto_available", []() { return torch::profiler::kKinetoAvailable; });
 
   // NOTICE: These record functions are not torch operators and may not show up
@@ -241,7 +243,9 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject *unused) {
   // Creates a new profiling scope using RecordFunction and invokes its starting
   // callbacks.
   m.def("_record_function_with_args_enter", [](const std::string& name, py::args args) {
-    auto rec = std::make_unique<at::RecordFunction>(at::RecordScope::USER_SCOPE);
+    using torch::autograd::profiler::PythonRecordFunction;
+    auto python_rec = c10::make_intrusive<PythonRecordFunction>(at::RecordScope::USER_SCOPE);
+    auto *rec = &python_rec->record;
     if (rec->isActive()) {
       if (rec->needsInputs()) {
         auto iv_inputs = std::vector<c10::IValue>();
@@ -253,16 +257,19 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject *unused) {
         rec->before(name);
       }
     }
-    return at::cpp_custom_type_hack::create(std::move(rec), at::TensorOptions());
+    return torch::jit::toPyObject(std::move(python_rec));
   });
 
   // Ends the profiling scope created with record_function_with_param_enter.
-  m.def("_record_function_with_args_exit", [](const at::Tensor& handle) {
-    // We don't actually need to do anything with handle just need to persist the
-    // lifetime until now.
-    auto& rec = at::cpp_custom_type_hack::cast<at::RecordFunction>(handle);
-    rec.end();
-  });
+  m.def("_record_function_with_args_exit",
+        [](const py::object &obj) {
+          using torch::autograd::profiler::PythonRecordFunction;
+          auto python_record = torch::jit::toCustomClass<PythonRecordFunction>(obj);
+
+          // We don't actually need to do anything with handle just need to persist the
+          // lifetime until now.
+          python_record->record.end();
+        });
 
   m.def("_supported_activities", []() {
     std::set<ActivityType> activities {ActivityType::CPU};
@@ -554,6 +561,31 @@ static PyObject * exit_python_mode(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject * set_torch_function_mode(PyObject* _unused, PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (arg == Py_None) {
+    at::impl::PythonTorchFunctionTLS::set_mode(nullptr);
+  } else {
+    Py_INCREF(arg);
+    at::impl::PythonTorchFunctionTLS::set_mode(std::make_shared<c10::SafePyObject>(arg, getPyInterpreter()));
+  }
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject * get_torch_function_mode(PyObject* _unused, PyObject* _unused2) {
+  HANDLE_TH_ERRORS
+  const auto& mode = at::impl::PythonTorchFunctionTLS::get_mode();
+  if (!mode) {
+    Py_RETURN_NONE;
+  } else {
+    auto* r = mode->ptr(getPyInterpreter());
+    Py_INCREF(r);
+    return r;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
 // autograd methods on torch._C
 static PyMethodDef methods[] = { // NOLINT
   {"_set_grad_enabled", set_grad_enabled, METH_O, nullptr},
@@ -578,6 +610,8 @@ static PyMethodDef methods[] = { // NOLINT
   {"_exit_dual_level", castPyCFunctionWithKeywords(python_exit_dual_level), METH_VARARGS | METH_KEYWORDS, nullptr},
   {"_enter_python_mode", enter_python_mode, METH_O, nullptr},
   {"_exit_python_mode", exit_python_mode, METH_NOARGS, nullptr},
+  {"_set_torch_function_mode", set_torch_function_mode, METH_O, nullptr},
+  {"_get_torch_function_mode", get_torch_function_mode, METH_NOARGS, nullptr},
   {nullptr, nullptr, 0, nullptr}
 };
 
diff --git a/torch/csrc/autograd/profiler_kineto.cpp b/torch/csrc/autograd/profiler_kineto.cpp
index 1ce7d85887be08..58ebb4ea119686 100644
--- a/torch/csrc/autograd/profiler_kineto.cpp
+++ b/torch/csrc/autograd/profiler_kineto.cpp
@@ -4,11 +4,12 @@
 #include <c10/macros/Export.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/irange.h>
+#include <c10/util/overloaded.h>
+#include <c10/util/variant.h>
 
-#include <torch/csrc/jit/frontend/tracer.h>
-#include <torch/csrc/jit/runtime/interpreter.h>
-#include <torch/csrc/jit/runtime/operator.h>
 #include <torch/csrc/profiler/api.h>
+#include <torch/csrc/profiler/collection.h>
+#include <torch/csrc/profiler/containers.h>
 #include <torch/csrc/profiler/kineto_shim.h>
 #include <torch/csrc/profiler/nvtx_observer.h>
 
@@ -117,46 +118,8 @@ namespace {
 using torch::profiler::impl::ProfilerThreadLocalStateBase;
 using torch::profiler::impl::ActiveProfilerType;
 
-// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-struct OpEventData {
-    // POD members
-    int64_t start_us_;
-    int64_t end_us_;
-    uint64_t correlation_id_;
-    uint64_t start_thread_id_;
-    uint64_t end_thread_id_;
-    int64_t sequence_number_;
-    uint64_t forward_thread_id_;
-    uint8_t record_function_scope_;
-    bool is_async_;
-    int64_t debug_handle_;
-    torch::profiler::impl::kineto::DeviceAndResource kineto_info_;
-
-    std::string name_;
-
-    // report_input_shapes
-    std::vector<std::vector<int64_t>> shapes_;
-    std::vector<std::string> dtypes_;
-
-    // with_stack
-    std::vector<std::string> stack_;
-
-    // with_modules
-    c10::optional<std::vector<std::string>> module_hierarchy_;
-
-    // with_flops
-    std::unordered_map<std::string, c10::IValue> extra_args_;
-
-    // reportBackendEventToActiveKinetoProfiler
-    c10::optional<std::string> backend_;
-
-    // ProfilerState::KINETO_GPU_FALLBACK
-    torch::profiler::impl::CUDAEventStub cuda_event_start_ = nullptr;
-    torch::profiler::impl::CUDAEventStub cuda_event_end_ = nullptr;
-};
-
 struct MemoryEventData {
-  int64_t start_time;
+  torch::profiler::impl::approx_time_t start_time;
   void* ptr;
   int64_t alloc_size;
   int64_t total_allocated;
@@ -174,11 +137,6 @@ static inline uint64_t getForwardThreadKey(uint64_t tid, uint64_t seqNr) {
   return (((tid) << 48) | ((seqNr) & (((uint64_t)1 << 48) - 1)));
 }
 
-struct KinetoObserverContext : public at::ObserverContext {
-  explicit KinetoObserverContext(OpEventData* data) : data_(data) {}
-  OpEventData* data_;
-};
-
 struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
   explicit KinetoThreadLocalState(
       const ProfilerConfig& config,
@@ -186,6 +144,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       : ProfilerThreadLocalStateBase(config),
         start_time_(getTimeUs()),
         activities_(std::move(activities)),
+        record_queue_(config),
         cpu_trace_(start_time_, "PyTorch Profiler") {}
   ~KinetoThreadLocalState() override = default;
 
@@ -204,12 +163,6 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
     return config().with_stack && activities_.count(ActivityType::CPU);
   }
 
-  std::unique_ptr<KinetoObserverContext> newOpEvent() {
-    std::lock_guard<std::mutex> guard(state_mutex_);
-    op_events_.emplace_back();
-    return std::make_unique<KinetoObserverContext>(&op_events_.back());
-  }
-
   void reportMemoryUsage(
       void* ptr,
       int64_t alloc_size,
@@ -217,16 +170,17 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       int64_t total_reserved,
       c10::Device device) override {
     if (config_.profile_memory && config_.state != ProfilerState::Disabled) {
-      memory_events_.push_back(
-          {getTimeUs(),
-           ptr,
-           alloc_size,
-           total_allocated,
-           total_reserved,
-           at::RecordFunction::currentThreadId(),
-           torch::profiler::impl::kineto::kineto_ids(),
-           device.type(),
-           device.index()});
+      std::lock_guard<std::mutex> guard(state_mutex_);
+      memory_events_.emplace_back(
+          torch::profiler::impl::getApproximateTime(),
+          ptr,
+          alloc_size,
+          total_allocated,
+          total_reserved,
+          at::RecordFunction::currentThreadId(),
+          torch::profiler::impl::kineto::kineto_ids(),
+          device.type(),
+          device.index());
     }
   }
 
@@ -264,84 +218,103 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
 
   void materializeOpEvents() {
     std::lock_guard<std::mutex> guard(state_mutex_);
+    auto converter = clock_converter_.makeConverter();
 
     for (const auto& e : memory_events_) {
-        cpu_trace_.addMemoryUsageActivity(
-            kMemoryEventName,
-            e.kineto_info,
-            e.start_time,
-            c10::Device(e.device_type, e.device_index),
-            e.ptr,
-            e.alloc_size,
-            e.total_allocated,
-            e.total_reserved);
+      auto start_time_us = converter(e.start_time) / 1000;
+      cpu_trace_.addMemoryUsageActivity(
+          kMemoryEventName,
+          e.kineto_info,
+          start_time_us,
+          c10::Device(e.device_type, e.device_index),
+          e.ptr,
+          e.alloc_size,
+          e.total_allocated,
+          e.total_reserved);
 
       kineto_events_.emplace_back();
       auto& evt = kineto_events_.back();
       evt.name(kMemoryEventName)
-          .startUs(e.start_time)
+          .startUs(start_time_us)
           .deviceIndex(e.device_index)
           .deviceType(e.device_type)
           .nBytes(e.alloc_size)
           .startThreadId(e.threadID);
     }
+    memory_events_.clear();
+
+    for (const auto& e : record_queue_.getRecords(converter)) {
+      // `take_data` handles time conversion.
+      int64_t start_us = e.start_time_us_;
+      int64_t end_us = e.end_time_us_;
 
-    for (const auto& e : op_events_) {
-      if (e.end_us_ < e.start_us_) {
+      if (end_us < start_us) {
         // We initialize end_us_ to the smallest int64_t, so this means that
         // the op did not finish before we stopped profiling.
         continue;
       }
 
       cpu_trace_.addCPUActivity(
-          e.name_,
+          e.name(),
           e.kineto_info_,
-          e.correlation_id_,
-          e.start_us_,
-          e.end_us_);
+          e.correlation_id(),
+          start_us,
+          end_us);
 
       kineto_events_.emplace_back();
       kineto_events_.back()
-          .name(e.name_)
-          .startUs(e.start_us_)
-          .durationUs(e.end_us_ - e.start_us_)
-          .correlationId(e.correlation_id_)
+          .name(e.name())
+          .startUs(start_us)
+          .durationUs(end_us - start_us)
+          .correlationId(e.correlation_id())
           .deviceType(c10::DeviceType::CPU)
-          .startThreadId(e.start_thread_id_)
-          .endThreadId(e.end_thread_id_)
-          .sequenceNr(e.sequence_number_)
-          .fwdThreadId(e.forward_thread_id_)
-          .scope(e.record_function_scope_)
-          .setAsync(e.is_async_)
-          .debugHandle(e.debug_handle_);
-
-      if (!e.shapes_.empty()) {
-        kineto_events_.back().shapes(e.shapes_);
+          .startThreadId(e.start_tid_);
+
+      c10::visit(
+          c10::overloaded(
+              [&](const torch::profiler::impl::OpEvent& op_event) {
+                kineto_events_.back()
+                    .endThreadId(op_event.end_thread_id_)
+                    .sequenceNr(op_event.sequence_number_)
+                    .fwdThreadId(op_event.forward_thread_id_)
+                    .scope(op_event.record_function_scope_)
+                    .setAsync(op_event.is_async_)
+                    .debugHandle(op_event.debug_handle_);
+              },
+              [&](const torch::profiler::impl::BackendEvent& backend_event) {
+                kineto_events_.back()
+                    .endThreadId(e.start_tid_)
+                    .scope(backend_event.record_function_scope_)
+                    .debugHandle(backend_event.debug_handle_)
+                    .backend(backend_event.backend_);
+              }),
+          e.event_);
+
+      if (!e.inputs_.shapes_.empty()) {
+        kineto_events_.back().shapes(e.inputs_.shapes_);
       }
 
-      if (!e.dtypes_.empty()) {
-        kineto_events_.back().dtypes(e.dtypes_);
+      if (!e.inputs_.dtypes_.empty()) {
+        kineto_events_.back().dtypes(e.inputs_.dtypes_);
       }
 
-      if (!e.stack_.empty()) {
-        kineto_events_.back().stack(e.stack_);
+      if (!e.jit_stack_.empty()) {
+        kineto_events_.back().stack(e.jit_stack_);
       }
 
-      if (e.module_hierarchy_) {
-        kineto_events_.back().moduleHierarchy(*e.module_hierarchy_);
+      if (!e.jit_modules_.empty()) {
+        kineto_events_.back().moduleHierarchy(e.jit_modules_);
       }
 
       if (!e.extra_args_.empty()) {
         kineto_events_.back().flops(
-            computeFlops(std::string(e.name_), e.extra_args_));
+            computeFlops(e.name(), e.extra_args_));
       }
-      if (e.backend_) {
-        kineto_events_.back().backend(*e.backend_);
-      }
-      kineto_events_.back().cuda_event_start_ = e.cuda_event_start_;
-      kineto_events_.back().cuda_event_end_ = e.cuda_event_end_;
+      kineto_events_.back().cuda_event_start_ =
+          e.gpu_fallback_.cuda_event_start_;
+      kineto_events_.back().cuda_event_end_ =
+          e.gpu_fallback_.cuda_event_end_;
     }
-    op_events_.clear();
   }
 
   void finalizeCPUTrace(std::unique_ptr<torch::profiler::impl::kineto::trace_t>& cpu_trace) {
@@ -549,12 +522,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       auto iter = tidSeq2activity.find(key);
       if (iter != tidSeq2activity.end()) {
         libkineto::GenericTraceActivity* fwd = iter->second;
-#ifdef USE_KINETO_UPDATED
         fwd->flow.start = true;
-#else
-        activity.flow.linkedActivity = fwd; // Only destination side set this,
-                                            // to distinguish with start side.
-#endif
         activity.flow.id = fwd->flow.id = fwd_bwd_link_id;
         activity.flow.type = fwd->flow.type = libkineto::kLinkFwdBwd;
         ++fwd_bwd_link_id;
@@ -586,6 +554,9 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
 #ifdef USE_KINETO
     const auto& events = *(trace.get()->activities());
     for (const auto& ev_ptr : events) {
+      if (ev_ptr == nullptr) {
+        continue;
+      }
       const auto& activity = *ev_ptr;
       // These events are already processed
       if (activity.type() != libkineto::ActivityType::CPU_OP &&
@@ -611,9 +582,10 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
   }
 
   uint64_t start_time_;
+  torch::profiler::impl::ApproximateClockToUnixTimeConverter clock_converter_;
   std::set<torch::profiler::impl::ActivityType> activities_;
-  std::deque<OpEventData> op_events_;
-  std::deque<MemoryEventData> memory_events_;
+  torch::profiler::impl::RecordQueue record_queue_;
+  torch::profiler::impl::AppendOnlyList<MemoryEventData, 1024> memory_events_;
   torch::profiler::impl::kineto::TraceWrapper cpu_trace_;
   std::vector<KinetoEvent> kineto_events_;
   // Optional, if event post-processing is enabled.
@@ -634,51 +606,7 @@ void pushProfilingCallbacks(const std::unordered_set<at::RecordScope>& scopes) {
             const auto& config = state_ptr->config();
             auto corr_id = next_correlation_id();
             torch::profiler::impl::kineto::pushCorrelationId(corr_id);
-
-            auto ctx_ptr = state_ptr->newOpEvent();
-            auto data_ptr = ctx_ptr->data_;
-
-            data_ptr->end_us_ = std::numeric_limits<int64_t>::min();
-            data_ptr->correlation_id_ = corr_id;
-            data_ptr->start_thread_id_ = fn.threadId();
-            data_ptr->sequence_number_ = fn.seqNr();
-            data_ptr->forward_thread_id_ = fn.forwardThreadId();
-            data_ptr->record_function_scope_ = (uint8_t)fn.scope();
-            data_ptr->is_async_ = fn.isAsync();
-            data_ptr->debug_handle_ = fn.debugHandle();
-            data_ptr->kineto_info_ = torch::profiler::impl::kineto::kineto_ids();
-            data_ptr->name_ = fn.name();
-            if (config.report_input_shapes) {
-              data_ptr->shapes_ = torch::profiler::impl::inputSizes(fn);
-              data_ptr->dtypes_ = torch::profiler::impl::inputTypes(fn);
-            }
-#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
-            // backward nodes source range corresponds to the forward node
-            // TODO: consider using C++ stack trace
-            if (config.with_stack &&
-                fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-              auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
-              data_ptr->stack_ = callstackStr(cs);
-            }
-            if (config.with_modules &&
-                fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-              data_ptr->module_hierarchy_ = jit::currentModuleHierarchy();
-            }
-#endif
-            if (config.with_flops) {
-              data_ptr->extra_args_ = torch::profiler::impl::saveExtraArgs(fn);
-            }
-            data_ptr->start_us_ = getTimeUs();
-
-            if (config.state == ProfilerState::KINETO_GPU_FALLBACK) {
-              try {
-                torch::profiler::impl::cudaStubs()->record(
-                    nullptr, &data_ptr->cuda_event_start_, nullptr);
-              } catch (const std::exception& e) {
-                LOG(WARNING) << "Failed to record CUDA event. " << e.what();
-              }
-            }
-            return ctx_ptr;
+            return state_ptr->record_queue_.getSubqueue()->begin_op(fn, corr_id);
           },
           [](const at::RecordFunction& fn, at::ObserverContext* ctx_ptr) {
             auto state_ptr = KinetoThreadLocalState::getTLS();
@@ -687,23 +615,22 @@ void pushProfilingCallbacks(const std::unordered_set<at::RecordScope>& scopes) {
             }
             const auto& config = state_ptr->config();
             auto* kineto_ctx_ptr =
-                static_cast<KinetoObserverContext*>(ctx_ptr);
+                static_cast<torch::profiler::impl::KinetoObserverContext*>(ctx_ptr);
             TORCH_INTERNAL_ASSERT(kineto_ctx_ptr != nullptr);
-            auto data_ptr = kineto_ctx_ptr->data_;
-            data_ptr->end_us_ = getTimeUs();
-            data_ptr->end_thread_id_ = at::RecordFunction::currentThreadId();
-
+            kineto_ctx_ptr->event_->end_time_ = torch::profiler::impl::getApproximateTime();
+            kineto_ctx_ptr->event_->end_thread_id_ = at::RecordFunction::currentThreadId();
             if (config.state == ProfilerState::KINETO_GPU_FALLBACK) {
               try {
+                auto fallback = kineto_ctx_ptr->fallback_;
+                TORCH_INTERNAL_ASSERT(fallback != nullptr);
                 torch::profiler::impl::cudaStubs()->record(
-                    nullptr, &data_ptr->cuda_event_end_, nullptr);
+                    nullptr, &fallback->cuda_event_end_, nullptr);
               } catch (const std::exception& e) {
                 LOG(WARNING) << "Failed to record CUDA event. " << e.what();
               }
             }
 
             torch::profiler::impl::kineto::popCorrelationId();
-            torch::profiler::impl::kineto::recordThreadInfo();
           })
           .needsInputs(registration_state_ptr->config().report_input_shapes)
           .scopes(scopes));
@@ -724,21 +651,14 @@ void reportBackendEventToActiveKinetoProfiler(
     return;
   }
 
-  auto ctx_ptr = state_ptr->newOpEvent();
-  auto data_ptr = ctx_ptr->data_;
-  data_ptr->start_us_ = start_time_us;
-  data_ptr->end_us_ = end_time_us;
-  data_ptr->correlation_id_ = std::numeric_limits<uint64_t>::max();
-  data_ptr->start_thread_id_ = at::RecordFunction::currentThreadId();
-  data_ptr->end_thread_id_ = data_ptr->start_thread_id_;
-  data_ptr->sequence_number_ = -1;
-  data_ptr->forward_thread_id_ = data_ptr->start_thread_id_;
-  data_ptr->record_function_scope_ = (uint8_t)scope;
-  data_ptr->is_async_ = false;
-  data_ptr->debug_handle_ = debug_handle;
-  data_ptr->kineto_info_ = torch::profiler::impl::kineto::kineto_ids();
-  data_ptr->name_ = event_name;
-  data_ptr->backend_ = backend_name;
+  state_ptr->record_queue_.getSubqueue()->emplace_backend_event(
+    torch::profiler::impl::BackendEvent {
+      start_time_us,
+      end_time_us,
+      (uint8_t)scope,
+      debug_handle,
+      event_name,
+      backend_name});
 
   /* no support for input shapes now?
   if (config.report_input_shapes) {
@@ -746,8 +666,6 @@ void reportBackendEventToActiveKinetoProfiler(
     ctx_ptr->dtypes = inputTypes(fn);
   }
   */
-
-  torch::profiler::impl::kineto::recordThreadInfo();
 }
 
 void prepareProfiler(
diff --git a/torch/csrc/autograd/python_function.cpp b/torch/csrc/autograd/python_function.cpp
index 9a6221130ed0ca..43911fe18b993f 100644
--- a/torch/csrc/autograd/python_function.cpp
+++ b/torch/csrc/autograd/python_function.cpp
@@ -167,10 +167,16 @@ auto PyNode::is_traceable() -> bool {
 }
 
 auto PyNode::release_variables() -> void {
-  pybind11::gil_scoped_acquire gil;
-  auto f = (THPFunction*) obj;
-  f->saved_variables.clear();
-  f->has_freed_buffers = 1;
+  // This function is called as part of the Node destructor!
+  // Since this object might be kept alive by C++, it is possible
+  // that the python interpreter is already dead here. In that case
+  // we just leak the saved objects.
+  if (Py_IsInitialized()) {
+    pybind11::gil_scoped_acquire gil;
+    auto f = (THPFunction*) obj;
+    f->saved_variables.clear();
+    f->has_freed_buffers = 1;
+  }
 }
 
 auto PyNode::name() const -> std::string {
@@ -564,6 +570,11 @@ static void _trace_post_record(
   }
 
   node->i_(jit::attr::inplace, is_inplace);
+  if (PyObject* module_name = PyDict_GetItemString(((PyTypeObject*)op_obj)->tp_dict, "__module__")) {
+    if (auto ptr = PyUnicode_AsUTF8(module_name)) {
+        node->s_(jit::attr::module, std::string(ptr));
+    }
+  }
 
   // Isolate C variable ptrs in a vector
   int num_outputs = PyTuple_GET_SIZE(output_objects);
@@ -671,10 +682,19 @@ PyObject* THPFunction_name(PyObject *self, PyObject* noargs) {
 PyObject *THPFunction_apply(PyObject *cls, PyObject *inputs)
 {
   HANDLE_TH_ERRORS
+
+  // save a local copy of seq_id before it gets incremented
+  int seq_id = at::sequence_number::peek();
+  auto info_pair = unpack_input<false>(inputs);
+  UnpackedInput& unpacked_input = info_pair.first;
+  InputFlags& input_info = info_pair.second;
+
+  // Call record function after all the inputs have been decoded, but
+  // before context has been allocated.
   RECORD_FUNCTION(
     ((PyTypeObject*)cls)->tp_name,
-    std::vector<c10::IValue>(),
-    at::sequence_number::peek());
+    std::vector<c10::IValue>(unpacked_input.input_vars.begin(), unpacked_input.input_vars.end()),
+    seq_id);
 
   // Temporary hack to improve functorch UX. We'll find a better solution.
   const auto& functorch_tls = at::functorch::functorchTLSAccessor();
@@ -691,11 +711,6 @@ PyObject *THPFunction_apply(PyObject *cls, PyObject *inputs)
   auto cdata = std::shared_ptr<PyNode>(new PyNode(std::move(ctx_obj)), deleteNode);
   ctx->cdata = cdata;
 
-  // Prepare inputs and allocate context (grad fn)
-  auto info_pair = unpack_input<false>(inputs);
-  UnpackedInput& unpacked_input = info_pair.first;
-  InputFlags& input_info = info_pair.second;
-
   // Record input nodes if tracing
   auto* node = _trace_pre_record(cls, inputs, unpacked_input.input_vars);
 
@@ -705,6 +720,7 @@ PyObject *THPFunction_apply(PyObject *cls, PyObject *inputs)
   ctx->needs_input_grad = input_info.needs_input_grad.release();
   ctx->is_variable_input = std::move(input_info.is_variable_input);
 
+
   // Prepend ctx to input_tuple, in preparation for static method call
   auto num_args = PyTuple_GET_SIZE(inputs);
   THPObjectPtr ctx_input_tuple(PyTuple_New(num_args + 1));
diff --git a/torch/csrc/autograd/python_mode.cpp b/torch/csrc/autograd/python_mode.cpp
index cda38bdb7dff3e..7e49d29d824368 100644
--- a/torch/csrc/autograd/python_mode.cpp
+++ b/torch/csrc/autograd/python_mode.cpp
@@ -1,8 +1,9 @@
-#include <torch/csrc/autograd/python_mode.h>
-#include <torch/csrc/python_headers.h>
-#include <torch/csrc/autograd/python_variable.h>
 #include <ATen/core/PythonModeTLS.h>
+#include <c10/core/SafePyObject.h>
 #include <c10/core/TensorImpl.h>
+#include <torch/csrc/autograd/python_mode.h>
+#include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/python_headers.h>
 
 namespace torch { namespace autograd {
 
@@ -13,10 +14,10 @@ void PythonMode::enter(PyObject* type) {
         "python mode has already been set. We do not yet support nested python ",
         "mode. Please file us an issue and reset it before setting it again.")
   }
-  // TorchDispatchTypeObject steals a reference, See NOTE [What is TorchDispatchTypeObject?]
+  // SafePyObject steals a reference, See NOTE [What is SafePyObject?]
   Py_INCREF(type);
-  auto state = std::make_shared<c10::TorchDispatchTypeObject>(type, getPyInterpreter());
-  at::impl::PythonModeTLS::set_state(state);
+  at::impl::PythonModeTLS::set_state(
+      std::make_shared<c10::SafePyObject>(type, getPyInterpreter()));
 }
 
 void PythonMode::exit() {
diff --git a/torch/csrc/autograd/python_variable.cpp b/torch/csrc/autograd/python_variable.cpp
index f960d8287c24e4..e3d828a699346b 100644
--- a/torch/csrc/autograd/python_variable.cpp
+++ b/torch/csrc/autograd/python_variable.cpp
@@ -1,36 +1,34 @@
-#include <torch/csrc/autograd/python_variable.h>
-
-#include <torch/csrc/THP.h>
+#include <ATen/NamedTensorUtils.h>
+#include <c10/core/DeviceType.h>
+#include <c10/core/SafePyObject.h>
+#include <c10/util/DeadlockDetection.h>
+#include <c10/util/irange.h>
+#include <pybind11/pybind11.h>
+#include <torch/csrc/Device.h>
 #include <torch/csrc/DynamicTypes.h>
 #include <torch/csrc/Exceptions.h>
-#include <torch/csrc/Device.h>
 #include <torch/csrc/Size.h>
+#include <torch/csrc/THP.h>
 #include <torch/csrc/Types.h>
 #include <torch/csrc/autograd/autograd.h>
 #include <torch/csrc/autograd/edge.h>
+#include <torch/csrc/autograd/function.h>
+#include <torch/csrc/autograd/functions/accumulate_grad.h>
+#include <torch/csrc/autograd/generated/VariableType.h>
 #include <torch/csrc/autograd/python_cpp_function.h>
 #include <torch/csrc/autograd/python_hook.h>
 #include <torch/csrc/autograd/python_variable_indexing.h>
-#include <torch/csrc/autograd/variable.h>
-#include <torch/csrc/autograd/functions/accumulate_grad.h>
-#include <torch/csrc/autograd/function.h>
-#include <torch/csrc/autograd/generated/VariableType.h>
 #include <torch/csrc/autograd/utils/error_messages.h>
 #include <torch/csrc/autograd/utils/wrap_outputs.h>
+#include <torch/csrc/autograd/variable.h>
+#include <torch/csrc/jit/frontend/tracer.h>
 #include <torch/csrc/tensor/python_tensor.h>
-#include <pybind11/pybind11.h>
 #include <torch/csrc/utils/cuda_lazy_init.h>
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/utils/pycfunction_helpers.h>
-#include <torch/csrc/utils/python_strings.h>
 #include <torch/csrc/utils/python_arg_parser.h>
+#include <torch/csrc/utils/python_strings.h>
 #include <torch/csrc/utils/tensor_new.h>
-#include <torch/csrc/jit/frontend/tracer.h>
-#include <ATen/NamedTensorUtils.h>
-#include <c10/core/DeviceType.h>
-#include <c10/util/DeadlockDetection.h>
-#include <c10/util/irange.h>
-
 
 #include <torch/library.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
@@ -104,7 +102,7 @@ void concrete_dispatch_fn(
     const c10::impl::PyInterpreter*,
     const c10::OperatorHandle& op,
     torch::jit::Stack* stack,
-    const std::shared_ptr<TorchDispatchTypeObject>& type);
+    const std::shared_ptr<SafePyObject>& type);
 
 class PyInterpreterHolder {
  public:
@@ -901,6 +899,16 @@ PyObject *THPVariable_is_cuda(THPVariable *self, void *unused)
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THPVariable_is_ipu(THPVariable* self, void* unused) {
+  HANDLE_TH_ERRORS
+  if (check_has_torch_function((PyObject*)self)) {
+    return handle_torch_function_getter(self, "is_ipu");
+  }
+  auto& self_ = THPVariable_Unpack(self);
+  return torch::autograd::utils::wrap(self_.is_ipu());
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THPVariable_is_xpu(THPVariable* self, void* unused) {
   HANDLE_TH_ERRORS
   if (check_has_torch_function((PyObject*)self)) {
@@ -1010,6 +1018,17 @@ PyObject *THPVariable_is_complex(THPVariable *self, void *unused)
   END_HANDLE_TH_ERRORS
 }
 
+PyObject *THPVariable_is_nested(THPVariable *self, void *unused)
+{
+  HANDLE_TH_ERRORS
+  if (check_has_torch_function((PyObject *)self)) {
+    return handle_torch_function_getter(self, "is_nested");
+  }
+  auto& self_ = THPVariable_Unpack(self);
+  return torch::autograd::utils::wrap(self_.is_nested());
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject *THPVariable_dtype(THPVariable *self, void *unused)
 {
   HANDLE_TH_ERRORS
@@ -1064,28 +1083,28 @@ PyObject *THPVariable_get_imag(THPVariable* self, void *unused)
   END_HANDLE_TH_ERRORS
 }
 
-int THPVariable_set_real(THPVariable *self, THPVariable *real, void *unused)
+int THPVariable_set_real(PyObject* self, PyObject* real, void *unused)
 {
   HANDLE_TH_ERRORS
   auto& self_ = THPVariable_Unpack(self);
-  auto& real_ = THPVariable_Unpack(real);
+  auto self_real = at::real(self_);
+  auto real_ = valueToTensor(self_real.options(), real, self_real.device());
   {
     pybind11::gil_scoped_release no_gil;
-    auto self_real = at::real(self_);
     self_real.copy_(real_);
     return 0;
   }
   END_HANDLE_TH_ERRORS_RET(-1)
 }
 
-int THPVariable_set_imag(THPVariable* self, THPVariable *imag, void *unused)
+int THPVariable_set_imag(PyObject* self, PyObject* imag, void *unused)
 {
   HANDLE_TH_ERRORS
   auto& self_ = THPVariable_Unpack(self);
-  auto& imag_ = THPVariable_Unpack(imag);
+  auto self_imag = at::imag(self_);
+  auto imag_ = valueToTensor(self_imag.options(), imag, self_imag.device());
   {
     pybind11::gil_scoped_release no_gil;
-    auto self_imag = at::imag(self_);
     self_imag.copy_(imag_);
     return 0;
   }
@@ -1119,6 +1138,7 @@ static struct PyGetSetDef THPVariable_properties[] = {
   {"shape", (getter)THPVariable_get_shape, nullptr, nullptr, nullptr},
   {"is_cuda", (getter)THPVariable_is_cuda, nullptr, nullptr, nullptr},
   {"is_xpu", (getter)THPVariable_is_xpu, nullptr, nullptr, nullptr},
+  {"is_ipu", (getter)THPVariable_is_ipu, nullptr, nullptr, nullptr},
   {"is_sparse", (getter)THPVariable_is_sparse, nullptr, nullptr, nullptr},
   {"is_sparse_csr", (getter)THPVariable_is_sparse_csr, nullptr, nullptr, nullptr},
   {"is_mkldnn", (getter)THPVariable_is_mkldnn, nullptr, nullptr, nullptr},
@@ -1128,6 +1148,7 @@ static struct PyGetSetDef THPVariable_properties[] = {
   {"is_complex", (getter)THPVariable_is_complex, nullptr, nullptr, nullptr},
   {"is_quantized", (getter)THPVariable_is_quantized, nullptr, nullptr, nullptr},
   {"is_meta", (getter)THPVariable_is_meta, nullptr, nullptr, nullptr},
+  {"is_nested", (getter)THPVariable_is_nested, nullptr, nullptr, nullptr},
   {"dtype", (getter)THPVariable_dtype, nullptr, nullptr, nullptr},
   {"layout", (getter)THPVariable_layout, nullptr, nullptr, nullptr},
   {"device", (getter)THPVariable_device, nullptr, nullptr, nullptr},
@@ -1267,7 +1288,7 @@ PyObject *THPVariable_pynew(PyTypeObject *type, PyObject *args, PyObject *kwargs
   HANDLE_TH_ERRORS
   TORCH_CHECK(type != &THPVariableType, "Cannot directly construct _TensorBase; subclass it and then construct that");
   jit::tracer::warn("torch.Tensor", jit::tracer::WARN_CONSTRUCTOR);
-  auto tensor = torch::utils::legacy_tensor_ctor(torch::tensors::get_default_dispatch_key(), torch::tensors::get_default_scalar_type(), args, kwargs);
+  auto tensor = torch::utils::base_tensor_ctor(args, kwargs);
   // WARNING: tensor is NOT guaranteed to be a fresh tensor; e.g., if it was
   // given a raw pointer that will refcount bump
   return THPVariable_NewWithVar(
@@ -1674,7 +1695,7 @@ void concrete_dispatch_fn(
     const c10::impl::PyInterpreter*,
     const c10::OperatorHandle& op,
     torch::jit::Stack* stack,
-    const std::shared_ptr<TorchDispatchTypeObject>& type) {
+    const std::shared_ptr<SafePyObject>& type) {
   const auto& schema = op.schema();
   const auto num_returns = schema.returns().size();
 
@@ -1684,6 +1705,7 @@ void concrete_dispatch_fn(
   // Parse the name into namespace and name (no overload_name)
   // TODO: put this into the library
   const auto& qualified_name = op.operator_name().name;
+  const auto& overload_name = schema.overload_name();
   auto pos = qualified_name.find("::");
   TORCH_INTERNAL_ASSERT(pos != std::string::npos, qualified_name);
   // Make me some null terminated strings
@@ -1704,6 +1726,12 @@ void concrete_dispatch_fn(
   // overload resolution but is more complicated (need to expose separate
   // functions per overload)
   py::handle torch_api_function = py::module::import("torch").attr("ops").attr(ns).attr(func_name);
+  py::handle torch_api_function_overload;
+  if (overload_name == "") {
+    torch_api_function_overload = torch_api_function.attr("default");
+  } else {
+    torch_api_function_overload = torch_api_function.attr(overload_name.c_str());
+  }
   std::string module_name_str = "torch.ops." + ns_str;
 
   // About all the pointers:
@@ -1752,7 +1780,7 @@ void concrete_dispatch_fn(
   py::dict kwargs;
 
   if (type) {
-    append_overloaded_type(&overloaded_args, type->ptr());
+    append_overloaded_type(&overloaded_args, type->ptr(getPyInterpreter()));
   }
 
   // Find overloaded tensors
@@ -1790,15 +1818,15 @@ void concrete_dispatch_fn(
     kwargs[py::cast(arg.name())] = torch::jit::toPyObject(std::move(arguments[idx]));
   }
 
-  auto out = py::reinterpret_steal<py::object>(handle_torch_function_no_python_arg_parser(
-    overloaded_args,
-    args.ptr(),
-    kwargs.ptr(),
-    func_name,
-    torch_api_function.ptr(),
-    module_name_str.c_str(),
-    "__torch_dispatch__"
-  ));
+  auto out = py::reinterpret_steal<py::object>(
+      handle_torch_function_no_python_arg_parser(
+          overloaded_args,
+          args.ptr(),
+          kwargs.ptr(),
+          func_name,
+          torch_api_function_overload.ptr(),
+          module_name_str.c_str(),
+          TorchFunctionName::TorchDispatch));
 
   if (num_returns == 0) {
     // Check that we got a None return from Python. Anything else is an error.
@@ -1830,15 +1858,20 @@ c10::intrusive_ptr<TensorImpl> concrete_detach_fn(const c10::impl::PyInterpreter
 
   py::dict kwargs;
 
-  auto out = py::reinterpret_steal<py::object>(handle_torch_function_no_python_arg_parser(
-    overloaded_args,
-    args.ptr(),
-    kwargs.ptr(),
-    "detach",
-    py::module::import("torch").attr("ops").attr("aten").attr("detach").ptr(),
-    "torch.ops.aten",
-    "__torch_dispatch__"
-  ));
+  auto out = py::reinterpret_steal<py::object>(
+      handle_torch_function_no_python_arg_parser(
+          overloaded_args,
+          args.ptr(),
+          kwargs.ptr(),
+          "detach",
+          py::module::import("torch")
+              .attr("ops")
+              .attr("aten")
+              .attr("detach")
+              .attr("default")
+              .ptr(),
+          "torch.ops.aten",
+          TorchFunctionName::TorchDispatch));
 
   TORCH_CHECK(THPVariable_Check(out.ptr()), "detach returned invalid type ", py::detail::get_fully_qualified_tp_name(Py_TYPE(out.ptr())), ", expected Tensor");
   const Tensor& res_t = THPVariable_Unpack(out.ptr());
diff --git a/torch/csrc/autograd/python_variable_indexing.cpp b/torch/csrc/autograd/python_variable_indexing.cpp
index 8faa07066ead73..6b7b7b6ef29f3f 100644
--- a/torch/csrc/autograd/python_variable_indexing.cpp
+++ b/torch/csrc/autograd/python_variable_indexing.cpp
@@ -4,7 +4,6 @@
 #include <torch/csrc/Exceptions.h>
 #include <torch/csrc/Export.h>
 #include <torch/csrc/autograd/function.h>
-#include <torch/csrc/autograd/python_variable.h>
 #include <torch/csrc/autograd/utils/wrap_outputs.h>
 #include <torch/csrc/autograd/variable.h>
 #include <torch/csrc/utils/python_compat.h>
@@ -88,7 +87,7 @@ static inline Variable sequenceToVariable(c10::TensorOptions options, PyObject*
   return torch::utils::indexing_tensor_from_data(options, kLong, c10::nullopt, seq);
 }
 
-static inline Variable valueToTensor(c10::TensorOptions options, PyObject* value, const at::Device& device) {
+inline Variable valueToTensor(c10::TensorOptions options, PyObject* value, const at::Device& device) {
   if (THPVariable_Check(value)) {
     return THPVariable_Unpack(value);
   }
diff --git a/torch/csrc/autograd/python_variable_indexing.h b/torch/csrc/autograd/python_variable_indexing.h
index 398b77293810d2..027bffb6dc8a04 100644
--- a/torch/csrc/autograd/python_variable_indexing.h
+++ b/torch/csrc/autograd/python_variable_indexing.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <torch/csrc/python_headers.h>
+#include <torch/csrc/autograd/python_variable.h>
 
 namespace torch { namespace autograd {
 
@@ -8,4 +9,6 @@ Py_ssize_t THPVariable_length(PyObject* self);
 PyObject* THPVariable_getitem(PyObject* self, PyObject* index);
 int THPVariable_setitem(PyObject* self, PyObject* index, PyObject* value);
 
+Variable valueToTensor(c10::TensorOptions options, PyObject* value, const at::Device& device);
+
 }} // namespace torch::autograd
diff --git a/torch/csrc/autograd/record_function_ops.cpp b/torch/csrc/autograd/record_function_ops.cpp
index 2cf427e04f6091..ad8bf336ee1507 100644
--- a/torch/csrc/autograd/record_function_ops.cpp
+++ b/torch/csrc/autograd/record_function_ops.cpp
@@ -1,8 +1,10 @@
+#include <torch/csrc/autograd/record_function_ops.h>
 #include <ATen/cpp_custom_type_hack.h>
 #include <ATen/record_function.h>
 #include <ATen/ThreadLocalState.h>
 
-#include <torch/csrc/jit/runtime/custom_operator.h>
+#include <torch/library.h>
+#include <torch/csrc/jit/runtime/operator.h>
 
 namespace caffe2 {
 // Required for cpp_custom_type_hack to work
@@ -16,47 +18,68 @@ namespace profiler {
 
 // Creates a new profiling scope using RecordFunction and invokes its starting
 // callbacks.
-at::Tensor record_function_enter(
+void record_function_enter(
     const std::string& name,
-    const c10::optional<std::string>& args) {
-  auto rec = std::make_unique<at::RecordFunction>(at::RecordScope::USER_SCOPE);
-  if (rec->isActive()) {
-    if (rec->needsInputs() && args.has_value()) {
-      rec->before(name, std::vector<c10::IValue>{c10::IValue{args.value()}});
+    const c10::optional<std::string>& args,
+    at::RecordFunction &rec) {
+  if (rec.isActive()) {
+    if (rec.needsInputs() && args.has_value()) {
+      rec.before(name, std::vector<c10::IValue>{c10::IValue{args.value()}});
     } else {
-      rec->before(name);
+      rec.before(name);
     }
   }
+}
+
+// Legacy signature using cpp_custom_type_hack
+at::Tensor record_function_enter_legacy(
+    const std::string& name,
+    const c10::optional<std::string>& args) {
+  auto rec = std::make_unique<at::RecordFunction>(at::RecordScope::USER_SCOPE);
+  record_function_enter(name, args, *rec);
   return at::cpp_custom_type_hack::create(std::move(rec), at::TensorOptions());
 }
 
+// New signature using custom_class
+c10::intrusive_ptr<PythonRecordFunction> record_function_enter_new(
+    const std::string &name, const c10::optional<std::string> &args) {
+  auto rec = c10::make_intrusive<PythonRecordFunction>(at::RecordScope::USER_SCOPE);
+  record_function_enter(name, args, rec->record);
+  return rec;
+}
+
 at::RecordFunction& getRecordFunctionFromTensor(const at::Tensor& handle) {
   auto& rec = at::cpp_custom_type_hack::cast<at::RecordFunction>(handle);
   return rec;
 }
 
 // Ends the profiling scope created with record_function_enter.
-void record_function_exit(const at::Tensor& handle) {
+void record_function_exit(at::RecordFunction &rec) {
+  rec.end();
+}
+
+// Legacy signature using cpp_custom_type_hack
+void record_function_exit_legacy(const at::Tensor &handle) {
   // We don't actually need to do anything with handle just need to persist the
   // lifetime until now.
   auto& rec = getRecordFunctionFromTensor(handle);
-  rec.end();
+  record_function_exit(rec);
+}
+
+// New signature using custom_class
+void record_function_exit_new(const c10::intrusive_ptr<PythonRecordFunction> &record) {
+  record_function_exit(record->record);
 }
 
+template <typename Func>
 c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut(
-    const at::Tensor& handle,
+    Func get_record,
     const c10::intrusive_ptr<c10::ivalue::Future>& fut) {
   // Profiling callback that ends the associated record_function
   // and returns the value of the passed in future.
   std::function<c10::IValue(c10::ivalue::Future&)> futureProfilingFunc =
-      [handle](c10::ivalue::Future& fut) {
-        TORCH_INTERNAL_ASSERT(
-            handle.defined(),
-            "Undefined RecordFunction handle. This can happen if the handle is "
-            "not correctly persisted and is destroyed before the future is "
-            "realized.");
-
-        auto& rec = getRecordFunctionFromTensor(handle);
+      [get_record = std::move(get_record)](c10::ivalue::Future& fut) {
+        auto& rec = get_record();
         rec.end();
         // Note: this future is returned to the user to ensure that a call to wait()
         // ensures that profiling callbacks have ran. To ensure that this is
@@ -67,36 +90,74 @@ c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut(
       };
   // Define a future that completes after the profiling callbacks are run.
   auto profiledFut = fut->then(at::wrapPropagateTLSState(
-      futureProfilingFunc),
+      std::move(futureProfilingFunc)),
       fut->elementType()
       );
   return profiledFut;
 }
 
-// Internal only, do not use directly, use Python's record_function()
-TORCH_LIBRARY_FRAGMENT(profiler, m) {
-    m.def("_record_function_enter(str name, str? args=None) -> Tensor", &record_function_enter);
-    m.def("_record_function_exit", &record_function_exit);
+// Legacy signature using cpp_custom_type_hack
+c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut_legacy(
+    const at::Tensor &handle,
+    const c10::intrusive_ptr<c10::ivalue::Future>& fut) {
+  return _call_end_callbacks_on_fut(
+      [handle] () -> at::RecordFunction& {
+        TORCH_INTERNAL_ASSERT(
+            handle.defined(),
+            "Undefined RecordFunction handle. This can happen if the handle is "
+            "not correctly persisted and is destroyed before the future is "
+            "realized.");
+
+        return getRecordFunctionFromTensor(handle);
+      },
+      fut
+    );
 }
 
-// Needed to register JIT operator in operator registry below
-c10::AliasAnalysisKind aliasAnalysisFromSchema() {
-  return c10::AliasAnalysisKind::FROM_SCHEMA;
+// New signature using custom_class
+c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut_new(
+    const c10::intrusive_ptr<PythonRecordFunction> &record,
+    const c10::intrusive_ptr<c10::ivalue::Future>& fut) {
+  return _call_end_callbacks_on_fut(
+      [record] () -> at::RecordFunction& { return record->record; }, fut);
 }
 
-jit::RegisterOperators reg_fut_ops({
-    jit::Operator(
+// Internal only, do not use directly, use Python's record_function()
+TORCH_LIBRARY_FRAGMENT(profiler, m) {
+  m.class_<PythonRecordFunction>("_RecordFunction");
+
+  m.def("_record_function_enter(str name, str? args=None) -> Tensor",
+        &record_function_enter_legacy);
+  m.def("_record_function_enter_new(str name, str? args=None) -> "
+        "__torch__.torch.classes.profiler._RecordFunction",
+        &record_function_enter_new);
+  m.def("_record_function_exit", &record_function_exit_legacy);
+  m.def("_record_function_exit._RecordFunction", &record_function_exit_new);
+
+  torch::jit::registerOperator(torch::jit::Operator(
         "profiler::_call_end_callbacks_on_jit_fut(Tensor x, Future(t) y) -> Future(t)",
         [](jit::Stack& stack) {
           // Pop inputs, which should be a future and a tensor
           auto fut = jit::pop(stack).toFuture();
           auto tensor = jit::pop(stack).toTensor();
-          auto profiledFut = _call_end_callbacks_on_fut(tensor, fut);
+          auto profiledFut = _call_end_callbacks_on_fut_legacy(tensor, fut);
           // return future that completes when profiling callbacks have run.
           jit::push(stack, std::move(profiledFut));
         },
-        aliasAnalysisFromSchema()),
-});
+        c10::AliasAnalysisKind::FROM_SCHEMA));
+  torch::jit::registerOperator(torch::jit::Operator(
+        "profiler::_call_end_callbacks_on_jit_fut._RecordFunction("
+            "__torch__.torch.classes.profiler._RecordFunction x, Future(t) y) -> Future(t)",
+        [](c10::Stack &stack) {
+          // Pop inputs, which should be a future and a PythonRecordFunction
+          auto fut = torch::jit::pop(stack).toFuture();
+          auto tensor = torch::jit::pop(stack).toCustomClass<PythonRecordFunction>();
+          auto profiledFut = _call_end_callbacks_on_fut_new(tensor, fut);
+          // return future that completes when profiling callbacks have run.
+          torch::jit::push(stack, std::move(profiledFut));
+        },
+        c10::AliasAnalysisKind::FROM_SCHEMA));
+}
 
 } // namespace profiler
 } // namespace autograd
diff --git a/torch/csrc/autograd/record_function_ops.h b/torch/csrc/autograd/record_function_ops.h
index 9042537aeabccb..81cc584381d42d 100644
--- a/torch/csrc/autograd/record_function_ops.h
+++ b/torch/csrc/autograd/record_function_ops.h
@@ -1,17 +1,30 @@
 #pragma once
 #include <ATen/record_function.h>
 #include <c10/util/Optional.h>
+#include <torch/custom_class.h>
 
 namespace torch {
 namespace autograd {
 namespace profiler {
+
+struct PythonRecordFunction: public torch::CustomClassHolder {
+  at::RecordFunction record;
+
+  PythonRecordFunction(
+      at::RecordScope scope = at::RecordScope::FUNCTION,
+      bool pre_sampled = false)
+    : record(scope, pre_sampled)
+    {}
+};
+
 // Creates a new profiling scope using RecordFunction and invokes its starting
 // callbacks.
-TORCH_API at::Tensor record_function_enter(const std::string& name, const c10::optional<std::string>& args = c10::nullopt);
+TORCH_API c10::intrusive_ptr<PythonRecordFunction> record_function_enter_new(
+    const std::string &name, const c10::optional<std::string> &args = c10::nullopt);
 
 // Schedules RecordFunction's end callbacks to be run on completion of a future.
-TORCH_API c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut(
-    const at::Tensor& handle,
+TORCH_API c10::intrusive_ptr<c10::ivalue::Future> _call_end_callbacks_on_fut_new(
+    const c10::intrusive_ptr<PythonRecordFunction> &record,
     const c10::intrusive_ptr<c10::ivalue::Future>& fut);
 
 } // namespace profiler
diff --git a/torch/csrc/autograd/utils/wrap_outputs.h b/torch/csrc/autograd/utils/wrap_outputs.h
index 10439553fcc571..114b53487368c7 100644
--- a/torch/csrc/autograd/utils/wrap_outputs.h
+++ b/torch/csrc/autograd/utils/wrap_outputs.h
@@ -7,6 +7,7 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/python_headers.h>
 #include <tuple>
+#include <initializer_list>
 
 #include <torch/csrc/Dtype.h>
 #include <torch/csrc/Layout.h>
@@ -77,117 +78,6 @@ inline PyObject* wrap(at::QScheme qscheme) {
   return thp_qscheme;
 }
 
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(2)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::get<0>(tensors)));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::get<1>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(PyTypeObject *type, std::tuple<at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyStructSequence_New(type)};
-  if (!r) throw python_error();
-  PyStructSequence_SET_ITEM(r.get(), 0, wrap(std::get<0>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 1, wrap(std::get<1>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(3)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  return r.release();
-}
-
-inline PyObject* wrap(PyTypeObject *type, std::tuple<at::Tensor, at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyStructSequence_New(type)};
-  if (!r) throw python_error();
-  PyStructSequence_SET_ITEM(r.get(), 0, wrap(std::get<0>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 1, wrap(std::get<1>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 2, wrap(std::get<2>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(PyTypeObject *type, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyStructSequence_New(type)};
-  if (!r) throw python_error();
-  PyStructSequence_SET_ITEM(r.get(), 0, wrap(std::get<0>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 1, wrap(std::get<1>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 2, wrap(std::get<2>(tensors)));
-  PyStructSequence_SET_ITEM(r.get(), 3, wrap(std::get<3>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, at::Tensor, int64_t> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(4)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::get<3>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, float, int64_t> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(4)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::move(std::get<3>(tensors))));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, int64_t> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(5)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::move(std::get<3>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 4, wrap(std::get<4>(tensors)));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, float, at::Tensor, int64_t> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(5)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::move(std::get<3>(tensors))));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 4, wrap(std::move(std::get<4>(tensors))));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(4)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::move(std::get<3>(tensors))));
-  return r.release();
-}
-
-inline PyObject* wrap(std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(5)};
-  if (!r) throw python_error();
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 2, wrap(std::move(std::get<2>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 3, wrap(std::move(std::get<3>(tensors))));
-  PyTuple_SET_ITEM(r.get(), 4, wrap(std::move(std::get<4>(tensors))));
-  return r.release();
-}
-
 inline PyObject* wrap(at::TensorList tl) {
   auto r = THPObjectPtr{PyTuple_New(tl.size())};
   if (!r) throw python_error();
@@ -206,13 +96,38 @@ inline PyObject* wrap(at::IntArrayRef list) {
   return r.release();
 }
 
-inline PyObject* wrap(std::tuple<float, int64_t> tensors) {
-  auto r = THPObjectPtr{PyTuple_New(2)};
+namespace detail {
+template <typename F, typename Tuple, size_t ...Is>
+void apply_with_idx_impl(const F &f, Tuple &t, std::index_sequence<Is...> /*indices*/) {
+  (void)std::initializer_list<int> {
+    (f(std::get<Is>(t), Is), 0)...
+  };
+}
+
+// For tuple(a, b, c), calls f(a, 0), f(b, 1), f(c, 2)
+template <typename F, typename ...Ts>
+void apply_with_idx(const F & f, std::tuple<Ts...> &t) {
+  apply_with_idx_impl(f, t, std::index_sequence_for<Ts...>{});
+}
+}  // namespace detail
+
+template <typename ...Ts>
+PyObject* wrap(std::tuple<Ts...> values) {
+  auto r = THPObjectPtr{PyTuple_New(sizeof...(Ts))};
+  if (!r) throw python_error();
+  detail::apply_with_idx([&](auto &value, size_t idx) {
+      PyTuple_SET_ITEM(r.get(), idx, wrap(std::move(value)));
+    }, values);
+  return r.release();
+}
+
+template <typename ...Ts>
+PyObject* wrap(PyTypeObject *type, std::tuple<Ts...> values) {
+  auto r = THPObjectPtr{PyStructSequence_New(type)};
   if (!r) throw python_error();
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 0, wrap(std::move(std::get<0>(tensors))));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  PyTuple_SET_ITEM(r.get(), 1, wrap(std::move(std::get<1>(tensors))));
+  detail::apply_with_idx([&](auto &value, size_t idx) {
+      PyStructSequence_SET_ITEM(r.get(), idx, wrap(std::move(value)));
+    }, values);
   return r.release();
 }
 
diff --git a/torch/csrc/cuda/Event.cpp b/torch/csrc/cuda/Event.cpp
index 20821636a7744b..4312b3aaf7b0c0 100644
--- a/torch/csrc/cuda/Event.cpp
+++ b/torch/csrc/cuda/Event.cpp
@@ -119,7 +119,7 @@ static PyObject * THCPEvent_wait(PyObject *_self, PyObject *_stream) {
   {
     auto self = (THCPEvent*)_self;
     auto stream = (THCPStream*)_stream;
-    pybind11::gil_scoped_release no_gil;
+    pybind11::gil_scoped_release no_gil{};
     self->cuda_event.block(stream->cuda_stream);
   }
   Py_RETURN_NONE;
@@ -145,7 +145,7 @@ static PyObject * THCPEvent_synchronize(PyObject *_self, PyObject *noargs) {
   HANDLE_TH_ERRORS
   {
     auto self = (THCPEvent*)_self;
-    pybind11::gil_scoped_release no_gil;
+    pybind11::gil_scoped_release no_gil{};
     self->cuda_event.synchronize();
   }
   Py_RETURN_NONE;
diff --git a/torch/csrc/cuda/shared/cudart.cpp b/torch/csrc/cuda/shared/cudart.cpp
index b93d921a16a946..b0af4c0884e91b 100644
--- a/torch/csrc/cuda/shared/cudart.cpp
+++ b/torch/csrc/cuda/shared/cudart.cpp
@@ -49,8 +49,8 @@ void initCudartBindings(PyObject* module) {
 #endif
   cudart.def("cuda" "MemGetInfo", [](int device) -> std::pair<size_t, size_t> {
     c10::cuda::CUDAGuard guard(device);
-    size_t device_free;
-    size_t device_total;
+    size_t device_free = 0;
+    size_t device_total = 0;
     cudaMemGetInfo(&device_free, &device_total);
     return {device_free, device_total};
   });
diff --git a/torch/csrc/deploy/CMakeLists.txt b/torch/csrc/deploy/CMakeLists.txt
index f8aa997eb10922..ec1dd3fef75a9d 100644
--- a/torch/csrc/deploy/CMakeLists.txt
+++ b/torch/csrc/deploy/CMakeLists.txt
@@ -33,10 +33,23 @@ caffe2_interface_library(torch_deploy_internal torch_deploy)
 set(INTERPRETER_TEST_SOURCES
   ${DEPLOY_DIR}/test_deploy.cpp
 )
+set(INTERPRETER_TEST_SOURCES_GPU
+  ${DEPLOY_DIR}/test_deploy_gpu.cpp
+)
+
 add_executable(test_deploy ${INTERPRETER_TEST_SOURCES})
 target_compile_definitions(test_deploy PUBLIC TEST_CUSTOM_LIBRARY)
 target_include_directories(test_deploy PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(test_deploy PUBLIC "-Wl,--no-as-needed" gtest dl torch_deploy)
+target_link_libraries(test_deploy
+  PUBLIC "-Wl,--no-as-needed -rdynamic" gtest dl torch_deploy
+)
+
+add_executable(test_deploy_gpu ${INTERPRETER_TEST_SOURCES_GPU})
+target_compile_definitions(test_deploy_gpu PUBLIC TEST_CUSTOM_LIBRARY)
+target_include_directories(test_deploy_gpu PRIVATE ${PYTORCH_ROOT}/torch)
+target_link_libraries(test_deploy_gpu
+  PUBLIC "-Wl,--no-as-needed -rdynamic" gtest dl torch_deploy
+)
 
 add_library(test_deploy_lib SHARED test_deploy_lib.cpp)
 add_dependencies(test_deploy_lib cpython)
@@ -45,14 +58,19 @@ target_link_libraries(test_deploy_lib PRIVATE pybind::pybind11)
 
 add_executable(deploy_benchmark ${DEPLOY_DIR}/example/benchmark.cpp)
 target_include_directories(deploy_benchmark PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(deploy_benchmark PUBLIC "-Wl,--no-as-needed" torch_deploy)
+target_link_libraries(deploy_benchmark
+  PUBLIC "-Wl,--no-as-needed -rdynamic" torch_deploy
+)
 
 add_executable(interactive_embedded_interpreter ${DEPLOY_DIR}/interactive_embedded_interpreter.cpp)
 target_include_directories(interactive_embedded_interpreter PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(interactive_embedded_interpreter PUBLIC "-Wl,--no-as-needed" torch_deploy)
+target_link_libraries(interactive_embedded_interpreter
+  PUBLIC "-Wl,--no-as-needed -rdynamic" torch_deploy
+)
 
 if(INSTALL_TEST)
   install(TARGETS test_deploy DESTINATION bin)
+  install(TARGETS test_deploy_gpu DESTINATION bin)
 endif()
 
 install(TARGETS torch_deploy DESTINATION lib)
diff --git a/torch/csrc/deploy/Exception.h b/torch/csrc/deploy/Exception.h
new file mode 100644
index 00000000000000..f4311debeebc45
--- /dev/null
+++ b/torch/csrc/deploy/Exception.h
@@ -0,0 +1,47 @@
+#ifndef MULTIPY_EXCEPTION_H
+#define MULTIPY_EXCEPTION_H
+
+#include <exception>
+
+#define MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(condition, message)               \
+  if (!(condition)) {                                                          \
+    throw std::runtime_error(                                                  \
+        "Internal Assertion failed: (" + std::string(#condition) + "), " +     \
+        "function " + __FUNCTION__ + ", file " + __FILE__ + ", line " +        \
+        std::to_string(__LINE__) + ".\n" + "Please report bug to Pytorch.\n" + \
+        message + "\n");                                                       \
+  }
+
+#define MULTIPY_INTERNAL_ASSERT_NO_MESSAGE(condition) \
+  MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(#condition, "")
+
+#define MULTIPY_INTERNAL_ASSERT_(x, condition, message, FUNC, ...) FUNC
+
+#define MULTIPY_INTERNAL_ASSERT(...)                     \
+  MULTIPY_INTERNAL_ASSERT_(                              \
+      ,                                                  \
+      ##__VA_ARGS__,                                     \
+      MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(__VA_ARGS__), \
+      MULTIPY_INTERNAL_ASSERT_NO_MESSAGE(__VA_ARGS__));
+
+#define MULTIPY_CHECK_WITH_MESSAGE(condition, message)                      \
+  if (!(condition)) {                                                       \
+    throw std::runtime_error(                                               \
+        "Check failed: (" + std::string(#condition) + "), " + "function " + \
+        __FUNCTION__ + ", file " + __FILE__ + ", line " +                   \
+        std::to_string(__LINE__) + ".\n" + message + "\n");                 \
+  }
+
+#define MULTIPY_CHECK_NO_MESSAGE(condition) \
+  MULTIPY_CHECK_WITH_MESSAGE(#condition, "")
+
+#define MULTIPY_CHECK_(x, condition, message, FUNC, ...) FUNC
+
+#define MULTIPY_CHECK(...)                     \
+  MULTIPY_CHECK_(                              \
+      ,                                        \
+      ##__VA_ARGS__,                           \
+      MULTIPY_CHECK_WITH_MESSAGE(__VA_ARGS__), \
+      MULTIPY_CHECK_NO_MESSAGE(__VA_ARGS__));
+
+#endif // MULTIPY_EXCEPTION_H
diff --git a/torch/csrc/deploy/benchmark.cpp b/torch/csrc/deploy/benchmark.cpp
new file mode 100644
index 00000000000000..82296a5e1a1da2
--- /dev/null
+++ b/torch/csrc/deploy/benchmark.cpp
@@ -0,0 +1,336 @@
+#include <torch/deploy.h>
+
+#include <ATen/ATen.h>
+#include <ATen/TypeDefault.h>
+#include <c10/util/irange.h>
+
+#include <torch/script.h>
+
+#include <pthread.h>
+#include <algorithm>
+#include <atomic>
+#include <cassert>
+#include <chrono>
+#include <iostream>
+#include <sstream>
+#include <thread>
+#include <vector>
+
+typedef void (*function_type)(const char*);
+
+bool cuda = false;
+
+constexpr auto latency_p = {
+    25.,
+    50.,
+    95.}; //{1., 5., 25., 50., 75., 90., 95., 99., 99.25, 99.5, 99.75, 99.9};
+
+// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+struct Report {
+  std::string benchmark;
+  std::string strategy;
+  size_t n_threads;
+  size_t items_completed;
+  double work_items_per_second;
+  std::vector<double> latencies;
+  static void report_header(std::ostream& out) {
+    out << "benchmark, strategy, n_threads, work_items_completed, work_items_per_second";
+    for (double l : latency_p) {
+      out << ", p" << l << "_latency";
+    }
+    out << ", device\n";
+  }
+  void report(std::ostream& out) {
+    out << benchmark << ", " << strategy << ", " << n_threads << ", "
+        << items_completed << ", " << work_items_per_second;
+    for (double l : latencies) {
+      out << ", " << l;
+    }
+    out << ", " << (cuda ? "cuda" : "cpu") << "\n";
+  }
+};
+
+const int min_items_to_complete = 1;
+
+struct RunPython {
+  static torch::deploy::ReplicatedObj load_and_wrap(
+      torch::deploy::Package& package) {
+    auto I = package.acquireSession();
+    auto obj = I.self.attr("load_pickle")({"model", "model.pkl"});
+    if (cuda) {
+      obj = I.global("gpu_wrapper", "GPUWrapper")({obj});
+    }
+    return I.createMovable(obj);
+  }
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+  RunPython(
+      torch::deploy::Package& package,
+      std::vector<at::IValue> eg,
+      const torch::deploy::Interpreter* interps)
+      : obj_(load_and_wrap(package)), eg_(std::move(eg)), interps_(interps) {}
+  void operator()(int i) {
+    auto I = obj_.acquireSession();
+    if (cuda) {
+      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+      std::vector<at::IValue> eg2 = {i};
+      eg2.insert(eg2.end(), eg_.begin(), eg_.end());
+      I.self(eg2);
+    } else {
+      I.self(eg_);
+    }
+  }
+  torch::deploy::ReplicatedObj obj_;
+  std::vector<at::IValue> eg_;
+  const torch::deploy::Interpreter* interps_;
+};
+
+// def to_device(i, d):
+//     if isinstance(i, torch.Tensor):
+//         return i.to(device=d)
+//     elif isinstance(i, (tuple, list)):
+//         return tuple(to_device(e, d) for e in i)
+//     else:
+//         raise RuntimeError('inputs are weird')
+
+static torch::IValue to_device(const torch::IValue& v, torch::Device to);
+
+static std::vector<torch::IValue> to_device_vec(
+    at::ArrayRef<torch::IValue> vs,
+    torch::Device to) {
+  std::vector<torch::IValue> results;
+  for (const torch::IValue& v : vs) {
+    results.push_back(to_device(v, to));
+  }
+  return results;
+}
+
+static torch::IValue to_device(const torch::IValue& v, torch::Device to) {
+  if (v.isTensor()) {
+    return v.toTensor().to(to);
+  } else if (v.isTuple()) {
+    auto tup = v.toTuple();
+    return c10::ivalue::Tuple::create(to_device_vec(tup->elements(), to));
+  } else if (v.isList()) {
+    auto converted = to_device_vec(v.toListRef(), to);
+    torch::List<torch::IValue> result(v.toList().elementType());
+    for (const torch::IValue& v : converted) {
+      result.push_back(v);
+    }
+    return result;
+  } else {
+    MULTIPY_INTERNAL_ASSERT(false, "cannot to_device");
+  }
+}
+
+static bool exists(const std::string& fname) {
+  std::fstream jit_file(fname);
+  return jit_file.good();
+}
+
+struct RunJIT {
+  RunJIT(const std::string& file_to_run, std::vector<torch::IValue> eg)
+      : eg_(std::move(eg)) {
+    if (!cuda) {
+      models_.push_back(torch::jit::load(file_to_run + "_jit"));
+    } else {
+      for (const auto i : c10::irange(2)) {
+        auto d = torch::Device(torch::DeviceType::CUDA, i);
+        std::stringstream qualified;
+        qualified << file_to_run << "_jit_" << i;
+        auto loaded = exists(qualified.str())
+            ? torch::jit::load(qualified.str(), d)
+            : torch::jit::load(file_to_run + "_jit", d);
+        loaded.to(d);
+        models_.push_back(loaded);
+      }
+    }
+  }
+  void operator()(int i) {
+    if (cuda) {
+      const auto device_id = i % models_.size();
+      auto d = torch::Device(torch::DeviceType::CUDA, device_id);
+      to_device(
+          models_[device_id].forward(to_device_vec(eg_, d)),
+          torch::DeviceType::CPU);
+    } else {
+      models_[0].forward(eg_);
+    }
+  }
+  std::vector<at::IValue> eg_;
+  std::vector<torch::jit::Module> models_;
+};
+
+struct Benchmark {
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+  Benchmark(
+      torch::deploy::InterpreterManager& manager,
+      size_t n_threads,
+      std::string strategy,
+      // NOLINTNEXTLINE(modernize-pass-by-value)
+      std::string file_to_run,
+      size_t n_seconds = 5)
+      : manager_(manager),
+        n_threads_(n_threads),
+        strategy_(strategy),
+        file_to_run_(file_to_run),
+        n_seconds_(n_seconds),
+        should_run_(true),
+        items_completed_(0),
+        reached_min_items_completed_(0) {
+    // NOLINTNEXTLINE(bugprone-branch-clone)
+    if (strategy == "one_python") {
+      manager.debugLimitInterpreters(1);
+    } else if (strategy == "multi_python") {
+      manager.debugLimitInterpreters(n_threads_);
+    }
+  }
+
+  Report run() {
+    pthread_barrier_init(&first_run_, nullptr, n_threads_ + 1);
+
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    torch::deploy::Package package = manager_.loadPackage(file_to_run_);
+
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    std::vector<at::IValue> eg;
+    {
+      auto I = package.acquireSession();
+
+      eg = I.global("builtins", "tuple")(
+                I.self.attr("load_pickle")({"model", "example.pkl"}))
+               .toIValue()
+               .toTupleRef()
+               .elements();
+    }
+
+    // NOLINTNEXTLINE(bugprone-branch-clone)
+    if (strategy_ == "jit") {
+      run_one_work_item = RunJIT(file_to_run_, std::move(eg));
+    } else {
+      run_one_work_item =
+          RunPython(package, std::move(eg), manager_.allInstances().data());
+    }
+
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    std::vector<std::vector<double>> latencies(n_threads_);
+
+    for (const auto i : c10::irange(n_threads_)) {
+      threads_.emplace_back([this, &latencies, i] {
+        torch::NoGradGuard guard;
+        // do initial work
+        run_one_work_item(i);
+
+        pthread_barrier_wait(&first_run_);
+        size_t local_items_completed = 0;
+        while (should_run_) {
+          auto begin = std::chrono::steady_clock::now();
+          run_one_work_item(i);
+          auto end = std::chrono::steady_clock::now();
+          double work_seconds =
+              std::chrono::duration<double>(end - begin).count();
+          latencies[i].push_back(work_seconds);
+          local_items_completed++;
+          if (local_items_completed == min_items_to_complete) {
+            reached_min_items_completed_++;
+          }
+        }
+        items_completed_ += local_items_completed;
+      });
+    }
+
+    pthread_barrier_wait(&first_run_);
+    auto begin = std::chrono::steady_clock::now();
+    auto try_stop_at = begin + std::chrono::seconds(n_seconds_);
+    std::this_thread::sleep_until(try_stop_at);
+    for (int i = 0; reached_min_items_completed_ < n_threads_; ++i) {
+      std::this_thread::sleep_until(
+          begin + (i + 2) * std::chrono::seconds(n_seconds_));
+    }
+    should_run_ = false;
+    for (std::thread& thread : threads_) {
+      thread.join();
+    }
+    auto end = std::chrono::steady_clock::now();
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    double total_seconds = std::chrono::duration<double>(end - begin).count();
+    Report report;
+    report.benchmark = file_to_run_;
+    report.strategy = strategy_;
+    report.n_threads = n_threads_;
+    report.items_completed = items_completed_;
+    report.work_items_per_second = items_completed_ / total_seconds;
+    reportLatencies(report.latencies, latencies);
+    run_one_work_item = nullptr;
+    return report;
+  }
+
+ private:
+  void reportLatencies(
+      std::vector<double>& results,
+      const std::vector<std::vector<double>>& latencies) {
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    std::vector<double> flat_latencies;
+    for (const auto& elem : latencies) {
+      flat_latencies.insert(flat_latencies.end(), elem.begin(), elem.end());
+    }
+    std::sort(flat_latencies.begin(), flat_latencies.end());
+    for (double target : latency_p) {
+      size_t idx = size_t(flat_latencies.size() * target / 100.0);
+      double time = flat_latencies.size() == 0
+          ? 0
+          : flat_latencies.at(std::min(flat_latencies.size() - 1, idx));
+      results.push_back(time);
+    }
+  }
+  torch::deploy::InterpreterManager& manager_;
+  size_t n_threads_;
+  std::string strategy_;
+  std::string file_to_run_;
+  size_t n_seconds_;
+  pthread_barrier_t first_run_;
+  std::atomic<bool> should_run_;
+  std::atomic<size_t> items_completed_;
+  std::atomic<size_t> reached_min_items_completed_;
+  std::vector<std::thread> threads_;
+  std::function<void(int)> run_one_work_item;
+};
+
+// NOLINTNEXTLINE(bugprone-exception-escape)
+int main(int argc, char* argv[]) {
+  int max_thread = atoi(argv[1]);
+  cuda = std::string(argv[2]) == "cuda";
+  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+  bool jit_enable = std::string(argv[3]) == "jit";
+  Report::report_header(std::cout);
+  torch::deploy::InterpreterManager manager(max_thread);
+
+  // make sure gpu_wrapper.py is in the import path
+  for (auto& interp : manager.allInstances()) {
+    auto I = interp.acquireSession();
+    I.global("sys", "path").attr("append")({"torch/csrc/deploy/example"});
+  }
+
+  auto n_threads = {1, 2, 4, 8, 16, 32, 40};
+  for (const auto i : c10::irange(4, argc)) {
+    std::string model_file = argv[i];
+    for (int n_thread : n_threads) {
+      if (n_thread > max_thread) {
+        continue;
+      }
+      for (std::string strategy : {"one_python", "multi_python", "jit"}) {
+        if (strategy == "jit") {
+          if (!jit_enable) {
+            continue;
+          }
+          if (!exists(model_file + "_jit")) {
+            continue;
+          }
+        }
+        Benchmark b(manager, n_thread, strategy, model_file);
+        Report r = b.run();
+        r.report(std::cout);
+      }
+    }
+  }
+  return 0;
+}
diff --git a/torch/csrc/deploy/deploy.cpp b/torch/csrc/deploy/deploy.cpp
index 647c9a4e810bd0..47a6936c72025b 100644
--- a/torch/csrc/deploy/deploy.cpp
+++ b/torch/csrc/deploy/deploy.cpp
@@ -1,6 +1,8 @@
-#include <c10/util/Exception.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/deploy/deploy.h>
 #include <torch/csrc/deploy/elf_file.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
+
 #include <torch/cuda.h>
 
 #include <dlfcn.h>
@@ -54,12 +56,13 @@ static bool writeDeployInterpreter(FILE* dst) {
   std::ifstream("/proc/self/cmdline") >> exePath;
   ElfFile elfFile(exePath.c_str());
   for (const auto& s : pythonInterpreterSection) {
-    at::optional<Section> payloadSection = elfFile.findSection(s.sectionName);
-    if (payloadSection != at::nullopt) {
+    multipy::optional<Section> payloadSection =
+        elfFile.findSection(s.sectionName);
+    if (payloadSection != multipy::nullopt) {
       payloadStart = payloadSection->start;
       customLoader = s.customLoader;
       size = payloadSection->len;
-      TORCH_CHECK(payloadSection.has_value(), "Missing the payload section");
+      MULTIPY_CHECK(payloadSection.has_value(), "Missing the payload section");
       break;
     }
   }
@@ -74,10 +77,10 @@ static bool writeDeployInterpreter(FILE* dst) {
         break;
       }
     }
-    TORCH_CHECK(
+    MULTIPY_CHECK(
         libStart != nullptr && libEnd != nullptr,
-        "torch::deploy requires a build-time dependency on embedded_interpreter or embedded_interpreter_cuda, neither of which were found.  torch::cuda::is_available()=",
-        torch::cuda::is_available());
+        "torch::deploy requires a build-time dependency on embedded_interpreter or embedded_interpreter_cuda, neither of which were found.  torch::cuda::is_available()=" +
+            std::to_string(torch::cuda::is_available()));
 
     size = libEnd - libStart;
     payloadStart = libStart;
@@ -99,12 +102,12 @@ InterpreterManager::InterpreterManager(
     // can be used for balancing work across GPUs
     I.global("torch", "version").attr("__setattr__")({"interp", int(i)});
     instances_.back().pImpl_->setFindModule(
-        [this](const std::string& name) -> at::optional<std::string> {
+        [this](const std::string& name) -> multipy::optional<std::string> {
           auto it = registeredModuleSource_.find(name);
           if (it != registeredModuleSource_.end()) {
             return it->second;
           } else {
-            return at::nullopt;
+            return multipy::nullopt;
           }
         });
   }
@@ -189,11 +192,11 @@ void ReplicatedObj::unload(const Interpreter* onThisInterpreter) {
 
 ReplicatedObj InterpreterSession::createMovable(Obj obj) {
   TORCH_DEPLOY_TRY
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       manager_,
       "Can only create a movable object when the session was created from an interpreter that is part of a InterpreterManager");
 
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       impl_->isOwner(obj),
       "Cannot create movable from an object that lives in different session");
 
@@ -214,6 +217,11 @@ using dlopen_t = void* (*)(const char*, int);
 // function.
 static dlopen_t find_real_dlopen() {
   void* libc = dlopen("libdl.so.2", RTLD_NOLOAD | RTLD_LAZY | RTLD_LOCAL);
+  // libdl is gone on some newer systems.
+  if (!libc) {
+    // libc.so won't open with dlopen because it's a linker script.
+    libc = dlopen("libc.so.6", RTLD_NOLOAD | RTLD_LAZY | RTLD_LOCAL);
+  }
   TORCH_INTERNAL_ASSERT(libc);
   auto dlopen_ = (dlopen_t)dlsym(libc, "dlopen");
   TORCH_INTERNAL_ASSERT(dlopen_);
diff --git a/torch/csrc/deploy/deploy.h b/torch/csrc/deploy/deploy.h
index c6a4794a932d02..b986093ed020ad 100644
--- a/torch/csrc/deploy/deploy.h
+++ b/torch/csrc/deploy/deploy.h
@@ -1,7 +1,7 @@
 #pragma once
-#include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/api/include/torch/imethod.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 #include <torch/csrc/deploy/interpreter/interpreter_impl.h>
 #include <torch/csrc/deploy/noop_environment.h>
 #include <torch/csrc/jit/serialization/import.h>
@@ -95,7 +95,7 @@ struct TORCH_API LoadBalancer {
   }
   void setResourceLimit(size_t n) {
     TORCH_DEPLOY_TRY
-    TORCH_INTERNAL_ASSERT(n <= allocated_);
+    MULTIPY_INTERNAL_ASSERT(n <= allocated_);
     n_ = n;
     TORCH_DEPLOY_SAFE_CATCH_RETHROW
   }
diff --git a/torch/csrc/deploy/elf_file.cpp b/torch/csrc/deploy/elf_file.cpp
index 85eaaa19cc26ee..ca1e749868e51d 100644
--- a/torch/csrc/deploy/elf_file.cpp
+++ b/torch/csrc/deploy/elf_file.cpp
@@ -1,5 +1,7 @@
 #include <c10/util/irange.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/deploy/elf_file.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 
 namespace torch {
 namespace deploy {
@@ -13,7 +15,7 @@ ElfFile::ElfFile(const char* filename) : memFile_(filename) {
   shdrList_ = (Elf64_Shdr*)(fileData + ehdr_->e_shoff);
 
   auto strtabSecNo = ehdr_->e_shstrndx;
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       strtabSecNo >= 0 && strtabSecNo < numSections_,
       "e_shstrndx out of range");
 
@@ -25,9 +27,9 @@ ElfFile::ElfFile(const char* filename) : memFile_(filename) {
   }
 }
 
-at::optional<Section> ElfFile::findSection(const char* name) const {
-  TORCH_CHECK(name != nullptr, "Null name");
-  at::optional<Section> found = at::nullopt;
+multipy::optional<Section> ElfFile::findSection(const char* name) const {
+  MULTIPY_CHECK(name != nullptr, "Null name");
+  multipy::optional<Section> found = multipy::nullopt;
   for (const auto& section : sections_) {
     if (strcmp(name, section.name) == 0) {
       found = section;
@@ -40,13 +42,13 @@ at::optional<Section> ElfFile::findSection(const char* name) const {
 
 void ElfFile::checkFormat() const {
   // check the magic numbers
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       (ehdr_->e_ident[EI_MAG0] == ELFMAG0) &&
           (ehdr_->e_ident[EI_MAG1] == ELFMAG1) &&
           (ehdr_->e_ident[EI_MAG2] == ELFMAG2) &&
           (ehdr_->e_ident[EI_MAG3] == ELFMAG3),
       "Unexpected magic numbers");
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       ehdr_->e_ident[EI_CLASS] == ELFCLASS64, "Only support 64bit ELF file");
 }
 
diff --git a/torch/csrc/deploy/elf_file.h b/torch/csrc/deploy/elf_file.h
index e27750c01139e0..31ea7976af88c5 100644
--- a/torch/csrc/deploy/elf_file.h
+++ b/torch/csrc/deploy/elf_file.h
@@ -1,7 +1,8 @@
 #pragma once
 
-#include <c10/util/Optional.h>
 #include <elf.h>
+#include <torch/csrc/deploy/Exception.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 #include <torch/csrc/deploy/mem_file.h>
 #include <vector>
 
@@ -30,7 +31,7 @@ struct Section {
 class ElfFile {
  public:
   explicit ElfFile(const char* filename);
-  at::optional<Section> findSection(const char* name) const;
+  multipy::optional<Section> findSection(const char* name) const;
 
  private:
   Section toSection(Elf64_Shdr* shdr) {
@@ -40,7 +41,7 @@ class ElfFile {
     const char* name = "";
 
     if (strtabSection_) {
-      TORCH_CHECK(nameOff >= 0 && nameOff < strtabSection_.len);
+      MULTIPY_CHECK(nameOff >= 0 && nameOff < strtabSection_.len);
       name = strtabSection_.start + nameOff;
     }
     const char* start = memFile_.data() + shOff;
@@ -48,7 +49,7 @@ class ElfFile {
   }
 
   [[nodiscard]] const char* str(size_t off) const {
-    TORCH_CHECK(off < strtabSection_.len, "String table index out of range");
+    MULTIPY_CHECK(off < strtabSection_.len, "String table index out of range");
     return strtabSection_.start + off;
   }
   void checkFormat() const;
diff --git a/torch/csrc/deploy/environment.h b/torch/csrc/deploy/environment.h
index 4485a4e1d031a4..433ce6bcb3f660 100644
--- a/torch/csrc/deploy/environment.h
+++ b/torch/csrc/deploy/environment.h
@@ -1,5 +1,6 @@
 #pragma once
 #include <fmt/format.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/deploy/elf_file.h>
 #include <fstream>
 #include <string>
@@ -27,7 +28,7 @@ class Environment {
     // load the zipped torch modules
     constexpr const char* ZIPPED_TORCH_NAME = ".torch_python_modules";
     auto zippedTorchSection = elfFile.findSection(ZIPPED_TORCH_NAME);
-    TORCH_CHECK(
+    MULTIPY_CHECK(
         zippedTorchSection.has_value(), "Missing the zipped torch section");
     const char* zippedTorchStart = zippedTorchSection->start;
     auto zippedTorchSize = zippedTorchSection->len;
@@ -35,7 +36,7 @@ class Environment {
     std::string zipArchive =
         std::string(pythonAppDir) + "/torch_python_modules.zip";
     auto zippedFile = fopen(zipArchive.c_str(), "wb");
-    TORCH_CHECK(
+    MULTIPY_CHECK(
         zippedFile != nullptr, "Fail to create file: ", strerror(errno));
     fwrite(zippedTorchStart, 1, zippedTorchSize, zippedFile);
     fclose(zippedFile);
diff --git a/torch/csrc/deploy/example/examples.py b/torch/csrc/deploy/example/examples.py
index 25bb54a0c606e7..73eeb2149b545f 100644
--- a/torch/csrc/deploy/example/examples.py
+++ b/torch/csrc/deploy/example/examples.py
@@ -146,8 +146,7 @@ class MultiReturn(torch.nn.Module):
     def __init__(self):
         super(MultiReturn, self).__init__()
 
-    def forward(self, t):
-        # type: (Tuple[Tensor, Tensor]) -> Tuple[Tuple[Tensor, Tensor], Tuple[Tensor, Tensor]]
+    def forward(self, t: Tuple[Tensor, Tensor]) -> Tuple[Tuple[Tensor, Tensor], Tuple[Tensor, Tensor]]:
         a, b = t
         result = ((a.masked_fill_(b, 0.1), b), (torch.ones_like(a), b))
         return result
diff --git a/torch/csrc/deploy/interpreter/Optional.hpp b/torch/csrc/deploy/interpreter/Optional.hpp
new file mode 100644
index 00000000000000..92b73d7f6fbba4
--- /dev/null
+++ b/torch/csrc/deploy/interpreter/Optional.hpp
@@ -0,0 +1,1107 @@
+// Copyright (C) 2011 - 2012 Andrzej Krzemienski.
+//
+// Use, modification, and distribution is subject to the Boost Software
+// License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
+// http://www.boost.org/LICENSE_1_0.txt)
+//
+// The idea and interface is based on Boost.Optional library
+// authored by Fernando Luis Cacciola Carballal
+//
+// Source: https://github.com/akrzemi1/Optional
+
+#ifndef ___OPTIONAL_HPP___
+#define ___OPTIONAL_HPP___
+
+#include <cassert>
+#include <functional>
+#include <initializer_list>
+#include <stdexcept>
+#include <string>
+#include <type_traits>
+#include <utility>
+
+#define TR2_OPTIONAL_REQUIRES(...) \
+  typename std::enable_if<__VA_ARGS__::value, bool>::type = false
+
+#if defined __GNUC__ // NOTE: GNUC is also defined for Clang
+#if (__GNUC__ == 4) && (__GNUC_MINOR__ >= 8)
+#define TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
+#elif (__GNUC__ > 4)
+#define TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
+#endif
+
+#if (__GNUC__ == 4) && (__GNUC_MINOR__ >= 7)
+#define TR2_OPTIONAL_GCC_4_7_AND_HIGHER___
+#elif (__GNUC__ > 4)
+#define TR2_OPTIONAL_GCC_4_7_AND_HIGHER___
+#endif
+
+#if (__GNUC__ == 4) && (__GNUC_MINOR__ == 8) && (__GNUC_PATCHLEVEL__ >= 1)
+#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
+#elif (__GNUC__ == 4) && (__GNUC_MINOR__ >= 9)
+#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
+#elif (__GNUC__ > 4)
+#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
+#endif
+#endif
+
+#if defined __clang_major__
+#if (__clang_major__ == 3 && __clang_minor__ >= 5)
+#define TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
+#elif (__clang_major__ > 3)
+#define TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
+#endif
+#if defined TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
+#define TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
+#elif ( \
+    __clang_major__ == 3 && __clang_minor__ == 4 && __clang_patchlevel__ >= 2)
+#define TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
+#endif
+#endif
+
+#if defined _MSC_VER
+#if (_MSC_VER >= 1900)
+#define TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
+#endif
+#endif
+
+#if defined __clang__
+#if (__clang_major__ > 2) || (__clang_major__ == 2) && (__clang_minor__ >= 9)
+#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
+#else
+#define OPTIONAL_HAS_THIS_RVALUE_REFS 0
+#endif
+#elif defined TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
+#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
+#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
+#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
+#else
+#define OPTIONAL_HAS_THIS_RVALUE_REFS 0
+#endif
+
+#if defined TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
+#define OPTIONAL_HAS_CONSTEXPR_INIT_LIST 1
+#define OPTIONAL_CONSTEXPR_INIT_LIST constexpr
+#else
+#define OPTIONAL_HAS_CONSTEXPR_INIT_LIST 0
+#define OPTIONAL_CONSTEXPR_INIT_LIST
+#endif
+
+#if defined TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_ && (defined __cplusplus) && \
+    (__cplusplus != 201103L)
+#define OPTIONAL_HAS_MOVE_ACCESSORS 1
+#else
+#define OPTIONAL_HAS_MOVE_ACCESSORS 0
+#endif
+
+// In C++11 constexpr implies const, so we need to make non-const members also
+// non-constexpr
+#if (defined __cplusplus) && (__cplusplus == 201103L)
+#define OPTIONAL_MUTABLE_CONSTEXPR
+#else
+#define OPTIONAL_MUTABLE_CONSTEXPR constexpr
+#endif
+
+namespace multipy {
+
+// BEGIN workaround for missing std::is_trivially_destructible
+#if defined TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
+// leave it: it is already there
+#elif defined TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
+// leave it: it is already there
+#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
+// leave it: it is already there
+#elif defined TR2_OPTIONAL_DISABLE_EMULATION_OF_TYPE_TRAITS
+// leave it: the user doesn't want it
+#else
+template <typename T>
+using std::is_trivially_destructible = std::has_trivial_destructor<T>;
+#endif
+// END workaround for missing std::is_trivially_destructible
+
+#if (defined TR2_OPTIONAL_GCC_4_7_AND_HIGHER___)
+// leave it; our metafunctions are already defined.
+#elif defined TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
+// leave it; our metafunctions are already defined.
+#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
+// leave it: it is already there
+#elif defined TR2_OPTIONAL_DISABLE_EMULATION_OF_TYPE_TRAITS
+// leave it: the user doesn't want it
+#else
+
+// workaround for missing traits in GCC and CLANG
+template <class T>
+struct std::is_nothrow_move_constructible {
+  constexpr static bool value = std::is_nothrow_constructible<T, T&&>::value;
+};
+
+template <class T, class U>
+struct is_assignable {
+  template <class X, class Y>
+  constexpr static bool has_assign(...) {
+    return false;
+  }
+
+  template <
+      class X,
+      class Y,
+      size_t S = sizeof((std::declval<X>() = std::declval<Y>(), true))>
+  // the comma operator is necessary for the cases where operator= returns void
+  constexpr static bool has_assign(bool) {
+    return true;
+  }
+
+  constexpr static bool value = has_assign<T, U>(true);
+};
+
+template <class T>
+struct std::is_nothrow_move_assignable {
+  template <class X, bool has_any_move_assign>
+  struct has_nothrow_move_assign {
+    constexpr static bool value = false;
+  };
+
+  template <class X>
+  struct has_nothrow_move_assign<X, true> {
+    constexpr static bool value =
+        noexcept(std::declval<X&>() = std::declval<X&&>());
+  };
+
+  constexpr static bool value =
+      has_nothrow_move_assign<T, is_assignable<T&, T&&>::value>::value;
+};
+// end workaround
+
+#endif
+
+// 20.5.4, optional for object types
+template <class T>
+class optional;
+
+// 20.5.5, optional for lvalue reference types
+template <class T>
+class optional<T&>;
+
+// workaround: std utility functions aren't constexpr yet
+template <class T>
+inline constexpr T&& constexpr_forward(
+    typename std::remove_reference<T>::type& t) noexcept {
+  return static_cast<T&&>(t);
+}
+
+template <class T>
+inline constexpr T&& constexpr_forward(
+    typename std::remove_reference<T>::type&& t) noexcept {
+  static_assert(!std::is_lvalue_reference<T>::value, "!!");
+  return static_cast<T&&>(t);
+}
+
+template <class T>
+inline constexpr typename std::remove_reference<T>::type&& constexpr_move(
+    T&& t) noexcept {
+  return static_cast<typename std::remove_reference<T>::type&&>(t);
+}
+
+#if defined NDEBUG
+#define TR2_OPTIONAL_ASSERTED_EXPRESSION(CHECK, EXPR) (EXPR)
+#else
+#define TR2_OPTIONAL_ASSERTED_EXPRESSION(CHECK, EXPR) \
+  ((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
+#endif
+
+namespace detail_ {
+
+// static_addressof: a constexpr version of addressof
+template <typename T>
+struct has_overloaded_addressof {
+  template <class X>
+  constexpr static bool has_overload(...) {
+    return false;
+  }
+
+  template <class X, size_t S = sizeof(std::declval<X&>().operator&())>
+  constexpr static bool has_overload(bool) {
+    return true;
+  }
+
+  constexpr static bool value = has_overload<T>(true);
+};
+
+template <typename T, TR2_OPTIONAL_REQUIRES(!has_overloaded_addressof<T>)>
+constexpr T* static_addressof(T& ref) {
+  return &ref;
+}
+
+template <typename T, TR2_OPTIONAL_REQUIRES(has_overloaded_addressof<T>)>
+T* static_addressof(T& ref) {
+  return std::addressof(ref);
+}
+
+// the call to convert<A>(b) has return type A and converts b to type A iff b
+// decltype(b) is implicitly convertible to A
+template <class U>
+constexpr U convert(U v) {
+  return v;
+}
+
+namespace swap_ns {
+using std::swap;
+
+template <class T>
+void adl_swap(T& t, T& u) noexcept(noexcept(swap(t, u))) {
+  swap(t, u);
+}
+
+} // namespace swap_ns
+
+} // namespace detail_
+
+constexpr struct trivial_init_t {
+} trivial_init{};
+
+// 20.5.6, In-place construction
+constexpr struct in_place_t {
+} in_place{};
+
+// 20.5.7, Disengaged state indicator
+struct nullopt_t {
+  struct init {};
+  constexpr explicit nullopt_t(init) {}
+};
+constexpr nullopt_t nullopt{nullopt_t::init()};
+
+// 20.5.8, class bad_optional_access
+class bad_optional_access : public std::logic_error {
+ public:
+  explicit bad_optional_access(const std::string& what_arg)
+      : std::logic_error{what_arg} {}
+  explicit bad_optional_access(const char* what_arg)
+      : std::logic_error{what_arg} {}
+};
+
+template <class T>
+union storage_t {
+  unsigned char dummy_;
+  T value_;
+
+  constexpr storage_t(trivial_init_t) noexcept : dummy_(){};
+
+  template <class... Args>
+  constexpr storage_t(Args&&... args)
+      : value_(constexpr_forward<Args>(args)...) {}
+
+  ~storage_t() {}
+};
+
+template <class T>
+union constexpr_storage_t {
+  unsigned char dummy_;
+  T value_;
+
+  constexpr constexpr_storage_t(trivial_init_t) noexcept : dummy_(){};
+
+  template <class... Args>
+  constexpr constexpr_storage_t(Args&&... args)
+      : value_(constexpr_forward<Args>(args)...) {}
+
+  ~constexpr_storage_t() = default;
+};
+
+template <class T>
+struct optional_base {
+  bool init_;
+  storage_t<T> storage_;
+
+  constexpr optional_base() noexcept : init_(false), storage_(trivial_init){};
+
+  explicit constexpr optional_base(const T& v) : init_(true), storage_(v) {}
+
+  explicit constexpr optional_base(T&& v)
+      : init_(true), storage_(constexpr_move(v)) {}
+
+  template <class... Args>
+  explicit optional_base(in_place_t, Args&&... args)
+      : init_(true), storage_(constexpr_forward<Args>(args)...) {}
+
+  template <
+      class U,
+      class... Args,
+      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
+  explicit optional_base(
+      in_place_t,
+      std::initializer_list<U> il,
+      Args&&... args)
+      : init_(true), storage_(il, std::forward<Args>(args)...) {}
+
+  ~optional_base() {
+    if (init_)
+      storage_.value_.T::~T();
+  }
+};
+
+template <class T>
+struct constexpr_optional_base {
+  bool init_;
+  constexpr_storage_t<T> storage_;
+
+  constexpr constexpr_optional_base() noexcept
+      : init_(false), storage_(trivial_init){};
+
+  explicit constexpr constexpr_optional_base(const T& v)
+      : init_(true), storage_(v) {}
+
+  explicit constexpr constexpr_optional_base(T&& v)
+      : init_(true), storage_(constexpr_move(v)) {}
+
+  template <class... Args>
+  explicit constexpr constexpr_optional_base(in_place_t, Args&&... args)
+      : init_(true), storage_(constexpr_forward<Args>(args)...) {}
+
+  template <
+      class U,
+      class... Args,
+      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
+  OPTIONAL_CONSTEXPR_INIT_LIST explicit constexpr_optional_base(
+      in_place_t,
+      std::initializer_list<U> il,
+      Args&&... args)
+      : init_(true), storage_(il, std::forward<Args>(args)...) {}
+
+  ~constexpr_optional_base() = default;
+};
+
+template <class T>
+using OptionalBase = typename std::conditional<
+    std::is_trivially_destructible<T>::value, // if possible
+    constexpr_optional_base<typename std::remove_const<
+        T>::type>, // use base with trivial destructor
+    optional_base<typename std::remove_const<T>::type>>::type;
+
+template <class T>
+class optional : private OptionalBase<T> {
+  static_assert(
+      !std::is_same<typename std::decay<T>::type, nullopt_t>::value,
+      "bad T");
+  static_assert(
+      !std::is_same<typename std::decay<T>::type, in_place_t>::value,
+      "bad T");
+
+  constexpr bool initialized() const noexcept {
+    return OptionalBase<T>::init_;
+  }
+  typename std::remove_const<T>::type* dataptr() {
+    return std::addressof(OptionalBase<T>::storage_.value_);
+  }
+  constexpr const T* dataptr() const {
+    return detail_::static_addressof(OptionalBase<T>::storage_.value_);
+  }
+
+#if OPTIONAL_HAS_THIS_RVALUE_REFS == 1
+  constexpr const T& contained_val() const& {
+    return OptionalBase<T>::storage_.value_;
+  }
+#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
+  OPTIONAL_MUTABLE_CONSTEXPR T&& contained_val() && {
+    return std::move(OptionalBase<T>::storage_.value_);
+  }
+  OPTIONAL_MUTABLE_CONSTEXPR T& contained_val() & {
+    return OptionalBase<T>::storage_.value_;
+  }
+#else
+  T& contained_val() & {
+    return OptionalBase<T>::storage_.value_;
+  }
+  T&& contained_val() && {
+    return std::move(OptionalBase<T>::storage_.value_);
+  }
+#endif
+#else
+  constexpr const T& contained_val() const {
+    return OptionalBase<T>::storage_.value_;
+  }
+  T& contained_val() {
+    return OptionalBase<T>::storage_.value_;
+  }
+#endif
+
+  void clear() noexcept {
+    if (initialized())
+      dataptr()->T::~T();
+    OptionalBase<T>::init_ = false;
+  }
+
+  template <class... Args>
+  void initialize(Args&&... args) noexcept(
+      noexcept(T(std::forward<Args>(args)...))) {
+    assert(!OptionalBase<T>::init_);
+    ::new (static_cast<void*>(dataptr())) T(std::forward<Args>(args)...);
+    OptionalBase<T>::init_ = true;
+  }
+
+  template <class U, class... Args>
+  void initialize(std::initializer_list<U> il, Args&&... args) noexcept(
+      noexcept(T(il, std::forward<Args>(args)...))) {
+    assert(!OptionalBase<T>::init_);
+    ::new (static_cast<void*>(dataptr())) T(il, std::forward<Args>(args)...);
+    OptionalBase<T>::init_ = true;
+  }
+
+ public:
+  typedef T value_type;
+
+  // 20.5.5.1, constructors
+  constexpr optional() noexcept : OptionalBase<T>(){};
+  constexpr optional(nullopt_t) noexcept : OptionalBase<T>(){};
+
+  optional(const optional& rhs) : OptionalBase<T>() {
+    if (rhs.initialized()) {
+      ::new (static_cast<void*>(dataptr())) T(*rhs);
+      OptionalBase<T>::init_ = true;
+    }
+  }
+
+  optional(optional&& rhs) noexcept(
+      std::is_nothrow_move_constructible<T>::value)
+      : OptionalBase<T>() {
+    if (rhs.initialized()) {
+      ::new (static_cast<void*>(dataptr())) T(std::move(*rhs));
+      OptionalBase<T>::init_ = true;
+    }
+  }
+
+  constexpr optional(const T& v) : OptionalBase<T>(v) {}
+
+  constexpr optional(T&& v) : OptionalBase<T>(constexpr_move(v)) {}
+
+  template <class... Args>
+  explicit constexpr optional(in_place_t, Args&&... args)
+      : OptionalBase<T>(in_place_t{}, constexpr_forward<Args>(args)...) {}
+
+  template <
+      class U,
+      class... Args,
+      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
+  OPTIONAL_CONSTEXPR_INIT_LIST explicit optional(
+      in_place_t,
+      std::initializer_list<U> il,
+      Args&&... args)
+      : OptionalBase<T>(in_place_t{}, il, constexpr_forward<Args>(args)...) {}
+
+  // 20.5.4.2, Destructor
+  ~optional() = default;
+
+  // 20.5.4.3, assignment
+  optional& operator=(nullopt_t) noexcept {
+    clear();
+    return *this;
+  }
+
+  optional& operator=(const optional& rhs) {
+    if (initialized() == true && rhs.initialized() == false)
+      clear();
+    else if (initialized() == false && rhs.initialized() == true)
+      initialize(*rhs);
+    else if (initialized() == true && rhs.initialized() == true)
+      contained_val() = *rhs;
+    return *this;
+  }
+
+  optional& operator=(optional&& rhs) noexcept(
+      std::is_nothrow_move_assignable<T>::value&&
+          std::is_nothrow_move_constructible<T>::value) {
+    if (initialized() == true && rhs.initialized() == false)
+      clear();
+    else if (initialized() == false && rhs.initialized() == true)
+      initialize(std::move(*rhs));
+    else if (initialized() == true && rhs.initialized() == true)
+      contained_val() = std::move(*rhs);
+    return *this;
+  }
+
+  template <class U>
+  auto operator=(U&& v) -> typename std::enable_if<
+      std::is_same<typename std::decay<U>::type, T>::value,
+      optional&>::type {
+    if (initialized()) {
+      contained_val() = std::forward<U>(v);
+    } else {
+      initialize(std::forward<U>(v));
+    }
+    return *this;
+  }
+
+  template <class... Args>
+  void emplace(Args&&... args) {
+    clear();
+    initialize(std::forward<Args>(args)...);
+  }
+
+  template <class U, class... Args>
+  void emplace(std::initializer_list<U> il, Args&&... args) {
+    clear();
+    initialize<U, Args...>(il, std::forward<Args>(args)...);
+  }
+
+  // 20.5.4.4, Swap
+  void swap(optional<T>& rhs) noexcept(
+      std::is_nothrow_move_constructible<T>::value&& noexcept(
+          detail_::swap_ns::adl_swap(std::declval<T&>(), std::declval<T&>()))) {
+    if (initialized() == true && rhs.initialized() == false) {
+      rhs.initialize(std::move(**this));
+      clear();
+    } else if (initialized() == false && rhs.initialized() == true) {
+      initialize(std::move(*rhs));
+      rhs.clear();
+    } else if (initialized() == true && rhs.initialized() == true) {
+      using std::swap;
+      swap(**this, *rhs);
+    }
+  }
+
+  // 20.5.4.5, Observers
+
+  explicit constexpr operator bool() const noexcept {
+    return initialized();
+  }
+  constexpr bool has_value() const noexcept {
+    return initialized();
+  }
+
+  constexpr T const* operator->() const {
+    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), dataptr());
+  }
+
+#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
+
+  OPTIONAL_MUTABLE_CONSTEXPR T* operator->() {
+    assert(initialized());
+    return dataptr();
+  }
+
+  constexpr T const& operator*() const& {
+    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), contained_val());
+  }
+
+  OPTIONAL_MUTABLE_CONSTEXPR T& operator*() & {
+    assert(initialized());
+    return contained_val();
+  }
+
+  OPTIONAL_MUTABLE_CONSTEXPR T&& operator*() && {
+    assert(initialized());
+    return constexpr_move(contained_val());
+  }
+
+  constexpr T const& value() const& {
+    return initialized()
+        ? contained_val()
+        : (throw bad_optional_access("bad optional access"), contained_val());
+  }
+
+  OPTIONAL_MUTABLE_CONSTEXPR T& value() & {
+    return initialized()
+        ? contained_val()
+        : (throw bad_optional_access("bad optional access"), contained_val());
+  }
+
+  OPTIONAL_MUTABLE_CONSTEXPR T&& value() && {
+    if (!initialized())
+      throw bad_optional_access("bad optional access");
+    return std::move(contained_val());
+  }
+
+#else
+
+  T* operator->() {
+    assert(initialized());
+    return dataptr();
+  }
+
+  constexpr T const& operator*() const {
+    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), contained_val());
+  }
+
+  T& operator*() {
+    assert(initialized());
+    return contained_val();
+  }
+
+  constexpr T const& value() const {
+    return initialized()
+        ? contained_val()
+        : (throw bad_optional_access("bad optional access"), contained_val());
+  }
+
+  T& value() {
+    return initialized()
+        ? contained_val()
+        : (throw bad_optional_access("bad optional access"), contained_val());
+  }
+
+#endif
+
+#if OPTIONAL_HAS_THIS_RVALUE_REFS == 1
+
+  template <class V>
+  constexpr T value_or(V&& v) const& {
+    return *this ? **this : detail_::convert<T>(constexpr_forward<V>(v));
+  }
+
+#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
+
+  template <class V>
+  OPTIONAL_MUTABLE_CONSTEXPR T value_or(V&& v) && {
+    return *this
+        ? constexpr_move(const_cast<optional<T>&>(*this).contained_val())
+        : detail_::convert<T>(constexpr_forward<V>(v));
+  }
+
+#else
+
+  template <class V>
+  T value_or(V&& v) && {
+    return *this
+        ? constexpr_move(const_cast<optional<T>&>(*this).contained_val())
+        : detail_::convert<T>(constexpr_forward<V>(v));
+  }
+
+#endif
+
+#else
+
+  template <class V>
+  constexpr T value_or(V&& v) const {
+    return *this ? **this : detail_::convert<T>(constexpr_forward<V>(v));
+  }
+
+#endif
+
+  // 20.6.3.6, modifiers
+  void reset() noexcept {
+    clear();
+  }
+};
+
+template <class T>
+class optional<T&> {
+  static_assert(!std::is_same<T, nullopt_t>::value, "bad T");
+  static_assert(!std::is_same<T, in_place_t>::value, "bad T");
+  T* ref;
+
+ public:
+  // 20.5.5.1, construction/destruction
+  constexpr optional() noexcept : ref(nullptr) {}
+
+  constexpr optional(nullopt_t) noexcept : ref(nullptr) {}
+
+  constexpr optional(T& v) noexcept : ref(detail_::static_addressof(v)) {}
+
+  optional(T&&) = delete;
+
+  constexpr optional(const optional& rhs) noexcept : ref(rhs.ref) {}
+
+  explicit constexpr optional(in_place_t, T& v) noexcept
+      : ref(detail_::static_addressof(v)) {}
+
+  explicit optional(in_place_t, T&&) = delete;
+
+  ~optional() = default;
+
+  // 20.5.5.2, mutation
+  optional& operator=(nullopt_t) noexcept {
+    ref = nullptr;
+    return *this;
+  }
+
+  // optional& operator=(const optional& rhs) noexcept {
+  // ref = rhs.ref;
+  // return *this;
+  // }
+
+  // optional& operator=(optional&& rhs) noexcept {
+  // ref = rhs.ref;
+  // return *this;
+  // }
+
+  template <typename U>
+  auto operator=(U&& rhs) noexcept -> typename std::enable_if<
+      std::is_same<typename std::decay<U>::type, optional<T&>>::value,
+      optional&>::type {
+    ref = rhs.ref;
+    return *this;
+  }
+
+  template <typename U>
+  auto operator=(U&& rhs) noexcept -> typename std::enable_if<
+      !std::is_same<typename std::decay<U>::type, optional<T&>>::value,
+      optional&>::type = delete;
+
+  void emplace(T& v) noexcept {
+    ref = detail_::static_addressof(v);
+  }
+
+  void emplace(T&&) = delete;
+
+  void swap(optional<T&>& rhs) noexcept {
+    std::swap(ref, rhs.ref);
+  }
+
+  // 20.5.5.3, observers
+  constexpr T* operator->() const {
+    return TR2_OPTIONAL_ASSERTED_EXPRESSION(ref, ref);
+  }
+
+  constexpr T& operator*() const {
+    return TR2_OPTIONAL_ASSERTED_EXPRESSION(ref, *ref);
+  }
+
+  constexpr T& value() const {
+    return ref ? *ref
+               : (throw bad_optional_access("bad optional access"), *ref);
+  }
+
+  explicit constexpr operator bool() const noexcept {
+    return ref != nullptr;
+  }
+
+  constexpr bool has_value() const noexcept {
+    return ref != nullptr;
+  }
+
+  template <class V>
+  constexpr typename std::decay<T>::type value_or(V&& v) const {
+    return *this ? **this
+                 : detail_::convert<typename std::decay<T>::type>(
+                       constexpr_forward<V>(v));
+  }
+
+  // x.x.x.x, modifiers
+  void reset() noexcept {
+    ref = nullptr;
+  }
+};
+
+template <class T>
+class optional<T&&> {
+  static_assert(sizeof(T) == 0, "optional rvalue references disallowed");
+};
+
+// 20.5.8, Relational operators
+template <class T>
+constexpr bool operator==(const optional<T>& x, const optional<T>& y) {
+  return bool(x) != bool(y) ? false : bool(x) == false ? true : *x == *y;
+}
+
+template <class T>
+constexpr bool operator!=(const optional<T>& x, const optional<T>& y) {
+  return !(x == y);
+}
+
+template <class T>
+constexpr bool operator<(const optional<T>& x, const optional<T>& y) {
+  return (!y) ? false : (!x) ? true : *x < *y;
+}
+
+template <class T>
+constexpr bool operator>(const optional<T>& x, const optional<T>& y) {
+  return (y < x);
+}
+
+template <class T>
+constexpr bool operator<=(const optional<T>& x, const optional<T>& y) {
+  return !(y < x);
+}
+
+template <class T>
+constexpr bool operator>=(const optional<T>& x, const optional<T>& y) {
+  return !(x < y);
+}
+
+// 20.5.9, Comparison with nullopt
+template <class T>
+constexpr bool operator==(const optional<T>& x, nullopt_t) noexcept {
+  return (!x);
+}
+
+template <class T>
+constexpr bool operator==(nullopt_t, const optional<T>& x) noexcept {
+  return (!x);
+}
+
+template <class T>
+constexpr bool operator!=(const optional<T>& x, nullopt_t) noexcept {
+  return bool(x);
+}
+
+template <class T>
+constexpr bool operator!=(nullopt_t, const optional<T>& x) noexcept {
+  return bool(x);
+}
+
+template <class T>
+constexpr bool operator<(const optional<T>&, nullopt_t) noexcept {
+  return false;
+}
+
+template <class T>
+constexpr bool operator<(nullopt_t, const optional<T>& x) noexcept {
+  return bool(x);
+}
+
+template <class T>
+constexpr bool operator<=(const optional<T>& x, nullopt_t) noexcept {
+  return (!x);
+}
+
+template <class T>
+constexpr bool operator<=(nullopt_t, const optional<T>&) noexcept {
+  return true;
+}
+
+template <class T>
+constexpr bool operator>(const optional<T>& x, nullopt_t) noexcept {
+  return bool(x);
+}
+
+template <class T>
+constexpr bool operator>(nullopt_t, const optional<T>&) noexcept {
+  return false;
+}
+
+template <class T>
+constexpr bool operator>=(const optional<T>&, nullopt_t) noexcept {
+  return true;
+}
+
+template <class T>
+constexpr bool operator>=(nullopt_t, const optional<T>& x) noexcept {
+  return (!x);
+}
+
+// 20.5.10, Comparison with T
+template <class T>
+constexpr bool operator==(const optional<T>& x, const T& v) {
+  return bool(x) ? *x == v : false;
+}
+
+template <class T>
+constexpr bool operator==(const T& v, const optional<T>& x) {
+  return bool(x) ? v == *x : false;
+}
+
+template <class T>
+constexpr bool operator!=(const optional<T>& x, const T& v) {
+  return bool(x) ? *x != v : true;
+}
+
+template <class T>
+constexpr bool operator!=(const T& v, const optional<T>& x) {
+  return bool(x) ? v != *x : true;
+}
+
+template <class T>
+constexpr bool operator<(const optional<T>& x, const T& v) {
+  return bool(x) ? *x < v : true;
+}
+
+template <class T>
+constexpr bool operator>(const T& v, const optional<T>& x) {
+  return bool(x) ? v > *x : true;
+}
+
+template <class T>
+constexpr bool operator>(const optional<T>& x, const T& v) {
+  return bool(x) ? *x > v : false;
+}
+
+template <class T>
+constexpr bool operator<(const T& v, const optional<T>& x) {
+  return bool(x) ? v < *x : false;
+}
+
+template <class T>
+constexpr bool operator>=(const optional<T>& x, const T& v) {
+  return bool(x) ? *x >= v : false;
+}
+
+template <class T>
+constexpr bool operator<=(const T& v, const optional<T>& x) {
+  return bool(x) ? v <= *x : false;
+}
+
+template <class T>
+constexpr bool operator<=(const optional<T>& x, const T& v) {
+  return bool(x) ? *x <= v : true;
+}
+
+template <class T>
+constexpr bool operator>=(const T& v, const optional<T>& x) {
+  return bool(x) ? v >= *x : true;
+}
+
+// Comparison of optional<T&> with T
+template <class T>
+constexpr bool operator==(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x == v : false;
+}
+
+template <class T>
+constexpr bool operator==(const T& v, const optional<T&>& x) {
+  return bool(x) ? v == *x : false;
+}
+
+template <class T>
+constexpr bool operator!=(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x != v : true;
+}
+
+template <class T>
+constexpr bool operator!=(const T& v, const optional<T&>& x) {
+  return bool(x) ? v != *x : true;
+}
+
+template <class T>
+constexpr bool operator<(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x < v : true;
+}
+
+template <class T>
+constexpr bool operator>(const T& v, const optional<T&>& x) {
+  return bool(x) ? v > *x : true;
+}
+
+template <class T>
+constexpr bool operator>(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x > v : false;
+}
+
+template <class T>
+constexpr bool operator<(const T& v, const optional<T&>& x) {
+  return bool(x) ? v < *x : false;
+}
+
+template <class T>
+constexpr bool operator>=(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x >= v : false;
+}
+
+template <class T>
+constexpr bool operator<=(const T& v, const optional<T&>& x) {
+  return bool(x) ? v <= *x : false;
+}
+
+template <class T>
+constexpr bool operator<=(const optional<T&>& x, const T& v) {
+  return bool(x) ? *x <= v : true;
+}
+
+template <class T>
+constexpr bool operator>=(const T& v, const optional<T&>& x) {
+  return bool(x) ? v >= *x : true;
+}
+
+// Comparison of optional<T const&> with T
+template <class T>
+constexpr bool operator==(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x == v : false;
+}
+
+template <class T>
+constexpr bool operator==(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v == *x : false;
+}
+
+template <class T>
+constexpr bool operator!=(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x != v : true;
+}
+
+template <class T>
+constexpr bool operator!=(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v != *x : true;
+}
+
+template <class T>
+constexpr bool operator<(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x < v : true;
+}
+
+template <class T>
+constexpr bool operator>(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v > *x : true;
+}
+
+template <class T>
+constexpr bool operator>(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x > v : false;
+}
+
+template <class T>
+constexpr bool operator<(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v < *x : false;
+}
+
+template <class T>
+constexpr bool operator>=(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x >= v : false;
+}
+
+template <class T>
+constexpr bool operator<=(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v <= *x : false;
+}
+
+template <class T>
+constexpr bool operator<=(const optional<const T&>& x, const T& v) {
+  return bool(x) ? *x <= v : true;
+}
+
+template <class T>
+constexpr bool operator>=(const T& v, const optional<const T&>& x) {
+  return bool(x) ? v >= *x : true;
+}
+
+// 20.5.12, Specialized algorithms
+template <class T>
+void swap(optional<T>& x, optional<T>& y) noexcept(noexcept(x.swap(y))) {
+  x.swap(y);
+}
+
+template <class T>
+constexpr optional<typename std::decay<T>::type> make_optional(T&& v) {
+  return optional<typename std::decay<T>::type>(constexpr_forward<T>(v));
+}
+
+template <class X>
+constexpr optional<X&> make_optional(std::reference_wrapper<X> v) {
+  return optional<X&>(v.get());
+}
+
+} // namespace multipy
+
+namespace std {
+template <typename T>
+struct hash<multipy::optional<T>> {
+  typedef typename hash<T>::result_type result_type;
+  typedef multipy::optional<T> argument_type;
+
+  constexpr result_type operator()(argument_type const& arg) const {
+    return arg ? std::hash<T>{}(*arg) : result_type{};
+  }
+};
+
+template <typename T>
+struct hash<multipy::optional<T&>> {
+  typedef typename hash<T>::result_type result_type;
+  typedef multipy::optional<T&> argument_type;
+
+  constexpr result_type operator()(argument_type const& arg) const {
+    return arg ? std::hash<T>{}(*arg) : result_type{};
+  }
+};
+} // namespace std
+
+#undef TR2_OPTIONAL_REQUIRES
+#undef TR2_OPTIONAL_ASSERTED_EXPRESSION
+
+#endif //___OPTIONAL_HPP___
diff --git a/torch/csrc/deploy/interpreter/builtin_registry.cpp b/torch/csrc/deploy/interpreter/builtin_registry.cpp
index a34768c2a009bf..6bcabd969ec521 100644
--- a/torch/csrc/deploy/interpreter/builtin_registry.cpp
+++ b/torch/csrc/deploy/interpreter/builtin_registry.cpp
@@ -1,6 +1,7 @@
 #include <Python.h>
 #include <c10/util/Exception.h>
 #include <fmt/format.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/deploy/interpreter/builtin_registry.h>
 
 namespace torch {
@@ -44,7 +45,7 @@ BuiltinRegistryItem::BuiltinRegistryItem(
 
   fprintf(
       stderr,
-      "torch::deploy builtin %s contains %d modules\n",
+      "torch::deploy builtin %s contains %u modules\n",
       name,
       numModules);
 }
@@ -109,8 +110,8 @@ BuiltinRegistryItem* BuiltinRegistry::getItem(const std::string& name) {
                                        : get()->items_[itr->second].get();
 }
 
-int BuiltinRegistry::totalNumModules() {
-  int tot = 0;
+unsigned BuiltinRegistry::totalNumModules() {
+  unsigned tot = 0;
   for (const auto& itemptr : get()->items_) {
     tot += itemptr->numModules;
   }
@@ -119,7 +120,7 @@ int BuiltinRegistry::totalNumModules() {
 
 struct _frozen* BuiltinRegistry::getAllFrozenModules() {
   /* Allocate new memory for the combined table */
-  int totNumModules = totalNumModules();
+  size_t totNumModules = totalNumModules();
   struct _frozen* p = nullptr;
   if (totNumModules > 0 &&
       totNumModules <= SIZE_MAX / sizeof(struct _frozen) - 1) {
@@ -134,7 +135,7 @@ struct _frozen* BuiltinRegistry::getAllFrozenModules() {
   memset(&p[0], 0, sizeof(p[0]));
 
   /* Copy the tables into the new memory */
-  int off = 0;
+  unsigned off = 0;
   for (const auto& itemptr : items()) {
     if (itemptr->numModules > 0) {
       memcpy(
diff --git a/torch/csrc/deploy/interpreter/builtin_registry.h b/torch/csrc/deploy/interpreter/builtin_registry.h
index da7eb372de84f1..533adc2100b3d1 100644
--- a/torch/csrc/deploy/interpreter/builtin_registry.h
+++ b/torch/csrc/deploy/interpreter/builtin_registry.h
@@ -49,7 +49,7 @@ struct BuiltinRegistryItem {
       std::vector<std::pair<const char*, void*>>&& _builtinModules);
   const char* name;
   const struct _frozen* frozenModules;
-  int numModules;
+  unsigned numModules;
   std::vector<std::pair<const char*, void*>> builtinModules;
 };
 
@@ -77,7 +77,7 @@ class BuiltinRegistry {
   static const std::vector<std::unique_ptr<BuiltinRegistryItem>>& items() {
     return get()->items_;
   }
-  static int totalNumModules();
+  static unsigned totalNumModules();
   static BuiltinRegistry* get();
   static BuiltinRegistryItem* getItem(const std::string& name);
   static std::vector<std::pair<const char*, void*>> getAllBuiltinModules();
diff --git a/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp b/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
index b8af5de3db205e..2a89a96c623d71 100644
--- a/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
+++ b/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
@@ -1,4 +1,5 @@
 #include <torch/csrc/deploy/loader.h>
+#include <sstream>
 #include <vector>
 
 using torch::deploy::CustomLibrary;
diff --git a/torch/csrc/deploy/interpreter/interpreter_impl.cpp b/torch/csrc/deploy/interpreter/interpreter_impl.cpp
index 1ff30f0afbb04b..2af33582aa6dfd 100644
--- a/torch/csrc/deploy/interpreter/interpreter_impl.cpp
+++ b/torch/csrc/deploy/interpreter/interpreter_impl.cpp
@@ -9,6 +9,7 @@
 #include <pybind11/functional.h>
 #include <torch/csrc/DynamicTypes.h>
 #include <torch/csrc/autograd/generated/variable_factories.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 
 #include <cassert>
@@ -219,8 +220,8 @@ struct __attribute__((visibility("hidden"))) ConcreteInterpreterImpl
   }
 
   void setFindModule(
-      std::function<at::optional<std::string>(const std::string&)> find_module)
-      override {
+      std::function<multipy::optional<std::string>(const std::string&)>
+          find_module) override {
     std::function<py::object(const std::string&)> wrapped_find_module =
         [=](const std::string& name) -> py::object {
       auto r = find_module(name);
diff --git a/torch/csrc/deploy/interpreter/interpreter_impl.h b/torch/csrc/deploy/interpreter/interpreter_impl.h
index 10a1489740ec27..a2dd57e9beeba6 100644
--- a/torch/csrc/deploy/interpreter/interpreter_impl.h
+++ b/torch/csrc/deploy/interpreter/interpreter_impl.h
@@ -3,6 +3,7 @@
 #include <ATen/ATen.h>
 #include <ATen/core/ivalue.h>
 #include <caffe2/serialize/inline_container.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 
 /* Torch Deploy intentionally embeds multiple copies of c++ libraries
    providing python bindings necessary for torch::deploy users in the same
@@ -15,8 +16,8 @@
    the client application.
 
    It is safe to throw exception types that are defined once in
-   the context of the client application, such as c10::Error, which is defined
-   in libtorch, which isn't duplicated in torch::deploy interpreters.
+   the context of the client application, such as std::runtime_error,
+   which isn't duplicated in torch::deploy interpreters.
 
    ==> Use TORCH_DEPLOY_TRY, _SAFE_CATCH_RETHROW around _ALL_ torch::deploy APIs
 
@@ -30,20 +31,17 @@
 
 */
 #define TORCH_DEPLOY_TRY try {
-#define TORCH_DEPLOY_SAFE_CATCH_RETHROW                                        \
-  }                                                                            \
-  catch (std::exception & err) {                                               \
-    throw c10::Error(                                                          \
-        std::string(                                                           \
-            "Exception Caught inside torch::deploy embedded library: \n") +    \
-            err.what(),                                                        \
-        "");                                                                   \
-  }                                                                            \
-  catch (...) {                                                                \
-    throw c10::Error(                                                          \
-        std::string(                                                           \
-            "Unknown Exception Caught inside torch::deploy embedded library"), \
-        "");                                                                   \
+#define TORCH_DEPLOY_SAFE_CATCH_RETHROW                                     \
+  }                                                                         \
+  catch (std::exception & err) {                                            \
+    throw std::runtime_error(                                               \
+        std::string(                                                        \
+            "Exception Caught inside torch::deploy embedded library: \n") + \
+        err.what());                                                        \
+  }                                                                         \
+  catch (...) {                                                             \
+    throw std::runtime_error(std::string(                                   \
+        "Unknown Exception Caught inside torch::deploy embedded library")); \
   }
 namespace torch {
 namespace deploy {
@@ -132,7 +130,7 @@ struct InterpreterSessionImpl {
 struct InterpreterImpl {
   virtual InterpreterSessionImpl* acquireSession() = 0;
   virtual void setFindModule(
-      std::function<at::optional<std::string>(const std::string&)>
+      std::function<multipy::optional<std::string>(const std::string&)>
           find_module) = 0;
   virtual ~InterpreterImpl() = default; // this will uninitialize python
 };
diff --git a/torch/csrc/deploy/loader.cpp b/torch/csrc/deploy/loader.cpp
index f03a2d299a5510..ab4d0c7c329e5e 100644
--- a/torch/csrc/deploy/loader.cpp
+++ b/torch/csrc/deploy/loader.cpp
@@ -53,8 +53,8 @@
 // Get PAGE_SIZE and PAGE_MASK.
 #include <sys/user.h>
 
-#include <c10/util/Optional.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 
 #include <fmt/format.h>
 #include <torch/csrc/deploy/loader.h>
@@ -300,15 +300,15 @@ struct __attribute__((visibility("hidden"))) SystemLibraryImpl
   SystemLibraryImpl(void* handle, bool steal)
       : handle_(handle), own_handle_(steal && handle != RTLD_DEFAULT) {}
 
-  at::optional<Elf64_Addr> sym(const char* name) const override {
+  multipy::optional<Elf64_Addr> sym(const char* name) const override {
     void* r = dlsym(handle_, name);
     if (!r) {
-      return at::nullopt;
+      return multipy::nullopt;
     }
     return (Elf64_Addr)r;
   }
 
-  at::optional<TLSIndex> tls_sym(const char* name) const override;
+  multipy::optional<TLSIndex> tls_sym(const char* name) const override;
 
   ~SystemLibraryImpl() override {
     if (own_handle_) {
@@ -534,11 +534,11 @@ struct ElfDynamicInfo {
     }
   }
 
-  at::optional<Elf64_Addr> sym(
+  multipy::optional<Elf64_Addr> sym(
       const char* name,
       GnuHash* precomputed_hash = nullptr) const {
     if (!gnu_bucket_) {
-      return at::nullopt; // no hashtable was loaded
+      return multipy::nullopt; // no hashtable was loaded
     }
     GnuHash hash_obj = precomputed_hash ? *precomputed_hash : GnuHash(name);
     auto hash = hash_obj.hash;
@@ -551,12 +551,12 @@ struct ElfDynamicInfo {
     const uint32_t h2 = (hash >> gnu_shift2_) % kBloomMaskBits;
 
     if ((1 & (bloom_word >> h1) & (bloom_word >> h2)) != 1) {
-      return at::nullopt;
+      return multipy::nullopt;
     }
 
     uint32_t sym_idx = gnu_bucket_[hash % gnu_nbucket_];
     if (sym_idx == 0) {
-      return at::nullopt;
+      return multipy::nullopt;
     }
 
     uint32_t chain_value = 0;
@@ -574,12 +574,12 @@ struct ElfDynamicInfo {
                 ((ELF64_ST_TYPE(sym->st_info) == STT_TLS) ? 0 : load_bias_);
           }
           // symbol isn't defined
-          return at::nullopt;
+          return multipy::nullopt;
         }
       }
       ++sym_idx;
     } while ((chain_value & 1) == 0);
-    return at::nullopt;
+    return multipy::nullopt;
   }
 };
 
@@ -613,7 +613,7 @@ struct AlreadyLoadedSymTable {
     dyninfo_.initialize_from_dynamic_section(name, dynamic, load_bias, true);
   }
 
-  at::optional<Elf64_Addr> sym(const char* name) {
+  multipy::optional<Elf64_Addr> sym(const char* name) {
     return dyninfo_.sym(name);
   }
 };
@@ -626,8 +626,8 @@ static int iterate_cb(struct dl_phdr_info* info, size_t size, void* data) {
 // with a normal dlsym call. Instead we iterate through all loaded libraries and
 // check their symbol tables for the symbol. The value of the symbol is the TLS
 // offset. When we find the library we also get the module id.
-at::optional<TLSIndex> slow_find_tls_symbol_offset(const char* sym_name) {
-  at::optional<TLSIndex> result = at::nullopt;
+multipy::optional<TLSIndex> slow_find_tls_symbol_offset(const char* sym_name) {
+  multipy::optional<TLSIndex> result = multipy::nullopt;
   std::function<int(struct dl_phdr_info*, size_t)> cb =
       [&](struct dl_phdr_info* info, size_t size) {
         // std::cout << "SEARCHING .. " << info->dlpi_name << "\n";
@@ -650,10 +650,11 @@ at::optional<TLSIndex> slow_find_tls_symbol_offset(const char* sym_name) {
   return result;
 }
 
-at::optional<TLSIndex> SystemLibraryImpl::tls_sym(const char* name) const {
+multipy::optional<TLSIndex> SystemLibraryImpl::tls_sym(const char* name) const {
   if (!sym(name)) {
-    return at::nullopt; // before we do a bunch of slow lookups to find the
-                        // module_id, check that this even defines the symbol
+    return multipy::nullopt; // before we do a bunch of slow lookups to find the
+                             // module_id, check that this even defines the
+                             // symbol
   }
   if (handle_ == RTLD_DEFAULT) {
     return slow_find_tls_symbol_offset(name);
@@ -675,7 +676,7 @@ at::optional<TLSIndex> SystemLibraryImpl::tls_sym(const char* name) const {
         "failed to query dlinfo for module_id");
     return TLSIndex{module_id, *r};
   }
-  return at::nullopt;
+  return multipy::nullopt;
 }
 
 // dlopen does not accept additional search paths as an argument.
@@ -966,7 +967,7 @@ struct __attribute__((visibility("hidden"))) CustomLibraryImpl
         dyninfo_.needed_);
   }
 
-  at::optional<Elf64_Addr> lookup_symbol(Elf64_Xword r_info) {
+  multipy::optional<Elf64_Addr> lookup_symbol(Elf64_Xword r_info) {
     const uint32_t r_type = ELF64_R_TYPE(r_info);
     const uint32_t r_sym = ELF64_R_SYM(r_info);
 
@@ -999,10 +1000,10 @@ struct __attribute__((visibility("hidden"))) CustomLibraryImpl
           name_.c_str(),
           sym_name);
     }
-    return at::nullopt;
+    return multipy::nullopt;
   }
 
-  at::optional<TLSIndex> tls_lookup_symbol(Elf64_Xword r_info) {
+  multipy::optional<TLSIndex> tls_lookup_symbol(Elf64_Xword r_info) {
     const uint32_t r_sym = ELF64_R_SYM(r_info);
 
     if (r_sym == 0) {
@@ -1030,7 +1031,7 @@ struct __attribute__((visibility("hidden"))) CustomLibraryImpl
           name_.c_str(),
           sym_name);
     }
-    return at::nullopt;
+    return multipy::nullopt;
   }
 
   void relocate_one(const Elf64_Rela& reloc) {
@@ -1177,16 +1178,16 @@ struct __attribute__((visibility("hidden"))) CustomLibraryImpl
     f(argc_, argv_, environ);
   }
 
-  at::optional<Elf64_Addr> sym(const char* name) const override {
+  multipy::optional<Elf64_Addr> sym(const char* name) const override {
     return dyninfo_.sym(name);
   }
 
-  at::optional<TLSIndex> tls_sym(const char* name) const override {
+  multipy::optional<TLSIndex> tls_sym(const char* name) const override {
     auto r = dyninfo_.sym(name);
     if (r) {
       return TLSIndex{module_id(), *r};
     }
-    return at::nullopt;
+    return multipy::nullopt;
   }
 
   void* tls_addr(size_t offset) {
diff --git a/torch/csrc/deploy/loader.h b/torch/csrc/deploy/loader.h
index eeff1a30174ee9..9e5a7fd4571de8 100644
--- a/torch/csrc/deploy/loader.h
+++ b/torch/csrc/deploy/loader.h
@@ -1,7 +1,7 @@
 #pragma once
-#include <c10/util/Optional.h>
 #include <dlfcn.h>
 #include <elf.h>
+#include <torch/csrc/deploy/interpreter/Optional.hpp>
 #include <memory>
 
 namespace torch {
@@ -19,8 +19,8 @@ struct TLSIndex {
 
 struct SymbolProvider {
   SymbolProvider() = default;
-  virtual at::optional<Elf64_Addr> sym(const char* name) const = 0;
-  virtual at::optional<TLSIndex> tls_sym(const char* name) const = 0;
+  virtual multipy::optional<Elf64_Addr> sym(const char* name) const = 0;
+  virtual multipy::optional<TLSIndex> tls_sym(const char* name) const = 0;
   SymbolProvider(const SymbolProvider&) = delete;
   SymbolProvider& operator=(const SymbolProvider&) = delete;
   virtual ~SymbolProvider() = default;
diff --git a/torch/csrc/deploy/mem_file.h b/torch/csrc/deploy/mem_file.h
index c50889f8353bb3..df4fe941ca58c0 100644
--- a/torch/csrc/deploy/mem_file.h
+++ b/torch/csrc/deploy/mem_file.h
@@ -1,9 +1,9 @@
 #pragma once
 
-#include <c10/util/Exception.h>
 #include <fcntl.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <unistd.h>
 #include <cerrno>
 #include <cstdio>
@@ -20,18 +20,21 @@ namespace deploy {
 struct MemFile {
   explicit MemFile(const char* filename_) : fd_(0), mem_(nullptr), n_bytes_(0) {
     fd_ = open(filename_, O_RDONLY);
-    TORCH_CHECK(fd_ != -1, "failed to open {}: {}", filename_, strerror(errno));
+    MULTIPY_CHECK(
+        fd_ != -1, "failed to open {}: {}" + filename_ + strerror(errno));
     // NOLINTNEXTLINE
     struct stat s;
     if (-1 == fstat(fd_, &s)) {
       close(fd_); // destructors don't run during exceptions
-      TORCH_CHECK(false, "failed to stat {}: {}", filename_, strerror(errno));
+      MULTIPY_CHECK(
+          false, "failed to stat {}: {}" + filename_ + strerror(errno));
     }
     n_bytes_ = s.st_size;
     mem_ = mmap(nullptr, n_bytes_, PROT_READ, MAP_SHARED, fd_, 0);
     if (MAP_FAILED == mem_) {
       close(fd_);
-      TORCH_CHECK(false, "failed to mmap {}: {}", filename_, strerror(errno));
+      MULTIPY_CHECK(
+          false, "failed to mmap {}: {}" + filename_ + strerror(errno));
     }
   }
   MemFile(const MemFile&) = delete;
diff --git a/torch/csrc/deploy/test_deploy.cpp b/torch/csrc/deploy/test_deploy.cpp
index 840720cc01f895..973fbff0fa4f26 100644
--- a/torch/csrc/deploy/test_deploy.cpp
+++ b/torch/csrc/deploy/test_deploy.cpp
@@ -182,13 +182,14 @@ TEST(TorchpyTest, ErrorsReplicatingObj) {
   auto obj = session1.fromMovable(replicatedObj);
   // should throw an error when trying to access obj from different session
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(session2.createMovable(obj), c10::Error);
+  EXPECT_THROW(session2.createMovable(obj), std::runtime_error);
   try {
     session2.createMovable(obj);
-  } catch (c10::Error& error) {
+  } catch (std::runtime_error& error) {
     EXPECT_TRUE(
-        error.msg().find(
-            "Cannot create movable from an object that lives in different session") !=
+        std::string(error.what())
+            .find(
+                "Cannot create movable from an object that lives in different session") !=
         std::string::npos);
   }
 }
@@ -197,15 +198,15 @@ TEST(TorchpyTest, ThrowsSafely) {
   // See explanation in deploy.h
   torch::deploy::InterpreterManager manager(3);
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(manager.loadPackage("some garbage path"), c10::Error);
+  EXPECT_THROW(manager.loadPackage("some garbage path"), std::runtime_error);
 
   torch::deploy::Package p = manager.loadPackage(path("SIMPLE", simple));
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(p.loadPickle("some other", "garbage path"), c10::Error);
+  EXPECT_THROW(p.loadPickle("some other", "garbage path"), std::runtime_error);
 
   auto model = p.loadPickle("model", "model.pkl");
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(model(at::IValue("unexpected input")), c10::Error);
+  EXPECT_THROW(model(at::IValue("unexpected input")), std::runtime_error);
 }
 
 TEST(TorchpyTest, AcquireMultipleSessionsInTheSamePackage) {
@@ -238,7 +239,7 @@ TEST(TorchpyTest, TensorSharingNotAllowed) {
   auto t = obj.toIValue().toTensor();
   // try to feed it to the other interpreter, should error
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  ASSERT_THROW(I1.global("torch", "sigmoid")({t}), c10::Error);
+  ASSERT_THROW(I1.global("torch", "sigmoid")({t}), std::runtime_error);
 }
 
 TEST(TorchpyTest, TaggingRace) {
@@ -259,7 +260,7 @@ TEST(TorchpyTest, TaggingRace) {
         try {
           I.fromIValue(t);
           success++;
-        } catch (const c10::Error& e) {
+        } catch (const std::runtime_error& e) {
           failed++;
         }
       }
@@ -279,7 +280,7 @@ TEST(TorchpyTest, DisarmHook) {
   torch::deploy::InterpreterManager m(1);
   auto I = m.acquireOne();
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  ASSERT_THROW(I.fromIValue(t), c10::Error); // NOT a segfault
+  ASSERT_THROW(I.fromIValue(t), std::runtime_error); // NOT a segfault
 }
 
 TEST(TorchpyTest, RegisterModule) {
@@ -291,6 +292,7 @@ TEST(TorchpyTest, RegisterModule) {
   }
 }
 
+#ifdef FBCODE_CAFFE2
 TEST(TorchpyTest, FxModule) {
   size_t nthreads = 3;
   torch::deploy::InterpreterManager manager(nthreads);
@@ -317,6 +319,7 @@ TEST(TorchpyTest, FxModule) {
     ASSERT_TRUE(ref_output.equal(outputs[i]));
   }
 }
+#endif
 
 // Moving a tensor between interpreters should share the underlying storage.
 TEST(TorchpyTest, TensorSerializationSharing) {
@@ -479,6 +482,42 @@ TEST(TorchpyTest, TestPyYAML) {
 }
 #endif
 
+TEST(TorchpyTest, PrintInstruction) {
+  const auto jit_script_with_print = R"JIT(
+  def forward(self, a):
+    print(a)
+    return a + a
+  )JIT";
+
+  auto input = torch::autograd::make_variable(at::randn({2, 3}));
+  auto expected_forward = input + input;
+
+  auto module = std::make_shared<torch::jit::Module>(
+      "Module", std::make_shared<at::CompilationUnit>());
+  module->define(jit_script_with_print);
+
+  std::vector<at::IValue> inputs{at::IValue(input)};
+
+  // Checking that a module containing prim::Print() works fine.
+  auto result1 = (*module)(inputs);
+  EXPECT_TRUE(result1.toTensor().equal(expected_forward));
+
+  {
+    auto interpreterManager =
+        std::make_shared<torch::deploy::InterpreterManager>(1);
+
+    // Checking that a module containing prim::Print() still works fine
+    // after Python environment was created.
+    auto result2 = (*module)(inputs);
+    EXPECT_TRUE(result2.toTensor().equal(expected_forward));
+  }
+
+  // Checking that a module containing prim::Print() still works fine
+  // after Python environment was created and then destroyed.
+  auto result3 = (*module)(inputs);
+  EXPECT_TRUE(result3.toTensor().equal(expected_forward));
+}
+
 int main(int argc, char* argv[]) {
   ::testing::InitGoogleTest(&argc, argv);
   int rc = RUN_ALL_TESTS();
diff --git a/torch/csrc/deploy/test_deploy_gpu.cpp b/torch/csrc/deploy/test_deploy_gpu.cpp
index 8fa154b8070953..48660c79fefa3e 100644
--- a/torch/csrc/deploy/test_deploy_gpu.cpp
+++ b/torch/csrc/deploy/test_deploy_gpu.cpp
@@ -67,6 +67,7 @@ TEST(TorchDeployGPUTest, UsesDistributed) {
   }
 }
 
+#ifdef FBCODE_CAFFE2
 TEST(TorchDeployGPUTest, TensorRT) {
   if (!torch::cuda::is_available()) {
     GTEST_SKIP();
@@ -85,6 +86,7 @@ TEST(TorchDeployGPUTest, TensorRT) {
         output.allclose(model(at::IValue{input}).toIValue().toTensor()));
   }
 }
+#endif
 
 // OSS build does not have bultin numpy support yet. Use this flag to guard the
 // test case.
diff --git a/torch/csrc/deploy/test_deploy_missing_interpreter.cpp b/torch/csrc/deploy/test_deploy_missing_interpreter.cpp
index 8ac602a3f2fc5e..b47f4556ad781e 100644
--- a/torch/csrc/deploy/test_deploy_missing_interpreter.cpp
+++ b/torch/csrc/deploy/test_deploy_missing_interpreter.cpp
@@ -10,5 +10,5 @@ int main(int argc, char* argv[]) {
 
 TEST(TorchDeployMissingInterpreter, Throws) {
   // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(torch::deploy::InterpreterManager(1), c10::Error);
+  EXPECT_THROW(torch::deploy::InterpreterManager(1), std::runtime_error);
 }
diff --git a/torch/csrc/deploy/unity/xar_environment.cpp b/torch/csrc/deploy/unity/xar_environment.cpp
index 3ff233b0c420cc..4bb764374525ec 100644
--- a/torch/csrc/deploy/unity/xar_environment.cpp
+++ b/torch/csrc/deploy/unity/xar_environment.cpp
@@ -2,6 +2,7 @@
 #include <dlfcn.h>
 #include <fmt/format.h>
 #include <sys/stat.h>
+#include <torch/csrc/deploy/Exception.h>
 #include <torch/csrc/deploy/elf_file.h>
 #include <torch/csrc/deploy/unity/xar_environment.h>
 
@@ -59,7 +60,7 @@ bool _fileExists(const std::string& filePath) {
 }
 
 void XarEnvironment::setupPythonApp() {
-  TORCH_CHECK(
+  MULTIPY_CHECK(
       !alreadySetupPythonApp_,
       "Already setup the python application. It should only been done once!");
 
@@ -67,7 +68,8 @@ void XarEnvironment::setupPythonApp() {
   constexpr const char* SECTION_NAME = ".torch_deploy_payload.unity";
   ElfFile elfFile(exePath_.c_str());
   auto payloadSection = elfFile.findSection(SECTION_NAME);
-  TORCH_CHECK(payloadSection != at::nullopt, "Missing the payload section");
+  MULTIPY_CHECK(
+      payloadSection != multipy::nullopt, "Missing the payload section");
   const char* pythonAppPkgStart = payloadSection->start;
   auto pythonAppPkgSize = payloadSection->len;
   LOG(INFO) << "Embedded binary size " << pythonAppPkgSize;
@@ -107,23 +109,26 @@ void XarEnvironment::setupPythonApp() {
    * past runs. It should be pretty safe to discard them.
    */
   std::string rmCmd = fmt::format("rm -rf {}", pythonAppDir_);
-  TORCH_CHECK(system(rmCmd.c_str()) == 0, "Fail to remove the directory.");
+  MULTIPY_CHECK(system(rmCmd.c_str()) == 0, "Fail to remove the directory.");
 
   // recreate the directory
   auto r = mkdir(pythonAppDir_.c_str(), 0777);
-  TORCH_CHECK(r == 0, "Failed to create directory: ", strerror(errno));
+  MULTIPY_CHECK(r == 0, "Failed to create directory: " + strerror(errno));
 
   std::string pythonAppArchive = std::string(pythonAppDir_) + "/python_app.xar";
   auto fp = fopen(pythonAppArchive.c_str(), "wb");
-  TORCH_CHECK(fp != nullptr, "Fail to create file: ", strerror(errno));
+  MULTIPY_CHECK(fp != nullptr, "Fail to create file: " + strerror(errno));
   auto written = fwrite(pythonAppPkgStart, 1, pythonAppPkgSize, fp);
-  TORCH_CHECK(written == pythonAppPkgSize, "Expected written == size");
+  MULTIPY_CHECK(written == pythonAppPkgSize, "Expected written == size");
   fclose(fp);
 
   std::string extractCommand = fmt::format(
       "unsquashfs -o 4096 -d {} {}", pythonAppRoot_, pythonAppArchive);
   r = system(extractCommand.c_str());
-  TORCH_CHECK(r == 0, "Fail to extract the python package");
+  MULTIPY_CHECK(
+      r == 0,
+      "Fail to extract the python package" + std::to_string(r) +
+          extractCommand.c_str());
 
   alreadySetupPythonApp_ = true;
 }
@@ -143,12 +148,9 @@ void XarEnvironment::preloadSharedLibraries() {
                 << " does not exist in the python app root, skip loading it";
       continue;
     }
-    TORCH_CHECK(
+    MULTIPY_CHECK(
         dlopen(preloadList[i], RTLD_GLOBAL | RTLD_LAZY) != nullptr,
-        "Fail to open the shared library ",
-        preloadList[i],
-        ": ",
-        dlerror());
+        "Fail to open the shared library " + preloadList[i] + ": " + dlerror());
   }
 }
 
diff --git a/torch/csrc/distributed/c10d/NCCLUtils.hpp b/torch/csrc/distributed/c10d/NCCLUtils.hpp
index 9dabc0c8c3fc35..7ca54d167eadc5 100644
--- a/torch/csrc/distributed/c10d/NCCLUtils.hpp
+++ b/torch/csrc/distributed/c10d/NCCLUtils.hpp
@@ -25,7 +25,8 @@ const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::
     case ncclUnhandledCudaError:
       return "ncclUnhandledCudaError: Call to CUDA function failed.";
     case ncclSystemError:
-      return "ncclSystemError: System call (socket, malloc, munmap, etc) failed.";
+      return "ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. "
+        "It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.";
     case ncclInternalError:
       return "ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption";
     case ncclInvalidArgument:
diff --git a/torch/csrc/distributed/c10d/ProcessGroup.cpp b/torch/csrc/distributed/c10d/ProcessGroup.cpp
index e8949fe6dcf738..d99530347f6eb8 100644
--- a/torch/csrc/distributed/c10d/ProcessGroup.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroup.cpp
@@ -2,6 +2,7 @@
 #include <c10d/ProcessGroup.hpp>
 
 #include <c10/util/Logging.h>
+#include <fmt/format.h>
 
 namespace c10d {
 
@@ -180,4 +181,8 @@ ProcessGroup::ProcessGroup(int rank, int size)
 
 ProcessGroup::~ProcessGroup() {}
 
+void ProcessGroup::init() {
+  C10_LOG_API_USAGE_ONCE(fmt::format("c10d.process_group_{}", getBackendName()));
+}
+
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/ProcessGroup.hpp b/torch/csrc/distributed/c10d/ProcessGroup.hpp
index f66919a63b4401..f2418eb4bb9ac7 100644
--- a/torch/csrc/distributed/c10d/ProcessGroup.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroup.hpp
@@ -427,6 +427,10 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
   }
 
  protected:
+  // Implementations of this interface need to call this to setup
+  // appropriate logging etc.
+  void init();
+
   const int rank_;
   const int size_;
   // Optional sequence number structure for matching collectives.
diff --git a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
index 1297af592d98e8..0a33784ecaed80 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
@@ -763,6 +763,8 @@ ProcessGroupGloo::ProcessGroupGloo(
   for(const auto i : c10::irange(threads_.size())) {
     threads_[i] = std::thread(&ProcessGroupGloo::runLoop, this, i);
   }
+
+  init();
 }
 
 ProcessGroupGloo::~ProcessGroupGloo() {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp b/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
index 714f3a84deb61f..55d7d7c50441ee 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
@@ -310,6 +310,8 @@ ProcessGroupMPI::ProcessGroupMPI(int rank, int size, MPI_Comm pgComm)
 
   // Start the worker thread accepting MPI calls
   workerThread_ = std::thread(&ProcessGroupMPI::runLoop, this);
+
+  init();
 }
 
 ProcessGroupMPI::~ProcessGroupMPI() {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
index c397937ab79cb3..86d7897f558b14 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@@ -1,4 +1,5 @@
 #include <c10d/ProcessGroupNCCL.hpp>
+#include <c10d/UCCForNCCL.hpp>
 #include <sstream>
 
 #ifdef USE_C10D_NCCL
@@ -282,7 +283,8 @@ ProcessGroupNCCL::WorkNCCL::WorkNCCL(
     OpType opType,
     uint64_t seq,
     const char* profilingTitle,
-    const c10::optional<std::vector<at::Tensor>>& inputs)
+    const c10::optional<std::vector<at::Tensor>>& inputs,
+    bool desyncDebug)
     : Work(rank, opType, profilingTitle, inputs),
       devices_(devices),
       workStartTime_(std::chrono::steady_clock::now()),
@@ -290,8 +292,10 @@ ProcessGroupNCCL::WorkNCCL::WorkNCCL(
   // Creates the CUDA event wrappers
   // Note: The actual events are lazily created when first recorded to with
   // DEFAULT_FLAGS = cudaEventDisableTiming.
-  ncclStartEvents_ =
-      std::make_shared<std::vector<at::cuda::CUDAEvent>>(devices.size());
+  if (desyncDebug) {
+    ncclStartEvents_ =
+        std::make_shared<std::vector<at::cuda::CUDAEvent>>(devices.size());
+  }
   ncclEndEvents_ =
       std::make_shared<std::vector<at::cuda::CUDAEvent>>(devices.size());
   ncclComms_.resize(devices.size());
@@ -373,11 +377,20 @@ bool ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const {
 }
 
 bool ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const {
-  for (const auto i : c10::irange(devices_.size())) {
-    // Checking the work's corresponding CUDA events' status
-    if (!(*ncclEndEvents_)[i].query()) {
-      return false;
+  try {
+    for (const auto i : c10::irange(devices_.size())) {
+      // Checking the work's corresponding CUDA events' status
+      if (!(*ncclEndEvents_)[i].query()) {
+        return false;
+      }
+    }
+  } catch (const std::exception& e) {
+    if (std::string(e.what()).find("driver shutting down") == std::string::npos) {
+      throw;
     }
+    LOG(INFO) << "[Rank " << rank_
+              << "] Event query failed with exception: "
+              << e.what();
   }
   return true;
 }
@@ -537,9 +550,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
       "ProcessGroupNCCL is only supported with GPUs, no GPUs found!");
   blockingWait_ = parseEnvVarFlag(NCCL_BLOCKING_WAIT);
   asyncErrorHandling_ = parseEnvVarFlag(NCCL_ASYNC_ERROR_HANDLING);
-  // Infer desync debug from whether TORCH_DISTRIBUTED_DEBUG >= INFO
-  // Provide backward support of NCCL_DESYNC_DEBUG
-  desyncDebug_ = dist_debug_level_ >= DebugLevel::Info || parseEnvVarFlag(NCCL_DESYNC_DEBUG);
+  desyncDebug_ = parseEnvVarFlag(NCCL_DESYNC_DEBUG);
 
   if (blockingWait_) {
     if (asyncErrorHandling_ || desyncDebug_) {
@@ -578,20 +589,25 @@ ProcessGroupNCCL::ProcessGroupNCCL(
     workCleanupThread_ = std::thread(&ProcessGroupNCCL::workCleanupLoop, this);
   }
 
-  const char* ncclDebugLevel = std::getenv("NCCL_DEBUG");
-
-  if (!ncclDebugLevel) {
-    ncclDebugLevel = "UNSET";
-  }
-
+  init();
   LOG(INFO) << "[Rank " << rank_
             << "] ProcessGroupNCCL initialized with following options:"
             << "\nNCCL_ASYNC_ERROR_HANDLING: " << asyncErrorHandling_
+            << "\nNCCL_DESYNC_DEBUG: " << desyncDebug_
             << "\nNCCL_BLOCKING_WAIT: " << blockingWait_
             << "\nTIMEOUT(ms): " << options_->timeout.count()
             << "\nUSE_HIGH_PRIORITY_STREAM: "
-            << options_->is_high_priority_stream
-            << "\nNCCL_DEBUG: " << ncclDebugLevel;
+            << options_->is_high_priority_stream;
+
+#ifdef USE_NCCL_WITH_UCC
+  static std::once_flag initialize_ucc_lib_flag;
+  std::call_once(initialize_ucc_lib_flag, [&]{
+    uccLib_ = loadTorchUCC();
+    if (uccLib_ != nullptr) {
+      LOG(INFO) << "[Rank " << rank_  << "] torch_ucc.so loaded";
+    }
+  });
+#endif
 }
 
 void ProcessGroupNCCL::runHealthCheck() {
@@ -1166,6 +1182,12 @@ std::vector<std::shared_ptr<NCCLComm>>& ProcessGroupNCCL::getNCCLComm(
   // [Note 2 ]
   C10D_NCCL_CHECK(ncclGroupEnd(), c10::nullopt);
 
+  // At this point NCCL should have been initialized, hence we can accurately get
+  // the env value even if NCCL sets it by reading from nccl.conf file
+  if (getRank() == 0) {
+    LOG(INFO) << "NCCL_DEBUG: " << parse_env("NCCL_DEBUG");
+  }
+
   // See [Group Start/End Note]
   for (const auto i : c10::irange(ncclActiveGroupCounter_)) {
     (void)i;
@@ -1338,7 +1360,8 @@ c10::intrusive_ptr<ProcessGroupNCCL::WorkNCCL> ProcessGroupNCCL::initWork(
       opType,
       seq_,
       profilingTitle,
-      inputs);
+      inputs,
+      desyncDebug_);
 }
 
 std::vector<at::Tensor> ProcessGroupNCCL::WorkNCCL::result() {
@@ -2259,6 +2282,9 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::gather(
       invalidArgument("requires empty output on non-root");
     }
     outputs = {};
+    // append a empty tensor to the list, we don't use it but
+    // collective function requires it to invoke its macros
+    outputs.emplace_back();
   }
 
   return collective(
@@ -2405,6 +2431,10 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_allgather_base(
       "nccl:_all_gather_base");
 }
 
+#ifdef USE_NCCL_WITH_UCC
+std::shared_ptr<at::DynamicLibrary> ProcessGroupNCCL::uccLib_ = nullptr;
+#endif
+
 } // namespace c10d
 
 #endif // USE_C10D_NCCL
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
index d13b683a2a33c0..89f0b4b813d859 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
@@ -12,7 +12,9 @@
 #include <c10d/NCCLUtils.hpp>
 #include <c10d/ProcessGroup.hpp>
 #include <c10d/Store.hpp>
+#include <c10d/UCCForNCCL.hpp>
 
+#include <ATen/DynamicLibrary.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/CUDAEvent.h>
 #include <c10/core/Stream.h>
@@ -89,7 +91,8 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
         OpType opType,
         uint64_t seq,
         const char* profilingTitle = nullptr,
-        const c10::optional<std::vector<at::Tensor>>& inputs = c10::nullopt);
+        const c10::optional<std::vector<at::Tensor>>& inputs = c10::nullopt,
+        bool desyncDebug = false);
     // Copy constructor doing partial copy without outputs_. Cleanup thread
     // monitors and removes finished works. However it will deadlock when
     // destructs outputs_ tensors who are view tensors in autograd graph.
@@ -622,6 +625,11 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
 
   // Counting for the sequential number of NCCL collective call.
   uint64_t seq_{0};
+
+#ifdef USE_NCCL_WITH_UCC
+  // ProcessGroupUCC shared library handle
+  static std::shared_ptr<at::DynamicLibrary> uccLib_;
+#endif
 };
 
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/UCCForNCCL.hpp b/torch/csrc/distributed/c10d/UCCForNCCL.hpp
new file mode 100644
index 00000000000000..ce38894faebc13
--- /dev/null
+++ b/torch/csrc/distributed/c10d/UCCForNCCL.hpp
@@ -0,0 +1,25 @@
+#pragma once
+
+#include <string>
+#include <vector>
+#include <cassert>
+#include <memory>
+
+#include <ATen/DynamicLibrary.h>
+
+namespace c10d {
+
+inline std::shared_ptr<at::DynamicLibrary> loadTorchUCC() {
+  const char *path = std::getenv("TORCH_UCC_LIBRARY_PATH");
+  if (path != nullptr) {
+    try {
+      return std::make_shared<at::DynamicLibrary>(path);
+    } catch (const c10::DynamicLibraryError &e) {
+      TORCH_WARN("TORCH_UCC_LIBRARY_PATH is set, "
+                 "but the loading of torch_ucc.so failed with:", e.msg());
+    }
+  }
+  return nullptr;
+}
+
+}  // namespace c10d
diff --git a/torch/csrc/distributed/c10d/Utils.hpp b/torch/csrc/distributed/c10d/Utils.hpp
index efa0a7e7ff687b..501993a728b7e1 100644
--- a/torch/csrc/distributed/c10d/Utils.hpp
+++ b/torch/csrc/distributed/c10d/Utils.hpp
@@ -407,7 +407,7 @@ inline void checkSplitSizes(
         "Tensor's dim 0 does not divide equally across group size");
   } else {
     TORCH_CHECK(
-        split_sizes.size() == group_size,
+        split_sizes.size() == static_cast<size_t>(group_size),
         "Number of tensor splits not equal to group size");
     const auto sum = c10::sum_integers(split_sizes);
     TORCH_CHECK(
diff --git a/torch/csrc/distributed/c10d/debug.h b/torch/csrc/distributed/c10d/debug.h
index 7c326b2380eff4..ecfb4944829570 100644
--- a/torch/csrc/distributed/c10d/debug.h
+++ b/torch/csrc/distributed/c10d/debug.h
@@ -11,9 +11,9 @@
 namespace c10d {
 
 enum class DebugLevel {
-  Off = 0,
-  Info = 1,
-  Detail = 2
+  Off,
+  Info,
+  Detail
 };
 
 TORCH_API void setDebugLevel(DebugLevel level);
diff --git a/torch/csrc/distributed/c10d/reducer.cpp b/torch/csrc/distributed/c10d/reducer.cpp
index 815e36cfa263eb..777abfaf2f0f0e 100644
--- a/torch/csrc/distributed/c10d/reducer.cpp
+++ b/torch/csrc/distributed/c10d/reducer.cpp
@@ -2052,6 +2052,47 @@ void verify_params_across_processes(
     const c10::intrusive_ptr<c10d::ProcessGroup>& process_group,
     const std::vector<at::Tensor>& params,
     const c10::optional<std::weak_ptr<c10d::Logger>>& logger) {
+
+  // First verify number of parameters to avoid inconsistent inputs into
+  // broadcast which can cause a crash.
+  // See https://github.com/pytorch/pytorch/issues/73547
+
+  at::TensorOptions param_size_options;
+  param_size_options = param_size_options.dtype(at::kLong);
+  param_size_options = param_size_options.device(params[0].device());
+  // Note: Not using tensor building API because of
+  // https://github.com/pytorch/pytorch/issues/74114
+  at::Tensor param_size_tensor = at::tensor(
+    {static_cast<int64_t>(params.size())}, param_size_options);
+
+  // Allgather and verify parameter size.
+  std::vector<std::vector<at::Tensor>> param_size_output_tensors;
+  param_size_output_tensors.emplace_back(std::vector<at::Tensor>{});
+  auto world_size = process_group->getSize();
+  for (size_t i = 0 ; i < world_size ; ++i) {
+    param_size_output_tensors.front().emplace_back(
+      at::empty_like(param_size_tensor)
+    );
+  }
+
+  std::vector<at::Tensor> param_size_vec{param_size_tensor};
+  process_group->allgather(param_size_output_tensors, param_size_vec)->wait();
+  auto result_size_tensors = param_size_output_tensors.front();
+  for (size_t i = 0; i < world_size ; ++i ) {
+    auto param_size_for_rank = result_size_tensors[i][0].item<int>();
+    TORCH_CHECK(
+      param_size_for_rank == params.size(),
+      c10::str(
+        "DDP expects same model across all ranks, but Rank ",
+        process_group->getRank(),
+        " has ", params.size(), " params, while rank ", i,
+        " has inconsistent ", param_size_for_rank,
+        " params."
+      )
+    );
+  }
+
+  // Continue with parameter shape verification.
   size_t i = 0;
   for (const auto& t : params) {
     i += 2 * t.dim();
@@ -2085,10 +2126,9 @@ void verify_params_across_processes(
   i = 0;
   for (const auto p : c10::irange(params.size())) {
     const auto& t = params[p];
-    // I'd like to include which process we are in the message,
-    // but ProcessGroup::getRank is not public!
     for (const auto& sz : t.sizes()) {
-      auto msg = c10::str("params[", p, "] in this process",
+      auto msg = c10::str("[", process_group->getRank(),
+                        "]: params[", p, "] in this process",
                         " with sizes ",
                         t.sizes(),
                         " appears not to match sizes of the same param in process 0.");
diff --git a/torch/csrc/distributed/c10d/reducer.hpp b/torch/csrc/distributed/c10d/reducer.hpp
index ecf92a1893ff2f..adc021ca381439 100644
--- a/torch/csrc/distributed/c10d/reducer.hpp
+++ b/torch/csrc/distributed/c10d/reducer.hpp
@@ -14,6 +14,7 @@
 #include <c10d/Utils.hpp>
 #include <c10d/comm.hpp>
 #include <c10d/debug.h>
+#include <c10d/reducer_timer.hpp>
 #include <c10d/default_comm_hooks.hpp>
 #include <torch/csrc/autograd/function.h>
 #include <torch/csrc/autograd/profiler.h>
@@ -28,77 +29,10 @@ constexpr int kDefaultFirstBucketBytes = int(1024 * 1024);
 constexpr int kDefaultBucketBytesCap = int(25 * 1024 * 1024);
 // Collect runtime stats once for every kDDPRuntimeLoggingSampleRate iterations.
 constexpr int kDDPRuntimeLoggingSampleRate = 100;
-constexpr int kUnsetTime = -1;
-
-inline int64_t current_time_in_nanos() {
-  return torch::profiler::impl::getTime();
-}
 
 // Forward declaration
 class Logger;
 
-class TORCH_API Timer {
-  private:
-    // The timestamp of forward call start time in each iteration.
-    int64_t forward_start_time = kUnsetTime;
-    // The timestamp of backward computation start and end time in each
-    // iteration.
-    int64_t backward_compute_start_time = kUnsetTime;
-    int64_t backward_compute_end_time = kUnsetTime;
-    // The timestamp of first communication call start time in each iteration.
-    int64_t backward_comm_start_time = kUnsetTime;
-    // The timestamp of last communication call end time in each iteration.
-  int64_t backward_comm_end_time = kUnsetTime;
- public:
-  enum class Event {
-    kForwardStart,
-    kBackwardComputeStart,
-    kBackwardComputeEnd,
-    kBackwardCommStart,
-    kBackwardCommEnd,
-  };
-
-  // Record the current event, i.e., mark it as having occurred now. Default
-  // CPU implementation.
-  virtual void record(Event event) {
-    getTimeRef(event) = current_time_in_nanos();
-  }
-
-  // Return the difference between when two events occurred, in nanoseconds.
-  // Or nullopt if one of them hasn't been recorded.
-  virtual c10::optional<int64_t> measureDifference(Event start, Event end) = 0;
-
-  virtual ~Timer() = default;
-
-  // Return host-side timestamp, or nullopt if it has not yet been recorded.
-  c10::optional<int64_t> getTimestamp(Event event) {
-    auto time = getTimeRef(event);
-    if (time == kUnsetTime) {
-      return c10::nullopt;
-    } else {
-      return time;
-    }
-  }
-
-  // Return host-side time member variable corresponding to the given event.
-  int64_t& getTimeRef(Event event) {
-    switch (event) {
-      case Event::kForwardStart:
-        return forward_start_time;
-      case Event::kBackwardComputeStart:
-        return backward_compute_start_time;
-      case Event::kBackwardComputeEnd:
-        return backward_compute_end_time;
-      case Event::kBackwardCommStart:
-        return backward_comm_start_time;
-      case Event::kBackwardCommEnd:
-        return backward_comm_end_time;
-      default:
-        TORCH_INTERNAL_ASSERT(false);
-    }
-  }
-};
-
 // Local accumulator type for a single bucket.
 struct BucketAccumulator {
   std::vector<size_t> indices;
@@ -106,8 +40,6 @@ struct BucketAccumulator {
   size_t size_limit = 0;
 };
 
-C10_DECLARE_TYPED_REGISTRY(TimerRegistry, c10::DeviceType, Timer, std::unique_ptr, c10::Device);
-
 class TORCH_API Reducer {
  public:
   // The constructor takes a list of variables (i.e. parameters) for this
diff --git a/torch/csrc/distributed/c10d/reducer_cuda.cpp b/torch/csrc/distributed/c10d/reducer_cuda.cpp
index b836cddd8017c9..a1c570da5d59ab 100644
--- a/torch/csrc/distributed/c10d/reducer_cuda.cpp
+++ b/torch/csrc/distributed/c10d/reducer_cuda.cpp
@@ -1,4 +1,4 @@
-#include <c10d/reducer.hpp>
+#include <c10d/reducer_timer.hpp>
 
 #include <c10/core/DeviceGuard.h>
 #include <ATen/cuda/CUDAEvent.h>
diff --git a/torch/csrc/distributed/c10d/reducer_timer.hpp b/torch/csrc/distributed/c10d/reducer_timer.hpp
new file mode 100644
index 00000000000000..ba696383b88e7f
--- /dev/null
+++ b/torch/csrc/distributed/c10d/reducer_timer.hpp
@@ -0,0 +1,75 @@
+#pragma once
+#include <torch/csrc/autograd/profiler.h>
+
+namespace c10d {
+constexpr int kUnsetTime = -1;
+
+inline int64_t current_time_in_nanos() {
+  return torch::profiler::impl::getTime();
+}
+
+class TORCH_API Timer {
+ private:
+  // The timestamp of forward call start time in each iteration.
+  int64_t forward_start_time = kUnsetTime;
+  // The timestamp of backward computation start and end time in each
+  // iteration.
+  int64_t backward_compute_start_time = kUnsetTime;
+  int64_t backward_compute_end_time = kUnsetTime;
+  // The timestamp of first communication call start time in each iteration.
+  int64_t backward_comm_start_time = kUnsetTime;
+  // The timestamp of last communication call end time in each iteration.
+  int64_t backward_comm_end_time = kUnsetTime;
+
+ public:
+  enum class Event {
+    kForwardStart,
+    kBackwardComputeStart,
+    kBackwardComputeEnd,
+    kBackwardCommStart,
+    kBackwardCommEnd,
+  };
+
+  // Record the current event, i.e., mark it as having occurred now. Default
+  // CPU implementation.
+  virtual void record(Event event) {
+    getTimeRef(event) = current_time_in_nanos();
+  }
+
+  // Return the difference between when two events occurred, in nanoseconds.
+  // Or nullopt if one of them hasn't been recorded.
+  virtual c10::optional<int64_t> measureDifference(Event start, Event end) = 0;
+
+  virtual ~Timer() = default;
+
+  // Return host-side timestamp, or nullopt if it has not yet been recorded.
+  c10::optional<int64_t> getTimestamp(Event event) {
+    auto time = getTimeRef(event);
+    if (time == kUnsetTime) {
+      return c10::nullopt;
+    } else {
+      return time;
+    }
+  }
+
+  // Return host-side time member variable corresponding to the given event.
+  int64_t& getTimeRef(Event event) {
+    switch (event) {
+      case Event::kForwardStart:
+        return forward_start_time;
+      case Event::kBackwardComputeStart:
+        return backward_compute_start_time;
+      case Event::kBackwardComputeEnd:
+        return backward_compute_end_time;
+      case Event::kBackwardCommStart:
+        return backward_comm_start_time;
+      case Event::kBackwardCommEnd:
+        return backward_comm_end_time;
+      default:
+        TORCH_INTERNAL_ASSERT(false);
+    }
+  }
+};
+
+C10_DECLARE_TYPED_REGISTRY(TimerRegistry, c10::DeviceType, Timer, std::unique_ptr, c10::Device);
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/socket.cpp b/torch/csrc/distributed/c10d/socket.cpp
index af09473a36ac82..acd819ab631cda 100644
--- a/torch/csrc/distributed/c10d/socket.cpp
+++ b/torch/csrc/distributed/c10d/socket.cpp
@@ -613,28 +613,11 @@ std::unique_ptr<SocketImpl> SocketConnectOp::run() {
 }
 
 bool SocketConnectOp::tryConnect(int family) {
-  ::addrinfo hints{}, *naked_result = nullptr;
-
+  ::addrinfo hints{};
   hints.ai_flags = AI_V4MAPPED | AI_ALL | AI_NUMERICSERV;
   hints.ai_family = family;
   hints.ai_socktype = SOCK_STREAM;
 
-  int r = ::getaddrinfo(host_, port_.c_str(), &hints, &naked_result);
-  if (r != 0) {
-    const char* gai_err = ::gai_strerror(r);
-
-    recordError("The {}network addresses of ({}, {}) cannot be retrieved (gai error: {} - {}).",
-                family == AF_INET ? "IPv4 " : family == AF_INET6 ? "IPv6 " : "",
-                host_,
-                port_,
-                r,
-                gai_err);
-
-    return false;
-  }
-
-  addrinfo_ptr result{naked_result};
-
   deadline_ = Clock::now() + opts_->connect_timeout();
 
   std::size_t retry_attempt = 1;
@@ -645,16 +628,33 @@ bool SocketConnectOp::tryConnect(int family) {
 
     errors_.clear();
 
-    for (::addrinfo* addr = naked_result; addr != nullptr; addr = addr->ai_next) {
-      C10D_TRACE("The client socket is attempting to connect to {}.", *addr);
+    ::addrinfo *naked_result = nullptr;
+    // patternlint-disable cpp-dns-deps
+    int r = ::getaddrinfo(host_, port_.c_str(), &hints, &naked_result);
+    if (r != 0) {
+      const char* gai_err = ::gai_strerror(r);
+
+      recordError("The {}network addresses of ({}, {}) cannot be retrieved (gai error: {} - {}).",
+                  family == AF_INET ? "IPv4 " : family == AF_INET6 ? "IPv6 " : "",
+                  host_,
+                  port_,
+                  r,
+                  gai_err);
+      retry = true;
+    } else {
+      addrinfo_ptr result{naked_result};
+
+      for (::addrinfo* addr = naked_result; addr != nullptr; addr = addr->ai_next) {
+        C10D_TRACE("The client socket is attempting to connect to {}.", *addr);
 
-      ConnectResult cr = tryConnect(*addr);
-      if (cr == ConnectResult::Success) {
-        return true;
-      }
+        ConnectResult cr = tryConnect(*addr);
+        if (cr == ConnectResult::Success) {
+          return true;
+        }
 
-      if (cr == ConnectResult::Retry) {
-        retry = true;
+        if (cr == ConnectResult::Retry) {
+          retry = true;
+        }
       }
     }
 
diff --git a/torch/csrc/distributed/rpc/agent_utils.cpp b/torch/csrc/distributed/rpc/agent_utils.cpp
index 45ffb2903bb0ca..3aa11961b0d671 100644
--- a/torch/csrc/distributed/rpc/agent_utils.cpp
+++ b/torch/csrc/distributed/rpc/agent_utils.cpp
@@ -41,6 +41,89 @@ std::unordered_map<std::string, worker_id_t> collectNames(
   return nameToId;
 }
 
+std::vector<std::string> splitString(
+    const std::string& s,
+    const std::string& delim) {
+  std::vector<std::string> tokens;
+  size_t start = 0;
+  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+  size_t end;
+  while ((end = s.find(delim, start)) != std::string::npos) {
+    tokens.emplace_back(s.substr(start, end - start));
+    start = end + delim.length();
+  }
+  tokens.emplace_back(s.substr(start));
+  return tokens;
+}
+
+std::unordered_map<std::string, worker_id_t> collectCurrentNames(
+    ::c10d::PrefixStore store,
+    const worker_id_t selfId,
+    const std::string& selfName) {
+  std::vector<uint8_t> selfNameVector(
+      (uint8_t*)selfName.c_str(),
+      (uint8_t*)selfName.c_str() + selfName.length());
+
+  // Check that ID does not already exist and set {ID : NAME}
+  std::vector<uint8_t> resultVector = store.compareSet(
+      c10::to_string(selfId), std::vector<uint8_t>(), selfNameVector);
+  TORCH_CHECK(
+      resultVector == selfNameVector,
+      "RPC worker id ",
+      selfId,
+      " is not unique. Worker ",
+      resultVector,
+      " and already has ID and ",
+      selfNameVector,
+      " cannot be added.");
+
+  store.set(c10::to_string(selfId), selfNameVector);
+
+  std::unordered_map<std::string, worker_id_t> nameToId;
+  nameToId.emplace(selfName, selfId);
+
+  // Check to see if there is list of worker names in the store
+  std::string allWorkerInfosKey("AllWorkerInfos");
+  bool worker_names_available =
+      store.check(std::vector<std::string>{allWorkerInfosKey});
+  std::string allWorkerInfos;
+  if (worker_names_available) {
+    // Get the current list of workers
+    std::vector<uint8_t> allWorkerInfosKeyVector = store.get(allWorkerInfosKey);
+    allWorkerInfos = std::string(
+        (char*)allWorkerInfosKeyVector.data(), allWorkerInfosKeyVector.size());
+    // workerInfos are comma separated, (e.g.
+    // "Name1-Rank1,Name2-Rank2,Name3-Rank2") parse list of workers
+    for (const std::string& workerInfo : splitString(allWorkerInfos, ",")) {
+      auto workerInfoVec = splitString(workerInfo, "-");
+      std::string workerName = workerInfoVec.at(0);
+      int workerId = std::stoi(workerInfoVec.at(1));
+
+      TORCH_CHECK(
+          nameToId.find(workerName) == nameToId.end(),
+          "RPC worker name ",
+          workerName,
+          " is not unique. Workers ",
+          nameToId.find(workerName)->second,
+          " and ",
+          workerId,
+          " share the same name.");
+
+      nameToId.emplace(workerName, workerId);
+    }
+    allWorkerInfos = fmt::format("{},{}-{}", allWorkerInfos, selfName, selfId);
+  } else {
+    // Add own name to worker list
+    allWorkerInfos = fmt::format("{}-{}", selfName, selfId);
+  }
+  std::vector<uint8_t> allWorkerInfosVector(
+      (uint8_t*)allWorkerInfos.c_str(),
+      (uint8_t*)allWorkerInfos.c_str() + allWorkerInfos.length());
+  store.set(allWorkerInfosKey, allWorkerInfosVector);
+
+  return nameToId;
+}
+
 const string storeKeyBarrierId = "_ID_";
 const string storeKeyProcessCount = "PROCESS_COUNT";
 const string storeKeyActiveCallCount = "ACTIVE_CALLS";
diff --git a/torch/csrc/distributed/rpc/agent_utils.h b/torch/csrc/distributed/rpc/agent_utils.h
index befa26b8603754..d7e63dd033f74e 100644
--- a/torch/csrc/distributed/rpc/agent_utils.h
+++ b/torch/csrc/distributed/rpc/agent_utils.h
@@ -16,6 +16,16 @@ std::unordered_map<std::string, worker_id_t> collectNames(
     const std::string& selfName,
     const int worldSize);
 
+// Ranks in dynamic RPC groups will initially call into this to establish the
+// name-to-id mapping for the current peers in the group. The current rank will
+// put its own worker info in the store and discover all the ranks that came
+// before it. NOTE: This needs to be called with the Dynamic RPC group
+// membership management token held.
+std::unordered_map<std::string, worker_id_t> collectCurrentNames(
+    ::c10d::PrefixStore store,
+    const worker_id_t selfId,
+    const std::string& selfName);
+
 // This performs a synchronization of all call counts by using store.
 // All RPC peers wait for others to join to exit at the same time.
 int syncCallCount(
diff --git a/torch/csrc/distributed/rpc/init.cpp b/torch/csrc/distributed/rpc/init.cpp
index 8c16d87f9ee9b2..0552c9c641148d 100644
--- a/torch/csrc/distributed/rpc/init.cpp
+++ b/torch/csrc/distributed/rpc/init.cpp
@@ -576,7 +576,7 @@ PyObject* rpc_init(PyObject* _unused, PyObject* noargs) {
               [](const c10::intrusive_ptr<::c10d::Store>& store,
                  std::string selfName,
                  worker_id_t selfId,
-                 int worldSize,
+                 optional<int> worldSize,
                  TensorPipeRpcBackendOptions opts,
                  std::unordered_map<std::string, DeviceMap> reverseDeviceMaps,
                  std::vector<c10::Device> devices) {
diff --git a/torch/csrc/distributed/rpc/tensorpipe_agent.cpp b/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
index aaaf3c673f7557..d2f753a9edcbe1 100644
--- a/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
+++ b/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
@@ -342,9 +342,15 @@ void TensorPipeAgent::removeFromTimeoutMap(uint64_t messageId) {
   }
 }
 
-void TensorPipeAgent::prepareNames() {
-  auto nameToId = collectNames(
-      rankToNameStore_, workerInfo_.id_, workerInfo_.name_, worldSize_);
+void TensorPipeAgent::prepareNames(bool isStaticGroup) {
+  std::unordered_map<std::string, worker_id_t> nameToId;
+  if (isStaticGroup) {
+    nameToId = collectNames(
+        rankToNameStore_, workerInfo_.id_, workerInfo_.name_, worldSize_);
+  } else {
+    nameToId = collectCurrentNames(
+        rankToNameStore_, workerInfo_.id_, workerInfo_.name_);
+  }
 
   for (const auto& entry : nameToId) {
     const auto& workerName = entry.first;
@@ -354,11 +360,35 @@ void TensorPipeAgent::prepareNames() {
   }
 }
 
+void TensorPipeAgent::checkAndSetStaticGroup(
+    const c10::intrusive_ptr<::c10d::Store>& store) {
+  std::string isStaticGroupKey("rpcIsStaticGroup");
+
+  std::string isStaticGroupStr = isStaticGroup_ ? "true" : "false";
+  std::vector<uint8_t> isStaticGroupVec(
+      (uint8_t*)isStaticGroupStr.c_str(),
+      (uint8_t*)isStaticGroupStr.c_str() + isStaticGroupStr.length());
+  std::vector<uint8_t> returnedVec;
+  returnedVec = store->compareSet(
+      isStaticGroupKey, std::vector<uint8_t>(), isStaticGroupVec);
+  std::string returnedVal = std::string(returnedVec.begin(), returnedVec.end());
+  // In both cases, the returned value should be the value of isStaticGroupStr,
+  // otherwise there is a discrepency with initialization among one of the
+  // members
+  TORCH_CHECK(
+      returnedVal == isStaticGroupStr,
+      fmt::format(
+          "RPC group mixes statically and dynamically initialized members which is not supported. ",
+          "Static group property is initialized as {} and is trying to be set as {} ",
+          isStaticGroup_,
+          returnedVal));
+}
+
 TensorPipeAgent::TensorPipeAgent(
     const c10::intrusive_ptr<::c10d::Store>& store,
     std::string selfName,
     worker_id_t selfId,
-    int worldSize,
+    optional<int> worldSize,
     TensorPipeRpcBackendOptions opts,
     std::unordered_map<std::string, DeviceMap> reverseDeviceMaps,
     std::vector<c10::Device> devices,
@@ -377,9 +407,16 @@ TensorPipeAgent::TensorPipeAgent(
       rankToNameStore_("names", store),
       nameToAddressStore_("addrs", store),
       shutdownStore_("shutdown", store),
-      worldSize_(worldSize) {
+      isStaticGroup_(worldSize.has_value()) {
+  if (isStaticGroup_) {
+    worldSize_ = worldSize.value();
+  }
+
+  // check the static group attribute against store
+  checkAndSetStaticGroup(store);
+
   // collect worker names
-  prepareNames();
+  prepareNames(isStaticGroup_);
 
   // Initialize the time-series metrics tracking map
   timeSeriesMetrics_.emplace(kGilAverageWaitTime, TimeSeriesMetricsTracker());
diff --git a/torch/csrc/distributed/rpc/tensorpipe_agent.h b/torch/csrc/distributed/rpc/tensorpipe_agent.h
index b76e1a099bebd5..4d667f02961fc7 100644
--- a/torch/csrc/distributed/rpc/tensorpipe_agent.h
+++ b/torch/csrc/distributed/rpc/tensorpipe_agent.h
@@ -165,7 +165,7 @@ class TORCH_API TensorPipeAgent : public RpcAgent {
       const c10::intrusive_ptr<::c10d::Store>& store,
       std::string selfName,
       worker_id_t selfId,
-      int worldSize,
+      optional<int> worldSize,
       TensorPipeRpcBackendOptions opts,
       std::unordered_map<std::string, DeviceMap> reverseDeviceMaps,
       std::vector<c10::Device> devices,
@@ -233,7 +233,10 @@ class TORCH_API TensorPipeAgent : public RpcAgent {
   void removeFromTimeoutMap(uint64_t messageId);
 
   // Populates workerIdToInfo_ and workerNameToInfo_ using addressStore_
-  void prepareNames();
+  void prepareNames(bool isStaticGroup);
+
+  // Check the static group attribute with the value set in store
+  void checkAndSetStaticGroup(const c10::intrusive_ptr<::c10d::Store>& store);
 
   const std::string& findWorkerURL(const WorkerInfo& worker) const;
 
@@ -331,7 +334,8 @@ class TORCH_API TensorPipeAgent : public RpcAgent {
   // Store keys that will used to count joined processes and active calls during
   // the shutdown process
   ::c10d::PrefixStore shutdownStore_;
-  const int worldSize_;
+  int worldSize_ = 0;
+  const bool isStaticGroup_;
 
   std::atomic<uint64_t> nextMessageID_{0};
 
diff --git a/torch/csrc/distributed/rpc/torchscript_functions.cpp b/torch/csrc/distributed/rpc/torchscript_functions.cpp
index 464a290de1dc96..8afbc813591442 100644
--- a/torch/csrc/distributed/rpc/torchscript_functions.cpp
+++ b/torch/csrc/distributed/rpc/torchscript_functions.cpp
@@ -21,10 +21,7 @@ c10::intrusive_ptr<JitFuture> rpcTorchscript(
     std::vector<c10::IValue>& stack,
     const float rpcTimeoutSeconds,
     const bool isAsyncExecution) {
-  // This dummy tensor holds an at::RecordFunction when profiling is enabled.
-  // This is because at::RecordFunction is not yet registered as a TorchScript
-  // custom class (https://github.com/pytorch/pytorch/issues/35026)
-  at::Tensor handle = at::zeros(1);
+  c10::intrusive_ptr<torch::autograd::profiler::PythonRecordFunction> record;
   auto shouldProfile = torch::autograd::profiler::profilerEnabled() &&
       !torch::distributed::rpc::RemoteProfilerManager::getInstance()
            .isCurrentKeySet();
@@ -35,7 +32,8 @@ c10::intrusive_ptr<JitFuture> rpcTorchscript(
             .qualifiedName(), /* name of torchscript function being run */
         RpcAgent::getCurrentRpcAgent()->getWorkerInfo().name_,
         dstWorkerName);
-    handle = torch::autograd::profiler::record_function_enter(rpcAsyncJitKey);
+    record =
+        torch::autograd::profiler::record_function_enter_new(rpcAsyncJitKey);
     auto& remoteProfilerManager =
         torch::distributed::rpc::RemoteProfilerManager::getInstance();
     remoteProfilerManager.setCurrentKey(rpcAsyncJitKey);
@@ -75,7 +73,8 @@ c10::intrusive_ptr<JitFuture> rpcTorchscript(
   }));
   if (shouldProfile) {
     auto profiledFutPtr =
-        torch::autograd::profiler::_call_end_callbacks_on_fut(handle, futPtr);
+        torch::autograd::profiler::_call_end_callbacks_on_fut_new(
+            record, futPtr);
     return profiledFutPtr;
   }
   return futPtr;
diff --git a/torch/csrc/generic/Storage.cpp b/torch/csrc/generic/Storage.cpp
index 99499ef9a01947..4743ba1a862787 100644
--- a/torch/csrc/generic/Storage.cpp
+++ b/torch/csrc/generic/Storage.cpp
@@ -144,7 +144,7 @@ static PyObject * THPStorage_(get)(THPStorage *self, PyObject *index)
     int64_t nindex = THPUtils_unpackLong(index);
     if (nindex < 0)
       nindex += (self->cdata->nbytes() / sizeof(scalar_t));
-    if (nindex < 0 || nindex >= (self->cdata->nbytes() / sizeof(scalar_t))) {
+    if (nindex < 0 || nindex >= static_cast<int64_t>(self->cdata->nbytes() / sizeof(scalar_t))) {
       PyErr_SetString(PyExc_IndexError, fmt::format(
             "index {} out of range for storage of size {}",
             nindex, self->cdata->nbytes() / sizeof(scalar_t)));
diff --git a/torch/csrc/generic/StorageSharing.cpp b/torch/csrc/generic/StorageSharing.cpp
index 01cd5c49998b1f..701df7daaa0c14 100644
--- a/torch/csrc/generic/StorageSharing.cpp
+++ b/torch/csrc/generic/StorageSharing.cpp
@@ -282,13 +282,9 @@ static PyObject * THPStorage_(shareCuda)(PyObject *_self, PyObject *noargs)
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     cudaIpcEventHandle_t ipc_event_handle;
 
-#if !defined(USE_ROCM)
     if (sent_data->event_sync_required_) {
       C10_CUDA_CHECK(cudaIpcGetEventHandle(&ipc_event_handle, sent_data->event_));
     }
-#else
-    // ipc_event_handle unused in storage receiver, we can leave it uninitialized.
-#endif
 
     _event_handle = PyBytes_FromStringAndSize((char *)&ipc_event_handle, CUDA_IPC_HANDLE_SIZE);
     _event_sync_required = PyBool_FromLong(sent_data->event_sync_required_);
@@ -400,7 +396,6 @@ static PyObject * THPStorage_(newSharedCuda)(PyObject *_unused, PyObject *args)
   int64_t device = THPUtils_unpackLong(_device);
   at::cuda::CUDAGuard device_guard(device);
 
-#if !defined(USE_ROCM)
   if (PyObject_IsTrue(_event_sync_required)) {
     // Ensure that producer prepared all tensor's data
     std::string s_ipc_event_handle =
@@ -413,9 +408,6 @@ static PyObject * THPStorage_(newSharedCuda)(PyObject *_unused, PyObject *args)
     AT_CUDA_CHECK(
         cudaStreamWaitEvent(c10::cuda::getCurrentCUDAStream(device), event, 0));
   }
-#else
-  // Already synchronized inside producer stream
-#endif
 
   std::string s_handle = THPStorage_(bytesAsHandleString)(_handle);
   std::shared_ptr<void> basePtr = c10::cuda::CUDACachingAllocator::getIpcDevPtr(s_handle);
diff --git a/torch/csrc/init_flatbuffer_module.cpp b/torch/csrc/init_flatbuffer_module.cpp
new file mode 100644
index 00000000000000..22ec10b6f29b78
--- /dev/null
+++ b/torch/csrc/init_flatbuffer_module.cpp
@@ -0,0 +1,97 @@
+#include <torch/csrc/python_headers.h>
+
+#include <libshm.h>
+#include <cstdlib>
+
+#include <pybind11/detail/common.h>
+#include <pybind11/functional.h>
+#include <pybind11/pybind11.h>
+#include <pybind11/pytypes.h>
+#include <pybind11/stl.h>
+#include <pybind11/stl_bind.h>
+
+#include <Python.h> // NOLINT
+#include <torch/csrc/jit/mobile/flatbuffer_loader.h>
+#include <torch/csrc/jit/python/module_python.h>
+#include <torch/csrc/jit/python/python_ivalue.h>
+#include <torch/csrc/jit/python/python_sugared_value.h>
+#include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
+
+namespace py = pybind11;
+
+static std::shared_ptr<char> copyStr(const std::string& bytes) {
+  size_t size = (bytes.size() / FLATBUFFERS_MAX_ALIGNMENT + 1) *
+      FLATBUFFERS_MAX_ALIGNMENT;
+#ifdef _WIN32
+  std::shared_ptr<char> bytes_copy(
+      static_cast<char*>(_aligned_malloc(size, FLATBUFFERS_MAX_ALIGNMENT)),
+      _aligned_free);
+#else
+  std::shared_ptr<char> bytes_copy(
+      static_cast<char*>(aligned_alloc(FLATBUFFERS_MAX_ALIGNMENT, size)), free);
+#endif
+  memcpy(bytes_copy.get(), bytes.data(), bytes.size());
+  return bytes_copy;
+}
+
+extern "C"
+#ifdef _WIN32
+    __declspec(dllexport)
+#endif
+        PyObject* initModuleFlatbuffer() {
+  using namespace torch::jit;
+  PyMethodDef m[] = {{nullptr, nullptr, 0, nullptr}}; // NOLINT
+  static struct PyModuleDef torchmodule = {
+      PyModuleDef_HEAD_INIT,
+      "torch._C_flatbuffer",
+      nullptr,
+      -1,
+      m,
+  }; // NOLINT
+  PyObject* module = PyModule_Create(&torchmodule);
+  auto pym = py::handle(module).cast<py::module>();
+  pym.def("_load_mobile_module_from_file", [](const std::string& filename) {
+    return torch::jit::load_mobile_module_from_file(filename);
+  });
+  pym.def("_load_mobile_module_from_bytes", [](const std::string& bytes) {
+    auto bytes_copy = copyStr(bytes);
+    return torch::jit::parse_and_initialize_mobile_module(
+        bytes_copy, bytes.size());
+  });
+  pym.def("_load_jit_module_from_file", [](const std::string& filename) {
+    ExtraFilesMap extra_files = ExtraFilesMap();
+    return torch::jit::load_jit_module_from_file(filename, extra_files);
+  });
+  pym.def("_load_jit_module_from_bytes", [](const std::string& bytes) {
+    auto bytes_copy = copyStr(bytes);
+    ExtraFilesMap extra_files = ExtraFilesMap();
+    return torch::jit::parse_and_initialize_jit_module(
+        bytes_copy, bytes.size(), extra_files);
+  });
+  pym.def(
+      "_save_mobile_module",
+      [](const torch::jit::mobile::Module& module,
+         const std::string& filename) {
+        return torch::jit::save_mobile_module(module, filename);
+      });
+  pym.def(
+      "_save_jit_module",
+      [](const torch::jit::Module& module, const std::string& filename) {
+        return torch::jit::save_jit_module(module, filename);
+      });
+  pym.def(
+      "_save_mobile_module_to_bytes",
+      [](const torch::jit::mobile::Module& module) {
+        auto detached_buffer = torch::jit::save_mobile_module_to_bytes(module);
+        return py::bytes(
+            reinterpret_cast<char*>(detached_buffer.data()),
+            detached_buffer.size());
+      });
+  pym.def("_save_jit_module_to_bytes", [](const torch::jit::Module& module) {
+    auto detached_buffer = torch::jit::save_jit_module_to_bytes(module);
+    return py::bytes(
+        reinterpret_cast<char*>(detached_buffer.data()),
+        detached_buffer.size());
+  });
+  return module;
+}
diff --git a/torch/csrc/jit/api/function_impl.h b/torch/csrc/jit/api/function_impl.h
index c92e46a352e363..d97f3a2c862faa 100644
--- a/torch/csrc/jit/api/function_impl.h
+++ b/torch/csrc/jit/api/function_impl.h
@@ -13,10 +13,14 @@ struct TORCH_API GraphFunction : public Function {
   GraphFunction(
       c10::QualifiedName name,
       std::shared_ptr<Graph> graph,
-      std::function<void(GraphFunction&)> function_creator)
+      std::function<void(GraphFunction&)> function_creator,
+      c10::optional<ExecutorExecutionMode> executor_execution_mode =
+          c10::nullopt)
       : name_(std::move(name)),
         graph_(std::move(graph)),
-        function_creator_(std::move(function_creator)) {}
+        function_creator_(std::move(function_creator)) {
+    executor_execution_mode_ = executor_execution_mode;
+  }
 
   bool isGraphFunction() const override {
     return true;
@@ -53,6 +57,13 @@ struct TORCH_API GraphFunction : public Function {
     return name_;
   }
 
+  // private/unstable api. sets the initial execution mode
+  // will not affect executor if there is an existing executor
+  // created for this function
+  void _set_initial_executor_execution_mode(ExecutorExecutionMode mode) {
+    executor_execution_mode_ = mode;
+  }
+
   // if this isn't yet defined, run its method_creator function
   void ensure_defined() override;
 
@@ -92,14 +103,20 @@ struct TORCH_API GraphFunction : public Function {
       return *executor;
     }
     check_single_output();
-    executor = GraphExecutor(optimized_graph(), name_.name());
+    const std::string& name = name_.name();
+    std::shared_ptr<Graph> opt_graph = optimized_graph();
+    if (!executor_execution_mode_) {
+      executor = GraphExecutor(opt_graph, name);
+    } else {
+      executor = GraphExecutor(opt_graph, name, *executor_execution_mode_);
+    }
     return *executor;
   }
 
   using Function::call;
   bool call(
       Stack& stack,
-      size_t bailOut,
+      c10::optional<size_t> bailOut,
       c10::function_ref<void(const Code&)> f) override {
     f(get_executor().getPlanFor(stack, bailOut).code);
     return true;
@@ -128,6 +145,10 @@ struct TORCH_API GraphFunction : public Function {
   // The original, non-optimized graph
   std::shared_ptr<Graph> graph_; // for debugging and for inlining
 
+  // allows users to specify Simple/Profiling Executor for function
+  // TODO: add more executors
+  mutable c10::optional<ExecutorExecutionMode> executor_execution_mode_;
+
   // Optimized graph, computed lazily. Used for inlining.
   mutable std::array<
       c10::optional<std::shared_ptr<Graph>>,
diff --git a/torch/csrc/jit/api/module.h b/torch/csrc/jit/api/module.h
index a040b953be1c23..a6aa49278cbec6 100644
--- a/torch/csrc/jit/api/module.h
+++ b/torch/csrc/jit/api/module.h
@@ -223,12 +223,14 @@ struct TORCH_API Module : public Object {
   void _save_for_mobile(
       std::ostream& out,
       const ExtraFilesMap& extra_files = ExtraFilesMap(),
-      bool save_mobile_debug_info = false) const;
+      bool save_mobile_debug_info = false,
+      bool use_flatbuffer = false) const;
 
   void _save_for_mobile(
       const std::string& filename,
       const ExtraFilesMap& extra_files = ExtraFilesMap(),
-      bool save_mobile_debug_info = false) const;
+      bool save_mobile_debug_info = false,
+      bool use_flatbuffer = false) const;
 
   Module copy() const;
 
@@ -265,6 +267,10 @@ struct TORCH_API Module : public Object {
     return _ivalue() == y._ivalue();
   }
 
+  void set_delete_memory(std::shared_ptr<char> delete_mem) {
+    mem_to_delete_ = delete_mem;
+  }
+
  private:
   Module clone_impl(
       std::unordered_map<TypePtr, TypePtr>& type_remap,
@@ -286,6 +292,9 @@ struct TORCH_API Module : public Object {
       const c10::optional<at::Device>& device,
       const c10::optional<at::ScalarType>& dtype,
       bool non_blocking);
+
+  // Extra handle for the module to delete when itself is deleted
+  std::shared_ptr<char> mem_to_delete_;
 };
 
 // C++ equivalent api of `torch.jit.freeze`. See documentation there for
diff --git a/torch/csrc/jit/api/module_save.cpp b/torch/csrc/jit/api/module_save.cpp
index c8afa5efaf3529..912c38612c354b 100644
--- a/torch/csrc/jit/api/module_save.cpp
+++ b/torch/csrc/jit/api/module_save.cpp
@@ -16,25 +16,29 @@ void Module::save(const std::string& filename, const ExtraFilesMap& extra_files)
 void Module::_save_for_mobile(
     std::ostream& out,
     const ExtraFilesMap& extra_files,
-    bool save_mobile_debug_info) const {
+    bool save_mobile_debug_info,
+    bool use_flatbuffer) const {
   ExportModule(
       *this,
       out,
       extra_files,
       true /* bytecode_format */,
-      save_mobile_debug_info);
+      save_mobile_debug_info,
+      use_flatbuffer);
 }
 
 void Module::_save_for_mobile(
     const std::string& filename,
     const ExtraFilesMap& extra_files,
-    bool save_mobile_debug_info) const {
+    bool save_mobile_debug_info,
+    bool use_flatbuffer) const {
   ExportModule(
       *this,
       filename,
       extra_files,
       true /* bytecode_format */,
-      save_mobile_debug_info);
+      save_mobile_debug_info,
+      use_flatbuffer);
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/backends/nnapi/nnapi_backend_lib.cpp b/torch/csrc/jit/backends/nnapi/nnapi_backend_lib.cpp
index 7d9dc18c12589f..ba4a2b25c23a78 100644
--- a/torch/csrc/jit/backends/nnapi/nnapi_backend_lib.cpp
+++ b/torch/csrc/jit/backends/nnapi/nnapi_backend_lib.cpp
@@ -31,7 +31,7 @@ class NnapiBackend : public PyTorchBackendInterface {
   c10::impl::GenericDict compile(
       c10::IValue processed,
       c10::impl::GenericDict method_compile_spec) override {
-    // Wrap procesed in dictionary: {"forward": processed}
+    // Wrap processed in dictionary: {"forward": processed}
     auto dict = processed.toGenericDict();
     c10::Dict<c10::IValue, c10::IValue> handles(
         c10::StringType::get(), c10::AnyType::get());
@@ -64,7 +64,7 @@ class NnapiBackend : public PyTorchBackendInterface {
     auto inp_mem_fmts = dict.at("inp_mem_fmts").toIntList();
     TORCH_CHECK(tensorInp.size() == inp_mem_fmts.size());
     std::vector<at::Tensor> fixed_inputs;
-    for (int i = 0; i < tensorInp.size(); i++) {
+    for (auto i = 0U; i < tensorInp.size(); i++) {
       int fmt = inp_mem_fmts[i];
       // These constants match the values in DimOrder in serializer.py
       // 0: NCHW, 1: NHWC
@@ -84,7 +84,7 @@ class NnapiBackend : public PyTorchBackendInterface {
     // Adjust output memory formats
     auto out_mem_fmts = dict.at("out_mem_fmts").toIntList();
     TORCH_CHECK(outputs.size() == out_mem_fmts.size());
-    for (int i = 0; i < outputs.size(); i++) {
+    for (auto i = 0U; i < outputs.size(); i++) {
       int fmt = out_mem_fmts[i];
       // These constants match the values in DimOrder in serializer.py
       // 0: NCHW, 1: NHWC
diff --git a/torch/csrc/jit/backends/nnapi/nnapi_backend_preprocess.cpp b/torch/csrc/jit/backends/nnapi/nnapi_backend_preprocess.cpp
index be0dbe18d90d0c..a787ecc6cbfda6 100644
--- a/torch/csrc/jit/backends/nnapi/nnapi_backend_preprocess.cpp
+++ b/torch/csrc/jit/backends/nnapi/nnapi_backend_preprocess.cpp
@@ -96,7 +96,7 @@ c10::IValue preprocess(
   // transform Python lists to C++ c10::List
   c10::List<at::Tensor> weights(
       py::cast<std::vector<at::Tensor>>(nnapi_processed[2]));
-  for (int i = 0; i < weights.size(); i++) {
+  for (auto i = 0U; i < weights.size(); i++) {
     weights.set(i, weights.get(i).contiguous());
   }
   c10::List<int64_t> inp_mem_fmts(
diff --git a/torch/csrc/jit/codegen/cuda/README.md b/torch/csrc/jit/codegen/cuda/README.md
new file mode 100644
index 00000000000000..4f50c32aecdb4f
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/README.md
@@ -0,0 +1,228 @@
+# NVFuser - A Fusion Code Generator for NVIDIA GPUs
+_NVFuser is integrated as a backend for TorchScript's Profiling Graph Executor_
+
+## Enabling NVFuser
+_NVFuser is not currently the default fuser for NVIDIA GPUs._
+
+**Fusions will only show up during the ~3rd iteration of execution, the exact number depends on profiling executor's optimization phases**
+
+### Enable by Context Manager
+
+```
+jit_model = torch.jit.script(model)
+
+with torch.jit.fuser("fuser2") :
+    for _ in range(5) :
+        outputs = jit_model(inputs)
+```
+
+### Enable by Specific Functions
+
+1. Disable cpu/gpu fusion for native/nnc fuser
+```
+torch._C._jit_override_can_fuse_on_cpu(False)
+torch._C._jit_override_can_fuse_on_gpu(False)
+```
+2. Disable nnc fuser
+```
+torch._C._jit_set_texpr_fuser_enabled(False)
+```
+3. Enable nvfuser
+```
+torch._C._jit_set_nvfuser_enabled(True)
+```
+
+## Simple knobs to change fusion behavior
+
+1. Allow single node fusion `torch._C._jit_set_nvfuser_single_node_mode(True)`
+Fusion group is only created when two or more compatible ops are grouped together. Turn on single node fusion would allow fusion pass to create fusion group with a single node, this is very handy for testing and could be useful when single node generated kernel out-performs native cuda kernels in framework.
+
+2. Allow horizontal fusion `torch._C._jit_set_nvfuser_horizontal_mode(True)`
+Fusion pass fuses producer to consumer, horizontal mode allows sibling nodes that shared tensor input to be fused together. This could save input memory bandwidth.
+
+3. Turn off guard for fusion `torch._C._jit_set_nvfuser_guard_mode(False)`
+This disables the runtime check on fusion group pre-assumptions (tensor meta information / constant inputs / profiled constants), this really is only used for testing as we want to ensure generated kernels are indeed tested and you should avoid using this in training scripts.
+
+## Fusion Debugging
+
+Given the following script as an example
+
+```
+import torch
+
+def forward(x):
+    o = x + 1.0
+    o = o.relu()
+    return o
+
+shape = (2, 32, 128, 512)
+input = torch.rand(*shape).cuda()
+t = torch.jit.script(forward)
+
+with torch.jit.fuser("fuser2"):
+    for k in range(4):
+        o = t(input)
+```
+
+### TorchScript Based Debugging
+
+#### 1. TorchScript IR Graph
+
+##### Usage
+
+Two easy ways to checkout fusion for graph: The first one is to print out graph in python script after a few runs (for optimization to kick in).
+
+`print(t.graph_for(input))`
+
+The second way is to turn on graph dumping in profiling executor via command line below:
+
+```
+PYTORCH_JIT_LOG_LEVEL="profiling_graph_executor_impl" python <your pytorch script>
+```
+
+##### Example Output
+
+Graph print out is straight forward and you should look for `prim::CudaFusionGroup_X` for fused kernels. While profiling executor dumps many things, but the most important part is `Optimized Graph`. In this example, it shows a Fusion Group, which is an indication that fusion is happening and you should be expecting fused kernel!
+
+```
+  Optimized Graph:
+  graph(%x.1 : Tensor):
+    %12 : bool = prim::CudaFusionGuard[types=[Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)]](%x.1)
+    %11 : Tensor = prim::If(%12)
+      block0():
+        %o.8 : Tensor = prim::CudaFusionGroup_0[cache_id=0](%x.1)
+        -> (%o.8)
+      block1():
+        %18 : Function = prim::Constant[name="fallback_function", fallback=1]()
+        %19 : (Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)) = prim::CallFunction(%18, %x.1)
+        %20 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = prim::TupleUnpack(%19)
+        -> (%20)
+    return (%11)
+  with prim::CudaFusionGroup_0 = graph(%2 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
+    %4 : int = prim::Constant[value=1]()
+    %3 : float = prim::Constant[value=1.]() # test.py:6:12
+    %o.1 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%2, %3, %4) # test.py:6:8
+    %o.5 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.1) # test.py:7:8
+    return (%o.5)
+```
+
+Note that one thing that could prevents fusion when you are running training is autodiff. Fusion pass only runs within `prim::DifferentiableGraph`, so the first thing you should check is to that targetted ops are within differentiable graph subgraphs.
+Graph dump could be quite confusing to look at, since it naively dumps all graphs executed by profiling executor and differentiable graphs are executed via a nested graph executor. So for each graph, you might see a few segmented `Optimized Graph` where each corresponds to a differentiable node in the original graph.
+
+#### 2. Cuda Fusion Graphs
+
+##### Usage
+
+Cuda fusion dump gives the input and output graph to fusion pass. This is a good place to check fusion pass logic.
+
+```
+PYTORCH_JIT_LOG_LEVEL="graph_fuser" python <your pytorch script>
+```
+
+##### Example Output
+
+Running the same script above, in the log, you should be looking for two graphs `Before Fusion` shows the subgraph where fusion pass runs on; `Before Compilation` shows the graph sent to codegen backend, where each `CudaFusionGroup` will trigger codegen runtime system to generate kernel(s) to execute the subgraph.
+
+```
+  Before Fusion:
+  graph(%x.1 : Tensor):
+    %2 : float = prim::Constant[value=1.]()
+    %1 : int = prim::Constant[value=1]()
+    %3 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%x.1)
+    %o.10 : Tensor = aten::add(%3, %2, %1) # test.py:6:8
+    %5 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.10)
+    %o.7 : Tensor = aten::relu(%5) # test.py:7:8
+    %7 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.7)
+    %8 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.7)
+    return (%7, %8)
+
+  Before Compilation:
+  graph(%x.1 : Tensor):
+    %13 : bool = prim::CudaFusionGuard[types=[Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)]](%x.1)
+    %12 : Tensor = prim::If(%13)
+      block0():
+        %o.11 : Tensor = prim::CudaFusionGroup_0(%x.1)
+        -> (%o.11)
+      block1():
+        %o.7 : Tensor = prim::FallbackGraph_1(%x.1)
+        -> (%o.7)
+    return (%12, %12)
+  with prim::CudaFusionGroup_0 = graph(%2 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
+    %4 : int = prim::Constant[value=1]()
+    %3 : float = prim::Constant[value=1.]()
+    %o.10 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%2, %3, %4) # test.py:6:8
+    %o.7 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.10) # test.py:7:8
+    return (%o.7)
+  with prim::FallbackGraph_1 = graph(%x.1 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
+    %1 : int = prim::Constant[value=1]()
+    %2 : float = prim::Constant[value=1.]()
+    %o.10 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%x.1, %2, %1) # test.py:6:8
+    %o.7 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.10) # test.py:7:8
+    return (%o.7)
+```
+
+### General ideals of debug no-fusion
+
+Currently there we have a few consumers that utilizes nvfuser via lowering computations to TorchScript and executing that through a ProfilingExecutor.
+
+Without going into too much details about how the integration is done, a few notes on debugging no-fusion on ProfilingExecutor:
+
+1. Run TorchScript module multiple times (5 could be a lucky number) to enable fusion.
+    Because ProfilingExecutor takes the first (few) runs for profiling, later optimization (including the fusion pass the enables nvfuser) relies on profiling information to run, so your initial runs are not going to trigger fused kernels.
+    Note that the number of profiling runs is dependent on your model.
+
+2. Fused kernel should show up in TorchScript IR as `prim::CudaFusionGroup`. You can look at your TorchScript optimized graph to see if fusion is happening `jit_model.graph_for(*inputs)`.
+
+3. If your scripted model has inputs requiring gradient, fusion is only happening for graphs inside `prim::DifferentiableGraph`.
+    There are many reasons why your graph is not autodiff-able. Take a look at `/torch/csrc/jit/runtime/symbolic_scripts.cpp`, which lists all autodiff-able ops (note that this is a different list from autograd-supported ops). There's also a threshold where tiny autodiff graph are inlined/reverted, which could be disabled via `torch._C._debug_set_autodiff_subgraph_inlining(False)`.
+
+### General ideals of debug nvfuser mal-functioning
+
+Assuming we have ProfilingExecutor things worked out properly, that is, you see a region that's supposed to be fused but did not ended up in a fused kernel, here's ways to dig deeper:
+
+1. Dump fusion pass result:
+    `PYTORCH_JIT_LOG_LEVEL=graph_fuser python your_script.py &> log`
+
+    Looks for graph dumped with `Before Fusion` & `Before Compilation`, which shows the portion of graph where fusion pass runs on and the result of fusion (`CudaFusionGroup`).
+
+2. Check out which ops are not fused and roughly why:
+    `PYTORCH_JIT_LOG_LEVEL=">partition:graph_fuser" python your_script.py &> log`
+
+    Enabling GRAPH_UPDATE from partition.cpp dumps a log when a given node is rejected by fusion.
+
+3. Disabling FALLBACK path:
+    If you see a warning where a FALLBACK path has been taken while executing your model with nvfuser enabled, it's indicating that either codegen or fusion pass has failed unexpectedly. This is likely to cause regression on model performance, even though it's still functionally correct. We recommend to disable FALLBACK path, so error would be reported properly to open an informative issue.
+
+    `PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python your_script.py &> log`
+
+4. Pin point kernel/fusion pattern that's causing error:
+    With a larger model that includes multiple fusion patterns, it could be tricky to figure out which exact fusion is causing FALLBACK and build up a minimal python repro.
+    One quick thing to try is to run the example with a few knobs turned on:
+
+    ```
+    PYTORCH_NVFUSER_DISABLE_FALLBACK=1 \
+    PYTORCH_JIT_LOG_LEVEL=">partition:graph_fuser:>>kernel_cache" \
+    python your_script.py &> log
+    ```
+
+    This logs all TorchScript IR parsed to codegen IR as well as kernel generated and executed by nvfuser. Since fallback path is disabled, it's likely that the last log would indicate the failing fusion.
+
+    Hint: look for last `Before Compilation:` that indicates a parsing failure, or `running GraphCache: xxxxx`, which indicates jit compilation/execution failure (also search for the GraphCache address, which would should have dumped a TorchScript IR earlier.
+
+### Query nvfuser codegen kernels
+
+There're a few debug dump that could be turned on via environment variables. Look for `PYTORCH_NVFUSER_DUMP` inside `[pytorch_source_path]/torch/csrc/jit/codegen/cuda/utils.cpp`. A few useful ones are:
+1. `dump_eff_bandwidth`: print out effective bandwidth of each generated kernel. This naively measure the kernel time divided by I/O buffer size and is a good/simple metric of performance for bandwidth bound kernels
+2. `cuda_kernel`: print out generated cuda kernels
+3. `launch_param`: print out launch config of generated kernels
+4. `print_args`: print out input output tensors of executed codegen kernels
+
+### FAQs
+
+1. There's regression after turning on nvfuser.
+
+First thing is to check that you have fusion kernel running properly. Try to run your model with fallback disabled to see if you hit any errors that caused fallback via `export PYTORCH_NVFUSER_DISABLE_FALLBACK=1`.
+
+2. I didn't see any speedup with nvfuser.
+
+Check if there is fusion in your script model. Run your script with `PYTORCH_JIT_LOG_LEVEL="graph_fuser"`, you should see some log dump of before/after graph regarding fusion pass. If nothing shows up in the log, that means something in TorchScript is not right and fusion pass are not executed. Check [General ideals of debug no-fusion] for more details.
diff --git a/torch/csrc/jit/codegen/cuda/arith.cpp b/torch/csrc/jit/codegen/cuda/arith.cpp
index d9bf46b51c7837..cbdf83d8ff3f71 100644
--- a/torch/csrc/jit/codegen/cuda/arith.cpp
+++ b/torch/csrc/jit/codegen/cuda/arith.cpp
@@ -33,6 +33,9 @@ Val* newScalar(ValType vtype, DataType dtype) {
         case DataType::Int32:
         case DataType::Int:
           return IrBuilder::create<Int>();
+        case DataType::ComplexFloat:
+        case DataType::ComplexDouble:
+          return IrBuilder::create<ComplexDouble>();
         default:
           break;
       }
@@ -187,7 +190,7 @@ Val* newValLike(Val* val, DataType dtype) {
 
 Val* castOp(DataType dtype, Val* v1) {
   if (v1->getDataType().value() == dtype) {
-    return v1;
+    return set(v1);
   }
 
   if (cast_func_str(std::make_pair(v1->getDataType().value(), dtype)) ==
@@ -258,12 +261,10 @@ TensorView* unaryOp(
 
 NVFUSER_DEFINE_UNARY_OP(set, Set)
 NVFUSER_DEFINE_UNARY_OP(randlike, RandLike)
-NVFUSER_DEFINE_UNARY_OP(abs, Abs)
 NVFUSER_DEFINE_UNARY_OP(notOp, Not)
 NVFUSER_DEFINE_UNARY_OP(ceil, Ceil)
 NVFUSER_DEFINE_UNARY_OP(floor, Floor)
 NVFUSER_DEFINE_UNARY_OP(frac, Frac)
-NVFUSER_DEFINE_UNARY_OP(gelu, Gelu)
 NVFUSER_DEFINE_UNARY_OP(neg, Neg)
 NVFUSER_DEFINE_UNARY_OP(relu, Relu)
 NVFUSER_DEFINE_UNARY_OP(round, Round)
@@ -271,6 +272,25 @@ NVFUSER_DEFINE_UNARY_OP(silu, Silu)
 NVFUSER_DEFINE_UNARY_OP(trunc, Trunc)
 #undef NVFUSER_DEFINE_UNARY_OP
 
+// The output of abs(complex_tensor) are real numbers
+Val* abs(Val* v) {
+  if (v->getDataType() == DataType::ComplexDouble) {
+    Val* out = newValLike(v, DataType::Double);
+    IrBuilder::create<UnaryOp>(UnaryOpType::Abs, out, v);
+    return out;
+  }
+  if (v->getDataType() == DataType::ComplexFloat) {
+    Val* out = newValLike(v, DataType::Float);
+    IrBuilder::create<UnaryOp>(UnaryOpType::Abs, out, v);
+    return out;
+  }
+  return unaryOp(UnaryOpType::Abs, v);
+}
+
+TensorView* abs(TensorView* tv) {
+  return abs(tv->as<Val>())->as<TensorView>();
+}
+
 // UNARY FLOAT CAST OPERATIONS
 
 #define NVFUSER_DEFINE_UNARY_FLOAT_OP(op_name, op_type)                       \
@@ -652,8 +672,9 @@ TensorView* reductionOp(
   const auto init_type = init->getDataType().value();
   TORCH_CHECK(
       (isFloatingPointType(out_type) && isFloatingPointType(init_type)) ||
+          (isComplexType(out_type) && isComplexType(init_type)) ||
           (isIntegralType(out_type) && isIntegralType(init_type)) ||
-          (out_type == DataType::Bool && init_type == DataType::Bool),
+          (isBooleanType(out_type) && isBooleanType(init_type)),
       "Types should match for reduction ops but received: ",
       out_type,
       " and ",
@@ -661,7 +682,7 @@ TensorView* reductionOp(
   IrBuilder::create<ReductionOp>(reduction_op_type, init, out, tv);
 
   if (keep_dim) {
-    auto tv_root = TensorDomain::noReductions(tv->getRootDomain());
+    auto tv_root = TensorDomain::noReductions(tv->getMaybeRFactorDomain());
     std::vector<bool> is_broadcast(tv_root.size(), false);
     for (auto axis : uint_axes) {
       is_broadcast.at(axis) = true;
@@ -680,8 +701,13 @@ TensorView* sum(
   auto dtype = v1->getDataType().value();
   if (isFloatingPointType(dtype)) {
     init = IrBuilder::create<Double>(0.0);
+  } else if (isComplexType(dtype)) {
+    init = IrBuilder::create<ComplexDouble>(c10::complex<double>(0.0, 0.0));
   } else if (isIntegralType(dtype)) {
     init = FusionGuard::getCurFusion()->zeroVal();
+  } else if (isBooleanType(dtype)) {
+    v1 = castOp(DataType::Int, v1);
+    init = FusionGuard::getCurFusion()->zeroVal();
   } else {
     TORCH_CHECK(
         false,
@@ -705,7 +731,13 @@ TensorView* max(
       init = IrBuilder::create<Double>(std::numeric_limits<float>::lowest());
       break;
     case (DataType::Int):
-      init = IrBuilder::create<Int>(INT_MIN);
+      init = IrBuilder::create<Int>(std::numeric_limits<int64_t>::lowest());
+      break;
+    case (DataType::Int32):
+      init = IrBuilder::create<Int>(std::numeric_limits<int32_t>::lowest());
+      break;
+    case (DataType::Bool):
+      init = IrBuilder::create<Bool>(false);
       break;
     default:
       TORCH_CHECK(
@@ -730,7 +762,13 @@ TensorView* min(
       init = IrBuilder::create<Double>(FLT_MAX);
       break;
     case (DataType::Int):
-      init = IrBuilder::create<Int>(INT_MAX);
+      init = IrBuilder::create<Int>(std::numeric_limits<int64_t>::max());
+      break;
+    case (DataType::Int32):
+      init = IrBuilder::create<Int>(std::numeric_limits<int32_t>::max());
+      break;
+    case (DataType::Bool):
+      init = IrBuilder::create<Bool>(true);
       break;
     default:
       TORCH_CHECK(
@@ -779,7 +817,12 @@ TensorView* broadcast(
           ParallelType::Serial,
           IterType::BroadcastWithoutStride));
     } else {
-      out_domain.push_back(inp_domain[iinp]->clone());
+      out_domain.push_back(IrBuilder::create<IterDomain>(
+          inp_domain[iinp]->start(),
+          inp_domain[iinp]->extent(),
+          inp_domain[iinp]->stopOffset(),
+          inp_domain[iinp]->getParallelType(),
+          inp_domain[iinp]->getIterType()));
       iinp++;
     }
     ibdim++;
@@ -856,7 +899,7 @@ WelfordResult Welford(
   // Create tensor outputs
   TensorView* out_avg = newForReduction(tv, uint_axes);
   TensorView* out_var = newForReduction(tv, uint_axes);
-  TensorView* out_N = newForReduction(tv, uint_axes, DataType::Int);
+  TensorView* out_N = newForReduction(tv, uint_axes, DataType::Index);
 
   IrBuilder::create<WelfordOp>(
       out_avg,
@@ -889,7 +932,7 @@ WelfordResult WelfordResult::rFactor(const std::vector<int>& axes) {
 TensorView* transpose(
     TensorView* inp,
     const std::unordered_map<int, int>& old2new) {
-  auto inp_domain = TensorDomain::noReductions(inp->getRootDomain());
+  auto inp_domain = TensorDomain::noReductions(inp->getMaybeRFactorDomain());
   std::vector<IterDomain*> out_domain(inp_domain.size());
 
   auto new2old = ir_utils::normalizeOld2New(old2new, inp_domain.size());
@@ -1109,7 +1152,7 @@ TensorView* clamp(TensorView* in, Val* min_val, Val* max_val) {
 // sum_to operator
 
 TensorView* sum_to(TensorView* in, const std::vector<Int*>& sum_to_size) {
-  const auto& root = TensorDomain::noReductions(in->getRootDomain());
+  const auto& root = TensorDomain::noReductions(in->getMaybeRFactorDomain());
 
   TORCH_CHECK(
       root.size() >= sum_to_size.size(),
@@ -1155,7 +1198,7 @@ TensorView* sum_to(TensorView* in, const std::vector<Int*>& sum_to_size) {
 }
 
 TensorView* sum_to(TensorView* in, const std::vector<int64_t>& sum_to_size) {
-  const auto& root = TensorDomain::noReductions(in->getRootDomain());
+  const auto& root = TensorDomain::noReductions(in->getMaybeRFactorDomain());
 
   TORCH_CHECK(
       root.size() >= sum_to_size.size(),
@@ -1380,7 +1423,7 @@ TensorView* gather(
     const std::vector<std::vector<int>>& pad_width,
     const std::vector<int>& strides,
     bool trim_out_of_bounds) {
-  auto inp_dom = TensorDomain::noReductions(inp->getRootDomain());
+  auto inp_dom = TensorDomain::noReductions(inp->getMaybeRFactorDomain());
   const auto ndims = inp_dom.size();
 
   TORCH_CHECK(
@@ -1484,6 +1527,135 @@ TensorView* gather(
   return out_tv;
 }
 
+namespace {
+
+//! Create new output for mma
+static TensorView* newForMma(
+    TensorView* tv_a,
+    TensorView* tv_b,
+    const std::vector<unsigned int>& axes,
+    DataType data_type = DataType::Float) {
+  auto orig_domain_a =
+      TensorDomain::noReductions(tv_a->getMaybeRFactorDomain());
+  auto orig_domain_b =
+      TensorDomain::noReductions(tv_b->getMaybeRFactorDomain());
+
+  TORCH_INTERNAL_ASSERT(
+      orig_domain_a.size() == orig_domain_b.size(),
+      "MMA op: need matching dim input");
+
+  std::set<unsigned int> axes_set(axes.begin(), axes.end());
+  std::vector<IterDomain*> new_domain;
+
+  TORCH_INTERNAL_ASSERT(
+      !axes_set.empty(),
+      "Asked for ouput of reduction, but no reduction axis provided.");
+
+  TORCH_INTERNAL_ASSERT(
+      (*(axes_set.rbegin())) < orig_domain_a.size(),
+      "Error setting up reduction, reduction axis (",
+      *(axes_set.rbegin()),
+      ") is outside nDims (",
+      orig_domain_a.size(),
+      "). Keep in mind reductions are relative to root domains, not modified views.");
+
+  auto axis_iter = axes_set.begin();
+  for (const auto dim : c10::irange(orig_domain_a.size())) {
+    bool isReduction = false;
+    if (axis_iter != axes_set.end() && *axis_iter == dim) {
+      isReduction = true;
+      axis_iter++;
+    }
+
+    const IterDomain* id = orig_domain_a[dim]->isBroadcast()
+        ? orig_domain_b[dim]
+        : orig_domain_a[dim];
+
+    TORCH_CHECK(
+        !(isReduction && id->isBroadcast() && !id->isImplicitBroadcast()),
+        "Cannot reduce an axis that is marked as broadcasted as it has an undetermined size. Tried to reduce ID = ",
+        id,
+        " of tensor ",
+        tv_a,
+        "and",
+        tv_b);
+
+    new_domain.push_back(IrBuilder::create<IterDomain>(
+        id->start(),
+        id->extent(),
+        id->stopOffset(),
+        ParallelType::Serial,
+        isReduction ? IterType::Reduction : id->getIterType()));
+  }
+
+  TensorDomain* td = IrBuilder::create<TensorDomain>(
+      new_domain, std::vector<bool>(new_domain.size(), true));
+
+  return IrBuilder::create<TensorView>(td, data_type);
+}
+
+} // namespace
+
+TensorView* fusedMultiplySum(
+    TensorView* tv_a,
+    TensorView* tv_b,
+    const std::vector<int>& axes,
+    Val* init) {
+  if (init == nullptr) {
+    init = IrBuilder::create<Double>(0);
+  }
+
+  // TODO:
+  //  We will want to support initialize and rfactor with
+  //  mma as well, for maybe fusing bias in prolog.
+  // TODO: check init type if given a tv,
+  //  not supported currently though.
+  TORCH_CHECK(
+      init->isConstScalar(),
+      "Cannot create a reduction operation where the initial value is not a const scalar.");
+
+  // TODO:
+  //  Validate axis relationships between a and b
+  TORCH_CHECK(tv_a->nDims() > 0, "Tried to reduce a 0-dim tensor");
+
+  // TODO:
+  //  Add tf32 and other mma data types
+  //  Add fallback path for non-mma data types.
+  TORCH_CHECK(tv_a->getDataType().value() == DataType::Half);
+  TORCH_CHECK(tv_b->getDataType().value() == DataType::Half);
+
+  TORCH_CHECK(axes.size() > 0, "No reduction axis specified");
+
+  // TODO:
+  //  will lift this in a follow up when we have a
+  //  more generic axes matching.
+  TORCH_CHECK(
+      axes.size() == 1, "Single axis reduction only for mma op instantiation.")
+
+  std::vector<unsigned int> uint_axes;
+  const int ndims = tv_a->domain()->noReductions().size();
+  for (int axis : axes) {
+    if (axis < 0) {
+      axis += ndims;
+    }
+
+    TORCH_CHECK(
+        axis >= 0 && axis < ndims,
+        "Reduction on invalid axis, recieved: ",
+        axis,
+        " however tensor view only has ",
+        ndims,
+        " non-reduction dims.");
+
+    uint_axes.push_back((unsigned int)axis);
+  }
+
+  TensorView* out = newForMma(tv_a, tv_b, uint_axes);
+  IrBuilder::create<MmaOp>(out, tv_a, tv_b, init);
+
+  return out;
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/arith.h b/torch/csrc/jit/codegen/cuda/arith.h
index 1f18f65666ad09..f224468c6bed82 100644
--- a/torch/csrc/jit/codegen/cuda/arith.h
+++ b/torch/csrc/jit/codegen/cuda/arith.h
@@ -161,9 +161,6 @@ TORCH_CUDA_CU_API TensorView* floor(TensorView*);
 // frac
 TORCH_CUDA_CU_API Val* frac(Val*);
 TORCH_CUDA_CU_API TensorView* frac(TensorView*);
-// gelu
-TORCH_CUDA_CU_API Val* gelu(Val*);
-TORCH_CUDA_CU_API TensorView* gelu(TensorView*);
 // silu
 TORCH_CUDA_CU_API Val* silu(Val*);
 TORCH_CUDA_CU_API TensorView* silu(TensorView*);
@@ -561,6 +558,28 @@ TORCH_CUDA_CU_API TensorView* gather(
     const std::vector<int>& strides = {},
     bool trim_out_of_bounds = false);
 
+//! A fused pointwise multiply and sum
+//!  operator that instantiates the following
+//!  fused pattern:
+//!     c = mul(tv_a, tv_b);
+//!     return sum(c, axes)
+//!
+//! \param tv_a first multiply operand
+//! \param tv_b second multiply operand
+//! \param axes axes to sum over
+//! \param init sum initial value
+//!
+//! Note & TODO:
+//!   currently only support lowering to a mma op
+//!   through this interface and only support fp16 inputs.
+//!   will support converting back to multiply and reduce in
+//!   a follow up.
+TORCH_CUDA_CU_API TensorView* fusedMultiplySum(
+    TensorView* tv_a,
+    TensorView* tv_b,
+    const std::vector<int>& axes,
+    Val* init = nullptr);
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/codegen.cpp b/torch/csrc/jit/codegen/cuda/codegen.cpp
index 67926e92672644..2287b2835ee603 100644
--- a/torch/csrc/jit/codegen/cuda/codegen.cpp
+++ b/torch/csrc/jit/codegen/cuda/codegen.cpp
@@ -4,6 +4,7 @@
 #include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
 #include <torch/csrc/jit/codegen/cuda/type.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
 
@@ -20,6 +21,105 @@ namespace codegen {
 
 namespace {
 
+std::string ptrType(DataType dt) {
+  std::stringstream ss;
+  ss << dt << "*";
+  return ss.str();
+}
+
+std::string refType(DataType dt) {
+  std::stringstream ss;
+  ss << dt << "&";
+  return ss.str();
+}
+
+//! Utility class to build an argument list
+class ArgumentBuilder {
+ public:
+  //! Build an argument list where each argument is separated with a comma
+  ArgumentBuilder() = default;
+
+  //! Build an argument list where each argument has its own line
+  ArgumentBuilder(int indent_level, const char* tab) {
+    std::stringstream ss;
+    for (const auto i : c10::irange(indent_level)) {
+      (void)i; // Suppress unused variable warning
+      ss << tab;
+    }
+    sep_ = ",\n" + ss.str();
+  }
+
+  //! Add a new argument
+  template <typename T>
+  ArgumentBuilder& arg(const T& x) {
+    addSeparator();
+    return append(x);
+  }
+
+  //! Append to the last argument
+  template <typename T>
+  ArgumentBuilder& append(const T& arg) {
+    ss_ << arg;
+    return *this;
+  }
+
+  //! Get a string of the argument list
+  std::string str() const {
+    return ss_.str();
+  }
+
+  friend std::ostream& operator<<(std::ostream& os, const ArgumentBuilder& ab) {
+    return os << ab.str();
+  }
+
+ private:
+  void addSeparator() {
+    if (ss_.tellp() != 0) {
+      ss_ << sep_;
+    }
+  }
+
+ private:
+  std::string sep_ = ", ";
+  std::stringstream ss_;
+};
+
+//! Append to the last argument
+template <>
+ArgumentBuilder& ArgumentBuilder::append<bool>(const bool& arg) {
+  ss_ << (arg ? "true" : "false");
+  return *this;
+}
+
+//! Returns "template_name<template_arg>"
+template <typename TemplateNameT, typename TemplateArgT>
+std::string genTemplate(
+    const TemplateNameT& template_name,
+    const TemplateArgT& template_arg) {
+  std::stringstream ss;
+  ss << template_name << "<" << template_arg << ">";
+  return ss.str();
+}
+
+//! Returns "func_name(func_arg)"
+template <typename FuncNameT, typename FuncArgT>
+std::string genCall(const FuncNameT& func_name, const FuncArgT& func_arg) {
+  std::stringstream ss;
+  ss << func_name << "(" << func_arg << ")";
+  return ss.str();
+}
+
+//! Returns "func_name<template_arg>(func_arg)"
+template <typename FuncNameT, typename TemplateArgT, typename FuncArgT>
+std::string genCall(
+    const FuncNameT& func_name,
+    const TemplateArgT& template_arg,
+    const FuncArgT& func_arg) {
+  std::stringstream ss;
+  ss << func_name << "<" << template_arg << ">(" << func_arg << ")";
+  return ss.str();
+}
+
 class CudaKernelGenerator : private OptOutConstDispatch {
   static constexpr const char* kTab = "  ";
 
@@ -46,6 +146,8 @@ class CudaKernelGenerator : private OptOutConstDispatch {
 
     code_ << "__global__ void " << kernel_name << "(";
 
+    std::unordered_set<Val*> unique_args;
+
     std::vector<Val*> params;
 
     // Inputs & Outputs
@@ -53,27 +155,44 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       params.push_back(val);
     }
     for (auto val : kernel_->outputs()) {
+      TORCH_INTERNAL_ASSERT(
+          !val->isScalar(), "No scalar output is allowed: ", val->toString());
       params.push_back(val);
     }
 
     // Generate parameter declarations
-    for (Val* val : params) {
-      if (const auto tv = dynamic_cast<TensorView*>(val)) {
+    unsigned int duplicate_counter = 0;
+    for (auto i : c10::irange(params.size())) {
+      std::stringstream var_name_ss;
+      if (params[i]->isA<TensorView>()) {
+        var_name_ss << varName(params[i]->as<TensorView>());
+      } else {
+        var_name_ss << gen(params[i]);
+      }
+
+      // If value is duplicate in arguments change the name to avoid name
+      // conflicts in args.
+      if (!unique_args.emplace(params[i]).second) {
+        var_name_ss << "_duplicate_" << duplicate_counter++;
+      }
+
+      if (const auto tv = dynamic_cast<TensorView*>(params[i])) {
         if (tv->isCpuScalar()) {
-          code_ << " CpuScalarTensor<" << val->dtype() << "> " << varName(tv);
+          code_ << " CpuScalarTensor<" << params[i]->dtype() << "> "
+                << var_name_ss.str();
         } else {
           code_
-              << "Tensor<" << val->dtype() << ", "
+              << "Tensor<" << params[i]->dtype() << ", "
               << TensorDomain::noReductions(tv->getMaybeRFactorDomain()).size()
-              << "> " << varName(tv);
+              << "> " << var_name_ss.str();
         }
       } else {
-        TORCH_INTERNAL_ASSERT(val->isScalar()); // NOLINT (LLVM bug 48525)
-        TORCH_INTERNAL_ASSERT(val->definition() == nullptr);
-        code_ << val->dtype() << " " << gen(val);
+        TORCH_INTERNAL_ASSERT(params[i]->isScalar()); // NOLINT (LLVM bug 48525)
+        TORCH_INTERNAL_ASSERT(params[i]->definition() == nullptr);
+        code_ << params[i]->dtype() << " " << var_name_ss.str();
       }
 
-      if (val != params.back()) {
+      if (i + 1 != params.size()) {
         code_ << ", ";
       }
     }
@@ -211,10 +330,6 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   std::string gen(const Statement* stmt) {
     std::stringstream tmp_code;
     std::swap(tmp_code, code_);
-    auto replacement = replacement_map_.find(stmt);
-    if (replacement != replacement_map_.end()) {
-      stmt = replacement->second;
-    }
     OptOutConstDispatch::handle(stmt);
     std::swap(tmp_code, code_);
     return tmp_code.str();
@@ -247,7 +362,8 @@ class CudaKernelGenerator : private OptOutConstDispatch {
 
   void handle(const Bool* pred) final {
     const auto def = pred->definition();
-    if (print_inline_ && def != nullptr) {
+    const bool has_alloc = alloc_map_.find(pred) != alloc_map_.end();
+    if (def != nullptr && !has_alloc) {
       code_ << "(" << gen(def) << ")";
     } else if (pred->isConst()) {
       code_ << (*pred->value() ? "true" : "false");
@@ -258,7 +374,8 @@ class CudaKernelGenerator : private OptOutConstDispatch {
 
   void handle(const Double* d) final {
     const auto def = d->definition();
-    if (print_inline_ && def != nullptr) {
+    const bool has_alloc = alloc_map_.find(d) != alloc_map_.end();
+    if (def != nullptr && !has_alloc) {
       code_ << "(" << gen(def) << ")";
     } else if (d->isConst()) {
       const int digits = std::numeric_limits<Double::ScalarType>::max_digits10;
@@ -270,8 +387,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
 
   void handle(const Int* i) final {
     const auto def = i->definition();
-    if (print_inline_ && def != nullptr) {
-      code_ << "(" << gen(def) << ")";
+    const bool has_alloc = alloc_map_.find(i) != alloc_map_.end();
+    if (def != nullptr && !has_alloc) {
+      code_ << "(" << genInline(def) << ")";
     } else if (i->isConst()) {
       code_ << *i->value();
     } else {
@@ -279,6 +397,20 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const ComplexDouble* c) final {
+    const auto def = c->definition();
+    const bool has_alloc = alloc_map_.find(c) != alloc_map_.end();
+    if (def != nullptr && !has_alloc) {
+      code_ << "(" << gen(def) << ")";
+    } else if (c->isConst()) {
+      const int digits = std::numeric_limits<double>::max_digits10;
+      code_ << "std::complex<double>" << std::setprecision(digits)
+            << *c->value();
+    } else {
+      code_ << varName(c);
+    }
+  }
+
   void handle(const NamedScalar* ns) final {
     // dim3 components are unsigned int. Cast to signed integer to
     // support negative indexing
@@ -291,24 +423,27 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   }
 
   void handle(const kir::TensorIndex* ti) final {
-    code_ << varName(ti->view()) << "[";
-
     bool first = true;
+    std::stringstream index;
     for (auto* ind : ti->indices()) {
       if (!ind->isZeroInt()) {
         if (!first) {
-          code_ << " + ";
+          index << " + ";
         }
-        code_ << genInline(ind);
+        index << genInline(ind);
         first = false;
       }
     }
 
     if (first) {
-      code_ << "0";
+      index << "0";
     }
-
-    code_ << "]";
+    bool is_volatile = ti->view()->getMemoryType() == MemoryType::Global &&
+        kernel_->summary().sync_map.needsRawSync(ti->view()).hasBID();
+    if (is_volatile) {
+      code_ << "*(volatile " << ti->getDataType().value() << "*)&";
+    }
+    code_ << varName(ti->view()) << "[" << index.str() << "]";
   }
 
   void handle(const IterDomain*) final {
@@ -327,6 +462,21 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     bool is_vector_op = false;
     size_t vector_word_size = 1;
 
+    if (uop->out()->isA<kir::TensorIndex>()) {
+      auto out_tv = uop->out()->as<kir::TensorIndex>()->view();
+      if (std::any_of(
+              out_tv->domain()->domain().begin(),
+              out_tv->domain()->domain().end(),
+              [&](IterDomain* id) { return id->isMma(); })) {
+        auto mma = dynamic_cast<MmaOp*>(
+            uop->out()->as<kir::TensorIndex>()->view()->definition());
+        TORCH_INTERNAL_ASSERT(
+            mma != nullptr, "CodeGen: mma op not in mma loop");
+        genMmaInitialization(mma, uop);
+        return;
+      }
+    }
+
     if (vectorize_scope_ && uop->out()->isA<kir::TensorIndex>()) {
       auto ti = uop->out()->as<kir::TensorIndex>();
 
@@ -370,26 +520,77 @@ class CudaKernelGenerator : private OptOutConstDispatch {
             uop->out()->dtype() == uop->in()->dtype(),
             "Vectorized store/load requires input and output datatypes match.");
       }
-    }
 
-    if (is_vector_op) {
-      if (uop->in()->isScalar()) {
-        indent() << "reinterpret_cast<"
-                 << "Array<" << uop->out()->dtype() << ", " << vector_word_size
-                 << ">*>"
-                 << "(&" << gen(uop->out()) << ")->set(" << gen(uop->in())
-                 << ");\n";
-      } else {
-        indent() << "*reinterpret_cast<"
-                 << "Array<" << uop->out()->dtype() << ", " << vector_word_size
-                 << ">*>"
-                 << "(&" << gen(uop->out()) << ")"
-                 << " = *reinterpret_cast<"
-                 << "Array<" << uop->in()->dtype() << ", " << vector_word_size
-                 << ">*>"
-                 << "(&" << gen(uop->in()) << ");\n";
+      if (is_vector_op) {
+        auto out_tv = uop->out()->as<kir::TensorIndex>()->view();
+        if (uop->in()->isScalar()) {
+          // Note:
+          //  Double buffered local tensors need indexed initialization,
+          //   so will need to use `arraySet` option.
+          if (out_tv->getMemoryType() == MemoryType::Local &&
+              !out_tv->isDoubleBuffered()) {
+            // Vectorized initialization
+            indent() << varName(out_tv) << ".set(" << gen(uop->in()) << ");\n";
+          } else {
+            // Note: currently arraySet option is not vectorized, so it will
+            //  rely on auto vectorization pass of cuda compiler.
+            indent() << "arraySet<" << out_tv->getDataType().value() << ", "
+                     << vector_word_size << ">(&" << gen(uop->out()) << ", "
+                     << "(" << out_tv->getDataType().value() << ")"
+                     << gen(uop->in()) << ");\n";
+          }
+        } else {
+          // Vectorized load
+          TORCH_INTERNAL_ASSERT(
+              uop->in()->isA<kir::TensorIndex>(),
+              "Invalid input to unary op with tensor output, found: ",
+              uop->in()->toString());
+
+          auto in_tv = uop->in()->as<kir::TensorIndex>()->view();
+          bool localToGlobal = out_tv->getMemoryType() == MemoryType::Global &&
+              in_tv->getMemoryType() == MemoryType::Local;
+
+          bool globalToLocal = out_tv->getMemoryType() == MemoryType::Local &&
+              in_tv->getMemoryType() == MemoryType::Global;
+
+          bool globalToGlobal = out_tv->getMemoryType() == MemoryType::Global &&
+              in_tv->getMemoryType() == MemoryType::Global;
+
+          bool is_volatile_to = out_tv->getMemoryType() == MemoryType::Global &&
+              kernel_->summary().sync_map.needsRawSync(out_tv).hasBID();
+
+          bool is_volatile_from =
+              in_tv->getMemoryType() == MemoryType::Global &&
+              kernel_->summary().sync_map.needsRawSync(in_tv).hasBID();
+
+          if (localToGlobal) {
+            indent() << "loadLocalToGlobal<" << uop->out()->dtype() << ", "
+                     << vector_word_size << ", "
+                     << (is_volatile_to ? "true" : "false") << ">(";
+            code_ << " &" << gen(uop->out()) << ", &" << gen(uop->in())
+                  << ");\n";
+          } else if (globalToLocal) {
+            indent() << "loadGlobalToLocal<" << uop->out()->dtype() << ", "
+                     << vector_word_size << ", "
+                     << (is_volatile_from ? "true" : "false") << ">(&"
+                     << gen(uop->out()) << ", ";
+            code_ << " &" << gen(uop->in()) << ");\n";
+          } else if (globalToGlobal) {
+            indent() << "loadGlobalToGlobal<" << uop->out()->dtype() << ", "
+                     << vector_word_size << ", "
+                     << (is_volatile_to ? "true" : "false") << ", "
+                     << (is_volatile_from ? "true" : "false") << ">(";
+            code_ << " &" << gen(uop->out()) << ", ";
+            code_ << " &" << gen(uop->in()) << ");\n";
+          } else {
+            indent() << "loadGeneric<" << uop->out()->dtype() << ", "
+                     << vector_word_size << ">(";
+            code_ << " &" << gen(uop->out()) << ", ";
+            code_ << " &" << gen(uop->in()) << ");\n";
+          }
+        }
+        return;
       }
-      return;
     }
 
     if (uop->out()->isA<NamedScalar>()) {
@@ -469,6 +670,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       if (integer_op_str(op_type) && isIntegralType(out->dtype())) {
         auto int_op = integer_op_str(op_type);
         expr << *int_op;
+      } else if (bool_op_str(op_type) && isBooleanType(out->dtype())) {
+        auto bool_op = bool_op_str(op_type);
+        expr << *bool_op;
       } else {
         expr << op_type;
         if (needFloatSuffix(op_type) && out->dtype() == DataType::Float) {
@@ -620,6 +824,10 @@ class CudaKernelGenerator : private OptOutConstDispatch {
           if (integer_op_str(op_type) && isIntegralType(bop->out()->dtype())) {
             auto int_op = integer_op_str(op_type);
             code_ << " = " << *int_op << "(\n";
+          } else if (
+              bool_op_str(op_type) && isBooleanType(bop->out()->dtype())) {
+            auto bool_op = bool_op_str(op_type);
+            code_ << " = " << *bool_op << "(\n";
           } else {
             std::stringstream op_str;
             op_str << op_type;
@@ -667,6 +875,74 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  std::string genArchString(MmaOptions options) {
+    std::stringstream ss;
+    if (isVolta(options.macro)) {
+      ss << "Volta";
+    } else if (isTuring(options.macro)) {
+      ss << "Turing";
+    } else if (isAmpere(options.macro)) {
+      ss << "Ampere";
+    } else {
+      TORCH_INTERNAL_ASSERT(false, "mma macro unknown arch");
+    }
+    return ss.str();
+  }
+
+  std::string genMmaOp(const MmaOp* mma, bool init = false) {
+    std::stringstream ss;
+    auto options = mma->options();
+    ss << genArchString(options) << "::";
+    if (init) {
+      ss << "init";
+    }
+    ss << toString(options.macro) << toString(options.operand_layout);
+    // TODO: additional parameter could be removed by swizzling iterdomain
+    auto acc_stride = mma->accStride();
+    TORCH_INTERNAL_ASSERT(acc_stride > 0);
+    ss << "<" << acc_stride << ">";
+    return ss.str();
+  }
+
+  void genMmaOperands(const MmaOp* mma) {
+    std::stringstream ss;
+    auto options = mma->options();
+    auto in_a = mma->inA()->as<kir::TensorIndex>()->view();
+    auto dtype = in_a->getDataType().value();
+    indent() << kTab << "reinterpret_cast<Array<" << dtype << ","
+             << getInputARegisterSize(options.macro) << ","
+             << getInputARegisterSize(options.macro) << ">*>(&"
+             << gen(mma->inA()) << "),\n";
+    indent() << kTab << "reinterpret_cast<Array<" << dtype << ","
+             << getInputBRegisterSize(options.macro) << ","
+             << getInputBRegisterSize(options.macro) << ">*>(&"
+             << gen(mma->inB()) << ")";
+  }
+
+  void genMmaInitialization(const MmaOp* mma, const UnaryOp* uop) {
+    auto options = mma->options();
+
+    indent() << genMmaOp(mma, true) << "(reinterpret_cast<Array<"
+             << mma->out()->getDataType().value() << ","
+             << getOutputRegisterSize(mma->options().macro) << ","
+             << getOutputRegisterSize(mma->options().macro) << ">*>"
+             << "(&" << gen(uop->out()) << "));\n";
+  }
+
+  void handle(const MmaOp* mma) final {
+    auto options = mma->options();
+    auto in_a = mma->inA()->as<kir::TensorIndex>();
+    auto out = mma->out()->as<kir::TensorIndex>();
+    indent() << genMmaOp(mma) << "(\n";
+    indent() << kTab << "reinterpret_cast<Array<"
+             << out->view()->getDataType().value() << ","
+             << getOutputRegisterSize(options.macro) << ","
+             << getOutputRegisterSize(options.macro) << ">*>(&"
+             << gen(mma->out()) << "),\n";
+    genMmaOperands(mma);
+    code_ << ");\n";
+  }
+
   std::string genReductionOp(BinaryOpType op_type, Val* out) {
     std::stringstream lambda;
     DataType data_type = out->dtype();
@@ -870,7 +1146,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         indent() << data_type << " "
                  << "block_result_var_" << block_reduce_name_ << " = "
                  << gen(wop->initVar()) << ";\n";
-        indent() << DataType::Int << " "
+        indent() << out_N->dtype() << " "
                  << "block_result_n_" << block_reduce_name_ << " = "
                  << gen(wop->initN()) << ";\n";
       }
@@ -900,7 +1176,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
                << "*>(shared_mem_avg),\n";
       indent() << kTab << "reinterpret_cast<" << data_type
                << "*>(shared_mem_var),\n";
-      indent() << kTab << "reinterpret_cast<" << DataType::Int
+      indent() << kTab << "reinterpret_cast<" << out_N->dtype()
                << "*>(shared_mem_n),\n";
       TORCH_INTERNAL_ASSERT(wop->predicate() != nullptr);
       TORCH_INTERNAL_ASSERT(
@@ -921,8 +1197,11 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   std::string generateGridReduceTemplateFlags(
       const REDUCTION_OP* rop,
       const ParallelTypeBitmap& thread_pred) {
+    TORCH_INTERNAL_ASSERT(
+        !rop->isFused(), "This is not for the fused reduction kernel\n");
+
     const auto par_domains = ir_utils::getParallelDomains(rop->outputs()[0]);
-    std::stringstream flags;
+    ArgumentBuilder flags;
     for (const ParallelType pt : kParallelTypeThreads) {
       const bool parallel_reduction =
           par_domains.find(pt) != par_domains.end() &&
@@ -941,10 +1220,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       } else {
         flag = !pred && !parallel_reduction;
       }
-      if (pt != kParallelTypeThreads[0]) {
-        flags << ", ";
-      }
-      flags << (flag ? "true" : "false");
+      flags.arg(flag);
     }
     return flags.str();
   }
@@ -967,6 +1243,11 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         grop->reduction_buffer()->buffer()->as<TensorView>();
     const auto sync_buffer = grop->sync_buffer()->buffer()->as<TensorView>();
 
+    if (rop->isFused()) {
+      generateFusedGridReduction(grop);
+      return;
+    }
+
     const std::string flags_str =
         generateGridReduceTemplateFlags(rop, grop->threadPredicate());
 
@@ -974,33 +1255,108 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         kernel_->summary().has_cooperative_grid_reduction;
 
     // Since block-level reduction is already done, those dimensions
-    // with tidx/y/z being true do not participate in the grid reduction.
-    indent() << "reduction::gridReduce<" << flags_str << ", "
-             << (persistent_sync ? "true" : "false") << ">(\n";
-    indent() << kTab << gen(rop->out()) << ",\n";
+    // with tidx/y/z being true do not participate in the grid
+    // reduction.
+    ArgumentBuilder template_args;
+    template_args.arg(flags_str).arg(persistent_sync);
+
+    ArgumentBuilder func_args(block_nest_level_ + 1, kTab);
+    func_args.arg(gen(rop->out()));
     if (domain->hasBlockReduction()) {
-      indent() << kTab << "block_result_" << block_reduce_name_ << ",\n";
+      func_args.arg("block_result_").append(block_reduce_name_);
       block_reduce_name_++;
     } else {
-      indent() << kTab << gen(rop->in()) << ",\n";
+      func_args.arg(gen(rop->in()));
     }
-    indent() << kTab << genReductionOp(op_type, out) << ",\n";
-    indent() << kTab << "&" << varName(work_buffer) << "[0],\n";
-    indent() << kTab << varName(sync_buffer) << ",\n";
-    indent() << kTab << "static_cast<" << data_type << "*>(shared_mem),\n";
+    func_args.arg(genReductionOp(op_type, out));
+    func_args.arg("&").append(varName(work_buffer)).append("[0]");
+    func_args.arg(varName(sync_buffer));
+    func_args.arg(genCall("static_cast", ptrType(data_type), "shared_mem"));
+    // read and write predicates
     TORCH_INTERNAL_ASSERT(
         grop->predicate() != nullptr && grop->predicate()->hasValue());
-    auto read_pred = genInline(grop->predicate());
-    indent() << kTab << read_pred << ",\n";
+    const auto read_pred = genInline(grop->predicate());
+    func_args.arg(read_pred);
     if (grop->writePredicate() != nullptr) {
       TORCH_INTERNAL_ASSERT(grop->writePredicate()->hasValue());
-      auto write_pred = genInline(grop->writePredicate());
-      indent() << kTab << write_pred << ",\n";
+      func_args.arg(genInline(grop->writePredicate()));
     } else {
-      indent() << kTab << read_pred << ",\n";
+      func_args.arg(read_pred);
     }
-    indent() << kTab << data_type << "("
-             << genInline(grop->reduction_op()->init()) << "));\n";
+    // Init val
+    func_args.arg(genCall(data_type, genInline(grop->reduction_op()->init())));
+
+    indent() << "reduction::gridReduce<" << template_args << ">(\n";
+    indent() << kTab << func_args << ");\n";
+  }
+
+  std::string genFusedReductionName(const kir::TensorIndex* reduction_out) {
+    return varName(reduction_out->view()) + "_reduction";
+  }
+
+  void generateFusedGridReduction(const kir::GridReduction* grop) {
+    const auto rop = grop->reduction_op();
+    TORCH_INTERNAL_ASSERT(rop->isFused());
+
+    const auto out = rop->out()->as<kir::TensorIndex>();
+    const auto domain = out->view()->domain();
+
+    const auto data_type = rop->out()->dtype();
+    const auto op_type = rop->getReductionOpType();
+
+    const auto work_buffer =
+        grop->reduction_buffer()->buffer()->as<TensorView>();
+    const auto sync_buffer = grop->sync_buffer()->buffer()->as<TensorView>();
+
+    const auto reduction_name = genFusedReductionName(out);
+
+    // template <typename Func, typename... Types>
+    // __device__ __inline__ void reduce(
+    //   RefTuple<Types...> out,
+    //   const LocalTuple<Types...>& inp,
+    //   VolatilePtrTuple<Types...> global_work_buffer,
+    //   int64_t* global_sync_buffer, // Allocated as product of all
+    //                                // non-participating Grid dimension
+    //   PtrTuple<Types...> shared_buf,
+    //   bool read_pred, // Prevent reading from out of bounds memory
+    //   bool write_pred, // Prevent from writing out of bounds
+    //   const LocalTuple<Types...>& init_val,
+    //   Func reduction_op);
+
+    indent() << reduction_name << ".reduce(\n";
+
+    ArgumentBuilder func_args(block_nest_level_ + 1, kTab);
+    // out
+    func_args.arg(genCall("RefTuple", data_type, gen(rop->out())));
+    // inp
+    func_args.arg(genCall("ConstRefTuple", data_type, gen(rop->in())));
+    // global_work_buffer
+    func_args.arg(genCall(
+        "VolatilePtrTuple", data_type, "&" + varName(work_buffer) + "[0]"));
+    // global_sync_buffer
+    func_args.arg("&").append(varName(sync_buffer)).append("[0]");
+    // shared_buf
+    func_args.arg(genCall(
+        "PtrTuple",
+        data_type,
+        genCall("static_cast", ptrType(data_type), "shared_mem")));
+    // read and write predicates
+    TORCH_INTERNAL_ASSERT(
+        grop->predicate() != nullptr && grop->predicate()->hasValue());
+    const auto read_pred = genInline(grop->predicate());
+    auto write_pred = read_pred;
+    if (grop->writePredicate() != nullptr) {
+      TORCH_INTERNAL_ASSERT(grop->writePredicate()->hasValue());
+      write_pred = genInline(grop->writePredicate());
+    }
+    func_args.arg(read_pred).arg(write_pred);
+    // init_val
+    func_args.arg(genCall(
+        "LocalTuple", data_type, genInline(grop->reduction_op()->init())));
+    // reduction_op
+    func_args.arg(genReductionOp(op_type, out));
+
+    indent() << kTab << func_args << ");\n";
   }
 
   void handle(const kir::GridBroadcast* grop) final {
@@ -1066,6 +1422,11 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     const auto n_buffer = gwop->N_buffer()->buffer()->as<TensorView>();
     const auto sync_buffer = gwop->sync_buffer()->buffer()->as<TensorView>();
 
+    if (wop->isFused()) {
+      generateFusedGridWelford(gwop);
+      return;
+    }
+
     const bool persistent_sync =
         kernel_->summary().has_cooperative_grid_reduction;
 
@@ -1119,76 +1480,188 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     indent() << kTab << data_type << "(0));\n";
   }
 
+  void generateFusedGridWelford(const kir::GridWelford* gwop) {
+    const auto wop = gwop->welford_op();
+    TORCH_INTERNAL_ASSERT(wop->isFused());
+
+    const auto out = wop->out()->as<kir::TensorIndex>();
+    const auto domain = out->view()->domain();
+
+    const auto data_type = wop->outAvg()->dtype();
+    const auto index_type = wop->outN()->dtype();
+    TORCH_INTERNAL_ASSERT(wop->outAvg()->dtype() == wop->outVar()->dtype());
+
+    ArgumentBuilder data_type_args;
+    data_type_args.arg(data_type).arg(data_type).arg(index_type);
+
+    const auto sync_buffer = gwop->sync_buffer()->buffer()->as<TensorView>();
+
+    const auto reduction_name = genFusedReductionName(out);
+
+    // template <typename Func, typename... Types>
+    // __device__ __inline__ void reduce(
+    //   RefTuple<Types...> out,
+    //   const LocalTuple<Types...>& inp,
+    //   VolatilePtrTuple<Types...> global_work_buffer,
+    //   int64_t* global_sync_buffer, // Allocated as product of all
+    //                                // non-participating Grid dimension
+    //   PtrTuple<Types...> shared_buf,
+    //   bool read_pred, // Prevent reading from out of bounds memory
+    //   bool write_pred, // Prevent from writing out of bounds
+    //   const LocalTuple<Types...>& init_val,
+    //   Func reduction_op);
+
+    ArgumentBuilder out_args;
+    out_args.arg(gen(wop->outAvg()));
+    out_args.arg(gen(wop->outVar()));
+    out_args.arg(gen(wop->outN()));
+
+    ArgumentBuilder in_args;
+    in_args.arg(gen(wop->inAvg()));
+    if (wop->inVar() != nullptr) {
+      in_args.arg(gen(wop->inVar()));
+    } else {
+      in_args.arg("(").append(data_type).append(")0");
+    }
+    in_args.arg(gen(wop->inN()));
+
+    ArgumentBuilder init_args;
+    init_args.arg(gen(wop->initAvg()));
+    init_args.arg(gen(wop->initVar()));
+    init_args.arg(gen(wop->initN()));
+
+    ArgumentBuilder work_buffer_args;
+    work_buffer_args.arg("&")
+        .append(varName(gwop->avg_buffer()->buffer()->as<TensorView>()))
+        .append("[0]");
+    work_buffer_args.arg("&")
+        .append(varName(gwop->var_buffer()->buffer()->as<TensorView>()))
+        .append("[0]");
+    work_buffer_args.arg("&")
+        .append(varName(gwop->N_buffer()->buffer()->as<TensorView>()))
+        .append("[0]");
+
+    ArgumentBuilder smem_buffer_args;
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_avg"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_var"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(index_type), "shared_mem_n"));
+
+    ArgumentBuilder func_args(block_nest_level_ + 1, kTab);
+    // out
+    func_args.arg(genCall("RefTuple", data_type_args, out_args));
+    // inp
+    func_args.arg(genCall("ConstRefTuple", data_type_args, in_args));
+    // global_work_buffer
+    func_args.arg(
+        genCall("VolatilePtrTuple", data_type_args, work_buffer_args));
+    // global_sync_buffer
+    func_args.arg("&").append(varName(sync_buffer)).append("[0]");
+    // shared_buf
+    func_args.arg(genCall("PtrTuple", data_type_args, smem_buffer_args));
+    // read and write predicates
+    TORCH_INTERNAL_ASSERT(
+        gwop->predicate() != nullptr && gwop->predicate()->hasValue());
+    const auto read_pred = genInline(gwop->predicate());
+    auto write_pred = read_pred;
+    if (gwop->writePredicate() != nullptr) {
+      TORCH_INTERNAL_ASSERT(gwop->writePredicate()->hasValue());
+      write_pred = genInline(gwop->writePredicate());
+    }
+    func_args.arg(read_pred).arg(write_pred);
+    // init_val
+    func_args.arg(genCall("LocalTuple", data_type_args, init_args));
+    // reduction_op
+    func_args.arg(genTemplate(
+        "welfordCombine", ArgumentBuilder().arg(data_type).arg(index_type)));
+
+    indent() << reduction_name << ".reduce(\n";
+    indent() << kTab << func_args << ");\n";
+  }
+
+  void handle(const kir::AllocateFusedReduction* alloc_fused_reduction) final {
+    // See the runtime file of the fused reduction
+    enum class ReductionParallelTypeState { Reduce, Iter, Pred, Inactive };
+
+    using ReductionParallelTypeStateArray =
+        ParallelTypeMap<ReductionParallelTypeState>;
+
+    ReductionParallelTypeStateArray states(
+        ReductionParallelTypeState::Inactive);
+
+    for (const ParallelType pt : kParallelTypeThreads) {
+      // It may be better to predicate grid reductions on dimensions they don't
+      // actively use, however since that should generally be discouraged (they
+      // should be part of the iter portion of the operation, or they should be
+      // predciated out) we're just going to assume they're part of the iter
+      // dimension. This would cause more communication than strictly necessary
+      // but should not be a common use case.
+      auto pt_dim = kernel_->summary().parallel_dimension_map_.get(pt);
+      if (pt_dim == nullptr || pt_dim->isOneInt()) {
+        continue;
+      }
+      // Initialize pt_dim if used to an iter dimension. It may change to a
+      // reduction or predicated dimension later.
+      states[pt] = ReductionParallelTypeState::Iter;
+    }
+
+    for (auto id : alloc_fused_reduction->out()->view()->domain()->domain()) {
+      auto pt = id->getParallelType();
+      if (isParallelTypeThread(pt)) {
+        auto state = id->isReduction() ? ReductionParallelTypeState::Reduce
+                                       : ReductionParallelTypeState::Iter;
+        states[pt] = state;
+      }
+    }
+
+    for (const auto predicated_pt : alloc_fused_reduction->threadPredicate()) {
+      auto& state = states[predicated_pt];
+      TORCH_INTERNAL_ASSERT(
+          state != ReductionParallelTypeState::Reduce,
+          "Invalid thread predication: ",
+          predicated_pt);
+      state = ReductionParallelTypeState::Pred;
+    }
+
+    ArgumentBuilder flags;
+    for (auto pt : kParallelTypeThreads) {
+      flags.arg(static_cast<int>(states[pt]));
+    }
+
+    // Persistent
+    flags.arg(true);
+
+    // Broadcast is fused
+    flags.arg(true);
+
+    const auto reduction_name =
+        genFusedReductionName(alloc_fused_reduction->out());
+
+    indent() << genTemplate("fused_reduction::ParallelReduce", flags) << " "
+             << reduction_name << ";\n";
+  }
+
   void handleScope(const kir::Scope& scope) {
     for (auto expr : scope.exprs()) {
       OptOutConstDispatch::handle(expr);
     }
   }
 
-  void handle(const kir::ForLoop* loop) final {
-    if (loop->iter_domain()->isBroadcast()) {
-      handleScope(loop->body());
-      return;
-    } else if (loop->vectorize()) {
+  void handleTrivialLoop(const kir::ForLoop* loop) {
+    if (loop->vectorize()) {
       vectorize_scope_ = loop->vectorize();
-      handleScope(loop->body());
-      vectorize_scope_ = false;
-      return;
-    } else if (loop->iter_domain()->isStride()) {
-      // A stride domain only executes the loop body with the loop
-      // index being zero.
-      indent() << "constexpr "
-               << "nvfuser_index_t"
-               << " " << gen(loop->index()) << " = 0;\n";
-      handleScope(loop->body());
-      return;
     }
-
-    // By default, a parallelized loop would look like:
-    //
-    //   for (int x = threadIdx.x; x < stop; x += blockDim.x) {
-    //     do_some_comp(x);
-    //   }
-    //
-    // When stop is guaranteed to be smaller or equal to the number of
-    // threads, the for-loop is not necessary. In the above case, we
-    // would just generate the loop body without the for clause but
-    // references to the loop index replaced by the loop start value.
-    //
-    // When the loop end is the same as the IterDomain extent, the
-    // assumption can be safely made. This is more conservative than
-    // necessary since the loop stop value just needs to be <= the
-    // IterDomain extent. However, at this point, this conservative
-    // analysis seems sufficient.
-    if (loop->stop() == loop->iter_domain()->extent() &&
-        loop->iter_domain()->isThread()) {
-      // Register a replacement of references to the loop index with
-      // the loop start value.
-      replacement_map_.insert({loop->index(), loop->start()});
-      handleScope(loop->body());
-      replacement_map_.erase(loop->index());
-      return;
+    handleScope(loop->body());
+    if (loop->vectorize()) {
+      vectorize_scope_ = false;
     }
+  }
 
-    if (loop->start()->isZeroInt() && loop->stop()->isOneInt()) {
-      indent() << "constexpr "
-               << "nvfuser_index_t"
-               << " " << gen(loop->index()) << " = 0;\n";
-      handleScope(loop->body());
-      return;
-    } else if (
-        // Special case handling for a pattern where start == end - 1.
-        loop->start()->definition() != nullptr &&
-        loop->start()->definition()->isA<BinaryOp>() &&
-        loop->start()->definition()->as<BinaryOp>()->getBinaryOpType() ==
-            BinaryOpType::Sub &&
-        loop->start()->definition()->as<BinaryOp>()->lhs() == loop->stop() &&
-        loop->start()->definition()->as<BinaryOp>()->rhs()->isOneInt()) {
-      indent() << "const "
-               << "nvfuser_index_t"
-               << " " << gen(loop->index()) << " = " << genInline(loop->start())
-               << ";\n";
-      handleScope(loop->body());
+  void handle(const kir::ForLoop* loop) final {
+    if (loop->isTrivial()) {
+      handleTrivialLoop(loop);
       return;
     }
 
@@ -1259,6 +1732,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   void handle(const kir::Allocate* alloc) final {
     const auto buffer_dtype = alloc->buffer()->dtype();
 
+    TORCH_INTERNAL_ASSERT(alloc->buffer() != nullptr);
+    alloc_map_.emplace(alloc->buffer(), alloc);
+
     if (!alloc->buffer()->isA<TensorView>()) {
       indent() << buffer_dtype << " " << gen(alloc->buffer()) << ";\n";
       return;
@@ -1273,8 +1749,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       // Allocate alias another Allocate stmt
       const auto alias_tv = alloc->alias()->buffer()->as<TensorView>();
       indent() << "// Alias Allocation - " << alloc->memoryType() << "\n";
-      indent() << buffer_dtype << "* " << varName(tv) << " = "
-               << varName(alias_tv) << ";\n";
+      indent() << "auto& " << varName(tv) << " = " << varName(alias_tv)
+               << ";\n";
+
     } else {
       // Standard Memory Allocation
       switch (tv->getMemoryType()) {
@@ -1284,11 +1761,23 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         case MemoryType::Shared:
           if (kir::ExpressionEvaluator::isConst(size)) {
             // Static shared memory
-            indent() << "__shared__ " << buffer_dtype << " " << varName(tv)
-                     << "[" << genInline(size) << "];\n";
+            //  Always align to 16B for tensorview buffers
+            //   with any vectorized access.
+            //  TODO:
+            //   This path will be less commonly exercised once we
+            //    start dynamically allocate all the tensors and
+            //    might be removed in a follow up.
+            auto va = kernel_->summary().vectorized_accesses;
+            if (va.count(tv)) {
+              indent() << "__align__(16) ";
+            } else {
+              indent();
+            }
+            code_ << "__shared__ " << buffer_dtype << " " << varName(tv) << "["
+                  << genInline(size) << "];\n";
           } else {
             // Align Offset Position
-            indent() << "offset = alignBufferSize(offset,"
+            indent() << "offset = alignBufferSize(offset, "
                      << dataTypeSize(buffer_dtype) << ");\n";
             // Shared Memory Pointer
             indent() << buffer_dtype << "* " << varName(tv)
@@ -1299,17 +1788,23 @@ class CudaKernelGenerator : private OptOutConstDispatch {
                      << buffer_dtype << "));\n";
           }
           break;
-        case MemoryType::Local:
-          indent() << buffer_dtype << " " << varName(tv) << "["
-                   << genInline(size) << "];\n";
-          break;
+        case MemoryType::Local: {
+          auto va = kernel_->summary().vectorized_accesses;
+          if (va.find(tv) != va.end()) {
+            indent() << "Array<" << buffer_dtype << ", " << genInline(size)
+                     << ", " << va.at(tv) << "> " << varName(tv) << ";\n";
+          } else {
+            indent() << buffer_dtype << " " << varName(tv) << "["
+                     << genInline(size) << "];\n";
+          }
+        } break;
         default:
           TORCH_INTERNAL_ASSERT(false, "Unexpected memory type");
       }
     }
   }
 
-  void handle(const kir::Sync*) final {
+  void handle(const kir::BlockSync*) final {
     // Use a custom synchronization method if enabled
     if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
       indent() << "block_sync::sync();\n";
@@ -1318,6 +1813,31 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const kir::GridSync* sync) final {
+    // Use a custom synchronization method if enabled
+    bool bidx = sync->syncDims().get(ParallelType::BIDx);
+    bool bidy = sync->syncDims().get(ParallelType::BIDy);
+    bool bidz = sync->syncDims().get(ParallelType::BIDz);
+    auto bool2str = [](bool b) { return (b ? "true" : "false"); };
+    std::stringstream sync_str;
+    sync_str << bool2str(bidx) << ", " << bool2str(bidy) << ", "
+             << bool2str(bidz);
+
+    std::stringstream sync_segment_size;
+    sync_segment_size << "index_utils::maskedSize<" << sync_str.str()
+                      << ">(gridDim)";
+
+    std::stringstream sync_idx;
+    sync_idx << "index_utils::maskedOffset<" << bool2str(!bidx) << ", "
+             << bool2str(!bidy) << ", " << bool2str(!bidz)
+             << ">(gridDim, blockDim)";
+
+    indent() << "grid_sync::sync<" << sync_str.str() << ", true>(\n";
+    indent() << "  " << varName(sync->syncBuffer()) << "[" << sync_idx.str()
+             << "],\n";
+    indent() << "  " << sync_segment_size.str() << ");\n";
+  }
+
   void handle(const kir::InitMagicZero*) final {
     indent() << "NVFUSER_DEFINE_MAGIC_ZERO\n";
   }
@@ -1336,8 +1856,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   // Mark when we are inside of a vectorized for-loop
   bool vectorize_scope_ = false;
 
-  //! Holds active replacement mappings during codegen
-  std::unordered_map<const Statement*, const Statement*> replacement_map_;
+  //! Keep track of Allocate node for Val. Used to determine if Val
+  //! should be inlined.
+  std::unordered_map<const Val*, const kir::Allocate*> alloc_map_;
 };
 
 } // namespace
diff --git a/torch/csrc/jit/codegen/cuda/compute_at.cpp b/torch/csrc/jit/codegen/cuda/compute_at.cpp
index f51e0fe1bc9e98..306f631194f7b5 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at.cpp
@@ -785,16 +785,14 @@ void ComputeAt::updateSiblings() {
             id->parallelize(sibling_id->getParallelType());
           }
         }
-        if (tv->getComputeAtPosition() > sibling_tv->getComputeAtPosition()) {
-          auto sibling_domain = TransformReplay::fullSelfReplay(
-              sibling_tv->domain(), tv->domain());
-          validateDomain(sibling_tv, sibling_domain);
-          sibling_tv->setDomain(sibling_domain);
-          sibling_tv->setComputeAt(tv->getComputeAtPosition());
-          sibling_tv->setMaxProducer(tv->getMaxProducerPosition());
-          auto consumer_tvs = ir_utils::consumerTvsOf(sibling_tv);
-          consumers_to_update.insert(consumer_tvs.begin(), consumer_tvs.end());
-        }
+        auto sibling_domain =
+            TransformReplay::fullSelfReplay(sibling_tv->domain(), tv->domain());
+        validateDomain(sibling_tv, sibling_domain);
+        sibling_tv->setDomain(sibling_domain);
+        sibling_tv->setComputeAt(tv->getComputeAtPosition());
+        sibling_tv->setMaxProducer(tv->getMaxProducerPosition());
+        auto consumer_tvs = ir_utils::consumerTvsOf(sibling_tv);
+        consumers_to_update.insert(consumer_tvs.begin(), consumer_tvs.end());
       }
     }
 
diff --git a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
index f46a7495163024..0269c890ba0f5d 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
@@ -256,10 +256,15 @@ void ComputeAtMap::build(Fusion* fusion, GpuLower* gpu_lower) {
       if (first_output_tv == nullptr) {
         first_output_tv = c_tv;
       } else {
-        // Map multi outputs of an expression to eachother. c is current output,
-        // and f as first output. Keep consistent with the later section of
-        // producer and consumers. Which here producer is now "first output",
-        // and consumer is still consumer.
+        // Map multi outputs of an expression to each other. c is current
+        // output, and f as first output. Keep consistent with the later section
+        // of producer and consumers. Which here producer is now "first output",
+        // and consumer is still consumer. One exception is how the
+        // domains left of CA positions are handled in the Parallel
+        // map. Those domains are not mapped in producer and consumer
+        // mappings as they do not share loops, but are mapped in the
+        // case of mapping multiple outputs since they do share the
+        // same loops.
 
         TORCH_INTERNAL_ASSERT(
             c_tv->getRootDomain().size() ==
@@ -282,35 +287,14 @@ void ComputeAtMap::build(Fusion* fusion, GpuLower* gpu_lower) {
 
         auto c2f_map = replay_FasC.getReplay();
 
-        // If we're creating parallel map, only map the leaf
-        // axes. Also, the producer axis must be left of the CA
-        // point.
-        // Otherwise, map the entire replay map.
-        if (mapping_mode_ == MappingMode::PARALLEL) {
-          // Mark axes left of compute at point for parallel type tracking
-          std::unordered_set<IterDomain*> producer_axes_to_map(
-              first_output_tv->domain()->domain().begin(),
-              first_output_tv->domain()->domain().begin() +
-                  first_output_tv->getComputeAtPosition());
-
-          for (auto c_id : c_tv->domain()->domain()) {
-            auto it = c2f_map.find(c_id);
-            if (it == c2f_map.end()) {
-              continue;
-            }
-            auto f_id = it->second;
-            if (producer_axes_to_map.find(f_id) == producer_axes_to_map.end()) {
-              continue;
-            }
-            mapIds(f_id, c_id);
-          }
-        } else {
-          for (auto entry : c2f_map) {
-            auto c_id = entry.first;
-            auto f_id = entry.second;
-            // Map the id's together
-            mapIds(f_id, c_id);
-          }
+        // Map the entire replay map between the multiple
+        // consumers even for the Parallel map as they share the same
+        // loop.
+        for (auto entry : c2f_map) {
+          auto c_id = entry.first;
+          auto f_id = entry.second;
+          // Map the id's together
+          mapIds(f_id, c_id);
         }
       }
 
@@ -457,16 +441,42 @@ void ComputeAtMap::build(Fusion* fusion, GpuLower* gpu_lower) {
     int max_concrete_count = -1;
     int max_broadcast_count = -1;
     IterDomain* concrete_id = nullptr;
+
+    // Prefer domains appearing after rfactor domains. This matters
+    // when view merges domains to create a new domain, which becomes
+    // an rfactor domain. Suppose a broadcast follows the view
+    // operation and the broadcast domain is merged with the domain
+    // matching with the rfactor domain, that domain should be chosen
+    // as the concrete domain as it has the broadcast domain and the
+    // domain matching with the rfactor domain. The concrete domain
+    // does not have a history of merge/shift further up from the
+    // rfactor domain in pre-view tensors, but that should be fine as
+    // IndexCompute with those pre-view tensors should be able to
+    // compute indices from their leaf domains.
+    // See issue #1493
+
+    // Indicate if the previous ID was an rfactor domain
+    bool rf_detected = false;
     for (auto id : *set) {
-      int concrete_count = n_concrete_ids_.at(id);
-      if (concrete_count >= max_concrete_count) {
-        int broadcast_count = n_broadcast_ids_.at(id);
-        if (concrete_count > max_concrete_count ||
-            broadcast_count > max_broadcast_count) {
-          max_concrete_count = concrete_count;
-          max_broadcast_count = broadcast_count;
-          concrete_id = id;
+      // If the previous ID is an rfactor, reset the concrete ID with
+      // this ID no matter how many IDs the previous concrete ID has.
+      if (rf_detected) {
+        concrete_id = id;
+        max_concrete_count = n_concrete_ids_.at(id);
+        max_broadcast_count = n_broadcast_ids_.at(id);
+        rf_detected = id->isRFactorProduct();
+      } else {
+        int concrete_count = n_concrete_ids_.at(id);
+        if (concrete_count >= max_concrete_count) {
+          int broadcast_count = n_broadcast_ids_.at(id);
+          if (concrete_count > max_concrete_count ||
+              broadcast_count > max_broadcast_count) {
+            max_concrete_count = concrete_count;
+            max_broadcast_count = broadcast_count;
+            concrete_id = id;
+          }
         }
+        rf_detected = id->isRFactorProduct();
       }
     }
 
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.cpp b/torch/csrc/jit/codegen/cuda/contiguity.cpp
new file mode 100644
index 00000000000000..780e4298c6bf52
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/contiguity.cpp
@@ -0,0 +1,164 @@
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+ContigIDs::ContigIDs(
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    const std::vector<bool>& root_contiguity)
+    : root_domain_(root_domain), root_contiguity_(root_contiguity) {
+  if (ids.empty()) {
+    return;
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      root_domain_.size() == root_contiguity_.size(),
+      "Arguments don't match ",
+      root_domain_.size(),
+      " != ",
+      root_contiguity_.size());
+
+  TORCH_INTERNAL_ASSERT(
+      GpuLower::current() != nullptr, "GpuLower is not found");
+
+  for (const auto i : c10::irange(root_domain_.size())) {
+    auto root_domain_i = root_domain_[i]->as<IterDomain>();
+    // If a root domain has halo, can't use merged domain even if
+    // both inputs are contiguous. HaloInfo is also initialized for
+    // rfactor root domains, which should just return "zero"
+    // RootAxisInfo. This should be safe as no rfactor tensor should
+    // need halo.
+    if (root_contiguity_[i] &&
+        !GpuLower::current()
+             ->haloInfo()
+             .getRootAxisInfo(root_domain_i)
+             .hasHalo()) {
+      contig_ids_.emplace(root_domain_i);
+      is_contig_root_[root_domain_i] = true;
+      within_contig_ids_[root_domain_i] = std::unordered_set<IterDomain*>();
+    } else {
+      is_contig_root_[root_domain_i] = false;
+    }
+    root_to_indexed_id_[root_domain_i] = root_domain_i;
+  }
+
+  auto exprs = StmtSort::getExprs(ids[0]->fusion(), {ids.begin(), ids.end()});
+
+  for (auto expr : exprs) {
+    handle(expr);
+  }
+}
+
+void ContigIDs::handle(Merge* merge) {
+  // If either input is non-contiguous so is output.
+  const auto inner = merge->inner();
+  const auto outer = merge->outer();
+
+  if (!isContig(inner) || !isContig(outer)) {
+    return;
+  }
+
+  // Grab inputs, make sure they're in root domain, check if they're
+  // contiguous.
+
+  auto lhs_inputs =
+      ir_utils::iterDomainInputsOfOrderedAs({outer}, root_domain_);
+  auto rhs_inputs =
+      ir_utils::iterDomainInputsOfOrderedAs({inner}, root_domain_);
+
+  TORCH_INTERNAL_ASSERT(
+      inRoot(lhs_inputs) && inRoot(rhs_inputs),
+      "Found an invalid merge operation, inputs of its arguments are not in the root domain.");
+
+  std::deque<IterDomain*> ordered_inputs(lhs_inputs.begin(), lhs_inputs.end());
+  ordered_inputs.insert(
+      ordered_inputs.end(), rhs_inputs.begin(), rhs_inputs.end());
+
+  // If any root input is not contig, output is not contig
+  if (!(std::all_of(
+          ordered_inputs.begin(), ordered_inputs.end(), [this](IterDomain* id) {
+            return is_contig_root_.at(id) && !id->isBroadcast() &&
+                !id->isReduction();
+          }))) {
+    return;
+  }
+
+  std::deque<IterDomain*> root_copy(root_domain_.begin(), root_domain_.end());
+
+  // Forward to first matching argument
+  while (!root_copy.empty() && !ordered_inputs.empty()) {
+    if (root_copy.front() != ordered_inputs.front()) {
+      root_copy.pop_front();
+    } else {
+      break;
+    }
+  }
+
+  // Forward through all matching arguments
+  while (!root_copy.empty() && !ordered_inputs.empty()) {
+    if (root_copy.front() == ordered_inputs.front()) {
+      root_copy.pop_front();
+      ordered_inputs.pop_front();
+      // This is no longer causing an error in:
+      // ReductionSchedulerMultiDimNonFastest TODO: test reenablement to make
+      // sure it does what's expected
+      //  } else if (
+      //     root_copy.front()->isReduction() ||
+      //     root_copy.front()->isBroadcast()) {
+      //   root_copy.pop_front();
+    } else {
+      break;
+    }
+  }
+
+  // If we matched all inputs, the output is contiguous. Only want to keep the
+  // top contig ID, lower ids should be placed in the "within_contig_ids" map
+  // of top id.
+  auto out = merge->out()->as<IterDomain>();
+  if (ordered_inputs.empty()) {
+    if (contig_ids_.find(inner) != contig_ids_.end()) {
+      contig_ids_.erase(inner);
+    }
+
+    if (contig_ids_.find(outer) != contig_ids_.end()) {
+      contig_ids_.erase(outer);
+    }
+
+    contig_ids_.emplace(out);
+
+    std::unordered_set<IterDomain*> within_out;
+    within_out.emplace(inner);
+    if (within_contig_ids_.find(inner) != within_contig_ids_.end()) {
+      auto in_inner = within_contig_ids_.at(inner);
+      within_out.insert(in_inner.begin(), in_inner.end());
+      within_contig_ids_.erase(inner);
+    }
+
+    within_out.emplace(outer);
+    if (within_contig_ids_.find(outer) != within_contig_ids_.end()) {
+      auto in_outer = within_contig_ids_.at(outer);
+      within_out.insert(in_outer.begin(), in_outer.end());
+      within_contig_ids_.erase(outer);
+    }
+
+    within_contig_ids_[out] = within_out;
+
+    for (auto root : lhs_inputs) {
+      root_to_indexed_id_[root] = out;
+    }
+    for (auto root : rhs_inputs) {
+      root_to_indexed_id_[root] = out;
+    }
+  }
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.h b/torch/csrc/jit/codegen/cuda/contiguity.h
new file mode 100644
index 00000000000000..0379f0c5ecda37
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/contiguity.h
@@ -0,0 +1,88 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// A merge is contiguous if:
+//   Inputs of outer are to the left in the root domain of the inputs of RHS.
+//   All inputs are contiguous in the root domain:
+//     - All marked as contiguous
+//     - Only gaps between inputs are broadcast or reductoin dims
+//   There are no split transformations performed on outer or inner
+//   All transformations on outer or inner are contiguous merges
+// If this criteria holds, then we can index the input root domains of this
+// merge with the indexing provided to the output of the merge in the backward
+// index pass
+
+class ContigIDs : public OptInDispatch {
+ public:
+  ContigIDs() = delete;
+
+  // Check through the history of ids whose inputs map to root_domain with
+  // contiguity root_contiguity. Return unordered_set of all merges that are
+  // contiguous. Ignore root order is primarily used for predicate generation.
+  // In this case we can linearize indexing of any ID that only consists of
+  // merge operations.
+  ContigIDs(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      const std::vector<bool>& root_contiguity);
+
+  const std::unordered_set<IterDomain*>& contigIDs() const {
+    return contig_ids_;
+  }
+
+  const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>&
+  withinContigIDs() const {
+    return within_contig_ids_;
+  }
+
+  const std::unordered_map<IterDomain*, IterDomain*>& rootToIndexedID() const {
+    return root_to_indexed_id_;
+  }
+
+ private:
+  using OptInDispatch::handle;
+
+  bool inRoot(const std::vector<IterDomain*>& ids) {
+    return std::all_of(ids.begin(), ids.end(), [this](IterDomain* id) {
+      return is_contig_root_.find(id) != is_contig_root_.end();
+    });
+  }
+
+  bool isContig(IterDomain* id) {
+    return contig_ids_.find(id) != contig_ids_.end();
+  }
+
+  // Split outputs are not contiguous, don't need to do anything.
+  void handle(Split*) override {}
+
+  void handle(Merge* merge) override;
+
+ private:
+  //! Root domains to analyze contiguity
+  const std::vector<IterDomain*>& root_domain_;
+  //! Contiguity of root_domain_
+  const std::vector<bool>& root_contiguity_;
+  //! Mapping of root domain to bool indicating contiguity
+  std::unordered_map<IterDomain*, bool> is_contig_root_;
+  // Mark if ids are result of contigous merges
+  std::unordered_set<IterDomain*> contig_ids_;
+  // Given contiguous domain, return all iter domains within its history.
+  std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>
+      within_contig_ids_;
+  //! Mapping of root domain to the actual indexed domain, which can
+  //! be itself or a contig merged domain if found.
+  std::unordered_map<IterDomain*, IterDomain*> root_to_indexed_id_;
+};
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.cpp b/torch/csrc/jit/codegen/cuda/dispatch.cpp
index 1702de93bdd47e..dc7ac6403d657c 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.cpp
+++ b/torch/csrc/jit/codegen/cuda/dispatch.cpp
@@ -54,6 +54,9 @@ void Val::dispatch(T handler, Val* val) {
         case DataType::Int:
           ptr(handler)->handle(val->as<Int>());
           return;
+        case DataType::ComplexDouble:
+          ptr(handler)->handle(val->as<ComplexDouble>());
+          return;
         default:
           break;
       }
@@ -101,6 +104,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::MmaOp:
+      ptr(handler)->handle(expr->as<MmaOp>());
+      return;
     case ExprType::BroadcastOp:
       ptr(handler)->handle(expr->as<BroadcastOp>());
       return;
@@ -120,6 +126,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::GatherOp:
       ptr(handler)->handle(expr->as<GatherOp>());
       return;
+    case ExprType::ViewDtypeOp:
+      ptr(handler)->handle(expr->as<ViewDtypeOp>());
+      return;
     case ExprType::ViewOp:
       ptr(handler)->handle(expr->as<ViewOp>());
       return;
@@ -127,8 +136,11 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::Allocate:
       ptr(handler)->handle(expr->as<kir::Allocate>());
       return;
-    case ExprType::Sync:
-      ptr(handler)->handle(expr->as<kir::Sync>());
+    case ExprType::BlockSync:
+      ptr(handler)->handle(expr->as<kir::BlockSync>());
+      return;
+    case ExprType::GridSync:
+      ptr(handler)->handle(expr->as<kir::GridSync>());
       return;
     case ExprType::InitMagicZero:
       ptr(handler)->handle(expr->as<kir::InitMagicZero>());
@@ -151,6 +163,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::AllocateFusedReduction:
+      ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
+      return;
     default:
       TORCH_INTERNAL_ASSERT(false, "Unknown exprtype in dispatch!");
   }
@@ -180,6 +195,9 @@ void Val::constDispatch(T handler, const Val* val) {
         case DataType::Int:
           ptr(handler)->handle(val->as<Int>());
           return;
+        case DataType::ComplexDouble:
+          ptr(handler)->handle(val->as<ComplexDouble>());
+          return;
         default:
           break;
       }
@@ -227,6 +245,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::MmaOp:
+      ptr(handler)->handle(expr->as<MmaOp>());
+      return;
     case ExprType::BroadcastOp:
       ptr(handler)->handle(expr->as<BroadcastOp>());
       return;
@@ -246,6 +267,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::GatherOp:
       ptr(handler)->handle(expr->as<GatherOp>());
       return;
+    case ExprType::ViewDtypeOp:
+      ptr(handler)->handle(expr->as<ViewDtypeOp>());
+      return;
     case ExprType::ViewOp:
       ptr(handler)->handle(expr->as<ViewOp>());
       return;
@@ -253,8 +277,11 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::Allocate:
       ptr(handler)->handle(expr->as<kir::Allocate>());
       return;
-    case ExprType::Sync:
-      ptr(handler)->handle(expr->as<kir::Sync>());
+    case ExprType::BlockSync:
+      ptr(handler)->handle(expr->as<kir::BlockSync>());
+      return;
+    case ExprType::GridSync:
+      ptr(handler)->handle(expr->as<kir::GridSync>());
       return;
     case ExprType::InitMagicZero:
       ptr(handler)->handle(expr->as<kir::InitMagicZero>());
@@ -277,6 +304,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::AllocateFusedReduction:
+      ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
+      return;
     default:
       TORCH_INTERNAL_ASSERT(false, "Unknown exprtype in dispatch!");
   }
@@ -317,6 +347,9 @@ void Val::mutatorDispatch(T mutator, Val* val) {
         case DataType::Int:
           ptr(mutator)->mutate(val->as<Int>());
           return;
+        case DataType::ComplexDouble:
+          ptr(mutator)->mutate(val->as<ComplexDouble>());
+          return;
         default:
           break;
       }
@@ -364,6 +397,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(mutator)->mutate(expr->as<WelfordOp>());
       return;
+    case ExprType::MmaOp:
+      ptr(mutator)->mutate(expr->as<MmaOp>());
+      return;
     case ExprType::BroadcastOp:
       ptr(mutator)->mutate(expr->as<BroadcastOp>());
       return;
@@ -383,6 +419,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::GatherOp:
       ptr(mutator)->mutate(expr->as<GatherOp>());
       return;
+    case ExprType::ViewDtypeOp:
+      ptr(mutator)->mutate(expr->as<ViewDtypeOp>());
+      return;
     case ExprType::ViewOp:
       ptr(mutator)->mutate(expr->as<ViewOp>());
       return;
@@ -390,8 +429,11 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::Allocate:
       ptr(mutator)->mutate(expr->as<kir::Allocate>());
       return;
-    case ExprType::Sync:
-      ptr(mutator)->mutate(expr->as<kir::Sync>());
+    case ExprType::BlockSync:
+      ptr(mutator)->mutate(expr->as<kir::BlockSync>());
+      return;
+    case ExprType::GridSync:
+      ptr(mutator)->mutate(expr->as<kir::GridSync>());
       return;
     case ExprType::InitMagicZero:
       ptr(mutator)->mutate(expr->as<kir::InitMagicZero>());
@@ -414,6 +456,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::GridWelford:
       ptr(mutator)->mutate(expr->as<kir::GridWelford>());
       return;
+    case ExprType::AllocateFusedReduction:
+      ptr(mutator)->mutate(expr->as<kir::AllocateFusedReduction>());
+      return;
     default:
       TORCH_INTERNAL_ASSERT(false, "Unknown exprtype in dispatch!");
   }
@@ -530,6 +575,9 @@ void OptOutConstDispatch::handle(const Double* stmt) {
 void OptOutConstDispatch::handle(const Int* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const ComplexDouble* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const NamedScalar* stmt) {
   unhandled(stmt);
 }
@@ -566,6 +614,9 @@ void OptOutConstDispatch::handle(const ReductionOp* stmt) {
 void OptOutConstDispatch::handle(const WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const MmaOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const BroadcastOp* stmt) {
   unhandled(stmt);
 }
@@ -585,6 +636,9 @@ void OptOutConstDispatch::handle(const ShiftOp* stmt) {
 void OptOutConstDispatch::handle(const GatherOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const ViewDtypeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const ViewOp* stmt) {
   unhandled(stmt);
 }
@@ -592,7 +646,10 @@ void OptOutConstDispatch::handle(const ViewOp* stmt) {
 void OptOutConstDispatch::handle(const kir::Allocate* stmt) {
   unhandled(stmt);
 }
-void OptOutConstDispatch::handle(const kir::Sync* stmt) {
+void OptOutConstDispatch::handle(const kir::BlockSync* stmt) {
+  unhandled(stmt);
+}
+void OptOutConstDispatch::handle(const kir::GridSync* stmt) {
   unhandled(stmt);
 }
 void OptOutConstDispatch::handle(const kir::InitMagicZero* stmt) {
@@ -616,6 +673,9 @@ void OptOutConstDispatch::handle(const kir::GridBroadcast* stmt) {
 void OptOutConstDispatch::handle(const kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const kir::AllocateFusedReduction* stmt) {
+  unhandled(stmt);
+}
 
 void OptOutDispatch::unhandled(Statement*) {}
 
@@ -629,6 +689,9 @@ void OptOutDispatch::handle(Double* stmt) {
 void OptOutDispatch::handle(Int* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(ComplexDouble* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(NamedScalar* stmt) {
   unhandled(stmt);
 }
@@ -665,6 +728,9 @@ void OptOutDispatch::handle(ReductionOp* stmt) {
 void OptOutDispatch::handle(WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(MmaOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(BroadcastOp* stmt) {
   unhandled(stmt);
 }
@@ -684,6 +750,9 @@ void OptOutDispatch::handle(ShiftOp* stmt) {
 void OptOutDispatch::handle(GatherOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(ViewDtypeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(ViewOp* stmt) {
   unhandled(stmt);
 }
@@ -691,7 +760,10 @@ void OptOutDispatch::handle(ViewOp* stmt) {
 void OptOutDispatch::handle(kir::Allocate* stmt) {
   unhandled(stmt);
 }
-void OptOutDispatch::handle(kir::Sync* stmt) {
+void OptOutDispatch::handle(kir::BlockSync* stmt) {
+  unhandled(stmt);
+}
+void OptOutDispatch::handle(kir::GridSync* stmt) {
   unhandled(stmt);
 }
 void OptOutDispatch::handle(kir::InitMagicZero* stmt) {
@@ -715,6 +787,9 @@ void OptOutDispatch::handle(kir::GridBroadcast* stmt) {
 void OptOutDispatch::handle(kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(kir::AllocateFusedReduction* stmt) {
+  unhandled(stmt);
+}
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.h b/torch/csrc/jit/codegen/cuda/dispatch.h
index 6961ebd6a1584e..c38641cee580a8 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.h
+++ b/torch/csrc/jit/codegen/cuda/dispatch.h
@@ -64,6 +64,7 @@ class TensorView;
 class Bool;
 class Double;
 class Int;
+class ComplexDouble;
 class NamedScalar;
 
 // Exprs
@@ -72,31 +73,31 @@ class BinaryOp;
 class TernaryOp;
 class ReductionOp;
 class WelfordOp;
+class MmaOp;
 class BroadcastOp;
 class TransposeOp;
 class ShiftOp;
 class GatherOp;
+class ViewDtypeOp;
 class ViewOp;
 
 // Exprs
 class Split;
 class Merge;
-class TransposeOp;
-class ShiftOp;
-class GatherOp;
-class ViewOp;
 
 namespace kir {
 class Predicate;
 class TensorIndex;
 
 class Allocate;
-class Sync;
+class BlockSync;
+class GridSync;
 class ForLoop;
 class IfThenElse;
 class GridReduction;
 class GridBroadcast;
 class GridWelford;
+class AllocateFusedReduction;
 class InitMagicZero;
 class UpdateMagicZero;
 } // namespace kir
@@ -120,6 +121,7 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const Bool* stmt);
   virtual void handle(const Double* stmt);
   virtual void handle(const Int* stmt);
+  virtual void handle(const ComplexDouble* stmt);
   virtual void handle(const NamedScalar* stmt);
 
   virtual void handle(const kir::Predicate*);
@@ -131,6 +133,7 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const TernaryOp* stmt);
   virtual void handle(const ReductionOp* stmt);
   virtual void handle(const WelfordOp* stmt);
+  virtual void handle(const MmaOp* stmt);
   virtual void handle(const BroadcastOp* stmt);
 
   virtual void handle(const Split* stmt);
@@ -138,10 +141,12 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const TransposeOp* stmt);
   virtual void handle(const ShiftOp* stmt);
   virtual void handle(const GatherOp* stmt);
+  virtual void handle(const ViewDtypeOp* stmt);
   virtual void handle(const ViewOp* stmt);
 
   virtual void handle(const kir::Allocate*);
-  virtual void handle(const kir::Sync*);
+  virtual void handle(const kir::BlockSync*);
+  virtual void handle(const kir::GridSync*);
   virtual void handle(const kir::InitMagicZero*);
   virtual void handle(const kir::UpdateMagicZero*);
   virtual void handle(const kir::ForLoop*);
@@ -149,6 +154,7 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const kir::GridReduction*);
   virtual void handle(const kir::GridBroadcast*);
   virtual void handle(const kir::GridWelford*);
+  virtual void handle(const kir::AllocateFusedReduction*);
 };
 
 class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
@@ -165,6 +171,7 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(Bool* stmt);
   virtual void handle(Double* stmt);
   virtual void handle(Int* stmt);
+  virtual void handle(ComplexDouble* stmt);
   virtual void handle(NamedScalar* stmt);
   virtual void handle(IterDomain* stmt);
   virtual void handle(TensorDomain* stmt);
@@ -179,6 +186,7 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(TernaryOp* stmt);
   virtual void handle(ReductionOp* stmt);
   virtual void handle(WelfordOp* stmt);
+  virtual void handle(MmaOp* stmt);
   virtual void handle(BroadcastOp* stmt);
 
   virtual void handle(Split* stmt);
@@ -186,10 +194,12 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(TransposeOp* stmt);
   virtual void handle(ShiftOp* stmt);
   virtual void handle(GatherOp* stmt);
+  virtual void handle(ViewDtypeOp* stmt);
   virtual void handle(ViewOp* stmt);
 
   virtual void handle(kir::Allocate* stmt);
-  virtual void handle(kir::Sync* stmt);
+  virtual void handle(kir::BlockSync* stmt);
+  virtual void handle(kir::GridSync* stmt);
   virtual void handle(kir::InitMagicZero* stmt);
   virtual void handle(kir::UpdateMagicZero* stmt);
   virtual void handle(kir::ForLoop* stmt);
@@ -197,6 +207,7 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(kir::GridReduction* stmt);
   virtual void handle(kir::GridBroadcast* stmt);
   virtual void handle(kir::GridWelford* stmt);
+  virtual void handle(kir::AllocateFusedReduction* stmt);
 };
 
 class TORCH_CUDA_CU_API OptInConstDispatch : public OptOutConstDispatch {
@@ -254,6 +265,7 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(Bool*);
   virtual void mutate(Double*);
   virtual void mutate(Int*);
+  virtual void mutate(ComplexDouble*);
   virtual void mutate(NamedScalar*);
   virtual void mutate(IterDomain*);
   virtual void mutate(TensorDomain*);
@@ -268,6 +280,7 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(TernaryOp*);
   virtual void mutate(ReductionOp*);
   virtual void mutate(WelfordOp*);
+  virtual void mutate(MmaOp*);
   virtual void mutate(BroadcastOp*);
 
   virtual void mutate(Split*);
@@ -275,10 +288,12 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(TransposeOp*);
   virtual void mutate(ShiftOp*);
   virtual void mutate(GatherOp*);
+  virtual void mutate(ViewDtypeOp*);
   virtual void mutate(ViewOp*);
 
   virtual void mutate(kir::Allocate*);
-  virtual void mutate(kir::Sync*);
+  virtual void mutate(kir::BlockSync*);
+  virtual void mutate(kir::GridSync*);
   virtual void mutate(kir::InitMagicZero*);
   virtual void mutate(kir::UpdateMagicZero*);
   virtual void mutate(kir::ForLoop*);
@@ -286,6 +301,7 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(kir::GridReduction*);
   virtual void mutate(kir::GridBroadcast*);
   virtual void mutate(kir::GridWelford*);
+  virtual void mutate(kir::AllocateFusedReduction*);
 
  protected:
   void removeExpr(IrContainer*, Expr*);
diff --git a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
index 0948131956982b..83107569dc54b5 100644
--- a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
+++ b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
@@ -388,7 +388,7 @@ void KernelPrecomputedIntegers::bindTensorMetaData(
     const at::Tensor& at_tensor) {
   std::vector<std::pair<Val*, int64_t>> ret;
   const auto root_domain =
-      TensorDomain::noReductions(tv->domain()->getRootDomain());
+      TensorDomain::noReductions(tv->domain()->getMaybeRFactorDomain());
   TORCH_INTERNAL_ASSERT(
       at_tensor.ndimension() == static_cast<int>(root_domain.size()),
       "Something went wrong configuring launch. Inputs do not match.");
diff --git a/torch/csrc/jit/codegen/cuda/executor.cpp b/torch/csrc/jit/codegen/cuda/executor.cpp
index 5e6f2d9375e019..a32dbbf73b2485 100644
--- a/torch/csrc/jit/codegen/cuda/executor.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor.cpp
@@ -13,6 +13,7 @@
 
 #include <ATen/core/LegacyTypeDispatch.h>
 #include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/llvm_jit_strings.h>
 #include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 #include <c10/core/DeviceGuard.h>
 #include <c10/cuda/CUDAFunctions.h>
@@ -56,6 +57,18 @@ typedef unsigned long long int uint64_t;
 )";
 }
 
+static const std::string& defineComplexTypes() {
+  static std::string result = std::string(R"ESCAPE(
+#define POS_INFINITY __int_as_float(0x7f800000)
+#define INFINITY POS_INFINITY
+#define NEG_INFINITY __int_as_float(0xff800000)
+#define NAN __int_as_float(0x7fffffff)
+)ESCAPE") +
+      at::cuda::get_traits_string() + at::cuda::get_complex_body_string() +
+      at::cuda::get_cmath_string() + at::cuda::get_complex_math_string();
+  return result;
+}
+
 } // namespace
 
 std::string FusionExecutor::getStructuredCode(const std::string& kernel) {
@@ -70,7 +83,7 @@ std::string FusionExecutor::getStructuredCode(const std::string& kernel) {
 #endif
   code += std::string("namespace ") + FusionExecutor::kernelNamespace() +
       " {\n" + defineIntegerTypes() + defineIndexMode(options_.index_mode) +
-      executor_utils::kernelPreamble() + kernel + "}\n";
+      defineComplexTypes() + executor_utils::kernelPreamble() + kernel + "}\n";
 
   if (isDebugDumpEnabled(DebugDumpOption::CudaKernel)) {
     std::cout << "\n======= Codegen output for kernel: " << kernelName()
@@ -169,12 +182,15 @@ void FusionExecutor::compileFusion(
   c10::DeviceGuard dg(options_.device);
 
   TORCH_INTERNAL_ASSERT(
-      options.device.is_cuda(), "Provided device to CUDA fuser is the CPU.");
-  auto properties = at::cuda::getDeviceProperties(options.device.index());
+      options_.device.is_cuda(), "Provided device to CUDA fuser is the CPU.");
+  auto properties = at::cuda::getDeviceProperties(options_.device.index());
   max_device_smem = properties->sharedMemPerBlock;
   warp_size_ = properties->warpSize;
 
-  lowered_ = std::make_unique<GpuLower>(fusion);
+  lowered_ = std::make_unique<GpuLower>(
+      fusion,
+      options_.index_mode == KernelIndexMode::INT64 ? DataType::Int
+                                                    : DataType::Int32);
   const auto kernel = lowered_->kernel();
   fusion_ = lowered_->kernel()->as<Fusion>();
 
@@ -464,8 +480,12 @@ LaunchParams FusionExecutor::computeLaunchParams(
       }
       maximum_value = std::max(maximum_value, *val);
     }
-    expr_eval.bind(p_type, maximum_value);
-    launch_params.bind(maximum_value, p_type);
+    // Protect for size-0 tensors, they still have a value so would prefer to
+    // bind nothing than 0
+    if (maximum_value > 0) {
+      expr_eval.bind(p_type, maximum_value);
+      launch_params.bind(maximum_value, p_type);
+    }
   }
 
   // Re-run the integer machine with all
@@ -552,23 +572,41 @@ FusionExecutor::GlobalBuffers FusionExecutor::allocGlobalVals(
 }
 
 std::vector<at::Tensor> FusionExecutor::allocOutputs(
+    const at::ArrayRef<IValue>& inputs,
     kir::ExpressionEvaluator& expr_eval,
     const std::unordered_set<int>& alias_indices) {
   FUSER_PERF_SCOPE("FusionExecutor::AllocOutputs");
   const auto kernel = lowered_->kernel();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   std::vector<at::Tensor> outputs;
-  for (const auto i : c10::irange(kernel->outputs().size())) {
-    TORCH_INTERNAL_ASSERT(
-        kernel->outputs()[i]->isA<TensorView>(),
-        "Cannot allocate outputs that are not tensors.");
-    auto output = kernel->outputs()[i]->as<TensorView>();
-    if (alias_indices.count(i) == 0) {
-      outputs.push_back(
-          inferAndAllocOutput(output, expr_eval, options_, false));
+  for (const auto out_i : c10::irange(kernel->outputs().size())) {
+    // Dummy output.
+    if (kernel->outputs()[out_i]->isFusionInput()) {
+      for (auto inp_i : c10::irange(kernel->inputs().size())) {
+        if (kernel->inputs()[inp_i] == kernel->outputs()[out_i]) {
+          TORCH_INTERNAL_ASSERT(
+              inp_i < inputs.size(),
+              "Issue with an input showing up as output, couldn't find input.");
+          TORCH_INTERNAL_ASSERT(
+              inputs[inp_i].isTensor(),
+              "Cannot register a scalar as an output in a fusion.");
+          outputs.push_back(inputs[inp_i].toTensor());
+          break;
+        }
+      }
     } else {
-      // aliasing to inputs, no need to allocate real output
-      outputs.push_back(inferAndAlloc(output, {}, expr_eval, options_, false));
+      TORCH_INTERNAL_ASSERT(
+          kernel->outputs()[out_i]->isA<TensorView>(),
+          "Cannot allocate outputs that are not tensors.");
+      auto output = kernel->outputs()[out_i]->as<TensorView>();
+      if (alias_indices.count(out_i) == 0) {
+        outputs.push_back(
+            inferAndAllocOutput(output, expr_eval, options_, false));
+      } else {
+        // aliasing to inputs, no need to allocate real output
+        outputs.push_back(
+            inferAndAlloc(output, {}, expr_eval, options_, false));
+      }
     }
   }
   return outputs;
@@ -753,7 +791,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
       auto& output_alias_indices = output_alias_indices_entry.get();
 
-      allocated_outputs = allocOutputs(expr_eval, output_alias_indices);
+      allocated_outputs = allocOutputs(inputs, expr_eval, output_alias_indices);
 
       for (const auto& entry : alias_indices) {
         TORCH_INTERNAL_ASSERT(
@@ -826,14 +864,17 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
               << "Inputs:" << std::endl;
     for (const auto& input : inputs) {
       if (input.isTensor()) {
-        std::cout << input.toTensor().scalar_type() << " "
-                  << input.toTensor().sizes() << std::endl;
+        const auto& input_tensor = input.toTensor();
+        std::cout << "  " << input_tensor.scalar_type() << " "
+                  << input.toTensor().sizes()
+                  << " (strides = " << input.toTensor().strides() << ")"
+                  << std::endl;
       }
     }
     std::cout << "Outputs:" << std::endl;
     for (const auto& output : allocated_outputs) {
       std::cout << "  " << output.scalar_type() << " " << output.sizes()
-                << std::endl;
+                << " (strides = " << output.strides() << ")" << std::endl;
     }
     std::cout << "Reduction and semaphore buffers:" << std::endl;
     for (const auto& buffer : global_buffers.buffers) {
diff --git a/torch/csrc/jit/codegen/cuda/executor.h b/torch/csrc/jit/codegen/cuda/executor.h
index 40accbfb5208d0..a62507e87bfd89 100644
--- a/torch/csrc/jit/codegen/cuda/executor.h
+++ b/torch/csrc/jit/codegen/cuda/executor.h
@@ -165,6 +165,7 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   // skip allocating real storage for those, but still maintain its spot to
   // maintain the indexing from output aliases to inputs
   std::vector<at::Tensor> allocOutputs(
+      const at::ArrayRef<IValue>& inputs,
       kir::ExpressionEvaluator& expr_eval,
       const std::unordered_set<int>& alias_indices = {});
 
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
index 883fae207c51d2..da5667f9faccdd 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
@@ -88,6 +88,10 @@ std::unique_ptr<TensorArgAbstract> getTensorArg(
       return getTensorArg<int64_t, INDEX_MODE>(nDims);
     case c10::ScalarType::Int:
       return getTensorArg<int32_t, INDEX_MODE>(nDims);
+    case c10::ScalarType::ComplexFloat:
+      return getTensorArg<c10::complex<float>, INDEX_MODE>(nDims);
+    case c10::ScalarType::ComplexDouble:
+      return getTensorArg<c10::complex<double>, INDEX_MODE>(nDims);
     default:
       TORCH_CHECK(
           false,
@@ -193,6 +197,10 @@ void KernelArgumentHolder::push(const IValue& val) {
   auto scalar_val = val.toScalar();
   switch (scalar_val.type()) {
     // NOLINTNEXTLINE(bugprone-branch-clone)
+    case c10::ScalarType::ComplexDouble:
+      arguments_.push_back(
+          std::make_unique<ComplexDoubleArg>(scalar_val.toComplexDouble()));
+      return;
     case c10::ScalarType::Double:
       arguments_.push_back(std::make_unique<DoubleArg>(scalar_val.toDouble()));
       return;
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
index d457a69adb2505..c135328a3acc1e 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
@@ -4,6 +4,7 @@
 #include <ATen/cuda/CUDAGeneratorImpl.h>
 #include <c10/util/Exception.h>
 #include <torch/csrc/jit/ir/ir.h>
+#include <array>
 
 namespace torch {
 namespace jit {
@@ -18,10 +19,8 @@ struct TensorArgCodegen {
   };
 
   T* data;
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
-  nvfuser_index_t size[N];
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
-  nvfuser_index_t stride[N];
+  std::array<nvfuser_index_t, N> size;
+  std::array<nvfuser_index_t, N> stride;
   constexpr int nDims() {
     return N;
   }
@@ -71,8 +70,7 @@ struct ArgAbstract {
 struct PhiloxCudaStateArg : public ArgAbstract {
   at::PhiloxCudaState val_;
   PhiloxCudaStateArg(at::PhiloxCudaState _val) : val_(_val){};
-  // NOLINTNEXTLINE(modernize-use-override,cppcoreguidelines-explicit-virtual-functions)
-  void* arg() {
+  void* arg() override {
     return &val_;
   }
 };
@@ -80,8 +78,7 @@ struct PhiloxCudaStateArg : public ArgAbstract {
 struct LongArg : public ArgAbstract {
   int64_t val_;
   explicit LongArg(int64_t _val) : val_(_val) {}
-  // NOLINTNEXTLINE(modernize-use-override,cppcoreguidelines-explicit-virtual-functions)
-  void* arg() {
+  void* arg() override {
     return &val_;
   }
 };
@@ -89,8 +86,15 @@ struct LongArg : public ArgAbstract {
 struct DoubleArg : public ArgAbstract {
   double val_;
   explicit DoubleArg(double _val) : val_(_val) {}
-  // NOLINTNEXTLINE(modernize-use-override,cppcoreguidelines-explicit-virtual-functions)
-  void* arg() {
+  void* arg() override {
+    return &val_;
+  }
+};
+
+struct ComplexDoubleArg : public ArgAbstract {
+  c10::complex<double> val_;
+  explicit ComplexDoubleArg(c10::complex<double> _val) : val_(_val) {}
+  void* arg() override {
     return &val_;
   }
 };
@@ -98,8 +102,7 @@ struct DoubleArg : public ArgAbstract {
 struct BoolArg : public ArgAbstract {
   bool val_;
   explicit BoolArg(bool _val) : val_(_val) {}
-  // NOLINTNEXTLINE(modernize-use-override,cppcoreguidelines-explicit-virtual-functions)
-  void* arg() {
+  void* arg() override {
     return &val_;
   }
 };
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.cpp b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
index 5323036e5df982..d81ce7b2c55c76 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
@@ -5,20 +5,24 @@
 #include <c10/cuda/CUDACachingAllocator.h>
 #include <c10/util/irange.h>
 
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
 #include <torch/csrc/jit/codegen/cuda/executor_utils.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/fuser/cuda/fused_kernel.h>
 #include <torch/csrc/jit/resource_guard.h>
 
 #include <nvfuser_resources/PhiloxCudaStateRaw.h>
+#include <nvfuser_resources/array.h>
 #include <nvfuser_resources/bf16_support.h>
 #include <nvfuser_resources/block_reduction.h>
 #include <nvfuser_resources/block_sync_atomic.h>
 #include <nvfuser_resources/block_sync_default.h>
 #include <nvfuser_resources/broadcast.h>
 #include <nvfuser_resources/fp16_support.h>
+#include <nvfuser_resources/fused_reduction.h>
 #include <nvfuser_resources/grid_broadcast.h>
 #include <nvfuser_resources/grid_reduction.h>
 #include <nvfuser_resources/grid_sync.h>
@@ -26,6 +30,9 @@
 #include <nvfuser_resources/index_utils.h>
 #include <nvfuser_resources/random_numbers.h>
 #include <nvfuser_resources/tensor.h>
+#include <nvfuser_resources/tensorcore.h>
+#include <nvfuser_resources/tuple.h>
+#include <nvfuser_resources/type_traits.h>
 #include <nvfuser_resources/warp.h>
 #include <nvfuser_resources/welford.h>
 
@@ -68,9 +75,12 @@ std::string kernelPreamble() {
 
   // Base classes and helpers
   ss << nvfuser_resources::tensor_cu;
+  ss << nvfuser_resources::type_traits_cu;
+  ss << nvfuser_resources::array_cu;
   ss << nvfuser_resources::random_numbers_cu;
   ss << nvfuser_resources::helpers_cu;
   ss << nvfuser_resources::index_utils_cu;
+  ss << nvfuser_resources::tuple_cu;
 
   // Synchronization classes
   if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
@@ -87,6 +97,8 @@ std::string kernelPreamble() {
   ss << nvfuser_resources::broadcast_cu;
   ss << nvfuser_resources::welford_cu;
   ss << nvfuser_resources::warp_cu;
+  ss << nvfuser_resources::tensorcore_cu;
+  ss << nvfuser_resources::fused_reduction_cu;
 
   // Random utilities
   ss << nvfuser_resources::PhiloxCudaStateRaw_cu;
@@ -123,9 +135,9 @@ bool validateKernelArgTensor(
   size_t arg_dim = arg.dim();
   // Note: This requires current Fusion to be active.
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  size_t param_dim =
-      TensorDomain::noReductions(param->as<TensorView>()->getRootDomain())
-          .size();
+  size_t param_dim = TensorDomain::noReductions(
+                         param->as<TensorView>()->getMaybeRFactorDomain())
+                         .size();
   // see [Note - broadcast support in integration]
   // Because of broadcasting support handled in integration, we relax the rank
   // check as necessary.
@@ -166,6 +178,12 @@ bool validateKernelArgTensor(
     case at::ScalarType::Bool:
       match = param_data_type == DataType::Bool;
       break;
+    case at::ScalarType::ComplexFloat:
+      match = param_data_type == DataType::ComplexFloat;
+      break;
+    case at::ScalarType::ComplexDouble:
+      match = param_data_type == DataType::ComplexDouble;
+      break;
     default:
       msg << "Argument element type, " << arg_data_type << ", is not supported."
           << "\n";
@@ -193,6 +211,10 @@ bool validateKernelArgScalar(
     case c10::ScalarType::Long:
       match = param_type == DataType::Int || param_type == DataType::Int32;
       break;
+    case c10::ScalarType::ComplexDouble:
+      match = param_type == DataType::ComplexDouble ||
+          param_type == DataType::ComplexFloat;
+      break;
     case c10::ScalarType::Double:
       match = param_type == DataType::Double || param_type == DataType::Float ||
           param_type == DataType::Half || param_type == DataType::BFloat16;
@@ -254,6 +276,10 @@ bool checkSameStride(const std::vector<c10::IValue>& tensors) {
 
 // Return true if all the tensors are contiguous and have the same striding
 bool checkSameContiguity(const std::vector<c10::IValue>& tensors) {
+  if (tensors.size() < 2) {
+    return true;
+  }
+
   auto reference = tensors.front();
   if (!reference.isTensor()) {
     return false;
@@ -286,6 +312,7 @@ bool checkValidMisalignedTensors(
     // Only check input tensors
     return checkSameStride(inp_tensors);
   } else if (!out_tv.empty() && out_tensors.empty()) {
+    // out_tensors is empty unless outputs are given to runFusion.
     // Assume out tensors are contiguous
     return checkSameContiguity(inp_tensors);
   } else {
@@ -350,243 +377,231 @@ void validateKernelOutputs(
 
 namespace {
 
-bool canVectorize(const IValue& aten_val, int word_size) {
-  if (!aten_val.isTensor()) {
-    return false;
-  }
-
-  const auto& aten_tensor = aten_val.toTensor();
-
-  if (reinterpret_cast<size_t>(aten_tensor.data_ptr()) %
-          (word_size * aten_tensor.dtype().itemsize()) !=
-      0) {
-    return false;
-  }
-
-  for (size_t i = aten_tensor.ndimension(); i > 0; i--) {
-    if (aten_tensor.size(i - 1) != 1) {
-      if (aten_tensor.size(aten_tensor.ndimension() - 1) % word_size != 0 ||
-          aten_tensor.stride(aten_tensor.ndimension() - 1) != 1) {
-        return false;
-      }
-      break;
-    }
-  }
-
-  for (auto stride : aten_tensor.strides()) {
-    if (stride != 1 && stride % word_size != 0) {
-      return false;
-    }
-  }
-
-  return true;
-}
-
-// Returns true if a TV can be used with ParallelType::Vectorize. When
-// input or output tensors are involved, the other version of
-// canVectorize is used.
-bool canVectorize(
-    TensorView* tv,
-    int word_size,
-    kir::ExpressionEvaluator& expr_eval) {
-  IterDomain* last_root_dim = nullptr;
-  for (size_t i = tv->getRootDomain().size(); i > 0; i--) {
-    auto r_id = tv->getRootDomain()[i - 1];
-    if (r_id->isReduction() || r_id->isTrivialReduction() ||
-        r_id->isBroadcast()) {
-      continue;
-    }
-    last_root_dim = r_id;
-    break;
-  }
-
-  if (last_root_dim == nullptr) {
-    return false;
-  }
-
-  auto last_dim_size = expr_eval.evaluate(last_root_dim->extent());
+// Finds a fusion input or output tensor to validate its stides
+// for vectorization.
+// Returns a pair consisting of a flag indicating it's a fusion input
+// and an integer position within in the input or output tensor list.
+std::vector<std::pair<bool, int>> getVectorizedFusionInputOutput(
+    TensorView* producer_tv,
+    TensorView* consumer_tv,
+    Fusion* fusion) {
+  std::vector<std::pair<bool, int>> vectorized_input_output;
 
-  if (!last_dim_size.has_value()) {
-    return false;
-  }
+  // When the producer is a fusion input, validate only the producer
+  // and assume the consumer is contiguous. Similarly, when the
+  // consumer is a fusion output, validate the consumer and assume the
+  // producer is contiguous.
 
-  if (last_dim_size.value() % word_size != 0) {
-    return false;
+  if (producer_tv->isFusionInput()) {
+    auto producer_it = std::find(
+        fusion->inputs().begin(), fusion->inputs().end(), producer_tv);
+    TORCH_INTERNAL_ASSERT(
+        producer_it != fusion->inputs().end(),
+        "Could not find ",
+        producer_tv,
+        " in fusion inputs.");
+    auto pos = std::distance(fusion->inputs().begin(), producer_it);
+    vectorized_input_output.push_back(
+        std::make_pair<bool, int>(true, static_cast<int>(pos)));
+  } else {
+    // If not fusion input, assume it's fully contiguous, so nothing
+    // to check with respect to strides.
+    TORCH_INTERNAL_ASSERT(
+        std::all_of(
+            producer_tv->domain()->contiguity().begin(),
+            producer_tv->domain()->contiguity().end(),
+            [](bool contig) { return contig; }),
+        "Unsupported pattern of vectorization: ",
+        consumer_tv->definition()->toString());
   }
 
-  return true;
-}
-
-// Check if there's any split that is non-divisible and vectorized. If
-// found, Vectorize is illegal.
-void validateVectorizedSplits(
-    kir::Kernel* kernel,
-    kir::ExpressionEvaluator& expr_eval) {
-  for (const auto& extent_factor : kernel->summary().splits_to_validate) {
-    auto input_extent = expr_eval.evaluate(extent_factor.first);
-    auto split_factor = expr_eval.evaluate(extent_factor.second);
+  if (consumer_tv->isFusionOutput()) {
+    auto consumer_it = std::find(
+        fusion->outputs().begin(), fusion->outputs().end(), consumer_tv);
     TORCH_INTERNAL_ASSERT(
-        input_extent.has_value(),
-        "Could not check if a split with vectorization is divisible because the extent, ",
-        extent_factor.first->toString(),
-        ", is not possible to evaluate.");
-    TORCH_INTERNAL_ASSERT(
-        input_extent.has_value(),
-        "Could not check if a split with vectorization is divisible because the split factor, ",
-        extent_factor.second->toString(),
-        ", is not possible to evaluate.");
+        consumer_it != fusion->outputs().end(),
+        "Could not find ",
+        consumer_tv,
+        " in fusion outputs.");
+    auto pos = std::distance(fusion->outputs().begin(), consumer_it);
+    vectorized_input_output.push_back(
+        std::make_pair<bool, int>(false, static_cast<int>(pos)));
+  } else {
+    // If not fusion input, assume it's fully contiguous, so nothing
+    // to check with respect to strides.
     TORCH_INTERNAL_ASSERT(
-        input_extent.value() % split_factor.value() == 0,
-        "Non-divisible split with vectorization is detected. ",
-        "Extent: ",
-        input_extent.value(),
-        ". Factor: ",
-        split_factor.value());
+        std::all_of(
+            consumer_tv->domain()->contiguity().begin(),
+            consumer_tv->domain()->contiguity().end(),
+            [](bool contig) { return contig; }),
+        "Unsupported pattern of vectorization: ",
+        consumer_tv->definition()->toString());
   }
+
+  return vectorized_input_output;
 }
 
-//! Returns the position information of vectorized input/output tensors
-//!  in the given fusion.
+//! Returns the information of vectorized input/output tensors
+//! in the given fusion.
 std::unique_ptr<caching::VectorizedTensorInfo> getVectorizedTensorValidationInfo(
-    Fusion* fusion) {
+    kir::Kernel* kernel) {
   auto vectorized_tensor_info_ptr =
       std::make_unique<caching::VectorizedTensorInfo>();
-  auto& tv_to_vector_word_size =
-      vectorized_tensor_info_ptr->tv_to_vector_word_size;
-  auto& global_inp_misaligned_tv =
-      vectorized_tensor_info_ptr->global_inp_misaligned_tv;
-  auto& global_out_misaligned_tv =
-      vectorized_tensor_info_ptr->global_out_misaligned_tv;
 
-  kir::ExpressionEvaluator expr_eval;
+  for (const auto& vector_info : kernel->summary().vectorized_set_info) {
+    auto consumer_tv = vector_info.consumer_tv;
+    auto producer_tv = vector_info.producer_tv;
 
-  // Find all vectorized tensors and their word size
-  for (auto expr : fusion->exprs()) {
-    if (!expr->isA<UnaryOp>() ||
-        expr->as<UnaryOp>()->getUnaryOpType() != UnaryOpType::Set) {
-      continue;
-    }
-    auto uop = expr->as<UnaryOp>();
-    if (!uop->out()->isA<TensorView>() || !uop->in()->isA<TensorView>()) {
-      continue;
-    }
-    auto out_tv = uop->out()->as<TensorView>();
-    auto in_tv = uop->in()->as<TensorView>();
-    IterDomain* vector_dim = nullptr;
-    for (auto id : out_tv->domain()->domain()) {
-      if (id->getParallelType() == ParallelType::Vectorize ||
-          id->getParallelType() == ParallelType::MisalignedVectorize) {
-        TORCH_INTERNAL_ASSERT(
-            vector_dim == nullptr,
-            "Found multiple vectorized dimensions on tensor ",
-            out_tv);
-        vector_dim = id;
-      }
-    }
-    if (vector_dim == nullptr) {
-      continue;
-    }
-    auto vector_word_size = expr_eval.evaluate(vector_dim->extent());
-    TORCH_INTERNAL_ASSERT(
-        vector_word_size.has_value(),
-        "Non constant vector dimension found in ",
-        out_tv);
-
-    // The expression here must be a UnaryOp::Set, so checking either of the
-    // input or output tensor should be sufficient. When the output is a
-    // fusion output, check the tensor as its size information is available
-    // without using the expression evaluator.
-    auto tv_to_verify = out_tv->isFusionOutput() ? out_tv : in_tv;
-    tv_to_vector_word_size[tv_to_verify] = vector_word_size.value();
-
-    if (vector_dim->getParallelType() == ParallelType::MisalignedVectorize) {
+    auto vector_dim = vector_info.vectorized_leaf_id;
+    const auto is_aligned =
+        vector_dim->getParallelType() == ParallelType::Vectorize;
+
+    // Find fusion inputs and outputs that are used with misaligned
+    // vectorization.
+    if (!is_aligned) {
       TORCH_INTERNAL_ASSERT(
-          in_tv->isFusionInput() || out_tv->isFusionOutput(),
+          producer_tv->isFusionInput() || consumer_tv->isFusionOutput(),
           "MisalignedVectorize is assumed to be used with either input or output tensor");
-      if (out_tv->getMemoryType() == MemoryType::Global &&
-          in_tv->getMemoryType() == MemoryType::Local) {
-        global_out_misaligned_tv.insert(out_tv);
+      if (consumer_tv->getMemoryType() == MemoryType::Global &&
+          producer_tv->getMemoryType() == MemoryType::Local) {
+        vectorized_tensor_info_ptr->global_out_misaligned_tv.insert(
+            consumer_tv);
       } else if (
-          in_tv->getMemoryType() == MemoryType::Global &&
-          out_tv->getMemoryType() == MemoryType::Local) {
-        global_inp_misaligned_tv.insert(in_tv);
+          producer_tv->getMemoryType() == MemoryType::Global &&
+          consumer_tv->getMemoryType() == MemoryType::Local) {
+        vectorized_tensor_info_ptr->global_inp_misaligned_tv.insert(
+            producer_tv);
       } else {
         TORCH_INTERNAL_ASSERT(
             false,
             "Unsupported memory configuration for misaligned vectorization.");
       }
     }
-  }
 
-  // Check striding information on input and outputs as well as size information
-  // of all
-  auto& inp_misaligned_tensors_pos =
-      vectorized_tensor_info_ptr->inp_misaligned_tensors_pos;
-  auto& out_misaligned_tensors_pos =
-      vectorized_tensor_info_ptr->out_misaligned_tensors_pos;
-  auto& inp_pos_to_word_size_map_to_verify =
-      vectorized_tensor_info_ptr->inp_pos_to_word_size_map_to_verify;
-  auto& out_pos_to_word_size_map_to_verify =
-      vectorized_tensor_info_ptr->out_pos_to_word_size_map_to_verify;
-  auto& intermediate_tv_to_word_size_map_to_verify =
-      vectorized_tensor_info_ptr->intermediate_tv_to_word_size_map_to_verify;
-
-  for (auto entry : tv_to_vector_word_size) {
-    auto tv = entry.first;
-    auto word_size = entry.second;
-    if (tv->isFusionInput()) {
-      auto inp_it =
-          std::find(fusion->inputs().begin(), fusion->inputs().end(), tv);
-      TORCH_INTERNAL_ASSERT(
-          inp_it != fusion->inputs().end(),
-          "Could not find ",
-          tv,
-          " in fusion inputs.");
-      auto inp_pos = std::distance(fusion->inputs().begin(), inp_it);
-
-      if (global_inp_misaligned_tv.find(tv) != global_inp_misaligned_tv.end()) {
-        inp_misaligned_tensors_pos.emplace_back(inp_pos);
-      } else {
-        // Shouldn't visit same pos twice here, assert ?
-        inp_pos_to_word_size_map_to_verify[inp_pos] = word_size;
-      }
-    } else if (tv->isFusionOutput()) {
-      auto out_it =
-          std::find(fusion->outputs().begin(), fusion->outputs().end(), tv);
-      TORCH_INTERNAL_ASSERT(
-          out_it != fusion->outputs().end(),
-          "Could not find ",
-          tv,
-          " in provided fusion outputs.");
-      auto out_pos = std::distance(fusion->outputs().begin(), out_it);
-
-      if (global_out_misaligned_tv.find(tv) != global_out_misaligned_tv.end()) {
-        out_misaligned_tensors_pos.emplace_back(out_pos);
+    // Collect information on corresponding fusion input and output
+    // tensors to verify strides.
+    auto inp_or_out_info =
+        getVectorizedFusionInputOutput(producer_tv, consumer_tv, kernel);
+
+    // If both producer and consumer are contig and intermediate,
+    // nothing to validate with respect to strides.
+    if (inp_or_out_info.empty()) {
+      continue;
+    }
+
+    // Misaligned vectorize only allows from input to local or local
+    // to output
+    if (!is_aligned) {
+      TORCH_INTERNAL_ASSERT(inp_or_out_info.size() == 1);
+    }
+
+    for (const auto& inp_or_out : inp_or_out_info) {
+      const bool is_input = inp_or_out.first;
+      const int pos = inp_or_out.second;
+
+      if (is_aligned) {
+        auto& pos_list = is_input
+            ? vectorized_tensor_info_ptr->aligned_vectorized_inp_tensor_pos
+            : vectorized_tensor_info_ptr->aligned_vectorized_out_tensor_pos;
+        pos_list.push_back(pos);
       } else {
-        out_pos_to_word_size_map_to_verify[out_pos] = word_size;
+        auto& map = is_input
+            ? vectorized_tensor_info_ptr->inp_misaligned_tensors_pos
+            : vectorized_tensor_info_ptr->out_misaligned_tensors_pos;
+        map.emplace_back(pos);
       }
-    } else {
-      // Intermediate tensors. Note that this must be Vectorize as
-      // MisalignedVectorize is only supported for inputs and outputs.
-      intermediate_tv_to_word_size_map_to_verify[tv] = word_size;
     }
   }
 
   return vectorized_tensor_info_ptr;
 }
-} // namespace
 
-// Misaligned vectorization check. Currently misaligned vectorization is limited
-// to global-register and register-global load/store patterns. However, this
-// could be improved to include shared memory.
-void validateVectorizedTensors(
+// Make sure the root domain(s) comprising the vectorized leaf domain
+// have the (merged) extent that is divisible by the vectorization
+// word size.
+void validateAlignedVectorizeExtents(
+    const VectorizedSetInfo& info,
+    kir::ExpressionEvaluator& expr_eval) {
+  int64_t vectorized_merged_domain_extent = 1;
+  for (auto id : info.contig_root_ids) {
+    auto extent_val = expr_eval.evaluate(id->extent());
+    TORCH_INTERNAL_ASSERT(
+        extent_val.has_value(),
+        "Error vectorizing, ",
+        info.consumer_tv->toString(),
+        " as the extent of a vectorized root domain, ",
+        id->toString(),
+        ", is unknown.");
+    vectorized_merged_domain_extent *= extent_val.value();
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      vectorized_merged_domain_extent % info.word_size == 0,
+      "Error vectorizing, ",
+      info.consumer_tv->toString(),
+      " as the extent of the indexed domain, ",
+      vectorized_merged_domain_extent,
+      ", is not divisible by vector word size ",
+      info.word_size);
+}
+
+void validateAlignedVectorizedFusionInputOutput(
+    const IValue& aten_val,
+    int word_size,
+    TensorView* tv) {
+  TORCH_INTERNAL_ASSERT(aten_val.isTensor());
+
+  const auto& aten_tensor = aten_val.toTensor();
+
+  TORCH_INTERNAL_ASSERT(
+      reinterpret_cast<size_t>(aten_tensor.data_ptr()) %
+              (word_size * aten_tensor.dtype().itemsize()) ==
+          0,
+      "Vectorization of ",
+      tv->toString(),
+      " not possible as the memory address is not aligned. ",
+      "Address: ",
+      aten_tensor.data_ptr(),
+      ", vector word size: ",
+      word_size,
+      ", data type: ",
+      aten_tensor.dtype());
+
+  // Traverse strides from the right-most domains. The rightmost
+  // domain must have stride 1.
+  int64_t cur_contig_stride = 1;
+  bool still_rightmost = true;
+  for (auto i = aten_tensor.ndimension() - 1; i >= 0; --i) {
+    const auto stride = aten_tensor.strides().at(i);
+    // If this domain is contiguous, then not necessary to check the
+    // stride. Otherwise, stride must be 1 if it's rightmost or
+    // divisible by word_size.
+    TORCH_INTERNAL_ASSERT(
+        stride == cur_contig_stride || (still_rightmost && stride == 1) ||
+            (!still_rightmost && stride % word_size == 0),
+        "Vectorization of ",
+        tv->toString(),
+        " with word size ",
+        word_size,
+        " not possible due to invalid stride.",
+        " Domain: ",
+        tv->axis(i)->toString(),
+        ", stride: ",
+        stride)
+    // If the domain is size-1, the next domain is still considered
+    // rightmost.
+    const auto size = aten_tensor.sizes().at(i);
+    still_rightmost = still_rightmost && size == 1;
+    cur_contig_stride = stride * size;
+  }
+}
+
+void validateAlignedVectorizedTensors(
     kir::Kernel* kernel,
     const at::ArrayRef<IValue>& inputs,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
-  FUSER_PERF_SCOPE("FusionExecutor::validateVectorizedTensors");
-
   auto tensor_vectorization_validation_entry =
       executor_utils::caching::ExecutorCompileTimeEntry<
           executor_utils::caching::VectorizedTensorValidation>(
@@ -594,40 +609,51 @@ void validateVectorizedTensors(
             return executor_utils::getVectorizedTensorValidationInfo(kernel);
           });
 
-  // Validate all the canVectorizes:
-  for (auto it : tensor_vectorization_validation_entry.get()
-                     .inp_pos_to_word_size_map_to_verify) {
-    TORCH_INTERNAL_ASSERT(
-        canVectorize(inputs[it.first], it.second),
-        "Error vectorizing, ",
-        kernel->inputs()[it.first],
-        " as input provided does not allowed vectorization by word size, ",
-        it.second);
-  }
+  // Verify extents of aligned vectorized tensors
+  for (const auto& vec_info : kernel->summary().vectorized_set_info) {
+    auto in_tv = vec_info.producer_tv;
+    auto out_tv = vec_info.consumer_tv;
 
-  if (outputs.size() > 0) {
-    for (auto it : tensor_vectorization_validation_entry.get()
-                       .out_pos_to_word_size_map_to_verify) {
-      TORCH_INTERNAL_ASSERT(
-          canVectorize(outputs[it.first], it.second),
-          "Error vectorizing, ",
-          kernel->outputs()[it.first],
-          " as output provided does not allowed vectorization by word size, ",
-          it.second);
+    if (vec_info.vectorized_leaf_id->getParallelType() ==
+        ParallelType::Vectorize) {
+      validateAlignedVectorizeExtents(vec_info, expr_eval);
     }
   }
 
-  for (auto it : tensor_vectorization_validation_entry.get()
-                     .intermediate_tv_to_word_size_map_to_verify) {
-    auto tv = it.first;
-    auto vec_width = it.second;
-    TORCH_INTERNAL_ASSERT(
-        canVectorize(tv, vec_width, expr_eval),
-        "Error vectorizing, ",
-        tv->toString(),
-        " as the extent of the vectorized axis does not allowed vectorization by word size, ",
-        vec_width);
+  // Validate input and output tensors with aligend
+  // vectorization.
+  for (auto pos : tensor_vectorization_validation_entry.get()
+                      .aligned_vectorized_inp_tensor_pos) {
+    auto tv = kernel->inputs().at(pos)->as<TensorView>();
+    auto word_size = kernel->summary().vectorized_accesses.at(tv);
+    validateAlignedVectorizedFusionInputOutput(inputs[pos], word_size, tv);
+  }
+
+  if (!outputs.empty()) {
+    for (auto pos : tensor_vectorization_validation_entry.get()
+                        .aligned_vectorized_out_tensor_pos) {
+      auto tv = kernel->outputs().at(pos)->as<TensorView>();
+      auto word_size = kernel->summary().vectorized_accesses.at(tv);
+      validateAlignedVectorizedFusionInputOutput(outputs[pos], word_size, tv);
+    }
   }
+}
+
+// Misaligned vectorization check. Currently misaligned vectorization is limited
+// to global-register and register-global load/store patterns. However, this
+// could be improved to include shared memory.
+void validateMisalignedVectorizedTensors(
+    kir::Kernel* kernel,
+    const at::ArrayRef<IValue>& inputs,
+    const std::vector<at::Tensor>& outputs,
+    caching::ExecutorCompileTimeInfoCache* data_cache,
+    kir::ExpressionEvaluator& expr_eval) {
+  auto tensor_vectorization_validation_entry =
+      executor_utils::caching::ExecutorCompileTimeEntry<
+          executor_utils::caching::VectorizedTensorValidation>(
+          data_cache, [kernel]() {
+            return executor_utils::getVectorizedTensorValidationInfo(kernel);
+          });
 
   std::vector<c10::IValue> inp_misaligned_tensors;
   std::vector<c10::IValue> out_misaligned_tensors;
@@ -659,6 +685,51 @@ void validateVectorizedTensors(
           inp_misaligned_tensors,
           out_misaligned_tensors),
       "All global tensors must have the same stride for misaligned vectorization.");
+}
+
+// Check if there's any split that is non-divisible and vectorized. If
+// found, Vectorize is illegal.
+void validateVectorizedSplits(
+    kir::Kernel* kernel,
+    kir::ExpressionEvaluator& expr_eval) {
+  for (const auto& extent_factor : kernel->summary().splits_to_validate) {
+    auto input_extent = expr_eval.evaluate(extent_factor.first);
+    auto split_factor = expr_eval.evaluate(extent_factor.second);
+    TORCH_INTERNAL_ASSERT(
+        input_extent.has_value(),
+        "Could not check if a split with vectorization is divisible because the extent, ",
+        extent_factor.first->toString(),
+        ", is not possible to evaluate.");
+    TORCH_INTERNAL_ASSERT(
+        input_extent.has_value(),
+        "Could not check if a split with vectorization is divisible because the split factor, ",
+        extent_factor.second->toString(),
+        ", is not possible to evaluate.");
+    TORCH_INTERNAL_ASSERT(
+        input_extent.value() % split_factor.value() == 0,
+        "Non-divisible split with vectorization is detected. ",
+        "Extent: ",
+        input_extent.value(),
+        ". Factor: ",
+        split_factor.value());
+  }
+}
+
+} // namespace
+
+void validateVectorizedTensors(
+    kir::Kernel* kernel,
+    const at::ArrayRef<IValue>& inputs,
+    const std::vector<at::Tensor>& outputs,
+    caching::ExecutorCompileTimeInfoCache* data_cache,
+    kir::ExpressionEvaluator& expr_eval) {
+  FUSER_PERF_SCOPE("FusionExecutor::validateVectorizedTensors");
+
+  validateAlignedVectorizedTensors(
+      kernel, inputs, outputs, data_cache, expr_eval);
+
+  validateMisalignedVectorizedTensors(
+      kernel, inputs, outputs, data_cache, expr_eval);
 
   validateVectorizedSplits(kernel, expr_eval);
 }
@@ -686,8 +757,8 @@ kir::ExpressionEvaluator bindKernelInputs(
           i);
 
       const auto aten_tensor = aten_inputs[i].toTensor();
-      const auto root_domain =
-          TensorDomain::noReductions(tensor_input->domain()->getRootDomain());
+      const auto root_domain = TensorDomain::noReductions(
+          tensor_input->domain()->getMaybeRFactorDomain());
       TORCH_INTERNAL_ASSERT(
           aten_tensor.ndimension() == static_cast<int>(root_domain.size()),
           "Something went wrong configuring launch. Inputs no longer match.");
@@ -695,6 +766,11 @@ kir::ExpressionEvaluator bindKernelInputs(
       for (const auto dim : c10::irange(root_domain.size())) {
         const auto extent = root_domain[dim]->extent();
         const auto value = aten_tensor.sizes()[dim];
+        if (value == 0 && tensor_input->uses().empty()) {
+          // If there's no uses, ignore there's a size-0 dimension.
+          continue;
+        }
+        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
         bool should_bind = true;
         if (check_consistency) {
           const auto prev_value = expr_eval.evaluate(extent);
@@ -717,7 +793,9 @@ kir::ExpressionEvaluator bindKernelInputs(
       // NOLINTNEXTLINE: https://bugs.llvm.org/show_bug.cgi?id=48525
     } else if (input->isScalar() && input->dtype() == DataType::Int) {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType);
+          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
+          "kernel expected Scalar Int inputs, but found",
+          aten_inputs[i].type()->str());
       expr_eval.bind(input, aten_inputs[i].toInt());
     }
   }
@@ -748,14 +826,19 @@ ExpressionEvaluator bindFusionInputs(
           "Something went wrong configuring launch. Inputs do not match.");
 
       auto aten_tensor = aten_inputs[i].toTensor();
-      auto root_dom = TensorDomain::noReductions(cg_tensor->getRootDomain());
+      auto root_dom =
+          TensorDomain::noReductions(cg_tensor->getMaybeRFactorDomain());
       TORCH_INTERNAL_ASSERT(
           aten_tensor.ndimension() == (int64_t)root_dom.size(),
           "Something went wrong configuring launch. Inputs do not match.");
-
       for (const auto dim : c10::irange(root_dom.size())) {
         const auto extent = root_dom[dim]->extent();
         const auto value = aten_tensor.sizes()[dim];
+        if (value == 0 && cg_tensor->uses().empty()) {
+          // If there's no uses, ignore there's a size-0 dimension.
+          continue;
+        }
+        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
         const auto prev_value = evaluator.evaluate(extent);
         if (prev_value.has_value()) {
           TORCH_CHECK(
@@ -774,7 +857,9 @@ ExpressionEvaluator bindFusionInputs(
         inputs[i]->getValType().value() == ValType::Scalar &&
         inputs[i]->getDataType().value() == DataType::Int) {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType);
+          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
+          "fusion expected Scalar Int inputs, but found",
+          aten_inputs[i].type()->str());
       evaluator.bind(inputs[i], aten_inputs[i].toInt());
     }
   }
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.h b/torch/csrc/jit/codegen/cuda/executor_utils.h
index 93deec6343f1fb..eb73643ed8d895 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.h
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.h
@@ -147,15 +147,18 @@ class WarpPaddedParallelExtents {
 //!  VectorizedTensorInfo:
 //!    Auxiliary data type for entry class VectorizedTensorValidation
 struct VectorizedTensorInfo {
+  //! Aligned vectorized fusion inputs
+  std::vector<int> aligned_vectorized_inp_tensor_pos;
+  //! Aligned vectorized fusion outputs
+  std::vector<int> aligned_vectorized_out_tensor_pos;
+  //! Misaligned vectorized input tensors
   std::unordered_set<TensorView*> global_inp_misaligned_tv;
+  //! Misaligned vectorized output tensors
   std::unordered_set<TensorView*> global_out_misaligned_tv;
-  std::unordered_map<TensorView*, int> tv_to_vector_word_size;
+  //! Positions of misaligned input tensors
   std::vector<int> inp_misaligned_tensors_pos;
+  //! Positions of misaligned output tensors
   std::vector<int> out_misaligned_tensors_pos;
-  std::unordered_map<int, int> inp_pos_to_word_size_map_to_verify;
-  std::unordered_map<int, int> out_pos_to_word_size_map_to_verify;
-  std::unordered_map<TensorView*, int>
-      intermediate_tv_to_word_size_map_to_verify;
 };
 
 //! Compile-time info to be cached in each FusionExecutor:
diff --git a/torch/csrc/jit/codegen/cuda/fusion.cpp b/torch/csrc/jit/codegen/cuda/fusion.cpp
index be686c0d9439ab..f96237ee9d6874 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion.cpp
@@ -176,9 +176,17 @@ void Fusion::removeVal(Val* val) {
 void Fusion::addInput(Val* input) {
   assertInContainer(input, "Cannot register input ");
 
+  TORCH_INTERNAL_ASSERT(
+      input->getDataType() != DataType::Index,
+      "Data type Index is a local compile time data type only, it cannot be used as an input in case it was generated from another kernel.");
+
   if (input->getValType().value() == ValType::TensorView) {
     auto tv = input->as<TensorView>();
     tv->setMemoryType(MemoryType::Global);
+  } else if (input->getValType().value() == ValType::Scalar) {
+    TORCH_CHECK(
+        !input->isConst(),
+        "Immediate scalar value cannot be added as an input. It is not necessary to pass it as an input.");
   }
 
   inputs_.push_back(input);
@@ -188,6 +196,19 @@ void Fusion::addInput(Val* input) {
 }
 
 void Fusion::addOutput(Val* output) {
+  // We currently don't support explicitly outputing aliased inputs. This is
+  // because they are already marked as output for in-place update. It's tricky
+  // to allow marking them explicitly as real output, since that requires us to
+  // register/identify output not only by `Val*` pointer, but also by indices;
+  // it also requires us to magically arrange `outputs_` entries in proper order
+  // ^^^ this doesn't look intuitive on `outputs_` in fusion.
+  // I think we can solve this by marking addOutput on io_alias_ keys after
+  // fusion is fully defined. Tracking this in #1488
+  // Apparently we can't do this neither at the time. I think segmentation
+  // unfortunately would call addOutput after we marked io_alias_ map.
+  // TORCH_CHECK(io_alias_.count(output) == 0,
+  //     "can't register aliased output as real output");
+
   assertInContainer(output, "Cannot register output ");
   if (output->getValType().value() == ValType::TensorView) {
     auto tv = output->as<TensorView>();
@@ -304,13 +325,13 @@ void Fusion::print() {
   std::cout << "}\n\n";
 }
 
-void Fusion::printKernel() {
+void Fusion::printKernel(DataType index_type) {
   FUSER_PERF_SCOPE("Fusion::printKernel");
   TORCH_INTERNAL_ASSERT(
       !this->isA<kir::Kernel>(),
       "Cannot \"print kernel\" of a kernel container. ",
       "This would require lowering during lowering.");
-  std::cout << codegen::generateCudaKernel(GpuLower(this).kernel());
+  std::cout << codegen::generateCudaKernel(GpuLower(this, index_type).kernel());
 }
 
 void Fusion::printMath(bool from_outputs_only) {
@@ -567,6 +588,33 @@ bool Fusion::isAliasCompatible(Val* left, Val* right) {
 }
 
 void Fusion::aliasOutputToInput(Val* output, Val* input) {
+  // Because we could cast output when input is casted.
+  TORCH_INTERNAL_ASSERT(
+      !output->isFusionOutput(),
+      "Do NOT add aliased output to fusion output outside of `aliasOutputToInput");
+
+  if (!input->isFusionInput()) {
+    auto input_expr = input->definition();
+    // TORCH_INTERNAL_ASSERT(input_def.etype() == ExprType::UnaryOp, "expected
+    // unary op for aliased input");
+    TORCH_INTERNAL_ASSERT(
+        input_expr->isA<UnaryOp>(), "expected unary op for aliased input");
+    auto input_uop = input_expr->as<UnaryOp>();
+    TORCH_INTERNAL_ASSERT(
+        input_uop->getUnaryOpType() == UnaryOpType::Cast,
+        "expected aliased input to be output of cast op");
+    input = input_uop->in();
+  }
+  TORCH_INTERNAL_ASSERT(
+      input->getDataType().has_value() && output->getDataType().has_value(),
+      "requires DataType to be available for aliased output to input");
+
+  if (input->getDataType().value() != output->getDataType().value()) {
+    output = castOp(input->getDataType().value(), output);
+  }
+  // TODO: output should be marked at the end of fusion definition #1488
+  addOutput(output);
+
   TORCH_INTERNAL_ASSERT(
       isAliasCompatible(input, output),
       "The input and output values are not alias-compatible.");
diff --git a/torch/csrc/jit/codegen/cuda/fusion.h b/torch/csrc/jit/codegen/cuda/fusion.h
index 2e76e00896b5f3..e67b287288f908 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.h
+++ b/torch/csrc/jit/codegen/cuda/fusion.h
@@ -135,7 +135,7 @@ class TORCH_CUDA_CU_API Fusion : public IrContainer {
   void printTransforms();
 
   //! Lower the fusion and print a kernel
-  void printKernel();
+  void printKernel(DataType index_type = DataType::Int);
 
   //! Return a list of topologically sorted expressions. This only includes
   //! exprs required to genereate registered outputs.
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
index 0e74ce172f9161..bec8f6e99ea361 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
@@ -1170,14 +1170,24 @@ std::unique_ptr<Fusion> SegmentedFusion::makeFusion(SegmentedGroup* sg) {
     fusion_segment->removeOutput(out);
   }
 
+  std::vector<TensorView*> view_tvs;
   for (auto inp : getAllInputs(sg)) {
-    fusion_segment->addInput(complete_to_segment_map.clone(inp));
+    auto clone_tv = complete_to_segment_map.clone(inp);
+    fusion_segment->addInput(clone_tv);
+    if (inp->isDefinitionType(ExprType::ViewOp)) {
+      TORCH_INTERNAL_ASSERT(clone_tv != nullptr && clone_tv->isA<TensorView>());
+      view_tvs.push_back(clone_tv->as<TensorView>());
+    }
   }
 
   for (auto out : getAllOutputs(sg)) {
     fusion_segment->addOutput(complete_to_segment_map.clone(out));
   }
 
+  for (auto tv : view_tvs) {
+    tv->convertRfactorToRootDomain();
+  }
+
   return fusion_segment;
 }
 
@@ -2715,8 +2725,8 @@ void SegmentCandidateFinder::findSegments() {
     }
   }
 
-  auto reduction_ops =
-      ir_utils::getReductionOps(segmented_fusion_->completeFusion());
+  auto reduction_ops = ir_utils::getReductionOps(
+      segmented_fusion_->completeFusion(), true /* ignore_trivial */);
   auto welford_ops = ir_utils::filterByType<WelfordOp>(reduction_ops);
 
   if (options_.run_translate_welford &&
@@ -2798,12 +2808,12 @@ void SegmentCandidateFinder::findSegments() {
 
   if (options_.run_final_merge) {
     // TODO: consider interleaving herrmman merge and bruteforce merge, as
-    // bruteforce merge can introduce
-    //  opportunities for more herrmann merge
+    // bruteforce merge can introduce opportunities for more herrmann merge
     finalMerge();
   }
 
   finalize();
+
   if (isDebugDumpEnabled(DebugDumpOption::FusionSegmentsDrawing)) {
     segmented_fusion_->draw();
   }
@@ -3012,6 +3022,7 @@ void SegmentCandidateFinder::finalize() {
 
   // Finalize each group, fill in the missing inputs, i.e. tensor dims.
   for (auto g : groups()) {
+    g->setHeuristic(deriveHeuristic(g));
     g->finalize();
   }
 }
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
index 63124839fc1e1d..6e8b15cb67b851 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
@@ -129,7 +129,7 @@ class TORCH_CUDA_CU_API SegmentedGroup {
   int group_id_ = -1;
 
   //! The scheduler to use for compiling this group
-  ScheduleHeuristic heuristic_ = ScheduleHeuristic::PointWise;
+  ScheduleHeuristic heuristic_ = ScheduleHeuristic::None;
 
   //! Exprs that make up the group
   std::vector<Expr*> exprs_;
diff --git a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
index dee3fa50fb4051..92ad8ce80fbf8c 100644
--- a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
+++ b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
@@ -945,7 +945,11 @@ struct CudaGraphFuser {
       // extended shape expression support to reduction operations
       // TODO: `aten::sum` is too flexible, we should restrict for a better
       // match
-      if (n->kind() == aten::sum) {
+      // TODO: Add python tests where we check for existing ops and their
+      // shape expression logic.
+      static std::unordered_set<Symbol> reduction_ops(
+          {aten::sum, aten::mean, aten::var, aten::std});
+      if (reduction_ops.find(n->kind()) != reduction_ops.end()) {
         // TODO: expand support to wire non-constant inputs, this is currently
         // blocked by profiling executor not capable of profiling scalar inputs.
         TORCH_INTERNAL_ASSERT(
@@ -1102,7 +1106,8 @@ struct CudaGraphFuser {
     // TODO: failure in buildShapeExpressions should not break fusion execution,
     // we can add a try/catch here to bailout from removeOutputsUsedOnlyInSize.
     GRAPH_DEBUG("before build shape expression: ", *graph_);
-    fusion_value_to_runtime_shape_ = buildShapeExpressions(fusion_group);
+    auto shape_map = buildShapeExpressions(fusion_group);
+    fusion_value_to_runtime_shape_.insert(shape_map.begin(), shape_map.end());
     GRAPH_DEBUG("after build shape expression: ", *graph_);
 
     auto outputs = fusion_group->outputs().vec();
@@ -1113,14 +1118,12 @@ struct CudaGraphFuser {
     for (int64_t i = static_cast<int64_t>(outputs.size()) - 1; i >= 0; --i) {
       auto output = outputs[i];
       auto soutput = soutputs[i];
-      if (usedOnlyInDtypeAndSize(output) &&
-          fusion_value_to_runtime_shape_.count(soutput) > 0) {
+      if (usedOnlyInDtypeAndSize(output) && shape_map.count(soutput) > 0) {
         bool has_dtype = usedInDtype(output);
         auto uses = output->uses();
         for (Use u : uses) {
           if (u.user->matches("aten::size(Tensor self) -> int[]")) {
-            u.user->output()->replaceAllUsesWith(
-                fusion_value_to_runtime_shape_.at(soutput));
+            u.user->output()->replaceAllUsesWith(shape_map.at(soutput));
             u.user->destroy();
           } else if (u.user->matches("prim::dtype(Tensor a) -> int")) {
             continue;
@@ -1210,7 +1213,12 @@ struct CudaGraphFuser {
 
     for (Node* node : block_->nodes()) {
       for (Block* sub_block : node->blocks()) {
-        CudaGraphFuser(sub_block, graph_).run();
+        CudaGraphFuser sub_block_cfg(sub_block, graph_);
+        sub_block_cfg.run();
+        // Accumulate runtime shapes for all sub-blocks
+        fusion_value_to_runtime_shape_.insert(
+            sub_block_cfg.fusion_value_to_runtime_shape_.begin(),
+            sub_block_cfg.fusion_value_to_runtime_shape_.end());
       }
     }
   }
@@ -1605,17 +1613,19 @@ void guardFusionGroup(
         // TODO: Add support for dynamic split to view guard
 
         // Path from profile-ivalue to prim::view_copy operation
-        // profile-ivalue -> Uses: [Constant, CudaFusionGroup]
+        // profile-ivalue -> Constant -> CudaFusionGroup
         // Get argument position in CudaFusionGroup
         // Get argument in subgraph for CudaFusionGroup
         // CudaFusionGroup argument -> Constant List -> prim::view_copy
-        auto cuda_fusion_group_arg = profiled_ival->uses().back().offset;
-        auto subgraph_arg = fusion_graph->inputs()[cuda_fusion_group_arg];
+        auto subgraph_arg = fusion_graph->inputs()[offset];
         auto constant = subgraph_arg->uses().front().user->output();
+
+        TORCH_INTERNAL_ASSERT(!constant->uses().empty());
         auto view = constant->uses().front().user;
         TORCH_INTERNAL_ASSERT(
             view->kind() == prim::view_copy ||
             view->kind() == prim::reshape_copy);
+
         ivalue_check = guardView(
             fusion,
             fusion_value_to_runtime_size,
@@ -1710,11 +1720,15 @@ void guardFusionGroups(
     //         c. restore conditional constant to non-constant for fallback
     guardFusionGroup(fusion, fusion_value_to_runtime_size);
   }
+}
 
-  if (GRAPH_DEBUG_ENABLED) {
-    GRAPH_DEBUG("Exporting all NVFuser fusions:");
-    for (Node* fusion : fusions) {
-      GRAPH_EXPORT("", fusion->g(attr::Subgraph));
+void dumpFusionGroups(std::shared_ptr<Graph>& g) {
+  DepthFirstGraphNodeIterator it(g);
+  Node* n = nullptr;
+  GRAPH_DEBUG("Exporting all NVFuser fusions:");
+  while ((n = it.next()) != nullptr) {
+    if (n->kind() == prim::FallbackGraph) {
+      GRAPH_EXPORT("", n->g(attr::Subgraph));
     }
   }
 }
@@ -2009,23 +2023,6 @@ void ExtractProfileIValue(Node* profile_ivalue) {
   }
 }
 
-void traverseProfileIValues(
-    Block* block,
-    const std::function<void(Node*)>& func) {
-  std::vector<Node*> profile_ivalues;
-  for (Node* n : block->nodes()) {
-    for (Block* b : n->blocks()) {
-      traverseProfileIValues(b, func);
-    }
-    if (n->kind() == prim::profile_ivalue) {
-      profile_ivalues.push_back(n);
-    }
-  }
-  for (Node* profile_ivalue : profile_ivalues) {
-    func(profile_ivalue);
-  }
-}
-
 // break `linear` layer into `matmul` and `add_optional`. This allows us to fuse
 // the binary operation without supporting gemm.
 // Note that we are not breaking `linear` layer without bias.
@@ -2086,48 +2083,58 @@ void decomposeLinearOps(Block* block) {
 // Replace 'operation' with 'operation_copy' to guard alias operations.
 // Supports View, Reshape, Squeeze, and Unsqueeze
 void replaceAliasOpsWithCopy(std::shared_ptr<Graph>& graph, Block* block) {
-  static std::unordered_map<Symbol, Symbol> op_mapping(
-      {{aten::view, prim::view_copy},
+  static std::unordered_map<Symbol, Symbol> alias_to_copy_mapping(
+      // TODO: revert disabled aten::view
+      {// {aten::view, prim::view_copy},
        {aten::reshape, prim::reshape_copy},
        {aten::squeeze, prim::squeeze_copy},
        {aten::unsqueeze, prim::unsqueeze_copy}});
 
-  std::vector<Node*> maybe_alias_nodes;
+  std::vector<Node*> maybe_safe_alias_nodes;
   for (Node* n : block->nodes()) {
     for (Block* b : n->blocks()) {
       replaceAliasOpsWithCopy(graph, b);
     }
-    if (op_mapping.find(n->kind()) != op_mapping.end()) {
-      maybe_alias_nodes.push_back(n);
+    if (alias_to_copy_mapping.find(n->kind()) != alias_to_copy_mapping.end()) {
+      maybe_safe_alias_nodes.push_back(n);
     }
   }
 
   auto alias_db = std::make_unique<AliasDb>(graph);
-  for (Node* n : maybe_alias_nodes) {
-    if (!alias_db->safeToChangeAliasingRelationship(
-            n->input(0), n->output(0))) {
-      continue;
-    }
 
+  auto safeToChangeAliasToCopy = [&alias_db](Node* n) {
+    return !alias_db->hasWriters(n->input(0)) &&
+        !alias_db->hasWriters(n->output(0));
+  };
+
+  auto replaceAliasWithCopy = [&graph, &alias_db](Node* n) {
     WithInsertPoint guard(n);
-    auto op_copy =
-        graph->insertNode(graph->create(op_mapping[n->kind()], n->inputs(), 1));
-    op_copy->output()->setType(n->output(0)->type());
+    auto copy_op = graph->insertNode(
+        graph->create(alias_to_copy_mapping[n->kind()], n->inputs(), 1));
+    copy_op->output()->setType(n->output(0)->type());
 
     // adding newly created value into alias_db;
-    alias_db->createValue(op_copy->output());
+    alias_db->createValue(copy_op->output());
 
-    n->output()->replaceAllUsesWith(op_copy->output());
+    n->output()->replaceAllUsesWith(copy_op->output());
     n->destroy();
+  };
+
+  for (Node* n : maybe_safe_alias_nodes) {
+    if (!safeToChangeAliasToCopy(n)) {
+      continue;
+    }
+    replaceAliasWithCopy(n);
   }
 }
 
-// Revert all 'op_copy' with 'op' except in CudaFusionGroup
+// Revert all 'operation_copy' with 'operation' except in CudaFusionGroup
 // e.g., Any non-fused alias operation including within the prim::FallbackGraph
 // Supports View, Reshape, Squeeze, and Unsqueeze
 void revertAliasCopyOps(std::shared_ptr<Graph>& graph, Block* block) {
-  static std::unordered_map<Symbol, Symbol> op_mapping(
-      {{prim::view_copy, aten::view},
+  static std::unordered_map<Symbol, Symbol> copy_to_alias_mapping(
+      // TODO: revert disabled aten::view
+      {// {prim::view_copy, aten::view},
        {prim::reshape_copy, aten::reshape},
        {prim::squeeze_copy, aten::squeeze},
        {prim::unsqueeze_copy, aten::unsqueeze}});
@@ -2147,18 +2154,22 @@ void revertAliasCopyOps(std::shared_ptr<Graph>& graph, Block* block) {
       revertAliasCopyOps(graph, b);
     }
     // Revert any non-fused alias copy ops
-    if (op_mapping.find(n->kind()) != op_mapping.end()) {
+    if (copy_to_alias_mapping.find(n->kind()) != copy_to_alias_mapping.end()) {
       alias_copy_ops.push_back(n);
     }
   }
 
-  for (Node* n : alias_copy_ops) {
+  auto replaceCopyWithAlias = [&graph](Node* n) {
     WithInsertPoint guard(n);
-    auto reverted_op =
-        graph->insertNode(graph->create(op_mapping[n->kind()], n->inputs(), 1));
-    reverted_op->output()->setType(n->output(0)->type());
-    n->output()->replaceAllUsesWith(reverted_op->output());
+    auto alias_op = graph->insertNode(
+        graph->create(copy_to_alias_mapping[n->kind()], n->inputs(), 1));
+    alias_op->output()->setType(n->output(0)->type());
+    n->output()->replaceAllUsesWith(alias_op->output());
     n->destroy();
+  };
+
+  for (Node* n : alias_copy_ops) {
+    replaceCopyWithAlias(n);
   }
 }
 
@@ -2242,6 +2253,67 @@ bool removeInplaceOperations(const std::shared_ptr<Graph>& graph) {
       graph, [&](Node* node) { return inplace_ops.count(node->kind()) != 0; });
 }
 
+// Recursively traverse blocks, gather all nodes with given symbol,
+// and then apply mutator function.
+void mutateNode(
+    Block* block,
+    Symbol symbol,
+    const std::function<void(Node*)>& func) {
+  // Recursively call mutateNode on blocks
+  // Gather all nodes with given symbol
+  std::vector<Node*> nodes;
+  for (Node* n : block->nodes()) {
+    for (Block* b : n->blocks()) {
+      mutateNode(b, symbol, func);
+    }
+    if (n->kind() == symbol) {
+      nodes.push_back(n);
+    }
+  }
+
+  // Apply mutator funcion to every node
+  for (Node* n : nodes) {
+    func(n);
+  }
+}
+
+// For the given CudaFusionGroup, separate nested views and remove any unused,
+// intermediate views
+void separateNestedViews(Node* cuda_fusion_group) {
+  TORCH_INTERNAL_ASSERT(cuda_fusion_group->kind() == prim::CudaFusionGroup);
+
+  auto isView = [](Node* node) {
+    static std::unordered_set<Symbol> alias_op_set(
+        {prim::view_copy, prim::reshape_copy});
+    return alias_op_set.find(node->kind()) != alias_op_set.end();
+  };
+
+  // node -> input / output values
+  auto isNestedView = [&isView](Node* node) {
+    return isView(node) && isView(node->input(0)->node());
+  };
+
+  auto subgraph = cuda_fusion_group->g(attr::Subgraph);
+  for (auto node : subgraph->block()->nodes()) {
+    if (isNestedView(node)) {
+      // grandparent -> (view / reshape) parent -> (view / reshape) node
+      auto parent_value = node->input(0);
+      auto parent = parent_value->node();
+
+      auto grandparent_value = parent->input(0);
+      auto grandparent = grandparent_value->node();
+
+      // Before: gp -> x -> n
+      // After: gp -> x / gp -> n
+      // Delete x if no more uses
+      node->replaceInputWith(parent_value, grandparent_value);
+      if (!parent->hasUses()) {
+        parent->destroy();
+      }
+    }
+  }
+}
+
 } // anonymous namespace
 
 void CudaFuseGraph(std::shared_ptr<Graph>& graph) {
@@ -2252,7 +2324,7 @@ void CudaFuseGraph(std::shared_ptr<Graph>& graph) {
   // I don't know how to store edge/node in attribute. so let's abuse data flow
   // dependency and add inputs to conditional constant generated by
   // aten::profile_ivalue
-  traverseProfileIValues(graph->block(), ExtractProfileIValue);
+  mutateNode(graph->block(), prim::profile_ivalue, ExtractProfileIValue);
   GRAPH_DEBUG("insert conditional constant from profile_ivalue: ", *graph);
 
   // TODO: we need to properly restore shape information after fusion.
@@ -2292,7 +2364,7 @@ void CudaFuseGraph(std::shared_ptr<Graph>& graph) {
   alterBatchNormImpls(graph->block());
   GRAPH_DEBUG("After _batch_norm_impl_index: ", *graph);
 
-  traverseProfileIValues(graph->block(), RemoveProfileIValue);
+  mutateNode(graph->block(), prim::profile_ivalue, RemoveProfileIValue);
 
   GRAPH_DEBUG("Before remove missing profiling: ", *graph);
   removeFusionWithMissingProfilingInformation(graph->block());
@@ -2302,9 +2374,15 @@ void CudaFuseGraph(std::shared_ptr<Graph>& graph) {
   removeOutputUsedOnlyInDtype(graph->block());
   GRAPH_DEBUG("After removeOutputUsedOnlyInDtype: ", *graph);
 
+  mutateNode(graph->block(), prim::CudaFusionGroup, separateNestedViews);
+  GRAPH_DEBUG(
+      "separate nested and delete redundant views in CudaFusionGroup:", *graph);
+
   revertAliasCopyOps(graph, graph->block());
   GRAPH_DEBUG("revert alias_copy ops by nvfuser: ", *graph);
 
+  dumpFusionGroups(graph);
+
   // After FuseGraph some common subexpressions may come back
   EliminateCommonSubexpression(graph);
   // We might have emitted a fair amount of useless shape propagating code, so
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.cpp b/torch/csrc/jit/codegen/cuda/index_compute.cpp
index 8e151372b7558f..16cc960791c678 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.cpp
+++ b/torch/csrc/jit/codegen/cuda/index_compute.cpp
@@ -3,6 +3,7 @@
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/index_reference_replay.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
@@ -27,203 +28,6 @@ namespace cuda {
 
 namespace {
 
-// A merge is contiguous if:
-//   Inputs of outer are to the left in the root domain of the inputs of RHS.
-//   All inputs are contiguous in the root domain:
-//     - All marked as contiguous
-//     - Only gaps between inputs are broadcast or reductoin dims
-//   There are no split transformations performed on outer or inner
-//   All transformations on outer or inner are contiguous merges
-// If this criteria holds, then we can index the input root domains of this
-// merge with the indexing provided to the output of the merge in the backward
-// index pass
-
-class ContigIDs : public OptInDispatch {
- private:
-  using OptInDispatch::handle;
-
-  // Mark if ids are result of contigous merges
-  std::unordered_set<IterDomain*> contig_ids;
-  // Given contiguous domain, return all iter domains within its history.
-  std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>
-      within_contig_ids;
-  const std::vector<IterDomain*>& root_domain_;
-  const std::vector<bool>& root_contiguity_;
-  std::unordered_map<IterDomain*, bool> is_contig_root;
-
-  bool inRoot(const std::vector<IterDomain*>& ids) {
-    return std::all_of(ids.begin(), ids.end(), [this](IterDomain* id) {
-      return is_contig_root.find(id) != is_contig_root.end();
-    });
-  }
-
-  bool isContig(IterDomain* id) {
-    return contig_ids.find(id) != contig_ids.end();
-  }
-
-  // Split outputs are not contiguous, don't need to do anything.
-  void handle(Split*) override {}
-
-  void handle(Merge* merge) override {
-    // If either input is non-contiguous so is output.
-    const auto inner = merge->inner();
-    const auto outer = merge->outer();
-
-    if (!isContig(inner) || !isContig(outer)) {
-      return;
-    }
-
-    // Grab inputs, make sure they're in root domain, check if they're
-    // contiguous.
-
-    auto lhs_inputs =
-        ir_utils::iterDomainInputsOfOrderedAs({outer}, root_domain_);
-    auto rhs_inputs =
-        ir_utils::iterDomainInputsOfOrderedAs({inner}, root_domain_);
-
-    TORCH_INTERNAL_ASSERT(
-        inRoot(lhs_inputs) && inRoot(rhs_inputs),
-        "Found an invalid merge operation, inputs of its arguments are not in the root domain.");
-
-    std::deque<IterDomain*> ordered_inputs(
-        lhs_inputs.begin(), lhs_inputs.end());
-    ordered_inputs.insert(
-        ordered_inputs.end(), rhs_inputs.begin(), rhs_inputs.end());
-
-    // If any root input is not contig, output is not contig
-    if (!(std::all_of(
-            ordered_inputs.begin(),
-            ordered_inputs.end(),
-            [this](IterDomain* id) {
-              return is_contig_root.at(id) && !id->isBroadcast() &&
-                  !id->isReduction();
-            }))) {
-      return;
-    }
-
-    std::deque<IterDomain*> root_copy(root_domain_.begin(), root_domain_.end());
-
-    // Forward to first matching argument
-    while (!root_copy.empty() && !ordered_inputs.empty()) {
-      if (root_copy.front() != ordered_inputs.front()) {
-        root_copy.pop_front();
-      } else {
-        break;
-      }
-    }
-
-    // Forward through all matching arguments
-    while (!root_copy.empty() && !ordered_inputs.empty()) {
-      if (root_copy.front() == ordered_inputs.front()) {
-        root_copy.pop_front();
-        ordered_inputs.pop_front();
-        // This is no longer causing an error in:
-        // ReductionSchedulerMultiDimNonFastest TODO: test reenablement to make
-        // sure it does what's expected
-        //  } else if (
-        //     root_copy.front()->isReduction() ||
-        //     root_copy.front()->isBroadcast()) {
-        //   root_copy.pop_front();
-      } else {
-        break;
-      }
-    }
-
-    // If we matched all inputs, the output is contiguous. Only want to keep the
-    // top contig ID, lower ids should be placed in the "within_contig_ids" map
-    // of top id.
-    auto out = merge->out()->as<IterDomain>();
-    if (ordered_inputs.empty()) {
-      if (contig_ids.find(inner) != contig_ids.end()) {
-        contig_ids.erase(inner);
-      }
-
-      if (contig_ids.find(outer) != contig_ids.end()) {
-        contig_ids.erase(outer);
-      }
-
-      contig_ids.emplace(out);
-
-      std::unordered_set<IterDomain*> within_out;
-      within_out.emplace(inner);
-      if (within_contig_ids.find(inner) != within_contig_ids.end()) {
-        auto in_inner = within_contig_ids.at(inner);
-        within_out.insert(in_inner.begin(), in_inner.end());
-        within_contig_ids.erase(inner);
-      }
-
-      within_out.emplace(outer);
-      if (within_contig_ids.find(outer) != within_contig_ids.end()) {
-        auto in_outer = within_contig_ids.at(outer);
-        within_out.insert(in_outer.begin(), in_outer.end());
-        within_contig_ids.erase(outer);
-      }
-
-      within_contig_ids[out] = within_out;
-    }
-  }
-
- public:
-  ContigIDs() = delete;
-
-  // Check through the history of ids whose inputs map to root_domain with
-  // contiguity root_contiguity. Return unordered_set of all merges that are
-  // contiguous. Ignore root order is primarily used for predicate generation.
-  // In this case we can linearize indexing of any ID that only consists of
-  // merge operations.
-  ContigIDs(
-      const std::vector<IterDomain*>& ids,
-      const std::vector<IterDomain*>& root_domain,
-      const std::vector<bool>& root_contiguity)
-      : root_domain_(root_domain), root_contiguity_(root_contiguity) {
-    if (ids.empty()) {
-      return;
-    }
-
-    TORCH_INTERNAL_ASSERT(
-        root_domain_.size() == root_contiguity_.size(),
-        "Arguments don't match ",
-        root_domain_.size(),
-        " != ",
-        root_contiguity_.size());
-
-    for (const auto i : c10::irange(root_domain_.size())) {
-      // If a root domain has halo, can't use merged domain even if
-      // both inputs are contiguous. HaloInfo is also initialized for
-      // rfactor root domains, which should just return "zero"
-      // RootAxisInfo. This should be safe as no rfactor tensor should
-      // need halo.
-      if (root_contiguity_[i] &&
-          !GpuLower::current()
-               ->haloInfo()
-               .getRootAxisInfo(root_domain_[i])
-               .hasHalo()) {
-        auto root_domain_i = root_domain_[i]->as<IterDomain>();
-        contig_ids.emplace(root_domain_i);
-        within_contig_ids[root_domain_i] = std::unordered_set<IterDomain*>();
-        is_contig_root[root_domain_[i]] = true;
-      } else {
-        is_contig_root[root_domain_[i]] = false;
-      }
-    }
-
-    auto exprs = StmtSort::getExprs(ids[0]->fusion(), {ids.begin(), ids.end()});
-
-    for (auto expr : exprs) {
-      handle(expr);
-    }
-  }
-
-  const std::unordered_set<IterDomain*> contigIDs() const {
-    return contig_ids;
-  }
-
-  const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>
-  withinContigIDs() const {
-    return within_contig_ids;
-  }
-};
-
 // Update the HaloInfo mappings for a reference tensor by propagating
 // the halo information from the consumer tensor.
 void updateHaloInfoForReference(
@@ -395,7 +199,7 @@ Val* getProducerOffsetWithGather(
 
   // producer offset: window_index - padding
   auto producer_offset = SimplifyingIrBuilder::subExpr(
-      window_idx, IrBuilder::create<Int>(pad_width));
+      window_idx, SimplifyingIrBuilder::create<Int>(pad_width));
   return producer_offset;
 }
 
@@ -496,14 +300,14 @@ Val* getProducerIndexWithPartialSplit(
     if (consumer_offset->isZeroInt()) {
       return producer_index;
     } else {
-      return IrBuilder::addExpr(producer_index, consumer_offset);
+      return SimplifyingIrBuilder::addExpr(producer_index, consumer_offset);
     }
   }
 
   // Non-global case. Difference of the split offsets must be
   // accounted.
 
-  auto diff = IrBuilder::subExpr(consumer_offset, producer_offset);
+  auto diff = SimplifyingIrBuilder::subExpr(consumer_offset, producer_offset);
   kir::ExpressionEvaluator ee;
   auto diff_eval = ee.evaluate(diff);
   // We currently only allow constant offsetting
@@ -513,8 +317,8 @@ Val* getProducerIndexWithPartialSplit(
     return producer_index;
   }
 
-  return IrBuilder::addExpr(
-      producer_index, IrBuilder::create<Int>(diff_eval.value()));
+  return SimplifyingIrBuilder::addExpr(
+      producer_index, SimplifyingIrBuilder::create<Int>(diff_eval.value()));
 }
 
 } // namespace
@@ -564,13 +368,14 @@ void IndexCompute::handle(Split* split) {
     index_map_[in_id] = outer_ind;
     extent_map_[in_id] = getExtent(outer_id);
   } else {
-    index_map_[in_id] = IrBuilder::addExpr(
-        IrBuilder::mulExpr(outer_ind, getExtent(inner_id)), inner_ind);
+    index_map_[in_id] = SimplifyingIrBuilder::addExpr(
+        SimplifyingIrBuilder::mulExpr(outer_ind, getExtent(inner_id)),
+        inner_ind);
     // The extent should be updated only when its allocation is
     // partial, i.e., zero_merged_in is true. See PR #1270.
     if (zero_merged_in) {
-      extent_map_[in_id] =
-          IrBuilder::mulExpr(getExtent(outer_id), getExtent(inner_id));
+      extent_map_[in_id] = SimplifyingIrBuilder::mulExpr(
+          getExtent(outer_id), getExtent(inner_id));
     }
   }
 }
@@ -679,8 +484,8 @@ void IndexCompute::handle(Merge* merge) {
     zero_merged_in_.emplace(inner_id);
     zero_merged_in_.emplace(outer_id);
   } else {
-    index_map_[outer_id] = IrBuilder::divExpr(out_ind, inner_extent);
-    index_map_[inner_id] = IrBuilder::modExpr(out_ind, inner_extent);
+    index_map_[outer_id] = SimplifyingIrBuilder::divExpr(out_ind, inner_extent);
+    index_map_[inner_id] = SimplifyingIrBuilder::modExpr(out_ind, inner_extent);
   }
 }
 
@@ -724,7 +529,8 @@ IndexCompute::IndexCompute(
     ContigIDs contig_finder(
         td_->domain(), td_->getMaybeRFactorDomain(), root_contiguity);
     contig_ids = contig_finder.contigIDs();
-    auto within_contig = contig_finder.withinContigIDs();
+    root_to_contig_id_ = contig_finder.rootToIndexedID();
+    const auto& within_contig = contig_finder.withinContigIDs();
     for (auto contig_id : contig_ids) {
       if (index_map_.find(contig_id) != index_map_.end()) {
         TORCH_INTERNAL_ASSERT(
@@ -734,6 +540,10 @@ IndexCompute::IndexCompute(
         }
       }
     }
+  } else {
+    for (auto root_id : td_->getMaybeRFactorDomain()) {
+      root_to_contig_id_[root_id] = root_id;
+    }
   }
 }
 
@@ -744,7 +554,7 @@ void IndexCompute::run() {
   traverseFrom(td_->fusion(), domain_vals, false);
 }
 
-Val* IndexCompute::getExtent(IterDomain* id) {
+Val* IndexCompute::getExtent(IterDomain* id) const {
   // Pick from extent_map_ if available. Previously parallel
   // dimensions were ued (e.g., blockDim.x), however, it would result
   // in out-of-bounds errors when the extent of IterDomain is smaller
@@ -768,7 +578,8 @@ IndexCompute IndexCompute::updateIndexCompute(
     const TensorDomain* new_td,
     const std::unordered_map<IterDomain*, IterDomain*>& id_map,
     const std::vector<bool>& root_contiguity,
-    const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map) {
+    const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map)
+    const {
   FUSER_PERF_SCOPE("GpuLower::Lower::updateIndexCompute");
 
   std::unordered_map<IterDomain*, Val*> updated_index_map;
@@ -852,10 +663,13 @@ class UpdateLeafIndices : public IterVisitor {
     }
 
     auto factor = split->factor();
-    index_map_[inner_id] = IrBuilder::modExpr(index_map_[in_id], factor);
+    index_map_[inner_id] =
+        SimplifyingIrBuilder::modExpr(index_map_[in_id], factor);
     extent_map_[inner_id] = factor;
-    index_map_[outer_id] = IrBuilder::divExpr(index_map_[in_id], factor);
-    extent_map_[outer_id] = IrBuilder::ceilDivExpr(getExtent(in_id), factor);
+    index_map_[outer_id] =
+        SimplifyingIrBuilder::divExpr(index_map_[in_id], factor);
+    extent_map_[outer_id] =
+        SimplifyingIrBuilder::ceilDivExpr(getExtent(in_id), factor);
   }
 
   void handle(Merge* merge) override {
@@ -874,12 +688,13 @@ class UpdateLeafIndices : public IterVisitor {
     TORCH_INTERNAL_ASSERT(
         index_map_.find(inner_id) != index_map_.end(), "Inner ID not found");
 
-    index_map_[out_id] = IrBuilder::mulExpr(
+    index_map_[out_id] = SimplifyingIrBuilder::mulExpr(
         index_map_[inner_id],
-        IrBuilder::mulExpr(index_map_[outer_id], getExtent(inner_id)));
+        SimplifyingIrBuilder::mulExpr(
+            index_map_[outer_id], getExtent(inner_id)));
 
     extent_map_[out_id] =
-        IrBuilder::mulExpr(getExtent(outer_id), getExtent(inner_id));
+        SimplifyingIrBuilder::mulExpr(getExtent(outer_id), getExtent(inner_id));
   }
 
   // return extent_map_[id] if exists, else return id->extent()
@@ -906,8 +721,8 @@ Val* getHaloExtentOfRootAxis(IterDomain* id, Val* normal_extent = nullptr) {
 
   const auto& halo = GpuLower::current()->haloInfo().getRootAxisInfo(id);
   if (halo.hasHalo()) {
-    auto halo_extent =
-        IrBuilder::addExpr(normal_extent, IrBuilder::create<Int>(halo.width()));
+    auto halo_extent = SimplifyingIrBuilder::addExpr(
+        normal_extent, SimplifyingIrBuilder::create<Int>(halo.width()));
     return halo_extent;
   } else {
     return normal_extent;
@@ -959,8 +774,8 @@ void IndexSwizzle::run() {
       auto idx_to_swizzle_i = indexMap().at(id_to_swizzle_i);
       auto idx_to_swizzle_j = indexMap().at(id_to_swizzle_j);
 
-      auto swizzled_idx = IrBuilder::modExpr(
-          IrBuilder::addExpr(idx_to_swizzle_i, idx_to_swizzle_j),
+      auto swizzled_idx = SimplifyingIrBuilder::modExpr(
+          SimplifyingIrBuilder::addExpr(idx_to_swizzle_i, idx_to_swizzle_j),
           id_to_swizzle_j->extent());
       index_map_[id_to_swizzle_j] = swizzled_idx;
       swizzled_ids_.insert(id_to_swizzle_j);
@@ -1012,6 +827,13 @@ indexMapFromTV(
 
   std::unordered_map<kir::ForLoop*, Val*> loop_to_ind_map;
 
+  // Check if the current op has an implicit loop implemented
+  //  within an mma instruction.
+  bool within_mma_loops =
+      std::any_of(loops.begin(), loops.end(), [](kir::ForLoop* fl) {
+        return fl->iter_domain()->isMma();
+      });
+
   // When indexed as a producer, the parallel types of the the
   // producer domains may not be the same as those of the loops, but
   // that's still valid parallelization. However, in that case, using
@@ -1047,8 +869,15 @@ indexMapFromTV(
 
   for (auto loop : loops) {
     Val* idx = nullptr;
-    const auto same_parallel_type =
-        as_consumer || find_matching_parallel_domain(loop->iter_domain());
+    const auto same_parallel_type = as_consumer ||
+        find_matching_parallel_domain(loop->iter_domain()) ||
+        // Note && TODO:
+        //  mma swizzled lane_id does not map naturally from producer
+        //   to consumer but they should still be detected as same
+        //   parallel type. In a follow up may want to extent
+        //   find_matching_parallel_domain to cover this case.
+        (within_mma_loops &&
+         loop->iter_domain()->getParallelType() == ParallelType::TIDx);
     // See also LoopNestGenerator::pushAlloc.
     // NOLINTNEXTLINE(bugprone-branch-clone)
     if (!within_alloc) {
@@ -1076,18 +905,22 @@ indexMapFromTV(
         // Similarly for local memory tensors, zero replacement can be
         // only done when there's a matching domain with the same
         // parallel type
-        (loop->iter_domain()->isThread() && is_local && same_parallel_type) ||
-        loop->vectorize()) {
+        (loop->iter_domain()->isThread() && is_local && same_parallel_type)) {
       idx = GpuLower::current()->kernel()->zeroVal();
-      if (!loop->vectorize()) {
-        zero_loops.insert(loop);
-      }
+      zero_loops.insert(loop);
     } else {
       idx = loop->index();
     }
 
+    // If the loop is trivial, the loop index can only be the loop
+    // start value.
+    if (idx == loop->index() && loop->isTrivial()) {
+      idx = loop->start();
+    }
+
     if (loop == double_buffer_loop) {
-      idx = IrBuilder::addExpr(idx, GpuLower::current()->kernel()->oneVal());
+      idx = SimplifyingIrBuilder::addExpr(
+          idx, GpuLower::current()->kernel()->oneVal());
     }
 
     loop_to_ind_map[loop] = idx;
@@ -1192,6 +1025,130 @@ std::unordered_map<IterDomain*, IterDomain*> indexMapReferenceTo(
   return index_map_ref_to_producer;
 }
 
+Val* hoistConsumerIndex(
+    IterDomain* consumer_root_id,
+    const TensorView* consumer_tv,
+    const IndexCompute& consumer_indexing,
+    TensorDomain* ref_td,
+    const IndexCompute& ref_indexing,
+    const std::vector<kir::ForLoop*>& loops,
+    Val* index) {
+  // If index has no defining expression, there's nothing to hoist
+  if (disableIndexHoisting() || index->definition() == nullptr) {
+    return index;
+  }
+
+  // The old swizzle interface, which should be deprecated, is not
+  // supported.
+  if (consumer_tv->swizzleType() != SwizzleType::NoSwizzle) {
+    return index;
+  }
+
+  // Find the true indexed domain, which can be a merged contiguous domain.
+  auto indexed_consumer_id_it =
+      consumer_indexing.rootToContigID().find(consumer_root_id);
+  TORCH_INTERNAL_ASSERT(
+      indexed_consumer_id_it != consumer_indexing.rootToContigID().end(),
+      "Consumer indexed ID not found: ",
+      consumer_root_id->toString());
+  auto indexed_consumer_id = indexed_consumer_id_it->second;
+
+  // Insert the index into the common index map. A previously inserted
+  // val can be returned.
+  auto common_index = GpuLower::current()
+                          ->commonIndexMap()
+                          .insert(
+                              indexed_consumer_id,
+                              consumer_tv->domain(),
+                              ref_td,
+                              ref_indexing.indexMap(),
+                              loops,
+                              index)
+                          .first;
+
+  return common_index;
+}
+
+std::unordered_map<IterDomain*, IterDomain*> invertOneToOneMap(
+    const std::unordered_map<IterDomain*, IterDomain*>& map) {
+  std::unordered_map<IterDomain*, IterDomain*> inverted;
+  for (const auto& kv : map) {
+    bool inserted = inverted.emplace(kv.second, kv.first).second;
+    TORCH_INTERNAL_ASSERT(
+        inserted,
+        "Multiple mappings to the same value detected: ",
+        kv.second->toString());
+  }
+  return inverted;
+}
+
+Val* hoistProducerIndex(
+    IterDomain* producer_root_id,
+    const TensorView* producer_tv,
+    const IndexCompute& producer_indexing,
+    const TensorView* consumer_tv,
+    const std::unordered_map<IterDomain*, IterDomain*>& p2c_map,
+    TensorDomain* ref_td,
+    const IndexCompute& ref_indexing,
+    const std::vector<kir::ForLoop*>& loops,
+    Val* index) {
+  // If index has no defining expression, there's nothing to hoist
+  if (disableIndexHoisting() || index->definition() == nullptr) {
+    return index;
+  }
+
+  // The old swizzle interface, which should be deprecated, is not
+  // supported.
+  if (producer_tv->swizzleType() != SwizzleType::NoSwizzle) {
+    return index;
+  }
+
+  auto indexed_producer_id_it =
+      producer_indexing.rootToContigID().find(producer_root_id);
+  TORCH_INTERNAL_ASSERT(
+      indexed_producer_id_it != producer_indexing.rootToContigID().end(),
+      "Producer indexed ID not found: ",
+      producer_root_id->toString());
+  auto indexed_producer_id = indexed_producer_id_it->second;
+
+  // Use the corresponding consumer domain to find matching
+  // for-loops. Note that there's no CA mapping with the producer
+  // domains as the producer TensorDomain is a temporary replay
+  // domain.
+
+  auto indexed_consumer_id_it = p2c_map.find(indexed_producer_id);
+
+  // There can be no corresponding consumer ID. For example, consider:
+  //   consumer: [b1, i2, i3]
+  //   producer: [i2, i3].
+  // Suppose the consumer is transformed as:
+  //   consumer: [(b1*i2)*i3]
+  // Then the producer would be transformed when indexed:
+  //   producer: [i2*i3]
+  // Assuming i2 and i3 are contiguous, the producer indexing is done
+  // with the mreged i2*i3 domain, but there's no domain in the
+  // cosumer that maps with the producer indexed domain.
+  // It seems non-trivial to support patterns like this. Skip for now.
+  if (indexed_consumer_id_it == p2c_map.end()) {
+    return index;
+  }
+
+  IterDomain* indexed_consumer_id = indexed_consumer_id_it->second;
+
+  auto common_index = GpuLower::current()
+                          ->commonIndexMap()
+                          .insert(
+                              indexed_consumer_id,
+                              consumer_tv->domain(),
+                              ref_td,
+                              ref_indexing.indexMap(),
+                              loops,
+                              index)
+                          .first;
+
+  return common_index;
+}
+
 } // namespace
 
 std::vector<Val*> Index::getGlobalProducerStridedIndices(
@@ -1219,16 +1176,17 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
   // Map everything we can from reference to producer using compute at index
   // map. Use consumer as a proxy between producer and the generated reference.
   std::unordered_map<IterDomain*, IterDomain*> index_map_ref_to_producer;
-  {
-    // This replay has to be consistent with compute at index map.
-    BestEffortReplay replay_producer_as_consumer(
-        producer_tv->domain()->domain(),
-        consumer_tv->domain()->domain(),
-        pairwise_map.mapConsumerToProducer(
-            consumer_tv->domain(), producer_tv->domain()));
 
-    const auto& c2p_map = replay_producer_as_consumer.getReplay();
+  // This replay has to be consistent with compute at index map.
+  BestEffortReplay replay_producer_as_consumer(
+      producer_tv->domain()->domain(),
+      consumer_tv->domain()->domain(),
+      pairwise_map.mapConsumerToProducer(
+          consumer_tv->domain(), producer_tv->domain()));
 
+  const auto& c2p_map = replay_producer_as_consumer.getReplay();
+  const auto p2c_map = invertOneToOneMap(c2p_map);
+  {
     std::unordered_map<IterDomain*, IterDomain*> index_map_ref_to_consumer =
         indexMapReferenceTo(
             consumer_tv, gpu_lower->caIndexMap(), reference_id_map);
@@ -1302,7 +1260,8 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
       }
       std::stringstream ss;
       ss << "T" << producer_tv->name() << ".stride[" << stride_i++ << "]";
-      strides[i] = IrBuilder::create<NamedScalar>(ss.str(), DataType::Int);
+      strides[i] =
+          SimplifyingIrBuilder::create<NamedScalar>(ss.str(), DataType::Int);
     }
   }
 
@@ -1343,12 +1302,13 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
       // by extent of this dimension
       auto root_dim_extent = getHaloExtentOfRootAxis(root_dom[dim]);
       cur_contig_stride =
-          IrBuilder::mulExpr(cur_contig_stride, root_dim_extent);
+          SimplifyingIrBuilder::mulExpr(cur_contig_stride, root_dim_extent);
     } else {
       // If non contiguous dimension, keep local stride information, set cur
       // stride to local stride * local raw extent
       auto root_dim_extent = getHaloExtentOfRootAxis(root_dom[dim]);
-      cur_contig_stride = IrBuilder::mulExpr(strides[dim], root_dim_extent);
+      cur_contig_stride =
+          SimplifyingIrBuilder::mulExpr(strides[dim], root_dim_extent);
     }
   }
 
@@ -1380,6 +1340,18 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
 
     auto root_ind = producer_indexing.indexMap().at(root_dom[i]);
 
+    // index hoist must be done before the adjustments for halo
+    root_ind = hoistProducerIndex(
+        root_dom[i],
+        producer_tv,
+        producer_indexing,
+        consumer_tv,
+        p2c_map,
+        reference.domain,
+        ref_compute,
+        loops,
+        root_ind);
+
     root_ind = getProducerIndexWithHalo(producer_tv, i, root_ind, consumer_tv);
 
     root_ind = getProducerIndexWithGather(
@@ -1396,9 +1368,10 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
     if (root_ind->isZeroInt()) {
       continue;
     } else {
-      auto strided_ind = IrBuilder::mulExpr(root_ind, strides[i]);
+      auto strided_ind = SimplifyingIrBuilder::mulExpr(root_ind, strides[i]);
       if (i == root_dom.size() - 1 && vectorize_shift != nullptr) {
-        strided_inds[i] = IrBuilder::addExpr(strided_ind, vectorize_shift);
+        strided_inds[i] =
+            SimplifyingIrBuilder::addExpr(strided_ind, vectorize_shift);
       } else {
         strided_inds[i] = strided_ind;
       }
@@ -1434,25 +1407,25 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
   // the allocation position of the producer, and to figure out which producer
   // indices are mapped to consumer trivial reductions.
   std::unordered_map<IterDomain*, IterDomain*> p2c_alloc_map;
-  {
-    //  We want to play producer as consumer instead of the other way around
-    //  since consumer may have some broadcasted axes producer doesn't have
-    //  merged into loops producer may use. If we did consumer as producer we
-    //  wouldn't have this information in the mapping.
-    auto replay_PasC = BestEffortReplay::replayPasC(
-        producer_tv, consumer_tv, -1, pairwise_map);
-
-    auto c2p_map = replay_PasC.getReplay();
-
-    // Grab consumer domain entries and reverse replay map. TODO: Maybe
-    // TransformReplay::replayPasC could return this map
-    for (auto id : consumer_tv->domain()->domain()) {
-      auto c2p_it = c2p_map.find(id);
-      if (c2p_it != c2p_map.end()) {
-        auto c_id = c2p_it->first;
-        auto p_id = c2p_it->second;
-        p2c_alloc_map[p_id] = c_id;
-      }
+
+  //  We want to play producer as consumer instead of the other way around
+  //  since consumer may have some broadcasted axes producer doesn't have
+  //  merged into loops producer may use. If we did consumer as producer we
+  //  wouldn't have this information in the mapping.
+  auto replay_PasC =
+      BestEffortReplay::replayPasC(producer_tv, consumer_tv, -1, pairwise_map);
+
+  const auto& c2p_map = replay_PasC.getReplay();
+  const auto p2c_map = invertOneToOneMap(c2p_map);
+
+  // Grab consumer domain entries and reverse replay map. TODO: Maybe
+  // TransformReplay::replayPasC could return this map
+  for (auto id : consumer_tv->domain()->domain()) {
+    auto c2p_it = c2p_map.find(id);
+    if (c2p_it != c2p_map.end()) {
+      auto c_id = c2p_it->first;
+      auto p_id = c2p_it->second;
+      p2c_alloc_map[p_id] = c_id;
     }
   }
 
@@ -1641,6 +1614,18 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
 
     auto root_ind_i = index_map.at(root_dom[i]);
 
+    // index hoist must be done before the adjustments for halo
+    root_ind_i = hoistProducerIndex(
+        root_dom[i],
+        producer_tv,
+        producer_indexing,
+        consumer_tv,
+        c2p_map,
+        reference.domain,
+        ref_compute,
+        loops,
+        root_ind_i);
+
     root_ind_i =
         getProducerIndexWithHalo(producer_tv, i, root_ind_i, consumer_tv);
 
@@ -1685,13 +1670,13 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
         if (stride == nullptr) {
           stride = root_ext_j;
         } else {
-          stride = IrBuilder::mulExpr(stride, root_ext_j);
+          stride = SimplifyingIrBuilder::mulExpr(stride, root_ext_j);
         }
       }
     }
 
     if (stride != nullptr) {
-      strided_inds[i] = IrBuilder::mulExpr(root_ind_i, stride);
+      strided_inds[i] = SimplifyingIrBuilder::mulExpr(root_ind_i, stride);
     } else {
       strided_inds[i] = root_ind_i;
     }
@@ -1701,12 +1686,14 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
     auto db_loop = gpu_lower->doubleBufferInfo().getDoubleBufferLoop(
         producer_tv, loops, true);
     if (db_loop != nullptr) {
-      auto db_switch_index =
-          IrBuilder::modExpr(db_loop->index(), IrBuilder::create<Int>(2));
+      auto loop_index =
+          db_loop->isTrivial() ? db_loop->start() : db_loop->index();
+      auto db_switch_index = SimplifyingIrBuilder::modExpr(
+          loop_index, SimplifyingIrBuilder::create<Int>(2));
       auto original_alloc_size =
           gpu_lower->doubleBufferInfo().getOriginalAllocSize(producer_tv);
       auto db_strided_index =
-          IrBuilder::mulExpr(db_switch_index, original_alloc_size);
+          SimplifyingIrBuilder::mulExpr(db_switch_index, original_alloc_size);
       strided_inds.push_back(db_strided_index);
     }
   }
@@ -1845,6 +1832,16 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
 
     auto root_ind = consumer_indexing.indexMap().at(root_dom[i]);
 
+    // index hoist must be done before the adjustments for halo
+    root_ind = hoistConsumerIndex(
+        root_dom[i],
+        consumer_tv,
+        consumer_indexing,
+        reference.domain,
+        ref_compute,
+        loops,
+        root_ind);
+
     root_ind = SimplifyingIrBuilder::addExpr(
         root_ind, getGlobalConsumerOffsetWithPartialSplit(root_dom[i]));
 
@@ -1979,11 +1976,21 @@ std::vector<Val*> Index::getNonGlobalConsumerStridedIndices(
         " id: ",
         root_dom[i]->toString());
 
-    const auto root_ind_i = index_map.at(root_dom[i]);
+    auto root_ind_i = index_map.at(root_dom[i]);
     if (root_ind_i->isZeroInt()) {
       continue;
     }
 
+    // index hoist must be done before the adjustments for halo
+    root_ind_i = hoistConsumerIndex(
+        root_dom[i],
+        consumer_tv,
+        consumer_indexing,
+        reference.domain,
+        ref_compute,
+        loops,
+        root_ind_i);
+
     // Compute striding for this index.
     Val* stride = nullptr;
     for (const auto j : c10::irange(i + 1, root_dom.size())) {
@@ -2012,13 +2019,13 @@ std::vector<Val*> Index::getNonGlobalConsumerStridedIndices(
         if (stride == nullptr) {
           stride = root_ext_j;
         } else {
-          stride = IrBuilder::mulExpr(stride, root_ext_j);
+          stride = SimplifyingIrBuilder::mulExpr(stride, root_ext_j);
         }
       }
     }
 
     if (stride != nullptr) {
-      strided_inds[i] = IrBuilder::mulExpr(root_ind_i, stride);
+      strided_inds[i] = SimplifyingIrBuilder::mulExpr(root_ind_i, stride);
     } else {
       strided_inds[i] = root_ind_i;
     }
@@ -2037,13 +2044,14 @@ std::vector<Val*> Index::getNonGlobalConsumerStridedIndices(
     auto db_loop = gpu_lower->doubleBufferInfo().getDoubleBufferLoop(
         consumer_tv, loops, true);
     if (db_loop != nullptr) {
-      auto db_switch_index = IrBuilder::subExpr(
+      auto db_switch_index = SimplifyingIrBuilder::subExpr(
           gpu_lower->kernel()->oneVal(),
-          IrBuilder::modExpr(db_loop->index(), IrBuilder::create<Int>(2)));
+          SimplifyingIrBuilder::modExpr(
+              db_loop->index(), SimplifyingIrBuilder::create<Int>(2)));
       auto original_alloc_size =
           gpu_lower->doubleBufferInfo().getOriginalAllocSize(consumer_tv);
       auto db_strided_index =
-          IrBuilder::mulExpr(db_switch_index, original_alloc_size);
+          SimplifyingIrBuilder::mulExpr(db_switch_index, original_alloc_size);
       strided_inds.push_back(db_strided_index);
     }
   }
@@ -2085,7 +2093,8 @@ kir::TensorIndex* Index::getProducerIndex(
     const TensorView* consumer,
     const std::vector<kir::ForLoop*>& loops) {
   auto strided_indices = getProducerStridedIndices(producer, consumer, loops);
-  return IrBuilder::create<kir::TensorIndex>(producer, strided_indices);
+  return SimplifyingIrBuilder::create<kir::TensorIndex>(
+      producer, strided_indices);
 }
 
 std::vector<Val*> Index::getConsumerStridedIndices(
@@ -2113,7 +2122,8 @@ kir::TensorIndex* Index::getConsumerIndex(
     const TensorView* consumer,
     const std::vector<kir::ForLoop*>& loops) {
   auto strided_indices = getConsumerStridedIndices(consumer, loops);
-  return IrBuilder::create<kir::TensorIndex>(consumer, strided_indices);
+  return SimplifyingIrBuilder::create<kir::TensorIndex>(
+      consumer, strided_indices);
 }
 
 namespace {
@@ -2363,8 +2373,8 @@ std::pair<Val*, Val*> getStartAndStopOffsetsForShift(
   }
 
   return {
-      IrBuilder::create<Int>(start_offset),
-      IrBuilder::create<Int>(stop_offset)};
+      SimplifyingIrBuilder::create<Int>(start_offset),
+      SimplifyingIrBuilder::create<Int>(stop_offset)};
 }
 
 std::pair<Val*, Val*> getStartAndStopOffsetsForGather(
@@ -2627,6 +2637,15 @@ auto getPredicateReferenceIndexing(
     }
   }
 
+  for (const auto loop : loops) {
+    auto& idx = loop_to_ind_map.at(loop);
+    // If the loop is trivial, the loop index can only be the loop
+    // start value.
+    if (idx == loop->index() && loop->isTrivial()) {
+      idx = loop->start();
+    }
+  }
+
   if (double_buffer_axis != nullptr) {
     auto db_loop = GpuLower::current()->doubleBufferInfo().getDoubleBufferLoop(
         double_buffer_axis, loops, true);
@@ -2639,7 +2658,7 @@ auto getPredicateReferenceIndexing(
       // unswitch. In that case, it is not necessary to move ahead the
       // index for double buffering.
       if (cur_index == db_loop->index()) {
-        loop_to_ind_map[db_loop] = IrBuilder::addExpr(
+        loop_to_ind_map[db_loop] = SimplifyingIrBuilder::addExpr(
             cur_index, GpuLower::current()->kernel()->oneVal());
       }
     }
@@ -2813,8 +2832,7 @@ bool canOmitStopPredicate(
     }
   }
 
-  // Omit only when both the index and extent are "simple".
-  if (!(index_simple && contig_id->extent()->definition() == nullptr)) {
+  if (!index_simple) {
     return false;
   }
 
@@ -2827,14 +2845,20 @@ bool canOmitStopPredicate(
 
   auto stop_offset_val = stop_offset->as<Int>()->value();
 
-  auto halo_ext = gpu_lower->haloInfo().getRootAxisInfo(contig_id).width();
-
   // If they are not compile-time constant, can't prove the
   // condition.
   if (!stop_offset_val.has_value()) {
     return false;
   }
 
+  // Note that when a root domain is halo extended, it is the domain
+  // to be predicated, not its merged contig id even if it exists. So,
+  // if contig_id does not have root axis info, contig_id is
+  // guaranteed to have no halo.
+  auto halo_ext = gpu_lower->haloInfo().hasRootAxisInfo(contig_id)
+      ? gpu_lower->haloInfo().getRootAxisInfo(contig_id).width()
+      : 0;
+
   if (halo_ext + stop_offset_val.value() > 0) {
     return false;
   }
@@ -2858,6 +2882,61 @@ bool canOmitStopPredicate(
   return true;
 }
 
+std::pair<Val*, Val*> hoistPredicates(
+    Val* start_index,
+    Val* stop_index,
+    const std::vector<kir::ForLoop*>& loops,
+    kir::ForLoop* unswitch_or_vec_loop,
+    IterDomain* predicated_consumer_id,
+    TensorView* predicated_consumer_tv,
+    TensorDomain* ref_td,
+    const std::unordered_map<IterDomain*, Val*>& ref_start_index_map,
+    const std::unordered_map<IterDomain*, Val*>& ref_stop_index_map) {
+  const std::pair<Val*, Val*> same_indices{start_index, stop_index};
+
+  if (disableIndexHoisting()) {
+    return same_indices;
+  }
+
+  const auto start_is_same_as_stop = stop_index == start_index;
+
+  Val* hoisted_stop_index = nullptr;
+
+  if (stop_index->definition() == nullptr) {
+    // If the index doens't have an expression, nothing to hoist
+    hoisted_stop_index = stop_index;
+  } else {
+    bool inserted = false;
+    std::tie(hoisted_stop_index, inserted) =
+        GpuLower::current()->commonIndexMap().insert(
+            predicated_consumer_id,
+            predicated_consumer_tv->domain(),
+            ref_td,
+            ref_stop_index_map,
+            loops,
+            stop_index);
+  }
+
+  Val* hoisted_start_index = nullptr;
+  if (start_is_same_as_stop) {
+    hoisted_start_index = hoisted_stop_index;
+  } else if (start_index->definition() == nullptr) {
+    hoisted_start_index = start_index;
+  } else {
+    bool inserted = false;
+    std::tie(hoisted_start_index, inserted) =
+        GpuLower::current()->commonIndexMap().insert(
+            predicated_consumer_id,
+            predicated_consumer_tv->domain(),
+            ref_td,
+            ref_start_index_map,
+            loops,
+            start_index);
+  }
+
+  return {hoisted_start_index, hoisted_stop_index};
+}
+
 } // namespace
 
 // Returns predicates and the concrete (by loop map) root domains they cover
@@ -2908,10 +2987,13 @@ std::pair<std::vector<RootPredicateInfo>, ReferenceTensor> Index::
 
   // If not unswitch, share the same indexing map as the stop index
   // map
+  const auto& ref_start_indexing = is_unswitch
+      ? getPredicateReferenceIndexing(
+            loops, reference, unswitch_or_vec_loop, db_axis, true)
+      : ref_stop_indexing;
+
   std::unordered_map<IterDomain*, Val*> consumer_start_index_map;
   if (is_unswitch) {
-    auto ref_start_indexing = getPredicateReferenceIndexing(
-        loops, reference, unswitch_or_vec_loop, db_axis, true);
     const auto consumer_start_indexing = ref_start_indexing.updateIndexCompute(
         consumer_tv->domain(),
         ref_2_consumer,
@@ -2986,6 +3068,17 @@ std::pair<std::vector<RootPredicateInfo>, ReferenceTensor> Index::
     auto stop_index = consumer_stop_indexing_it->second;
     auto start_index = consumer_start_index_map.at(contig_id);
 
+    std::tie(start_index, stop_index) = hoistPredicates(
+        start_index,
+        stop_index,
+        loops,
+        unswitch_or_vec_loop,
+        contig_id,
+        consumer_tv,
+        reference.domain,
+        ref_start_indexing.indexMap(),
+        ref_stop_indexing.indexMap());
+
     // Build predicates for start positions as:
     //   start_index + start_offset >= 0
     auto start_offset = simplifyStartOffset(info.start_offset_);
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.h b/torch/csrc/jit/codegen/cuda/index_compute.h
index 27f1c911bde122..32aa3421ae8b28 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.h
+++ b/torch/csrc/jit/codegen/cuda/index_compute.h
@@ -69,7 +69,7 @@ class IndexCompute : public BackwardVisitor {
   void handle(Expr*) override;
 
   // return extent_map_[id] if exists, else return id->extent()
-  Val* getExtent(IterDomain* id);
+  Val* getExtent(IterDomain* id) const;
 
   //! True if a domain is not used to index
   bool isZero(IterDomain* id) const;
@@ -105,6 +105,9 @@ class IndexCompute : public BackwardVisitor {
   // IDs that are a result of contiguous merges
   std::unordered_set<IterDomain*> contig_ids;
 
+  // Map from root to contig domains
+  std::unordered_map<IterDomain*, IterDomain*> root_to_contig_id_;
+
   // Mentions if we should propagate an index down a particular IterDomain path
   // if there's an option
   std::unordered_set<IterDomain*> preferred_paths_;
@@ -130,6 +133,10 @@ class IndexCompute : public BackwardVisitor {
     return zero_merged_in_;
   }
 
+  const std::unordered_map<IterDomain*, IterDomain*>& rootToContigID() const {
+    return root_to_contig_id_;
+  }
+
   // Propagate back from _td using initial_index_map
   IndexCompute(
       const TensorDomain* _td,
@@ -148,7 +155,7 @@ class IndexCompute : public BackwardVisitor {
       const std::unordered_map<IterDomain*, IterDomain*>& id_map,
       const std::vector<bool>& _root_contiguity,
       const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map =
-          {});
+          {}) const;
 
   virtual void run();
 };
diff --git a/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp b/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
index 27e5b93e94e29c..a0e346f8892c61 100644
--- a/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
+++ b/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
@@ -40,7 +40,7 @@ IterDomain* IndexReferenceReplay::idCopy(IterDomain* id) {
   // reduction. All we care about are the transformations, and trying to make
   // sure we track correctly a replaying with consistent reduction/broadcast
   // domains is challenging and unnecessary.
-  auto copied_id = IrBuilder::create<IterDomain>(
+  auto copied_id = SimplifyingIrBuilder::create<IterDomain>(
       id->container(), id->start(), id->extent(), id->getParallelType());
   replayed_ids_.emplace_back(copied_id);
   return copied_id;
@@ -59,13 +59,13 @@ void IndexReferenceReplay::handle(Split* split) {
   // Don't produce the same values multiple times
   auto ref_outer = concreteToRefId(toConcrete(split->outer()));
   auto ref_inner = concreteToRefId(toConcrete(split->inner()));
-  if (ref_id_produced_.find(ref_outer) != ref_id_consumed_.end() ||
-      ref_id_produced_.find(ref_inner) != ref_id_consumed_.end()) {
+  if (ref_id_produced_.find(ref_outer) != ref_id_produced_.end() ||
+      ref_id_produced_.find(ref_inner) != ref_id_produced_.end()) {
     return;
   }
 
   // Replay the provided split operation and add it to the reference DAG
-  IrBuilder::create<Split>(
+  SimplifyingIrBuilder::create<Split>(
       split->container(),
       ref_outer,
       ref_inner,
@@ -92,12 +92,13 @@ void IndexReferenceReplay::handle(Merge* merge) {
 
   // Don't produce the same values multiple times
   auto ref_out = concreteToRefId(toConcrete(merge->out()));
-  if (ref_id_produced_.find(ref_out) != ref_id_consumed_.end()) {
+  if (ref_id_produced_.find(ref_out) != ref_id_produced_.end()) {
     return;
   }
 
   // Replay the provided merge operation and add it to the reference DAG
-  IrBuilder::create<Merge>(merge->container(), ref_out, ref_outer, ref_inner);
+  SimplifyingIrBuilder::create<Merge>(
+      merge->container(), ref_out, ref_outer, ref_inner);
 
   // Mark producers and consumers
   ref_id_consumed_.emplace(ref_outer);
@@ -218,7 +219,7 @@ TensorDomain* IndexReferenceReplay::computeReplay() {
           loops_replayed_domain.begin(),
           loops_replayed_domain.end(),
           [](IterDomain* id) { return id->definition() != nullptr; })) {
-    auto domain = IrBuilder::create<TensorDomain>(
+    auto domain = SimplifyingIrBuilder::create<TensorDomain>(
         // If there was no replay only return a domain with a root domain.
         loops_replayed_domain);
     return domain;
@@ -253,7 +254,7 @@ TensorDomain* IndexReferenceReplay::computeReplay() {
     }
 
     // Create and return the reference.
-    auto domain = IrBuilder::create<TensorDomain>(
+    auto domain = SimplifyingIrBuilder::create<TensorDomain>(
         std::vector<IterDomain*>(
             root_domain_ids.begin(), root_domain_ids.end()),
         loops_replayed_domain);
@@ -276,17 +277,23 @@ IndexCompute getReferenceIndexing(
     auto loop = loop_structure[loop_i];
     auto ind = loop->index();
 
-    initial_index_map[ref_axis] = ind;
-    if (loop->vectorize()) {
-      initial_index_map[ref_axis] = GpuLower::current()->kernel()->zeroVal();
-    } else if (double_buffer_loop == loop) {
+    // If the loop is trivial, only the start value is used
+    if (loop->isTrivial()) {
+      initial_index_map[ref_axis] = loop->start();
+    } else {
+      initial_index_map[ref_axis] = ind;
+    }
+
+    if (double_buffer_loop == loop) {
+      TORCH_INTERNAL_ASSERT(
+          !loop->isTrivial(), "The double buffer loop must be materialized");
       // This version of getReferenceIndexing is only used for
       // indexing global tensors. When indexing global producers, the
       // index for a double buffered loop needs to be incremented. The
       // parameter double_buffer_loop should be nullptr when indexing
       // global consumers tensors.
-      initial_index_map[ref_axis] =
-          IrBuilder::addExpr(ind, GpuLower::current()->kernel()->oneVal());
+      initial_index_map[ref_axis] = SimplifyingIrBuilder::addExpr(
+          initial_index_map[ref_axis], GpuLower::current()->kernel()->oneVal());
     }
 
     if (Index::protectWithMagicZero(loop, ref_axis, ind)) {
@@ -297,7 +304,7 @@ IndexCompute getReferenceIndexing(
   // Add magic zero to a fairly inner most index
   if (magic_zero_loop >= 0) {
     auto ref_id = reference_tensor->axis(magic_zero_loop);
-    initial_index_map[ref_id] = IrBuilder::addExpr(
+    initial_index_map[ref_id] = SimplifyingIrBuilder::addExpr(
         initial_index_map[ref_id], FusionGuard::getCurFusion()->magicZeroVal());
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/interface.cpp b/torch/csrc/jit/codegen/cuda/interface.cpp
index d21004ae154278..1292f4b7ed02ab 100644
--- a/torch/csrc/jit/codegen/cuda/interface.cpp
+++ b/torch/csrc/jit/codegen/cuda/interface.cpp
@@ -90,6 +90,11 @@ bool profileNode(const Node* node) {
       getFuserInterface()->fn_profile_n(node);
 }
 
+bool skipNode(const std::string& symbol_str, bool flip) {
+  return getFuserInterface()->fn_skip_n != nullptr &&
+      getFuserInterface()->fn_skip_n(symbol_str, flip);
+}
+
 //! [ Note -- type guard logic in CudaFusionGuard ]
 //!
 //! CudaFusionGuard is used to Guard input tensor to `CudaFusionGroup` so that
@@ -117,11 +122,15 @@ bool profileNode(const Node* node) {
 //!             extra attention should be paid to contiguity across size-1
 //!             dimensions.
 //!   c. size check:
+//!        c.1 broadcast check:
 //!        making sure that broadcast semantics are identical. So we want to
 //!        make sure a given dimension either are both size-1 for `tensor` &
 //!        `guard_tensor_type`, or are both non-size-1.
 //!        This is due to the fact that we specialize size-1 dimension as
 //!        broadcasted dimension while translating PyTorch tensor to Fusion IR.
+//!        c.1 size-0 check:
+//!        we don't specialize this on codegen, but we do specialize fusion
+//!        logic for size-0 on reductoins, hence the check
 //!
 bool complyWith(
     const at::Tensor& tensor,
@@ -133,13 +142,19 @@ bool complyWith(
   // check a. if num_dimension check fails or scalar type check fails
   if (*guard_tensor_type->dim() != static_cast<size_t>(tensor.ndimension()) ||
       (guard_tensor_type->scalarType().has_value() &&
-       (guard_tensor_type->scalarType().value() != tensor.scalar_type()))) {
+       (guard_tensor_type->scalarType().value() != tensor.scalar_type())) ||
+      (guard_tensor_type->device().has_value() &&
+       (guard_tensor_type->device().value() != tensor.device())) ||
+      (guard_tensor_type->requiresGrad().has_value() &&
+       guard_tensor_type->requiresGrad().value() !=
+           (tensor.requires_grad() && at::GradMode::is_enabled()))) {
     return false;
   }
 
   // TODO: should we get symbolic_size instead and check for size
   // consistency across tensors as well?
   const auto& sizes = guard_tensor_type->sizes();
+  // see [ Note -- stirde_properties in tensor type ]
   const auto& stride_properties = guard_tensor_type->stride_properties();
 
   const auto& t_sizes = tensor.sizes();
@@ -207,12 +222,18 @@ bool complyWith(
       }
     }
 
-    // check c, we go along semantic ordered dimensions
+    // check c.1, we go along semantic ordered dimensions
     // check broadcast / size-1:
     bool guard_bcast = sizes[j].has_value() && sizes[j].value() == 1;
     if (guard_bcast != (t_sizes[j] == 1)) {
       return false;
     }
+
+    // check c.2, check for size-0
+    bool guard_size_0 = sizes[j].has_value() && sizes[j].value() == 0;
+    if (guard_size_0 != (t_sizes[j] == 0)) {
+      return false;
+    }
   }
 
   return true;
@@ -675,7 +696,7 @@ RegisterOperators reg_infer_unsqueeze_size({
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 RegisterOperators reg_infer_squeeze_dim_size({
     Operator(
-        "prim::infer_squeeze_size(int[] a, int dim) -> int[]",
+        "prim::infer_squeeze_size.dim(int[] a, int dim) -> int[]",
         [](const Node* node) -> Operation {
           return [](Stack& stack) {
             auto dim = pop(stack).toInt();
@@ -696,7 +717,7 @@ RegisterOperators reg_infer_squeeze_dim_size({
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 RegisterOperators reg_infer_squeeze_size({
     Operator(
-        "prim::infer_squeeze_size.dim(int[] a) -> int[]",
+        "prim::infer_squeeze_size(int[] a) -> int[]",
         [](const Node* node) -> Operation {
           return [](Stack& stack) {
             auto size = pop(stack).toIntVector();
diff --git a/torch/csrc/jit/codegen/cuda/interface.h b/torch/csrc/jit/codegen/cuda/interface.h
index 8afa854ea5cf46..a24d3f3b043276 100644
--- a/torch/csrc/jit/codegen/cuda/interface.h
+++ b/torch/csrc/jit/codegen/cuda/interface.h
@@ -19,10 +19,10 @@ namespace cuda {
 
 TORCH_API std::atomic<bool>& getCudaFusionGuardMode();
 
-C10_EXPORT bool getSingletonFusion();
-C10_EXPORT bool setSingletonFusion(bool value);
-C10_EXPORT bool getHorizontalFusion();
-C10_EXPORT bool setHorizontalFusion(bool value);
+TORCH_API bool getSingletonFusion();
+TORCH_API bool setSingletonFusion(bool value);
+TORCH_API bool getHorizontalFusion();
+TORCH_API bool setHorizontalFusion(bool value);
 
 // dummy struct to allow API registration
 struct CudaFuserInterface {
@@ -32,19 +32,22 @@ struct CudaFuserInterface {
   bool (*fn_can_fuse_n)(const Node*) = nullptr;
   void (*fn_insert_profile_inodes)(ProfilingRecord* pr) = nullptr;
   bool (*fn_profile_n)(const Node*) = nullptr;
+  bool (*fn_skip_n)(const std::string&, bool flip) = nullptr;
 };
 
 // Get interface, this is used by registration and user facing API internally
-C10_EXPORT CudaFuserInterface* getFuserInterface();
+TORCH_API CudaFuserInterface* getFuserInterface();
 
-C10_EXPORT void compileFusionGroup(Node* fusion_node);
-C10_EXPORT void runFusionGroup(const Node* fusion_node, Stack& stack);
-C10_EXPORT void fuseGraph(std::shared_ptr<Graph>&);
-C10_EXPORT bool canFuseNode(const Node* node);
-C10_EXPORT void InsertProfileNodesForCUDAFuser(ProfilingRecord* pr);
-C10_EXPORT bool profileNode(const Node* node);
+TORCH_API void compileFusionGroup(Node* fusion_node);
+TORCH_API void runFusionGroup(const Node* fusion_node, Stack& stack);
+TORCH_API void fuseGraph(std::shared_ptr<Graph>&);
+TORCH_API bool canFuseNode(const Node* node);
+TORCH_API void InsertProfileNodesForCUDAFuser(ProfilingRecord* pr);
+TORCH_API bool profileNode(const Node* node);
 
-C10_EXPORT bool complyWith(
+TORCH_API bool skipNode(const std::string& symbol_str, bool flip = true);
+
+TORCH_API bool complyWith(
     const at::Tensor& tensor,
     const c10::TensorTypePtr& guard_tensor_type);
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
index 6a094c104df34d..39434ff993721b 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
@@ -103,6 +103,19 @@ const std::vector<Expr*>& Val::uses() const {
   return uses_;
 }
 
+void Val::resolveIndexDtype() {
+  TORCH_INTERNAL_ASSERT(
+      vtype_ == ValType::TensorView,
+      "Resolving index type is currently only supported on tensor view values.");
+  TORCH_INTERNAL_ASSERT(
+      dtype_ == DataType::Index,
+      "Can only resolve index type if a tensor has an Index DataType.");
+  TORCH_INTERNAL_ASSERT(
+      container()->isA<kir::Kernel>(),
+      "Index type can only be resolved at compile time.");
+  dtype_ = container()->as<kir::Kernel>()->indexType();
+}
+
 namespace {
 
 // Traverse definition of all values involved in constructing the provided val.
@@ -180,6 +193,16 @@ bool Val::isOneInt() const {
   return int_val.has_value() && int_val.value() == 1;
 }
 
+bool Val::isDefinitionType(ExprType expression_type) const {
+  if (definition() != nullptr) {
+    auto def_expr_type = definition()->getExprType();
+    if (def_expr_type.has_value() && def_expr_type.value() == expression_type) {
+      return true;
+    }
+  }
+  return false;
+}
+
 c10::optional<DataType> Val::getDataType() const {
   TORCH_INTERNAL_ASSERT(
       dtype_ != DataType::Null, "Value does not have a data type.");
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
index 1b8444fae46203..70f0b8f80fe538 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
@@ -266,6 +266,9 @@ class TORCH_CUDA_CU_API Val : public Statement {
     return definition_;
   }
 
+  // Determine if value definition matches given expression type
+  bool isDefinitionType(ExprType expression_type) const;
+
   const std::vector<Expr*>& uses() const;
 
   bool isFusionInput() const {
@@ -309,13 +312,13 @@ class TORCH_CUDA_CU_API Val : public Statement {
     definition_ = expr;
   }
 
+  void resolveIndexDtype();
+
  protected:
   friend Fusion;
 
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
   const ValType vtype_;
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  const DataType dtype_;
 
   // TODO: Add fusion passkey for this
   void setIsFusionInput(bool is_fusion_input) {
@@ -333,6 +336,11 @@ class TORCH_CUDA_CU_API Val : public Statement {
   }
 
  private:
+  // There's only one instance where dtype can change, and that's through
+  // resolving the index data type from nvfuser to either Int or Int32 for
+  // welford operations.
+  DataType dtype_;
+
   // Following is managed by Fusion and can change.
   bool is_fusion_input_ = false;
   bool is_fusion_output_ = false;
diff --git a/torch/csrc/jit/codegen/cuda/ir_builder.cpp b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
index 17a4e59cfb625b..c17ff0de44a49b 100644
--- a/torch/csrc/jit/codegen/cuda/ir_builder.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
@@ -47,6 +47,7 @@ IR_BUILDER_INSTANTIATE(TensorView)
 IR_BUILDER_INSTANTIATE(Bool)
 IR_BUILDER_INSTANTIATE(Double)
 IR_BUILDER_INSTANTIATE(Int)
+IR_BUILDER_INSTANTIATE(ComplexDouble)
 IR_BUILDER_INSTANTIATE(NamedScalar)
 
 // Exprs
@@ -55,12 +56,14 @@ IR_BUILDER_INSTANTIATE(Merge)
 IR_BUILDER_INSTANTIATE(TransposeOp)
 IR_BUILDER_INSTANTIATE(ShiftOp)
 IR_BUILDER_INSTANTIATE(GatherOp)
+IR_BUILDER_INSTANTIATE(ViewDtypeOp)
 IR_BUILDER_INSTANTIATE(ViewOp)
 IR_BUILDER_INSTANTIATE(UnaryOp)
 IR_BUILDER_INSTANTIATE(BinaryOp)
 IR_BUILDER_INSTANTIATE(TernaryOp)
 IR_BUILDER_INSTANTIATE(ReductionOp)
 IR_BUILDER_INSTANTIATE(WelfordOp)
+IR_BUILDER_INSTANTIATE(MmaOp)
 IR_BUILDER_INSTANTIATE(BroadcastOp)
 
 Val* IrBuilder::newResult(DataType dtype) {
@@ -268,6 +271,61 @@ Val* SimplifyingIrBuilder::subExpr(Val* lhs, Val* rhs) {
   return addExpr(lhs, negExpr(rhs));
 }
 
+Val* SimplifyingIrBuilder::mulExpr(Int* lhs, Int::ScalarType rhs) {
+  if (rhs == 0) {
+    return lhs->container()->zeroVal();
+  } else if (rhs == 1) {
+    return lhs;
+  } else if (lhs == nullptr) {
+    return IrBuilder::create<Int>(rhs);
+  } else if (lhs->isConst()) {
+    return IrBuilder::create<Int>(lhs->value().value() * rhs);
+  } else {
+    return IrBuilder::mulExpr(lhs, IrBuilder::create<Int>(rhs));
+  }
+}
+
+Val* SimplifyingIrBuilder::mulExpr(Val* lhs, Int::ScalarType rhs) {
+  auto lhs_int = dynamic_cast<Int*>(lhs);
+  if (lhs_int != nullptr) {
+    return mulExpr(lhs_int, rhs);
+  } else {
+    return IrBuilder::mulExpr(lhs, IrBuilder::create<Int>(rhs));
+  }
+}
+
+Val* SimplifyingIrBuilder::mulExpr(Int* lhs, Int* rhs) {
+  if (rhs == nullptr) {
+    return lhs;
+  } else if (lhs == nullptr) {
+    return rhs;
+  } else if (lhs->isConst()) {
+    return mulExpr(rhs, lhs->value().value());
+  } else if (rhs->isConst()) {
+    return mulExpr(lhs, rhs->value().value());
+  } else {
+    return IrBuilder::mulExpr(lhs, rhs);
+  }
+}
+
+Val* SimplifyingIrBuilder::mulExpr(Val* lhs, Val* rhs) {
+  TORCH_INTERNAL_ASSERT(lhs != nullptr || rhs != nullptr);
+  if (lhs == nullptr || lhs->isOneInt()) {
+    return rhs;
+  } else if (rhs == nullptr || rhs->isOneInt()) {
+    return lhs;
+  } else if (lhs->isZeroInt() || rhs->isZeroInt()) {
+    return lhs->container()->zeroVal();
+  }
+  auto lhs_int = dynamic_cast<Int*>(lhs);
+  auto rhs_int = dynamic_cast<Int*>(rhs);
+  if (lhs_int != nullptr && rhs_int != nullptr) {
+    return mulExpr(lhs_int, rhs_int);
+  } else {
+    return IrBuilder::mulExpr(lhs, rhs);
+  }
+}
+
 Val* SimplifyingIrBuilder::andExpr(Val* lhs, Val* rhs) {
   TORCH_INTERNAL_ASSERT(!(lhs == nullptr && rhs == nullptr));
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_builder.h b/torch/csrc/jit/codegen/cuda/ir_builder.h
index 5087f2832a99df..f122232f8fb8eb 100644
--- a/torch/csrc/jit/codegen/cuda/ir_builder.h
+++ b/torch/csrc/jit/codegen/cuda/ir_builder.h
@@ -116,6 +116,10 @@ class TORCH_CUDA_CU_API SimplifyingIrBuilder : public IrBuilder {
   static Val* addExpr(Int* lhs, Int* rhs);
   static Val* addExpr(Val* lhs, Val* rhs);
   static Val* subExpr(Val* lhs, Val* rhs);
+  static Val* mulExpr(Int* lhs, Int::ScalarType rhs);
+  static Val* mulExpr(Val* lhs, Int::ScalarType rhs);
+  static Val* mulExpr(Int* lhs, Int* rhs);
+  static Val* mulExpr(Val* lhs, Val* rhs);
   static Val* andExpr(Val* lhs, Val* rhs);
   static Val* maxExpr(Val* lhs, Val* rhs);
   static Val* minExpr(Val* lhs, Val* rhs);
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
index 8a1717e8d059dd..1ddc4feb90dacc 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
@@ -76,6 +76,10 @@ void IrCloner::handle(const Int* i) {
   clone_ = IrBuilder::clone(i, this);
 }
 
+void IrCloner::handle(const ComplexDouble* c) {
+  clone_ = IrBuilder::clone(c, this);
+}
+
 void IrCloner::handle(const NamedScalar* named_scalar) {
   clone_ = IrBuilder::clone(named_scalar, this);
 }
@@ -108,6 +112,10 @@ void IrCloner::handle(const WelfordOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
 
+void IrCloner::handle(const MmaOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const TransposeOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
@@ -120,6 +128,10 @@ void IrCloner::handle(const GatherOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
 
+void IrCloner::handle(const ViewDtypeOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const ViewOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.h b/torch/csrc/jit/codegen/cuda/ir_cloner.h
index 1755b9e95632fe..3f50cd48e93bf6 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.h
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.h
@@ -65,6 +65,7 @@ class TORCH_CUDA_CU_API IrCloner : private OptInConstDispatch {
   void handle(const Bool*) override;
   void handle(const Double*) override;
   void handle(const Int*) override;
+  void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
   void handle(const UnaryOp*) override;
@@ -73,9 +74,11 @@ class TORCH_CUDA_CU_API IrCloner : private OptInConstDispatch {
   void handle(const BroadcastOp*) override;
   void handle(const ReductionOp*) override;
   void handle(const WelfordOp*) override;
+  void handle(const MmaOp*) override;
   void handle(const TransposeOp*) override;
   void handle(const ShiftOp*) override;
   void handle(const GatherOp*) override;
+  void handle(const ViewDtypeOp*) override;
   void handle(const ViewOp*) override;
 
   void handle(const Split*) override;
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
index 7511fbd4d6d595..941bf22dea7633 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
@@ -371,6 +371,10 @@ void IrGraphGenerator::handle(const Int* i) {
   printValue(i, IrNodeLabel::gen(i, detail_level_));
 }
 
+void IrGraphGenerator::handle(const ComplexDouble* i) {
+  printValue(i, IrNodeLabel::gen(i, detail_level_));
+}
+
 void IrGraphGenerator::handle(const NamedScalar* i) {
   printValue(i, IrNodeLabel::gen(i, detail_level_));
 }
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.h b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
index f9b3adf703d14c..e5bbcac9157dc7 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.h
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
@@ -79,6 +79,7 @@ class TORCH_CUDA_CU_API IrGraphGenerator : private OptInConstDispatch {
   void handle(const Bool*) override;
   void handle(const Double*) override;
   void handle(const Int*) override;
+  void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
   void handle(const UnaryOp*) override;
diff --git a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
index 28478c64d91efe..bfc76acdfccd37 100644
--- a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
@@ -53,9 +53,9 @@ class TORCH_CUDA_CU_API Bool : public Val {
   const c10::optional<bool> maybe_value_;
 };
 
-//! A Float64 value. For now we don't have any other type besides
-//! Float64. This value can be a symbolic value (defined after the kernel
-//! is compiled) or a constant value (inlined into the kernel definition).
+//! A Float64 value. This value can be a symbolic value (defined after the
+//! kernel is compiled) or a constant value (inlined into the kernel
+//! definition).
 class TORCH_CUDA_CU_API Double : public Val {
  public:
   using ScalarType = double;
@@ -114,6 +114,39 @@ class TORCH_CUDA_CU_API Int : public Val {
   const c10::optional<ScalarType> maybe_value_;
 };
 
+//! An c10::complex<double> value. This value can be a symbolic value (defined
+//! after the kernel is compiled) or a constant value (inlined into the kernel
+//! definition).
+class TORCH_CUDA_CU_API ComplexDouble : public Val {
+ public:
+  using ScalarType = c10::complex<double>;
+
+  ComplexDouble(IrBuilderPasskey passkey);
+
+  explicit ComplexDouble(IrBuilderPasskey passkey, ScalarType value);
+
+  explicit ComplexDouble(
+      IrBuilderPasskey passkey,
+      c10::optional<ScalarType> value);
+
+  ComplexDouble(const ComplexDouble* src, IrCloner* ir_cloner);
+
+  bool isSymbolic() const {
+    return !(maybe_value_.has_value());
+  }
+  bool isConst() const final {
+    return maybe_value_.has_value();
+  }
+  c10::optional<ScalarType> value() const {
+    return maybe_value_;
+  }
+
+  bool sameAs(const Statement* other) const override;
+
+ private:
+  const c10::optional<ScalarType> maybe_value_;
+};
+
 //! Mode during propagation of computeAt, standard will throw an error if
 //! computeAt position provided can't be satisfied, best effort will lower the
 //! computeAt position as needed during traversal, most inlined will increase
@@ -176,6 +209,13 @@ class TORCH_CUDA_CU_API TensorView : public Val {
     return domain_;
   }
 
+  //! This is for a TensorView with an rFactor domain that is an input to a
+  //! fusion segment. We convert the rfactor domain into a new root domain.
+  //! Any dynamic-sized rfactor iterDomains are given a new symbolic extent.
+  //! Concrete integer extents are kept. Output TensorViews of any subsequent
+  //! expressions that use this TensorView are also updated.
+  void convertRfactorToRootDomain();
+
   void setContiguity(const std::vector<bool>& contig) {
     domain()->setContiguity(contig);
   }
@@ -400,6 +440,24 @@ class TORCH_CUDA_CU_API TensorView : public Val {
     return is_double_buffered_;
   }
 
+  //! Fill in mma options in scheduling time.
+  //!  Each mma op in Fusion IR must be configured once before lowering.
+  //!  Mma options are configuration parameters used in lowering to mma
+  //!  instrinsics, mainly the type of mma macro to use and input data layout
+  //!  etc.
+  //!
+  //! TODO: This step will very likely be removed in a follow up PR. All of
+  //!  the options configured here could actually be inferred from fusion IR
+  //!  once we are feature complete.
+  void configureMma(MmaOptions options);
+
+  //! Transforms the innermost iterdomains according to the given mma swizzle,
+  //!  this should be used on the tvs that are either inputs/outputs of an
+  //!  MmaOp, or any tv's that are involved in prolog/epilog fusions and need to
+  //!  have a matching thread swizzle with the mma operand/result.
+  //! More detail on usage see [WarpMmaSwizzler] in scheduler/mma_utils.h .
+  void applyMmaSwizzle(MmaOptions options);
+
   friend TORCH_CUDA_CU_API TransformPropagator;
   friend TORCH_CUDA_CU_API TransformReplay;
   friend TORCH_CUDA_CU_API OptOutMutator;
diff --git a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
index bb494148be2135..41f16978779768 100644
--- a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
@@ -5,6 +5,7 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_base_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/mma_type.h>
 #include <torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h>
 
 //! Nodes in here should generally not be used by users. They should be behind
@@ -150,7 +151,8 @@ class TORCH_CUDA_CU_API ReductionOp : public Expr {
       BinaryOpType reduction_op_type,
       Val* init,
       Val* out,
-      Val* in);
+      Val* in,
+      bool is_fused = false);
 
   ReductionOp(const ReductionOp* src, IrCloner* ir_cloner);
 
@@ -168,6 +170,10 @@ class TORCH_CUDA_CU_API ReductionOp : public Expr {
     return reduction_op_type_;
   }
 
+  bool isFused() const {
+    return is_fused_;
+  }
+
   bool sameAs(const Statement* other) const override;
 
  private:
@@ -175,6 +181,8 @@ class TORCH_CUDA_CU_API ReductionOp : public Expr {
   Val* const init_ = nullptr;
   Val* const out_ = nullptr;
   Val* const in_ = nullptr;
+  //! True if using the fused reduction kernel
+  bool is_fused_ = false;
 };
 
 //! Welford Scan operation.
@@ -190,7 +198,8 @@ class TORCH_CUDA_CU_API WelfordOp : public Expr {
       Val* init_N,
       Val* in_avg,
       Val* in_var,
-      Val* in_N);
+      Val* in_N,
+      bool is_fused = false);
 
   WelfordOp(const WelfordOp* src, IrCloner* ir_cloner);
 
@@ -250,6 +259,10 @@ class TORCH_CUDA_CU_API WelfordOp : public Expr {
     return !init_N_->isZeroInt();
   }
 
+  bool isFused() const {
+    return is_fused_;
+  }
+
  private:
   Val* const out_avg_;
   Val* const out_var_;
@@ -260,6 +273,63 @@ class TORCH_CUDA_CU_API WelfordOp : public Expr {
   Val* const in_avg_;
   Val* const in_var_;
   Val* const in_N_;
+  //! True if using the fused reduction kernel (not implemented yet)
+  bool is_fused_ = false;
+};
+
+//! Fused Matmul operation
+class TORCH_CUDA_CU_API MmaOp : public Expr {
+ public:
+  MmaOp(IrBuilderPasskey, Val* out, Val* in_a, Val* in_b, Val* init);
+
+  MmaOp(
+      IrBuilderPasskey,
+      Val* out,
+      Val* in_a,
+      Val* in_b,
+      Val* init,
+      MmaOptions options);
+
+  MmaOp(const MmaOp* src, IrCloner* ir_cloner);
+
+  Val* out() const {
+    return out_;
+  }
+
+  Val* inA() const {
+    return in_a_;
+  }
+
+  Val* inB() const {
+    return in_b_;
+  }
+
+  Val* init() const {
+    return init_;
+  }
+
+  const auto& options() const {
+    TORCH_INTERNAL_ASSERT(options_.has_value(), "MmaOp not configured:", this);
+    return options_.value();
+  }
+
+  bool sameAs(const Statement* const other) const override;
+
+  auto accStride() const {
+    TORCH_INTERNAL_ASSERT(options_.has_value(), "MmaOp not configured:", this);
+    return options_->accumulator_stride;
+  }
+
+  void configureOptions(MmaOptions options) {
+    options_ = options;
+  }
+
+ private:
+  Val* const out_ = nullptr;
+  Val* const in_a_ = nullptr;
+  Val* const in_b_ = nullptr;
+  Val* const init_ = nullptr;
+  c10::optional<MmaOptions> options_ = c10::nullopt;
 };
 
 class TORCH_CUDA_CU_API TransposeOp : public Expr {
@@ -429,6 +499,34 @@ class TORCH_CUDA_CU_API GatherOp : public Expr {
   std::vector<std::vector<int>> pad_width_;
 };
 
+class TORCH_CUDA_CU_API ViewDtypeOp : public Expr {
+ public:
+  ViewDtypeOp(
+      IrBuilderPasskey,
+      TensorView* out,
+      TensorView* in,
+      DataType dtype);
+
+  ViewDtypeOp(const ViewDtypeOp* src, IrCloner* ir_cloner);
+
+  TensorView* out() const {
+    return out_;
+  }
+
+  TensorView* in() const {
+    return in_;
+  }
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+ private:
+  TensorView* const out_ = nullptr;
+  TensorView* const in_ = nullptr;
+  DataType dtype_;
+};
+
 class TORCH_CUDA_CU_API ViewOp : public Expr {
  public:
   ViewOp(IrBuilderPasskey, TensorView* out, TensorView* in);
@@ -662,6 +760,50 @@ class TORCH_CUDA_CU_API IterDomain : public Val {
     return definition() == nullptr;
   }
 
+  //! Marks that this id represents a
+  //!  instruction loop, mma use only.
+  //!
+  //! An instruction loop can be considered a generalization of
+  //!  vectorization. It also represents a loop that's implemented
+  //!  by an instruction and should not be realized by codegen and
+  //!  cannot be inlined with.
+  //! As an example, if a mma macro, call it mma_eg implements:
+  //!  for m in M
+  //!    for n in N
+  //!      for k in K
+  //!         C[m,n] += A[m,k]*B[k,n],
+  //! But the generated code should simply be:
+  //!  mma_eg(C,A,B)
+  //! without the 3 level loopnest, i.e. they're instruction loops.
+  //!
+  //! In the actual mma macros, the loopnests it implements is a
+  //!  transformed version of above to match the mma swizzle.
+  //!  So it's different implicit loopnest for different macros.
+  //!  WarpMmaSwizzler will label the instruction loops case-by-case.
+  bool isMma() const {
+    return parallel_type_ == ParallelType::Mma;
+  }
+
+  bool isMmaSwizzled() const {
+    return is_mma_swizzled_;
+  }
+
+  //! Used by WarpMmaSwizzler, this is an utility for WarpMmaSwizzler
+  //!  to lock the thread swizzled iterdomains.
+  //! Only true for the iterdomains produced by WarpMmaSwizzler.
+  //! Mma ops require specific swizzle patterns
+  //!  and this label utility is to prevent any further transform on the
+  //!  iterdomains involved in the swizzle so that the pattern remain correct in
+  //!  generated code.
+  //!
+  //! Note:
+  //!    Used only through WarpMmaSwizzler only and mma validation relies on
+  //!    this
+  //!  flag being set on the correct iterdomains.
+  void toMmaSwizzled() {
+    is_mma_swizzled_ = true;
+  }
+
  protected:
   friend TensorDomain;
   friend ReplayTransformations;
@@ -682,6 +824,11 @@ class TORCH_CUDA_CU_API IterDomain : public Val {
   // TODO: Remove only used in kernel IR because IterDomains don't maintain
   // definitions of split/merge.
   bool is_simple_ = true;
+
+  //! Tracks if this id represents a thread swizzled loop or
+  //!   models an implicit loop within instructions. Should not make
+  //!   any changes once an id is warp mapped.
+  bool is_mma_swizzled_ = false;
 };
 
 //! TensorDomain holds a vector of IterDomains. It holds an IterDomain for every
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
index 8c0e1022308320..0ca27be650ca73 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
@@ -146,33 +146,29 @@ void IrPrinter::handle(const TensorDomain* td) {
 }
 
 void IrPrinter::handle(const TensorView* tv) {
-  if (tv->nDims() == 0) {
-    os_ << typePrefix(tv->getDataType().value()) << varName(tv);
-  } else {
-    os_ << "T" << varName(tv);
-    switch (tv->getMemoryType()) {
-      case MemoryType::Global:
-        os_ << "_g";
-        break;
-      case MemoryType::Shared:
-        os_ << "_s";
-        break;
-      case MemoryType::Local:
-        os_ << "_l";
-        break;
-    }
-    handle(tv->domain());
+  os_ << "T" << varName(tv);
+  switch (tv->getMemoryType()) {
+    case MemoryType::Global:
+      os_ << "_g";
+      break;
+    case MemoryType::Shared:
+      os_ << "_s";
+      break;
+    case MemoryType::Local:
+      os_ << "_l";
+      break;
+  }
+  handle(tv->domain());
 
-    if (tv->getComputeAtPosition() > 0) {
-      os_ << " ca_pos( ";
-      os_ << tv->getComputeAtPosition();
-      os_ << " )";
-    }
-    if (tv->getMaxProducerPosition() > 0) {
-      os_ << " produce_pos( ";
-      os_ << tv->getMaxProducerPosition();
-      os_ << ")";
-    }
+  if (tv->getComputeAtPosition() > 0) {
+    os_ << " ca_pos( ";
+    os_ << tv->getComputeAtPosition();
+    os_ << " )";
+  }
+  if (tv->getMaxProducerPosition() > 0) {
+    os_ << " produce_pos( ";
+    os_ << tv->getMaxProducerPosition();
+    os_ << ")";
   }
 }
 
@@ -225,6 +221,25 @@ void IrPrinter::handle(const Int* i) {
   }
 }
 
+void IrPrinter::handle(const ComplexDouble* c) {
+  if (print_inline_) {
+    if (auto def = c->definition()) {
+      os_ << "( ";
+      handle(def);
+      os_ << " )";
+      return;
+    }
+  }
+
+  if (c->isSymbolic()) {
+    os_ << "c" << varName(c);
+  } else {
+    os_ << "std::complex<double>"
+        << std::setprecision(std::numeric_limits<double>::max_digits10)
+        << *(c->value());
+  }
+}
+
 void IrPrinter::handle(const NamedScalar* ns) {
   os_ << ns->name();
 }
@@ -377,7 +392,8 @@ void IrPrinter::handle(const TernaryOp* top) {
 void IrPrinter::handle(const ReductionOp* rop) {
   indent() << rop->out() << " = reduction( " << rop->in()
            << ", op = " << rop->getReductionOpType()
-           << ", initial value = " << rop->init() << " )\n";
+           << ", initial value = " << rop->init()
+           << ", fused = " << rop->isFused() << " )\n";
 }
 
 void IrPrinter::handle(const WelfordOp* wop) {
@@ -395,6 +411,7 @@ void IrPrinter::handle(const WelfordOp* wop) {
     os_ << "\n  initial value = " << wop->initAvg() << "(Avg)\n  "
         << wop->initVar() << "(Var)\n  " << wop->initN() << "(N)";
   }
+  os_ << "\n  fused = " << wop->isFused();
   os_ << " )\n";
 }
 
@@ -439,6 +456,11 @@ void IrPrinter::handle(const ShiftOp* sop) {
            << "}, {" << sop->padWidth() << "} )\n";
 }
 
+void IrPrinter::handle(const MmaOp* mma) {
+  indent() << mma->out() << " = mma(" << mma->inA() << "," << mma->inB();
+  os_ << ")\n";
+}
+
 void IrPrinter::handle(const GatherOp* op) {
   indent() << op->out() << " = gather( " << op->in() << ", {";
   bool no_comma = true;
@@ -461,6 +483,11 @@ void IrPrinter::handle(const GatherOp* op) {
   os_ << "} )\n";
 }
 
+void IrPrinter::handle(const ViewDtypeOp* top) {
+  indent() << top->out() << " = view.dtype( " << top->in() << ", "
+           << top->dtype() << " )\n";
+}
+
 void IrPrinter::handle(const ViewOp* top) {
   indent() << top->out() << " = view( " << top->in() << " )\n";
 }
@@ -540,11 +567,17 @@ void IrPrinter::handle(const kir::Allocate* node) {
   }
 }
 
-void IrPrinter::handle(const kir::Sync* node) {
-  indent() << "SYNC(war_hazard=" << boolLiteral(node->isWarHazardSync())
+void IrPrinter::handle(const kir::BlockSync* node) {
+  indent() << "BLOCKSYNC(war_hazard=" << boolLiteral(node->isWarHazardSync())
            << ")\n";
 }
 
+void IrPrinter::handle(const kir::GridSync* node) {
+  indent() << "GRIDSYNC(" << node->syncDims().toString() << ", ";
+  handle(node->syncBuffer());
+  os_ << ")\n";
+}
+
 void IrPrinter::handle(const kir::ForLoop* node) {
   indent() << "FOR ";
   handle(node->index());
@@ -566,7 +599,19 @@ void IrPrinter::handle(const kir::IfThenElse* node) {
 }
 
 void IrPrinter::handle(const kir::GridBroadcast* node) {
-  TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
+  const auto* broadcast_op = node->broadcast_op();
+  indent();
+  handle(broadcast_op->out());
+  os_ << " = "
+      << "GRID_BROADCAST(in=";
+  handle(broadcast_op->in());
+  os_ << ")\n";
+  indent() << kTab << ".broadcast_buffer=";
+  handle(node->broadcast_buffer()->buffer());
+  os_ << "\n";
+  indent() << kTab << ".sync_buffer=";
+  handle(node->sync_buffer()->buffer());
+  os_ << "\n";
 }
 
 void IrPrinter::handle(const kir::GridReduction* node) {
@@ -579,8 +624,19 @@ void IrPrinter::handle(const kir::GridReduction* node) {
   handle(reduction_op->in());
   os_ << ", init=";
   handle(reduction_op->init());
-  os_ << ", pred=";
-  handle(reduction_op->predicate());
+  os_ << ", read_pred=";
+  if (reduction_op->predicate() != nullptr) {
+    handle(reduction_op->predicate());
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ")\n";
+  os_ << ", write_pred=";
+  if (reduction_op->writePredicate() != nullptr) {
+    handle(reduction_op->writePredicate());
+  } else {
+    os_ << "nullptr";
+  }
   os_ << ")\n";
   indent() << kTab << ".reduction_buffer=";
   handle(node->reduction_buffer()->buffer());
@@ -588,8 +644,19 @@ void IrPrinter::handle(const kir::GridReduction* node) {
   indent() << kTab << ".sync_buffer=";
   handle(node->sync_buffer()->buffer());
   os_ << "\n";
-  indent() << kTab << ".grid_pred=";
-  handle(node->predicate());
+  indent() << kTab << ".grid_read_pred=";
+  if (node->predicate() != nullptr) {
+    handle(node->predicate());
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << "\n";
+  indent() << kTab << ".grid_write_pred=";
+  if (node->writePredicate() != nullptr) {
+    handle(node->writePredicate());
+  } else {
+    os_ << "nullptr";
+  }
   os_ << "\n";
 }
 
@@ -619,8 +686,19 @@ void IrPrinter::handle(const kir::GridWelford* node) {
     os_ << " initN=";
     handle(welford_op->initN());
   }
-  indent() << ", pred=";
-  handle(welford_op->predicate());
+  indent() << ", read_pred=";
+  if (welford_op->predicate() != nullptr) {
+    handle(welford_op->predicate());
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ")\n";
+  indent() << ", write_pred=";
+  if (welford_op->writePredicate() != nullptr) {
+    handle(welford_op->writePredicate());
+  } else {
+    os_ << "nullptr";
+  }
   os_ << ")\n";
   indent() << kTab << ".var_buffer=";
   handle(node->var_buffer()->buffer());
@@ -632,8 +710,19 @@ void IrPrinter::handle(const kir::GridWelford* node) {
   indent() << kTab << ".sync_buffer=";
   handle(node->sync_buffer()->buffer());
   os_ << "\n";
-  indent() << kTab << ".grid_pred=";
-  handle(node->predicate());
+  indent() << kTab << ".grid_read_pred=";
+  if (node->predicate() != nullptr) {
+    handle(node->predicate());
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << "\n";
+  indent() << kTab << ".grid_write_pred=";
+  if (node->writePredicate() != nullptr) {
+    handle(node->writePredicate());
+  } else {
+    os_ << "nullptr";
+  }
   os_ << "\n";
 }
 
@@ -645,6 +734,12 @@ void IrPrinter::handle(const kir::UpdateMagicZero* node) {
   indent() << "NVFUSER_UPDATE_MAGIC_ZERO\n";
 }
 
+void IrPrinter::handle(const kir::AllocateFusedReduction* node) {
+  indent() << "AllocateFusedReduction(reduction buffer=";
+  handle(node->out());
+  os_ << ")\n";
+}
+
 void IrTransformPrinter::handle(Fusion* f) {
   auto all_vals = f->usedMathVals();
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.h b/torch/csrc/jit/codegen/cuda/ir_iostream.h
index f8c07886114f16..e25e6ef0f865d3 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.h
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.h
@@ -79,6 +79,7 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const Bool*) final;
   void handle(const Double*) final;
   void handle(const Int*) final;
+  void handle(const ComplexDouble*) final;
   void handle(const NamedScalar*) final;
 
   void handle(const UnaryOp*) final;
@@ -86,10 +87,12 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const TernaryOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
   void handle(const TransposeOp*) final;
   void handle(const ShiftOp*) final;
   void handle(const GatherOp*) final;
+  void handle(const ViewDtypeOp*) final;
   void handle(const ViewOp*) final;
 
   void handle(const kir::Predicate*) final;
@@ -101,9 +104,11 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const kir::ForLoop*) final;
   void handle(const kir::IfThenElse*) final;
   void handle(const kir::Allocate*) final;
-  void handle(const kir::Sync*) final;
+  void handle(const kir::BlockSync*) final;
+  void handle(const kir::GridSync*) final;
   void handle(const kir::InitMagicZero*) final;
   void handle(const kir::UpdateMagicZero*) final;
+  void handle(const kir::AllocateFusedReduction*) final;
 
   // IR math printer overrides these to prevent them from printing, keep
   // override
diff --git a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
index 884b6a6e0eca79..44f2e29df5e9a4 100644
--- a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
@@ -152,6 +152,36 @@ bool Int::sameAs(const Statement* other) const {
   return false;
 }
 
+ComplexDouble::ComplexDouble(IrBuilderPasskey passkey)
+    : Val(passkey, ValType::Scalar, DataType::ComplexDouble),
+      maybe_value_{c10::nullopt} {}
+
+ComplexDouble::ComplexDouble(IrBuilderPasskey passkey, ScalarType value)
+    : Val(passkey, ValType::Scalar, DataType::ComplexDouble),
+      maybe_value_{value} {}
+
+ComplexDouble::ComplexDouble(
+    IrBuilderPasskey passkey,
+    c10::optional<ScalarType> value)
+    : Val(passkey, ValType::Scalar, DataType::ComplexDouble),
+      maybe_value_{value} {}
+
+ComplexDouble::ComplexDouble(const ComplexDouble* src, IrCloner* ir_cloner)
+    : Val(src, ir_cloner), maybe_value_(src->maybe_value_) {}
+
+bool ComplexDouble::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<ComplexDouble>()) {
+    return false;
+  }
+  const auto other_complex = other->as<ComplexDouble>();
+  if (isConst() && other_complex->isConst())
+    return *value() == *(other_complex->value());
+  return false;
+}
+
 UnaryOp::UnaryOp(IrBuilderPasskey passkey, UnaryOpType type, Val* out, Val* in)
     : Expr(passkey, ExprType::UnaryOp),
       unary_op_type_{type},
@@ -351,12 +381,14 @@ ReductionOp::ReductionOp(
     BinaryOpType reduction_op_type,
     Val* init,
     Val* out,
-    Val* in)
+    Val* in,
+    bool is_fused)
     : Expr(passkey, ExprType::ReductionOp),
       reduction_op_type_(reduction_op_type),
       init_(init),
       out_(out),
-      in_(in) {
+      in_(in),
+      is_fused_(is_fused) {
   TORCH_CHECK(
       out->getValType().value() == ValType::TensorView ||
       out->getValType().value() == ValType::TensorIndex);
@@ -393,7 +425,8 @@ WelfordOp::WelfordOp(
     Val* init_N,
     Val* in_avg,
     Val* in_var,
-    Val* in_N)
+    Val* in_N,
+    bool is_fused)
     : Expr(passkey, ExprType::WelfordOp),
       out_avg_(out_avg),
       out_var_(out_var),
@@ -403,7 +436,8 @@ WelfordOp::WelfordOp(
       init_N_(init_N),
       in_avg_(in_avg),
       in_var_(in_var),
-      in_N_(in_N) {
+      in_N_(in_N),
+      is_fused_(is_fused) {
   // Check output type
   TORCH_INTERNAL_ASSERT(
       out_avg->getValType().value() == ValType::TensorView ||
@@ -472,7 +506,8 @@ WelfordOp::WelfordOp(const WelfordOp* src, IrCloner* ir_cloner)
       init_N_(ir_cloner->clone(src->init_N_)),
       in_avg_(ir_cloner->clone(src->in_avg_)),
       in_var_(src->in_var_ ? ir_cloner->clone(src->in_var_) : nullptr),
-      in_N_(ir_cloner->clone(src->in_N_)) {}
+      in_N_(ir_cloner->clone(src->in_N_)),
+      is_fused_(src->is_fused_) {}
 
 namespace {
 inline bool sameOptionalVal(Val* a, Val* b) {
@@ -495,12 +530,75 @@ bool WelfordOp::sameAs(const Statement* other) const {
   return false;
 }
 
+MmaOp::MmaOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* in_a,
+    Val* in_b,
+    Val* init)
+    : Expr(passkey, ExprType::MmaOp),
+      out_(out),
+      in_a_(in_a),
+      in_b_(in_b),
+      init_(init) {
+  // Check output type
+  TORCH_INTERNAL_ASSERT(
+      out->getValType().value() == ValType::TensorView ||
+      out->getValType().value() == ValType::TensorIndex);
+
+  TORCH_INTERNAL_ASSERT(
+      in_a->getValType().value() == ValType::TensorView ||
+          in_a->getValType().value() == ValType::TensorIndex,
+      in_a->getValType().value());
+
+  TORCH_INTERNAL_ASSERT(
+      in_b->getValType().value() == ValType::TensorView ||
+          in_b->getValType().value() == ValType::TensorIndex,
+      in_b->getValType().value());
+
+  addOutput(out);
+  addInput(in_a);
+  addInput(in_b);
+}
+
+MmaOp::MmaOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* in_a,
+    Val* in_b,
+    Val* init,
+    MmaOptions options)
+    : MmaOp(passkey, out, in_a, in_b, init) {
+  options_ = options;
+}
+
+MmaOp::MmaOp(const MmaOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      out_(ir_cloner->clone(src->out_)),
+      in_a_(ir_cloner->clone(src->in_a_)),
+      in_b_(ir_cloner->clone(src->in_b_)),
+      init_(ir_cloner->clone(src->init_)),
+      options_(src->options_) {}
+
+bool MmaOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (auto other_mma = dynamic_cast<const MmaOp*>(other)) {
+    return out_->sameAs(other_mma->out_) && in_a_->sameAs(other_mma->in_a_) &&
+        in_b_->sameAs(other_mma->in_b_) && init_->sameAs(other_mma->init_) &&
+        options_ == other_mma->options_;
+  }
+  return false;
+}
+
 ReductionOp::ReductionOp(const ReductionOp* src, IrCloner* ir_cloner)
     : Expr(src, ir_cloner),
       reduction_op_type_(src->reduction_op_type_),
       init_(ir_cloner->clone(src->init_)),
       out_(ir_cloner->clone(src->out_)),
-      in_(ir_cloner->clone(src->in_)) {}
+      in_(ir_cloner->clone(src->in_)),
+      is_fused_(src->is_fused_) {}
 
 bool ReductionOp::sameAs(const Statement* other) const {
   if (this == other) {
@@ -697,6 +795,22 @@ int GatherOp::gatherAxis(int axis) const {
   return int(windowShape().size()) + axis;
 }
 
+ViewDtypeOp::ViewDtypeOp(
+    IrBuilderPasskey passkey,
+    TensorView* out,
+    TensorView* in,
+    DataType dtype)
+    : Expr(passkey, ExprType::ViewDtypeOp), out_(out), in_(in), dtype_(dtype) {
+  addOutput(out);
+  addInput(in);
+}
+
+ViewDtypeOp::ViewDtypeOp(const ViewDtypeOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      out_(ir_cloner->clone(src->out_)),
+      in_(ir_cloner->clone(src->in_)),
+      dtype_(src->dtype()) {}
+
 ViewOp::ViewOp(IrBuilderPasskey passkey, TensorView* out, TensorView* in)
     : Expr(passkey, ExprType::ViewOp), out_(out), in_(in) {
   addOutput(out);
@@ -767,7 +881,8 @@ IterDomain::IterDomain(const IterDomain* src, IrCloner* ir_cloner)
       iter_type_(src->iter_type_),
       is_rfactor_domain_(src->is_rfactor_domain_),
       is_padded_dimension_(src->is_padded_dimension_),
-      padded_to_size_(src->padded_to_size_) {}
+      padded_to_size_(src->padded_to_size_),
+      is_mma_swizzled_(src->is_mma_swizzled_) {}
 
 bool IterDomain::sameAs(const Statement* other) const {
   if (other == this) {
@@ -978,6 +1093,12 @@ void IterDomain::parallelize(ParallelType t) {
         extent(),
         " .");
   }
+
+  if (isMmaSwizzled()) {
+    TORCH_CHECK(
+        t == ParallelType::Vectorize,
+        "Parallel type other than vectorize not allowed for warp mapped ids");
+  }
 }
 
 bool IterDomain::maybePartial() const {
@@ -1314,6 +1435,10 @@ void TensorDomain::split(
         "Partial split is only allowed with root domains");
   }
 
+  TORCH_INTERNAL_ASSERT(
+      !id->isMmaSwizzled(),
+      "Further transformation on warp mapped id's not allowed.");
+
   auto split_ids =
       IterDomain::split(id, factor, inner_split, trim_out_of_bounds);
   domain_.erase(domain_.begin() + axis_);
@@ -1349,6 +1474,10 @@ void TensorDomain::merge(int axis_o, int axis_i) {
   IterDomain* first = axis(axis_o);
   IterDomain* second = axis(axis_i);
 
+  TORCH_INTERNAL_ASSERT(
+      !first->isMmaSwizzled() && !second->isMmaSwizzled(),
+      "Further transformation on warp mapped id's not allowed.");
+
   IterDomain* merged_id = IterDomain::merge(first, second);
 
   domain_.erase(domain_.begin() + axis_i);
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.cpp b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
index 004cfa23dff43c..6abc8cce0723b7 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
@@ -254,6 +254,21 @@ struct SubstituteInExpr : public OptInDispatch {
         gather_expr->padWidth());
   }
 
+  void handle(ViewDtypeOp* view_expr) final {
+    TORCH_INTERNAL_ASSERT(
+        substitute_->isA<TensorView>(),
+        "All args to view must be TensorView, but received a non-TensorView for replacement: ",
+        substitute_);
+    auto in = reference_->sameAs(view_expr->in())
+        ? substitute_->as<TensorView>()
+        : view_expr->in();
+    auto out = reference_->sameAs(view_expr->out())
+        ? substitute_->as<TensorView>()
+        : view_expr->out();
+    expr_ = IrBuilder::create<ViewDtypeOp>(
+        view_expr->container(), out, in, view_expr->dtype());
+  }
+
   void handle(ViewOp* view_expr) final {
     TORCH_INTERNAL_ASSERT(
         substitute_->isA<TensorView>(),
@@ -309,7 +324,29 @@ struct SubstituteInExpr : public OptInDispatch {
         init_N,
         in_avg,
         in_var,
-        in_N);
+        in_N,
+        welford_expr->isFused());
+  }
+
+  void handle(MmaOp* mma_expr) final {
+    TORCH_INTERNAL_ASSERT(
+        substitute_->isA<TensorView>(),
+        "All args to MmaOp must be TensorView, but received a non-TensorView for replacement: ",
+        substitute_);
+    auto in_a = reference_->sameAs(mma_expr->inA())
+        ? substitute_->as<TensorView>()
+        : mma_expr->inA();
+    auto in_b = reference_->sameAs(mma_expr->inB())
+        ? substitute_->as<TensorView>()
+        : mma_expr->inB();
+    auto out = reference_->sameAs(mma_expr->out())
+        ? substitute_->as<TensorView>()
+        : mma_expr->out();
+    auto init = reference_->sameAs(mma_expr->init())
+        ? substitute_->as<TensorView>()
+        : mma_expr->init();
+    expr_ = IrBuilder::create<MmaOp>(
+        mma_expr->container(), out, in_a, in_b, init, mma_expr->options());
   }
 
  private:
@@ -434,7 +471,7 @@ std::vector<TensorView*> allTvs(Fusion* fusion) {
   return uniqueEntries({used_tvs.begin(), used_tvs.end()});
 }
 
-std::vector<Expr*> getReductionOps(Fusion* fusion) {
+std::vector<Expr*> getReductionOps(Fusion* fusion, bool ignore_trivial) {
   std::vector<Expr*> red_ops;
   for (auto expr : fusion->exprs()) {
     const Val* out_val = nullptr;
@@ -452,8 +489,9 @@ std::vector<Expr*> getReductionOps(Fusion* fusion) {
     if (std::any_of(
             out_tv->getRootDomain().begin(),
             out_tv->getRootDomain().end(),
-            [](IterDomain* id) {
-              return id->isReduction() && !id->isTrivialReduction();
+            [&ignore_trivial](IterDomain* id) {
+              return id->isReduction() &&
+                  !(ignore_trivial && id->isTrivialReduction());
             })) {
       red_ops.push_back(expr);
     }
@@ -461,6 +499,73 @@ std::vector<Expr*> getReductionOps(Fusion* fusion) {
   return red_ops;
 }
 
+namespace {
+
+class ValReplacementMutator : private OptOutMutator {
+ public:
+  ValReplacementMutator(
+      Fusion* fusion,
+      const std::unordered_map<Val*, Val*>& replacement_map)
+      : replacement_map_(replacement_map) {
+    FusionGuard fg(fusion);
+
+    // Welford makes this a little annoying since it holds a count which is
+    // typically not used by anything else. If we don't grab that count, then it
+    // would be a tensorview that doesn't get updated extents. Therefore, first
+    // grab all leaves towards outputs and grab stmts from there.
+    auto stmts = StmtSort::getStmts(fusion, allLeafOuts(fusion), true);
+    for (auto stmt : stmts) {
+      mutate(stmt);
+    }
+  }
+
+ private:
+  using OptOutMutator::mutate;
+  void mutate(Val* val) final {
+    if (replacement_map_.find(val) == replacement_map_.end()) {
+      return OptOutMutator::mutate(val);
+    }
+    auto replaced_val = replacement_map_.at(val);
+    registerMutation(val, replaced_val);
+  }
+
+  std::vector<Val*> allLeafOuts(Fusion* fusion) {
+    auto exprs = StmtSort::getExprs(fusion, true);
+    std::unordered_set<Val*> inputs;
+    std::unordered_set<Val*> outputs;
+    std::vector<Val*> ordered_outputs;
+    for (auto expr : exprs) {
+      inputs.insert(expr->inputs().begin(), expr->inputs().end());
+      outputs.insert(expr->outputs().begin(), expr->outputs().end());
+      ordered_outputs.insert(
+          ordered_outputs.end(),
+          expr->outputs().begin(),
+          expr->outputs().end());
+    }
+    for (auto input : inputs) {
+      outputs.erase(input);
+    }
+
+    std::vector<Val*> ordered_leaf_outs;
+    for (auto out : ordered_outputs) {
+      if (outputs.find(out) != outputs.end()) {
+        ordered_leaf_outs.push_back(out);
+      }
+    }
+    return ordered_leaf_outs;
+  }
+
+  const std::unordered_map<Val*, Val*>& replacement_map_;
+};
+
+} // namespace
+
+void replaceValue(
+    Fusion* fusion,
+    const std::unordered_map<Val*, Val*>& replacement_map) {
+  ValReplacementMutator(fusion, replacement_map);
+}
+
 } // namespace ir_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.h b/torch/csrc/jit/codegen/cuda/ir_utils.h
index 1bf3f27ec0b9bc..dd5c9dd13e83ae 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.h
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.h
@@ -12,6 +12,11 @@ namespace fuser {
 namespace cuda {
 namespace ir_utils {
 
+// Replace values in fusion using ValReplacementMutator
+void replaceValue(
+    Fusion*,
+    const std::unordered_map<Val*, Val*>& replacement_map);
+
 template <typename FilterType, typename Iterator>
 class FilterIterator {
  public:
@@ -178,7 +183,9 @@ TORCH_CUDA_CU_API std::vector<TensorView*> outputTvsOf(
 // returns all tensor views in fusion that are used between outputs and inputs.
 TORCH_CUDA_CU_API std::vector<TensorView*> allTvs(Fusion* fusion);
 
-TORCH_CUDA_CU_API std::vector<Expr*> getReductionOps(Fusion* fusion);
+TORCH_CUDA_CU_API std::vector<Expr*> getReductionOps(
+    Fusion* fusion,
+    bool ignore_trivial = true);
 
 } // namespace ir_utils
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
index 894b40f79e3fa1..34345600b465d4 100644
--- a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
+++ b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
@@ -83,6 +83,10 @@ class RecursiveDependencies : public OptInDispatch {
     simpleVal(stmt);
   }
 
+  void handle(ComplexDouble* stmt) final {
+    simpleVal(stmt);
+  }
+
   void handle(NamedScalar* stmt) final {
     simpleVal(stmt);
   }
@@ -593,6 +597,9 @@ class DependentVals : public IterVisitor {
   std::unordered_set<Val*> outs_;
 
   // Boundary where we want to stop searching beyond
+  // TODO: Based on the todo below, shouldn't we stop just at the definition of?
+  // If we really wanted to make this traverse left, wouldn't we first check
+  // which outputs are outputs dependent on of?
   std::unordered_set<Val*> boundary_;
 
   std::vector<Statement*> next(Val* v) override {
@@ -616,6 +623,11 @@ class DependentVals : public IterVisitor {
   }
 
   // optimization to limit search path
+  // TODO: Is this valid? Couldn't something like:
+  // out0 = of + val0
+  // out1 = out0 + val1
+  // out2 = TernaryOp(out1, val0, of)
+  // Hide the dep of out1 on of?
   void createBoundary() {
     for (auto v_of : of_) {
       for (auto v_expr : v_of->uses()) {
diff --git a/torch/csrc/jit/codegen/cuda/kernel.cpp b/torch/csrc/jit/codegen/cuda/kernel.cpp
index b9062f5bc458fb..54963709bd1cb5 100644
--- a/torch/csrc/jit/codegen/cuda/kernel.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel.cpp
@@ -49,7 +49,7 @@ class KernelIrScanner : private IrVisitor {
       handle(out);
     }
   }
-  void handle(Sync* sync) final {
+  void handle(BlockSync* sync) final {
     // TODO: Move to a dedicated validation pass
     // which is not on the common execution/compilation path
     if (sync->isWarHazardSync()) {
@@ -57,6 +57,10 @@ class KernelIrScanner : private IrVisitor {
     }
   }
 
+  void handle(GridSync* sync) final {
+    summary_.has_cooperative_grid_reduction = true;
+  }
+
   void handle(Allocate* allocate) final {
     switch (allocate->memoryType()) {
       case MemoryType::Global:
@@ -276,6 +280,12 @@ void Kernel::finalize(std::vector<Expr*> top_level_exprs) {
   warp_padded_parallel_info_ = GpuLower::current()->getWarpPaddedParallelInfo();
   ValidateAllocation::validate(this);
   analyze();
+  // Make sure this is after analyze as it sets summary_
+  summary_.vectorized_accesses = GpuLower::current()->vectorizedAccesses();
+  summary_.vectorized_set_info = GpuLower::current()->vectorizedSetInfo();
+  summary_.sync_map = GpuLower::current()->syncMap();
+  summary_.parallel_dimension_map_ =
+      GpuLower::current()->parallelDimensionMap();
 }
 
 void Kernel::analyze() {
@@ -345,6 +355,10 @@ void Kernel::registerExpr(Expr* expr) {
   Fusion::registerExpr(expr);
 }
 
+std::vector<Expr*>& KernelInternalProxy::topLevelExprs() {
+  return kernel_->top_level_exprs_;
+}
+
 } // namespace kir
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/kernel.h b/torch/csrc/jit/codegen/cuda/kernel.h
index 0c8bbdef9dfdfd..4930da1a287212 100644
--- a/torch/csrc/jit/codegen/cuda/kernel.h
+++ b/torch/csrc/jit/codegen/cuda/kernel.h
@@ -5,8 +5,11 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_base_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/lower_sync_information.h>
 #include <torch/csrc/jit/codegen/cuda/lower_warp_reduce.h>
+#include <torch/csrc/jit/codegen/cuda/parallel_dimension_map.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
+#include <torch/csrc/jit/codegen/cuda/vectorization_info.h>
 
 #include <memory>
 #include <unordered_map>
@@ -78,12 +81,31 @@ struct KernelSummary {
   //! Effective ParallelTypes of broadcast ops
   std::unordered_map<const BroadcastOp*, ParallelTypeBitmap>
       broadcast_parallel_types;
+
+  //! Track which tensor views are inputs or outputs of a vectorized operation
+  //! and their maximum vectorized access size
+  std::unordered_map<TensorView*, int> vectorized_accesses;
+
+  // Sync map is needed to figure out if global memory buffers need to be marked
+  // as volatile because they're used for communication.
+  SyncMap sync_map;
+
+  // Parallel dimension map needed to set the correct properties of grid buffers
+  // (is a dim inactive)
+  ParallelDimensionMap parallel_dimension_map_;
+
+  //! Track information on vectorized set operations for runtime validation
+  std::vector<VectorizedSetInfo> vectorized_set_info;
 };
 
+class KernelInternalProxy;
+
 //! Container for a lowered Kernel IR
 //!
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
 class TORCH_CUDA_CU_API Kernel final : public Fusion {
+  friend KernelInternalProxy;
+
  public:
   // Kernel starts by grabbing all the nodes from the provided fusion.
   // Kernel is not SSA, if a definition is not set, we should update it, but
@@ -91,7 +113,9 @@ class TORCH_CUDA_CU_API Kernel final : public Fusion {
   // we do something like generate an initialization statement for a reduction
   // TV, we may want to continue to do fusion like analysis on the original
   // expression.
-  Kernel(Fusion* fusion) : Fusion(*fusion) {}
+  // TODO: Assert index type is int or int32
+  Kernel(Fusion* fusion, DataType index_type = DataType::Int)
+      : Fusion(*fusion), index_type_(index_type) {}
 
   Kernel() = delete;
 
@@ -102,8 +126,7 @@ class TORCH_CUDA_CU_API Kernel final : public Fusion {
   //! Finalize a kernel definition
   //!
   //! At this point we have a complete kernel definition and we can
-  //! run analysis passes to build a KernelSummary
-  //!
+  //! run analysis passes to build a KernelSummary.
   void finalize(std::vector<Expr*> top_level_exprs);
 
   const std::vector<Expr*>& topLevelExprs() const {
@@ -114,6 +137,10 @@ class TORCH_CUDA_CU_API Kernel final : public Fusion {
     return summary_;
   }
 
+  DataType indexType() const {
+    return index_type_;
+  }
+
   //! Checks if parallel type is padded
   bool isParallelTypePadded(ParallelType ptype) const {
     return ptype == ParallelType::TIDx &&
@@ -140,16 +167,32 @@ class TORCH_CUDA_CU_API Kernel final : public Fusion {
   // Analyze the kernel IR and caches the summary of interesting data
   void analyze();
 
- private:
   // Top level statements
   std::vector<Expr*> top_level_exprs_;
 
   // Summary of interesting kernel data
   KernelSummary summary_;
 
+  // Is this kernel being compiled with int32 or int64 indexing. This
+  // information is required to resolve DataType::Index
+  DataType index_type_ = DataType::Int;
+
   WarpPaddedParallelInfo warp_padded_parallel_info_;
 };
 
+//! A special debugging proxy for Kernel.
+//!
+//! Should not be used for other than testing and debugging.
+class TORCH_CUDA_CU_API KernelInternalProxy {
+ public:
+  KernelInternalProxy(Kernel* kernel) : kernel_(kernel) {}
+
+  std::vector<Expr*>& topLevelExprs();
+
+ private:
+  Kernel* kernel_ = nullptr;
+};
+
 } // namespace kir
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
index c1c113dbbc4ac3..6a1d50462f957a 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
@@ -4,6 +4,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/parser.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+#include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
 
 #include <c10/util/irange.h>
@@ -118,7 +119,11 @@ InputsIdLookup::IdLookupReturn InputsIdLookup::lookupId(
 }
 
 FusionExecutorCache::FusionExecutorCache(std::unique_ptr<Fusion> fusion)
-    : fusion_(std::move(fusion)) {}
+    : fusion_(std::move(fusion)) {
+  for (const auto& indices : fusion_->getOutputAliasIndices()) {
+    aliased_output_indices_.insert(indices);
+  }
+}
 
 // Note [ Permutation support in nvfuser ]
 //
@@ -187,6 +192,12 @@ std::vector<at::Tensor> FusionExecutorCache::runFusionWithInputs(
     outputs[pair.first] = outputs[pair.first].permute(pair.second);
   }
 
+  int offset = 0;
+  for (const auto& v : aliased_output_indices_) {
+    outputs.erase(outputs.begin() + v - offset);
+    offset++;
+  }
+
   return outputs;
 }
 
@@ -634,6 +645,8 @@ void GraphCache::createFusion(const std::shared_ptr<Graph>& graph) {
 
   fusion_executor_cache_ =
       std::make_unique<FusionExecutorCache>(parseJitIR(graph));
+
+  num_of_outputs_ = graph->outputs().size();
 }
 
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
@@ -642,6 +655,8 @@ GraphCache::GraphCache(const std::shared_ptr<Graph>& graph) {
   TORCH_INTERNAL_ASSERT(
       IsNewExecutorEnabled(), "legacy executor is not supported by nvfuser");
 
+  GRAPH_DEBUG("GraphCache constructor: ", this);
+  GRAPH_DUMP("GraphCache created for graph", graph);
   createFusion(graph);
 }
 
@@ -649,7 +664,16 @@ std::vector<at::Tensor> GraphCache::runGraphWithInputs(
     const at::ArrayRef<IValue>& inputs) {
   FUSER_PERF_SCOPE("GraphCache::runGraphWithInputs");
 
-  return fusion_executor_cache_->runFusionWithInputs(inputs);
+  GRAPH_DEBUG("running GraphCache: ", this);
+  auto outputs = fusion_executor_cache_->runFusionWithInputs(inputs);
+  TORCH_INTERNAL_ASSERT(
+      outputs.size() == num_of_outputs_,
+      "FusionExecutorCache returned ",
+      outputs.size(),
+      " outputs, doesn't match computational graph, which requires ",
+      num_of_outputs_);
+
+  return outputs;
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.h b/torch/csrc/jit/codegen/cuda/kernel_cache.h
index cba42f99dc4c36..71dd6c3592d00a 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.h
@@ -410,6 +410,11 @@ class TORCH_CUDA_CU_API FusionExecutorCache {
   //! TODO: this can be largely expanded to look at complete
   //!   caching profiles. Currently it just makes it easier to test
   FusionKernelRuntime* most_recent_runtime_ = nullptr;
+
+  //! indices of fusion outputs that are aliased to inputs. These are used only
+  //! to support in-place update and should have been dropped before pushing
+  //! outputs to stack.
+  std::set<int> aliased_output_indices_;
 };
 
 class GraphCache {
@@ -426,15 +431,15 @@ class GraphCache {
       const at::ArrayRef<IValue>& inputs);
 
  private:
-  //! Computation graph;
-  std::shared_ptr<Graph> graph_;
-
   //! construct FusionExecutorCache
   void createFusion(const std::shared_ptr<Graph>& graph);
 
  private:
   //! FusionExecutorCache that performs schedule and kernel execution;
   std::unique_ptr<FusionExecutorCache> fusion_executor_cache_;
+
+  //! num of outputs
+  size_t num_of_outputs_ = 0;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
index 5d2eb44f8a8cb9..46fdc78aade718 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
@@ -78,13 +78,21 @@ TensorIndex::TensorIndex(
   }
 }
 
-Sync::Sync(IrBuilderPasskey passkey, bool war_sync)
-    : Expr(passkey, ExprType::Sync), war_sync_(war_sync) {
+BlockSync::BlockSync(IrBuilderPasskey passkey, bool war_sync)
+    : Expr(passkey, ExprType::BlockSync), war_sync_(war_sync) {
   TORCH_INTERNAL_ASSERT(
       passkey.ir_container_->isA<kir::Kernel>(),
       "IR type only valid for Kernel container.");
 }
 
+GridSync::GridSync(
+    IrBuilderPasskey passkey,
+    ParallelTypeBitmap sync_dims,
+    Val* sync_buffer)
+    : Expr(passkey, ExprType::GridSync),
+      sync_dims_(sync_dims),
+      sync_buffer_(sync_buffer) {}
+
 InitMagicZero::InitMagicZero(IrBuilderPasskey passkey)
     : Expr(passkey, ExprType::InitMagicZero) {
   TORCH_INTERNAL_ASSERT(
@@ -206,7 +214,8 @@ ForLoop::ForLoop(IrBuilderPasskey passkey, IterDomain* iter_domain)
           nullptr,
           nullptr,
           nullptr,
-          isParallelTypeVectorize(iter_domain->getParallelType()),
+          !iter_domain->isBroadcast() &&
+              isParallelTypeVectorize(iter_domain->getParallelType()),
           nullptr,
           false) {
   TORCH_INTERNAL_ASSERT(
@@ -298,6 +307,51 @@ Val* ForLoop::step() const {
   return step_;
 }
 
+bool ForLoop::isTrivial() const {
+  // These loops are not materialized
+  if (vectorize() || iter_domain()->isBroadcast() ||
+      iter_domain()->isStride() || iter_domain()->isMma()) {
+    return true;
+  }
+
+  // By default, a parallelized loop would look like:
+  //
+  //   for (int x = threadIdx.x; x < stop; x += blockDim.x) {
+  //     do_some_comp(x);
+  //   }
+  //
+  // When stop is guaranteed to be smaller or equal to the number of
+  // threads, the for-loop is not necessary. In the above case, we
+  // would just generate the loop body without the for clause but
+  // references to the loop index replaced by the loop start value.
+  //
+  // When the loop end is the same as the IterDomain extent, the
+  // assumption can be safely made. This is more conservative than
+  // necessary since the loop stop value just needs to be <= the
+  // IterDomain extent. However, at this point, this conservative
+  // analysis seems sufficient.
+  if (stop() == iter_domain()->extent() && iter_domain()->isThread()) {
+    return true;
+  }
+
+  // Extent-1 loop: for (int i = 0; i < 1; ++i) {
+  if (start()->isZeroInt() && stop()->isOneInt() && step()->isOneInt()) {
+    return true;
+  }
+
+  // Another extent-1 loop: for (int i = N - 1; i < N; ++i) {
+  if (start()->definition() != nullptr &&
+      start()->definition()->isA<BinaryOp>() &&
+      start()->definition()->as<BinaryOp>()->getBinaryOpType() ==
+          BinaryOpType::Sub &&
+      start()->definition()->as<BinaryOp>()->lhs() == stop() &&
+      start()->definition()->as<BinaryOp>()->rhs()->isOneInt()) {
+    return true;
+  }
+
+  return false;
+}
+
 IfThenElse::IfThenElse(IrBuilderPasskey passkey, Predicate* cond)
     : Expr(passkey, ExprType::IfThenElse), then_body_(this), else_body_(this) {
   setPredicate(cond);
@@ -419,6 +473,50 @@ GridWelford::GridWelford(
       "IR type only valid for Kernel container.");
 }
 
+AllocateFusedReduction::AllocateFusedReduction(
+    IrBuilderPasskey passkey,
+    GridReduction* grid_reduction)
+    : Expr(passkey, ExprType::AllocateFusedReduction),
+      grid_expr_(grid_reduction) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
+AllocateFusedReduction::AllocateFusedReduction(
+    IrBuilderPasskey passkey,
+    GridWelford* grid_welford)
+    : Expr(passkey, ExprType::AllocateFusedReduction),
+      grid_expr_(grid_welford) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
+TensorIndex* AllocateFusedReduction::out() const {
+  TORCH_INTERNAL_ASSERT(grid_expr_ != nullptr);
+  if (auto grid_reduction = dynamic_cast<GridReduction*>(grid_expr_)) {
+    return grid_reduction->reduction_op()->out()->as<kir::TensorIndex>();
+  } else if (auto grid_welford = dynamic_cast<GridWelford*>(grid_expr_)) {
+    return grid_welford->welford_op()->out()->as<kir::TensorIndex>();
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        false, "Invalid grid expression: ", grid_expr_->toString());
+  }
+}
+
+const ParallelTypeBitmap& AllocateFusedReduction::threadPredicate() const {
+  TORCH_INTERNAL_ASSERT(grid_expr_ != nullptr);
+  if (auto grid_reduction = dynamic_cast<GridReduction*>(grid_expr_)) {
+    return grid_reduction->threadPredicate();
+  } else if (auto grid_welford = dynamic_cast<GridWelford*>(grid_expr_)) {
+    return grid_welford->threadPredicate();
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        false, "Invalid grid expression: ", grid_expr_->toString());
+  }
+}
+
 } // namespace kir
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.h b/torch/csrc/jit/codegen/cuda/kernel_ir.h
index ad6be90bf98a58..bc714e5d87e470 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.h
@@ -52,7 +52,8 @@ class TensorIndex;
 
 // Expressions
 class Allocate;
-class Sync;
+class BlockSync;
+class GridSync;
 class InitMagicZero;
 class UpdateMagicZero;
 class ForLoop;
@@ -60,6 +61,7 @@ class IfThenElse;
 class GridReduction;
 class GridBroadcast;
 class GridWelford;
+class AllocateFusedReduction;
 
 // Expr container
 class Scope;
@@ -143,7 +145,7 @@ class TORCH_CUDA_CU_API TensorIndex final : public Val {
  public:
   TensorIndex(
       IrBuilderPasskey,
-      const fuser::cuda::TensorView* view,
+      const TensorView* view,
       std::vector<Val*> indices);
 
   std::vector<Val*>::size_type nDims() const {
@@ -240,9 +242,9 @@ class TORCH_CUDA_CU_API Allocate final : public Expr {
 //
 // TODO(kir): change name to SyncThreads as we could have other barriers.
 //
-class TORCH_CUDA_CU_API Sync final : public Expr {
+class TORCH_CUDA_CU_API BlockSync final : public Expr {
  public:
-  explicit Sync(IrBuilderPasskey passkey, bool war_sync = false);
+  explicit BlockSync(IrBuilderPasskey passkey, bool war_sync = false);
 
   bool isWarHazardSync() const {
     return war_sync_;
@@ -253,6 +255,28 @@ class TORCH_CUDA_CU_API Sync final : public Expr {
   bool war_sync_ = false;
 };
 
+// Synchronize all blocks in device, implies cooperative group launch is
+// required.
+class TORCH_CUDA_CU_API GridSync final : public Expr {
+ public:
+  explicit GridSync(
+      IrBuilderPasskey passkey,
+      ParallelTypeBitmap sync_dims,
+      Val* sync_buffer);
+
+  ParallelTypeBitmap syncDims() const {
+    return sync_dims_;
+  }
+
+  Val* syncBuffer() const {
+    return sync_buffer_;
+  }
+
+ private:
+  ParallelTypeBitmap sync_dims_;
+  Val* sync_buffer_ = nullptr;
+};
+
 // Simply prints "DEFINE_MAGIC_ZERO" in the code in accordance with magic_zero
 // in helpers.cu
 class TORCH_CUDA_CU_API InitMagicZero final : public Expr {
@@ -408,6 +432,9 @@ class TORCH_CUDA_CU_API ForLoop final : public Expr {
     unroll_required_ = true;
   }
 
+  //! True if no actual for-loop is materialized
+  bool isTrivial() const;
+
  private:
   //! Returns if a loop could be unrolled.
   bool isUnrollable() const;
@@ -603,6 +630,30 @@ class TORCH_CUDA_CU_API GridWelford final : public Expr {
   ParallelTypeBitmap thread_predicate_;
 };
 
+// Allocate an instance of the fused reduction class.
+class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
+ public:
+  explicit AllocateFusedReduction(
+      IrBuilderPasskey passkey,
+      GridReduction* grid_reduction);
+
+  explicit AllocateFusedReduction(
+      IrBuilderPasskey passkey,
+      GridWelford* grid_welford);
+
+  Expr* gridExpr() const {
+    return grid_expr_;
+  }
+
+  TensorIndex* out() const;
+
+  const ParallelTypeBitmap& threadPredicate() const;
+
+ private:
+  //! GridReduction or GridWelford
+  Expr* grid_expr_ = nullptr;
+};
+
 } // namespace kir
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.cpp b/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.cpp
index bfc4794e299b4e..a64b07da4a0538 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.cpp
@@ -17,15 +17,18 @@ std::vector<Expr*> IrVisitor::handle(const std::vector<Expr*>& exprs) {
 void IrVisitor::handle(ForLoop* fl) {
   for_loops_.push_back(fl);
   scope_.push_back(&fl->body());
+  scope_exprs_.push_back(fl);
   auto body_exprs = std::vector<Expr*>(fl->body().exprs());
   for (auto expr : body_exprs) {
     handle(expr);
   }
+  scope_exprs_.pop_back();
   scope_.pop_back();
   for_loops_.pop_back();
 }
 
 void IrVisitor::handle(IfThenElse* ite) {
+  scope_exprs_.push_back(ite);
   scope_.push_back(&ite->thenBody());
   auto then_exprs = std::vector<Expr*>(ite->thenBody().exprs());
   for (auto expr : then_exprs) {
@@ -39,10 +42,11 @@ void IrVisitor::handle(IfThenElse* ite) {
     handle(expr);
   }
   scope_.pop_back();
+  scope_exprs_.pop_back();
 }
 
 std::vector<Expr*> ExprMutator::mutate(bool reverse_order) {
-  if (insertions_.empty() && replacements_.empty()) {
+  if (insertions_.empty() && replacements_.empty() && removal_.empty()) {
     return exprs_;
   }
 
@@ -107,6 +111,22 @@ std::vector<Expr*> ExprMutator::mutate(bool reverse_order) {
     }
   }
 
+  for (auto removal_info : removal_) {
+    if (removal_info.scope == nullptr) {
+      auto pos_it =
+          std::find(exprs_.begin(), exprs_.end(), removal_info.reference);
+      TORCH_INTERNAL_ASSERT(
+          pos_it != exprs_.end(), "Issue finding expression to remove.");
+      exprs_.erase(pos_it);
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          removal_info.scope->contains(removal_info.reference),
+          "Expression to remove is not found in the given scope: ",
+          removal_info.reference->toString());
+      removal_info.scope->erase(removal_info.reference);
+    }
+  }
+
   insertions_.clear();
   replacements_.clear();
 
@@ -132,8 +152,12 @@ void ExprMutator::registerMutation(
   mutation.mode = mode;
   if (mode == MutationMode::BEFORE || mode == MutationMode::AFTER) {
     insertions_.push_back(mutation);
-  } else {
+  } else if (mode == MutationMode::REPLACE) {
     replacements_.push_back(mutation);
+  } else if (mode == MutationMode::REMOVE) {
+    removal_.push_back(mutation);
+  } else {
+    TORCH_INTERNAL_ASSERT(false, "Invalid mutation type");
   }
 }
 
@@ -158,6 +182,10 @@ void ExprMutator::registerReplace(
   registerMutation(reference, new_expr, scope, MutationMode::REPLACE);
 }
 
+void ExprMutator::registerRemove(Expr* expr_to_remove, Scope* scope) {
+  registerMutation(expr_to_remove, nullptr, scope, MutationMode::REMOVE);
+}
+
 void ExprMutator::registerInsertBefore(Expr* reference, Expr* new_expr) {
   Scope* scope = scope_.empty() ? nullptr : scope_.back();
   registerInsertBefore(reference, new_expr, scope);
@@ -173,6 +201,11 @@ void ExprMutator::registerReplace(Expr* reference, Expr* new_expr) {
   registerReplace(reference, new_expr, scope);
 }
 
+void ExprMutator::registerRemove(Expr* expr_to_remove) {
+  Scope* scope = scope_.empty() ? nullptr : scope_.back();
+  registerRemove(expr_to_remove, scope);
+}
+
 } // namespace kir
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h b/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h
index 2140498af14009..d665c4a6fdf539 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h
@@ -41,14 +41,15 @@ class TORCH_CUDA_CU_API IrVisitor : public OptOutDispatch {
  protected:
   std::vector<ForLoop*> for_loops_;
   std::vector<Scope*> scope_;
+  std::vector<Expr*> scope_exprs_;
   std::vector<Expr*> exprs_;
 };
 
 // Base Expr Mutator class that visits all nodes with IrVisitor, and then
-// inserts new expressions or replaces expressions based on insertion/replace
-// maps provided. These replacement maps are expected to accumulate during an
-// initial traversal, then runs an insertion based on them after the overloaded
-// traversal.
+// inserts new expressions, replaces expressions based on insertion/replace
+// maps provided or removes existing expressions. These replacement
+// maps are expected to accumulate during an initial traversal, then
+// runs an insertion based on them after the overloaded traversal.
 //
 // Order of mutations may be important, mutations are ordered according to the
 // following rules:
@@ -61,6 +62,8 @@ class TORCH_CUDA_CU_API IrVisitor : public OptOutDispatch {
 //   Before/After insertions are done before Expr replacements, so reference for
 //   insertions must be on pre-replaced Exprs
 //
+//   Removal of expressions is done after replacements.
+//
 // To place in a scope that is empty, simply provide a nullptr reference
 // Since insertions are done in order, it's possible to insert an expression in
 // an empty scope, and then use that inserted scope as a reference for
@@ -79,6 +82,7 @@ class ExprMutator : public IrVisitor {
   void registerInsertBefore(Expr* reference, Expr* new_expr, Scope* scope);
   void registerInsertAfter(Expr* reference, Expr* new_expr, Scope* scope);
   void registerReplace(Expr* reference, Expr* new_expr, Scope* scope);
+  void registerRemove(Expr* expr_to_remove, Scope* scope);
 
   // Registration function which need to be called "in place" during visiting.
   // I.E.
@@ -87,9 +91,10 @@ class ExprMutator : public IrVisitor {
   void registerInsertBefore(Expr* reference, Expr* new_expr);
   void registerInsertAfter(Expr* reference, Expr* new_expr);
   void registerReplace(Expr* reference, Expr* new_expr);
+  void registerRemove(Expr* expr_to_remove);
 
  private:
-  enum class MutationMode { BEFORE, AFTER, REPLACE };
+  enum class MutationMode { BEFORE, AFTER, REPLACE, REMOVE };
 
   void registerMutation(
       Expr* ref,
@@ -109,6 +114,9 @@ class ExprMutator : public IrVisitor {
 
   // Track replacements as they're registered
   std::vector<MutationInformation> replacements_;
+
+  // Track removal as they're registered
+  std::vector<MutationInformation> removal_;
 };
 
 } // namespace kir
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.cpp b/torch/csrc/jit/codegen/cuda/lower2device.cpp
index 21eb6e02fb8ef0..de54e7b50434d1 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower2device.cpp
@@ -184,7 +184,7 @@ void GpuLower::collectPaddedParallelDims() {
   }
 }
 
-void GpuLower::lower(Fusion* fusion) {
+void GpuLower::lower(Fusion* fusion, DataType index_type) {
   FUSER_PERF_SCOPE("GpuLower::lower");
   TORCH_INTERNAL_ASSERT(fusion != nullptr);
   TORCH_INTERNAL_ASSERT(
@@ -199,58 +199,85 @@ void GpuLower::lower(Fusion* fusion) {
     }
   } lower_guard(this);
   // Copy fusion into a new kernel for processing
-  kernel_ = std::make_unique<kir::Kernel>(fusion);
+  kernel_ = std::make_unique<kir::Kernel>(fusion, index_type);
   // Alias the fusion kernel caries around as a view of itself.
   fusion_ = kernel_.get();
 
+  // Convert tensor views of DataType::Index type to either Int or Int32
+  for (auto tv : ir_utils::allTvs(fusion_)) {
+    if (tv->dtype() == DataType::Index) {
+      tv->resolveIndexDtype();
+    }
+  }
+
   FusionGuard fg(fusion_);
   // prepare for lowering
   validateIr(fusion_);
 
+  // Checks if any TIDx dim is marked as padded to a warp. Also checks if we can
+  // determine the padding is explicitly a single warp.
   collectPaddedParallelDims();
 
+  // Replaces integers that are tensor sizes by named scalars as "T0.size[0]"
   replaceSymbolicSizes(fusion_);
 
+  // Traverse through reductions and termine if any iteration domains are
+  // trivial reductions. Add these iteration domains to trivial_reduction_info_
+  // which simply holds a map of which axes are trivial and which are not.
   trivial_reduction_info_.build(fusion_);
-  trivialReductionReplacement(fusion_, trivialReductionInfo());
+  // Replaces trivial reduction expressions (all id's being reduced are trivial)
+  // with set unary op
+  trivialReductionReplacement(fusion_, trivial_reduction_info_);
 
   // In the future we may directly use this map, but for now it will propagate
-  // and validate (to some extent) the parallelization strategy.
-  // This is the first time nodes will be lowered to kir nodes. Since for now we
-  // propagate the parallel strategy in some instances, we need to do it before
-  // lowering.
+  // and validate (to some extent) the parallelization strategy. Map only axes
+  // to the left of compute at position, forward broadcast in replay.
   ca_parallel_map_ = ComputeAtMap(ComputeAtMap::MappingMode::PARALLEL);
   ca_parallel_map_.build(fusion_, current());
 
-  // Want to run this after parallel map is created
-  validateVectorize(fusion_);
-
-  // Generate mappings to generate indices
+  // Generate mappings to generate indices. Maps all iteration domains but
+  // doesn't map any broadcast iteration domains, nor forward them in replay.
   ca_index_map_ = ComputeAtMap(ComputeAtMap::MappingMode::INDEX);
   ca_index_map_.build(fusion_, current());
 
-  // Generate mappings to generate and map to loop nests
+  // Generate mappings to generate and map to loop nests. Maps all iteration
+  // domains, forwards broadcasts, ensures root domain mappings exist (aren't
+  // replaced in forwarding).
   ca_loop_map_ = ComputeAtMap(ComputeAtMap::MappingMode::LOOP);
   ca_loop_map_.build(fusion_, current());
 
+  // Used in parallel dimension map
+  concretized_broadcast_domains_.build(fusion_);
+
   parallelDimensionMap().build(fusion_);
   if (isDebugDumpEnabled(DebugDumpOption::ParallelDimensions)) {
     std::cout << "Parallel dimension map:" << std::endl;
     std::cout << parallel_dimension_map_.toString() << std::endl;
   }
 
-  concretized_broadcast_domains_.build(fusion_);
+  // Validate mma data format and compatibility if any on the fusion.
+  validateMma(fusion_);
 
   // Compute thread predicates. Depends on parallel_dimension_map_
   thread_pred_map_.build(fusion_);
 
-  // Depends on thread_pred_map_
-  validateParallelize(fusion_);
+  // Fuse cetain patterns of reductions, such as a grid reduction
+  // followed by a grid broadcast. Only depends on parallelization and
+  // thread predicate map.
+  fuseReductions(fusion_);
 
   // Scan the whole fusion and build mappings about halo extensions of
   // all IterDomains
   haloInfo().build(fusion_);
 
+  // Want to run this after parallel map and halo info map are
+  // created. vectorized_accesses_ and vectorized_set_info_ are filled.
+  validateAndCollectVectorizeInfo(fusion_);
+
+  // Depends on thread_pred_map_, validates parallelization collects which
+  // tensor views need WAR or RAW syncs
+  sync_map_.build(fusion_);
+
   partialSplitMap().build(fusion_);
 
   validatePartialSplit(fusion_);
@@ -312,14 +339,20 @@ void GpuLower::lower(Fusion* fusion) {
   const auto exprs_conditional_loops =
       generateConditionalFromPredicate(exprs_with_fused_broadcast);
 
+  const auto exprs_common_index_allocated =
+      allocateCommonIndices(exprs_conditional_loops);
+
   // Insert fake zero updates to make sure nvrtc doesn't blow out register use
   // on index and predicate reuse
-  const auto exprs_register_adjusted = insertMagicZero(exprs_conditional_loops);
+  const auto exprs_register_adjusted =
+      insertMagicZero(exprs_common_index_allocated);
 
   const auto exprs_cleaned_up_loops =
       KIRCleaner::cleanUp(exprs_register_adjusted);
 
-  // We now have the lowered expressions, finalize the kernel IR
+  // We now have the lowered expressions, finalize the kernel IR. This function
+  // will also copy over some relevant information for code generation from
+  // GpuLower.
   kernel_->finalize(exprs_cleaned_up_loops);
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.h b/torch/csrc/jit/codegen/cuda/lower2device.h
index b97c6ac18373c3..6273c0e2d6a9ba 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.h
+++ b/torch/csrc/jit/codegen/cuda/lower2device.h
@@ -8,8 +8,11 @@
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 #include <torch/csrc/jit/codegen/cuda/lower_allocation.h>
 #include <torch/csrc/jit/codegen/cuda/lower_double_buffer.h>
+#include <torch/csrc/jit/codegen/cuda/lower_fused_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/lower_index_hoist.h>
 #include <torch/csrc/jit/codegen/cuda/lower_predicate.h>
 #include <torch/csrc/jit/codegen/cuda/lower_shift.h>
+#include <torch/csrc/jit/codegen/cuda/lower_sync_information.h>
 #include <torch/csrc/jit/codegen/cuda/lower_thread_predicate.h>
 #include <torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h>
 #include <torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h>
@@ -18,9 +21,12 @@
 #include <torch/csrc/jit/codegen/cuda/parallel_dimension_map.h>
 #include <torch/csrc/jit/codegen/cuda/partial_split_map.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/vectorization_info.h>
 
 #include <memory>
 #include <ostream>
+#include <unordered_map>
+#include <unordered_set>
 
 namespace torch {
 namespace jit {
@@ -38,9 +44,12 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
  public:
   GpuLower() = delete;
 
+  // GpuLower lowers the provided fusion into a kernel which can be translated
+  // into cuda code. index_type allows to compile the kernel based on int32
+  // indexing instead of int64 for additional performance.
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  explicit GpuLower(Fusion* fusion) {
-    lower(fusion);
+  explicit GpuLower(Fusion* fusion, DataType index_type = DataType::Int) {
+    lower(fusion, index_type);
   }
 
   kir::Kernel* kernel() const;
@@ -57,6 +66,12 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return thread_pred_map_;
   }
 
+  // Returns non-const reference. Necessary to reset a predicate flag
+  // when a broadcast expression is fused into a reduction.
+  ThreadPredicateMap& threadPredMap() {
+    return thread_pred_map_;
+  }
+
   const ComputeAtMap& caLoopMap() const {
     return ca_loop_map_;
   }
@@ -125,8 +140,36 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return double_buffer_info_;
   }
 
+  CommonIndexMap& commonIndexMap() {
+    return common_index_map_;
+  }
+
+  const auto& vectorizedAccesses() const {
+    return vectorized_accesses_;
+  }
+
+  auto& vectorizedAccesses() {
+    return vectorized_accesses_;
+  }
+
+  const auto& vectorizedSetInfo() const {
+    return vectorized_set_info_;
+  }
+
+  auto& vectorizedSetInfo() {
+    return vectorized_set_info_;
+  }
+
+  FusedReductionInfo& fusedReductionInfo() {
+    return fused_reduction_info_;
+  }
+
+  const SyncMap& syncMap() const {
+    return sync_map_;
+  }
+
  private:
-  void lower(Fusion* fusion);
+  void lower(Fusion* fusion, DataType index_type);
 
   // Goes through the parallelized iterdomains of the used TVs and find
   //  the parallel dimensions that need to be padded to a multiples of
@@ -152,6 +195,16 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   PartialSplitMap partial_split_map_;
   NonDivisibleSplitInfo non_divisible_split_info_;
   DoubleBufferInfo double_buffer_info_;
+  CommonIndexMap common_index_map_;
+  FusedReductionInfo fused_reduction_info_;
+  SyncMap sync_map_;
+
+  // Track which tensor views are inputs or outputs of a vectorized operation
+  // and their maximum vectorized access size
+  // std::unordered_map<TensorView*, VectorizationInfo> vectorized_accesses_;
+  std::unordered_map<TensorView*, int> vectorized_accesses_;
+  // Info on each vectorized set op
+  std::vector<VectorizedSetInfo> vectorized_set_info_;
 
   Fusion* fusion_ = nullptr;
 };
diff --git a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
index 17a2db069d865c..32da48bf51417a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
@@ -920,6 +920,31 @@ class AllocateReuseModifier {
           continue;
         }
 
+        if (alloc_info->alloc_expr->buffer()->isA<TensorView>()) {
+          if (!alloc_info->alloc_expr->buffer()->isA<TensorView>()) {
+            continue;
+          }
+          auto this_tv = alloc_info->alloc_expr->buffer()->as<TensorView>();
+          auto reuse_tv = alloc_info->alloc_expr->buffer()->as<TensorView>();
+          // Check that either both tv's are vectorized acceses, or neither are.
+          // Vectorized allocations require correct alignment so they can only
+          // alias with other allocations with the right alignment
+          const auto& va = GpuLower::current()->vectorizedAccesses();
+          if ((va.find(this_tv) == va.end()) !=
+              (va.find(reuse_tv) == va.end())) {
+            return false;
+          }
+
+          // Shared memory is all aligned to 128 bits, local memory might not be
+          if (this_tv->getMemoryType() == MemoryType::Local &&
+              va.find(this_tv) != va.end()) {
+            // Make sure alignment matches
+            if (va.at(this_tv) != va.at(reuse_tv)) {
+              return false;
+            }
+          }
+        }
+
         // TODO:
         //  Outer interval based sharing supports arbitrary re-indexing into
         //    the same buffer and would require additional syncs if fully
diff --git a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
index c03848ccff86e9..bb2f8b173fdee0 100644
--- a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
@@ -453,6 +453,8 @@ class AllocationInserter : public kir::ExprMutator {
             default_val == nullptr,
             "Reduction should not have a default initialization value for predicate elimination.");
         init = expr->as<ReductionOp>()->init();
+      } else if (expr->isA<MmaOp>()) {
+        init = expr->as<MmaOp>()->init();
       } else if (expr->isA<WelfordOp>()) {
         TORCH_INTERNAL_ASSERT(
             default_val == nullptr,
diff --git a/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp b/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
index c8110413de7430..571ba62a545baf 100644
--- a/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
@@ -407,7 +407,7 @@ class DoubleBufferInserter : private kir::ExprMutator {
     // RAW sync is not inserted for double buffered tensors. The only
     // exception is the prologue load.
     if (write_to_smem) {
-      auto sync = IrBuilder::create<kir::Sync>();
+      auto sync = IrBuilder::create<kir::BlockSync>();
       registerInsertBefore(double_buffer_loop, sync);
     }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
index 84c72c08185d7b..cd5a589f13ad6e 100644
--- a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
@@ -683,9 +683,9 @@ struct LocalDomainSorter {
   // Return if id0 should be before id1
   inline bool operator()(IterDomain* id0, IterDomain* id1) {
     auto concrete_id_0 =
-        GpuLower::current()->caLoopMap().getConcreteMappedID(id0);
+        GpuLower::current()->caParallelMap().getConcreteMappedID(id0);
     auto concrete_id_1 =
-        GpuLower::current()->caLoopMap().getConcreteMappedID(id1);
+        GpuLower::current()->caParallelMap().getConcreteMappedID(id1);
 
     if (concrete_id_dependencies_.find(concrete_id_0) !=
         concrete_id_dependencies_.end()) {
@@ -840,7 +840,7 @@ ExprGroup* ExprSegmentationSorter::makeMergedNode(
     if (producer_of_consumer_edge->isA<TensorView>()) {
       auto tv = producer_of_consumer_edge->as<TensorView>();
       for (const auto tv_i : c10::irange(tv->getComputeAtPosition())) {
-        ca_ids.emplace(GpuLower::current()->caLoopMap().getConcreteMappedID(
+        ca_ids.emplace(GpuLower::current()->caParallelMap().getConcreteMappedID(
             tv->axis(tv_i)));
       }
     }
@@ -855,7 +855,7 @@ ExprGroup* ExprSegmentationSorter::makeMergedNode(
     if (consumer_of_producer_edge->isA<TensorView>()) {
       auto tv = consumer_of_producer_edge->as<TensorView>();
       for (const auto tv_i : c10::irange(tv->getMaxProducerPosition())) {
-        pa_ids.emplace(GpuLower::current()->caLoopMap().getConcreteMappedID(
+        pa_ids.emplace(GpuLower::current()->caParallelMap().getConcreteMappedID(
             tv->axis(tv_i)));
       }
     }
@@ -866,7 +866,7 @@ ExprGroup* ExprSegmentationSorter::makeMergedNode(
 
   auto ordered_ids = getLocalDomainOrdering(
       joined_groups->exprs(),
-      GpuLower::current()->caLoopMap(),
+      GpuLower::current()->caParallelMap(),
       all_ca_pa_ids,
       concrete_id_dependencies);
 
@@ -914,7 +914,7 @@ bool canReducePA(ExprGroup* group) {
     // it can't decide if it can be reduced
     bool has_matching_pa = false;
     for (const auto i : c10::irange(consumer_tv->getMaxProducerPosition())) {
-      if (GpuLower::current()->caLoopMap().areMapped(
+      if (GpuLower::current()->caParallelMap().areMapped(
               consumer_tv->axis(i), group_pa_last_id)) {
         has_matching_pa = true;
         break;
@@ -931,7 +931,7 @@ bool canReducePA(ExprGroup* group) {
              static_cast<int>(producer_tv->getComputeAtPosition());
          producer_pos_i > 0;
          producer_pos_i--) {
-      if (GpuLower::current()->caLoopMap().areMapped(
+      if (GpuLower::current()->caParallelMap().areMapped(
               producer_tv->axis(producer_pos_i - 1), group_pa_last_id)) {
         return false;
       }
@@ -1027,7 +1027,7 @@ void ExprSegmentationSorter::initializeForLoopDependencies() {
          tv_id_i--) {
       auto tv_id = tv->axis((int)(tv_id_i - 1));
       auto concrete_id =
-          GpuLower::current()->caLoopMap().getConcreteMappedID(tv_id);
+          GpuLower::current()->caParallelMap().getConcreteMappedID(tv_id);
 
       if (concrete_id_dependencies.find(concrete_id) ==
           concrete_id_dependencies.end()) {
@@ -1039,7 +1039,7 @@ void ExprSegmentationSorter::initializeForLoopDependencies() {
 
       // Loops after tv_id are dependent on tv_id
       dependencies.emplace(
-          GpuLower::current()->caLoopMap().getConcreteMappedID(tv_id));
+          GpuLower::current()->caParallelMap().getConcreteMappedID(tv_id));
     }
   }
 
@@ -1067,27 +1067,62 @@ void ExprSegmentationSorter::initializeForLoopDependencies() {
       std::back_inserter(to_visit),
       [](const auto& concrete_dep_entry) { return concrete_dep_entry.first; });
 
+  size_t inf_loop_counter = to_visit.size();
+  bool failed = false;
+
   while (!to_visit.empty()) {
     auto id = to_visit.front();
     to_visit.pop_front();
 
+    if (inf_loop_counter-- == 0) {
+      failed = true;
+      break;
+    }
+
     auto& dependencies = concrete_id_dependencies.at(id);
-    bool ready = std::all_of(
-        dependencies.begin(), dependencies.end(), [&visited](IterDomain* id) {
-          return visited.count(id);
-        });
+    bool ready = dependencies.empty() ||
+        std::all_of(dependencies.begin(),
+                    dependencies.end(),
+                    [&visited](IterDomain* id) { return visited.count(id); });
 
     if (!ready) {
       to_visit.push_back(id);
       continue;
     }
 
+    inf_loop_counter = to_visit.size();
+
     for (auto dependency : dependencies) {
       auto dep_of_dep = concrete_id_dependencies.at(dependency);
       dependencies.insert(dep_of_dep.begin(), dep_of_dep.end());
     }
     visited.emplace(id);
   }
+  if (failed) {
+    std::cerr
+        << "ERROR: Iteration domain sorting has failed, infinite loop detected."
+        << std::endl;
+    std::cerr << "Failed to sort out: " << std::endl;
+    for (auto entry : to_visit) {
+      std::cerr << entry->toString();
+      if (entry != to_visit.back()) {
+        std::cerr << ", ";
+      }
+    }
+
+    std::cerr << "Depdencies: " << std::endl;
+    for (const auto& dep_entry : concrete_id_dependencies) {
+      std::cerr << "  Deps of " << dep_entry.first->toString() << std::endl
+                << "   ";
+
+      for (auto dep : dep_entry.second) {
+        std::cerr << dep->toString() << ", ";
+      }
+      std::cerr << std::endl;
+    }
+
+    TORCH_INTERNAL_ASSERT(false);
+  }
 }
 
 // Checks if the for loop associated with the concrete ID is ready to be
@@ -1145,7 +1180,7 @@ bool ExprSegmentationSorter::supportedMerge(ExprGroup* sg1, ExprGroup* sg2) {
     return false;
   }
 
-  const auto& loop_map = GpuLower::current()->caLoopMap();
+  const auto& parallel_map = GpuLower::current()->caParallelMap();
 
   // If inner loop dependencies have not been resolved, cannot merge.
   if (!loopReady(producer_ca_domain.back()) ||
@@ -1182,11 +1217,11 @@ bool ExprSegmentationSorter::supportedMerge(ExprGroup* sg1, ExprGroup* sg2) {
       continue;
     }
 
-    if (!loop_map.areMapped(compute_at_dim, producer_ca_domain.back())) {
+    if (!parallel_map.areMapped(compute_at_dim, producer_ca_domain.back())) {
       continue;
     }
 
-    if (loop_map.areMapped(compute_at_dim, consumer_pa_domain.back())) {
+    if (parallel_map.areMapped(compute_at_dim, consumer_pa_domain.back())) {
       return true;
     }
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
new file mode 100644
index 00000000000000..cf6458ea0980c3
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
@@ -0,0 +1,312 @@
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+
+#include <torch/csrc/jit/codegen/cuda/lower_fused_reduction.h>
+
+#include <algorithm>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+//! An instance of reduction patterns to fuse
+class FusedReductionBroadcastInfo : public PolymorphicBase {
+ public:
+  FusedReductionBroadcastInfo(ReductionOp* reduction, bool with_broadcast)
+      : reductions_({reduction}), with_broadcast_({with_broadcast}) {}
+
+  FusedReductionBroadcastInfo(WelfordOp* welford, bool with_broadcast)
+      : reductions_({welford}), with_broadcast_({with_broadcast}) {}
+
+  const std::vector<Expr*>& reductions() const {
+    return reductions_;
+  }
+
+  const std::vector<bool>& withBroadcast() const {
+    return with_broadcast_;
+  }
+
+ private:
+  // Holds ReductionOp or WelfordOp. Can be multiple in the case of
+  // horizontal fusion
+  std::vector<Expr*> reductions_;
+  // True each reduction also broadcasts
+  std::vector<bool> with_broadcast_;
+};
+
+//! Inspect a fusion to detect eligible sequences of expressions to
+//! use the fused reduction kernel
+class FusionInspector : private IterVisitor {
+ public:
+  static std::vector<FusedReductionBroadcastInfo> run(Fusion* fusion) {
+    FusionInspector inspector(fusion);
+    return inspector.fusion_list_;
+  }
+
+ private:
+  FusionInspector(Fusion* fusion) {
+    traverse(fusion);
+  }
+
+  using IterVisitor::handle;
+
+  void handle(ReductionOp* rop) final {
+    /// If it's a grid reduction, keep track of tensors that depend on
+    /// this reduction.
+    // Only consider when out is on register as that is assumed in the
+    // fused reduction kernel.
+    auto out = rop->out()->as<TensorView>();
+    if (out->getMemoryType() == MemoryType::Local &&
+        out->domain()->hasGridReduction()) {
+      reduction_dep_[out].insert(rop);
+    }
+  }
+
+  void handle(WelfordOp* wop) final {
+    /// If it's a grid reduction, keep track of tensors that depend on
+    /// this reduction.
+    // Only consider when out is on register as that is assumed in the
+    // fused reduction kernel.
+    auto out = wop->out()->as<TensorView>();
+    if (out->getMemoryType() == MemoryType::Local &&
+        out->domain()->hasGridReduction()) {
+      reduction_dep_[out].insert(wop);
+    }
+  }
+
+  void handle(Expr* expr) final {
+    IterVisitor::handle(expr);
+    for (auto in_tv : ir_utils::filterByType<TensorView>(expr->inputs())) {
+      for (auto reduction_op : reduction_dep_[in_tv]) {
+        if (fused_exprs_.find(reduction_op) != fused_exprs_.end()) {
+          continue;
+        }
+        for (auto out_tv :
+             ir_utils::filterByType<TensorView>(expr->outputs())) {
+          reduction_dep_[out_tv].insert(reduction_op);
+        }
+      }
+    }
+  }
+
+  // In the case of welford, use the fused broadcast reduction when at
+  // least one of the outputs is broadcast.
+  void handle(BroadcastOp* bop) final {
+    // Detect a pattern where a reduction is followed by a broadcast
+    auto bop_out = bop->out()->as<TensorView>();
+    auto bop_in = bop->in()->as<TensorView>();
+
+    for (Expr* preceding_expr : reduction_dep_[bop_in]) {
+      auto parallel_reduction_axes =
+          getReductionParallelTypeStates(preceding_expr);
+
+      // If not matching, propagate the reduction further down to
+      // subsequent expressions
+      if (!isBroadcastFuseable(bop_out, parallel_reduction_axes)) {
+        continue;
+      }
+
+      if (fused_exprs_.find(preceding_expr) != fused_exprs_.end()) {
+        // Already added to the fusion list. This can happen with
+        // welford as there can be multiple broadcast consumer
+        // expressions.
+        continue;
+      }
+
+      if (preceding_expr->isA<ReductionOp>()) {
+        fusion_list_.emplace_back(preceding_expr->as<ReductionOp>(), true);
+      } else {
+        fusion_list_.emplace_back(preceding_expr->as<WelfordOp>(), true);
+      }
+
+      fused_exprs_.insert(preceding_expr);
+    }
+  }
+
+  ParallelTypeBitmap getReductionParallelTypeStates(Expr* expr) {
+    ParallelTypeBitmap parallel_reduction_axes;
+
+    for (auto id : ir_utils::getTvOutput(expr)->domain()->domain()) {
+      auto pt = id->getParallelType();
+      if (id->isReduction() && isParallelTypeThread(pt)) {
+        parallel_reduction_axes.set(pt);
+      }
+    }
+
+    return parallel_reduction_axes;
+  }
+
+  // Requires reduction parallel dimensions to exactly match parallel broadcast
+  // dimensions
+  bool isBroadcastFuseable(
+      TensorView* broadcast_out,
+      const ParallelTypeBitmap& parallel_reduction_axes) {
+    const auto broadcast_parallel_types =
+        GpuLower::current()->threadPredMap().getParallelBroadcastDomains(
+            broadcast_out);
+
+    // If no parallel broadcast, nothing to fuse
+    if (broadcast_parallel_types.none()) {
+      return false;
+    }
+
+    // Make sure the broadcast parallel types are the types reduced by
+    // the preceding reduction op
+    for (auto id : broadcast_out->domain()->domain()) {
+      auto pt = id->getParallelType();
+      if (!isParallelTypeThread(pt)) {
+        continue;
+      }
+      // Parallel broadcast must be included in reduction_states
+      if (id->isBroadcast() && broadcast_parallel_types.get(pt)) {
+        if (!parallel_reduction_axes.get(pt)) {
+          return false;
+        }
+      }
+    }
+
+    return true;
+  }
+
+ private:
+  //! List of expression sequences to fuse
+  std::vector<FusedReductionBroadcastInfo> fusion_list_;
+  //! Keep track of fused reduction/welford exprs to avoid duplication
+  std::unordered_set<Expr*> fused_exprs_;
+  //! Keep track of ReductionOp/WelfordOp expressions that are
+  //! (indirectly) input to a tensor
+  std::unordered_map<TensorView*, std::unordered_set<Expr*>> reduction_dep_;
+};
+
+//! Transform a fusion to use the fused reduction kernel.
+class FusionTransformer {
+ public:
+  static void run(
+      Fusion* fusion,
+      const std::vector<FusedReductionBroadcastInfo>& fusion_list) {
+    FusionTransformer transformer(fusion, fusion_list);
+  }
+
+ private:
+  FusionTransformer(
+      Fusion* fusion,
+      const std::vector<FusedReductionBroadcastInfo>& fusion_list)
+      : fusion_(fusion), fusion_list_(fusion_list) {
+    transform();
+  }
+
+  void transform() {
+    for (const auto& info : fusion_list_) {
+      transform(info);
+    }
+    // If the thread predicate map is modified, rebuild the
+    // map. build() only updates mappings that need to be updated.
+    if (thread_pred_map_modified_) {
+      GpuLower::current()->threadPredMap().build(fusion_);
+    }
+  }
+
+  void transform(const FusedReductionBroadcastInfo& info) {
+    TORCH_INTERNAL_ASSERT(
+        info.reductions().size() == 1, "Horizontal fusion not supported yet");
+
+    for (const auto i : c10::irange(info.reductions().size())) {
+      const auto expr = info.reductions().at(i);
+      const auto with_broadcast = info.withBroadcast().at(i);
+      Expr* fused_expr = nullptr;
+
+      if (auto reduction = dynamic_cast<ReductionOp*>(expr)) {
+        TORCH_INTERNAL_ASSERT(!reduction->isFused());
+
+        auto red_op_type = reduction->getReductionOpType();
+        auto init = reduction->init();
+        auto out = reduction->out();
+        auto in = reduction->in();
+
+        fusion_->removeExpr(reduction);
+
+        fused_expr =
+            IrBuilder::create<ReductionOp>(red_op_type, init, out, in, true);
+      } else if (auto welford = dynamic_cast<WelfordOp*>(expr)) {
+        TORCH_INTERNAL_ASSERT(!welford->isFused());
+
+        auto out_avg = welford->outAvg();
+        auto out_var = welford->outVar();
+        auto out_n = welford->outN();
+        auto init_avg = welford->initAvg();
+        auto init_var = welford->initVar();
+        auto init_n = welford->initN();
+        auto in_avg = welford->inAvg();
+        auto in_var = welford->inVar();
+        auto in_n = welford->inN();
+
+        fusion_->removeExpr(welford);
+
+        fused_expr = IrBuilder::create<WelfordOp>(
+            out_avg,
+            out_var,
+            out_n,
+            init_avg,
+            init_var,
+            init_n,
+            in_avg,
+            in_var,
+            in_n,
+            true);
+      }
+
+      TORCH_INTERNAL_ASSERT(fused_expr != nullptr);
+
+      // Do not just remove the broadcast but just reset the thread
+      // predicate of the broadcast op. Since fusion is applied only
+      // when all parallel broadcast domains are to be parallel
+      // reduction, all parallel types can be reset.
+      if (with_broadcast) {
+        // It may be just fine to remove the broadcast expr, but
+        // technically speaking that would violate the root domain mapping
+        // as broadcast domains would appear in the consumer of the
+        // broadcast output tensor without a broadcast expression.
+        for (auto reduction_out :
+             ir_utils::filterByType<TensorView>(fused_expr->outputs())) {
+          for (auto id : reduction_out->domain()->domain()) {
+            if (id->isReduction()) {
+              GpuLower::current()->fusedReductionInfo().markAsAllreduce(id);
+              GpuLower::current()->threadPredMap().markAsUpdated(reduction_out);
+              thread_pred_map_modified_ = true;
+            }
+          }
+        }
+      }
+    }
+  }
+
+ private:
+  Fusion* fusion_ = nullptr;
+  const std::vector<FusedReductionBroadcastInfo>& fusion_list_;
+  bool thread_pred_map_modified_ = false;
+};
+
+} // namespace
+
+void fuseReductions(Fusion* fusion) {
+  auto fusion_list = FusionInspector::run(fusion);
+  FusionTransformer::run(fusion, fusion_list);
+}
+
+void FusedReductionInfo::markAsAllreduce(IterDomain* id) {
+  allreduce_ids_.insert(id);
+}
+
+bool FusedReductionInfo::isAllreduce(IterDomain* id) const {
+  return allreduce_ids_.find(id) != allreduce_ids_.end();
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.h b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.h
new file mode 100644
index 00000000000000..97cd5f66086752
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.h
@@ -0,0 +1,34 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+//! Keep track of certain patterns of reductions.
+//!
+//! - Allreduce IterDomain: reduced and broadcast domain.
+class FusedReductionInfo {
+ public:
+  void markAsAllreduce(IterDomain* id);
+
+  bool isAllreduce(IterDomain* id) const;
+
+ private:
+  // Reduction IterDomains that are also broadcast
+  std::unordered_set<IterDomain*> allreduce_ids_;
+};
+
+//! Detect reductions and broadcasts that are eligible for the fused
+//! reduction kernel. When found, the predicate flags of the broadcast
+//! is unset, which effectively makes the broadcast just a unary set
+//! op.
+//! TODO: Consider moving the warp-based fused reduction here.
+void fuseReductions(Fusion*);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp b/torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp
index fa84d1006a16b8..dd4a06dfb3f829 100644
--- a/torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp
@@ -91,6 +91,15 @@ class UnaryOpInserter : private kir::ExprMutator {
         gop, IrBuilder::create<UnaryOp>(container, UnaryOpType::Set, out, in));
   }
 
+  void handle(ViewDtypeOp* vop) final {
+    auto out = vop->out();
+    auto in = vop->in();
+    auto container = out->container();
+    registerReplace(
+        vop,
+        IrBuilder::create<UnaryOp>(container, UnaryOpType::EraseType, out, in));
+  }
+
   void handle(ViewOp* vop) final {
     auto out = vop->out();
     auto in = vop->in();
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.cpp b/torch/csrc/jit/codegen/cuda/lower_index.cpp
index b0ef14079c436d..5db1999a3a0635 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_index.cpp
@@ -37,6 +37,11 @@ void IndexLowering::pushBack(Expr* expr) {
   }
 }
 
+void IndexLowering::insertAtTopLevel(Expr* expr) {
+  TORCH_INTERNAL_ASSERT(!lowered_exprs_.empty());
+  lowered_exprs_.insert(lowered_exprs_.end() - 1, expr);
+}
+
 void IndexLowering::handle(const kir::IfThenElse* ite) {
   const auto prev_scope = active_scope_;
 
@@ -101,7 +106,11 @@ namespace {
 
 // Get the size of the temporary work buffer for grid communication, this can be
 // grid reduction, broadcast, or grid welford.
-Val* getGridCommWorkBufferSize(const TensorDomain* td) {
+// expansion_factor can be optionally passed to expand the allocation
+// size. For example, FusedReduction should double the work buffer size.
+Val* getGridCommWorkBufferSize(
+    const TensorDomain* td,
+    int expansion_factor = 1) {
   // The buffer size is the number of thread blocks multiplied by the
   // number of threads not used for reduction domains.
   // Note: Previously it was calculated based on the shape of the
@@ -111,7 +120,11 @@ Val* getGridCommWorkBufferSize(const TensorDomain* td) {
   // size if the parallel dimensions are exact, but otherwise, just
   // computing the buffer size based on the tensor shape isn't
   // sufficient since there could be extra threads/blocks.
-  Val* buffer_size = GpuLower::current()->kernel()->oneVal();
+  TORCH_INTERNAL_ASSERT(
+      expansion_factor >= 1, "Invalid expansion factor: ", expansion_factor);
+  Val* buffer_size = expansion_factor == 1
+      ? GpuLower::current()->kernel()->oneVal()
+      : IrBuilder::create<Int>(expansion_factor);
   for (auto pt : kParallelTypeThreads) {
     auto pt_dim = GpuLower::current()->parallelDimensionMap().get(pt);
     if (pt_dim == nullptr || pt_dim->isOneInt()) {
@@ -172,89 +185,122 @@ void IndexLowering::handle(const ReductionOp* rop) {
   const auto out_tv = rop->out()->as<TensorView>();
   const auto out_domain = out_tv->domain();
 
-  const bool is_block_reduce = out_domain->hasBlockReduction();
-  const bool is_grid_reduce = out_domain->hasGridReduction();
-
-  // If we do a grid reduction we can't have a reduction axis that is not bound
-  // to a grid or block dim ()
-  if (is_grid_reduce) {
-    TORCH_INTERNAL_ASSERT(
-        std::none_of(
-            out_domain->domain().begin(),
-            out_domain->domain().end(),
-            [](IterDomain* id) {
-              return !id->isThread() && id->isReduction() &&
-                  !id->extent()->isOneInt();
-            }),
-        "Found a reduction stage that has both a non-parallelized ",
-        "reduction and a grid reduction.  This is not supported, ",
-        "please use rfactor to do the serialized reduction first, ",
-        "then the grid reduction.");
-  }
+  const bool has_block_reduce = out_domain->hasBlockReduction();
+  const bool has_grid_reduce = out_domain->hasGridReduction();
 
   const auto out = lowerDstIndex(rop->out());
   const auto in = lowerSrcIndex(rop->in(), rop->out());
 
-  ReductionOp* block_reduction_op = nullptr;
+  // Serial reduction
+  if (!has_block_reduce && !has_grid_reduce) {
+    pushBack(
+        IrBuilder::create<BinaryOp>(rop->getReductionOpType(), out, out, in));
+    return;
+  }
 
-  if (is_block_reduce) {
-    block_reduction_op = IrBuilder::create<ReductionOp>(
-        rop->getReductionOpType(), rop->init(), out, in);
-    if (rop->predicate()) {
-      block_reduction_op->setPredicate(rop->predicate());
-    }
-    if (rop->writePredicate()) {
-      block_reduction_op->setWritePredicate(rop->writePredicate());
-    }
-    pushBack(block_reduction_op);
+  ReductionOp* indexed_rop = IrBuilder::create<ReductionOp>(
+      rop->getReductionOpType(), rop->init(), out, in, rop->isFused());
+  if (rop->predicate()) {
+    indexed_rop->setPredicate(rop->predicate());
+  }
+  if (rop->writePredicate()) {
+    indexed_rop->setWritePredicate(rop->writePredicate());
   }
 
-  if (is_grid_reduce) {
-    const auto reduce_buffer = allocGlobalBufferForGridComm(
-        getGridCommWorkBufferSize(out_domain), out->dtype(), false);
-
-    const auto sync_buffer = allocGlobalBufferForGridComm(
-        getGridSyncBufferSize(out_domain), DataType::Int, true);
-
-    const auto grid_reduction_op = (block_reduction_op == nullptr)
-        ? IrBuilder::create<ReductionOp>(
-              rop->getReductionOpType(), rop->init(), out, in)
-        : block_reduction_op;
-
-    // The thread predicate for GridReduction needs to be set
-    // separately from the main predicate. Do not combine them like
-    // other expressions.
-    const auto& thread_pred =
-        GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
-    auto grid_reduction = IrBuilder::create<kir::GridReduction>(
-        grid_reduction_op, reduce_buffer, sync_buffer);
-    grid_reduction->setThreadPredicate(thread_pred);
-
-    if (rop->predicate()) {
-      // If preceded by a blockReduce, all thread blocks should have
-      // valid inputs to gridReduce. In fact, using the original
-      // predicate does not work when the write predicate of the
-      // blockReduce is different from the read predicate.
-      if (is_block_reduce) {
-        grid_reduction->setPredicate(IrBuilder::create<kir::Predicate>(
-            GpuLower::current()->kernel()->trueVal()));
-      } else {
-        grid_reduction->setPredicate(rop->predicate());
-      }
-    }
+  // If not grid reduction, just append the new ReductionOp node
+  if (!has_grid_reduce) {
+    pushBack(indexed_rop);
+    return;
+  }
+
+  handleGridReduction(indexed_rop);
+}
+
+void IndexLowering::handleGridReduction(ReductionOp* indexed_rop) {
+  const auto out_tv = indexed_rop->out()->as<kir::TensorIndex>()->view();
+  const auto out_domain = out_tv->domain();
+
+  TORCH_INTERNAL_ASSERT(out_domain->hasGridReduction());
+
+  // If we do a grid reduction we can't have a reduction axis that is not bound
+  // to a grid or block dim.
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          out_domain->domain().begin(),
+          out_domain->domain().end(),
+          [](IterDomain* id) {
+            return !id->isThread() && id->isReduction() &&
+                !id->extent()->isOneInt();
+          }),
+      "Found a reduction stage that has both a non-parallelized ",
+      "reduction and a grid reduction. This is not supported, ",
+      "please use rfactor to do the serialized reduction first, ",
+      "then the grid reduction.");
+
+  // When using the fused reduction in a loop, the global work buffer
+  // is double buffered to save global synchronizations.
+  auto is_within_a_loop = std::any_of(
+      out_domain->domain().begin(),
+      out_domain->domain().end(),
+      [](IterDomain* id) { return !isTrivialIterDomain(id); });
+
+  const auto reduce_buffer = allocGlobalBufferForGridComm(
+      getGridCommWorkBufferSize(
+          out_domain, indexed_rop->isFused() && is_within_a_loop ? 2 : 1),
+      indexed_rop->out()->dtype(),
+      false);
+
+  const auto sync_buffer = allocGlobalBufferForGridComm(
+      getGridSyncBufferSize(out_domain), DataType::Int, true);
 
-    if (rop->writePredicate()) {
-      grid_reduction->setWritePredicate(rop->writePredicate());
+  const bool block_reduce_separated =
+      out_domain->hasBlockReduction() && !indexed_rop->isFused();
+
+  // The thread predicate for GridReduction needs to be set
+  // separately from the main predicate. Do not combine them like
+  // other expressions.
+  const auto& thread_pred =
+      GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
+
+  auto grid_reduction = IrBuilder::create<kir::GridReduction>(
+      indexed_rop, reduce_buffer, sync_buffer);
+
+  grid_reduction->setThreadPredicate(thread_pred);
+
+  // If preceded by a blockReduce, all thread blocks should have
+  // valid inputs to gridReduce. In fact, using the original
+  // predicate does not work when the write predicate of the
+  // blockReduce is different from the read predicate.
+  if (indexed_rop->predicate()) {
+    if (block_reduce_separated) {
+      grid_reduction->setPredicate(IrBuilder::create<kir::Predicate>(
+          GpuLower::current()->kernel()->trueVal()));
+    } else {
+      grid_reduction->setPredicate(indexed_rop->predicate());
     }
+  }
 
-    pushBack(reduce_buffer);
-    pushBack(sync_buffer);
-    pushBack(grid_reduction);
+  if (indexed_rop->writePredicate()) {
+    grid_reduction->setWritePredicate(indexed_rop->writePredicate());
   }
 
-  if (!is_block_reduce && !is_grid_reduce) {
-    pushBack(
-        IrBuilder::create<BinaryOp>(rop->getReductionOpType(), out, out, in));
+  // Push back the reduction op when block reduction is done
+  // separately. Otherwise, the reduction op is just referenced from
+  // the grid reduction op.
+  if (block_reduce_separated) {
+    pushBack(indexed_rop);
+  }
+
+  pushBack(reduce_buffer);
+  pushBack(sync_buffer);
+  pushBack(grid_reduction);
+
+  if (indexed_rop->isFused()) {
+    // When using the fused reduction, allocate the reduction object at
+    // the outer-most scope
+    auto fused_reduction_alloc_reduction =
+        IrBuilder::create<kir::AllocateFusedReduction>(grid_reduction);
+    insertAtTopLevel(fused_reduction_alloc_reduction);
   }
 }
 
@@ -264,12 +310,12 @@ void IndexLowering::handle(const WelfordOp* wop) {
   const auto out_tv = wop->outAvg()->as<TensorView>();
   const auto out_domain = out_tv->domain();
 
-  const bool is_block_reduce = out_domain->hasBlockReduction();
-  const bool is_grid_reduce = out_domain->hasGridReduction();
+  const bool has_block_reduce = out_domain->hasBlockReduction();
+  const bool has_grid_reduce = out_domain->hasGridReduction();
 
   // If we do a grid reduction we can't have a reduction axis that is not bound
   // to a grid or block dim ()
-  if (is_grid_reduce) {
+  if (has_grid_reduce) {
     TORCH_INTERNAL_ASSERT(
         std::none_of(
             out_domain->domain().begin(),
@@ -298,7 +344,7 @@ void IndexLowering::handle(const WelfordOp* wop) {
   auto out_var = lowerDstIndex(wop->outVar());
   auto out_N = lowerDstIndex(wop->outN());
 
-  WelfordOp* welford_op = IrBuilder::create<WelfordOp>(
+  WelfordOp* indexed_wop = IrBuilder::create<WelfordOp>(
       out_avg,
       out_var,
       out_N,
@@ -307,70 +353,111 @@ void IndexLowering::handle(const WelfordOp* wop) {
       wop->initN(),
       in_avg,
       in_var,
-      in_N);
+      in_N,
+      wop->isFused());
 
-  WelfordOp* block_welford_op = nullptr;
+  if (wop->predicate()) {
+    indexed_wop->setPredicate(wop->predicate());
+  }
+  if (wop->writePredicate()) {
+    indexed_wop->setWritePredicate(wop->writePredicate());
+  }
 
-  if (is_block_reduce) {
-    block_welford_op = welford_op;
-    if (wop->predicate()) {
-      block_welford_op->setPredicate(wop->predicate());
-    }
-    if (wop->writePredicate()) {
-      block_welford_op->setWritePredicate(wop->writePredicate());
-    }
-    pushBack(block_welford_op);
+  // Serial welford
+  if (!has_block_reduce && !has_grid_reduce) {
+    pushBack(indexed_wop);
+    return;
   }
 
-  if (is_grid_reduce) {
-    // Buffer allocation
-    const auto work_buffer_size = getGridCommWorkBufferSize(out_domain);
-
-    const auto out_var_buffer =
-        allocGlobalBufferForGridComm(work_buffer_size, out_var->dtype(), false);
-    const auto out_avg_buffer =
-        allocGlobalBufferForGridComm(work_buffer_size, out_avg->dtype(), false);
-    const auto out_N_buffer =
-        allocGlobalBufferForGridComm(work_buffer_size, out_N->dtype(), false);
-
-    const auto sync_buffer = allocGlobalBufferForGridComm(
-        getGridSyncBufferSize(out_domain), DataType::Int, true);
-
-    // Grid Welford instantiation
-    const auto grid_welford_op =
-        (block_welford_op == nullptr) ? welford_op : block_welford_op;
-
-    // The thread predicate for GridReduction needs to be set
-    // separately from the main predicate. Do not combine them like
-    // other expressions.
-    const auto& thread_pred =
-        GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
-
-    auto grid_welford = IrBuilder::create<kir::GridWelford>(
-        grid_welford_op,
-        out_var_buffer,
-        out_avg_buffer,
-        out_N_buffer,
-        sync_buffer);
-
-    grid_welford->setThreadPredicate(thread_pred);
-
-    if (wop->predicate()) {
-      grid_welford->setPredicate(wop->predicate());
+  // Block-only welford
+  if (!has_grid_reduce) {
+    pushBack(indexed_wop);
+    return;
+  }
+
+  handleGridWelford(indexed_wop);
+}
+
+void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
+  const auto out_tv = indexed_wop->out()->as<kir::TensorIndex>()->view();
+  const auto out_domain = out_tv->domain();
+
+  // Buffer allocation
+  // When using the fused reduction in a loop, the global work buffer
+  // is double buffered to save global synchronizations.
+  auto is_within_a_loop = std::any_of(
+      out_domain->domain().begin(),
+      out_domain->domain().end(),
+      [](IterDomain* id) { return !isTrivialIterDomain(id); });
+
+  const auto work_buffer_size = getGridCommWorkBufferSize(
+      out_domain, indexed_wop->isFused() && is_within_a_loop ? 2 : 1);
+
+  const auto out_var_buffer = allocGlobalBufferForGridComm(
+      work_buffer_size, indexed_wop->outVar()->dtype(), false);
+  const auto out_avg_buffer = allocGlobalBufferForGridComm(
+      work_buffer_size, indexed_wop->outAvg()->dtype(), false);
+  const auto out_N_buffer = allocGlobalBufferForGridComm(
+      work_buffer_size, indexed_wop->outN()->dtype(), false);
+
+  const auto sync_buffer = allocGlobalBufferForGridComm(
+      getGridSyncBufferSize(out_domain), DataType::Int, true);
+
+  // The thread predicate for GridReduction needs to be set
+  // separately from the main predicate. Do not combine them like
+  // other expressions.
+  const auto& thread_pred =
+      GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
+
+  auto grid_welford = IrBuilder::create<kir::GridWelford>(
+      indexed_wop, out_var_buffer, out_avg_buffer, out_N_buffer, sync_buffer);
+
+  grid_welford->setThreadPredicate(thread_pred);
+
+  const bool block_reduce_separated =
+      out_domain->hasBlockReduction() && !indexed_wop->isFused();
+
+  if (indexed_wop->predicate()) {
+    if (block_reduce_separated) {
+      grid_welford->setPredicate(IrBuilder::create<kir::Predicate>(
+          GpuLower::current()->kernel()->trueVal()));
+    } else {
+      grid_welford->setPredicate(indexed_wop->predicate());
     }
+  }
 
-    pushBack(out_var_buffer);
-    pushBack(out_avg_buffer);
-    pushBack(out_N_buffer);
-    pushBack(sync_buffer);
-    pushBack(grid_welford);
+  if (indexed_wop->writePredicate()) {
+    grid_welford->setWritePredicate(indexed_wop->writePredicate());
   }
 
-  if (!is_block_reduce && !is_grid_reduce) {
-    pushBack(welford_op);
+  if (block_reduce_separated) {
+    pushBack(indexed_wop);
+  }
+
+  pushBack(out_var_buffer);
+  pushBack(out_avg_buffer);
+  pushBack(out_N_buffer);
+  pushBack(sync_buffer);
+  pushBack(grid_welford);
+
+  if (indexed_wop->isFused()) {
+    // When using the fused reduction, allocate the reduction object at
+    // the outer-most scope
+    auto fused_reduction_alloc_reduction =
+        IrBuilder::create<kir::AllocateFusedReduction>(grid_welford);
+    insertAtTopLevel(fused_reduction_alloc_reduction);
   }
 }
 
+void IndexLowering::handle(const MmaOp* mma) {
+  const auto a = lowerSrcIndex(mma->inA(), mma->out());
+  const auto b = lowerSrcIndex(mma->inB(), mma->out());
+  const auto out = lowerDstIndex(mma->out());
+  auto mma_indexed =
+      IrBuilder::create<MmaOp>(out, a, b, mma->init(), mma->options());
+  pushBack(mma_indexed);
+}
+
 void IndexLowering::handle(const BroadcastOp* bop) {
   TORCH_INTERNAL_ASSERT(ir_utils::isTvOp(bop));
 
@@ -423,9 +510,14 @@ void IndexLowering::handle(const kir::Allocate* allocate) {
   pushBack(const_cast<kir::Allocate*>(allocate)); // NOLINT
 }
 
-void IndexLowering::handle(const kir::Sync* sync) {
+void IndexLowering::handle(const kir::BlockSync* sync) {
+  // TODO(kir): remove the need for const_cast
+  pushBack(const_cast<kir::BlockSync*>(sync)); // NOLINT
+}
+
+void IndexLowering::handle(const kir::GridSync* sync) {
   // TODO(kir): remove the need for const_cast
-  pushBack(const_cast<kir::Sync*>(sync)); // NOLINT
+  pushBack(const_cast<kir::GridSync*>(sync)); // NOLINT
 }
 
 void IndexLowering::generate(const std::vector<Expr*>& exprs) {
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.h b/torch/csrc/jit/codegen/cuda/lower_index.h
index 2f3af0061e1898..78d6bb2a02fb78 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.h
+++ b/torch/csrc/jit/codegen/cuda/lower_index.h
@@ -30,23 +30,32 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
 
   void pushBack(Expr*);
 
+  // Insert an expression before the current top-level expression.
+  void insertAtTopLevel(Expr* expr);
+
   void handle(const UnaryOp*) final;
   void handle(const BinaryOp*) final;
   void handle(const TernaryOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
 
   void handle(const kir::ForLoop*) final;
   void handle(const kir::IfThenElse*) final;
   void handle(const kir::Allocate*) final;
-  void handle(const kir::Sync*) final;
+  void handle(const kir::BlockSync*) final;
+  void handle(const kir::GridSync*) final;
 
   void generate(const std::vector<Expr*>& exprs);
 
   Val* lowerSrcIndex(Val* val, Val* dst) const;
+
   Val* lowerDstIndex(Val* dst) const;
 
+  void handleGridReduction(ReductionOp* new_rop);
+  void handleGridWelford(WelfordOp* new_wop);
+
  private:
   std::vector<Expr*> lowered_exprs_;
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_hoist.cpp b/torch/csrc/jit/codegen/cuda/lower_index_hoist.cpp
new file mode 100644
index 00000000000000..699c887816f8d6
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_index_hoist.cpp
@@ -0,0 +1,326 @@
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
+
+#include <torch/csrc/jit/codegen/cuda/lower_index_hoist.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+// Return leaf domains of a given domain.
+std::unordered_set<IterDomain*> getUsedLeafIds(
+    IterDomain* id,
+    TensorDomain* td) {
+  const auto all_vals_between = DependencyCheck::getAllValsBetween(
+      {id}, {td->domain().begin(), td->domain().end()});
+
+  std::unordered_set<IterDomain*> used_leaf_ids;
+
+  for (const auto leaf : td->domain()) {
+    if (std::find(all_vals_between.begin(), all_vals_between.end(), leaf) !=
+        all_vals_between.end()) {
+      used_leaf_ids.insert(leaf);
+    }
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      !used_leaf_ids.empty(),
+      "No used id found: ",
+      id->toString(),
+      ", ",
+      td->toString());
+
+  return used_leaf_ids;
+}
+
+} // namespace
+
+CommonIndexKey::CommonIndexKey(
+    IterDomain* consumer_indexed_id,
+    TensorDomain* consumer_td,
+    TensorDomain* ref_td,
+    const std::unordered_map<IterDomain*, Val*>& ref_index_map,
+    const std::vector<kir::ForLoop*>& loops) {
+  auto gpu_lower = GpuLower::current();
+
+  concrete_indexed_id_ =
+      gpu_lower->caIndexMap().getConcreteMappedID(consumer_indexed_id);
+
+  const auto consumer_leaf_ids =
+      getUsedLeafIds(consumer_indexed_id, consumer_td);
+
+  // Convert to Parallel concrete IDs to find matching loops.
+  std::unordered_set<IterDomain*> concrete_leaf_ids;
+  for (auto& id : consumer_leaf_ids) {
+    concrete_leaf_ids.insert(
+        gpu_lower->caParallelMap().getConcreteMappedID(id));
+  }
+
+  // Find used loops and their index vals
+  for (const auto i : c10::irange(loops.size())) {
+    auto loop = loops.at(i);
+    auto loop_id =
+        gpu_lower->caParallelMap().getConcreteMappedID(loop->iter_domain());
+    auto it = concrete_leaf_ids.find(loop_id);
+    if (it != concrete_leaf_ids.end()) {
+      // This leaf reference id is used for indexing the consumer id
+      used_loops_.push_back(loop);
+      auto index_it = ref_index_map.find(ref_td->axis(i));
+      TORCH_INTERNAL_ASSERT(
+          index_it != ref_index_map.end(),
+          "Index not found for leaf ID, ",
+          ref_td->axis(i)->toString());
+      loop_index_vals_.push_back(index_it->second);
+    }
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      !used_loops_.empty(),
+      "No loop used for indexing found. ",
+      consumer_indexed_id->toString());
+
+  TORCH_INTERNAL_ASSERT(
+      consumer_leaf_ids.size() == used_loops_.size(),
+      "consumer_leaf_ids.size() = ",
+      consumer_leaf_ids.size(),
+      ", used_loops_.size() == ",
+      used_loops_.size(),
+      ", loops.size() == ",
+      loops.size());
+}
+
+bool CommonIndexKey::operator==(const CommonIndexKey& other) const {
+  auto gpu_lower = GpuLower::current();
+
+  if (concrete_indexed_id_ != other.concrete_indexed_id_) {
+    return false;
+  }
+
+  if (used_loops_.size() != other.used_loops_.size()) {
+    return false;
+  }
+
+  for (const auto i : c10::irange(used_loops_.size())) {
+    auto lhs_loop = used_loops_.at(i);
+    auto rhs_loop = other.used_loops_.at(i);
+    if (lhs_loop == rhs_loop) {
+      continue;
+    }
+    if (gpu_lower->caLoopMap().areMapped(
+            lhs_loop->iter_domain(), rhs_loop->iter_domain()) &&
+        lhs_loop->isTrivial() && rhs_loop->isTrivial()) {
+      continue;
+    }
+    return false;
+  }
+
+  for (const auto i : c10::irange(loop_index_vals_.size())) {
+    auto lhs_index = loop_index_vals_.at(i);
+    auto rhs_index = other.loop_index_vals_.at(i);
+    if (lhs_index == rhs_index) {
+      continue;
+    }
+    // Initial index variables can have some additions such as magic
+    // zero and "1" when used in producer indexing for double buffered
+    // tensors. Thus, the initial variables themselves may be
+    // different, and its components need to be examined. An easy way
+    // is to flatten them to strings as follows.
+    auto lhs_str = loop_index_vals_.at(i)->toInlineString();
+    auto rhs_str = other.loop_index_vals_.at(i)->toInlineString();
+    if (lhs_str == rhs_str) {
+      continue;
+    }
+
+    return false;
+  }
+
+  return true;
+}
+
+std::string CommonIndexKey::toString() const {
+  TORCH_INTERNAL_ASSERT(concrete_indexed_id_ != nullptr);
+  std::stringstream ss;
+  ss << "CommonIndexKey: " << concrete_indexed_id_->toString();
+  ss << ", { ";
+  for (auto loop : used_loops_) {
+    ss << loop->iter_domain()->toString() << " ";
+  }
+  ss << "}";
+  ss << ", { ";
+  for (auto val : loop_index_vals_) {
+    ss << val->toString() << " ";
+  }
+  ss << "}";
+  return ss.str();
+}
+
+std::pair<Val*, bool> CommonIndexMap::insert(
+    IterDomain* indexed_consumer_id,
+    TensorDomain* consumer_td,
+    TensorDomain* ref_td,
+    const std::unordered_map<IterDomain*, Val*>& ref_index_map,
+    const std::vector<kir::ForLoop*>& loops,
+    Val* index) {
+  if (index->definition() == nullptr) {
+    // Only expression is eligible to hoist
+    return {index, false};
+  }
+
+  const CommonIndexKey key(
+      indexed_consumer_id, consumer_td, ref_td, ref_index_map, loops);
+
+  Val* hoisted_index = nullptr;
+  bool new_index_inserted = false;
+
+  // If already mapped, return the previously mapped index
+  auto it = common_index_map_.find(key);
+  if (it != common_index_map_.end()) {
+    hoisted_index = it->second;
+    new_index_inserted = false;
+    ++use_counts_.at(key);
+  } else {
+    common_index_map_.emplace(key, index);
+    hoisted_index = index;
+    new_index_inserted = true;
+    use_counts_[key] = 1;
+  }
+
+  return {hoisted_index, new_index_inserted};
+}
+
+namespace {
+
+//! Insertion point of allocation
+struct CommonIndexInsertionInfo {
+  Expr* ref = nullptr;
+  kir::Scope* scope = nullptr;
+};
+
+// Inserts allocations of hoisted indices
+class CommonIndexInserter : private kir::ExprMutator {
+ public:
+  static std::vector<Expr*> run(
+      const std::vector<Expr*>& exprs,
+      const CommonIndexMap& common_indices) {
+    CommonIndexInserter inserter(exprs, common_indices);
+    return inserter.exprs_;
+  }
+
+ private:
+  CommonIndexInserter(
+      const std::vector<Expr*>& exprs,
+      const CommonIndexMap& common_index_map)
+      : common_index_map_(common_index_map) {
+    // Create a map to keys from loops where they should be inserted
+    for (const auto& kv : common_index_map.commonIndexMap()) {
+      const auto& key = kv.first;
+      // Only consider indices used multiple times
+      if (!usedMultipleTimes(key)) {
+        continue;
+      }
+      TORCH_INTERNAL_ASSERT(!key.usedLoops().empty());
+      auto insertion_loop = key.usedLoops().back();
+      innermost_used_loop_map_[insertion_loop].push_back(key);
+    }
+
+    traverseAndInsert(exprs);
+  }
+
+  CommonIndexInsertionInfo findInsertionPoint(
+      const CommonIndexKey& key,
+      kir::ForLoop* current_loop) const {
+    CommonIndexInsertionInfo info;
+
+    // Allocation must be inside any used non-trivial loop. Since the
+    // loop index value is constant if a loop is trivial, allocation
+    // does not need to be inside trivial loops.
+    for (const auto loop : key.usedLoops()) {
+      if (!loop->isTrivial()) {
+        info.ref = loop->body()[0];
+        info.scope = &(loop->body());
+      }
+    }
+
+    // If no non-trivial used loop is found, insert at the top-level
+    // scope just before the outer-most loop.
+    if (info.ref == nullptr) {
+      info.ref = scope_exprs_.empty() ? current_loop : scope_exprs_.at(0);
+      info.scope = nullptr;
+    }
+
+    return info;
+  }
+
+  using kir::ExprMutator::handle;
+
+  void handle(kir::ForLoop* loop) final {
+    auto innermost_loop_map_it = innermost_used_loop_map_.find(loop);
+    if (innermost_loop_map_it == innermost_used_loop_map_.end()) {
+      kir::ExprMutator::handle(loop);
+      return;
+    }
+
+    for (const auto& key : innermost_loop_map_it->second) {
+      const auto common_index = common_index_map_.commonIndexMap().at(key);
+
+      // Insert only when the index is used multiple times and is not
+      // yet inserted.
+      if (inserted_indices_.find(common_index) != inserted_indices_.end()) {
+        continue;
+      }
+
+      auto alloc = IrBuilder::create<kir::Allocate>(
+          common_index,
+          MemoryType::Local,
+          GpuLower::current()->kernel()->oneVal());
+      const auto common_index_def = common_index->definition();
+      TORCH_INTERNAL_ASSERT(
+          common_index_def != nullptr,
+          "Hosted index must have a definition. ",
+          common_index->toString());
+
+      const auto insertion_info = findInsertionPoint(key, loop);
+      registerInsertBefore(insertion_info.ref, alloc, insertion_info.scope);
+      registerInsertBefore(
+          insertion_info.ref, common_index_def, insertion_info.scope);
+
+      // Track inserted index
+      inserted_indices_.emplace(common_index);
+    }
+
+    kir::ExprMutator::handle(loop);
+  }
+
+  bool usedMultipleTimes(const CommonIndexKey& key) {
+    auto it = common_index_map_.useCounts().find(key);
+    TORCH_INTERNAL_ASSERT(
+        it != common_index_map_.useCounts().end(),
+        "Key not found in the use-count map: ",
+        key.toString());
+    return it->second > 1;
+  }
+
+ private:
+  const CommonIndexMap& common_index_map_;
+  //! Map to CommonIndexKeys from their innermost used loops
+  std::unordered_map<kir::ForLoop*, std::vector<CommonIndexKey>>
+      innermost_used_loop_map_;
+  //! Keep track of inserted indices
+  std::unordered_set<Val*> inserted_indices_;
+};
+
+} // namespace
+
+std::vector<Expr*> allocateCommonIndices(const std::vector<Expr*>& exprs) {
+  return CommonIndexInserter::run(exprs, GpuLower::current()->commonIndexMap());
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_hoist.h b/torch/csrc/jit/codegen/cuda/lower_index_hoist.h
new file mode 100644
index 00000000000000..5e0256f9e84498
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_index_hoist.h
@@ -0,0 +1,121 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+#include <functional>
+#include <unordered_map>
+#include <vector>
+
+// Hoisting common index subexpressions
+//
+// Class CommonIndexMap is updated during the lowering as new indices
+// are inserted. An index is uniquely identified with CommonIndexKey,
+// which consists of the concrete ID of the indexed/predicated domain,
+// the for-loops used in the index, and the index vals of the use
+// for-loops.
+//
+// Once all indices are inserted to CommonIndexMap, allocations of the
+// the hoisted indices are inserted by allocateCommonIndices. Note
+// that this assumes that the CUDA code generator does not inline a
+// scalar Val with allocation (PR #1434).
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+//! Class to represent unique indexed domains for index
+//! hoisting. Uniquenesss is determined with the indexed domain
+//! itself, the for-loops and their index values.
+class CommonIndexKey {
+  friend struct CommonIndexKeyHash;
+
+ public:
+  //! \param consumer_indexed_id Indexed consumer domain
+  //! \param consumer_td TensorDomain of consumer_indexed_id
+  //! \param ref_td Reference domain at the time of indexing
+  //! \param ref_index_map Index map of the reference domain
+  //! \param loops Loop structure where this id is indexed
+  CommonIndexKey(
+      IterDomain* consumer_indexed_id,
+      TensorDomain* consumer_td,
+      TensorDomain* ref_td,
+      const std::unordered_map<IterDomain*, Val*>& ref_index_map,
+      const std::vector<kir::ForLoop*>& loops);
+
+  const IterDomain* concreteIndexedId() const {
+    return concrete_indexed_id_;
+  }
+
+  const std::vector<kir::ForLoop*>& usedLoops() const {
+    return used_loops_;
+  }
+
+  const std::vector<Val*>& loopIndexVals() const {
+    return loop_index_vals_;
+  }
+
+  bool operator==(const CommonIndexKey& other) const;
+
+  std::string toString() const;
+
+ private:
+  //! Concrete domain of indexed domain
+  IterDomain* concrete_indexed_id_ = nullptr;
+  //! Loops used for the index
+  std::vector<kir::ForLoop*> used_loops_;
+  //! Loop index vals for the used loops
+  std::vector<Val*> loop_index_vals_;
+};
+
+struct CommonIndexKeyHash {
+  std::size_t operator()(const CommonIndexKey& key) const {
+    auto h = std::hash<const IterDomain*>{}(key.concrete_indexed_id_);
+    // NOTE: do not use other fields as the pointers can be different
+    // even when two keys can share the same index
+    return h;
+  }
+};
+
+//! Map to hold hoisted common indices
+class TORCH_CUDA_CU_API CommonIndexMap {
+ public:
+  //! Register an indexd consumer domain to hoist
+  //!
+  //! Returns a corresponding hoisted index and a flag indicating if a
+  //! new index is inserted.
+  //!
+  //! Consumer domains are used even for producer indexing since
+  //! producer domains in producer indexing are temporary replay
+  //! domains.
+  std::pair<Val*, bool> insert(
+      IterDomain* indexed_consumer_id,
+      TensorDomain* consumer_td,
+      TensorDomain* ref_td,
+      const std::unordered_map<IterDomain*, Val*>& ref_index_map,
+      const std::vector<kir::ForLoop*>& loops,
+      Val* index);
+
+  const auto& commonIndexMap() const {
+    return common_index_map_;
+  }
+
+  const auto& useCounts() const {
+    return use_counts_;
+  }
+
+ private:
+  //! Map to hold hoisted common indices
+  std::unordered_map<CommonIndexKey, Val*, CommonIndexKeyHash>
+      common_index_map_;
+  std::unordered_map<CommonIndexKey, int, CommonIndexKeyHash> use_counts_;
+};
+
+//! Insert allocations of hoisted indices. Must be called after
+//! collecting all common indices.
+std::vector<Expr*> allocateCommonIndices(const std::vector<Expr*>& exprs);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
index 77be88183eccb4..1acf33150cc401 100644
--- a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
@@ -145,7 +145,21 @@ class WarSyncInserter : private kir::ExprMutator {
     kir::ExprMutator::handle(ite);
   }
 
-  void handle(kir::Sync* sync) final {
+  void handle(kir::BlockSync* sync) final {
+    // Register the sync for the active for loop
+    sync_hit_.back() = true;
+    // Run through the active allocations, if a read was hit, register there was
+    // a sync after the read. If there's subsequent reads on this buffer the
+    // sync_after_read will be cleared.
+    for (auto& entry : smem_allocations_) {
+      auto& alloc_stack = entry.second;
+      if (alloc_stack.back().read_hit) {
+        alloc_stack.back().sync_after_read = true;
+      }
+    }
+  }
+
+  void handle(kir::GridSync* sync) final {
     // Register the sync for the active for loop
     sync_hit_.back() = true;
     // Run through the active allocations, if a read was hit, register there was
@@ -191,9 +205,11 @@ class WarSyncInserter : private kir::ExprMutator {
     // Mark write has been hit for all output tvs
     auto out_tvs = ir_utils::filterByType<TensorView>(expr->outputs());
     for (auto out_tv : out_tvs) {
-      if (out_tv->getMemoryType() != MemoryType::Shared) {
+      if (out_tv->getMemoryType() != MemoryType::Shared ||
+          GpuLower::current()->syncMap().needsRawSync(out_tv).none()) {
         continue;
       }
+
       auto& entry = getMemInfo(out_tv);
 
       // If this is the first write and there's a sync in one of the loops after
@@ -207,9 +223,11 @@ class WarSyncInserter : private kir::ExprMutator {
     // Mark read was hit, if sync_after_read was set, clear it.
     auto inp_tvs = ir_utils::filterByType<TensorView>(expr->inputs());
     for (auto inp_tv : inp_tvs) {
-      if (inp_tv->getMemoryType() != MemoryType::Shared) {
+      if (inp_tv->getMemoryType() != MemoryType::Shared ||
+          GpuLower::current()->syncMap().needsRawSync(inp_tv).none()) {
         continue;
       }
+
       auto& entry = getMemInfo(inp_tv);
       entry.read_hit = true;
       // Clear the sync_after_read if it was set because there was another write
@@ -223,10 +241,7 @@ class WarSyncInserter : private kir::ExprMutator {
     sync_hit_.push_back(false);
 
     // If there is no real iterating loop WAR syncs aren't necessary
-    within_iter_loop_ = within_iter_loop_ ||
-        !(for_loop->iter_domain()->isThread() ||
-          for_loop->iter_domain()->isBroadcast() ||
-          for_loop->iter_domain()->extent()->isOneInt());
+    within_iter_loop_ = within_iter_loop_ || !for_loop->isTrivial();
 
     // Process the expressions in the for loop
     kir::ExprMutator::handle(for_loop);
@@ -260,7 +275,7 @@ class WarSyncInserter : private kir::ExprMutator {
 
     // WAR Sync is necessary in this loop, register its insertion.
     if (insert_sync) {
-      auto sync_expr = IrBuilder::create<kir::Sync>(true);
+      auto sync_expr = IrBuilder::create<kir::BlockSync>(true);
       kir::ExprMutator::registerInsertAfter(
           for_loop->body().exprs().back(), sync_expr, &for_loop->body());
       handle(sync_expr);
@@ -376,15 +391,56 @@ class ValidatePlacementAfterWrites : private kir::IrVisitor {
   const std::unordered_set<Expr*>& writes_;
 };
 
+namespace {
+
+Val* getGridSyncBufferSize(const ParallelTypeBitmap& ptb) {
+  // See the comment above for getGridCommWorkBufferSize.
+  TORCH_INTERNAL_ASSERT(
+      ptb.hasBID(),
+      "Detected  needing a grid sync but no grid bits set in bitmap.");
+  Val* buffer_size = GpuLower::current()->kernel()->oneVal();
+  for (auto pt : kParallelTypeBIDs) {
+    if (!ptb.get(pt)) {
+      continue;
+    }
+    auto pt_dim = GpuLower::current()->parallelDimensionMap().get(pt);
+    if (pt_dim == nullptr || pt_dim->isOneInt()) {
+      continue;
+    }
+    buffer_size = IrBuilder::mulExpr(buffer_size, pt_dim);
+  }
+  return buffer_size;
+}
+
+// Copied from lower_index.cpp, may be worth either removing this function and
+// doing it inline or reusing the function from lower_index.cpp
+kir::Allocate* allocGlobalBufferForGridComm(
+    Val* buffer_size,
+    DataType dtype,
+    bool zero_init) {
+  const std::vector<IterDomain*> new_buffer_ids = {
+      IrBuilder::create<IterDomain>(
+          GpuLower::current()->kernel()->zeroVal(), buffer_size)};
+  const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
+  const auto buffer_tv =
+      IrBuilder::create<TensorView>(buffer_domain, dtype, MemoryType::Global);
+  return IrBuilder::create<kir::Allocate>(
+      buffer_tv, buffer_tv->getMemoryType(), nullptr, zero_init);
+}
+
+} // namespace
+
 class ReadAfterWriteSyncs : public kir::ExprMutator {
  private:
   using kir::ExprMutator::handle;
 
   //! Traverse up the loop stack from loops_it and if a halo loop is
   //! found, place a given sync expr before the outer-most halo loop.
+  // TODO: What needs to be done here for gmem comm?
   bool insertBeforeHaloLoop(
       std::vector<kir::ForLoop*>::iterator loops_it,
-      kir::Sync* sync_expr,
+      Expr* sync_expr,
+      Expr* maybe_alloc,
       const std::unordered_set<Expr*>& writes) {
     std::vector<kir::ForLoop*>::iterator halo_loop_it;
     bool halo_loop_found = false;
@@ -424,6 +480,10 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       auto place_in = *(halo_loop_it - 1);
       kir::ExprMutator::registerInsertBefore(
           halo_loop, sync_expr, &place_in->body());
+      if (maybe_alloc != nullptr) {
+        kir::ExprMutator::registerInsertBefore(
+            halo_loop, maybe_alloc, &place_in->body());
+      }
     }
 
     return true;
@@ -435,7 +495,8 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       return;
     }
 
-    if (sync_after_.size() > 0 && sync_after_.front() == expr) {
+    if (sync_after_.size() > 0 && sync_after_.front().first == expr) {
+      auto sync_bitmap = sync_after_.front().second;
       sync_after_.pop_front();
       auto last_writes = last_writes_.front();
       last_writes_.pop_front();
@@ -450,8 +511,16 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       // TODO: This may be a common operation, could be worth making a utility
       // out of or saving state for tensor view ID -> for loop
       // TODO: Explicitly test the 3 cases below
-
-      auto sync_expr = IrBuilder::create<kir::Sync>();
+      Expr* sync_expr = nullptr;
+      kir::Allocate* maybe_alloc = nullptr;
+      if (sync_bitmap.hasBID()) {
+        maybe_alloc = allocGlobalBufferForGridComm(
+            getGridSyncBufferSize(sync_bitmap), DataType::Int, true);
+        sync_expr = IrBuilder::create<kir::GridSync>(
+            sync_bitmap, maybe_alloc->buffer());
+      } else {
+        sync_expr = IrBuilder::create<kir::BlockSync>();
+      }
       if (out_tv->getComputeAtPosition() == 0) {
         // Sync should be placed at global scope, after its outer most loop if
         // it has one.
@@ -465,8 +534,10 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
             "Tried to place after, ",
             place_after->toString(),
             ", but could not find this expression at the global scope.");
-
-        registerInsertAfter(*(place_after_it + 1), sync_expr, nullptr);
+        registerInsertAfter(*(place_after_it), sync_expr, nullptr);
+        if (maybe_alloc != nullptr) {
+          registerInsertAfter(place_after, maybe_alloc, nullptr);
+        }
       } else {
         // Find the last loop in computeAt of out_tv, this is the loop where we
         // would place an allocation for out_tv
@@ -485,7 +556,8 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
         TORCH_INTERNAL_ASSERT(loops_it != for_loops_.end());
 
         // block sync must be placed before halo-extended loops
-        if (insertBeforeHaloLoop(loops_it, sync_expr, last_writes)) {
+        if (insertBeforeHaloLoop(
+                loops_it, sync_expr, maybe_alloc, last_writes)) {
           return;
         }
 
@@ -503,6 +575,9 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
         }
 
         registerInsertAfter(place_after, sync_expr, &place_in->body());
+        if (maybe_alloc != nullptr) {
+          registerInsertAfter(place_after, maybe_alloc, &place_in->body());
+        }
       }
     }
   }
@@ -514,11 +589,6 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
         "this pass should be run before any conditionals are placed in code.");
   }
 
-  // Clear the modify status for all shared memory buffers
-  static void cleanSharedMemory(std::unordered_map<Val*, Expr*>& smem) {
-    smem.clear();
-  }
-
   // Return a set of expressions that modify shared-memory
   // tensors. Expressions are excluded when syncthreads are already
   // placed.
@@ -526,7 +596,13 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       const std::unordered_map<Val*, Expr*>& smem,
       const std::vector<Val*>& tvs) const {
     std::unordered_set<Expr*> last_writes;
-    for (auto tv : tvs) {
+    for (auto tv : ir_utils::filterByType<TensorView>(tvs)) {
+      if (GpuLower::current()->syncMap().needsRawSync(tv).none()) {
+        continue;
+      }
+      if (tv->getMemoryType() != MemoryType::Shared) {
+        continue;
+      }
       auto it = smem.find(tv);
       if (it != smem.end()) {
         last_writes.insert(it->second);
@@ -535,10 +611,27 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
     return last_writes;
   }
 
+  std::unordered_set<Expr*> isModifiedGlobalMemory(
+      const std::unordered_map<Val*, Expr*>& gmem,
+      const std::vector<Val*>& tvs) const {
+    std::unordered_set<Expr*> last_writes;
+    for (auto tv : ir_utils::filterByType<TensorView>(tvs)) {
+      if (GpuLower::current()->syncMap().needsRawSync(tv).none()) {
+        continue;
+      }
+      auto it = gmem.find(tv);
+      if (it != gmem.end()) {
+        last_writes.insert(it->second);
+      }
+    }
+    return last_writes;
+  }
+
   ReadAfterWriteSyncs(const std::vector<Expr*>& _exprs) {
     // Fusion shared_memory values
     // Tracks if shared memory is modified
     std::unordered_map<Val*, Expr*> smem;
+    std::unordered_map<Val*, Expr*> gmem;
 
     // Flatten all the expressions
     auto flattened_exprs = ExprFlattener::flatten(_exprs);
@@ -549,14 +642,36 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
         continue;
       }
 
-      auto last_writes = isModifiedSharedMemory(smem, expr->inputs());
-      if (!last_writes.empty()) {
+      auto last_gmem_writes = isModifiedGlobalMemory(gmem, expr->inputs());
+      if (!last_gmem_writes.empty()) {
         TORCH_INTERNAL_ASSERT(
             prev_tv_expr != nullptr,
             "Can't require sync on inputs, however, detected it's needed.");
-        sync_after_.push_back(prev_tv_expr);
-        last_writes_.push_back(last_writes);
-        cleanSharedMemory(smem);
+        ParallelTypeBitmap bitmap;
+        for (auto entry : gmem) {
+          TORCH_INTERNAL_ASSERT(entry.first->isA<TensorView>());
+          auto sync_bits = GpuLower::current()->syncMap().needsRawSync(
+              entry.first->as<TensorView>());
+          bitmap |= sync_bits;
+        }
+        // Temporarily do full grid sync.
+        sync_after_.emplace_back(std::make_pair(prev_tv_expr, bitmap));
+        last_writes_.push_back(last_gmem_writes);
+        gmem.clear();
+      }
+
+      auto last_smem_writes = isModifiedSharedMemory(smem, expr->inputs());
+      if (!last_smem_writes.empty()) {
+        TORCH_INTERNAL_ASSERT(
+            prev_tv_expr != nullptr,
+            "Can't require sync on inputs, however, detected it's needed.");
+        ParallelTypeBitmap bitmap;
+        bitmap.set(ParallelType::TIDx);
+        bitmap.set(ParallelType::TIDy);
+        bitmap.set(ParallelType::TIDz);
+        sync_after_.emplace_back(std::make_pair(prev_tv_expr, bitmap));
+        last_writes_.push_back(last_smem_writes);
+        smem.clear();
       }
 
       for (auto tv : ir_utils::filterByType<TensorView>(expr->outputs())) {
@@ -567,6 +682,9 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
             !tv->isDoubleBuffered()) {
           smem[tv] = expr;
         }
+        if (tv->getMemoryType() == MemoryType::Global) {
+          gmem[tv] = expr;
+        }
       }
 
       prev_tv_expr = expr;
@@ -580,7 +698,7 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
 
  private:
   //! Keep track of expressions that must be followed by syncthreads
-  std::deque<Expr*> sync_after_;
+  std::deque<std::pair<Expr*, ParallelTypeBitmap>> sync_after_;
 
   //! Keep track of write expressions that must be placed before
   //! syncthreads.
diff --git a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
index cd34c56b510e7c..166f38d6cf56f7 100644
--- a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
@@ -126,6 +126,12 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
   }
 };
 
+void assertOnWarpOps(const Expr* expr) {
+  TORCH_INTERNAL_ASSERT(
+      !expr->isA<MmaOp>(),
+      "Mma op: cannot eliminate predicate for mma op, tiling not valid");
+}
+
 } // namespace
 
 std::vector<Expr*> generateConditionalFromPredicate(
@@ -151,6 +157,8 @@ class PredicateAnalyzer : public OptOutDispatch {
     // of the parallelized axis is the actual size of the axis, not
     // the number of threads. Since the number of threads can be
     // larger than the axis size, it's not safe to skip predication
+
+    // Check that parallel dimension will not generate out of bound index
     if (!(producer->getMemoryType() == MemoryType::Local &&
           consumer->getMemoryType() == MemoryType::Local)) {
       return true;
@@ -355,6 +363,10 @@ void PredicateElimination::handle(Expr* expr) {
   }
 
   if (needsPredicate(expr)) {
+    // Warp primitives are currently limited to un-predicated usage,
+    //   predicating these ops will require extra steps to ensure that
+    //   the whole warp will get the same value.
+    assertOnWarpOps(expr);
     return;
   }
 
@@ -392,6 +404,11 @@ void PredicateElimination::handle(Expr* expr) {
       continue;
     }
 
+    if (expr->isA<MmaOp>()) {
+      setReductionInitValue(input, expr->as<MmaOp>()->init());
+      continue;
+    }
+
     // If an input does not need a predicate either, then it should
     // have some value, so no need to set a default value
     if (non_predicated_exprs_.find(input_def) != non_predicated_exprs_.end()) {
diff --git a/torch/csrc/jit/codegen/cuda/lower_replace_size.cpp b/torch/csrc/jit/codegen/cuda/lower_replace_size.cpp
index 582b6d91d067af..beec550e537f6e 100644
--- a/torch/csrc/jit/codegen/cuda/lower_replace_size.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_replace_size.cpp
@@ -147,61 +147,6 @@ std::unordered_map<Val*, Val*> getSimplificationMap(Fusion* fusion) {
   return extent_to_min_input_id_extent;
 }
 
-std::vector<Val*> allLeafOuts(Fusion* fusion) {
-  auto exprs = StmtSort::getExprs(fusion, true);
-  std::unordered_set<Val*> inputs;
-  std::unordered_set<Val*> outputs;
-  std::vector<Val*> ordered_outputs;
-  for (auto expr : exprs) {
-    inputs.insert(expr->inputs().begin(), expr->inputs().end());
-    outputs.insert(expr->outputs().begin(), expr->outputs().end());
-    ordered_outputs.insert(
-        ordered_outputs.end(), expr->outputs().begin(), expr->outputs().end());
-  }
-  for (auto input : inputs) {
-    outputs.erase(input);
-  }
-
-  std::vector<Val*> ordered_leaf_outs;
-  for (auto out : ordered_outputs) {
-    if (outputs.find(out) != outputs.end()) {
-      ordered_leaf_outs.push_back(out);
-    }
-  }
-  return ordered_leaf_outs;
-}
-
-class ValReplacementMutator : private OptOutMutator {
- public:
-  ValReplacementMutator(
-      Fusion* fusion,
-      const std::unordered_map<Val*, Val*>& replacement_map)
-      : replacement_map_(replacement_map) {
-    FusionGuard fg(fusion);
-
-    // Welford makes this a little annoying since it holds a count which is
-    // typically not used by anything else. If we don't grab that count, then it
-    // would be a tensorview that doesn't get updated extents. Therefore, first
-    // grab all leaves towards outputs and grab stmts from there.
-    auto stmts = StmtSort::getStmts(fusion, allLeafOuts(fusion), true);
-    for (auto stmt : stmts) {
-      mutate(stmt);
-    }
-  }
-
- private:
-  using OptOutMutator::mutate;
-  void mutate(Val* val) final {
-    if (replacement_map_.find(val) == replacement_map_.end()) {
-      return OptOutMutator::mutate(val);
-    }
-    auto replaced_val = replacement_map_.at(val);
-    registerMutation(val, replaced_val);
-  }
-
-  const std::unordered_map<Val*, Val*>& replacement_map_;
-};
-
 } // namespace
 
 void replaceSymbolicSizes(Fusion* fusion) {
@@ -279,7 +224,7 @@ void replaceSymbolicSizes(Fusion* fusion) {
   }
 
   // Run mutation on the fusion with the tensor_dim_map
-  ValReplacementMutator(fusion, tensor_dim_map);
+  ir_utils::replaceValue(fusion, tensor_dim_map);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
new file mode 100644
index 00000000000000..8ab11140f497a8
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
@@ -0,0 +1,451 @@
+
+#include <torch/csrc/jit/codegen/cuda/instrumentation.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+
+#include <torch/csrc/jit/codegen/cuda/lower_sync_information.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+// Validate parallelization of a single tensor
+void validateParallelizationOfTensor(TensorView* tv) {
+  // Each ParallelType can be used only once.
+  ParallelTypeBitmap pt_map;
+  for (size_t i = 0; i < tv->nDims(); ++i) {
+    auto axis = tv->axis(i);
+    auto ptype = axis->getParallelType();
+    if (!isParallelTypeThread(ptype)) {
+      continue;
+    }
+
+    // It doesn't matter if this axis is a non-concretized broadcast
+    // TODO: merging broadcast and non-broadcast
+    if (axis->isBroadcast() &&
+        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+            axis)) {
+      continue;
+    }
+
+    TORCH_INTERNAL_ASSERT(
+        !pt_map.get(ptype),
+        "Multiple use of ",
+        ptype,
+        " in tensor t",
+        tv->name(),
+        ": ",
+        tv);
+    pt_map.set(ptype);
+  }
+
+  // If this tensor is predicated by a paralel type, it should not be
+  // used to parallelize any domain of this tensor
+
+  const auto thread_pred =
+      GpuLower::current()->threadPredMap().getPredicateInfo(tv);
+
+  auto predicated_parallel_types = pt_map & thread_pred.limited_types;
+
+  TORCH_INTERNAL_ASSERT(
+      predicated_parallel_types.none(),
+      "Invalid parallelization of tensor t",
+      tv->name(),
+      ". The tensor is parallelized with ",
+      predicated_parallel_types.toString(),
+      ", but it's invalid to use the types as the tensor is also predicated with them.",
+      ", thread pred: ",
+      thread_pred.limited_types.toString());
+}
+
+//! Return true if axis is derived from a root axis that is an input
+//! to a CA leaf axis.
+bool derivedFromRootCAAxes(TensorView* tv, IterDomain* axis) {
+  std::vector<IterDomain*> ca_axes(
+      tv->domain()->domain().begin(),
+      tv->domain()->domain().begin() + tv->getComputeAtPosition());
+
+  auto ca_root_vals = IterVisitor::getInputsTo(
+      std::vector<Val*>(ca_axes.begin(), ca_axes.end()));
+
+  auto root_vals = IterVisitor::getInputsTo({axis});
+
+  return std::any_of(
+      root_vals.begin(), root_vals.end(), [&ca_root_vals](auto root) {
+        return std::find(ca_root_vals.begin(), ca_root_vals.end(), root) !=
+            ca_root_vals.end();
+      });
+}
+
+} // namespace
+
+void SyncMap::build(Fusion* fusion) {
+  FUSER_PERF_SCOPE("GpuLower::Lower::validateParallelize");
+  FusionGuard fg(fusion);
+
+  const auto& par_map = GpuLower::current()->caParallelMap();
+  const auto& loop_map = GpuLower::current()->caLoopMap();
+  const auto& index_map = GpuLower::current()->caIndexMap();
+  const auto& pred_map = GpuLower::current()->threadPredMap();
+
+  auto exprs = StmtSort::getExprs(fusion);
+
+  // Run through expressions and check for communication across threads/blocks
+  // occuring from producer to consumer of the expression
+  for (auto expr : exprs) {
+    if (!ir_utils::isTvOp(expr)) {
+      continue;
+    }
+
+    // Validate parallelization of each consumer by itself
+    for (auto consumer : ir_utils::filterByType<TensorView>(expr->outputs())) {
+      validateParallelizationOfTensor(consumer);
+    }
+
+    // It's probably enough to just check all producers to one consumer as
+    // multi-consumers are guaranteed to be transformed/parallelized the same,
+    // but to be conservative for now checking every producer <-> consumer
+    // relationship.
+    for (auto producer : ir_utils::filterByType<TensorView>(expr->inputs())) {
+      // Parallelization on input tensors have no effect.
+      if (producer->isFusionInput()) {
+        continue;
+      }
+
+      ParallelTypeBitmap raw_dims;
+
+      const auto parallel_bcast_doms =
+          pred_map.getParallelBroadcastDomains(producer);
+
+      // Stash information about parallelized producer iteration domains
+      std::vector<IterDomain*> producer_parallel_ids(
+          ParallelTypeBitmap::kNumParallelTypes, nullptr);
+      ParallelTypeBitmap producer_parallel_bitmap;
+
+      // Tracking for quick check later
+      std::unordered_set<IterDomain*> producer_within_compute_at;
+
+      for (const auto producer_i : c10::irange(producer->nDims())) {
+        auto producer_axis = producer->axis(producer_i);
+        auto producer_ptype =
+            par_map.getConcreteMappedID(producer_axis)->getParallelType();
+
+        if (!isParallelTypeThread(producer_ptype)) {
+          continue;
+        }
+
+        // Producer reductions shouldn't map to consumers
+        if (producer_axis->isReduction()) {
+          continue;
+        }
+
+        if (producer_i < producer->getComputeAtPosition()) {
+          producer_within_compute_at.emplace(producer_axis);
+        }
+
+        producer_parallel_bitmap.set(producer_ptype);
+        producer_parallel_ids[getParallelTypeBitMapOffset(producer_ptype)] =
+            producer_axis;
+      }
+
+      for (auto consumer :
+           ir_utils::filterByType<TensorView>(expr->outputs())) {
+        // Stash information about parallelized consumer iteration domains
+        std::vector<IterDomain*> consumer_parallel_ids(
+            ParallelTypeBitmap::kNumParallelTypes, nullptr);
+        ParallelTypeBitmap consumer_parallel_bitmap;
+
+        for (const auto consumer_i : c10::irange(consumer->nDims())) {
+          auto consumer_axis = consumer->axis(consumer_i);
+          auto consumer_ptype =
+              par_map.getConcreteMappedID(consumer_axis)->getParallelType();
+
+          if (!isParallelTypeThread(consumer_ptype)) {
+            continue;
+          }
+
+          // When the consumer axis is a broadcast, it is not really
+          // parallelized unless thread-predicated and eventually concretized
+          if (consumer_axis->isBroadcast() &&
+              (!parallel_bcast_doms.get(consumer_ptype) ||
+               !GpuLower::current()
+                    ->concretizedBroadcastDomains()
+                    .isConcretized(consumer_axis))) {
+            continue;
+          }
+
+          consumer_parallel_bitmap.set(consumer_ptype);
+          consumer_parallel_ids[getParallelTypeBitMapOffset(consumer_ptype)] =
+              consumer_axis;
+        }
+
+        // At this point each parallel type that's present in the consumer or
+        // the producer will be present in their corresponding `_parallel_ids`
+        // map going from parallel index type (only size 6 for grid/block dims)
+        // to the iteration domain of that parallel type.
+        for (auto parallel_type : kParallelTypeThreads) {
+          // TIDx is reserved for lane_id in the case of mma ops.
+          //  It is swizzled and handled separately in validateMma.
+          if (parallel_type == ParallelType::TIDx && expr->isA<MmaOp>()) {
+            continue;
+          }
+
+          auto parallel_type_i = getParallelTypeBitMapOffset(parallel_type);
+
+          auto p_id = producer_parallel_ids[parallel_type_i];
+          auto c_id = consumer_parallel_ids[parallel_type_i];
+
+          if (p_id == nullptr && c_id == nullptr) {
+            continue;
+          } else if (p_id != nullptr && c_id != nullptr) {
+            if (loop_map.areMapped(p_id, c_id)) {
+              const auto halo_info = GpuLower::current()->haloInfo();
+
+              if (halo_info.hasHaloWidth(p_id) !=
+                      halo_info.hasHaloWidth(c_id) ||
+                  (halo_info.hasHaloWidth(p_id) &&
+                   halo_info.hasHaloWidth(c_id) &&
+                   halo_info.getHaloWidth(p_id) !=
+                       halo_info.getHaloWidth(c_id))) {
+                raw_dims.set(parallel_type);
+                continue;
+              }
+            }
+          } else {
+            if (p_id != nullptr) {
+              auto it = std::find_if(
+                  consumer->domain()->domain().begin(),
+                  consumer->domain()->domain().end(),
+                  [&](IterDomain* c_id) {
+                    return loop_map.areMapped(p_id, c_id);
+                  });
+
+              // If there isn't a mapping from producer to a consumer domain,
+              // need to assume there's communication across this parallel
+              // dimension.
+              c_id = it == consumer->domain()->domain().end() ? nullptr : *it;
+              // i.e. if producer is parallelized across threadIdx.x in a
+              // certain split, if the consumer doesn't map to this split,
+              // then we need to assume it has to be in smem with proper
+              // syncs.
+            } else {
+              auto it = std::find_if(
+                  producer->domain()->domain().begin(),
+                  producer->domain()->domain().end(),
+                  [&](IterDomain* p_id) {
+                    return loop_map.areMapped(p_id, c_id);
+                  });
+              if (it == producer->domain()->domain().end()) {
+                // Can't infer anything if producer doesn't have a matching axis
+                // to parallel consumer dim.
+                continue;
+              }
+              p_id = *it;
+            }
+          }
+
+          // Comm pattern options (when parallel types don't have matching
+          // axes) and required memory, Chart is producer parallel type,
+          // consumer parallel type Parallel types are Serial(S),
+          // threadIdx(T), blockIdx(B), Memory required for the producer is
+          // Local(L), Shared(S), Global(G), Sync is None (N/A), blockSync(B),
+          // grid_sync(G)
+          //
+          // P    C   Mem Req   Sync Type
+          // S    S      L          N/A
+          // S    T      L          N/A
+          // S    B      L          N/A
+          // T    S      S           B
+          // T    T      S           B
+          // T    B      S           B
+          // B    S      G           G
+          // B    T      G           G
+          // B    B      G           G
+
+          auto producer_ptype =
+              par_map.getConcreteMappedID(p_id)->getParallelType();
+          auto consumer_ptype = c_id == nullptr
+              ? ParallelType::Serial
+              : par_map.getConcreteMappedID(c_id)->getParallelType();
+
+          if (!p_id->isBroadcast() && isParallelTypeThread(producer_ptype) &&
+              !(isParallelTypeThread(consumer_ptype) &&
+                parallel_bcast_doms.get(consumer_ptype)) &&
+              // Being in compute at means consumer and producer rely on the
+              // same loop size
+              !producer_within_compute_at.count(p_id) &&
+              // For usage of derivedFromRootCAAxes check
+              // NVFuserTest.FusionAdvancedIndexing1_CUDA
+              (c_id == nullptr || !derivedFromRootCAAxes(producer, p_id))) {
+            // There must be a consumer axis that uses the same indexing
+            // with the same parallel type as the producer axis. The index
+            // map is used to to find such an axis. In addition, even when
+            // no mapped axis is found in the index map, but when an mapped
+            // axis exists in the loop map, the producer and consumer axes
+            // may still use the same indexing. That only happens when the
+            // producer is derived from a root axis that is an input to any
+            // leaf CA axes. In such a case, the axis in the reference
+            // tensor that maps to the producer axis is created based on the
+            // consumer, so both the producer and consumer axes should have
+            // the same indexing. See issue #995 as well as the
+            // FusionValidateParallelize6 test for a concrete example.
+            auto it = std::find_if(
+                consumer->domain()->domain().begin(),
+                consumer->domain()->domain().end(),
+                [&](IterDomain* c_id_) {
+                  return index_map.areMapped(p_id, c_id_);
+                });
+            if (it == consumer->domain()->domain().end()) {
+              if (isParallelTypeThread(producer_ptype)) {
+                raw_dims.set(producer_ptype);
+              }
+              if (isParallelTypeThread(consumer_ptype)) {
+                raw_dims.set(consumer_ptype);
+              }
+            }
+          }
+
+          // In shift or gather operations, if a thread or block
+          // domain's root ID is shifted or gathered, it can overlap
+          // in shared or global memory. This doesn't
+          // require a RAW sync since each thread would still write every value
+          // it would read, but it can require a WAR sync for Shared Memory.
+          // Since there isn't a separate structure for WAR than RAW for now
+          // we'll flag it on RAW which will trigger the WAR.
+          // See test FusionValidateParallelizeShift_CUDA for a
+          // concrete example where this sync is required.
+          if ((expr->getExprType() == ExprType::GatherOp ||
+               expr->getExprType() == ExprType::ShiftOp) &&
+              producer->getMemoryType() == MemoryType::Shared &&
+              isParallelTypeThreadDim(producer_ptype)) {
+            std::unordered_set<Val*> shifted_rfactor_ids;
+            if (expr->getExprType() == ExprType::GatherOp) {
+              auto gather_op = expr->as<GatherOp>();
+              for (auto root_i :
+                   c10::irange(producer->getMaybeRFactorDomain().size())) {
+                auto rfactor_id = producer->getMaybeRFactorDomain()[root_i];
+                // If the window shape is 1, it just copies the
+                // producer to the consumer
+                if (gather_op->windowShape()[root_i] != 1) {
+                  shifted_rfactor_ids.insert(rfactor_id);
+                }
+              }
+            } else if (expr->getExprType() == ExprType::ShiftOp) {
+              auto shift_op = expr->as<ShiftOp>();
+              for (auto root_i :
+                   c10::irange(producer->getMaybeRFactorDomain().size())) {
+                auto rfactor_id = producer->getMaybeRFactorDomain()[root_i];
+                // If the shift offset is 0, it doesn't actually shift
+                if (shift_op->offsets()[root_i] != 0) {
+                  shifted_rfactor_ids.insert(rfactor_id);
+                }
+              }
+            }
+
+            // Grab all values between shifted rfactor domains and p_id so we
+            // can identify which rfactor domains are inputs to the p_id
+            auto p_id_dep_vals =
+                DependencyCheck::getAllValsBetween(shifted_rfactor_ids, {p_id});
+            // If this shifted rfactor domain is an input to p_id, we
+            // must have a WAR sync. Mark raw sync so it will be generated.
+            if (!p_id_dep_vals.empty()) {
+              raw_dims.set(producer_ptype);
+            }
+          }
+
+          // If same parallel type and mapped, no need for syncs unless
+          // producer is in smem, producer parallel type is a thread
+          // dimension, and consumer concretizes the dimension. This sync is
+          // due to the redundant predicate omission in lower thread
+          // predicate.
+          auto redundant_preds = GpuLower::current()
+                                     ->threadPredMap()
+                                     .getPredicateInfo(producer)
+                                     .redundant_types;
+
+          if (p_id->isBroadcast() &&
+              GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+                  p_id) &&
+              producer->getMemoryType() == MemoryType::Shared &&
+              redundant_preds.hasTID()) {
+            redundant_preds.clearAllBID();
+            raw_dims |= redundant_preds;
+            continue;
+          }
+
+          // When the producer axis is a broadcast, it is not really
+          // parallelized unless thread-predicated and concretized
+          if (isParallelTypeThread(producer_ptype) && p_id->isBroadcast() &&
+              (!parallel_bcast_doms.get(producer_ptype) ||
+               !GpuLower::current()
+                    ->concretizedBroadcastDomains()
+                    .isConcretized(p_id))) {
+            continue;
+          }
+
+          // If matching dims and matching parallel types, no comm is necessary.
+          if (producer_ptype == consumer_ptype &&
+              loop_map.areMapped(p_id, c_id)) {
+            continue;
+          }
+
+          // Set parallel dimensions that communication is occuring over.
+          if (isParallelTypeThread(producer_ptype)) {
+            raw_dims.set(producer_ptype);
+          }
+        } // end for ptypes
+
+        if (raw_dims.hasBID()) {
+          TORCH_INTERNAL_ASSERT(
+              producer->getMemoryType() == MemoryType::Global,
+              "Inconsistent parallelization found between TV",
+              producer->name(),
+              " (",
+              producer->toString(),
+              ") and TV",
+              consumer->name(),
+              "(",
+              consumer->toString(),
+              "). Producer is required to be in Global Memory based on parallelization strategy.");
+        } else if (raw_dims.hasTID()) {
+          TORCH_INTERNAL_ASSERT(
+              producer->getMemoryType() == MemoryType::Global ||
+                  producer->getMemoryType() == MemoryType::Shared,
+              "Inconsistent parallelization found between TV",
+              producer->name(),
+              " (",
+              producer->toString(),
+              ") and TV",
+              consumer->name(),
+              "(",
+              consumer->toString(),
+              "). Producer is required to be in Global or Shared Memory based on parallelization strategy.");
+        }
+
+      } // end for consumers
+
+      if (raw_dims.any()) {
+        needs_raw_sync_[producer] = raw_dims;
+      }
+
+    } // end producer
+  }
+}
+
+std::string SyncMap::toString() const {
+  std::stringstream ss;
+  ss << "TVs requiring RAW:" << std::endl;
+  for (auto entry : needs_raw_sync_) {
+    ss << "  " << entry.first->toString() << " :: " << entry.second.toString()
+       << std::endl;
+  }
+  return ss.str();
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_sync_information.h b/torch/csrc/jit/codegen/cuda/lower_sync_information.h
new file mode 100644
index 00000000000000..09fcf9eabd7f34
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_sync_information.h
@@ -0,0 +1,45 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h>
+
+#include <unordered_map>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+class SyncMap {
+ public:
+  std::string toString() const;
+
+  //! Validates all tensors are consistently parallelized. Basically,
+  //! when a producer axis is threaded, either with threadIdx or
+  //! blockIdx, there must be a mapped consumer axis with the
+  //! same ParallelType with some exceptions.
+  //!
+  //! This function assumes Loop and Parallel ComputeAtMaps are already
+  //! built as they are used to validate consistency.
+  //!
+  //! Fills needs_raw_sync with output TVs if they need a raw sync if on smem or
+  //! gmem. The second entry in this map is the parallel dimensions being
+  //! communicated across.
+  void build(Fusion* fusion);
+
+  ParallelTypeBitmap needsRawSync(TensorView* tv) const {
+    auto it = needs_raw_sync_.find(tv);
+    if (it != needs_raw_sync_.end()) {
+      return it->second;
+    }
+    return ParallelTypeBitmap();
+  }
+
+ private:
+  std::unordered_map<TensorView*, ParallelTypeBitmap> needs_raw_sync_;
+};
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
index 8721490feb7917..7f77182bd71713 100644
--- a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
@@ -146,6 +146,21 @@ ParallelTypeBitmap getReductionPredicateForUnusedParallelTypes(
 void ThreadPredicateMap::updateBitSet(const Expr* expr) {
   FUSER_PERF_SCOPE("GpuLower::Lower::ThreadPredicateMap::updateBitSet");
 
+  // If all of the inputs are not updated and all of the outputs have
+  // already mappings, don't do anything
+  if (std::all_of(
+          ir_utils::filterByType<TensorView>(expr->inputs()).begin(),
+          ir_utils::filterByType<TensorView>(expr->inputs()).end(),
+          [this](TensorView* tv) {
+            return updated_tvs_.find(tv) == updated_tvs_.end();
+          }) &&
+      std::all_of(
+          ir_utils::filterByType<TensorView>(expr->outputs()).begin(),
+          ir_utils::filterByType<TensorView>(expr->outputs()).end(),
+          [this](TensorView* tv) { return find(tv) != end(); })) {
+    return;
+  }
+
   // Which predicates were set for the inputs
   ParallelTypeBitmap input_preds;
 
@@ -181,7 +196,8 @@ void ThreadPredicateMap::updateBitSet(const Expr* expr) {
     for (auto id : tv_inp->domain()->domain()) {
       if (id->isThread()) {
         id_ptypes.set(id->getParallelType());
-        if (id->isReduction()) {
+        if (id->isReduction() &&
+            !GpuLower::current()->fusedReductionInfo().isAllreduce(id)) {
           id_reductions.set(id->getParallelType());
         }
         if (id->isBroadcast() &&
@@ -228,9 +244,8 @@ void ThreadPredicateMap::updateBitSet(const Expr* expr) {
 
   // Run through outputs and set bitset predicates
   for (auto* out_tv : ir_utils::filterByType<TensorView>(expr->outputs())) {
-    TORCH_INTERNAL_ASSERT(find(out_tv) == end());
     auto redundant_types = avoidRedundantWrites(out_tv);
-    insert(out_tv, output_preds, redundant_types);
+    update(out_tv, output_preds, redundant_types);
   }
 }
 
@@ -240,12 +255,13 @@ void ThreadPredicateMap::build(Fusion* fusion) {
   // Initialize mapping for input tensors
   for (auto inp : fusion->inputs()) {
     if (auto tv = dynamic_cast<const TensorView*>(inp)) {
-      insert(tv, ParallelTypeBitmap(), ParallelTypeBitmap());
+      update(tv, ParallelTypeBitmap(), ParallelTypeBitmap());
     }
   }
   for (auto expr : fusion->exprs()) {
     updateBitSet(expr);
   }
+  updated_tvs_.clear();
 }
 
 ThreadPredicateMap::const_iterator ThreadPredicateMap::find(
@@ -284,17 +300,31 @@ ParallelTypeBitmap ThreadPredicateMap::getPredicatedParallelTypes(
   return pred_info.limited_types | pred_info.redundant_types;
 }
 
-void ThreadPredicateMap::insert(
+bool ThreadPredicateMap::update(
     const TensorView* tv,
-    const ParallelTypeBitmap& valid_types,
+    const ParallelTypeBitmap& limited_types,
     const ParallelTypeBitmap& redundant_types) {
-  insert(tv, {valid_types, redundant_types});
+  return update(tv, {limited_types, redundant_types});
 }
 
-void ThreadPredicateMap::insert(
+bool ThreadPredicateMap::update(
     const TensorView* tv,
     const PredicateInfo& pred_info) {
-  thread_predicates_.insert({tv, pred_info});
+  auto existing_mapping_it = thread_predicates_.find(tv);
+  if (existing_mapping_it != end()) {
+    PredicateInfo& existing_info = existing_mapping_it->second;
+    if (existing_info == pred_info) {
+      return false;
+    } else {
+      existing_info = pred_info;
+      markAsUpdated(tv);
+      return true;
+    }
+  } else {
+    thread_predicates_.insert({tv, pred_info});
+    markAsUpdated(tv);
+    return true;
+  }
 }
 
 Bool* ThreadPredicateMap::getPredicate(const TensorView* tv) const {
@@ -333,6 +363,10 @@ ParallelTypeBitmap ThreadPredicateMap::getParallelBroadcastDomains(
   return parallel_broadcast & at(tv).limited_types;
 }
 
+void ThreadPredicateMap::markAsUpdated(const TensorView* tv) {
+  updated_tvs_.insert(tv);
+}
+
 void ThreadPredicateMap::print() const {
   std::cout << "\nThreadPredicateMap\n";
   std::cout << "--------------------------------\n";
diff --git a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.h b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.h
index 0d7a2685b32150..2fb115953c6e75 100644
--- a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.h
+++ b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.h
@@ -48,6 +48,10 @@ class TORCH_CUDA_CU_API ThreadPredicateMap {
     ParallelTypeBitmap limited_types;
     // Parallel types where only one thread/block is enough.
     ParallelTypeBitmap redundant_types;
+    bool operator==(const PredicateInfo& other) const {
+      return limited_types == other.limited_types &&
+          redundant_types == other.redundant_types;
+    }
   };
 
   using MapType = std::unordered_map<const TensorView*, PredicateInfo>;
@@ -78,6 +82,10 @@ class TORCH_CUDA_CU_API ThreadPredicateMap {
   //! blockBroadcast unless it is predicated by limited_types_
   ParallelTypeBitmap getParallelBroadcastDomains(const TensorView* tv) const;
 
+  //! Mark tv as updated so that rebuilding the map should recompute
+  //! its predicates and those of its dependents.
+  void markAsUpdated(const TensorView* tv);
+
   void print() const;
 
   //! Generate a Bool value from PredicateInfo.
@@ -94,17 +102,19 @@ class TORCH_CUDA_CU_API ThreadPredicateMap {
   const PredicateInfo& at(const TensorView* tv) const;
   PredicateInfo& at(const TensorView* tv);
 
-  //! Insert a new mapping
-  void insert(
+  //! Update a mapping
+  bool update(
       const TensorView* tv,
-      const ParallelTypeBitmap& valid_types,
+      const ParallelTypeBitmap& limited_types,
       const ParallelTypeBitmap& redundant_types);
 
-  //! Insert a new mapping
-  void insert(const TensorView* tv, const PredicateInfo& pred_and_src);
+  //! Update a mapping
+  bool update(const TensorView* tv, const PredicateInfo& pred_and_src);
 
  private:
   MapType thread_predicates_;
+  //! Keep track of updated tensors that need predicates to be computed
+  std::unordered_set<const TensorView*> updated_tvs_;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp b/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp
index a8905b4d4047e8..9922b243e4eedd 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp
@@ -18,6 +18,7 @@ namespace {
 
 bool analyzeIfDerivedFromTrivialReduction(TensorView* tv, IterDomain* id);
 
+// Checks the producer of tv to see if the
 bool traverseToRFactorTensor(TensorView* tv, IterDomain* root_id) {
   TORCH_INTERNAL_ASSERT(
       root_id->definition() == nullptr, "Not root IterDomain: ", root_id);
@@ -29,6 +30,7 @@ bool traverseToRFactorTensor(TensorView* tv, IterDomain* root_id) {
 
   const auto& inputs = tv->definition()->inputs();
 
+  // Check the reduction expression that produces tv
   if (inputs.size() != 1 || !inputs[0]->isA<TensorView>() ||
       (tv->definition()->getExprType() != ExprType::ReductionOp &&
        tv->definition()->getExprType() != ExprType::WelfordOp)) {
@@ -63,8 +65,10 @@ bool analyzeIfDerivedFromTrivialReduction(TensorView* tv, IterDomain* id) {
       continue;
     }
     // If not possible to prove the root ID is trivial, see if the ID
-    // is derived from a rfactor tensor and, if so, continue the
-    // analysis at the rfactor tensor.
+    // is derived from a rfactor tensor. This may mean that the iteration domain
+    // was merged or split in another expression through rfactor. Trace back
+    // through rfactor expressions to find original roots and determine there if
+    // trivial.
     if (!traverseToRFactorTensor(tv, root_id)) {
       return false;
     }
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h b/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h
index 9ccbc2f78285d0..655d64a0417973 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h
@@ -20,6 +20,8 @@ class TORCH_CUDA_CU_API TrivialReductionInfo {
   void build(Fusion* fusion);
 
   bool isDerived(IterDomain* id) const;
+
+  // TODO: Not used, cleanup
   bool isDerivedFromRoot(IterDomain* id) const;
 
  private:
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.cpp b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
index ba2f618efae06e..49852aff5e8320 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
@@ -6,6 +6,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
 #include <torch/csrc/jit/codegen/cuda/lower_thread_predicate.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
@@ -92,10 +93,12 @@ bool isTvOp(const Expr* expr) {
        expr->getExprType().value() == ExprType::TernaryOp ||
        expr->getExprType().value() == ExprType::ReductionOp ||
        expr->getExprType().value() == ExprType::WelfordOp ||
+       expr->getExprType().value() == ExprType::MmaOp ||
        expr->getExprType().value() == ExprType::BroadcastOp ||
        expr->getExprType().value() == ExprType::TransposeOp ||
        expr->getExprType().value() == ExprType::ShiftOp ||
        expr->getExprType().value() == ExprType::GatherOp ||
+       expr->getExprType().value() == ExprType::ViewDtypeOp ||
        expr->getExprType().value() == ExprType::ViewOp ||
        expr->getExprType().value() == ExprType::GridReduction ||
        expr->getExprType().value() == ExprType::GridBroadcast ||
@@ -334,48 +337,21 @@ BasicAllocInfo getAllocInformation(
 
 namespace {
 
-class ReplaceExprInput : public OptOutDispatch {
+class ReplaceExprInput : private kir::ExprMutator {
  public:
-  using OptOutDispatch::handle;
-  static Expr* replace(
-      Expr* expr,
-      const std::unordered_map<Val*, Val*>& replacement_map) {
-    ReplaceExprInput replacer(expr, replacement_map);
-    TORCH_INTERNAL_ASSERT(expr != nullptr);
-    replacer.handle(expr);
-    TORCH_INTERNAL_ASSERT(replacer.replaced_expr_ != nullptr);
-    auto ret_expr = replacer.replaced_expr_;
-
-    // Copy predicates if the original expr is predicated
-    if (ret_expr != expr) {
-      ret_expr->setPredicate(expr->predicate());
-      ret_expr->setWritePredicate(expr->writePredicate());
-    }
-    return ret_expr;
-  }
-
   static std::vector<Expr*> replace(
-      const std::vector<Expr*>& scope,
+      const std::vector<Expr*>& exprs,
       const std::unordered_map<Val*, Val*>& replacement_map) {
-    std::vector<Expr*> ret_expr;
-    ret_expr.reserve(scope.size());
-
-    for (auto expr : scope) {
-      ret_expr.push_back(replace(expr, replacement_map));
-    }
-
-    return ret_expr;
+    ReplaceExprInput replacer(replacement_map);
+    replacer.traverseAndInsert(exprs);
+    return replacer.exprs_;
   }
 
  private:
-  // TODO: Replace this with mutator, example of this is done in replace
-  // symbolic sizes
-  ReplaceExprInput(
-      Expr* expr,
-      const std::unordered_map<Val*, Val*>& replacement_map)
-      : replacement_map_(replacement_map) {
-    replaced_expr_ = expr;
-  }
+  ReplaceExprInput(const std::unordered_map<Val*, Val*>& replacement_map)
+      : replacement_map_(replacement_map) {}
+
+  using kir::ExprMutator::handle;
 
   c10::optional<std::unordered_map<Val*, Val*>> getMaybeInputReplacementMap(
       Expr* expr) {
@@ -398,93 +374,77 @@ class ReplaceExprInput : public OptOutDispatch {
     }
   }
 
-  // IR visitor interface
-  void handle(kir::ForLoop* for_loop) final {
-    auto new_for_loop = IrBuilder::create<kir::ForLoop>(for_loop);
-
-    auto replaced_loop_body =
-        replace(for_loop->body().exprs(), replacement_map_);
-
-    for (auto new_expr : replaced_loop_body) {
-      new_for_loop->body().push_back(new_expr);
-    }
-    replaced_expr_ = new_for_loop;
-  }
-
-  void handle(kir::IfThenElse* ite) final {
-    auto new_ite = IrBuilder::create<kir::IfThenElse>(ite->predicate());
-    auto replaced_then_body =
-        replace(ite->thenBody().exprs(), replacement_map_);
-    for (auto new_expr : replaced_then_body) {
-      new_ite->thenBody().push_back(new_expr);
-    }
-    if (ite->hasElse()) {
-      auto replaced_else_body =
-          replace(ite->elseBody().exprs(), replacement_map_);
-      for (auto new_expr : replaced_else_body) {
-        new_ite->elseBody().push_back(new_expr);
-      }
-    }
-    replaced_expr_ = new_ite;
+  // Copy predicates and register expression replacement
+  void registerReplaceWithPredicate(Expr* old_expr, Expr* new_expr) {
+    new_expr->setPredicate(old_expr->predicate());
+    new_expr->setWritePredicate(old_expr->writePredicate());
+    registerReplace(old_expr, new_expr);
   }
 
   void handle(UnaryOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<UnaryOp>(
+      auto replacement = IrBuilder::create<UnaryOp>(
           node->getUnaryOpType(),
           node->out(),
           replaced_inputs.value().at(node->in()));
+      registerReplaceWithPredicate(node, replacement);
     }
   }
+
   void handle(BinaryOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<BinaryOp>(
+      auto replacement = IrBuilder::create<BinaryOp>(
           node->getBinaryOpType(),
           node->out(),
           replaced_inputs.value().at(node->lhs()),
           replaced_inputs.value().at(node->rhs()));
+      registerReplaceWithPredicate(node, replacement);
     }
   }
 
   void handle(TernaryOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<TernaryOp>(
+      auto replacement = IrBuilder::create<TernaryOp>(
           node->getTernaryOpType(),
           node->out(),
           replaced_inputs.value().at(node->in1()),
           replaced_inputs.value().at(node->in2()),
           replaced_inputs.value().at(node->in3()));
+      registerReplaceWithPredicate(node, replacement);
     }
   }
 
   void handle(ReductionOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<ReductionOp>(
+      auto replacement = IrBuilder::create<ReductionOp>(
           node->getReductionOpType(),
           node->init(),
           node->out(),
-          replaced_inputs.value().at(node->in()));
+          replaced_inputs.value().at(node->in()),
+          node->isFused());
+      registerReplaceWithPredicate(node, replacement);
     }
   }
 
   void handle(BroadcastOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<BroadcastOp>(
+      auto replacement = IrBuilder::create<BroadcastOp>(
           node->out(),
           replaced_inputs.value().at(node->in()),
           node->getBroadcastDimFlags());
+      registerReplaceWithPredicate(node, replacement);
     }
   }
 
   void handle(WelfordOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
-      replaced_expr_ = IrBuilder::create<WelfordOp>(
+      auto replacement = IrBuilder::create<WelfordOp>(
           node->outAvg(),
           node->outVar(),
           node->outN(),
@@ -494,11 +454,24 @@ class ReplaceExprInput : public OptOutDispatch {
           replaced_inputs.value().at(node->inAvg()),
           replaced_inputs.value().at(node->inVar()),
           replaced_inputs.value().at(node->inN()));
+      registerReplaceWithPredicate(node, replacement);
+    }
+  }
+
+  void handle(MmaOp* node) final {
+    auto replaced_inputs = getMaybeInputReplacementMap(node);
+    if (replaced_inputs.has_value()) {
+      auto replacement = IrBuilder::create<MmaOp>(
+          node->out(),
+          replaced_inputs.value().at(node->inA()),
+          replaced_inputs.value().at(node->inB()),
+          node->init(),
+          node->options());
+      registerReplaceWithPredicate(node, replacement);
     }
   }
 
  private:
-  Expr* replaced_expr_ = nullptr;
   const std::unordered_map<Val*, Val*>& replacement_map_;
 };
 
@@ -510,6 +483,15 @@ std::vector<Expr*> replaceInputsInExpr(
   return ReplaceExprInput::replace(exprs, replacement_map);
 }
 
+bool isTrivialIterDomain(IterDomain* id) {
+  auto pt = id->getParallelType();
+  return id->isReduction() || id->isBroadcast() || id->isStride() ||
+      (id->extent()->isOneInt() && id->start()->isZeroInt()) ||
+      pt == ParallelType::Vectorize ||
+      (isParallelTypeThread(pt) &&
+       !GpuLower::current()->haloInfo().hasHaloWidth(id));
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.h b/torch/csrc/jit/codegen/cuda/lower_utils.h
index 4ed6c25e731a5b..39fec2aef103ec 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.h
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.h
@@ -137,6 +137,9 @@ std::vector<Expr*> replaceInputsInExpr(
     const std::vector<Expr*>& exprs,
     const std::unordered_map<Val*, Val*>& replacement_map);
 
+// True if an IterDomain does not materialize a loop
+bool isTrivialIterDomain(IterDomain* id);
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_validation.cpp b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
index 25ba76ee71b2da..5f30cb513f55a7 100644
--- a/torch/csrc/jit/codegen/cuda/lower_validation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/codegen/cuda/lower_validation.h>
 
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
@@ -10,6 +11,7 @@
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 #include <torch/csrc/jit/codegen/cuda/type.h>
 
+#include <ATen/cuda/CUDAContext.h>
 #include <limits>
 
 namespace torch {
@@ -260,6 +262,116 @@ class VectorizeValidator : public OptInDispatch {
     domains_.insert(m->inner());
   }
 
+  // For the producer tensor, it's indexed first by transformed like
+  // the consumer. So, to find its contig merged domain, use the
+  // consumer TensorDomain with the producer contiguity info.
+  static std::vector<bool> mapProducerContiguity(
+      TensorView* producer_tv,
+      TensorView* consumer_tv) {
+    const auto c2p = PairwiseRootDomainMap(producer_tv, consumer_tv)
+                         .mapConsumerToProducer(
+                             consumer_tv->domain(), producer_tv->domain());
+
+    std::vector<bool> producer_contiguity;
+
+    for (auto consumer_root_id : consumer_tv->getRootDomain()) {
+      auto producer_root_id = c2p.at(consumer_root_id);
+      auto producer_root_it = std::find(
+          producer_tv->getMaybeRFactorDomain().begin(),
+          producer_tv->getMaybeRFactorDomain().end(),
+          producer_root_id);
+      TORCH_INTERNAL_ASSERT(
+          producer_root_it != producer_tv->getMaybeRFactorDomain().end());
+      auto producer_root_id_offset = std::distance(
+          producer_tv->getMaybeRFactorDomain().begin(), producer_root_it);
+      producer_contiguity.push_back(
+          producer_tv->domain()->contiguity().at(producer_root_id_offset));
+    }
+
+    return producer_contiguity;
+  }
+
+  //! Find the contig root domains that a vectorized leaf domain
+  //! depends on.
+  static void fillVectorizedContigRootDomains(
+      TensorView* consumer_tv,
+      VectorizedSetInfo& info) {
+    auto producer_tv =
+        consumer_tv->definition()->inputs().at(0)->as<TensorView>();
+
+    // For each of the producer and consumer vectorized root domains,
+    // find the contig merged domain if exists. The extent of the
+    // domain is the size that must be divisible by the vectorization
+    // word size. Both of the producer and consumer domains must be
+    // divisible, so pick the one that has the smaller number of
+    // merged domains.
+
+    ContigIDs consumer_contig_finder(
+        consumer_tv->domain()->domain(),
+        consumer_tv->getRootDomain(),
+        consumer_tv->domain()->contiguity());
+
+    // info.vectorized_root_id is validated at this point to be the
+    // last concrete root domain in consumer.
+    auto consumer_root_id = info.vectorized_root_id;
+
+    // Find the root domains that are dependency of the merged contig domain.
+    auto consumer_indexed_it =
+        consumer_contig_finder.rootToIndexedID().find(consumer_root_id);
+    TORCH_INTERNAL_ASSERT(
+        consumer_indexed_it != consumer_contig_finder.rootToIndexedID().end(),
+        "Contiguity information not found for root domain: ",
+        consumer_root_id->toString());
+    auto consumer_indexed_id = consumer_indexed_it->second;
+    // Actual indexed root domains for this consumer root domain. If
+    // contig merge is done, multiple root domains are included.
+    std::unordered_set<IterDomain*> consumer_indexed_root_ids;
+    if (consumer_indexed_id == consumer_root_id) {
+      // Indexed domain is equal to the root domain, meaning no contig
+      // merge is involved.
+      consumer_indexed_root_ids.insert(consumer_root_id);
+    } else {
+      auto consumer_within_contig_it =
+          consumer_contig_finder.withinContigIDs().find(consumer_indexed_id);
+      TORCH_INTERNAL_ASSERT(
+          consumer_within_contig_it !=
+          consumer_contig_finder.withinContigIDs().end());
+      consumer_indexed_root_ids = consumer_within_contig_it->second;
+    }
+
+    // Note: we use the consumer domain with the producer
+    // contiguity.
+    ContigIDs producer_contig_finder(
+        consumer_tv->domain()->domain(),
+        consumer_tv->getRootDomain(),
+        mapProducerContiguity(producer_tv, consumer_tv));
+
+    auto producer_indexed_it =
+        producer_contig_finder.rootToIndexedID().find(consumer_root_id);
+    TORCH_INTERNAL_ASSERT(
+        producer_indexed_it != producer_contig_finder.rootToIndexedID().end(),
+        "Contiguity information not found for root domain: ",
+        consumer_root_id->toString());
+    auto producer_indexed_id = producer_indexed_it->second;
+    std::unordered_set<IterDomain*> producer_indexed_root_ids;
+    if (producer_indexed_id == consumer_root_id) {
+      producer_indexed_root_ids.insert(consumer_root_id);
+    } else {
+      auto producer_within_contig_it =
+          producer_contig_finder.withinContigIDs().find(producer_indexed_id);
+      TORCH_INTERNAL_ASSERT(
+          producer_within_contig_it !=
+          producer_contig_finder.withinContigIDs().end());
+      producer_indexed_root_ids = producer_within_contig_it->second;
+    }
+
+    // Pick the smaller merged domain
+    info.contig_root_ids =
+        consumer_indexed_root_ids.size() < producer_indexed_root_ids.size()
+        ? consumer_indexed_root_ids
+        : producer_indexed_root_ids;
+  }
+
  private:
   std::unordered_set<IterDomain*> domains_;
   IterDomain* vectorized_id_ = nullptr;
@@ -284,8 +396,10 @@ class VectorizeValidator : public OptInDispatch {
       }
     }
 
-    // If no vectorized id's found simply return;
-    if (v_id == nullptr) {
+    // If no vectorized ids found simply return. If vectorized access is
+    // broadcast, it won't generate an actual vector instruction, so can safely
+    // be ignore
+    if (v_id == nullptr || v_id->isBroadcast()) {
       return;
     }
 
@@ -318,7 +432,10 @@ class VectorizeValidator : public OptInDispatch {
         vector_size,
         " however, vector sizes only upto and including 16 bytes are supported.");
 
-    auto replay_exprs = StmtSort::getExprs(fusion, {v_id}, false);
+    auto replay_exprs = DependencyCheck::getAllExprsBetween(
+        {tv->getMaybeRFactorDomain().begin(),
+         tv->getMaybeRFactorDomain().end()},
+        {v_id});
 
     VectorizeValidator validator(v_id);
 
@@ -376,12 +493,56 @@ class VectorizeValidator : public OptInDispatch {
         "Vectorized dim has to be from a contiguous inner most position: ",
         tv,
         "\n");
+
+    // Save info required to lowering and runtime validation
+    auto consumer_word_size_it =
+        GpuLower::current()->vectorizedAccesses().find(tv);
+    if (consumer_word_size_it !=
+        GpuLower::current()->vectorizedAccesses().end()) {
+      consumer_word_size_it->second = std::max(
+          (int)vector_size_optional.value(), consumer_word_size_it->second);
+    } else {
+      GpuLower::current()->vectorizedAccesses().emplace(
+          tv, (int)vector_size_optional.value());
+    }
+    auto producer_tv = tv->definition()->inputs().at(0)->as<TensorView>();
+    auto producer_word_size_it =
+        GpuLower::current()->vectorizedAccesses().find(producer_tv);
+    if (producer_word_size_it !=
+        GpuLower::current()->vectorizedAccesses().end()) {
+      producer_word_size_it->second = std::max(
+          (int)vector_size_optional.value(), producer_word_size_it->second);
+    } else {
+      GpuLower::current()->vectorizedAccesses().emplace(
+          producer_tv, (int)vector_size_optional.value());
+    }
+
+    VectorizedSetInfo vectorized_set_info;
+    vectorized_set_info.consumer_tv = tv;
+    vectorized_set_info.producer_tv = producer_tv;
+    // Note that VectorizedSetInfo is about each instance of
+    // vectorized set operations, so the word size is the size of this
+    // specific vectorized set.
+    vectorized_set_info.word_size = (int)vector_size_optional.value();
+    vectorized_set_info.vectorized_leaf_id = v_id;
+    vectorized_set_info.vectorized_root_id = validator.vectorized_id_;
+    // For aligned vectorize, the extent of a vectorized domain must
+    // be divisible by the vector word size. The domain is usually
+    // just one of the root domains, but can be a merged domain of
+    // contiguous domains.
+    if (!misaligned_vectorize) {
+      fillVectorizedContigRootDomains(tv, vectorized_set_info);
+    }
+    GpuLower::current()->vectorizedSetInfo().emplace_back(vectorized_set_info);
   }
 };
 
 } // namespace
 
-void validateVectorize(Fusion* fusion) {
+// Uses ContigIDs to find root contig domains that a vectorized domain
+// depends on. As ContigIDs depends on HaloInfo, this must be done
+// after HaloInfo is created.
+void validateAndCollectVectorizeInfo(Fusion* fusion) {
   FUSER_PERF_SCOPE("GpuLower::Lower::validateVectorize");
   FusionGuard fg(fusion);
 
@@ -443,6 +604,10 @@ void validateVectorize(Fusion* fusion) {
           "TensorView: ",
           tv);
     }
+    // Validate the vectorized domain maps to the innermost domain of
+    // tv. Note that we don't need to validate its producer tv as
+    // both Vectorize and MisalignedVectorize can only be used with
+    // UnaryOp::Set.
     if (has_vectorize_dim || has_misaligned_vectorize_dim) {
       VectorizeValidator::validate(tv);
     }
@@ -451,176 +616,6 @@ void validateVectorize(Fusion* fusion) {
 
 namespace {
 
-// Validate parallelization of a single tensor
-void validateParallelizationOfTensor(TensorView* tv) {
-  // Each ParallelType can be used only once.
-  ParallelTypeBitmap pt_map;
-  for (size_t i = 0; i < tv->nDims(); ++i) {
-    auto axis = tv->axis(i);
-    auto ptype = axis->getParallelType();
-    if (!isParallelTypeThread(ptype)) {
-      continue;
-    }
-
-    // It doesn't matter if this axis is a non-concretized broadcast
-    // TODO: merging broadcast and non-broadcast
-    if (axis->isBroadcast() &&
-        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(
-            axis)) {
-      continue;
-    }
-
-    TORCH_INTERNAL_ASSERT(
-        !pt_map.get(ptype),
-        "Multiple use of ",
-        ptype,
-        " in tensor t",
-        tv->name(),
-        ": ",
-        tv);
-    pt_map.set(ptype);
-  }
-
-  // If this tensor is predicated by a paralel type, it should not be
-  // used to parallelize any domain of this tensor
-
-  const auto thread_pred =
-      GpuLower::current()->threadPredMap().getPredicateInfo(tv);
-
-  auto predicated_parallel_types = pt_map & thread_pred.limited_types;
-
-  TORCH_INTERNAL_ASSERT(
-      predicated_parallel_types.none(),
-      "Invalid parallelization of tensor t",
-      tv->name(),
-      ". The tensor is parallelized with ",
-      predicated_parallel_types.toString(),
-      ", but it's invalid to use the types as the tensor is also predicated with them.",
-      ", thread pred: ",
-      thread_pred.limited_types.toString());
-}
-
-} // namespace
-
-void validateParallelize(Fusion* fusion) {
-  FUSER_PERF_SCOPE("GpuLower::Lower::validateParallelize");
-  FusionGuard fg(fusion);
-
-  const auto& par_map = GpuLower::current()->caParallelMap();
-  const auto& loop_map = GpuLower::current()->caLoopMap();
-  const auto& pred_map = GpuLower::current()->threadPredMap();
-
-  auto exprs = StmtSort::getExprs(fusion);
-
-  for (auto expr : exprs) {
-    if (!ir_utils::isTvOp(expr)) {
-      continue;
-    }
-    // Validate parallelization of each consumer by itself
-    for (auto consumer : ir_utils::filterByType<TensorView>(expr->outputs())) {
-      validateParallelizationOfTensor(consumer);
-    }
-    // Validate parallelization between a producer and a consumer
-    for (auto producer : ir_utils::filterByType<TensorView>(expr->inputs())) {
-      // Parallelization on input tensors have no effect.
-      if (producer->isFusionInput()) {
-        continue;
-      }
-      const auto parallel_bcast_doms =
-          pred_map.getParallelBroadcastDomains(producer);
-      for (const auto i : c10::irange(producer->nDims())) {
-        // If a producer axis is threaded, either with threadIdx or
-        // blockIdx, there must be a mapped consumer axis with the
-        // same ParallelType. An exception is when the producer is
-        // allocated on shared memory and its parallelized with
-        // threadIdx. In that case, there is no parallelization
-        // constraint on the consumer as syncthreads will be inserted
-        // when necessary.
-        auto producer_axis = producer->axis(i);
-        auto producer_ptype =
-            par_map.getConcreteMappedID(producer_axis)->getParallelType();
-        if (!isParallelTypeThread(producer_ptype)) {
-          continue;
-        }
-        // When the producer axis is a broadcast, it is not really
-        // parallelized unless thread-predicated
-        if (producer_axis->isBroadcast() &&
-            !parallel_bcast_doms.get(producer_ptype)) {
-          continue;
-        }
-        // No constraint on the consumer tensor when the producer
-        // axis is parallelized with threadIdx and allocates on
-        // shared memory
-        if (isParallelTypeThreadDim(producer_ptype) &&
-            producer->getMemoryType() == MemoryType::Shared) {
-          continue;
-        }
-        // There should be also nothing to validate when the producer
-        // axis is reduction.
-        if (producer_axis->isReduction()) {
-          continue;
-        }
-        // There must be a consumer axis that uses the same indexing
-        // with the same parallel type as the producer axis. The loop
-        // map is used to to find such an axis. Broadcast forwarding
-        // does not cause any inconsistent parallelization as indexing
-        // takes care of the forwarding.
-        for (auto consumer :
-             ir_utils::filterByType<TensorView>(expr->outputs())) {
-          auto it = std::find_if(
-              consumer->domain()->domain().begin(),
-              consumer->domain()->domain().end(),
-              [&](IterDomain* consumer_axis) {
-                return loop_map.areMapped(producer_axis, consumer_axis);
-              });
-          TORCH_INTERNAL_ASSERT(
-              it != consumer->domain()->domain().end(),
-              "Inconsistent parallelization found between TV",
-              producer->name(),
-              " (",
-              producer,
-              ") and TV",
-              consumer->name(),
-              "(",
-              consumer,
-              "). ",
-              "TV",
-              consumer->name(),
-              " does not have a matching axis for parallelized producer axis, ",
-              producer_axis,
-              ". CA Map: ",
-              loop_map.toString());
-          auto consumer_axis = *it;
-          auto consumer_ptype =
-              par_map.getConcreteMappedID(consumer_axis)->getParallelType();
-          TORCH_INTERNAL_ASSERT(
-              producer_ptype == consumer_ptype,
-              "Inconsistent parallelization found between TV",
-              producer->name(),
-              " (",
-              producer,
-              ") and TV",
-              consumer->name(),
-              "(",
-              consumer,
-              "). "
-              "Producer axis, ",
-              producer_axis,
-              " is parallelized with ",
-              stringifyThread(producer_ptype),
-              ", but the parallel type of its matching consumer axis, ",
-              consumer_axis,
-              " is ",
-              stringifyThread(consumer_ptype),
-              ".");
-        }
-      }
-    }
-  }
-}
-
-namespace {
-
 // Backward propagation of partial ranges from outputs to
 // inputs. Necessary to determine required ranges to compute.
 //
@@ -802,6 +797,95 @@ void validatePartialSplit(Fusion* fusion) {
   }
 }
 
+namespace {
+
+//! Utility to make sure targeted gpu capability is
+//!  higher than provided major.minor.
+void validateMinimumArch(int major, int minor) {
+  auto prop = at::cuda::getCurrentDeviceProperties();
+  TORCH_INTERNAL_ASSERT(prop->major >= major);
+  if (prop->major == major) {
+    TORCH_INTERNAL_ASSERT(prop->minor >= minor);
+  }
+}
+
+//! Validates that the operand and result tensors
+//!  of mma ops are swizzled and also validates
+//!  specialization of tidx as lane id.
+void validateMmaTensors(MmaOp* mma) {
+  bool tidx_validated = false;
+  std::vector<TensorView*> to_validate = {
+      mma->inA()->as<TensorView>(),
+      mma->inB()->as<TensorView>(),
+      mma->out()->as<TensorView>()};
+
+  for (auto tv : to_validate) {
+    for (auto id : tv->domain()->domain()) {
+      auto ptype = id->getParallelType();
+      if (ptype == ParallelType::TIDx) {
+        TORCH_INTERNAL_ASSERT(
+            id->isMmaSwizzled(),
+            "TIDx for mma input/output must be set by WarpMmaSwizzler",
+            id,
+            tv);
+        if (!tidx_validated) {
+          // Check that TIDx is exact lane_id
+          const auto& paralel_dim_map =
+              GpuLower::current()->parallelDimensionMap();
+          TORCH_INTERNAL_ASSERT(
+              paralel_dim_map.isExact(ptype) &&
+                  paralel_dim_map.get(ptype)->getInt().has_value() &&
+                  paralel_dim_map.get(ptype)->getInt().value() ==
+                      at::cuda::warp_size(),
+              "TIDx is reserved for lane id in mma kernels, and it needs to be exactly a warp");
+          tidx_validated = true;
+        }
+      }
+    }
+  }
+
+  // Note: this check will be relaxed in a follow up.
+  auto validate_operand_ids = [](const TensorView* tv) {
+    TORCH_INTERNAL_ASSERT(
+        std::all_of(
+            tv->domain()->domain().begin() + tv->getComputeAtPosition(),
+            tv->domain()->domain().end(),
+            [](IterDomain* id) {
+              return id->isMmaSwizzled() ||
+                  (id->isBroadcast() &&
+                   id->getParallelType() == ParallelType::Serial);
+            }),
+        "All id's on the right of CA pos needs to be mma-swizzled by WarpMmaSwizzler\n",
+        tv);
+  };
+
+  validate_operand_ids(mma->inA()->as<TensorView>());
+  validate_operand_ids(mma->inB()->as<TensorView>());
+}
+
+} // namespace
+
+//! Validate data format and GPU arch compatibility of scheduled
+//!  mma operators on the fusion.
+void validateMma(Fusion* fusion) {
+  auto exprs = StmtSort::getExprs(fusion);
+
+  for (auto expr : exprs) {
+    if (auto mma = dynamic_cast<MmaOp*>(expr)) {
+      validateMmaTensors(mma);
+
+      switch (mma->options().macro) {
+        case MmaOptions::MacroType::Volta_16_16_4:
+          validateMinimumArch(7, 0);
+          break;
+        default:
+          TORCH_INTERNAL_ASSERT(false, "validate mma: unsupported macro");
+          break;
+      }
+    }
+  }
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_validation.h b/torch/csrc/jit/codegen/cuda/lower_validation.h
index 115df13c32201e..a3009fc5dddc32 100644
--- a/torch/csrc/jit/codegen/cuda/lower_validation.h
+++ b/torch/csrc/jit/codegen/cuda/lower_validation.h
@@ -11,16 +11,9 @@ namespace cuda {
 
 void validateIr(Fusion* fusion);
 
-void validateVectorize(Fusion* fusion);
-
-//! Validates all tensors are consistently parallelized. Basically,
-//! when a producer axis is threaded, either with threadIdx or
-//! blockIdx, there must be a mapped consumer axis with the
-//! same ParallelType with some exceptions.
-//!
-//! This function assumes Loop and Parallel ComputeAtMaps are already
-//! built as they are used to validate consistency.
-void validateParallelize(Fusion* fusion);
+//! Validate vectorization and collect information on vectorization
+//! used in code generation as well as runtime validation.
+void validateAndCollectVectorizeInfo(Fusion* fusion);
 
 //! Validates partial split expressions. Partial split only uses an
 //! inner subdomain specified by start and stop offsets, ignoring the
@@ -30,6 +23,10 @@ void validateParallelize(Fusion* fusion);
 //! calculated that are necessary for output values.
 void validatePartialSplit(Fusion* fusion);
 
+//! Validate data format and GPU arch compatibility of scheduled
+//!  mma operators on the fusion.
+void validateMma(Fusion* fusion);
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
index 630d3128e783d6..1d87790c014fb8 100644
--- a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
@@ -13,6 +13,46 @@ namespace cuda {
 
 namespace {
 
+//! A helper class for EliminateDeadBroadcastAndAllocate. Eliminate
+//! dead Allocate and Broadcast detected by EliminateDeadBroadcastAndAllocate.
+class DeadTvEliminator : private kir::ExprMutator {
+ public:
+  static std::vector<Expr*> run(
+      const std::vector<Expr*>& exprs,
+      const std::unordered_set<TensorView*>& dead_tvs) {
+    return DeadTvEliminator(exprs, dead_tvs).exprs_;
+  }
+
+ private:
+  DeadTvEliminator(
+      const std::vector<Expr*>& exprs,
+      const std::unordered_set<TensorView*>& dead_tvs)
+      : dead_tvs_(dead_tvs) {
+    traverseAndInsert(exprs);
+  }
+
+  using kir::ExprMutator::handle;
+
+  void handle(kir::Allocate* allocate) final {
+    if (auto buffer_tv = dynamic_cast<TensorView*>(allocate->buffer())) {
+      if (dead_tvs_.count(buffer_tv)) {
+        registerRemove(allocate);
+      }
+    }
+  }
+
+  void handle(BroadcastOp* broadcast) final {
+    if (auto out_ti = dynamic_cast<kir::TensorIndex*>(broadcast->out())) {
+      if (dead_tvs_.count(out_ti->view())) {
+        registerRemove(broadcast);
+      }
+    }
+  }
+
+ private:
+  const std::unordered_set<TensorView*>& dead_tvs_;
+};
+
 //! A simple DCE for eliminating the
 //!  parallel broadcasts that has been fused
 //!  and their corresponding allocations
@@ -20,14 +60,13 @@ class EliminateDeadBroadcastAndAllocate {
  public:
   static std::vector<Expr*> run(const std::vector<Expr*>& exprs) {
     EliminateDeadBroadcastAndAllocate dce(exprs);
-    return dce.result_exprs_;
+    return DeadTvEliminator::run(exprs, dce.dead_tvs_);
   }
 
  private:
   EliminateDeadBroadcastAndAllocate(const std::vector<Expr*>& exprs) {
     findLiveTvs(exprs);
     findDeadTvs();
-    eliminateDeadCode(exprs);
   }
 
   void findLiveTvs(const std::vector<Expr*>& exprs) {
@@ -70,93 +109,10 @@ class EliminateDeadBroadcastAndAllocate {
     }
   }
 
-  void eliminateDeadCode(const std::vector<Expr*>& exprs) {
-    result_exprs_ = eliminateDeadCodeInScope(exprs);
-  }
-
-  bool shouldEliminate(Expr* expr) {
-    if (auto allocate = dynamic_cast<kir::Allocate*>(expr)) {
-      if (auto buffer_tv = dynamic_cast<TensorView*>(allocate->buffer())) {
-        if (dead_tvs_.count(buffer_tv)) {
-          return true;
-        }
-      }
-    } else if (auto broadcast = dynamic_cast<BroadcastOp*>(expr)) {
-      if (auto out_ti = dynamic_cast<kir::TensorIndex*>(broadcast->out())) {
-        if (dead_tvs_.count(out_ti->view())) {
-          return true;
-        }
-      }
-    }
-    return false;
-  }
-
-  //! Returns a new vector of exprs with dead exprs
-  //!  eliminated.
-  std::vector<Expr*> eliminateDeadCodeInScope(const std::vector<Expr*>& exprs) {
-    std::vector<Expr*> result_exprs;
-
-    for (auto expr : exprs) {
-      auto result_expr = expr;
-      if (auto for_loop = dynamic_cast<kir::ForLoop*>(expr)) {
-        result_expr = eliminateDeadCode(for_loop);
-      } else if (auto ite = dynamic_cast<kir::IfThenElse*>(expr)) {
-        result_expr = eliminateDeadCode(ite);
-      } else {
-        if (shouldEliminate(expr)) {
-          result_expr = nullptr;
-        }
-      }
-
-      // Push the result expr if not eliminated
-      if (result_expr) {
-        result_exprs.push_back(result_expr);
-      }
-    }
-
-    return result_exprs;
-  }
-
-  kir::ForLoop* eliminateDeadCode(kir::ForLoop* for_loop) {
-    auto new_loop_body = eliminateDeadCodeInScope(for_loop->body().exprs());
-    if (new_loop_body.empty()) {
-      return nullptr;
-    }
-
-    // TODO: we will need a kernel_ir cloner to make this
-    //  kind of logic re-usable.
-    auto new_loop = scope_utils::cloneForLoop(for_loop);
-
-    for (auto expr : new_loop_body) {
-      new_loop->body().push_back(expr);
-    }
-    return new_loop;
-  }
-
-  kir::IfThenElse* eliminateDeadCode(kir::IfThenElse* ite) {
-    auto new_then_body = eliminateDeadCodeInScope(ite->thenBody().exprs());
-    auto new_else_body = eliminateDeadCodeInScope(ite->elseBody().exprs());
-    if (new_then_body.empty() && new_else_body.empty()) {
-      return nullptr;
-    }
-
-    auto new_ite = scope_utils::cloneIfThenElse(ite);
-
-    for (auto expr : new_then_body) {
-      new_ite->thenBody().push_back(expr);
-    }
-    for (auto expr : new_else_body) {
-      new_ite->elseBody().push_back(expr);
-    }
-    return new_ite;
-  }
-
  private:
   std::unordered_set<TensorView*> live_tvs_;
   std::unordered_set<TensorView*> dead_tvs_;
   std::unordered_set<TensorView*> candidate_tv_set_;
-
-  std::vector<Expr*> result_exprs_;
 };
 
 //! A pass to eliminate redundant parallel broadcasts that are consumers
@@ -220,6 +176,7 @@ class FuseBroadcastWithWarpReduce : private kir::IrVisitor {
         }
       }
     }
+    kir::IrVisitor::handle(expr);
   }
 
   bool openLoopNestLevel(IterDomain* id) {
diff --git a/torch/csrc/jit/codegen/cuda/manager.cpp b/torch/csrc/jit/codegen/cuda/manager.cpp
index 0f5967c004d103..94a57959d57d55 100644
--- a/torch/csrc/jit/codegen/cuda/manager.cpp
+++ b/torch/csrc/jit/codegen/cuda/manager.cpp
@@ -9,6 +9,7 @@
 #include <torch/csrc/jit/codegen/cuda/type_inference.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
 #include <torch/csrc/jit/passes/canonicalize.h>
+#include <torch/csrc/jit/passes/cuda_graph_fuser.h>
 #include <torch/csrc/jit/passes/shape_analysis.h>
 #include <torch/csrc/jit/passes/symbolic_shape_analysis.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
@@ -182,7 +183,6 @@ void compileCudaFusionGroup(Node* fusion_node) {
     // node only insert meta information after itself).
     PropagateShapesOnGraph(graph);
     TypePropagate(graph);
-    PropagateShapesOnGraph(graph);
 
     int32_t fusion_cache_id =
         CudaFusionManager::getManager().registerOrGetCacheId(graph);
@@ -209,7 +209,7 @@ void runCudaFusionGroup(const Node* fusion_node, Stack& stack) {
   FUSER_PERF_SCOPE("nvFuser::Manager::runCudaFusionGroup");
 
   // Fallback to use if anything goes wrong
-  auto take_fallback = [&]() {
+  auto take_fallback = [&](Stack& stack) {
     // copying graph here since we are eliminating shape information;
     auto copied_graph = fusion_node->g(attr::Subgraph)->copy();
     EraseShapeInformation(copied_graph);
@@ -217,6 +217,24 @@ void runCudaFusionGroup(const Node* fusion_node, Stack& stack) {
     InterpreterState{Code(copied_graph, "fallback_cuda_fuser")}.run(stack);
   };
 
+  c10::optional<Stack> stack_copy;
+  auto compare_callback = getCudaFuserComparisonCallback();
+  if (compare_callback.run_fallback) {
+    // make a copy of the stack
+    int64_t inputs_size =
+        static_cast<int64_t>(fusion_node->g(attr::Subgraph)->inputs().size());
+    TORCH_INTERNAL_ASSERT(stack.size() >= inputs_size);
+    stack_copy = Stack();
+    stack_copy->insert(
+        stack_copy->end(), stack.begin(), stack.end() - inputs_size);
+    // deepcopy the last (inputs_size) stack items
+    std::transform(
+        stack.end() - inputs_size,
+        stack.end(),
+        std::back_inserter(*stack_copy),
+        [](const c10::IValue& ivalue) { return ivalue.deepcopy(); });
+  }
+
   auto run_fusion = [&]() {
     TORCH_CHECK(
         fusion_node->kind() == prim::CudaFusionGroup,
@@ -253,11 +271,45 @@ void runCudaFusionGroup(const Node* fusion_node, Stack& stack) {
           "Failed for some reason. To debug try disable codegen fallback path"
           "via setting the env variable"
           "`export PYTORCH_NVFUSER_DISABLE_FALLBACK=1`");
-      take_fallback();
+      take_fallback(stack);
     }
   } else {
     run_fusion();
   }
+
+  if (compare_callback.callback != nullptr) {
+    Stack fused_outputs;
+    Stack fallback_outputs;
+    int64_t output_count =
+        static_cast<int64_t>(fusion_node->g(attr::Subgraph)->outputs().size());
+    TORCH_CHECK(
+        output_count <= stack.size(),
+        "Expected ",
+        output_count,
+        " outputs but found only ",
+        stack.size(),
+        " items on the stack");
+
+    fused_outputs.insert(
+        fused_outputs.begin(), stack.end() - output_count, stack.end());
+
+    if (stack_copy) {
+      take_fallback(*stack_copy);
+      TORCH_CHECK(
+          stack_copy->size() == stack.size(),
+          "Fused graph returns stack with ",
+          stack.size(),
+          " items, compared to ",
+          stack_copy->size(),
+          " from unfused graph");
+      fallback_outputs.insert(
+          fallback_outputs.begin(),
+          stack_copy->end() - output_count,
+          stack_copy->end());
+    }
+    auto graph_str = fusion_node->g(attr::Subgraph)->toString();
+    compare_callback.callback(fused_outputs, fallback_outputs, graph_str);
+  }
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/mma_type.cpp b/torch/csrc/jit/codegen/cuda/mma_type.cpp
new file mode 100644
index 00000000000000..3751cdea6bcf67
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/mma_type.cpp
@@ -0,0 +1,139 @@
+#include <torch/csrc/jit/codegen/cuda/mma_type.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+MmaBuilder::MmaBuilder(
+    MmaOptions::MacroType macro,
+    MatMulTileOptions gemm_tile) {
+  option_.macro = macro;
+  // Calculate accumulator stride, will be removed once transpose swizzle ready
+  int outer_stride = gemm_tile.warp_tile.n / gemm_tile.instruction_tile.n;
+  switch (macro) {
+    // Numbers depend on actual output layout of mma instruction
+    case MmaOptions::MacroType::Volta_16_16_4:
+      option_.accumulator_stride = outer_stride * 4;
+      break;
+    default:
+      TORCH_CHECK(false, "unsupported macro");
+      break;
+  }
+}
+
+MmaBuilder& MmaBuilder::layout(MmaOptions::MmaInputLayout layout) {
+  option_.operand_layout = layout;
+  return *this;
+}
+
+MmaBuilder& MmaBuilder::operand(MmaOptions::Operand a_or_b) {
+  option_.operand = a_or_b;
+  return *this;
+}
+
+// TODO: validate op config
+MmaOptions MmaBuilder::build() const {
+  return option_;
+}
+
+bool isVolta(MmaOptions::MacroType macro) {
+  return macro == MmaOptions::MacroType::Volta_16_16_4;
+}
+
+bool isTuring(MmaOptions::MacroType macro) {
+  return macro == MmaOptions::MacroType::Turing_16_8_16;
+}
+
+bool isAmpere(MmaOptions::MacroType macro) {
+  return false;
+}
+
+int getOutputRegisterSize(MmaOptions::MacroType macro) {
+  switch (macro) {
+    case MmaOptions::MacroType::Volta_16_16_4:
+      return 8;
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unknown macro");
+      break;
+  }
+  return -1;
+}
+
+int getInputARegisterSize(MmaOptions::MacroType macro) {
+  switch (macro) {
+    case MmaOptions::MacroType::Volta_16_16_4:
+      return 4;
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unknown macro");
+      break;
+  }
+  return -1;
+}
+
+int getInputBRegisterSize(MmaOptions::MacroType macro) {
+  switch (macro) {
+    case MmaOptions::MacroType::Volta_16_16_4:
+      return 4;
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unknown macro");
+      break;
+  }
+  return -1;
+}
+
+bool isOperandTransposed(MmaOptions options) {
+  switch (options.operand) {
+    case MmaOptions::Operand::A:
+      return options.operand_layout == MmaOptions::MmaInputLayout::TT ||
+          options.operand_layout == MmaOptions::MmaInputLayout::TN;
+    case MmaOptions::Operand::B:
+      return options.operand_layout == MmaOptions::MmaInputLayout::TT ||
+          options.operand_layout == MmaOptions::MmaInputLayout::NT;
+    default:
+      TORCH_CHECK(false, "isOperandTransposed: please specify operand");
+  }
+  return false;
+}
+
+std::string toString(MmaOptions::MmaInputLayout input_layout) {
+  std::stringstream ss;
+  switch (input_layout) {
+    case MmaOptions::MmaInputLayout::TT:
+      ss << "TT";
+      break;
+    case MmaOptions::MmaInputLayout::TN:
+      ss << "TN";
+      break;
+    case MmaOptions::MmaInputLayout::NT:
+      ss << "NT";
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unsupported operand layout");
+  }
+  return ss.str();
+}
+
+std::string toString(MmaOptions::MacroType mt) {
+  std::stringstream ss;
+  switch (mt) {
+    case MmaOptions::MacroType::NoMMA:
+      ss << "NoOp";
+      break;
+    case MmaOptions::MacroType::Volta_16_16_4:
+      ss << "M16N16K4";
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "undefined mma type");
+      break;
+  }
+  return ss.str();
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/mma_type.h b/torch/csrc/jit/codegen/cuda/mma_type.h
new file mode 100644
index 00000000000000..5f42d41ded65e1
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/mma_type.h
@@ -0,0 +1,132 @@
+#pragma once
+#include <c10/macros/Export.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+//! Utility data structure for recording gemm tiles
+struct GemmTile {
+  int m, n, k;
+  GemmTile(int m_, int n_, int k_) : m(m_), n(n_), k(k_) {}
+
+  bool operator==(const GemmTile& other) {
+    return m == other.m && n == other.n && k == other.k;
+  }
+
+  GemmTile operator/(const GemmTile& other) {
+    return GemmTile(m / other.m, n / other.n, k / other.k);
+  }
+};
+
+//! Utility data structure for recording gemm tiles
+struct TORCH_CUDA_CU_API MatMulTileOptions {
+  GemmTile cta_tile = GemmTile(128, 128, 32);
+  GemmTile warp_tile = GemmTile(64, 64, 32);
+  GemmTile instruction_tile = GemmTile(16, 8, 16);
+
+  MatMulTileOptions() = default;
+  MatMulTileOptions(
+      GemmTile cta_tile_,
+      GemmTile warp_tile_,
+      GemmTile instruction_tile_)
+      : cta_tile(cta_tile_),
+        warp_tile(warp_tile_),
+        instruction_tile(instruction_tile_) {}
+
+  bool operator==(const MatMulTileOptions& other) {
+    return cta_tile == other.cta_tile && warp_tile == other.warp_tile &&
+        instruction_tile == other.instruction_tile;
+  }
+};
+
+//! Information for configuring and lowering mma ops
+struct MmaOptions {
+  //! Type of mma instrinsic macro to use
+  //!  This will translate to which mma intrinsic from runtime string
+  //!    to be generated to implement the mma op. The current plan
+  //!    is to have exactly one macro for each
+  //!  (arch, datatype, operand layout) triple, though there
+  //!  exists multiple possibilities for some cases, e.g. for Turing and fp16
+  //!  one can use 16_8_8 or 16_8_16.
+  //! Will consider adding more choices that the scheduler can pick from
+  //!  when our perf target becomes more fine grained, which is more likely in
+  //!  latency bound kernels.
+  enum class MacroType {
+    NoMMA = 0,
+    Volta_16_16_4,
+    Turing_16_8_16, // place holder for turing/ampere mma
+    Ampere_16_8_8 // place holder for tf32
+  };
+
+  //! [Operand Layout Convention]
+  //! Operand layout, T=transposed/row_major, N=normal/col_major
+  //!   We don't support calling NN mma directly since it implies
+  //!    a fused transpose. User needs to swap the operands and use
+  //!    TT mma to make the transpose explicit.
+  //! Ordered by position of K
+  //! NT : K,M x K,N -> K,M,N
+  //! TT : M,K X K,N -> M,K,N
+  //! TN : M,K X N,K -> M,N,K
+  enum class MmaInputLayout { NT = 0, TT, TN };
+
+  //! Utility to annotate which input of mma this option struct describes
+  enum class Operand { NotOperand = 0, A, B };
+
+  //! Utility to annotate which mma macro this config uses.
+  MacroType macro = MacroType::NoMMA;
+
+  //! Utility to annotate transposition of operands
+  MmaInputLayout operand_layout = MmaInputLayout::TT;
+
+  //! Utility to annotate which input of mma this option struct describes
+  Operand operand = Operand::A;
+
+  //! Accumulator register stride, will be removed when the swizzle op
+  //!  is introduced and the output can be labeled with a transpose swizzle.
+  int accumulator_stride = 0;
+
+  bool operator==(const MmaOptions& other) const {
+    return macro == other.macro && operand_layout == other.operand_layout &&
+        operand == other.operand &&
+        accumulator_stride == other.accumulator_stride;
+  }
+};
+
+//! User interface generating mma options for mma op
+class TORCH_CUDA_CU_API MmaBuilder {
+ public:
+  MmaBuilder(MmaOptions::MacroType macro, MatMulTileOptions gemm_tile);
+  MmaBuilder& layout(MmaOptions::MmaInputLayout layout);
+  MmaBuilder& operand(MmaOptions::Operand a_or_b);
+  MmaOptions build() const;
+
+ private:
+  MmaOptions option_;
+};
+
+//! GPU arch check for macro type
+bool isVolta(MmaOptions::MacroType macro);
+bool isTuring(MmaOptions::MacroType macro);
+bool isAmpere(MmaOptions::MacroType macro);
+
+//! Returns true if the given option describes a transposed operand
+bool isOperandTransposed(MmaOptions options);
+
+// Unpacked constants from macro type:
+//   exact numbers are defined by each individual instruction.
+int getOutputRegisterSize(MmaOptions::MacroType macro);
+int getInputARegisterSize(MmaOptions::MacroType macro);
+int getInputBRegisterSize(MmaOptions::MacroType macro);
+
+// MMA stringify utils
+std::string toString(MmaOptions::MacroType macro);
+std::string toString(MmaOptions::MmaInputLayout input_layout);
+std::string toString(MmaOptions::MacroType mt);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/mutator.cpp b/torch/csrc/jit/codegen/cuda/mutator.cpp
index c24e444eb566ec..5e397a5bfa116b 100644
--- a/torch/csrc/jit/codegen/cuda/mutator.cpp
+++ b/torch/csrc/jit/codegen/cuda/mutator.cpp
@@ -51,6 +51,8 @@ void OptOutMutator::mutate(Double* d) {}
 
 void OptOutMutator::mutate(Int* i) {}
 
+void OptOutMutator::mutate(ComplexDouble* c) {}
+
 void OptOutMutator::mutate(NamedScalar* ns) {}
 
 void OptOutMutator::mutate(IterDomain* id) {
@@ -181,7 +183,8 @@ void OptOutMutator::mutate(ReductionOp* rop) {
   auto container = rop->container();
   auto rop_type = rop->getReductionOpType();
   container->removeExpr(rop);
-  IrBuilder::create<ReductionOp>(container, rop_type, init, out, in);
+  IrBuilder::create<ReductionOp>(
+      container, rop_type, init, out, in, rop->isFused());
 }
 
 namespace {
@@ -230,7 +233,26 @@ void OptOutMutator::mutate(WelfordOp* wop) {
       init_N,
       in_avg,
       in_var,
-      in_N);
+      in_N,
+      wop->isFused());
+}
+
+void OptOutMutator::mutate(MmaOp* mma) {
+  Val* out = maybeMutated(mma->out());
+  Val* in_a = maybeMutated(mma->inA());
+  Val* in_b = maybeMutated(mma->inB());
+  Val* init = mma->init();
+
+  if (out->sameAs(mma->out()) && in_a->sameAs(mma->inA()) &&
+      in_b->sameAs(mma->inB())) {
+    return;
+  }
+
+  auto container = mma->container();
+  auto options = mma->options();
+  container->removeExpr(mma);
+  auto new_mma =
+      IrBuilder::create<MmaOp>(container, out, in_a, in_b, init, options);
 }
 
 void OptOutMutator::mutate(BroadcastOp* bop) {
@@ -291,6 +313,19 @@ void OptOutMutator::mutate(GatherOp* op) {
   IrBuilder::create<GatherOp>(container, out, in, window_shape, pad_width);
 }
 
+void OptOutMutator::mutate(ViewDtypeOp* vop) {
+  TensorView* out = maybeMutated(vop->out())->as<TensorView>();
+  TensorView* in = maybeMutated(vop->in())->as<TensorView>();
+
+  if (out->sameAs(vop->out()) && in->sameAs(vop->in())) {
+    return;
+  }
+
+  auto container = vop->container();
+  container->removeExpr(vop);
+  IrBuilder::create<ViewDtypeOp>(container, out, in, vop->dtype());
+}
+
 void OptOutMutator::mutate(ViewOp* vop) {
   TensorView* out = maybeMutated(vop->out())->as<TensorView>();
   TensorView* in = maybeMutated(vop->in())->as<TensorView>();
@@ -344,7 +379,10 @@ void OptOutMutator::mutate(Merge* m) {
 void OptOutMutator::mutate(kir::Allocate*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
-void OptOutMutator::mutate(kir::Sync*) {
+void OptOutMutator::mutate(kir::BlockSync*) {
+  TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
+}
+void OptOutMutator::mutate(kir::GridSync*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
 void OptOutMutator::mutate(kir::InitMagicZero*) {
@@ -368,6 +406,9 @@ void OptOutMutator::mutate(kir::GridBroadcast*) {
 void OptOutMutator::mutate(kir::GridWelford*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
+void OptOutMutator::mutate(kir::AllocateFusedReduction*) {
+  TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
+}
 
 void OptOutMutator::removeExpr(IrContainer* container, Expr* expr) {
   container->removeExpr(expr);
diff --git a/torch/csrc/jit/codegen/cuda/nvfuser.cmake b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
new file mode 100644
index 00000000000000..5dc211eb4f6cee
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
@@ -0,0 +1,58 @@
+if(BUILD_SPLIT_CUDA)
+  set(TORCHLIB_FLAVOR torch_cuda_cu) # chose torch_cuda_cu here since JIT is in torch_cuda_cpp
+elseif(USE_CUDA)
+  set(TORCHLIB_FLAVOR torch_cuda)
+elseif(USE_ROCM)
+  set(TORCHLIB_FLAVOR torch_hip)
+endif()
+
+# The list of NVFUSER runtime files
+list(APPEND NVFUSER_RUNTIME_FILES
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/array.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_reduction.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_sync_default.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/broadcast.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fp16_support.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/bf16_support.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_sync.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/helpers.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/index_utils.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/random_numbers.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/tensor.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/tuple.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/type_traits.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/welford.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/warp.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/tensorcore.cu
+  ${TORCH_ROOT}/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh
+  ${TORCH_ROOT}/aten/src/ATen/cuda/detail/UnpackRaw.cuh
+)
+
+file(MAKE_DIRECTORY "${CMAKE_BINARY_DIR}/include/nvfuser_resources")
+
+# "stringify" NVFUSER runtime sources
+# (generate C++ header files embedding the original input as a string literal)
+set(NVFUSER_STRINGIFY_TOOL "${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/tools/stringify_file.py")
+foreach(src ${NVFUSER_RUNTIME_FILES})
+  get_filename_component(filename ${src} NAME_WE)
+  set(dst "${CMAKE_BINARY_DIR}/include/nvfuser_resources/${filename}.h")
+  add_custom_command(
+    COMMENT "Stringify NVFUSER runtime source file"
+    OUTPUT ${dst}
+    DEPENDS ${src}
+    COMMAND ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst}
+  )
+  add_custom_target(nvfuser_rt_${filename} DEPENDS ${dst})
+  add_dependencies(${TORCHLIB_FLAVOR} nvfuser_rt_${filename})
+
+  # also generate the resource headers during the configuration step
+  # (so tools like clang-tidy can run w/o requiring a real build)
+  execute_process(COMMAND
+    ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst})
+endforeach()
+
+target_include_directories(${TORCHLIB_FLAVOR} PRIVATE "${CMAKE_BINARY_DIR}/include")
diff --git a/torch/csrc/jit/codegen/cuda/ops/alias.cpp b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
index 14aff510911e24..cc3220c742feb5 100644
--- a/torch/csrc/jit/codegen/cuda/ops/alias.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
@@ -52,11 +52,39 @@ TensorView* applyViewTransforms(
 
 } // namespace
 
+TensorView* view(TensorView* x, DataType dtype) {
+  if (x->getDataType() == dtype) {
+    return x;
+  }
+
+  // TODO: support view(dtype) for dtypes of different size.
+  TORCH_INTERNAL_ASSERT(
+      dataTypeSize(x->getDataType().value()) == dataTypeSize(dtype),
+      "Currently, aten::view only supports viewing the data as a type with the same size.");
+
+  std::vector<IterDomain*> out_domain;
+  auto inp_domain = TensorDomain::noReductions(x->getMaybeRFactorDomain());
+  out_domain.reserve(inp_domain.size());
+  for (auto d : inp_domain) {
+    out_domain.push_back(d->clone());
+  }
+  auto out = IrBuilder::create<TensorView>(
+      x->container(),
+      IrBuilder::create<TensorDomain>(
+          out_domain, std::vector<bool>(out_domain.size(), true)),
+      dtype);
+
+  IrBuilder::create<ViewDtypeOp>(x->container(), out, x, dtype);
+  return out;
+}
+
 TensorView* view(
     TensorView* x,
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
-  TORCH_INTERNAL_ASSERT(x->nDims() == original_sizes.size());
+  TORCH_INTERNAL_ASSERT(
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size() ==
+      original_sizes.size());
 
   auto analyze_view = analyzeView(x, original_sizes, new_sizes);
 
@@ -90,8 +118,7 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes, int dim) {
   if (dim < 0) {
     dim = (int)(x->nDims()) + dim;
   }
-  TORCH_INTERNAL_ASSERT(dim >= 0 && dim < x->nDims());
-  if (sizes[dim] == 1) {
+  if (dim >= 0 && dim < x->nDims() && sizes[dim] == 1) {
     return sum(x, {dim});
   } else {
     return set(x);
diff --git a/torch/csrc/jit/codegen/cuda/ops/alias.h b/torch/csrc/jit/codegen/cuda/ops/alias.h
index 8003e3268b3285..30f3de2f228b34 100644
--- a/torch/csrc/jit/codegen/cuda/ops/alias.h
+++ b/torch/csrc/jit/codegen/cuda/ops/alias.h
@@ -16,6 +16,8 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+TORCH_CUDA_CU_API TensorView* view(TensorView* x, DataType dtype);
+
 TORCH_CUDA_CU_API TensorView* view(
     TensorView* x,
     const std::vector<int64_t>& original_sizes,
diff --git a/torch/csrc/jit/codegen/cuda/ops/composite.cpp b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
index c01b7230625596..3c7c713e734d33 100644
--- a/torch/csrc/jit/codegen/cuda/ops/composite.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
@@ -49,18 +49,6 @@ TensorView* dropout_backward(TensorView* dy, TensorView* mask, Val* scale) {
   return dx;
 }
 
-Val* softplus(Val* x, Val* beta, Val* threshold) {
-  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
-  TORCH_INTERNAL_ASSERT(beta != nullptr, "Beta is invalid.");
-  TORCH_INTERNAL_ASSERT(
-      threshold != nullptr, "Threshold is not a valid Double.");
-
-  auto op_beta = mul(x, beta);
-  auto maybe_result = div(log1p(exp(op_beta)), beta);
-  auto y = where(gt(op_beta, threshold), x, maybe_result);
-  return y;
-}
-
 LstmResult lstm(
     TensorView* prev_cell,
     TensorView* in_x,
@@ -85,7 +73,53 @@ LstmResult lstm(
   return {cell, hidden};
 }
 
-Val* fast_gelu(Val* x) {
+TensorView* softplus(TensorView* x, Val* beta, Val* threshold) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(beta != nullptr, "Beta is invalid.");
+  TORCH_INTERNAL_ASSERT(
+      threshold != nullptr, "Threshold is not a valid Double.");
+
+  auto op_beta = mul(x, beta);
+  auto maybe_result = div(log1p(exp(op_beta)), beta);
+  auto y = where(gt(op_beta, threshold), x, maybe_result);
+  return y;
+}
+
+TensorView* gelu(TensorView* x) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid");
+
+  auto kappa = IrBuilder::create<Double>(x->container(), M_SQRT1_2);
+  auto half = IrBuilder::create<Double>(x->container(), 0.5);
+  auto one = IrBuilder::create<Double>(x->container(), 1.);
+
+  auto cdf = mul(half, add(one, erf(mul(x, kappa))));
+  auto y = mul(x, cdf);
+  return y;
+}
+
+TensorView* gelu_backward(TensorView* dy, TensorView* x) {
+  TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid");
+
+  constexpr double kAlpha = M_2_SQRTPI * M_SQRT1_2 * 0.5;
+  const double kHalf = 0.5;
+
+  auto cdf_1 = mul(x, IrBuilder::create<Double>(x->container(), M_SQRT1_2));
+  auto cdf_2 = erf(cdf_1);
+  auto cdf_3 = add(cdf_2, IrBuilder::create<Double>(x->container(), 1.));
+  auto cdf_4 = mul(cdf_3, IrBuilder::create<Double>(x->container(), kHalf));
+
+  auto pdf_1 = mul(x, x);
+  auto pdf_2 = mul(pdf_1, IrBuilder::create<Double>(x->container(), -kHalf));
+  auto pdf_3 = exp(pdf_2);
+
+  auto out = addcmul(
+      cdf_4, x, pdf_3, IrBuilder::create<Double>(x->container(), kAlpha));
+  auto dx = mul(out, dy);
+  return dx;
+}
+
+TensorView* tanh_gelu(TensorView* x) {
   TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid");
 
   constexpr double kBeta = M_SQRT2 * M_2_SQRTPI * 0.5;
@@ -104,7 +138,7 @@ Val* fast_gelu(Val* x) {
   return y;
 }
 
-Val* fast_gelu_backward(Val* dy, Val* x) {
+TensorView* tanh_gelu_backward(TensorView* dy, TensorView* x) {
   TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
   TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid");
 
@@ -139,29 +173,7 @@ Val* fast_gelu_backward(Val* dy, Val* x) {
   return dx;
 }
 
-Val* gelu_backward(Val* dy, Val* x) {
-  TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
-  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid");
-
-  constexpr double kAlpha = M_2_SQRTPI * M_SQRT1_2 * 0.5;
-  const double kHalf = 0.5;
-
-  auto cdf_1 = mul(x, IrBuilder::create<Double>(x->container(), M_SQRT1_2));
-  auto cdf_2 = erf(cdf_1);
-  auto cdf_3 = add(cdf_2, IrBuilder::create<Double>(x->container(), 1.));
-  auto cdf_4 = mul(cdf_3, IrBuilder::create<Double>(x->container(), kHalf));
-
-  auto pdf_1 = mul(x, x);
-  auto pdf_2 = mul(pdf_1, IrBuilder::create<Double>(x->container(), -kHalf));
-  auto pdf_3 = exp(pdf_2);
-
-  auto out = addcmul(
-      cdf_4, x, pdf_3, IrBuilder::create<Double>(x->container(), kAlpha));
-  auto dx = mul(out, dy);
-  return dx;
-}
-
-Val* tanh_backward(Val* dy, Val* tanh_x) {
+TensorView* tanh_backward(TensorView* dy, TensorView* tanh_x) {
   TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
   TORCH_INTERNAL_ASSERT(tanh_x != nullptr, "Input is invalid");
 
diff --git a/torch/csrc/jit/codegen/cuda/ops/composite.h b/torch/csrc/jit/codegen/cuda/ops/composite.h
index 63e17629f40b6a..99ce3c30a25208 100644
--- a/torch/csrc/jit/codegen/cuda/ops/composite.h
+++ b/torch/csrc/jit/codegen/cuda/ops/composite.h
@@ -31,8 +31,6 @@ TORCH_CUDA_CU_API TensorView* dropout_backward(
     TensorView* mask,
     Val* scale);
 
-TORCH_CUDA_CU_API Val* softplus(Val* x, Val* beta, Val* threshold);
-
 struct LstmResult {
   TensorView* cell = nullptr;
   TensorView* hidden = nullptr;
@@ -45,10 +43,15 @@ TORCH_CUDA_CU_API LstmResult lstm(
     TensorView* cell_x,
     TensorView* out_x);
 
-TORCH_CUDA_CU_API Val* fast_gelu(Val* x);
-TORCH_CUDA_CU_API Val* fast_gelu_backward(Val* dy, Val* x);
-TORCH_CUDA_CU_API Val* gelu_backward(Val* dy, Val* x);
-TORCH_CUDA_CU_API Val* tanh_backward(Val* dy, Val* tanh_x);
+TORCH_CUDA_CU_API TensorView* softplus(
+    TensorView* x,
+    Val* beta,
+    Val* threshold);
+TORCH_CUDA_CU_API TensorView* gelu(TensorView* x);
+TORCH_CUDA_CU_API TensorView* gelu_backward(TensorView* dy, TensorView* x);
+TORCH_CUDA_CU_API TensorView* tanh_gelu(TensorView* x);
+TORCH_CUDA_CU_API TensorView* tanh_gelu_backward(TensorView* dy, TensorView* x);
+TORCH_CUDA_CU_API TensorView* tanh_backward(TensorView* dy, TensorView* tanh_x);
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
index 4a473f662039c8..6311e67dd8f67b 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
@@ -7,6 +7,64 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+int nonNegativeAxis(int axis, int ndims) {
+  return (axis >= 0) ? axis : (ndims + axis);
+}
+
+Val* numFeatures(TensorView* x, const std::vector<int>& dims, int ndims) {
+  Val* num_features = IrBuilder::create<Double>(x->container(), 1);
+  for (const auto dim : dims) {
+    const int axis = nonNegativeAxis(dim, ndims);
+    num_features = mul(num_features, x->domain()->domain()[axis]->extent());
+  }
+  return num_features;
+}
+
+TensorView* mean(TensorView* x, const std::vector<int>& dims, bool keepdim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+
+  auto sum_x = sum(x, dims, keepdim);
+  auto y = div(sum_x, numFeatures(x, dims, kNumberOfDims));
+  return y;
+}
+
+TensorView* variance(
+    TensorView* x,
+    const std::vector<int>& dims,
+    bool unbiased,
+    bool keepdim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+
+  auto bcast_mean = mean(x, dims, true /* keepdim */);
+  auto x_mean_sub = sub(x, bcast_mean);
+  auto x_mean_sub_sq = mul(x_mean_sub, x_mean_sub);
+  auto sum_x_mean_sub_sq = sum(x_mean_sub_sq, dims, keepdim);
+
+  auto num_features = numFeatures(x, dims, kNumberOfDims);
+  if (unbiased) {
+    num_features =
+        sub(num_features, IrBuilder::create<Double>(x->container(), 1.));
+  }
+  auto y = div(sum_x_mean_sub_sq, num_features);
+
+  return y;
+}
+
+TensorView* standard_deviation(
+    TensorView* x,
+    const std::vector<int>& dims,
+    bool unbiased,
+    bool keepdim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  return sqrt(variance(x, dims, unbiased, keepdim));
+}
+
 TensorView* softmax(TensorView* x, int dim) {
   TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
 
@@ -50,6 +108,45 @@ TensorView* softmax_backward(TensorView* dy, TensorView* y, int dim) {
   return dx;
 }
 
+TensorView* log_softmax(TensorView* x, int dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+  const int kReductionAxis = (dim < 0) ? dim + kNumberOfDims : dim;
+  TORCH_INTERNAL_ASSERT(kReductionAxis >= 0 && kReductionAxis < kNumberOfDims);
+
+  std::vector<bool> broadcast_mask(kNumberOfDims, false);
+  broadcast_mask[kReductionAxis] = true;
+
+  auto max_val = max(x, {kReductionAxis});
+  auto bcast_max = broadcast(max_val, broadcast_mask);
+  auto x_max_sub = sub(x, bcast_max);
+  auto exp_val = exp(x_max_sub);
+  auto bcast_sum = sum(exp_val, {kReductionAxis}, true /* keepdim */);
+  auto log_sum_exp = log(bcast_sum);
+  auto y = sub(x_max_sub, log_sum_exp);
+
+  return y;
+}
+
+TensorView* log_softmax_backward(TensorView* dy, TensorView* y, int dim) {
+  TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
+  TORCH_INTERNAL_ASSERT(y != nullptr, "Output is invalid.");
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(y->getMaybeRFactorDomain()).size();
+  const int kReductionAxis = (dim < 0) ? dim + kNumberOfDims : dim;
+  TORCH_INTERNAL_ASSERT(kReductionAxis >= 0 && kReductionAxis < kNumberOfDims);
+
+  auto bcast_sum_grad = sum(dy, {kReductionAxis}, true /* keepdim */);
+  auto softmax = exp(y);
+  auto softmax_sum_mul = mul(softmax, bcast_sum_grad);
+  auto dx = sub(dy, softmax_sum_mul);
+
+  return dx;
+}
+
 ForwardNormResult layer_norm(
     TensorView* x,
     const std::vector<int64_t>& norm_shape,
@@ -59,18 +156,9 @@ ForwardNormResult layer_norm(
   return layer_norm(x, norm_shape.size(), weight, bias, eps);
 }
 
-ForwardNormResult layer_norm(
-    TensorView* x,
-    const size_t kNormShapeNumDims,
-    TensorView* weight,
-    TensorView* bias,
-    Val* eps) {
-  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
-  TORCH_INTERNAL_ASSERT(
-      eps != nullptr && eps->getDataType().has_value() &&
-          eps->getDataType().value() == DataType::Double,
-      "Epsilon (eps) is not a valid Double.");
-
+auto norm_properties_from_num_dims(
+    const TensorView* x,
+    const size_t kNormShapeNumDims) {
   // (B, C, H, W, D) tensor
   // norm_shape = [H, W, D]
   // M = outer = product of remaining dimensions = B * C
@@ -82,13 +170,14 @@ ForwardNormResult layer_norm(
 
   std::vector<int> outer_reduction_axes(kOuterNumDims);
   std::vector<bool> outer_broadcast_mask(kNumberOfDims, false);
+  std::vector<int> inner_reduction_axes(kNormShapeNumDims);
+  std::vector<bool> inner_broadcast_mask(kNumberOfDims, false);
+
   for (const auto idx : c10::irange(kOuterNumDims)) {
     outer_reduction_axes[idx] = idx;
     outer_broadcast_mask[idx] = true;
   }
 
-  std::vector<int> inner_reduction_axes(kNormShapeNumDims);
-  std::vector<bool> inner_broadcast_mask(kNumberOfDims, false);
   Val* num_features = IrBuilder::create<Double>(x->container(), 1);
   for (const auto idx : c10::irange(kNormShapeNumDims)) {
     const size_t axis = kNumberOfDims - 1 - idx;
@@ -96,14 +185,42 @@ ForwardNormResult layer_norm(
     inner_broadcast_mask[axis] = true;
     num_features = mul(num_features, x->domain()->domain()[axis]->extent());
   }
+  struct result {
+    std::vector<int> outer_reduction_axes;
+    std::vector<bool> outer_broadcast_mask;
+    std::vector<int> inner_reduction_axes;
+    std::vector<bool> inner_broadcast_mask;
+    Val* num_features = nullptr;
+  } r;
+  r.outer_reduction_axes = outer_reduction_axes;
+  r.outer_broadcast_mask = outer_broadcast_mask;
+  r.inner_reduction_axes = inner_reduction_axes;
+  r.inner_broadcast_mask = inner_broadcast_mask;
+  r.num_features = num_features;
+  return r;
+}
+
+ForwardNormResult layer_norm(
+    TensorView* x,
+    const size_t kNormShapeNumDims,
+    TensorView* weight,
+    TensorView* bias,
+    Val* eps) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(
+      eps != nullptr && eps->getDataType().has_value() &&
+          eps->getDataType().value() == DataType::Double,
+      "Epsilon (eps) is not a valid Double.");
+
+  auto r = norm_properties_from_num_dims(x, kNormShapeNumDims);
 
   // Main algorithm
-  auto welford_out = Welford(x, inner_reduction_axes);
-  auto mean_bcast = broadcast(welford_out.avg, inner_broadcast_mask);
+  auto welford_out = Welford(x, r.inner_reduction_axes);
+  auto mean_bcast = broadcast(welford_out.avg, r.inner_broadcast_mask);
   auto x_sub_mean = sub(x, mean_bcast);
 
-  auto var_sum_bcast = broadcast(welford_out.var_sum, inner_broadcast_mask);
-  auto var = mul(var_sum_bcast, reciprocal(num_features));
+  auto var_sum_bcast = broadcast(welford_out.var_sum, r.inner_broadcast_mask);
+  auto var = mul(var_sum_bcast, reciprocal(r.num_features));
   auto var_eps = add(var, eps);
   auto invstd = rsqrt(var_eps);
 
@@ -111,19 +228,58 @@ ForwardNormResult layer_norm(
 
   // Optional: norm * weight
   if (weight != nullptr) {
-    auto weight_bcast = broadcast(weight, outer_broadcast_mask);
+    auto weight_bcast = broadcast(weight, r.outer_broadcast_mask);
     y = mul(y, weight_bcast);
   }
 
   // Optional: norm * weight + bias
   if (bias != nullptr) {
-    auto bias_bcast = broadcast(bias, outer_broadcast_mask);
+    auto bias_bcast = broadcast(bias, r.outer_broadcast_mask);
     y = add(y, bias_bcast);
   }
 
   return {y, mean_bcast, invstd};
 }
 
+ForwardRMSNormResult rms_norm(
+    TensorView* x,
+    const std::vector<int64_t>& norm_shape,
+    TensorView* weight,
+    Val* eps) {
+  return rms_norm(x, norm_shape.size(), weight, eps);
+}
+
+ForwardRMSNormResult rms_norm(
+    TensorView* x,
+    const size_t kNormShapeNumDims,
+    TensorView* weight,
+    Val* eps) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(
+      eps != nullptr && eps->getDataType().has_value() &&
+          eps->getDataType().value() == DataType::Double,
+      "Epsilon (eps) is not a valid Double.");
+
+  auto r = norm_properties_from_num_dims(x, kNormShapeNumDims);
+
+  // Main algorithm
+  auto var_sum = sum(mul(x, x), r.inner_reduction_axes);
+  auto var_sum_bcast = broadcast(var_sum, r.inner_broadcast_mask);
+  auto var = mul(var_sum_bcast, reciprocal(r.num_features));
+  auto var_eps = add(var, eps);
+  auto invstd = rsqrt(var_eps);
+
+  auto y = mul(x, invstd);
+
+  // Optional: norm * weight
+  if (weight != nullptr) {
+    auto weight_bcast = broadcast(weight, r.outer_broadcast_mask);
+    y = mul(y, weight_bcast);
+  }
+
+  return {y, invstd};
+}
+
 BackwardNormResult layer_norm_backward(
     TensorView* dy,
     TensorView* x,
@@ -138,55 +294,30 @@ BackwardNormResult layer_norm_backward(
   TORCH_INTERNAL_ASSERT(mean != nullptr, "Mean is invalid.");
   TORCH_INTERNAL_ASSERT(invstd != nullptr, "Inv std is invalid.");
 
-  // (B, C, H, W, D) tensor
-  // norm_shape = [H, W, D]
-  // M = outer = product of remaining dimensions = B * C
-  // N = reduction = product of norm_shape = H * W * D
-  // weight = bias = norm_shape tensor
-  const size_t kNumberOfDims =
-      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
-  const size_t kNormShapeNumDims = norm_shape.size();
-  const size_t kOuterNumDims = kNumberOfDims - kNormShapeNumDims;
-
-  std::vector<int> outer_reduction_axes(kOuterNumDims);
-  std::vector<bool> outer_broadcast_mask(kNumberOfDims, false);
-  for (const auto idx : c10::irange(kOuterNumDims)) {
-    outer_reduction_axes[idx] = idx;
-    outer_broadcast_mask[idx] = true;
-  }
-
-  std::vector<int> inner_reduction_axes(kNormShapeNumDims);
-  std::vector<bool> inner_broadcast_mask(kNumberOfDims, false);
-  Val* num_features = IrBuilder::create<Double>(x->container(), 1);
-  for (const auto idx : c10::irange(kNormShapeNumDims)) {
-    const size_t axis = kNumberOfDims - 1 - idx;
-    inner_reduction_axes[idx] = axis;
-    inner_broadcast_mask[axis] = true;
-    num_features = mul(num_features, x->domain()->domain()[axis]->extent());
-  }
+  auto r = norm_properties_from_num_dims(x, norm_shape.size());
 
   auto x_hat = mul(sub(x, mean), invstd);
 
   TensorView* grad_x_hat = nullptr;
   if (weight != nullptr) {
-    auto* bcast_weight = broadcast(weight, outer_broadcast_mask);
+    auto* bcast_weight = broadcast(weight, r.outer_broadcast_mask);
     grad_x_hat = mul(dy, bcast_weight);
   } else {
     grad_x_hat = dy;
   }
 
-  auto a = mul(num_features, grad_x_hat);
+  auto a = mul(r.num_features, grad_x_hat);
 
-  auto b = sum(grad_x_hat, inner_reduction_axes);
-  auto bcast_b = broadcast(b, inner_broadcast_mask);
+  auto b = sum(grad_x_hat, r.inner_reduction_axes);
+  auto bcast_b = broadcast(b, r.inner_broadcast_mask);
 
   auto c1 = mul(grad_x_hat, x_hat);
-  auto c2 = sum(c1, inner_reduction_axes);
-  auto bcast_c2 = broadcast(c2, inner_broadcast_mask);
+  auto c2 = sum(c1, r.inner_reduction_axes);
+  auto bcast_c2 = broadcast(c2, r.inner_broadcast_mask);
   auto c3 = mul(x_hat, bcast_c2);
 
   auto inner = sub(sub(a, bcast_b), c3);
-  auto reciprocal_size = reciprocal(num_features);
+  auto reciprocal_size = reciprocal(r.num_features);
 
   TensorView* dx = nullptr;
   if (output_mask[0]) {
@@ -195,16 +326,65 @@ BackwardNormResult layer_norm_backward(
 
   TensorView* dw = nullptr;
   if (output_mask[1] && weight != nullptr) {
-    dw = sum(mul(dy, x_hat), outer_reduction_axes);
+    dw = sum(mul(dy, x_hat), r.outer_reduction_axes);
   }
 
   TensorView* db = nullptr;
   if (output_mask[2] && bias != nullptr) {
-    db = sum(dy, outer_reduction_axes);
+    db = sum(dy, r.outer_reduction_axes);
   }
   return {dx, dw, db};
 }
 
+BackwardRMSNormResult rms_norm_backward(
+    TensorView* dy,
+    TensorView* x,
+    const std::vector<int64_t>& norm_shape,
+    TensorView* invstd,
+    TensorView* weight,
+    const std::vector<bool>& output_mask) {
+  TORCH_INTERNAL_ASSERT(dy != nullptr, "Grad Output is invalid.");
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(invstd != nullptr, "Inv std is invalid.");
+
+  auto r = norm_properties_from_num_dims(x, norm_shape.size());
+
+  auto x_hat = mul(x, invstd);
+
+  TensorView* grad_x_hat = nullptr;
+  if (weight != nullptr) {
+    auto* bcast_weight = broadcast(weight, r.outer_broadcast_mask);
+    grad_x_hat = mul(dy, bcast_weight);
+  } else {
+    grad_x_hat = dy;
+  }
+
+  auto a = mul(r.num_features, grad_x_hat);
+
+  auto b = sum(grad_x_hat, r.inner_reduction_axes);
+  auto bcast_b = broadcast(b, r.inner_broadcast_mask);
+
+  auto c1 = mul(grad_x_hat, x_hat);
+  auto c2 = sum(c1, r.inner_reduction_axes);
+  auto bcast_c2 = broadcast(c2, r.inner_broadcast_mask);
+  auto c3 = mul(x_hat, bcast_c2);
+
+  auto inner = sub(sub(a, bcast_b), c3);
+  auto reciprocal_size = reciprocal(r.num_features);
+
+  TensorView* dx = nullptr;
+  if (output_mask[0]) {
+    dx = mul(mul(reciprocal_size, invstd), inner);
+  }
+
+  TensorView* dw = nullptr;
+  if (output_mask[1] && weight != nullptr) {
+    dw = sum(mul(dy, x_hat), r.outer_reduction_axes);
+  }
+
+  return {dx, dw};
+}
+
 ForwardNormResult batch_norm(
     TensorView* x,
     TensorView* weight,
@@ -300,19 +480,16 @@ ForwardNormResult batch_norm(
             "Input running stats must have dtype defined");
         auto casted_output = castOp(*rm_dtype, aliased_output);
 
-        fusion->addOutput(casted_output);
         fusion->aliasOutputToInput(casted_output, input_to_cast);
       };
 
       if (running_mean->isFusionInput()) {
-        fusion->addOutput(new_mean_hat);
         fusion->aliasOutputToInput(new_mean_hat, running_mean);
       } else {
         cast_to_input_dtype(running_mean, new_mean_hat);
       }
 
       if (running_var->isFusionInput()) {
-        fusion->addOutput(new_var_hat);
         fusion->aliasOutputToInput(new_var_hat, running_var);
       } else {
         cast_to_input_dtype(running_var, new_var_hat);
@@ -465,7 +642,8 @@ ForwardNormResult instance_norm(
     TensorView* running_var,
     const bool kUseInputStats,
     Val* momentum,
-    Val* eps) {
+    Val* eps,
+    bool channels_last) {
   auto fusion = FusionGuard::getCurFusion();
 
   TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
@@ -489,9 +667,9 @@ ForwardNormResult instance_norm(
   // N = reduction = H * W * D
   // weight = bias = C tensor
   const size_t kBatchDim = 0;
-  const size_t kChannelsDim = 1;
   const size_t kNumberOfDims =
       TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+  const size_t kChannelsDim = channels_last ? kNumberOfDims - 1 : 1;
 
   std::vector<int> x_reduction_axes;
   std::vector<bool> x_broadcast_mask(kNumberOfDims, false);
@@ -522,31 +700,51 @@ ForwardNormResult instance_norm(
 
     // updating running mean and running var
     if (running_mean != nullptr && running_var != nullptr) {
+      auto _running_mean = running_mean;
+      auto _running_var = running_var;
+      if (_running_mean->getDataType().value() == DataType::Half ||
+          _running_mean->getDataType().value() == DataType::BFloat16) {
+        _running_mean = castOp(DataType::Float, _running_mean);
+      }
+      if (_running_var->getDataType().value() == DataType::Half ||
+          _running_var->getDataType().value() == DataType::BFloat16) {
+        _running_var = castOp(DataType::Float, running_var);
+      }
       auto rev_momentum =
           sub(IrBuilder::create<Double>(x->container(), 1.0), momentum);
       auto current_mean_hat = mul(welford_out.avg, momentum);
-      auto mean_hat = mul(running_mean, rev_momentum);
+      auto mean_hat = mul(_running_mean, rev_momentum);
       auto new_mean_hat = add(mean_hat, current_mean_hat);
 
       // NS: static_cast to workaround VC++ error, see
       // https://godbolt.org/z/6Prd77xYs
       auto new_mean_sum = sum(new_mean_hat, {static_cast<int>(kBatchDim)});
       auto new_mean_channels_only = mul(new_mean_sum, reciprocal(B));
-      fusion->addOutput(new_mean_channels_only);
+      if (running_mean->getDataType().value() == DataType::Half ||
+          running_mean->getDataType().value() == DataType::BFloat16) {
+        new_mean_channels_only =
+            castOp(running_mean->getDataType().value(), new_mean_channels_only);
+      }
+      // fusion->addOutput(new_mean_channels_only);
       fusion->aliasOutputToInput(new_mean_channels_only, running_mean);
 
       auto num_feature_decrement = sub(N, x->container()->oneVal());
       auto unbiased_var =
           mul(welford_out.var_sum, reciprocal(num_feature_decrement));
       auto current_var_hat = mul(unbiased_var, momentum);
-      auto var_hat = mul(running_var, rev_momentum);
+      auto var_hat = mul(_running_var, rev_momentum);
       auto new_var_hat = add(var_hat, current_var_hat);
 
       // NS: static_cast to workaround VC++ error, see
       // https://godbolt.org/z/6Prd77xYs
       auto new_var_sum = sum(new_var_hat, {static_cast<int>(kBatchDim)});
       auto new_var_channels_only = mul(new_var_sum, reciprocal(B));
-      fusion->addOutput(new_var_channels_only);
+      if (running_var->getDataType().value() == DataType::Half ||
+          running_var->getDataType().value() == DataType::BFloat16) {
+        new_var_channels_only =
+            castOp(running_var->getDataType().value(), new_var_channels_only);
+      }
+      // fusion->addOutput(new_var_channels_only);
       fusion->aliasOutputToInput(new_var_channels_only, running_var);
     }
 
@@ -590,6 +788,121 @@ ForwardNormResult instance_norm(
   return {y, mean, invstd};
 }
 
+BackwardNormResult instance_norm_backward(
+    TensorView* input,
+    TensorView* grad_output,
+    TensorView* weight,
+    TensorView* running_mean,
+    TensorView* running_var,
+    TensorView* save_mean,
+    TensorView* save_invstd,
+    const bool kTraining,
+    Val* eps,
+    const std::vector<bool>& output_mask,
+    bool channels_last) {
+  TORCH_INTERNAL_ASSERT(input != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(grad_output != nullptr, "Grad Output is invalid.");
+  TORCH_INTERNAL_ASSERT(
+      eps != nullptr && eps->getDataType().has_value() &&
+          eps->getDataType().value() == DataType::Double,
+      "Epsilon (eps) is not a valid Double.");
+
+  // (B, C, H, W, D) tensor
+  // M = outer = channels
+  // N = reduction = B * H * W * D
+  // weight = bias = (C) tensor
+  const size_t kNumberOfDims =
+      TensorDomain::noReductions(input->getMaybeRFactorDomain()).size();
+  // channels last format means C dimension is at axis kNumberOfDims-1 at x /
+  // grad_out
+  const size_t b_axis = 0; // for clarity
+  const size_t c_axis = channels_last ? kNumberOfDims - 1 : 1;
+
+  std::vector<int> reduction_axes;
+  std::vector<bool> broadcast_mask(kNumberOfDims, false);
+  // weight has its own broadcast mask as it is broadcast for the batch unlike
+  // mean/var
+  std::vector<bool> weight_broadcast_mask(kNumberOfDims, false);
+  Val* num_features = nullptr;
+  for (const auto axis : c10::irange(kNumberOfDims)) {
+    if (axis != c_axis) {
+      weight_broadcast_mask[axis] = true;
+      if (axis != b_axis) {
+        reduction_axes.push_back(axis);
+        broadcast_mask[axis] = true;
+        if (num_features == nullptr) {
+          num_features = castOp(
+              DataType::Double, input->domain()->domain()[axis]->extent());
+        } else {
+          num_features =
+              mul(num_features, input->domain()->domain()[axis]->extent());
+        }
+      }
+    }
+  }
+
+  auto mean = save_mean;
+  auto invstd = save_invstd;
+  if (kTraining) {
+    TORCH_INTERNAL_ASSERT(
+        save_mean != nullptr && save_invstd != nullptr,
+        "When training=True, save_mean and save_invstd are required.");
+  } else {
+    mean = running_mean;
+    invstd = rsqrt(add(running_var, eps));
+  }
+  mean = broadcast(mean, broadcast_mask);
+
+  auto norm = reciprocal(num_features);
+
+  auto grad_output_sum = sum(grad_output, reduction_axes);
+  auto dot_p = sum(mul(grad_output, sub(input, mean)), reduction_axes);
+
+  auto grad_mean = broadcast(mul(grad_output_sum, norm), broadcast_mask);
+
+  auto proj_scale =
+      broadcast(mul(mul(dot_p, norm), mul(invstd, invstd)), broadcast_mask);
+
+  TensorView* grad_scale = nullptr;
+
+  if (weight == nullptr) {
+    grad_scale =
+        mul(broadcast(invstd, broadcast_mask),
+            IrBuilder::create<Double>(input->container(), 1));
+  } else {
+    grad_scale =
+        mul(broadcast(invstd, broadcast_mask),
+            broadcast(weight, weight_broadcast_mask));
+  }
+
+  TensorView* grad_input = nullptr;
+  if (kTraining) {
+    auto proj = mul(sub(input, mean), proj_scale);
+    grad_input = mul(sub(sub(grad_output, proj), grad_mean), grad_scale);
+  } else {
+    grad_input = mul(grad_output, grad_scale);
+  }
+
+  TensorView* grad_weight = nullptr;
+  TensorView* grad_weight_reduced = nullptr;
+  if (output_mask[1]) {
+    grad_weight = mul(dot_p, invstd);
+    // TODO: grad weight needs to be reduced across batch-dim but is this the
+    // most efficient place or can reduction happen earlier?
+    grad_weight_reduced = sum(grad_weight, {0});
+  }
+
+  TensorView* grad_bias = nullptr;
+  TensorView* grad_bias_reduced = nullptr;
+  if (output_mask[2]) {
+    grad_bias = grad_output_sum;
+    // TODO: same as above for grad weight
+    grad_bias_reduced = sum(grad_bias, {0});
+  }
+
+  return {grad_input, grad_weight_reduced, grad_bias_reduced};
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.h b/torch/csrc/jit/codegen/cuda/ops/normalization.h
index b28cdf6b33ca88..74d8cc4ab65099 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.h
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.h
@@ -28,6 +28,33 @@ struct BackwardNormResult {
   TensorView* grad_bias = nullptr;
 };
 
+struct ForwardRMSNormResult {
+  TensorView* output = nullptr;
+  TensorView* invstd = nullptr;
+};
+
+struct BackwardRMSNormResult {
+  TensorView* grad_input = nullptr;
+  TensorView* grad_weight = nullptr;
+};
+
+TORCH_CUDA_CU_API TensorView* mean(
+    TensorView* x,
+    const std::vector<int>& dims,
+    bool keepdim);
+
+TORCH_CUDA_CU_API TensorView* variance(
+    TensorView* x,
+    const std::vector<int>& dims,
+    bool unbiased,
+    bool keepdim);
+
+TORCH_CUDA_CU_API TensorView* standard_deviation(
+    TensorView* x,
+    const std::vector<int>& dims,
+    bool unbiased,
+    bool keepdim);
+
 TORCH_CUDA_CU_API TensorView* softmax(TensorView* x, int dim);
 
 TORCH_CUDA_CU_API TensorView* softmax_backward(
@@ -35,6 +62,13 @@ TORCH_CUDA_CU_API TensorView* softmax_backward(
     TensorView* y,
     const int dim);
 
+TORCH_CUDA_CU_API TensorView* log_softmax(TensorView* x, int dim);
+
+TORCH_CUDA_CU_API TensorView* log_softmax_backward(
+    TensorView* dy,
+    TensorView* y,
+    const int dim);
+
 TORCH_CUDA_CU_API ForwardNormResult layer_norm(
     TensorView* x,
     const std::vector<int64_t>& norm_shape,
@@ -49,6 +83,18 @@ TORCH_CUDA_CU_API ForwardNormResult layer_norm(
     TensorView* bias,
     Val* eps);
 
+TORCH_CUDA_CU_API ForwardRMSNormResult rms_norm(
+    TensorView* x,
+    const std::vector<int64_t>& norm_shape,
+    TensorView* weight,
+    Val* eps);
+
+TORCH_CUDA_CU_API ForwardRMSNormResult rms_norm(
+    TensorView* x,
+    const size_t kNormShapeNumDims,
+    TensorView* weight,
+    Val* eps);
+
 TORCH_CUDA_CU_API BackwardNormResult layer_norm_backward(
     TensorView* dy,
     TensorView* x,
@@ -59,6 +105,14 @@ TORCH_CUDA_CU_API BackwardNormResult layer_norm_backward(
     TensorView* bias,
     const std::vector<bool>& output_mask);
 
+TORCH_CUDA_CU_API BackwardRMSNormResult rms_norm_backward(
+    TensorView* dy,
+    TensorView* x,
+    const std::vector<int64_t>& norm_shape,
+    TensorView* rstd,
+    TensorView* weight,
+    const std::vector<bool>& output_mask);
+
 TORCH_CUDA_CU_API ForwardNormResult batch_norm(
     TensorView* x,
     TensorView* weight,
@@ -89,9 +143,23 @@ TORCH_CUDA_CU_API ForwardNormResult instance_norm(
     TensorView* bias,
     TensorView* running_mean,
     TensorView* running_var,
-    const bool kUseInputStats,
+    const bool kUseInputStats, // kTraining?
     Val* momentum,
-    Val* eps);
+    Val* eps,
+    bool channels_last = false);
+
+TORCH_CUDA_CU_API BackwardNormResult instance_norm_backward(
+    TensorView* x,
+    TensorView* dy,
+    TensorView* weight,
+    TensorView* running_mean,
+    TensorView* running_var,
+    TensorView* save_mean,
+    TensorView* save_invstd,
+    const bool kTraining,
+    Val* eps,
+    const std::vector<bool>& output_mask,
+    bool channels_last = false);
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
index d966fc21a971a7..795eab0a634f5c 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
@@ -43,28 +43,22 @@ void ParallelDimensionMap::build(Fusion* fusion) {
 }
 
 void ParallelDimensionMap::registerConstantExtent(IterDomain* id) {
-  ExpressionEvaluator ee(id->fusion());
-  auto extent_int = ee.evaluate(id->extent());
-  if (!extent_int.has_value()) {
+  if (!id->extent()->isConstScalar()) {
     // Nothing to do if not constant
     return;
   }
 
-  auto const_extent = extent_int.value();
+  ExpressionEvaluator ee(id->fusion());
+  auto extent_int = ee.evaluate(id->extent());
+  TORCH_INTERNAL_ASSERT(
+      extent_int.has_value(),
+      "Extent of ",
+      id->toString(),
+      " should have been constant, but could not be evaluated at compile time.");
 
-  // Ignore if this is derived from a size-1 domain as it is likely a
-  // size-1 broadcast domain and that does not represent the actual
-  // dimension even if it's constant. Being size-1 may not always mean
-  // it's a broadcast domain, but it'd be safe to assume it is mostly
-  // the case. If it is not a broadcast, ignoring this domain does not
-  // impact the correctness.
-  auto extent_inputs = InputsOf::output(id->fusion(), id->extent());
-  if (std::any_of(extent_inputs.begin(), extent_inputs.end(), [](Val* input) {
-        return input->isOneInt();
-      })) {
-    return;
-  }
+  auto const_extent = extent_int.value();
 
+  // Uses index map
   auto concrete_id = getCAMappedConcreteDomain(id);
 
   auto existing_it = constant_extent_map_.find(id);
@@ -106,14 +100,13 @@ void ParallelDimensionMap::populateDimensionMapWithSingleCASet(
   auto it = constant_extent_map_.find(id);
 
   if (it != constant_extent_map_.end()) {
-    if (it->second.size() == 1) {
-      dim_map_.insert({pt, IrBuilder::create<Int>(*(it->second.begin()))});
-      exact_types_.insert(pt);
-    } else {
-      // Multiple constant dimensions found; Use the corresponding
-      // symbolic parallel dim
-      dim_map_.insert({pt, NamedScalar::getParallelDim(pt)});
-    }
+    TORCH_INTERNAL_ASSERT(
+        it->second.size() == 1,
+        "Only one value found mapped to parallel type ",
+        stringifyThread(pt),
+        " yet its bound to multiple extents.");
+    dim_map_.insert({pt, IrBuilder::create<Int>(*(it->second.begin()))});
+    exact_types_.insert(pt);
   } else {
     // Prefer to use blockDim/gridDim if not constant
     dim_map_.insert({pt, NamedScalar::getParallelDim(pt)});
@@ -200,7 +193,9 @@ void ParallelDimensionMap::adjustMappingsForWarpPadding() {
   // non-exact.
 
   auto& warp_info = gpu_lower->getWarpPaddedParallelInfo();
-  if (!warp_info.is_tidx_padded) {
+  // TIDx isn't really padded if there isn't a warp reduction (this could
+  // change)
+  if (!(warp_info.is_tidx_padded && warp_info.has_warp_reduction)) {
     return;
   }
 
@@ -218,11 +213,24 @@ void ParallelDimensionMap::adjustMappingsForWarpPadding() {
         return;
       }
     }
+    // If tidx is strictly defined as blockDim.x then it must be set to a
+    // multiple of the warp and can be considered exact
+    bool tidx_def_trivial = true;
+    for (auto entry : concrete_dom_map_.at(tidx_pt)) {
+      if (!entry->isA<NamedScalar>() ||
+          !entry->as<NamedScalar>()->sameAs(
+              NamedScalar::getParallelDim(tidx_pt))) {
+        tidx_def_trivial = false;
+      }
+    }
+    if (tidx_def_trivial) {
+      return;
+    }
   }
 
   // TIDx is padded to a multiple of warp. If it's known to be a
   // single warp, use the constant warp size as the dimension of
-  // TIDx. Otherwise, jsut use blockDim.x.
+  // TIDx. Otherwise, just use blockDim.x.
   if (warp_info.is_tidx_single_warp) {
     dim_map_.at(ParallelType::TIDx) = IrBuilder::create<Int>(warp_size);
   } else {
@@ -292,6 +300,13 @@ bool ParallelDimensionMap::equalDim(Val* dim1, Val* dim2) {
   // If both are BinaryOp or UnaryOp, check their inputs. Since these
   // Vals are IterDomain extents, UnaryOp should not occur, but
   // checking shouldn't be harmful.
+  // TODO:
+  //   We might be able to replace this with dim1->toInlineString() ==
+  //   dim2->toInlineString()
+  //   If we want this less conservative we could make an "exact map" which
+  //   could be another mode in compute at that maps all iter domains, but not
+  //   concretized broadcast axes and only forwards through non-concretized
+  //   broadcast axes.
   if ((dim1_def->isA<BinaryOp>() && dim2_def->isA<BinaryOp>() &&
        (dim1_def->as<BinaryOp>()->getBinaryOpType() ==
         dim2_def->as<BinaryOp>()->getBinaryOpType())) ||
diff --git a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h
index 3bfb32d38bc027..642017a3c0977f 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h
+++ b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h
@@ -3,6 +3,7 @@
 #include <c10/macros/Export.h>
 #include <torch/csrc/jit/codegen/cuda/type.h>
 
+#include <array>
 #include <bitset>
 #include <map>
 #include <unordered_map>
@@ -160,6 +161,20 @@ class ParallelTypeBitmap {
     *this |= ParallelTypeBitmap(kBIDBits);
   }
 
+  //! Clear all of the TID flags
+  void clearAllTID() {
+    auto tid_bits = ParallelTypeBitmap(kTIDBits);
+    auto not_tid_bits = ~tid_bits;
+    *this &= not_tid_bits;
+  }
+
+  //! Clear all of the BID flags
+  void clearAllBID() {
+    auto bid_bits = ParallelTypeBitmap(kBIDBits);
+    auto not_bid_bits = ~bid_bits;
+    *this &= not_bid_bits;
+  }
+
   //! Get an iterator to traverse set types
   Iterator begin() const {
     return Iterator::begin(*this);
@@ -271,6 +286,52 @@ inline ParallelTypeBitmap::Iterator ParallelTypeBitmap::Iterator::end(
   return Iterator(map, kOffsetEnd);
 }
 
+//! Map from ParallelType to template type T
+template <typename T>
+class ParallelTypeMap {
+ public:
+  ParallelTypeMap() = default;
+
+  ParallelTypeMap(const T& init) {
+    std::fill(map_.begin(), map_.end(), init);
+  }
+
+  T& operator[](ParallelType pt) {
+    return map_[getParallelTypeBitMapOffset(pt)];
+  }
+
+  const T& operator[](ParallelType pt) const {
+    return map_[getParallelTypeBitMapOffset(pt)];
+  }
+
+  T& at(ParallelType pt) {
+    return map_.at(getParallelTypeBitMapOffset(pt));
+  }
+
+  const T& at(ParallelType pt) const {
+    return map_.at(getParallelTypeBitMapOffset(pt));
+  }
+
+  auto begin() {
+    return map_.begin();
+  }
+
+  auto begin() const {
+    return map_.begin();
+  }
+
+  auto end() {
+    return map_.begin();
+  }
+
+  auto end() const {
+    return map_.begin();
+  }
+
+ private:
+  std::array<T, ParallelTypeBitmap::kNumParallelTypes> map_;
+};
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/parser.cpp b/torch/csrc/jit/codegen/cuda/parser.cpp
index 94dad076db85ca..419bb028e3dc04 100644
--- a/torch/csrc/jit/codegen/cuda/parser.cpp
+++ b/torch/csrc/jit/codegen/cuda/parser.cpp
@@ -38,11 +38,15 @@ constexpr auto kNumBinaryOpsWithAlpha = 6;
 constexpr auto kNumLerpOps = 2;
 constexpr auto kNumLayernormFwd = 2;
 constexpr auto kNumBatchnormFwd = 3;
+constexpr auto kNumBatchnormBwd = 2;
 constexpr auto kNumInstancenormFwd = 1;
 constexpr auto kNumSumToSize = 2;
 constexpr auto kNumAutocastOps = 2;
 constexpr auto kNumAliasDimOps = 2;
 constexpr auto kNumViewOps = 2;
+constexpr auto kNumVarOps = 2;
+constexpr auto kNumSoftmaxFwd = 2;
+constexpr auto kNumSoftmaxBwd = 2;
 
 namespace {
 
@@ -64,6 +68,21 @@ const auto& strAttr = Symbol::attr("profiled_str");
 typedef Val* CgValue;
 typedef Expr* CgOp;
 
+bool isReductionNonCompatibleTensor(
+    const std::shared_ptr<c10::TensorType>& tensor_type) {
+  return is_zero_dim_tensor(tensor_type) || is_zero_sized_tensor(tensor_type);
+}
+
+bool isInputNonSizeZeroTensor(const Node* node) {
+  for (const auto& val : node->inputs()) {
+    auto tensor_type = val->type()->cast<TensorType>();
+    if (tensor_type && is_zero_sized_tensor(tensor_type)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 // Note [ Permutation Bookkeeping and Propagation in Parser ]
 //
 // The goal in supporting permutation propagation in parser is to:
@@ -577,6 +596,9 @@ class IrParser {
 
   // return nullptr if entry does not exist
   static const RegistrationEntry* lookupInRegistry(const Node* node) {
+    if (parser_skip_set_.count(node->kind()) != 0) {
+      return nullptr;
+    }
     // we need to use maybeSchema for nodes like prim::Constant, which doesn't
     // have a schema
     auto schema_ptr = node->maybeSchema();
@@ -600,6 +622,20 @@ class IrParser {
     return nullptr;
   }
 
+  static bool querySkipSymbolSet(c10::Symbol symbol, bool flip) {
+    // no need to init registry here (unlike `lookupInSymbolSet`, as
+    // `parser_skip_set_` is not initialized via initialization
+    bool ret = parser_skip_set_.count(symbol) != 0;
+    if (flip) {
+      if (ret) {
+        parser_skip_set_.erase(symbol);
+      } else {
+        parser_skip_set_.insert(symbol);
+      }
+    }
+    return ret;
+  }
+
   static void initRegistry() {
     if (init_registry_) {
       // TODO: mutex this guy;
@@ -733,7 +769,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -769,7 +805,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -825,7 +861,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -874,7 +910,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -917,7 +953,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -988,7 +1024,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1009,7 +1045,7 @@ class IrParser {
             auto out = randlike(operand);
             value_map.emplace(node->output()->unique(), out);
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1024,14 +1060,14 @@ class IrParser {
             std::tie(format, list_val) = getConsistentValues(
                 MemoryFormat::Contiguous(),
                 value_map[node->inputs()[0]->unique()]);
-            auto operand = list_val.front();
+            auto operand = list_val.front()->as<TensorView>();
             list_val.pop_front();
             auto& beta = value_map[node->inputs()[1]->unique()];
             auto& threshold = value_map[node->inputs()[2]->unique()];
             auto out = softplus(operand, beta, threshold);
             value_map.emplace(node->output()->unique(), out);
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1054,7 +1090,7 @@ class IrParser {
             auto out = threshold(operand, th, value);
             value_map.emplace(node->output()->unique(), out);
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1086,7 +1122,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1112,7 +1148,7 @@ class IrParser {
             auto out = clamp(operand, low, high);
             value_map.emplace(node->output()->unique(), out);
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1140,7 +1176,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1171,7 +1207,7 @@ class IrParser {
               value_map.emplace(
                   node->output()->unique(), ValueHolder(out, format));
             },
-            nullptr,
+            isInputNonSizeZeroTensor,
             nullptr);
       }
     }
@@ -1203,7 +1239,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1240,7 +1276,7 @@ class IrParser {
                   ValueHolder(TensorViewBuilder().build(), format));
             }
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1273,7 +1309,7 @@ class IrParser {
               value_map.emplace(node->output()->unique(), input);
             }
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1301,7 +1337,7 @@ class IrParser {
                 grad->as<TensorView>(), mask->as<TensorView>(), scale);
             value_map.emplace(node->output()->unique(), output);
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -1342,9 +1378,6 @@ class IrParser {
                       static_cast<c10::TypePtr>(NoneType::get()))) {
                 running_mean =
                     value_map[node->input(3)->unique()]->as<TensorView>();
-                TORCH_INTERNAL_ASSERT(
-                    running_mean->isFusionInput(),
-                    "IO_tensor `instance_norm::running_mean` can only be input tensor to fusion");
               }
 
               TensorView* running_var = nullptr;
@@ -1352,9 +1385,6 @@ class IrParser {
                       static_cast<c10::TypePtr>(NoneType::get()))) {
                 running_var =
                     value_map[node->input(4)->unique()]->as<TensorView>();
-                TORCH_INTERNAL_ASSERT(
-                    running_var->isFusionInput(),
-                    "IO_tensor `instance_norm::running_var` can only be input tensor to fusion");
               }
 
               // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
@@ -1397,7 +1427,13 @@ class IrParser {
                 value_map.emplace(node->output()->unique(), result.output);
               }
             },
-            [](const Node* node) -> bool { return true; },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              return true;
+            },
             [](const Node* node) -> OperatorType {
               return OperatorType::Normalization;
             });
@@ -1508,7 +1544,13 @@ class IrParser {
                     ValueHolder(result.output, format));
               }
             },
-            [](const Node* node) -> bool { return true; },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              return true;
+            },
             [](const Node* node) -> OperatorType {
               return OperatorType::Normalization;
             });
@@ -1516,156 +1558,208 @@ class IrParser {
     }
 
     {
-      auto ptr_op = getOperatorForLiteral(
-          "aten::_batch_norm_impl_index_backward(int impl_index, Tensor input, Tensor grad_output, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var_transform, bool train, float eps, bool[3] output_mask, Tensor reservedSpace) -> (Tensor, Tensor, Tensor)");
-      REGISTER_PARSE_RULE(
-          ptr_op,
-          {
-            // discard impl_index and reservedSpace since we don't use them
-            MemoryFormat format;
-            std::list<Val*> list_val;
-            std::tie(format, list_val) = getConsistentValues(
-                c10::nullopt,
-                value_map[node->inputs()[1]->unique()],
-                value_map[node->inputs()[2]->unique()]);
-            if (format.hasPermutation() && !format.isChannelsLast()) {
+      std::array<const char*, kNumBatchnormBwd> BatchNormBwd = {
+          "aten::_batch_norm_impl_index_backward(int impl_index, Tensor input, Tensor grad_output, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var_transform, bool train, float eps, bool[3] output_mask, Tensor reservedSpace) -> (Tensor, Tensor, Tensor)",
+          "aten::native_batch_norm_backward(Tensor grad_out, Tensor input, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_invstd, bool train, float eps, bool[3] output_mask) -> (Tensor, Tensor, Tensor)"};
+      for (auto signature : BatchNormBwd) {
+        auto ptr_op = getOperatorForLiteral(signature);
+        REGISTER_PARSE_RULE(
+            ptr_op,
+            {
+              JitValue* ts_input = nullptr;
+              JitValue* ts_grad_output;
+              JitValue* ts_weight = nullptr;
+              JitValue* ts_r_mean = nullptr;
+              JitValue* ts_r_var = nullptr;
+              JitValue* ts_save_mean = nullptr;
+              JitValue* ts_save_invstd = nullptr;
+              JitValue* ts_train = nullptr;
+              JitValue* ts_eps = nullptr;
+              JitValue* ts_mask = nullptr;
+              if (node->kind() ==
+                  c10::Symbol::fromQualString(
+                      "aten::_batch_norm_impl_index_backward")) {
+                ts_input = node->input(1);
+                ts_grad_output = node->input(2);
+                ts_weight = node->input(3);
+                ts_r_mean = node->input(4);
+                ts_r_var = node->input(5);
+                ts_save_mean = node->input(6);
+                ts_save_invstd = node->input(7);
+                ts_train = node->input(8);
+                ts_eps = node->input(9);
+                ts_mask = node->input(10);
+              } else if (
+                  node->kind() ==
+                  c10::Symbol::fromQualString(
+                      "aten::native_batch_norm_backward")) {
+                ts_grad_output = node->input(0);
+                ts_input = node->input(1);
+                ts_weight = node->input(2);
+                ts_r_mean = node->input(3);
+                ts_r_var = node->input(4);
+                ts_save_mean = node->input(5);
+                ts_save_invstd = node->input(6);
+                ts_train = node->input(7);
+                ts_eps = node->input(8);
+                ts_mask = node->input(9);
+              } else {
+                TORCH_INTERNAL_ASSERT(
+                    false,
+                    "Forgot to register the key for BN variation: ",
+                    node->kind().toDisplayString());
+              }
+
+              // discard impl_index and reservedSpace since we don't use them
+              MemoryFormat format;
+              std::list<Val*> list_val;
               std::tie(format, list_val) = getConsistentValues(
-                  MemoryFormat::Contiguous(),
-                  value_map[node->inputs()[1]->unique()],
-                  value_map[node->inputs()[2]->unique()]);
-            }
-            auto operand0 = list_val.front();
-            list_val.pop_front();
-            auto operand1 = list_val.front();
-            list_val.pop_front();
-            auto input = operand0->as<TensorView>();
-            auto grad_out = operand1->as<TensorView>();
+                  c10::nullopt,
+                  value_map[ts_input->unique()],
+                  value_map[ts_grad_output->unique()]);
+              if (format.hasPermutation() && !format.isChannelsLast()) {
+                std::tie(format, list_val) = getConsistentValues(
+                    MemoryFormat::Contiguous(),
+                    value_map[ts_input->unique()],
+                    value_map[ts_grad_output->unique()]);
+              }
+              auto operand0 = list_val.front();
+              list_val.pop_front();
+              auto operand1 = list_val.front();
+              list_val.pop_front();
+              auto input = operand0->as<TensorView>();
+              auto grad_out = operand1->as<TensorView>();
 
-            TensorView* weight = nullptr;
-            if (!node->input(3)->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get()))) {
-              weight = value_map[node->input(3)->unique()]->as<TensorView>();
-            }
+              TensorView* weight = nullptr;
+              if (!ts_weight->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get()))) {
+                weight = value_map[ts_weight->unique()]->as<TensorView>();
+              }
 
-            TensorView* running_mean = nullptr;
-            if (!node->input(4)->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get()))) {
-              running_mean =
-                  value_map[node->input(4)->unique()]->as<TensorView>();
-            }
+              TensorView* running_mean = nullptr;
+              if (!ts_r_mean->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get()))) {
+                running_mean = value_map[ts_r_mean->unique()]->as<TensorView>();
+              }
 
-            TensorView* running_var = nullptr;
-            if (!node->input(5)->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get()))) {
-              running_var =
-                  value_map[node->input(5)->unique()]->as<TensorView>();
-            }
+              TensorView* running_var = nullptr;
+              if (!ts_r_var->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get()))) {
+                running_var = value_map[ts_r_var->unique()]->as<TensorView>();
+              }
 
-            TensorView* save_mean = nullptr;
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            if (!node->input(6)->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get()))) {
+              TensorView* save_mean = nullptr;
               // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-              save_mean = value_map[node->input(6)->unique()]->as<TensorView>();
-            }
-
-            TensorView* save_invstd = nullptr;
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            if (!node->input(7)->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get()))) {
-              save_invstd =
-                  // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-                  value_map[node->input(7)->unique()]->as<TensorView>();
-            }
+              if (!ts_save_mean->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get()))) {
+                // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+                save_mean = value_map[ts_save_mean->unique()]->as<TensorView>();
+              }
 
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            auto training = constant_as<bool>(node->input(8));
-            TORCH_INTERNAL_ASSERT(
-                training.has_value(),
-                "The training (bool) parameter is required.");
-            const bool kTraining = training.value();
+              TensorView* save_invstd = nullptr;
+              // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+              if (!ts_save_invstd->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get()))) {
+                save_invstd =
+                    // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+                    value_map[ts_save_invstd->unique()]->as<TensorView>();
+              }
 
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            Val* eps_ptr = nullptr;
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            if (auto eps = constant_as<float>(node->input(9))) {
-              eps_ptr = IrBuilder::create<Double>(eps.value());
-            } else {
               // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-              eps_ptr = value_map[node->input(7)->unique()];
-            }
+              auto training = constant_as<bool>(ts_train);
+              TORCH_INTERNAL_ASSERT(
+                  training.has_value(),
+                  "The training (bool) parameter is required.");
+              const bool kTraining = training.value();
 
-            // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-            auto out_mask_list = constant_as<c10::List<bool>>(node->input(10));
-            TORCH_INTERNAL_ASSERT(
-                out_mask_list.has_value(),
-                "output mask for batch_norm_backward");
-            std::vector<bool> output_mask;
-            for (const auto value : out_mask_list->vec()) {
-              output_mask.emplace_back(static_cast<bool>(value));
-            }
+              // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+              Val* eps_ptr = nullptr;
+              // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+              if (auto eps = constant_as<float>(ts_eps)) {
+                eps_ptr = IrBuilder::create<Double>(eps.value());
+              } else {
+                // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+                eps_ptr = value_map[ts_eps->unique()];
+              }
 
-            // TODO: merge this loop below.
-            if (kTraining) {
-              TORCH_INTERNAL_ASSERT(
-                  save_mean != nullptr && save_invstd != nullptr,
-                  "When training=True, save_mean and save_invstd are required.");
-            } else {
-              // TODO: this is not a legit assumption? Can't we run with
-              // track_running_stats == false && training == false
-              // which should just run through the case above.
+              // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+              auto out_mask_list = constant_as<c10::List<bool>>(ts_mask);
               TORCH_INTERNAL_ASSERT(
-                  running_mean != nullptr && running_var != nullptr,
-                  "When training=False, running_mean and running_invstd are required.");
-            }
+                  out_mask_list.has_value(),
+                  "output mask for batch_norm_backward");
+              std::vector<bool> output_mask;
+              for (const auto value : out_mask_list->vec()) {
+                output_mask.emplace_back(static_cast<bool>(value));
+              }
 
-            auto grads = batch_norm_backward(
-                input,
-                grad_out,
-                weight,
-                running_mean,
-                running_var,
-                save_mean,
-                save_invstd,
-                kTraining,
-                eps_ptr,
-                output_mask,
-                format.isChannelsLast());
+              // TODO: merge this loop below.
+              if (kTraining) {
+                TORCH_INTERNAL_ASSERT(
+                    save_mean != nullptr && save_invstd != nullptr,
+                    "When training=True, save_mean and save_invstd are required.");
+              } else {
+                // TODO: this is not a legit assumption? Can't we run with
+                // track_running_stats == false && training == false
+                // which should just run through the case above.
+                TORCH_INTERNAL_ASSERT(
+                    running_mean != nullptr && running_var != nullptr,
+                    "When training=False, running_mean and running_invstd are required.");
+              }
 
-            if (output_mask[0]) {
-              TORCH_INTERNAL_ASSERT(grads.grad_input != nullptr);
-              value_map.emplace(
-                  node->output(0)->unique(),
-                  ValueHolder(grads.grad_input, format));
-            } else {
-              TORCH_INTERNAL_ASSERT(grads.grad_input == nullptr);
-              value_map.emplace(
-                  node->output(0)->unique(),
-                  ValueHolder(TensorViewBuilder().build(), format));
-            }
+              auto grads = batch_norm_backward(
+                  input,
+                  grad_out,
+                  weight,
+                  running_mean,
+                  running_var,
+                  save_mean,
+                  save_invstd,
+                  kTraining,
+                  eps_ptr,
+                  output_mask,
+                  format.isChannelsLast());
 
-            if (output_mask[1]) {
-              TORCH_INTERNAL_ASSERT(grads.grad_weight != nullptr);
-              value_map.emplace(node->output(1)->unique(), grads.grad_weight);
-            } else {
-              TORCH_INTERNAL_ASSERT(grads.grad_weight == nullptr);
-              value_map.emplace(
-                  node->output(1)->unique(), TensorViewBuilder().build());
-            }
+              if (output_mask[0]) {
+                TORCH_INTERNAL_ASSERT(grads.grad_input != nullptr);
+                value_map.emplace(
+                    node->output(0)->unique(),
+                    ValueHolder(grads.grad_input, format));
+              } else {
+                TORCH_INTERNAL_ASSERT(grads.grad_input == nullptr);
+                value_map.emplace(
+                    node->output(0)->unique(),
+                    ValueHolder(TensorViewBuilder().build(), format));
+              }
 
-            if (output_mask[2]) {
-              TORCH_INTERNAL_ASSERT(grads.grad_bias != nullptr);
-              value_map.emplace(node->output(2)->unique(), grads.grad_bias);
-            } else {
-              TORCH_INTERNAL_ASSERT(grads.grad_bias == nullptr);
-              value_map.emplace(
-                  node->output(2)->unique(), TensorViewBuilder().build());
-            }
-          },
-          [](const Node* node) -> bool { return true; },
-          [](const Node* node) -> OperatorType {
-            return OperatorType::Normalization;
-          });
+              if (output_mask[1]) {
+                TORCH_INTERNAL_ASSERT(grads.grad_weight != nullptr);
+                value_map.emplace(node->output(1)->unique(), grads.grad_weight);
+              } else {
+                TORCH_INTERNAL_ASSERT(grads.grad_weight == nullptr);
+                value_map.emplace(
+                    node->output(1)->unique(), TensorViewBuilder().build());
+              }
+
+              if (output_mask[2]) {
+                TORCH_INTERNAL_ASSERT(grads.grad_bias != nullptr);
+                value_map.emplace(node->output(2)->unique(), grads.grad_bias);
+              } else {
+                TORCH_INTERNAL_ASSERT(grads.grad_bias == nullptr);
+                value_map.emplace(
+                    node->output(2)->unique(), TensorViewBuilder().build());
+              }
+            },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(1)->type()->cast<TensorType>())) {
+                return false;
+              }
+              return true;
+            },
+            [](const Node* node) -> OperatorType {
+              return OperatorType::Normalization;
+            });
+      }
     }
 
     {
@@ -1727,7 +1821,13 @@ class IrParser {
               }
             },
             // TODO: #ProfileIValue List should update this
-            [](const Node* node) -> bool { return true; },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              return true;
+            },
             [](const Node* node) -> OperatorType {
               return OperatorType::Normalization;
             });
@@ -1825,42 +1925,9 @@ class IrParser {
             }
           },
           // TODO: #ProfileIValue List should update this
-          [](const Node* node) -> bool { return true; },
-          [](const Node* node) -> OperatorType {
-            return OperatorType::Normalization;
-          });
-    }
-
-    {
-      auto ptr_op = getOperatorForLiteral(
-          "aten::softmax.int(Tensor self, int dim, int? dtype) -> Tensor");
-      REGISTER_PARSE_RULE(
-          ptr_op,
-          {
-            MemoryFormat format;
-            std::list<Val*> list_val;
-            std::tie(format, list_val) = getConsistentValues(
-                MemoryFormat::Contiguous(),
-                value_map[node->inputs()[0]->unique()]);
-            auto input_t = list_val.front();
-            list_val.pop_front();
-            auto input = input_t->as<TensorView>();
-
-            auto dim_value = constant_as<int>(node->input(1));
-            TORCH_INTERNAL_ASSERT(
-                dim_value.has_value(), "dim in softmax is not valid");
-
-            auto output = softmax(input, dim_value.value());
-            value_map.emplace(node->output()->unique(), output);
-          },
           [](const Node* node) -> bool {
-            if (node->inputs()[1]->node()->kind() != prim::Constant) {
-              return false;
-            }
-            // TODO: support dynamic input by profiling it
-            if (!node->inputs()[2]->type()->isSubtypeOf(
-                    static_cast<c10::TypePtr>(NoneType::get())) &&
-                node->inputs()[2]->node()->kind() != prim::Constant) {
+            if (isReductionNonCompatibleTensor(
+                    node->input(0)->type()->cast<TensorType>())) {
               return false;
             }
             return true;
@@ -1870,6 +1937,58 @@ class IrParser {
           });
     }
 
+    {
+      std::array<const char*, kNumSoftmaxFwd> SoftmaxFwd = {
+          "aten::softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor",
+          "aten::log_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor"};
+      for (auto signature : SoftmaxFwd) {
+        auto ptr_op = getOperatorForLiteral(signature);
+        REGISTER_PARSE_RULE(
+            ptr_op,
+            {
+              MemoryFormat format;
+              std::list<Val*> list_val;
+              std::tie(format, list_val) = getConsistentValues(
+                  MemoryFormat::Contiguous(),
+                  value_map[node->inputs()[0]->unique()]);
+              auto input_t = list_val.front();
+              list_val.pop_front();
+              auto input = input_t->as<TensorView>();
+
+              auto dim_value = constant_as<int>(node->input(1));
+              TORCH_INTERNAL_ASSERT(
+                  dim_value.has_value(), "dim in softmax is not valid");
+
+              bool is_log_softmax = node->kind() ==
+                  c10::Symbol::fromQualString("aten::log_softmax");
+
+              auto output = (is_log_softmax)
+                  ? log_softmax(input, dim_value.value())
+                  : softmax(input, dim_value.value());
+              value_map.emplace(node->output()->unique(), output);
+            },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              if (node->inputs()[1]->node()->kind() != prim::Constant) {
+                return false;
+              }
+              // TODO: support dynamic input by profiling it
+              if (!node->inputs()[2]->type()->isSubtypeOf(
+                      static_cast<c10::TypePtr>(NoneType::get())) &&
+                  node->inputs()[2]->node()->kind() != prim::Constant) {
+                return false;
+              }
+              return true;
+            },
+            [](const Node* node) -> OperatorType {
+              return OperatorType::Normalization;
+            });
+      }
+    }
+
     { // LTC uses this op for softmax
       auto ptr_op = getOperatorForLiteral(
           "aten::_softmax(Tensor self, int dim, bool half_to_float) -> Tensor");
@@ -1893,6 +2012,10 @@ class IrParser {
             value_map.emplace(node->output()->unique(), output);
           },
           [](const Node* node) -> bool {
+            if (isReductionNonCompatibleTensor(
+                    node->input(0)->type()->cast<TensorType>())) {
+              return false;
+            }
             if (node->inputs()[1]->node()->kind() != prim::Constant) {
               return false;
             }
@@ -1917,35 +2040,104 @@ class IrParser {
     }
 
     {
-      auto ptr_op = getOperatorForLiteral(
-          "aten::_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor");
-      REGISTER_PARSE_RULE(
-          ptr_op,
-          {
-            auto grad_output =
-                value_map[node->input(0)->unique()]->as<TensorView>();
+      std::array<const char*, kNumSoftmaxBwd> SoftmaxBwd = {
+          "aten::_log_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor",
+          "aten::_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor"};
+      for (auto signature : SoftmaxBwd) {
+        auto ptr_op = getOperatorForLiteral(signature);
+        REGISTER_PARSE_RULE(
+            ptr_op,
+            {
+              auto grad_output =
+                  value_map[node->input(0)->unique()]->as<TensorView>();
 
-            auto output = value_map[node->input(1)->unique()]->as<TensorView>();
+              auto output =
+                  value_map[node->input(1)->unique()]->as<TensorView>();
 
-            auto dim_value = constant_as<int>(node->input(2));
-            TORCH_INTERNAL_ASSERT(
-                dim_value.has_value(), "dim in softmax is not valid");
+              auto dim_value = constant_as<int>(node->input(2));
+              TORCH_INTERNAL_ASSERT(
+                  dim_value.has_value(), "dim in softmax is not valid");
 
-            // input_dtype here is ignored! type_inference handles it
-            auto grad_input =
-                softmax_backward(grad_output, output, dim_value.value());
+              // input_dtype here is ignored! type_inference handles it
+              bool is_log_softmax = node->kind() ==
+                  c10::Symbol::fromQualString(
+                                        "aten::_log_softmax_backward_data");
+              auto grad_input = (is_log_softmax)
+                  ? log_softmax_backward(grad_output, output, dim_value.value())
+                  : softmax_backward(grad_output, output, dim_value.value());
 
-            value_map.emplace(node->output()->unique(), grad_input);
-          },
-          [](const Node* node) -> bool {
-            if (node->inputs()[2]->node()->kind() != prim::Constant) {
-              return false;
-            }
-            return true;
-          },
-          [](const Node* node) -> OperatorType {
-            return OperatorType::Normalization;
-          });
+              value_map.emplace(node->output()->unique(), grad_input);
+            },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              if (node->inputs()[2]->node()->kind() != prim::Constant) {
+                return false;
+              }
+              return true;
+            },
+            [](const Node* node) -> OperatorType {
+              return OperatorType::Normalization;
+            });
+      }
+    }
+
+    {
+      std::array<const char*, kNumVarOps> Variance = {
+          "aten::var.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor",
+          "aten::std.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor"};
+      for (auto signature : Variance) {
+        auto ptr_op = getOperatorForLiteral(signature);
+        REGISTER_PARSE_RULE(
+            ptr_op,
+            {
+              MemoryFormat format;
+              std::list<Val*> list_val;
+              std::tie(format, list_val) = getConsistentValues(
+                  MemoryFormat::Contiguous(),
+                  value_map[node->inputs()[0]->unique()]);
+              auto input_t = list_val.front();
+              list_val.pop_front();
+              auto input = input_t->as<TensorView>();
+
+              bool is_variance =
+                  node->kind() == c10::Symbol::fromQualString("aten::var");
+
+              auto dims_list = constant_as<c10::List<int64_t>>(node->input(1));
+              TORCH_INTERNAL_ASSERT(
+                  dims_list.has_value(), "Cannot fuse with dynamic axes");
+              std::vector<int> dims;
+              for (const auto dim : dims_list->vec()) {
+                dims.emplace_back(static_cast<int>(dim));
+              }
+
+              auto unbiased = constant_as<bool>(node->input(2));
+              TORCH_INTERNAL_ASSERT(
+                  unbiased.has_value(), "Cannot fuse with dynamic unbiased");
+
+              auto keepdim = constant_as<bool>(node->input(3));
+              TORCH_INTERNAL_ASSERT(
+                  keepdim.has_value(), "Cannot fuse with dynamic keepdim");
+
+              auto output = (is_variance)
+                  ? variance(input, dims, unbiased.value(), keepdim.value())
+                  : standard_deviation(
+                        input, dims, unbiased.value(), keepdim.value());
+              value_map.emplace(node->output()->unique(), output);
+            },
+            [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
+              return true;
+            },
+            [](const Node* node) -> OperatorType {
+              return OperatorType::Normalization;
+            });
+      }
     }
 
     {
@@ -1967,8 +2159,13 @@ class IrParser {
                 dims_list.has_value(),
                 "aten::sum cannot be fused with dynamic axes");
             std::vector<int> dims;
-            for (const auto dim : dims_list->vec()) {
-              dims.emplace_back(static_cast<int>(dim));
+            if (!dims_list->empty()) {
+              for (const auto dim : dims_list->vec()) {
+                dims.emplace_back(static_cast<int>(dim));
+              }
+            } else {
+              dims.resize(self->as<TensorView>()->nDims());
+              std::iota(dims.begin(), dims.end(), 0);
             }
             auto keepdim = constant_as<bool>(node->input(2));
             TORCH_INTERNAL_ASSERT(
@@ -1978,20 +2175,20 @@ class IrParser {
             value_map.emplace(node->output()->unique(), out);
           },
           [](const Node* node) -> bool {
+            if (isReductionNonCompatibleTensor(
+                    node->input(0)->type()->cast<TensorType>())) {
+              return false;
+            }
             // TODO: support cast of output types
             if (!node->inputs()[3]->type()->isSubtypeOf(
                     static_cast<c10::TypePtr>(NoneType::get()))) {
               // We can only handle output as half, float, and double;
               if (const auto opt_ivalue = toIValue(node->input(3))) {
                 const auto scalar_type = opt_ivalue->toScalarType();
-                if (scalar_type == at::ScalarType::Double ||
-                    scalar_type == at::ScalarType::Float ||
-                    scalar_type == at::ScalarType::BFloat16 ||
-                    scalar_type == at::ScalarType::Half) {
-                  return true;
+                if (!at::isFloatingType(scalar_type)) {
+                  return false;
                 }
               }
-              return false;
             }
             // we don't support dynamic reduction axes;
             if (node->inputs()[1]->node()->kind() != prim::Constant) {
@@ -2027,8 +2224,13 @@ class IrParser {
                 dims_list.has_value(),
                 "aten::mean cannot be fused with dynamic axes");
             std::vector<int> dims;
-            for (const auto dim : dims_list->vec()) {
-              dims.emplace_back(static_cast<int>(dim));
+            if (!dims_list->empty()) {
+              for (const auto dim : dims_list->vec()) {
+                dims.emplace_back(static_cast<int>(dim));
+              }
+            } else {
+              dims.resize(self->as<TensorView>()->nDims());
+              std::iota(dims.begin(), dims.end(), 0);
             }
             auto keepdim = constant_as<bool>(node->input(2));
             TORCH_INTERNAL_ASSERT(
@@ -2047,20 +2249,20 @@ class IrParser {
             value_map.emplace(node->output()->unique(), out);
           },
           [](const Node* node) -> bool {
+            if (isReductionNonCompatibleTensor(
+                    node->input(0)->type()->cast<TensorType>())) {
+              return false;
+            }
             // TODO: support cast of output types
             if (!node->inputs()[3]->type()->isSubtypeOf(
                     static_cast<c10::TypePtr>(NoneType::get()))) {
               // We can only handle output as half, float, and double;
               if (const auto opt_ivalue = toIValue(node->input(3))) {
                 const auto scalar_type = opt_ivalue->toScalarType();
-                if (scalar_type == at::ScalarType::Double ||
-                    scalar_type == at::ScalarType::Float ||
-                    scalar_type == at::ScalarType::BFloat16 ||
-                    scalar_type == at::ScalarType::Half) {
-                  return true;
+                if (!at::isFloatingType(scalar_type)) {
+                  return false;
                 }
               }
-              return false;
             }
             // we don't support dynamic reduction axes;
             if (node->inputs()[1]->node()->kind() != prim::Constant) {
@@ -2105,13 +2307,15 @@ class IrParser {
               }
             },
             [](const Node* node) -> bool {
+              if (isReductionNonCompatibleTensor(
+                      node->input(0)->type()->cast<TensorType>())) {
+                return false;
+              }
               // we don't support dynamic reduction axes;
               if (node->inputs()[1]->node()->kind() != prim::Constant) {
                 return false;
               }
               return true;
-              // auto size_to = constant_as<c10::List<int64_t>>(node->input(1));
-              // return size_to.has_value() && !size_to->empty();
             },
             [](const Node* node) -> OperatorType {
               auto size_to = constant_as<c10::List<int64_t>>(node->input(1));
@@ -2146,7 +2350,7 @@ class IrParser {
               value_map.emplace(
                   node->output()->unique(), ValueHolder(out, format));
             },
-            nullptr,
+            isInputNonSizeZeroTensor,
             nullptr);
       }
     }
@@ -2184,7 +2388,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2213,7 +2417,7 @@ class IrParser {
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2274,7 +2478,7 @@ class IrParser {
                   node->output()->unique(), ValueHolder(out, format));
             }
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2288,27 +2492,22 @@ class IrParser {
             std::list<Val*> list_val;
             std::tie(format, list_val) = getConsistentValues(
                 c10::nullopt, value_map[node->inputs()[0]->unique()]);
-            auto self = list_val.front();
+            auto self = list_val.front()->as<TensorView>();
             list_val.pop_front();
 
             auto approximate = constant_as<std::string>(node->input(1));
             TORCH_INTERNAL_ASSERT(
                 approximate.has_value(),
                 "The approximate parameter is required.");
-            const auto kApproximate = approximate.value();
-
-            Val* out = nullptr;
-            if (at::native::get_gelutype_enum(kApproximate) ==
-                at::native::GeluType::Tanh) {
-              out = fast_gelu(self);
-            } else {
-              out = unaryOp(UnaryOpType::Gelu, self);
-            }
+            const auto kTanhGelu =
+                at::native::get_gelutype_enum(approximate.value()) ==
+                at::native::GeluType::Tanh;
 
+            auto out = (kTanhGelu) ? tanh_gelu(self) : gelu(self);
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2324,29 +2523,25 @@ class IrParser {
                 c10::nullopt,
                 value_map[node->inputs()[0]->unique()],
                 value_map[node->inputs()[1]->unique()]);
-            auto grad_out = list_val.front();
+            auto grad_out = list_val.front()->as<TensorView>();
             list_val.pop_front();
-            auto self = list_val.front();
+            auto self = list_val.front()->as<TensorView>();
             list_val.pop_front();
 
             auto approximate = constant_as<std::string>(node->input(2));
             TORCH_INTERNAL_ASSERT(
                 approximate.has_value(),
                 "The approximate parameter is required.");
-            const auto kApproximate = approximate.value();
-
-            Val* grad_in = nullptr;
-            if (at::native::get_gelutype_enum(kApproximate) ==
-                at::native::GeluType::Tanh) {
-              grad_in = fast_gelu_backward(grad_out, self);
-            } else {
-              grad_in = gelu_backward(grad_out, self);
-            }
+            const auto kTanhGelu =
+                at::native::get_gelutype_enum(approximate.value()) ==
+                at::native::GeluType::Tanh;
 
+            auto grad_in = (kTanhGelu) ? tanh_gelu_backward(grad_out, self)
+                                       : gelu_backward(grad_out, self);
             value_map.emplace(
                 node->output()->unique(), ValueHolder(grad_in, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2362,16 +2557,16 @@ class IrParser {
                 c10::nullopt,
                 value_map[node->inputs()[0]->unique()],
                 value_map[node->inputs()[1]->unique()]);
-            auto grad_out = list_val.front();
+            auto grad_out = list_val.front()->as<TensorView>();
             list_val.pop_front();
-            auto self = list_val.front();
+            auto self = list_val.front()->as<TensorView>();
             list_val.pop_front();
 
             auto grad_in = tanh_backward(grad_out, self);
             value_map.emplace(
                 node->output()->unique(), ValueHolder(grad_in, format));
           },
-          nullptr,
+          isInputNonSizeZeroTensor,
           nullptr);
     }
 
@@ -2393,8 +2588,13 @@ class IrParser {
                 dims_list.has_value(),
                 "aten::amax cannot be fused with dynamic axes");
             std::vector<int> dims;
-            for (const auto dim : dims_list->vec()) {
-              dims.emplace_back(static_cast<int>(dim));
+            if (!dims_list->empty()) {
+              for (const auto dim : dims_list->vec()) {
+                dims.emplace_back(static_cast<int>(dim));
+              }
+            } else {
+              dims.resize(self->as<TensorView>()->nDims());
+              std::iota(dims.begin(), dims.end(), 0);
             }
             auto keepdim = constant_as<bool>(node->input(2));
             TORCH_INTERNAL_ASSERT(
@@ -2405,6 +2605,10 @@ class IrParser {
             value_map.emplace(node->output()->unique(), out);
           },
           [](const Node* node) -> bool {
+            if (isReductionNonCompatibleTensor(
+                    node->input(0)->type()->cast<TensorType>())) {
+              return false;
+            }
             // we don't support dynamic reduction axes;
             if (node->inputs()[1]->node()->kind() != prim::Constant) {
               return false;
@@ -2449,10 +2653,26 @@ class IrParser {
               value_map.emplace(node->output()->unique(), output);
             },
             [](const Node* node) -> bool {
+              auto self_value = node->inputs()[0];
+              auto tensor_type = self_value->type()->cast<c10::TensorType>();
+              if (tensor_type == nullptr) {
+                return false;
+              }
+              if (!tensor_type->sizes().concrete_sizes().has_value()) {
+                // Shape information for input tensor is required.
+                return false;
+              }
+
+              if (!isInputNonSizeZeroTensor(node)) {
+                return false;
+              }
               // Reject fusing node if view_sizes contains an inferred dimension
               auto view_sizes = constant_as<c10::List<int64_t>>(node->input(1));
-              TORCH_INTERNAL_ASSERT(
-                  view_sizes.has_value(), "The size parameter is required.");
+              if (!view_sizes.has_value()) {
+                // The size parameter is required.
+                return false;
+              }
+
               for (auto axis_size : view_sizes->vec()) {
                 if (axis_size == -1) {
                   return false;
@@ -2485,7 +2705,18 @@ class IrParser {
             auto output = squeeze(self, self_sizes);
             value_map.emplace(node->output()->unique(), output);
           },
-          nullptr,
+          [](const Node* node) -> bool {
+            // Shape information for input tensor is required.
+            auto self_value = node->inputs()[0];
+            auto tensor_type = self_value->type()->cast<c10::TensorType>();
+            if (tensor_type == nullptr) {
+              return false;
+            }
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+            return tensor_type->sizes().concrete_sizes().has_value();
+          },
           nullptr);
     }
 
@@ -2521,7 +2752,19 @@ class IrParser {
               }
               value_map.emplace(node->output()->unique(), output);
             },
-            nullptr,
+            [](const Node* node) -> bool {
+              // Shape information for input tensor is required.
+              auto self_value = node->inputs()[0];
+              auto tensor_type = self_value->type()->cast<c10::TensorType>();
+              if (tensor_type == nullptr) {
+                return false;
+              }
+              if (!isInputNonSizeZeroTensor(node)) {
+                return false;
+              }
+              auto optional_sizes = tensor_type->sizes().concrete_sizes();
+              return tensor_type->sizes().concrete_sizes().has_value();
+            },
             nullptr);
       }
     }
@@ -2662,7 +2905,6 @@ class IrParser {
           nhwc_stride_vec[i]->stride_index_ = n_dim - i - 1;
         }
 
-        // auto updated_tensor_type = c10::TensorType::create(
         tensor_type = c10::TensorType::create(
             tensor_type->scalarType(),
             tensor_type->device(),
@@ -2688,6 +2930,7 @@ class IrParser {
   std::unordered_map<size_t, ValueHolder> value_map_;
 
   static std::unordered_set<Symbol> parser_symbol_set_;
+  static std::unordered_set<Symbol> parser_skip_set_;
 
   // parsing rule registry.
   static std::unordered_map<std::string, RegistrationEntry>
@@ -2701,6 +2944,7 @@ class IrParser {
   static bool init_registry_;
 };
 std::unordered_set<Symbol> IrParser::parser_symbol_set_; // NOLINT
+std::unordered_set<Symbol> IrParser::parser_skip_set_; // NOLINT
 std::unordered_map<std::string, IrParser::RegistrationEntry>
     IrParser::jit_operator_registry_; // NOLINT
 std::unordered_map<const FunctionSchema*, const IrParser::RegistrationEntry*>
@@ -2995,6 +3239,11 @@ bool shouldProfileNode(const Node* node) {
   return IrParser::lookupInSymbolSet(node);
 }
 
+bool skipNodeKind(const std::string& symbol_str, bool flip) {
+  return IrParser::querySkipSymbolSet(
+      c10::Symbol::fromQualString(symbol_str), flip);
+}
+
 bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
   // is skip constant necessary?
   if (node->input(offset)->node()->kind() == prim::Constant) {
@@ -3172,6 +3421,38 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     return true;
   }
 
+  static auto gelu_schema =
+      getOperatorForLiteral(
+          "aten::gelu(Tensor self, *, str approximate='none') -> Tensor")
+          ->schema();
+  if (node->matches(gelu_schema)) {
+    switch (offset) {
+      // argument 1: approximate;
+      case 1:
+        profileString(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
+  static auto gelu_backward_schema =
+      getOperatorForLiteral(
+          "aten::gelu_backward(Tensor grad_output, Tensor self, *, str approximate='none') -> Tensor")
+          ->schema();
+  if (node->matches(gelu_backward_schema)) {
+    switch (offset) {
+      // argument 2: approximate;
+      case 2:
+        profileString(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
   static auto native_layer_norm_schema =
       getOperatorForLiteral(
           "aten::native_layer_norm(Tensor input, int[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)")
@@ -3213,6 +3494,26 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     return true;
   }
 
+  static auto batch_norm_backward_schema =
+      getOperatorForLiteral(
+          "aten::native_batch_norm_backward(Tensor grad_out, Tensor input, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_invstd, bool train, float eps, bool[3] output_mask) -> (Tensor, Tensor, Tensor)")
+          ->schema();
+  if (node->matches(batch_norm_backward_schema)) {
+    switch (offset) {
+      // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+      case 7: // argument 8: training;
+        profileBool(pr, node, offset);
+        break;
+      // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
+      case 9:
+        profileBoolList(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
   static auto native_layer_norm_backward_schema =
       getOperatorForLiteral(
           "aten::native_layer_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)")
@@ -3246,43 +3547,16 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     }
   }
 
-  static auto gelu_schema =
-      getOperatorForLiteral(
-          "aten::gelu(Tensor self, *, str approximate='none') -> Tensor")
-          ->schema();
-  if (node->matches(gelu_schema)) {
-    switch (offset) {
-      // argument 1: approximate;
-      case 1:
-        profileString(pr, node, offset);
-        break;
-      default:
-        return false;
-    }
-    return true;
-  }
-
-  static auto gelu_backward_schema =
+  static auto log_softmax_backward_data_schema =
       getOperatorForLiteral(
-          "aten::gelu_backward(Tensor grad_output, Tensor self, *, str approximate='none') -> Tensor")
+          "aten::_log_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor")
           ->schema();
-  if (node->matches(gelu_backward_schema)) {
-    switch (offset) {
-      // argument 2: approximate;
-      case 2:
-        profileString(pr, node, offset);
-        break;
-      default:
-        return false;
-    }
-    return true;
-  }
-
   static auto softmax_backward_data_schema =
       getOperatorForLiteral(
           "aten::_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor")
           ->schema();
-  if (node->matches(softmax_backward_data_schema)) {
+  if (node->matches(log_softmax_backward_data_schema) ||
+      node->matches(softmax_backward_data_schema)) {
     switch (offset) {
       case 3:
         profileInt(pr, node, offset);
diff --git a/torch/csrc/jit/codegen/cuda/parser.h b/torch/csrc/jit/codegen/cuda/parser.h
index 6d52b325042577..ddfbf7762742a7 100644
--- a/torch/csrc/jit/codegen/cuda/parser.h
+++ b/torch/csrc/jit/codegen/cuda/parser.h
@@ -44,6 +44,8 @@ TORCH_CUDA_CU_API bool isElementWiseNode(const Node* node);
 TORCH_CUDA_CU_API bool isNodeParsible(const Node* node);
 TORCH_CUDA_CU_API bool shouldProfileNode(const Node* node);
 
+TORCH_CUDA_CU_API bool skipNodeKind(const std::string& symbol_str, bool flip);
+
 void InsertProfileNodes(ProfilingRecord* pr);
 
 // lowers PyTorch jit graph to `Fusion`.
diff --git a/torch/csrc/jit/codegen/cuda/partition.cpp b/torch/csrc/jit/codegen/cuda/partition.cpp
index 91d68494fd42fe..c5a452dc366982 100644
--- a/torch/csrc/jit/codegen/cuda/partition.cpp
+++ b/torch/csrc/jit/codegen/cuda/partition.cpp
@@ -6,6 +6,7 @@
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/parser.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
+#include <torch/csrc/jit/jit_log.h>
 
 namespace torch {
 namespace jit {
@@ -51,6 +52,26 @@ static c10::optional<c10::Device> getDevice(const Value* value) {
   return tensor_type.device();
 }
 
+static bool hasBfloat(const Node* node) {
+  auto has_bfloat = [](const Value* value) {
+    if (!value->type()->isSubtypeOf(*TensorType::get())) {
+      return false;
+    }
+    auto opt_scalar_type = value->type()->expectRef<TensorType>().scalarType();
+    if (opt_scalar_type.has_value() &&
+        opt_scalar_type.value() == at::ScalarType::BFloat16) {
+      return true;
+    }
+    return false;
+  };
+
+  if (std::any_of(node->inputs().begin(), node->inputs().end(), has_bfloat) ||
+      std::any_of(node->outputs().begin(), node->outputs().end(), has_bfloat)) {
+    return true;
+  }
+  return false;
+}
+
 static c10::optional<c10::Device> getDevice(const Node* node) {
   c10::optional<c10::Device> ret = c10::nullopt;
   auto merge_devices = [&ret](const c10::optional<c10::Device>& device) {
@@ -87,7 +108,29 @@ static c10::optional<c10::Device> getDevice(const Node* node) {
   return ret;
 }
 
-static bool isFusibleDevice(const Node* node, const c10::Device device) {
+static bool isDeviceCompatible(const Node* node, const c10::Device& device) {
+  // only fuses cuda device
+  if (!device.is_cuda()) {
+    GRAPH_UPDATE("rejecting node (non-cuda device): ", *node);
+    return false;
+  }
+  const auto major = at::cuda::getDeviceProperties(device.index())->major;
+  // disable non-elementwise fusion on pre-volta devices
+  if (major < 7 && hasNonElementWiseOperation(node)) {
+    GRAPH_UPDATE(
+        "rejecting node (non element-wise op not supported on SM < 7X): ",
+        *node);
+    return false;
+  }
+  // disable bfloat fusion on pre-ampere devices
+  if (major < 8 && hasBfloat(node)) {
+    GRAPH_UPDATE("rejecting node (bfloat not supported on SM < 8X): ", *node);
+    return false;
+  }
+  return true;
+}
+
+static bool isFusibleDevice(const Node* node, const c10::Device& device) {
   TORCH_INTERNAL_ASSERT(
       device.index() != INVALID_INDEX, "fusible device needs to be validate");
   auto opt_device = getDevice(node);
@@ -95,6 +138,12 @@ static bool isFusibleDevice(const Node* node, const c10::Device device) {
   // node into an existing `device`
   if (opt_device.has_value() &&
       (opt_device->index() == INVALID_INDEX || opt_device != device)) {
+    GRAPH_UPDATE(
+        "rejecting node from fusion (outputs device not matching fusion): ",
+        *node);
+    return false;
+  }
+  if (!isDeviceCompatible(node, device)) {
     return false;
   }
   return true;
@@ -105,12 +154,14 @@ static bool isFusibleDevice(const Node* node) {
   auto device = getDevice(node);
   // be conservative and only fuse cuda operations, this avoids us initializing
   // operations that produces cpu scalar outputs
-  if (!device.has_value()) {
+  if (!device.has_value() || device->index() == INVALID_INDEX) {
+    return false;
+  }
+
+  if (!isDeviceCompatible(node, device.value())) {
     return false;
   }
-  return device->index() != INVALID_INDEX && device->is_cuda() &&
-      (at::cuda::getDeviceProperties(device->index())->major >= 7 ||
-       !hasNonElementWiseOperation(node));
+  return true;
 }
 
 bool compatibleType(const torch::jit::Value* val) {
@@ -120,6 +171,11 @@ bool compatibleType(const torch::jit::Value* val) {
           DataType::Null) {
         return false;
       }
+      // Complex is disabled until its support is completely added
+      // TODO: remove this logic
+      if (isComplexType(aten_to_data_type(tensor_type->scalarType().value()))) {
+        return false;
+      }
     }
   }
   return true;
@@ -161,268 +217,35 @@ bool checkOutputTensorTypes(const Node* node) {
 }
 
 inline bool isFusibleNode(const Node* node) {
+  // Check if already part of a fusion group
   if (node->kind() == prim::CudaFusionGroup)
     return true;
   // Check we have a parsing rule
-  bool isFusible = isNodeParsible(node);
-  // Check if we have a tensor type it's one we support
-  isFusible = isFusible && checkInputTensorTypes(node);
-  isFusible = isFusible && checkOutputTensorTypes(node);
-  // Check if already part of a fusion group
-  return isFusible;
-}
-
-bool maybeBroadcast(
-    const TensorTypePtr& type,
-    const std::vector<c10::optional<int64_t>>& shape) {
-  if (type->dim()) {
-    if (type->dim().value() < shape.size()) {
-      // no broadcast for reduction operation;
-      return false;
-    } else if (type->dim().value() > shape.size()) {
-      // increased rank means there is reduction;
-      return true;
-    } else {
-      // same rank, we need to iterate through sizes and check if size-1
-      // exists in input `shape`
-      for (const auto& opt_size : shape) {
-        // TODO: not sure if we need to check for output size != 1, since we
-        // are currently marking all size-1 dimension as broadcast in codegen.
-        if (opt_size.has_value() && opt_size.value() == 1) {
-          return true;
-        }
-      }
+  if (!isNodeParsible(node)) {
+    // ignoring profile nodes & constant nodes to avoid noise from debugging
+    if (node->kind() != prim::Constant &&
+        node->kind() != prim::profile_ivalue && node->kind() != prim::profile &&
+        node->kind() != prim::Param) {
+      GRAPH_UPDATE("rejecting node from fusion (node not parsible): ", *node);
     }
+    return false;
   }
-  return false;
-}
-
-// utility function to check if the node implies broadcast on a given shape (
-// assumed to be shape of an input tensor)
-// limitations:
-//   1. we rely on shape information to judge this. so we would require output
-//      shape to be available;
-//   2. we basically compares given shape to the shape of the only output of
-//      the node and return true if it implies broadcast from the former to the
-//      latter.
-bool maybeBroadcastOnShape(
-    const Node* n,
-    const std::vector<c10::optional<int64_t>>& shape) {
-  // TODO: we are only checking output 0. This means that our current check for
-  // normalization is not complete.
-  // assumes that if output is not a tensor type, it's not broadcasting
-  if (auto out_type = n->output(0)->type()->cast<TensorType>()) {
-    return maybeBroadcast(out_type, shape);
-  }
-  return false;
-};
-
-// return true if node is pointwise operation and input tensors all have
-// identical shape.
-bool isNonBroadcastElementWise(const Node* n) {
-  if (hasNonElementWiseOperation(n)) {
+  // Check if we have a tensor type it's one we support
+  if (!checkInputTensorTypes(node)) {
+    GRAPH_UPDATE(
+        "rejecting node from fusion (input scalar type not supported): ",
+        *node);
     return false;
   }
-
-  for (const auto output : n->outputs()) {
-    const auto& n_output_type = output->type()->cast<TensorType>();
-
-    // TODO: we need to stay on safer side instead of "default to return true
-    // when shape information is not available.", Change that when we enable
-    // profiling on autodiff FW execution.
-    if (n_output_type != nullptr && n_output_type->sizes().sizes()) {
-      const std::vector<c10::optional<int64_t>>& n_output_shape =
-          n_output_type->sizes().sizes().value();
-
-      for (auto input : n->inputs()) {
-        if (auto t_type = input->type()->cast<TensorType>()) {
-          if (maybeBroadcast(t_type, n_output_shape)) {
-            return false;
-          }
-        }
-      }
-    }
+  if (!checkOutputTensorTypes(node)) {
+    GRAPH_UPDATE(
+        "rejecting node from fusion (output scalar type not supported): ",
+        *node);
+    return false;
   }
-
   return true;
 }
 
-//! [ Note - tricky broadcasting ]
-//!
-//! github issue # 190
-//!
-//! To extend the issue further, we consider two difficult broadcasting cases
-//! that is difficult to naively schedule:
-//!   scenario 1: single tensor with multiple broadcasting semantics;
-//!               ```
-//!                   %t = op(...)
-//!                   %t0_o = op0(%t, %t0)
-//!                   %t1_o = op1(%t, %t1)
-//!               ```
-//!               It's hard to check/validate whether `%t0` and `%t1` implies
-//!               identical broadcasting for `%t` so that we can simply
-//!               broadcast it to their common shape and use the broadcasted
-//!               tensor view in both `op0` and `op1`; or, if `%t0` and `%t1`
-//!               has different shapes, we would need differently broadcasted
-//!               `%t` for the two ops. Even with this condition sorted out,
-//!               scheduling is challenging. As we cannot inline the computation
-//!               of `%t` to the downstream consumer of `%t0_o` and `%t1_o`
-//!               easily, because `computeAt` could propagate contradicting
-//!               transformations on the common ancestor `%t`. See footnote*;
-//!   scenario 2: output tensor_view which is broadcasted later;
-//!               ```
-//!                   %t = op(...)
-//!                   %t0_o = op0(%t, %t0)
-//!                   return (%t, %t0_o)
-//!               ```
-//!               Similarly, if we need to broadcast `%t` to `%t0` for `op0`,
-//!               and use it as output, it also complicates schedule.
-//!
-//! Currently we just avoid the two cases in our graph partitioning.
-//!
-//! We bake the implementation along with our partition, where we merge nodes
-//! from producer to consumer. In the example down, we list all "type"s of edges
-//! among producer/consumer and the out side world.
-//!
-//!   %input_t0, %input_t1, %input_t2 # inputs from outside world feeding
-//!                                   # producer/consumer pair
-//!   %p_out_t0, %p_out_t1 = producer(%input_t0, %input_t1)
-//!   %c_out_t, ... = consumer(%input_t0, %input_t2, %p_out_t0)
-//!
-//! producer/consumer : the nodes that we are trying to merge, each node could
-//! be
-//!                     a parsible real operation or a `CudaFusionGroup`.
-//! %input_t0         : inputs shared by both producer & consumer
-//! %input_t1         : inputs feed only to producer, but not to consumer
-//! %input_t2         : inputs feed only to consumer, but not to producer
-//! %p_put_t0         : outputs of producer that is fed to consumer
-//! %p_put_t1         : outputs of producer that is not fed to consumer
-//! %c_put_t0         : outputs of consumer
-//!
-//! We can see that after merging consumer & producer, we will have:
-//!   %input_t0, %input_t1, %input_t2 # inputs from outside world feeding
-//!                                   # producer/consumer pair
-//!   %p_out_t, %c_out_t = group(%input_t0, %input_t1, %input_t2)
-//!
-//! Under the assumption that any existing `CudaFusionGroup` does not have
-//! violating broadcasting semantics mentioned above.
-//!
-//! If we examine the `group`, new cases of scenario 1 (multiple broadcast)
-//! could only be created by merging new edges in the new `group`, that is:
-//!   case 1. `%input_t0`, shared by `producer` and `consumer`
-//!   case 2. `%p_out_t0`, produced by `producer` and fed to `consumer`
-//!
-//! new cases of scenario 2 (output was broadcasted later) could only be added
-//! via:
-//!   case 3. `%p_out_t0`, produced by `producer` and fed to `consumer`, which
-//!           could be broadcasted in the consumer subgraph.
-//!
-//! footnote*:
-//! We are only disabling multiple broadcast right on the tensor, instead of
-//! tracing all the broadcast further down.
-//! I don't think we need to worry about broadcasting further down the
-//! dependency chain, as those would create new IterDomain, which doesn't have
-//! th problem of conflicting broadcasting.
-bool createTrickyBroadcast(const Node* consumer, const Node* producer) {
-  auto count_broadcasting_in_node =
-      [](const Node* node,
-         const std::vector<c10::optional<int64_t>>& shape,
-         size_t offset) {
-        int num_broadcasting = 0;
-        if (node->kind() == prim::CudaFusionGroup) {
-          // be careful here as `subgraph_input`, as its name suggests, is in a
-          // different fraph from `node`.
-          const auto& subgraph_input =
-              node->g(attr::Subgraph)->inputs()[offset];
-          for (const auto& use : subgraph_input->uses()) {
-            if (maybeBroadcastOnShape(use.user, shape)) {
-              num_broadcasting++;
-            }
-          }
-        } else {
-          if (maybeBroadcastOnShape(node, shape)) {
-            num_broadcasting++;
-          }
-        }
-        return num_broadcasting;
-      };
-
-  // case 1. We check shared inputs to `producer` & `consumer`;
-  for (const auto i : c10::irange(producer->inputs().size())) {
-    auto n_input = producer->input(i);
-    auto n_input_type = n_input->type()->cast<TensorType>();
-    if (n_input_type != nullptr && n_input_type->sizes().sizes()) {
-      std::vector<c10::optional<int64_t>> n_input_shape =
-          n_input_type->sizes().sizes().value();
-      int num_broadcasting = 0;
-
-      // check broadcasting for the n_input inside `consumer`;
-      for (const auto& use : n_input->uses()) {
-        if (use.user == consumer) {
-          num_broadcasting +=
-              count_broadcasting_in_node(consumer, n_input_shape, use.offset);
-        }
-      }
-
-      // if no broadcasting happened for consumer, there's no point check
-      // multiple broadcasting in producer alone;
-      if (num_broadcasting == 0) {
-        continue;
-      }
-
-      // check broadcasting for n_input inside `producer`;
-      num_broadcasting +=
-          count_broadcasting_in_node(producer, n_input_shape, i);
-
-      // encounted multiple broadcasting scheme for a single TV, we will not be
-      // able to schedule this, prevent the fusion; (case 1)
-      if (num_broadcasting > 1) {
-        return true;
-      }
-    }
-  }
-
-  // case 2. We check input to `consumer` that is also the output from
-  // `producer`
-  for (const auto i : c10::irange(producer->outputs().size())) {
-    auto n_output = producer->output(i);
-    auto n_output_type = n_output->type()->cast<TensorType>();
-    if (n_output_type != nullptr && n_output_type->sizes().sizes()) {
-      std::vector<c10::optional<int64_t>> n_output_shape =
-          n_output_type->sizes().sizes().value();
-      int num_broadcasting = 0;
-      // If we only look at case 1 & case 2, we need to check broadcast of
-      // `n_output` inside `producer`, if it is a `prim::CudaFusionGroup`.
-      // this is actually not necessary when we consider case 3, as we avoid
-      // broadcasting on outputs already;
-
-      // TODO: merge this code with case 1.
-      // check broadcasting for the n_output inside `consumer`;
-      bool use_as_output = false;
-      for (const auto& use : n_output->uses()) {
-        if (use.user == consumer) {
-          num_broadcasting +=
-              count_broadcasting_in_node(consumer, n_output_shape, use.offset);
-        } else {
-          // case 3. output is used by other nodes not the consumer, no
-          //         broadcasting is allowed;
-          use_as_output = true;
-        }
-      }
-
-      // encounted multiple broadcasting scheme for a single TV, we will not be
-      // able to schedule this, prevent the fusion; (case 2)
-      // Alternatively, if use_as_output is true, we would not permit broadcast
-      // at all. (case 3)
-      if (num_broadcasting > (use_as_output ? 0 : 1)) {
-        return true;
-      }
-    }
-  }
-
-  return false;
-}
-
 } // namespace
 
 bool isFusibleCudaFusionGroup(const Node* node) {
diff --git a/torch/csrc/jit/codegen/cuda/register_interface.cpp b/torch/csrc/jit/codegen/cuda/register_interface.cpp
index a3fba4b629751d..d47c220a17e926 100644
--- a/torch/csrc/jit/codegen/cuda/register_interface.cpp
+++ b/torch/csrc/jit/codegen/cuda/register_interface.cpp
@@ -25,6 +25,7 @@ class RegisterInterface {
     ptr->fn_can_fuse_n = &isFusibleCudaFusionGroup;
     ptr->fn_insert_profile_inodes = &InsertProfileNodes;
     ptr->fn_profile_n = &shouldProfileNode;
+    ptr->fn_skip_n = &skipNodeKind;
   }
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
index b48c6b00b3a331..011cbcf8098e15 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
@@ -285,6 +285,12 @@ void UnmappableReductionDomains::handle(ReductionOp* op) {
   handleReductionOutput(out_tv);
 }
 
+void UnmappableReductionDomains::handle(MmaOp* mma) {
+  // Builds a map from reduction domains to consumer domains.
+  TensorView* out_tv = mma->out()->as<TensorView>();
+  handleReductionOutput(out_tv);
+}
+
 void UnmappableReductionDomains::handle(WelfordOp* op) {
   // Builds a map from reduction domains to consumer domains.
   handleReductionOutput(op->outAvg()->as<TensorView>());
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.h b/torch/csrc/jit/codegen/cuda/root_domain_map.h
index 5156dc604f15b0..57f1c0d299d019 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.h
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.h
@@ -187,6 +187,7 @@ class TORCH_CUDA_CU_API UnmappableReductionDomains : private IterVisitor {
   using IterVisitor::handle;
   void handle(ReductionOp* op) override;
   void handle(WelfordOp* op) override;
+  void handle(MmaOp* op) override;
 
   void handleReductionOutput(TensorView* out_tv);
 
@@ -393,10 +394,18 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
     mapPointwiseOrReductionOp(wop);
   }
 
+  void handle(MmaOp* wop) override {
+    mapPointwiseOrReductionOp(wop);
+  }
+
   void handle(ShiftOp* op) override {
     mapPointwiseOrReductionOp(op);
   }
 
+  void handle(ViewDtypeOp* op) override {
+    mapPointwiseOrReductionOp(op);
+  }
+
   void handle(ViewOp* op) override {
     mapPointwiseOrReductionOp(op);
   }
diff --git a/torch/csrc/jit/codegen/cuda/runtime/array.cu b/torch/csrc/jit/codegen/cuda/runtime/array.cu
new file mode 100644
index 00000000000000..470482d79eaf81
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/array.cu
@@ -0,0 +1,264 @@
+// aligned register array for vectorized load/store
+template <typename scalar_t, int size, int align_size>
+struct alignas(sizeof(scalar_t) * align_size) Array {
+  scalar_t array[size];
+
+  __device__ void set(scalar_t v) {
+#pragma unroll
+    for (int i = 0; i < size; ++i) {
+      array[i] = v;
+    }
+  }
+
+  __device__ scalar_t& operator[](const unsigned int i) {
+    return array[i];
+  }
+};
+
+// Used for vectorized allocations that are not in registers
+template <typename scalar_t, int vec_size>
+__device__ void arraySet(scalar_t* buff, scalar_t val) {
+#pragma unroll
+  for (int i = 0; i < vec_size; ++i) {
+    buff[i] = val;
+  }
+}
+
+template <typename scalar_t, int vec_size>
+__device__ void loadGeneric(scalar_t* to, scalar_t* from) {
+  // It would be really nice to use memcpy here, but one example was failing
+  // with:
+  //
+  //  memcpy(to, from, vec_size * sizeof(scalar_t));
+  //
+  // Yet passing with:
+  //
+  // for(int i = 0; i < vec_size; i++){
+  //   to[i] = from[i];
+  // }
+
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+      *reinterpret_cast<uchar1*>(to) = *reinterpret_cast<uchar1*>(from);
+      break;
+    case 2:
+      *reinterpret_cast<uchar2*>(to) = *reinterpret_cast<uchar2*>(from);
+      break;
+    case 4:
+      *reinterpret_cast<uint1*>(to) = *reinterpret_cast<uint1*>(from);
+      break;
+    case 8:
+      *reinterpret_cast<uint2*>(to) = *reinterpret_cast<uint2*>(from);
+      break;
+    case 12:
+      *reinterpret_cast<uint3*>(to) = *reinterpret_cast<uint3*>(from);
+      break;
+    case 16:
+      *reinterpret_cast<uint4*>(to) = *reinterpret_cast<uint4*>(from);
+      break;
+  }
+}
+
+// Volatile version only works with c++ fundamnetal types
+template <
+    typename scalar_t,
+    int vec_size,
+    bool is_volatile_to,
+    bool is_volatile_from>
+__device__ void loadGenericVolatile(
+    typename MaybeVolatile<scalar_t, is_volatile_to>::type* to,
+    typename MaybeVolatile<scalar_t, is_volatile_from>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    // Reinterpret cast like this with volatile types only works for C++
+    // fundamental types otherwise the = operator is not defined
+    case 1:
+      *reinterpret_cast<
+          typename MaybeVolatile<unsigned char, is_volatile_to>::type*>(to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<unsigned char, is_volatile_from>::type*>(
+              from);
+      break;
+    case 2:
+      *reinterpret_cast<typename MaybeVolatile<short, is_volatile_to>::type*>(
+          to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<short, is_volatile_from>::type*>(from);
+      break;
+    case 4:
+      *reinterpret_cast<
+          typename MaybeVolatile<unsigned int, is_volatile_to>::type*>(to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<unsigned int, is_volatile_from>::type*>(
+              from);
+      break;
+    case 8:
+      *reinterpret_cast<typename MaybeVolatile<double, is_volatile_to>::type*>(
+          to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<double, is_volatile_from>::type*>(from);
+      break;
+  }
+}
+
+template <typename scalar_t, int vec_size, bool is_volatile>
+__device__ void loadLocalToGlobal(
+    typename MaybeVolatile<scalar_t, is_volatile>::type* to,
+    scalar_t* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+    case 2:
+    case 4:
+      loadGenericVolatile<scalar_t, vec_size, is_volatile, false>(to, from);
+      break;
+    case 8: {
+      uint2 const& data = *reinterpret_cast<uint2*>(from);
+      if (is_volatile) {
+        asm volatile(
+            "st.volatile.global.v2.s32 [%0], {%1,%2};" ::"l"(
+                (typename MaybeVolatile<uint2, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y));
+      } else {
+        asm volatile(
+            "st.global.cs.v2.s32 [%0], {%1,%2};" ::"l"(
+                (typename MaybeVolatile<uint2, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y));
+      }
+      break;
+    }
+    case 12: {
+      uint3 const& data = *reinterpret_cast<uint3*>(from);
+      if (is_volatile) {
+        asm volatile(
+            "st.volatile.global.v3.s32 [%0], {%1,%2,%3};" ::"l"(
+                (typename MaybeVolatile<uint3, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y),
+            "r"(data.z));
+      } else {
+        asm volatile(
+            "st.global.cs.v3.s32 [%0], {%1,%2,%3};" ::"l"(
+                (typename MaybeVolatile<uint3, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y),
+            "r"(data.z));
+      }
+      break;
+    }
+    case 16: {
+      uint4 const& data = *reinterpret_cast<uint4*>(from);
+      if (is_volatile) {
+        asm volatile(
+            "st.volatile.global.v4.s32 [%0], {%1,%2,%3,%4};" ::"l"(
+                (typename MaybeVolatile<uint4, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y),
+            "r"(data.z),
+            "r"(data.w));
+      } else {
+        asm volatile(
+            "st.global.cs.v4.s32 [%0], {%1,%2,%3,%4};" ::"l"(
+                (typename MaybeVolatile<uint4, is_volatile>::type*)to),
+            "r"(data.x),
+            "r"(data.y),
+            "r"(data.z),
+            "r"(data.w));
+      }
+      break;
+    }
+  }
+}
+
+template <typename scalar_t, int vec_size, bool is_volatile>
+__device__ void loadGlobalToLocal(
+    scalar_t* to,
+    typename MaybeVolatile<scalar_t, is_volatile>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+    case 2:
+    case 4:
+      loadGenericVolatile<scalar_t, vec_size, false, is_volatile>(to, from);
+      break;
+    case 8: {
+      if (is_volatile) {
+        uint2& data = *reinterpret_cast<uint2*>(to);
+        asm volatile("ld.volatile.global.v2.s32 {%0,%1}, [%2];"
+                     : "=r"(data.x), "=r"(data.y)
+                     : "l"((uint2*)from));
+        break;
+      } else {
+        uint2& data = *reinterpret_cast<uint2*>(to);
+        asm volatile("ld.global.cs.v2.s32 {%0,%1}, [%2];"
+                     : "=r"(data.x), "=r"(data.y)
+                     : "l"((uint2*)from));
+      }
+      break;
+    }
+    case 12: {
+      if (is_volatile) {
+        uint3& data = *reinterpret_cast<uint3*>(to);
+        asm volatile("ld.volatile.global.v3.s32 {%0,%1,%2}, [%3];"
+                     : "=r"(data.x), "=r"(data.y), "=r"(data.z)
+                     : "l"((uint3*)from));
+      } else {
+        uint3& data = *reinterpret_cast<uint3*>(to);
+        asm volatile("ld.global.cs.v3.s32 {%0,%1,%2}, [%3];"
+                     : "=r"(data.x), "=r"(data.y), "=r"(data.z)
+                     : "l"((uint3*)from));
+      }
+      break;
+    }
+    case 16: {
+      if (is_volatile) {
+        uint4& data = *reinterpret_cast<uint4*>(to);
+        asm volatile("ld.volatile.global.v4.s32 {%0,%1,%2,%3}, [%4];"
+                     : "=r"(data.x), "=r"(data.y), "=r"(data.z), "=r"(data.w)
+                     : "l"((uint4*)from));
+      } else {
+        uint4& data = *reinterpret_cast<uint4*>(to);
+        asm volatile("ld.global.cs.v4.s32 {%0,%1,%2,%3}, [%4];"
+                     : "=r"(data.x), "=r"(data.y), "=r"(data.z), "=r"(data.w)
+                     : "l"((uint4*)from));
+      }
+      break;
+    }
+  }
+}
+
+template <
+    typename scalar_t,
+    int vec_size,
+    bool is_volatile_to,
+    bool is_volatile_from>
+__device__ void loadGlobalToGlobal(
+    typename MaybeVolatile<scalar_t, is_volatile_to>::type* to,
+    typename MaybeVolatile<scalar_t, is_volatile_from>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    // Reinterpret cast like this with volatile types only works for C++
+    // fundamental types otherwise the = operator is not defined
+    case 1:
+    case 2:
+    case 4:
+    case 8:
+      loadGenericVolatile<scalar_t, vec_size, is_volatile_to, is_volatile_from>(
+          to, from);
+      break;
+    case 12: {
+      uint3 local_intermediate;
+      loadGlobalToLocal<scalar_t, vec_size, is_volatile_from>(
+          reinterpret_cast<scalar_t*>(&local_intermediate), from);
+      loadLocalToGlobal<scalar_t, vec_size, is_volatile_to>(
+          to, reinterpret_cast<scalar_t*>(&local_intermediate));
+      break;
+    }
+    case 16: {
+      uint4 local_intermediate;
+      loadGlobalToLocal<scalar_t, vec_size, is_volatile_from>(
+          reinterpret_cast<scalar_t*>(&local_intermediate), from);
+      loadLocalToGlobal<scalar_t, vec_size, is_volatile_to>(
+          to, reinterpret_cast<scalar_t*>(&local_intermediate));
+      break;
+    }
+  }
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu b/torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu
index 4bd402e84c6041..410f3a7aaea12b 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu
@@ -30,14 +30,3 @@ __device__ float __half2float(const __half h) {
   asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__NVFUSER_HALF_TO_CUS(h)));
   return val;
 }
-
-// aligned vector generates vectorized load/store on CUDA
-template <typename scalar_t, int vec_size>
-struct alignas(sizeof(scalar_t) * vec_size) Array {
-  scalar_t val[vec_size];
-  __device__ void set(scalar_t v) {
-    for (int i = 0; i < vec_size; ++i) {
-      val[i] = v;
-    }
-  }
-};
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
new file mode 100644
index 00000000000000..69a36699265334
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
@@ -0,0 +1,529 @@
+namespace fused_reduction {
+
+// We have 6 dimensions, 3 in the grid, 3 in the block
+// They can be 1 of 3 states,
+// Reduction Domain - TEMPLATE STATE 0
+//   - Participating in the reduction, has values coming in, one value coming
+//     out across the dimension
+// Iteration Domain - TEMPLATE STATE 1
+//   - Not participating in the reduction, has values across the dimension after
+//     the reduction
+// Collapsed Domain - TEMPLATE STATE 2
+//   - Previously reduced, doesn't need to be reduced on that dimension, doesn't
+//     have values across that dimension
+constexpr __device__ bool isReduce(int STATE) {
+  return STATE == 0;
+}
+
+constexpr __device__ bool isIter(int STATE) {
+  return STATE == 1;
+}
+
+constexpr __device__ bool isPred(int STATE) {
+  return STATE == 2;
+}
+
+constexpr __device__ bool inactive(int STATE) {
+  return STATE == 3;
+}
+
+constexpr __device__ bool activeNotIter(int STATE) {
+  return STATE != 3 && STATE != 1;
+}
+
+// When generating an index into the reduction, we have to stride by iteration
+// domains and reduction domains. Collapsed domains we can ignore, but we need
+// to make sure they never read or write (need to be predicated to correct
+// participation).
+
+// All inclusive reduction with option to re-broadcast. This reduction class
+// does not use predication of parallelization in the read or write predicates.
+// Instead there are 3 states each dimension of parallelization can have,
+// described above. Predication, indexing, and reduction will be done based on
+// this information.
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+class ParallelReduce {
+  static constexpr bool BLOCK_REDUCE =
+      isReduce(X_THREAD) || isReduce(Y_THREAD) || isReduce(Z_THREAD);
+
+  static constexpr bool GRID_REDUCE =
+      isReduce(X_BLOCK) || isReduce(Y_BLOCK) || isReduce(Z_BLOCK);
+
+  // ping-pong between global buffers to avoid a second sync
+  bool flip = false;
+
+ public:
+  __device__ ParallelReduce() {}
+
+  template <typename Func, typename... Types>
+  __device__ __inline__ void reduce(
+      RefTuple<Types...> out,
+      const ConstRefTuple<Types...>& inp,
+      VolatilePtrTuple<Types...> global_work_buffer,
+      int64_t* global_sync_buffer, // Allocated as product of all
+                                   // non-participating Grid dimension
+      PtrTuple<Types...> shared_buf,
+      bool read_pred, // Prevent reading from out of bounds memory
+      bool write_pred, // Prevent from writing out of bounds
+      const LocalTuple<Types...>& init_val,
+      Func reduction_op) {
+    // If no reduction needed, just return input
+    if (!BLOCK_REDUCE && !GRID_REDUCE) {
+      if (read_pred && write_pred) {
+        out = inp;
+      }
+      return;
+    }
+
+    // Don't read/write in temporary buffers if in a predicated dimension
+    bool block_reduce_participate = index_utils::
+        maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+            threadIdx);
+
+    // Initialize block result
+    LocalTuple<Types...> block_result = init_val;
+
+    // Grab input data if participating in the reduction, set to block_result in
+    // the case there is no block reduction
+    if (block_reduce_participate && read_pred) {
+      block_result = inp;
+    }
+
+    // Only threads that with id == 0 in the dimensions being reduced will
+    // have a valid result
+    bool has_block_result = index_utils::maskedIsZero<
+        isReduce(X_THREAD),
+        isReduce(Y_THREAD),
+        isReduce(Z_THREAD)>(threadIdx);
+
+    if (BLOCK_REDUCE) {
+      // -- START BLOCK REDUCTION -- //
+
+      // Size of the block reduction segment, can be an int since it's limited
+      // to number of threads
+      int block_reduction_size = index_utils::maskedSize<
+          isReduce(X_THREAD),
+          isReduce(Y_THREAD),
+          isReduce(Z_THREAD)>(blockDim);
+
+      // Index in the reduction segment, can be an int since it's limited to
+      // number of threads
+      int tid_in_block_reduction = index_utils::maskedOffset<
+          isReduce(X_THREAD),
+          isReduce(Y_THREAD),
+          isReduce(Z_THREAD)>(threadIdx, blockDim);
+
+      // ID of the block reduction this thread is participating in
+      //
+      // If any of the parallel dimensions are predicated out, that means
+      // they've already been reduced, so we only care about the first thread in
+      // that dimension. Therefore don't expand the reduction_idx by that
+      // dimension
+      int block_reduction_idx = index_utils::
+          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+              threadIdx, blockDim);
+
+      // Shared memory buffer is 2D
+      // [iter dimension, reduction dimension]
+
+      // Offset into smem for the current thread
+      int block_reduce_smem_offset =
+          block_reduction_idx * block_reduction_size + tid_in_block_reduction;
+
+      // Initialize shared memory
+      if (block_reduce_participate) {
+        copyTuple(shared_buf, block_reduce_smem_offset, block_result);
+      }
+
+      // Sync to make sure smem is completely initialized
+      block_sync::sync();
+
+      // Round reduction size down to nearest power of 2
+      int np2 = 1 << (31 - __clz(block_reduction_size));
+
+      // Perform an initial reduction leaving np2 elements
+      if (block_reduce_participate && tid_in_block_reduction < np2 &&
+          tid_in_block_reduction + np2 < block_reduction_size) {
+        reduce(
+            shared_buf,
+            block_reduce_smem_offset,
+            shared_buf,
+            block_reduce_smem_offset + np2,
+            reduction_op);
+      }
+
+      // Always need to sync while operating on shared memory
+      block_sync::sync();
+
+      // Reduce down until 2 values, leaving 2 values allows us to manually
+      // perform the last reduction and avoid a syncthreads
+      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+        if (tid_in_block_reduction < factor && block_reduce_participate) {
+          reduce(
+              shared_buf,
+              block_reduce_smem_offset,
+              shared_buf,
+              block_reduce_smem_offset + factor,
+              reduction_op);
+        }
+        block_sync::sync();
+      }
+
+      // Accumulate that last valid result
+      if (has_block_result) {
+        copyTuple(block_result, shared_buf, block_reduce_smem_offset);
+        if (block_reduction_size > 1) {
+          reduce(
+              block_result,
+              0,
+              shared_buf,
+              block_reduce_smem_offset + 1,
+              reduction_op);
+        }
+      }
+
+      // ===== BLOCK REDUCTION CLEANUP =======
+      if (!GRID_REDUCE) {
+        // If no grid reduction, we don't have to continue. Either broadcast
+        // back across the block or return the correct reduction
+        if (has_block_result && write_pred) {
+          reduce(block_result, 0, out, 0, reduction_op);
+          out = block_result;
+        }
+        if (BROADCAST) {
+          // No grid reduce, but need to broadcast, perform block broadcast
+          if (has_block_result && write_pred) {
+            // Put result back in shared memory, put in the first entry of the
+            // reduction segment's buffer
+            copyTuple(
+                shared_buf,
+                block_reduction_idx * block_reduction_size,
+                block_result);
+          }
+
+          // Sync threads to make sure result is in smem
+          block_sync::sync();
+          // If the thread is participating, and is not attempting to write out
+          // of bounds, return the broadcasted value.
+          if (block_reduce_participate && write_pred) {
+            copyTuple(
+                out, shared_buf, block_reduction_idx * block_reduction_size);
+          }
+        }
+
+        // Forward protect shared memory, don't want threads to continue to
+        // another reduction/broadcast and pollute shared memory before the
+        // reduction is completely finished.
+        //
+        // This could be avoided in some cases if we added thread syncs from
+        // block reductions in the syncthread insertion pass.
+        block_sync::sync();
+        return;
+      }
+    }
+
+    // -- START GRID REDUCTION -- //
+    // Grid reductions are more challenging for two reasons, (1) the reduction
+    // itself is 3D instead of 2D because we now have an iter domain space in
+    // the grid dimension. (2) a tree reduction isn't performed, instead all
+    // blocks will populate GMEM and one  block will finish the grid reduction.
+
+    // What is the grid reduction size, block reduction already performed so
+    // that doesn't have to be taken into consideration
+    const auto grid_red_size = index_utils::
+        maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+            gridDim);
+
+    // Which ID in the reduction is this block. Threads can participate in
+    // multiple grid reductions, but the block will have the same relative index
+    // in those reductions
+    const auto idx_in_grid_red = index_utils::
+        maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+            blockIdx, gridDim);
+
+    if (PERSISTENT_REDUCTION && flip) {
+      auto global_buffer_size =
+          index_utils::
+              maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                  gridDim) *
+          grid_red_size;
+      global_work_buffer += global_buffer_size;
+    }
+    flip = ~flip;
+
+    // How many grid reductions have to be performed, in the grid dimension
+    const auto num_block_iters = index_utils::
+        maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+    // Which grid reduction does this block participate in, in the grid
+    // dimension
+    const auto block_red_idx_offset = index_utils::
+        maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+            blockIdx, gridDim);
+
+    // How many grid reductions have to be performed, in the block dimension
+    const auto num_thread_iters = index_utils::
+        maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+            blockDim);
+
+    // Which grid reduction does this thread participate in, in the block
+    // dimension
+    const auto thread_red_idx_offset = index_utils::
+        maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+            threadIdx, blockDim);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Offset into the work buffer
+    const auto work_buf_offset =
+        (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+            num_thread_iters +
+        thread_red_idx_offset;
+
+    // Don't read/write in temporary buffers if in a predicated dimension
+    bool grid_reduce_participate = index_utils::
+        maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(
+            blockIdx);
+
+    if (grid_reduce_participate && block_reduce_participate) {
+      if (has_block_result) {
+        copyTuple(global_work_buffer, work_buf_offset, block_result);
+      }
+    }
+
+    // -- GLOBAL BUFFER FILLED -- //
+
+    bool last_block = index_utils::
+        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+            blockIdx, gridDim);
+
+    if (grid_reduce_participate) {
+      // Don't need to sync up blocks that are not participating in this
+      // reduction
+      grid_sync::sync<
+          isReduce(X_BLOCK),
+          isReduce(Y_BLOCK),
+          isReduce(Z_BLOCK),
+          PERSISTENT_REDUCTION>(
+          global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+    }
+
+    // -- START BLOCK CLEANUP -- //
+    // All blocks perform the last cleanup, so every block, and every thread
+    // will have the final result
+
+    // Initialize block result
+    LocalTuple<Types...> last_block_result(init_val);
+
+    if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+      // Can use the last block to reduce all the values the blocks filled in.
+      // Can use any thread that has been predicated, or has been reduced to do
+      // this reduction, cannot use any block that's associated with an
+      // iteration domain
+
+      // Start with non-block reduction
+
+      // Index in the reduction segment
+      int tid_in_block_reduction_2 = index_utils::maskedOffset<
+          activeNotIter(X_THREAD),
+          activeNotIter(Y_THREAD),
+          activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+      int block_reduction_size_2 = index_utils::maskedSize<
+          activeNotIter(X_THREAD),
+          activeNotIter(Y_THREAD),
+          activeNotIter(Z_THREAD)>(blockDim);
+
+      // 3D buffer of reductions:
+      //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+      // Change the offset, we want to keep the last two dimensions, but the
+      // first dimension is what we will reduce over
+      const auto work_buf_offset_2 =
+          block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+      for (auto reduction_i = tid_in_block_reduction_2;
+           reduction_i < grid_red_size;
+           reduction_i += block_reduction_size_2) {
+        reduce(
+            last_block_result,
+            0,
+            global_work_buffer,
+            work_buf_offset_2 +
+                reduction_i * num_block_iters *
+                    num_thread_iters, // Iterating over the outer most
+                                      // dimension, so need to stride by the
+                                      // total number of grid reductions. Could
+                                      // come back and change it so this is the
+                                      // contiguous dimension
+            reduction_op);
+      }
+
+      // -- START LAST BLOCK - BLOCK REDUCTION -- //
+
+      // Reduced so we have one value per thread, we need to further reduce any
+      // dimension that is not an iter dimension
+
+      // Which block reduction this thread is participating in
+      int block_reduction_idx = index_utils::
+          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+              threadIdx, blockDim);
+
+      // Offset in smem for this thread's result
+      auto smem_offset = block_reduction_idx * block_reduction_size_2 +
+          tid_in_block_reduction_2;
+
+      // Similar as before, reduce down to nearest power of 2 so we can do a
+      // tree reduction
+      int np2 = 1 << (31 - __clz(min(block_reduction_size_2, grid_red_size)));
+
+      // Threads values are initialized, so all can participate here
+      if (tid_in_block_reduction_2 >= np2) {
+        copyTuple(shared_buf, smem_offset, last_block_result);
+      }
+
+      block_sync::sync();
+
+      if (tid_in_block_reduction_2 < np2 &&
+          tid_in_block_reduction_2 + np2 <
+              min(block_reduction_size_2, grid_red_size)) {
+        reduce(
+            last_block_result, 0, shared_buf, smem_offset + np2, reduction_op);
+      }
+
+      if (tid_in_block_reduction_2 < np2) {
+        copyTuple(shared_buf, smem_offset, last_block_result);
+      }
+
+      // Always sync when communicating across smem
+      block_sync::sync();
+
+      // Reduce down to 2 values, last thread will do the final reduction and
+      // can save a syncthreads this way
+      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+        if (tid_in_block_reduction_2 < factor) {
+          reduce(
+              shared_buf,
+              smem_offset,
+              shared_buf,
+              smem_offset + factor,
+              reduction_op);
+        }
+        block_sync::sync();
+      }
+
+      // If this thread in each block has the final result before broadcasting
+      // to all other threads in block
+      bool has_block_result_2 = index_utils::maskedIsZero<
+          activeNotIter(X_THREAD),
+          activeNotIter(Y_THREAD),
+          activeNotIter(Z_THREAD)>(threadIdx);
+      // Do the last reduction, protected by the write predicate
+      copyTuple(last_block_result, shared_buf, smem_offset);
+      if (has_block_result && grid_reduce_participate) {
+        reduce(last_block_result, 0, out, 0, reduction_op);
+        if (min(block_reduction_size_2, grid_red_size) > 1) {
+          reduce(
+              last_block_result, 0, shared_buf, smem_offset + 1, reduction_op);
+        }
+      }
+      if (grid_reduce_participate && PERSISTENT_REDUCTION) {
+        // If persistent reduction, always broadcast reduced values
+        copyTuple(shared_buf, smem_offset, last_block_result);
+        block_sync::sync();
+        if (write_pred && block_reduce_participate) {
+          copyTuple(
+              out, shared_buf, block_reduction_idx * block_reduction_size_2);
+        }
+        // For persistent kernels we double the global buffer allocation so we
+        // don't need to protect those buffers every iteration preventing the
+        // need of an additional grid_sync. Since we flip back and forth between
+        // sections of the buffer, the one grid sync protects the other part of
+        // the buffer.
+
+      } else {
+        // Forward protect the smem used in this reduction
+        if (grid_reduce_participate) {
+          if (last_block && has_block_result && block_reduce_participate &&
+              write_pred) {
+            copyTuple(
+                out, shared_buf, block_reduction_idx * block_reduction_size_2);
+          }
+        }
+        block_sync::sync();
+      }
+    }
+  }
+
+ private:
+  template <typename TupleType0, typename TupleType1, typename Func>
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        TupleType0::num_vals == TupleType1::num_vals,
+        "Invalid number of values");
+    TupleReduce<TupleType0, TupleType1, Func, TupleType0::num_vals>::reduce(
+        val0, offset0, val1, offset1, reduction_op);
+  }
+
+  template <
+      typename TupleType0,
+      typename TupleType1,
+      typename Func,
+      int num_vals>
+  struct TupleReduce {};
+
+  template <typename TupleType0, typename TupleType1, typename Func>
+  struct TupleReduce<TupleType0, TupleType1, Func, 1> {
+    __inline__ __device__ static void reduce(
+        TupleType0& val0,
+        nvfuser_index_t offset0,
+        const TupleType1& val1,
+        nvfuser_index_t offset1,
+        Func reduction_op) {
+      static_assert(
+          IsSameType<
+              typename TupleType0::ValTypes,
+              typename TupleType1::ValTypes>::value,
+          "Invalid value types");
+      reduction_op(val0.val<0>(offset0), val1.val<0>(offset1));
+    }
+  };
+
+  template <typename TupleType0, typename TupleType1, typename Func>
+  struct TupleReduce<TupleType0, TupleType1, Func, 3> {
+    __inline__ __device__ static void reduce(
+        TupleType0& val0,
+        nvfuser_index_t offset0,
+        const TupleType1& val1,
+        nvfuser_index_t offset1,
+        Func reduction_op) {
+      static_assert(
+          IsSameType<
+              typename TupleType0::ValTypes,
+              typename TupleType1::ValTypes>::value,
+          "Invalid value types");
+      reduction_op(
+          val0.val<0>(offset0),
+          val0.val<1>(offset0),
+          val0.val<2>(offset0),
+          val1.val<0>(offset1),
+          val1.val<1>(offset1),
+          val1.val<2>(offset1));
+    }
+  };
+
+  // End Parallel reduce class
+};
+
+} // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu b/torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
index 83382f4704c6a5..df88b76772a7f9 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
@@ -272,6 +272,3 @@ __device__ void gridReduce(
 }
 
 } // namespace reduction
-
-#undef isize
-#undef ioffset
diff --git a/torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu b/torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu
index a134bd81c2da3c..4bb89e17ece43d 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu
@@ -18,7 +18,10 @@ __device__ T globalAsVolatile(volatile T& global_val) {
 // [X,Y,Z]_BLOCK. The granularity of this sync are those dimensions. I.E.
 // Marking X and Y but not Z means there should be Z semaphores of size X*Y.
 template <bool X_BLOCK, bool Y_BLOCK, bool Z_BLOCK, bool PERSISTENT>
-__device__ void sync(int64_t& semaphore, const uint64_t& segment_size) {
+__device__ void sync(
+    int64_t& semaphore,
+    const uint64_t& segment_size,
+    const bool last_block) {
   // Finish all global memory transactions before synchronizing
   __threadfence();
 
@@ -36,8 +39,6 @@ __device__ void sync(int64_t& semaphore, const uint64_t& segment_size) {
     // Makes the assumption that blocks are in increasing order, this is not
     // guaranteed by CUDA but this is the current behavior, and unlikely to
     // change.
-    bool last_block =
-        index_utils::maskedIsLast<X_BLOCK, Y_BLOCK, Z_BLOCK>(blockIdx, gridDim);
     if (last_block) {
       semaphore_increment = FIRST_UINT64_BIT - (segment_size - 1);
     }
@@ -63,4 +64,13 @@ __device__ void sync(int64_t& semaphore, const uint64_t& segment_size) {
   // Sync block to make sure all other threads are waiting on the sync
   block_sync::sync();
 }
+
+template <bool X_BLOCK, bool Y_BLOCK, bool Z_BLOCK, bool PERSISTENT>
+__device__ void sync(int64_t& semaphore, const uint64_t& segment_size) {
+  sync<X_BLOCK, Y_BLOCK, Z_BLOCK, PERSISTENT>(
+      semaphore,
+      segment_size,
+      index_utils::maskedIsLast<X_BLOCK, Y_BLOCK, Z_BLOCK>(blockIdx, gridDim));
+}
+
 } // namespace grid_sync
diff --git a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
index 02fd8bf8777296..0d27bb50e5f6dd 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
@@ -28,19 +28,19 @@ __device__ constexpr int64_t ceilDiv(int a, int64_t b) {
 }
 
 __device__ constexpr int max(int a, int b) {
-  return ::max(a, b);
+  return a > b ? a : b;
 }
 
 __device__ constexpr int64_t max(int64_t a, int b) {
-  return ::max(a, (int64_t)b);
+  return a > (int64_t)b ? a : (int64_t)b;
 }
 
 __device__ constexpr int64_t max(int a, int64_t b) {
-  return ::max((int64_t)a, b);
+  return (int64_t)a > b ? (int64_t)a : b;
 }
 
 __device__ constexpr int64_t max(int64_t a, int64_t b) {
-  return ::max(a, b);
+  return a > b ? a : b;
 }
 
 __device__ double fmax(double a, double b) {
@@ -50,7 +50,7 @@ __device__ double fmax(double a, double b) {
   } else if (b != b) {
     return b;
   } else {
-    return ::fmax(a, b);
+    return a > b ? a : b;
   }
 }
 
@@ -61,24 +61,24 @@ __device__ float fmax(float a, float b) {
   } else if (b != b) {
     return b;
   } else {
-    return ::fmax(a, b);
+    return a > b ? a : b;
   }
 }
 
 __device__ constexpr int min(int a, int b) {
-  return ::min(a, b);
+  return a > b ? b : a;
 }
 
 __device__ constexpr int64_t min(int64_t a, int b) {
-  return ::min(a, (int64_t)b);
+  return (int64_t)a > b ? b : (int64_t)a;
 }
 
 __device__ constexpr int64_t min(int a, int64_t b) {
-  return ::min((int64_t)a, b);
+  return a > (int64_t)b ? (int64_t)b : a;
 }
 
 __device__ constexpr int64_t min(int64_t a, int64_t b) {
-  return ::min(a, b);
+  return a > b ? b : a;
 }
 
 __device__ double fmin(double a, double b) {
@@ -88,7 +88,7 @@ __device__ double fmin(double a, double b) {
   } else if (b != b) {
     return b;
   } else {
-    return ::fmin(a, b);
+    return a > b ? b : a;
   }
 }
 
@@ -99,7 +99,7 @@ __device__ float fmin(float a, float b) {
   } else if (b != b) {
     return b;
   } else {
-    return ::fmin(a, b);
+    return a > b ? b : a;
   }
 }
 
@@ -115,20 +115,20 @@ __device__ float clamp(float x, double minv, double maxv) {
   return x < minv ? minv : (x > maxv ? maxv : x);
 }
 
-__device__ double frac(double x) {
-  return x - trunc(x);
+__device__ int clamp(int x, int64_t minv, int64_t maxv) {
+  return x < minv ? minv : (x > maxv ? maxv : x);
 }
 
-__device__ float frac(float x) {
-  return x - trunc(x);
+__device__ int64_t clamp(int64_t x, int64_t minv, int64_t maxv) {
+  return x < minv ? minv : (x > maxv ? maxv : x);
 }
 
-__device__ double gelu(double x) {
-  return x * normcdf(x);
+__device__ double frac(double x) {
+  return x - trunc(x);
 }
 
-__device__ float gelu(float x) {
-  return x * normcdf(x);
+__device__ float frac(float x) {
+  return x - trunc(x);
 }
 
 __device__ double reciprocal(double x) {
@@ -139,6 +139,14 @@ __device__ float reciprocal(float x) {
   return 1 / x;
 }
 
+__device__ std::complex<double> reciprocal(std::complex<double> x) {
+  return 1.0 / x;
+}
+
+__device__ std::complex<float> reciprocal(std::complex<float> x) {
+  return 1.0f / x;
+}
+
 __device__ double relu(double x) {
   return x <= 0 ? 0 : x;
 }
@@ -170,11 +178,19 @@ __device__ float remainder(float a, float b) {
 }
 
 __device__ double sigmoid(double x) {
-  return 1 / (1 + exp(-x));
+  return 1.0 / (1.0 + exp(-x));
 }
 
 __device__ float sigmoid(float x) {
-  return 1 / (1 + exp(-x));
+  return 1.0f / (1.0f + exp(-x));
+}
+
+__device__ std::complex<double> sigmoid(std::complex<double> x) {
+  return 1.0 / (1.0 + exp(-x));
+}
+
+__device__ std::complex<float> sigmoid(std::complex<float> x) {
+  return 1.0f / (1.0f + exp(-x));
 }
 
 __device__ double silu(double x) {
@@ -193,6 +209,28 @@ __device__ float threshold(float x, double t, double v) {
   return x <= t ? v : x;
 }
 
+__device__ std::complex<double> where(
+    bool c,
+    std::complex<double> a,
+    std::complex<double> b) {
+  return c ? a : b;
+}
+
+__device__ std::complex<float> where(
+    bool c,
+    std::complex<float> a,
+    std::complex<float> b) {
+  return c ? a : b;
+}
+
+__device__ int threshold(int x, int64_t t, int64_t v) {
+  return x <= t ? v : x;
+}
+
+__device__ int64_t threshold(int64_t x, int64_t t, int64_t v) {
+  return x <= t ? v : x;
+}
+
 __device__ double where(bool c, double a, double b) {
   return c ? a : b;
 }
@@ -205,6 +243,18 @@ __device__ int64_t where(bool c, int64_t a, int64_t b) {
   return c ? a : b;
 }
 
+__device__ int where(bool c, int a, int b) {
+  return c ? a : b;
+}
+
+__device__ int64_t where(bool c, int64_t a, int b) {
+  return c ? a : b;
+}
+
+__device__ int64_t where(bool c, int a, int64_t b) {
+  return c ? a : b;
+}
+
 __device__ double randLike(Philox& rnd) {
   return uniform(rnd(), rnd());
 }
@@ -267,31 +317,59 @@ __device__ T pow(T a, T b) {
   }
 }
 
-template int pow<int>(int a, int b);
-template int64_t pow<int64_t>(int64_t a, int64_t b);
+template __device__ int pow<int>(int a, int b);
+template __device__ int64_t pow<int64_t>(int64_t a, int64_t b);
 
 template <>
-float pow<float>(float a, float b) {
+__device__ float pow<float>(float a, float b) {
   return ::pow(a, b);
 }
 
 template <>
-double pow<double>(double a, double b) {
+__device__ double pow<double>(double a, double b) {
   return ::pow(a, b);
 }
 
-float pow(float a, int b) {
+__device__ float pow(float a, int b) {
   return pow(a, (float)b);
 }
 
-double pow(double a, int b) {
+__device__ double pow(double a, int b) {
   return pow(a, (double)b);
 }
 
-float pow(float a, int64_t b) {
+__device__ float pow(float a, int64_t b) {
   return pow(a, (float)b);
 }
 
-double pow(double a, int64_t b) {
+__device__ double pow(double a, int64_t b) {
   return pow(a, (double)b);
 }
+
+int64_t pow(int64_t a, int b) {
+  return pow(a, (int64_t)b);
+}
+
+int64_t pow(int a, int64_t b) {
+  return pow((int64_t)a, b);
+}
+
+template <int size, int align = size>
+struct alignas(align) TypelessData {
+  int8_t data[size];
+
+  template <typename T, std::enable_if_t<sizeof(T) == size, int> _ = 0>
+  TypelessData(T x) {
+    *reinterpret_cast<T*>(data) = x;
+  }
+
+  template <typename T, std::enable_if_t<sizeof(T) == size, int> _ = 0>
+  operator T() {
+    return *reinterpret_cast<T*>(data);
+  }
+};
+
+template <typename T>
+TypelessData<sizeof(T), alignof(T)> erase_type(T x) {
+  return x;
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu b/torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu
new file mode 100644
index 00000000000000..f95978e84475bf
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu
@@ -0,0 +1,215 @@
+// Utility macro for this file
+#define DEVICE_INLINE __device__ inline
+
+// MMA instruction wrappers:
+//  The wrappers are subroutines that implement matrix of size
+//    A(M,K) X B(K,N) = C(M,N)
+//  The naming of the wrappers follow similar naming conventions
+//    as the mma instructions.
+//  All the mma macros follow the namespace and naming like
+//    Arch::M (M-dim) N (N-dim) K(K-dim) (Layout), eg.
+//    Volta::M16N16K4TT,
+//  with the dimensions describing the size of the sub-matrices being
+//   multiplied by this wrapper.
+//  see [Operand Layout Convention] in mma_type.h for details on the layout
+//   notation.
+namespace Volta {
+
+namespace util {
+// MMA instruction wrappers (sm_70+):
+// The instruction wrappers below are quarter-warp macros, which currently
+// nvfuser
+//  doesn't explicitly model. So they are currently only meant to be
+//  used as building blocks in warp level mma macros
+
+//  8x8x4 mma instruction, per quarter warp (8 threads), fp32 accumulate
+//  per thread register:
+//   A[4] x B[4] -> C[8]
+DEVICE_INLINE void mmaM8n8k4tt(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  unsigned const* _A = reinterpret_cast<unsigned const*>(A);
+  unsigned const* _B = reinterpret_cast<unsigned const*>(B);
+  unsigned* _C = reinterpret_cast<unsigned*>(C);
+
+  asm("mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 {%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, {%12,%13,%14,%15,%16,%17,%18,%19};\n"
+      : "=r"(_C[0]),
+        "=r"(_C[1]),
+        "=r"(_C[2]),
+        "=r"(_C[3]),
+        "=r"(_C[4]),
+        "=r"(_C[5]),
+        "=r"(_C[6]),
+        "=r"(_C[7])
+      : "r"(_A[0]),
+        "r"(_A[1]),
+        "r"(_B[0]),
+        "r"(_B[1]),
+        "r"(_C[0]),
+        "r"(_C[1]),
+        "r"(_C[2]),
+        "r"(_C[3]),
+        "r"(_C[4]),
+        "r"(_C[5]),
+        "r"(_C[6]),
+        "r"(_C[7]));
+}
+
+DEVICE_INLINE void mmaM8n8k4tn(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  unsigned const* _A = reinterpret_cast<unsigned const*>(A);
+  unsigned const* _B = reinterpret_cast<unsigned const*>(B);
+  unsigned* _C = reinterpret_cast<unsigned*>(C);
+
+  asm("mma.sync.aligned.m8n8k4.row.col.f32.f16.f16.f32 {%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, {%12,%13,%14,%15,%16,%17,%18,%19};\n"
+      : "=r"(_C[0]),
+        "=r"(_C[1]),
+        "=r"(_C[2]),
+        "=r"(_C[3]),
+        "=r"(_C[4]),
+        "=r"(_C[5]),
+        "=r"(_C[6]),
+        "=r"(_C[7])
+      : "r"(_A[0]),
+        "r"(_A[1]),
+        "r"(_B[0]),
+        "r"(_B[1]),
+        "r"(_C[0]),
+        "r"(_C[1]),
+        "r"(_C[2]),
+        "r"(_C[3]),
+        "r"(_C[4]),
+        "r"(_C[5]),
+        "r"(_C[6]),
+        "r"(_C[7]));
+}
+
+DEVICE_INLINE void mmaM8n8k4nt(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  unsigned const* _A = reinterpret_cast<unsigned const*>(A);
+  unsigned const* _B = reinterpret_cast<unsigned const*>(B);
+  unsigned* _C = reinterpret_cast<unsigned*>(C);
+
+  asm("mma.sync.aligned.m8n8k4.col.row.f32.f16.f16.f32 {%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, {%12,%13,%14,%15,%16,%17,%18,%19};\n"
+      : "=r"(_C[0]),
+        "=r"(_C[1]),
+        "=r"(_C[2]),
+        "=r"(_C[3]),
+        "=r"(_C[4]),
+        "=r"(_C[5]),
+        "=r"(_C[6]),
+        "=r"(_C[7])
+      : "r"(_A[0]),
+        "r"(_A[1]),
+        "r"(_B[0]),
+        "r"(_B[1]),
+        "r"(_C[0]),
+        "r"(_C[1]),
+        "r"(_C[2]),
+        "r"(_C[3]),
+        "r"(_C[4]),
+        "r"(_C[5]),
+        "r"(_C[6]),
+        "r"(_C[7]));
+}
+
+// TODO: in a follow up,
+//    lift this part onto iterdomain ops, once the
+//    swizzle ops are ready.
+template <int acc_stride>
+DEVICE_INLINE Array<float, 8, 8> accToMma(float* _C) {
+  float C_data[8] = {
+      _C[0],
+      _C[1],
+      _C[acc_stride],
+      _C[acc_stride + 1],
+      _C[2],
+      _C[3],
+      _C[acc_stride + 2],
+      _C[acc_stride + 3],
+  };
+
+  return *reinterpret_cast<Array<float, 8, 8>*>(&C_data[0]);
+}
+
+template <int acc_stride>
+DEVICE_INLINE void mmaToAcc(float* _C, Array<float, 8, 8>& C) {
+  float* C_data = reinterpret_cast<float*>(&C);
+  _C[0] = C_data[0];
+  _C[1] = C_data[1];
+  _C[acc_stride] = C_data[2];
+  _C[acc_stride + 1] = C_data[3];
+  _C[2] = C_data[4];
+  _C[3] = C_data[5];
+  _C[acc_stride + 2] = C_data[6];
+  _C[acc_stride + 3] = C_data[7];
+}
+
+// Should be able to lift this with transpose op as well.
+template <int acc_stride>
+DEVICE_INLINE void initM16N16K4(Array<float, 8, 8>& accumulator) {
+  float* _C = reinterpret_cast<float*>(&accumulator);
+  float zeros[8] = {0, 0, 0, 0, 0, 0, 0, 0};
+  mmaToAcc<acc_stride>(_C, *reinterpret_cast<Array<float, 8, 8>*>(&zeros[0]));
+}
+
+} // namespace util
+
+template <int acc_stride>
+DEVICE_INLINE void M16N16K4TT(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  float* _C = reinterpret_cast<float*>(C);
+  Array<float, 8, 8> C_data = util::accToMma<acc_stride>(_C);
+  util::mmaM8n8k4tt(&C_data, A, B);
+  util::mmaToAcc<acc_stride>(_C, C_data);
+}
+
+template <int acc_stride>
+DEVICE_INLINE void M16N16K4TN(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  float* _C = reinterpret_cast<float*>(C);
+  Array<float, 8, 8> C_data = util::accToMma<acc_stride>(_C);
+  util::mmaM8n8k4tn(&C_data, A, B);
+  util::mmaToAcc<acc_stride>(_C, C_data);
+}
+
+template <int acc_stride>
+DEVICE_INLINE void M16N16K4NT(
+    Array<float, 8, 8>* C,
+    Array<__half, 4, 4>* A,
+    Array<__half, 4, 4>* B) {
+  float* _C = reinterpret_cast<float*>(C);
+  Array<float, 8, 8> C_data = util::accToMma<acc_stride>(_C);
+  util::mmaM8n8k4nt(&C_data, A, B);
+  util::mmaToAcc<acc_stride>(_C, C_data);
+}
+
+// Same initialization for now, will be different in interleaved
+//   macros
+template <int acc_stride>
+DEVICE_INLINE void initM16N16K4TT(Array<float, 8, 8>* accumulator) {
+  util::initM16N16K4<acc_stride>(*accumulator);
+}
+
+template <int acc_stride>
+DEVICE_INLINE void initM16N16K4TN(Array<float, 8, 8>* accumulator) {
+  util::initM16N16K4<acc_stride>(*accumulator);
+}
+
+template <int acc_stride>
+DEVICE_INLINE void initM16N16K4NT(Array<float, 8, 8>* accumulator) {
+  util::initM16N16K4<acc_stride>(*accumulator);
+}
+
+} // namespace Volta
+
+#undef DEVICE_INLINE
diff --git a/torch/csrc/jit/codegen/cuda/runtime/tuple.cu b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
new file mode 100644
index 00000000000000..8e67dba7da72c9
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
@@ -0,0 +1,322 @@
+// std::tuple-like type
+template <typename... Types>
+struct Tuple;
+
+template <typename T0>
+struct Tuple<T0> {
+  T0 val0;
+
+  __device__ Tuple(T0 _val0) : val0(_val0) {}
+
+  // Only valid when instantiated for pointer types
+  __device__ void operator+=(nvfuser_index_t offset) {
+    static_assert(IsPointerType<T0>::value, "Invalid for non-pointer types");
+    val0 += offset;
+  }
+};
+
+template <typename T0, typename T1>
+struct Tuple<T0, T1> {
+  T0 val0;
+  T1 val1;
+
+  __device__ Tuple(T0 _val0, T1 _val1) : val0(_val0), val1(_val1) {}
+
+  // Only valid when instantiated for pointer types
+  __device__ void operator+=(nvfuser_index_t offset) {
+    static_assert(IsPointerType<T0>::value, "Invalid for non-pointer types");
+    static_assert(IsPointerType<T1>::value, "Invalid for non-pointer types");
+    val0 += offset;
+    val1 += offset;
+  }
+};
+
+template <typename T0, typename T1, typename T2>
+struct Tuple<T0, T1, T2> {
+  T0 val0;
+  T1 val1;
+  T2 val2;
+
+  __device__ Tuple(T0 _val0, T1 _val1, T2 _val2)
+      : val0(_val0), val1(_val1), val2(_val2) {}
+
+  // Only valid when instantiated for pointer types
+  __device__ void operator+=(nvfuser_index_t offset) {
+    static_assert(IsPointerType<T0>::value, "Invalid for non-pointer types");
+    static_assert(IsPointerType<T1>::value, "Invalid for non-pointer types");
+    static_assert(IsPointerType<T2>::value, "Invalid for non-pointer types");
+    val0 += offset;
+    val1 += offset;
+    val2 += offset;
+  }
+};
+
+// Accessor for Tuple
+template <int idx>
+struct get;
+
+template <>
+struct get<0> {
+  template <typename Tuple>
+  __device__ auto& operator()(Tuple& vals) {
+    return vals.val0;
+  }
+  template <typename Tuple>
+  __device__ const auto& operator()(const Tuple& vals) {
+    return vals.val0;
+  }
+};
+
+template <>
+struct get<1> {
+  template <typename Tuple>
+  __device__ auto& operator()(Tuple& vals) {
+    return vals.val1;
+  }
+  template <typename Tuple>
+  __device__ const auto& operator()(const Tuple& vals) {
+    return vals.val1;
+  }
+};
+
+template <>
+struct get<2> {
+  template <typename Tuple>
+  __device__ auto& operator()(Tuple& vals) {
+    return vals.val2;
+  }
+  template <typename Tuple>
+  __device__ const auto& operator()(const Tuple& vals) {
+    return vals.val2;
+  }
+};
+
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyTuple(
+    DstType& dst,
+    nvfuser_index_t dst_offset,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0);
+
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyTuple(
+    DstType& dst,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0);
+
+template <typename... Types>
+class LocalTuple {
+ public:
+  static constexpr int num_vals = sizeof...(Types);
+  using ValTypes = TypeList<Types...>;
+
+  __device__ LocalTuple(Types... args) : vals_(args...) {}
+
+  __device__ LocalTuple(const LocalTuple& other) : vals_(other.vals_) {}
+
+  template <template <typename...> typename TupleType>
+  __device__ LocalTuple(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+  }
+
+  __device__ LocalTuple& operator=(const LocalTuple<Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <template <typename...> typename TupleType>
+  __device__ LocalTuple& operator=(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <int val_idx>
+  __device__ auto& val(nvfuser_index_t ptr_offset = 0) {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return get<val_idx>()(vals_);
+  }
+
+  template <int val_idx>
+  __device__ const auto& val(nvfuser_index_t ptr_offset = 0) const {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return get<val_idx>()(vals_);
+  }
+
+ private:
+  Tuple<Types...> vals_;
+};
+
+template <bool is_volatile, typename... Types>
+class PtrTupleBase {
+ public:
+  static constexpr int num_vals = sizeof...(Types);
+  using ValTypes = TypeList<Types...>;
+  template <int val_idx>
+  using TypeIMaybeVolatile = typename MaybeVolatile<
+      typename TypeSelector<val_idx, Types...>::type,
+      is_volatile>::type;
+
+  __device__ PtrTupleBase(Types*... args) : vals_(args...) {}
+
+  __device__ PtrTupleBase(const PtrTupleBase& other) : vals_(other.vals_) {}
+
+  // Note: this is a deep copy
+  __device__ PtrTupleBase& operator=(
+      const PtrTupleBase<is_volatile, Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <template <typename...> typename TupleType>
+  __device__ PtrTupleBase& operator=(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <int val_idx>
+  __device__ TypeIMaybeVolatile<val_idx>& val(nvfuser_index_t ptr_offset = 0) {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return ((TypeIMaybeVolatile<val_idx>*)get<val_idx>()(vals_))[ptr_offset];
+  }
+
+  template <int val_idx>
+  __device__ const TypeIMaybeVolatile<val_idx>& val(
+      nvfuser_index_t ptr_offset = 0) const {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return ((TypeIMaybeVolatile<val_idx>*)get<val_idx>()(vals_))[ptr_offset];
+  }
+
+  __device__ void operator+=(nvfuser_index_t ptr_offset) {
+    vals_ += ptr_offset;
+  }
+
+ private:
+  Tuple<Types*...> vals_;
+};
+
+template <typename... Types>
+class RefTuple {
+ public:
+  static constexpr int num_vals = sizeof...(Types);
+  using ValTypes = TypeList<Types...>;
+
+  __device__ RefTuple(Types&... args) : vals_(args...) {}
+
+  __device__ RefTuple(const RefTuple& other) : vals_(other.vals_) {}
+
+  template <template <typename...> typename TupleType>
+  __device__ RefTuple(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+  }
+
+  __device__ RefTuple& operator=(const RefTuple<Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <template <typename...> typename TupleType>
+  __device__ RefTuple& operator=(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+    return *this;
+  }
+
+  template <int val_idx>
+  __device__ auto& val(nvfuser_index_t ptr_offset = 0) {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return get<val_idx>()(vals_);
+  }
+
+  template <int val_idx>
+  __device__ const auto& val(nvfuser_index_t ptr_offset = 0) const {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return get<val_idx>()(vals_);
+  }
+
+ private:
+  Tuple<Types&...> vals_;
+};
+
+template <typename DstType, typename SrcType, int num_vals>
+struct TupleCopy {
+  __inline__ __device__ static void copy(
+      DstType& dst,
+      nvfuser_index_t dst_offset,
+      const SrcType& src,
+      nvfuser_index_t src_offset) {
+    static_assert(
+        IsSameType<typename DstType::ValTypes, typename SrcType::ValTypes>::
+            value,
+        "Invalid value types");
+    TupleCopy<DstType, SrcType, num_vals - 1>::copy(
+        dst, dst_offset, src, src_offset);
+    dst.val<num_vals - 1>(dst_offset) = src.val<num_vals - 1>(src_offset);
+  }
+};
+
+// Can a generic const and non-const RefTupe be defined?
+template <typename... Types>
+class ConstRefTuple {
+ public:
+  static constexpr int num_vals = sizeof...(Types);
+  using ValTypes = TypeList<Types...>;
+
+  __device__ ConstRefTuple(const Types&... args) : vals_(args...) {}
+
+  __device__ ConstRefTuple(const ConstRefTuple& other) : vals_(other.vals_) {}
+
+  template <template <typename...> typename TupleType>
+  __device__ ConstRefTuple(const TupleType<Types...>& other) {
+    copyTuple(*this, other);
+  }
+
+  template <int val_idx>
+  __device__ const auto& val(nvfuser_index_t ptr_offset = 0) const {
+    static_assert(val_idx < num_vals, "Out-of-range value index");
+    return get<val_idx>()(vals_);
+  }
+
+ private:
+  Tuple<const Types&...> vals_;
+};
+
+template <typename DstType, typename SrcType>
+struct TupleCopy<DstType, SrcType, 1> {
+  __inline__ __device__ static void copy(
+      DstType& dst,
+      nvfuser_index_t dst_offset,
+      const SrcType& src,
+      nvfuser_index_t src_offset) {
+    static_assert(
+        IsSameType<typename DstType::ValTypes, typename SrcType::ValTypes>::
+            value,
+        "Invalid value types");
+    dst.val<0>(dst_offset) = src.val<0>(src_offset);
+  }
+};
+
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyTuple(
+    DstType& dst,
+    nvfuser_index_t dst_offset,
+    const SrcType& src,
+    nvfuser_index_t src_offset) {
+  static_assert(
+      IsSameType<typename DstType::ValTypes, typename SrcType::ValTypes>::value,
+      "Invalid value types");
+  TupleCopy<DstType, SrcType, DstType::num_vals>::copy(
+      dst, dst_offset, src, src_offset);
+};
+
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyTuple(
+    DstType& dst,
+    const SrcType& src,
+    nvfuser_index_t src_offset) {
+  copyTuple(dst, 0, src, src_offset);
+};
+
+template <typename... Types>
+using PtrTuple = PtrTupleBase<false, Types...>;
+
+template <typename... Types>
+using VolatilePtrTuple = PtrTupleBase<true, Types...>;
diff --git a/torch/csrc/jit/codegen/cuda/runtime/type_traits.cu b/torch/csrc/jit/codegen/cuda/runtime/type_traits.cu
new file mode 100644
index 00000000000000..6f815f26d6b439
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/type_traits.cu
@@ -0,0 +1,46 @@
+// Type trait utils
+template <typename Type, bool is_volatile>
+struct MaybeVolatile;
+
+template <typename Type>
+struct MaybeVolatile<Type, true> {
+  using type = volatile Type;
+};
+
+template <typename Type>
+struct MaybeVolatile<Type, false> {
+  using type = Type;
+};
+
+template <typename... Types>
+struct TypeList {};
+
+template <int idx, typename T, typename... Types>
+struct TypeSelector {
+  using type = typename TypeSelector<idx - 1, Types...>::type;
+};
+
+template <typename T, typename... Types>
+struct TypeSelector<0, T, Types...> {
+  using type = T;
+};
+
+template <typename T0, typename T1>
+struct IsSameType {
+  static constexpr bool value = false;
+};
+
+template <typename T0>
+struct IsSameType<T0, T0> {
+  static constexpr bool value = true;
+};
+
+template <typename T>
+struct IsPointerType {
+  static constexpr bool value = false;
+};
+
+template <typename T>
+struct IsPointerType<T*> {
+  static constexpr bool value = true;
+};
diff --git a/torch/csrc/jit/codegen/cuda/runtime/warp.cu b/torch/csrc/jit/codegen/cuda/runtime/warp.cu
index 985df8823b0858..35c4ca7a6adfa4 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/warp.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/warp.cu
@@ -51,7 +51,7 @@ __device__ void warpReduceTIDX(
     block_sync::sync();
 
     if (warp_idx == 0) {
-      // This assumes num_of_warps will be < 32, meaning < 1024 blocks.
+      // This assumes num_of_warps will be < 32, meaning < 1024 threads.
       //  Should be true for long enough.
       assert(num_of_warps <= 32);
 
diff --git a/torch/csrc/jit/codegen/cuda/runtime/welford.cu b/torch/csrc/jit/codegen/cuda/runtime/welford.cu
index c3b09d82b740e5..4d4fd3876bc19e 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/welford.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/welford.cu
@@ -366,6 +366,3 @@ __device__ void gridWelford(
 }
 
 } // namespace welford
-
-#undef isize
-#undef ioffset
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
index 7483cc7c2ae36b..56460ec926959d 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
@@ -9,6 +9,7 @@ namespace fuser {
 namespace cuda {
 
 enum class TORCH_CUDA_CU_API ScheduleHeuristic {
+  None,
   PointWise,
   Reduction,
   Persistent
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
new file mode 100644
index 00000000000000..875a9ea5ab15f4
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
@@ -0,0 +1,432 @@
+
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/ir_printer.h>
+#include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace mma_util {
+
+namespace {
+
+// Utility for mma dimension matching
+enum class MmaDimension { M = 0, N, K };
+
+// Utility for mma dimension matching, assumes the innermost
+//  3 dimensions are the mma operand dimensions, i.e. mnk, but
+// not necessarily in this order.
+// For matmul use cases the root domains are always 3 dimensional,
+//  but this wouldn't be the case for other kernels such as batched gemm.
+// This utility only applies to the case where the innermost 3 dims
+//  are the one that mma's are used. We probably don't want to use
+//  mma intrinsics if that's not the case.
+IterDomain* getMmaOperandRootDimension3d(
+    TensorView* tv,
+    MmaOptions::MmaInputLayout layout,
+    MmaDimension mma_dimension) {
+  TORCH_INTERNAL_ASSERT(tv->getMaybeRFactorDomain().size() >= 3);
+  // NT : K,M x K,N -> K,M,N
+  // TT : M,K X K,N -> M,K,N
+  // TN : M,K X N,K -> M,N,K
+  int axis_id = -1;
+  switch (mma_dimension) {
+    case MmaDimension::K:
+      axis_id = (int)layout;
+      break;
+    case MmaDimension::M:
+      axis_id = layout == MmaOptions::MmaInputLayout::NT ? 1 : 0;
+      break;
+    case MmaDimension::N:
+      axis_id = layout == MmaOptions::MmaInputLayout::TN ? 1 : 2;
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Unreachable");
+      break;
+  }
+
+  int root_size = tv->getMaybeRFactorDomain().size();
+  // Convert to index from right.
+  return tv->getMaybeRFactorDomain().at(root_size + axis_id - 3);
+}
+
+// Locate the root id corresponding to the given mma dimension
+//  Assumes the mma dimension always the innermost 2 or 3, might
+//  need to extend for more complex fusions.
+IterDomain* getMmaOperandRootDimension(
+    TensorView* tv,
+    MmaOptions options,
+    MmaDimension mma_dimension) {
+  if (isVolta(options.macro)) {
+    return getMmaOperandRootDimension3d(
+        tv, options.operand_layout, mma_dimension);
+  }
+  TORCH_INTERNAL_ASSERT(false, "unreachable");
+  return nullptr;
+}
+
+// Preliminary checks to try to validate that leaf is
+//  a innermost dim of root of exactly the given size.
+bool canValidateIsInnerDim(
+    IterDomain* root,
+    IterDomain* leaf,
+    int inner_dim_size) {
+  // Accept boundary case for Volta.
+  if (leaf == root && leaf->isBroadcast()) {
+    return true;
+  }
+  auto expr = leaf->definition();
+  ExpressionEvaluator const_eval(leaf->fusion());
+  auto maybe_leaf_size = const_eval.evaluate(leaf->extent());
+  if (!maybe_leaf_size.has_value()) {
+    return false;
+  }
+  if (maybe_leaf_size.value() != inner_dim_size) {
+    return false;
+  }
+
+  while (expr) {
+    if (auto split = dynamic_cast<Split*>(expr)) {
+      // Inner split only
+      if (leaf != split->inner()) {
+        return false;
+      }
+      // Const split only
+      auto maybe_factor = const_eval.evaluate(split->factor());
+      if (!maybe_factor.has_value()) {
+        return false;
+      }
+      int factor = maybe_factor.value();
+      if (factor < inner_dim_size) {
+        // This might be too restrictive. Would need more
+        //   bookkeeping to relax.
+        return false;
+      }
+      leaf = split->in();
+    } else if (auto merge = dynamic_cast<Merge*>(expr)) {
+      // Might consider just rejecting merge.
+      auto outer = merge->outer();
+      auto inner = merge->inner();
+      if (outer->isBroadcast()) {
+        return false;
+      }
+
+      // Only support merging with constant sized dims
+      maybe_leaf_size = const_eval.evaluate(leaf->extent());
+      if (!maybe_leaf_size.has_value()) {
+        return false;
+      }
+      if (maybe_leaf_size.value() != inner_dim_size) {
+        return false;
+      }
+      leaf = merge->inner();
+    } else {
+      // No support for swizzled inner dim for now.
+      //  Might need to add transpose swizzle here.
+      return false;
+    }
+    expr = leaf->definition();
+  }
+  return leaf == root;
+}
+
+} // namespace
+
+void checkDimSize(
+    TensorView* tv,
+    std::vector<int> axis,
+    std::vector<int> expect) {
+  TORCH_INTERNAL_ASSERT(
+      axis.size() == expect.size(),
+      "CheckDimSize: Mismatched axis and expect size");
+  ExpressionEvaluator const_eval(tv->fusion());
+  for (auto axis_index : c10::irange(axis.size())) {
+    TORCH_INTERNAL_ASSERT(
+        ((axis[axis_index] + tv->nDims()) >= 0) &&
+            (axis[axis_index] < (int)tv->nDims()),
+        "CheckDimSize: axis position out of bound ",
+        axis[axis_index],
+        " ",
+        tv->nDims());
+    auto id = tv->axis(axis[axis_index]);
+    auto maybe_extent = const_eval.evaluate(id->extent());
+    TORCH_CHECK(
+        maybe_extent.has_value(),
+        "Mma warp mapping: instruction tile has to be constant");
+    TORCH_CHECK(
+        maybe_extent.value() == expect[axis_index],
+        "Mma warp mapping: unexpected tile size at",
+        axis_index,
+        ":",
+        maybe_extent.value(),
+        "vs",
+        expect[axis_index]);
+  }
+}
+
+void WarpMmaSwizzler::scheduleMmaWarpOutput(
+    TensorView* tv,
+    MmaOptions options) {
+  auto macro = options.macro;
+  switch (macro) {
+    case MmaOptions::MacroType::Volta_16_16_4:
+      scheduleVoltaM16N16K4Fp32Output(tv, options);
+      if (tv->definition()->isA<MmaOp>()) {
+        setWarpMapped(tv, 5);
+      }
+      break;
+    default:
+      TORCH_CHECK(
+          false, "scheduleMmaWarp: unsupported mma option ", toString(macro));
+      break;
+  }
+}
+
+void WarpMmaSwizzler::scheduleOperandRead(TensorView* tv, MmaOptions options) {
+  // Schedules operand for inner most 3 contiguous dimensions
+  // Assumes M, N, K
+
+  switch (options.macro) {
+    case MmaOptions::MacroType::Volta_16_16_4:
+      scheduleVoltaOperandRead(tv, options);
+      break;
+    default:
+      TORCH_CHECK(false, "WarpMmaSwizzler: please specify macro");
+      break;
+  }
+}
+
+void WarpMmaSwizzler::setWarpMapped(TensorView* tv, int number_of_dims) {
+  for (int id : c10::irange(number_of_dims)) {
+    tv->axis(-id - 1)->toMmaSwizzled();
+  }
+}
+
+namespace {
+
+// Utility to check operand innermost scheduled dimensions
+void validateInnerMNK(TensorView* tv, MmaOptions options, int m, int n, int k) {
+  TORCH_INTERNAL_ASSERT(tv->nDims() >= 3);
+  TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
+      getMmaOperandRootDimension(tv, options, MmaDimension::M),
+      tv->axis(-3),
+      m));
+  TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
+      getMmaOperandRootDimension(tv, options, MmaDimension::N),
+      tv->axis(-2),
+      n));
+  TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
+      getMmaOperandRootDimension(tv, options, MmaDimension::K),
+      tv->axis(-1),
+      k));
+}
+
+void validateResultInnerMN(TensorView* tv, int m, int n) {
+  TORCH_INTERNAL_ASSERT(tv->nDims() >= 2);
+  int root_dim = tv->getMaybeRFactorDomain().size();
+  TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
+      tv->getMaybeRFactorDomain()[root_dim - 2], tv->axis(-2), m));
+  TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
+      tv->getMaybeRFactorDomain()[root_dim - 1], tv->axis(-1), n));
+}
+
+void scheduleVoltaA(TensorView* tv, MmaOptions options) {
+  // Assumed:
+  // [..., 16, 16 ,4]
+  // [..., M,  BN, K]
+  // Some validation:
+  validateInnerMNK(tv, options, 16, 16, 4);
+  bool transposed = isOperandTransposed(options);
+
+  tv->split(-3, 4);
+
+  // Split out 16 from the bcast
+  tv->split(-2, 16);
+  tv->split(-2, 8);
+
+  // -6   -5    -4  -3   -2  -1
+  //[Mo4, Mi4, Noo, No2, Ni8, K]
+
+  if (transposed) {
+    tv->reorder({{-5, -3}, {-3, -5}});
+    // -6   -5    -4  -3   -2  -1
+    //[Mo4, No2, Noo, Mi4, Ni8, K]
+
+  } else {
+    tv->reorder({{-5, -1}, {-3, -5}, {-1, -3}});
+    // -6   -5    -4  -3  -2  -1
+    //[Mo4, No2, Noo,  K, Ni8, Mi4]
+  }
+
+  tv->merge(-6);
+  tv->merge(-5);
+  tv->merge(-4);
+
+  //[Warp, Ni8, K/Mi4]
+  tv->axis(-3)->parallelize(ParallelType::TIDx);
+}
+
+void scheduleVoltaB(TensorView* tv, MmaOptions options) {
+  // Assumed:
+  // [..., 16,16,4]
+  // [..., BM, N, K]
+  // Some validation:
+  validateInnerMNK(tv, options, 16, 16, 4);
+
+  bool transposed = isOperandTransposed(options);
+  tv->split(-3, 16);
+  tv->split(-3, 8);
+
+  tv->split(-2, 8);
+  tv->split(-2, 4);
+
+  // -7   -6   -5   -4   -3    -2   -1
+  //[Moo, Mo2, Mi8, No2, Nio2, Nii4, K]
+  tv->reorder({{-6, -4}, {-5, -6}, {-4, -3}, {-3, -5}});
+
+  // -7   -6   -5   -4    -3    -2   -1
+  //[Moo, Mi8, Nio2, Mo2, No2,  Nii4, K ]
+  if (transposed) {
+    tv->reorder({{-2, -1}, {-1, -2}});
+    //  -7   -6   -5   -4    -3  -2   -1
+    //[Moo, Mi8, Nio2, Mo2, No2, K, Nii4]
+  }
+
+  tv->merge(-5);
+  tv->merge(-4);
+  tv->merge(-3);
+
+  //[Moo, Mi8, Warp, K/Nii4]
+  tv->axis(-2)->parallelize(ParallelType::TIDx);
+}
+
+} // namespace
+
+void WarpMmaSwizzler::scheduleVoltaOperandRead(
+    TensorView* tv,
+    MmaOptions options) {
+  switch (options.operand) {
+    case MmaOptions::Operand::A:
+      scheduleVoltaA(tv, options);
+      setWarpMapped(tv, 3);
+      break;
+    case MmaOptions::Operand::B:
+      scheduleVoltaB(tv, options);
+      setWarpMapped(tv, 4);
+      break;
+    default:
+      TORCH_CHECK(false, "WarpMmaSwizzler: please specify operand");
+  }
+}
+
+// Fp32 and Fp16 outputs have different layouts on volta,
+//   but we only support fp32 accumulate at this stage.
+void WarpMmaSwizzler::scheduleVoltaM16N16K4Fp32Output(
+    TensorView* tv,
+    const MmaOptions& options) {
+  // Assume last 2 dims [M16, N16] or [M16, N16, R]
+  bool is_reduction = tv->axis(-1)->isReduction();
+
+  // Make sure instruction tile size is correct.
+  if (is_reduction) {
+    validateInnerMNK(tv, options, 16, 16, 4);
+  } else {
+    validateResultInnerMN(tv, 16, 16);
+  }
+
+  int m_pos = is_reduction ? -3 : -2;
+
+  // Assumed:
+  //       m
+  // [..., 16,16, (4)]
+  // [..., M, N,  (R)]
+  tv->split(m_pos, 4);
+  tv->split(m_pos, 2);
+  tv->split(m_pos + 1, 8);
+  tv->split(m_pos + 1, 4);
+  tv->split(m_pos + 1, 2);
+
+  //        m-5  m-4   m-3   m-2   m-1    m     m+1   m+2
+  // [..., Mo4, Mio2, Mii2,  No2, Nio2, Niio2, Niii2, (R)]
+  tv->reorder(
+      {{m_pos - 4, m_pos - 1},
+       {m_pos - 3, m_pos - 2},
+       {m_pos - 2, m_pos - 4},
+       {m_pos - 1, m_pos},
+       {m_pos, m_pos - 3}});
+
+  //        m-5  m-4   m-3   m-2   m-1    m     m+1   m+2
+  //  [..., Mo4, No2, Niio2, Mii2, Mio2, Nio2, Niii2, (R)]
+
+  tv->merge(m_pos - 5);
+  tv->merge(m_pos - 4);
+  tv->merge(m_pos - 3);
+
+  //  m-2   m-1   m     m+1   m+2
+  //[Warp, Mio2, Nio2, Niii2, (R)]
+  tv->axis(m_pos - 2)->parallelize(ParallelType::TIDx);
+
+  if (is_reduction && tv->definition()->isA<MmaOp>()) {
+    // Set instruction loops for mma reduce output
+    for (int pos : c10::irange(5)) {
+      if (!tv->axis(-pos - 1)->isThread()) {
+        tv->axis(-pos - 1)->parallelize(ParallelType::Mma);
+      }
+      tv->axis(-pos - 1)->toMmaSwizzled();
+    }
+  }
+}
+
+namespace {
+
+bool isMmaInitLoop(const kir::Scope& loop_body) {
+  for (auto expr : loop_body.exprs()) {
+    if (auto inner_loop = dynamic_cast<kir::ForLoop*>(expr)) {
+      if (!isMmaInitLoop(inner_loop->body())) {
+        return false;
+      }
+    } else if (auto uop = dynamic_cast<UnaryOp*>(expr)) {
+      if (!ir_utils::isTvOp(expr) ||
+          uop->getUnaryOpType() != UnaryOpType::Set) {
+        return false;
+      }
+      if (auto ti = dynamic_cast<kir::TensorIndex*>(expr->output(0))) {
+        if (!ti->view()->definition() ||
+            !ti->view()->definition()->isA<MmaOp>()) {
+          return false;
+        }
+      }
+      if (auto tv = dynamic_cast<TensorView*>(expr->output(0))) {
+        if (!tv->definition() || !tv->definition()->isA<MmaOp>()) {
+          return false;
+        }
+      }
+    } else if (auto ite = dynamic_cast<kir::IfThenElse*>(expr)) {
+      if (!isMmaInitLoop(ite->thenBody())) {
+        return false;
+      }
+      if (!isMmaInitLoop(ite->elseBody())) {
+        return false;
+      }
+    } else {
+      return false;
+    }
+  }
+  return true;
+}
+
+} // namespace
+
+bool isMmaInitLoop(const kir::ForLoop* loop) {
+  return isMmaInitLoop(loop->body());
+}
+
+} // namespace mma_util
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h
new file mode 100644
index 00000000000000..2ee1b447327781
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h
@@ -0,0 +1,141 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/mma_type.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace mma_util {
+
+//! [WarpMmaSwizzler]:
+//!   This class is used to implement the thread swizzle format
+//!     required for the mma macros, cf. PTX ISA 9.7.13.4.
+//!
+//!   The mma instructions (Volta through Ampere) require specific
+//!     thread mapping within a warp for both the mma inputs and
+//!     mma outputs. All mma swizzle patterns seen so far turned out
+//!     to be affine, so we could use the normal scheduler interface
+//!     to fulfill the mma thread swizzle pattern. And fusion with
+//!     other non-mma ops and validations can just natually rely on the current
+//!     iterdomain infrastructure.
+//!
+//!   This is different from a normal scheduler utility though,
+//!      as the thread mapping within a warp are *required* to be
+//!      a specific pattern which currently translates to an enforced
+//!      requirement that all the leaf domains produced by WarpMmaSwizzler
+//!      cannot be further transformed (split/merge/reorder etc.).
+//!
+//!   Currently WarpMmaSwizzler can be accessed by schedulers through
+//!     TensorView::applyMmaSwizzle, and the current scheduling procedure is
+//!     as follows:
+//!
+//!   Step 1. Before scheduling, the mma op needs to be configured with a macro
+//!   type, either manually or inferred (eg. Volta_16_16_4).
+//!
+//!   Step 2. Scheduler can tile the outer dimensions based on any heuristics,
+//!   i.e. the CTA tiling, warp tiling, splitK etc.
+//!
+//!   Step 3. The scheduler will need to split the innermost part of the 3
+//!   involved
+//!    root dimensions, they need to be ordered as M,N,K on the rightmost of
+//!    tensordomain (see [Operand Layout Convention] for exact definition).
+//!
+//!    For example before calling WarpMmaSwizzler, the domain could look like:
+//!    [TileM, TileN, TileK, Im(16), In(16), Rk(4)], to use Volta_16_16_4.
+//!    The rightmost 3 iterdomains need to be the innermost component of their
+//!    corresponding root id, similar to vectorization except this requirement
+//!    applies to all 3 rightmost dims.
+//!
+//!         Before applying swizzle, WarpMmaSwizzler will try to validate:
+//!           1. The "innermost-ness" of the rightmost 3 iterdomains. E.g:
+//!              Xo, Xi = split(X, 16),
+//!               Xo doesn't check, Xi would check.
+//!           2. The rightmost three are constant sized, and they are ordered as
+//!           M,N,K.
+//!             In the case of operand schedule before the broadcast, only 2 of
+//!             the axis are see, and they still need to follow the same order,
+//!             i.e. need to be M,K or N,K.
+//!           3. The rightmost three axes have matching size with the selected
+//!           mma macro.
+//!
+//!    Step 4. WarpMmaSwizzler will transform the rightmost 3 domains to the
+//!    correct swizzle
+//!     format and will parallelize the TIDx, which is reserved for lane id. The
+//!     transformed inner iterdomains will be locked with WarpMapped tag so that
+//!     they cannot be further transformed. Currently the only change that
+//!     scheduler can still do after this step is to vectorize the innermost
+//!     iterdomain.
+//!
+//! Notes:
+//!   This version of implementation is trying to balance the composition
+//!   flexibility and validation complexity. Currently the validation protocol
+//!   is that if the rightmost 3 dimensions given to WarpMmaSwizzler are indeed
+//!   innermost components of the 3 root id's and their dimensions match the mma
+//!   macro, the swizzle format produced by WarpMmaSwizzler will be correct for
+//!   the macro and we just lock the innermost iterdomains from further
+//!   transformations.
+//!
+//!   Ninja users/schedulers might go for 2 cases that we currently don't
+//!   support:
+//!
+//!   1. Equivalent affine transforms:
+//!     Even though the mma swizzles are affine, there are still infinitely many
+//!     equivalent ways to implement
+//!      the same affine transform. E.g. io,ii = split(i,8); ioii =
+//!      merge(io,ii); would make ioii equiv to i if it's a divisible split. One
+//!      can use this to construct infinite many equivalent affine swizzles.
+//!
+//!     Users/schedulers might want to have a different but equivalent affine
+//!     representation from the one provided
+//!      by WarpMmaSwizzler, but validating them needs some extra work
+//!      canonicalizing the affine transforms. So short term wouldn't support
+//!      this flexibility.
+//!
+//!   2. Swizzled data input:
+//!     It is also possible that the data input has other swizzles before
+//!     entering the fusion already and some might be natively compatible
+//!     with mma format. This is a very broad category of use cases
+//!     and we'd have to consider enabling any use like this case-by-case.
+class TORCH_CUDA_CU_API WarpMmaSwizzler {
+ public:
+  //! Applies the output mma swizzling to the given tv, should be used
+  //!  on mma output or tv's involved in epilog fusion, i.e. bias.
+  //! The rightmost iterdomains must follow the m,n,k convention before calling.
+  static void scheduleMmaWarpOutput(TensorView* tv, MmaOptions options);
+
+  //! Applies the input mma swizzling to the given tv, should be used
+  //!  on mma input or tv's involved in any fusion before mma, but after smem
+  //!  read.
+  //! The rightmost iterdomains must follow the m,n,k convention before calling.
+  static void scheduleOperandRead(
+      TensorView* tv,
+      MmaOptions options = MmaOptions());
+
+ private:
+  //! Swizzle implementations for Volta mma.
+  static void scheduleVoltaOperandRead(TensorView* tv, MmaOptions options);
+  static void scheduleVoltaM16N16K4Fp32Output(
+      TensorView* tv,
+      const MmaOptions& options);
+
+  //! Utility to lock the transformed dimensions from further transforms.
+  static void setWarpMapped(TensorView* tv, int number_of_dims);
+};
+
+void checkDimSize(
+    TensorView* tv,
+    std::vector<int> axis,
+    std::vector<int> expect);
+
+// Returns if the loopnest is initializing for an mma op.
+bool isMmaInitLoop(const kir::ForLoop* loop);
+
+} // namespace mma_util
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
index 8aa3081fcc69d2..bdc0278aa5c807 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
@@ -810,7 +810,8 @@ TORCH_CUDA_CU_API c10::optional<ReductionParams> getPersistentHeuristics(
       HeuristicSummaryEntry<HeuristicCompileTime::ReductionTVs>(
           data_cache, [&fusion]() {
             return std::make_unique<std::vector<TensorView*>>(
-                scheduler_utils::getReductionTvs(fusion));
+                scheduler_utils::getReductionTvs(
+                    fusion /*, ignore_trivial = true */));
           });
 
   auto& reduction_tvs = reduction_tv_entry.get();
@@ -980,9 +981,9 @@ TORCH_CUDA_CU_API void schedulePersistentKernel(
   scheduler_utils::clearMemorySpace(fusion);
 
   auto persistent_info = scheduler_utils::persistentBuffers(fusion);
-  // persistent_info.buffers[1]->setMemoryType(MemoryType::Shared);
 
-  auto reduction_tvs = scheduler_utils::getReductionTvs(fusion);
+  auto reduction_tvs =
+      scheduler_utils::getReductionTvs(fusion /*, ignore_trivial = true */);
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
   auto reduction_tv = reduction_tvs[0];
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
index fb465b287e6d2d..8c2f60e024a5b0 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
@@ -22,6 +22,145 @@ namespace {
 // constexpr int64_t x_grid_limit = ((int64_t)1 << (int64_t)31) - (int64_t)1;
 // Unused at the moment, commenting for clang tidy
 constexpr int64_t kThreadX = 128;
+
+// DomainMap uses the ComputeAtMap to find a reference TensorView
+// that maps to all iterDomains in the fusion.
+class DomainMap {
+ public:
+  DomainMap(Fusion* fusion)
+      : fusion_(fusion),
+        ca_index_map_(ComputeAtMap(ComputeAtMap::MappingMode::INDEX)) {
+    ca_index_map_.build(fusion);
+    view_tvs_ = scheduler_utils::getViewTVs(fusion);
+  }
+
+  // The pointwise scheduler heuristics requires a minimum number of axes.
+  // The output reference tensor should respect this requirement.
+  TensorView* findReferenceTensorView(int minimum_num_axes = 0) const {
+    for (auto output_tv :
+         ir_utils::filterByType<TensorView>(fusion_->outputs())) {
+      if (isValidReference(output_tv) &&
+          hasMinimumSize(output_tv, minimum_num_axes) &&
+          !output_tv->isFusionInput()) {
+        return output_tv;
+      }
+    }
+    return nullptr;
+  }
+
+  static bool hasReferenceTensorView(Fusion* fusion) {
+    FusionGuard fg(fusion);
+    DomainMap domain_map(fusion);
+    return domain_map.findReferenceTensorView() != nullptr;
+  }
+
+  // Determine if output TensorView is a valid reference tensor for this fusion.
+  // The reference tensor must map to all the iterDomains in each input.
+  bool isValidReference(TensorView* output_tv) const {
+    if (output_tv->isFusionInput()) {
+      return false;
+    }
+    for (auto input_tv :
+         ir_utils::filterByType<TensorView>(fusion_->inputs())) {
+      if (input_tv->uses().empty()) {
+        continue;
+      }
+
+      if (fusion_->getOutputAlias(output_tv) == input_tv) {
+        continue;
+      }
+
+      if (!areAllMapped(input_tv, output_tv)) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+ private:
+  bool hasMinimumSize(TensorView* tv, int num_axes) const {
+    TORCH_INTERNAL_ASSERT(tv != nullptr);
+    return (num_axes == 0 || tv->getMaybeRFactorDomain().size() > num_axes);
+  }
+
+  // Determine if all iterDomains are mapped between input and output tvs
+  bool areAllMapped(TensorView* input_tv, TensorView* output_tv) const {
+    // Get concrete IDs for input root or rfactor domain
+    std::unordered_set<IterDomain*> in_concrete_ids;
+    for (auto in_id : input_tv->getMaybeRFactorDomain()) {
+      if (!ca_index_map_.getConcreteMappedID(in_id)->isBroadcast() &&
+          !in_id->isReduction()) {
+        in_concrete_ids.insert(ca_index_map_.getConcreteMappedID(in_id));
+      }
+    }
+
+    // Erase all input concrete IDs mapped to the output domain
+    // Ignore unresolved broadcast dimensions
+    for (auto out_id : output_tv->getMaybeRFactorDomain()) {
+      if (!out_id->isBroadcast()) {
+        if (!eraseIfMapped(in_concrete_ids, out_id)) {
+          eraseIfMappedThroughView(in_concrete_ids, out_id);
+        }
+      }
+    }
+    return in_concrete_ids.empty();
+  }
+
+  // Erase input concrete ID if it is mapped to output ID
+  bool eraseIfMapped(
+      std::unordered_set<IterDomain*>& in_concrete_ids,
+      IterDomain* out_id) const {
+    auto out_concrete_id = ca_index_map_.getConcreteMappedID(out_id);
+    auto in_concrete_id_iter = in_concrete_ids.find(out_concrete_id);
+    bool found_match = in_concrete_id_iter != in_concrete_ids.end();
+    if (found_match) {
+      in_concrete_ids.erase(in_concrete_id_iter);
+    }
+    return found_match;
+  }
+
+  // Check if in_id is mapped to out_id through any view rfactor domain
+  void eraseIfMappedThroughView(
+      std::unordered_set<IterDomain*>& in_concrete_ids,
+      IterDomain* out_id) const {
+    for (auto view : view_tvs_) {
+      // Find any ID in view rfactor domain that is mapped to output ID
+      auto view_rfactor_id = anyMapped(view->getRFactorDomain(), out_id);
+      if (view_rfactor_id == nullptr) {
+        continue;
+      }
+
+      if (view_rfactor_id->isRFactorProduct()) {
+        // Check if input ID is mapped to any input IDs of the view rfactor ID
+        auto root_inputs = InputsOf::outputs(fusion_, {view_rfactor_id});
+        auto filtered_root_ids =
+            ir_utils::filterByType<IterDomain>(root_inputs);
+        for (auto view_root_id : filtered_root_ids) {
+          eraseIfMapped(in_concrete_ids, view_root_id);
+        }
+      } else {
+        // Otherwise, the input ID must map to the view rfactor ID
+        eraseIfMapped(in_concrete_ids, view_rfactor_id);
+      }
+    }
+  }
+
+  // Find any id in domain that maps with target id
+  IterDomain* anyMapped(
+      const std::vector<IterDomain*> domain,
+      IterDomain* target) const {
+    for (auto id : domain) {
+      if (ca_index_map_.areMapped(id, target)) {
+        return id;
+      }
+    }
+    return nullptr;
+  }
+
+  Fusion* fusion_ = nullptr;
+  ComputeAtMap ca_index_map_;
+  std::vector<TensorView*> view_tvs_;
+};
 } // namespace
 
 c10::optional<PointwiseParams> getPointwiseHeuristics(
@@ -43,9 +182,17 @@ c10::optional<PointwiseParams> getPointwiseHeuristics(
   int max_dims = -1;
 
   auto in_tvs = ir_utils::filterByType<TensorView>(fusion->inputs());
-  auto out_tvs_it = ir_utils::filterByType<TensorView>(fusion->outputs());
   // Will want to access this with direct indexing later, convert now.
-  std::vector<TensorView*> out_tvs(out_tvs_it.begin(), out_tvs_it.end());
+  std::vector<TensorView*> out_tvs;
+  // Only use valid reference tensors during heuristics analysis
+  DomainMap domain_map(fusion);
+  for (auto out_tv : ir_utils::filterByType<TensorView>(fusion->outputs())) {
+    if (domain_map.isValidReference(out_tv)) {
+      out_tvs.push_back(out_tv);
+    }
+  }
+  TORCH_INTERNAL_ASSERT(
+      !out_tvs.empty(), "No valid reference outputs were found!");
 
   for (auto out_tv : out_tvs) {
     int n_dims = 0;
@@ -63,29 +210,6 @@ c10::optional<PointwiseParams> getPointwiseHeuristics(
 
   TORCH_INTERNAL_ASSERT(largest_out != nullptr);
 
-  // If zero dimensional, return default parameters
-  if (TensorDomain::noReductions(
-          TensorDomain::noBroadcasts(largest_out->domain()->domain()))
-          .size() == 0) {
-    auto vectorizable_inputs_outputs_entry = HeuristicSummaryEntry<
-        HeuristicCompileTime::VectorizableInputsAndOutputs>(data_cache, []() {
-      return std::make_unique<std::vector<TensorView*>>();
-    });
-    vectorizable_inputs_outputs_entry.get();
-
-    auto broadcast_byte_multiples_entry =
-        HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
-            data_cache, []() {
-              return std::make_unique<
-                  std::vector<scheduler_utils::BroadcastMultiple>>();
-            });
-    broadcast_byte_multiples_entry.get();
-
-    PointwiseParams params;
-    params.tag = "Pointwise heuristics";
-    return params;
-  }
-
   const int64_t device_multiprocessor_count =
       (int64_t)at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
 
@@ -118,11 +242,36 @@ c10::optional<PointwiseParams> getPointwiseHeuristics(
         runtime_info.expressionEvaluator().evaluate(ref_root[ref_i]->extent());
     TORCH_INTERNAL_ASSERT(
         inferred_val.has_value(),
-        "Error inferring size for pointwise scheduler.");
+        "Error inferring size for pointwise scheduler: ",
+        ref_root[ref_i]->extent()->toInlineString());
     elem_counts[ref_i] = inferred_val.value();
     n_elems *= inferred_val.value();
   }
 
+  // If zero dimensional or zero size, return default parameters
+  if (TensorDomain::noReductions(
+          TensorDomain::noBroadcasts(largest_out->domain()->domain()))
+              .size() == 0 ||
+      n_elems == 0) {
+    auto vectorizable_inputs_outputs_entry = HeuristicSummaryEntry<
+        HeuristicCompileTime::VectorizableInputsAndOutputs>(data_cache, []() {
+      return std::make_unique<std::vector<TensorView*>>();
+    });
+    vectorizable_inputs_outputs_entry.get();
+
+    auto broadcast_byte_multiples_entry =
+        HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
+            data_cache, []() {
+              return std::make_unique<
+                  std::vector<scheduler_utils::BroadcastMultiple>>();
+            });
+    broadcast_byte_multiples_entry.get();
+
+    PointwiseParams params;
+    params.tag = "Pointwise heuristics";
+    return params;
+  }
+
   // Don't unroll at the cost of getting a full wave on the GPU
   if (n_elems < device_multiprocessor_count * kThreadX &&
       max_unroll_factor > 1) {
@@ -369,133 +518,6 @@ size_t nRootDims(const TensorView* tv) {
   }
   return tv_n_dims;
 }
-
-// DomainMap uses the ComputeAtMap to find a reference TensorView
-// that maps to all iterDomains in the fusion.
-class DomainMap {
- public:
-  DomainMap(Fusion* fusion)
-      : fusion_(fusion),
-        ca_index_map_(ComputeAtMap(ComputeAtMap::MappingMode::INDEX)) {
-    ca_index_map_.build(fusion);
-    view_tvs_ = scheduler_utils::getViewTVs(fusion);
-  }
-
-  TensorView* findReferenceTensorView() const {
-    auto fusion_outputs = fusion_->outputs();
-    for (auto output_tv : ir_utils::filterByType<TensorView>(fusion_outputs)) {
-      if (isValidReference(output_tv)) {
-        return output_tv;
-      }
-    }
-    return nullptr;
-  }
-
-  static bool hasReferenceTensorView(Fusion* fusion) {
-    FusionGuard fg(fusion);
-    DomainMap domain_map(fusion);
-    return domain_map.findReferenceTensorView() != nullptr;
-  }
-
- private:
-  // Determine if output TensorView is a valid reference tensor for this fusion.
-  // The reference tensor must map to all the iterDomains in each input.
-  bool isValidReference(TensorView* output_tv) const {
-    auto fusion_inputs = fusion_->inputs();
-    for (auto input_tv : ir_utils::filterByType<TensorView>(fusion_inputs)) {
-      if (input_tv->uses().empty()) {
-        continue;
-      }
-
-      if (fusion_->getOutputAlias(output_tv) == input_tv) {
-        continue;
-      }
-
-      if (!areAllMapped(input_tv, output_tv)) {
-        return false;
-      }
-    }
-    return true;
-  }
-
-  // Determine if all iterDomains are mapped between input and output tvs
-  bool areAllMapped(TensorView* input_tv, TensorView* output_tv) const {
-    // Get concrete IDs for input root or rfactor domain
-    std::unordered_set<IterDomain*> in_concrete_ids;
-    for (auto in_id : input_tv->getMaybeRFactorDomain()) {
-      if (!ca_index_map_.getConcreteMappedID(in_id)->isBroadcast() &&
-          !in_id->isReduction()) {
-        in_concrete_ids.insert(ca_index_map_.getConcreteMappedID(in_id));
-      }
-    }
-
-    // Erase all input concrete IDs mapped to the output domain
-    for (auto out_id : output_tv->getMaybeRFactorDomain()) {
-      if (!out_id->isBroadcast() && !out_id->isReduction()) {
-        if (!eraseIfMapped(in_concrete_ids, out_id)) {
-          eraseIfMappedThroughView(in_concrete_ids, out_id);
-        }
-      }
-    }
-    return in_concrete_ids.empty();
-  }
-
-  // Erase input concrete ID if it is mapped to output ID
-  bool eraseIfMapped(
-      std::unordered_set<IterDomain*>& in_concrete_ids,
-      IterDomain* out_id) const {
-    auto out_concrete_id = ca_index_map_.getConcreteMappedID(out_id);
-    auto in_concrete_id_iter = in_concrete_ids.find(out_concrete_id);
-    bool found_match = in_concrete_id_iter != in_concrete_ids.end();
-    if (found_match) {
-      in_concrete_ids.erase(in_concrete_id_iter);
-    }
-    return found_match;
-  }
-
-  // Check if in_id is mapped to out_id through any view rfactor domain
-  void eraseIfMappedThroughView(
-      std::unordered_set<IterDomain*>& in_concrete_ids,
-      IterDomain* out_id) const {
-    for (auto view : view_tvs_) {
-      // Find any ID in view rfactor domain that is mapped to output ID
-      auto view_rfactor_id = anyMapped(view->getRFactorDomain(), out_id);
-      if (view_rfactor_id == nullptr) {
-        continue;
-      }
-
-      if (view_rfactor_id->isRFactorProduct()) {
-        // Check if input ID is mapped to any input IDs of the view rfactor ID
-        auto root_inputs = InputsOf::outputs(fusion_, {view_rfactor_id});
-        auto filtered_root_ids =
-            ir_utils::filterByType<IterDomain>(root_inputs);
-        for (auto view_root_id : filtered_root_ids) {
-          eraseIfMapped(in_concrete_ids, view_root_id);
-        }
-      } else {
-        // Otherwise, the input ID must map to the view rfactor ID
-        eraseIfMapped(in_concrete_ids, view_rfactor_id);
-      }
-    }
-  }
-
-  // Find any id in domain that maps with target id
-  IterDomain* anyMapped(
-      const std::vector<IterDomain*> domain,
-      IterDomain* target) const {
-    for (auto id : domain) {
-      if (ca_index_map_.areMapped(id, target)) {
-        return id;
-      }
-    }
-    return nullptr;
-  }
-
-  Fusion* fusion_ = nullptr;
-  ComputeAtMap ca_index_map_;
-  std::vector<TensorView*> view_tvs_;
-};
-
 } // namespace
 
 bool hasReferenceTensorView(Fusion* fusion) {
@@ -514,7 +536,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   // maybe has_reduction for scheduling should be done on a per output tensor
   // basis.
   TORCH_INTERNAL_ASSERT(
-      ir_utils::getReductionOps(fusion).empty(),
+      ir_utils::getReductionOps(fusion /*, ignore_trivial=true */).empty(),
       "This scheduler only handles pointwise ops.");
 
   // For intermediate outputs, apply cache_fork
@@ -555,7 +577,8 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   }
 
   DomainMap domain_map(fusion);
-  TensorView* reference_tv = domain_map.findReferenceTensorView();
+  TensorView* reference_tv =
+      domain_map.findReferenceTensorView(params.break_point);
 
   TORCH_INTERNAL_ASSERT(
       reference_tv != nullptr,
@@ -588,7 +611,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
 
   // Figure out which inputs to cache for unrolling or vectorization
   for (auto inp : input_tvs) {
-    if (inp->uses().empty()) {
+    if (inp->uses().empty() || inp->isFusionOutput()) {
       continue;
     }
     cached_inputs.emplace_back(inp->cache_after());
@@ -739,19 +762,33 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
       reference_tv->axis(3)->parallelize(ParallelType::TIDx);
     }
   }
+
   TransformPropagator::from(reference_tv);
   scheduler_utils::parallelizeAllLike(reference_tv, all_tvs);
 
   if (params.vectorize) {
     // Grab all tensor views that should be vectorized
-    auto vectorizable_inputs_outputs =
+    auto vectorized_tvs =
         scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true);
+    // Going to move inputs to consumers of inputs, need a copy as we'll modify
+    // the original.
+    {
+      auto vectorized_tvs_copy = vectorized_tvs;
+      for (auto inp : vectorized_tvs_copy) {
+        if (!inp->isFusionInput()) {
+          continue;
+        }
+        vectorized_tvs.erase(
+            std::find(vectorized_tvs.begin(), vectorized_tvs.end(), inp));
+        auto consumer_tvs = ir_utils::consumerTvsOf(inp);
+        vectorized_tvs.insert(
+            vectorized_tvs.end(), consumer_tvs.begin(), consumer_tvs.end());
+      }
+    }
     // Clear vectorize on tensors that shouldn't have it
     for (auto tv : all_tvs) {
-      if (std::find(
-              vectorizable_inputs_outputs.begin(),
-              vectorizable_inputs_outputs.end(),
-              tv) == vectorizable_inputs_outputs.end()) {
+      if (std::find(vectorized_tvs.begin(), vectorized_tvs.end(), tv) ==
+          vectorized_tvs.end()) {
         for (auto id : tv->domain()->domain()) {
           if (id->getParallelType() == ParallelType::Vectorize) {
             id->parallelize(ParallelType::Serial);
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
index 088968b089041f..02795f3ef39df5 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
@@ -783,7 +783,8 @@ TORCH_CUDA_CU_API c10::optional<ReductionParams> getReductionHeuristics(
       HeuristicSummaryEntry<HeuristicCompileTime::ReductionTVs>(
           data_cache, [&fusion]() {
             return std::make_unique<std::vector<TensorView*>>(
-                scheduler_utils::getReductionTvs(fusion));
+                scheduler_utils::getReductionTvs(
+                    fusion /*, ignore_trivial = true */));
           });
 
   auto& reduction_tvs = reduction_tv_entry.get();
@@ -886,7 +887,8 @@ void scheduleReduction(Fusion* fusion, const ReductionParams& rparams) {
   // fusion segmentation
   scheduler_utils::clearMemorySpace(fusion);
 
-  auto reduction_tvs = scheduler_utils::getReductionTvs(fusion);
+  auto reduction_tvs =
+      scheduler_utils::getReductionTvs(fusion /*, ignore_trivial = true */);
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
 
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
index 57988d8d99492d..3c24c66acd19f2 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
@@ -489,10 +489,8 @@ void multiReductionInliner(
     }
     for (auto out : ir_utils::filterByType<TensorView>(fusion->outputs())) {
       // only terminating outputs
-      if (out->uses().size()) {
-        continue;
-      }
-      if (outs_of_reds.find(out) != outs_of_reds.end()) {
+      if (out->uses().size() || outs_of_reds.find(out) != outs_of_reds.end() ||
+          out->isFusionInput()) {
         continue;
       }
       compute_to.push_back(out);
@@ -535,12 +533,6 @@ int idPos(const IterDomain* id) {
   }
   inner_most--;
 
-  // Broadcast
-  if (id->isBroadcast() || id->isImplicitBroadcast()) {
-    return inner_most;
-  }
-  inner_most--;
-
   // Reduction and unrolled
   if (id->isReduction() &&
       (id->getParallelType() == ParallelType::Unroll ||
@@ -568,6 +560,12 @@ int idPos(const IterDomain* id) {
   }
   inner_most--;
 
+  // Broadcast
+  if (id->isBroadcast() || id->isImplicitBroadcast()) {
+    return inner_most;
+  }
+  inner_most--;
+
   // Iter and unrolled
   if (!id->isReduction() &&
       (id->getParallelType() == ParallelType::Unroll ||
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
index 4f2982b01f2af7..246c9e5fe4df0d 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
@@ -615,6 +615,8 @@ size_t SchedulerRuntimeInfo::getVectorizableWidth(TensorView* tv) {
 
 void SchedulerRuntimeInfo::collectIndexModeInfo(
     const at::ArrayRef<at::IValue>& inputs) {
+  // TODO: Need to check the output sizes as well.
+
   // Save 1 more bit besides the sign bit to be conservative
   constexpr int64_t most_positive_int32_index =
       std::numeric_limits<int>::max() / 2;
@@ -765,12 +767,20 @@ class ReductionScheduler : public SchedulerEntry {
 
   //! Check if the reduction heuristics apply in given fusion
   static bool canScheduleCompileTime(Fusion* fusion) {
+    // Temporarily allow view in reduction scheduler
+    // TODO Add more testing before enabling
     auto view_tvs = scheduler_utils::getViewTVs(fusion);
     if (view_tvs.size() > 0) {
       return false;
     }
 
-    auto reduction_tvs = scheduler_utils::getReductionTvs(fusion);
+    // Needs at least one non-trivial reduction to consider.
+    if (ir_utils::getReductionOps(fusion, true /* ignore_trivial */).empty()) {
+      return false;
+    }
+
+    auto reduction_tvs =
+        scheduler_utils::getReductionTvs(fusion, false /* ignore_trivial */);
 
     if (reduction_tvs.size() == 0) {
       // Use pointwise logic
@@ -783,7 +793,8 @@ class ReductionScheduler : public SchedulerEntry {
     }
 
     // Make sure reduction axes are consistent through the fusion
-    auto reduction_ops = ir_utils::getReductionOps(fusion);
+    auto reduction_ops =
+        ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
     if (reduction_ops.size() > 1) {
       // Before examining the reduction axes want to quickly
       //   check the reductions have the same axis width
@@ -881,7 +892,8 @@ class PointWiseScheduler : public SchedulerEntry {
       return false;
     }
 
-    auto reduction_ops = ir_utils::getReductionOps(fusion);
+    auto reduction_ops =
+        ir_utils::getReductionOps(fusion, true /* ignore_trivial */);
     auto welford_ops = ir_utils::filterByType<WelfordOp>(reduction_ops);
     return reduction_ops.empty() && welford_ops.empty();
   }
@@ -924,7 +936,13 @@ class PersistentKernelScheduler : public SchedulerEntry {
   }
 
   static bool canScheduleCompileTime(Fusion* fusion) {
-    auto reduction_ops = ir_utils::getReductionOps(fusion);
+    // Needs at least one non-trivial reduction to consider.
+    if (ir_utils::getReductionOps(fusion, true /* ignore_trivial */).empty()) {
+      return false;
+    }
+
+    auto reduction_ops =
+        ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
     auto welford_ops = ir_utils::filterByType<WelfordOp>(reduction_ops);
     // For persistent schedule we want welford translated to average and
     // standard deviation reductions.
@@ -937,7 +955,8 @@ class PersistentKernelScheduler : public SchedulerEntry {
       return false;
     }
 
-    auto reduction_tvs = scheduler_utils::getReductionTvs(fusion);
+    auto reduction_tvs =
+        scheduler_utils::getReductionTvs(fusion, false /* ignore_trivial */);
 
     if (reduction_tvs.size() == 0) {
       // Use pointwise logic
@@ -1011,7 +1030,8 @@ class PersistentKernelScheduler : public SchedulerEntry {
         HeuristicSummaryEntry<HeuristicCompileTime::ReductionTVs>(
             data_cache, [&fusion]() {
               return std::make_unique<std::vector<TensorView*>>(
-                  scheduler_utils::getReductionTvs(fusion));
+                  scheduler_utils::getReductionTvs(
+                      fusion /*, ignore_trivial = true*/));
             });
 
     auto& reduction_tvs = reduction_tv_entry.get();
@@ -1259,20 +1279,22 @@ HeuristicSummary::HeuristicSummary(
 
 void HeuristicSummary::validate() const {
   switch (heuristic_) {
-    case ScheduleHeuristic::PointWise:
+    case ScheduleHeuristic::PointWise: {
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::BROADCAST_BYTE_MULTIPLES));
       break;
-    case ScheduleHeuristic::Reduction:
+    }
+    case ScheduleHeuristic::Reduction: {
       TORCH_INTERNAL_ASSERT(entry_type_map_.count(EntryType::REDUCTION_TVS));
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::UNROLLABLE_INPUTS_AND_OUTPUTS));
       break;
-    case ScheduleHeuristic::Persistent:
+    }
+    case ScheduleHeuristic::Persistent: {
       TORCH_INTERNAL_ASSERT(entry_type_map_.count(EntryType::REDUCTION_TVS));
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
@@ -1290,6 +1312,9 @@ void HeuristicSummary::validate() const {
           !persistent_buffer_info->persistent_buffers.empty() &&
           entry_type_map_.count(EntryType::SCOPE_PERSISTENT_FACTOR_INFO));
       break;
+    }
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unknown heuristic");
   }
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
index 90b348236cfef8..c5fa015930daf8 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
@@ -6,6 +6,7 @@
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 
 namespace torch {
@@ -221,6 +222,9 @@ void computeAtInputs(TensorView* consumer, int pos, ComputeAtMode mode) {
 
 void computeWithOutputs(TensorView* producer, int pos, ComputeAtMode mode) {
   for (auto out_tv : ir_utils::outputTvsOf(producer)) {
+    if (out_tv == producer) {
+      continue;
+    }
     producer->computeWith(out_tv, pos, mode);
   }
 }
@@ -434,7 +438,7 @@ PersistentBufferInfo persistentBuffers(Fusion* fusion) {
   }
 
   // Find projectable persistent buffers
-  auto reduction_tvs = getReductionTvs(fusion);
+  auto reduction_tvs = getReductionTvs(fusion /*, ignore_trivial=true */);
   for (auto persistent_buffer : persistent_buffer_info.persistent_buffers) {
     // Inputs marked as persistent buffers can't be projected any further back
     if (persistent_buffer->isFusionInput()) {
@@ -932,7 +936,7 @@ std::pair<bool, bool> canonicalDimReduction(
   }
 }
 
-std::vector<TensorView*> getReductionTvs(Fusion* fusion) {
+std::vector<TensorView*> getReductionTvs(Fusion* fusion, bool ignore_trivial) {
   auto all_tvs = ir_utils::allTvs(fusion);
   std::vector<TensorView*> reduction_tvs;
   for (auto tv : all_tvs) {
@@ -940,8 +944,9 @@ std::vector<TensorView*> getReductionTvs(Fusion* fusion) {
         std::any_of(
             tv->domain()->domain().begin(),
             tv->domain()->domain().end(),
-            [](IterDomain* id) {
-              return id->isReduction() && !id->isTrivialReduction();
+            [&ignore_trivial](IterDomain* id) {
+              return id->isReduction() &&
+                  !(ignore_trivial && id->isTrivialReduction());
             })) {
       reduction_tvs.emplace_back(tv);
     }
@@ -966,25 +971,13 @@ std::vector<TensorView*> getReductionTvs(Fusion* fusion) {
   return reduction_tvs;
 }
 
-bool isViewDefinition(TensorView* tv) {
-  auto def_expr = tv->definition();
-  if (def_expr != nullptr) {
-    auto def_expr_type = def_expr->getExprType();
-    if (def_expr_type.has_value() &&
-        def_expr_type.value() == ExprType::ViewOp) {
-      return true;
-    }
-  }
-  return false;
-}
-
 std::vector<TensorView*> getViewTVs(Fusion* fusion) {
   std::vector<TensorView*> view_tvs;
   auto fusion_vals = fusion->usedMathVals();
   for (auto producer_tv : ir_utils::filterByType<TensorView>(fusion_vals)) {
     auto consumer_tvs = ir_utils::consumerTvsOf(producer_tv);
     for (auto consumer_tv : consumer_tvs) {
-      if (isViewDefinition(consumer_tv)) {
+      if (consumer_tv->isDefinitionType(ExprType::ViewOp)) {
         view_tvs.push_back(consumer_tv);
       }
     }
@@ -1014,7 +1007,7 @@ std::vector<TensorView*> cacheInputs(Fusion* fusion, bool unroll) {
   // If we're going to unroll, make a cache of the inputs
   auto in_tvs = ir_utils::filterByType<TensorView>(fusion->inputs());
   for (auto tv : in_tvs) {
-    if (tv->uses().empty()) {
+    if (tv->uses().empty() || tv->isFusionOutput()) {
       continue;
     }
     auto cached_tv = tv->cache_after();
@@ -1390,6 +1383,97 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(TensorView* reference_tv) {
   return multiples;
 }
 
+namespace matmul_utils {
+
+void scheduleWarpTileWithReduction(TensorView* tv, MatMulTileOptions tile) {
+  // Assumes
+  // [M, N, K]
+  auto cta_tile = tile.cta_tile;
+  auto warp_tile = tile.warp_tile;
+  auto instruction_tile = tile.instruction_tile;
+
+  TORCH_CHECK(
+      warp_tile.k == cta_tile.k,
+      "schedule warp tile: currently no support for splitting k dimension to different warps");
+
+  mma_util::checkDimSize(
+      tv, {-3, -2, -1}, {cta_tile.m, cta_tile.n, cta_tile.k});
+
+  //       -3   -2  -1
+  //[...    M,   N,  K]
+
+  // Distribute warp tile:
+  tv->split(-3, warp_tile.m);
+  tv->split(-2, warp_tile.n);
+
+  //  -5   -4   -3   -2   -1
+  // [Mwo  Mw  Nwo   Nw   K]
+  tv->split(-4, instruction_tile.m);
+  tv->split(-2, instruction_tile.n);
+  tv->split(-1, instruction_tile.k);
+
+  //   -8  -7 -6 -5 -4 -3 -2 -1
+  // [Mwo Mw Mi Nwo Nw Ni Ko Ki]
+
+  tv->reorder({{-7, -5}, {-6, -3}, {-5, -7}, {-3, -2}, {-2, -6}});
+
+  //   -8  -7  -6 -5 -4 -3 -2 -1
+  // [Mwo  Nwo Ko Mw Nw Mi Ni Ki]
+}
+
+void scheduleWarpTileWithNoReduction(TensorView* tv, MatMulTileOptions tile) {
+  // Assumes
+  // [M, N, K]
+  auto cta_tile = tile.cta_tile;
+  auto warp_tile = tile.warp_tile;
+  auto instruction_tile = tile.instruction_tile;
+
+  mma_util::checkDimSize(tv, {-2, -1}, {cta_tile.m, cta_tile.n});
+
+  //        -2  -1
+  //[...    M,   N]
+
+  // Distribute warp tile:
+  tv->split(-2, warp_tile.m);
+  tv->split(-1, warp_tile.n);
+
+  //  -4   -3   -2   -1
+  // [Mwo  Mw  Nwo   Nw ]
+  tv->split(-3, instruction_tile.m);
+  tv->split(-1, instruction_tile.n);
+
+  //  -6 -5  -4 -3 -2 -1
+  // [Mwo Mw Mi Nwo Nw Ni]
+
+  tv->reorder({{-5, -4}, {-4, -2}, {-3, -5}, {-2, -3}});
+
+  //  -6   -5  -4 -3 -2 -1
+  // [Mwo  Nwo Mw Nw Mi Ni]
+}
+
+//! Split the innermost dim to a vectorized load
+void scheduleContiguousVectorLoad(
+    TensorView* tv,
+    MatMulTileOptions tile,
+    int vector_word) {
+  auto warp_dims = tile.cta_tile / tile.warp_tile;
+  int num_of_thread = warp_dims.m * warp_dims.n * warp_dims.k * 32;
+
+  tv->split(-1, num_of_thread * vector_word);
+  tv->split(-1, vector_word);
+  // [..., thread, vec]
+  // distribute to warp:
+  tv->split(-2, 32);
+  tv->split(-3, warp_dims.n * warp_dims.k);
+
+  tv->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv->axis(-2)->parallelize(ParallelType::TIDx);
+  tv->axis(-3)->parallelize(ParallelType::TIDy);
+  tv->axis(-4)->parallelize(ParallelType::TIDz);
+}
+
+} // namespace matmul_utils
+
 } // namespace scheduler_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.h b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
index 48686e09d959ae..bce6c79f551383 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
@@ -163,7 +163,9 @@ std::pair<bool, bool> canonicalDimReduction(
 // Return a list of tensor views that are outputs of reduction operations. If
 // multiple outputs of an expression are found, only include one in the list
 // (WelfordOp)
-TORCH_CUDA_CU_API std::vector<TensorView*> getReductionTvs(Fusion* fusion);
+TORCH_CUDA_CU_API std::vector<TensorView*> getReductionTvs(
+    Fusion* fusion,
+    bool ignore_trivial = true);
 
 // Returns a list of TensorViews that are the consumer tv for a view operation.
 std::vector<TensorView*> getViewTVs(Fusion* fusion);
@@ -248,6 +250,35 @@ struct BroadcastMultiple {
 // data type size.
 std::vector<BroadcastMultiple> getBroadcastMultiples(TensorView* reference_tv);
 
+namespace matmul_utils {
+//! Utilities in this namespace facilitates scheduling matmul kernels with
+//!  hierarchichal tiling specified in MatMulTileOptions.
+
+//! Schedule utility for matmul prolog:
+//!   Use all the threads on a CTA tile to load matmul operands
+//!  into shared memory with the given vectorization word.
+//! TODO:
+//!  will need to add bank conflict removal swizzle in a follow up.
+TORCH_CUDA_CU_API void scheduleContiguousVectorLoad(
+    TensorView* tv,
+    MatMulTileOptions tile,
+    int vector_word);
+
+//! Schedule utility for mma output in matmul main loop:
+//!  Realize the hierarchical tiling based on the given tiling options.
+TORCH_CUDA_CU_API void scheduleWarpTileWithReduction(
+    TensorView* tv,
+    MatMulTileOptions tile);
+
+//! Schedule utility for mma output in matmul main loop:
+//!  Realize the hierarchical tiling based on the given tiling options
+//! on consumers of mma ops in epilog.
+TORCH_CUDA_CU_API void scheduleWarpTileWithNoReduction(
+    TensorView* tv,
+    MatMulTileOptions tile);
+
+} // namespace matmul_utils
+
 } // namespace scheduler_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/tensor_view.cpp b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
index 911bda3da04b0a..2c067b2a6bfdd5 100644
--- a/torch/csrc/jit/codegen/cuda/tensor_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
@@ -10,6 +10,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
 #include <torch/csrc/jit/codegen/cuda/lower_double_buffer.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
 
 // Cleanup
 #include <torch/csrc/jit/codegen/cuda/transform_iter.h>
@@ -75,6 +76,27 @@ TensorView::TensorView(
           passkey.ir_container_->zeroVal(), IrBuilder::create<Int>()));
     }
   }
+  // [ Note -- stride_properties in tensor type ]
+  //
+  // `stride_properties()` returns a vector<optional<Stride>>, while
+  //     Stride {
+  //       optional<size_t> stride_index_;
+  //       optional<bool> contiguous_;
+  //       optional<size_t> stride_;
+  //     };
+  // To keep things simple, we ignore all the optional wrapper, as in reality,
+  // they would always be available unless we start doing multiple profiling
+  // runs.
+  //
+  //   `stride_properties()` returns the vector of Stride, where it is ordered
+  //   from the fastest to slowest dimensions. i.e. stride_properties()[i] would
+  //   give us the i-th fastest dimension. where:
+  //     1. `Stride::stride_index_` gives the index to the dimension;
+  //     2. `Stride::contiguous_` indicates whether this dimension is
+  //     memory-dense*;
+  //     3. `Stride::stride_` is the actual stride for the given dimension.
+  // * note that memory-dense means different things depending on the order of
+  // the dimension. checkout `TensorType::computeStrideProps` for details
 
   // default to non_contiguous;
   std::vector<bool> contig_info(tensor_type->dim().value(), false);
@@ -96,12 +118,19 @@ TensorView::TensorView(
         // dim;
         contig_info[index] = (index == tensor_type->dim().value() - 1);
       } else {
-        // check the neighboring faster dimension;
-        if (auto left_index_opt =
-                tensor_type->stride_properties()[static_cast<int>(i) - 1]
-                    ->stride_index_) {
-          // collapse if two axes are neighboring in both sizes & stride_index;
-          contig_info[index] = (left_index_opt.value() == (index + 1));
+        // check the neighboring faster dimension, collapse if it is considered
+        // as inner dimension per stride_index
+        auto inner_index_opt =
+            tensor_type->stride_properties()[static_cast<int>(i) - 1]
+                ->stride_index_;
+        if (inner_index_opt.has_value() &&
+            inner_index_opt.value() == (index + 1)) {
+          // collapse if inner dimension has non-broadcasted strides
+          auto inner_stride_opt =
+              tensor_type->stride_properties()[static_cast<int>(i) - 1]
+                  ->stride_;
+          contig_info[index] =
+              inner_stride_opt.has_value() && inner_stride_opt.value() != 0;
         }
       }
     }
@@ -119,6 +148,72 @@ TensorView::TensorView(
       "Function invalid for kernel container.");
 }
 
+void TensorView::convertRfactorToRootDomain() {
+  // For a given TensorView, does its domain (root / rfactor) contain any
+  // concrete sized extents?
+  auto is_concrete_tensor = [](TensorView* tv) {
+    for (auto id : tv->getMaybeRFactorDomain()) {
+      if (!id->extent()->isConstScalar()) {
+        return false;
+      }
+    }
+    return true;
+  };
+
+  // Create a new root domain and replacement TensorDomain.
+  // Given an rfactor domain, create a new IterDomain.
+  // Otherwise, clone the previous IterDomain
+  auto createReplacementDomain =
+      [this](const std::vector<Val*>& replacement_extents) {
+        TORCH_INTERNAL_ASSERT(
+            !replacement_extents.empty() &&
+            getMaybeRFactorDomain().size() == replacement_extents.size());
+        size_t idx = 0;
+        std::vector<IterDomain*> new_root_domain(
+            getMaybeRFactorDomain().size());
+        for (const auto& id : getMaybeRFactorDomain()) {
+          if (replacement_extents[idx] != nullptr) {
+            new_root_domain[idx] = IrBuilder::create<IterDomain>(
+                container(),
+                id->start(),
+                replacement_extents[idx],
+                id->stopOffset(),
+                id->getParallelType(),
+                id->getIterType());
+            ++idx;
+          } else {
+            TORCH_INTERNAL_ASSERT(!id->isRFactorProduct());
+            new_root_domain[idx++] = id->clone();
+          }
+        }
+
+        TORCH_INTERNAL_ASSERT(
+            new_root_domain.size() == domain()->contiguity().size());
+        setDomain(IrBuilder::create<TensorDomain>(
+            container(), new_root_domain, domain()->contiguity()));
+      };
+
+  std::vector<Val*> rfactor_extents;
+  std::unordered_map<Val*, Val*> replacement_map;
+  const auto kThisIsConcreteTensor = is_concrete_tensor(this);
+  for (const auto& id : getMaybeRFactorDomain()) {
+    if (id->isRFactorProduct()) {
+      // Create new symbolic extents for rfactor iterDomains
+      auto domain_extent = (!kThisIsConcreteTensor)
+          ? IrBuilder::create<Int>(container())
+          : id->extent();
+      rfactor_extents.push_back(domain_extent);
+      replacement_map.emplace(id->extent(), domain_extent);
+    } else {
+      rfactor_extents.push_back(nullptr);
+    }
+  }
+  createReplacementDomain(rfactor_extents);
+
+  // Propagate new extent throughout fusion using ValReplacementMutator
+  ir_utils::replaceValue(fusion(), replacement_map);
+}
+
 TensorView::TensorView(const TensorView* src, IrCloner* ir_cloner)
     : Val(src, ir_cloner),
       domain_(ir_cloner->clone(src->domain_)),
@@ -933,6 +1028,21 @@ bool TensorView::isEmptyTensor() const {
       });
 }
 
+void TensorView::applyMmaSwizzle(MmaOptions options) {
+  switch (options.operand) {
+    case MmaOptions::Operand::NotOperand:
+      mma_util::WarpMmaSwizzler::scheduleMmaWarpOutput(this, options);
+      break;
+    case MmaOptions::Operand::A:
+    case MmaOptions::Operand::B:
+      mma_util::WarpMmaSwizzler::scheduleOperandRead(this, options);
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "unknown operand flag");
+      break;
+  }
+}
+
 TensorViewBuilder& TensorViewBuilder::ndims(size_t ndims) {
   TORCH_CHECK(shape_.empty() || shape_.size() == ndims);
   TORCH_CHECK(contiguity_.empty() || contiguity_.size() == ndims);
@@ -997,6 +1107,13 @@ TensorView* TensorViewBuilder::build() const {
       IrBuilder::create<TensorDomain>(domain, contiguity_), dtype_);
 }
 
+void TensorView::configureMma(MmaOptions options) {
+  TORCH_CHECK(definition(), "configureMma: invalid for input tensor ", this);
+  auto mma = dynamic_cast<MmaOp*>(definition());
+  TORCH_CHECK(mma, "configureMma: invalid for non-mma output: ", this);
+  mma->configureOptions(options);
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/test/cpp/jit/test_gpu.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
similarity index 92%
rename from test/cpp/jit/test_gpu.cpp
rename to torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
index b7a4489abe9d5a..2416b3de2adb53 100644
--- a/test/cpp/jit/test_gpu.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
@@ -27,6 +27,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 #include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
 
@@ -34,8 +35,6 @@
 #include <torch/csrc/jit/codegen/cuda/parser.h>
 #include <torch/csrc/jit/ir/irparser.h>
 
-#include "test_gpu_validator.h"
-
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/Exceptions.h>
 #include <c10/cuda/CUDAStream.h>
@@ -802,15 +801,56 @@ TEST_F(NVFuserTest, FusionSimpleArith_CUDA) {
       "Error where explicit add nodes don't match implicit add nodes.");
 }
 
-TEST_F(NVFuserTest, FusionSimpleTypePromote_CUDA) {
+TEST_F(NVFuserTest, FusionScalarTypePromote_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
-  Double* d4 = IrBuilder::create<Double>(4.f);
-  Int* i1 = IrBuilder::create<Int>(3);
-  auto d5 = add(d4, i1);
+  Bool* b = IrBuilder::create<Bool>(true);
+  Double* d = IrBuilder::create<Double>(4.f);
+  Int* i = IrBuilder::create<Int>(3);
+  ComplexDouble* c =
+      IrBuilder::create<ComplexDouble>(c10::complex<double>(1, 2));
+
+  TORCH_CHECK(add(b, b)->getDataType() == DataType::Bool);
+  TORCH_CHECK(add(b, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(b, i)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(b, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(d, b)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, i)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(i, b)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(i, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(i, i)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(i, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(c, b)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, d)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, i)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, c)->getDataType() == DataType::ComplexDouble);
+}
+
+TEST_F(NVFuserTest, FusionComplexAbsTypes_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto options = at::TensorOptions().device(at::kCUDA, 0);
+  auto tensor_cf = at::randn({4, 4, 4}, options.dtype(at::kComplexFloat));
+  auto tensor_cd = at::randn({4, 4, 4}, options.dtype(at::kComplexDouble));
+
+  auto type_cf = TensorType::create(tensor_cf);
+  auto tv_cf = IrBuilder::create<TensorView>(type_cf);
+  auto type_cd = TensorType::create(tensor_cd);
+  auto tv_cd = IrBuilder::create<TensorView>(type_cd);
 
-  TORCH_CHECK(d5->getDataType() == DataType::Double);
+  TORCH_CHECK(
+      tensor_cf.abs().scalar_type() ==
+      data_type_to_aten(abs(tv_cf)->getDataType().value()));
+  TORCH_CHECK(
+      tensor_cd.abs().scalar_type() ==
+      data_type_to_aten(abs(tv_cd)->getDataType().value()));
 }
 
 TEST_F(NVFuserTest, FusionRegister_CUDA) {
@@ -856,8 +896,8 @@ TEST_F(NVFuserTest, FusionTopoSort_CUDA) {
   // e1: v4     =   add(v3, v2)
   // e2: v5     =   add(v2, v4)
   // e3: v6     =   add(v5, v5)
-  Double* v0 = IrBuilder::create<Double>(1.f);
-  Double* v1 = IrBuilder::create<Double>(2.f);
+  Double* v0 = IrBuilder::create<Double>();
+  Double* v1 = IrBuilder::create<Double>();
   Double* v2 = IrBuilder::create<Double>();
   Double* v3 = IrBuilder::create<Double>();
   Double* v4 = IrBuilder::create<Double>();
@@ -1263,32 +1303,27 @@ TEST_F(NVFuserTest, FusionParser_CUDA) {
   // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
   const std::string expected_kernel = R"(
 __global__ void CUDAGeneratedKernel(Tensor<float, 1> T0, Tensor<float, 1> T1, Tensor<float, 1> T3) {
-  if ((((((((((nvfuser_index_t)blockIdx.x) * 1) + 0) * 1) + 0) * 128) + ((nvfuser_index_t)threadIdx.x)) < T0.size[0])) {
-    constexpr nvfuser_index_t i33 = 0;
+  int64_t i52;
+  i52 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
+  if ((i52 < T0.size[0])) {
     float T5[1];
-    constexpr nvfuser_index_t i45 = 0;
-    T5[i45] = 0;
-    constexpr nvfuser_index_t i41 = 0;
-    T5[i41]
-       = T1[(((((((((nvfuser_index_t)blockIdx.x) * 1) + i33) * 1) + i41) * 128) + ((nvfuser_index_t)threadIdx.x)) * 1)];
+    T5[0] = 0;
+    T5[0]
+       = T1[i52];
     float T4[1];
-    constexpr nvfuser_index_t i47 = 0;
-    T4[i47] = 0;
-    constexpr nvfuser_index_t i39 = 0;
-    T4[i39]
-       = T0[(((((((((nvfuser_index_t)blockIdx.x) * 1) + i33) * 1) + i39) * 128) + ((nvfuser_index_t)threadIdx.x)) * 1)];
+    T4[0] = 0;
+    T4[0]
+       = T0[i52];
     float T6[1];
-    constexpr nvfuser_index_t i37 = 0;
     float T2[1];
     T2[0]
-      = T4[i37]
-      * T5[i37];
-    T6[i37]
+      = T4[0]
+      * T5[0];
+    T6[0]
       = T2[0]
-      * T4[i37];
-    constexpr nvfuser_index_t i35 = 0;
-    T3[(((((((((nvfuser_index_t)blockIdx.x) * 1) + i33) * 1) + i35) * 128) + ((nvfuser_index_t)threadIdx.x)) * 1)]
-       = T6[i35];
+      * T4[0];
+    T3[i52]
+       = T6[0];
   }
 }
 )";
@@ -1505,6 +1540,65 @@ TEST_F(NVFuserTest, FusionSimplePWise_CUDA) {
   TORCH_CHECK(output_ref.equal(output));
 }
 
+TEST_F(NVFuserTest, FusionSimplePWiseDtypeComplex_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  // dimensionality of the problem
+  int nDims = 3;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(nDims, DataType::ComplexFloat);
+  TensorView* tv1 = makeContigTensor(nDims, DataType::ComplexFloat);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  c10::complex<double> scalar1(2.0, 3.0);
+  TensorView* tv2 = add(tv1, IrBuilder::create<ComplexDouble>(scalar1));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  // Do transformations, remember, transformations are outputs to inputs
+  // This doesn't have to be in this order
+  tv3->merge(1);
+  tv3->merge(0);
+
+  // Split by n_threads
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, -1);
+  tv1->computeAt(tv3, -1);
+
+  // Parallelize TV3
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-2)->parallelize(ParallelType::Unroll);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options =
+      at::TensorOptions().dtype(at::kComplexFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({64, 2, 128}, options);
+  at::Tensor input2 = at::rand_like(input1);
+  at::Tensor output = at::empty_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  fe.runFusion({input1, input2}, {output});
+
+  at::Tensor tv2_ref = input2 + scalar1;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
 TEST_F(NVFuserTest, FusionExecKernel_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
@@ -3703,6 +3797,10 @@ Val* gen_jit_operand(std::pair<ValType, DataType> desc) {
       return IrBuilder::create<Double>();
     } else if (desc.second == DataType::Double) {
       return IrBuilder::create<Double>();
+    } else if (desc.second == DataType::ComplexFloat) {
+      return IrBuilder::create<ComplexDouble>();
+    } else if (desc.second == DataType::ComplexDouble) {
+      return IrBuilder::create<ComplexDouble>();
     } else if (desc.second == DataType::Int) {
       return IrBuilder::create<Int>();
     } else {
@@ -3725,6 +3823,8 @@ IValue gen_aten_operand(
     bool rand) {
   if (desc.first == ValType::TensorView) {
     if (desc.second == DataType::Double || desc.second == DataType::Float ||
+        desc.second == DataType::ComplexDouble ||
+        desc.second == DataType::ComplexFloat ||
         desc.second == DataType::Half || desc.second == DataType::BFloat16) {
       auto options = at::TensorOptions()
                          .dtype(data_type_to_aten(desc.second))
@@ -3760,9 +3860,13 @@ IValue gen_aten_operand(
     }
   } else if (desc.first == ValType::Scalar) {
     // IValue scalars can only be double int64 or bool
-    if (desc.second == DataType::Double || desc.second == DataType::Float ||
+    if (desc.second == DataType::ComplexDouble ||
+        desc.second == DataType::ComplexFloat) {
+      return IValue(at::Scalar(c10::complex<double>(1.0, 0.0)));
+    } else if (
+        desc.second == DataType::Double || desc.second == DataType::Float ||
         desc.second == DataType::Half || desc.second == DataType::BFloat16) {
-      return IValue(at::Scalar(1.f));
+      return IValue(at::Scalar(1.0));
     } else if (desc.second == DataType::Int) {
       return IValue(at::Scalar(1));
     } else {
@@ -3882,42 +3986,73 @@ TEST_F(NVFuserTest, FusionUnaryOps_CUDA) {
   // list within the vector to make this code compatible with some old env
   // which we still need to support. eg. gcc 5.4 + cuda 9.2.
   std::vector<OpTuple> ops{
-      OpTuple{at::abs, UnaryOpType::Abs, "abs"},
       OpTuple{at::acos, UnaryOpType::Acos, "acos"},
       OpTuple{at::asin, UnaryOpType::Asin, "asin"},
       OpTuple{at::atan, UnaryOpType::Atan, "atan"},
       // There does not appear to be an appropriate ATen function for atanh
       // OpTuple{at::atanh,      UnaryOpType::Atanh,      "atanh"      },
-      OpTuple{at::ceil, UnaryOpType::Ceil, "ceil"},
       OpTuple{at::cos, UnaryOpType::Cos, "cos"},
       OpTuple{at::cosh, UnaryOpType::Cosh, "cosh"},
-      OpTuple{at::erf, UnaryOpType::Erf, "erf"},
-      OpTuple{at::erfc, UnaryOpType::Erfc, "erfc"},
       OpTuple{at::exp, UnaryOpType::Exp, "exp"},
-      OpTuple{at::expm1, UnaryOpType::Expm1, "expm1"},
-      OpTuple{at::floor, UnaryOpType::Floor, "floor"},
-      OpTuple{at::frac, UnaryOpType::Frac, "frac"},
-      OpTuple{at::lgamma, UnaryOpType::Lgamma, "lgamma"},
+      // OpTuple{at::gelu, UnaryOpType::Gelu, "gelu"},
       OpTuple{at::log, UnaryOpType::Log, "log"},
       OpTuple{at::log10, UnaryOpType::Log10, "log10"},
-      OpTuple{at::log1p, UnaryOpType::Log1p, "log1p"},
-      OpTuple{at::log2, UnaryOpType::Log2, "log2"},
       OpTuple{at::neg, UnaryOpType::Neg, "neg"},
       OpTuple{at::reciprocal, UnaryOpType::Reciprocal, "reciprocal"},
-      OpTuple{at::relu, UnaryOpType::Relu, "relu"},
-      OpTuple{at::round, UnaryOpType::Round, "round"},
-      OpTuple{at::rsqrt, UnaryOpType::Rsqrt, "rsqrt"},
       OpTuple{at::sigmoid, UnaryOpType::Sigmoid, "sigmoid"},
       OpTuple{at::sin, UnaryOpType::Sin, "sin"},
       OpTuple{at::sinh, UnaryOpType::Sinh, "sinh"},
       OpTuple{at::sqrt, UnaryOpType::Sqrt, "sqrt"},
       OpTuple{at::tan, UnaryOpType::Tan, "tan"},
       OpTuple{at::tanh, UnaryOpType::Tanh, "tanh"},
-      OpTuple{at::trunc, UnaryOpType::Trunc, "trunc"}};
+  };
+
+  // The following ops has no complex support in eager mode
+  std::vector<OpTuple> ops_without_complex{
+      OpTuple{at::ceil, UnaryOpType::Ceil, "ceil"},
+      OpTuple{at::floor, UnaryOpType::Floor, "floor"},
+      OpTuple{at::frac, UnaryOpType::Frac, "frac"},
+      OpTuple{at::trunc, UnaryOpType::Trunc, "trunc"},
+      OpTuple{at::round, UnaryOpType::Round, "round"},
+      OpTuple{at::relu, UnaryOpType::Relu, "relu"},
+      OpTuple{at::expm1, UnaryOpType::Expm1, "expm1"},
+      OpTuple{at::log1p, UnaryOpType::Log1p, "log1p"},
+      OpTuple{at::lgamma, UnaryOpType::Lgamma, "lgamma"},
+      OpTuple{at::erf, UnaryOpType::Erf, "erf"},
+      OpTuple{at::erfc, UnaryOpType::Erfc, "erfc"}};
+
+  // Complex support for the following op is not working in nvFuser yet
+  std::vector<OpTuple> ops_skip_complex{
+      // TODO: abs is actually supported in nvFuser, but it has bug!!!
+      // In eager mode, abs(complex_tensor) returns floating point tensor
+      // but in nvFuser, it wrongly returns complex tensor!
+      // We need to:
+      //  1. change our type promotion logic to make a special case for abs
+      //  2. why this bug is not detected here? we should bump up test coverage
+      OpTuple{at::abs, UnaryOpType::Abs, "abs"},
+      // TODO: the following two ops fails with compilation error like
+      // "undefined function rsqrt(complex)", we could implement them in
+      // helpers.cu, but I think it is better to check with Jiterator first,
+      // because Jiterator uses the same string for complex support.
+      OpTuple{at::rsqrt, UnaryOpType::Rsqrt, "rsqrt"},
+      OpTuple{at::log2, UnaryOpType::Log2, "log2"}};
 
-  std::vector<DataType> dtypes = {DataType::Float, DataType::Double};
+  std::vector<DataType> dtypes = {
+      DataType::Float,
+      DataType::Double,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
 
   for (auto dtype : dtypes) {
+    auto ops_to_test = ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      ops_to_test.insert(
+          ops_to_test.end(),
+          ops_without_complex.begin(),
+          ops_without_complex.end());
+      ops_to_test.insert(
+          ops_to_test.end(), ops_skip_complex.begin(), ops_skip_complex.end());
+    }
     std::for_each(ops.begin(), ops.end(), [&](OpTuple& op) {
       test_op(
           /*blocks*/ 640,
@@ -3934,19 +4069,24 @@ TEST_F(NVFuserTest, FusionUnaryOps_CUDA) {
           std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
     });
 
-    test_op(
-        /*blocks*/ 128,
-        /*threads*/ 64,
-        /*name*/ "rand_like",
-        /*Aten Func   */
-        [](std::array<IValue, 1>& vals) {
-          return at::rand_like(vals[0].toTensor());
-        },
-        /*JIT  Func   */
-        [](Val* in1) -> Val* { return unaryOp(UnaryOpType::RandLike, in1); },
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    // TODO: why the rand_like test is failing for complex? Is it because each
+    // complex needs to draw 2 random numbers from the RNG? We need to enable
+    // this
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      test_op(
+          /*blocks*/ 128,
+          /*threads*/ 64,
+          /*name*/ "rand_like",
+          /*Aten Func   */
+          [](std::array<IValue, 1>& vals) {
+            return at::rand_like(vals[0].toTensor());
+          },
+          /*JIT  Func   */
+          [](Val* in1) -> Val* { return unaryOp(UnaryOpType::RandLike, in1); },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    }
   }
 
   dtypes = {DataType::Int, DataType::Int32, DataType::Bool};
@@ -3971,17 +4111,45 @@ TEST_F(NVFuserTest, FusionBinaryOps_CUDA) {
   using AtenFuncSig = at::Tensor (*)(const at::Tensor&, const at::Tensor&);
   using OpTuple = std::tuple<AtenFuncSig, BinaryOpType, std::string>;
 
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
+
   // see [Note: explicit tuple type for uniform initialization list]
-  std::vector<OpTuple> logic_ops{
+  std::vector<OpTuple> equal_ops{
       OpTuple{at::eq, BinaryOpType::Eq, "eq"},
+      OpTuple{at::ne, BinaryOpType::NE, "ne"}};
+
+  // Complex numbers are not ordered
+  std::vector<OpTuple> order_ops{
       OpTuple{at::ge, BinaryOpType::GE, "ge"},
       OpTuple{at::gt, BinaryOpType::GT, "gt"},
       OpTuple{at::le, BinaryOpType::LE, "le"},
-      OpTuple{at::lt, BinaryOpType::LT, "lt"},
-      OpTuple{at::ne, BinaryOpType::NE, "ne"}};
-  std::vector<DataType> dtypes = {DataType::Double, DataType::Float};
+      OpTuple{at::lt, BinaryOpType::LT, "lt"}};
+
+  // see [Note: explicit tuple type for uniform initialization list]
+  std::vector<OpTuple> math_ops{
+      OpTuple{at::div, BinaryOpType::Div, "div"},
+      OpTuple{at::mul, BinaryOpType::Mul, "mul"},
+      OpTuple{at::pow, BinaryOpType::Pow, "pow"}};
+
+  // The following ops has no complex support in eager mode
+  std::vector<OpTuple> math_ops_without_complex{
+      OpTuple{at::atan2, BinaryOpType::Atan2, "atan2"},
+      OpTuple{at::max, BinaryOpType::Max, "max"},
+      OpTuple{at::min, BinaryOpType::Min, "min"},
+      OpTuple{at::fmod, BinaryOpType::Fmod, "fmod"},
+      // NOTE: Remainder does not match the Aten impl exactly
+      // despite using an identical function.
+      OpTuple{at::remainder, BinaryOpType::Remainder, "remainder"}};
 
   for (auto dtype : dtypes) {
+    auto logic_ops = equal_ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      logic_ops.insert(logic_ops.end(), order_ops.begin(), order_ops.end());
+    }
     std::for_each(logic_ops.begin(), logic_ops.end(), [&](OpTuple& op) {
       test_op(
           /*blocks*/ 640,
@@ -4002,39 +4170,33 @@ TEST_F(NVFuserTest, FusionBinaryOps_CUDA) {
               std::make_pair(ValType::TensorView, dtype)));
     });
 
-    // see [Note: explicit tuple type for uniform initialization list]
-    std::vector<OpTuple> math_ops{
-        OpTuple{at::atan2, BinaryOpType::Atan2, "atan2"},
-        OpTuple{at::div, BinaryOpType::Div, "div"},
-        OpTuple{at::fmod, BinaryOpType::Fmod, "fmod"},
-        OpTuple{at::max, BinaryOpType::Max, "max"},
-        OpTuple{at::min, BinaryOpType::Min, "min"},
-        OpTuple{at::mul, BinaryOpType::Mul, "mul"},
-        OpTuple{at::pow, BinaryOpType::Pow, "pow"},
-        // NOTE: Remainder does not match the Aten impl exactly
-        // despite using an identical function.
-        OpTuple{at::remainder, BinaryOpType::Remainder, "remainder"},
-    };
-
-    std::for_each(math_ops.begin(), math_ops.end(), [&](OpTuple& op) {
-      test_op(
-          /*blocks*/ 640,
-          /*threads*/ 64,
-          /*name*/ std::get<2>(op),
-          /*Aten Func   */
-          [&op](std::array<IValue, 2>& vals) {
-            return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
-          },
-          /*JIT  Func   */
-          [&op](Val* in1, Val* in2) -> Val* {
-            return binaryOp(std::get<1>(op), in1, in2);
-          },
-          /*Output      */ std::make_pair(ValType::TensorView, dtype),
-          /*Inputs Tuple*/
-          std::make_tuple(
-              std::make_pair(ValType::TensorView, dtype),
-              std::make_pair(ValType::TensorView, dtype)));
-    });
+    auto enabled_math_ops = math_ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      enabled_math_ops.insert(
+          enabled_math_ops.end(),
+          math_ops_without_complex.begin(),
+          math_ops_without_complex.end());
+    }
+    std::for_each(
+        enabled_math_ops.begin(), enabled_math_ops.end(), [&](OpTuple& op) {
+          test_op(
+              /*blocks*/ 640,
+              /*threads*/ 64,
+              /*name*/ std::get<2>(op),
+              /*Aten Func   */
+              [&op](std::array<IValue, 2>& vals) {
+                return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
+              },
+              /*JIT  Func   */
+              [&op](Val* in1, Val* in2) -> Val* {
+                return binaryOp(std::get<1>(op), in1, in2);
+              },
+              /*Output      */ std::make_pair(ValType::TensorView, dtype),
+              /*Inputs Tuple*/
+              std::make_tuple(
+                  std::make_pair(ValType::TensorView, dtype),
+                  std::make_pair(ValType::TensorView, dtype)));
+        });
 
     test_op(
         /*blocks*/ 640,
@@ -4073,59 +4235,66 @@ TEST_F(NVFuserTest, FusionBinaryOps_CUDA) {
 }
 
 TEST_F(NVFuserTest, FusionTernaryOps_CUDA) {
-  std::vector<DataType> dtypes = {DataType::Double, DataType::Float};
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
 
   for (auto dtype : dtypes) {
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "clamp",
-        /*Aten Func   */
-        [](std::array<IValue, 1>& vals) {
-          return at::clamp(vals[0].toTensor(), 0.f, 1.f);
-        },
-        /*JIT  Func   */
-        [&](Val* in1) -> Val* {
-          if (dtype == DataType::Float) {
-            return clamp(
-                in1,
-                IrBuilder::create<Double>(0.f),
-                IrBuilder::create<Double>(1.f));
-          } else {
-            return clamp(
-                in1,
-                IrBuilder::create<Double>(0.f),
-                IrBuilder::create<Double>(1.f));
-          }
-        },
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "threshold",
-        /*Aten Func   */
-        [](std::array<IValue, 1>& vals) {
-          return at::threshold(vals[0].toTensor(), 0.f, 1.f);
-        },
-        /*JIT  Func   */
-        [&](Val* in1) -> Val* {
-          if (dtype == DataType::Float) {
-            return threshold(
-                in1,
-                IrBuilder::create<Double>(0.f),
-                IrBuilder::create<Double>(1.f));
-          } else {
-            return threshold(
-                in1,
-                IrBuilder::create<Double>(0.f),
-                IrBuilder::create<Double>(1.f));
-          }
-        },
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    // clamp and threshold are not supported for complex on eager mode
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ "clamp",
+          /*Aten Func   */
+          [](std::array<IValue, 1>& vals) {
+            return at::clamp(vals[0].toTensor(), 0.f, 1.f);
+          },
+          /*JIT  Func   */
+          [&](Val* in1) -> Val* {
+            if (dtype == DataType::Float) {
+              return clamp(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            } else {
+              return clamp(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            }
+          },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ "threshold",
+          /*Aten Func   */
+          [](std::array<IValue, 1>& vals) {
+            return at::threshold(vals[0].toTensor(), 0.f, 1.f);
+          },
+          /*JIT  Func   */
+          [&](Val* in1) -> Val* {
+            if (dtype == DataType::Float) {
+              return threshold(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            } else {
+              return threshold(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            }
+          },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    }
     test_op(
         /*blocks*/ 640,
         /*threads*/ 64,
@@ -4146,7 +4315,11 @@ TEST_F(NVFuserTest, FusionTernaryOps_CUDA) {
 }
 
 TEST_F(NVFuserTest, FusionCompoundOps_CUDA) {
-  std::vector<DataType> dtypes = {DataType::Double, DataType::Float};
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
 
   for (auto dtype : dtypes) {
     test_op(
@@ -7142,8 +7315,8 @@ TEST_F(NVFuserTest, FusionComputeAtExprOrder3_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
-  const size_t dimx = 13;
-  const size_t dimy = 15;
+  const int64_t dimx = 13;
+  const int64_t dimy = 15;
 
   TensorView* tv0 = makeConcreteTensor({dimx, dimy});
   fusion.addInput(tv0);
@@ -7709,6 +7882,11 @@ TEST_F(NVFuserTest, FusionReductionSchedulerMultiDimFastest_CUDA) {
 TEST_F(NVFuserTest, FusionReductionSchedulerNoODimShmoo_CUDA) {
   std::vector<DataType> dtypes = {
       DataType::Double, DataType::Float, DataType::Half};
+  // TODO: add test for complex. Currently complex fails with the following
+  // NVRTC compilation error message:
+  //   error: no suitable user-defined conversion from
+  //   "CudaCodeGen::std::complex<double>" to "CudaCodeGen::std::complex<float>"
+  //   exists
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   if (at::cuda::getDeviceProperties(0)->major >= 8) {
     dtypes.insert(dtypes.end(), DataType::BFloat16);
@@ -7782,6 +7960,10 @@ TEST_F(NVFuserTest, FusionReductionSchedulerNoODimShmoo_CUDA) {
 TEST_F(NVFuserTest, FusionReductionSchedulerDimShmoo_CUDA) {
   std::vector<DataType> dtypes = {
       DataType::Double, DataType::Float, DataType::Half};
+  // TODO: add complex support. Currently, complex fails with the following
+  // NVRTC compilation error:
+  //   error: no instance of overloaded function "__shfl_xor_sync" matches the
+  //   argument list
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   if (at::cuda::getDeviceProperties(0)->major >= 8) {
     dtypes.insert(dtypes.end(), DataType::BFloat16);
@@ -8238,7 +8420,7 @@ TEST_F(NVFuserTest, FusionSmemReduce_CUDA) {
 
   testValidate(
       &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
 }
 
 TEST_F(NVFuserTest, FusionSmemBlockGemm_CUDA) {
@@ -8258,9 +8440,9 @@ TEST_F(NVFuserTest, FusionSmemBlockGemm_CUDA) {
 
   // Schedule
   constexpr int BSX = 16;
-  tv5->split(2, BSX);
+  tv5->split(2, BSX - 1);
   tv5->split(1, BSX);
-  tv5->split(0, BSX);
+  tv5->split(0, BSX + 1);
   // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
   tv5->reorder({{0, 0}, {1, 3}, {2, 2}, {3, 5}, {4, 1}, {5, 4}});
   // M/BSX, N/BSX, K/BSX, MSX, NSX, KSX
@@ -8456,8 +8638,8 @@ TEST_F(NVFuserTest, FusionSmemDynamicPersistentSoftmax2D_CUDA) {
     tensor->axis(-1)->parallelize(ParallelType::TIDx);
   }
 
-  const size_t dimx = 1024;
-  const size_t dimy = 4096;
+  const int64_t dimx = 1024;
+  const int64_t dimy = 4096;
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor aten_input = at::randn({dimx, dimy}, options);
   auto aten_output = at::_softmax(aten_input.to(at::kDouble), -1, false);
@@ -8658,6 +8840,84 @@ TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormBackward_CUDA) {
       __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormBackward_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+  const int64_t NORM_SIZE = 1024;
+  std::vector<int64_t> shape{8, 56, NORM_SIZE};
+  std::vector<int64_t> norm_shape{NORM_SIZE};
+
+  const size_t kM = shape.size();
+  const size_t kN = norm_shape.size();
+  const size_t kOuterNumDims = kM - kN;
+
+  std::vector<int64_t> outer_shape;
+  for (const auto idx : c10::irange(kOuterNumDims)) {
+    outer_shape.push_back(shape[idx]);
+  }
+  for (const auto idx : c10::irange(kOuterNumDims, kM)) {
+    outer_shape.push_back(1);
+  }
+
+  auto grad_out = makeContigTensor(shape.size());
+  auto input = makeContigTensor(shape.size());
+  auto rstd = makeConcreteTensor(outer_shape);
+  auto weight = makeContigTensor(norm_shape.size());
+  fusion.addInput(grad_out);
+  fusion.addInput(input);
+  fusion.addInput(rstd);
+  fusion.addInput(weight);
+
+  auto grads = rms_norm_backward(
+      grad_out, input, norm_shape, rstd, weight, {true, true});
+
+  fusion.addOutput(grads.grad_input);
+  fusion.addOutput(grads.grad_weight);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_grad_out = at::randn(shape, options);
+  at::Tensor aten_input = at::randn(shape, options);
+  at::Tensor aten_weight = at::randn(norm_shape, options);
+  auto at_weight = c10::optional<at::Tensor>(aten_weight);
+
+  const float kEps = 1e-6;
+  auto pow2 = at::pow(aten_input, 2);
+  auto sum = at::sum(pow2, -1, true);
+  auto var = at::mul(sum, 1.0 / NORM_SIZE);
+  auto aten_rstd = at::pow(at::add(var, kEps), -0.5);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> aten_inputs = {
+      aten_grad_out, aten_input, aten_rstd, aten_weight};
+  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto in_mul_rstd = at::mul(aten_input, aten_rstd);
+  auto grad_out_mul = at::mul(aten_grad_out, in_mul_rstd);
+  auto aten_grad_weight = at::sum(grad_out_mul, c10::IntArrayRef{0, 1});
+  auto sum_loss1 = at::sum(at::mul(aten_grad_out, aten_weight), -1, true);
+  auto sum_loss2 = at::sum(
+      at::mul(
+          at::mul(at::mul(aten_grad_out, aten_weight), aten_input), aten_rstd),
+      -1,
+      true);
+
+  const float fH = NORM_SIZE;
+  auto term1 = at::mul(aten_rstd, 1.0 / fH);
+  auto aten_grad_input = at::mul(at::mul(aten_grad_out, fH), aten_weight);
+  aten_grad_input = at::sub(aten_grad_input, sum_loss1);
+  aten_grad_input = at::sub(
+      aten_grad_input, at::mul(at::mul(aten_input, aten_rstd), sum_loss2));
+  aten_grad_input = at::mul(aten_grad_input, term1);
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {aten_grad_input, aten_grad_weight},
+      __LINE__,
+      __FILE__);
+}
+
 TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormalization_CUDA) {
   std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
   Fusion& fusion = *fusion_ptr.get();
@@ -8690,12 +8950,8 @@ TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormalization_CUDA) {
   auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
   TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
 
-  schedulePersistentKernel(&fusion, reduction_params.value());
-  auto lparams = reduction_params.value().lparams;
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({aten_input});
 
   testValidate(
       &fusion,
@@ -8706,17 +8962,63 @@ TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormalization_CUDA) {
        std::get<2>(aten_outputs)},
       __LINE__,
       __FILE__,
-      "",
-      lparams);
+      "");
 }
 
-TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
+TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormalization_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int64_t NORM_SIZE = 1024;
+  const float kEps = 1e-6;
+  Double* eps_ptr = IrBuilder::create<Double>(kEps);
+
+  std::vector<int64_t> input_shape{8, 56, NORM_SIZE};
+  std::vector<int64_t> norm_shape{NORM_SIZE};
+
+  auto input = makeContigTensor(input_shape.size());
+  fusion.addInput(input);
+  auto result = rms_norm(input, norm_shape, nullptr, eps_ptr);
+
+  fusion.addOutput(result.output);
+  fusion.addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(input_shape, options);
+  c10::optional<at::Tensor> aten_weight = c10::nullopt;
+
+  auto pow2 = at::pow(aten_input, 2);
+
+  auto sum = at::sum(pow2, -1, true);
+  auto var = at::mul(sum, 1.0 / NORM_SIZE);
+  auto invstd = at::pow(at::add(var, kEps), -0.5);
+  auto output = at::mul(aten_input, invstd);
+  //// Check reduction axis is same for all reductions
+  //// Generate Launch Parameters
+  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({aten_input});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {output, invstd},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
 
   const float kMomentum = 0.1;
   const float kEps = 1e-5;
@@ -8772,9 +9074,7 @@ TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
       executor_cache.fusion(),
       cg_outputs,
       aten_inputs,
-      {at_run_mean,
-       at_run_var,
-       std::get<0>(aten_outputs),
+      {std::get<0>(aten_outputs),
        std::get<1>(aten_outputs),
        std::get<2>(aten_outputs)},
       __LINE__,
@@ -8782,6 +9082,215 @@ TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
       "");
 }
 
+TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalization_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+  const bool kUseInputStats = true;
+  std::vector<int64_t> input_shape{20, 100, 35, 45};
+
+  auto input = makeSymbolicTensor(input_shape.size());
+  auto weight = makeSymbolicTensor(1);
+  auto bias = makeSymbolicTensor(1);
+  auto running_mean = makeSymbolicTensor(1);
+  auto running_var = makeSymbolicTensor(1);
+  fusion->addInput(input);
+  fusion->addInput(weight);
+  fusion->addInput(bias);
+  fusion->addInput(running_mean);
+  fusion->addInput(running_var);
+
+  Double* momentum = IrBuilder::create<Double>(kMomentum);
+  Double* eps = IrBuilder::create<Double>(kEps);
+
+  auto result = instance_norm(
+      input,
+      weight,
+      bias,
+      running_mean,
+      running_var,
+      kUseInputStats,
+      momentum,
+      eps);
+
+  fusion->addOutput(result.output);
+  // fusion->addOutput(result.mean);
+  // fusion->addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_input = at::randn(input_shape, options);
+  auto at_weight = at::ones({input_shape[1]}, options);
+  auto at_bias = at::zeros({input_shape[1]}, options);
+  auto at_run_mean = at::zeros({input_shape[1]}, options);
+  auto at_run_var = at::ones({input_shape[1]}, options);
+
+  std::vector<IValue> aten_inputs = {
+      at_input, at_weight, at_bias, at_run_mean, at_run_var};
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+  auto cg_outputs_full = {at_run_mean, at_run_var, cg_outputs[0]};
+
+  auto aten_outputs = at::instance_norm(
+      at_input,
+      c10::optional<at::Tensor>(at_weight),
+      c10::optional<at::Tensor>(at_bias),
+      c10::optional<at::Tensor>(at_run_mean),
+      c10::optional<at::Tensor>(at_run_var),
+      kUseInputStats,
+      kMomentum,
+      kEps,
+      false);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      // TODO: can run_mean/run_var be checked here?
+      // fusion_outputs.size() == aten_outputs.size() && aten_outputs.size() ==
+      // fusion->outputs().size() - output_alias_indices.size()
+      {aten_outputs},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalizationBackward_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion_forward = std::make_unique<Fusion>();
+  FusionGuard fg_forward(fusion_forward.get());
+
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+  const bool kUseInputStats = true;
+  const bool channels_last = true;
+  const int B = 2;
+  const int C = 5;
+  const int S = 3;
+  std::vector<int64_t> input_shape{B, C, S, S, S};
+  // explicit channels-last for NVFuser
+  std::vector<int64_t> nvfuser_input_shape{B, S, S, S, C};
+
+  auto input = makeContigTensor(input_shape.size());
+  auto weight = makeContigTensor(1);
+  auto bias = makeContigTensor(1);
+  fusion_forward->addInput(input);
+  fusion_forward->addInput(weight);
+  fusion_forward->addInput(bias);
+
+  Double* momentum = IrBuilder::create<Double>(kMomentum);
+  Double* eps = IrBuilder::create<Double>(kEps);
+  auto result_forward = instance_norm(
+      input,
+      weight,
+      bias,
+      nullptr,
+      nullptr,
+      kUseInputStats,
+      momentum,
+      eps,
+      channels_last);
+  fusion_forward->addOutput(result_forward.output);
+  fusion_forward->addOutput(result_forward.mean);
+  fusion_forward->addOutput(result_forward.invstd);
+
+  FusionExecutorCache executor_cache_forward(std::move(fusion_forward));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_input = at::randn(input_shape, options)
+                      .to(at::MemoryFormat::ChannelsLast3d)
+                      .set_requires_grad(true);
+  auto at_input_nvfuser = at_input.clone().detach().permute({0, 2, 3, 4, 1});
+  auto at_weight = at::ones({input_shape[1]}, options).set_requires_grad(true);
+  auto at_weight_nvfuser = at_weight.clone().detach();
+  auto at_bias = at::zeros({input_shape[1]}, options).set_requires_grad(true);
+  auto at_bias_nvfuser = at_bias.clone().detach();
+  std::vector<torch::jit::IValue> aten_inputs_forward = {
+      at_input_nvfuser, at_weight_nvfuser, at_bias_nvfuser};
+  // out, mean, invstd
+  auto outputs_forward =
+      executor_cache_forward.runFusionWithInputs(aten_inputs_forward);
+  auto at_out = at::instance_norm(
+      at_input,
+      c10::optional<at::Tensor>(at_weight),
+      c10::optional<at::Tensor>(at_bias),
+      c10::optional<at::Tensor>(c10::nullopt),
+      c10::optional<at::Tensor>(c10::nullopt),
+      kUseInputStats,
+      kMomentum,
+      kEps,
+      false);
+  auto at_grad =
+      at::randn(input_shape, options).to(at::MemoryFormat::ChannelsLast3d);
+  auto at_grad_nvfuser = at_grad.clone().detach().permute({0, 2, 3, 4, 1});
+  at_out.backward(at_grad);
+  auto fusion_backward = std::make_unique<Fusion>();
+  FusionGuard fg_backward(fusion_backward.get());
+
+  input = makeContigTensor(input_shape.size());
+  auto grad_output = makeContigTensor(input_shape.size());
+  weight = makeContigTensor(1);
+  auto save_mean = makeContigTensor(2);
+  auto save_invstd = makeContigTensor(2);
+  auto dummy = makeContigTensor(0);
+
+  fusion_backward->addInput(input);
+  fusion_backward->addInput(grad_output);
+  fusion_backward->addInput(weight);
+  fusion_backward->addInput(dummy); // dummy for run_mean
+  fusion_backward->addInput(dummy); // dummy for run_var
+  fusion_backward->addInput(save_mean);
+  fusion_backward->addInput(save_invstd);
+
+  auto result_backward = instance_norm_backward(
+      input,
+      grad_output,
+      weight,
+      nullptr,
+      nullptr,
+      save_mean,
+      save_invstd,
+      kUseInputStats,
+      eps,
+      {true, true, true},
+      channels_last);
+
+  fusion_backward->addOutput(result_backward.grad_input);
+  fusion_backward->addOutput(result_backward.grad_weight);
+  fusion_backward->addOutput(result_backward.grad_bias);
+
+  FusionExecutorCache executor_cache_backward(std::move(fusion_backward));
+  std::vector<torch::jit::IValue> aten_inputs_backward = {
+      at_input_nvfuser,
+      at_grad_nvfuser,
+      at_weight_nvfuser,
+      at::empty({}),
+      at::empty({}),
+      outputs_forward[1],
+      outputs_forward[2]};
+  auto outputs_backward =
+      executor_cache_backward.runFusionWithInputs(aten_inputs_backward);
+  outputs_backward[0] = outputs_backward[0].permute({0, 4, 1, 2, 3});
+  testValidate(
+      executor_cache_backward.fusion(),
+      outputs_backward,
+      aten_inputs_backward,
+      {at_input.grad(), at_weight.grad(), at_bias.grad()},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
 TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalSmem_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
@@ -8884,8 +9393,8 @@ TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalSmem_CUDA) {
     }
   }
 
-  const size_t dimx = 1024;
-  const size_t dimy = 16384;
+  const int64_t dimx = 1024;
+  const int64_t dimy = 16384;
 
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor aten_input = at::randn({dimx, dimy}, options);
@@ -9328,7 +9837,7 @@ TEST_F(NVFuserTest, FusionSmemDynamicReductionSymbolicArg_CUDA) {
       "",
       lparams);
 
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
 }
 
 TEST_F(NVFuserTest, FusionSmemDynamicPwiseMulSymbolicArgWAR_CUDA) {
@@ -10463,7 +10972,7 @@ TEST_F(NVFuserTest, FusionTrivialReduction_CUDA) {
   fusion.addOutput(tv1);
 
   TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).empty(),
+      ir_utils::getReductionOps(&fusion, true /* ignore_trivial */).empty(),
       "Trivial reduction picked up by fusion");
 
   const auto options =
@@ -12174,7 +12683,7 @@ TEST_F(NVFuserTest, FusionWelfordOp_CUDA) {
   outputs[1] /= N;
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {t0},
       {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
@@ -12220,7 +12729,7 @@ TEST_F(NVFuserTest, FusionBlockWelfordOp_CUDA) {
   outputs[1] /= N;
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {t0},
       {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
@@ -12266,7 +12775,7 @@ TEST_F(NVFuserTest, FusionGridWelfordOp_CUDA) {
   outputs[1] /= N;
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {t0},
       {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
@@ -12311,7 +12820,7 @@ TEST_F(NVFuserTest, FusionRfactorWelfordOp_CUDA) {
   outputs[1] /= N;
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {t0},
       {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
@@ -12357,7 +12866,7 @@ TEST_F(NVFuserTest, FusionWelfordSchedule_CUDA) {
   auto at_n = at::ones({M}, options_int) * N;
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {t0},
       {at_avg, at_var, at_n},
@@ -12439,7 +12948,7 @@ void testWelford(DataType dtype, int red_axis, int odim, int rdim) {
   at_n = at_n.sum({axis});
 
   testValidate(
-      &fusion,
+      fe.kernel(),
       outputs,
       {aten_input},
       {at_avg, at_var, at_n},
@@ -12453,6 +12962,11 @@ void testWelford(DataType dtype, int red_axis, int odim, int rdim) {
 TEST_F(NVFuserTest, FusionWelfordShmoo_CUDA) {
   std::vector<DataType> dtypes = {
       DataType::Double, DataType::Float, DataType::Half};
+  // TODO: enable this for complex. Currently, complex yields
+  // silent wrong results:
+  //   Detected abs error of: 3.8062
+  //     absolute tolerance was set to 2.23704e-06
+  //     and relative tolerance set to 2.23704e-08
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   if (at::cuda::getDeviceProperties(0)->major >= 8) {
     dtypes.insert(dtypes.end(), DataType::BFloat16);
@@ -14201,444 +14715,102 @@ TEST_F(NVFuserTest, FusionVectorizeMisalignedStrideFail_CUDA) {
   ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
 }
 
-TEST_F(NVFuserTest, FusionViewOutput_CUDA) {
+TEST_F(NVFuserTest, FusionVectorization1_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
-  std::vector<int64_t> input_shape{2, 10, 40};
-  std::vector<int64_t> output_shape{2, 10, 4, 10};
-
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
+  auto tv0 = makeSymbolicTensor(2);
 
-  auto x_add_bias = add(x, bias);
-  auto x_view = view(x_add_bias, input_shape, output_shape);
-  fusion.addOutput(x_view);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
 
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_bias = at::randn(input_shape, options);
-  std::vector<IValue> aten_inputs = {at_x, at_bias};
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
 
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
+  tv2->split(1, 16);
+  tv2->split(1, 64);
 
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto outputs = fe.runFusion(aten_inputs, lparams);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
 
-  auto at_x_add_bias = at_x + at_bias;
-  auto at_x_view = at::native::view(at_x_add_bias, output_shape);
+  auto c0 = tv0->cache_after();
+  auto c1 = tv1->cache_after();
+  auto c2 = tv2->cache_before();
 
-  testValidate(&fusion, outputs, aten_inputs, {at_x_view}, __LINE__, __FILE__);
-}
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
 
-TEST_F(NVFuserTest, FusionViewFailMismatchSize_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    tv->axis(-1)->parallelize(ParallelType::Vectorize);
+  }
 
-  // The number of elements in input and output shapes do not match,
-  // so this view transformation is invalid.
-  // 2 * 10 * 40 != 2 * 50 * 4 * 10
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2048;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
 
-  std::vector<int64_t> input_shape{2, 10, 40};
-  std::vector<int64_t> output_shape{2, 50, 4, 10};
+  std::vector<IValue> aten_inputs = {t0, t1};
 
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
 
-  auto x_add_bias = add(x, bias);
-  ASSERT_ANY_THROW(view(x_add_bias, input_shape, output_shape));
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionViewFailMulitDimInference_CUDA) {
+TEST_F(NVFuserTest, FusionVectorization2_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
-  // Only one dimension can be inferred in the output shape.
-  // Otherwise, the size of the dimensions is ambiguous.
-  std::vector<int64_t> input_shape{2, 10, 40};
-  std::vector<int64_t> output_shape{2, -1, 4, -1};
-
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
-
-  auto x_add_bias = add(x, bias);
-  ASSERT_ANY_THROW(view(x_add_bias, input_shape, output_shape));
-}
+  auto tv0 = makeSymbolicTensor(2);
 
-TEST_F(NVFuserTest, FusionViewFailReduction_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
 
-  // View is only supported by the pointwise scheduler,
-  // so it should fail with any reduction operations
-  std::vector<int64_t> input_shape{2, 10, 40};
-  std::vector<int64_t> output_shape{2, 10, 2, 20};
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
 
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
+  tv2->split(1, 16);
+  tv2->split(1, 64);
 
-  auto x_add_bias = add(x, bias);
-  auto x_view = view(x_add_bias, input_shape, output_shape);
-  auto x_sum = sum(x_view, {-1});
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
 
-  fusion.addOutput(x_sum);
+  auto c0 = tv0->cache_after();
+  auto c1 = tv1->cache_after();
+  auto c2 = tv2->cache_before();
 
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
 
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_bias = at::randn(input_shape, options);
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    // Vectorize the wrong dimension
+    tv->axis(-2)->parallelize(ParallelType::Vectorize);
+  }
 
-  FusionExecutorCache fusion_executor_cache(std::move(fusion_ptr));
-  ASSERT_ANY_THROW(fusion_executor_cache.runFusionWithInputs({at_x, at_bias}));
+  FusionExecutor fe;
+  // Make sure compilation fails
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
 }
 
-TEST_F(NVFuserTest, FusionViewFailPersistent_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
+TEST_F(NVFuserTest, FusionVectorization3_CUDA) {
+  Fusion fusion;
   FusionGuard fg(&fusion);
 
-  // View is only supported by the pointwise scheduler,
-  // so it should fail with any persistent normalization operations
-  std::vector<int64_t> input_shape{2, 10, 40};
-  std::vector<int64_t> output_shape{2, 10, 2, 20};
+  auto tv0 = makeSymbolicTensor(2);
 
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
-
-  auto x_add_bias = add(x, bias);
-  auto x_view = view(x_add_bias, input_shape, output_shape);
-  auto x_softmax = softmax(x_view, -1);
-
-  fusion.addOutput(x_softmax);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_bias = at::randn(input_shape, options);
-
-  FusionExecutorCache fusion_executor_cache(std::move(fusion_ptr));
-  ASSERT_ANY_THROW(fusion_executor_cache.runFusionWithInputs({at_x, at_bias}));
-}
-
-void addViewGeluFusion(
-    std::vector<int64_t>& input_shape,
-    std::vector<int64_t>& output_shape) {
-  for (auto hasImplicitBroadcast : {false, true}) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* x = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape)
-        : makeSymbolicTensor(input_shape.size());
-    TensorView* bias = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape)
-        : makeSymbolicTensor(input_shape.size());
-    fusion.addInput(x);
-    fusion.addInput(bias);
-
-    auto x_add_bias = add(x, bias);
-    auto x_view = view(x_add_bias, input_shape, output_shape);
-    auto y = gelu(x_view);
-    fusion.addOutput(y);
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor at_x = at::randn(input_shape, options);
-    at::Tensor at_bias = at::randn(input_shape, options);
-    std::vector<IValue> aten_inputs = {at_x, at_bias};
-
-    auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, aten_inputs, lparams);
-    auto outputs = fe.runFusion(aten_inputs, lparams);
-
-    auto at_x_add_bias = at_x + at_bias;
-    auto at_x_view = at::native::view(at_x_add_bias, output_shape);
-    auto at_y = at::gelu(at_x_view);
-
-    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionViewSplit_CUDA) {
-  std::vector<int64_t> input_shape{80};
-  std::vector<int64_t> output_shape{2, 4, 10};
-  addViewGeluFusion(input_shape, output_shape);
-}
-
-TEST_F(NVFuserTest, FusionViewBroadcast_CUDA) {
-  std::vector<int64_t> input_shape{80};
-  std::vector<int64_t> output_shape{1, 80};
-  addViewGeluFusion(input_shape, output_shape);
-}
-
-TEST_F(NVFuserTest, FusionViewMerge_CUDA) {
-  std::vector<int64_t> input_shape{2, 40, 7};
-  std::vector<int64_t> output_shape{560};
-  addViewGeluFusion(input_shape, output_shape);
-}
-
-TEST_F(NVFuserTest, FusionViewAllShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 1, 2, 4, 10}},
-      {{3, 17, 80, 1, 9}, {51, 1, 2, 4, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
-      {{22, 22, 2}, {22, 11, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, 333, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
-      {{1, 333, 1}, {1, 37, 9, 1}},
-      {{1, 333}, {1, 1, 1, 111, 1, 3}},
-      {{22, 1, 22, 1}, {484}},
-      {{1, 333, 1}, {333}},
-      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
-
-  for (auto e : examples) {
-    addViewGeluFusion(e.first, e.second);
-  }
-}
-
-TEST_F(NVFuserTest, FusionViewInferShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, -1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 1, 2, 4, -1}},
-      {{3, 17, 80, 1, 9}, {-1, 1, 2, 4, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, -1, 2, 5, 1}},
-      {{22, 22, 2}, {22, -1, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, -1, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, -1, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, -1}},
-      {{1, 333, 1}, {1, 37, -1, 1}},
-      {{1, 333}, {1, 1, 1, -1, 1, 3}},
-      {{22, 1, 22, 1}, {-1}},
-      {{1, 333, 1}, {-1}},
-      {{1, 27454, 1, 2}, {1, 7844, 1, -1}},
-      {{1, 7844, 1, 7}, {1, -1, 2}}};
-
-  for (auto e : examples) {
-    addViewGeluFusion(e.first, e.second);
-  }
-}
-
-void geluViewAddFusion(
-    std::vector<int64_t> input_shape,
-    std::vector<int64_t> output_shape) {
-  for (auto hasImplicitBroadcast : {false, true}) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* x = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape)
-        : makeSymbolicTensor(input_shape.size());
-    TensorView* bias = (hasImplicitBroadcast)
-        ? makeConcreteTensor(output_shape)
-        : makeSymbolicTensor(output_shape.size());
-    fusion.addInput(x);
-    fusion.addInput(bias);
-
-    auto x_gelu = gelu(x);
-    auto x_view = view(x_gelu, input_shape, output_shape);
-    auto y = add(x_view, bias);
-    fusion.addOutput(y);
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor at_x = at::randn(input_shape, options);
-    at::Tensor at_bias = at::randn(output_shape, options);
-    std::vector<IValue> aten_inputs = {at_x, at_bias};
-
-    auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, aten_inputs, lparams);
-    auto outputs = fe.runFusion(aten_inputs, lparams);
-
-    auto at_x_gelu = at::gelu(at_x);
-    auto at_x_view = at::native::view(at_x_gelu, output_shape);
-    auto at_y = at_x_view + at_bias;
-
-    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionViewStride_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 27454, 2}, {1, 7844, 7}},
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
-
-  for (auto e : examples) {
-    geluViewAddFusion(e.first, e.second);
-  }
-}
-
-void geluViewBinaryAddFusion(
-    std::vector<int64_t> input_shape1,
-    std::vector<int64_t> input_shape2,
-    std::vector<int64_t> output_shape) {
-  for (auto hasImplicitBroadcast : {false, true}) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* x = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape1)
-        : makeSymbolicTensor(input_shape1.size());
-    TensorView* bias = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape2)
-        : makeSymbolicTensor(input_shape2.size());
-    fusion.addInput(x);
-    fusion.addInput(bias);
-
-    auto x_gelu = gelu(x);
-    auto x_view = view(x_gelu, input_shape1, output_shape);
-    auto bias_view = view(bias, input_shape2, output_shape);
-    auto y = add(x_view, bias_view);
-    fusion.addOutput(y);
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor at_x = at::randn(input_shape1, options);
-    at::Tensor at_bias = at::randn(input_shape2, options);
-    std::vector<IValue> aten_inputs = {at_x, at_bias};
-
-    auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, aten_inputs, lparams);
-    auto outputs = fe.runFusion(aten_inputs, lparams);
-
-    auto at_x_gelu = at::gelu(at_x);
-    auto at_x_view = at::native::view(at_x_gelu, output_shape);
-    auto at_bias_view = at::native::view(at_bias, output_shape);
-    auto at_y = at_x_view + at_bias_view;
-
-    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionViewBinary_CUDA) {
-  geluViewBinaryAddFusion({27454, 2}, {54908}, {7844, 7});
-}
-
-TEST_F(NVFuserTest, FusionVectorization1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cache_after();
-  auto c1 = tv1->cache_after();
-  auto c2 = tv2->cache_before();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    tv->axis(-1)->parallelize(ParallelType::Vectorize);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2048;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorization2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cache_after();
-  auto c1 = tv1->cache_after();
-  auto c2 = tv2->cache_before();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    // Vectorize the wrong dimension
-    tv->axis(-2)->parallelize(ParallelType::Vectorize);
-  }
-
-  FusionExecutor fe;
-  // Make sure compilation fails
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionVectorization3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
 
   auto tv2 = add(tv0, tv1);
   fusion.addOutput(tv2);
@@ -14919,9 +15091,9 @@ TEST_F(NVFuserTest, FusionValidateParallelize4_CUDA) {
 
   tv1->setMemoryType(MemoryType::Global);
 
-  // tv1 and tv2 do not have the same shape
+  // tv1 and tv2 do not have the same shape but global memory comm is supported.
   FusionExecutor fe;
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+  fe.compileFusion(&fusion);
 }
 
 TEST_F(NVFuserTest, FusionValidateParallelize5_CUDA) {
@@ -14953,8 +15125,10 @@ TEST_F(NVFuserTest, FusionValidateParallelize6_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(4);
+  int64_t W = 5, X = 6, Y = 7, Z = 8;
+
+  auto tv0 = makeConcreteTensor({X, Y, Z});
+  auto tv1 = makeConcreteTensor({W, X, Y, Z});
   fusion.addInput(tv0);
   fusion.addInput(tv1);
 
@@ -14966,9 +15140,9 @@ TEST_F(NVFuserTest, FusionValidateParallelize6_CUDA) {
   tv4->merge(0);
   tv4->merge(0);
   tv4->merge(0);
-  tv4->split(0, 128);
-  tv4->split(0, 1);
-  tv4->split(0, 1);
+  tv4->split(0, 4);
+  tv4->split(0, 3);
+  tv4->split(0, 2);
 
   TransformPropagator::from(tv4);
 
@@ -14979,6 +15153,7 @@ TEST_F(NVFuserTest, FusionValidateParallelize6_CUDA) {
   tv4->axis(-1)->parallelize(ParallelType::TIDx);
   tv2->axis(0)->parallelize(ParallelType::BIDx);
   tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
 
   // Validation should throw an exception saying the first axes of tv2
   // and tv3 have incompatible parallelization. See also issue #995.
@@ -15776,8 +15951,7 @@ TEST_F(NVFuserTest, FusionBNRepro_CUDA) {
   auto at_mean = std::get<1>(at_results);
   auto at_invstd = std::get<2>(at_results);
 
-  std::vector<at::Tensor> aten_outputs = {
-      input4_ref, input5_ref, at_output, at_mean, at_invstd};
+  std::vector<at::Tensor> aten_outputs = {at_output, at_mean, at_invstd};
 
   testValidate(
       &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
@@ -15997,9 +16171,13 @@ TEST_F(NVFuserTest, FusionSegmentIoAlias_CUDA) {
                                    //  keeps normalization scheduler away)
   TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
 
-  fusion->addOutput(tv6);
   // Note: test alias;
   fusion->aliasOutputToInput(tv6, tv0);
+  // TODO: support output on aliased fusion #1488
+  // remove tv7 after #1488
+  // fusion->addOutput(tv6);
+  TensorView* tv7 = add(tv6, IrBuilder::create<Double>(1)); // Group 0
+  fusion->addOutput(tv7);
 
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor t0 = at::randn({128, 65}, options);
@@ -16010,13 +16188,15 @@ TEST_F(NVFuserTest, FusionSegmentIoAlias_CUDA) {
   auto t4 = std::get<0>(at::max(t3, 0));
   auto t5 = t4.add(t1);
   auto t6 = t5.add(t2);
+  auto t7 = t6.add(1.0);
 
   FusionExecutorCache executor_cache(std::move(fusion));
 
   auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});
 
+  // TODO: support output on aliased fusion #1488
   // validating aliasing
-  TORCH_INTERNAL_ASSERT(outputs[0].data_ptr() == t0.data_ptr());
+  // TORCH_INTERNAL_ASSERT(outputs[0].data_ptr() == t0.data_ptr());
 
   TORCH_CHECK(
       executor_cache.getMostRecentKernelRuntime()->isSegmented(),
@@ -16029,7 +16209,7 @@ TEST_F(NVFuserTest, FusionSegmentIoAlias_CUDA) {
       "segmentation didn't happen as expected");
 
   testValidate(
-      executor_cache.fusion(), outputs, {t0, t1, t2}, {t6}, __LINE__, __FILE__);
+      executor_cache.fusion(), outputs, {t0, t1, t2}, {t7}, __LINE__, __FILE__);
 }
 
 TEST_F(NVFuserTest, FusionWelford1Output_CUDA) {
@@ -17807,8 +17987,8 @@ TEST_F(NVFuserTest, FusionWARSyncAliasedSmem_CUDA) {
   tv0->computeAt(tv3, 1);
 
   tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+  tv3->axis(-1)->parallelize(ParallelType::TIDz);
 
   // Make sure a WAR sync is inserted at the end of the outer loop
   GpuLower gpulw(&fusion);
@@ -17816,7 +17996,7 @@ TEST_F(NVFuserTest, FusionWARSyncAliasedSmem_CUDA) {
     if (auto loop = dynamic_cast<kir::ForLoop*>(kir_node)) {
       const auto& body = loop->body().exprs();
       TORCH_CHECK(!body.empty());
-      auto last_expr = dynamic_cast<kir::Sync*>(body.back());
+      auto last_expr = dynamic_cast<kir::BlockSync*>(body.back());
       TORCH_CHECK(last_expr != nullptr, "Invalid expr found");
       TORCH_CHECK(last_expr->isWarHazardSync(), "Not a sync for WAR hazard");
     }
@@ -18027,7 +18207,7 @@ TEST_F(NVFuserTest, FusionPointwiseBroadcast_CUDA) {
 
   auto x_add_bias = add(x, bias);
   auto x_bcast = broadcast(x_add_bias, {false, false, true, false});
-  auto y = unaryOp(UnaryOpType::Gelu, x_bcast);
+  auto y = gelu(x_bcast);
   fusion.addOutput(y);
 
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
@@ -18524,31 +18704,27 @@ TEST_F(NVFuserTest, FusionChannelsLastParser_CUDA) {
   // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
   const std::string expected_kernel = R"(
 __global__ void CUDAGeneratedKernel(Tensor<__half, 4> T0, Tensor<__half, 4> T2, Tensor<__half, 4> T7) {
-  if ((((((((((nvfuser_index_t)blockIdx.x) * 1) + 0) * 1) + 0) * 128) + ((nvfuser_index_t)threadIdx.x)) < (T0.size[0] * (T0.size[1] * (T0.size[2] * T0.size[3]))))) {
-    constexpr nvfuser_index_t i120 = 0;
+  int64_t i167;
+  i167 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
+  if ((i167 < (T0.size[0] * (T0.size[1] * (T0.size[2] * T0.size[3]))))) {
     __half T9[1];
-    constexpr nvfuser_index_t i132 = 0;
-    T9[i132] = 0;
-    constexpr nvfuser_index_t i128 = 0;
-    T9[i128]
-       = T2[((((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i128) * 128) + ((nvfuser_index_t)threadIdx.x)) / (T0.size[1] * (T0.size[2] * T0.size[3]))) * (((1 * T0.size[2]) * T0.size[1]) * T0.size[3])) + ((((((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i128) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) % T0.size[3]) * ((1 * T0.size[2]) * T0.size[1])) + (((((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i128) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) / (T0.size[2] * T0.size[3])) * (1 * T0.size[2])) + ((((((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i128) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) / T0.size[3]) * 1)];
+    T9[0] = 0;
+    T9[0]
+       = T2[((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) / (T0.size[1] * (T0.size[2] * T0.size[3]))) * ((T0.size[2] * T0.size[1]) * T0.size[3])) + ((((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) % T0.size[3]) * (T0.size[2] * T0.size[1])) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) / (T0.size[2] * T0.size[3])) * T0.size[2]) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) / T0.size[3])];
     __half T8[1];
-    constexpr nvfuser_index_t i134 = 0;
-    T8[i134] = 0;
-    constexpr nvfuser_index_t i126 = 0;
-    T8[i126]
-       = T0[(((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i126) * 128) + ((nvfuser_index_t)threadIdx.x)) * 1)];
+    T8[0] = 0;
+    T8[0]
+       = T0[i167];
     __half T10[1];
-    constexpr nvfuser_index_t i124 = 0;
     float T3[1];
     T3[0]
-       = __half2float(T9[i124]);
+       = __half2float(T9[0]);
     float T4[1];
     T4[0]
        = T3[0];
     float T1[1];
     T1[0]
-       = __half2float(T8[i124]);
+       = __half2float(T8[0]);
     float T5[1];
     T5[0]
       = T1[0]
@@ -18556,11 +18732,10 @@ __global__ void CUDAGeneratedKernel(Tensor<__half, 4> T0, Tensor<__half, 4> T2,
     float T6[1];
     T6[0]
        = relu(T5[0]);
-    T10[i124]
+    T10[0]
        = __float2half(T6[0]);
-    constexpr nvfuser_index_t i122 = 0;
-    T7[(((((((((nvfuser_index_t)blockIdx.x) * 1) + i120) * 1) + i122) * 128) + ((nvfuser_index_t)threadIdx.x)) * 1)]
-       = T10[i122];
+    T7[i167]
+       = T10[0];
   }
 }
 )";
@@ -20225,7 +20400,7 @@ TEST_F(NVFuserTest, FusionSmemBlockGemmCacheDoubleBuffer_CUDA) {
 }
 
 TEST_F(NVFuserTest, FusionIntermediateTensorVectorize_CUDA) {
-  auto mem_types = {MemoryType::Shared, MemoryType::Local};
+  std::vector<MemoryType> mem_types = {MemoryType::Shared, MemoryType::Local};
 
   for (auto mem_type : mem_types) {
     Fusion fusion;
@@ -20440,6 +20615,713 @@ TEST_F(NVFuserTest, FusionBroadcastConcretization4_CUDA) {
 }
 #endif
 
+TEST_F(NVFuserTest, FusionIssue1430_CUDA) {
+  // Derived from an expression sorting issue when using loop map, now expr
+  // sorting uses parallel map.
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int V = 2, W = 3, X = 4, Y = 5, Z = 6;
+
+  // setup fusion
+  auto tv0 = TensorViewBuilder()
+                 .ndims(5)
+                 .dtype(DataType::Half)
+                 .contiguity(std::vector<bool>(5, true))
+                 .shape({V, W, X, Y, Z})
+                 .build();
+
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = castOp(DataType::Float, tv1);
+
+  auto tvs = Welford(tv2, {1, 2, 3, 4});
+  auto tv3 = tvs.avg;
+  auto tv4 = tvs.var_sum;
+  auto tv5 = tvs.n;
+
+  // avg
+  auto tv6 = broadcast(tvs.avg, {false, true, true, true, true});
+
+  // var
+  auto tv7 = mul(tv4, IrBuilder::create<Double>(1. / (W * X * Y * Z)));
+  auto tv8 = add(tv7, IrBuilder::create<Double>(1.e-6));
+  auto tv9 = broadcast(tv8, {false, true, true, true, true});
+  auto tv10 = rsqrt(tv9);
+
+  auto tv11 = castOp(DataType::Float, tv1);
+  auto tv12 = sub(tv11, tv6);
+  auto tv13 = mul(tv12, tv10);
+
+  auto tv14 = set(tv13);
+  fusion.addOutput(tv14);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDy);
+  tv3->axis(2)->parallelize(ParallelType::BIDx);
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+  tv3->axis(4)->parallelize(ParallelType::Vectorize);
+
+  // tv3->reorder({{1, -2}});
+
+  auto rfactor = ir_utils::rfactorHelper(tv3, {1, 4});
+
+  scheduler_utils::parallelizeAllLike(rfactor, ir_utils::allTvs(&fusion));
+
+  for (auto tv : ir_utils::allTvs(&fusion)) {
+    if (tv != tv1 || tv != tv3) {
+      for (auto i : c10::irange(tv->nDims())) {
+        if (isParallelTypeVectorize(tv->axis(i)->getParallelType())) {
+          tv->axis(i)->parallelize(ParallelType::Serial);
+        }
+      }
+    }
+  }
+
+  tv0->computeAt(tv14, 1);
+  tv13->computeAt(tv14, -2);
+  tv2->computeAt(tv14, -1, ComputeAtMode::MostInlined);
+  tv11->computeAt(tv14, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({V, W, X, Y, Z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+  auto cg_outputs = fe.runFusion({t0}, LaunchParams(X, V, -1, Y, -1, -1));
+
+  auto t0_double = t0.to(at::kDouble);
+
+  auto at_mu = at::mean(t0_double, {1, 2, 3, 4})
+                   .unsqueeze(-1)
+                   .unsqueeze(-1)
+                   .unsqueeze(-1)
+                   .unsqueeze(-1);
+  auto at_var = at::var(t0_double, {1, 2, 3, 4}, false)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1);
+
+  auto at_out = t0_double.sub(at_mu).div(at_var.add(1.e-6).sqrt());
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {t0},
+      {at_out},
+      __LINE__,
+      __FILE__,
+      "",
+      LaunchParams(X, V, -1, Y, -1, -1));
+}
+
+// Test code generation of allocated scalars
+TEST_F(NVFuserTest, FusionCodegenAllocatedScalars_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Fusion is just a dummy container in this test, just used for
+  // getting a Kernel container
+  auto tv0 = makeSymbolicTensor(0);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  GpuLower gpulw(&fusion);
+  auto kernel = gpulw.kernel();
+
+  // Set the kernel as the current fusion
+  FusionGuard kg(kernel);
+
+  // Create alocated scalars
+  auto ks0 = add(kernel->zeroVal(), kernel->oneVal());
+  auto ks0_alloc = IrBuilder::create<kir::Allocate>(
+      ks0, MemoryType::Local, kernel->oneVal());
+
+  auto ks1 = add(ks0, kernel->oneVal());
+  auto ks1_alloc = IrBuilder::create<kir::Allocate>(
+      ks1, MemoryType::Local, kernel->oneVal());
+
+  auto tk0 = kernel->inputs()[0]->as<TensorView>();
+  auto tki0 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks0});
+  auto tki1 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks1});
+  auto tk0_expr = IrBuilder::create<UnaryOp>(UnaryOpType::Set, tki0, tki1);
+
+  // Insert the scalar expression and the allocation of the
+  // output directly to the kernel
+  auto proxy = kir::KernelInternalProxy(kernel);
+
+  const auto indent = "  ";
+  const auto ks0_name = "i" + std::to_string(ks0->name());
+  const auto ks1_name = "i" + std::to_string(ks1->name());
+  const auto tk0_name = "T" + std::to_string(tk0->name());
+
+  auto& exprs = proxy.topLevelExprs();
+  exprs.push_back(tk0_expr);
+
+  // Invalid code gen
+  const auto no_alloc_code = codegen::generateCudaKernel(kernel);
+
+  // Without alloc, Int vals are just inlined, resulting in:
+  // t0[(0 + 1)] = t0[((0 + 1) + 1)]
+  std::stringstream no_alloc_ref;
+  no_alloc_ref << "\n"
+               << indent << tk0_name << "[(0 + 1)]\n"
+               << indent << indent << " = " << tk0_name << "[((0 + 1) + 1)];\n";
+
+  TORCH_CHECK(
+      no_alloc_code.find(no_alloc_ref.str()) != std::string::npos,
+      "Invalid code generation. Expected:",
+      no_alloc_ref.str(),
+      "Actual:\n",
+      no_alloc_code);
+
+  // Insert proper allocations and definitions
+  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks0_alloc);
+  exprs.insert(
+      std::find(exprs.begin(), exprs.end(), tk0_expr), ks0->definition());
+  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks1_alloc);
+  exprs.insert(
+      std::find(exprs.begin(), exprs.end(), tk0_expr), ks1->definition());
+
+  const auto valid_code = codegen::generateCudaKernel(kernel);
+
+  std::stringstream valid_ref;
+  valid_ref << "\n"
+            << indent << tk0_name << "[" << ks0_name << "]\n"
+            << indent << indent << " = " << tk0_name << "[" << ks1_name
+            << "];\n";
+
+  TORCH_CHECK(
+      valid_code.find(valid_ref.str()) != std::string::npos,
+      "Invalid code generation. Expected:",
+      valid_ref.str(),
+      "Actual:\n",
+      valid_code);
+}
+
+TEST_F(NVFuserTest, FusionIndexHoist1_CUDA) {
+  if (disableIndexHoisting()) {
+    GTEST_SKIP() << "Index hoisting disabled";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = set(tv2);
+  auto tv4 = set(tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv1->split(-1, 4);
+  tv2->split(-1, 4);
+  tv3->merge(0, 1);
+  tv3->split(0, 8);
+  tv5->merge(0, 1);
+  tv5->split(0, 8);
+  tv4->computeAt(tv5, -1);
+
+  tv1->setMemoryType(MemoryType::Global);
+  tv2->setMemoryType(MemoryType::Global);
+  tv3->setMemoryType(MemoryType::Global);
+
+  GpuLower gpulw(&fusion);
+  auto kernel = gpulw.kernel();
+
+  auto is_index_times_ns = [](Val* val, Val* index, std::string name) -> bool {
+    auto def = dynamic_cast<BinaryOp*>(val->definition());
+    if (def == nullptr) {
+      return false;
+    }
+    return def->getBinaryOpType() == BinaryOpType::Mul &&
+        def->rhs()->isA<NamedScalar>() &&
+        def->rhs()->as<NamedScalar>()->name() == name && def->lhs() == index;
+  };
+
+  // Validate indices in the kernel are hoisted as
+  // intended. Validation could be also done by just string comparison
+  // as the parser test, but updating such tests would be tedious.
+  for (auto top_level_loop :
+       ir_utils::filterByType<kir::ForLoop>(kernel->topLevelExprs())) {
+    auto innermost_loop = top_level_loop;
+    while (auto first_expr_loop = dynamic_cast<kir::ForLoop*>(
+               innermost_loop->body().exprs().at(0))) {
+      innermost_loop = first_expr_loop;
+    }
+    const auto& exprs = innermost_loop->body().exprs();
+    TORCH_CHECK(!exprs.empty(), "No expression found");
+    TORCH_CHECK(
+        exprs.at(0)->isA<kir::Allocate>(),
+        "Invalid expression: ",
+        exprs.at(0)->toString());
+    auto hoisted_index = exprs.at(0)->as<kir::Allocate>()->buffer();
+    kir::Predicate* pred = nullptr;
+    for (auto expr : exprs) {
+      if (expr->isA<kir::IfThenElse>()) {
+        pred = expr->as<kir::IfThenElse>()->predicate();
+        auto arith_expr = expr->as<kir::IfThenElse>()->thenBody().exprs().at(0);
+        auto out_ti = arith_expr->outputs()[0]->as<kir::TensorIndex>();
+        if (out_ti->view()->name() == 1) {
+          // Ref: T1[*, hoisted_index] = T0[*, hoisted_index * T0.stride];
+          auto t1_index = out_ti->index(1);
+          TORCH_CHECK(
+              t1_index == hoisted_index,
+              "Invalid index: ",
+              t1_index->toInlineString());
+          // Pred: hoisted_index < T0.size[1]
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 0);
+          // hoisted_index * T0.stride[1]
+          auto t0_index = in0->index(1);
+          TORCH_CHECK(
+              is_index_times_ns(t0_index, hoisted_index, "T0.stride[1]"),
+              "Invalid index: ",
+              t0_index->toInlineString());
+        } else if (out_ti->view()->name() == 2) {
+          // Ref: T3[*, hoisted_index] = T2[*, hoisted_index];
+          auto out_index = out_ti->index(1);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 1);
+          auto in0_index = in0->index(1);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString());
+        } else if (out_ti->view()->name() == 3) {
+          // Ref: T3[hoisted_index] = T2[hoisted_index];
+          auto out_index = out_ti->index(0);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 2);
+          auto in0_index = in0->index(0);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString());
+        } else if (out_ti->view()->name() == 4) {
+          // Ref: T4[0] = T3[hoisted_index];
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 3);
+          auto in0_index = in0->index(0);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString());
+        } else if (out_ti->view()->name() == 5) {
+          // Ref: T5[hoisted_index] = T4[0]
+          auto out_index = out_ti->index(0);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString());
+        }
+      }
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({15, 17}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Hoist indices for vectorized tensors
+TEST_F(NVFuserTest, FusionIndexHoist2_CUDA) {
+  if (disableIndexHoisting()) {
+    GTEST_SKIP() << "Index hoisting disabled";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeContigTensor(1);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv5->split(-1, 4);
+  TransformPropagator::from(tv5);
+
+  tv4->split(-1, 3);
+
+  tv0->computeAt(tv5, 1);
+  tv1->computeAt(tv5, 1);
+
+  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv5->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({16}, options);
+  auto t1 = at::randn({16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTestGridComm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  int X = 3, Y = 4, Z = 2;
+  auto tv0 = makeConcreteTensor({X, Y, Z});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({X, Y, Z});
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = add(tv2, tv1);
+  auto tv4 = set(tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv2->setMemoryType(MemoryType::Global);
+  tv3->setMemoryType(MemoryType::Global);
+  tv4->setMemoryType(MemoryType::Global);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDy);
+  tv2->axis(1)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::Vectorize);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::BIDy);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDy);
+  tv4->axis(1)->parallelize(ParallelType::BIDx);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDy);
+  tv5->axis(1)->parallelize(ParallelType::BIDx);
+  tv5->axis(2)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({X, Y, Z}, options);
+  auto t1 = at::randn({X, Y, Z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// See issue https://github.com/csarofeen/pytorch/issues/1497
+// TODO: Enable
+#if 0
+TEST_F(NVFuserTest, FusionTestGridComm2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t W = 3, X = 4;
+
+  auto tv0 = makeConcreteTensor({X});
+  auto tv1 = makeConcreteTensor({W, X});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->split(0, 2);
+
+  TransformPropagator::from(tv4);
+
+  tv3->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv2->setMemoryType(MemoryType::Global);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({X}, options);
+  auto t1 = at::randn({W, X}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+#endif
+
+// Vectorized reset test for double buffered registers
+TEST_F(NVFuserTest, FusionDoubleBufferVector_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = sum(tv1, {0});
+  auto tv2c = tv2->cache_before();
+
+  fusion.addOutput(tv2);
+
+  auto tv1cw = tv1->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+
+  tv1cw->split(-1, 32);
+  tv1cr->split(-1, 32);
+  tv1cr->split(-1, 4);
+  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  tv1cw->computeAt(tv1cr, 1);
+  tv0->computeAt(tv1cw, -1);
+  tv2c->split(-1, 32);
+  tv2c->split(-1, 4);
+  tv1cr->computeAt(tv2c, 2);
+
+  tv1cw->setMemoryType(MemoryType::Shared);
+  tv1cr->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::manual_seed(0);
+  auto t0 = at::randn({200}, options);
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto ref = (t0 + 1).sum({0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1493
+TEST_F(NVFuserTest, FusionViewConcreteDomain_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv1);
+
+  auto tv2 = view(tv0, {2, 3}, {6});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  auto tv4 = broadcast(tv3, {true, false});
+  auto tv5 = add(tv4, tv1);
+
+  fusion.addOutput(tv5);
+
+  tv5->merge(0);
+  tv0->computeAt(tv5, -1);
+  tv1->computeAt(tv5, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({2, 3}, options);
+  auto t1 = at::randn({1, 6}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = (at::native::view(t0, {6}) + 1).unsqueeze(0) + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Repro of #1521
+TEST_F(NVFuserTest, FusionImmediateValueAsInput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto immediate_scalr = IrBuilder::create<Double>(0.1);
+  // Adding an immediate scalar value as an input is not allowed
+  ASSERT_ANY_THROW(fusion.addInput(immediate_scalr));
+
+  // Instead, use a symbolic value
+  auto symbolic_scalar = IrBuilder::create<Double>();
+  fusion.addInput(symbolic_scalar);
+
+  auto tv1 = add(tv0, symbolic_scalar);
+  fusion.addOutput(tv1);
+
+  // Make sure the kernel is compiled.
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+// Repro of #1506
+TEST_F(NVFuserTest, FusionVectorizeContigIndex_CUDA) {
+  std::vector<int64_t> shape{14, 14};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion.addOutput(tv2);
+
+  tv2->merge(0);
+
+  // Vectorize by 4 should be allowed
+  tv2->split(0, 4);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv0->computeAt(tv2, 1);
+
+  tv1->axis(1)->parallelize(ParallelType::Vectorize);
+  tv2->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  TORCH_CHECK(t0.equal(cg_outputs[0]));
+}
+
+// Make sure the same fusion as FusionVectorizeContigIndex fails if
+// not contig.
+TEST_F(NVFuserTest, FusionVectorizeContigIndexFail_CUDA) {
+  std::vector<int64_t> shape{14, 14};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion.addOutput(tv2);
+
+  tv2->merge(0);
+
+  tv2->split(0, 4);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv0->computeAt(tv2, 1);
+
+  tv1->axis(1)->parallelize(ParallelType::Vectorize);
+  tv2->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+
+  // This should fail at the launch time as 14 is not divisible by the
+  // vector word size. The two domains are merged, but they are not
+  // contiguous, so contig indexing is not involved in this case.
+  ASSERT_ANY_THROW(fe.runFusion({t0}));
+}
+
+TEST_F(NVFuserTest, FusionVectorizeInputToOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->split(0, 4);
+
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+
+  const int n = 12;
+  auto t0 = at::randn({n}, options);
+  // Shift by one to make it non-aligned
+  auto t0_misaligned = at::randn({n + 1}, options).index({Slice(1)});
+  auto t1_misaligned = at::empty({n + 1}, options).index({Slice(1)});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  TORCH_CHECK(t0.equal(cg_outputs[0]));
+
+  // Pass misaligned input. This must fail.
+  ASSERT_ANY_THROW(fe.runFusion({t0_misaligned}));
+
+  // Pass misaligned output. This must fail too.
+  ASSERT_ANY_THROW(fe.runFusion({t0}, {t1_misaligned}));
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
new file mode 100644
index 00000000000000..df92908d7d055b
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
@@ -0,0 +1,613 @@
+#if defined(USE_CUDA)
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+// fuser and IR parser
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+namespace {
+
+// Make a tensor that is known to be fully contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeContigTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder()
+      .ndims(ndims)
+      .dtype(dtype)
+      .contiguity(std::vector<bool>(ndims, true))
+      .build();
+}
+
+// Make a tensor that is known to be non-contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeSymbolicTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder().ndims(ndims).dtype(dtype).build();
+}
+
+// Make a non-contiguous tensor of compile-time known sizes
+TensorView* makeConcreteTensor(
+    std::vector<int64_t> shape,
+    DataType dtype = DataType::Float) {
+  return TensorViewBuilder().shape(shape).dtype(dtype).build();
+}
+
+class KernelExprVisitor : private kir::IrVisitor {
+ public:
+  static std::vector<Expr*> getAllExprs(const kir::Kernel* kernel) {
+    KernelExprVisitor visitor(kernel);
+    return visitor.all_exprs_;
+  }
+
+ private:
+  KernelExprVisitor(const kir::Kernel* kernel) {
+    handle(kernel->topLevelExprs());
+  }
+
+  using kir::IrVisitor::handle;
+
+  void handle(Expr* expr) final {
+    all_exprs_.push_back(expr);
+    kir::IrVisitor::handle(expr);
+  }
+
+ private:
+  std::vector<Expr*> all_exprs_;
+};
+
+void validateNoParallelBroadcastExist(kir::Kernel* kernel) {
+  for (auto expr : KernelExprVisitor::getAllExprs(kernel)) {
+    BroadcastOp* bc = dynamic_cast<BroadcastOp*>(expr);
+    if (bc == nullptr) {
+      auto grid_bc = dynamic_cast<kir::GridBroadcast*>(expr);
+      if (grid_bc != nullptr) {
+        std::cerr << "Grid broadcast: " << grid_bc->toString();
+        bc = grid_bc->broadcast_op();
+      }
+    }
+    if (bc == nullptr) {
+      continue;
+    }
+    TORCH_CHECK(
+        kernel->summary().broadcast_parallel_types.at(bc).none(),
+        "Parallel broadcast should not exist but was found: ",
+        bc->toString());
+  }
+}
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionReduceAndBroadcast1_CUDA) {
+  const int nx = 999;
+  const int tidx = 128;
+
+  if (ceilDiv(nx, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {true});
+  auto tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(0, tidx);
+  TransformPropagator::from(tv3);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3, ir_utils::allTvs(&fusion));
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = sum(t0).unsqueeze(0) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduceAndBroadcast2_CUDA) {
+  const int nx = 99;
+  const int tidx = 32;
+
+  if (ceilDiv(nx, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {true});
+  auto tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(0, tidx);
+  TransformPropagator::from(tv3);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3, {tv2});
+
+  // Broadcast on TIDy instead of TIDx. This still uses the fused
+  // reduction as it's broadcast on BIDx as well. Since TIDy is not
+  // predicated, the broadcast becomes a set op.
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = sum(t0).unsqueeze(0) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Grid reduction with serial non-reduction axis. The global work
+// buffer is double buffered.
+TEST_F(NVFuserTest, FusionReduceAndBroadcast3_CUDA) {
+  const int nx = 100;
+  const int ny = 5000;
+  const int tidx = 128;
+
+  if (ceilDiv(ny, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(1, tidx);
+  TransformPropagator::from(tv3);
+
+  tv0->computeAt(tv3, 1);
+
+  tv3->axis(1)->parallelize(ParallelType::BIDx);
+  tv3->axis(2)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3, ir_utils::allTvs(&fusion));
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx, ny}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = sum(t0, {1}).unsqueeze(-1) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Indirect reduction and broadcast
+TEST_F(NVFuserTest, FusionReduceAndBroadcast4_CUDA) {
+  const int nx = 999;
+  const int tidx = 128;
+
+  if (ceilDiv(nx, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true});
+  auto tv4 = add(tv0, tv3);
+
+  fusion.addOutput(tv4);
+
+  tv4->split(0, tidx);
+  TransformPropagator::from(tv4);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv4, ir_utils::allTvs(&fusion));
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (sum(t0) + 1).unsqueeze(0) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Unused block dimension in the kernel
+TEST_F(NVFuserTest, FusionReduceAndBroadcast5_CUDA) {
+  const int nx = 999;
+  const int tidx = 128;
+  const int iter = 2;
+  const int bdimx = 9; // One more than required by the reduction
+  const int bdimy = 3; // Want an unused dimension
+
+  // Going to bump the bdimx count for this test, ignor
+  if (bdimx * bdimy > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Didn't setup this test with inlining for register usage, so just leave the
+  // iter dimension concrete
+  auto tv0 = makeConcreteTensor({iter, -1});
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {false, true});
+  auto tv4 = add(tv0, tv3);
+
+  fusion.addOutput(tv4);
+
+  // Dummy op to mess with parallelization
+  auto tv5 = makeSymbolicTensor(2);
+  fusion.addInput(tv5);
+  auto tv6 = set(tv5);
+  fusion.addOutput(tv6);
+
+  // Setup the reduction
+  tv4->split(1, tidx);
+  TransformPropagator::from(tv4);
+
+  tv4->axis(1)->parallelize(ParallelType::BIDx);
+  tv4->axis(2)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv4, ir_utils::allTvs(&fusion));
+
+  tv6->axis(0)->parallelize(ParallelType::BIDy);
+  tv6->axis(1)->parallelize(ParallelType::BIDx);
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({iter, nx}, options);
+  auto t5 = at::randn({bdimy, bdimx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t5});
+  auto cg_outputs = fe.runFusion({t0, t5});
+
+  auto ref = (sum(t0, {1}) + 1).unsqueeze(-1) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0, t5}, {ref, t5}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWelfordAndBroadcast1_CUDA) {
+  const int nx = 999;
+  const int tidx = 128;
+
+  if (ceilDiv(nx, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {0});
+  auto tv2 = broadcast(tvs.avg, {true});
+  auto tv3 = broadcast(tvs.var_sum, {true});
+  auto tv4 = add(tv0, tv2);
+  auto tv5 = add(tv4, tv3);
+
+  fusion.addOutput(tv5);
+
+  tv5->split(0, tidx);
+  TransformPropagator::from(tv5);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv5, ir_utils::allTvs(&fusion));
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref =
+      (t0.mean({0}).unsqueeze(0) + t0) + t0.var({0}, false).unsqueeze(0) * nx;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Grid welford reduction with serial non-reduction axis. The global
+// work buffer is double buffered.
+TEST_F(NVFuserTest, FusionWelfordAndBroadcast2_CUDA) {
+  const int nx = 100;
+  const int ny = 5000;
+  const int tidx = 128;
+
+  if (ceilDiv(ny, tidx) > deviceSMCount()) {
+    GTEST_SKIP() << "Not enough SMs to run this test";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {1});
+  auto tv2 = broadcast(tvs.avg, {false, true});
+  auto tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(1, tidx);
+  TransformPropagator::from(tv3);
+
+  tv0->computeAt(tv3, 1);
+
+  tv3->axis(1)->parallelize(ParallelType::BIDx);
+  tv3->axis(2)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3, ir_utils::allTvs(&fusion));
+
+  // There must be no parallel broadcast
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({nx, ny}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (sum(t0, {1}) / ny).unsqueeze(-1) + t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Persistent batchnorm. Uses the fused reduction for grid welford and
+// broadcast.
+TEST_F(NVFuserTest, FusionFusedReductionBatchnorm_CUDA) {
+  const std::vector<int64_t> input_shape{256, 2048, 14, 14};
+
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(4, DataType::Half);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1, DataType::Half);
+  fusion.addInput(tv1);
+  auto tv2 = makeSymbolicTensor(1, DataType::Half);
+  fusion.addInput(tv2);
+  auto tv3 = makeSymbolicTensor(1, DataType::Float);
+  fusion.addInput(tv3);
+  auto tv4 = makeSymbolicTensor(1, DataType::Float);
+  fusion.addInput(tv4);
+
+  auto d34 = IrBuilder::create<Double>(1);
+  auto tv5 = castOp(DataType::Float, tv0);
+  auto tv6 = castOp(DataType::Float, tv1);
+  auto tv7 = castOp(DataType::Float, tv2);
+  auto tvs = Welford(tv5, {0, 2, 3});
+  auto tv8 = tvs.avg;
+  auto tv9 = tvs.var_sum;
+  auto tv10 = tvs.n;
+  auto tv11 = mul(tv8, IrBuilder::create<Double>(0.1));
+  auto tv12 = mul(tv3, d34);
+  auto tv13 = add(tv12, tv11);
+  auto d43 = IrBuilder::create<Double>(0.5);
+  auto tv14 = mul(tv9, d43);
+  auto tv15 = mul(tv14, IrBuilder::create<Double>(0.1));
+  auto tv16 = mul(tv4, d34);
+  auto tv17 = add(tv16, tv15);
+  auto tv18 = broadcast(tv8, {true, false, true, true});
+  auto tv19 = sub(tv5, tv18);
+  auto tv20 = mul(tv9, d43);
+  auto tv21 = add(tv20, IrBuilder::create<Double>(0.0001));
+  auto tv22 = rsqrt(tv21);
+  auto tv23 = broadcast(tv22, {true, false, true, true});
+  auto tv24 = mul(tv19, tv23);
+  auto tv25 = broadcast(tv6, {true, false, true, true});
+  auto tv26 = mul(tv24, tv25);
+  auto tv27 = broadcast(tv7, {true, false, true, true});
+  auto tv28 = add(tv26, tv27);
+  auto tv29 = castOp(DataType::Half, tv28);
+  fusion.addOutput(tv13);
+  fusion.addOutput(tv17);
+  fusion.addOutput(tv29);
+
+  auto tv0_cache = tv0->cache_after();
+  auto tv1_cache = tv1->cache_after();
+  auto tv2_cache = tv2->cache_after();
+  auto tv3_cache = tv3->cache_after();
+  auto tv4_cache = tv4->cache_after();
+
+  auto tv13_cache = tv13->cache_before();
+  auto tv17_cache = tv17->cache_before();
+  auto tv29_cache = tv29->cache_before();
+
+  tv0->split(1, NamedScalar::getParallelDim(ParallelType::BIDx), false);
+  tv0->split(0, NamedScalar::getParallelDim(ParallelType::BIDy), false);
+  tv0->split(1, 8, false);
+  tv0->split(2, 8, false);
+  tv0->merge(-2, -1);
+  tv0->split(-1, 2);
+  tv0->split(-2, 1, false);
+  tv0->split(-2, 1, false);
+  tv0->reorder(
+      {{4, 0},
+       {5, 1},
+       {0, 2},
+       {3, 3},
+       {8, 4},
+       {1, 5},
+       {7, 6},
+       {2, 7},
+       {9, 8},
+       {6, 9}});
+
+  TransformPropagator::from(tv0);
+
+  auto tvs_rf = tvs.rFactor({-5, -4, -3, -2, -1});
+
+  tv0->computeAt(tv29, 2);
+  tv1->computeAt(tv29, 2);
+  tv2->computeAt(tv29, 2);
+  tv3->computeAt(tv13, 2);
+  tv4->computeAt(tv17, 2);
+
+  tv29->axis(0)->parallelize(ParallelType::BIDx);
+  tv29->axis(2)->parallelize(ParallelType::BIDy);
+  tv29->axis(3)->parallelize(ParallelType::TIDz);
+  tv29->axis(4)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv29, ir_utils::allTvs(&fusion));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_half = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(input_shape, options_half);
+  auto t1 = at::randn(input_shape[1], options_half);
+  auto t2 = at::randn(input_shape[1], options_half);
+  auto t3 = at::randn(input_shape[1], options);
+  auto t4 = at::randn(input_shape[1], options);
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3, t4};
+
+  GpuLower gpulw(&fusion);
+  validateNoParallelBroadcastExist(gpulw.kernel());
+
+  FusionExecutor fe;
+  LaunchParams launch_params(2, 2, -1, -1, -1, -1);
+  fe.compileFusion(&fusion, aten_inputs, launch_params);
+  auto cg_outputs = fe.runFusion(aten_inputs, launch_params);
+
+  auto t5 = t0.to(at::kFloat);
+  auto t6 = t1.to(at::kFloat);
+  auto t7 = t2.to(at::kFloat);
+  auto t8 = t5.mean({0, 2, 3});
+  auto t9 = t5.var({0, 2, 3}, false) * input_shape[0] * input_shape[2] *
+      input_shape[3];
+  auto t11 = t8 * 0.1;
+  auto t12 = t3 * 1;
+  auto t13 = t12 + t11;
+  auto t14 = t9 * 0.5;
+  auto t15 = t14 * 0.1;
+  auto t16 = t4 * 1;
+  auto t17 = t16 + t15;
+  auto t18 = t8.unsqueeze(0).unsqueeze(-1).unsqueeze(-1);
+  auto t19 = t5 - t18;
+  auto t20 = t9 * 0.5;
+  auto t21 = t20 + 0.0001;
+  auto t22 = rsqrt(t21);
+  auto t23 = t22.unsqueeze(0).unsqueeze(-1).unsqueeze(-1);
+  auto t24 = t19 * t23;
+  auto t25 = t6.unsqueeze(0).unsqueeze(-1).unsqueeze(-1);
+  auto t26 = t24 * t25;
+  auto t27 = t7.unsqueeze(0).unsqueeze(-1).unsqueeze(-1);
+  auto t28 = t26 + t27;
+  auto t29 = t28.to(at::kHalf);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {t13, t17, t29},
+      __LINE__,
+      __FILE__,
+      "",
+      launch_params);
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/test/cpp/jit/test_gpu_shift.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
similarity index 99%
rename from test/cpp/jit/test_gpu_shift.cpp
rename to torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
index 2665f16563b76e..da3bee9061afc9 100644
--- a/test/cpp/jit/test_gpu_shift.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
@@ -24,12 +24,11 @@
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 #include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
 
 // fuser and IR parser
-#include "test_gpu_validator.h"
-
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/Exceptions.h>
 #include <c10/cuda/CUDAStream.h>
@@ -2033,38 +2032,6 @@ TEST_F(NVFuserTest, FusionShiftSyncPlacement2_CUDA) {
   testValidate(&fusion, outputs, inputs, {t4}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionShiftSyncPlacement3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
-  auto tv3 = shift(tv2, {1});
-  fusion.addOutput(tv3);
-
-  // This doesn't work. syncthreads is needed between tv1 and tv2, but
-  // both the loop extent of both tv1 and tv2 has halo, so the loop is
-  // not eliminated even though it is parallelized. Moving syncthreads
-  // out of the loop would make it placed before tv1, which would make
-  // it meaningless.
-  // Ideally, an exception should be thrown at this computeAt, but at
-  // this point, the fusion is not yet parallelized, nor memory type
-  // is set, so this computeAt itself is not an error yet.
-  tv1->computeAt(tv2, -1);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // The error should be detected when the fusion is lowered.
-  ASSERT_ANY_THROW(fusion.printKernel());
-}
-
 // Based on original CUDA provided by Vishal Mehta.
 // Major differences with the original version:
 // - The original version uses additional 2 warps to load the halos
@@ -3985,7 +3952,12 @@ TEST_F(NVFuserTest, FusionShiftNoPadding3_CUDA) {
   auto ref_N = at::ones({}, options_int) * (numel_x - 2) * (numel_y - 2);
 
   testValidate(
-      &fusion, outputs, inputs, {ref_avg, ref_M2, ref_N}, __LINE__, __FILE__);
+      fe.kernel(),
+      outputs,
+      inputs,
+      {ref_avg, ref_M2, ref_N},
+      __LINE__,
+      __FILE__);
 }
 
 // Shift indexing and predication with contiguous merge
@@ -5366,6 +5338,45 @@ TEST_F(NVFuserTest, FusionGather9ptStencilDoubleBuffering_CUDA) {
   testValidate(&fusion, outputs, inputs, {ref}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionValidateParallelizeShift_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = shift(tv1, {1});
+  auto tv3 = shift(tv1, {-1});
+  auto tv4 = add(tv1, tv2);
+  auto tv5 = add(tv4, tv3);
+  fusion.addOutput(tv5);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv5->split(-1, 1024);
+  tv5->split(-1, 2);
+  TransformPropagator::from(tv5);
+
+  tv0->computeAt(tv5, 1);
+
+  tv5->axis(1)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv5, ir_utils::allTvs(&fusion));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1024 * 32}, options);
+  std::vector<IValue> inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  auto outputs = fe.runFusion(inputs);
+
+  auto ref = t0 + shift(t0, {1}) + shift(t0, {-1});
+
+  testValidate(&fusion, outputs, inputs, {ref}, __LINE__, __FILE__);
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp
new file mode 100644
index 00000000000000..78900ebbdaa5aa
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp
@@ -0,0 +1,898 @@
+#if defined(USE_CUDA)
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_printer.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/mma_type.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+// fuser and IR parser
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+namespace {
+
+// Make a tensor that is known to be fully contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeContigTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder()
+      .ndims(ndims)
+      .dtype(dtype)
+      .contiguity(std::vector<bool>(ndims, true))
+      .build();
+}
+
+// Make a tensor that is known to be non-contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeSymbolicTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder().ndims(ndims).dtype(dtype).build();
+}
+
+// Make a non-contiguous tensor of compile-time known sizes
+TensorView* makeConcreteTensor(
+    std::vector<int64_t> shape,
+    DataType dtype = DataType::Float) {
+  return TensorViewBuilder().shape(shape).dtype(dtype).build();
+}
+
+void checkIntValue(
+    ExpressionEvaluator& evaluator,
+    Val* val,
+    Int::ScalarType expected_value) {
+  TORCH_CHECK(val->isAnInt());
+  const auto actual_value = evaluator.evaluate(val);
+  TORCH_CHECK(actual_value.has_value());
+  TORCH_CHECK(actual_value.value() == expected_value);
+}
+
+void checkIntValue(
+    kir::ExpressionEvaluator& evaluator,
+    const Val* val,
+    Int::ScalarType expected_value) {
+  const auto actual_value = evaluator.evaluate(val);
+  TORCH_CHECK(actual_value.has_value());
+  TORCH_CHECK(actual_value.value() == expected_value);
+}
+
+bool cudaArchGuardShouldSkip(int required_major, int required_minor) {
+  int capability_major = at::cuda::getCurrentDeviceProperties()->major;
+  int capability_minor = at::cuda::getCurrentDeviceProperties()->minor;
+
+  if (capability_major < required_major ||
+      (capability_major == required_major &&
+       capability_minor < required_minor)) {
+    return true;
+  }
+  return false;
+}
+
+#define NVFUSER_TEST_CUDA_ARCH_GUARD(REQUIRED_MAJOR, REQUIRED_MINOR)          \
+  if (cudaArchGuardShouldSkip(REQUIRED_MAJOR, REQUIRED_MINOR)) {              \
+    GTEST_SKIP() << "Requires GPU capability above " << REQUIRED_MAJOR << "." \
+                 << REQUIRED_MINOR << " to run.\n";                           \
+  }
+
+} // namespace
+
+// MMA unit test for a single instruction tile. VoltaTT
+TEST_F(NVFuserTest, FusionVoltaMMATT_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // [M,K]
+  auto tv0 = makeConcreteTensor({16, 4}, DataType::Half);
+  // [K,N]
+  auto tv1 = makeConcreteTensor({4, 16}, DataType::Half);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [M,K,N]
+  auto tv0b = broadcast(tv0, {false, false, true});
+  auto tv1b = broadcast(tv1, {true, false, false});
+
+  // Leaving both sets of mma inputs for volta outside
+  //  currently since they need to be swizzled.
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {1});
+
+  fusion.addOutput(tv2);
+
+  // TODO: should be able to completely remove it
+  //  in a follow up.
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(16, 16, 4);
+  gemm_tile.warp_tile = GemmTile(16, 16, 4);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::TT);
+  tv2->configureMma(mma_builder.build());
+
+  // Write A to smem
+  auto tv0cw = tv0b->cache_after();
+  // Read A from smem
+  auto tv0cr = tv0cw->cache_after();
+
+  // Write B to smem
+  auto tv1cw = tv1b->cache_after();
+
+  // Read B from smem
+  auto tv1cr = tv1cw->cache_after();
+
+  // Register accumulator
+  auto tv2c = tv2->cache_before();
+
+  // [M,K,N]->[M,N,K]
+  tv0cr->reorder({{-2, -1}, {-1, -2}});
+
+  // Schedule the instruction tile loops, which is the only
+  //  part we have in this unit test.
+  // Assumes last 3 dims are mnk
+  // The innermost loops are dictated by the type of mma used,
+  //   the scheduler needs to use mma_util::WarpMmaSwizzler to
+  //   get the right thread swizzle. Currently this is the only
+  //   method allowed to schedule the 3/2 inner most loops of
+  //   mma input/output.
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+
+  // [M,K,N]->[M,N,K]
+  tv1cr->reorder({{-2, -1}, {-1, -2}});
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+
+  // [M,K,N]->[M,N,K]
+  tv2c->reorder({{-2, -1}, {-1, -2}});
+
+  // Schedule the output instruction tile.
+  // Assumes last 3 dims are mnk
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  // Set memory type.
+  tv0cw->setMemoryType(MemoryType::Shared);
+  tv1cw->setMemoryType(MemoryType::Shared);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({16, 4}, options);
+  auto t1 = at::randn({4, 16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto tref = t0.to(at::kFloat).matmul(t1.to(at::kFloat));
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {tref}, __LINE__, __FILE__);
+}
+
+// MMA unit test for a single instruction tile. VoltaTN
+TEST_F(NVFuserTest, FusionVoltaMMATN_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // [M,K]
+  auto tv0 = makeConcreteTensor({16, 4}, DataType::Half);
+  // [N,K]
+  auto tv1 = makeConcreteTensor({16, 4}, DataType::Half);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [M,N,K]
+  auto tv0b = broadcast(tv0, {false, true, false});
+  auto tv1b = broadcast(tv1, {true, false, false});
+
+  // Leaving both sets of mma inputs for volta outside
+  //  currently since they need to be swizzled.
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {2});
+
+  fusion.addOutput(tv2);
+
+  // TODO: should be able to completely remove it
+  //  in a follow up.
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(16, 16, 4);
+  gemm_tile.warp_tile = GemmTile(16, 16, 4);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::TN);
+
+  tv2->configureMma(mma_builder.build());
+
+  auto tv0cw = tv0b->cache_after();
+  auto tv0cr = tv0cw->cache_after();
+  auto tv1cw = tv1b->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+  auto tv2c = tv2->cache_before();
+
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  tv0cw->setMemoryType(MemoryType::Shared);
+  tv1cw->setMemoryType(MemoryType::Shared);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({16, 4}, options);
+  auto t1 = at::randn({16, 4}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto tref = t0.to(at::kFloat).matmul(t1.t().to(at::kFloat));
+  testValidate(&fusion, cg_outputs, {t0, t1}, {tref}, __LINE__, __FILE__);
+}
+
+// MMA unit test for a single instruction tile. VoltaNT
+TEST_F(NVFuserTest, FusionVoltaMMANT_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // [K,M]
+  auto tv0 = makeConcreteTensor({4, 16}, DataType::Half);
+  // [K,N]
+  auto tv1 = makeConcreteTensor({4, 16}, DataType::Half);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [K,M,N]
+  auto tv0b = broadcast(tv0, {false, false, true});
+  auto tv1b = broadcast(tv1, {false, true, false});
+
+  // Leaving both sets of mma inputs for volta outside
+  //  currently since they need to be swizzled.
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {0});
+
+  fusion.addOutput(tv2);
+
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(16, 16, 4);
+  gemm_tile.warp_tile = GemmTile(16, 16, 4);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::NT);
+
+  tv2->configureMma(mma_builder.build());
+
+  auto tv0cw = tv0b->cache_after();
+  auto tv0cr = tv0cw->cache_after();
+  auto tv1cw = tv1b->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+  auto tv2c = tv2->cache_before();
+
+  // To MNK
+  tv0cr->reorder({{0, 2}, {1, 0}, {2, 1}});
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+
+  // To MNK
+  tv1cr->reorder({{0, 2}, {1, 0}, {2, 1}});
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+
+  tv2c->reorder({{0, 2}, {1, 0}, {2, 1}});
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv0cw->setMemoryType(MemoryType::Shared);
+  tv1cw->setMemoryType(MemoryType::Shared);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({4, 16}, options);
+  auto t1 = at::randn({4, 16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto tref = t0.t().to(at::kFloat).matmul(t1.to(at::kFloat));
+  testValidate(&fusion, cg_outputs, {t0, t1}, {tref}, __LINE__, __FILE__);
+}
+
+// Gemm test for Volta MMA: TT
+//  This is the only example that is fully manual,
+//    the rest of them are facilitated by gemm utils.
+TEST_F(NVFuserTest, FusionVoltaMatMulTT_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Keep multiples of 8 to keep vectorizable.
+  int M = 264, N = 120, K = 248;
+
+  // [M,K]
+  auto tv0 = makeContigTensor(2, DataType::Half);
+  // [K,N]
+  auto tv1 = makeContigTensor(2, DataType::Half);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [M,K,N]
+  auto tv0b = broadcast(tv0, {false, false, true});
+  auto tv1b = broadcast(tv1, {true, false, false});
+
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {1});
+
+  fusion.addOutput(tv2);
+
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(128, 128, 32);
+  gemm_tile.warp_tile = GemmTile(64, 64, 32);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::TT);
+
+  tv2->configureMma(mma_builder.build());
+
+  auto tv0r = tv0->cache_after();
+  auto tv1r = tv1->cache_after();
+  auto tv0cw = tv0b->cache_after();
+  auto tv0cr = tv0cw->cache_after();
+  auto tv1cw = tv1b->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+  auto tv2c = tv2->cache_before();
+
+  // Make a CTA tile
+  // ------------------------------------------------------------------
+  // [M,N]
+  tv2->split(-2, gemm_tile.cta_tile.m);
+  tv2->split(-1, gemm_tile.cta_tile.n);
+
+  //  0   1    2   3
+  // [Mo,M128, No, N128]
+  tv2->reorder({{1, 2}, {2, 1}});
+
+  //  0   1    2   3
+  // [Mo,No, M128, N128]
+  tv0->computeAt(tv2, 2);
+  tv1->computeAt(tv2, 2);
+
+  // Order K
+  //  0   1    2   3     4    5
+  // [Mo,No, M128, N128, Ko, K32]
+  tv2c->split(-1, gemm_tile.cta_tile.k);
+  tv2c->reorder({{2, 3}, {3, 4}, {4, 2}});
+
+  //  0   1  2   3     4    5
+  // [Mo,No, Ko M128, N128, K32]
+  tv0r->computeAt(tv2c, 3);
+  tv1r->computeAt(tv2c, 3);
+
+  // Make warp tile:
+  // -------------------------------------------------------------------------
+
+  //       -3   -2  -1
+  //[...    M,   N,  K]
+  // Distribute warp tile: accumulator reg
+  tv2c->split(-3, gemm_tile.warp_tile.m);
+  tv2c->split(-2, gemm_tile.warp_tile.n);
+
+  //  -5   -4   -3   -2   -1
+  // [Mwo  Mw  Nwo   Nw   K]
+  tv2c->split(-4, gemm_tile.instruction_tile.m);
+  tv2c->split(-2, gemm_tile.instruction_tile.n);
+  tv2c->split(-1, gemm_tile.instruction_tile.k);
+
+  //   -8  -7 -6 -5 -4 -3 -2 -1
+  // [Mwo Mw Mi Nwo Nw Ni Ko Ki]
+  tv2c->reorder({{-7, -5}, {-6, -3}, {-5, -7}, {-3, -2}, {-2, -6}});
+  //   -8  -7  -6 -5 -4 -3 -2 -1
+  // [Mwo  Nwo Ko Mw Nw Mi Ni Ki]
+
+  // Distribute warp tile: output tensor
+  tv2->split(-2, gemm_tile.warp_tile.m);
+  tv2->split(-1, gemm_tile.warp_tile.n);
+
+  //  -4   -3   -2   -1
+  // [Mwo  Mw  Nwo   Nw ]
+  tv2->split(-3, gemm_tile.instruction_tile.m);
+  tv2->split(-1, gemm_tile.instruction_tile.n);
+
+  //  -6 -5  -4 -3 -2 -1
+  // [Mwo Mw Mi Nwo Nw Ni]
+  tv2->reorder({{-5, -4}, {-4, -2}, {-3, -5}, {-2, -3}});
+  //  -6 -5  -4 -3 -2 -1
+  // [Mwo Nwo Mw Nw Mi Ni]
+
+  //           -8   -7  -6 -5 -4 -3 -2 -1
+  // [Mo No Ko Mwo  Nwo Kwo Mw Nw Mi Ni Ki]
+
+  tv0cr->computeAt(tv2c, -4);
+  tv1cr->computeAt(tv2c, -4);
+
+  // Schedule gmem read and smem write:
+  // ---------------------------------------------------------------------------
+  // [Mo,No,Ko,M,N,K]
+  tv0cw->reorder({
+      {-3, -2},
+      {-2, -3},
+  });
+  // [Mo,No,Ko,N,M,K]
+  tv0cw->merge(-2);
+  tv0r->merge(-2);
+  auto warp_dims = gemm_tile.cta_tile / gemm_tile.warp_tile;
+  int num_of_thread = warp_dims.m * warp_dims.n * warp_dims.k * 32;
+  int vector_word = 8;
+
+  // Smem write
+  tv0cw->split(-1, num_of_thread * vector_word);
+  tv0cw->split(-1, 8);
+  // [..., thread, vec]
+  // distribute to warp:
+  tv0cw->split(-2, 32);
+  tv0cw->split(-3, warp_dims.n * warp_dims.k);
+
+  tv0cw->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv0cw->axis(-2)->parallelize(ParallelType::TIDx);
+  tv0cw->axis(-3)->parallelize(ParallelType::TIDy);
+  tv0cw->axis(-4)->parallelize(ParallelType::TIDz);
+
+  // Gmem read (reg staging)
+  tv0r->split(-1, num_of_thread * vector_word);
+  tv0r->split(-1, 8);
+  // [..., thread, vec]
+  // distribute to warp:
+  tv0r->split(-2, 32);
+  tv0r->split(-3, warp_dims.n * warp_dims.k);
+
+  tv0r->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv0r->axis(-2)->parallelize(ParallelType::TIDx);
+  tv0r->axis(-3)->parallelize(ParallelType::TIDy);
+  tv0r->axis(-4)->parallelize(ParallelType::TIDz);
+
+  tv0cw->setMemoryType(MemoryType::Shared);
+  // [Mo,Ko,i,wy,wx,v]
+
+  // [Mo,No,Ko,M,N,K]
+  tv1r->reorder({
+      {-1, -2},
+      {-2, -1},
+  });
+  tv1cw->reorder({
+      {-1, -2},
+      {-2, -1},
+  });
+  // [Mo,No,Ko,M,K,N]
+  tv1cw->merge(-2);
+  tv1r->merge(-2);
+  // [Mo,No,Ko,i,wy,wx,v]
+  tv1r->split(-1, num_of_thread * vector_word);
+  tv1r->split(-1, 8);
+  // [..., thread, vec]
+  // distribute to warp:
+  tv1r->split(-2, 32);
+  tv1r->split(-3, warp_dims.n * warp_dims.k);
+
+  tv1r->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1r->axis(-2)->parallelize(ParallelType::TIDx);
+  tv1r->axis(-3)->parallelize(ParallelType::TIDy);
+  tv1r->axis(-4)->parallelize(ParallelType::TIDz);
+
+  tv1cw->split(-1, num_of_thread * vector_word);
+  tv1cw->split(-1, 8);
+  // [..., thread, vec]
+  // distribute to warp:
+  tv1cw->split(-2, 32);
+  tv1cw->split(-3, warp_dims.n * warp_dims.k);
+
+  tv1cw->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1cw->axis(-2)->parallelize(ParallelType::TIDx);
+  tv1cw->axis(-3)->parallelize(ParallelType::TIDy);
+  tv1cw->axis(-4)->parallelize(ParallelType::TIDz);
+
+  tv1cw->setMemoryType(MemoryType::Shared);
+
+  // Schedule mma input
+  // ---------------------------------------------------------------------------
+
+  // Use WarpMmaSwizzler for the innermost instruction tile.(Mi, Ni, Ki)
+  //           -8   -7  -6 -5 -4 -3 -2 -1
+  // [Mo No Ko Mwo  Nwo Kwo Mw Nw Mi Ni Ki]
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+
+  // Schedule mma output
+  // ---------------------------------------------------------------------------
+  // Use WarpMmaSwizzler for the innermost instruction tile (Mi,Ni, Ki) on
+  // output
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  //  -6 -5  -4 -3 -2 -1
+  // [Mwo Nwo Mw Nw Mi Ni]
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  // Inline broadcast with smem write.
+  tv0b->computeAt(tv0cw, -2);
+  tv1b->computeAt(tv1cw, -2);
+
+  // Vectorize smem read
+  tv0cr->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  // Parallelize
+  //  0   1  2  3    4   5  6  7  8  9  10
+  // [Mo No Ko Mwo  Nwo Kw Mw Nw (Mi Ni Ki)]
+  tv2c->axis(3)->parallelize(ParallelType::TIDz);
+  tv2c->axis(4)->parallelize(ParallelType::TIDy);
+
+  // Parallelize
+  //  0  1  2   3   4   5  6  7
+  // [Mo No Mwo Nwo Mw Nw (Mi Ni)]
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(2)->parallelize(ParallelType::TIDz);
+  tv2->axis(3)->parallelize(ParallelType::TIDy);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({M, K}, options);
+  auto t1 = at::randn({K, N}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto tref = t0.to(at::kFloat).matmul(t1.to(at::kFloat));
+
+  TORCH_CHECK(cg_outputs[0].allclose(tref, 0.0001, 0.0001));
+}
+
+// Gemm test for Volta MMA: TN
+TEST_F(NVFuserTest, FusionVoltaMatMulTN_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  int M = 120, N = 264, K = 56;
+
+  // [M,K]
+  auto tv0 = makeContigTensor(2, DataType::Half);
+  // [N,K]
+  auto tv1 = makeContigTensor(2, DataType::Half);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [M,N,K]
+  auto tv0b = broadcast(tv0, {false, true, false});
+  auto tv1b = broadcast(tv1, {true, false, false});
+
+  // Leaving both sets of mma inputs for volta outside
+  //  currently since they need to be swizzled.
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {2});
+
+  fusion.addOutput(tv2);
+
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(128, 128, 32);
+  gemm_tile.warp_tile = GemmTile(64, 64, 32);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::TN);
+
+  tv2->configureMma(mma_builder.build());
+
+  auto tv0r = tv0->cache_after();
+  auto tv1r = tv1->cache_after();
+  auto tv0cw = tv0b->cache_after();
+  auto tv0cr = tv0cw->cache_after();
+  auto tv1cw = tv1b->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+  auto tv2c = tv2->cache_before();
+
+  // Make a CTA tile
+  // ------------------------------------------------------------------
+  // [M,N]
+  tv2->split(-2, gemm_tile.cta_tile.m);
+  tv2->split(-1, gemm_tile.cta_tile.n);
+
+  //  0   1    2   3
+  // [Mo,M128, No, N128]
+  tv2->reorder({{1, 2}, {2, 1}});
+
+  //  0   1    2   3
+  // [Mo,No, M128, N128]
+  tv0->computeAt(tv2, 2);
+  tv1->computeAt(tv2, 2);
+
+  // Order K
+  //  0   1    2   3     4    5
+  // [Mo,No, M128, N128, Ko, K32]
+  tv2c->split(-1, gemm_tile.cta_tile.k);
+  tv2c->reorder({{2, 3}, {3, 4}, {4, 2}});
+
+  //  0   1  2   3     4    5
+  // [Mo,No, Ko M128, N128, K32]
+  tv0r->computeAt(tv2c, 3);
+  tv1r->computeAt(tv2c, 3);
+
+  // Make warp tile:
+  // -------------------------------------------------------------------------
+  scheduler_utils::matmul_utils::scheduleWarpTileWithReduction(tv2c, gemm_tile);
+  scheduler_utils::matmul_utils::scheduleWarpTileWithNoReduction(
+      tv2, gemm_tile);
+  //           -8   -7  -6 -5 -4 -3 -2 -1
+  // [Mo No Ko Mwo  Nwo Kwo Mw Nw Mi Ni Ki]
+  tv0cr->computeAt(tv2c, -4);
+  tv1cr->computeAt(tv2c, -4);
+
+  // Schedule gmem read and smem write:
+  // ---------------------------------------------------------------------------
+  // [Mo,No,Ko,M,N,K]
+  tv0cw->reorder({
+      {-3, -2},
+      {-2, -3},
+  });
+  // [Mo,No,Ko,N,M,K]
+  tv0cw->merge(-2);
+  tv0r->merge(-2);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv0cw, gemm_tile, 8);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv0r, gemm_tile, 8);
+  tv0cw->setMemoryType(MemoryType::Shared);
+  // [Mo,Ko,i,wy,wx,v]
+
+  // [Mo,No,Ko,M,N,K]
+  tv1cw->merge(-2);
+  tv1r->merge(-2);
+  // [Mo,No,Ko,i,wy,wx,v]
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv1cw, gemm_tile, 8);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv1r, gemm_tile, 8);
+  tv1cw->setMemoryType(MemoryType::Shared);
+  // Schedule mma input
+  // ---------------------------------------------------------------------------
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+
+  // Schedule mma output
+  // ---------------------------------------------------------------------------
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  tv0b->computeAt(tv0cw, -2);
+  tv1b->computeAt(tv1cw, -2);
+
+  tv0cr->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
+  // Parallelize
+  //  0   1  2  3    4   5  6  7  8  9  10
+  // [Mo No Ko Mwo  Nwo Kw Mw Nw (Mi Ni Ki)]
+  tv2c->axis(3)->parallelize(ParallelType::TIDz);
+  tv2c->axis(4)->parallelize(ParallelType::TIDy);
+
+  // Parallelize
+  //  0  1  2   3   4   5  6  7
+  // [Mo No Mwo Nwo Mw Nw (Mi Ni)]
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(2)->parallelize(ParallelType::TIDz);
+  tv2->axis(3)->parallelize(ParallelType::TIDy);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({M, K}, options);
+  auto t1 = at::randn({N, K}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto tref = t0.to(at::kFloat).matmul(t1.to(at::kFloat).t());
+  TORCH_CHECK(cg_outputs[0].allclose(tref, 0.0001, 0.0001));
+}
+
+// Gemm test for Volta MMA: NT
+TEST_F(NVFuserTest, FusionVoltaMatMulNT_CUDA) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(7, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  int M = 240, N = 320, K = 136;
+
+  // [K,M]
+  auto tv0 = makeContigTensor(2, DataType::Half);
+  // [K,N]
+  auto tv1 = makeContigTensor(2, DataType::Half);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [K,M,N]
+  auto tv0b = broadcast(tv0, {false, false, true});
+  auto tv1b = broadcast(tv1, {false, true, false});
+
+  // Leaving both sets of mma inputs for volta outside
+  //  currently since they need to be swizzled.
+  auto tv2 = fusedMultiplySum(tv0b, tv1b, {0});
+
+  fusion.addOutput(tv2);
+
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = GemmTile(128, 128, 32);
+  gemm_tile.warp_tile = GemmTile(64, 64, 32);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 4);
+
+  auto mma_builder = MmaBuilder(MmaOptions::MacroType::Volta_16_16_4, gemm_tile)
+                         .layout(MmaOptions::MmaInputLayout::NT);
+
+  tv2->configureMma(mma_builder.build());
+
+  auto tv0r = tv0->cache_after();
+  auto tv1r = tv1->cache_after();
+  auto tv0cw = tv0b->cache_after();
+  auto tv0cr = tv0cw->cache_after();
+  auto tv1cw = tv1b->cache_after();
+  auto tv1cr = tv1cw->cache_after();
+  auto tv2c = tv2->cache_before();
+
+  // Make a CTA tile
+  // ------------------------------------------------------------------
+  // [M,N]
+  tv2->split(-2, gemm_tile.cta_tile.m);
+  tv2->split(-1, gemm_tile.cta_tile.n);
+
+  //  0   1    2   3
+  // [Mo,M128, No, N128]
+  tv2->reorder({{1, 2}, {2, 1}});
+
+  //  0   1    2   3
+  // [Mo,No, M128, N128]
+  tv0->computeAt(tv2, 2);
+  tv1->computeAt(tv2, 2);
+
+  // Order K
+  //  0   1    2   3     4    5
+  // [Mo,No, M128, N128, Ko, K32]
+  tv2c->split(-1, gemm_tile.cta_tile.k);
+  tv2c->reorder({{2, 3}, {3, 4}, {4, 2}});
+
+  //  0   1  2   3     4    5
+  // [Mo,No, Ko M128, N128, K32]
+  tv0r->computeAt(tv2c, 3);
+  tv1r->computeAt(tv2c, 3);
+
+  // Make warp tile:
+  // -------------------------------------------------------------------------
+  scheduler_utils::matmul_utils::scheduleWarpTileWithReduction(tv2c, gemm_tile);
+  scheduler_utils::matmul_utils::scheduleWarpTileWithNoReduction(
+      tv2, gemm_tile);
+  //           -8   -7  -6 -5 -4 -3 -2 -1
+  // [Mo No Ko Mwo  Nwo Kwo Mw Nw Mi Ni Ki]
+  tv0cr->computeAt(tv2c, -4);
+  tv1cr->computeAt(tv2c, -4);
+
+  // Schedule gmem read and smem write:
+  // ---------------------------------------------------------------------------
+  // [Mo,No,Ko,M,N,K]
+  tv0cw->reorder({{-3, -1}, {-2, -3}, {-1, -2}});
+  // [Mo,No,Ko,N,K,M]
+  tv0cw->merge(-2);
+
+  // [Mo,No,M,K]
+  tv0r->reorder({{-2, -1}, {-1, -2}});
+  // [Mo,No,K,M]
+  tv0r->merge(-2);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv0cw, gemm_tile, 8);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv0r, gemm_tile, 8);
+  tv0cw->setMemoryType(MemoryType::Shared);
+  // [Mo,Ko,i,wy,wx,v]
+
+  // [Mo,No,Ko,M,N,K]
+  tv1cw->reorder({{-2, -1}, {-1, -2}});
+  tv1r->reorder({{-2, -1}, {-1, -2}});
+  // [Mo,No,Ko,M,K,N]
+  tv1cw->merge(-2);
+  tv1r->merge(-2);
+  // [Mo,No,Ko,i,wy,wx,v]
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv1cw, gemm_tile, 8);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      tv1r, gemm_tile, 8);
+  tv1cw->setMemoryType(MemoryType::Shared);
+  // Schedule mma input
+  // ---------------------------------------------------------------------------
+  // [...M,N,K]
+  tv0cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::A).build());
+  tv1cr->applyMmaSwizzle(mma_builder.operand(MmaOptions::Operand::B).build());
+
+  // Schedule mma output
+  // ---------------------------------------------------------------------------
+  tv2c->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+  tv2->applyMmaSwizzle(
+      mma_builder.operand(MmaOptions::Operand::NotOperand).build());
+
+  tv0b->computeAt(tv0cw, -2);
+  tv1b->computeAt(tv1cw, -2);
+
+  tv0cr->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
+  // Parallelize
+  //  0   1  2  3    4   5  6  7  8  9  10
+  // [Mo No Ko Mwo  Nwo Kw Mw Nw (Mi Ni Ki)]
+  tv2c->axis(3)->parallelize(ParallelType::TIDz);
+  tv2c->axis(4)->parallelize(ParallelType::TIDy);
+
+  // Parallelize
+  //  0  1  2   3   4   5  6  7
+  // [Mo No Mwo Nwo Mw Nw (Mi Ni)]
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(2)->parallelize(ParallelType::TIDz);
+  tv2->axis(3)->parallelize(ParallelType::TIDy);
+
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto t0 = at::randn({K, M}, options);
+  auto t1 = at::randn({K, N}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto tref = t0.to(at::kFloat).t().matmul(t1.to(at::kFloat));
+
+  TORCH_CHECK(cg_outputs[0].allclose(tref, 0.0001, 0.0001));
+}
+
+#undef NVFUSER_TEST_CUDA_ARCH_GUARD
+
+} // namespace jit
+} // namespace torch
+
+#endif
diff --git a/test/cpp/jit/test_gpu_validator.h b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
similarity index 94%
rename from test/cpp/jit/test_gpu_validator.h
rename to torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
index 4b01f361cfcb4f..8fca108dc00977 100644
--- a/test/cpp/jit/test_gpu_validator.h
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
@@ -13,7 +13,7 @@ namespace fuser {
 namespace cuda {
 
 inline bool deviceMajorMinorCheck(int major, int minor = 0) {
-  auto dev_prop = at::cuda::getDeviceProperties(0);
+  auto dev_prop = at::cuda::getCurrentDeviceProperties();
   if (dev_prop->major < major ||
       (dev_prop->major == major && dev_prop->minor < minor)) {
     return false;
@@ -21,6 +21,11 @@ inline bool deviceMajorMinorCheck(int major, int minor = 0) {
   return true;
 }
 
+inline int deviceSMCount() {
+  int sm_count = at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+  return sm_count;
+}
+
 class NVFuserTest : public ::testing::Test {
  protected:
   void SetUp() override {
@@ -68,6 +73,8 @@ std::pair<double, double> getTolerance(
     int64_t reduction_size,
     const ValidationConstants& tolerances) {
   switch (dtype) {
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
     case DataType::Float:
     // TODO: Pull new tolerances for Double, for now we will just use float
     // tolerances as it should be no worse.
@@ -342,9 +349,12 @@ inline void testValidate(
   auto reduction_sizes =
       ReductionSizeMapper::computeReductionSizes(fusion, expr_eval);
 
+  auto output_alias_indices = fusion->getOutputAliasIndices();
+
   TORCH_INTERNAL_ASSERT(
       fusion_outputs.size() == aten_outputs.size() &&
-          aten_outputs.size() == fusion->outputs().size(),
+          aten_outputs.size() ==
+              fusion->outputs().size() - output_alias_indices.size(),
       "Number of outputs don't match.");
 
   TORCH_INTERNAL_ASSERT(
@@ -368,13 +378,17 @@ inline void testValidate(
     }
   }
 
-  for (size_t i = 0; i < fusion->outputs().size(); i++) {
+  for (size_t i = 0, j = 0; i < fusion->outputs().size(); i++) {
     TORCH_INTERNAL_ASSERT(
         fusion->outputs()[i]->isA<TensorView>(), "Mismatch of tensor outputs.");
+    if (output_alias_indices.count(i) != 0) {
+      // this is an aliased output, let's not check this;
+      continue;
+    }
 
-    auto fusion_output_tensor = fusion_outputs[i];
+    auto fusion_output_tensor = fusion_outputs[j];
     auto fusion_output_tv = fusion->outputs()[i]->as<TensorView>();
-    auto aten_output_tensor = aten_outputs[i];
+    auto aten_output_tensor = aten_outputs[j];
 
     TORCH_INTERNAL_ASSERT(
         reduction_sizes.count(fusion_output_tv),
@@ -385,7 +399,7 @@ inline void testValidate(
 
     TORCH_INTERNAL_ASSERT(
         aten_output_tensor.dim() == fusion_output_tensor.dim() &&
-            fusion_outputs[i].dim() ==
+            fusion_outputs[j].dim() ==
                 TensorDomain::noReductions(
                     fusion_output_tv->getMaybeRFactorDomain())
                     .size(),
@@ -394,7 +408,8 @@ inline void testValidate(
     auto tolerance_values = getTolerance(
         fusion_output_tv->getDataType().value(), reduction_size, tolerances);
 
-    if (aten_output_tensor.is_floating_point()) {
+    if (aten_output_tensor.is_floating_point() ||
+        aten_output_tensor.is_complex()) {
       TORCH_INTERNAL_ASSERT(
           aten_output_tensor.allclose(
               fusion_output_tensor.to(aten_output_tensor.dtype()),
@@ -403,7 +418,7 @@ inline void testValidate(
           "\n",
           err_msg,
           "\nValidation error in output ",
-          i,
+          j,
           " on line ",
           line_number,
           " in file ",
@@ -425,13 +440,14 @@ inline void testValidate(
           "\n",
           err_msg,
           ".\n  Validation error in output ",
-          i,
+          j,
           " on line ",
           line_number,
           " in file ",
           file_name,
           ".\n Values are not equal and are not a floating type.");
     }
+    j++;
   }
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
new file mode 100644
index 00000000000000..4d0b87bb0540d6
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
@@ -0,0 +1,630 @@
+#if defined(USE_CUDA)
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+// fuser and IR parser
+#include <torch/csrc/jit/codegen/cuda/parser.h>
+#include <torch/csrc/jit/ir/irparser.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+namespace {
+
+// Make a tensor that is known to be fully contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeContigTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder()
+      .ndims(ndims)
+      .dtype(dtype)
+      .contiguity(std::vector<bool>(ndims, true))
+      .build();
+}
+
+// Make a tensor that is known to be non-contiguous of dimensionality=ndims,
+// but unknown sizes
+TensorView* makeSymbolicTensor(size_t ndims, DataType dtype = DataType::Float) {
+  return TensorViewBuilder().ndims(ndims).dtype(dtype).build();
+}
+
+// Make a non-contiguous tensor of compile-time known sizes
+TensorView* makeConcreteTensor(
+    std::vector<int64_t> shape,
+    DataType dtype = DataType::Float) {
+  return TensorViewBuilder().shape(shape).dtype(dtype).build();
+}
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionViewDtypeSameSizeOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape{2, 10, 40};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size(), DataType::Float);
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  auto x_view = view(x_add_bias, DataType::Int32);
+  fusion.addOutput(x_view);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_x = at::randn(input_shape, options);
+  at::Tensor at_bias = at::randn(input_shape, options);
+  std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto outputs = fe.runFusion(aten_inputs, lparams);
+
+  auto at_x_add_bias = at_x + at_bias;
+  auto at_x_view = at_x_add_bias.view(at::ScalarType::Int);
+
+  testValidate(&fusion, outputs, aten_inputs, {at_x_view}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionViewDtypeFailMismatchSize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape{2, 10, 40};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size(), DataType::Float);
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(view(x_add_bias, DataType::Int));
+}
+
+TEST_F(NVFuserTest, FusionViewRfactorExtentReplacement_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = makeContigTensor(2);
+  fusion->addInput(tv1);
+
+  auto tv2 = view(tv0, {12, 8}, {4, 3, 8});
+  auto tv3 = sum(tv2, {-1});
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  auto tv5 = add(tv1, tv4);
+  fusion->addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({12, 8}, options);
+  auto t1 = at::randn({4, 3}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
+
+  auto ref = at::native::view(t0, {4, 3, 8}).sum({-1}) + 1 + t1;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionViewOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape{2, 10, 40};
+  std::vector<int64_t> output_shape{2, 10, 4, 10};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size());
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  auto x_view = view(x_add_bias, input_shape, output_shape);
+  fusion.addOutput(x_view);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_x = at::randn(input_shape, options);
+  at::Tensor at_bias = at::randn(input_shape, options);
+  std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto outputs = fe.runFusion(aten_inputs, lparams);
+
+  auto at_x_add_bias = at_x + at_bias;
+  auto at_x_view = at::native::view(at_x_add_bias, output_shape);
+
+  testValidate(&fusion, outputs, aten_inputs, {at_x_view}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionViewFailMismatchSize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // The number of elements in input and output shapes do not match,
+  // so this view transformation is invalid.
+  // 2 * 10 * 40 != 2 * 50 * 4 * 10
+
+  std::vector<int64_t> input_shape{2, 10, 40};
+  std::vector<int64_t> output_shape{2, 50, 4, 10};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size());
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(view(x_add_bias, input_shape, output_shape));
+}
+
+TEST_F(NVFuserTest, FusionViewFailMulitDimInference_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Only one dimension can be inferred in the output shape.
+  // Otherwise, the size of the dimensions is ambiguous.
+  std::vector<int64_t> input_shape{2, 10, 40};
+  std::vector<int64_t> output_shape{2, -1, 4, -1};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size());
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(view(x_add_bias, input_shape, output_shape));
+}
+
+void reductionViewAddFusion(
+    std::vector<int64_t>& input_shape,
+    std::vector<int64_t>& output_shape,
+    bool view_before_reduction) {
+  constexpr int kReductionAxis = -1;
+
+  // Drop size for reduction axis from view_shape
+  std::vector<int64_t> view_shape;
+  {
+    const auto kAxis = (kReductionAxis < 0)
+        ? (kReductionAxis + input_shape.size())
+        : kReductionAxis;
+    for (auto i : c10::irange(input_shape.size())) {
+      if (view_before_reduction || i != kAxis) {
+        view_shape.push_back(input_shape[i]);
+      }
+    }
+  }
+
+  auto bias_shape = (view_before_reduction) ? input_shape : output_shape;
+  for (auto has_implicit_broadcast : {false, true}) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    Fusion& fusion = *fusion_ptr.get();
+    FusionGuard fg(&fusion);
+
+    TensorView* x = (has_implicit_broadcast)
+        ? makeConcreteTensor(input_shape)
+        : makeSymbolicTensor(input_shape.size());
+    TensorView* bias = (has_implicit_broadcast)
+        ? makeConcreteTensor(bias_shape)
+        : makeSymbolicTensor(bias_shape.size());
+    fusion.addInput(x);
+    fusion.addInput(bias);
+
+    auto tv1 =
+        (view_before_reduction) ? add(x, bias) : sum(x, {kReductionAxis});
+    auto x_view = view(tv1, view_shape, output_shape);
+    auto y = (view_before_reduction) ? sum(x_view, {kReductionAxis})
+                                     : add(x_view, bias);
+    fusion.addOutput(y);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor at_x = at::randn(input_shape, options);
+    at::Tensor at_bias = at::randn(bias_shape, options);
+    std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+    FusionExecutorCache fusion_executor_cache(std::move(fusion_ptr));
+    auto outputs = fusion_executor_cache.runFusionWithInputs(aten_inputs);
+
+    auto at_tv1 = (view_before_reduction) ? (at_x + at_bias)
+                                          : at::sum(at_x, kReductionAxis);
+    auto at_x_view = at::native::view(at_tv1, output_shape);
+    auto at_y = (view_before_reduction) ? at::sum(at_x_view, kReductionAxis)
+                                        : at::add(at_x_view, at_bias);
+
+    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewReductionShmoo_CUDA) {
+  typedef std::vector<int64_t> shape;
+  typedef std::pair<shape, shape> view_example;
+
+  std::vector<view_example> view_before_examples = {
+      {{19, 12, 7, 99}, {19, 3, 2772}},
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
+      // Incorrect Result - Broadcast Issue - Pointwise
+      // {{3, 17, 80, 1}, {51, 2, 4, 1, 10}},
+      // {{3, 17, 80, 1, 9}, {51, 2, 4, 1, 10, 9}},
+      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
+      {{22, 22, 2}, {22, 11, 1, 1, 4}},
+      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
+      {{1, 1, 333, 1}, {1, 1, 333, 1}},
+      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
+      {{1, 333, 1}, {1, 37, 9, 1}},
+      {{1, 333}, {1, 1, 1, 111, 1, 3}},
+      {{22, 1, 22, 1}, {484}},
+      {{1, 333, 1}, {333}},
+      // Incorrect Result - Broadcast Issue - Reduction
+      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
+      {{1, 7844, 1, 7}, {1, 27454, 2}}};
+
+  for (auto e : view_before_examples) {
+    reductionViewAddFusion(e.first, e.second, true /* view_before_reduction */);
+  }
+
+  std::vector<view_example> view_after_examples = {
+      {{19, 12, 7, 99}, {19, 3, 28}},
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 28}},
+      {{3, 17, 80, 1}, {51, 1, 2, 4, 10}},
+      {{3, 17, 80, 1, 9}, {51, 1, 2, 4, 10}},
+      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 1}},
+      {{22, 22, 2}, {22, 11, 1, 1, 2}},
+      {{37, 9, 7, 6, 10}, {333, 2, 21}},
+      {{1, 1, 333, 1}, {1, 1, 333, 1}},
+      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1}},
+      {{1, 333, 1}, {1, 37, 9, 1}},
+      {{22, 1, 22, 1}, {484}},
+      {{1, 333, 1}, {333}},
+      {{1, 27454, 1, 2}, {1, 3922, 1, 7}},
+      {{1, 7844, 1, 7}, {1, 1961, 4}}};
+
+  for (auto e : view_after_examples) {
+    reductionViewAddFusion(
+        e.first, e.second, false /* view_before_reduction */);
+  }
+}
+
+void persistentViewAddFusion(
+    std::vector<int64_t>& input_shape,
+    std::vector<int64_t>& output_shape,
+    bool view_before_persistent) {
+  constexpr int kAxis = -1;
+
+  auto bias_shape = (view_before_persistent) ? input_shape : output_shape;
+  for (auto has_implicit_broadcast : {false, true}) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    Fusion& fusion = *fusion_ptr.get();
+    FusionGuard fg(&fusion);
+
+    TensorView* x = (has_implicit_broadcast)
+        ? makeConcreteTensor(input_shape)
+        : makeSymbolicTensor(input_shape.size());
+    TensorView* bias = (has_implicit_broadcast)
+        ? makeConcreteTensor(bias_shape)
+        : makeSymbolicTensor(bias_shape.size());
+    fusion.addInput(x);
+    fusion.addInput(bias);
+
+    auto tv1 = (view_before_persistent) ? add(x, bias) : softmax(x, kAxis);
+    auto x_view = view(tv1, input_shape, output_shape);
+    auto y =
+        (view_before_persistent) ? softmax(x_view, kAxis) : add(x_view, bias);
+    fusion.addOutput(y);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor at_x = at::randn(input_shape, options);
+    at::Tensor at_bias = at::randn(bias_shape, options);
+    std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+    FusionExecutorCache fusion_executor_cache(std::move(fusion_ptr));
+    auto outputs = fusion_executor_cache.runFusionWithInputs(aten_inputs);
+
+    auto at_tv1 = (view_before_persistent)
+        ? (at_x + at_bias)
+        : at::_softmax(at_x, kAxis, false /* half_to_float */);
+    auto at_x_view = at::native::view(at_tv1, output_shape);
+    auto at_y = (view_before_persistent)
+        ? at::_softmax(at_x_view, kAxis, false /* half_to_float */)
+        : at::add(at_x_view, at_bias);
+
+    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewPersistentShmoo_CUDA) {
+  typedef std::vector<int64_t> shape;
+  typedef std::pair<shape, shape> view_example;
+
+  std::vector<view_example> view_examples = {
+      {{19, 12, 7, 99}, {19, 3, 2772}},
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
+      // Incorrect Result - Broadcast Issue - Pointwise
+      // {{3, 17, 80, 1}, {51, 2, 4, 1, 10}},
+      // {{3, 17, 80, 1, 9}, {51, 2, 4, 1, 10, 9}},
+      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
+      {{22, 22, 2}, {22, 11, 1, 1, 4}},
+      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
+      {{1, 1, 333, 1}, {1, 1, 333, 1}},
+      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
+      {{1, 333, 1}, {1, 37, 9, 1}},
+      {{1, 333}, {1, 1, 1, 111, 1, 3}},
+      {{22, 1, 22, 1}, {484}},
+      {{1, 333, 1}, {333}},
+      // Incorrect Result - Broadcast Issue - Reduction
+      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
+      {{1, 7844, 1, 7}, {1, 27454, 2}}};
+
+  for (auto e : view_examples) {
+    persistentViewAddFusion(
+        e.first, e.second, true /* view_before_persistent */);
+  }
+
+  // Disabled: How to select post-view concrete ID?
+  // for (auto e : view_examples) {
+  // persistentViewAddFusion(e.first, e.second, false /* view_before_persistent
+  // */);
+  // }
+}
+
+void addViewGeluFusion(
+    std::vector<int64_t>& input_shape,
+    std::vector<int64_t>& output_shape) {
+  for (auto has_implicit_broadcast : {false, true}) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* x = (has_implicit_broadcast)
+        ? makeConcreteTensor(input_shape)
+        : makeSymbolicTensor(input_shape.size());
+    TensorView* bias = (has_implicit_broadcast)
+        ? makeConcreteTensor(input_shape)
+        : makeSymbolicTensor(input_shape.size());
+    fusion.addInput(x);
+    fusion.addInput(bias);
+
+    auto x_add_bias = add(x, bias);
+    auto x_view = view(x_add_bias, input_shape, output_shape);
+    auto y = gelu(x_view);
+    fusion.addOutput(y);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor at_x = at::randn(input_shape, options);
+    at::Tensor at_bias = at::randn(input_shape, options);
+    std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+    auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, aten_inputs, lparams);
+    auto outputs = fe.runFusion(aten_inputs, lparams);
+
+    auto at_x_add_bias = at_x + at_bias;
+    auto at_x_view = at::native::view(at_x_add_bias, output_shape);
+    auto at_y = at::gelu(at_x_view);
+
+    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewSplit_CUDA) {
+  std::vector<int64_t> input_shape{80};
+  std::vector<int64_t> output_shape{2, 4, 10};
+  addViewGeluFusion(input_shape, output_shape);
+}
+
+TEST_F(NVFuserTest, FusionViewBroadcast_CUDA) {
+  std::vector<int64_t> input_shape{80};
+  std::vector<int64_t> output_shape{1, 80};
+  addViewGeluFusion(input_shape, output_shape);
+}
+
+TEST_F(NVFuserTest, FusionViewMerge_CUDA) {
+  std::vector<int64_t> input_shape{2, 40, 7};
+  std::vector<int64_t> output_shape{560};
+  addViewGeluFusion(input_shape, output_shape);
+}
+
+TEST_F(NVFuserTest, FusionViewAllShmoo_CUDA) {
+  typedef std::vector<int64_t> shape;
+  typedef std::pair<shape, shape> view_example;
+
+  std::vector<view_example> examples = {
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
+      {{3, 17, 80, 1}, {51, 1, 2, 4, 10}},
+      {{3, 17, 80, 1, 9}, {51, 1, 2, 4, 10, 9}},
+      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
+      {{22, 22, 2}, {22, 11, 1, 1, 4}},
+      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
+      {{1, 1, 333, 1}, {1, 1, 333, 1}},
+      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
+      {{1, 333, 1}, {1, 37, 9, 1}},
+      {{1, 333}, {1, 1, 1, 111, 1, 3}},
+      {{22, 1, 22, 1}, {484}},
+      {{1, 333, 1}, {333}},
+      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
+      {{1, 7844, 1, 7}, {1, 27454, 2}}};
+
+  for (auto e : examples) {
+    addViewGeluFusion(e.first, e.second);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewInferShmoo_CUDA) {
+  typedef std::vector<int64_t> shape;
+  typedef std::pair<shape, shape> view_example;
+
+  std::vector<view_example> examples = {
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, -1, 3, 2772}},
+      {{3, 17, 80, 1}, {51, 1, 2, 4, -1}},
+      {{3, 17, 80, 1, 9}, {-1, 1, 2, 4, 10, 9}},
+      {{2, 3, 4, 5}, {1, 6, 1, -1, 2, 5, 1}},
+      {{22, 22, 2}, {22, -1, 1, 1, 4}},
+      {{37, 9, 7, 6, 10}, {333, 2, -1, 3, 35}},
+      {{1, 1, 333, 1}, {1, 1, -1, 1}},
+      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, -1}},
+      {{1, 333, 1}, {1, 37, -1, 1}},
+      {{1, 333}, {1, 1, 1, -1, 1, 3}},
+      {{22, 1, 22, 1}, {-1}},
+      {{1, 333, 1}, {-1}},
+      {{1, 27454, 1, 2}, {1, 7844, 1, -1}},
+      {{1, 7844, 1, 7}, {1, -1, 2}}};
+
+  for (auto e : examples) {
+    addViewGeluFusion(e.first, e.second);
+  }
+}
+
+void geluViewAddFusion(
+    std::vector<int64_t> input_shape,
+    std::vector<int64_t> output_shape) {
+  for (auto hasImplicitBroadcast : {false, true}) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* x = (hasImplicitBroadcast)
+        ? makeConcreteTensor(input_shape)
+        : makeSymbolicTensor(input_shape.size());
+    TensorView* bias = (hasImplicitBroadcast)
+        ? makeConcreteTensor(output_shape)
+        : makeSymbolicTensor(output_shape.size());
+    fusion.addInput(x);
+    fusion.addInput(bias);
+
+    auto x_gelu = gelu(x);
+    auto x_view = view(x_gelu, input_shape, output_shape);
+    auto y = add(x_view, bias);
+    fusion.addOutput(y);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor at_x = at::randn(input_shape, options);
+    at::Tensor at_bias = at::randn(output_shape, options);
+    std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+    auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, aten_inputs, lparams);
+    auto outputs = fe.runFusion(aten_inputs, lparams);
+
+    auto at_x_gelu = at::gelu(at_x);
+    auto at_x_view = at::native::view(at_x_gelu, output_shape);
+    auto at_y = at_x_view + at_bias;
+
+    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewStride_CUDA) {
+  typedef std::vector<int64_t> shape;
+  typedef std::pair<shape, shape> view_example;
+
+  std::vector<view_example> examples = {
+      {{1, 27454, 2}, {1, 7844, 7}},
+      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
+      {{1, 7844, 1, 7}, {1, 27454, 2}}};
+
+  for (const auto& e : examples) {
+    geluViewAddFusion(e.first, e.second);
+  }
+}
+
+void geluViewBinaryAddFusion(
+    std::vector<int64_t> input_shape1,
+    std::vector<int64_t> input_shape2,
+    std::vector<int64_t> output_shape) {
+  for (auto hasImplicitBroadcast : {false, true}) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* x = (hasImplicitBroadcast)
+        ? makeConcreteTensor(input_shape1)
+        : makeSymbolicTensor(input_shape1.size());
+    TensorView* bias = (hasImplicitBroadcast)
+        ? makeConcreteTensor(input_shape2)
+        : makeSymbolicTensor(input_shape2.size());
+    fusion.addInput(x);
+    fusion.addInput(bias);
+
+    auto x_gelu = gelu(x);
+    auto x_view = view(x_gelu, input_shape1, output_shape);
+    auto bias_view = view(bias, input_shape2, output_shape);
+    auto y = add(x_view, bias_view);
+    fusion.addOutput(y);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor at_x = at::randn(input_shape1, options);
+    at::Tensor at_bias = at::randn(input_shape2, options);
+    std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+    auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, aten_inputs, lparams);
+    auto outputs = fe.runFusion(aten_inputs, lparams);
+
+    auto at_x_gelu = at::gelu(at_x);
+    auto at_x_view = at::native::view(at_x_gelu, output_shape);
+    auto at_bias_view = at::native::view(at_bias, output_shape);
+    auto at_y = at_x_view + at_bias_view;
+
+    testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionViewBinary_CUDA) {
+  geluViewBinaryAddFusion({27454, 2}, {54908}, {7844, 7});
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
index 9f4e74e9c1e627..4c3f173660b08c 100644
--- a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
+++ b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
@@ -16,6 +16,10 @@
 
 args = arg_parser.parse_args()
 
+# msvc string literal maximum length 16380
+# https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170
+MAX_STRING_LITERAL = 16000
+
 with open(args.input, 'r') as fin:
     with open(args.output, 'w') as fout:
         literal_name = f'{pathlib.Path(args.input).stem}_cu'
@@ -23,7 +27,15 @@
         fout.write(f'// {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}\n\n')
         fout.write('namespace nvfuser_resources {\n\n')
         fout.write(f'constexpr const char* {literal_name} = R"(\n')
+        accumulated_chars = 0
         for line in fin:
-            fout.write(line)
+            accumulated_chars = accumulated_chars + len(line) + 1
+            if accumulated_chars >= MAX_STRING_LITERAL:
+                fout.write(')"\n')
+                fout.write('R"(\n')
+                fout.write(line)
+                accumulated_chars = len(line) + 1
+            else:
+                fout.write(line)
         fout.write(')";\n')
         fout.write('\n} // namespace nvfuser_resources\n')
diff --git a/torch/csrc/jit/codegen/cuda/transform_iter.cpp b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
index bae77943b339d4..1157ac315a8607 100644
--- a/torch/csrc/jit/codegen/cuda/transform_iter.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
@@ -279,6 +279,8 @@ BestEffortReplay::BestEffortReplay(
   std::string err_str(
       "Error during replay, a transformation was called that conflicts with an rfactor call.");
 
+  bool any_target_expr_contains_broadcast_id = false;
+
   // Iterate through target IterDomains' history and compare with what we
   // recorded from replay_domain
   for (auto target_expr : target_exprs) {
@@ -313,6 +315,12 @@ BestEffortReplay::BestEffortReplay(
     std::vector<IterDomain*> target_id_inps(
         target_inps_filtered.begin(), target_inps_filtered.end());
 
+    bool target_expr_contains_broadcast_id = std::any_of(
+        target_inps_filtered.begin(),
+        target_inps_filtered.end(),
+        [](IterDomain* id) { return id->isBroadcast(); });
+    any_target_expr_contains_broadcast_id |= target_expr_contains_broadcast_id;
+
     std::vector<IterDomain*> replay_inps =
         std::vector<IterDomain*>(target_id_inps.size(), nullptr);
 
@@ -353,12 +361,20 @@ BestEffortReplay::BestEffortReplay(
               return replay_id2expr_map.find(id) == replay_id2expr_map.end();
             }
           });
-      TORCH_INTERNAL_ASSERT(no_missing_exprs, err_str);
+      // View operation creates a TensorView with rfactor. After view, broadcast
+      // operation adds iterDomains for any size-1 dimensions. Therefore, the
+      // target domain (broadcast) may contain broadcast ids that are not
+      // present in the replay domain (view). In this case, we skip any target
+      // expressions that contain broadcast ids.
+      TORCH_INTERNAL_ASSERT(
+          no_missing_exprs || any_target_expr_contains_broadcast_id, err_str);
     }
 
     // If any inputs are missing, continue as this expr doesn't match.
     if (missing_replay_input) {
-      TORCH_INTERNAL_ASSERT(!replay_has_rfactor_inp, err_str);
+      TORCH_INTERNAL_ASSERT(
+          !replay_has_rfactor_inp || any_target_expr_contains_broadcast_id,
+          err_str);
       continue;
     }
 
diff --git a/torch/csrc/jit/codegen/cuda/transform_view.cpp b/torch/csrc/jit/codegen/cuda/transform_view.cpp
index 433e34a11ebba0..c15e58c311e5e9 100644
--- a/torch/csrc/jit/codegen/cuda/transform_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_view.cpp
@@ -38,9 +38,9 @@ struct ViewIndexState {
 };
 
 //! Base class for all tranformations
-class Transform {
+class Transform : public PolymorphicBase {
  public:
-  virtual void toString(std::stringstream& output) const = 0;
+  virtual void toString(std::ostream& output) const = 0;
 
   size_t index() const {
     return index_;
@@ -54,8 +54,6 @@ class Transform {
     return new_index_;
   }
 
-  virtual ~Transform() = default;
-
  protected:
   Transform(const ViewIndexState& state, size_t index)
       : index_(index),
@@ -80,7 +78,6 @@ class ViewTransform : public Transform {
   virtual void createRfactorDomain(
       const std::vector<IterDomain*>& new_root_domain,
       std::vector<IterDomain*>& rfactor_domain) = 0;
-  ~ViewTransform() override = default;
 
   virtual bool isOriginalAxisDynamic() const = 0;
 
@@ -105,7 +102,7 @@ class MergeTransform final : public ViewTransform {
   MergeTransform(const ViewIndexState& state, bool is_last_axis_rfactor)
       : ViewTransform(state), is_last_axis_rfactor_(is_last_axis_rfactor) {}
 
-  void toString(std::stringstream& output) const override {
+  void toString(std::ostream& output) const override {
     output << "Merge Index: " << index_ << " RF: " << is_last_axis_rfactor_
            << std::endl;
   }
@@ -164,7 +161,7 @@ class SplitTransform final : public ViewTransform {
         is_last_axis_rfactor_(is_last_axis_rfactor),
         split_factor_(split_factor) {}
 
-  void toString(std::stringstream& output) const override {
+  void toString(std::ostream& output) const override {
     output << "Split Index: " << index_ << " RF: " << is_last_axis_rfactor_
            << " ARG: " << split_factor_ << std::endl;
   }
@@ -228,7 +225,7 @@ class KeepTransform final : public ViewTransform {
  public:
   KeepTransform(const ViewIndexState& state) : ViewTransform(state) {}
 
-  void toString(std::stringstream& output) const override {
+  void toString(std::ostream& output) const override {
     output << "Keep Index: " << index_ << std::endl;
   }
 
@@ -257,7 +254,7 @@ class BroadcastTransform final : public Transform {
   BroadcastTransform(const ViewIndexState& state)
       : Transform(state, Transform::computeNewIndex(state)) {}
 
-  void toString(std::stringstream& output) const override {
+  void toString(std::ostream& output) const override {
     output << "Bcast Index: " << index_ << std::endl;
   }
 };
@@ -269,7 +266,7 @@ class TrivialReductionTransform final : public Transform {
   TrivialReductionTransform(const ViewIndexState& state)
       : Transform(state, TrivialReductionTransform::computeIndex(state)) {}
 
-  void toString(std::stringstream& output) const override {
+  void toString(std::ostream& output) const override {
     output << "1-Red Index: " << index_ << std::endl;
   }
 
@@ -320,10 +317,23 @@ class AnalyzeViewTransformation {
 
   AnalyzeViewResult run() {
     findTransformation();
+
     TORCH_INTERNAL_ASSERT(
         validate(),
         "Analyze View Transformation failed to find valid transformation.\n",
         toString());
+
+    // Skip view operations if all iterDomains are kept as-is
+    bool all_keep_transforms = std::all_of(
+        view_transforms_.begin(),
+        view_transforms_.end(),
+        [](std::shared_ptr<ViewTransform> vt) {
+          return vt->isA<KeepTransform>();
+        });
+    if (all_keep_transforms) {
+      view_transforms_.clear();
+    }
+
     return {
         !broadcast_transforms_.empty(),
         generateBroadcastAxes(),
@@ -761,12 +771,13 @@ AnalyzeViewResult analyzeView(
     const std::vector<int64_t>& new_sizes) {
   FUSER_PERF_SCOPE("analyzeView");
   TORCH_INTERNAL_ASSERT(
-      tv->getMaybeRFactorDomain().size() == original_sizes.size());
+      TensorDomain::noReductions(tv->getMaybeRFactorDomain()).size() ==
+      original_sizes.size());
   auto sizes = inferNewViewShape(original_sizes, new_sizes);
   AnalyzeViewTransformation analyzer(
       sizes.first /* original_view */,
       sizes.second /* new_view */,
-      tv->getRootDomain());
+      TensorDomain::noReductions(tv->getMaybeRFactorDomain()));
   return analyzer.run();
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/type.cpp b/torch/csrc/jit/codegen/cuda/type.cpp
index e883421eb1e5e2..0742cea74b7532 100644
--- a/torch/csrc/jit/codegen/cuda/type.cpp
+++ b/torch/csrc/jit/codegen/cuda/type.cpp
@@ -17,14 +17,38 @@ bool isFloatingPointType(DataType dtype) {
     case DataType::Half:
     case DataType::BFloat16:
       return true;
+    case DataType::Index:
     case DataType::Int:
     case DataType::Int32:
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
       return false;
     case DataType::Null:
       TORCH_CHECK(
-          false, "Null type is not a valid argument to isFloatingPoint");
+          false, "Null type is not a valid argument to isFloatingPointType");
     default:
-      TORCH_CHECK(false, "Type not supported in isFloatingPoint");
+      TORCH_CHECK(false, "Type not supported in isFloatingPointType");
+  }
+}
+
+bool isBooleanType(DataType dtype) {
+  switch (dtype) {
+    case DataType::Bool:
+      return true;
+    case DataType::Double:
+    case DataType::Float:
+    case DataType::Half:
+    case DataType::BFloat16:
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
+    case DataType::Index:
+    case DataType::Int:
+    case DataType::Int32:
+      return false;
+    case DataType::Null:
+      TORCH_CHECK(false, "Null type is not a valid argument to isBooleanType");
+    default:
+      TORCH_CHECK(false, "Type not supported in isBooleanType");
   }
 }
 
@@ -35,7 +59,10 @@ bool isIntegralType(DataType dtype) {
     case DataType::Float:
     case DataType::Half:
     case DataType::BFloat16:
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
       return false;
+    case DataType::Index:
     case DataType::Int:
     case DataType::Int32:
       return true;
@@ -47,6 +74,27 @@ bool isIntegralType(DataType dtype) {
   }
 }
 
+bool isComplexType(DataType dtype) {
+  switch (dtype) {
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
+      return true;
+    case DataType::Bool:
+    case DataType::Double:
+    case DataType::Float:
+    case DataType::Half:
+    case DataType::BFloat16:
+    case DataType::Int:
+    case DataType::Index:
+    case DataType::Int32:
+      return false;
+    case DataType::Null:
+      TORCH_CHECK(false, "Null type is not a valid argument to isComplexType");
+    default:
+      TORCH_CHECK(false, "Type not supported in isComplexType");
+  }
+}
+
 bool isIntegerOp(const BinaryOpType bopt) {
   return bopt >= BinaryOpType::Mod && bopt <= BinaryOpType::Rshift;
 }
@@ -71,7 +119,8 @@ DataType promote_type(const DataType& t1, const DataType& t2) {
       t1,
       " and ",
       t2);
-  return t1 < t2 ? t1 : t2;
+  return aten_to_data_type(
+      c10::promoteTypes(data_type_to_aten(t1), data_type_to_aten(t2)));
 }
 
 // Return highest on list (smallest enum val)
@@ -107,8 +156,14 @@ static const char* data_type2string(DataType t) {
       return "__bfloat";
     case DataType::Int:
       return "int64_t";
+    case DataType::Index:
+      return "nvfuser_index_t";
     case DataType::Int32:
       return "int";
+    case DataType::ComplexFloat:
+      return "std::complex<float>";
+    case DataType::ComplexDouble:
+      return "std::complex<double>";
     case DataType::Null:
       return "null_type";
     default:
@@ -153,12 +208,16 @@ static const char* expr_type2string(ExprType t) {
       return "BroadcastOp";
     case ExprType::WelfordOp:
       return "WelfordOp";
+    case ExprType::MmaOp:
+      return "MmaOp";
     case ExprType::TransposeOp:
       return "TransposeOp";
     case ExprType::ShiftOp:
       return "ShiftOp";
     case ExprType::GatherOp:
       return "GatherOp";
+    case ExprType::ViewDtypeOp:
+      return "ViewDtypeOp";
     case ExprType::ViewOp:
       return "ViewOp";
     case ExprType::Split:
@@ -167,8 +226,10 @@ static const char* expr_type2string(ExprType t) {
       return "Merge";
     case ExprType::Allocate:
       return "Allocate";
-    case ExprType::Sync:
-      return "Sync";
+    case ExprType::BlockSync:
+      return "BlockSync";
+    case ExprType::GridSync:
+      return "GridSync";
     case ExprType::InitMagicZero:
       return "InitMagicZero";
     case ExprType::UpdateMagicZero:
@@ -195,6 +256,7 @@ bool needFloatSuffix(UnaryOpType t) {
     case UnaryOpType::Frac:
     case UnaryOpType::Gelu:
     case UnaryOpType::Silu:
+    case UnaryOpType::EraseType:
     case UnaryOpType::Neg:
     case UnaryOpType::Relu:
     case UnaryOpType::Reciprocal:
@@ -238,8 +300,6 @@ static const char* unary_op_type2string(UnaryOpType t) {
       return "floor";
     case UnaryOpType::Frac:
       return "frac";
-    case UnaryOpType::Gelu:
-      return "gelu";
     case UnaryOpType::Silu:
       return "silu";
     case UnaryOpType::Lgamma:
@@ -252,6 +312,8 @@ static const char* unary_op_type2string(UnaryOpType t) {
       return "log1p";
     case UnaryOpType::Log2:
       return "log2";
+    case UnaryOpType::EraseType:
+      return "erase_type";
     case UnaryOpType::Neg:
       return "neg";
     case UnaryOpType::Not:
@@ -387,6 +449,18 @@ static const char* binary_op_integer_op2string(BinaryOpType t) {
   return nullptr;
 }
 
+static const char* binary_op_bool_op2string(BinaryOpType t) {
+  switch (t) {
+    case BinaryOpType::Max:
+      return "max";
+    case BinaryOpType::Min:
+      return "min";
+    default:
+      break;
+  }
+  return nullptr;
+}
+
 static const char* binary_op_type_inline_op2string(BinaryOpType t) {
   switch (t) {
     case BinaryOpType::Add:
@@ -479,6 +553,8 @@ static const char* parallel_type2string(ParallelType t) {
       return "UR";
     case ParallelType::Unswitch:
       return "US";
+    case ParallelType::Mma:
+      return "MMA";
     case ParallelType::Serial:
       return "S";
     default:
@@ -546,22 +622,56 @@ constexpr unsigned int supported_switch_pair(DataType t1, DataType t2) {
 static const char* supported_casts2string(
     const std::pair<DataType, DataType>& t) {
   switch (supported_switch_pair(std::get<0>(t), std::get<1>(t))) {
+    case supported_switch_pair(DataType::Index, DataType::Float):
     case supported_switch_pair(DataType::Int, DataType::Float):
     case supported_switch_pair(DataType::Int32, DataType::Float):
     case supported_switch_pair(DataType::Double, DataType::Float):
+    case supported_switch_pair(DataType::Bool, DataType::Float):
       return "(float)";
+    case supported_switch_pair(DataType::Index, DataType::Int):
     case supported_switch_pair(DataType::Int32, DataType::Int):
     case supported_switch_pair(DataType::Float, DataType::Int):
     case supported_switch_pair(DataType::Double, DataType::Int):
+    case supported_switch_pair(DataType::Bool, DataType::Int):
       return "(int64_t)";
+    case supported_switch_pair(DataType::Index, DataType::Int32):
     case supported_switch_pair(DataType::Int, DataType::Int32):
     case supported_switch_pair(DataType::Float, DataType::Int32):
     case supported_switch_pair(DataType::Double, DataType::Int32):
+    case supported_switch_pair(DataType::Bool, DataType::Int32):
       return "(int32_t)";
+    case supported_switch_pair(DataType::Int, DataType::Index):
+    case supported_switch_pair(DataType::Int32, DataType::Index):
+    case supported_switch_pair(DataType::Float, DataType::Index):
+    case supported_switch_pair(DataType::Double, DataType::Index):
+      return "(nvfuser_index_t)";
+    case supported_switch_pair(DataType::Index, DataType::Double):
     case supported_switch_pair(DataType::Int, DataType::Double):
     case supported_switch_pair(DataType::Int32, DataType::Double):
     case supported_switch_pair(DataType::Float, DataType::Double):
+    case supported_switch_pair(DataType::Bool, DataType::Double):
       return "(double)";
+    case supported_switch_pair(DataType::Float, DataType::Bool):
+    case supported_switch_pair(DataType::Double, DataType::Bool):
+    case supported_switch_pair(DataType::Int32, DataType::Bool):
+    case supported_switch_pair(DataType::Int, DataType::Bool):
+      return "(bool)";
+    case supported_switch_pair(DataType::Index, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::Int, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::Int32, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::Double, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::Float, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::Bool, DataType::ComplexDouble):
+    case supported_switch_pair(DataType::ComplexFloat, DataType::ComplexDouble):
+      return "(std::complex<double>)";
+    case supported_switch_pair(DataType::Index, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::Int, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::Int32, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::Double, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::Float, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::Bool, DataType::ComplexFloat):
+    case supported_switch_pair(DataType::ComplexDouble, DataType::ComplexFloat):
+      return "(std::complex<float>)";
     case supported_switch_pair(DataType::Float, DataType::Half):
       return "__float2half";
     case supported_switch_pair(DataType::Float, DataType::BFloat16):
@@ -570,14 +680,6 @@ static const char* supported_casts2string(
       return "__half2float";
     case supported_switch_pair(DataType::BFloat16, DataType::Float):
       return "__bfloat2float";
-    case supported_switch_pair(DataType::Bool, DataType::Double):
-      return "double";
-    case supported_switch_pair(DataType::Bool, DataType::Float):
-      return "float";
-    case supported_switch_pair(DataType::Bool, DataType::Int):
-      return "int64_t";
-    case supported_switch_pair(DataType::Bool, DataType::Int32):
-      return "int32_t";
     default:
       return nullptr;
   }
@@ -599,6 +701,10 @@ DataType aten_to_data_type(const at::ScalarType& scalar_type) {
       return DataType::Int;
     case at::ScalarType::Int:
       return DataType::Int32;
+    case at::ScalarType::ComplexFloat:
+      return DataType::ComplexFloat;
+    case at::ScalarType::ComplexDouble:
+      return DataType::ComplexDouble;
     default:
       return DataType::Null;
   }
@@ -618,8 +724,19 @@ at::ScalarType data_type_to_aten(const DataType& data_type) {
       return at::ScalarType::BFloat16;
     case DataType::Int:
       return at::ScalarType::Long;
+    case DataType::Index:
+      TORCH_INTERNAL_ASSERT(
+          false,
+          "Index is determined at compile time,",
+          " to convert from an aten type you need to have the compiled information. ",
+          "This information is passed to GpuLower at compile time, and then copied to kerned.",
+          "There's also this information in FusionExecutorCache and the Registry system.");
     case DataType::Int32:
       return at::ScalarType::Int;
+    case DataType::ComplexFloat:
+      return at::ScalarType::ComplexFloat;
+    case DataType::ComplexDouble:
+      return at::ScalarType::ComplexDouble;
     default:
       TORCH_INTERNAL_ASSERT(false, "No data type found for scalar type.");
   }
@@ -682,6 +799,12 @@ c10::optional<std::string> integer_op_str(const BinaryOpType botype) {
                         : c10::nullopt;
 }
 
+c10::optional<std::string> bool_op_str(const BinaryOpType botype) {
+  const char* str = binary_op_bool_op2string(botype);
+  return str != nullptr ? c10::optional<std::string>(std::string(str))
+                        : c10::nullopt;
+}
+
 std::string stringifyThreadSize(const ParallelType ptype) {
   return thread_size2string(ptype);
 }
@@ -700,9 +823,13 @@ std::string typePrefix(const DataType data_type) {
     case DataType::Half:
     case DataType::BFloat16:
       return "f";
+    case DataType::Index:
     case DataType::Int:
     case DataType::Int32:
       return "i";
+    case DataType::ComplexFloat:
+    case DataType::ComplexDouble:
+      return "c";
     default:
       TORCH_INTERNAL_ASSERT(false, "No data type found for scalar type.");
   }
@@ -738,6 +865,10 @@ size_t dataTypeSize(DataType type) {
   switch (type) {
     case DataType::Bool:
       return sizeof(bool);
+    case DataType::ComplexDouble:
+      return sizeof(std::complex<double>);
+    case DataType::ComplexFloat:
+      return sizeof(std::complex<float>);
     case DataType::Double:
       return sizeof(double);
     case DataType::Float:
@@ -746,6 +877,9 @@ size_t dataTypeSize(DataType type) {
       return sizeof(at::Half);
     case DataType::BFloat16:
       return sizeof(at::BFloat16);
+    case DataType::Index:
+      TORCH_INTERNAL_ASSERT(
+          false, "The actual type of Index is only known at compile time.");
     case DataType::Int:
       return sizeof(uint64_t);
     case DataType::Int32:
diff --git a/torch/csrc/jit/codegen/cuda/type.h b/torch/csrc/jit/codegen/cuda/type.h
index ea7e8bd04d329c..d84a7b26564c26 100644
--- a/torch/csrc/jit/codegen/cuda/type.h
+++ b/torch/csrc/jit/codegen/cuda/type.h
@@ -54,12 +54,33 @@ enum class PredicateType {
   ReductionWrite
 };
 
-enum class DataType { Double, Float, Half, Int, Int32, Bool, BFloat16, Null };
+// Index type is a convenience type that may be a 64 or 32 signed integer.
+// This is helpful for math on indexing/size when we don't know what the index
+// type might be. This allows us to prevent assuming the welford count must be
+// int64_t which is relatively heavy to carry around. Index will be resolved
+// at compile time with KernelIndexMode.
+enum class DataType {
+  Double,
+  Float,
+  Half,
+  Int,
+  Index,
+  Int32,
+  Bool,
+  BFloat16,
+  ComplexFloat,
+  ComplexDouble,
+  Null
+};
 
 // Returns if the datatype is a floating point type
 bool isFloatingPointType(DataType dtype);
-// Returns if the datatype is an integer type
+// Returns if the datatype is an boolean type
 bool isIntegralType(DataType dtype);
+// Returns if the datatype is an integer type
+bool isBooleanType(DataType dtype);
+// Returns if the datatype is a complex type
+bool isComplexType(DataType dtype);
 
 enum class ExprType {
   Invalid,
@@ -69,14 +90,17 @@ enum class ExprType {
   ReductionOp,
   BroadcastOp,
   WelfordOp,
+  MmaOp,
   TransposeOp,
   ShiftOp,
   GatherOp,
+  ViewDtypeOp,
   ViewOp,
   Split,
   Merge,
   Allocate,
-  Sync,
+  BlockSync,
+  GridSync,
   InitMagicZero,
   UpdateMagicZero,
   ForLoop,
@@ -84,6 +108,7 @@ enum class ExprType {
   GridReduction,
   GridBroadcast,
   GridWelford,
+  AllocateFusedReduction
 };
 
 enum class UnaryOpType {
@@ -110,6 +135,7 @@ enum class UnaryOpType {
   Log10,
   Log1p,
   Log2,
+  EraseType,
   Neg,
   RandLike,
   Reciprocal,
@@ -195,6 +221,7 @@ enum class ParallelType {
   MisalignedVectorize,
   Unroll,
   Unswitch,
+  Mma,
   Serial
 };
 
@@ -280,6 +307,7 @@ TORCH_CUDA_CU_API bool isParallelTypeVectorize(ParallelType);
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const UnaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const BinaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> integer_op_str(const BinaryOpType);
+TORCH_CUDA_CU_API c10::optional<std::string> bool_op_str(const BinaryOpType);
 
 TORCH_CUDA_CU_API c10::optional<std::string> cast_func_str(
     const std::pair<DataType, DataType>&);
diff --git a/torch/csrc/jit/codegen/cuda/type_inference.cpp b/torch/csrc/jit/codegen/cuda/type_inference.cpp
index a8facc6a45bef8..d7fa2f0c83d08a 100644
--- a/torch/csrc/jit/codegen/cuda/type_inference.cpp
+++ b/torch/csrc/jit/codegen/cuda/type_inference.cpp
@@ -29,6 +29,27 @@ bool hasTypeAndDevice(const TensorTypePtr& op) {
       op->scalarType().has_value();
 }
 
+void copyScalarTypeAndDeviceToOutput(
+    c10::optional<c10::ScalarType> dtype,
+    c10::optional<c10::Device> device,
+    Node* node,
+    size_t index = 0) {
+  auto out = node->output(index)->type()->cast<TensorType>();
+  TORCH_INTERNAL_ASSERT(
+      out != nullptr,
+      "Expect target node's type pointer to be non-nullptr, but get nullptr");
+  out->scalarType() = dtype;
+  out->device() = device;
+}
+
+void copyScalarTypeAndDeviceToOutput(
+    TensorTypePtr from,
+    Node* node,
+    size_t index = 0) {
+  copyScalarTypeAndDeviceToOutput(
+      from->scalarType(), from->device(), node, index);
+}
+
 TensorTypePtr getInputTensorType(
     Node* node,
     size_t index,
@@ -104,7 +125,7 @@ class NaiveTypePropagator {
       case aten::bitwise_not:
       // TODO: rand_like should support cast.
       case aten::rand_like: {
-        node->output()->setType(unary_type(node));
+        unary_type(node);
         break;
       }
       // unary float operations
@@ -131,12 +152,12 @@ class NaiveTypePropagator {
       case aten::reciprocal:
       case aten::sigmoid:
       case aten::tanh: {
-        node->output()->setType(unary_float_type(node));
+        unary_float_type(node);
         break;
       }
       // binary float
       case aten::atan2: {
-        node->output()->setType(binary_float_type(node));
+        binary_type(node, TypePromotion::float_op_config);
         break;
       }
       // binary operations that forward meta info and broadcast shape:
@@ -156,15 +177,17 @@ class NaiveTypePropagator {
       // to neither type promotion nor shape.
       // TODO: Include alpha check for add/sub
       case aten::add:
-      case aten::sub: {
-        node->output()->setType(binary_type(node));
+      case aten::sub:
+      case aten::rsub: {
+        binary_type(node);
         break;
       }
       // Type can be int or bool for "and" and "or", if both are bool should be
       // bool, if both int should be int, otherwise would have errored
       case aten::__and__:
       case aten::__or__: {
-        const auto promoted_type = binary_broadcast_type(
+        binary_broadcast_type(
+            node,
             getInputTensorType(node, 0, true),
             getInputTensorType(node, 1, true),
             node->input(0)->type()->cast<TensorType>()->scalarType() ==
@@ -177,11 +200,11 @@ class NaiveTypePropagator {
       case aten::__xor__:
       case aten::__lshift__:
       case aten::__rshift__: {
-        const auto promoted_type = binary_broadcast_type(
+        binary_broadcast_type(
+            node,
             getInputTensorType(node, 0, true),
             getInputTensorType(node, 1, true),
             at::ScalarType::Int);
-        node->output()->setType(promoted_type);
         break;
       }
       // binary comparison
@@ -191,52 +214,66 @@ class NaiveTypePropagator {
       case aten::ge:
       case aten::ne:
       case aten::eq: {
-        const auto promoted_type = binary_broadcast_type(
+        binary_broadcast_type(
+            node,
             getInputTensorType(node, 0, false),
             getInputTensorType(node, 1, true),
             at::ScalarType::Bool);
-        node->output()->setType(promoted_type);
         break;
       }
       case aten::where: {
-        const auto promoted_type = binary_broadcast_type(
+        binary_broadcast_type(
+            node,
             getInputTensorType(node, 1, true),
             getInputTensorType(node, 2, true));
-        node->output()->setType(promoted_type);
         break;
       }
       case aten::addcmul: {
         auto promoted_type = binary_broadcast_type(
+            nullptr,
             getInputTensorType(node, 1, true),
             getInputTensorType(node, 2, true));
-        promoted_type = binary_broadcast_type(
-            promoted_type, getInputTensorType(node, 0, true));
-        node->output()->setType(promoted_type);
-        break;
-      }
-      case aten::native_dropout_backward:
-      case aten::dropout: {
-        node->output()->setType(getInputTensorType(node, 0));
+        binary_broadcast_type(
+            node, promoted_type, getInputTensorType(node, 0, true));
         break;
       }
       case aten::native_dropout: {
         auto out_type = getInputTensorType(node, 0);
-        node->output(0)->setType(out_type);
-
-        auto mask_type = TensorType::create(
-            at::ScalarType::Bool, *out_type->device(), c10::nullopt, false);
-
-        node->output(1)->setType(mask_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node, 0);
+        copyScalarTypeAndDeviceToOutput(
+            out_type->withScalarType(at::ScalarType::Bool), node, 1);
         break;
       }
+      case aten::native_dropout_backward:
+      case aten::dropout:
       case aten::instance_norm:
-      case aten::batch_norm: {
-        node->output()->setType(getInputTensorType(node, 0));
+      case aten::batch_norm:
+      case aten::layer_norm: {
+        copyScalarTypeAndDeviceToOutput(getInputTensorType(node, 0), node);
         break;
       }
-      case aten::_batch_norm_impl_index_backward: {
+      case aten::_batch_norm_impl_index_backward:
+      case aten::native_batch_norm_backward: {
+        int grad_input_index = 1;
+        int weight_index = -1;
+        int mask_index = -1;
+        if (node->kind() ==
+            c10::Symbol::fromQualString(
+                "aten::_batch_norm_impl_index_backward")) {
+          weight_index = 3;
+          mask_index = 10;
+        } else if (
+            node->kind() ==
+            c10::Symbol::fromQualString("aten::native_batch_norm_backward")) {
+          weight_index = 2;
+          mask_index = 9;
+        } else {
+          TORCH_INTERNAL_ASSERT(
+              false, "unidentified node kind", node->kind().toDisplayString());
+        }
         // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-        auto out_mask_list = constant_as<c10::List<bool>>(node->input(10));
+        auto out_mask_list =
+            constant_as<c10::List<bool>>(node->input(mask_index));
         TORCH_INTERNAL_ASSERT(
             out_mask_list.has_value(),
             "Missing output mask for batch_norm_backward");
@@ -247,14 +284,14 @@ class NaiveTypePropagator {
 
         auto grad_input_type = getInputTensorType(node, 1);
         if (output_mask[0]) {
-          node->output(0)->setType(grad_input_type);
+          copyScalarTypeAndDeviceToOutput(grad_input_type, node, 0);
         }
 
         if (output_mask[1]) {
           if (auto weight_type = getInputTensorType(node, 3, true)) {
             auto acc_weight_type =
                 weight_type->withScalarType(toAccumulateType(weight_type));
-            node->output(1)->setType(acc_weight_type);
+            copyScalarTypeAndDeviceToOutput(acc_weight_type, node, 1);
           }
         }
 
@@ -266,21 +303,21 @@ class NaiveTypePropagator {
               *grad_input_type->device(),
               c10::nullopt,
               c10::nullopt);
-          node->output(2)->setType(bias_type);
+          copyScalarTypeAndDeviceToOutput(bias_type, node, 2);
         }
         break;
       }
       case aten::_batch_norm_impl_index: {
         auto out_type = getInputTensorType(node, 0);
-        node->output(0)->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node, 0);
 
         auto mean_invstd_type = TensorType::create(
             toAccumulateType(out_type),
             *out_type->device(),
             c10::nullopt,
             c10::nullopt);
-        node->output(1)->setType(mean_invstd_type);
-        node->output(2)->setType(mean_invstd_type);
+        copyScalarTypeAndDeviceToOutput(mean_invstd_type, node, 1);
+        copyScalarTypeAndDeviceToOutput(mean_invstd_type, node, 2);
 
         // TODO: not that it matters, but mark the right type here;
         auto reserve_type = TensorType::create(
@@ -288,38 +325,22 @@ class NaiveTypePropagator {
             *out_type->device(),
             c10::nullopt,
             c10::nullopt);
-        node->output(3)->setType(reserve_type);
+        copyScalarTypeAndDeviceToOutput(reserve_type, node, 3);
         node->output(4)->setType(IntType::get());
         break;
       }
-      case aten::native_batch_norm: {
-        auto out_type = getInputTensorType(node, 0);
-        node->output(0)->setType(out_type);
-
-        auto mean_invstd_type = TensorType::create(
-            toAccumulateType(out_type),
-            *out_type->device(),
-            c10::nullopt,
-            c10::nullopt);
-        node->output(1)->setType(mean_invstd_type);
-        node->output(2)->setType(mean_invstd_type);
-        break;
-      }
-      case aten::layer_norm: {
-        node->output(0)->setType(getInputTensorType(node, 0));
-        break;
-      }
+      case aten::native_batch_norm:
       case aten::native_layer_norm: {
         auto out_type = getInputTensorType(node, 0);
-        node->output(0)->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node, 0);
 
         auto mean_invstd_type = TensorType::create(
             toAccumulateType(out_type),
             *out_type->device(),
             c10::nullopt,
             c10::nullopt);
-        node->output(1)->setType(mean_invstd_type);
-        node->output(2)->setType(mean_invstd_type);
+        copyScalarTypeAndDeviceToOutput(mean_invstd_type, node, 1);
+        copyScalarTypeAndDeviceToOutput(mean_invstd_type, node, 2);
         break;
       }
       case aten::native_layer_norm_backward: {
@@ -333,24 +354,25 @@ class NaiveTypePropagator {
         }
 
         if (output_mask[0]) {
-          node->output(0)->setType(getInputTensorType(node, 0));
+          copyScalarTypeAndDeviceToOutput(getInputTensorType(node, 0), node, 0);
         }
 
         if (output_mask[1]) {
           // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
           if (auto weight_type = getInputTensorType(node, 5, true)) {
-            node->output(1)->setType(weight_type);
+            copyScalarTypeAndDeviceToOutput(weight_type, node, 1);
           }
         }
 
         if (output_mask[2]) {
           // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
           if (auto bias_type = getInputTensorType(node, 6, true)) {
-            node->output(2)->setType(bias_type);
+            copyScalarTypeAndDeviceToOutput(bias_type, node, 2);
           }
         }
         break;
       }
+      case aten::log_softmax:
       case aten::softmax: {
         auto out_type = getInputTensorType(node, 0);
 
@@ -361,7 +383,7 @@ class NaiveTypePropagator {
             out_type = out_type->withScalarType(opt_ivalue->toScalarType());
           }
         }
-        node->output()->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
       }
       case aten::_softmax: {
@@ -375,15 +397,16 @@ class NaiveTypePropagator {
           out_type = out_type->withScalarType(at::ScalarType::Float);
         }
 
-        node->output()->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
       }
+      case aten::_log_softmax_backward_data:
       case aten::_softmax_backward_data: {
         auto out_type = getInputTensorType(node, 0);
         if (auto opt_ivalue = toIValue(node->input(3))) {
           out_type = out_type->withScalarType(opt_ivalue->toScalarType());
         }
-        node->output()->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
       }
       case aten::amax:
@@ -405,14 +428,24 @@ class NaiveTypePropagator {
         TORCH_CHECK(
             dims.has_value() && keepdim.has_value(),
             "Shape inference cannot handle options.");
-        node->output()->setType(
-            unary_reduce_type(out_type, dims->vec(), keepdim.value()));
+        unary_reduce_type(node, out_type, dims->vec(), keepdim.value());
+        break;
+      }
+      case aten::std:
+      case aten::var: {
+        auto out_type = getInputTensorType(node, 0);
+        const auto dims = constant_as<c10::List<int64_t>>(node->input(1));
+        const auto keepdim = constant_as<bool>(node->input(3));
+        TORCH_CHECK(
+            dims.has_value() && keepdim.has_value(),
+            "Shape inference cannot handle options.");
+        unary_reduce_type(node, out_type, dims->vec(), keepdim.value());
         break;
       }
       case aten::sum_to_size:
       case aten::_grad_sum_to_size: {
         auto out_type = node->input(0)->type()->cast<TensorType>();
-        node->output()->setType(out_type->withDim(c10::nullopt));
+        copyScalarTypeAndDeviceToOutput(out_type->withDim(c10::nullopt), node);
         break;
       }
       case prim::unsqueeze_copy:
@@ -420,21 +453,22 @@ class NaiveTypePropagator {
       case prim::reshape_copy:
       case prim::view_copy: {
         auto out_type = node->input(0)->type()->cast<TensorType>();
-        node->output()->setType(out_type);
+        copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
       }
       case aten::type_as: {
         const auto type0 = getInputTensorType(node, 0);
         const auto type1 = getInputTensorType(node, 1);
-        node->output()->setType(type0->withScalarType(type1->scalarType()));
+        copyScalarTypeAndDeviceToOutput(
+            type0->withScalarType(type1->scalarType()), node);
         break;
       }
       case aten::to: {
         const auto type0 = getInputTensorType(node, 0);
         const auto out_dtype = toIValue(node->input(1));
         TORCH_CHECK(out_dtype, "No output type specified");
-        node->output()->setType(
-            type0->withScalarType(out_dtype->toScalarType()));
+        copyScalarTypeAndDeviceToOutput(
+            type0->withScalarType(out_dtype->toScalarType()), node);
         break;
       }
       case prim::add_optional: {
@@ -443,7 +477,7 @@ class NaiveTypePropagator {
         // note: add_optional is supposed to replace an inplace add on input0,
         // so we just directly forward dtype
         TORCH_CHECK(type0 != nullptr);
-        node->output()->setType(type0);
+        copyScalarTypeAndDeviceToOutput(type0, node);
         break;
       }
       case aten::_autocast_to_reduced_precision: {
@@ -463,15 +497,16 @@ class NaiveTypePropagator {
             "_autocast_to_reduced_precision requires all scalar inputs to be constant.");
         if (in_type->scalarType() == at::ScalarType::Float) {
           if (in_device->is_cuda() && cuda_enabled.value()) {
-            node->output()->setType(
-                in_type->withScalarType(cuda_dtype.value()));
+            copyScalarTypeAndDeviceToOutput(
+                in_type->withScalarType(cuda_dtype.value()), node);
             break;
           } else if (in_device->is_cpu() && cpu_enabled.value()) {
-            node->output()->setType(in_type->withScalarType(cpu_dtype.value()));
+            copyScalarTypeAndDeviceToOutput(
+                in_type->withScalarType(cpu_dtype.value()), node);
             break;
           }
         }
-        node->output()->setType(in_type);
+        copyScalarTypeAndDeviceToOutput(in_type, node);
         break;
       }
       case aten::_autocast_to_full_precision: {
@@ -491,10 +526,10 @@ class NaiveTypePropagator {
              in_scalar_type == at::ScalarType::BFloat16) &&
             ((in_device->is_cuda() && cuda_enabled.value()) ||
              (in_device->is_cpu() && cpu_enabled.value()))) {
-          node->output()->setType(
-              in_type->withScalarType(at::ScalarType::Float));
+          copyScalarTypeAndDeviceToOutput(
+              in_type->withScalarType(at::ScalarType::Float), node);
         } else {
-          node->output()->setType(in_type);
+          copyScalarTypeAndDeviceToOutput(in_type, node);
         }
         break;
       }
@@ -514,33 +549,33 @@ class NaiveTypePropagator {
   }
 
  protected:
-  TensorTypePtr unary_type(Node* node) {
+  void unary_type(Node* node) {
     auto op = getInputTensorType(node, 0, false);
-    return TensorType::create(
-        *op->scalarType(), *op->device(), c10::nullopt, c10::nullopt);
+    copyScalarTypeAndDeviceToOutput(op, node);
   }
 
-  TensorTypePtr unary_float_type(Node* node) {
+  void unary_float_type(Node* node) {
     auto op = getInputTensorType(node, 0, false);
-    return TensorType::create(
+    copyScalarTypeAndDeviceToOutput(
         computeTypes(TypePromotion::float_op_config, {op}),
         *op->device(),
-        c10::nullopt,
-        c10::nullopt);
+        node);
   }
 
-  TensorTypePtr unary_reduce_type(
+  void unary_reduce_type(
+      Node* node,
       const TensorTypePtr& op,
       const std::vector<int64_t>& dims,
       bool keepdim) {
     TORCH_CHECK(
         hasTypeAndDevice(op),
         "Type and device propagation has failed, or was not provided enough information.");
-    return TensorType::create(
-        *op->scalarType(), *op->device(), c10::nullopt, c10::nullopt);
+    copyScalarTypeAndDeviceToOutput(op, node);
   }
 
-  TensorTypePtr binary_type(Node* node) {
+  void binary_type(
+      Node* node,
+      TypePromotionConfig config = TypePromotion::default_op_config) {
     auto op0 = node->input(0)->type();
     auto op1 = node->input(1)->type();
     auto op0_tensor_type = op0->cast<TensorType>();
@@ -549,53 +584,46 @@ class NaiveTypePropagator {
         hasTypeAndDevice(op0_tensor_type) || hasTypeAndDevice(op1_tensor_type),
         "At least one operand must be a tensor.");
     auto ptr = (op0_tensor_type != nullptr) ? op0_tensor_type : op1_tensor_type;
-    return TensorType::create(
-        computeTypes(TypePromotion::default_op_config, {op0, op1}),
-        *ptr->device(),
-        c10::nullopt,
-        c10::nullopt);
-  }
-
-  TensorTypePtr binary_float_type(Node* node) {
-    auto op0 = getInputTensorType(node, 0, false);
-    auto op1 = node->input(1)->type();
-    return TensorType::create(
-        computeTypes(TypePromotion::float_op_config, {op0, op1}),
-        *op0->device(),
-        c10::nullopt,
-        c10::nullopt);
+    copyScalarTypeAndDeviceToOutput(
+        computeTypes(config, {op0, op1}), *ptr->device(), node);
   }
 
   // TODO: we should comply to codegen type promotion.
   TensorTypePtr binary_broadcast_type(
+      Node* node,
       TensorTypePtr const& op0,
       TensorTypePtr const& op1,
       c10::optional<at::ScalarType> scalar_type = c10::nullopt) {
+    TensorTypePtr out;
     TORCH_CHECK(
         op0 != nullptr || op1 != nullptr,
         "Scalar operations on binary broadcast type, not supported yet.");
 
+    c10::ScalarType promoted_scalar_type;
+    c10::optional<c10::Device> device;
     if (op0 != nullptr && op1 != nullptr) {
       TORCH_CHECK(
           hasTypeAndDevice(op0) && hasTypeAndDevice(op1),
           "Type and device propagation has failed, or was not provided enough information.");
-      auto promoted_scalar_type = scalar_type.has_value()
+      promoted_scalar_type = scalar_type.has_value()
           ? *scalar_type
           : c10::promoteTypes(*op0->scalarType(), *op1->scalarType());
-
-      return TensorType::create(
-          promoted_scalar_type, *op0->device(), c10::nullopt, c10::nullopt);
+      device = *op0->device();
     } else {
       auto ptr = (op0 != nullptr) ? op0 : op1;
       TORCH_CHECK(
           hasTypeAndDevice(ptr),
           "Type and device propagation has failed, or was not provided enough information.");
-      return TensorType::create(
-          scalar_type.has_value() ? *scalar_type : *ptr->scalarType(),
-          *ptr->device(),
-          c10::nullopt,
-          c10::nullopt);
+      promoted_scalar_type =
+          scalar_type.has_value() ? *scalar_type : *ptr->scalarType();
+      device = *ptr->device();
+    }
+    if (node != nullptr) {
+      copyScalarTypeAndDeviceToOutput(promoted_scalar_type, device, node);
     }
+
+    return TensorType::create(
+        promoted_scalar_type, device, c10::nullopt, c10::nullopt);
   }
 
  private:
diff --git a/torch/csrc/jit/codegen/cuda/type_promotion.cpp b/torch/csrc/jit/codegen/cuda/type_promotion.cpp
index 68a38e6737810a..405a33f2608121 100644
--- a/torch/csrc/jit/codegen/cuda/type_promotion.cpp
+++ b/torch/csrc/jit/codegen/cuda/type_promotion.cpp
@@ -52,10 +52,6 @@ at::native::ResultTypeState updateResultTypeState(
 at::native::ResultTypeState updateResultTypeState(
     const c10::ScalarType scalar,
     const at::native::ResultTypeState& in_state) {
-  TORCH_INTERNAL_ASSERT(
-      !c10::isComplexType(scalar),
-      "NvFuser does not support complex data types.");
-
   at::native::ResultTypeState new_state = in_state;
   c10::ScalarType current = scalar;
   if (c10::isFloatingType(scalar)) {
@@ -196,16 +192,19 @@ std::vector<Val*> promoteValues(
 
 Val* optionalCast(DataType dtype, Val* v) {
   TORCH_INTERNAL_ASSERT(v->getDataType().has_value());
-  // Avoid casting Float/Int scalar to any corresponding FloatingPoint/Integral
-  // type in fusion. Instead, we cast them directly. The exception is Bool,
-  // which is always casted to the desired type.
+  // Avoid casting Float/Int/ComplexDouble scalar to any corresponding
+  // FloatingPoint/Integral/Double type in fusion. Instead, we cast them
+  // directly. The exception is Bool, which is always casted to the desired
+  // type.
   const bool kSameDtype = v->getDataType().value() == dtype;
   const bool kIsScalarFloat =
       !v->isA<TensorView>() && isFloatingPointType(dtype);
   const bool kIsScalarInt = !v->isA<TensorView>() && isIntegralType(dtype);
+  const bool kIsScalarComplex = !v->isA<TensorView>() && isComplexType(dtype);
   if (kSameDtype ||
       (kIsScalarFloat && isFloatingPointType(v->getDataType().value())) ||
-      (kIsScalarInt && isIntegralType(v->getDataType().value()))) {
+      (kIsScalarInt && isIntegralType(v->getDataType().value())) ||
+      (kIsScalarComplex && isComplexType(v->getDataType().value()))) {
     return v;
   } else {
     return castOp(dtype, v);
diff --git a/torch/csrc/jit/codegen/cuda/utils.cpp b/torch/csrc/jit/codegen/cuda/utils.cpp
index 127078b45f73e4..dd03d85f44ad59 100644
--- a/torch/csrc/jit/codegen/cuda/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/utils.cpp
@@ -143,6 +143,24 @@ void debugPrint(const c10::TensorTypePtr& type) {
 }
 #pragma clang diagnostic pop
 
+bool is_zero_dim_tensor(const std::shared_ptr<c10::TensorType>& tensor_type) {
+  return tensor_type && tensor_type->dim().has_value() &&
+      tensor_type->dim().value() == 0;
+}
+
+bool is_zero_sized_tensor(const std::shared_ptr<c10::TensorType>& tensor_type) {
+  auto opt_sizes = tensor_type->sizes().concrete_sizes();
+  if (opt_sizes.has_value()) {
+    auto sizes = opt_sizes.value();
+    for (const auto& size : sizes) {
+      if (size == 0) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
 bool is_cpu_scalar(const at::Tensor& tensor) {
   return tensor.device().is_cpu() && tensor.numel() == 1 && tensor.dim() == 0;
 }
@@ -171,6 +189,12 @@ bool disableRNGUnrolling() {
   return disable_rng_unroll ? atoi(disable_rng_unroll) : false;
 }
 
+bool disableIndexHoisting() {
+  const static char* disable_index_hoist =
+      getenv("PYTORCH_NVFUSER_DISABLE_INDEX_HOIST");
+  return disable_index_hoist ? atoi(disable_index_hoist) : false;
+}
+
 std::vector<int64_t> getTensorSizes(TensorTypePtr const& tensor_type) {
   TORCH_INTERNAL_ASSERT(tensor_type != nullptr, "Input must be a Tensor.");
   auto optional_sizes = tensor_type->sizes().concrete_sizes();
diff --git a/torch/csrc/jit/codegen/cuda/utils.h b/torch/csrc/jit/codegen/cuda/utils.h
index c035cdeae24841..b4f81a859016d5 100644
--- a/torch/csrc/jit/codegen/cuda/utils.h
+++ b/torch/csrc/jit/codegen/cuda/utils.h
@@ -11,6 +11,9 @@ namespace cuda {
 
 void debugPrint(const c10::TensorTypePtr& type);
 
+bool is_zero_dim_tensor(const std::shared_ptr<c10::TensorType>& tensor_type);
+bool is_zero_sized_tensor(const std::shared_ptr<c10::TensorType>& tensor_type);
+
 bool is_cpu_scalar(const at::Tensor& tensor);
 bool is_cpu_scalar(const c10::TensorType& tensor_type);
 
@@ -47,6 +50,9 @@ bool useFallback();
 // Returns if unrolling should not be used for kernels with RNG in them.
 bool disableRNGUnrolling();
 
+//! Returns if index hoisting should be disabled
+TORCH_CUDA_CU_API bool disableIndexHoisting();
+
 //! Ceil integer division
 constexpr int64_t ceilDiv(int64_t a, int64_t b) {
   return (a + b - 1) / b;
diff --git a/torch/csrc/jit/codegen/cuda/vectorization_info.h b/torch/csrc/jit/codegen/cuda/vectorization_info.h
new file mode 100644
index 00000000000000..14b5662ab3c5c2
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/vectorization_info.h
@@ -0,0 +1,30 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+struct VectorizedSetInfo {
+  //! Producer of a vectorized set
+  TensorView* producer_tv = nullptr;
+  //! Consumer of a vectorized set
+  TensorView* consumer_tv = nullptr;
+  //! Number of elements to vectorize
+  int word_size = -1;
+  //! Vectorized domain
+  IterDomain* vectorized_leaf_id = nullptr;
+  //! Right-most root dependent domain of the leaf domain
+  IterDomain* vectorized_root_id = nullptr;
+  //! All of the dependent root domains that are contiguously merged
+  std::unordered_set<IterDomain*> contig_root_ids;
+};
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/frontend/function_schema_parser.cpp b/torch/csrc/jit/frontend/function_schema_parser.cpp
index f11f067309df00..4d57d652eb706b 100644
--- a/torch/csrc/jit/frontend/function_schema_parser.cpp
+++ b/torch/csrc/jit/frontend/function_schema_parser.cpp
@@ -2,7 +2,6 @@
 
 #include <ATen/core/Reduction.h>
 #include <ATen/core/type_factory.h>
-#include <c10/util/Optional.h>
 #include <c10/util/string_utils.h>
 #include <torch/csrc/jit/frontend/lexer.h>
 #include <torch/csrc/jit/frontend/parse_string_literal.h>
@@ -28,13 +27,8 @@ namespace jit {
 
 namespace {
 struct SchemaParser {
-  explicit SchemaParser(const std::string& str)
-      : L(std::make_shared<Source>(
-            c10::string_view(str),
-            c10::nullopt,
-            0,
-            nullptr,
-            Source::DONT_COPY)),
+  SchemaParser(const std::string& str)
+      : L(std::make_shared<SourceView>(c10::string_view(str))),
         type_parser(L, /*parse_complete_tensor_types*/ false) {}
 
   either<OperatorName, FunctionSchema> parseDeclaration() {
@@ -356,12 +350,12 @@ struct SchemaParser {
 };
 } // namespace
 
-C10_EXPORT either<OperatorName, FunctionSchema> parseSchemaOrName(
+either<OperatorName, FunctionSchema> parseSchemaOrName(
     const std::string& schemaOrName) {
   return SchemaParser(schemaOrName).parseExactlyOneDeclaration();
 }
 
-C10_EXPORT FunctionSchema parseSchema(const std::string& schema) {
+FunctionSchema parseSchema(const std::string& schema) {
   auto parsed = parseSchemaOrName(schema);
   TORCH_CHECK(
       parsed.is_right(),
@@ -369,7 +363,7 @@ C10_EXPORT FunctionSchema parseSchema(const std::string& schema) {
   return std::move(parsed.right());
 }
 
-C10_EXPORT OperatorName parseName(const std::string& name) {
+OperatorName parseName(const std::string& name) {
   auto parsed = parseSchemaOrName(name);
   TORCH_CHECK(
       parsed.is_left(),
diff --git a/torch/csrc/jit/frontend/ir_emitter.cpp b/torch/csrc/jit/frontend/ir_emitter.cpp
index eac6161c923224..7e9f9d8dc65612 100644
--- a/torch/csrc/jit/frontend/ir_emitter.cpp
+++ b/torch/csrc/jit/frontend/ir_emitter.cpp
@@ -24,6 +24,7 @@
 #include <torch/csrc/jit/passes/lower_tuples.h>
 #include <torch/csrc/jit/passes/normalize_ops.h>
 #include <torch/csrc/jit/passes/replacement_of_old_operators.h>
+#include <torch/csrc/jit/runtime/graph_iterator.h>
 #include <torch/csrc/jit/runtime/interpreter.h>
 #include <torch/csrc/jit/runtime/operator.h>
 #include <torch/csrc/jit/runtime/slice_indices_adjust.h>
@@ -34,6 +35,9 @@
 #include <c10/util/Optional.h>
 #include <c10/util/hash.h>
 
+#include <ATen/core/interned_strings.h>
+#include <ATen/core/jit_type.h>
+#include <torch/csrc/jit/frontend/error_report.h>
 #include <atomic>
 #include <climits>
 #include <set>
@@ -3316,7 +3320,7 @@ struct to_ir {
     auto sv = emitSugaredExpr(apply.callee(), 1);
     auto loc = apply.callee().range();
     if (auto special_form = dynamic_cast<SpecialFormValue*>(sv.get())) {
-      return emitApplySpecialForm(special_form->form(), apply, type_hint);
+      return emitApplySpecialForm(special_form->form(), apply, sv, type_hint);
     }
     auto args = getNamedValues(apply.inputs(), true);
     auto kwargs = emitAttributes(apply.attributes());
@@ -3331,6 +3335,7 @@ struct to_ir {
   std::shared_ptr<SugaredValue> emitApplySpecialForm(
       Symbol form,
       Apply& apply,
+      std::shared_ptr<SugaredValue> sv,
       const TypePtr& type_hint = nullptr) {
     switch (form) {
       case prim::fork: {
@@ -3435,6 +3440,71 @@ struct to_ir {
         return std::make_shared<SimpleValue>(
             graph->insertNode(graph->createTuple(inp_values))->output());
       }
+      case prim::LegacyTypedConstructor: {
+        // see legacy_tensor_generic_ctor_new
+        // These legacy constructors do not follow schemas that can be
+        // typed in native_functions.yaml / JIT type signature and are handled
+        // here. Only the two common cases are handled initially:
+        // "new(IntArrayRef size, *, Device? device=None)",
+        // "new(PyObject* data, *, Device? device=None)",
+        // Note: device argument is unused in the kernel
+        auto args = getValues(apply.inputs(), true);
+        auto kwargs = emitAttributes(apply.attributes());
+        auto get_base_error_msg = [&]() {
+          std::stringstream base_error_msg;
+          base_error_msg
+              << "Legacy Tensor Constructor only supports two schemas in TorchScript: \n";
+          base_error_msg
+              << "'new(IntArrayRef size, *, Device? device=None)',\n";
+          base_error_msg << "'new(PyObject* data, *, Device? device=None)\n'";
+          return base_error_msg;
+        };
+        if (kwargs.size() == 1 && kwargs[0].name() != "device") {
+          throw ErrorReport(apply)
+              << get_base_error_msg().str() << "Got kwarg " << kwargs[0].name();
+        }
+        if (kwargs.size() > 1) {
+          throw ErrorReport(apply)
+              << get_base_error_msg().str() << "Got multiple kwargs\n";
+        }
+        auto dtype = dynamic_cast<LegacyTensorConstructor*>(sv.get())->dtype();
+        auto dtype_ivalue = graph->insertConstant(dtype);
+
+        // supporting "new(IntArrayRef size, *, Device? device=None)", through
+        // empty.memory_format(int[] size, *, ScalarType? dtype=None, Layout?
+        // layout=None, Device? device=None, bool? pin_memory=None,
+        // MemoryFormat? memory_format=None) -> Tensor
+        bool all_ints = std::all_of(args.begin(), args.end(), [](Value* v) {
+          return v->type()->cast<IntType>();
+        });
+        if (args.size() == 0) {
+          // empty inputs == torch.tensor([], dtype=....)
+          auto inp_list =
+              graph->insertNode(graph->createList(IntType::get(), {}))
+                  ->output();
+          return std::make_shared<SimpleValue>(graph->insert(
+              aten::tensor,
+              {inp_list},
+              {NamedValue(apply.range(), "dtype", dtype_ivalue)}));
+        } else if (all_ints) {
+          auto inp_list =
+              graph->insertNode(graph->createList(IntType::get(), args))
+                  ->output();
+          return std::make_shared<SimpleValue>(graph->insert(
+              aten::empty,
+              {inp_list},
+              {NamedValue(apply.range(), "dtype", dtype_ivalue)}));
+        } else if (args.size() == 1) {
+          return std::make_shared<SimpleValue>(graph->insert(
+              aten::tensor,
+              {args[0]},
+              {NamedValue(apply.range(), "dtype", dtype_ivalue)}));
+        } else {
+          throw ErrorReport(apply)
+              << get_base_error_msg().str()
+              << "Got multiple positional arguments that were not all integers";
+        }
+      }
       case prim::isinstance: {
         checkApplyNumInputs(apply, 2);
         auto result = emitIsInstance(apply.inputs()[0], apply.inputs()[1]);
@@ -3555,6 +3625,22 @@ struct to_ir {
       case prim::dict: {
         return emitApplySpecialFormForDict(apply, type_hint);
       }
+      case aten::index: {
+        const SourceRange& loc = apply.range();
+        auto select = Select(apply.callee());
+        auto self = emitSugaredExpr(select.value(), 1)->asValue(loc, method);
+
+        auto inputs = apply.inputs();
+        if (inputs.size() != 1) {
+          throw ErrorReport(apply)
+              << "__getitem__ expected exactly 1 arguments, got "
+              << inputs.size();
+        }
+        auto input =
+            emitSugaredExpr(apply.inputs()[0], 1)->asValue(loc, method);
+
+        return std::make_shared<SimpleValue>(emitIndex(loc, self, {input}));
+      }
       default:
         TORCH_INTERNAL_ASSERT(false, "unknown special form: ", form);
     }
@@ -4183,6 +4269,27 @@ struct to_ir {
   Value* emitListLiteral(ListLiteral ll, const TypePtr& type_hint) {
     auto values = getValues(ll.inputs(), /*maybe_unpack=*/true);
 
+    // Empty List Literals that are not assigned to variables
+    // may match to any list type in schema matching,
+    // but still default to List[Tensor] if assigned to a variable
+    // or returned from a function
+    // Restricting empty list matching to temporary values
+    // avoids difficult to handle cases such as
+    // a = []
+    // b = a
+    // if cond:
+    //    b.append(2)
+    // else:
+    //    a.append("hi")
+    // This is also the same behavior that C++ allows with {}
+    // (cannot assign to a variable typed as auto)
+    // These nodes will be removed in a later pass after initial compilation
+    if (values.size() == 0 && type_hint == nullptr) {
+      auto node = graph->insertNode(graph->create(prim::EmptyListLiteral));
+      node->output()->setType(ListType::ofTensors());
+      return node->output();
+    }
+
     // Determine the element type of the list. If we have a type hint
     // of `List[T]`, use `T`. If the list is non-empty, find the
     // greatest common supertype of all the list elements (defaulting to
@@ -5434,13 +5541,39 @@ std::vector<Function*> CompilationUnit::define(
       self);
 }
 
+void eraseListLiterals(std::shared_ptr<Graph>& graph) {
+  DepthFirstGraphNodeIterator it(graph);
+  Node* n = nullptr;
+
+  for (auto next_node = it.next(); next_node != nullptr;) {
+    Node* node = next_node;
+    next_node = it.next();
+
+    if (node->kind() == prim::EmptyListLiteral) {
+      if (node->hasUses()) {
+        TORCH_INTERNAL_ASSERT(
+            node->output()->type()->isSubtypeOf(ListType::ofTensors()));
+
+        auto li = graph->createList(TensorType::get(), {});
+        li->insertBefore(node);
+        node->replaceAllUsesWith(li);
+      }
+      node->destroy();
+    }
+  }
+}
+
 void runCleanupPasses(std::shared_ptr<Graph>& to_clean) {
   liftClosures(to_clean);
   inlineForkedClosures(to_clean);
+
   if (getInlineEverythingMode()) {
     Inline(*to_clean);
   }
 
+  // these exist temporarily in initial compilation
+  eraseListLiterals(to_clean);
+
   // remove any uses of tuples that we inserted that are not needed
   LowerSimpleTuples(to_clean);
 
diff --git a/torch/csrc/jit/frontend/lexer.h b/torch/csrc/jit/frontend/lexer.h
index 6ab311c4412f5e..3dd9a5ef772c13 100644
--- a/torch/csrc/jit/frontend/lexer.h
+++ b/torch/csrc/jit/frontend/lexer.h
@@ -190,7 +190,7 @@ struct TORCH_API SharedParserData {
   // find the longest match of str.substring(pos) against a token, return true
   // if successful filling in kind, start,and len
   bool match(
-      StringCordView str,
+      c10::string_view str,
       size_t pos,
       bool continuation, // are we inside a scope where newlines don't count
                          // (e.g. inside parens)
@@ -241,12 +241,12 @@ struct TORCH_API SharedParserData {
     // invariant: the next token is not whitespace or newline
     *start = pos;
     // check for a valid number
-    if (isNumber(str.piece(0), pos, len)) {
+    if (isNumber(str, pos, len)) {
       *kind = TK_NUMBER;
       return true;
     }
     // check for string
-    if (isString(str.piece(0), pos, len)) {
+    if (isString(str, pos, len)) {
       *kind = TK_STRINGLITERAL;
       return true;
     }
@@ -369,7 +369,7 @@ struct TORCH_API SharedParserData {
     return isspace(n) && n != '\n';
   }
   // Make an exception ignoring comments for type annotation comments
-  bool isTypeComment(StringCordView str, size_t pos) {
+  bool isTypeComment(c10::string_view str, size_t pos) {
     const std::string type_string = "# type:";
     if (str.size() < pos + type_string.length()) {
       return false;
@@ -388,7 +388,7 @@ struct Token {
   SourceRange range;
   Token(int kind, SourceRange range) : kind(kind), range(std::move(range)) {}
   std::string text() {
-    return range.text().str();
+    return range.text();
   }
   std::string kindString() const {
     return kindToString(kind);
@@ -396,7 +396,7 @@ struct Token {
 };
 
 struct Lexer {
-  explicit Lexer(std::shared_ptr<Source> source)
+  explicit Lexer(std::shared_ptr<SourceView> source)
       : source(std::move(source)),
         pos(0),
         nesting(0),
@@ -519,19 +519,25 @@ struct Lexer {
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     size_t length;
     AT_ASSERT(source);
-    auto src = source->text_str();
     if (!shared.match(
-            src, pos, nesting > 0, whitespace_token, &kind, &start, &length)) {
+            source->text(),
+            pos,
+            nesting > 0,
+            whitespace_token,
+            &kind,
+            &start,
+            &length)) {
       expected(
           "a valid token",
-          Token(source->char_at(start), SourceRange(source, start, start + 1)));
+          Token(
+              (source->text())[start], SourceRange(source, start, start + 1)));
     }
     auto t = Token(kind, SourceRange(source, start, start + length));
     pos = start + length;
     return t;
   }
 
-  std::shared_ptr<Source> source;
+  std::shared_ptr<SourceView> source;
   size_t pos;
   size_t nesting; // depth of ( [ { nesting...
   std::vector<int> indent_stack; // stack of indentation level of blocks
diff --git a/torch/csrc/jit/frontend/parser.cpp b/torch/csrc/jit/frontend/parser.cpp
index 259ce27a1edc4d..b5792fed1eb0ec 100644
--- a/torch/csrc/jit/frontend/parser.cpp
+++ b/torch/csrc/jit/frontend/parser.cpp
@@ -46,7 +46,7 @@ Decl mergeTypesFromTypeComment(
 }
 
 struct ParserImpl {
-  explicit ParserImpl(const std::shared_ptr<Source>& source)
+  explicit ParserImpl(const std::shared_ptr<SourceView>& source)
       : L(source), shared(sharedParserData()) {}
 
   Ident parseIdent() {
@@ -801,7 +801,7 @@ struct ParserImpl {
   SharedParserData& shared;
 };
 
-Parser::Parser(const std::shared_ptr<Source>& src)
+Parser::Parser(const std::shared_ptr<SourceView>& src)
     : pImpl(new ParserImpl(src)) {}
 
 Parser::~Parser() = default;
diff --git a/torch/csrc/jit/frontend/parser.h b/torch/csrc/jit/frontend/parser.h
index 6d856a090854a4..8b2bb302a340bb 100644
--- a/torch/csrc/jit/frontend/parser.h
+++ b/torch/csrc/jit/frontend/parser.h
@@ -17,7 +17,7 @@ TORCH_API Decl mergeTypesFromTypeComment(
     bool is_method);
 
 struct TORCH_API Parser {
-  explicit Parser(const std::shared_ptr<Source>& src);
+  explicit Parser(const std::shared_ptr<SourceView>& src);
   TreeRef parseFunction(bool is_method);
   TreeRef parseClass();
   Decl parseTypeComment();
diff --git a/torch/csrc/jit/frontend/schema_matching.cpp b/torch/csrc/jit/frontend/schema_matching.cpp
index d63b1953504895..81cef4fe393ad0 100644
--- a/torch/csrc/jit/frontend/schema_matching.cpp
+++ b/torch/csrc/jit/frontend/schema_matching.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/frontend/schema_matching.h>
 
+#include <ATen/core/interned_strings.h>
 #include <ATen/core/jit_type.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Optional.h>
@@ -8,6 +9,7 @@
 #include <torch/csrc/jit/frontend/builtin_functions.h>
 #include <torch/csrc/jit/frontend/error_report.h>
 #include <torch/csrc/jit/frontend/function_schema_parser.h>
+#include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/operator_upgraders/utils.h>
 #include <torch/csrc/jit/operator_upgraders/version_map.h>
 #include <torch/csrc/jit/runtime/operator.h>
@@ -76,6 +78,16 @@ Value* tryConvertToType(
     }
   }
 
+  // allow temporary, unannotated list literals `[]` to match to arbitrary list
+  // types
+  if (value->node()->kind() == prim::EmptyListLiteral &&
+      concrete_type->cast<ListType>()) {
+    value = graph
+                .insertNode(graph.createList(
+                    concrete_type->cast<ListType>()->getElementType(), {}))
+                ->output();
+  }
+
   if (auto value_tuple = value->type()->cast<TupleType>()) {
     // Allow homogeneous tuples to be casted implicitly to lists of appropriate
     // types
diff --git a/torch/csrc/jit/frontend/schema_type_parser.cpp b/torch/csrc/jit/frontend/schema_type_parser.cpp
index 252b5e2370ca5a..f3c73195aeb26a 100644
--- a/torch/csrc/jit/frontend/schema_type_parser.cpp
+++ b/torch/csrc/jit/frontend/schema_type_parser.cpp
@@ -31,6 +31,7 @@ using c10::StorageType;
 using c10::StreamObjType;
 using c10::StringType;
 using c10::Symbol;
+using c10::SymIntType;
 using c10::TensorType;
 using c10::TupleType;
 using c10::UnionType;
@@ -61,6 +62,7 @@ TypePtr SchemaTypeParser::parseBaseType() {
       {"float", c10::TypeFactory::get<FloatType>()},
       {"complex", c10::TypeFactory::get<ComplexType>()},
       {"int", c10::TypeFactory::get<IntType>()},
+      {"SymInt", c10::TypeFactory::get<SymIntType>()},
       {"bool", c10::TypeFactory::get<BoolType>()},
       {"None", c10::TypeFactory::get<NoneType>()},
       {"NoneType", c10::TypeFactory::get<NoneType>()},
diff --git a/torch/csrc/jit/frontend/script_type_parser.cpp b/torch/csrc/jit/frontend/script_type_parser.cpp
index f5d6f640d413d4..8b4034f38b375b 100644
--- a/torch/csrc/jit/frontend/script_type_parser.cpp
+++ b/torch/csrc/jit/frontend/script_type_parser.cpp
@@ -227,7 +227,7 @@ TypePtr ScriptTypeParser::parseTypeFromExpr(const Expr& expr) const {
   // expression and base type names.
   if (resolver_) {
     if (auto typePtr =
-            resolver_->resolveType(expr.range().text().str(), expr.range())) {
+            resolver_->resolveType(expr.range().text(), expr.range())) {
       return typePtr;
     }
   }
diff --git a/torch/csrc/jit/frontend/source_range.cpp b/torch/csrc/jit/frontend/source_range.cpp
index 881e5df26fd3a3..a3911472dae4fa 100644
--- a/torch/csrc/jit/frontend/source_range.cpp
+++ b/torch/csrc/jit/frontend/source_range.cpp
@@ -4,140 +4,13 @@
 
 namespace torch {
 namespace jit {
-
-// A stringlike class backed by a vector of string_view
-// the string represented are logically the concatenation of  the string_views
-// This has advantage of not needing continues memory.
-StringCordView::StringCordView() {
-  accumulated_sizes_.push_back(0);
-}
-
-StringCordView::StringCordView(
-    std::vector<c10::string_view> inputs,
-    std::vector<std::shared_ptr<std::string>> ownerships)
-    : pieces_(std::move(inputs)), owned_strings_(std::move(ownerships)) {
-  accumulated_sizes_.push_back(0);
-  size_t running_sum = 0;
-  for (auto& s : pieces_) {
-    if (s.size() > 0) {
-      running_sum += s.size();
-      accumulated_sizes_.push_back(running_sum);
-    }
-  }
-}
-
-size_t StringCordView::find(const std::string& tok, size_t start) const {
-  if (tok.size() == 0) {
-    return 0;
-  }
-
-  if ((size() - start) < tok.size()) {
-    return std::string::npos;
-  }
-
-  Iterator begin = iter_for_pos(start);
-  Iterator end_iter = end();
-  size_t offset = start;
-  for (; begin != end_iter; ++begin, ++offset) {
-    if (*begin == tok[0]) {
-      auto mis = std::mismatch(begin, end_iter, tok.begin(), tok.end());
-      if (mis.second == tok.end()) {
-        // no mismatch, and second string (tok) is exhausted.
-        return offset;
-      }
-      if (mis.first == end_iter) {
-        // this str is exhausted but tok is not
-        return std::string::npos;
-      }
-    }
-  }
-  return std::string::npos;
-}
-
-StringCordView StringCordView::substr(size_t start, size_t size) const {
-  std::vector<c10::string_view> pieces;
-  std::vector<std::shared_ptr<std::string>> ownerships;
-  if (start >= this->size()) {
-    // out of bounds
-    return StringCordView();
-  }
-  if (start + size >= this->size()) {
-    size = this->size() - start;
-  }
-  Iterator begin = iter_for_pos(start);
-  Iterator end = iter_for_pos(start + size);
-
-  if (begin.line_ == end.line_) {
-    // same line
-    pieces.push_back(pieces_[begin.line_].substr(begin.pos_, size));
-  } else {
-    pieces.push_back(pieces_[begin.line_].substr(begin.pos_));
-
-    size_t last_line = pieces_.size();
-    if (end != this->end() && end.line_ < last_line) {
-      // end is within the string
-      last_line = end.line_;
-    }
-    for (size_t i = begin.line_ + 1; i < last_line; i++) {
-      pieces.push_back(pieces_[i]);
-    }
-    if (end != this->end()) {
-      pieces.push_back(pieces_[end.line_].substr(0, end.pos_));
-    }
-  }
-
-  // share ownership
-  std::copy(
-      owned_strings_.begin(),
-      owned_strings_.end(),
-      std::back_inserter(ownerships));
-
-  return StringCordView(std::move(pieces), std::move(ownerships));
-}
-
-bool StringCordView::operator==(const std::string& rhs) {
-  if (size() != rhs.size()) {
-    return false;
-  }
-  auto res = std::mismatch(begin(), end(), rhs.begin(), rhs.end());
-  // both need to exhaust
-  return res.first == end() && res.second == rhs.end();
-}
-
-bool StringCordView::operator==(const StringCordView& rhs) {
-  if (size() != rhs.size()) {
-    return false;
-  }
-  auto res = std::mismatch(begin(), end(), rhs.begin(), rhs.end());
-  // both need to exhaust
-  return res.first == end() && res.second == rhs.end();
-}
-
-StringCordView::Iterator StringCordView::iter_for_pos(size_t pos) const {
-  if (pos == 0) {
-    return begin();
-  }
-  if (pos >= size()) {
-    return end();
-  }
-  auto upper = std::upper_bound(
-      accumulated_sizes_.begin(), accumulated_sizes_.end(), pos);
-  if (upper == accumulated_sizes_.end()) {
-    return end();
-  }
-  size_t line = upper - accumulated_sizes_.begin() - 1;
-  assert(accumulated_sizes_[line] <= pos);
-  assert(accumulated_sizes_[line + 1] > pos);
-  return Iterator(this, line, pos - accumulated_sizes_[line], size() - pos);
-}
-
 size_t SourceRangeHasher::operator()(const torch::jit::SourceRange& key) const {
   return (
       std::hash<uintptr_t>()(reinterpret_cast<uintptr_t>(key.source().get())) ^
       std::hash<size_t>()(key.start()) ^ std::hash<size_t>()(key.end()));
 }
 
-c10::optional<SourceRange> Source::findSourceRangeThatGenerated(
+c10::optional<SourceRange> SourceView::findSourceRangeThatGenerated(
     const SourceRange& range) {
   if (!gen_ranges_) {
     return c10::nullopt;
@@ -145,7 +18,7 @@ c10::optional<SourceRange> Source::findSourceRangeThatGenerated(
   return gen_ranges_->findSourceRangeThatGenerated(range);
 }
 
-C10_EXPORT void SourceRange::highlight(std::ostream& out) const {
+void SourceRange::highlight(std::ostream& out) const {
   // Retrieve original SourceRange, if present.
   if (auto orig_source_range = findSourceRangeThatGenerated()) {
     orig_source_range->highlight(out);
@@ -154,7 +27,7 @@ C10_EXPORT void SourceRange::highlight(std::ostream& out) const {
   print_with_context(out, CONTEXT, true, "");
 }
 
-C10_EXPORT void format_stack_trace(
+void format_stack_trace(
     std::ostream& out,
     const std::vector<StackEntry>& entries) {
   bool has_orig_ranges = false;
@@ -190,7 +63,7 @@ C10_EXPORT void format_stack_trace(
   }
 }
 
-C10_EXPORT void SourceRange::print_with_context(
+void SourceRange::print_with_context(
     std::ostream& out,
     size_t context,
     bool highlight,
@@ -200,7 +73,7 @@ C10_EXPORT void SourceRange::print_with_context(
     return;
   }
 
-  auto str = source_view_->text_str().str();
+  c10::string_view str = source_view_->text();
   if (size() == str.size()) {
     // this is just the entire file, not a subset, so print it out.
     // primarily used to print out python stack traces
@@ -268,7 +141,7 @@ C10_EXPORT void SourceRange::print_with_context(
     line_end = start();
     while (line_start < range_end) {
       // move line_end to end of line
-      while (line_end < str.size() && str[line_end] != '\n') {
+      while (str[line_end] != '\n' && line_end < str.size()) {
         ++line_end;
       }
       // print line of code
diff --git a/torch/csrc/jit/frontend/source_range.h b/torch/csrc/jit/frontend/source_range.h
index 686ad3e472318f..084da17142d3c5 100644
--- a/torch/csrc/jit/frontend/source_range.h
+++ b/torch/csrc/jit/frontend/source_range.h
@@ -4,172 +4,43 @@
 
 #include <algorithm>
 #include <iostream>
-#include <iterator>
 #include <memory>
-#include <numeric>
 #include <unordered_map>
-
 namespace torch {
 namespace jit {
 
 class SourceRangeUnpickler;
 struct SourceRange;
 
-// A stringlike class backed by a vector of string_view
-// the string represented are logically the concatenation of  the string_views
-// This has advantage of not needing continues memory.
-struct TORCH_API StringCordView {
-  StringCordView();
-  StringCordView(
-      std::vector<c10::string_view> inputs,
-      std::vector<std::shared_ptr<std::string>> ownerships);
-
-  size_t size() const {
-    return accumulated_sizes_.back();
-  }
-
-  size_t find(const std::string& tok, size_t start) const;
-  StringCordView substr(size_t start, size_t size) const;
-
-  char at(size_t index) const {
-    return *iter_for_pos(index);
-  }
-  char operator[](size_t index) const {
-    return at(index);
-  }
-
-  std::string str() const {
-    std::stringstream ss;
-    for (auto s : pieces_) {
-      ss << std::string(s);
-    }
-    return ss.str();
-  }
-
-  bool operator==(const std::string& rhs);
-
-  bool operator==(const StringCordView& rhs);
-
-  c10::string_view piece(size_t index) const {
-    return pieces_[index];
-  }
-
-  struct Iterator {
-    Iterator(
-        const StringCordView* str,
-        size_t start_line,
-        size_t start_pos,
-        size_t size)
-        : line_(start_line), pos_(start_pos), str_(str), size_(size) {}
-    explicit Iterator(const StringCordView* str)
-        : Iterator(str, 0, 0, str->size()) {}
-    Iterator(const Iterator&) = default;
-    Iterator(Iterator&&) = default;
-    Iterator& operator=(const Iterator&) = default;
-    Iterator& operator=(Iterator&&) = default;
-
-    Iterator operator++() {
-      if (size_ == 0) {
-        return *this;
-      }
-      if ((pos_ + 1) < str_->pieces_[line_].size()) {
-        pos_++;
-      } else {
-        line_++;
-        pos_ = 0;
-      }
-      return *this;
-    }
-
-    Iterator operator++(int) {
-      Iterator prev(*this);
-      ++(*this);
-      return prev;
-    }
-
-    bool operator==(const Iterator& rhs) const {
-      if (!has_next() && !rhs.has_next()) {
-        return true;
-      }
-      return (str_ == rhs.str_) && (line_ == rhs.line_) && (pos_ == rhs.pos_);
-    }
-    bool operator!=(const Iterator& rhs) {
-      return !((*this) == rhs);
-    }
-    bool has_next() const {
-      return size_ > 0 && (line_ < str_->pieces_.size());
-    }
-
-    char operator*() const {
-      TORCH_INTERNAL_ASSERT(line_ < str_->pieces_.size());
-      TORCH_INTERNAL_ASSERT(pos_ < str_->pieces_[line_].size());
-      return str_->pieces_[line_].at(pos_);
-    }
-
-   private:
-    size_t line_;
-    size_t pos_;
-    const StringCordView* str_;
-    size_t size_;
-    friend struct StringCordView;
-  };
-
-  Iterator begin() const {
-    return Iterator(this, 0, 0, size());
-  }
-  Iterator end() const {
-    return Iterator(this, pieces_.size(), 0, 0);
-  }
-
- private:
-  Iterator iter_for_pos(size_t pos) const;
-
-  std::vector<c10::string_view> pieces_;
-  std::vector<size_t> accumulated_sizes_;
-  std::vector<std::shared_ptr<std::string>> owned_strings_;
-};
-
-// Source represents a code segment. It keeps track of:
+// SourceView represents a code segment. It keeps track of:
 //  - text_view : the view into text of the code segment
 //  - filename (optional) : if present, represents the name of the file from
 //                          which the code segment originated.
 //  - starting_line_no : represents the line in the original file where the
 //                       code segment started.
-struct TORCH_API Source {
-  // Whether or not Source should copy the string passed in the constructor.
-  enum CopiesString { COPIES_STRING, DONT_COPY };
-
-  explicit Source(
+struct SourceView {
+  explicit SourceView(
       c10::string_view text_view,
-      c10::optional<std::string> filename = c10::nullopt,
-      size_t starting_line_no = 0,
-      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr,
-      CopiesString copies_str = COPIES_STRING)
-      : filename_(std::move(filename)),
-        starting_line_no_(starting_line_no),
+      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
+      : text_view_(text_view),
+        filename_(c10::nullopt),
+        starting_line_no_(0),
         gen_ranges_(std::move(gen_ranges)) {
-    if (copies_str == COPIES_STRING) {
-      std::shared_ptr<std::string> allocated_str =
-          std::make_shared<std::string>(text_view.data(), text_view.size());
-      text_view_ = StringCordView({*allocated_str}, {allocated_str});
-    } else {
-      text_view_ = StringCordView({text_view}, {});
-    }
-
     calc_line_start_offsets();
   }
 
-  explicit Source(
-      StringCordView str,
-      c10::optional<std::string> filename = c10::nullopt,
-      size_t starting_line_no = 0,
+  SourceView(
+      c10::string_view text_view,
+      c10::optional<std::string> filename,
+      size_t starting_line_no,
       std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
-      : text_view_(str),
+      : text_view_(text_view),
         filename_(std::move(filename)),
         starting_line_no_(starting_line_no),
         gen_ranges_(std::move(gen_ranges)) {
     calc_line_start_offsets();
   }
+
   // Given a line number (within source_), return the byte offset of the
   // beginning of that line.
   size_t offset_for_line(size_t line) const {
@@ -183,9 +54,11 @@ struct TORCH_API Source {
 
   // Calculate the line (within the code segment) on which `offset` resides.
   size_t lineno_for_offset(size_t offset) const {
-    auto iter = std::upper_bound(
-        line_starting_offsets_.begin(), line_starting_offsets_.end(), offset);
-    return iter - line_starting_offsets_.begin() - 1;
+    return std::upper_bound(
+               line_starting_offsets_.begin(),
+               line_starting_offsets_.end(),
+               offset) -
+        line_starting_offsets_.begin() - 1;
   }
 
   // Calculate the line (within the original source file, if present) on which
@@ -198,27 +71,11 @@ struct TORCH_API Source {
     }
   }
 
-  StringCordView get_line(size_t lineno) const {
-    auto start = offset_for_line(lineno);
-    auto size = (lineno + 1) < num_lines() ? offset_for_line(lineno + 1) - start
-                                           : text_view_.size() - start;
-    return text_view_.substr(start, size);
-  }
-
-  // Note: this makes a copy
-  StringCordView text_str() const {
+  const c10::string_view text() const {
     return text_view_;
   }
 
-  char char_at(size_t index) const {
-    return text_view_.at(index);
-  }
-
-  size_t size() const {
-    return text_view_.size();
-  }
-
-  c10::optional<std::string>& filename() {
+  const c10::optional<std::string>& filename() const {
     return filename_;
   }
 
@@ -229,20 +86,18 @@ struct TORCH_API Source {
   c10::optional<SourceRange> findSourceRangeThatGenerated(
       const SourceRange& range);
 
-  ~Source() = default;
+ protected:
+  c10::string_view text_view_;
 
  private:
   void calc_line_start_offsets() {
-    line_starting_offsets_.clear();
     line_starting_offsets_.push_back(0);
     size_t pos = 0;
-    while ((pos = text_view_.find("\n", pos)) != std::string::npos) {
+    while ((pos = text().find('\n', pos)) != std::string::npos) {
       line_starting_offsets_.push_back(++pos);
     }
   }
 
-  StringCordView text_view_;
-
   c10::optional<std::string> filename_;
   // If filename_ is not present, starting_line_no_ is don't care
   size_t starting_line_no_;
@@ -253,15 +108,67 @@ struct TORCH_API Source {
   std::shared_ptr<SourceRangeUnpickler> gen_ranges_;
 };
 
+// Source represents a code segment like SourceView, but the former owns a copy
+// of source text while the latter doesn't.
+struct Source : public SourceView {
+  explicit Source(
+      std::string text,
+      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
+      : SourceView(text, gen_ranges), text_(std::move(text)) {
+    text_view_ = text_;
+  }
+
+  explicit Source(
+      c10::string_view text_view,
+      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
+      : SourceView(text_view, gen_ranges),
+        text_(text_view.begin(), text_view.end()) {
+    text_view_ = text_;
+  }
+
+  explicit Source(
+      std::string text,
+      c10::optional<std::string> filename,
+      size_t starting_line_no,
+      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
+      : SourceView(text, filename, starting_line_no, gen_ranges),
+        text_(std::move(text)) {
+    text_view_ = text_;
+  }
+
+  explicit Source(
+      c10::string_view text_view,
+      c10::optional<std::string> filename,
+      size_t starting_line_no,
+      std::shared_ptr<SourceRangeUnpickler> gen_ranges = nullptr)
+      : SourceView(text_view, filename, starting_line_no, gen_ranges),
+        text_(text_view.begin(), text_view.end()) {
+    text_view_ = text_;
+  }
+
+  // Constructor that deepcopies and owns source text referenced in
+  // `source_view`.
+  explicit Source(const SourceView& source_view) : SourceView(source_view) {
+    text_ = std::string(text_view_.begin(), text_view_.end());
+    text_view_ = text_;
+  }
+
+  std::string text_;
+};
+
 // A SourceRange is a reference to subset of a Source, specified by `start` and
 // `end` byte offsets into the source text.
 struct TORCH_API SourceRange {
-  SourceRange(std::shared_ptr<Source> source_view_, size_t start_, size_t end_)
+  SourceRange(
+      std::shared_ptr<SourceView> source_view_,
+      size_t start_,
+      size_t end_)
       : source_view_(std::move(source_view_)), start_(start_), end_(end_) {}
   SourceRange() : source_view_(nullptr), start_(0), end_(0) {}
 
-  const StringCordView text() const {
-    return source_view_->text_str().substr(start(), end() - start());
+  const std::string text() const {
+    auto text_view = source_view_->text().substr(start(), end() - start());
+    return std::string(text_view.begin(), text_view.end());
   }
   size_t size() const {
     return end() - start();
@@ -276,7 +183,7 @@ struct TORCH_API SourceRange {
       bool highlight,
       const std::string& funcname) const;
 
-  const std::shared_ptr<Source>& source() const {
+  const std::shared_ptr<SourceView>& source() const {
     return source_view_;
   }
   size_t start() const {
@@ -322,7 +229,7 @@ struct TORCH_API SourceRange {
   }
 
  protected:
-  std::shared_ptr<Source> source_view_;
+  std::shared_ptr<SourceView> source_view_;
 
  private:
   size_t start_;
@@ -330,16 +237,13 @@ struct TORCH_API SourceRange {
 };
 
 // OwnedSourceRange is just like a SourceRange except that it owns a `Source`
-// instead of `Source`. Thus OwnedSourceRange owns a copy of source text.
+// instead of `SourceView`. Thus OwnedSourceRange owns a copy of source text.
 struct OwnedSourceRange : public SourceRange {
-  explicit OwnedSourceRange(const SourceRange& source_range)
+  OwnedSourceRange(const SourceRange& source_range)
       : SourceRange(source_range) {
     const auto& source = source_range.source();
     if (source) {
-      source_view_ = std::make_shared<Source>(
-          source->text_str().str(),
-          source->filename(),
-          source->starting_line_no());
+      source_view_ = std::make_shared<Source>(*source);
     }
   }
 };
@@ -354,7 +258,7 @@ struct StackEntry {
   SourceRange range;
 };
 
-C10_EXPORT void format_stack_trace(
+TORCH_API void format_stack_trace(
     std::ostream& out,
     const std::vector<StackEntry>& entries);
 
@@ -377,14 +281,3 @@ using SourceRangeTagMap =
 
 } // namespace jit
 } // namespace torch
-
-namespace std {
-template <>
-struct iterator_traits<torch::jit::StringCordView::Iterator> {
-  using value_type = char;
-  using difference_type = ptrdiff_t;
-  using pointer = char*;
-  using reference = char&;
-  using iterator_category = std::forward_iterator_tag;
-};
-} // namespace std
diff --git a/torch/csrc/jit/frontend/source_ref.h b/torch/csrc/jit/frontend/source_ref.h
index 185bd3c1268417..9b99dba45245aa 100644
--- a/torch/csrc/jit/frontend/source_ref.h
+++ b/torch/csrc/jit/frontend/source_ref.h
@@ -21,26 +21,26 @@ namespace jit {
  */
 class TORCH_API SourceRef : public CustomClassHolder {
  public:
-  explicit SourceRef(std::shared_ptr<Source> source_view)
+  explicit SourceRef(std::shared_ptr<SourceView> source_view)
       : source_view_(std::move(source_view)) {}
   bool operator==(const SourceRef& other) const {
     return source_view_ == other.source_view_;
   }
-  bool operator<(const Source& other) const {
+  bool operator<(const SourceView& other) const {
     return source_view_.get() < &other;
   }
-  friend bool operator<(const Source& other, const SourceRef& self) {
+  friend bool operator<(const SourceView& other, const SourceRef& self) {
     return &other < self.source_view_.get();
   }
   bool operator<(const SourceRef& other) const {
     return *this < *other.source_view_.get();
   }
-  const Source* operator->() const {
+  const SourceView* operator->() const {
     return source_view_.get();
   }
 
  private:
-  std::shared_ptr<Source> source_view_;
+  std::shared_ptr<SourceView> source_view_;
 };
 
 } // namespace jit
diff --git a/torch/csrc/jit/frontend/strtod.cpp b/torch/csrc/jit/frontend/strtod.cpp
index 4ee6ceac38fc1b..a6dfe67f695a68 100644
--- a/torch/csrc/jit/frontend/strtod.cpp
+++ b/torch/csrc/jit/frontend/strtod.cpp
@@ -1,5 +1,6 @@
 // Taken from
 // https://github.com/JuliaLang/julia/blob/v1.1.0/src/support/strtod.c
+#include <torch/csrc/jit/frontend/strtod.h>
 
 #include <ATen/core/Macros.h>
 #include <clocale>
@@ -31,19 +32,19 @@ namespace torch {
 namespace jit {
 
 #ifdef _MSC_VER
-C10_EXPORT double strtod_c(const char* nptr, char** endptr) {
+double strtod_c(const char* nptr, char** endptr) {
   static _locale_t loc = _create_locale(LC_ALL, "C");
   return _strtod_l(nptr, endptr, loc);
 }
 #else
-C10_EXPORT double strtod_c(const char* nptr, char** endptr) {
+double strtod_c(const char* nptr, char** endptr) {
   /// NOLINTNEXTLINE(hicpp-signed-bitwise)
   static locale_t loc = newlocale(LC_ALL_MASK, "C", nullptr);
   return strtod_l(nptr, endptr, loc);
 }
 #endif
 
-C10_EXPORT float strtof_c(const char* nptr, char** endptr) {
+float strtof_c(const char* nptr, char** endptr) {
   return (float)strtod_c(nptr, endptr);
 }
 
diff --git a/torch/csrc/jit/frontend/sugared_value.cpp b/torch/csrc/jit/frontend/sugared_value.cpp
index ae3bbcf45a0722..fed962a22e26a2 100644
--- a/torch/csrc/jit/frontend/sugared_value.cpp
+++ b/torch/csrc/jit/frontend/sugared_value.cpp
@@ -121,8 +121,10 @@ std::shared_ptr<SugaredValue> SimpleValue::attr(
            {"is_mlc", "prim"},
            {"is_quantized", "prim"},
            {"is_vulkan", "prim"},
+           {"is_ipu", "prim"},
            {"is_meta", "prim"},
            {"is_leaf", "aten"},
+           {"is_nested", "prim"},
            {"requires_grad", "prim"},
            {"layout", "prim"},
            {"T", "prim"},
@@ -221,6 +223,14 @@ std::shared_ptr<SugaredValue> SimpleValue::attr(
     return SpecialFormValue::create(prim::tolist);
   }
 
+  // Handle calling __getitem__() directly on a Tensor, it needs special
+  // handling because desired method name (`__getitem__`) doesn't match `aten`
+  // operator name of `aten::index`.
+  if (value_->type()->isSubtypeOf(*TensorType::get()) &&
+      field == "__getitem__") {
+    return SpecialFormValue::create(aten::index);
+  }
+
   ErrorReport report(loc);
   report << "'" << value_->type()->repr_str()
          << "' object has no attribute or method '" << field << "'.";
diff --git a/torch/csrc/jit/frontend/sugared_value.h b/torch/csrc/jit/frontend/sugared_value.h
index 6ddd9bed753c62..c447e9764a6c3e 100644
--- a/torch/csrc/jit/frontend/sugared_value.h
+++ b/torch/csrc/jit/frontend/sugared_value.h
@@ -1,4 +1,5 @@
 #pragma once
+#include <c10/util/Optional.h>
 #include <functional>
 #include <memory>
 #include <string>
@@ -618,6 +619,25 @@ struct TORCH_API SpecialFormValue : public SugaredValue {
   Symbol form_;
 };
 
+struct TORCH_API LegacyTensorConstructor : public SpecialFormValue {
+  LegacyTensorConstructor(Symbol form, at::ScalarType dtype, at::Device device)
+      : SpecialFormValue(form), device_(device), dtype_(dtype) {}
+
+  static std::shared_ptr<LegacyTensorConstructor> create(
+      Symbol form,
+      at::ScalarType dtype,
+      at::Device device) {
+    return std::make_shared<LegacyTensorConstructor>(form, dtype, device);
+  }
+  at::ScalarType dtype() const {
+    return dtype_;
+  }
+
+ private:
+  at::Device device_;
+  at::ScalarType dtype_;
+};
+
 // matched against for special handling of range expressions
 struct TORCH_API RangeValue : SugaredValue {
   RangeValue(
diff --git a/torch/csrc/jit/frontend/tracer.cpp b/torch/csrc/jit/frontend/tracer.cpp
index ba5e38adce2f52..51b56f82ae4d0b 100644
--- a/torch/csrc/jit/frontend/tracer.cpp
+++ b/torch/csrc/jit/frontend/tracer.cpp
@@ -599,6 +599,16 @@ void addInputs(Node* n, const char* name, int64_t value) {
   }
 }
 
+void addInputs(Node* n, const char* name, c10::SymInt value) {
+  using ArgumentStash = jit::tracer::ArgumentStash;
+  if (ArgumentStash::hasValue(name)) {
+    Value* v = ArgumentStash::popValue(name);
+    n->addInput(v);
+  } else {
+    detail::genericAddInput(n, value);
+  }
+}
+
 void addInputs(Node* n, const char* name, c10::optional<int64_t> value) {
   using ArgumentStash = jit::tracer::ArgumentStash;
   if (ArgumentStash::hasValue(name)) {
@@ -797,6 +807,19 @@ void addInputs(
   detail::genericAddOptionalInput(n, name, opt_value);
 }
 
+void addInputs(
+    Node* n,
+    const char* name,
+    const at::OptionalIntArrayRef& opt_value) {
+  if (opt_value.has_value()) {
+    jit::tracer::addInputs(n, name, *opt_value);
+  } else {
+    Graph* g = n->owningGraph();
+    Value* none = g->insertNode(g->createNone())->output();
+    n->addInput(none);
+  }
+}
+
 void addInputs(Node* n, const char* name, ArrayRef<double> value) {
   std::vector<Value*> info;
   auto& g = getTracingState()->graph;
diff --git a/torch/csrc/jit/frontend/tracer.h b/torch/csrc/jit/frontend/tracer.h
index 5a928f23f5828a..7736bb25201f04 100644
--- a/torch/csrc/jit/frontend/tracer.h
+++ b/torch/csrc/jit/frontend/tracer.h
@@ -235,6 +235,7 @@ TORCH_API void abandon();
 // NB: those serve both as an intermediate steps in addInputs below,
 // as well as the overloads that terminate template recursion
 TORCH_API void addInputs(Node* n, const char* name, int64_t value);
+TORCH_API void addInputs(Node* n, const char* name, c10::SymInt value);
 TORCH_API void addInputs(
     Node* n,
     const char* name,
@@ -264,6 +265,10 @@ TORCH_API void addInputs(
     Node* n,
     const char* name,
     const c10::optional<ArrayRef<int64_t>>& value);
+TORCH_API void addInputs(
+    Node* n,
+    const char* name,
+    const at::OptionalIntArrayRef& opt_value);
 TORCH_API void addInputs(
     Node* n,
     const char* name,
diff --git a/torch/csrc/jit/ir/alias_analysis.cpp b/torch/csrc/jit/ir/alias_analysis.cpp
index 3af978641d04ab..ba26ff157c4083 100644
--- a/torch/csrc/jit/ir/alias_analysis.cpp
+++ b/torch/csrc/jit/ir/alias_analysis.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/ir/alias_analysis.h>
 
+#include <ATen/core/interned_strings.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/api/function_impl.h>
@@ -659,6 +660,10 @@ void AliasDb::analyzeImpl(Node* node) {
     case prim::MMBatchSide:
     case prim::BroadcastSizes:
     case prim::ChunkSizes:
+    // this should never be seen outside of initial compilation
+    // but because of some dependencies with closure invoking alias
+    // db needs to be handled here
+    case prim::EmptyListLiteral:
     case prim::Closure:
     case prim::CreateObject:
     case prim::tolist:
diff --git a/torch/csrc/jit/ir/ir.cpp b/torch/csrc/jit/ir/ir.cpp
index 7f32d3b1cb6c96..19c7a0c77456db 100644
--- a/torch/csrc/jit/ir/ir.cpp
+++ b/torch/csrc/jit/ir/ir.cpp
@@ -2285,10 +2285,7 @@ const Symbol ProfileOp::Kind = ::c10::prim::profile;
 const Symbol ProfileIValueOp::Kind = ::c10::prim::profile_ivalue;
 
 OperatorSet::OperatorSet(std::initializer_list<const char*> sig_literals) {
-  for (const char* sig : sig_literals) {
-    auto op = getOperatorForLiteral(sig);
-    ops[Symbol::fromQualString(op->schema().name())].push_back(op);
-  }
+  insert(sig_literals);
 }
 
 std::vector<std::shared_ptr<Operator>> OperatorSet::getOps() const {
@@ -2300,6 +2297,13 @@ std::vector<std::shared_ptr<Operator>> OperatorSet::getOps() const {
   return result;
 }
 
+void OperatorSet::insert(std::initializer_list<const char*> sig_literals) {
+  for (const char* sig : sig_literals) {
+    auto op = getOperatorForLiteral(sig);
+    ops[Symbol::fromQualString(op->schema().name())].push_back(op);
+  }
+}
+
 bool Node::isMemberOf(const OperatorSet& os) const {
   auto it = os.ops.find(kind());
   if (it == os.ops.end()) {
diff --git a/torch/csrc/jit/ir/ir.h b/torch/csrc/jit/ir/ir.h
index 2ce2a3c420df0f..ca10809b5e70c9 100644
--- a/torch/csrc/jit/ir/ir.h
+++ b/torch/csrc/jit/ir/ir.h
@@ -1624,9 +1624,10 @@ TORCH_API std::vector<Node*> findAllNodes(
     Symbol kind,
     bool recurse);
 
-struct OperatorSet {
+struct TORCH_API OperatorSet {
   OperatorSet(std::initializer_list<const char*> sig_literals);
   std::vector<std::shared_ptr<Operator>> getOps() const;
+  void insert(std::initializer_list<const char*> sig_literals);
 
  private:
   friend struct Node;
diff --git a/torch/csrc/jit/ir/irparser.cpp b/torch/csrc/jit/ir/irparser.cpp
index db6a7660f1f07e..c2a2398d7884ef 100644
--- a/torch/csrc/jit/ir/irparser.cpp
+++ b/torch/csrc/jit/ir/irparser.cpp
@@ -1,10 +1,18 @@
 #include <torch/csrc/jit/ir/irparser.h>
 
+#include <ATen/EmptyTensor.h>
 #include <torch/csrc/jit/frontend/lexer.h>
 #include <torch/csrc/jit/frontend/parse_string_literal.h>
 #include <torch/csrc/jit/frontend/schema_type_parser.h>
 #include <torch/csrc/jit/ir/ir.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_strided.h>
+#endif
+
 #include <string>
 #include <vector>
 
@@ -18,15 +26,18 @@ class IRParser {
   friend void parseIR(
       const std::string& str,
       torch::jit::Graph* graph,
-      std::unordered_map<std::string, Value*>& vmap);
+      std::unordered_map<std::string, Value*>& vmap,
+      bool parse_tensor_constants);
   IRParser(
       const std::string& str,
       torch::jit::Graph* graph,
-      std::unordered_map<std::string, Value*>& vmap)
+      std::unordered_map<std::string, Value*>& vmap,
+      bool parse_tensor_constants)
       : L(std::make_shared<Source>(str)),
         g(graph),
         vmap(vmap),
-        type_parser(L, /*parse_complete_tensor_types*/ true) {}
+        type_parser(L, /*parse_complete_tensor_types*/ true),
+        parse_tensor_constants_(parse_tensor_constants) {}
 
   std::string parseVar();
   VarWithType parseVarWithType(bool allow_optional = false);
@@ -61,6 +72,8 @@ class IRParser {
   torch::jit::Graph* g = nullptr;
   std::unordered_map<std::string, Value*>& vmap;
   SchemaTypeParser type_parser;
+  bool parse_tensor_constants_;
+  std::vector<Node*> deferred_tensor_value_initializations_;
 };
 
 struct ParsedLiteral {
@@ -89,14 +102,18 @@ struct VarWithType {
 void parseIR(
     const std::string& str,
     torch::jit::Graph* graph,
-    std::unordered_map<std::string, Value*>& vmap) {
-  torch::jit::IRParser p(str, graph, vmap);
+    std::unordered_map<std::string, Value*>& vmap,
+    bool parse_tensor_constants) {
+  torch::jit::IRParser p(str, graph, vmap, parse_tensor_constants);
   p.parse();
 }
 
-void parseIR(const std::string& str, torch::jit::Graph* graph) {
+void parseIR(
+    const std::string& str,
+    torch::jit::Graph* graph,
+    bool parse_tensor_constants) {
   std::unordered_map<std::string, Value*> vmap;
-  parseIR(str, graph, vmap);
+  parseIR(str, graph, vmap, parse_tensor_constants);
 }
 
 VarWithType IRParser::parseVarWithType(bool allow_optional) {
@@ -117,17 +134,23 @@ VarWithType IRParser::parseVarWithType(bool allow_optional) {
 
 std::string IRParser::parseVar() {
   L.expect('%');
-  if (L.cur().kind == TK_IDENT) {
-    auto name = L.expect(TK_IDENT).text();
-    if (L.cur().kind == TK_NUMBER) {
-      auto suffix = L.expect(TK_NUMBER).text();
-      AT_ASSERT(suffix[0] == '.');
-      name += suffix;
+  std::string name;
+  bool continue_parsing;
+  do {
+    if (L.cur().kind == TK_IDENT) {
+      name += L.expect(TK_IDENT).text();
+    } else {
+      name += L.expect(TK_NUMBER).text();
     }
-    return name;
-  } else {
-    return L.expect(TK_NUMBER).text();
-  }
+    continue_parsing = false;
+    if (L.nextIf('.')) {
+      continue_parsing = true;
+      name += '.';
+    } else if (L.cur().kind == TK_NUMBER && L.cur().text()[0] == '.') {
+      continue_parsing = true;
+    }
+  } while (continue_parsing);
+  return name;
 }
 
 void IRParser::parseOperatorOutputs(std::vector<VarWithType>* outs) {
@@ -184,6 +207,25 @@ ParsedLiteral IRParser::parseScalarLiteral(Node* n) {
       AT_ASSERTM(!type_alias.second, "Parsing IR with Alias Info not handled");
       r.ty = type_alias.first;
       return r;
+    case '<': {
+      L.next();
+      auto text = L.expect(TK_IDENT);
+      if (text.text() != "Tensor") {
+        throw ErrorReport(token.range)
+            << "Could not parse literal" << token.text();
+      }
+      if (!parse_tensor_constants_) {
+        throw ErrorReport(token.range)
+            << "Tensor constant encountered but `parse_tensor_constants` set to false"
+            << token.text();
+      }
+      L.expect('>');
+      // these values will be set with randomly initialized data in
+      // a post processing pass;
+      deferred_tensor_value_initializations_.push_back(n);
+      r.k = AttributeKind::t;
+      return r;
+    }
     default:
       throw ErrorReport(token.range)
           << "Could not parse literal" << token.text();
@@ -283,6 +325,9 @@ void IRParser::parseAttr(Node* n) {
       case AttributeKind::ty:
         n->ty_(Symbol::attr(attrname), r.ty);
         break;
+      case AttributeKind::t:
+        // initialized with random data later
+        break;
       default:
         throw ErrorReport(L.cur().range) << "Unexpected attr type";
     }
@@ -483,6 +528,22 @@ void IRParser::parse() {
 
   // The last statement should be return, which specifies graph outputs
   parseReturnOperator();
+
+  for (Node* n : deferred_tensor_value_initializations_) {
+    auto type = n->output()->type()->expect<TensorType>();
+    auto tt = n->output()->type()->cast<TensorType>();
+    TORCH_INTERNAL_ASSERT(tt, "expected tensor output ", *n);
+    auto sizes = tt->sizes().concrete_sizes();
+    TORCH_INTERNAL_ASSERT(sizes);
+    auto strides = tt->strides().concrete_sizes();
+    TORCH_INTERNAL_ASSERT(strides);
+    auto device = tt->device();
+    TORCH_INTERNAL_ASSERT(device);
+    auto dtype = tt->scalarType();
+    TORCH_INTERNAL_ASSERT(dtype);
+    auto options = at::TensorOptions(*device).dtype(*dtype);
+    auto t = n->t_(attr::value, at::empty_strided(*sizes, *strides, options));
+  }
 }
 
 void IRParser::parseList(
diff --git a/torch/csrc/jit/ir/irparser.h b/torch/csrc/jit/ir/irparser.h
index f49e8150521ab6..e6e0301b22c20e 100644
--- a/torch/csrc/jit/ir/irparser.h
+++ b/torch/csrc/jit/ir/irparser.h
@@ -4,6 +4,7 @@
 #include <string>
 #include <unordered_map>
 
+#include <c10/util/Optional.h>
 #include <torch/csrc/Export.h>
 
 namespace torch {
@@ -13,17 +14,27 @@ struct Graph;
 struct Value;
 
 // \brief Parse IR from \p STR constructing the corresponding IR in\ GRAPH.
-TORCH_API void parseIR(const std::string& str, torch::jit::Graph* graph);
+// if parse_tensor_constants is true will construct empty tensors
+// for Tensor constants with random or unitialized contents, otherwise will
+// throw
+TORCH_API void parseIR(
+    const std::string& str,
+    torch::jit::Graph* graph,
+    bool parse_tensor_constants = false);
 
 /** \brief Parse IR from \p STR constructing the corresponding IR in\ GRAPH.
  *
  * \p VMAP is filled with String to Value pairs allowing to index Values in the
  * newly created graph by their name in the original IR string.
+ * if parse_tensor_constants is true will construct empty tensors
+ * for Tensor constants with random or unitialized contents, otherwise will
+ * throw
  */
 TORCH_API void parseIR(
     const std::string& str,
     torch::jit::Graph* graph,
-    std::unordered_map<std::string, Value*>& vmap);
+    std::unordered_map<std::string, Value*>& vmap,
+    bool parse_tensor_constants = false);
 
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/compatibility/backport.cpp b/torch/csrc/jit/mobile/compatibility/backport.cpp
index 4d934fbf708e75..3cf184667f1ed4 100644
--- a/torch/csrc/jit/mobile/compatibility/backport.cpp
+++ b/torch/csrc/jit/mobile/compatibility/backport.cpp
@@ -21,7 +21,7 @@ const static BackportManager backportManager;
 // Forward declare so that _backport_for_mobile() overloads can
 // call this method directly.
 bool _backport_for_mobile_impl(
-    std::shared_ptr<IStreamAdapter> istream_adapter,
+    std::istream& oss,
     PyTorchStreamWriter& writer,
     const int64_t to_version);
 
@@ -29,24 +29,20 @@ bool _backport_for_mobile(
     std::istream& in,
     std::ostream& out,
     const int64_t to_version) {
-  std::unique_ptr<IStreamAdapter> rai = std::make_unique<IStreamAdapter>(&in);
   auto writer_func = [&](const void* buf, size_t nbytes) -> size_t {
     out.write(static_cast<const char*>(buf), nbytes);
     return !out ? 0 : nbytes;
   };
   PyTorchStreamWriter writer(writer_func);
-  return _backport_for_mobile_impl(std::move(rai), writer, to_version);
+  return _backport_for_mobile_impl(in, writer, to_version);
 }
 
 bool _backport_for_mobile(
     std::istream& in,
     const std::string& output_filename,
     const int64_t to_version) {
-  std::unique_ptr<IStreamAdapter> istream_adapter =
-      std::make_unique<IStreamAdapter>(&in);
   PyTorchStreamWriter writer(output_filename);
-  return _backport_for_mobile_impl(
-      std::move(istream_adapter), writer, to_version);
+  return _backport_for_mobile_impl(in, writer, to_version);
 }
 
 bool _backport_for_mobile(
@@ -59,15 +55,13 @@ bool _backport_for_mobile(
   if (!file_stream) {
     AT_ERROR("open file failed, file path: ", input_filename);
   }
-  istream_adapter = std::make_unique<IStreamAdapter>(&file_stream);
-
   auto writer_func = [&](const void* buf, size_t nbytes) -> size_t {
     out.write(static_cast<const char*>(buf), nbytes);
     return !out ? 0 : nbytes;
   };
+
   PyTorchStreamWriter writer(writer_func);
-  return _backport_for_mobile_impl(
-      std::move(istream_adapter), writer, to_version);
+  return _backport_for_mobile_impl(file_stream, writer, to_version);
 }
 
 bool _backport_for_mobile(
@@ -75,28 +69,25 @@ bool _backport_for_mobile(
     const std::string& output_filename,
     const int64_t to_version) {
   std::ifstream file_stream;
-  std::unique_ptr<IStreamAdapter> istream_adapter;
   file_stream.open(input_filename, std::ifstream::in | std::ifstream::binary);
   if (!file_stream) {
     AT_ERROR("open file failed, file path: ", input_filename);
   }
-  istream_adapter = std::make_unique<IStreamAdapter>(&file_stream);
 
   PyTorchStreamWriter writer(output_filename);
-  return _backport_for_mobile_impl(
-      std::move(istream_adapter), writer, to_version);
+  return _backport_for_mobile_impl(file_stream, writer, to_version);
 }
 
 bool _backport_for_mobile_impl(
-    std::shared_ptr<IStreamAdapter> istream_adapter,
+    std::istream& oss,
     PyTorchStreamWriter& writer,
     const int64_t to_version) {
   if (!backportManager.hasBytecodeBackportFunction(to_version + 1)) {
     return false;
   }
-  auto from_version = _get_model_bytecode_version(istream_adapter);
-  return backportManager.backport(
-      istream_adapter, writer, from_version, to_version);
+  oss.seekg(0, oss.beg);
+  auto from_version = _get_model_bytecode_version(oss);
+  return backportManager.backport(oss, writer, from_version, to_version);
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/mobile/compatibility/backport_manager.cpp b/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
index 09377093e0b897..e6413ceb030eb9 100644
--- a/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
+++ b/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
@@ -28,6 +28,7 @@ constexpr int64_t kBytecodeVersionV5 = 0x5L;
 constexpr int64_t kBytecodeVersionV6 = 0x6L;
 constexpr int64_t kBytecodeVersionV7 = 0x7L;
 constexpr int64_t kBytecodeVersionV8 = 0x8L;
+constexpr int64_t kBytecodeVersionV9 = 0x9L;
 } // namespace
 
 /********************** Utility Functions **********************/
@@ -515,6 +516,32 @@ std::stringstream backport_v7_to_v6(std::stringstream& input_model_stream) {
   return output_model_stream;
 }
 
+std::stringstream backport_v9_to_v8(std::stringstream& input_model_stream) {
+  ExtraFilesMap extra_files;
+  Module torch_script =
+      torch::jit::load(input_model_stream, c10::nullopt, extra_files);
+  std::stringstream intermediate_model_stream;
+  // TODO(@pavithran) : Check if debug info is available and use load/save while
+  // backporting hardcode debaug info to be false untill supported.
+  bool hasBytecodeDebug = false;
+  {
+    BytecodeEmitModeGuard argNumGuard(
+        false /*emit_default_input_instructions*/,
+        true /*enable_defaults_args_with_out_args*/,
+        true /*enable_emit_promoted_ops*/);
+    torch_script._save_for_mobile(
+        intermediate_model_stream,
+        extra_files,
+        hasBytecodeDebug,
+        /*use_flatbuffer=*/false);
+  }
+  // Update the bytecode version (from 9 to 8)
+  std::stringstream output_model_stream =
+      update_bytecode_version(intermediate_model_stream, kBytecodeVersionV8);
+
+  return output_model_stream;
+}
+
 std::stringstream backport_v8_to_v7(std::stringstream& input_model_stream) {
   std::shared_ptr<IStreamAdapter> rai =
       std::make_shared<IStreamAdapter>(&input_model_stream);
@@ -565,6 +592,7 @@ BackportManager::BackportManager() {
   registerBytecodeBackportFunction(kBytecodeVersionV6, backport_v6_to_v5);
   registerBytecodeBackportFunction(kBytecodeVersionV7, backport_v7_to_v6);
   registerBytecodeBackportFunction(kBytecodeVersionV8, backport_v8_to_v7);
+  registerBytecodeBackportFunction(kBytecodeVersionV9, backport_v9_to_v8);
 }
 
 std::unordered_map<
@@ -599,12 +627,10 @@ void BackportManager::registerBytecodeBackportFunction(
 // istream_adapter has access to it. During the backport process,
 // the intermediate result will be stored with stream.
 bool BackportManager::backport(
-    std::shared_ptr<IStreamAdapter> istream_adapter,
+    std::istream& oss,
     PyTorchStreamWriter& final_writer,
     int64_t from_version,
     int64_t to_version) const {
-  PyTorchStreamReader start_reader(istream_adapter);
-
   if (from_version <= to_version) {
     TORCH_WARN(
         "backport donesn't support backporting model to new version. It's trying to backport from version ",
@@ -619,9 +645,9 @@ bool BackportManager::backport(
   // 1) Given an istream_adapter (an adapter with access to the input model, the
   // model can be from istream, file and etc), copy all model content to
   // stringstream
-  std::stringstream oss;
-  get_model_stream(start_reader, oss);
-  std::stringstream input_model_stream(oss.str());
+  oss.seekg(0, std::ios::beg);
+  std::stringstream input_model_stream;
+  input_model_stream << oss.rdbuf();
   std::stringstream output_model_stream;
 
   // 2) backport model, backport_v{i}_to_v{i-1} function's argurment is
@@ -639,6 +665,7 @@ bool BackportManager::backport(
       return false;
     }
 
+    input_model_stream.seekg(0, input_model_stream.beg);
     auto input_model_stream_version =
         _get_model_bytecode_version(input_model_stream);
 
@@ -656,6 +683,7 @@ bool BackportManager::backport(
         bytecodeBackportFunctions()[bytecode_version--](input_model_stream);
 
     output_model_stream.swap(backport_model_stream);
+    output_model_stream.seekg(0, output_model_stream.beg);
     auto output_model_stream_version =
         _get_model_bytecode_version(output_model_stream);
 
diff --git a/torch/csrc/jit/mobile/compatibility/backport_manager.h b/torch/csrc/jit/mobile/compatibility/backport_manager.h
index 6a518b391d08e9..842bb268e5ae59 100644
--- a/torch/csrc/jit/mobile/compatibility/backport_manager.h
+++ b/torch/csrc/jit/mobile/compatibility/backport_manager.h
@@ -34,7 +34,7 @@ class BackportManager final {
   bytecodeBackportFunctions() const;
 
   bool backport(
-      std::shared_ptr<caffe2::serialize::IStreamAdapter> istream_adapter,
+      std::istream& oss,
       caffe2::serialize::PyTorchStreamWriter& final_writer,
       int64_t from_version,
       int64_t to_version) const;
diff --git a/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp b/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
index 832ff77f545f43..41e6c437f461fa 100644
--- a/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
+++ b/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
@@ -3,6 +3,10 @@
 #include <caffe2/serialize/inline_container.h>
 #include <torch/csrc/jit/api/compilation_unit.h> // removed after using simple type_resolver/obj_loader
 #include <torch/csrc/jit/mobile/compatibility/model_compatibility.h>
+#include <torch/csrc/jit/mobile/file_format.h>
+#if defined(ENABLE_FLATBUFFER)
+#include <torch/csrc/jit/mobile/flatbuffer_loader.h>
+#endif
 #include <torch/csrc/jit/mobile/import.h> // removed after using simple type_resolver/obj_loader
 #include <torch/csrc/jit/mobile/type_parser.h>
 #include <torch/csrc/jit/serialization/import_export_constants.h>
@@ -69,13 +73,53 @@ uint64_t _get_model_bytecode_version(
     const std::vector<IValue>& bytecode_ivalues);
 
 uint64_t _get_model_bytecode_version(std::istream& in) {
-  std::unique_ptr<IStreamAdapter> rai = std::make_unique<IStreamAdapter>(&in);
-  return _get_model_bytecode_version(std::move(rai));
+  auto orig_pos = in.tellg();
+  in.seekg(0, in.beg);
+  auto format = getFileFormat(in);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if !defined(ENABLE_FLATBUFFER)
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+#else
+      return get_bytecode_version(in);
+#endif
+    }
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<IStreamAdapter> rai =
+          std::make_unique<IStreamAdapter>(&in);
+      auto version = _get_model_bytecode_version(std::move(rai));
+      in.seekg(orig_pos, in.beg);
+      return version;
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
 }
 
 uint64_t _get_model_bytecode_version(const std::string& filename) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
-  return _get_model_bytecode_version(std::move(rai));
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if !defined(ENABLE_FLATBUFFER)
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+#else
+      return get_bytecode_version(filename);
+#endif
+    }
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<FileAdapter> rai =
+          std::make_unique<FileAdapter>(filename);
+      return _get_model_bytecode_version(std::move(rai));
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
 }
 
 uint64_t _get_model_bytecode_version(
diff --git a/torch/csrc/jit/mobile/debug_info.cpp b/torch/csrc/jit/mobile/debug_info.cpp
index 05fd60b7886b2f..e8b4ed01d02a7f 100644
--- a/torch/csrc/jit/mobile/debug_info.cpp
+++ b/torch/csrc/jit/mobile/debug_info.cpp
@@ -122,26 +122,17 @@ MobileDebugTable::MobileDebugTable(
       at::DataPtr debug_data;
       size_t debug_size{0};
       std::tie(debug_data, debug_size) = reader->getRecord(record_name);
-      auto ivalueTuple = jit::unpickle(
-          reinterpret_cast<const char*>(debug_data.get()),
-          debug_size,
-          nullptr,
-          {},
-          c10::parseType);
-      const auto& ivalues = ivalueTuple.toTuple()->elements();
-      IValue lines;
-      std::unique_ptr<SourceRangeDeserializer> deserializer;
-      if (ivalues.size() == 3 && ivalues[0].isString() &&
-          kFormatWithStringTable == ivalues[0].toStringRef()) {
-        // new format
-        deserializer = std::make_unique<SourceRangeDeserializer>(ivalues[1]);
-        lines = ivalues[2];
-      } else {
-        deserializer = std::make_unique<SourceRangeDeserializer>();
-        lines = ivalueTuple;
-      }
-
-      for (auto& val : lines.toTuple()->elements()) {
+      auto ivalues =
+          std::move(*jit::unpickle(
+                         reinterpret_cast<const char*>(debug_data.get()),
+                         debug_size,
+                         nullptr,
+                         {},
+                         c10::parseType)
+                         .toTuple())
+              .elements();
+      SourceRangeDeserializer deserializer;
+      for (auto& val : ivalues) {
         auto tup_elems = std::move(*std::move(val).toTuple()).elements();
         // For BC we decode only tuples with 3 elements
         // assuming it contains
@@ -149,7 +140,7 @@ MobileDebugTable::MobileDebugTable(
         if (tup_elems.size() == 3) {
           int64_t debug_handle = tup_elems[kSourceRangeTagIndex].toInt();
           auto source_range =
-              deserializer->deserialize(tup_elems[kSourceRangeIndex]);
+              deserializer.deserialize(tup_elems[kSourceRangeIndex]);
           source_range_map.emplace(debug_handle, std::move(source_range));
         }
       }
diff --git a/torch/csrc/jit/mobile/debug_info.h b/torch/csrc/jit/mobile/debug_info.h
index 7044aa644d67e8..bbeb4edf2c41fc 100644
--- a/torch/csrc/jit/mobile/debug_info.h
+++ b/torch/csrc/jit/mobile/debug_info.h
@@ -3,7 +3,6 @@
 #include <caffe2/serialize/inline_container.h>
 #include <torch/csrc/jit/api/compilation_unit.h>
 #include <torch/csrc/jit/ir/scope.h>
-#include <torch/csrc/jit/serialization/source_range_serialization.h>
 
 namespace torch {
 namespace jit {
diff --git a/torch/csrc/jit/mobile/flatbuffer_loader.cpp b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
index e45347b4a9d50b..c68323e3676288 100644
--- a/torch/csrc/jit/mobile/flatbuffer_loader.cpp
+++ b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
@@ -1,3 +1,4 @@
+#include <flatbuffers/base.h>
 #include <torch/csrc/jit/mobile/flatbuffer_loader.h>
 
 #include <ATen/ATen.h>
@@ -5,6 +6,7 @@
 #include <ATen/core/ivalue.h>
 #include <ATen/core/qualified_name.h>
 #include <c10/core/CPUAllocator.h>
+#include <c10/core/impl/alloc_cpu.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Optional.h>
 #include <c10/util/ScopeExit.h>
@@ -15,6 +17,7 @@
 #include <torch/csrc/jit/mobile/observer.h>
 #include <torch/csrc/jit/mobile/type_parser.h>
 #include <torch/csrc/jit/runtime/instruction.h>
+#include <torch/csrc/jit/serialization/export_bytecode.h>
 #include <torch/csrc/jit/serialization/import_export_constants.h>
 #include <torch/csrc/jit/serialization/import_read.h>
 #include <torch/custom_class.h>
@@ -28,6 +31,12 @@
 #include <unistd.h>
 #endif
 
+#ifdef _WIN32
+#include <malloc.h>
+#else
+#include <cstdlib>
+#endif
+
 #include <string>
 #include <vector>
 
@@ -146,14 +155,21 @@ void FlatbufferLoader::internal_registerTypeResolver(
   type_resolver_ = type_resolver;
 }
 
+void parseExtraFilesFromVector(
+    const flatbuffers::Vector<flatbuffers::Offset<
+        torch::jit::mobile::serialization::ExtraFile>>* files,
+    ExtraFilesMap* extra_files) {
+  for (uint32_t i = 0; i < files->size(); ++i) {
+    const auto* extra_file = files->Get(i);
+    (*extra_files)[extra_file->name()->str()] = extra_file->content()->str();
+  }
+}
+
 void parseExtraFiles(
     mobile::serialization::Module* module,
     ExtraFilesMap& extra_files) {
   auto extra_files_offsets = module->extra_files();
-  for (uint32_t i = 0; i < extra_files_offsets->size(); ++i) {
-    const auto* extra_file = extra_files_offsets->Get(i);
-    extra_files[extra_file->name()->str()] = extra_file->content()->str();
-  }
+  parseExtraFilesFromVector(extra_files_offsets, &extra_files);
 }
 
 mobile::Module FlatbufferLoader::parseModule(
@@ -163,6 +179,7 @@ mobile::Module FlatbufferLoader::parseModule(
   all_types_.clear();
   storages_.clear();
   storage_loaded_.clear();
+  module_parsed_ = false;
 
   const auto* ivalues = module->ivalues();
   all_ivalues_.resize(ivalues->size());
@@ -190,6 +207,7 @@ mobile::Module FlatbufferLoader::parseModule(
     class_type->addMethod(f.second);
   }
 
+  module_parsed_ = true;
   return mobile::Module(module_ivalue.toObject(), mcu_);
 }
 
@@ -209,7 +227,6 @@ std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
   }
 
   std::unordered_set<std::string> unsupported_op_names;
-  const int64_t model_version = 0x6L;
   for (const auto* op : *method->operators()) {
     c10::optional<int> num_args = c10::nullopt;
     if (op->num_args_serialized() > -1) {
@@ -217,7 +234,7 @@ std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
     }
 
     auto op_found = function->append_operator(
-        op->name()->str(), op->overload_name()->str(), num_args, model_version);
+        op->name()->str(), op->overload_name()->str(), num_args);
 
     if (!op_found) {
       unsupported_op_names.emplace(
@@ -225,7 +242,10 @@ std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
     }
   }
 
-  AT_ASSERT(unsupported_op_names.empty());
+  TORCH_CHECK(
+      unsupported_op_names.empty(),
+      "Unsupported ops: ",
+      c10::Join(", ", unsupported_op_names));
 
   for (const auto i : *method->type_annotations()) {
     function->append_type(getOrCreateTypeAnnotations(i));
@@ -233,29 +253,32 @@ std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
 
   function->set_register_size(method->register_size());
   if (method->schema()) {
-    auto parseArgList = [this](const auto* args_fb) {
-      std::vector<c10::Argument> args;
-      for (const auto* arg_tb : *args_fb) {
-        IValue default_value = getIValue(arg_tb->default_value());
-        TypePtr type_ptr = getOrCreateTypeAnnotations(arg_tb->type());
-        auto arg = c10::Argument(
-            arg_tb->name()->str(),
-            std::move(type_ptr),
-            c10::nullopt /*N*/,
-            std::move(default_value));
-        args.emplace_back(std::move(arg));
-      }
-      return args;
-    };
-    c10::FunctionSchema schema(
-        method->qn()->str(),
-        "" /*overload_name*/,
-        parseArgList(method->schema()->arguments()),
-        parseArgList(method->schema()->returns()),
-        false /*is_varargs*/,
-        false /*is_varret*/);
-
-    function->setSchema(std::move(schema));
+    try {
+      auto parseArgList = [this](const auto* args_fb) {
+        std::vector<c10::Argument> args;
+        for (const auto* arg_tb : *args_fb) {
+          IValue default_value = getIValue(arg_tb->default_value());
+          TypePtr type_ptr = getOrCreateTypeAnnotations(arg_tb->type());
+          auto arg = c10::Argument(
+              arg_tb->name()->str(),
+              std::move(type_ptr),
+              c10::nullopt /*N*/,
+              std::move(default_value));
+          args.emplace_back(std::move(arg));
+        }
+        return args;
+      };
+      c10::FunctionSchema schema(
+          method->qn()->str(),
+          "" /*overload_name*/,
+          parseArgList(method->schema()->arguments()),
+          parseArgList(method->schema()->returns()),
+          false /*is_varargs*/,
+          false /*is_varret*/);
+
+      function->setSchema(std::move(schema));
+    } catch (const c10::Error& e) {
+    }
   }
   return function;
 }
@@ -550,10 +573,73 @@ TypePtr FlatbufferLoader::getOrCreateTypeAnnotations(
   return type;
 }
 
+std::tuple<std::shared_ptr<char>, size_t> get_file_content(
+    const char* filename) {
+#if defined(HAVE_MMAP)
+  int fd = open(filename, O_RDONLY);
+  struct stat statbuf {};
+  fstat(fd, &statbuf);
+  size_t size = statbuf.st_size;
+  void* ptr = mmap(nullptr, statbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+  close(fd);
+  auto deleter = [statbuf](char* ptr) { munmap(ptr, statbuf.st_size); };
+  std::shared_ptr<char> data(reinterpret_cast<char*>(ptr), deleter);
+#else
+  FILE* f = fopen(filename, "rb");
+  fseek(f, 0, SEEK_END);
+  size_t size = ftell(f);
+  fseek(f, 0, SEEK_SET);
+  // make sure buffer size is multiple of alignment
+  size_t buffer_size =
+      (size / FLATBUFFERS_MAX_ALIGNMENT + 1) * FLATBUFFERS_MAX_ALIGNMENT;
+  std::shared_ptr<char> data(
+      static_cast<char*>(c10::alloc_cpu(buffer_size)), c10::free_cpu);
+  fread(data.get(), size, 1, f);
+  fclose(f);
+#endif
+  return std::make_tuple(data, size);
+}
+
+std::tuple<std::shared_ptr<char>, size_t> get_stream_content(std::istream& in) {
+  // get size of the stream and reset to orig
+  std::streampos orig_pos = in.tellg();
+  in.seekg(orig_pos, std::ios::end);
+  const long size = in.tellg();
+  in.seekg(orig_pos, in.beg);
+
+  // read stream
+  // NOLINT make sure buffer size is multiple of alignment
+  size_t buffer_size =
+      (size / FLATBUFFERS_MAX_ALIGNMENT + 1) * FLATBUFFERS_MAX_ALIGNMENT;
+  std::shared_ptr<char> data(
+      static_cast<char*>(c10::alloc_cpu(buffer_size)), c10::free_cpu);
+  in.read(data.get(), size);
+
+  // reset stream to original position
+  in.seekg(orig_pos, in.beg);
+  return std::make_tuple(data, size);
+}
+
+void FlatbufferLoader::extractJitSourceAndConstants(
+    ExtraFilesMap* jit_sources,
+    std::vector<IValue>* constants) {
+  AT_ASSERT(
+      module_parsed_,
+      "Need to first parse a flatbuffer file before extracing jit_sources");
+  const auto* jit_constants = module_->jit_constants();
+  for (auto i = 0; i < jit_constants->size(); ++i) {
+    constants->emplace_back(getIValue(jit_constants->Get(i)));
+  }
+  parseExtraFilesFromVector(module_->jit_sources(), jit_sources);
+}
+
 mobile::Module parse_and_initialize_mobile_module(
     std::shared_ptr<char> data,
     size_t,
     c10::optional<at::Device>) {
+  TORCH_CHECK(
+      mobile::serialization::ModuleBufferHasIdentifier(data.get()),
+      "Format error");
   auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
   mobile::Module m = FlatbufferLoader().parseModule(flatbuffer_module);
   m.set_delete_memory(std::move(data));
@@ -570,26 +656,33 @@ mobile::Module initialize_mobile_module(
 mobile::Module load_mobile_module_from_file(
     const std::string& filename,
     c10::optional<c10::Device> device) {
-#if defined(HAVE_MMAP)
-  int fd = open(filename.c_str(), O_RDONLY);
-  struct stat statbuf {};
-  fstat(fd, &statbuf);
-  int size = statbuf.st_size;
-  void* ptr = mmap(nullptr, statbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
-  close(fd);
-  auto deleter = [statbuf](char* ptr) { munmap(ptr, statbuf.st_size); };
-  std::shared_ptr<char> data(reinterpret_cast<char*>(ptr), deleter);
-#else
-  FILE* f = fopen(filename.c_str(), "rb");
-  fseek(f, 0, SEEK_END);
-  long size = ftell(f);
-  fseek(f, 0, SEEK_SET);
-  std::shared_ptr<char> data(static_cast<char*>(malloc(size)), free); // NOLINT
-  fread(data.get(), size, 1, f);
-  fclose(f);
-#endif
+  std::shared_ptr<char> data;
+  size_t size = 0;
+  std::tie(data, size) = get_file_content(filename.c_str());
   return parse_and_initialize_mobile_module(std::move(data), size, device);
 }
 
+uint64_t get_bytecode_version(std::istream& in) {
+  std::shared_ptr<char> data;
+  size_t size = 0;
+  std::tie(data, size) = get_stream_content(in);
+  TORCH_CHECK(
+      mobile::serialization::ModuleBufferHasIdentifier(data.get()),
+      "Format error");
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
+  return flatbuffer_module->bytecode_version();
+}
+
+uint64_t get_bytecode_version(const std::string& filename) {
+  std::shared_ptr<char> data;
+  size_t size = 0;
+  std::tie(data, size) = get_file_content(filename.c_str());
+  TORCH_CHECK(
+      mobile::serialization::ModuleBufferHasIdentifier(data.get()),
+      "Format error");
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
+  return flatbuffer_module->bytecode_version();
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/flatbuffer_loader.h b/torch/csrc/jit/mobile/flatbuffer_loader.h
index 8a21be17a85724..e62be642d0f6dd 100644
--- a/torch/csrc/jit/mobile/flatbuffer_loader.h
+++ b/torch/csrc/jit/mobile/flatbuffer_loader.h
@@ -56,7 +56,16 @@ TORCH_API void parseExtraFiles(
     mobile::serialization::Module* module,
     ExtraFilesMap& extra_files);
 
-class FlatbufferLoader {
+TORCH_API std::tuple<std::shared_ptr<char>, size_t> get_file_content(
+    const char* filename);
+
+TORCH_API std::tuple<std::shared_ptr<char>, size_t> get_stream_content(
+    std::istream& in);
+
+TORCH_API uint64_t get_bytecode_version(std::istream& in);
+TORCH_API uint64_t get_bytecode_version(const std::string& filename);
+
+class TORCH_API FlatbufferLoader {
  public:
   FlatbufferLoader();
 
@@ -67,6 +76,10 @@ class FlatbufferLoader {
       IValueParser parser);
   mobile::Module parseModule(mobile::serialization::Module* module);
 
+  void extractJitSourceAndConstants(
+      ExtraFilesMap* jit_sources,
+      std::vector<IValue>* constants);
+
   typedef TypePtr (*TypeResolver)(
       const std::string& type_str,
       std::shared_ptr<CompilationUnit> cu);
@@ -117,6 +130,7 @@ class FlatbufferLoader {
       ivalue_parsers_;
   TypeResolver type_resolver_ = nullptr;
   mobile::serialization::Module* module_ = nullptr;
+  bool module_parsed_ = false;
 };
 
 } // namespace jit
diff --git a/torch/csrc/jit/mobile/function.cpp b/torch/csrc/jit/mobile/function.cpp
index 3a7c11c0da063c..cc038d2536eba6 100644
--- a/torch/csrc/jit/mobile/function.cpp
+++ b/torch/csrc/jit/mobile/function.cpp
@@ -46,14 +46,12 @@ void Function::append_instruction(OpCode op, int X, int N) {
 bool Function::append_operator(
     const std::string& name,
     const std::string& overload_name,
-    const c10::optional<int>& num_specified_args,
-    int64_t model_version) { /* TODO: T90339189 deprecate all v3 when v3 models
-                                are removed */
+    const c10::optional<int>& num_specified_args) {
   // Keep the original opname in code_
   code_.op_names_.emplace_back(name, overload_name);
   const auto& opname = code_.op_names_.back();
   code_.operator_input_sizes_.emplace_back(num_specified_args.value_or(-1));
-  auto func = makeOperatorFunction(opname, num_specified_args, model_version);
+  auto func = makeOperatorFunction(opname, num_specified_args);
   if (!func.has_value()) {
     return false;
   }
@@ -134,8 +132,7 @@ const std::vector<int64_t>& Function::getExceptionDebugHandles() const {
 
 c10::optional<std::function<void(Stack&)>> makeOperatorFunction(
     c10::OperatorName opname,
-    c10::optional<int> num_specified_args,
-    int64_t model_version) {
+    c10::optional<int> num_specified_args) {
   std::function<void(Stack&)> fn;
   const auto full_name = c10::toString(opname);
   const std::vector<c10::Argument>* pArgs = nullptr;
@@ -165,55 +162,41 @@ c10::optional<std::function<void(Stack&)>> makeOperatorFunction(
   if (!promoted_op) {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(pArgs);
     const auto& args = *pArgs;
-    if (model_version == 0x3LL && opname.name == "aten::_convolution" &&
-        opname.overload_name.empty()) {
-      // Since byte-code versions 0x4L, convolution has an additional
-      // default-value argument (allow_tf32=True, see
-      // https://github.com/pytorch/pytorch/pull/40737). This wrapper handles
-      // backward compatibility with models of byte-code version <= 0x3L, where
-      // this bool argument does not yet exist.
-      fn = [fn](Stack& stack) {
-        stack.push_back(true);
+    // num_specified_args >= 0 indicates number of arguments are available
+    // from model. We can use it to handle backward compatibility.
+    if (num_specified_args &&
+        num_specified_args.value() < static_cast<int64_t>(args.size())) {
+      fn = [fn, num_specified_args, &args](Stack& stack) {
+        std::vector<IValue> out_args;
+        // The following logic pops and temporarily stores all out arguments
+        // from the stack (which can be 0 or more, and always appended to the
+        // schema), in order to push the necessary default values. Finally,
+        // the out arguments are pushed back into the stack.
+        for (size_t i = args.size() - 1; i > 0 && args.at(i).is_out(); i--) {
+          out_args.push_back(stack.back());
+          stack.pop_back();
+        }
+        size_t start_index = num_specified_args.value() - out_args.size();
+        TORCH_CHECK(
+            start_index >= 0,
+            "The number of output arguments is: ",
+            out_args.size(),
+            ", which is more then the number of specified arguments: ",
+            num_specified_args.value());
+        for (size_t i = start_index; i < (args.size() - out_args.size()); ++i) {
+          TORCH_CHECK(
+              args[i].default_value().has_value(),
+              "Error happened at preparing for default values for the argument. The ",
+              i,
+              "th argument ",
+              args[i].name(),
+              " does not have a specified value or default value. ");
+
+          stack.push_back(args[i].default_value());
+        }
+        stack.insert(stack.end(), out_args.rbegin(), out_args.rend());
         fn(stack);
       };
-    } else {
-      // num_specified_args >= 0 indicates number of arguments are available
-      // from model. We can use it to handle backward compatibility.
-      if (num_specified_args &&
-          num_specified_args.value() < static_cast<int64_t>(args.size())) {
-        fn = [fn, num_specified_args, &args](Stack& stack) {
-          std::vector<IValue> out_args;
-          // The following logic pops and temporarily stores all out arguments
-          // from the stack (which can be 0 or more, and always appended to the
-          // schema), in order to push the necessary default values. Finally,
-          // the out arguments are pushed back into the stack.
-          for (size_t i = args.size() - 1; i > 0 && args.at(i).is_out(); i--) {
-            out_args.push_back(stack.back());
-            stack.pop_back();
-          }
-          size_t start_index = num_specified_args.value() - out_args.size();
-          TORCH_CHECK(
-              start_index >= 0,
-              "The number of output arguments is: ",
-              out_args.size(),
-              ", which is more then the number of specified arguments: ",
-              num_specified_args.value());
-          for (size_t i = start_index; i < (args.size() - out_args.size());
-               ++i) {
-            TORCH_CHECK(
-                args[i].default_value().has_value(),
-                "Error happened at preparing for default values for the argument. The ",
-                i,
-                "th argument ",
-                args[i].name(),
-                " does not have a specified value or default value. ");
-
-            stack.push_back(args[i].default_value());
-          }
-          stack.insert(stack.end(), out_args.rbegin(), out_args.rend());
-          fn(stack);
-        };
-      }
     }
   }
   return fn;
diff --git a/torch/csrc/jit/mobile/function.h b/torch/csrc/jit/mobile/function.h
index 69126e6c02443a..f832a4373a603b 100644
--- a/torch/csrc/jit/mobile/function.h
+++ b/torch/csrc/jit/mobile/function.h
@@ -37,9 +37,7 @@ class TORCH_API Function : public torch::jit::Function {
   bool append_operator(
       const std::string& name,
       const std::string& overload_name,
-      const c10::optional<int>& num_specified_args,
-      int64_t model_version); /* TODO: T90339189 deprecate all v3 when v3 models
-                                are removed */
+      const c10::optional<int>& num_specified_args);
   void append_constant(const c10::IValue& constant);
   void append_type(const c10::TypePtr& type);
   void append_function(mobile::Function& func);
@@ -73,8 +71,7 @@ class TORCH_API Function : public torch::jit::Function {
 
 c10::optional<std::function<void(Stack&)>> makeOperatorFunction(
     c10::OperatorName opname,
-    c10::optional<int> num_specified_args,
-    int64_t model_version);
+    c10::optional<int> num_specified_args);
 
 } // namespace mobile
 } // namespace jit
diff --git a/torch/csrc/jit/mobile/import.cpp b/torch/csrc/jit/mobile/import.cpp
index b1c707974dce8e..75a1934922a913 100644
--- a/torch/csrc/jit/mobile/import.cpp
+++ b/torch/csrc/jit/mobile/import.cpp
@@ -10,6 +10,10 @@
 #include <caffe2/serialize/inline_container.h>
 #include <caffe2/serialize/versions.h>
 #include <torch/csrc/jit/api/compilation_unit.h>
+#include <torch/csrc/jit/mobile/file_format.h>
+#if defined(ENABLE_FLATBUFFER)
+#include <torch/csrc/jit/mobile/flatbuffer_loader.h>
+#endif
 #include <torch/csrc/jit/mobile/interpreter.h>
 #include <torch/csrc/jit/mobile/observer.h>
 #include <torch/csrc/jit/mobile/type_parser.h>
@@ -307,8 +311,9 @@ void BytecodeDeserializer::parseMethods(
   TORCH_CHECK(vals.size() > 0, "Bytecode has no elements. ");
   // Initialized with the version number when kProducedBytecodeVersion was
   // introduced. The old models (some of them already in production) without
-  // version number don't have to be re-generated.
-  int64_t model_version = 0x3L;
+  // version number are seen as version 3 (deprecated).
+  constexpr uint64_t default_version = 0x3L;
+  uint64_t model_version = default_version;
   size_t method_i_start = 0;
   if (vals[0].isInt()) {
     model_version = vals[0].toInt();
@@ -379,11 +384,7 @@ void BytecodeDeserializer::parseMethods(
     }
     init_upgrader(function.get());
     // 1. First pass all operators from models
-    parseOperators(
-        std::move(ops_list),
-        model_version,
-        module_load_options_,
-        function.get());
+    parseOperators(std::move(ops_list), module_load_options_, function.get());
 
     // 2. Decides if upgrader is needed
     bool use_upgrader =
@@ -536,18 +537,48 @@ mobile::Module _load_for_mobile(
     std::istream& in,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<IStreamAdapter> rai = std::make_unique<IStreamAdapter>(&in);
-  auto module = _load_for_mobile(std::move(rai), device, extra_files);
-  return module;
+  in.seekg(0, in.beg);
+  auto format = getFileFormat(in);
+  switch (format) {
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<IStreamAdapter> rai =
+          std::make_unique<IStreamAdapter>(&in);
+      auto module = _load_for_mobile(std::move(rai), device, extra_files);
+      return module;
+    }
+#if defined(ENABLE_FLATBUFFER)
+    case FileFormat::FlatbufferFileFormat: {
+      std::shared_ptr<char> data;
+      size_t size = 0;
+      std::tie(data, size) = get_stream_content(in);
+      auto* flatbuffer_module =
+          mobile::serialization::GetMutableModule(data.get());
+      mobile::Module m = initialize_mobile_module(flatbuffer_module);
+      parseExtraFiles(flatbuffer_module, extra_files);
+      return m;
+    }
+#else
+    case FileFormat::FlatbufferFileFormat: {
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+    }
+#endif
+    default: {
+      TORCH_CHECK(false, "Format error");
+    }
+  }
 }
 
 mobile::Module _load_for_mobile(
     const std::string& filename,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
-  auto module = _load_for_mobile(std::move(rai), device, extra_files);
-  return module;
+  return _load_for_mobile(
+      filename,
+      device,
+      extra_files,
+      /*module_load_options=*/_default_mobile_module_load_options);
 }
 
 mobile::Module _load_for_mobile(
@@ -555,10 +586,37 @@ mobile::Module _load_for_mobile(
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files,
     uint64_t module_load_options) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
-  auto module = _load_for_mobile_impl(
-      std::move(rai), device, extra_files, module_load_options);
-  return module;
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<FileAdapter> rai =
+          std::make_unique<FileAdapter>(filename);
+      auto module = _load_for_mobile_impl(
+          std::move(rai), device, extra_files, module_load_options);
+      return module;
+    }
+#if defined(ENABLE_FLATBUFFER)
+    case FileFormat::FlatbufferFileFormat: {
+      std::shared_ptr<char> data;
+      size_t size = 0;
+      std::tie(data, size) = get_file_content(filename.c_str());
+      auto* flatbuffer_module =
+          mobile::serialization::GetMutableModule(data.get());
+      mobile::Module m = initialize_mobile_module(flatbuffer_module);
+      parseExtraFiles(flatbuffer_module, extra_files);
+      return m;
+    }
+#else
+    case FileFormat::FlatbufferFileFormat: {
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+    }
+#endif
+    default: {
+      TORCH_CHECK(false, "Format error");
+    }
+  }
 }
 
 mobile::Module _load_for_mobile(
diff --git a/torch/csrc/jit/mobile/import_data.cpp b/torch/csrc/jit/mobile/import_data.cpp
index c67b521937e599..6bd7f727f5eb48 100644
--- a/torch/csrc/jit/mobile/import_data.cpp
+++ b/torch/csrc/jit/mobile/import_data.cpp
@@ -1,15 +1,24 @@
 #include <torch/csrc/jit/mobile/import_data.h>
 
+#include <ATen/Functions.h>
 #include <ATen/core/ivalue.h>
 #include <c10/util/irange.h>
+#include <caffe2/serialize/file_adapter.h>
 #include <caffe2/serialize/inline_container.h>
 #include <torch/csrc/jit/api/compilation_unit.h>
+#include <torch/csrc/jit/mobile/file_format.h>
+#include <torch/csrc/jit/mobile/module.h>
 #include <torch/csrc/jit/mobile/observer.h>
 #include <torch/csrc/jit/mobile/type_parser.h>
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/serialization/unpickler.h>
 #include <torch/custom_class.h>
 
+#if defined(ENABLE_FLATBUFFER)
+#include <torch/csrc/jit/mobile/flatbuffer_loader.h>
+#include <torch/csrc/jit/mobile/import_export_common.h>
+#endif // defined(ENABLE_FLATBUFFER)
+
 #include <exception>
 #include <fstream>
 #include <string>
@@ -17,16 +26,20 @@
 
 namespace torch {
 namespace jit {
+using caffe2::serialize::FileAdapter;
 using caffe2::serialize::IStreamAdapter;
 using caffe2::serialize::PyTorchStreamReader;
 using caffe2::serialize::ReadAdapterInterface;
 
 namespace {
 
-// The deserializer class which loads the bytecode package from bc files.
-class BytecodeDeserializer final {
+/**
+ * Given a ZIP file containing a file named "data.pkl", uses Pickle to
+ * deserialize the file and returns the IValue inside it.
+ */
+class IValueUnpickler final {
  public:
-  explicit BytecodeDeserializer(std::unique_ptr<PyTorchStreamReader> reader);
+  explicit IValueUnpickler(std::unique_ptr<PyTorchStreamReader> reader);
   c10::IValue deserialize(c10::optional<at::Device> device);
 
  private:
@@ -39,20 +52,18 @@ class BytecodeDeserializer final {
   std::unique_ptr<PyTorchStreamReader> reader_;
 };
 
-BytecodeDeserializer::BytecodeDeserializer(
-    std::unique_ptr<PyTorchStreamReader> reader)
+IValueUnpickler::IValueUnpickler(std::unique_ptr<PyTorchStreamReader> reader)
     : compilation_unit_(std::make_shared<CompilationUnit>()),
       reader_(std::move(reader)) {}
 
-c10::IValue BytecodeDeserializer::deserialize(
-    c10::optional<at::Device> device) {
+c10::IValue IValueUnpickler::deserialize(c10::optional<at::Device> device) {
   auto mcu = std::make_shared<mobile::CompilationUnit>();
 
   // NOLINTNEXTLINE(performance-move-const-arg)
   return readArchive("data", mcu, std::move(device));
 }
 
-c10::IValue BytecodeDeserializer::readArchive(
+c10::IValue IValueUnpickler::readArchive(
     const std::string& archive_name,
     std::shared_ptr<mobile::CompilationUnit> mcu,
     c10::optional<at::Device> device) {
@@ -153,36 +164,152 @@ c10::IValue BytecodeDeserializer::readArchive(
   return unpickler.parse_ivalue();
 }
 
+/**
+ * Extracts and returns the parameter map serialized as ZIP + Pickle in @p rai.
+ */
+std::map<std::string, at::Tensor> load_parameters_from_zip(
+    std::unique_ptr<ReadAdapterInterface> rai,
+    c10::optional<c10::Device> device) {
+  auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
+  IValueUnpickler unpickler(std::move(reader));
+  auto result = unpickler.deserialize(device).toGenericDict();
+  std::map<std::string, at::Tensor> map;
+  for (const auto& e : result) {
+    auto key = e.key().toString()->string();
+    auto value = e.value().toTensor().tensor_data();
+    map[key] = value;
+  }
+  return map;
+}
+
+#if defined(ENABLE_FLATBUFFER)
+
+/**
+ * Extracts the parameter map stored in @p module. Expects a layout
+ * compatible with the one created by #_save_parameters().
+ */
+std::map<std::string, at::Tensor> mobile_module_to_parameter_map(
+    const mobile::Module& module) {
+  // Safely look for a slot with the expected name. Note that
+  // c10::ivalue::Object::getAttr() is not safe if the attribute isn't present.
+  auto obj = module._ivalue();
+  const std::vector<IValue>& slots = obj->slots();
+  for (const auto i : c10::irange(slots.size())) {
+    if (obj->type()->getAttributeName(i) ==
+        mobile::internal::kSavedParametersAttributeName) {
+      // Found a slot with the right name; make sure it's a
+      // Dict<string, Tensor>.
+      c10::IValue data = slots[i];
+      if (data.isGenericDict()) {
+        auto data_dict = data.toGenericDict();
+
+        // The key and value should be DynamicTypes that wrap String and Tensor.
+        c10::DynamicType* keyType =
+            data_dict.keyType()->castRaw<c10::DynamicType>();
+        c10::DynamicType* valueType =
+            data_dict.valueType()->castRaw<c10::DynamicType>();
+        if (keyType != nullptr &&
+            keyType->fallback()->kind() == TypeKind::StringType &&
+            valueType != nullptr &&
+            valueType->fallback()->kind() == TypeKind::TensorType) {
+          // Name and type are good; copy the contents to the output map.
+          std::map<std::string, at::Tensor> params;
+          for (const auto& e : data_dict) {
+            // The source Tensor points into the flatbuffer data associated with
+            // the Module. But, this Tensor needs to outlive the Module, since
+            // the caller of _load_parameters() won't have a pointer to the
+            // Module. So, return a deep copy.
+            const auto& source = e.value().toTensor();
+            at::Tensor copy = at::empty_like(source); // Must be the same shape.
+            copy.copy_(source);
+
+            params[e.key().toStringRef()] = copy;
+          }
+          return params;
+        }
+      }
+    }
+  }
+
+  TORCH_CHECK(
+      false,
+      "Could not find Dict<string, Tensor> named '",
+      mobile::internal::kSavedParametersAttributeName,
+      "' in deserialized mobile::Module");
+}
+
+#endif // defined(ENABLE_FLATBUFFER)
+
 } // namespace
 
 std::map<std::string, at::Tensor> _load_parameters(
     std::istream& in,
     c10::optional<at::Device> device) {
-  std::unique_ptr<IStreamAdapter> rai = std::make_unique<IStreamAdapter>(&in);
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  return _load_parameters(std::move(rai), std::move(device));
+  // Detect the data format from the head of the input stream.
+  FileFormat format = getFileFormat(in);
+
+  // Call the appropriate parser.
+  std::map<std::string, at::Tensor> map;
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      std::shared_ptr<char> data;
+      size_t size = 0;
+      std::tie(data, size) = get_stream_content(in);
+      mobile::Module module =
+          parse_and_initialize_mobile_module(std::move(data), size, device);
+      map = mobile_module_to_parameter_map(module);
+#else // !defined(ENABLE_FLATBUFFER)
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+#endif // !defined(ENABLE_FLATBUFFER)
+      break;
+    }
+
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<IStreamAdapter> rai =
+          std::make_unique<IStreamAdapter>(&in);
+      map = load_parameters_from_zip(std::move(rai), device);
+      break;
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
+  return map;
 }
 
 std::map<std::string, at::Tensor> _load_parameters(
     const std::string& filename,
     c10::optional<at::Device> device) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  return _load_parameters(std::move(rai), std::move(device));
-}
+  // Detect the file format from its header.
+  FileFormat format = getFileFormat(filename);
 
-std::map<std::string, at::Tensor> _load_parameters(
-    std::unique_ptr<ReadAdapterInterface> rai,
-    c10::optional<c10::Device> device) {
-  auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
-  BytecodeDeserializer deserializer(std::move(reader));
-  // NOLINTNEXTLINE(performance-move-const-arg)
-  auto result = deserializer.deserialize(std::move(device)).toGenericDict();
+  // Call the appropriate parser.
   std::map<std::string, at::Tensor> map;
-  for (const auto& e : result) {
-    auto key = e.key().toString()->string();
-    auto value = e.value().toTensor().tensor_data();
-    map[key] = value;
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      mobile::Module module = load_mobile_module_from_file(filename, device);
+      map = mobile_module_to_parameter_map(module);
+#else // !defined(ENABLE_FLATBUFFER)
+      TORCH_CHECK(
+          false,
+          "Flatbuffer input file but the build hasn't enabled flatbuffer");
+#endif // !defined(ENABLE_FLATBUFFER)
+      break;
+    }
+
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<FileAdapter> rai =
+          std::make_unique<FileAdapter>(filename);
+      map = load_parameters_from_zip(std::move(rai), device);
+      break;
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
   }
   return map;
 }
diff --git a/torch/csrc/jit/mobile/import_data.h b/torch/csrc/jit/mobile/import_data.h
index 443ae87f77becb..4f016b4587a6c2 100644
--- a/torch/csrc/jit/mobile/import_data.h
+++ b/torch/csrc/jit/mobile/import_data.h
@@ -1,28 +1,33 @@
 #pragma once
 
-#include <torch/csrc/jit/mobile/module.h>
+#include <ATen/core/TensorBase.h>
+#include <c10/core/Device.h>
+#include <c10/util/Optional.h>
 
 #include <istream>
-#include <memory>
-
-#include <caffe2/serialize/file_adapter.h>
+#include <map>
+#include <string>
 
 namespace torch {
 namespace jit {
-using caffe2::serialize::FileAdapter;
-using caffe2::serialize::IStreamAdapter;
-using caffe2::serialize::ReadAdapterInterface;
 
+/**
+ * Loads named parameters from the serialized data in @p in.
+ *
+ * Calls #TORCH_CHECK() if the data format is not recognized.
+ */
 TORCH_API std::map<std::string, at::Tensor> _load_parameters(
     std::istream& in,
     c10::optional<at::Device> device = c10::nullopt);
 
+/**
+ * Loads named parameters from the serialized data in @p filename.
+ *
+ * Calls #TORCH_CHECK() if the data format is not recognized.
+ */
 TORCH_API std::map<std::string, at::Tensor> _load_parameters(
     const std::string& filename,
     c10::optional<at::Device> device = c10::nullopt);
 
-TORCH_API std::map<std::string, at::Tensor> _load_parameters(
-    std::unique_ptr<ReadAdapterInterface> rai,
-    c10::optional<c10::Device> device = c10::nullopt);
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/import_export_common.h b/torch/csrc/jit/mobile/import_export_common.h
new file mode 100644
index 00000000000000..05e21712d9e8ea
--- /dev/null
+++ b/torch/csrc/jit/mobile/import_export_common.h
@@ -0,0 +1,27 @@
+#pragma once
+
+/**
+ * @file
+ * Declarations shared between import_data.cpp and export_data.cpp
+ */
+
+namespace torch {
+namespace jit {
+namespace mobile {
+
+#if defined(ENABLE_FLATBUFFER)
+
+namespace internal {
+/**
+ * The name of the mobile::Module attribute which contains saved parameters, as
+ * a Dict of names to Tensors. Only used for Flatbuffer serialization.
+ */
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
+constexpr char kSavedParametersAttributeName[] = "data";
+} // namespace internal
+
+#endif // defined(ENABLE_FLATBUFFER)
+
+} // namespace mobile
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp b/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp
index 9816b7f42ec574..7c8d40dbf2128f 100644
--- a/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp
@@ -8,8 +8,10 @@ BuildFeatureTracer::BuildFeatureTracer() {
   auto recorder_cb =
       [](const at::RecordFunction& fn) -> std::unique_ptr<at::ObserverContext> {
     std::string name = fn.name();
-    std::lock_guard<std::mutex> guard(getMutex());
-    getBuildFeatures().insert(name);
+    getBuildFeatures().withLock(
+        [&](BuildFeatureTracer::build_feature_type& build_features) {
+          build_features.insert(name);
+        });
     return nullptr;
   };
 
@@ -18,16 +20,12 @@ BuildFeatureTracer::BuildFeatureTracer() {
                                 .scopes({at::RecordScope::BUILD_FEATURE}));
 }
 
-BuildFeatureTracer::build_feature_type& BuildFeatureTracer::getBuildFeatures() {
-  static build_feature_type build_features;
+c10::Synchronized<BuildFeatureTracer::build_feature_type>& BuildFeatureTracer::
+    getBuildFeatures() {
+  static c10::Synchronized<build_feature_type> build_features;
   return build_features;
 }
 
-std::mutex& BuildFeatureTracer::getMutex() {
-  static std::mutex m;
-  return m;
-}
-
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.h b/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.h
index 6eb1ee010fe7c9..72fb6754e1eea1 100644
--- a/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.h
+++ b/torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.h
@@ -1,8 +1,8 @@
 #pragma once
 
 #include <ATen/record_function.h>
+#include <c10/util/Synchronized.h>
 #include <map>
-#include <mutex>
 #include <set>
 #include <string>
 
@@ -29,9 +29,7 @@ struct BuildFeatureTracer final {
   typedef std::set<std::string> build_feature_type;
 
   BuildFeatureTracer();
-  static build_feature_type& getBuildFeatures();
-  /* Protect concurrent writes into the set. */
-  static std::mutex& getMutex();
+  static c10::Synchronized<build_feature_type>& getBuildFeatures();
 
   ~BuildFeatureTracer() {
     at::removeCallback(handle_);
diff --git a/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp b/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp
index 265a5ff886209e..fc37693cb0ea5d 100644
--- a/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp
@@ -8,8 +8,10 @@ CustomClassTracer::CustomClassTracer() {
   auto recorder_cb =
       [](const at::RecordFunction& fn) -> std::unique_ptr<at::ObserverContext> {
     std::string name = fn.name();
-    std::lock_guard<std::mutex> guard(getMutex());
-    getLoadedClasses().insert(name);
+    getLoadedClasses().withLock(
+        [&name](CustomClassTracer::custom_classes_type& custom_classes) {
+          custom_classes.insert(name);
+        });
     return nullptr;
   };
 
@@ -17,16 +19,12 @@ CustomClassTracer::CustomClassTracer() {
                                       .scopes({at::RecordScope::CUSTOM_CLASS}));
 }
 
-CustomClassTracer::custom_classes_type& CustomClassTracer::getLoadedClasses() {
-  static custom_classes_type loaded_classes;
+c10::Synchronized<CustomClassTracer::custom_classes_type>& CustomClassTracer::
+    getLoadedClasses() {
+  static c10::Synchronized<custom_classes_type> loaded_classes;
   return loaded_classes;
 }
 
-std::mutex& CustomClassTracer::getMutex() {
-  static std::mutex m;
-  return m;
-}
-
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.h b/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.h
index 9da175592084e8..3f75555077802a 100644
--- a/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.h
+++ b/torch/csrc/jit/mobile/model_tracer/CustomClassTracer.h
@@ -1,8 +1,8 @@
 #pragma once
 
 #include <ATen/record_function.h>
+#include <c10/util/Synchronized.h>
 #include <map>
-#include <mutex>
 #include <set>
 #include <string>
 
@@ -29,9 +29,7 @@ struct CustomClassTracer final {
   typedef std::set<std::string> custom_classes_type;
 
   CustomClassTracer();
-  static custom_classes_type& getLoadedClasses();
-  /* Protect concurrent writes into the set. */
-  static std::mutex& getMutex();
+  static c10::Synchronized<custom_classes_type>& getLoadedClasses();
 
   ~CustomClassTracer() {
     at::removeCallback(handle_);
diff --git a/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp b/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp
index ba0a8d457dcb32..15a65e8195e738 100644
--- a/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp
@@ -15,8 +15,9 @@ KernelDTypeTracer::KernelDTypeTracer() {
     std::string kernel_tag = name.substr(0, dollar_pos);
     std::string dtype = name.substr(dollar_pos + 1);
 
-    std::lock_guard<std::mutex> guard(getMutex());
-    getCalledKernelTags()[kernel_tag].insert(dtype);
+    getCalledKernelTags().withLock([&](kernel_tags_type& kernel_tags) {
+      kernel_tags[kernel_tag].insert(dtype);
+    });
     return nullptr;
   };
 
@@ -25,16 +26,12 @@ KernelDTypeTracer::KernelDTypeTracer() {
           .scopes({at::RecordScope::KERNEL_FUNCTION_DTYPE}));
 }
 
-KernelDTypeTracer::kernel_tags_type& KernelDTypeTracer::getCalledKernelTags() {
-  static kernel_tags_type called_kernel_tags;
+c10::Synchronized<KernelDTypeTracer::kernel_tags_type>& KernelDTypeTracer::
+    getCalledKernelTags() {
+  static c10::Synchronized<kernel_tags_type> called_kernel_tags;
   return called_kernel_tags;
 }
 
-std::mutex& KernelDTypeTracer::getMutex() {
-  static std::mutex m;
-  return m;
-}
-
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.h b/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.h
index 893cae49bbc2da..a3595e1d3911bf 100644
--- a/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.h
+++ b/torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.h
@@ -1,8 +1,8 @@
 #pragma once
 
 #include <ATen/record_function.h>
+#include <c10/util/Synchronized.h>
 #include <map>
-#include <mutex>
 #include <set>
 #include <string>
 
@@ -30,9 +30,7 @@ struct KernelDTypeTracer final {
   typedef std::map<std::string, std::set<std::string>> kernel_tags_type;
 
   KernelDTypeTracer();
-  static kernel_tags_type& getCalledKernelTags();
-  /* Protect concurrent writes into the map. */
-  static std::mutex& getMutex();
+  static c10::Synchronized<kernel_tags_type>& getCalledKernelTags();
 
   ~KernelDTypeTracer() {
     at::removeCallback(handle_);
diff --git a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
index 845752b1f08a15..32852d691b5624 100644
--- a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
@@ -295,29 +295,23 @@ TracerResult trace_run(const std::string& input_module_path) {
 
   recordCustomClassesFromOpSchemas(root_ops, traced_operators, loaded_classes);
 
-  {
-    std::lock_guard<std::mutex> guard(KernelDTypeTracer::getMutex());
-    called_kernel_tags.insert(
-        kdtype_tracer.getCalledKernelTags().begin(),
-        kdtype_tracer.getCalledKernelTags().end());
-  }
+  kdtype_tracer.getCalledKernelTags().withLock(
+      [&](KernelDTypeTracer::kernel_tags_type& kernel_tags) {
+        called_kernel_tags.insert(kernel_tags.begin(), kernel_tags.end());
+      });
 
   traced_operators.insert(
       always_included_traced_ops.begin(), always_included_traced_ops.end());
 
-  {
-    std::lock_guard<std::mutex> guard(CustomClassTracer::getMutex());
-    loaded_classes.insert(
-        custom_class_tracer.getLoadedClasses().begin(),
-        custom_class_tracer.getLoadedClasses().end());
-  }
+  custom_class_tracer.getLoadedClasses().withLock(
+      [&](CustomClassTracer::custom_classes_type& custom_classes) {
+        loaded_classes.insert(custom_classes.begin(), custom_classes.end());
+      });
 
-  {
-    std::lock_guard<std::mutex> guard(BuildFeatureTracer::getMutex());
-    build_features.insert(
-        build_feature_tracer.getBuildFeatures().begin(),
-        build_feature_tracer.getBuildFeatures().end());
-  }
+  build_feature_tracer.getBuildFeatures().withLock(
+      [&](BuildFeatureTracer::build_feature_type& bf) {
+        build_features.insert(bf.begin(), bf.end());
+      });
 
   TracerResult tracer_result = {
       root_ops,
diff --git a/torch/csrc/jit/mobile/module.h b/torch/csrc/jit/mobile/module.h
index fc4c046c77fa38..14720e993beea2 100644
--- a/torch/csrc/jit/mobile/module.h
+++ b/torch/csrc/jit/mobile/module.h
@@ -135,12 +135,21 @@ class TORCH_API Module {
     mem_to_delete_ = delete_mem;
   }
 
+  void set_bytecode_version(int64_t version) {
+    bytecode_version_ = version;
+  }
+
+  int64_t bytecode_version() const {
+    return bytecode_version_;
+  }
+
  private:
   c10::intrusive_ptr<c10::ivalue::Object> object_;
   std::unordered_map<std::string, std::string> metadata_;
   std::shared_ptr<CompilationUnit> cu_;
   MobileDebugTable debug_table_;
   bool has_debug_handles_ = false;
+  int64_t bytecode_version_;
 
   // Extra handle for the module to delete when itself is deleted
   std::shared_ptr<char> mem_to_delete_;
diff --git a/torch/csrc/jit/mobile/parse_bytecode.cpp b/torch/csrc/jit/mobile/parse_bytecode.cpp
index 2dcee06a5ea3ed..c099bd10954b7f 100644
--- a/torch/csrc/jit/mobile/parse_bytecode.cpp
+++ b/torch/csrc/jit/mobile/parse_bytecode.cpp
@@ -69,16 +69,16 @@ class OpCodeCache {
 } // namespace
 
 void applyUpgrader(mobile::Function* function, uint64_t operator_version) {
-  const Code& code = function->get_code();
+  Code& code = function->get_code();
   auto& operator_version_map = getOperatorVersionMapForMobile();
-  for (size_t i = 0; i < function->get_code().instructions_.size(); i++) {
-    Instruction& inst = function->get_code().instructions_[i];
+  for (size_t i = 0; i < code.instructions_.size(); i++) {
+    Instruction& inst = code.instructions_[i];
     if (inst.op == OpCode::OP) {
-      std::string op_name = function->get_code().op_names_[inst.X].name;
-      std::string operator_name = function->get_code().op_names_[inst.X].name +
-          (function->get_code().op_names_[inst.X].overload_name.empty()
+      std::string op_name = code.op_names_[inst.X].name;
+      std::string operator_name = code.op_names_[inst.X].name +
+          (code.op_names_[inst.X].overload_name.empty()
                ? ""
-               : "." + function->get_code().op_names_[inst.X].overload_name);
+               : "." + code.op_names_[inst.X].overload_name);
 
       auto it = operator_version_map.find(operator_name);
       // Find out if there is an upgrader for this operator
@@ -102,11 +102,11 @@ void applyUpgrader(mobile::Function* function, uint64_t operator_version) {
             // new_inst.X = upgrader.index;
             // code->instructions_[i] = new_inst;
             TORCH_CHECK(
-                upgrader.index < function->get_code().functions_.size(),
+                upgrader.index < code.functions_.size(),
                 "upgrader index is, ",
                 upgrader.index,
                 " and it's larger than the upgrader function list length ",
-                function->get_code().functions_.size());
+                code.functions_.size());
             inst.op = OpCode::CALL;
             inst.X = upgrader.index;
           }
diff --git a/torch/csrc/jit/mobile/parse_operators.cpp b/torch/csrc/jit/mobile/parse_operators.cpp
index 66f1ca1156f6d5..2c61ad03c26ee7 100644
--- a/torch/csrc/jit/mobile/parse_operators.cpp
+++ b/torch/csrc/jit/mobile/parse_operators.cpp
@@ -22,8 +22,7 @@ std::string operator_str(
  */
 std::unordered_set<std::string> load_and_find_unsupported_operator_names(
     c10::ivalue::TupleElements&& ops_list,
-    mobile::Function* function,
-    int64_t model_version) {
+    mobile::Function* function) {
   std::unordered_set<std::string> unsupported_op_names;
   // ops_list is the list of operator names that were read in from
   // bytecode.plk for the method that is currently being processed.
@@ -41,8 +40,7 @@ std::unordered_set<std::string> load_and_find_unsupported_operator_names(
     auto op_found = function->append_operator(
         op_item[0].toString()->string(),
         op_item[1].toString()->string(),
-        num_args,
-        model_version);
+        num_args);
     if (!op_found) {
       unsupported_op_names.emplace(operator_str(
           op_item[0].toString()->string(), op_item[1].toString()->string()));
@@ -66,12 +64,10 @@ void print_unsupported_ops_and_throw(
 
 void parseOperators(
     c10::ivalue::TupleElements&& ops_list,
-    const int64_t& model_version,
     const uint64_t& module_load_options,
     mobile::Function* function) {
   std::unordered_set<std::string> unsupported_op_names =
-      load_and_find_unsupported_operator_names(
-          std::move(ops_list), function, model_version);
+      load_and_find_unsupported_operator_names(std::move(ops_list), function);
   if ((module_load_options & MobileModuleLoadOptions::OPERATOR_CHECK) &&
       !unsupported_op_names.empty()) {
     print_unsupported_ops_and_throw(unsupported_op_names);
diff --git a/torch/csrc/jit/mobile/parse_operators.h b/torch/csrc/jit/mobile/parse_operators.h
index 698c24ee554f15..508267387abfef 100644
--- a/torch/csrc/jit/mobile/parse_operators.h
+++ b/torch/csrc/jit/mobile/parse_operators.h
@@ -16,7 +16,6 @@ namespace mobile {
 
 TORCH_API void parseOperators(
     c10::ivalue::TupleElements&& ops_list,
-    const int64_t& model_version,
     const uint64_t& module_load_options,
     mobile::Function* function);
 } // namespace mobile
diff --git a/torch/csrc/jit/mobile/train/export_data.cpp b/torch/csrc/jit/mobile/train/export_data.cpp
index bcf005df86ba84..a063213f3e322c 100644
--- a/torch/csrc/jit/mobile/train/export_data.cpp
+++ b/torch/csrc/jit/mobile/train/export_data.cpp
@@ -7,7 +7,15 @@
 
 #include <caffe2/serialize/inline_container.h>
 
+#include <ATen/core/ivalue.h>
 #include <ATen/core/jit_type.h>
+
+#if defined(ENABLE_FLATBUFFER)
+#include <flatbuffers/flatbuffers.h>
+#include <torch/csrc/jit/mobile/import_export_common.h>
+#include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
+#endif // defined(ENABLE_FLATBUFFER)
+
 #include <string>
 #include <vector>
 
@@ -19,12 +27,15 @@ char const* toString(OpCode op);
 
 namespace {
 
-class ScriptModuleSerializer {
+/**
+ * Serializes an IValue using Pickle, and puts it in a file named "data.pkl"
+ * in a ZIP wrapper.
+ */
+class IValuePickler final {
  public:
-  explicit ScriptModuleSerializer(const std::string& filename)
-      : writer_(filename) {}
+  explicit IValuePickler(const std::string& filename) : writer_(filename) {}
 
-  explicit ScriptModuleSerializer(
+  explicit IValuePickler(
       const std::function<size_t(const void*, size_t)>& writer_func)
       : writer_(writer_func) {}
 
@@ -33,6 +44,7 @@ class ScriptModuleSerializer {
     writeArchive("data", object);
   }
 
+ private:
   void writeArchive(const std::string& archive_name, const IValue& value) {
     std::vector<char> data;
     // Vector to capture the run-time class types during pickling the IValues
@@ -64,33 +76,109 @@ class ScriptModuleSerializer {
   TypeNameUniquer type_name_uniquer_;
 };
 
+/**
+ * Converts a map of named tensors to a c10::Dict.
+ */
+c10::Dict<std::string, at::Tensor> tensor_map_to_dict(
+    const std::map<std::string, at::Tensor>& map) {
+  c10::Dict<std::string, at::Tensor> dict;
+  for (const auto& e : map) {
+    dict.insert(e.first, e.second);
+  }
+  return dict;
+}
+
+#if defined(ENABLE_FLATBUFFER)
+
+/**
+ * Returns a Module with a single attribute, with the attribute name specified
+ * by #internal::kSavedParametersAttributeName, whose value is the provided
+ * dict.
+ */
+mobile::Module tensor_dict_to_mobile(
+    const c10::Dict<std::string, at::Tensor>& dict) {
+  // Create an Object to back the Module, with an attribute to hold the dict.
+  auto cu = std::make_shared<torch::jit::CompilationUnit>();
+  // Note that the name doesn't really matter, but it must begin with
+  // "__torch__." to be treated as a valid class when being imported.
+  auto cls = c10::ClassType::create(
+      "__torch__.SavedParameters", cu, /*is_module=*/true);
+  cls->addAttribute(
+      internal::kSavedParametersAttributeName,
+      c10::DictType::create(dict.keyType(), dict.valueType()));
+  auto object = c10::ivalue::Object::create(
+      c10::StrongTypePtr(std::move(cu), std::move(cls)), /*numSlots=*/1);
+
+  // Add the dict as an attribute.
+  object->setAttr(internal::kSavedParametersAttributeName, dict);
+
+  // Wrap the Object in a Module.
+  auto mcu = std::make_shared<mobile::CompilationUnit>();
+  return mobile::Module(object, mcu);
+}
+
+#endif // defined(ENABLE_FLATBUFFER)
+
 } // namespace
 } // namespace mobile
 
 void _save_parameters(
     const std::map<std::string, at::Tensor>& map,
-    std::ostream& out) {
-  mobile::ScriptModuleSerializer serializer(
-      [&](const void* buf, size_t nbytes) -> size_t {
-        out.write(static_cast<const char*>(buf), nbytes);
-        return !out ? 0 : nbytes;
-      });
-  c10::Dict<std::string, at::Tensor> dict;
-  for (const auto& e : map) {
-    dict.insert(e.first, e.second);
+    std::ostream& out,
+    bool use_flatbuffer) {
+  auto dict = mobile::tensor_map_to_dict(map);
+
+  if (use_flatbuffer) {
+#if defined(ENABLE_FLATBUFFER)
+    // For Flatbuffer, we serialize an entire, mostly-empty module containing
+    // the dict as an attribute.
+    flatbuffers::DetachedBuffer bytes = torch::jit::save_mobile_module_to_bytes(
+        mobile::tensor_dict_to_mobile(dict));
+    out.write(
+        reinterpret_cast<char*>(bytes.data()),
+        static_cast<std::streamsize>(bytes.size()));
+#else // !defined(ENABLE_FLATBUFFER)
+    TORCH_CHECK(
+        false,
+        "Trying to export as flatbuffer file but "
+        "the build hasn't enabled flatbuffer");
+#endif // !defined(ENABLE_FLATBUFFER)
+  } else {
+    // For Pickle, we only serialize the dict itself.
+    mobile::IValuePickler pickler(
+        [&](const void* buf, size_t nbytes) -> size_t {
+          out.write(
+              static_cast<const char*>(buf),
+              static_cast<std::streamsize>(nbytes));
+          return !out ? 0 : nbytes;
+        });
+    pickler.serialize(dict);
   }
-  serializer.serialize(dict);
 }
 
 void _save_parameters(
     const std::map<std::string, at::Tensor>& map,
-    const std::string& filename) {
-  mobile::ScriptModuleSerializer serializer(filename);
-  c10::Dict<std::string, at::Tensor> dict;
-  for (const auto& e : map) {
-    dict.insert(e.first, e.second);
+    const std::string& filename,
+    bool use_flatbuffer) {
+  auto dict = mobile::tensor_map_to_dict(map);
+
+  if (use_flatbuffer) {
+#if defined(ENABLE_FLATBUFFER)
+    // For Flatbuffer, we serialize an entire, mostly-empty module containing
+    // the dict as an attribute.
+    torch::jit::save_mobile_module(
+        mobile::tensor_dict_to_mobile(dict), filename);
+#else // !defined(ENABLE_FLATBUFFER)
+    TORCH_CHECK(
+        false,
+        "Trying to export as flatbuffer file but "
+        "the build hasn't enabled flatbuffer");
+#endif // !defined(ENABLE_FLATBUFFER)
+  } else {
+    // For Pickle, we only serialize the dict itself.
+    mobile::IValuePickler pickler(filename);
+    pickler.serialize(dict);
   }
-  serializer.serialize(dict);
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/mobile/train/export_data.h b/torch/csrc/jit/mobile/train/export_data.h
index a5de1acb68d43f..2b560d2f57504a 100644
--- a/torch/csrc/jit/mobile/train/export_data.h
+++ b/torch/csrc/jit/mobile/train/export_data.h
@@ -4,13 +4,34 @@
 
 namespace torch {
 namespace jit {
+
+/**
+ * Serializes the provided tensor map to the provided stream.
+ *
+ * @param[in] map The tensors to serialize.
+ * @param[in] out The stream to write the serialized data to.
+ * @param[in] use_flatbuffer If true, use Flatbuffers to serialize the data.
+ *     If false, use Pickle.
+ */
 TORCH_API void _save_parameters(
     const std::map<std::string, at::Tensor>& map,
-    std::ostream& out);
+    std::ostream& out,
+    bool use_flatbuffer = false);
 
+/**
+ * Serializes the provided tensor map to a file.
+ *
+ * @param[in] map The tensors to serialize.
+ * @param[in] filename The stem of the file name to write to. If
+ *     @p use_flatbuffer is false, the extension ".pkl" will be appended. If
+ *     @p use_flatbuffer is true, the extension ".ff" will be appended.
+ * @param[in] use_flatbuffer If true, use Flatbuffers to serialize the data.
+ *     If false, use Pickle.
+ */
 TORCH_API void _save_parameters(
     const std::map<std::string, at::Tensor>& map,
-    const std::string& filename);
+    const std::string& filename,
+    bool use_flatbuffer = false);
 
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/upgrader_mobile.cpp b/torch/csrc/jit/mobile/upgrader_mobile.cpp
index 3e876de47667de..0e52829255d091 100644
--- a/torch/csrc/jit/mobile/upgrader_mobile.cpp
+++ b/torch/csrc/jit/mobile/upgrader_mobile.cpp
@@ -533,8 +533,7 @@ const std::vector<ByteCodeFunctionWithOperator>& getUpgraderBytecodeList() {
         upgrader_function.function.append_operator(
             op.name,
             op.overload_name,
-            op.num_specified_args,
-            caffe2::serialize::kMaxSupportedFileFormatVersion);
+            op.num_specified_args);
       }
     }
     return upgrader_function_list;
diff --git a/torch/csrc/jit/passes/autocast.cpp b/torch/csrc/jit/passes/autocast.cpp
index 5efb80e14ca483..71e748844847f6 100644
--- a/torch/csrc/jit/passes/autocast.cpp
+++ b/torch/csrc/jit/passes/autocast.cpp
@@ -12,15 +12,14 @@
 
 #include <stack>
 #include <unordered_set>
+#include <vector>
 
 namespace torch {
 namespace jit {
 
 namespace {
 
-// TODO: Turn on autocast by default. default turned off to avoid tests failures
-// as we prototype the support
-bool autocast_enabled = false;
+bool autocast_enabled = true;
 
 struct AutocastContext {
   bool gpu_enabled = false;
@@ -44,7 +43,7 @@ bool isAutocastNode(Value* value) {
   return class_name.has_value() &&
       (*class_name == "__torch__.torch.cuda.amp.autocast_mode.autocast" ||
        *class_name == "__torch__.torch.cpu.amp.autocast_mode.autocast" ||
-       *class_name == "__torch__.torch.autocast_mode.autocast");
+       *class_name == "__torch__.torch.amp.autocast_mode.autocast");
 }
 
 // If we have an autocast instance, return it
@@ -149,17 +148,23 @@ void castTensorInputs(
   const auto graph = node->owningGraph();
 
   std::unordered_set<Value*> casted_inputs;
+  // need to also keep the inputs in order, otherwise tracing fails
+  // sanity checks because casting ops are inserted in random order
+  std::vector<Value*> casted_inputs_ordered;
   for (auto input : node->inputs()) {
     // TODO: update cast_op signature to take dynamic context flags
     auto input_tensor_type = input->type()->cast<TensorType>();
     if (input_tensor_type && input->node()->kind() != cast_op) {
-      casted_inputs.insert(input);
+      auto has_inserted = casted_inputs.insert(input);
+      if (has_inserted.second) {
+        casted_inputs_ordered.push_back(input);
+      }
     }
   }
 
   WithInsertPoint insert_point(node);
 
-  for (auto input : casted_inputs) {
+  for (auto input : casted_inputs_ordered) {
     if (cast_op == aten::_autocast_to_full_precision) {
       const auto new_input = graph->insert(
           cast_op,
@@ -437,7 +442,9 @@ void handleBlock(Block* block, AutocastContext initial_state) {
 
       // Banned in autocast, see binary_cross_entropy_banned()
       case aten::binary_cross_entropy:
-        AT_ERROR("Unsafe to autocast");
+        if (current_state()) {
+          AT_ERROR("Unsafe to autocast");
+        }
     }
 
     // process sub-blocks, if any
diff --git a/torch/csrc/jit/passes/common_expression_hoisting.cpp b/torch/csrc/jit/passes/common_expression_hoisting.cpp
index ab2b9d41afa8b3..4beb4457c08eda 100644
--- a/torch/csrc/jit/passes/common_expression_hoisting.cpp
+++ b/torch/csrc/jit/passes/common_expression_hoisting.cpp
@@ -42,18 +42,34 @@ struct CommonExpressionHoister {
       Node* true_b_node = *matching_elem;
 
       // Check if a move to the front of the block is valid
-      // If both of the moves are valid, then we know we can move the item out
-      // of the if blocks entirely.
       AliasDb& aliasDb = getOrCreateAliasDb();
-      bool true_moveable = aliasDb.couldMoveAfterTopologically(
+      bool true_moveable = aliasDb.couldMoveBeforeTopologically(
           true_b_node, true_block->nodes().front());
-      bool false_moveable = aliasDb.couldMoveAfterTopologically(
+      bool false_moveable = aliasDb.couldMoveBeforeTopologically(
           false_b_node, false_block->nodes().front());
 
       if (!true_moveable || !false_moveable) {
         continue;
       }
 
+      bool did_move_true = aliasDb.moveBeforeTopologicallyValid(
+          true_b_node, true_block->nodes().front());
+      bool did_move_false = aliasDb.moveBeforeTopologicallyValid(
+          false_b_node, false_block->nodes().front());
+
+      TORCH_INTERNAL_ASSERT(
+          did_move_true && did_move_false,
+          "Wasn't able to move nodes to the beginning of the if blocks");
+
+      // Even though we moved the node before the original first node in the
+      // block, we might have also moved other nodes in front of the target
+      // node. If it ended up at the very beginning of the if block, then it's
+      // safe to move it outside.
+      if (true_b_node != true_block->nodes().front() ||
+          false_b_node != false_block->nodes().front()) {
+        continue;
+      }
+
       // Get all the uses of the output to delete and reinsert them
       // as the input would change, the HashNode value would also change.
       std::unordered_set<Node*> true_b_uses;
diff --git a/torch/csrc/jit/passes/cuda_graph_fuser.cpp b/torch/csrc/jit/passes/cuda_graph_fuser.cpp
new file mode 100644
index 00000000000000..52ec8ba500f5a8
--- /dev/null
+++ b/torch/csrc/jit/passes/cuda_graph_fuser.cpp
@@ -0,0 +1,21 @@
+#include <torch/csrc/jit/passes/cuda_graph_fuser.h>
+#include <mutex>
+
+namespace torch {
+namespace jit {
+
+static CudaFuserComparisonCallback comparison_callback = {false, nullptr};
+static std::mutex comparison_callback_lock;
+
+CudaFuserComparisonCallback getCudaFuserComparisonCallback() {
+  std::lock_guard<std::mutex> guard(comparison_callback_lock);
+  return comparison_callback;
+}
+
+void setCudaFuserComparisonCallback(CudaFuserComparisonCallback callback) {
+  std::lock_guard<std::mutex> guard(comparison_callback_lock);
+  comparison_callback = callback;
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/cuda_graph_fuser.h b/torch/csrc/jit/passes/cuda_graph_fuser.h
index 4bdf83e2b916c5..aef0d209943e37 100644
--- a/torch/csrc/jit/passes/cuda_graph_fuser.h
+++ b/torch/csrc/jit/passes/cuda_graph_fuser.h
@@ -4,12 +4,14 @@
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/passes/pass_manager.h>
+#include <string>
+#include <utility>
 
 namespace torch {
 namespace jit {
 
 // Register CudaFuseGraph in custom passes
-struct C10_EXPORT RegisterCudaFuseGraph
+struct TORCH_API RegisterCudaFuseGraph
     : public PassManager<RegisterCudaFuseGraph> {
   static bool registerPass(bool enabled) {
     bool old_flag = PassManager::isRegistered();
@@ -29,5 +31,15 @@ struct C10_EXPORT RegisterCudaFuseGraph
   }
 };
 
+struct CudaFuserComparisonCallback {
+  using callback_type =
+      std::function<void(const Stack&, const Stack&, const std::string&)>;
+  bool run_fallback;
+  callback_type callback;
+};
+
+TORCH_API CudaFuserComparisonCallback getCudaFuserComparisonCallback();
+TORCH_API void setCudaFuserComparisonCallback(CudaFuserComparisonCallback);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/decompose_ops.cpp b/torch/csrc/jit/passes/decompose_ops.cpp
index eb1986241eab52..d0c30b3b5b760f 100644
--- a/torch/csrc/jit/passes/decompose_ops.cpp
+++ b/torch/csrc/jit/passes/decompose_ops.cpp
@@ -44,7 +44,7 @@ bool isDecomposableNorm(Node* normalize_op) {
   auto device = input->type()->expectRef<TensorType>().device();
   // As of now, we do the decomposition for batchnorm/layernorm on GPU device
   // only
-  if (!device || (*device).is_cpu()) {
+  if (!device || !(*device).is_cuda()) {
     return false;
   }
 
diff --git a/torch/csrc/jit/passes/freeze_module.cpp b/torch/csrc/jit/passes/freeze_module.cpp
index 5aa4a39155e3e0..a847af666012a1 100644
--- a/torch/csrc/jit/passes/freeze_module.cpp
+++ b/torch/csrc/jit/passes/freeze_module.cpp
@@ -5,6 +5,7 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/api/function_impl.h>
 #include <torch/csrc/jit/ir/alias_analysis.h>
+#include <torch/csrc/jit/passes/autocast.h>
 #include <torch/csrc/jit/passes/clear_profiling.h>
 #include <torch/csrc/jit/passes/eliminate_no_ops.h>
 #include <torch/csrc/jit/passes/inliner.h>
@@ -101,6 +102,9 @@ class AttributePropagator {
       ClearProfilingInformation(subgraph);
     };
     auto applyOptimizations = [](std::shared_ptr<Graph>& subgraph) {
+#ifndef C10_MOBILE
+      Autocast(subgraph);
+#endif
       runOptimization(
           subgraph,
           /* unroll_non_constant_loops? */ false,
diff --git a/torch/csrc/jit/passes/lower_grad_of.cpp b/torch/csrc/jit/passes/lower_grad_of.cpp
index 3f3de5ff779e22..a4195e24cb0f22 100644
--- a/torch/csrc/jit/passes/lower_grad_of.cpp
+++ b/torch/csrc/jit/passes/lower_grad_of.cpp
@@ -25,6 +25,12 @@ void LowerGradOf(Graph& g) {
                        ->insertBefore(else_block->return_node())
                        ->output();
       for (size_t i = 0; i < it->outputs().size(); ++i) {
+        // the else block returns a tensor for each of the outputs of the GradOf
+        // i.e. assuming that all the outputs are tensors. This might not be
+        // true, e.g. backward for cat() returns a list of gradient tensors.
+        // This is fixed in DifferentiableGraphBackward, where the list sizes
+        // are stored during the forward pass, and then undefined tensors are
+        // turned into lists of undefined tensors where necessary.
         else_block->registerOutput(undef);
         if_stat->outputs().at(i)->copyMetadata(it->outputs().at(i));
       }
diff --git a/torch/csrc/jit/passes/onnx.cpp b/torch/csrc/jit/passes/onnx.cpp
index ab549e0c5fbdd8..22278426cfd2be 100644
--- a/torch/csrc/jit/passes/onnx.cpp
+++ b/torch/csrc/jit/passes/onnx.cpp
@@ -8,6 +8,7 @@
 #include <torch/csrc/jit/ir/constants.h>
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/dead_code_elimination.h>
+#include <torch/csrc/jit/passes/onnx/onnx_log.h>
 #include <torch/csrc/jit/passes/onnx/shape_type_inference.h>
 #include <torch/csrc/jit/python/python_ir.h>
 #include <torch/csrc/utils/pybind.h>
@@ -167,7 +168,14 @@ std::shared_ptr<Graph> ToONNX(
   ConstantValueMap::ClearMaps();
   auto new_graph = std::make_shared<Graph>(graph->current_scope());
   std::unordered_map<Value*, Value*> env;
-  BlockToONNX(graph->block(), new_graph->block(), operator_export_type, env);
+  try {
+    BlockToONNX(graph->block(), new_graph->block(), operator_export_type, env);
+  } catch (std::runtime_error& ex) {
+    ONNX_LOG(
+        "ONNX graph being constructed during exception:\n",
+        new_graph->toString());
+    throw;
+  }
   GRAPH_DUMP("after ToONNX: ", new_graph);
   ConstantValueMap::ClearMaps();
   return new_graph;
diff --git a/torch/csrc/jit/passes/onnx/deduplicate_initializers.cpp b/torch/csrc/jit/passes/onnx/deduplicate_initializers.cpp
index 48278d362a0102..e5f062ebe00bb5 100644
--- a/torch/csrc/jit/passes/onnx/deduplicate_initializers.cpp
+++ b/torch/csrc/jit/passes/onnx/deduplicate_initializers.cpp
@@ -69,7 +69,16 @@ bool DeduplicateInitializersByDataPtr(at::Tensor& t1, at::Tensor& t2) {
 }
 
 bool DeduplicateInitializersByValue(at::Tensor& t1, at::Tensor& t2) {
-  return t1.dtype() == t2.dtype() && t1.equal(t2);
+  if (t1.dtype() != t2.dtype() || !t1.sizes().equals(t2.sizes()) ||
+      !t1.strides().equals(t2.strides())) {
+    return false;
+  }
+
+  if (t1.device() != t2.device()) {
+    return t1.to("cpu").equal(t2.to("cpu"));
+  }
+
+  return t1.equal(t2);
 }
 
 void DeduplicateInitializers(
diff --git a/torch/csrc/jit/passes/onnx/helper.cpp b/torch/csrc/jit/passes/onnx/helper.cpp
index f76b606c181226..7d473cebbeb898 100644
--- a/torch/csrc/jit/passes/onnx/helper.cpp
+++ b/torch/csrc/jit/passes/onnx/helper.cpp
@@ -186,7 +186,18 @@ Node* transformToONNXConcatNode(
   for (auto* input : lc_node->inputs()) {
     auto new_input =
         need_new_input ? g->addInput()->copyMetadata(input) : input;
-
+    // This particular Concat operation concats along axis=0 and this requires
+    // inputs to the node to have the same shape along dim-0. To ensure this,
+    // unsqueeze nodes are added such that all shapes along dim-0 are 1.
+    // Certain inputs from ListConstruct Int[] could be combinations of scalars
+    // and 1-D tensors, For inputs that are already 1-D tensors, we skip the
+    // step of creating a corresponding unsqueeze node.
+    if (auto type = new_input->type()->cast<TensorType>()) {
+      if (type->dim() && type->dim() == 1U) {
+        unsqueezed.emplace_back(new_input);
+        continue;
+      }
+    }
     Node* unsqueezed_node =
         createONNXUnsqueeze(g, new_node, new_input, 0, opset_version);
     unsqueezed_node->copyMetadata(lc_node);
diff --git a/torch/csrc/jit/passes/onnx/onnx_log.cpp b/torch/csrc/jit/passes/onnx/onnx_log.cpp
new file mode 100644
index 00000000000000..eff2ae4d5a3231
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/onnx_log.cpp
@@ -0,0 +1,31 @@
+#include <torch/csrc/jit/passes/onnx/onnx_log.h>
+#include <iostream>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+namespace {
+bool log_enabled = false;
+std::shared_ptr<std::ostream> out;
+} // namespace
+
+bool is_log_enabled() {
+  return log_enabled;
+}
+
+void set_log_enabled(bool enabled) {
+  log_enabled = enabled;
+}
+
+void set_log_output_stream(std::shared_ptr<std::ostream> out_stream) {
+  out = std::move(out_stream);
+}
+
+std::ostream& _get_log_output_stream() {
+  return out ? *out : std::cout;
+}
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/onnx_log.h b/torch/csrc/jit/passes/onnx/onnx_log.h
new file mode 100644
index 00000000000000..a659a122342760
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/onnx_log.h
@@ -0,0 +1,27 @@
+#pragma once
+#include <torch/csrc/Export.h>
+#include <memory>
+#include <ostream>
+#include <string>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+TORCH_API bool is_log_enabled();
+
+TORCH_API void set_log_enabled(bool enabled);
+
+TORCH_API void set_log_output_stream(std::shared_ptr<std::ostream> out_stream);
+
+TORCH_API std::ostream& _get_log_output_stream();
+
+#define ONNX_LOG(...)                            \
+  if (::torch::jit::onnx::is_log_enabled()) {    \
+    ::torch::jit::onnx::_get_log_output_stream() \
+        << ::c10::str(__VA_ARGS__) << std::endl; \
+  }
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp b/torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp
index 3e516498272ef5..bc646308424b0b 100644
--- a/torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp
+++ b/torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp
@@ -7,7 +7,7 @@ bool IndexingPatternFinder::IsSameSource(const Node* n, const Node* m) {
   const auto source_n = n->sourceRange().source();
   const auto source_m = m->sourceRange().source();
   return (
-      (source_n->text_str() == source_m->text_str()) &&
+      (source_n->text() == source_m->text()) &&
       (source_n->starting_line_no() == source_m->starting_line_no()));
 }
 
diff --git a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
index 5d6f4336dc365a..bc1d7449a8802c 100644
--- a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
+++ b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
@@ -363,7 +363,13 @@ void ConvertGraphToONNXProto(
     int opset_version) {
   RawDataExportMap export_map;
   bool val_use_external_data_format;
-  std::tie(model_proto, export_map, symbol_map, val_use_external_data_format) =
+  NodeNameMap node_names;
+  std::tie(
+      model_proto,
+      export_map,
+      symbol_map,
+      val_use_external_data_format,
+      node_names) =
       export_onnx(
           graph,
           {},
@@ -408,7 +414,7 @@ c10::optional<at::Tensor> ComputeConstantFolding(Node* n, int opset_version) {
     return c10::nullopt;
   }
   std::vector<at::Tensor> inputTensorValues;
-  for (auto i = 0; i < n->inputs().size(); i++) {
+  for (auto i : c10::irange(n->inputs().size())) {
     if (TensorTypePtr input_type = n->input(i)->type()->cast<TensorType>()) {
       if (!ConstantValueMap::HasValue(n->input(i)->debugName())) {
         return c10::nullopt;
@@ -425,8 +431,9 @@ c10::optional<at::Tensor> ComputeConstantFolding(Node* n, int opset_version) {
     return onnx_constant_fold::runTorchBackendForOnnx(
         n, inputTensorValues, opset_version);
   } catch (const std::exception& ex) {
-    TORCH_WARN(
-        "Constant folding in symbolic shape inference fails: ", ex.what());
+    auto ex_str = std::string(ex.what());
+    ex_str = ex_str.substr(0, ex_str.find("\n"));
+    TORCH_WARN("Constant folding in symbolic shape inference fails: ", ex_str);
     return c10::nullopt;
   }
 }
@@ -454,25 +461,24 @@ c10::optional<::c10::SymbolicShape> ComputeShapeFromReshape(
   auto it_0 = std::find_if(shape_vector.begin(), shape_vector.end(), is_zero);
   bool shape_has_zero = it_0 != shape_vector.end();
 
-  if (opset_version >= 14 && n->hasAttributeS("allowzero")) {
-    int allowzero = n->i(attr::allowzero);
-    if (allowzero == 1 && shape_has_zero) {
-      // Return here, because the denominator in shape_ratio calculation cannot
-      // be 0.
-      // TODO: Shall we handle the case when shape has -1 here?
-      return shape;
-    }
-  }
-
   int minus_one_pos = -1;
-  for (int i = 0; i < shape_vector.size(); ++i) {
+  for (auto i : c10::irange(shape_vector.size())) {
     if (shape_vector[i].value() == -1) {
       minus_one_pos = i;
       break;
     }
   }
 
-  if (!shape_has_zero && minus_one_pos == -1) {
+  int allowzero = 0;
+  if (opset_version >= 14 && n->hasAttributeS("allowzero")) {
+    allowzero = n->i(attr::allowzero);
+  }
+
+  TORCH_CHECK(
+      !(shape_has_zero && allowzero == 1 && minus_one_pos != -1),
+      "0 and -1 cannot both be present in `Shape` input of `Reshape` node, when `allowzero=1`.");
+
+  if (minus_one_pos == -1 && (!shape_has_zero || allowzero)) {
     return shape;
   }
   std::vector<c10::ShapeSymbol> final_shape;
@@ -689,7 +695,7 @@ void SetShapeValueFromListConstructNode(Node* lc_node) {
       } else if (ConstantValueMap::HasShapeValue(input->debugName())) {
         auto lc_value =
             ConstantValueMap::GetShapeValue(input->debugName()).value();
-        if (lc_value.rank() == 1) {
+        if (lc_value.rank() == 1U) {
           shape_size.emplace_back(lc_value.at(0));
         }
       }
@@ -711,10 +717,10 @@ std::vector<::c10::ShapeSymbol> Broadcast(
   size_t rank_min = std::min(rank_0, rank_1);
   std::vector<::c10::ShapeSymbol> final_shape;
   final_shape.reserve(rank_max);
-  for (auto idx = 0; idx < rank_max; idx++) {
+  for (auto idx : c10::irange(rank_max)) {
     final_shape.emplace_back(::c10::ShapeSymbol::newSymbol());
   }
-  for (auto idx = 0; idx < rank_min; idx++) {
+  for (auto idx : c10::irange(rank_min)) {
     const c10::ShapeSymbol& ss_shape_0 = input_shape_value_0[rank_0 - 1 - idx];
     const c10::ShapeSymbol& ss_shape_1 = input_shape_value_1[rank_1 - 1 - idx];
     bool is_static_0 = ss_shape_0.is_static();
@@ -770,11 +776,11 @@ void ProcessShapeForConcatNode(Node* n) {
     }
     std::vector<::c10::ShapeSymbol> final_shape;
     final_shape.reserve(rank);
-    for (auto idx = 0; idx < rank; idx++) {
+    for (auto idx : c10::irange(rank)) {
       if (idx == axis_adjust) {
         auto flag = true;
         int64_t size_total = 0;
-        for (auto input_idx = 0; input_idx < n->inputs().size(); input_idx++) {
+        for (auto input_idx : c10::irange(n->inputs().size())) {
           if (ConstantValueMap::HasShape(n->input(input_idx)->debugName())) {
             auto input_shape =
                 ConstantValueMap::GetShape(n->input(input_idx)->debugName());
@@ -796,7 +802,7 @@ void ProcessShapeForConcatNode(Node* n) {
         }
       } else {
         auto flag = false;
-        for (auto input_idx = 0; input_idx < n->inputs().size(); input_idx++) {
+        for (auto input_idx : c10::irange(n->inputs().size())) {
           if (ConstantValueMap::HasShape(n->input(input_idx)->debugName())) {
             auto input_shape =
                 ConstantValueMap::GetShape(n->input(input_idx)->debugName());
@@ -834,7 +840,7 @@ void ProcessShapeValueForConcatNode(Node* n) {
     } else if (ConstantValueMap::HasShapeValue(input->debugName())) {
       auto concat_value =
           ConstantValueMap::GetShapeValue(input->debugName()).value();
-      if (concat_value.rank() == 1) {
+      if (concat_value.rank() == 1U) {
         shape_size.emplace_back(concat_value.at(0));
       }
     }
@@ -910,7 +916,7 @@ void ProcessReduceNode(Node* n) {
     }
     final_shape.reserve(rank_0);
     std::vector<int64_t> axes_vector = n->is(attr::axes);
-    for (auto idx = 0; idx < axes_vector.size(); idx++) {
+    for (auto idx : c10::irange(axes_vector.size())) {
       if (axes_vector[idx] < 0) {
         axes_vector[idx] += rank_0;
       }
@@ -920,7 +926,7 @@ void ProcessReduceNode(Node* n) {
     if (n->hasAttributeS("keepdims")) {
       keepdims = n->i(attr::keepdims);
     }
-    for (auto idx = 0; idx < rank_0; idx++) {
+    for (auto idx : c10::irange(rank_0)) {
       auto it = std::find(axes_vector.begin(), axes_vector.end(), idx);
       if (it != axes_vector.end()) {
         if (keepdims != 0) {
@@ -1060,7 +1066,7 @@ void ProcessSliceNode(Node* n, int opset_version) {
   if (opset_version >= 10) {
     valid = ConstantValueMap::HasValue(n->input(1)->debugName()) &&
         ConstantValueMap::HasValue(n->input(2)->debugName());
-    for (const auto input_idx : c10::irange(3, 5)) {
+    for (const auto input_idx : c10::irange(3U, 5U)) {
       if (n->inputs().size() > input_idx) {
         valid = valid &&
             ConstantValueMap::HasValue(n->input(input_idx)->debugName());
@@ -1177,7 +1183,7 @@ void ProcessTimeSeriesNode(Node* n) {
         seq_length, num_directions, batch_size, hidden_size};
     UpdateShape(n->output(0), c10::SymbolicShape(final_shape));
   }
-  for (const auto idx : c10::irange(2, 4)) {
+  for (const auto idx : c10::irange(2U, 4U)) {
     if (n->outputs().size() > idx) {
       std::vector<c10::ShapeSymbol> final_shape = {
           num_directions, batch_size, hidden_size};
@@ -1275,17 +1281,11 @@ void ComputeConstant(Node* n, int opset_version) {
       break;
     }
     case ::c10::onnx::Gather: {
-      if (ConstantValueMap::HasRank(n->input(0)->debugName()) &&
-          ConstantValueMap::HasRank(n->input(1)->debugName())) {
-        auto rank_0 =
-            ConstantValueMap::GetRank(n->input(0)->debugName()).value();
-        auto rank_1 =
-            ConstantValueMap::GetRank(n->input(1)->debugName()).value();
-        only_rank_available = true;
-        rank = rank_0 + rank_1 - 1;
-      }
       if (ConstantValueMap::HasShapeValue(n->input(0)->debugName()) &&
           ConstantValueMap::HasValue(n->input(1)->debugName())) {
+        // Special case for pattern Shape -> Gather, to propagate shape value.
+        // Gather input 0 is 1d tensor, Gather input 1 is scalar.
+        // Gather output will be scalar.
         auto shape_value =
             ConstantValueMap::GetShapeValue(n->input(0)->debugName()).value();
         auto idx_value =
@@ -1574,7 +1574,7 @@ void ProcessConstantValueMap(Node* n, int opset_version) {
   // shapes also. ONNX If can have different types on different branches, skip
   // here.
   auto static_input_shape = AllGraphInputsStatic(n->owningGraph());
-  for (auto i = 0; i < n->outputs().size(); i++) {
+  for (auto i : c10::irange(n->outputs().size())) {
     if (TensorTypePtr output_type = n->output(i)->type()->cast<TensorType>()) {
       if (output_type->dim().has_value()) {
         size_t rank = static_cast<size_t>(output_type->dim().value());
@@ -1589,7 +1589,7 @@ void ProcessConstantValueMap(Node* n, int opset_version) {
   // Update ConstantValueMap on node inputs from onnx shape inference.
   // ListConstruct is handled here (we only consider IntType, not TensorType) ,
   // no need to have a per-op based process.
-  for (auto i = 0; i < n->inputs().size(); i++) {
+  for (auto i : c10::irange(n->inputs().size())) {
     if (TensorTypePtr input_type = n->input(i)->type()->cast<TensorType>()) {
       if (input_type->dim().has_value()) {
         size_t rank = static_cast<size_t>(input_type->dim().value());
@@ -1881,7 +1881,7 @@ std::pair<bool, bool> AreInputsReliableOrStatic(Node* n) {
     non_required_idx =
         non_required_shape_inference_idx_map[n->kind().toDisplayString()];
   }
-  for (auto idx = 0; idx < input_size; idx++) {
+  for (auto idx : c10::irange(input_size)) {
     if (!non_required_idx.empty() &&
         non_required_idx.find(idx) != non_required_idx.end()) {
       continue;
@@ -2102,7 +2102,7 @@ void ONNXSetDynamicInputShape(
           name_to_sym[name] = ::c10::ShapeSymbol::newSymbol();
         }
         TORCH_CHECK(
-            axis < shape.size(),
+            axis < static_cast<int64_t>(shape.size()),
             "Dynamic shape axis should be no more than the shape dimension for ",
             name);
         shape[axis] = name_to_sym[name];
diff --git a/torch/csrc/jit/passes/onnx/shape_type_inference.h b/torch/csrc/jit/passes/onnx/shape_type_inference.h
index f4347caa4ad586..73cfd8e150cd0a 100644
--- a/torch/csrc/jit/passes/onnx/shape_type_inference.h
+++ b/torch/csrc/jit/passes/onnx/shape_type_inference.h
@@ -35,7 +35,7 @@ TORCH_API void ONNXAssignOutputShape(
     bool onnx_shape_inference);
 
 // Utilize ONNX Shape Inference for node.
-// The node must have ONNX namespace, and is valid ONNX node accroding to spec.
+// The node must have ONNX namespace, and is valid ONNX node according to spec.
 // On successful ONNX shape inference runs, the function updates output types of
 // n with inferred shape and type. Otherwise n is unchanged.
 TORCH_API void ONNXShapeTypeInference(
diff --git a/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp b/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
index e385563921f995..d8c04f47473c50 100644
--- a/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
+++ b/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
@@ -1,6 +1,6 @@
 #include <torch/csrc/jit/passes/onnx/unpack_quantized_weights.h>
 
-#include <ATen/native/quantized/cpu/packed_params.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/ir/constants.h>
 #include <torch/csrc/jit/ir/irparser.h>
@@ -275,7 +275,7 @@ void unpackQuantizedWeightsHelper(
 
         const int64_t kSpatialDim = config_vals.at(0);
         // skip kSpatialDim
-        int idx = 1;
+        unsigned idx = 1;
         for (const auto i : c10::irange(kSpatialDim)) {
           (void)i; // Suppress unused variable warning
           stride_int.emplace_back(config_vals.at(idx));
diff --git a/torch/csrc/jit/passes/pass_manager.h b/torch/csrc/jit/passes/pass_manager.h
index 8585c6ecdb3de5..111cb116dd41b7 100644
--- a/torch/csrc/jit/passes/pass_manager.h
+++ b/torch/csrc/jit/passes/pass_manager.h
@@ -68,7 +68,7 @@ using RegisterPass = RegisterPostPass;
  * types.
  */
 template <typename DerivedType>
-struct C10_EXPORT PassManager {
+struct TORCH_API PassManager {
  private:
   // We want this class to be abstract because it's
   virtual void abstract() = 0;
diff --git a/torch/csrc/jit/passes/specialize_autogradzero.cpp b/torch/csrc/jit/passes/specialize_autogradzero.cpp
index 2ac068a248b1f0..2f72257f064f7c 100644
--- a/torch/csrc/jit/passes/specialize_autogradzero.cpp
+++ b/torch/csrc/jit/passes/specialize_autogradzero.cpp
@@ -90,7 +90,7 @@ struct AutogradZeroSpecializer {
     if (!isBackwardGraph()) {
       return;
     }
-    if (getProfilingMode()) {
+    if (getExecutorMode()) {
       if (auto versioning_if = guardSpecializations()) {
         specializeAutogradOps(versioning_if->blocks()[0]);
         GRAPH_DUMP("After versioning graph", graph_);
diff --git a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
index 9c5125d6246cb7..4d71b7334bb16b 100644
--- a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
+++ b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
@@ -19,6 +19,7 @@
 #include <torch/csrc/jit/passes/remove_mutation.h>
 #include <torch/csrc/jit/passes/shape_analysis.h>
 #include <torch/csrc/jit/passes/symbolic_shape_analysis.h>
+#include <torch/csrc/jit/passes/tensorexpr_fuser.h>
 #include <torch/csrc/jit/runtime/exception_message.h>
 #include <torch/csrc/jit/runtime/symbolic_shape_registry.h>
 #include <torch/csrc/utils/memory.h>
@@ -87,11 +88,11 @@ struct ShapeArg
     }
   }
 
-  c10::optional<int64_t> asConstantInt() {
+  c10::optional<int64_t> asConstantInt() const {
     return this->second;
   }
 
-  c10::optional<c10::ShapeSymbol> asShapeSymbol() {
+  c10::optional<c10::ShapeSymbol> asShapeSymbol() const {
     return this->first;
   }
 
@@ -102,30 +103,67 @@ struct ShapeArg
   }
 };
 
+std::ostream& operator<<(std::ostream& out, const ShapeArg& sa) {
+  if (auto val = sa.asConstantInt()) {
+    out << *val;
+  } else if (auto ss = sa.asShapeSymbol()) {
+    out << *ss;
+  } else {
+    out << "UNK";
+  }
+  return out;
+}
+
 struct ShapeArguments {
+  // Superset of SymbolicShape, with additional support for unknown, nonsymbolic
+  // vals
+ public:
   ShapeArguments(const c10::SymbolicShape& ss) {
-    TORCH_INTERNAL_ASSERT(ss.rank())
-    for (size_t i = 0; i < *ss.rank(); ++i) {
-      maybe_shape_symbols_.push_back(ShapeArg(ss.at(i)));
+    has_dim_ = ss.rank().has_value();
+    if (has_dim_) {
+      for (size_t i = 0; i < *ss.rank(); ++i) {
+        maybe_shape_symbols_.emplace_back(ss.at(i));
+      }
     }
   }
 
-  ShapeArguments(std::vector<ShapeArg> ss) {
-    maybe_shape_symbols_ = std::move(ss);
+  ShapeArguments(std::vector<ShapeArg> ss)
+      : has_dim_(true), maybe_shape_symbols_(std::move(ss)) {}
+
+  bool has_dim() const {
+    return has_dim_;
   }
 
-  int64_t len() {
-    return maybe_shape_symbols_.size();
+  int64_t len() const {
+    TORCH_INTERNAL_ASSERT(has_dim_, "ShapeArguments has no known dim")
+    return (int64_t)maybe_shape_symbols_.size();
   }
 
-  ShapeArg at(size_t i) {
+  const ShapeArg at(size_t i) const {
+    TORCH_INTERNAL_ASSERT(has_dim_, "ShapeArguments has no known dim")
     return maybe_shape_symbols_.at(i);
   }
 
  private:
+  bool has_dim_;
   std::vector<ShapeArg> maybe_shape_symbols_;
 };
 
+std::ostream& operator<<(std::ostream& os, const ShapeArguments& sa) {
+  if (!sa.has_dim()) {
+    os << "(UNKNOWN DIM)";
+    return os;
+  }
+
+  os << "(";
+  for (size_t i = 0; i < sa.len(); i++) {
+    os << sa.at(i);
+  }
+  os << ")";
+
+  return os;
+}
+
 bool setSymbolicShapeAnalysisTestMode(bool value) {
   bool old_value = symbolic_shape_analysis_test_mode;
   symbolic_shape_analysis_test_mode = value;
@@ -138,15 +176,6 @@ bool symbolicShapeAnalysisTestModeEnabled() {
 
 namespace {
 
-IValue tensor_sizes_from_tensor_list(const IValue& iv) {
-  c10::List<c10::List<int64_t>> tensor_sizes;
-  auto tensor_list = iv.toTensorVector();
-  for (const auto& ten : tensor_list) {
-    tensor_sizes.push_back(c10::List<int64_t>(ten.sizes()));
-  }
-  return tensor_sizes;
-}
-
 bool isListOfInts(const TypePtr& type) {
   return type->cast<ListType>() &&
       type->cast<ListType>()->getElementType()->cast<IntType>();
@@ -189,6 +218,32 @@ void replaceWithIValue(Value* v, IValue val) {
   v->replaceAllUsesWith(v->owningGraph()->insertConstant(val));
 }
 
+c10::SymbolicShape extractListShape(
+    Value* list,
+    std::unordered_map<Value*, int64_t>& symbolic_shape_values,
+    const AliasDb& db) {
+  if (list->node()->kind() == prim::Constant) {
+    auto int_list = toIValue(list)->toIntVector();
+    return c10::SymbolicShape(int_list);
+  }
+  // We need a list construct or a constant output
+  // that is not written to in order to analyze the output shape
+  if (list->node()->kind() != prim::ListConstruct || db.hasWriters(list)) {
+    GRAPH_DEBUG("Could not extract shape");
+    return c10::SymbolicShape();
+  }
+  Node* list_construct = list->node();
+  std::vector<c10::optional<int64_t>> output_shape;
+  for (Value* input : list_construct->inputs()) {
+    if (symbolic_shape_values.count(input)) {
+      output_shape.emplace_back(symbolic_shape_values[input]);
+    } else {
+      output_shape.push_back(constant_as<int64_t>(input));
+    }
+  }
+  return c10::SymbolicShape(output_shape);
+}
+
 } // namespace
 
 // Symbolic Shape Analysis works through iteratively partially evaluating
@@ -205,206 +260,107 @@ void replaceWithIValue(Value* v, IValue val) {
 // means that we do know its concrete value statically but we can asssign sets
 // of tensor dimensions which must be equal at runtime.
 
-struct SymbolicShapeNodeAnalyzer {
-  SymbolicShapeNodeAnalyzer(
-      Node* n,
-      std::shared_ptr<Graph> shape_compute_graph,
-      const AliasDb& db)
-      : shape_compute_graph_(shape_compute_graph->copy()), node_(n) {
-    // NB: shape compute graphs may have less inputs than their node
-    // counterparts to allow e.g. sharing one single unary definition
-    // so iterate on # of shape inputs
-    size_t shape_graph_initial_inputs = shape_compute_graph_->inputs().size();
-    // We make lists of Tensor inputs variadic, which results in
-    // offset between a node index and its corresponding graph index
-    size_t graph_index_offset = 0;
-    for (size_t node_index = 0; node_index < shape_graph_initial_inputs;
-         node_index++) {
-      auto type = node_->input(node_index)->type();
-      size_t graph_index = graph_index_offset + node_index;
+using SSArgument = c10::variant<ShapeArguments, IValue>;
 
+std::ostream& operator<<(std::ostream& out, const SSArgument& sa) {
+  if (const IValue* iv = c10::get_if<IValue>(&sa)) {
+    out << *iv;
+  } else {
+    out << c10::get<ShapeArguments>(sa);
+  }
+  return out;
+}
+
+struct SymbolicShapeOpAnalyzer {
+  std::shared_ptr<Graph> shape_compute_graph_;
+  const FunctionSchema* schema_;
+  std::vector<SSArgument> inputs_;
+
+  // For the case where we have a JIT graph,
+  // subsititute optional types for their component types
+  // if the type is known. This doesn't need to be done
+  // for known IValues.
+  void refineInputUnionTypes(const Node* parent_graph_node) {
+    for (size_t op_in_index = 0;
+         op_in_index < shape_compute_graph_->inputs().size();
+         op_in_index++) {
+      auto type = parent_graph_node->input(op_in_index)->type();
       if (auto opt_type = shape_compute_graph_->inputs()
-                              .at(graph_index)
+                              .at(op_in_index)
                               ->type()
                               ->cast<OptionalType>()) {
         // None will get handled with constant substitution later
         if (!type->cast<OptionalType>() &&
             !NoneType::get()->isSubtypeOf(*type)) {
           shape_compute_graph_->inputs()
-              .at(graph_index)
+              .at(op_in_index)
               ->setType(opt_type->getElementType());
         }
       } else if (shape_compute_graph_->inputs()
-                     .at(graph_index)
+                     .at(op_in_index)
                      ->type()
                      ->cast<NumberType>()) {
-        shape_compute_graph_->inputs().at(graph_index)->setType(type);
-      }
-
-      if (auto tt = type->castRaw<TensorType>()) {
-        addTensorInputMetaData(node_->input(node_index), graph_index);
-      } else if (isListOfTensors(type)) {
-        // waiting for more use cases to decide on best generalization
-        TORCH_INTERNAL_ASSERT(
-            node_->kind() == aten::cat, "TODO: generalize logic");
-        // When we have partially evaluate a list of Tensors like cat(tensor[])
-        // We have a few problems:
-        // - optimizing out calls to the length of the list: len(tensors)
-        // - resolving accesses of the list to the tensor symbolic sizes the
-        // corresponding list element We can solve both of these problems by
-        // replacing the partial evaluation of cat([x, y]) def cat(tensors:
-        // List[List[int]], dim: int)
-        //    body
-        // with
-        // def cat(x, y, dim: int)
-        //     tensors = [x, y]
-        //     body
-        // This reuses the existing input Tensors partial evaluation and allows
-        // our existing optimizations to optimize out len(tensors) instead of
-        // requiring extra partial evaluation within this pass
-        if (node_->input(node_index)->node()->kind() == prim::Constant) {
-          replaceWithIValue(
-              shape_compute_graph_->inputs().at(graph_index),
-              tensor_sizes_from_tensor_list(
-                  *toIValue(node_->input(node_index))));
-        } else if (
-            node_->input(node_index)->node()->kind() == prim::ListConstruct &&
-            !db.hasWriters(node_->input(node_index))) {
-          auto li_construct_node = node_->input(node_index)->node();
-          std::vector<Value*> li_inputs;
-          Value* graph_input = shape_compute_graph_->inputs().at(graph_index);
-          for (size_t j = 0; j < li_construct_node->inputs().size(); ++j) {
-            auto new_inp = shape_compute_graph_->insertInput(graph_index + j);
-            new_inp->setType(ListType::ofInts());
-            li_inputs.push_back(new_inp);
-          }
-          WithInsertPoint guard(
-              *shape_compute_graph_->block()->nodes().begin());
-          auto new_li = shape_compute_graph_->insertNode(
-              shape_compute_graph_->createList(ListType::ofInts(), li_inputs));
-          graph_input->replaceAllUsesWith(new_li->output());
-          for (size_t j = 0; j < li_construct_node->inputs().size(); ++j) {
-            addTensorInputMetaData(
-                li_construct_node->input(j), graph_index + j);
-          }
-          shape_compute_graph_->eraseInput(
-              node_index + li_construct_node->inputs().size());
-          graph_index_offset += li_construct_node->inputs().size() - 1;
-        }
-      } else if (auto ival = toIValue(node_->input(node_index))) {
-        replaceWithIValue(
-            shape_compute_graph_->inputs().at(graph_index), *ival);
-      } else if (
-          type->cast<ListType>() &&
-          type->cast<ListType>()->getElementType()->cast<IntType>()) {
-        if (node_->input(node_index)->node()->kind() == prim::ListConstruct &&
-            !db.hasWriters(node_->input(node_index))) {
-          // it is a very common in graphs to see patterns like:
-          // z = x.view(y.size())
-          // or:
-          // z = x.view(1, 10, y.size(0), y.size(1))
-          // We want to propagate symbolic dimensions and concrete sizes
-          // from y to z. To do this we try to associate symbolic dimensions
-          // or concrete sizes with the integer list inputs that have a
-          // constructor taken from constants or y.size() or y.size(0)
-          auto list_construct = node_->input(node_index)->node();
-          std::vector<ShapeArg> shape;
-          for (Value* v : list_construct->inputs()) {
-            if (auto constant = constant_as<int64_t>(v)) {
-              shape.emplace_back(*constant);
-            } else if (v->node()->kind() == aten::size) {
-              auto const_index = constant_as<int64_t>(v->node()->input(1));
-              auto tt = v->node()->input(0)->type()->expect<TensorType>();
-              auto ss = tt->symbolic_sizes();
-              if (!ss.rank() || !const_index) {
-                // if we are getting a size of a tensor, it is an unknown
-                // symbolic dimension instead of an unknown integer (must be
-                // >=0)
-                shape.emplace_back(at::ShapeSymbol::newSymbol());
-                continue;
-              }
-              auto norm_index = normIndex(*const_index, *ss.rank());
-              if (!norm_index) {
-                shape.emplace_back(at::ShapeSymbol::newSymbol());
-                continue;
-              }
-              shape.emplace_back(ss[*norm_index]);
-            } else {
-              shape.emplace_back(ShapeArg::unknownInteger());
-            }
-          }
-          node_symbolic_input_indices_.emplace_back(
-              graph_index, std::move(shape));
-        } else if (
-            node_->input(node_index)->node()->kind() == aten::size &&
-            !db.hasWriters(node_->input(node_index))) {
-          auto ten_inp = node_->input(node_index)->node()->input();
-          auto ss = ten_inp->type()->expect<TensorType>()->symbolic_sizes();
-          node_symbolic_input_indices_.emplace_back(graph_index, ss);
-        }
+        shape_compute_graph_->inputs().at(op_in_index)->setType(type);
       }
     }
   }
 
-  void addTensorInputMetaData(
-      Value* tensor_v,
-      size_t shape_compute_graph_index) {
-    auto tt = tensor_v->type()->expect<TensorType>();
-    // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
-    c10::SymbolicShape symbolic_shapes = tt->symbolic_sizes();
-
-    // for testing, we don't insert complete tensor shapes and rely on our
-    // partial evaluation pipeline to propagate information.
-    // this is a good proxy for our ability to propagate non-complete shape
-    // information.
-
-    if (symbolic_shapes.isComplete() && !symbolic_shape_analysis_test_mode) {
-      replaceWithIValue(
-          shape_compute_graph_->inputs().at(shape_compute_graph_index),
-          *tt->sizes().concrete_sizes());
-      return;
-    }
-    // TODO: remove, all constant tensors should have typed sizes
-    if (toIValue(tensor_v)) {
-      auto size = constant_as<at::Tensor>(tensor_v)->sizes();
-      if (!symbolic_shape_analysis_test_mode) {
-        replaceWithIValue(
-            shape_compute_graph_->inputs().at(shape_compute_graph_index), size);
-      } else {
-        node_symbolic_input_indices_.emplace_back(
-            shape_compute_graph_index, c10::SymbolicShape(size));
+  // We handle non-constant values in the shape propagation step
+  void substituteConstantInputs() {
+    if (schema_->name() == "aten::cat") {
+      // Modifying the graph where _node is part of to not use the tensor
+      // construct
+
+      // When we have partially evaluate a list of Tensors like cat(tensor[])
+      // We have a few problems:
+      // - optimizing out calls to the length of the list: len(tensors)
+      // - resolving accesses of the list to the tensor symbolic sizes the
+      // corresponding list element We can solve both of these problems by
+      // replacing the partial evaluation of cat([x, y]) def cat(tensors:
+      // List[List[int]], dim: int)
+      //    body
+      // with
+      // def cat(x, y, dim: int)
+      //     tensors = [x, y]
+      //     body
+      uint64_t li_length = inputs_.size() - (schema_->arguments().size() - 1);
+      std::vector<Value*> li_inputs;
+      Value* graph_input = shape_compute_graph_->inputs().at(0);
+      for (size_t j = 0; j < li_length; ++j) {
+        auto new_inp = shape_compute_graph_->insertInput(j);
+        new_inp->setType(ListType::ofInts());
+        li_inputs.push_back(new_inp);
       }
-      return;
-    }
+      WithInsertPoint guard(*shape_compute_graph_->block()->nodes().begin());
+      auto new_li = shape_compute_graph_->insertNode(
+          shape_compute_graph_->createList(ListType::ofInts(), li_inputs));
+      graph_input->replaceAllUsesWith(new_li->output());
 
-    // we can't optimize a tensor without fixed rank
-    if (symbolic_shapes.rank()) {
-      node_symbolic_input_indices_.emplace_back(
-          shape_compute_graph_index, symbolic_shapes);
+      shape_compute_graph_->eraseInput(li_length);
     }
-  }
 
-  // returns partially evaluated shape compute graph
-  std::shared_ptr<Graph> run() {
-    bool made_change = true;
-    constexpr size_t MAX_ATTEMPTS = 8;
-    size_t curr_attempt = 0;
-    while (made_change && curr_attempt < MAX_ATTEMPTS) {
-      curr_attempt++;
-      // symbolic shape concrete values are only used in final shape extraction
-      substituteInputTensorProperties(/*symbolic_shape_values*/ nullptr);
-      made_change = shapeGraphCleanupPasses(shape_compute_graph_);
+    TORCH_INTERNAL_ASSERT(
+        shape_compute_graph_->inputs().size() <= inputs_.size(),
+        "Shape Compute Graph expected to have less inputs than actual inputs"); //?
+    for (size_t op_in_index = 0;
+         op_in_index < shape_compute_graph_->inputs().size();
+         op_in_index++) {
+      SSArgument& argument = inputs_[op_in_index];
+      Value* graph_in_var = shape_compute_graph_->inputs().at(op_in_index);
+
+      if (IValue* cur_val = c10::get_if<IValue>(&argument)) {
+        GRAPH_DEBUG("Substituting constant input ", *cur_val);
+        replaceWithIValue(graph_in_var, *cur_val);
+      } else {
+        auto cur_arg = c10::get<ShapeArguments>(argument);
+        if (cur_arg.has_dim()) {
+          graph_in_var->setType(ListType::ofInts());
+        }
+      }
     }
-    std::unordered_map<Value*, int64_t> symbolic_shape_values;
-    substituteInputTensorProperties(&symbolic_shape_values);
-    GRAPH_DUMP("Done with partial evaluation", shape_compute_graph_);
-
-    extractOutputShape(symbolic_shape_values);
-    return shape_compute_graph_;
   }
 
- private:
-  void substituteInputTensorProperties(
+  void substituteSymbolicProperties(
       std::unordered_map<Value*, int64_t>* symbolic_shape_values) {
     // clang-format off
     // here we iteratively substitute properties of the node's input tensors
@@ -433,15 +389,22 @@ struct SymbolicShapeNodeAnalyzer {
 
     std::unordered_map<int64_t, std::vector<Value*>> symbolic_shape_map;
 
-    for (const auto& index_symbolic_shape : node_symbolic_input_indices_) {
-      auto index = index_symbolic_shape.first;
-      auto shape_arguments = index_symbolic_shape.second;
+    TORCH_INTERNAL_ASSERT(
+        inputs_.size() >= shape_compute_graph_->inputs().size(),
+        "Missing Arg for Shape Graph");
+    for (int64_t index = 0; index < shape_compute_graph_->inputs().size();
+         index++) {
+      auto shape_arguments = c10::get_if<ShapeArguments>(&inputs_[index]);
+      if (!shape_arguments || !shape_arguments->has_dim()) {
+        continue;
+      }
+      // Add support for testing symbolic shapes with dynamic dims
 
-      for (const auto& use : shape_compute_graph_->inputs().at(index)->uses()) {
+      for (const Use& use : shape_compute_graph_->inputs().at(index)->uses()) {
         // TODO: either decompose composite ops like slice or add handling here
         switch (use.user->kind()) {
           case aten::len: {
-            size_t len = shape_arguments.len();
+            size_t len = shape_arguments->len();
             replaceWithIValue(use.user->output(), static_cast<int64_t>(len));
           } break;
           case aten::__getitem__: {
@@ -449,11 +412,11 @@ struct SymbolicShapeNodeAnalyzer {
             if (!index) {
               continue;
             }
-            auto norm_index = normIndex(*index, shape_arguments.len());
+            auto norm_index = normIndex(*index, shape_arguments->len());
             if (!norm_index) {
               continue;
             }
-            auto shape_arg = shape_arguments.at(*norm_index);
+            auto shape_arg = shape_arguments->at(*norm_index);
             if (auto const_int = shape_arg.asConstantInt()) {
               replaceWithIValue(use.user->output(), const_int);
               continue;
@@ -555,65 +518,234 @@ struct SymbolicShapeNodeAnalyzer {
     }
   }
 
-  c10::SymbolicShape extractListShape(
-      Value* list,
-      std::unordered_map<Value*, int64_t>& symbolic_shape_values,
-      const AliasDb& db) {
-    if (list->node()->kind() == prim::Constant) {
-      auto int_list = toIValue(list)->toIntVector();
-      return c10::SymbolicShape(int_list);
-    }
-    // We need a list construct or a constant output
-    // that is not written to in order to analyze the output shape
-    if (list->node()->kind() != prim::ListConstruct || db.hasWriters(list)) {
-      GRAPH_DEBUG("Could not extract shape ", getHeader(node_));
-      return c10::SymbolicShape();
-    }
-    Node* list_construct = list->node();
-    std::vector<c10::optional<int64_t>> output_shape;
-    for (Value* input : list_construct->inputs()) {
-      if (symbolic_shape_values.count(input)) {
-        output_shape.push_back(symbolic_shape_values[input]);
-      } else {
-        output_shape.push_back(constant_as<int64_t>(input));
-      }
+  std::vector<c10::SymbolicShape> propagateShapesInGraph() {
+    bool made_change = true;
+    constexpr size_t MAX_ATTEMPTS = 8;
+    for (int attempt_num = 0; made_change && attempt_num < MAX_ATTEMPTS;
+         attempt_num++) {
+      // symbolic shape concrete values are only used in final shape extraction
+      GRAPH_DUMP("Before substitution: ", shape_compute_graph_);
+      substituteSymbolicProperties(/*symbolic_shape_values*/ nullptr);
+      GRAPH_DUMP("Before Opt: ", shape_compute_graph_);
+      made_change = shapeGraphCleanupPasses(shape_compute_graph_);
     }
-    return c10::SymbolicShape(output_shape);
+    std::unordered_map<Value*, int64_t> symbolic_shape_values;
+    substituteSymbolicProperties(&symbolic_shape_values);
+    GRAPH_DUMP("Done with partial evaluation", shape_compute_graph_);
+
+    return extractOutputShape(symbolic_shape_values);
   }
 
-  void extractOutputShape(
+  std::vector<c10::SymbolicShape> extractOutputShape(
       std::unordered_map<Value*, int64_t>& symbolic_shape_values) {
     TORCH_INTERNAL_ASSERT(
-        shape_compute_graph_->outputs().size() == node_->outputs().size());
+        shape_compute_graph_->outputs().size() == schema_->returns().size());
     // TODO: would be nice if there were easy facility to look at uses and see
     // if they are all pure instead of instanting db.
+    auto res = std::vector<c10::SymbolicShape>();
     AliasDb db(shape_compute_graph_);
     for (size_t i = 0; i < shape_compute_graph_->outputs().size(); ++i) {
       auto output = shape_compute_graph_->outputs().at(i);
       auto type = output->type();
       TORCH_INTERNAL_ASSERT(isListOfInts(type));
-      auto ss = extractListShape(output, symbolic_shape_values, db);
-      node_->output(i)->setType(
-          node_->output(i)->type()->expect<TensorType>()->withSymbolicShapes(
-              ss));
+      c10::SymbolicShape ss =
+          extractListShape(output, symbolic_shape_values, db);
+      GRAPH_DEBUG("Extracted Output: ", ss);
+      res.push_back(ss);
     }
+    return res;
   }
 
-  // node input indices that are TensorType and we need to iteratively
-  // substitute properties of. We only substitute properties
-  // of TensorTypes with a fixed dimension but not a complete shape,
-  // because a complete shape we can completely replace with a constant
-  // and non-fixed dimensions we cannot reason about at all
-  std::vector<std::pair<int64_t, ShapeArguments>> node_symbolic_input_indices_;
-  std::shared_ptr<Graph> shape_compute_graph_;
-  Node* node_;
+ public:
+  SymbolicShapeOpAnalyzer(const FunctionSchema* schema) : schema_(schema) {
+    shape_compute_graph_ = nullptr;
+    if (!schema_) {
+      return;
+    }
+    auto maybe_graph = shapeComputeGraphForSchema(*schema_);
+    if (!maybe_graph) {
+      return;
+    }
+    shape_compute_graph_ = (*maybe_graph)->copy();
+  }
+
+  c10::optional<std::vector<c10::SymbolicShape>> run(
+      std::vector<SSArgument>& inputs) {
+    if (!shape_compute_graph_) {
+      return c10::nullopt;
+    }
+    inputs_ = inputs;
+    substituteConstantInputs();
+    GRAPH_DEBUG(inputs_)
+    return propagateShapesInGraph();
+  }
+
+  std::shared_ptr<Graph> getShapeComputeGraph() {
+    return shape_compute_graph_;
+  }
 };
 
+SSArgument tensorShapeArg(Value* tensor_v) {
+  auto tt = tensor_v->type()->expect<TensorType>();
+  c10::SymbolicShape symbolic_shapes = tt->symbolic_sizes();
+
+  // for testing, we don't insert complete tensor shapes and rely on our
+  // partial evaluation pipeline to propagate information.
+  // this is a good proxy for our ability to propagate non-complete shape
+  // information.
+  if (symbolic_shapes.isComplete() && !symbolic_shape_analysis_test_mode) {
+    return IValue(tt->sizes().concrete_sizes());
+  }
+  if (toIValue(tensor_v)) {
+    auto size = constant_as<at::Tensor>(tensor_v)->sizes();
+    if (!symbolic_shape_analysis_test_mode) {
+      return IValue(size);
+    } else {
+      return c10::SymbolicShape(size);
+    }
+  }
+  return symbolic_shapes;
+}
+
+std::vector<SSArgument> getNodeInputShapes(Node* n, const AliasDb& db) {
+  // TODO: fix the List of integers implementation, and
+  // extract out the shape changes, otherwise this is complete
+  // NB: shape compute graphs may have less inputs than their node
+  // counterparts to allow e.g. sharing one single unary definition
+  // so iterate on # of shape inputs
+  // We make lists of Tensor inputs variadic, which results in
+  // offset between a node index and its corresponding graph index
+  std::vector<SSArgument> input_shapes = std::vector<SSArgument>();
+
+  for (size_t node_index = 0; node_index < n->inputs().size(); ++node_index) {
+    auto type = n->input(node_index)->type();
+
+    if (auto tt = type->castRaw<TensorType>()) {
+      input_shapes.push_back(tensorShapeArg(n->input(node_index)));
+      continue;
+    }
+    if (isListOfTensors(type)) {
+      // waiting for more use cases to decide on best generalization
+      TORCH_INTERNAL_ASSERT(n->kind() == aten::cat, "TODO: generalize logic");
+      if (n->input(node_index)->node()->kind() == prim::Constant) {
+        auto ival = toIValue(n->input(node_index));
+        for (const auto& ten : ival->toTensorVector()) {
+          input_shapes.emplace_back(c10::List<int64_t>(ten.sizes()));
+        }
+      } else if (
+          n->input(node_index)->node()->kind() == prim::ListConstruct &&
+          !db.hasWriters(n->input(node_index))) {
+        auto li_construct_node = n->input(node_index)->node();
+        for (size_t j = 0; j < li_construct_node->inputs().size(); ++j) {
+          input_shapes.push_back(tensorShapeArg(li_construct_node->input(j)));
+        }
+      } else {
+        TORCH_INTERNAL_ASSERT(false, "Unhandled List, we shouldn't get here");
+      }
+      continue;
+    }
+    if (auto ival = toIValue(n->input(node_index))) {
+      input_shapes.emplace_back(*ival);
+      continue;
+    }
+    if (type->cast<ListType>() &&
+        type->cast<ListType>()->getElementType()->cast<IntType>()) {
+      auto input_src_node = n->input(node_index)->node();
+      if (input_src_node->kind() == prim::ListConstruct &&
+          !db.hasWriters(n->input(node_index))) {
+        // it is a very common in graphs to see patterns like:
+        // z = x.view(y.size())
+        // or:
+        // z = x.view(1, 10, y.size(0), y.size(1))
+        // We want to propagate symbolic dimensions and concrete sizes
+        // from y to z. To do this we try to associate symbolic dimensions
+        // or concrete sizes with the integer list inputs that have a
+        // constructor taken from constants or y.size() or y.size(0)
+        auto list_construct = n->input(node_index)->node();
+        std::vector<ShapeArg> shape;
+        for (Value* v : list_construct->inputs()) {
+          if (auto constant = constant_as<int64_t>(v)) {
+            shape.emplace_back(*constant);
+          } else if (v->node()->kind() == aten::size) {
+            auto const_index = constant_as<int64_t>(v->node()->input(1));
+            auto tt = v->node()->input(0)->type()->expect<TensorType>();
+            auto ss = tt->symbolic_sizes();
+            if (!ss.rank() || !const_index) {
+              // if we are getting a size of a tensor, it is an unknown
+              // symbolic dimension instead of an unknown integer (must be
+              // >=0)
+              shape.emplace_back(at::ShapeSymbol::newSymbol());
+              continue;
+            }
+            auto norm_index = normIndex(*const_index, *ss.rank());
+            if (!norm_index) {
+              shape.emplace_back(at::ShapeSymbol::newSymbol());
+              continue;
+            }
+            shape.emplace_back(ss[*norm_index]);
+          } else {
+            shape.emplace_back(ShapeArg::unknownInteger());
+          }
+        }
+        input_shapes.emplace_back(ShapeArguments(shape));
+        continue;
+      }
+      if (input_src_node->kind() == aten::size &&
+          !db.hasWriters(n->input(node_index))) {
+        auto ten_inp = input_src_node->input();
+        auto ss = ten_inp->type()->expect<TensorType>()->symbolic_sizes();
+        input_shapes.emplace_back(ss);
+        continue;
+      }
+    }
+    GRAPH_DEBUG(
+        "Unhandled input: ",
+        n->kind().toDisplayString(),
+        " arg num: ",
+        node_index);
+    input_shapes.emplace_back(c10::SymbolicShape());
+  }
+  TORCH_INTERNAL_ASSERT(
+      input_shapes.size() >= n->inputs().size(),
+      "input_shapes size: ",
+      input_shapes.size(),
+      " n inputs size: ",
+      n->inputs().size());
+  return input_shapes;
+}
+
+void applyOutputShapeToGraph(
+    Node* node,
+    const std::vector<c10::SymbolicShape>& output_shapes) {
+  TORCH_INTERNAL_ASSERT(
+      node->outputs().size() == output_shapes.size(),
+      "Output shape size mismatch");
+  for (size_t i = 0; i < output_shapes.size(); ++i) {
+    auto& ss = output_shapes.at(i);
+    node->output(i)->setType(
+        node->output(i)->type()->expect<TensorType>()->withSymbolicShapes(ss));
+  }
+}
+
 std::shared_ptr<Graph> PropagateShapesWithShapeFunction(
     Node* n,
-    std::shared_ptr<Graph>& shape_compute_graph,
     const AliasDb& db) {
-  return SymbolicShapeNodeAnalyzer(n, shape_compute_graph, db).run();
+  const FunctionSchema* func_schema = n->maybeSchema();
+  if (!func_schema) {
+    return nullptr;
+  }
+  auto op_analyzer = SymbolicShapeOpAnalyzer(func_schema);
+  if (!op_analyzer.getShapeComputeGraph()) {
+    return nullptr;
+  }
+  auto input_shapes = getNodeInputShapes(n, db);
+  op_analyzer.refineInputUnionTypes(n);
+
+  if (auto output_shapes = op_analyzer.run(input_shapes)) {
+    applyOutputShapeToGraph(n, *output_shapes);
+  }
+
+  return op_analyzer.getShapeComputeGraph();
 }
 
 struct SymbolicShapeGraphAnalyzer {
@@ -782,8 +914,8 @@ struct SymbolicShapeGraphAnalyzer {
 
     // When we add a new tensor node, we do two things:
     // 1: record a mapping from the tensor node output to its shape in the
-    // partial eval graph 2: add each symbolic shape dimension that we have not
-    // already added as a output to the large shape compute graph
+    // partial eval graph 2: add each symbolic shape dimension that we have
+    // not already added as a output to the large shape compute graph
 
     // Once we are done stitching together all partial eval'd graphs, we can
     // cleanup the graph and remove the unneeded complete shapes as outputs,
@@ -891,11 +1023,8 @@ struct SymbolicShapeGraphAnalyzer {
     std::unordered_map<Node*, std::shared_ptr<Graph>> partial_evaluated_graphs;
     for (auto it = beg_->iterator(); it != end_->iterator(); it++) {
       auto curr = *it;
-      if (curr->maybeSchema()) {
-        if (auto maybe_graph = shapeComputeGraphForSchema(curr->schema())) {
-          partial_evaluated_graphs[curr] =
-              PropagateShapesWithShapeFunction(curr, *maybe_graph, db);
-        }
+      if (auto maybe_graph = PropagateShapesWithShapeFunction(curr, db)) {
+        partial_evaluated_graphs[curr] = maybe_graph;
       }
     }
     return partial_evaluated_graphs;
@@ -920,9 +1049,7 @@ void PropagateShapesOnBlock(Block* b, const AliasDb& db) {
       PropagateShapesOnBlock(if_v.elseBlock(), db);
       mergeTypes(if_v.thenOutputs(), if_v.elseOutputs(), if_v.outputs());
     } else if (n->maybeSchema()) {
-      if (auto maybe_graph = shapeComputeGraphForSchema(n->schema())) {
-        PropagateShapesWithShapeFunction(n, *maybe_graph, db);
-      }
+      PropagateShapesWithShapeFunction(n, db);
     } else if (n->kind() == prim::TupleConstruct) {
       auto orig_type = n->output()->type()->expect<TupleType>();
       auto new_types = fmap(n->inputs(), [](Value* v) { return v->type(); });
@@ -945,5 +1072,23 @@ PropagateShapesAndBuildLargeShapeComputeGraph(
   return SymbolicShapeGraphAnalyzer(graph, beg, end).run();
 }
 
+TORCH_API c10::optional<std::vector<c10::SymbolicShape>>
+calculateSymbolicShapesOnOp(
+    const FunctionSchema* schema,
+    const std::vector<SSAInput>& inputs) {
+  std::vector<SSArgument> ssa_args;
+  for (auto& arg : inputs) {
+    if (const IValue* ival = c10::get_if<IValue>(&arg)) {
+      ssa_args.emplace_back(*ival);
+    } else {
+      const c10::SymbolicShape* ss = c10::get_if<c10::SymbolicShape>(&arg);
+      ssa_args.emplace_back(ShapeArguments(*ss));
+    }
+  }
+
+  auto op_analyzer = SymbolicShapeOpAnalyzer(schema);
+  return op_analyzer.run(ssa_args);
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/symbolic_shape_analysis.h b/torch/csrc/jit/passes/symbolic_shape_analysis.h
index c2dc78c7d7fb42..5e6239ab3f75c5 100644
--- a/torch/csrc/jit/passes/symbolic_shape_analysis.h
+++ b/torch/csrc/jit/passes/symbolic_shape_analysis.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <c10/util/variant.h>
 #include <torch/csrc/Export.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <unordered_map>
@@ -47,5 +48,71 @@ PropagateShapesAndBuildLargeShapeComputeGraph(
 TORCH_API bool setSymbolicShapeAnalysisTestMode(bool value);
 TORCH_API bool symbolicShapeAnalysisTestModeEnabled();
 
+using SSAInput = c10::variant<IValue, c10::SymbolicShape>;
+TORCH_API c10::optional<std::vector<c10::SymbolicShape>>
+calculateSymbolicShapesOnOp(
+    const FunctionSchema* schema,
+    const std::vector<SSAInput>& inputs);
+
+struct TORCH_API CanonicalizedSymbolicShape {
+  CanonicalizedSymbolicShape(
+      c10::SymbolicShape& orig_shape,
+      std::unordered_map<int64_t, int64_t>& ss_map) {
+    init(orig_shape, ss_map);
+  }
+
+  CanonicalizedSymbolicShape(c10::SymbolicShape& orig_shape) {
+    std::unordered_map<int64_t, int64_t> new_ssmap;
+    init(orig_shape, new_ssmap);
+  }
+
+ private:
+  c10::optional<std::vector<int64_t>> values_;
+  std::vector<bool> is_symbolic_;
+
+  void init(
+      c10::SymbolicShape& orig_shape,
+      std::unordered_map<int64_t, int64_t>& ss_map) {
+    auto sizes = orig_shape.sizes();
+    if (!sizes) {
+      values_ = c10::nullopt;
+      return;
+    }
+    values_ = std::vector<int64_t>();
+    int64_t cur_symbolic_index = -(int64_t)ss_map.size() - 1;
+    for (auto& cur_shape : *sizes) {
+      if (cur_shape.is_static()) {
+        is_symbolic_.emplace_back(false);
+        values_->push_back(cur_shape.static_size());
+      } else {
+        // Check for aliasing
+        is_symbolic_.emplace_back(true);
+        auto it = ss_map.find(cur_shape.value());
+
+        if (it == ss_map.end()) {
+          values_->push_back(cur_symbolic_index);
+          ss_map.insert({cur_shape.value(), cur_symbolic_index});
+          cur_symbolic_index--;
+        } else {
+          values_->push_back(it->second);
+        }
+      }
+    }
+  }
+
+  friend bool operator==(
+      const CanonicalizedSymbolicShape& a,
+      const CanonicalizedSymbolicShape& b) {
+    if (a.values_.has_value() != b.values_.has_value()) {
+      return false;
+    }
+    if (!a.values_.has_value()) {
+      return true;
+    }
+    return (
+        a.values_.value() == b.values_.value() &&
+        a.is_symbolic_ == b.is_symbolic_);
+  };
+};
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/tensorexpr_fuser.cpp b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
index 2cb7db329e5fc6..18d7862ba61a40 100644
--- a/torch/csrc/jit/passes/tensorexpr_fuser.cpp
+++ b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
@@ -69,6 +69,11 @@ Value* broadcastSizes(at::ArrayRef<Value*> sizes, AliasDb* db) {
 
 namespace tensorexpr {
 
+OperatorSet& getCustomOperatorSet() {
+  static OperatorSet _g_custom_operator_set{};
+  return _g_custom_operator_set;
+}
+
 static const OperatorSet& supported_non_eltwise_set() {
   // clang-format off
   static const OperatorSet supported_non_eltwise_set{
@@ -101,6 +106,7 @@ bool isSupported(Node* node) {
   if (get_tensorexpr_elementwise_set().contains(node) ||
       node->isMemberOf(supported_non_eltwise_set()) ||
       node->isMemberOf(supported_misc_set) ||
+      node->isMemberOf(getCustomOperatorSet()) ||
       (texpr_reductions_enabled && node->isMemberOf(supported_reduction_set))) {
     // We only insert guards on Tensor types, so we rely on the output
     // of a node being uniquely determined by its input types.
@@ -745,6 +751,10 @@ class TensorExprFuser {
     }
     // Cleanup the subgraph from duplicated constants while we're at it.
     ConstantPooling(subgraph);
+
+    if (GRAPH_DEBUG_ENABLED) {
+      GRAPH_EXPORT("", subgraph);
+    }
     return false;
   }
 
@@ -840,9 +850,8 @@ class TensorExprFuser {
       return canFuseOnGPU();
     } else if (device->is_xpu()) {
       return false;
-    } else {
-      TORCH_CHECK_NOT_IMPLEMENTED(false, "Unknown device for tensorexpr fuser")
     }
+    return false;
   }
 
   bool isFusableOnDevice(Node* node) {
@@ -1091,6 +1100,7 @@ class TensorExprFuser {
       // aten::cat, though it does not have a shape function.
       REQ(node->kind() == prim::ListConstruct ||
           node->kind() == prim::TensorExprGroup ||
+          node->isMemberOf(tensorexpr::getCustomOperatorSet()) ||
           (node->maybeSchema() && shapeComputeGraphForSchema(node->schema())));
     }
 
diff --git a/torch/csrc/jit/passes/tensorexpr_fuser.h b/torch/csrc/jit/passes/tensorexpr_fuser.h
index 89053f8c0aed48..0792f16d502423 100644
--- a/torch/csrc/jit/passes/tensorexpr_fuser.h
+++ b/torch/csrc/jit/passes/tensorexpr_fuser.h
@@ -60,6 +60,17 @@ TORCH_API Value* broadcastSizes(at::ArrayRef<Value*> sizes, AliasDb* db);
 
 namespace tensorexpr {
 TORCH_API bool isSupported(Node* node);
+
+/// Get the modifiable custom operator set object.
+///
+/// For static shapes, if a custom operator has been added to the custom
+/// operator set, it will be pulled into the NNC fusion group. But it doesn't
+/// work with dynamic shapes unless explicitly register the shape function via
+/// `torch::jit::RegisterShapeComputeGraphForSchema` for the custom operator.
+///
+/// @return Reference of the custome operator set
+///
+TORCH_API OperatorSet& getCustomOperatorSet();
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/vulkan_rewrite.cpp b/torch/csrc/jit/passes/vulkan_rewrite.cpp
index 465677848af54f..b71dfd816b33ca 100644
--- a/torch/csrc/jit/passes/vulkan_rewrite.cpp
+++ b/torch/csrc/jit/passes/vulkan_rewrite.cpp
@@ -16,8 +16,6 @@
 namespace torch {
 namespace jit {
 
-#ifdef USE_VULKAN
-
 namespace {
 
 void insertPrePackedLinearOp(std::shared_ptr<Graph>& graph) {
@@ -84,6 +82,23 @@ void insertPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
   transpose_rewriter.runOnGraph(graph);
 }
 
+void insertPrePackedGruOp(std::shared_ptr<Graph>& graph) {
+  std::string gru_pattern = R"(
+      graph(%input.1, %hx.1, %params_cpu:Tensor[], %has_biases:bool, %num_layers:int, %dropout:float, %train:bool, %bidirectional:bool, %batch_first:bool):
+        %y.1 : Tensor, %hn.1 : Tensor = aten::gru(%input.1, %hx.1, %params_cpu, %has_biases, %num_layers, %dropout, %train, %bidirectional, %batch_first)
+        return (%y.1, %hn.1) )";
+  std::string prepacked_ops_pattern = R"(
+      graph(%input.1, %hx.1, %params_cpu:Tensor[], %has_biases:bool, %num_layers:int, %dropout:float, %train:bool, %bidirectional:bool, %batch_first:bool):
+        %packed_weights_biases = vulkan_prepack::gru_prepack(
+            %params_cpu, %has_biases, %num_layers, %dropout, %train, %bidirectional, %batch_first)
+        %y.1 : Tensor, %hn.1 : Tensor = vulkan_prepack::gru_run(%input.1, %hx.1, %packed_weights_biases)
+        return (%y.1, %hn.1) )";
+
+  SubgraphRewriter gru_rewriter;
+  gru_rewriter.RegisterRewritePattern(gru_pattern, prepacked_ops_pattern);
+  gru_rewriter.runOnGraph(graph);
+}
+
 void fuseHardtanhWithPackedOps(std::shared_ptr<Graph>& graph) {
   SubgraphRewriter rewriter;
 
@@ -172,6 +187,7 @@ void fuseReluWithPackedOps(std::shared_ptr<Graph>& graph) {
 void vulkanInsertPrePackedOps(std::shared_ptr<Graph>& graph) {
   insertPrePackedLinearOp(graph);
   insertPrePackedConv2dOp(graph);
+  insertPrePackedGruOp(graph);
 }
 
 void vulkanInsertPrePackedOps(script::Module& module) {
@@ -237,38 +253,5 @@ script::Module vulkanOptimizeForMobile(
   return cloned_module;
 }
 
-#else
-
-void vulkanInsertPrePackedOps(std::shared_ptr<Graph>& graph) {
-  TORCH_INTERNAL_ASSERT(
-      false, "Vulkan is not enabled. Please build with USE_VULKAN=1");
-}
-
-void vulkanInsertPrePackedOps(script::Module& module) {
-  TORCH_INTERNAL_ASSERT(
-      false, "Vulkan is not enabled. Please build with USE_VULKAN=1");
-}
-
-void vulkanFusePrePackedConvWithClamp(script::Module& module) {
-  TORCH_INTERNAL_ASSERT(
-      false, "Vulkan is not enabled. Please build with USE_VULKAN=1");
-}
-
-void vulkanFoldPrePackingOps(script::Module& m) {
-  TORCH_INTERNAL_ASSERT(
-      false, "Vulkan is not enabled. Please build with USE_VULKAN=1");
-}
-
-script::Module vulkanOptimizeForMobile(
-    const script::Module& module,
-    const std::vector<std::string>& preserved_methods) {
-  TORCH_INTERNAL_ASSERT(
-      false,
-      "Mobile optimizaiton only available with Vulkan at the moment. "
-      "Vulkan is not enabled. Please build with USE_VULKAN=1");
-  return module;
-}
-
-#endif
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/python/init.cpp b/torch/csrc/jit/python/init.cpp
index 3efb48f37d03d7..e3c52472bac0e1 100644
--- a/torch/csrc/jit/python/init.cpp
+++ b/torch/csrc/jit/python/init.cpp
@@ -77,6 +77,7 @@
 #include <torch/csrc/jit/python/script_init.h>
 #include <torch/csrc/jit/runtime/argument_spec.h>
 #include <torch/csrc/jit/runtime/autodiff.h>
+#include <torch/csrc/jit/runtime/decomposition_registry.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
 #include <torch/csrc/jit/runtime/jit_exception.h>
 #include <torch/csrc/jit/runtime/jit_trace.h>
@@ -160,6 +161,28 @@ void initJITBindings(PyObject* module) {
             }
             return shapeComputeGraphForSchema(n->schema());
           })
+      .def(
+          "_jit_decomposition_graph_for_node",
+          [](Node* n) -> c10::optional<std::shared_ptr<Graph>> {
+            if (!n->maybeSchema()) {
+              return c10::nullopt;
+            }
+            return DecompositionGraphForSchema(n->schema());
+          })
+      .def("_jit_pass_run_decompositions", RunDecompositions)
+      // using Node* here instead of Schema because looking up the schema
+      // and passing it in from Python will have a different pointer than the
+      // schema that is globally used for caching
+      .def(
+          "_jit_register_shape_compute_graph_for_node",
+          [](Node* n, std::shared_ptr<Graph>& graph) {
+            if (n->maybeSchema()) {
+              const FunctionSchema& schema = n->schema();
+              RegisterShapeComputeGraphForSchema(schema, graph);
+            } else {
+              TORCH_INTERNAL_ASSERT(false, "Expected schema", n);
+            }
+          })
       .def("_jit_pass_propagate_shapes_on_graph", PropagateShapesOnGraph)
       .def(
           "_jit_pass_propagate_shapes_on_graph_and_build_compute",
@@ -402,7 +425,11 @@ void initJITBindings(PyObject* module) {
           [](std::shared_ptr<Graph>& g) { return FuseAddMM(g); })
       .def(
           "_jit_pass_canonicalize",
-          [](const std::shared_ptr<Graph>& g) { return Canonicalize(g); })
+          [](const std::shared_ptr<Graph>& g, bool keep_unique_names = true) {
+            return Canonicalize(g, keep_unique_names);
+          },
+          py::arg("graph"),
+          py::arg("keep_unique_names") = true)
       .def("_jit_pass_lint", LintGraph)
       .def(
           "_jit_pass_complete_shape_analysis",
@@ -596,6 +623,18 @@ void initJITBindings(PyObject* module) {
             auto stack = toTraceableStack(args);
             checkAliasAnnotation(g, std::move(stack), unqualified_op_name);
           })
+      .def(
+          "_jit_set_nvfuser_skip_node_kind",
+          // Args:
+          //     `op_name`: Symbol of op;
+          //     `flip`: flag indicating whether to flip the given op in the
+          //             skip list.
+          // Returns:
+          //     a bool flag indicating if `op_name` was already in the skip
+          //     list.
+          [](const std::string& op_name, bool flip = true) {
+            return fuser::cuda::skipNode(op_name, flip);
+          })
       .def("_jit_set_nvfuser_enabled", &RegisterCudaFuseGraph::registerPass)
       .def(
           "_jit_set_nvfuser_single_node_mode",
@@ -617,6 +656,30 @@ void initJITBindings(PyObject* module) {
             return oldState;
           })
       .def("_jit_nvfuser_enabled", &RegisterCudaFuseGraph::isRegistered)
+      .def(
+          "_jit_nvfuser_set_comparison_callback",
+          [](bool run_fallback, py::function fn) {
+            // If set, then the callback will be run after each nvfuser fusion
+            // group is executed. Can be used for testing accuracy.
+            // If run_fallback == True, then a fallback will be run and
+            // unfused_outputs will be nonempty, showing the result if the
+            // fusion didn't take place. Otherwise, unfused_outputs will
+            // be empty
+            auto fn_ptr = std::make_shared<py::function>(fn);
+            auto callback_lambda = [fn_ptr](
+                                       const Stack& fused_outputs,
+                                       const Stack& unfused_outputs,
+                                       const std::string& graph_ir) {
+              py::gil_scoped_acquire acquire{};
+              (*fn_ptr)(fused_outputs, unfused_outputs, graph_ir);
+            };
+            setCudaFuserComparisonCallback({run_fallback, callback_lambda});
+          })
+      .def(
+          "_jit_nvfuser_clear_comparison_callback",
+          []() {
+            setCudaFuserComparisonCallback({false, nullptr});
+          })
       .def(
           "_jit_set_profiling_mode",
           [](bool profiling_flag) {
@@ -668,6 +731,7 @@ void initJITBindings(PyObject* module) {
                 vec_conv.emplace_back(FusionBehavior::DYNAMIC, pair.second);
               } else {
                 TORCH_INTERNAL_ASSERT(
+                    false,
                     "FusionBehavior only supported 'STATIC' or 'DYNAMIC', got: ",
                     pair.first);
               }
@@ -1232,9 +1296,10 @@ void initJITBindings(PyObject* module) {
           auto operations = getAllOperatorsFor(symbol);
           for (const auto& op : operations) {
             if (op->schema().overload_name() == overload_name) {
-              auto func =
-                  py::cpp_function([op](py::args args, py::kwargs kwargs) {
-                    return invokeOperatorFromPython({op}, args, kwargs);
+              auto func = py::cpp_function(
+                  [op, symbol](py::args args, py::kwargs kwargs) {
+                    return _get_operation_for_overload_or_packet(
+                        {op}, symbol, args, kwargs, true);
                   });
               return func;
             }
@@ -1265,60 +1330,8 @@ void initJITBindings(PyObject* module) {
 
           auto func = py::cpp_function(
               [operations, symbol](py::args args, py::kwargs kwargs) {
-                std::vector<py::handle> overloaded_args;
-                size_t total_arg_num = args.size() + kwargs.size();
-                for (const auto i : c10::irange(args.size())) {
-                  is_tensor_and_append_overloaded(
-                      args[i].ptr(), &overloaded_args);
-                  is_tensor_list_and_append_overloaded(
-                      args[i].ptr(),
-                      &overloaded_args,
-                      static_cast<int>(total_arg_num),
-                      false /* throw_error */);
-                }
-                // NB: for kwargs, we cannot guarantee the order of appending
-                // is the same as the argument order in operator's schema.
-                // This is suboptimal, but should be fine. Later when we have
-                // better schema matching and argument parsing, we could
-                // match the operator in `operations` first, then the order will
-                // be guaranteed.
-                for (auto item : kwargs) {
-                  is_tensor_and_append_overloaded(
-                      item.second.ptr(), &overloaded_args);
-                  is_tensor_list_and_append_overloaded(
-                      item.second.ptr(),
-                      &overloaded_args,
-                      total_arg_num,
-                      false /* throw_error */);
-                }
-                if (overloaded_args.size() > 0) {
-                  std::vector<py::object> overloaded_types;
-                  overloaded_types.reserve(overloaded_args.size());
-                  for (auto& oarg : overloaded_args) {
-                    overloaded_types.push_back(
-                        py::reinterpret_borrow<py::object>(
-                            (PyObject*)Py_TYPE(oarg.ptr())));
-                  }
-                  py::tuple py_types = py::cast(overloaded_types);
-                  py::object ret;
-                  std::string ns = symbol.ns().toUnqualString();
-                  std::string method_name = symbol.toUnqualString();
-                  auto self_func = py::module::import("torch")
-                                       .attr("ops")
-                                       .attr(ns.c_str())
-                                       .attr(method_name.c_str());
-                  std::string module_name("torch.ops");
-                  module_name.append(ns);
-                  return pybind11::reinterpret_steal<py::object>(
-                      handle_torch_function_no_python_arg_parser(
-                          overloaded_args,
-                          args.ptr(),
-                          kwargs.ptr(),
-                          method_name.c_str(),
-                          self_func.ptr(),
-                          module_name.c_str()));
-                }
-                return invokeOperatorFromPython(operations, args, kwargs);
+                return _get_operation_for_overload_or_packet(
+                    operations, symbol, args, kwargs, false);
               },
               py::name(symbol.toUnqualString()),
               py::doc(docstring.str().c_str()));
@@ -1332,11 +1345,15 @@ void initJITBindings(PyObject* module) {
       },
       py::arg("qualified_name"));
 
-  m.def("parse_ir", [](const std::string& input) {
-    auto graph = std::make_shared<Graph>();
-    parseIR(input, &*graph);
-    return graph;
-  });
+  m.def(
+      "parse_ir",
+      [](const std::string& input, bool parse_tensor_constants) {
+        auto graph = std::make_shared<Graph>();
+        parseIR(input, &*graph, parse_tensor_constants);
+        return graph;
+      },
+      py::arg("input"),
+      py::arg("parse_tensor_constants") = false);
   m.def("parse_schema", parseSchema);
   m.def("unify_type_list", [](const std::vector<TypePtr>& types) {
     std::ostringstream s;
@@ -1610,6 +1627,12 @@ void initJITBindings(PyObject* module) {
       throw std::runtime_error(e.what());
     }
   });
+
+  // On exit we need to reset the print handler to default one,
+  // because otherwise prim::Print() instruction won't work for JIT modules.
+  auto atexit = py::module_::import("atexit");
+  atexit.attr("register")(
+      py::cpp_function([]() { setPrintHandler(getDefaultPrintHandler()); }));
 }
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/python/pybind_utils.cpp b/torch/csrc/jit/python/pybind_utils.cpp
index 89767f8d09066c..45ddf957f61f89 100644
--- a/torch/csrc/jit/python/pybind_utils.cpp
+++ b/torch/csrc/jit/python/pybind_utils.cpp
@@ -31,13 +31,6 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
         return autograd::Variable();
       }
       auto var = py::cast<autograd::Variable>(obj);
-      if (var.is_sparse()) {
-        TORCH_WARN_ONCE(
-            "Using sparse tensors in TorchScript is experimental. Many optimization "
-            "pathways have not been thoroughly tested with sparse tensors. Please "
-            "include the fact that the network is running sparse tensors in any bug "
-            "reports submitted.");
-      }
       guardAgainstNamedTensor<autograd::Variable>(var);
       return var;
     }
@@ -49,6 +42,8 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
       auto c_obj = py::cast<std::complex<double>>(obj.ptr());
       return static_cast<c10::complex<double>>(c_obj);
     }
+    case TypeKind::SymIntType:
+      return py::cast<int64_t>(obj);
     case TypeKind::IntType:
     // TODO(xintchen): Handling LayoutType and ScalarTypeType correctly.
     case TypeKind::LayoutType:
diff --git a/torch/csrc/jit/python/pybind_utils.h b/torch/csrc/jit/python/pybind_utils.h
index 3eb1d5cafcb6d3..942cf676188546 100644
--- a/torch/csrc/jit/python/pybind_utils.h
+++ b/torch/csrc/jit/python/pybind_utils.h
@@ -587,6 +587,16 @@ inline void guardAgainstNamedTensor(const T& var) {
 // python_ivalue.h
 IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N);
 
+// Extract custom class registered with torchbind
+template <typename T>
+c10::intrusive_ptr<T> toCustomClass(py::handle obj) {
+  static_assert(
+      std::is_base_of<CustomClassHolder, T>::value, "T is not a CustomClass");
+  const auto& type = c10::getCustomClassType<c10::intrusive_ptr<T>>();
+  c10::IValue ivalue = toIValue(obj, type);
+  return std::move(ivalue).toCustomClass<T>();
+}
+
 // Small wrapper around getting the type name string from Python to make
 // types easier to interpret, e.g. give the structural type for a NamedTuple
 inline std::string friendlyTypeName(py::handle obj) {
@@ -683,13 +693,6 @@ inline py::object toPyObject(IValue ivalue) {
     return py::none();
   } else if (ivalue.isTensor()) {
     auto tensor = std::move(ivalue).toTensor();
-    if (tensor.is_sparse()) {
-      TORCH_WARN_ONCE(
-          "Using sparse tensors in TorchScript is experimental. Many optimization "
-          "pathways have not been thoroughly tested with sparse tensors. Please "
-          "include the fact that the network is running sparse tensors in any bug "
-          "reports submitted.");
-    }
     guardAgainstNamedTensor<at::Tensor>(tensor);
     return py::cast(autograd::Variable(std::move(tensor)));
   } else if (ivalue.isStorage()) {
@@ -1159,5 +1162,72 @@ inline py::object invokeOperatorFromPython(
   return createPyObjectForStack(std::move(stack));
 }
 
+inline py::object _get_operation_for_overload_or_packet(
+    const std::vector<std::shared_ptr<Operator>>& operations,
+    Symbol symbol,
+    py::args args,
+    const py::kwargs& kwargs,
+    bool is_overload) {
+  std::vector<py::handle> overloaded_args;
+  size_t total_arg_num = args.size() + kwargs.size();
+  for (const auto i : c10::irange(args.size())) {
+    is_tensor_and_append_overloaded(args[i].ptr(), &overloaded_args);
+    is_tensor_list_and_append_overloaded(
+        args[i].ptr(),
+        &overloaded_args,
+        static_cast<int>(total_arg_num),
+        false /* throw_error */);
+  }
+  // NB: for kwargs, we cannot guarantee the order of appending
+  // is the same as the argument order in operator's schema.
+  // This is suboptimal, but should be fine. Later when we have
+  // better schema matching and argument parsing, we could
+  // match the operator in `operations` first, then the order will
+  // be guaranteed.
+  for (auto item : kwargs) {
+    is_tensor_and_append_overloaded(item.second.ptr(), &overloaded_args);
+    is_tensor_list_and_append_overloaded(
+        item.second.ptr(),
+        &overloaded_args,
+        total_arg_num,
+        false /* throw_error */);
+  }
+  if (overloaded_args.size() > 0) {
+    std::vector<py::object> overloaded_types;
+    overloaded_types.reserve(overloaded_args.size());
+    for (auto& oarg : overloaded_args) {
+      overloaded_types.push_back(
+          py::reinterpret_borrow<py::object>((PyObject*)Py_TYPE(oarg.ptr())));
+    }
+    py::tuple py_types = py::cast(overloaded_types);
+    py::object ret;
+    std::string ns = symbol.ns().toUnqualString();
+    std::string method_name = symbol.toUnqualString();
+    auto self_func = py::module::import("torch")
+                         .attr("ops")
+                         .attr(ns.c_str())
+                         .attr(method_name.c_str());
+    if (is_overload) {
+      auto overload_name = operations[0]->schema().overload_name();
+      if (overload_name == "") {
+        self_func = self_func.attr("default");
+      } else {
+        self_func = self_func.attr(overload_name.c_str());
+      }
+    }
+    std::string module_name("torch.ops");
+    module_name.append(ns);
+    return pybind11::reinterpret_steal<py::object>(
+        handle_torch_function_no_python_arg_parser(
+            overloaded_args,
+            args.ptr(),
+            kwargs.ptr(),
+            method_name.c_str(),
+            self_func.ptr(),
+            module_name.c_str()));
+  }
+  return invokeOperatorFromPython(operations, args, kwargs);
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/python/python_ir.cpp b/torch/csrc/jit/python/python_ir.cpp
index a354edd9bfebe2..723b307f240d2b 100644
--- a/torch/csrc/jit/python/python_ir.cpp
+++ b/torch/csrc/jit/python/python_ir.cpp
@@ -204,7 +204,6 @@ void initPythonIRBindings(PyObject* module_) {
           "has_writers",
           [&](AliasDb& db, Value* v1) { return db.hasWriters(v1); })
       .def("__str__", &AliasDb::toString);
-
 #define GS(name) def(#name, &Graph ::name)
   py::class_<Graph, std::shared_ptr<Graph>>(m, "Graph")
       .def(py::init<>())
@@ -255,11 +254,13 @@ void initPythonIRBindings(PyObject* module_) {
             RawDataExportMap export_map;
             SymbolDimMap symbol_map;
             bool val_use_external_data_format = false;
+            NodeNameMap onnx_node_names;
             std::tie(
                 model_proto,
                 export_map,
                 symbol_map,
-                val_use_external_data_format) =
+                val_use_external_data_format,
+                onnx_node_names) =
                 export_onnx(
                     g,
                     initializers,
@@ -289,7 +290,8 @@ void initPythonIRBindings(PyObject* module_) {
             return std::make_tuple(
                 py::bytes(graph),
                 python_serialized_export_map,
-                val_use_external_data_format);
+                val_use_external_data_format,
+                onnx_node_names);
           },
           py::arg("initializers"),
           py::arg("onnx_opset_version") = 0,
@@ -948,6 +950,8 @@ void initPythonIRBindings(PyObject* module_) {
       .def_static("get", &NumberType::get);
   py::class_<IntType, Type, IntTypePtr>(m, "IntType")
       .def_static("get", &IntType::get);
+  py::class_<SymIntType, Type, SymIntTypePtr>(m, "SymIntType")
+      .def_static("get", &SymIntType::get);
   py::class_<FloatType, Type, FloatTypePtr>(m, "FloatType")
       .def_static("get", &FloatType::get);
   py::class_<ComplexType, Type, ComplexTypePtr>(m, "ComplexType")
diff --git a/torch/csrc/jit/python/python_list.h b/torch/csrc/jit/python/python_list.h
index 316387cf7359b4..04a2f23b29a874 100644
--- a/torch/csrc/jit/python/python_list.h
+++ b/torch/csrc/jit/python/python_list.h
@@ -197,7 +197,7 @@ class ScriptList final {
       idx += len();
     }
 
-    if (idx < 0 || (size_type)idx > len()) {
+    if (idx < 0 || idx > len()) {
       throw std::out_of_range("list index out of range");
     }
 
@@ -217,7 +217,7 @@ class ScriptList final {
       idx += sz;
     }
 
-    if (idx < 0 || (size_type)idx >= sz) {
+    if (idx < 0 || idx >= sz) {
       throw std::out_of_range("list index out of range");
     }
 
diff --git a/torch/csrc/jit/python/python_sugared_value.cpp b/torch/csrc/jit/python/python_sugared_value.cpp
index f014150d8a24dc..1e31dc11555259 100644
--- a/torch/csrc/jit/python/python_sugared_value.cpp
+++ b/torch/csrc/jit/python/python_sugared_value.cpp
@@ -1,5 +1,7 @@
 #include <torch/csrc/jit/python/python_sugared_value.h>
 
+#include <ATen/core/interned_strings.h>
+#include <c10/core/ScalarType.h>
 #include <pybind11/pytypes.h>
 #include <torch/csrc/Dtype.h>
 #include <torch/csrc/Layout.h>
@@ -1160,6 +1162,25 @@ std::shared_ptr<SugaredValue> toSugaredValue(
     throw ErrorReport(loc) << "Cannot call a ScriptModule that is not"
                            << " a submodule of the caller";
   }
+  std::vector<std::pair<const char*, at::ScalarType>> tensor_names = {
+      {"BoolTensor", at::ScalarType::Bool},
+      {"LongTensor", at::ScalarType::Long},
+      {"ByteTensor", at::ScalarType::Byte},
+      {"CharTensor", at::ScalarType::Char},
+      {"DoubleTensor", at::ScalarType::Double},
+      {"FloatTensor", at::ScalarType::Float},
+      {"IntTensor", at::ScalarType::Int},
+      {"ShortTensor", at::ScalarType::Short},
+      {"HalfTensor", at::ScalarType::Half},
+  };
+  for (const auto& name : tensor_names) {
+    if (obj.ptr() == py::module::import("torch").attr(name.first).ptr()) {
+      // torch.LongTensor and other related functions create on cpu,
+      // TODO: add support for torch.cuda.LongTensor for gpu
+      return LegacyTensorConstructor::create(
+          prim::LegacyTypedConstructor, name.second, at::kCPU);
+    }
+  }
 
   py::object builtin_name =
       py::module::import("torch.jit._builtins").attr("_find_builtin")(obj);
@@ -1257,6 +1278,10 @@ std::shared_ptr<SugaredValue> toSugaredValue(
       return std::make_shared<FunctionValue>(*callee);
     }
   }
+  if (obj.ptr() == py::module::import("math").attr("inf").ptr()) {
+    return toSimple(
+        g.insertConstant(std::numeric_limits<double>::infinity(), loc));
+  }
 
   py::bool_ isMethod = py::module::import("inspect").attr("ismethod")(obj);
   // methods here have been explicitly annotated to not be compiled,
diff --git a/torch/csrc/jit/python/python_tree_views.cpp b/torch/csrc/jit/python/python_tree_views.cpp
index fcb39ad122f8bf..a692c2d9282aed 100644
--- a/torch/csrc/jit/python/python_tree_views.cpp
+++ b/torch/csrc/jit/python/python_tree_views.cpp
@@ -104,8 +104,9 @@ void initTreeViewBindings(PyObject* module) {
             return SourceRange(self.source_, start, end);
           })
       .def_property_readonly("source", [](const SourceRangeFactory& self) {
-        auto text_view = self.source_->text_str().str();
-        return text_view;
+        auto text_view = self.source_->text();
+        std::string text(text_view.begin(), text_view.end());
+        return text;
       });
 
   py::class_<TreeView>(m, "TreeView")
diff --git a/torch/csrc/jit/python/script_init.cpp b/torch/csrc/jit/python/script_init.cpp
index 763dcc62ef672b..a61917cc8b7d8c 100644
--- a/torch/csrc/jit/python/script_init.cpp
+++ b/torch/csrc/jit/python/script_init.cpp
@@ -42,6 +42,7 @@
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/runtime/interpreter.h>
 #include <torch/csrc/jit/runtime/logging.h>
+#include <torch/csrc/jit/serialization/export_bytecode.h>
 #include <torch/csrc/jit/serialization/import_source.h>
 #include <torch/csrc/jit/serialization/python_print.h>
 #include <torch/csrc/jit/testing/hooks_for_testing.h>
@@ -1096,23 +1097,32 @@ void initJitScriptBindings(PyObject* module) {
           [](Module& m,
              const std::string& filename,
              const ExtraFilesMap& _extra_files = ExtraFilesMap(),
-             bool _save_mobile_debug_info = false) {
-            m._save_for_mobile(filename, _extra_files, _save_mobile_debug_info);
+             bool _save_mobile_debug_info = false,
+             bool _use_flatbuffer = false) {
+            m._save_for_mobile(
+                filename,
+                _extra_files,
+                _save_mobile_debug_info,
+                _use_flatbuffer);
           },
           py::arg("filename"),
           py::arg("_extra_files") = ExtraFilesMap(),
-          py::arg("_save_mobile_debug_info") = false)
+          py::arg("_save_mobile_debug_info") = false,
+          py::arg("_use_flatbuffer") = false)
       .def(
           "_save_to_buffer_for_mobile",
           [](Module& m,
              const ExtraFilesMap& _extra_files = ExtraFilesMap(),
-             bool _save_mobile_debug_info = false) {
+             bool _save_mobile_debug_info = false,
+             bool _use_flatbuffer = false) {
             std::ostringstream buf;
-            m._save_for_mobile(buf, _extra_files, _save_mobile_debug_info);
+            m._save_for_mobile(
+                buf, _extra_files, _save_mobile_debug_info, _use_flatbuffer);
             return py::bytes(buf.str());
           },
           py::arg("_extra_files") = ExtraFilesMap(),
-          py::arg("_save_mobile_debug_info") = false)
+          py::arg("_save_mobile_debug_info") = false,
+          py::arg("_use_flatbuffer") = false)
       .def("_set_optimized", &Module::set_optimized)
       .def(
           "dump",
@@ -1891,6 +1901,10 @@ void initJitScriptBindings(PyObject* module) {
         std::istringstream in(buffer);
         return _get_mobile_model_contained_types(in);
       });
+  m.def("_nn_module_to_mobile", [](const Module& module) {
+    CompilationOptions options;
+    return jitModuleToMobile(module, options);
+  });
   py::class_<OperatorInfo>(m, "OperatorInfo")
       .def_readonly("num_schema_args", &OperatorInfo::num_schema_args);
   m.def("_get_model_ops_and_info", [](const std::string& filename) {
@@ -1995,7 +2009,16 @@ void initJitScriptBindings(PyObject* module) {
     setGraphExecutorOptimize(optimize);
   });
 
-  m.def("_get_graph_executor_optimize", &torch::jit::getGraphExecutorOptimize);
+  m.def(
+      "_get_graph_executor_optimize",
+      [](c10::optional<bool> new_setting = c10::nullopt) {
+        bool old_value = getGraphExecutorOptimize();
+        if (new_setting) {
+          setGraphExecutorOptimize(*new_setting);
+        }
+        return old_value;
+      },
+      py::arg("new_settings") = nullptr);
 
   m.def(
       "_enable_mobile_interface_call_export",
@@ -2191,10 +2214,6 @@ void initJitScriptBindings(PyObject* module) {
   m.def(
       "_run_emit_module_hook", [](const Module& m) { didFinishEmitModule(m); });
 
-  m.def(
-      "_set_should_use_format_with_string_table",
-      setShouldUseFormatWithStringTable);
-
   // NOLINTNEXTLINE(bugprone-unused-raii)
   py::class_<logging::LoggerBase, std::shared_ptr<logging::LoggerBase>>(
       m, "LoggerBase");
diff --git a/torch/csrc/jit/runtime/decomposition_registry.cpp b/torch/csrc/jit/runtime/decomposition_registry.cpp
new file mode 100644
index 00000000000000..7032385665f591
--- /dev/null
+++ b/torch/csrc/jit/runtime/decomposition_registry.cpp
@@ -0,0 +1,129 @@
+#include <torch/csrc/jit/frontend/ir_emitter.h>
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/peephole.h>
+#include <torch/csrc/jit/runtime/decomposition_registry.h>
+#include <torch/csrc/jit/runtime/decomposition_registry_util.h>
+#include <torch/csrc/jit/runtime/operator.h>
+#include <torch/csrc/jit/serialization/import_source.h>
+
+#include <c10/util/Exception.h>
+#include <torch/csrc/jit/ir/ir.h>
+#include <unordered_map>
+
+namespace torch {
+namespace jit {
+namespace {
+std::mutex lock;
+
+// CompilationUnit that holds all these Functions and keeps them alive.
+auto compilation_unit = std::make_shared<CompilationUnit>();
+std::unordered_map<const FunctionSchema*, std::shared_ptr<Graph>>
+    schema_to_decomposition;
+
+std::unordered_map<const FunctionSchema*, Function*> schema_to_function;
+
+void loadModule(const CompilationUnit& module) {
+  const auto& mappings = GetDecompositionMapping().getAllKeysAndValues();
+  for (const auto& pair : mappings) {
+    const FunctionSchema* schema = &pair.first->schema();
+    const std::string& decomposition_function_name = pair.second;
+
+    Function& decomposition_function =
+        module.get_function(decomposition_function_name);
+    std::shared_ptr<Graph> graph =
+        toGraphFunction(decomposition_function).graph();
+
+    schema_to_function[schema] = &decomposition_function;
+    schema_to_decomposition[schema] = graph;
+  }
+}
+
+void loadDecompositionFunctions() {
+  std::lock_guard<std::mutex> guard(lock);
+  if (schema_to_decomposition.size() != 0) {
+    return;
+  }
+
+  auto src = std::make_shared<Source>(GetSerializedDecompositions());
+  std::stringstream ss;
+  std::vector<at::IValue> constantTable;
+  auto resolver = std::make_shared<SourceImporterImpl>(
+      compilation_unit,
+      &constantTable,
+      [&](const std::string& name) -> std::shared_ptr<Source> { return src; },
+      1);
+  compilation_unit->define(
+      c10::nullopt, GetSerializedDecompositions(), resolver, nullptr);
+  loadModule(*compilation_unit);
+}
+
+} // anonymous namespace
+
+void DecomposeOp(Node* n) {
+  auto schema = n->maybeSchema();
+  if (!schema) {
+    return;
+  }
+  auto decomposition = DecompositionGraphForSchema(n->schema());
+  if (!decomposition) {
+    return;
+  }
+  WithInsertPoint guard(n);
+  auto outputs =
+      insertGraph(*n->owningGraph(), *decomposition->get(), n->inputs());
+  TORCH_INTERNAL_ASSERT(outputs.size() == n->outputs().size());
+  for (size_t i : c10::irange(outputs.size())) {
+    n->outputs().at(i)->replaceAllUsesWith(outputs[i]);
+  }
+  n->destroy();
+}
+
+void RunDecompositions(Block* block) {
+  for (auto it = block->nodes().begin(); it != block->nodes().end();) {
+    Node* n = *it;
+    it++; // advance iterator bc the current node may be destroyed
+    for (Block* b : n->blocks()) {
+      RunDecompositions(b);
+    }
+    DecomposeOp(n);
+  }
+}
+
+void RunDecompositions(std::shared_ptr<Graph> g) {
+  RunDecompositions(g->block());
+  for (const auto _ : c10::irange(2)) {
+    PeepholeOptimize(g, /*disable_shape_peephole*/ true);
+    ConstantPropagation(g);
+  }
+}
+
+c10::optional<std::shared_ptr<Graph>> DecompositionGraphForSchema(
+    const FunctionSchema& schema) {
+  loadDecompositionFunctions();
+  GRAPH_DEBUG("Trying to find schema: ", schema);
+  auto cache_it = schema_to_decomposition.find(&schema);
+  if (cache_it != schema_to_decomposition.end()) {
+    return cache_it->second;
+  }
+  GRAPH_DEBUG("Could not find schema: ", schema);
+
+  return c10::nullopt;
+}
+
+c10::optional<GraphFunction*> GetDecompositionFunction(
+    const FunctionSchema& schema) {
+  loadDecompositionFunctions();
+  auto cache_it = schema_to_function.find(&schema);
+  GRAPH_DEBUG("Trying to find schema: ", schema);
+  if (cache_it == schema_to_function.end()) {
+    GRAPH_DEBUG("Could not find schema: ", schema);
+    return c10::nullopt;
+  }
+  auto& func = toGraphFunction(*cache_it->second);
+  func._set_initial_executor_execution_mode(ExecutorExecutionMode::SIMPLE);
+  return &func;
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/decomposition_registry.h b/torch/csrc/jit/runtime/decomposition_registry.h
new file mode 100644
index 00000000000000..e3a02ced1850d7
--- /dev/null
+++ b/torch/csrc/jit/runtime/decomposition_registry.h
@@ -0,0 +1,20 @@
+#pragma once
+// This file is temporary until native_functions.yaml and derivatives.yaml are
+// merged. Ideally this should all go into native_functions.yaml
+
+#include <torch/csrc/Export.h>
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+
+TORCH_API c10::optional<std::shared_ptr<Graph>> DecompositionGraphForSchema(
+    const FunctionSchema& schema);
+
+TORCH_API void RunDecompositions(std::shared_ptr<Graph> g);
+
+TORCH_API c10::optional<GraphFunction*> GetDecompositionFunction(
+    const FunctionSchema& schema);
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/decomposition_registry_util.cpp b/torch/csrc/jit/runtime/decomposition_registry_util.cpp
new file mode 100644
index 00000000000000..9b7ef138158195
--- /dev/null
+++ b/torch/csrc/jit/runtime/decomposition_registry_util.cpp
@@ -0,0 +1,78 @@
+
+/**
+ * @generated
+ * This is an auto-generated file. Please do not modify it by hand.
+ * To re-generate, please run:
+ * cd ~/pytorch && python tools/codegen/decompositions/gen_jit_decompositions.py
+ */
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/inliner.h>
+#include <torch/csrc/jit/runtime/operator.h>
+#include <torch/csrc/jit/runtime/decomposition_registry_util.h>
+
+namespace torch {
+namespace jit {
+
+
+const std::string decomp_funcs =
+R"(def var_decomposition(input: Tensor,
+    dim: Optional[List[int]]=None,
+    correction: Optional[int]=None,
+    keepdim: bool=False) -> Tensor:
+  if torch.__is__(dim, None):
+    dim0 = annotate(List[int], [])
+  else:
+    dim0 = unchecked_cast(List[int], dim)
+  if torch.eq(torch.len(dim0), 0):
+    n = torch.numel(input)
+  else:
+    n0 = 1
+    for _0 in range(torch.len(dim0)):
+      dim_i = dim0[_0]
+      n1 = torch.mul(n0, (torch.size(input))[dim_i])
+      n0 = n1
+    n = n0
+  mean = torch.mean(input, dim0, True)
+  sub = torch.sub(input, mean)
+  sq = torch.mul(sub, sub)
+  sum = torch.sum(sq, dim0, keepdim)
+  if torch.__isnot__(correction, None):
+    correction0 = unchecked_cast(int, correction)
+    n2 = torch.sub(n, correction0)
+  else:
+    n2 = n
+  return torch.div(sum, n2)
+
+def var(input: Tensor,
+    unbiased: bool=True) -> Tensor:
+  if unbiased:
+    _0 = 1
+  else:
+    _0 = 0
+  n = torch.numel(input)
+  mean = torch.mean(input, annotate(List[int], []), True)
+  sub = torch.sub(input, mean)
+  sq = torch.mul(sub, sub)
+  sum = torch.sum(sq, annotate(List[int], []))
+  n0 = torch.sub(n, _0)
+  return torch.div(sum, n0)
+
+)";
+
+const std::string& GetSerializedDecompositions() {
+  return decomp_funcs;
+}
+
+const OperatorMap<std::string>& GetDecompositionMapping() {
+  // clang-format off
+ static const OperatorMap<std::string> decomposition_mapping {
+    {"aten::var.correction(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False) -> (Tensor)", "var_decomposition"},
+    {"aten::var(Tensor self, bool unbiased=True) -> (Tensor)", "var"},
+  };
+  // clang-format on
+
+  return decomposition_mapping;
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/decomposition_registry_util.h b/torch/csrc/jit/runtime/decomposition_registry_util.h
new file mode 100644
index 00000000000000..47e68f7196ac5b
--- /dev/null
+++ b/torch/csrc/jit/runtime/decomposition_registry_util.h
@@ -0,0 +1,14 @@
+#pragma once
+
+#include <torch/csrc/Export.h>
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+
+TORCH_API const std::string& GetSerializedDecompositions();
+
+TORCH_API const OperatorMap<std::string>& GetDecompositionMapping();
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/graph_executor.cpp b/torch/csrc/jit/runtime/graph_executor.cpp
index 27d5936fd79644..ba69e94e11b9f4 100644
--- a/torch/csrc/jit/runtime/graph_executor.cpp
+++ b/torch/csrc/jit/runtime/graph_executor.cpp
@@ -39,9 +39,11 @@
 #include <torch/csrc/jit/runtime/interpreter.h>
 #include <torch/csrc/jit/runtime/profiling_graph_executor_impl.h>
 #include <torch/csrc/jit/runtime/profiling_record.h>
+#include <torch/csrc/jit/runtime/simple_graph_executor_impl.h>
 
 #include <torch/csrc/autograd/edge.h>
 #include <torch/csrc/autograd/function.h>
+#include <torch/csrc/jit/python/update_graph_executor_opt.h>
 #include <torch/csrc/jit/runtime/logging.h>
 
 #include <cstdint>
@@ -56,17 +58,16 @@ namespace torch {
 namespace jit {
 
 EnableProfilingGuard::EnableProfilingGuard() {
-  auto& profiling_mode = getProfilingMode();
-  old_profiling_mode = profiling_mode;
-  profiling_mode = true;
   auto& executor_mode = getExecutorMode();
   old_executor_mode = executor_mode;
   executor_mode = true;
+  old_get_optimize = getGraphExecutorOptimize();
+  setGraphExecutorOptimize(true);
 }
 
 EnableProfilingGuard::~EnableProfilingGuard() {
-  getProfilingMode() = old_profiling_mode;
   getExecutorMode() = old_executor_mode;
+  setGraphExecutorOptimize(old_get_optimize);
 }
 
 namespace {
@@ -283,7 +284,19 @@ struct DifferentiableGraphBackward : public autograd::Node {
           produceOutput(output_index++, std::move(tensor), outputs);
         }
       } else if (v.isTensor()) {
-        produceOutput(output_index++, std::move(v).toTensor(), outputs);
+        if (!v.toTensor().defined()) {
+          // this undefined gradient actually corresponds to a tensor list
+          if (input_tensor_lists_.count(output_index) != 0) {
+            size_t list_size = input_tensor_lists_[output_index];
+            for (size_t i = 0; i < list_size; i++) {
+              produceOutput(output_index++, {}, outputs);
+            }
+          } else {
+            produceOutput(output_index++, {}, outputs);
+          }
+        } else {
+          produceOutput(output_index++, std::move(v).toTensor(), outputs);
+        }
       } else {
         TORCH_INTERNAL_ASSERT_DEBUG_ONLY(v.isNone());
         output_index++;
@@ -292,6 +305,12 @@ struct DifferentiableGraphBackward : public autograd::Node {
         outputs.emplace_back();
       }
     }
+    TORCH_INTERNAL_ASSERT(
+        num_outputs() == outputs.size(),
+        "DifferentiableGraphBackward: expected ",
+        num_outputs(),
+        " outputs but found ",
+        outputs.size());
     return outputs;
   }
 
@@ -307,14 +326,18 @@ struct DifferentiableGraphBackward : public autograd::Node {
   }
   void addOutputForIValue(const IValue& value) {
     if (value.isTensorList()) {
+      input_tensor_lists_.insert({index_, value.toTensorList().size()});
       for (const at::Tensor tensor : value.toTensorList()) {
         addOutputForTensor(tensor);
+        index_++;
       }
     } else if (value.isTensor()) {
       addOutputForTensor(value.toTensor());
+      index_++;
     } else {
       // We could have None passed here via `Optional[Tensor]`
       add_next_edge(autograd::Edge{});
+      index_++;
     }
   }
 
@@ -322,7 +345,8 @@ struct DifferentiableGraphBackward : public autograd::Node {
     // NB: since our requires_grad setting is only a heuristic we might end
     // up wanting to differentiate through integral tensors, which is
     // generally a hard error in autograd.
-    if (at::isFloatingType(output.scalar_type())) {
+    if (at::isFloatingType(output.scalar_type()) ||
+        at::isComplexType(output.scalar_type())) {
       autograd::create_gradient_edge(output, shared_from_this());
       output.set_requires_grad(true);
     } else {
@@ -371,6 +395,14 @@ struct DifferentiableGraphBackward : public autograd::Node {
   GraphExecutor executor;
   CaptureList captures_;
   UnpackInstructions input_instructions_;
+  // we need to track input lists to fwd graph
+  // since in backward graphs these will become
+  // an undefined tensors if gradients are zeros
+  // we will need to convert an undefined tensor
+  // back to a list
+  // TODO: switch to using UnpackInstructions
+  size_t index_ = 0;
+  std::map<size_t, size_t> input_tensor_lists_;
 };
 
 // an optimized way of executing the subgraph computed directly on
@@ -407,8 +439,7 @@ struct DifferentiableGraphOp {
 
     detachVariables(stack);
     if (IsNewExecutorEnabled()) {
-      const ExecutionPlan& plan =
-          f_ptr->getPlanFor(stack, GraphExecutor::getDefaultNumBailOuts());
+      const ExecutionPlan& plan = f_ptr->getPlanFor(stack);
       InterpreterState(plan.code).run(stack);
     } else {
       InterpreterState(legacy_f).run(stack);
@@ -549,8 +580,7 @@ void GraphExecutorImplBase::run(Stack& stack) {
   logging::getLogger()->addStatValue(
       logging::runtime_counters::GRAPH_EXECUTOR_INVOCATIONS, 1.0);
 
-  const ExecutionPlan& plan =
-      getPlanFor(stack, GraphExecutor::getDefaultNumBailOuts());
+  const ExecutionPlan& plan = getPlanFor(stack);
   InterpreterState(plan.code).run(stack);
   last_executed_optimized_graph = plan.graph;
 }
@@ -575,9 +605,8 @@ c10::intrusive_ptr<Future> GraphExecutorImplBase::runAsync(
     ExecutionPlan plan;
     InterpreterState state;
   };
-  auto frame = std::make_shared<Frame>(
-      getPlanFor(stack, GraphExecutor::getDefaultNumBailOuts()),
-      std::move(taskLauncher));
+  auto frame =
+      std::make_shared<Frame>(getPlanFor(stack), std::move(taskLauncher));
   auto res = frame->state.runAsync(stack);
   last_executed_optimized_graph = frame->plan.graph;
   if (!res->completed()) {
@@ -602,8 +631,9 @@ struct GraphExecutorImpl : public GraphExecutorImplBase {
         logging::runtime_counters::GRAPH_EXECUTORS_CONSTRUCTED, 1.0);
   }
 
-  const ExecutionPlan& getPlanFor(Stack& stack, size_t remaining_bailout_depth)
-      override {
+  const ExecutionPlan& getPlanFor(
+      Stack& stack,
+      c10::optional<size_t> remaining_bailout_depth) override {
     return getGraphExecutorOptimize() ? getOrCompile(stack)
                                       : getOrCompileFallback();
   }
@@ -765,12 +795,33 @@ GraphExecutor::GraphExecutor(
     std::string function_name)
     : pImpl(
           IsNewExecutorEnabled()
+              ? (getProfilingMode() ?
+
+                                    dynamic_cast<GraphExecutorImplBase*>(
+                                        new ProfilingGraphExecutorImpl(
+                                            graph,
+                                            std::move(function_name)))
+                                    : dynamic_cast<GraphExecutorImplBase*>(
+                                          new SimpleGraphExecutorImpl(
+                                              graph,
+                                              std::move(function_name))))
+              : dynamic_cast<GraphExecutorImplBase*>(
+                    new GraphExecutorImpl(graph, std::move(function_name)))) {}
+
+GraphExecutor::GraphExecutor(
+    const std::shared_ptr<Graph>& graph,
+    std::string function_name,
+    ExecutorExecutionMode executor_mode)
+    : pImpl(
+          executor_mode == ExecutorExecutionMode::SIMPLE
               ? dynamic_cast<GraphExecutorImplBase*>(
-                    new ProfilingGraphExecutorImpl(
+                    new SimpleGraphExecutorImpl(
                         graph,
                         std::move(function_name)))
               : dynamic_cast<GraphExecutorImplBase*>(
-                    new GraphExecutorImpl(graph, std::move(function_name)))) {}
+                    new ProfilingGraphExecutorImpl(
+                        graph,
+                        std::move(function_name)))) {}
 
 void GraphExecutor::run(Stack& inputs) {
   return pImpl->run(inputs);
@@ -782,13 +833,9 @@ c10::intrusive_ptr<Future> GraphExecutor::runAsync(
   return pImpl->runAsync(stack, std::move(taskLauncher));
 }
 
-size_t GraphExecutor::getDefaultNumBailOuts() {
-  return getProfilingMode() ? getBailoutDepth() : 0;
-}
-
 const ExecutionPlan& GraphExecutor::getPlanFor(
     Stack& inputs,
-    size_t remaining_bailout_depth) {
+    c10::optional<size_t> remaining_bailout_depth) {
   return pImpl->getPlanFor(inputs, remaining_bailout_depth);
 }
 
@@ -886,10 +933,8 @@ void runNondiffOptimization(
 
   // decomposition pass, decompose certain ops that will be used in the
   // following passes (like batchmm and jit fusion)
-  if (!getProfilingMode()) {
-    DecomposeOps(graph);
-    GRAPH_DEBUG("After DecomposeOps\n", *graph);
-  }
+  DecomposeOps(graph);
+  GRAPH_DEBUG("After DecomposeOps\n", *graph);
 
   // TupleConstruct / TupleUnpack pairs can still be present at this point
   // and must be removed for fusion.
@@ -900,7 +945,7 @@ void runNondiffOptimization(
   BatchMM(graph);
 
   GRAPH_DEBUG("After BatchMM, before Fusion\n", *graph);
-  if (getProfilingMode()) {
+  if (getExecutorMode()) {
     if (tensorExprFuserEnabled()) {
       auto min_size = getFusionGroupInlining() ? 2 : 1;
       auto dyn_shapes = tensorExprDynamicShapeFusionEnabled();
diff --git a/torch/csrc/jit/runtime/graph_executor.h b/torch/csrc/jit/runtime/graph_executor.h
index 381fcc41062bad..36c08edf4803d1 100644
--- a/torch/csrc/jit/runtime/graph_executor.h
+++ b/torch/csrc/jit/runtime/graph_executor.h
@@ -16,14 +16,15 @@ namespace jit {
 struct GraphExecutorState;
 struct Code;
 
+enum ExecutorExecutionMode {
+  SIMPLE,
+  PROFILING,
+};
+
 struct ExecutionPlan {
   ExecutionPlan() = default;
-  ExecutionPlan(
-      std::shared_ptr<Graph> graph,
-      std::string function_name,
-      size_t remaining_bailout_depth = 0)
-      : code(graph, std::move(function_name), remaining_bailout_depth),
-        graph(std::move(graph)) {}
+  ExecutionPlan(std::shared_ptr<Graph> graph, std::string function_name)
+      : code(graph, std::move(function_name)), graph(std::move(graph)) {}
 
   operator bool() const {
     return static_cast<bool>(graph);
@@ -34,8 +35,8 @@ struct ExecutionPlan {
 };
 
 // Notice that those structs don't manage lifetime of their members.
-// They is only valid only right after you call getDebugState() and should never
-// be used again once another GraphExecutor function is called.
+// They are only valid only right after you call getDebugState() and should
+// never be used again once another GraphExecutor function is called.
 
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
 struct GraphExecutorState {
@@ -50,7 +51,7 @@ struct TORCH_API EnableProfilingGuard {
 
  private:
   bool old_executor_mode = false;
-  bool old_profiling_mode = false;
+  bool old_get_optimize = false;
 };
 
 struct GraphExecutorImplBase;
@@ -58,6 +59,11 @@ struct TORCH_API GraphExecutor {
   GraphExecutor() = default;
   GraphExecutor(const std::shared_ptr<Graph>& graph, std::string function_name);
 
+  GraphExecutor(
+      const std::shared_ptr<Graph>& graph,
+      std::string function_name,
+      ExecutorExecutionMode executor_mode);
+
   void run(Stack& inputs);
   c10::intrusive_ptr<Future> runAsync(
       Stack& stack,
@@ -72,13 +78,13 @@ struct TORCH_API GraphExecutor {
   // profiled information whenever a bailout check is failed/triggered, a new
   // `GraphExecutor` will be created. This new `GraphExecutor`'s
   // remaining_bailout_depth will be reduced by 1.
+  // If no bailout depth is passed, the depth will be initialized from the
+  // current global fusion strategy settings.
   const ExecutionPlan& getPlanFor(
       Stack& inputs,
-      size_t remaining_bailout_depth);
+      c10::optional<size_t> remaining_bailout_depth = c10::nullopt);
   GraphExecutorState getDebugState();
 
-  static size_t getDefaultNumBailOuts();
-
   void debugFlushCompilationCache();
 
   bool isOptimized() const;
diff --git a/torch/csrc/jit/runtime/graph_executor_impl.h b/torch/csrc/jit/runtime/graph_executor_impl.h
index c6c90fb7cd086e..e48ac2b022c183 100644
--- a/torch/csrc/jit/runtime/graph_executor_impl.h
+++ b/torch/csrc/jit/runtime/graph_executor_impl.h
@@ -79,7 +79,7 @@ struct GraphExecutorImplBase {
 
   virtual const ExecutionPlan& getPlanFor(
       Stack& stack,
-      size_t remaining_bailout_depth) = 0;
+      c10::optional<size_t> remaining_bailout_depth = c10::nullopt) = 0;
   virtual GraphExecutorState getDebugState() = 0;
   virtual ~GraphExecutorImplBase() = default;
 
diff --git a/torch/csrc/jit/runtime/interpreter.cpp b/torch/csrc/jit/runtime/interpreter.cpp
index e421815d7e74e5..8541db2639329c 100644
--- a/torch/csrc/jit/runtime/interpreter.cpp
+++ b/torch/csrc/jit/runtime/interpreter.cpp
@@ -175,7 +175,7 @@ struct InterpreterStateImpl : c10::intrusive_ptr_target {
   void callFunction(
       Function& f,
       Stack& stack,
-      size_t bailOut = GraphExecutor::getDefaultNumBailOuts(),
+      c10::optional<size_t> bailOut = c10::nullopt,
       bool next = true) {
     bool newFrame = f.call(stack, bailOut, [&](const Code& code) {
       enterFrame(code, stack.size() - code.num_inputs());
@@ -716,10 +716,7 @@ struct InterpreterStateImpl : c10::intrusive_ptr_target {
             auto& forked_fn =
                 toGraphFunction(*frame.function->function_table_[inst.X]);
             InterpreterState forked_interpreter(
-                forked_fn.get_executor()
-                    .getPlanFor(stack, GraphExecutor::getDefaultNumBailOuts())
-                    .code,
-                taskLauncher_);
+                forked_fn.get_executor().getPlanFor(stack).code, taskLauncher_);
             InterpreterContinuation continuation(
                 forked_interpreter,
                 Stack(stack.end() - inst.N, stack.end()),
diff --git a/torch/csrc/jit/runtime/operator.h b/torch/csrc/jit/runtime/operator.h
index a269908ce909ce..438fcc5411bb80 100644
--- a/torch/csrc/jit/runtime/operator.h
+++ b/torch/csrc/jit/runtime/operator.h
@@ -27,6 +27,7 @@ namespace torch {
 namespace jit {
 
 struct Node;
+using ::c10::Argument;
 using ::c10::FunctionSchema;
 using ::c10::Symbol;
 
@@ -85,6 +86,23 @@ struct TORCH_API Operator {
                 UnparsedFunctionSchema{std::move(schema), alias_analysis}),
             c10::make_left<Operation, OperationCreator>(std::move(op))})) {}
 
+  Operator(
+      std::string name,
+      std::string overload_name,
+      std::vector<Argument> arguments,
+      std::vector<Argument> returns,
+      Operation op,
+      c10::AliasAnalysisKind alias_analysis)
+      : op_(c10::make_right<C10Operator, JitOnlyOperator>(JitOnlyOperator{
+            c10::make_left<FunctionSchema, UnparsedFunctionSchema>(
+                varArgSchemaWithName(
+                    name,
+                    overload_name,
+                    arguments,
+                    returns,
+                    alias_analysis)),
+            c10::make_left<Operation, OperationCreator>(std::move(op))})) {}
+
   Operator(
       std::string schema,
       OperationCreator op_creator,
@@ -180,6 +198,23 @@ struct TORCH_API Operator {
     return result;
   }
 
+  static FunctionSchema varArgSchemaWithName(
+      std::string name,
+      std::string overload_name,
+      std::vector<Argument> arguments,
+      std::vector<Argument> returns,
+      AliasAnalysisKind alias_analysis) {
+    auto result = FunctionSchema(
+        name,
+        overload_name,
+        arguments,
+        returns,
+        /*is_vararg*/ false,
+        /*is_varret*/ false);
+    result.setAliasAnalysis(alias_analysis);
+    return result;
+  }
+
   c10::either<C10Operator, JitOnlyOperator> op_;
 };
 
@@ -245,5 +280,22 @@ c10::optional<Operator> OperatorGenerator(
   return c10::nullopt;
 }
 
+template <typename Func>
+c10::optional<Operator> OperatorGenerator(
+    const std::string name,
+    const std::string overload_name,
+    const std::vector<c10::Argument> arguments,
+    const std::vector<c10::Argument> returns,
+    Func&& op,
+    AliasAnalysisKind alias_analysis) {
+  return c10::optional<Operator>(Operator(
+      name,
+      overload_name,
+      arguments,
+      returns,
+      std::forward<Func>(op),
+      alias_analysis));
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/print_handler.cpp b/torch/csrc/jit/runtime/print_handler.cpp
index e3e8585bca56e2..9452589f9e390c 100644
--- a/torch/csrc/jit/runtime/print_handler.cpp
+++ b/torch/csrc/jit/runtime/print_handler.cpp
@@ -6,9 +6,15 @@
 namespace torch {
 namespace jit {
 
-std::atomic<PrintHandler> print_handler([](const std::string& str) {
-  std::cout << str;
-});
+namespace {
+
+std::atomic<PrintHandler> print_handler(getDefaultPrintHandler());
+
+} // namespace
+
+PrintHandler getDefaultPrintHandler() {
+  return [](const std::string& s) { std::cout << s; };
+}
 
 PrintHandler getPrintHandler() {
   return print_handler.load();
diff --git a/torch/csrc/jit/runtime/print_handler.h b/torch/csrc/jit/runtime/print_handler.h
index d9ba851fa1abd0..2f1f3ee92e0697 100644
--- a/torch/csrc/jit/runtime/print_handler.h
+++ b/torch/csrc/jit/runtime/print_handler.h
@@ -11,8 +11,9 @@ namespace jit {
 
 using PrintHandler = void (*)(const std::string&);
 
-TORCH_API void setPrintHandler(PrintHandler ph);
+TORCH_API PrintHandler getDefaultPrintHandler();
 TORCH_API PrintHandler getPrintHandler();
+TORCH_API void setPrintHandler(PrintHandler ph);
 
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp b/torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp
index ad9b48d9fef7de..1e1723b94f502d 100644
--- a/torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp
+++ b/torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp
@@ -310,8 +310,9 @@ bool guardDifferentiableGraph(Node* dnode) {
 }
 
 void runNooptPassPipeline(std::shared_ptr<Graph>& graph) {
-  GRAPH_DEBUG(
-      "Before LowerGradOf (beginning of runNooptPassPipeline)\n", *graph);
+  GRAPH_DEBUG("Before Inliner (beginning of runNooptPassPipeline)\n", *graph);
+  Inline(*graph);
+  GRAPH_DEBUG("After Inline, Before NoGrad\n", *graph);
   LowerGradOf(*graph);
   GRAPH_DEBUG("After LowerGradOf, before RemoveExpands\n", *graph);
   RemoveExpands(graph);
@@ -328,22 +329,8 @@ void runPreAutodiffPassPipeline(std::shared_ptr<Graph>& graph) {
       "Before InsertGuards (beginning of runPreAutodiffPassPipeline)\n",
       *graph);
 
-  if (tensorExprFuserEnabled() || RegisterCudaFuseGraph::isRegistered()) {
-    // With TE fuser or nvfuser, we don't generate bailouts
-    LowerGradOf(*graph);
-    GRAPH_DEBUG("After LowerGradOf, before specializeAutogradZero\n", *graph);
-  } else {
-    InsertGuards(graph);
-    GRAPH_DEBUG("After InsertGuards, before LowerGradOf\n", *graph);
-    LowerGradOf(*graph);
-    GRAPH_DEBUG("After LowerGradOf, before EliminateRedundantGuards\n", *graph);
-    EliminateRedundantGuards(graph);
-    GRAPH_DEBUG(
-        "After EliminateRedundantGuards, before InsertBailOuts\n", *graph);
-    InsertBailOuts(graph);
-    GRAPH_DEBUG(
-        "After InsertBailOuts, before specializeAutogradZero\n", *graph);
-  }
+  LowerGradOf(*graph);
+  GRAPH_DEBUG("After LowerGradOf, before specializeAutogradZero\n", *graph);
 
   specializeAutogradZero(graph);
   GRAPH_DEBUG("After specializeAutogradZero\n", *graph);
@@ -472,6 +459,7 @@ void ProfilingGraphExecutorImpl::runNoGradOptimizations(
     }
 
     // Run custom post-fusion passes
+    // e.g. NVFuser
     for (const auto& passPair : getCustomPostPasses()) {
       passPair.first(graph);
     }
@@ -608,13 +596,23 @@ ProfilingGraphExecutorImpl::ProfilingGraphExecutorImpl(
   fusion_strategy_ = getFusionStrategy();
 }
 
+size_t ProfilingGraphExecutorImpl::getInstantiatedBailoutDepth() {
+  // Initialize bailout_depth from command-line flag.
+  size_t depth = 0;
+  for (const auto& pair : fusion_strategy_) {
+    depth += pair.second;
+  }
+  return depth;
+}
+
 const ExecutionPlan& ProfilingGraphExecutorImpl::getOptimizedPlanFor(
     Stack& stack,
-    size_t remaining_bailout_depth) {
+    c10::optional<size_t> remaining_bailout_depth) {
   GRAPH_DEBUG("Running ProfilingGraphExecutorImpl ", this);
 
+  // TODO: instantiate simple executor when getProfilingMode() is false
   // no opt mode
-  if (!getGraphExecutorOptimize()) {
+  if (!getGraphExecutorOptimize() || !getProfilingMode()) {
     if (!fallback_plan_) {
       auto copy = graph->copy();
       GRAPH_DEBUG(
@@ -634,7 +632,11 @@ const ExecutionPlan& ProfilingGraphExecutorImpl::getOptimizedPlanFor(
   // getPlanFor(remaining_bailout_depth) is corrected and persisted by the Code
   // object in interpreter.
   if (!remaining_bailout_depth_.has_value() || !tensorExprFuserEnabled()) {
-    remaining_bailout_depth_ = remaining_bailout_depth;
+    if (remaining_bailout_depth.has_value()) {
+      remaining_bailout_depth_ = *remaining_bailout_depth;
+    } else {
+      remaining_bailout_depth_ = getInstantiatedBailoutDepth();
+    }
   }
 
   // simple executor
@@ -682,14 +684,13 @@ const ExecutionPlan& ProfilingGraphExecutorImpl::getOptimizedPlanFor(
   replaceFallbackGraphWithFallbackFunction(copy->block());
   runFinalOptimizations(copy);
   GRAPH_DUMP("Optimized Graph: ", copy);
-  optimized_plan_ =
-      ExecutionPlan(copy, function_name_, *remaining_bailout_depth_);
+  optimized_plan_ = ExecutionPlan(copy, function_name_);
   return *optimized_plan_;
 }
 
 const ExecutionPlan& ProfilingGraphExecutorImpl::getPlanFor(
     Stack& stack,
-    size_t remaining_bailout_depth) {
+    c10::optional<size_t> remaining_bailout_depth) {
   std::lock_guard<std::mutex> lock(compile_mutex);
 
   // IMPORTANT: This is a hot path of calling a torchscript function. Try not to
@@ -697,7 +698,7 @@ const ExecutionPlan& ProfilingGraphExecutorImpl::getPlanFor(
   if (optimized_plan_) {
     return *optimized_plan_;
   }
-
+  // if depth is not set, use
   return getOptimizedPlanFor(stack, remaining_bailout_depth);
 }
 
diff --git a/torch/csrc/jit/runtime/profiling_graph_executor_impl.h b/torch/csrc/jit/runtime/profiling_graph_executor_impl.h
index eafdd258247b87..ef79a6860df587 100644
--- a/torch/csrc/jit/runtime/profiling_graph_executor_impl.h
+++ b/torch/csrc/jit/runtime/profiling_graph_executor_impl.h
@@ -10,13 +10,16 @@ C10_DECLARE_bool(torch_jit_always_dynamic);
 namespace torch {
 namespace jit {
 
+TORCH_API void runNooptPassPipeline(std::shared_ptr<Graph>& graph);
+
 struct TORCH_API ProfilingGraphExecutorImpl : public GraphExecutorImplBase {
   ProfilingGraphExecutorImpl(
       const std::shared_ptr<Graph>& graph,
       std::string function_name);
 
-  const ExecutionPlan& getPlanFor(Stack& stack, size_t remaining_bailout_depth)
-      override;
+  const ExecutionPlan& getPlanFor(
+      Stack& stack,
+      c10::optional<size_t> remaining_bailout_depth) override;
   GraphExecutorState getDebugState() override;
   ~ProfilingGraphExecutorImpl() override = default;
 
@@ -29,6 +32,8 @@ struct TORCH_API ProfilingGraphExecutorImpl : public GraphExecutorImplBase {
     // prevent memory leaks
     fallback_functions_.clear();
     remaining_bailout_depth_.reset();
+    // TODO - would be nice to have it initialized in subsequent use
+    fusion_strategy_ = getFusionStrategy();
   }
 
   bool isOptimized() const override {
@@ -38,13 +43,14 @@ struct TORCH_API ProfilingGraphExecutorImpl : public GraphExecutorImplBase {
  private:
   const ExecutionPlan& getOptimizedPlanFor(
       Stack& stack,
-      size_t remaining_bailout_depth);
+      c10::optional<size_t> remaining_bailout_depth);
   void runProfilingInsensitiveOptimizations(std::shared_ptr<Graph>& graph);
   void runProfilingOptimizations(
       std::shared_ptr<Graph>& graph,
       size_t remaining_depth);
   void replaceFallbackGraphWithFallbackFunction(Block* b);
   FusionBehavior getCurrentBehavior(size_t remaining_depth);
+  size_t getInstantiatedBailoutDepth();
   void runNoGradOptimizations(
       std::shared_ptr<Graph>& graph,
       size_t remaining_bailout_depth);
diff --git a/torch/csrc/jit/runtime/register_prim_ops.cpp b/torch/csrc/jit/runtime/register_prim_ops.cpp
index de8bc3ae86e889..2477b5a123173c 100644
--- a/torch/csrc/jit/runtime/register_prim_ops.cpp
+++ b/torch/csrc/jit/runtime/register_prim_ops.cpp
@@ -2260,6 +2260,14 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs1{
           push(stack, a.is_vulkan());
         },
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("prim::is_ipu(Tensor a) -> bool"),
+        [](Stack& stack) {
+          at::Tensor a;
+          pop(stack, a);
+          push(stack, a.is_ipu());
+        },
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("prim::is_quantized(Tensor a) -> bool"),
         [](Stack& stack) {
@@ -2284,6 +2292,14 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs1{
           push(stack, a.is_ort());
         },
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("prim::is_nested(Tensor a) -> bool"),
+        [](Stack& stack) {
+          at::Tensor a;
+          pop(stack, a);
+          push(stack, a.is_nested());
+        },
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("prim::name(Tensor a) -> str?"),
         [](Stack& stack) {
diff --git a/torch/csrc/jit/runtime/script_profile.cpp b/torch/csrc/jit/runtime/script_profile.cpp
index e8e4efed3d3fb3..c393dcde1e016b 100644
--- a/torch/csrc/jit/runtime/script_profile.cpp
+++ b/torch/csrc/jit/runtime/script_profile.cpp
@@ -61,7 +61,7 @@ auto initBindings() {
             return static_cast<int64_t>((*self)->starting_line_no());
           })
       .def("text", [](const c10::intrusive_ptr<SourceRef>& self) {
-        return (*self)->text_str().str();
+        return (*self)->text();
       });
 
   torch::class_<InstructionStats>("profiling", "InstructionStats")
diff --git a/torch/csrc/jit/runtime/shape_functions.h b/torch/csrc/jit/runtime/shape_functions.h
index 7de7dadb5f6621..4a436b07ef043b 100644
--- a/torch/csrc/jit/runtime/shape_functions.h
+++ b/torch/csrc/jit/runtime/shape_functions.h
@@ -77,7 +77,7 @@ def slice(
         end_val += self[dim]
     if start_val < 0:
         start_val = 0
-    elif start_val >= self[dim]:
+    elif start_val > self[dim]:
         start_val = self[dim]
     if end_val < start_val:
         end_val = start_val
@@ -343,7 +343,7 @@ def conv2d(
 
 def batch_norm(
     input: List[int],
-    weight: List[int],
+    weight: Optional[List[int]],
     bias: Optional[List[int]],
     running_mean: Optional[List[int]],
     running_var: Optional[List[int]],
diff --git a/torch/csrc/jit/runtime/simple_graph_executor_impl.cpp b/torch/csrc/jit/runtime/simple_graph_executor_impl.cpp
new file mode 100644
index 00000000000000..ef5339cab38e87
--- /dev/null
+++ b/torch/csrc/jit/runtime/simple_graph_executor_impl.cpp
@@ -0,0 +1,43 @@
+#include <torch/csrc/jit/runtime/profiling_graph_executor_impl.h>
+
+#include <c10/util/Optional.h>
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/runtime/simple_graph_executor_impl.h>
+#include <mutex>
+
+namespace torch {
+namespace jit {
+
+SimpleGraphExecutorImpl::SimpleGraphExecutorImpl(
+    const std::shared_ptr<Graph>& graph,
+    std::string function_name)
+    : GraphExecutorImplBase(graph, std::move(function_name)) {}
+
+const ExecutionPlan& SimpleGraphExecutorImpl::getPlanFor(
+    Stack& stack,
+    c10::optional<size_t> remaining_bailout_depth) {
+  std::lock_guard<std::mutex> lock(compile_mutex);
+
+  // IMPORTANT: This is a hot path of calling a torchscript function. Try not to
+  // add any code above this.
+  if (execution_plan_) {
+    return *execution_plan_;
+  }
+  auto copy = graph->copy();
+  runNooptPassPipeline(copy);
+  execution_plan_ = ExecutionPlan(copy, function_name_);
+
+  return *execution_plan_;
+}
+
+GraphExecutorState SimpleGraphExecutorImpl::getDebugState() {
+  GraphExecutorState state;
+  TORCH_INTERNAL_ASSERT(execution_plan_);
+  state.graph = execution_plan_->graph.get();
+  auto opt_plan = *execution_plan_;
+  state.execution_plans.emplace(ArgumentSpec{0, 0}, opt_plan);
+  return state;
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/simple_graph_executor_impl.h b/torch/csrc/jit/runtime/simple_graph_executor_impl.h
new file mode 100644
index 00000000000000..2e9425aab65482
--- /dev/null
+++ b/torch/csrc/jit/runtime/simple_graph_executor_impl.h
@@ -0,0 +1,25 @@
+#pragma once
+#include <c10/util/Flags.h>
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/runtime/graph_executor_impl.h>
+
+namespace torch {
+namespace jit {
+
+struct TORCH_API SimpleGraphExecutorImpl : public GraphExecutorImplBase {
+  SimpleGraphExecutorImpl(
+      const std::shared_ptr<Graph>& graph,
+      std::string function_name);
+
+  const ExecutionPlan& getPlanFor(
+      Stack& stack,
+      c10::optional<size_t> remaining_bailout_depth) override;
+  GraphExecutorState getDebugState() override;
+  ~SimpleGraphExecutorImpl() override = default;
+
+ private:
+  c10::optional<ExecutionPlan> execution_plan_;
+};
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/runtime/static/README.md b/torch/csrc/jit/runtime/static/README.md
index 0ffb946aaf5885..d03a9761bad0a5 100644
--- a/torch/csrc/jit/runtime/static/README.md
+++ b/torch/csrc/jit/runtime/static/README.md
@@ -12,14 +12,9 @@ so many models will not work out of the box.
 This is a list of current assumptions for use with
 this feature.
 
-- Inference only execution
-
-After `torch.jit.freeze` and inlining/constant propagation is run on the model:
-
-- No control flow
-- No submodule invocations
-- No references to `self`
-- Inlined weights (i.e. no calls to `GetAttr`)
+- Inference only execution, CPU only
+- Static input dtypes
+- Static input shapes (the runtime supports dynamic shapes, but excessive dynamic shapes may degrade performance)
 
 ## Threading model
 Static runtime supports two execution modes.
@@ -61,10 +56,183 @@ Runtime instances in your code.
   pool.push(runtime);
 ```
 
-## Planned features
+**In both modes, `StaticRuntime` may not be used after its associated `StaticModule` is destructed!**
+
+## Memory Planning
+Static runtime's memory planner does two things:
+
+1) Coalesces internal allocations for tensor storage
+2) Does static analysis to figure out how to efficiently re-use memory.
+
+For (2), there are two algorithms used. Specify which algorithm with
+the `memory_planner_algorithm` field in `StaticModuleOptions`. The
+algorithms are briefly described below:
+
+### Standard Resizing (default)
+Static runtime will record the space required for each intermediate managed tensor it sees
+on the first inference iteration. An intermediate tensor is *managed* if two conditions
+are satisfied:
+
+1) The op that produces it has an out variant. Out variants are wrappers around ops that
+conceptually transform the op's signature from `Tensor some_op(const Tensor& some_arg)`
+into `void some_op(Tensor& output, const Tensor& some_arg)`. Out variants are registered
+with static runtime via the `REGISTER_OPERATOR_FUNCTOR` macro; see "Registering Ops" for
+more info.
+
+2) The tensor does not alias a graph output. Output tensors are handled separately by
+the memory planner, see "Managed Output Tensors" for details.
+
+With this algorithm, static analysis is used to group the tensors in `StorageGroup`s.
+Tensors in the same storage group share memory, and two tensors can be in the same storage group
+if their lifetimes do not overlap.
+
+On the subsequent iterations, static runtime allocates the tensor buffer at the start of the run.
+The amount of memory allocated is `sum([max(tensor.size()) for tensor in storage_groups])`.
+
+If a tensor needs to be bigger than the allocated space on subsequent runs, a dynamic allocation
+will occur. This is why dynamic shapes will degrade performance. With the standard resizing
+strategy, static runtime will record the new largest tensor size in each storage group at the
+end of the iteration and allocate a buffer that is possibly bigger on the next iteration.
+
+### Precomputed Offsets Memory Planner (experimental)
+This algorithm is based on [arXiv:2001.03288](https://arxiv.org/pdf/2001.03288.pdf), section 5.2 "Greedy by Size for Offset Calculation".
+
+The paper describes the algorithm in detail, but the key considerations are:
+
+1) This algorithm will tend to be more efficient with respect to maximum memory usage
+2) This algorithm will *not* resize the tensor buffer since recomputing offsets is a quadratic operation. Therefore,
+to avoid performance degradation, the model should be warmed up with the largest possible inputs.
+
+
+### Managed Output Tensors
+
+`StaticRuntime` can optionally manage output tensors via the `manage_output_tensors` option in `StaticModuleOptions`.
+When this flag is turned on, we coalesce allocations for output tensors together. Note that the buffer containing
+output tensors is separated from the one containing intermediate tensors. The former needs to live past the end
+of the inference run, but the latter needs deallocated at the end of the run.
+
+Under the hood, we store a refcounted pointer to the output arena in each returned `Tensor`. The arena is destroyed
+only when all output tensors are destroyed.
+
+```
+auto output = runtime(args);
+auto& elems = output.toTupleRef().elements();
+auto tensor_1 = elems[0].toTensor();
+auto tensor_2 = elems[1].toTensor();
+
+tensor_1 = at::empty({0}); // Output buffer not deallocated yet!
+tensor_2 = at::empty({0}); // This call deallocates the output buffer.
+```
+
+## Registering Ops
+Static runtime has three op execution modes:
+
+1) Out variants: ops that return tensors which we may be able to manage. See "Memory Planning" for more
+details. Out variants are registered via the `REGISTER_OPERATOR_FUNCTOR` macro in `ops.h`.
+```
+REGISTER_OPERATOR_FUNCTOR(
+  aten::op_name,
+  aten_op_name, // This macro generates a struct, this field names it
+  [](torch::jit::Node* n) -> SROperator {
+    // This mechanism lets us support a subset of schemas
+    if (n->matches(some_schema)) {
+      return some_overload;
+    } else if (n->matches(another_schema)) {
+      return another_overload;
+    }
+    return nullptr;
+  })
+```
+
+A `SROperator` is a type alias for `std::function<void(ProcessedNode*)>`. See "Implementation Details" for more
+details on `ProcessedNode`.
+
+2) Native functions: just like out variants, except their outputs cannot be managed. This is because the op's return
+type is not a tensor or it is a view op (returns a tensor alias instead of a new tensor). Registration is done with
+`REGISTER_NATIVE_OPERATOR_FUNCTOR`. This macro is used in the same way as `REGISTER_OPERATOR_FUNCTOR`.
+
+3) JIT fallback: static runtime has no implementation for this op, so the implementation that the JIT interpreter uses
+is selected instead.
+
+When loading a model, ops are selected for each `torch::jit::Node` in the graph as follows:
+
+1) If an out variant is registered, pass the node to the function that prodcues the `SROperator`. If
+the result is not `nulltpr`, use that op.
+2) If a native function is registered, pass the node to the function that prodcues the `SROperator`. If
+the result is not `nulltpr`, use that op.
+3) Use the JIT implementation. Static runtime will throw an exception if it does not exist.
+
+## Implementation Details
+
+### Structure and Lifetime Details
+
+The following diagram shows the core data structure. An arrow from `A` to `B` means that
+`A` stores a reference to `B`. If the reference is unowned,
+`A` may not out live `B` or anything that `B` stores a reference to (directly or indirectly).
+If the reference is owned, the lifetimes of `A` and `B` are the same.
+
+                         IValue array◄────────────────┐─────────────────────────────────────────┐
+                              ▲                       │               Owns                      │       Owns
+                              │                       │  ┌───────────────────────────────►ProcessedNode───────►BlockRunner
+                              │Owns                   │  │                                      │                  │
+                              │         Owns          │  │   Owns                               │                  │
+StaticModule◄───────────StaticRuntime───────────►BlockRunner────────►MemoryPlanner              │                  ▼
+    │     │                                           │                  │                      │                 ...
+Owns│     │                                           │                  │                      │
+    ▼     │                                           │                  │                      │
+BlockInfo◄├───────────────────────────────────────────┘──────────────────┘                      │
+          │                                                                                     │
+      Owns│                                                                                     │
+          ▼                                                                                     │
+ProcessedFunction ◄─────────────────────────────────────────────────────────────────────────────┘
+
+Each class is described in detail below.
+
+### `StaticModule` and `StaticRuntime`
+
+`StaticModule`s are constructed from `torch::jit::Module`s and can be used to construct `StaticRuntime`
+instances. Each `StaticModule` caches exactly one `StaticRuntime` instance - it is lazily initialized when
+you access it via `runtime()`.
+
+`StaticModule::operator()` can be used directly to make predictions. Under the hood, this method just
+forwards to the cached runtime's `StaticRuntime::operator()`. One upshot of this behavior is that
+`StaticModule::operator()` is not thread-safe.
+
+The way to use static runtime in a multi-threaded context is to give each thread its own `StaticRuntime`
+instance. New runtime instances can be created directly (`StaticRuntime(static_module)`) or `clone()`'d from
+an existing runtimes.
+
+`StaticModule` takes a set of options that control the behavior of the runtime instances that it spawns;
+see `StaticModuleOptions` for more details.
+
+Internally, `StaticRuntime` owns an array of `IValue`s that is referenced from all `BlockRunner`s and
+`ProcessedNode`s. All values that are generated at runtime are stored in this array.
+
+### `BlockRunner`
+
+A `BlockRunner` represents a single sub-block in the graph. Every graph has at least one `BlockRunner`
+corresponding to the top-level block, and `StaticRuntime` starts its inference run by invoking
+`(*top_level_block)(args, kwargs)`. Each `BlockRunner` has its own `MemoryPlanner` and set of `ProcessedNode`s.
+Special nodes that have sub-blocks (like `prim::If`) might own `BlockRunner`s. The op implementations are responsible
+for invoking `BlockRunner`s corresponding to sub-blocks.
+
+### `MemoryPlanner`
+
+See the "Memory Planning" section. `MemoryPlanner` is an abstract base class. Each sub-class implements a different
+memory planning algorithm.
+
+In addition to the memory planning we do for tensors, `MemoryPlanner` encapsulates a few other optimizations.
+
+* Managed output tensors (see "Managed Output Tensors")
+* Borrowed `IValue`s; ops that just unpack their inputs (e.g. `dict_unpack`) might produce weak-references to
+avoid refcount bumps, the `MemoryPlanner` needs to destroy these borrows appropriately.
+
+### `ProcessedNode` and `ProcessedFunction`
+
+`ProcessedNode` is our abstraction for a single op. Each `ProcessedNode` stores an unowned reference to `StaticRuntime`'s
+`IValue` array. It knows how to map input/output indices to indices in this array (so `processed_node->output(i)` returns
+a reference to `ivalue_array[some_set_of_indices[i]]`)
 
-- Memory planning
-- Operator dispatch inlining
-- Operator subsitution
-- Weight layout transformations (pre-packing)
-- Lowering to `torch.jit.tensorexpr`
+Each `ProcessedNode` stores a `ProcessedFunction`, which represents the actual op to execute. `ProcessedFunction`s are initialized
+upon `StaticModule` construction according to the out variant/native/JIT fallback lookup rules described in "Registering Ops".
+**Note that all `ProcessedFunction`s are shared amongst all runtime instances**, so all `ProcessedFunction`s must be thread-safe.
diff --git a/torch/csrc/jit/runtime/static/fusion.cpp b/torch/csrc/jit/runtime/static/fusion.cpp
index 038e03c6f2ebbb..a84d241cf2629d 100644
--- a/torch/csrc/jit/runtime/static/fusion.cpp
+++ b/torch/csrc/jit/runtime/static/fusion.cpp
@@ -345,7 +345,7 @@ void performTensorExprFusion(
   FuseTensorExprs(
       traced_graph,
       /*min_group_size*/ 2,
-      /*add_composed_op*/ false,
+      /*add_composed_op*/ true,
       /*fuse_to_dynamic_shapes*/ true);
   inlineFallbackGraphs(traced_graph);
   graph->block()->clear();
diff --git a/torch/csrc/jit/runtime/static/impl.cpp b/torch/csrc/jit/runtime/static/impl.cpp
index fe51901ad94485..e0539e042c1eef 100644
--- a/torch/csrc/jit/runtime/static/impl.cpp
+++ b/torch/csrc/jit/runtime/static/impl.cpp
@@ -146,6 +146,7 @@ void OptimizeGraph(
   UseVariadicCat(graph);
   UseVariadicStack(graph);
   EliminateTrivialEquallySplit(graph);
+  EliminateExtraPermuteOps(graph);
 
   if (opts.enable_out_variant) {
     UseVariadicOp(
@@ -164,7 +165,6 @@ void OptimizeGraph(
       ReplaceWithMaybeCopy(graph);
     }
     FuseListUnpack(graph);
-    EnableStaticRuntimeLayerNorm(graph);
 #endif
   }
 
@@ -175,6 +175,7 @@ void OptimizeGraph(
   EliminateNoOps(
       graph, /* custom_ops */ {fromQualString("fb::scale_gradient")});
   AddIfThenElseOp(graph);
+  UseSplitAndSqueeze(graph);
   GRAPH_DUMP("Final graph after optimizations: ", graph);
 }
 
@@ -217,6 +218,11 @@ bool mayContainAlias(
   return db.mayContainAlias(const_cast<Value*>(a), valueVecFromFastSet(b));
 }
 
+bool escapesScope(const AliasDb& db, const Value* a) {
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
+  return db.escapesScope({const_cast<Value*>(a)});
+}
+
 void PrepareGraphForStaticModule(
     std::shared_ptr<torch::jit::Graph> graph,
     const StaticModuleOptions& opts,
@@ -298,7 +304,7 @@ void ValueGroup::init(const Block& block, const AliasDb& db) {
       continue;
     }
     for (const auto* v : node->outputs()) {
-      if (mayContainAlias(db, v, external_aliases_)) {
+      if (escapesScope(db, v) || mayContainAlias(db, v, external_aliases_)) {
         external_aliases_.insert(v);
       }
     }
@@ -313,13 +319,6 @@ void ValueGroup::init(const Block& block, const AliasDb& db) {
       continue;
     }
     for (const auto* v : node->outputs()) {
-      // Add values that can aliase input/constant values. Note some output
-      // aliases may end up in this category via collection objects (e.g.,
-      // Tuple).
-      if (mayContainAlias(db, v, external_aliases_)) {
-        external_aliases_.insert(v);
-        continue;
-      }
       if (mayContainAlias(db, v, output_aliases_)) {
         output_aliases_.insert(v);
       }
@@ -808,7 +807,7 @@ c10::IValue StaticModule::operator()(
 
 BlockRunner::BlockRunner(
     const StaticModule& sm,
-    std::vector<IValue>& values,
+    IValue* values,
     Block* block,
     bool is_root_block)
     : static_module_(sm),
@@ -822,9 +821,8 @@ BlockRunner::BlockRunner(
           is_root_block_ && sm.opts().manage_output_tensors),
       values_(values) {
   nodes_.reserve(block_info_.nodes().size());
-  IValue* values_data = values_.data();
   for (auto& pre_pnode : block_info_.nodes()) {
-    nodes_.emplace_back(pre_pnode, values_data);
+    nodes_.emplace_back(pre_pnode, values_);
   }
 
   for (auto index : block_info_.block_output_indices()) {
@@ -843,7 +841,7 @@ BlockRunner::BlockRunner(
     block_runners->reserve(num_blocks);
 
     for (auto* b : blocks) {
-      block_runners->emplace_back(sm, values, b);
+      block_runners->emplace_back(sm, values_, b);
     }
     pnode.set_block_runners(std::move(block_runners));
   }
@@ -939,7 +937,7 @@ void BlockRunner::set_inputs(
 
 void BlockRunner::create_memory_planner() {
   if (!planner_) {
-    planner_ = std::make_unique<MemoryPlanner>(
+    planner_ = std::make_unique<StandardMemoryPlanner>(
         this,
         block_info_,
         static_module_.opts().enable_out_variant,
@@ -1716,7 +1714,9 @@ bool BlockRunner::checkOutputTensorMemoryLeaks() {
     for (const auto i : c10::irange(pnode.num_outputs())) {
       const IValue* ival = &pnode.Output(i);
       const Value* val = pnode.node()->output(i);
-      if (!isManagedOutputTensorValue(val)) {
+      if (!isManagedOutputTensorValue(val) || !ival->isTensor()) {
+        // ival can not be a tensor if it's being managed by ops like
+        // to_maybe_copy_out; see ReplaceWithMaybeCopy for details.
         continue;
       }
       const auto& t = ival->toTensor();
@@ -2002,11 +2002,11 @@ void ProcessedNode::verify_and_correct_memory_overlap() {
   }
 }
 
-StaticRuntime::StaticRuntime(const StaticModule& sm) {
-  values_.resize(sm.value_buffer_size());
-  std::copy(sm.constants().begin(), sm.constants().end(), values_.begin());
+StaticRuntime::StaticRuntime(const StaticModule& sm)
+    : values_(sm.value_buffer_size()) {
+  std::copy(sm.constants().begin(), sm.constants().end(), values_.data());
   block_ = std::make_unique<BlockRunner>(
-      sm, values_, sm.root_block(), /*is_root_block*/ true);
+      sm, values_.data(), sm.root_block(), /*is_root_block*/ true);
   ;
 }
 
diff --git a/torch/csrc/jit/runtime/static/impl.h b/torch/csrc/jit/runtime/static/impl.h
index 37739e6f4379c9..b28b4dc5addef2 100644
--- a/torch/csrc/jit/runtime/static/impl.h
+++ b/torch/csrc/jit/runtime/static/impl.h
@@ -518,7 +518,7 @@ class TORCH_API BlockRunner {
  public:
   BlockRunner(
       const StaticModule& sm,
-      std::vector<IValue>& values,
+      IValue* values,
       Block* block,
       bool is_root_block = false);
   BlockRunner(BlockRunner&&) noexcept;
@@ -569,15 +569,9 @@ class TORCH_API BlockRunner {
   // Input is readwrite
   IValue& Input(uint32_t i) {
     DCHECK_LT(i, block_info_.num_inputs());
-    DCHECK_LT(i, values_.size());
     return values_[i + block_info_.block_inputs_idx()];
   }
 
-  size_t init_sub_blocks(
-      const StaticModule& sm,
-      std::vector<IValue>& values,
-      size_t block_idx);
-
   // Output is readonly. The writing process happens inside ProcessedNodes
   C10_NODISCARD const IValue& Output(uint32_t i) const {
     DCHECK(i < outputs_.size());
@@ -734,7 +728,8 @@ class TORCH_API BlockRunner {
   // [block_i] = [inputs_i][intermediates_i]
   // Each BlockRunner knows where its inputs start. Each ProcessedNode
   // knows how to find the indices of its outputs/inputs in this array.
-  std::vector<IValue>& values_;
+  IValue* values_;
+
   std::vector<IValue*> outputs_;
   std::vector<ProcessedNode> nodes_;
 };
@@ -995,8 +990,38 @@ class TORCH_API StaticRuntime {
   }
 
  private:
+  // An array of IValues with unchanging size/data ptr.
+  class IValueArray {
+   public:
+    IValueArray() = default;
+    explicit IValueArray(size_t size) : array_(allocate(size)), size_(size) {}
+
+    IValue* data() const {
+      return array_.get();
+    }
+
+    size_t size() const {
+      return size_;
+    }
+
+   private:
+    // NOLINTNEXTLINE(modernize-avoid-c-arrays)
+    // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays)
+    static std::unique_ptr<IValue[]> allocate(size_t size) {
+      if (size) {
+        return std::make_unique<IValue[]>(size);
+      }
+      return nullptr;
+    }
+
+    // NOLINTNEXTLINE(modernize-avoid-c-arrays)
+    // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays)
+    std::unique_ptr<IValue[]> array_ = nullptr;
+    size_t size_;
+  };
+
   std::unique_ptr<BlockRunner> block_;
-  std::vector<IValue> values_;
+  IValueArray values_;
 };
 
 } // namespace jit
diff --git a/torch/csrc/jit/runtime/static/memory_planner.cpp b/torch/csrc/jit/runtime/static/memory_planner.cpp
index b19bfe5f6f9397..947f9b6cfd063f 100644
--- a/torch/csrc/jit/runtime/static/memory_planner.cpp
+++ b/torch/csrc/jit/runtime/static/memory_planner.cpp
@@ -54,6 +54,18 @@ FastMap<const Value*, at::Tensor*> tensorValueToTensor(
   return tensor_value_to_tensor;
 }
 
+// Don't change the size if it is already aligned, otherwise increase the size
+// to make it aligned.
+size_t compute_aligned_tensor_size(size_t nbytes) {
+  // Note: everything below is size_t
+  return (nbytes + c10::gAlignment - 1) & (~(c10::gAlignment - 1));
+}
+
+at::DataPtr allocate_buffer(size_t size) {
+  at::Allocator* allocator = c10::GetCPUCachingAllocator();
+  return allocator->allocate(size);
+}
+
 } // namespace
 
 std::vector<StorageGroup> assignStorageToManagedTensors(
@@ -154,8 +166,7 @@ MemoryPlanner::MemoryPlanner(
     BlockRunner* block_runner,
     const BlockInfo& block_info,
     bool enable_out_variant,
-    bool manage_output_tensors,
-    bool optimize_memory) {
+    bool manage_output_tensors) {
   const auto& managed_tensor_values = block_info.managed_tensor_values();
   const auto& managed_output_tensor_values =
       block_info.managed_output_tensor_values();
@@ -168,14 +179,19 @@ MemoryPlanner::MemoryPlanner(
     const auto borrows_outputs = borrowsOutputs(pnode.node()->kind());
     for (const auto i : c10::irange(pnode.outputs().size())) {
       const Value* out_v = pnode.node()->outputs()[i];
-      const bool in_managed_sets = setIncludes(managed_tensor_values, out_v) ||
+      const bool in_managed_tensors = setIncludes(managed_tensor_values, out_v);
+      const bool is_unmanaged_special_case = isUnmanagedSpecialCase(pnode, i);
+      if (in_managed_tensors && !is_unmanaged_special_case) {
+        ++num_managed_tensors_;
+      }
+      const bool in_managed_sets = in_managed_tensors ||
           // Manage output tensors might have been turned off, so we have to
           // check the flag here
           (manage_output_tensors &&
            setIncludes(managed_output_tensor_values, out_v)) ||
           setIncludes(leaked_values, out_v);
 
-      if (in_managed_sets && !isUnmanagedSpecialCase(pnode, i)) {
+      if (in_managed_sets && !is_unmanaged_special_case) {
         continue;
       }
       if (doesNotHeapAllocateWhenStoredInIValue(*out_v->type())) {
@@ -212,84 +228,18 @@ MemoryPlanner::MemoryPlanner(
       unmanaged_borrowed_ivalues.begin(),
       unmanaged_borrowed_ivalues.end());
 
-  if (enable_out_variant) {
-    const auto tensor_value_to_tensor =
-        tensorValueToTensor(block_runner->nodes(), managed_tensor_values);
-    if (optimize_memory) {
-      managed_tensors_ = assignStorageToManagedTensors(
-          block_info.node_ptrs(),
-          block_info.managed_tensor_ranges(),
-          tensor_value_to_tensor);
-    } else {
-      for (auto& tensor : tensor_value_to_tensor) {
-        managed_tensors_.emplace_back(tensor.second);
-      }
-    }
-  }
-
   if (enable_out_variant && manage_output_tensors) {
     managed_output_tensors_ = assignStorageToOutputTensors(
         block_runner, managed_output_tensor_values);
   }
-
-  num_managed_tensors_ = 0;
-  for (const auto& ms : managed_tensors_) {
-    num_managed_tensors_ += ms.numManagedTensors();
-  }
-}
-
-// Don't change the size if it is already aligned, otherwise increase the size
-// to make it aligned.
-size_t MemoryPlanner::compute_aligned_tensor_size(size_t nbytes) {
-  // Note: everything below is size_t
-  return (nbytes + c10::gAlignment - 1) & (~(c10::gAlignment - 1));
-}
-
-at::DataPtr MemoryPlanner::allocate_buffer(size_t size) {
-  at::Allocator* allocator = c10::GetCPUCachingAllocator();
-  return allocator->allocate(size);
 }
 
-void MemoryPlanner::allocateManagedTensors() {
-  if (managed_bytes_ == 0) {
-    return;
-  }
-  DCHECK(!managed_tensor_storage_impls_.empty());
-  buffer_ = allocate_buffer(managed_bytes_);
-
-  size_t offset = 0;
+uint8_t* MemoryPlanner::allocateBuffer(size_t num_bytes) {
+  buffer_ = allocate_buffer(num_bytes);
   uint8_t* start = static_cast<uint8_t*>(buffer_.get());
   buffer_start_ = start;
-  buffer_end_ = start + managed_bytes_;
-
-  reused_tensors_ = 0;
-  auto group_idx = 0;
-  for (auto& ms : managed_tensor_storage_impls_) {
-    auto tensor_size = ms.first;
-    if (tensor_size == 0) {
-      group_idx++;
-      continue;
-    }
-    at::StorageImpl* storageImpl = &ms.second;
-    DCHECK_LE(offset + tensor_size, managed_bytes_);
-    void* src = static_cast<void*>(start + offset);
-
-#ifndef NDEBUG
-    DCHECK_EQ(tensor_size, managed_tensors_[group_idx].maxTensorSize());
-    for (auto* tensor : managed_tensors_[group_idx].group()) {
-      DCHECK_EQ(storageImpl, tensor->storage().unsafeGetStorageImpl());
-    }
-#endif
-    DCHECK_NE(managed_tensors_[group_idx].numManagedTensors(), 0);
-    reused_tensors_ += managed_tensors_[group_idx].numManagedTensors() - 1;
-    storageImpl->set_data_ptr_noswap(
-        at::DataPtr(src, src, nullptr, c10::Device(c10::DeviceType::CPU)));
-    storageImpl->set_nbytes(tensor_size);
-
-    offset += tensor_size;
-    group_idx++;
-  }
-  DCHECK_EQ(offset, managed_bytes_);
+  buffer_end_ = start + num_bytes;
+  return start;
 }
 
 void MemoryPlanner::allocateOutputTensors() {
@@ -342,6 +292,106 @@ void MemoryPlanner::allocate() {
 }
 
 void MemoryPlanner::deallocate() {
+  deallocateManagedTensors();
+  for (auto& iv : borrowed_ivalues_needing_incref_) {
+    auto old = std::move(*iv);
+    *iv = IValue(old);
+    c10::MaybeOwnedTraits<c10::IValue>::destroyBorrow(old);
+  }
+  // for unmanaged ivalues (either tensor or non-tensor), we reset the *iv so
+  // that the objects pointed to by *iv may be reclaimed by reference counting
+  for (auto& iv : unmanaged_ivalues_) {
+    *iv = IValue();
+  }
+  for (auto& iv : unmanaged_borrowed_ivalues_) {
+    c10::MaybeOwnedTraits<c10::IValue>::destroyBorrow(*iv);
+  }
+  buffer_ = {};
+}
+
+void MemoryPlanner::deallocateOutputTensors() {
+  size_t output_buffer_bytes = 0;
+  for (auto& ms : managed_output_tensors_) {
+    auto* tensor = ms.second;
+    size_t current_size =
+        compute_aligned_tensor_size(tensor->storage().nbytes());
+    tensor->storage().unsafeGetStorageImpl()->reset();
+    if (current_size > ms.first) {
+      ms.first = current_size;
+    }
+    output_buffer_bytes += ms.first;
+  }
+  output_buffer_bytes_ = output_buffer_bytes;
+  output_buffer_ = {};
+}
+
+StandardMemoryPlanner::StandardMemoryPlanner(
+    BlockRunner* block_runner,
+    const BlockInfo& block_info,
+    bool enable_out_variant,
+    bool manage_output_tensors,
+    bool optimize_memory)
+    : MemoryPlanner(
+          block_runner,
+          block_info,
+          enable_out_variant,
+          manage_output_tensors) {
+  const auto& managed_tensor_values = block_info.managed_tensor_values();
+  if (enable_out_variant) {
+    const auto tensor_value_to_tensor =
+        tensorValueToTensor(block_runner->nodes(), managed_tensor_values);
+    if (optimize_memory) {
+      managed_tensors_ = assignStorageToManagedTensors(
+          block_info.node_ptrs(),
+          block_info.managed_tensor_ranges(),
+          tensor_value_to_tensor);
+    } else {
+      for (auto& tensor : tensor_value_to_tensor) {
+        managed_tensors_.emplace_back(tensor.second);
+      }
+    }
+  }
+}
+
+void StandardMemoryPlanner::allocateManagedTensors() {
+  if (managed_bytes_ == 0) {
+    return;
+  }
+  DCHECK(!managed_tensor_storage_impls_.empty());
+  size_t offset = 0;
+  auto* start = allocateBuffer(managed_bytes_);
+
+  reused_tensors_ = 0;
+  auto group_idx = 0;
+  for (auto& ms : managed_tensor_storage_impls_) {
+    auto tensor_size = ms.first;
+    if (tensor_size == 0) {
+      group_idx++;
+      continue;
+    }
+    at::StorageImpl* storageImpl = &ms.second;
+    DCHECK_LE(offset + tensor_size, managed_bytes_);
+    void* src = static_cast<void*>(start + offset);
+
+#ifndef NDEBUG
+    DCHECK_EQ(tensor_size, managed_tensors_[group_idx].maxTensorSize());
+    for (auto* tensor : managed_tensors_[group_idx].group()) {
+      DCHECK_EQ(storageImpl, tensor->storage().unsafeGetStorageImpl());
+    }
+#endif
+    DCHECK_NE(managed_tensors_[group_idx].numManagedTensors(), 0);
+    reused_tensors_ += managed_tensors_[group_idx].numManagedTensors() - 1;
+    storageImpl->set_data_ptr_noswap(
+        at::DataPtr(src, src, nullptr, c10::Device(c10::DeviceType::CPU)));
+    storageImpl->set_nbytes(tensor_size);
+
+    offset += tensor_size;
+    group_idx++;
+  }
+  DCHECK_EQ(offset, managed_bytes_);
+}
+
+void StandardMemoryPlanner::deallocateManagedTensors() {
   managed_bytes_ = 0;
   // free memory used by outputs of ops in out variants
   // but keep the TensorImpl and StorageImpl around.
@@ -419,37 +469,6 @@ void MemoryPlanner::deallocate() {
 
   DCHECK_EQ(managed_tensor_storage_impls_.size(), managed_tensors_.size());
   VLOG(1) << "managed_bytes: " << managed_bytes_;
-
-  for (auto& iv : borrowed_ivalues_needing_incref_) {
-    auto old = std::move(*iv);
-    *iv = IValue(old);
-    c10::MaybeOwnedTraits<c10::IValue>::destroyBorrow(old);
-  }
-  // for unmanaged ivalues (either tensor or non-tensor), we reset the *iv so
-  // that the objects pointed to by *iv may be reclaimed by reference counting
-  for (auto& iv : unmanaged_ivalues_) {
-    *iv = IValue();
-  }
-  for (auto& iv : unmanaged_borrowed_ivalues_) {
-    c10::MaybeOwnedTraits<c10::IValue>::destroyBorrow(*iv);
-  }
-  buffer_ = {};
-}
-
-void MemoryPlanner::deallocateOutputTensors() {
-  size_t output_buffer_bytes = 0;
-  for (auto& ms : managed_output_tensors_) {
-    auto* tensor = ms.second;
-    size_t current_size =
-        compute_aligned_tensor_size(tensor->storage().nbytes());
-    tensor->storage().unsafeGetStorageImpl()->reset();
-    if (current_size > ms.first) {
-      ms.first = current_size;
-    }
-    output_buffer_bytes += ms.first;
-  }
-  output_buffer_bytes_ = output_buffer_bytes;
-  output_buffer_ = {};
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/runtime/static/memory_planner.h b/torch/csrc/jit/runtime/static/memory_planner.h
index 473d429bf9193b..3f849b4daf7b33 100644
--- a/torch/csrc/jit/runtime/static/memory_planner.h
+++ b/torch/csrc/jit/runtime/static/memory_planner.h
@@ -92,17 +92,18 @@ TORCH_API std::vector<StorageGroup> assignStorageToManagedTensors(
 
 class MemoryPlanner {
  public:
-  explicit MemoryPlanner(
+  MemoryPlanner(
       BlockRunner* block_runner,
       const BlockInfo& block_info,
       bool enable_out_variant,
-      bool manage_output_tensors,
-      bool optimize_memory);
+      bool manage_output_tensors);
+
   // disable copying and moving
   MemoryPlanner(const MemoryPlanner&) = delete;
   MemoryPlanner& operator=(const MemoryPlanner&) = delete;
   MemoryPlanner(MemoryPlanner&&) = delete;
   MemoryPlanner& operator=(MemoryPlanner&&) = delete;
+  virtual ~MemoryPlanner() = default;
 
   void allocate();
   void deallocate();
@@ -112,6 +113,10 @@ class MemoryPlanner {
     return num_managed_tensors_;
   }
 
+  size_t total_reused_tensors() const {
+    return reused_tensors_;
+  }
+
   size_t total_num_managed_output_tensors() const {
     return managed_output_tensors_.size();
   }
@@ -132,10 +137,6 @@ class MemoryPlanner {
     return managed_bytes_;
   }
 
-  size_t total_reused_tensors() const {
-    return reused_tensors_;
-  }
-
   size_t numOutputBufferBytes() const {
     return output_buffer_bytes_;
   }
@@ -180,53 +181,71 @@ class MemoryPlanner {
     return buffer_start_ <= data_ptr && data_ptr < buffer_end_;
   }
 
- private:
-  // ivalues created in one run but not managed by MemoryPlanner
-  std::vector<IValue*> unmanaged_ivalues_;
-
-  // Special class of unmanaged values: some native ops create IValues
-  // in a "borrowed" state that can and must be cleaned up without a
-  // reference count decrement.
-  std::vector<IValue*> unmanaged_borrowed_ivalues_;
+ protected:
+  uint8_t* allocateBuffer(size_t num_bytes);
 
-  // Even more special class of unmanaged values: if select_tensor
-  // outputs are outputs of the graph, then they need to be restored
-  // to an ordinary "strong reference" state.
-  std::vector<IValue*> borrowed_ivalues_needing_incref_;
+  size_t managed_bytes_{0};
+  size_t reused_tensors_{0};
 
   // each pair contains the size (in bytes) of data to be allocated
   // and a vector of Tensors' storages that should be backed by that
   // same data. Thus, if memonger is disabled, all vectors are of
   // size 1.
-
   // We allocate StorageImpls ourselves so that 1) we don't have to do
   // an extra two loads per Tensor (which will likely miss in the CPU
   // data cache) first reading the Storage (i.e., StorageImpl pointer)
   // from the TensorImpl object and then second dereferencing it and
   // 2) our memory access pattern during allocate() has high locality.
-  std::vector<std::pair<size_t, at::StorageImpl>>
-      managed_tensor_storage_impls_{};
   // We don't have any guarantee that the model doesn't change the
   // Storage for managed tensors out from under us during execution,
   // so we have to check the StorageImpls each time we deallocate.
-  std::vector<StorageGroup> managed_tensors_{};
+  std::vector<std::pair<size_t, at::StorageImpl>>
+      managed_tensor_storage_impls_{};
+
+ private:
+  // ivalues created in one run but not managed by MemoryPlanner
+  std::vector<IValue*> unmanaged_ivalues_;
+
+  // Special class of unmanaged values: some native ops create IValues
+  // in a "borrowed" state that can and must be cleaned up without a
+  // reference count decrement.
+  std::vector<IValue*> unmanaged_borrowed_ivalues_;
+
+  // Even more special class of unmanaged values: if select_tensor
+  // outputs are outputs of the graph, then they need to be restored
+  // to an ordinary "strong reference" state.
+  std::vector<IValue*> borrowed_ivalues_needing_incref_;
+
   std::vector<std::pair<size_t, at::Tensor*>> managed_output_tensors_{};
   at::DataPtr buffer_; // allocated each time we call Run()
   uint8_t* buffer_start_{nullptr};
   uint8_t* buffer_end_{nullptr};
   size_t num_managed_tensors_{0};
-  size_t managed_bytes_{0};
-  size_t reused_tensors_{0};
   size_t num_unmanaged_scalar_ivalues_{0};
 
   at::DataPtr output_buffer_;
   size_t output_buffer_bytes_{0};
 
-  void allocateManagedTensors();
+  virtual void allocateManagedTensors() = 0;
+  virtual void deallocateManagedTensors() = 0;
+
   void allocateOutputTensors();
+};
+
+class StandardMemoryPlanner : public MemoryPlanner {
+ public:
+  StandardMemoryPlanner(
+      BlockRunner* block_runner,
+      const BlockInfo& block_info,
+      bool enable_out_variant,
+      bool manage_output_tensors,
+      bool optimize_memory);
+
+ protected:
+  void allocateManagedTensors() override;
+  void deallocateManagedTensors() override;
 
-  static size_t compute_aligned_tensor_size(size_t nbytes);
-  static at::DataPtr allocate_buffer(size_t size);
+  std::vector<StorageGroup> managed_tensors_{};
 };
 
 } // namespace jit
diff --git a/torch/csrc/jit/runtime/static/native_ops.cpp b/torch/csrc/jit/runtime/static/native_ops.cpp
index 889da301831251..5f22fc4e42ead9 100644
--- a/torch/csrc/jit/runtime/static/native_ops.cpp
+++ b/torch/csrc/jit/runtime/static/native_ops.cpp
@@ -87,17 +87,21 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::DictConstruct,
     prim_DictConstruct,
     [](Node* n) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        // prepare inputs
-        auto stack = boxInputs(*p_node);
-        // run op
-        auto* node = p_node->node();
-        dictConstruct(
-            stack,
-            node->output()->type()->expectRef<DictType>(),
-            node->inputs().size());
-        // put output back
-        p_node->Output(0) = std::move(stack[0]);
+      auto dict_type = n->output()->type()->expect<DictType>();
+      const auto num_inputs = n->inputs().size();
+      DCHECK_EQ(num_inputs % 2, 0);
+      return [dict_type = std::move(dict_type),
+              num_inputs,
+              dict_size = num_inputs / 2](ProcessedNode* p_node) {
+        auto result = c10::impl::GenericDict(
+            dict_type->containedType(0), dict_type->containedType(1));
+        result.reserve(dict_size);
+        for (size_t i = 0; i < num_inputs; i += 2) {
+          const auto& key = p_node->Input(i);
+          const auto& value = p_node->Input(i + 1);
+          result.insert_or_assign(key, value);
+        }
+        p_node->Output(0) = result;
       };
     });
 
@@ -168,16 +172,17 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::ListUnpack,
     prim_ListUnpack,
     [](Node* n) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        // prepare inputs
-        auto stack = boxInputs(*p_node);
-        // run op
-        size_t num_outputs = p_node->outputs().size();
-        listUnpack(stack, num_outputs);
-        // put output back
-        DCHECK_EQ(stack.size(), num_outputs);
+      const auto num_outputs = n->outputs().size();
+      return [num_outputs](ProcessedNode* p_node) {
+        const auto list = p_node->Input(0).toListRef();
+        TORCH_CHECK(
+            list.size() == num_outputs,
+            "Expected ",
+            num_outputs,
+            " elements in list but got ",
+            list.size());
         for (const auto i : c10::irange(num_outputs)) {
-          p_node->Output(i) = std::move(stack[i]);
+          p_node->Output(i) = list[i];
         }
       };
     });
@@ -709,26 +714,100 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
           [](ProcessedNode* p_node) { p_node->Output(0) = p_node->Input(0); };
     });
 
-REGISTER_NATIVE_OPERATOR_FUNCTOR(prim::If, prim_If, [](Node*) -> SROperator {
-  return [](ProcessedNode* p_node) {
-    auto condition = p_node->Input(0).toBool();
-    auto* block_runners = p_node->block_runners();
-    DCHECK(block_runners);
-    DCHECK_EQ(block_runners->size(), 2);
-    auto& runner = (*block_runners)[!condition];
-
-    auto output = runner({});
-    if (!output.isTuple()) {
-      p_node->Output(0) = std::move(output);
-      return;
-    }
-    auto& elems = output.toTupleRef().elements();
-    DCHECK_EQ(elems.size(), p_node->num_outputs());
-    for (const auto i : c10::irange(elems.size())) {
-      p_node->Output(i) = elems[i];
-    }
-  };
-});
+namespace {
+bool outputsEmpty(const Block* block) {
+  return block->outputs().size() == 1 && block->outputs().at(0)->mustBeNone();
+}
+
+bool blockEmpty(const Block* block) {
+  return block->nodes().begin() == block->nodes().end();
+}
+
+enum class BlockRunPlan : int8_t {
+  kRunOnlyTrueBlock,
+  kRunOnlyFalseBlock,
+  kRunBothBlocks,
+  kRunNeitherBlock,
+};
+} // namespace
+
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    prim::If,
+    prim_If,
+    [](Node* node) -> SROperator {
+      DCHECK_EQ(node->blocks().size(), 2);
+      const Block* true_block = node->blocks().at(0);
+      const Block* false_block = node->blocks().at(1);
+
+      const bool true_block_returns_empty = outputsEmpty(true_block);
+      const bool false_block_returns_empty = outputsEmpty(false_block);
+
+      BlockRunPlan block_run_plan = BlockRunPlan::kRunNeitherBlock;
+
+      if (true_block_returns_empty && false_block_returns_empty) {
+        const bool false_block_is_empty = blockEmpty(false_block);
+        const bool true_block_is_empty = blockEmpty(true_block);
+
+        if (false_block_is_empty && !true_block_is_empty) {
+          block_run_plan = BlockRunPlan::kRunOnlyTrueBlock;
+        } else if (!false_block_is_empty && true_block_is_empty) {
+          block_run_plan = BlockRunPlan::kRunOnlyFalseBlock;
+        } else if (false_block_is_empty && true_block_is_empty) {
+          block_run_plan = BlockRunPlan::kRunNeitherBlock;
+        } else {
+          block_run_plan = BlockRunPlan::kRunBothBlocks;
+        }
+      } else {
+        block_run_plan = BlockRunPlan::kRunBothBlocks;
+      }
+
+      switch (block_run_plan) {
+        case BlockRunPlan::kRunBothBlocks:
+          return [](ProcessedNode* p_node) {
+            auto condition = p_node->Input(0).toBool();
+            auto* block_runners = p_node->block_runners();
+            DCHECK(block_runners);
+            DCHECK_EQ(block_runners->size(), 2);
+            auto& runner = (*block_runners)[!condition];
+
+            auto output = runner({});
+            if (!output.isTuple()) {
+              p_node->Output(0) = std::move(output);
+              return;
+            }
+            auto& elems = output.toTupleRef().elements();
+            DCHECK_EQ(elems.size(), p_node->num_outputs());
+            for (const auto i : c10::irange(elems.size())) {
+              p_node->Output(i) = elems[i];
+            }
+          };
+        case BlockRunPlan::kRunOnlyTrueBlock:
+          return [](ProcessedNode* p_node) {
+            auto condition = p_node->Input(0).toBool();
+            auto* block_runners = p_node->block_runners();
+            DCHECK(block_runners);
+            DCHECK_EQ(block_runners->size(), 2);
+            if (condition) {
+              auto output = block_runners->front()({});
+              DCHECK(output.isNone());
+            }
+          };
+        case BlockRunPlan::kRunOnlyFalseBlock:
+          return [](ProcessedNode* p_node) {
+            auto condition = p_node->Input(0).toBool();
+            auto* block_runners = p_node->block_runners();
+            DCHECK(block_runners);
+            DCHECK_EQ(block_runners->size(), 2);
+            if (!condition) {
+              auto output = block_runners->back()({});
+              DCHECK(output.isNone());
+            }
+          };
+        case BlockRunPlan::kRunNeitherBlock:
+          return [](ProcessedNode*) {};
+      }
+      return [](ProcessedNode*) {};
+    });
 
 namespace {
 
@@ -958,5 +1037,110 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
       };
     });
 
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    aten::len,
+    aten_len,
+    [](Node* n) -> SROperator {
+      if (n->matches(torch::schema("aten::len.t(t[] a) -> int")) ||
+          n->matches(torch::schema("aten::len.any(Any[] a) -> int"))) {
+        return [](ProcessedNode* pnode) {
+          const auto list = pnode->Input(0).toListRef();
+          const int64_t size = list.size();
+          pnode->Output(0) = size;
+        };
+      }
+      if (n->matches(torch::schema("aten::len.Tensor(Tensor t) -> int"))) {
+        return [](ProcessedNode* pnode) {
+          const auto& t = pnode->Input(0).toTensor();
+          TORCH_CHECK(t.dim() > 0);
+          pnode->Output(0) = t.sizes()[0];
+        };
+      }
+      if (n->matches(torch::schema("aten::len.str(str s) -> int"))) {
+        return [](ProcessedNode* pnode) {
+          const auto& string = pnode->Input(0).toStringRef();
+          pnode->Output(0) = static_cast<int64_t>(string.size());
+        };
+      }
+      if (n->matches(
+              torch::schema("aten::len.Dict_str(Dict(str, t) self) -> int")) ||
+          n->matches(
+              torch::schema("aten::len.Dict_int(Dict(int, t) self) -> int")) ||
+          n->matches(torch::schema(
+              "aten::len.Dict_bool(Dict(bool, t) self) -> int")) ||
+          n->matches(torch::schema(
+              "aten::len.Dict_float(Dict(float, t) self) -> int")) ||
+          n->matches(torch::schema(
+              "aten::len.Dict_complex(Dict(complex, t) self) -> int")) ||
+          n->matches(torch::schema(
+              "aten::len.Dict_Tensor(Dict(Tensor, t) self) -> int"))) {
+        return [](ProcessedNode* pnode) {
+          const auto& dict = pnode->Input(0).toGenericDict();
+          pnode->Output(0) = static_cast<int64_t>(dict.size());
+        };
+      }
+      LogAndDumpSchema(n);
+      return nullptr;
+    });
+
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    aten::IntImplicit,
+    aten_IntImplicit,
+    [](Node* n) -> SROperator {
+      if (!n->matches(torch::schema("aten::IntImplicit(Tensor a) -> int"))) {
+        LogAndDumpSchema(n);
+        return nullptr;
+      }
+      return [](ProcessedNode* pnode) {
+        const auto& tensor = pnode->Input(0).toTensor();
+        // JIT does a check for requires_grad, but we skip it here since SR is
+        // inference only
+        if (tensor.sizes().size() != 0) {
+          throw std::runtime_error(
+              "Cannot convert a tensor of dimension > 0 to scalar");
+        }
+        if (!isIntegralType(tensor.scalar_type(), /*includeBool=*/false)) {
+          std::stringstream ss;
+          ss << "Cannot input a tensor of type " << tensor.scalar_type()
+             << " as an integral argument";
+          throw std::runtime_error(ss.str());
+        }
+        pnode->Output(0) = at::native::item(tensor).toInt();
+      };
+    });
+
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    aten::select,
+    aten_select,
+    [](Node* n) -> SROperator {
+      if (!n->matches(torch::schema(
+              "aten::select(Tensor(a) self, int dim, int index) -> Tensor(a)"))) {
+        LogAndDumpSchema(n);
+        return nullptr;
+      }
+      return [](ProcessedNode* pnode) {
+        const auto& self = pnode->Input(0).toTensor();
+        const auto dim = pnode->Input(1).toInt();
+        const auto index = pnode->Input(2).toInt();
+        pnode->Output(0) = at::native::select(self, dim, index);
+      };
+    });
+
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    aten::reshape_as,
+    aten_reshape_as,
+    [](Node* n) -> SROperator {
+      if (!n->matches(torch::schema(
+              "aten::reshape_as(Tensor(a) self, Tensor other) -> Tensor(a)"))) {
+        LogAndDumpSchema(n);
+        return nullptr;
+      }
+      return [](ProcessedNode* pnode) {
+        const auto& self = pnode->Input(0).toTensor();
+        const auto& other = pnode->Input(1).toTensor();
+        pnode->Output(0) = at::native::reshape(self, other.sizes());
+      };
+    });
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/static/ops.cpp b/torch/csrc/jit/runtime/static/ops.cpp
index db7129cd2bdfca..9dc1c37fa5fc1d 100644
--- a/torch/csrc/jit/runtime/static/ops.cpp
+++ b/torch/csrc/jit/runtime/static/ops.cpp
@@ -795,6 +795,7 @@ SROperator aten_stack(Node* n) {
   }
   return [](ProcessedNode* p_node) {
     const auto inputs = p_node->Input(0).toTensorVector();
+    TORCH_CHECK(inputs.size() > 0, "stack expects non-empty tensor list");
     const auto dim = p_node->Input(1).toInt();
     if (p_node->Output(0).isNone()) {
       p_node->Output(0) = at::native::_stack_cpu(inputs, dim);
@@ -1158,6 +1159,30 @@ REGISTER_OPERATOR_FUNCTOR(aten::index, aten_index, [](Node* n) -> SROperator {
     at::native::index_out(out_t, in0_t, in1_l);
   };
 });
+
+REGISTER_OPERATOR_FUNCTOR(
+    aten::index_select,
+    aten_index_select,
+    [](Node* n) -> SROperator {
+      if (!n->matches(torch::schema(
+              "aten::index_select(Tensor self, int dim, Tensor index) -> Tensor"))) {
+        LogAndDumpSchema(n);
+        return nullptr;
+      }
+      return [](ProcessedNode* p_node) {
+        const auto& self = p_node->Input(0).toTensor();
+        const auto dim = p_node->Input(1).toInt();
+        const auto& index = p_node->Input(2).toTensor();
+        if (p_node->Output(0).isNone()) {
+          p_node->Output(0) = at::native::index_select_cpu_(self, dim, index);
+          return;
+        }
+        auto& out = p_node->Output(0).toTensor();
+        fastResizeToZero(out);
+        at::native::index_select_out_cpu_(self, dim, index, out);
+      };
+    });
+
 REGISTER_OPERATOR_FUNCTOR(aten::pow, aten_pow, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
           "aten::pow.Tensor_Tensor(Tensor self, Tensor exponent) -> Tensor"))) {
@@ -1900,22 +1925,22 @@ REGISTER_OPERATOR_FUNCTOR(aten::softmax, aten_softmax, [](Node* n) -> SROperator
   };
 });
 
-static c10::MaybeOwned<at::Tensor> borrow_from_optional_tensor_ivalue(
+namespace {
+
+c10::MaybeOwned<at::Tensor> borrow_from_optional_tensor_ivalue(
     const IValue& iv) {
   if (iv.isNone()) {
     return c10::MaybeOwned<at::Tensor>::owned(c10::in_place);
   }
   return c10::MaybeOwned<at::Tensor>::borrowed(iv.toTensor());
 }
+
+} // namespace
+
 REGISTER_OPERATOR_FUNCTOR(
-    static_runtime::layer_norm,
+    aten::layer_norm,
     aten_layer_norm,
-    [](Node* n) -> SROperator {
-      if (!n->matches(torch::schema(
-              "static_runtime::layer_norm(Tensor input, int[] normalized_shape, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enable=True) -> (Tensor,Tensor,Tensor)"))) {
-        LogAndDumpSchema(n);
-        return nullptr;
-      }
+    [](Node*) -> SROperator {
       return [](ProcessedNode* p_node) {
         // ignore Input(5): `bool cudnn_enable=True`
         const auto& input = p_node->Input(0).toTensor();
@@ -1949,30 +1974,8 @@ REGISTER_OPERATOR_FUNCTOR(
           at::native::resize_(
               p_node->Output(0).toTensor(), X->sizes(), c10::nullopt);
         }
-        if (p_node->Output(1).isNone()) {
-          p_node->Output(1) = create_empty_from({M}, *X);
-        } else {
-          at::native::resize_(p_node->Output(1).toTensor(), {M}, c10::nullopt);
-        }
-        if (p_node->Output(2).isNone()) {
-          p_node->Output(2) = create_empty_from({M}, *X);
-        } else {
-          at::native::resize_(p_node->Output(2).toTensor(), {M}, c10::nullopt);
-        }
         at::Tensor& output = p_node->Output(0).toTensor();
-        at::Tensor& mean = p_node->Output(1).toTensor();
-        at::Tensor& rstd = p_node->Output(2).toTensor();
-        at::native::layer_norm_cpu_out(
-            output,
-            mean,
-            rstd,
-            input,
-            normalized_shape,
-            *gamma,
-            *beta,
-            eps,
-            M,
-            N);
+        at::native::layer_norm_cpu_out(output, input, *gamma, *beta, eps, M, N);
       };
     });
 
@@ -2174,7 +2177,11 @@ void apply_dynamic_out_functor<true>(
     const at::Tensor& input,
     at::Tensor& out,
     bool reduce_range) {
-  packed_weight->apply_dynamic_relu_out(input, out, reduce_range);
+  // The implementation of PackedLinearWeightFp16::apply_dynamic_impl does not
+  // handle relu. Currently, it ignores the `ReluFused` template parameter.
+  // So, we explicitly do the relu here.
+  packed_weight->apply_dynamic_out(input, out, reduce_range);
+  out.relu_();
 }
 
 template <bool has_relu>
@@ -2239,6 +2246,32 @@ REGISTER_OPERATOR_FUNCTOR(
       return quantized_linear_dynamic_fp16_impl<true>(n);
     });
 
+// device & pin_memory matter only when CUDA is enabled.
+static bool hasTensorWithOptions(
+    const IValue& ivalue,
+    c10::optional<c10::ScalarType> dtype,
+    c10::optional<c10::Layout> layout) {
+  if (!ivalue.isTensor()) {
+    return false;
+  }
+  const auto& tensor = ivalue.toTensor();
+  if (dtype == tensor.dtype().toScalarType() &&
+      layout == tensor.options().layout_opt()) {
+    return true;
+  }
+  VLOG(1) << "tensor exists, but tensor options were different";
+  return false;
+}
+
+static bool hasTensorWithOptions(
+    const IValue& ivalue,
+    c10::optional<c10::ScalarType> dtype,
+    c10::optional<c10::Layout> layout,
+    c10::optional<c10::MemoryFormat> memory_format) {
+  return hasTensorWithOptions(ivalue, dtype, layout) &&
+      (memory_format == ivalue.toTensor().options().memory_format_opt());
+}
+
 REGISTER_OPERATOR_FUNCTOR(aten::full, aten_full, [](Node* n) -> SROperator {
   if (!n->matches(torch::schema(
           "aten::full(int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor"))) {
@@ -2248,9 +2281,9 @@ REGISTER_OPERATOR_FUNCTOR(aten::full, aten_full, [](Node* n) -> SROperator {
   return [](ProcessedNode* p_node) {
     const auto& size = p_node->Input(0).toDimVector();
     const auto fill_value = p_node->Input(1).toScalar();
-    if (p_node->Output(0).isNone()) {
-      const auto dtype = p_node->Input(2).toOptional<c10::ScalarType>();
-      const auto layout = p_node->Input(3).toOptional<c10::Layout>();
+    const auto dtype = p_node->Input(2).toOptional<c10::ScalarType>();
+    const auto layout = p_node->Input(3).toOptional<c10::Layout>();
+    if (!hasTensorWithOptions(p_node->Output(0), dtype, layout)) {
       const auto device = p_node->Input(4).toOptional<c10::Device>();
       const auto pin_memory = p_node->Input(5).toOptional<bool>();
       p_node->Output(0) =
@@ -2271,9 +2304,9 @@ REGISTER_OPERATOR_FUNCTOR(aten::full_like, aten_full_like, [](Node* n) -> SROper
   return [](ProcessedNode* p_node) {
     const auto in1_s = p_node->Input(1).toScalar();
     const auto& in0_t = p_node->Input(0).toTensor();
-    if (p_node->Output(0).isNone()) {
-      const auto dtype = p_node->Input(2).toOptional<c10::ScalarType>();
-      const auto layout = p_node->Input(3).toOptional<c10::Layout>();
+    const auto dtype = p_node->Input(2).toOptional<c10::ScalarType>();
+    const auto layout = p_node->Input(3).toOptional<c10::Layout>();
+    if (!hasTensorWithOptions(p_node->Output(0), dtype, layout)) {
       const auto device = p_node->Input(4).toOptional<c10::Device>();
       const auto pin_memory = p_node->Input(5).toOptional<bool>();
       const auto memory_format =
@@ -2288,6 +2321,74 @@ REGISTER_OPERATOR_FUNCTOR(aten::full_like, aten_full_like, [](Node* n) -> SROper
   };
 });
 
+REGISTER_OPERATOR_FUNCTOR(aten::ones, aten_ones, [](Node* n) -> SROperator {
+  if (!n->matches(torch::schema(
+          "aten::ones(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor"))) {
+    LogAndDumpSchema(n);
+    return nullptr;
+  }
+  return [](ProcessedNode* p_node) {
+    const auto size = p_node->Input(0).toDimVector();
+    if (p_node->Output(0).isNone()) {
+      const auto dtype = p_node->Input(1).toOptional<c10::ScalarType>();
+      const auto layout = p_node->Input(2).toOptional<c10::Layout>();
+      const auto device = p_node->Input(3).toOptional<c10::Device>();
+      const auto pin_memory = p_node->Input(4).toOptional<bool>();
+      p_node->Output(0) =
+          at::native::ones(size, dtype, layout, device, pin_memory);
+      return;
+    }
+    auto& out_t = p_node->Output(0).toTensor();
+    fastResizeToZero(out_t);
+    at::native::ones_out(size, out_t);
+  };
+});
+
+REGISTER_OPERATOR_FUNCTOR(aten::ones_like, aten_ones_like, [](Node* n) -> SROperator {
+  if (!n->matches(torch::schema(
+          "aten::ones_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor"))) {
+    LogAndDumpSchema(n);
+    return nullptr;
+  }
+  return [](ProcessedNode* p_node) {
+    const auto& self = p_node->Input(0).toTensor();
+    const auto dtype = p_node->Input(1).toOptional<c10::ScalarType>();
+    const auto layout = p_node->Input(2).toOptional<c10::Layout>();
+    const auto device = p_node->Input(3).toOptional<c10::Device>();
+    const auto pin_memory = p_node->Input(4).toOptional<bool>();
+    const auto memory_format = p_node->Input(5).toOptional<c10::MemoryFormat>();
+    if (!hasTensorWithOptions(
+            p_node->Output(0), dtype, layout, memory_format)) {
+      p_node->Output(0) = at::native::ones_like(
+          self, dtype, layout, device, pin_memory, memory_format);
+      return;
+    }
+    auto& out_t = p_node->Output(0).toTensor();
+    fastResizeToZero(out_t);
+    at::native::ones_out(self.sizes(), out_t);
+  };
+});
+
+REGISTER_OPERATOR_FUNCTOR(aten::zeros, aten_zeros, [](Node* n) -> SROperator {
+  if (!n->matches(torch::schema(
+          "aten::zeros(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor"))) {
+    LogAndDumpSchema(n);
+    return nullptr;
+  }
+  return [](ProcessedNode* p_node) {
+    const auto size = p_node->Input(0).toDimVector();
+    const auto dtype = p_node->Input(1).toOptional<c10::ScalarType>();
+    const auto layout = p_node->Input(2).toOptional<c10::Layout>();
+    if (!hasTensorWithOptions(p_node->Output(0), dtype, layout)) {
+      p_node->Output(0) = at::native::zeros(size, dtype, layout);
+      return;
+    }
+    auto& out_t = p_node->Output(0).toTensor();
+    fastResizeToZero(out_t);
+    at::native::zeros_out(size, out_t);
+  };
+});
+
 REGISTER_OPERATOR_FUNCTOR(aten::linear, aten_linear, [](Node* n) -> SROperator {
   if (!n->matches(torch::schema(
           "aten::linear(Tensor input, Tensor weight, Tensor? bias=None) -> Tensor"))) {
@@ -2368,6 +2469,7 @@ REGISTER_OPERATOR_FUNCTOR(aten::cat, aten_cat, [](Node* n) -> SROperator {
   }
   return [](ProcessedNode* p_node) {
     const auto inputs = p_node->Input(0).toTensorVector();
+    TORCH_CHECK(inputs.size() > 0, "concat expects non-empty tensor list");
     const auto dim = p_node->Input(1).toInt();
     if (p_node->Output(0).isNone()) {
       p_node->Output(0) = at::native::_cat_cpu(inputs, dim);
@@ -2547,15 +2649,12 @@ REGISTER_OPERATOR_FUNCTOR(
       return nullptr;
     });
 
-/*
-
-TODO(T112769635): Fix broadcasting for this out variant
-
 REGISTER_OPERATOR_FUNCTOR(aten::where, aten_where, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
-          "aten::where.self(Tensor condition, Tensor self, Tensor other) ->
-Tensor"))) { return [](ProcessedNode* p_node) { const auto& cond =
-p_node->Input(0).toTensor(); const auto& self = p_node->Input(1).toTensor();
+          "aten::where.self(Tensor condition, Tensor self, Tensor other) -> Tensor"))) {
+    return [](ProcessedNode* p_node) {
+      const auto& cond = p_node->Input(0).toTensor();
+      const auto& self = p_node->Input(1).toTensor();
       const auto& other = p_node->Input(2).toTensor();
 
       if (p_node->Output(0).isNone()) {
@@ -2563,7 +2662,7 @@ p_node->Input(0).toTensor(); const auto& self = p_node->Input(1).toTensor();
       }
       auto& out = p_node->Output(0).toTensor();
       fastResizeToZero(out);
-      at::native::where_out(cond, self, other, out);
+      at::native::where_self_out(cond, self, other, out);
     };
   }
 
@@ -2571,8 +2670,6 @@ p_node->Input(0).toTensor(); const auto& self = p_node->Input(1).toTensor();
   return nullptr;
 });
 
-*/
-
 REGISTER_OPERATOR_FUNCTOR(
     prim::NumToTensor,
     prim_NumToTensor,
diff --git a/torch/csrc/jit/runtime/static/passes.cpp b/torch/csrc/jit/runtime/static/passes.cpp
index 8958a933a89adb..200392eb2ce424 100644
--- a/torch/csrc/jit/runtime/static/passes.cpp
+++ b/torch/csrc/jit/runtime/static/passes.cpp
@@ -1,6 +1,7 @@
 #include <torch/csrc/jit/runtime/static/passes.h>
 
 #include <torch/csrc/jit/ir/alias_analysis.h>
+#include <torch/csrc/jit/ir/subgraph_matcher.h>
 #include <torch/csrc/jit/passes/constant_pooling.h>
 #include <torch/csrc/jit/passes/constant_propagation.h>
 #include <torch/csrc/jit/passes/subgraph_rewrite.h>
@@ -718,8 +719,40 @@ void EliminateTrivialEquallySplit(std::shared_ptr<torch::jit::Graph>& graph) {
   }
 }
 
-// NB: The alias type of the fused op needs to be changed to
-// c10::AliasAnalysisKind::PURE_FUNCTION to make alias analysis work.
+namespace {
+
+bool shouldNotFuseListUnpackSpecialCase(const Node* node) {
+  const static std::array<c10::Symbol, 3> sigrid_transforms_symbols{
+      c10::Symbol::fromQualString("fb::variadic_sigrid_transforms_torch_bind"),
+      c10::Symbol::fromQualString("fb::sigrid_transforms_torch_bind"),
+      c10::Symbol::fromQualString("fb::sigrid_transforms")};
+
+  if (std::find(
+          sigrid_transforms_symbols.begin(),
+          sigrid_transforms_symbols.end(),
+          node->kind()) == sigrid_transforms_symbols.end()) {
+    return false;
+  }
+
+  // To fuse with sigrid transforms, we must be able to statically determine
+  // `instance` and `use_offsets` - these two together let us statically
+  // determine the types of the outputs. Rationale: it is a huge pain to write
+  // fused sigrid transforms without static type information, and these two
+  // arguments are indeed statically known in every model we've seen.
+  // The reason why trying to fuse the outputs is annoying without static type
+  // information is that, if one of the outputs is not managed, you need to
+  // reset to an empty tensor of the correct type each iteration. So, if we
+  // can't collect types ahead of time, we would have to do it lazily on the
+  // first iteration, which would could be wasteful in terms of time/memory
+  // - either each thread would have its own set of output types, or we would
+  // need a lock to prevent data races.
+  const auto num_inputs = node->inputs().size();
+  return !toIValue(node->input(0)).has_value() ||
+      !toIValue(node->input(num_inputs - 1)).has_value();
+}
+
+} // namespace
+
 void FuseListUnpack(std::shared_ptr<torch::jit::Graph>& graph) {
   const FastMap<c10::Symbol, c10::Symbol> unfused_to_fused = {
       OP_PAIR("fb::equally_split", "static_runtime::fused_equally_split"),
@@ -746,12 +779,7 @@ void FuseListUnpack(std::shared_ptr<torch::jit::Graph>& graph) {
       OP_PAIR(
           "fb::split_and_squeeze", "static_runtime::fused_split_and_squeeze")};
 
-  AliasDb alias_db(
-      graph,
-      /*isFrozen=*/false);
   // replacement contains (old_node, new_node, list_unpack_node)
-  const std::vector<Value*> graph_outputs(
-      graph->outputs().begin(), graph->outputs().end());
   std::vector<std::tuple<Node*, Node*, Node*>> replacement;
   DepthFirstGraphNodeIterator graph_it(graph);
   for (auto node = graph_it.next(); node != nullptr; node = graph_it.next()) {
@@ -775,20 +803,8 @@ void FuseListUnpack(std::shared_ptr<torch::jit::Graph>& graph) {
       continue;
     }
 
-    const bool checks_all_outputs =
-        node->kind() == fromQualString("fb::equally_split") ||
-        node->kind() == fromQualString("fb::gather_ranges_to_dense") ||
-        node->kind() == fromQualString("fb::gather_ranges_to_dense_v2");
-
-    if (!checks_all_outputs) {
-      // If any output of the ListUnpack node is unmanaged, disable fusion
-      // since the fused op assumes all outputs are either managed or not.
-      // Ops excluded here check all outputs.
-      const std::vector<Value*> list_unpack_outputs_vec(
-          list_unpack_outputs.begin(), list_unpack_outputs.end());
-      if (alias_db.mayContainAlias(list_unpack_outputs_vec, graph_outputs)) {
-        continue;
-      }
+    if (shouldNotFuseListUnpackSpecialCase(node)) {
+      continue;
     }
 
     const auto& new_sym = unfused_to_fused_it->second;
@@ -815,45 +831,8 @@ void FuseListUnpack(std::shared_ptr<torch::jit::Graph>& graph) {
     list_unpack_node->destroy();
     old_node->destroy();
   }
-
-#ifndef NDEBUG
-  graph->lint();
-  AliasDb db2(graph);
-  torch::jit::Lint(&db2);
-#endif
 } // namespace jit
 
-void EnableStaticRuntimeLayerNorm(std::shared_ptr<torch::jit::Graph>& graph) {
-  const c10::Symbol static_runtime_layer_norm_symbol =
-      fromQualString("static_runtime::layer_norm");
-  auto nodes = graph->nodes();
-  std::vector<std::pair<Node*, Node*>> replacement;
-  DepthFirstGraphNodeIterator graph_it(graph);
-  for (auto old_node = graph_it.next(); old_node != nullptr;
-       old_node = graph_it.next()) {
-    if (!old_node->matches(torch::schema(
-            "aten::layer_norm(Tensor input, int[] normalized_shape, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enable=True) -> Tensor"))) {
-      continue;
-    }
-    TORCH_CHECK(old_node->outputs().size() == 1);
-    auto* new_node = graph->create(
-        static_runtime_layer_norm_symbol,
-        /*layer_norm*/ 1 + /*mean*/ 1 + /*rst=*/1);
-    for (auto* input : old_node->inputs()) {
-      new_node->addInput(input);
-    }
-    replacement.emplace_back(old_node, new_node);
-  }
-  for (const auto& p : replacement) {
-    auto* old_node = p.first;
-    auto* new_node = p.second;
-    new_node->insertBefore(old_node);
-    new_node->output(0)->copyMetadata(old_node->output(0));
-    old_node->output(0)->replaceAllUsesWith(new_node->output(0));
-    old_node->destroy();
-  }
-}
-
 void RemoveImmutableInputDictLookups(
     std::shared_ptr<torch::jit::Graph>& graph) {
   auto nodes = graph->nodes();
@@ -1027,5 +1006,121 @@ void ForceNonEmptyOutputs(Graph& graph) {
   }
 }
 
+void EliminateExtraPermuteOps(std::shared_ptr<Graph>& graph) {
+  auto input_is_constant_list =
+      [](Node* node, size_t input_idx, const c10::List<int64_t>& expected) {
+        auto input_opt = toIValue(node->input(input_idx));
+        if (!input_opt.has_value() || !input_opt->isIntList()) {
+          return false;
+        }
+        return input_opt->toIntList() == expected;
+      };
+
+  // SubgraphRewriter can't pattern-match on constants, so we use this
+  // extra filter to make sure the values of the `dim` arguments are
+  // correct.
+  auto dims_are_valid_constants =
+      [&input_is_constant_list](
+          const Match& match,
+          const std::unordered_map<std::string, Value*>& vmap) {
+        // Get the nodes in the real graph from the nodes in the template
+        // pattern graph
+        const auto& node_map = match.nodes_map;
+        auto* sum_node = node_map.at(vmap.at("c")->node());
+        auto* permute_node = node_map.at(vmap.at("b")->node());
+        return input_is_constant_list(sum_node, 1, c10::List<int64_t>{-1}) &&
+            input_is_constant_list(
+                   permute_node, 1, c10::List<int64_t>{0, 2, 1});
+      };
+
+  const auto pattern = R"IR(
+    graph(%a, %sum_dim, %permute_dim, %keepdim, %dtype):
+        %b = aten::permute(%a, %permute_dim)
+        %c = aten::sum(%b, %sum_dim, %keepdim, %dtype)
+        return (%c))IR";
+
+  const auto fused_pattern = R"IR(
+    graph(%a, %sum_dim, %permute_dim, %keepdim, %dtype):
+        %new_sum_dim: int[] = prim::Constant[value=[1]]()
+        %d = aten::sum(%a, %new_sum_dim, %keepdim, %dtype)
+        return (%d))IR";
+
+  SubgraphRewriter fuse;
+  fuse.RegisterRewritePattern(pattern, fused_pattern);
+  fuse.runOnGraph(graph, dims_are_valid_constants);
+}
+
+namespace {
+
+Node* maybeUserWithKind(Value* value, c10::Symbol kind) {
+  auto& uses = value->uses();
+  if (uses.size() != 1) {
+    return nullptr;
+  }
+  auto* user = uses[0].user;
+  if (user->kind() != kind) {
+    return nullptr;
+  }
+  return user;
+}
+
+} // namespace
+
+void UseSplitAndSqueeze(std::shared_ptr<Graph>& graph) {
+  std::vector<Node*> to_erase;
+  for (auto* node : graph->nodes()) {
+    if (node->kind() != aten::split) {
+      continue;
+    }
+    auto axis_opt = toIValue(node->input(2));
+    if (!axis_opt) {
+      continue;
+    }
+    auto axis = *axis_opt;
+    auto* split_node_output = node->output();
+    auto* list_unpack_node =
+        maybeUserWithKind(split_node_output, prim::ListUnpack);
+    if (list_unpack_node == nullptr) {
+      continue;
+    }
+    std::vector<Node*> squeeze_nodes;
+    squeeze_nodes.reserve(list_unpack_node->outputs().size());
+    for (auto* output : list_unpack_node->outputs()) {
+      auto* squeeze_node = maybeUserWithKind(output, aten::squeeze);
+      if (squeeze_node == nullptr) {
+        break;
+      }
+      auto dim_opt = toIValue(squeeze_node->input(1));
+      if (!dim_opt || *dim_opt != axis) {
+        break;
+      }
+      squeeze_nodes.push_back(squeeze_node);
+    }
+    auto num_outputs = list_unpack_node->outputs().size();
+    if (squeeze_nodes.size() != num_outputs) {
+      continue;
+    }
+    auto* split_and_squeeze_node = graph->create(
+        c10::Symbol::fromQualString("static_runtime::fused_split_and_squeeze"),
+        num_outputs);
+    split_and_squeeze_node->addInput(node->input(0));
+    split_and_squeeze_node->addInput(node->input(1));
+    split_and_squeeze_node->addInput(node->input(2));
+    split_and_squeeze_node->insertBefore(node);
+    for (const auto i : c10::irange(num_outputs)) {
+      auto* squeeze_node = squeeze_nodes[i];
+      split_and_squeeze_node->output(i)->copyMetadata(squeeze_node->output());
+      squeeze_node->output()->replaceAllUsesWith(
+          split_and_squeeze_node->output(i));
+    }
+    to_erase.insert(to_erase.end(), squeeze_nodes.begin(), squeeze_nodes.end());
+    to_erase.push_back(list_unpack_node);
+    to_erase.push_back(node);
+  }
+  for (auto* node : to_erase) {
+    node->destroy();
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/static/passes.h b/torch/csrc/jit/runtime/static/passes.h
index ba31f22eaf0340..7d6c9aaaa885c2 100644
--- a/torch/csrc/jit/runtime/static/passes.h
+++ b/torch/csrc/jit/runtime/static/passes.h
@@ -21,9 +21,6 @@ void ReplaceWithMaybeCopy(
     std::shared_ptr<torch::jit::Graph>& graph,
     bool outputs_are_immutable = true);
 
-TORCH_API void EnableStaticRuntimeLayerNorm(
-    std::shared_ptr<torch::jit::Graph>& graph);
-
 TORCH_API void RemoveImmutableInputDictLookups(
     std::shared_ptr<torch::jit::Graph>& graph);
 
@@ -64,5 +61,9 @@ TORCH_API void ForceNonEmptyOutputs(Graph& graph);
 
 TORCH_API void UseVariadicGroupedAccessor(const std::shared_ptr<Graph>& graph);
 
+TORCH_API void EliminateExtraPermuteOps(std::shared_ptr<Graph>& graph);
+
+TORCH_API void UseSplitAndSqueeze(std::shared_ptr<Graph>& graph);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
index db9d2040ba86a9..4dd8c6b7f9b18f 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
@@ -1,6 +1,9 @@
+#include <c10/util/Exception.h>
 #include <torch/csrc/jit/frontend/ir_emitter.h>
+#include <torch/csrc/jit/ir/ir_views.h>
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/inliner.h>
+#include <torch/csrc/jit/runtime/graph_iterator.h>
 #include <torch/csrc/jit/runtime/operator.h>
 #include <torch/csrc/jit/runtime/symbolic_shape_registry.h>
 #include <torch/csrc/jit/runtime/symbolic_shape_registry_util.h>
@@ -160,26 +163,121 @@ const at::optional<const FunctionSchema*> getInplaceVariant(
   return at::nullopt;
 }
 
-void registerSchema(
-    const FunctionSchema* schema_string,
-    const std::string& shape_compute_function_name,
-    std::unordered_map<std::string, std::shared_ptr<Graph>>& reused_functions,
-    const CompilationUnit& module) {
-  if (reused_functions.count(shape_compute_function_name)) {
-    auto graph = reused_functions[shape_compute_function_name];
+TypePtr mapTensorToListOfInts(TypePtr type) {
+  if (type->cast<TensorType>()) {
+    return ListType::ofInts();
+  }
+  at::ArrayRef<TypePtr> contained = type->containedTypes();
+  if (contained.empty()) {
+    return type;
+  }
+  return type->withContained(
+      fmap(type->containedTypes(), mapTensorToListOfInts));
+}
 
-    // allow extra unused arguments to map multiple functions to e.g. unary
+void checkForWhileLoop(
+    const FunctionSchema* schema,
+    std::shared_ptr<Graph> graph) {
+  DepthFirstGraphNodeIterator graph_it(graph);
+  for (auto* node = graph_it.next(); node != nullptr; node = graph_it.next()) {
+    if (node->kind() != prim::Loop) {
+      continue;
+    }
+    LoopView loop(node);
+    if (loop.loopType() != LoopView::For) {
+      TORCH_WARN(
+          "While loops are not yet implemented in unrolling which may make this shape function difficult to partially evaluate: ",
+          *node,
+          " for schema ",
+          *schema);
+    }
+  }
+}
+
+void checkInputReturnedAsOutput(
+    const FunctionSchema* schema,
+    const std::shared_ptr<Graph>& graph) {
+  // Could use alias db here as well but would have to warn because it's
+  // imprecise
+  for (size_t i : c10::irange(graph->inputs().size())) {
+    Value* input = graph->inputs().at(i);
+    for (size_t j : c10::irange(graph->outputs().size())) {
+      Value* output = graph->outputs().at(j);
+      TORCH_CHECK(
+          input != output,
+          "For schema: ",
+          *schema,
+          " input index ",
+          i,
+          " is returned as output index ",
+          j,
+          ". Shape functions must return new unaliased lists");
+    }
+  }
+}
+
+void checkInputAndOutputTypes(
+    const FunctionSchema* schema,
+    const std::shared_ptr<Graph>& graph) {
+  // allow extra unused arguments to map multiple functions to e.g. unary
+  TORCH_CHECK(
+      graph->inputs().size() <= schema->arguments().size(),
+      "Shape function must have fewer arguments than schema. Got ",
+      graph->inputs().size(),
+      " graph arguments and ",
+      schema->arguments().size(),
+      " schema arguments of schema: ",
+      *schema);
+
+  for (auto i : c10::irange(graph->inputs().size())) {
+    auto inp_type = schema->arguments().at(i).type();
+    auto mapped_type = mapTensorToListOfInts(inp_type);
+    auto graph_type = graph->inputs().at(i)->type();
     TORCH_INTERNAL_ASSERT(
-        graph->inputs().size() <= schema_string->arguments().size());
+        mapped_type->isSubtypeOf(graph->inputs().at(i)->type()),
+        "For schema type: ",
+        inp_type->str(),
+        " Expected supertype of ",
+        mapped_type->str(),
+        " but got graph_type ",
+        graph_type->str(),
+        " at index ",
+        i,
+        " of schema: ",
+        *schema);
+  }
 
-    cached_schema_to_graph[schema_string] = graph;
-    return;
+  TORCH_CHECK(
+      graph->outputs().size() == schema->returns().size(),
+      "Shape function equal number of outputs as schema. Got ",
+      graph->outputs().size(),
+      " graph outputs and ",
+      schema->returns().size(),
+      " schema returns of schema: ",
+      *schema);
+
+  for (auto i : c10::irange(schema->returns().size())) {
+    auto out_type = schema->returns().at(i).type();
+    auto mapped_type = mapTensorToListOfInts(out_type);
+    auto graph_type = graph->outputs().at(i)->type();
+    TORCH_INTERNAL_ASSERT(
+        mapped_type->isSubtypeOf(graph->outputs().at(i)->type()),
+        "For schema type: ",
+        out_type->str(),
+        " Expected supertype of ",
+        mapped_type->str(),
+        " but got graph_type ",
+        graph_type->str(),
+        " at output index ",
+        i,
+        " of schema: ",
+        *schema);
   }
+}
 
-  Function& shape_compute_function =
-      module.get_function(shape_compute_function_name);
-  std::shared_ptr<Graph> graph =
-      toGraphFunction(shape_compute_function).graph();
+void transformShapeFunction(
+    const FunctionSchema* schema_string,
+    std::shared_ptr<Graph> graph) {
   Inline(*graph);
 
   // ATEN operators can return multiple unboxed values, this in contrast to
@@ -197,9 +295,33 @@ void registerSchema(
       graph->registerOutput(v);
     }
   }
-  // allow extra unused arguments to map multiple functions to e.g. unary
-  TORCH_INTERNAL_ASSERT(
-      graph->inputs().size() <= schema_string->arguments().size());
+}
+
+void registerSchema(
+    const FunctionSchema* schema_string,
+    const std::string& shape_compute_function_name,
+    std::unordered_map<std::string, std::shared_ptr<Graph>>& reused_functions,
+    const CompilationUnit& module) {
+  if (reused_functions.count(shape_compute_function_name)) {
+    auto graph = reused_functions[shape_compute_function_name];
+
+    // allow extra unused arguments to map multiple functions to e.g. unary
+    TORCH_INTERNAL_ASSERT(
+        graph->inputs().size() <= schema_string->arguments().size());
+
+    cached_schema_to_graph[schema_string] = graph;
+    return;
+  }
+
+  Function& shape_compute_function =
+      module.get_function(shape_compute_function_name);
+  std::shared_ptr<Graph> graph =
+      toGraphFunction(shape_compute_function).graph();
+
+  transformShapeFunction(schema_string, graph);
+  // NB: we lint the shape functions registered in source
+  // in a test file
+  // LintShapeComputeGraph(schema_string, graph);
 
   cached_schema_to_graph[schema_string] = graph;
   reused_functions[shape_compute_function_name] = graph;
@@ -299,8 +421,34 @@ void RegisterShapeComputeGraphForSchema(
   if (cached_schema_to_graph.size() == 0) {
     loadFunctions();
   }
+  transformShapeFunction(&schema, g);
+  LintShapeComputeGraph(&schema, g);
+
   cached_schema_to_graph[&schema] = g;
 }
 
+std::vector<const FunctionSchema*> RegisteredShapeComputeSchemas() {
+  std::lock_guard<std::mutex> guard(lock);
+  if (cached_schema_to_graph.size() == 0) {
+    loadFunctions();
+  }
+
+  std::vector<const FunctionSchema*> schemas;
+  schemas.reserve(cached_schema_to_graph.size());
+  for (const auto& pair : cached_schema_to_graph) {
+    schemas.push_back(pair.first);
+  }
+  return schemas;
+}
+
+void LintShapeComputeGraph(
+    const FunctionSchema* schema,
+    const std::shared_ptr<Graph>& graph) {
+  checkInputAndOutputTypes(schema, graph);
+  checkForWhileLoop(schema, graph);
+  checkInputReturnedAsOutput(schema, graph);
+  // TODO: other checks ? list ops which we don't symbolically optimize, etc ?
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry.h b/torch/csrc/jit/runtime/symbolic_shape_registry.h
index f9625ae6fb9370..657ac9a279cddf 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry.h
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry.h
@@ -8,6 +8,44 @@
 namespace torch {
 namespace jit {
 
+/*
+ADDING A NEW SHAPE GRAPH:
+- For one node schema, there is one corresponding registered shape compute
+graph. The schema of the graph should be the same except for Tensor arguments.
+For every Tensor input in operator schema, there should be a List[int]
+corresponding to that Tensor's shape. For example: "aten::linear(Tensor input,
+Tensor weight, Tensor? bias=None) -> Tensor" ==> def linear(input: List[int],
+weight: List[int], bias: Optional[List[int]])
+
+Additionally, arguments which are unused at the end of the schema may be left
+off. This allows sharing a single graph for multiple function schemas, such as
+unary operators with different trailing arguments that do not affect the output
+shape.
+
+The shape graph should return a new, unaliased List[int] (or tuple of lists for
+multiple returns) and should not modify any input lists. This allows the shape
+graphs to be composed and executed.
+
+The shape analysis (particularly for non-complete, or symbolic shapes) works by
+partially evaluating the JIT IR. It may be possible for a Graph to be registered
+that we cannot currently partially evaluate. If this happens, please file an
+issue. There are lints registered to avoid particular known patterns (continue
+or break or early return in a loop). Those may be improved in the future, please
+file an issue if necessary.
+
+To debug (and write initially) the recommended flow is to define these functions
+in python and iterate there. Functions in `shape_functions.h` and
+`shape_functions_1.h` should be executable in python.
+
+To test operators, the preferred flow is through OpInfos, with
+`assert_jit_shape_analysis=True`. If this is not feasible, you can look at tests
+in `test_symbolic_shape_analysis.py` such as `test_adaptive_avg_pool2d`.
+
+Operators which take in a list of tensors, such as concat, are not yet
+supported. Concat has been special cased and could be generalized as needed.
+Please file an issue.
+*/
+
 TORCH_API void RegisterShapeComputeGraphForSchema(
     const FunctionSchema& schema,
     std::shared_ptr<Graph> g);
@@ -15,5 +53,11 @@ TORCH_API void RegisterShapeComputeGraphForSchema(
 TORCH_API c10::optional<std::shared_ptr<Graph>> shapeComputeGraphForSchema(
     const FunctionSchema& schema);
 
+TORCH_API std::vector<const FunctionSchema*> RegisteredShapeComputeSchemas();
+
+TORCH_API void LintShapeComputeGraph(
+    const FunctionSchema* schema,
+    const std::shared_ptr<Graph>& graph);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
index 71c9730f5ba280..c5c714079a9503 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
@@ -118,8 +118,9 @@ const OperatorMap<std::string>& get_tensorexpr_elementwise_set() {
       {"aten::where.ScalarSelf(Tensor condition, Scalar self, Tensor other) -> Tensor", "broadcast_one_three"},
       // TODO: enable slice, shape inference is not implemented for this op yet
   };
+  // clang-format on
   return tensorexpr_elementwise_set;
 }
 
-}
-}
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/serialization/export.cpp b/torch/csrc/jit/serialization/export.cpp
index 0bb88390bfae96..6e3da2f746263a 100644
--- a/torch/csrc/jit/serialization/export.cpp
+++ b/torch/csrc/jit/serialization/export.cpp
@@ -233,6 +233,10 @@ class GraphEncoder {
     return use_external_data_format_;
   }
 
+  NodeNameMap get_onnx_node_names() {
+    return onnx_node_name_map_;
+  }
+
  private:
   // Using std::map instead of std::unordered_map for initializers
   // in EncodeGraph constructor so that the order in which initializers
@@ -364,6 +368,7 @@ class GraphEncoder {
   std::map<std::string, int> custom_opsets_;
   std::shared_ptr<Graph> graph_;
   NodeAttrNameMap node_attr_to_name_;
+  NodeNameMap onnx_node_name_map_;
   // For large models, the parameters can be stored in separate binary files.
   // This parameter sets a threshold on the number of elements in the parameter
   // tensor, beyond which the parameter is stored in a separate file (if
@@ -821,8 +826,10 @@ void GraphEncoder::EncodeNode(
   }
   node_proto->set_op_type(node->kind().toUnqualString());
   if (add_node_names) {
-    node_proto->set_name(
-        node_proto->op_type() + "_" + std::to_string(num_op_nodes_));
+    auto node_name =
+        node_proto->op_type() + "_" + std::to_string(num_op_nodes_);
+    node_proto->set_name(node_name);
+    onnx_node_name_map_[node] = node_name;
     num_op_nodes_++;
   }
   auto attrs_it = node_attr_to_name_.find(node);
@@ -1211,7 +1218,8 @@ std::tuple<
     std::shared_ptr<::ONNX_NAMESPACE::ModelProto>,
     RawDataExportMap,
     SymbolDimMap,
-    bool>
+    bool,
+    NodeNameMap>
 export_onnx(
     const std::shared_ptr<Graph>& graph,
     const std::map<std::string, at::Tensor>& initializers,
@@ -1248,7 +1256,8 @@ export_onnx(
           graph_encoder.get_model_proto()),
       graph_encoder.get_raw_data_export_map(),
       graph_encoder.get_symbol_dim_param_map(),
-      graph_encoder.get_use_external_data_format());
+      graph_encoder.get_use_external_data_format(),
+      graph_encoder.get_onnx_node_names());
 }
 
 std::string serialize_model_proto_to_string(
diff --git a/torch/csrc/jit/serialization/export.h b/torch/csrc/jit/serialization/export.h
index 7170273635bdec..36effaa20b8950 100644
--- a/torch/csrc/jit/serialization/export.h
+++ b/torch/csrc/jit/serialization/export.h
@@ -30,8 +30,9 @@ using RawDataExportMap = std::unordered_map<std::string, at::Tensor>;
 
 using SymbolDimMap = std::map<c10::ShapeSymbol, std::string>;
 
+using NodeNameMap = std::unordered_map<const Node*, std::string>;
+
 // Used for modularized export settling function and node attributes.
-using ValAttrNameMap = std::unordered_map<const Value*, std::string>;
 using NodeAttrNameMap = std::
     unordered_map<const Node*, std::unordered_map<std::string, std::string>>;
 
@@ -39,7 +40,8 @@ TORCH_API std::tuple<
     std::shared_ptr<::ONNX_NAMESPACE::ModelProto>,
     RawDataExportMap,
     SymbolDimMap,
-    bool>
+    bool,
+    NodeNameMap>
 export_onnx(
     const std::shared_ptr<Graph>& graph,
     const std::map<std::string, at::Tensor>& initializers,
@@ -156,21 +158,24 @@ TORCH_API void ExportModule(
     std::ostream& out,
     const ExtraFilesMap& metadata = ExtraFilesMap(),
     bool bytecode_format = false,
-    bool save_mobile_debug_info = false);
+    bool save_mobile_debug_info = false,
+    bool use_flatbuffer = false);
 
 TORCH_API void ExportModule(
     const Module& module,
     const std::string& filename,
     const ExtraFilesMap& metadata = ExtraFilesMap(),
     bool bytecode_format = false,
-    bool save_mobile_debug_info = false);
+    bool save_mobile_debug_info = false,
+    bool use_flatbuffer = false);
 
 TORCH_API void ExportModule(
     const Module& module,
     const std::function<size_t(const void*, size_t)>& writer_func,
     const ExtraFilesMap& metadata = ExtraFilesMap(),
     bool bytecode_format = false,
-    bool save_mobile_debug_info = false);
+    bool save_mobile_debug_info = false,
+    bool use_flatbuffer = false);
 
 // Write the bytes of a pickle archive and the tensors referenced inside that
 // archive
diff --git a/torch/csrc/jit/serialization/export_bytecode.cpp b/torch/csrc/jit/serialization/export_bytecode.cpp
index cb2b104e03978e..4164f3b3da1be5 100644
--- a/torch/csrc/jit/serialization/export_bytecode.cpp
+++ b/torch/csrc/jit/serialization/export_bytecode.cpp
@@ -173,8 +173,7 @@ mobile::Code compileGraphToMobileCode(
       }
       mobile_code.operator_input_sizes_.emplace_back(num_args.value_or(-1));
       mobile_code.op_names_.emplace_back(opname);
-      auto func = mobile::makeOperatorFunction(
-          opname, num_args, compilation_options.model_version);
+      auto func = mobile::makeOperatorFunction(opname, num_args);
       TORCH_INTERNAL_ASSERT(
           func.has_value(),
           "Operator with name: ",
@@ -377,6 +376,8 @@ mobile::Module jitModuleToMobile(
       backend_debug_info_map.begin(), backend_debug_info_map.end());
   m.setDebugTable(MobileDebugTable(
       debug_handle_cs_ptr_map.begin(), debug_handle_cs_ptr_map.end()));
+
+  m.set_bytecode_version(options.model_version);
   return m;
 }
 
diff --git a/torch/csrc/jit/serialization/export_module.cpp b/torch/csrc/jit/serialization/export_module.cpp
index cbfe143c0e7841..f9cef185496342 100644
--- a/torch/csrc/jit/serialization/export_module.cpp
+++ b/torch/csrc/jit/serialization/export_module.cpp
@@ -16,6 +16,9 @@
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/serialization/callstack_debug_info_serialization.h>
 #include <torch/csrc/jit/serialization/export_bytecode.h>
+#if defined(ENABLE_FLATBUFFER)
+#include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
+#endif
 #include <torch/csrc/jit/serialization/import_export_constants.h>
 #include <torch/csrc/jit/serialization/import_export_functions.h>
 #include <torch/csrc/jit/serialization/import_export_helpers.h>
@@ -508,8 +511,12 @@ void ScriptModuleSerializer::writeArchive(
   TORCH_INTERNAL_ASSERT(tensor_names.size() == data_pickle.tensorData().size());
 
   for (const auto& td : data_pickle.tensorData()) {
-    WriteableTensorData writable_td = getWriteableTensorData(td);
     std::string tensor_name = tensor_names[i++];
+    if (td.is_meta()) {
+      writer_.writeRecord(tensor_dir + tensor_name, nullptr, 0);
+      continue;
+    }
+    WriteableTensorData writable_td = getWriteableTensorData(td);
     if (use_storage_context && serialized_tensors.count(tensor_name)) {
       // storage has been serialzed already, skip
       continue;
@@ -784,20 +791,49 @@ SerializationStorageContext& ScriptModuleSerializer::storage_context() {
   return storage_context_;
 }
 
+#if defined(ENABLE_FLATBUFFER)
+void save_mobile_module_to(
+    const Module& module,
+    const ExtraFilesMap& extra_files,
+    bool save_mobile_debug_info,
+    const std::function<size_t(const void*, size_t)>& writer_func) {
+  ExtraFilesMap jitFiles;
+  CompilationOptions options = getOptionsFromGlobal();
+  std::vector<IValue> constants;
+  jitModuleToPythonCodeAndConstants(module, &jitFiles, &constants);
+  mobile::Module mod = jitModuleToMobile(module, options);
+  auto buffer =
+      save_mobile_module_to_bytes(mod, extra_files, jitFiles, constants);
+  writer_func(reinterpret_cast<void*>(buffer.data()), buffer.size());
+}
+#endif
+
 void ExportModule(
     const Module& module,
     std::ostream& out,
     const ExtraFilesMap& extra_files,
     bool bytecode_format,
-    bool save_mobile_debug_info) {
-  caffe2::serialize::PyTorchStreamWriter writer(
-      [&](const void* buf, size_t nbytes) -> size_t {
-        out.write(static_cast<const char*>(buf), nbytes);
-        return !out ? 0 : nbytes;
-      });
-  ScriptModuleSerializer serializer(writer);
-  serializer.serialize(
-      module, extra_files, bytecode_format, save_mobile_debug_info);
+    bool save_mobile_debug_info,
+    bool use_flatbuffer) {
+  auto writer_func = [&](const void* buf, size_t nbytes) -> size_t {
+    out.write(static_cast<const char*>(buf), nbytes);
+    return !out ? 0 : nbytes;
+  };
+  if (use_flatbuffer) {
+#if defined(ENABLE_FLATBUFFER)
+    save_mobile_module_to(
+        module, extra_files, save_mobile_debug_info, writer_func);
+#else
+    TORCH_CHECK(
+        false,
+        "Trying to export as flatbuffer file but the build hasn't enabled flatbuffer");
+#endif
+  } else {
+    caffe2::serialize::PyTorchStreamWriter writer(writer_func);
+    ScriptModuleSerializer serializer(writer);
+    serializer.serialize(
+        module, extra_files, bytecode_format, save_mobile_debug_info);
+  }
 }
 
 void ExportModule(
@@ -805,11 +841,29 @@ void ExportModule(
     const std::string& filename,
     const ExtraFilesMap& extra_files,
     bool bytecode_format,
-    bool save_mobile_debug_info) {
-  caffe2::serialize::PyTorchStreamWriter writer(filename);
-  ScriptModuleSerializer serializer(writer);
-  serializer.serialize(
-      module, extra_files, bytecode_format, save_mobile_debug_info);
+    bool save_mobile_debug_info,
+    bool use_flatbuffer) {
+  if (use_flatbuffer) {
+#if defined(ENABLE_FLATBUFFER)
+    auto writer_func = [&](const void* buf, size_t nbytes) -> size_t {
+      std::fstream ofile(filename, std::ios::binary | std::ios::out);
+      ofile.write(static_cast<const char*>(buf), nbytes);
+      ofile.close();
+      return !ofile ? 0 : nbytes;
+    };
+    save_mobile_module_to(
+        module, extra_files, save_mobile_debug_info, writer_func);
+#else
+    TORCH_CHECK(
+        false,
+        "Trying to export as flatbuffer file but the build hasn't enabled flatbuffer");
+#endif
+  } else {
+    caffe2::serialize::PyTorchStreamWriter writer(filename);
+    ScriptModuleSerializer serializer(writer);
+    serializer.serialize(
+        module, extra_files, bytecode_format, save_mobile_debug_info);
+  }
 }
 
 void ExportModule(
@@ -817,11 +871,23 @@ void ExportModule(
     const std::function<size_t(const void*, size_t)>& writer_func,
     const ExtraFilesMap& extra_files,
     bool bytecode_format,
-    bool save_mobile_debug_info) {
-  caffe2::serialize::PyTorchStreamWriter writer(writer_func);
-  ScriptModuleSerializer serializer(writer);
-  serializer.serialize(
-      module, extra_files, bytecode_format, save_mobile_debug_info);
+    bool save_mobile_debug_info,
+    bool use_flatbuffer) {
+  if (use_flatbuffer) {
+#if defined(ENABLE_FLATBUFFER)
+    save_mobile_module_to(
+        module, extra_files, save_mobile_debug_info, writer_func);
+#else
+    TORCH_CHECK(
+        false,
+        "Trying to export as flatbuffer file but the build hasn't enabled flatbuffer");
+#endif
+  } else {
+    caffe2::serialize::PyTorchStreamWriter writer(writer_func);
+    ScriptModuleSerializer serializer(writer);
+    serializer.serialize(
+        module, extra_files, bytecode_format, save_mobile_debug_info);
+  }
 }
 
 namespace {
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
index a9957c6165fa24..0081e277a9e19d 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
@@ -8,6 +8,8 @@
 #include <torch/csrc/jit/passes/inliner.h>
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/serialization/export.h>
+#include <torch/csrc/jit/serialization/export_bytecode.h>
+#include <torch/csrc/jit/serialization/import.h>
 #include <string>
 
 namespace torch {
@@ -31,6 +33,25 @@ namespace {
 // We will store IValue NONE in index 0 in flatbuffer.
 constexpr int kNoneIndex = 0;
 
+static TypePtr realType(TypePtr type) {
+  if (auto dyn = type->castRaw<c10::DynamicType>()) {
+    return dyn->fallback();
+  } else {
+    return type;
+  }
+}
+
+auto print_type(const c10::Type& t) -> c10::optional<std::string> {
+  auto namedType = t.cast<c10::NamedType>();
+  if (namedType && namedType->name()) {
+    return namedType->name().value().qualifiedName();
+  }
+  if (auto dyn = t.castRaw<c10::DynamicType>()) {
+    return dyn->fallback()->annotation_str();
+  }
+  return c10::nullopt;
+}
+
 class FlatbufferSerializer {
  public:
   FlatbufferSerializer() = default;
@@ -38,7 +59,9 @@ class FlatbufferSerializer {
   flatbuffers::DetachedBuffer serializeModule(
       const mobile::Module& module,
       bool include_tensor_data_in_flatbuffer,
-      const ExtraFilesMap& extra_files = ExtraFilesMap());
+      const ExtraFilesMap& extra_files = ExtraFilesMap(),
+      const ExtraFilesMap& jit_sources = ExtraFilesMap(),
+      const std::vector<IValue>& jit_constants = {});
 
  private:
   template <typename It>
@@ -153,21 +176,21 @@ flatbuffers::Offset<jit::mobile::serialization::Schema> FlatbufferSerializer::
   return_vec.reserve(returns.size());
   for (const auto& arg : args) {
     int index = storeIValueAndGetIndex(fbb, arg.default_value());
-    TORCH_INTERNAL_ASSERT(arg.type()->kind() != c10::DynamicType::Kind);
     arg_vec.emplace_back(CreateArg(
         fbb,
         fbb.CreateSharedString(arg.name()),
-        fbb.CreateSharedString(arg.type()->annotation_str(type_printer)),
+        fbb.CreateSharedString(
+            realType(arg.type())->annotation_str(type_printer)),
         index));
   }
 
   for (const auto& ret : returns) {
     int index = storeIValueAndGetIndex(fbb, ret.default_value());
-    TORCH_INTERNAL_ASSERT(ret.type()->kind() != c10::DynamicType::Kind);
     return_vec.emplace_back(CreateArg(
         fbb,
         fbb.CreateSharedString(ret.name()),
-        fbb.CreateSharedString(ret.type()->annotation_str(type_printer)),
+        fbb.CreateSharedString(
+            realType(ret.type())->annotation_str(type_printer)),
         index));
   }
   return CreateSchema(
@@ -215,8 +238,7 @@ flatbuffers::Offset<mobile::serialization::Function> FlatbufferSerializer::
   std::vector<flatbuffers::Offset<flatbuffers::String>> type_offsets;
 
   for (const TypePtr& t : code.types_) {
-    auto type_str = t->annotation_str();
-    TORCH_INTERNAL_ASSERT(t->kind() != c10::DynamicType::Kind);
+    auto type_str = realType(t)->annotation_str();
     if (type_str.find(torch_prefix) == 0) {
       TORCH_CHECK(
           type_str.find(class_prefix) == 0,
@@ -239,6 +261,9 @@ flatbuffers::Offset<mobile::serialization::Function> FlatbufferSerializer::
     if (namedType && namedType->name()) {
       return namedType->name().value().qualifiedName();
     }
+    if (auto dyn = t.castRaw<c10::DynamicType>()) {
+      return dyn->fallback()->annotation_str();
+    }
     return c10::nullopt;
   };
 
@@ -300,7 +325,9 @@ FlatbufferSerializer::storeExtraFilesAndGetOffset(
 flatbuffers::DetachedBuffer FlatbufferSerializer::serializeModule(
     const mobile::Module& module,
     bool include_tensor_data_in_flatbuffer,
-    const ExtraFilesMap& extra_files) {
+    const ExtraFilesMap& extra_files,
+    const ExtraFilesMap& jit_sources,
+    const std::vector<IValue>& jit_constants) {
   FlatBufferBuilder fbb;
 
   mcu_ = &module.compilation_unit();
@@ -352,17 +379,29 @@ flatbuffers::DetachedBuffer FlatbufferSerializer::serializeModule(
 
   auto extra_files_offset = storeExtraFilesAndGetOffset(fbb, extra_files);
 
+  auto jit_source_offset = storeExtraFilesAndGetOffset(fbb, jit_sources);
+  std::vector<uint32_t> jit_constants_indexes;
+  jit_constants_indexes.reserve(jit_constants.size());
+  for (const auto& ival : jit_constants) {
+    jit_constants_indexes.emplace_back(storeIValueAndGetIndex(fbb, ival));
+  }
+
+  const uint32_t bytecode_version =
+      static_cast<uint32_t>(module.bytecode_version());
+
   auto mod = CreateModule(
       fbb,
-      0, /* version */
+      /*bytecode_version=*/bytecode_version,
       extra_files_offset, /* extra_files */
       functions_offset,
       ivalue_index,
       fbb.CreateVector(ivalue_offsets_),
       tensor_data_.size(),
       storage_data_offset,
-      fbb.CreateVector(obj_types_offset_));
-  fbb.Finish(mod);
+      fbb.CreateVector(obj_types_offset_),
+      jit_source_offset,
+      fbb.CreateVector(jit_constants_indexes));
+  FinishModuleBuffer(fbb, mod);
   return fbb.Release();
 }
 
@@ -383,7 +422,8 @@ flatbuffers::Offset<mobile::serialization::List> FlatbufferSerializer::listToFB(
   return CreateList(
       fbb,
       fbb.CreateVector(items),
-      fbb.CreateSharedString(list.type<c10::Type>()->annotation_str()));
+      fbb.CreateSharedString(
+          realType(list.type<c10::Type>())->annotation_str(print_type)));
 }
 
 flatbuffers::Offset<mobile::serialization::Dict> FlatbufferSerializer::dictToFB(
@@ -400,11 +440,13 @@ flatbuffers::Offset<mobile::serialization::Dict> FlatbufferSerializer::dictToFB(
     int value_index = storeIValueAndGetIndex(fbb, entry.value());
     values.push_back(value_index);
   }
+
   return CreateDict(
       fbb,
       fbb.CreateVector(keys),
       fbb.CreateVector(values),
-      fbb.CreateSharedString(ivalue.type<c10::Type>()->annotation_str()));
+      fbb.CreateSharedString(
+          realType(ivalue.type<c10::Type>())->annotation_str(print_type)));
 }
 
 flatbuffers::Offset<mobile::serialization::ObjectType> FlatbufferSerializer::
@@ -696,9 +738,11 @@ flatbuffers::Offset<mobile::serialization::IValue> FlatbufferSerializer::
 void save_mobile_module(
     const mobile::Module& module,
     const std::string& filename,
-    const ExtraFilesMap& extra_files) {
-  FlatbufferSerializer fb_serializer;
-  auto buffer = fb_serializer.serializeModule(module, true, extra_files);
+    const ExtraFilesMap& extra_files,
+    const ExtraFilesMap& jit_sources,
+    const std::vector<IValue>& jit_constants) {
+  auto buffer = save_mobile_module_to_bytes(
+      module, extra_files, jit_sources, jit_constants);
   std::fstream ofile(filename, std::ios::binary | std::ios::out);
   ofile.write(reinterpret_cast<char*>(buffer.data()), buffer.size());
   ofile.close();
@@ -706,12 +750,76 @@ void save_mobile_module(
 
 flatbuffers::DetachedBuffer save_mobile_module_to_bytes(
     const mobile::Module& module,
-    const ExtraFilesMap& extra_files) {
+    const ExtraFilesMap& extra_files,
+    const ExtraFilesMap& jit_sources,
+    const std::vector<IValue>& jit_constants) {
   FlatbufferSerializer fb_serializer;
   return fb_serializer.serializeModule(
       module,
       /*include_tensor_data_in_flatbuffer*/ true,
-      extra_files);
+      extra_files,
+      jit_sources,
+      jit_constants);
+}
+
+Module parse_and_initialize_jit_module(
+    std::shared_ptr<char> data,
+    size_t size,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device) {
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
+  FlatbufferLoader loader;
+  mobile::Module mobilem = loader.parseModule(flatbuffer_module);
+  parseExtraFiles(flatbuffer_module, extra_files);
+  ExtraFilesMap files;
+  std::vector<IValue> constants;
+  loader.extractJitSourceAndConstants(&files, &constants);
+  Module m = jitModuleFromSourceAndConstants(
+      mobilem._ivalue(),
+      files,
+      constants,
+      flatbuffer_module->bytecode_version());
+  m.set_delete_memory(data);
+  return m;
+}
+
+Module load_jit_module_from_file(
+    const std::string& filename,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device) {
+  auto data = get_file_content(filename.c_str());
+  return parse_and_initialize_jit_module(
+      std::move(std::get<0>(data)), std::get<1>(data), extra_files, device);
+}
+
+Module load_jit_module_from_stream(
+    std::istream& in,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device) {
+  auto data = get_stream_content(in);
+  return parse_and_initialize_jit_module(
+      std::move(std::get<0>(data)), std::get<1>(data), extra_files, device);
+}
+
+void save_jit_module(
+    const Module& module,
+    const std::string& filename,
+    const ExtraFilesMap& extra_files) {
+  auto buffer = save_jit_module_to_bytes(module, extra_files);
+  std::fstream ofile(filename, std::ios::binary | std::ios::out);
+  ofile.write(reinterpret_cast<char*>(buffer.data()), buffer.size()); // NOLINT
+  ofile.close();
+}
+
+flatbuffers::DetachedBuffer save_jit_module_to_bytes(
+    const Module& module,
+    const ExtraFilesMap& extra_files) {
+  ExtraFilesMap jitfiles;
+  std::vector<IValue> constants;
+  jitModuleToPythonCodeAndConstants(module, &jitfiles, &constants);
+  CompilationOptions options;
+  mobile::Module mobilem = jitModuleToMobile(module, options);
+  return save_mobile_module_to_bytes(mobilem, extra_files, jitfiles, constants);
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer.h b/torch/csrc/jit/serialization/flatbuffer_serializer.h
index f9cfcc92d9531e..e17f31059621e3 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer.h
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer.h
@@ -19,11 +19,40 @@ namespace jit {
 TORCH_API void save_mobile_module(
     const mobile::Module& module,
     const std::string& filename,
-    const ExtraFilesMap& extra_files = ExtraFilesMap());
+    const ExtraFilesMap& extra_files = ExtraFilesMap(),
+    const ExtraFilesMap& jit_sources = ExtraFilesMap(),
+    const std::vector<IValue>& jit_constants = {});
 
 TORCH_API flatbuffers::DetachedBuffer save_mobile_module_to_bytes(
     const mobile::Module& module,
+    const ExtraFilesMap& extra_files = ExtraFilesMap(),
+    const ExtraFilesMap& jit_sources = ExtraFilesMap(),
+    const std::vector<IValue>& jit_constants = {});
+
+TORCH_API void save_jit_module(
+    const Module& module,
+    const std::string& filename,
+    const ExtraFilesMap& extra_files = ExtraFilesMap());
+
+TORCH_API flatbuffers::DetachedBuffer save_jit_module_to_bytes(
+    const Module& module,
     const ExtraFilesMap& extra_files = ExtraFilesMap());
 
+TORCH_API Module parse_and_initialize_jit_module(
+    std::shared_ptr<char> data,
+    size_t size,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device = c10::nullopt);
+
+TORCH_API Module load_jit_module_from_file(
+    const std::string& filename,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device = c10::nullopt);
+
+TORCH_API Module load_jit_module_from_stream(
+    std::istream& in,
+    ExtraFilesMap& extra_files,
+    c10::optional<at::Device> device = c10::nullopt);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/import.cpp b/torch/csrc/jit/serialization/import.cpp
index 45d98c5646a6b2..7250d1faa2d6f7 100644
--- a/torch/csrc/jit/serialization/import.cpp
+++ b/torch/csrc/jit/serialization/import.cpp
@@ -10,6 +10,7 @@
 #endif
 #include <torch/csrc/jit/frontend/script_type_parser.h>
 #include <torch/csrc/jit/ir/ir.h>
+#include <torch/csrc/jit/mobile/file_format.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders_entry.h>
 #include <torch/csrc/jit/passes/subgraph_rewrite.h>
 #include <torch/csrc/jit/serialization/import_read.h>
@@ -18,6 +19,10 @@
 #include <torch/csrc/jit/serialization/source_range_serialization.h>
 #include <torch/csrc/jit/serialization/unpickler.h>
 
+#if defined(ENABLE_FLATBUFFER)
+#include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
+#endif
+
 #include <caffe2/serialize/file_adapter.h>
 #include <caffe2/serialize/inline_container.h>
 #include <caffe2/serialize/istream_adapter.h>
@@ -290,9 +295,26 @@ Module import_ir_module(
     std::istream& in,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  auto reader = torch::make_unique<PyTorchStreamReader>(&in);
-  ScriptModuleDeserializer deserializer(std::move(cu), std::move(reader));
-  return deserializer.deserialize(device, extra_files);
+  in.seekg(0, in.beg);
+  auto format = getFileFormat(in);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      return load_jit_module_from_stream(in, extra_files, device);
+#else
+      TORCH_CHECK(
+          false, "Flatbuffer input file but the build hasn't enable flatbuffer")
+#endif
+    }
+    case FileFormat::ZipFileFormat: {
+      auto reader = torch::make_unique<PyTorchStreamReader>(&in);
+      ScriptModuleDeserializer deserializer(std::move(cu), std::move(reader));
+      return deserializer.deserialize(device, extra_files);
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
 }
 
 // For reading unified serialization format from torch.Package.
@@ -325,9 +347,25 @@ Module import_ir_module(
     const std::string& filename,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  auto reader = torch::make_unique<PyTorchStreamReader>(filename);
-  ScriptModuleDeserializer deserializer(std::move(cu), std::move(reader));
-  return deserializer.deserialize(device, extra_files);
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      return load_jit_module_from_file(filename, extra_files, device);
+#else
+      TORCH_CHECK(
+          false, "Flatbuffer input file but the build hasn't enable flatbuffer")
+#endif
+    }
+    case FileFormat::ZipFileFormat: {
+      auto reader = torch::make_unique<PyTorchStreamReader>(filename);
+      ScriptModuleDeserializer deserializer(std::move(cu), std::move(reader));
+      return deserializer.deserialize(device, extra_files);
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
 }
 
 Module import_ir_module(
@@ -357,9 +395,27 @@ Module load(
     std::istream& in,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<IStreamAdapter> rai = std::make_unique<IStreamAdapter>(&in);
-  auto module = load(std::move(rai), device, extra_files);
-  return module;
+  in.seekg(0, in.beg);
+  auto format = getFileFormat(in);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      return load_jit_module_from_stream(in, extra_files, device);
+#else
+      TORCH_CHECK(
+          false, "Flatbuffer input file but the build hasn't enable flatbuffer")
+#endif
+    }
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<IStreamAdapter> rai =
+          std::make_unique<IStreamAdapter>(&in);
+      auto module = load(std::move(rai), device, extra_files);
+      return module;
+    }
+
+    default:
+      TORCH_CHECK(false, "Unrecognized data format");
+  }
 }
 
 Module load(const std::string& filename, c10::optional<at::Device> device) {
@@ -371,9 +427,27 @@ Module load(
     const std::string& filename,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
-  auto module = load(std::move(rai), device, extra_files);
-  return module;
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::FlatbufferFileFormat: {
+#if defined(ENABLE_FLATBUFFER)
+      return load_jit_module_from_file(filename, extra_files, device);
+#else
+      TORCH_CHECK(
+          false, "Flatbuffer input file but the build hasn't enable flatbuffer")
+#endif
+
+      case FileFormat::ZipFileFormat: {
+        std::unique_ptr<FileAdapter> rai =
+            std::make_unique<FileAdapter>(filename);
+        auto module = load(std::move(rai), device, extra_files);
+        return module;
+      }
+
+      default:
+        TORCH_CHECK(false, "Unrecognized data format");
+    }
+  }
 }
 
 Module load(
@@ -387,8 +461,8 @@ Module load(
     std::shared_ptr<ReadAdapterInterface> rai,
     c10::optional<c10::Device> device,
     ExtraFilesMap& extra_files) {
-  // Verify that we're loading a zip archive and not a torch.save pickle archive
-  // (marked by the 0x80 0x02 bytes at the start)
+  // Verify that we're loading a zip archive and not a torch.save pickle
+  // archive (marked by the 0x80 0x02 bytes at the start)
   // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
   TORCH_CHECK(
       check_zip_file(rai),
@@ -403,5 +477,75 @@ Module load(
   return deserializer.deserialize(device, extra_files);
 }
 
+// Replace object with a newly created but equivalent object.
+// The goal is to replace object's methods. However, since object's
+// methods are attached to type; we need to replace it's type.
+// Non-objects are unchanged; however, nested structures such as list, dict
+// are also reconstructed because they might contain an object.
+static IValue recreateObject(IValue ivalue, TypeResolver resolver) {
+  if (ivalue.isObject()) {
+    auto obj = ivalue.toObject();
+    auto classtype_old = obj->type();
+    auto newtype = resolver(*classtype_old->name());
+    size_t n = classtype_old->numAttributes();
+    auto newobj = c10::ivalue::Object::create(newtype, n);
+    for (const auto i : c10::irange(n)) {
+      newobj->setSlot(i, recreateObject(obj->getSlot(i), resolver));
+    }
+    return newobj;
+  } else if (ivalue.isList()) {
+    auto res = c10::impl::GenericList(ivalue.type()->containedType(0));
+    for (const auto& ival : ivalue.toList()) {
+      res.emplace_back(recreateObject(ival, resolver));
+    }
+    return res;
+  } else if (ivalue.isGenericDict()) {
+    auto result = c10::impl::GenericDict(
+        ivalue.type()->containedType(0), ivalue.type()->containedType(1));
+    for (const auto& kv : ivalue.toGenericDict()) {
+      result.insert_or_assign(
+          recreateObject(kv.key(), resolver),
+          recreateObject(kv.value(), resolver));
+    }
+    return result;
+  } else if (ivalue.isTuple()) {
+    std::vector<IValue> res;
+    for (const auto& ival : ivalue.toTuple()->elements()) {
+      res.push_back(recreateObject(ival, resolver));
+    }
+    return c10::ivalue::Tuple::create(res);
+  }
+  // Leaf types are returned verbatim.
+  return ivalue;
+}
+
+Module jitModuleFromSourceAndConstants(
+    const IValue& ivalue,
+    const ExtraFilesMap& source,
+    const std::vector<IValue>& constants,
+    int32_t version) {
+  auto compilation_unit = std::make_shared<CompilationUnit>();
+  SourceImporter importer(
+      compilation_unit,
+      &constants,
+      [&source](const std::string& qualifier) -> std::shared_ptr<SourceView> {
+        auto source_iter = source.find(qualifier);
+        if (source_iter == source.end()) {
+          return nullptr;
+        }
+        return std::make_shared<Source>(
+            source_iter->second, qualifier, 1, nullptr);
+      },
+      version);
+  auto type_resolver = [&](const c10::QualifiedName& qn) {
+    auto cls = importer.loadType(qn);
+    return c10::StrongTypePtr(compilation_unit, std::move(cls));
+  };
+  auto newIvalue = recreateObject(ivalue, type_resolver).toObject();
+  Module m(newIvalue);
+  rewriteQuantizedConvForBC(m);
+  return m;
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/import.h b/torch/csrc/jit/serialization/import.h
index fe2e1129c272b0..8487b5183f5dff 100644
--- a/torch/csrc/jit/serialization/import.h
+++ b/torch/csrc/jit/serialization/import.h
@@ -98,5 +98,11 @@ TORCH_API Module load(
     c10::optional<c10::Device> device,
     ExtraFilesMap& extra_files);
 
+TORCH_API Module jitModuleFromSourceAndConstants(
+    const IValue& ivalue,
+    const ExtraFilesMap& source,
+    const std::vector<IValue>& constants,
+    int32_t version);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/import_export_helpers.cpp b/torch/csrc/jit/serialization/import_export_helpers.cpp
index 3ee2e9cad2fea0..6ed475aa7e8289 100644
--- a/torch/csrc/jit/serialization/import_export_helpers.cpp
+++ b/torch/csrc/jit/serialization/import_export_helpers.cpp
@@ -22,7 +22,7 @@ std::string qualifierToArchivePath(
   return export_prefix + path + "." + kExportSuffix;
 }
 
-std::shared_ptr<Source> findSourceInArchiveFromQualifier(
+std::shared_ptr<SourceView> findSourceInArchiveFromQualifier(
     caffe2::serialize::PyTorchStreamReader& reader,
     const std::string& export_prefix,
     const std::string& qualifier) {
diff --git a/torch/csrc/jit/serialization/import_export_helpers.h b/torch/csrc/jit/serialization/import_export_helpers.h
index c94993207e6a3c..d70bd2aa0e9e78 100644
--- a/torch/csrc/jit/serialization/import_export_helpers.h
+++ b/torch/csrc/jit/serialization/import_export_helpers.h
@@ -12,7 +12,7 @@ class PyTorchStreamReader;
 namespace torch {
 namespace jit {
 
-struct Source;
+struct SourceView;
 
 // Convert a class type's qualifier name to the corresponding path the source
 // file it should be written to.
@@ -23,7 +23,7 @@ std::string qualifierToArchivePath(
     const std::string& qualifier,
     const std::string& export_prefix);
 
-std::shared_ptr<Source> findSourceInArchiveFromQualifier(
+std::shared_ptr<SourceView> findSourceInArchiveFromQualifier(
     caffe2::serialize::PyTorchStreamReader& reader,
     const std::string& export_prefix,
     const std::string& qualifier);
diff --git a/torch/csrc/jit/serialization/import_source.cpp b/torch/csrc/jit/serialization/import_source.cpp
index 2a1f7fb0467438..d1fb898730eb4b 100644
--- a/torch/csrc/jit/serialization/import_source.cpp
+++ b/torch/csrc/jit/serialization/import_source.cpp
@@ -159,7 +159,7 @@ void SourceImporterImpl::parseSourceIfNeeded(const std::string& qualifier) {
     return;
   }
   loaded_sources_.insert(qualifier);
-  std::shared_ptr<Source> src = source_loader_(qualifier);
+  std::shared_ptr<SourceView> src = source_loader_(qualifier);
 
   // The importer, when looking for classes/functions doesn't know if 'foo'
   // contains definitions or if it is a prefix of 'foo.bar', we only figure it
diff --git a/torch/csrc/jit/serialization/import_source.h b/torch/csrc/jit/serialization/import_source.h
index 9a720a81bcbb20..22e59b4e40dc9a 100644
--- a/torch/csrc/jit/serialization/import_source.h
+++ b/torch/csrc/jit/serialization/import_source.h
@@ -20,7 +20,8 @@
 namespace torch {
 namespace jit {
 
-using SourceLoader = std::function<std::shared_ptr<Source>(const std::string&)>;
+using SourceLoader =
+    std::function<std::shared_ptr<SourceView>(const std::string&)>;
 
 struct SourceImporterImpl : public Resolver,
                             std::enable_shared_from_this<SourceImporterImpl> {
diff --git a/torch/csrc/jit/serialization/mobile_bytecode.fbs b/torch/csrc/jit/serialization/mobile_bytecode.fbs
index b936a8ebd4ff67..76de17a726de62 100644
--- a/torch/csrc/jit/serialization/mobile_bytecode.fbs
+++ b/torch/csrc/jit/serialization/mobile_bytecode.fbs
@@ -1,3 +1,5 @@
+file_identifier "PTMF";
+
 namespace torch.jit.mobile.serialization;
 
 struct Int {
@@ -184,7 +186,15 @@ table ExtraFile {
 }
 
 table Module {
-  version:int;
+  // denotes the bytecode version of the mobile Module
+  // this starts from 9 for a flatbuffer file
+  // versions 8 and below are reserved pickle
+  // Version is bumped when changes in model serialization/execution
+  // can no longer work in current version
+  // To read more:
+  // https://github.com/pytorch/pytorch/blob/master/caffe2/serialize/versions.h#L96
+  bytecode_version:uint;
+
   extra_files:[ExtraFile];
   methods:[uint];  // index to ivalues
   state_obj:uint; // index to ivalues
@@ -192,6 +202,15 @@ table Module {
   storage_data_size:int;  // number of storage data;
   storage_data:[StorageData];
   object_types:[ObjectType];
+  jit_sources:[ExtraFile];
+  jit_constants:[uint];  // index to ivalues
+
+  // version of operator
+  // Version is bumped when changes in operator
+  // can no longer work in current version
+  // To read more:
+  // https://github.com/pytorch/rfcs/blob/master/RFC-0017-PyTorch-Operator-Versioning.md
+  operator_version:uint;
 }
 
 root_type Module;
diff --git a/torch/csrc/jit/serialization/mobile_bytecode_generated.h b/torch/csrc/jit/serialization/mobile_bytecode_generated.h
index cb3c95a6626d86..a66e08772ef2f8 100644
--- a/torch/csrc/jit/serialization/mobile_bytecode_generated.h
+++ b/torch/csrc/jit/serialization/mobile_bytecode_generated.h
@@ -272,15 +272,15 @@ template<> struct IValueUnionTraits<torch::jit::mobile::serialization::Function>
 };
 
 bool VerifyIValueUnion(flatbuffers::Verifier &verifier, const void *obj, IValueUnion type);
-bool VerifyIValueUnionVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<uint8_t> *types);
+bool VerifyIValueUnionVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<IValueUnion> *types);
 
 FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(8) Int FLATBUFFERS_FINAL_CLASS {
  private:
   int64_t int_val_;
 
  public:
-  Int() {
-    memset(static_cast<void *>(this), 0, sizeof(Int));
+  Int()
+      : int_val_(0) {
   }
   Int(int64_t _int_val)
       : int_val_(flatbuffers::EndianScalar(_int_val)) {
@@ -299,8 +299,8 @@ FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(1) Bool FLATBUFFERS_FINAL_CLASS {
   uint8_t bool_val_;
 
  public:
-  Bool() {
-    memset(static_cast<void *>(this), 0, sizeof(Bool));
+  Bool()
+      : bool_val_(0) {
   }
   Bool(bool _bool_val)
       : bool_val_(flatbuffers::EndianScalar(static_cast<uint8_t>(_bool_val))) {
@@ -319,8 +319,8 @@ FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(8) Double FLATBUFFERS_FINAL_CLASS {
   double double_val_;
 
  public:
-  Double() {
-    memset(static_cast<void *>(this), 0, sizeof(Double));
+  Double()
+      : double_val_(0) {
   }
   Double(double _double_val)
       : double_val_(flatbuffers::EndianScalar(_double_val)) {
@@ -341,8 +341,11 @@ FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(8) PerTensorAffineSchema FLATBUFFERS_FINAL_C
   int32_t padding0__;
 
  public:
-  PerTensorAffineSchema() {
-    memset(static_cast<void *>(this), 0, sizeof(PerTensorAffineSchema));
+  PerTensorAffineSchema()
+      : q_scale_(0),
+        q_zero_point_(0),
+        padding0__(0) {
+    (void)padding0__;
   }
   PerTensorAffineSchema(double _q_scale, int32_t _q_zero_point)
       : q_scale_(flatbuffers::EndianScalar(_q_scale)),
@@ -371,8 +374,9 @@ FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(8) ComplexDouble FLATBUFFERS_FINAL_CLASS {
   double imag_;
 
  public:
-  ComplexDouble() {
-    memset(static_cast<void *>(this), 0, sizeof(ComplexDouble));
+  ComplexDouble()
+      : real_(0),
+        imag_(0) {
   }
   ComplexDouble(double _real, double _imag)
       : real_(flatbuffers::EndianScalar(_real)),
@@ -401,8 +405,12 @@ FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(4) Instruction FLATBUFFERS_FINAL_CLASS {
   int32_t x_;
 
  public:
-  Instruction() {
-    memset(static_cast<void *>(this), 0, sizeof(Instruction));
+  Instruction()
+      : op_(0),
+        padding0__(0),
+        n_(0),
+        x_(0) {
+    (void)padding0__;
   }
   Instruction(int8_t _op, uint16_t _n, int32_t _x)
       : op_(flatbuffers::EndianScalar(_op)),
@@ -445,19 +453,19 @@ struct QuantizedSchema FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   int8_t qscheme() const {
     return GetField<int8_t>(VT_QSCHEME, 0);
   }
-  bool mutate_qscheme(int8_t _qscheme) {
+  bool mutate_qscheme(int8_t _qscheme = 0) {
     return SetField<int8_t>(VT_QSCHEME, _qscheme, 0);
   }
   double scale() const {
     return GetField<double>(VT_SCALE, 0.0);
   }
-  bool mutate_scale(double _scale) {
+  bool mutate_scale(double _scale = 0.0) {
     return SetField<double>(VT_SCALE, _scale, 0.0);
   }
   int32_t zero_point() const {
     return GetField<int32_t>(VT_ZERO_POINT, 0);
   }
-  bool mutate_zero_point(int32_t _zero_point) {
+  bool mutate_zero_point(int32_t _zero_point = 0) {
     return SetField<int32_t>(VT_ZERO_POINT, _zero_point, 0);
   }
   const torch::jit::mobile::serialization::TensorMetadata *scales() const {
@@ -475,7 +483,7 @@ struct QuantizedSchema FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   int32_t axis() const {
     return GetField<int32_t>(VT_AXIS, 0);
   }
-  bool mutate_axis(int32_t _axis) {
+  bool mutate_axis(int32_t _axis = 0) {
     return SetField<int32_t>(VT_AXIS, _axis, 0);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -518,7 +526,6 @@ struct QuantizedSchemaBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  QuantizedSchemaBuilder &operator=(const QuantizedSchemaBuilder &);
   flatbuffers::Offset<QuantizedSchema> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<QuantizedSchema>(end);
@@ -558,19 +565,19 @@ struct TensorMetadata FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t storage_location_index() const {
     return GetField<uint32_t>(VT_STORAGE_LOCATION_INDEX, 0);
   }
-  bool mutate_storage_location_index(uint32_t _storage_location_index) {
+  bool mutate_storage_location_index(uint32_t _storage_location_index = 0) {
     return SetField<uint32_t>(VT_STORAGE_LOCATION_INDEX, _storage_location_index, 0);
   }
   int8_t scalar_type() const {
     return GetField<int8_t>(VT_SCALAR_TYPE, 0);
   }
-  bool mutate_scalar_type(int8_t _scalar_type) {
+  bool mutate_scalar_type(int8_t _scalar_type = 0) {
     return SetField<int8_t>(VT_SCALAR_TYPE, _scalar_type, 0);
   }
   int32_t storage_offset() const {
     return GetField<int32_t>(VT_STORAGE_OFFSET, 0);
   }
-  bool mutate_storage_offset(int32_t _storage_offset) {
+  bool mutate_storage_offset(int32_t _storage_offset = 0) {
     return SetField<int32_t>(VT_STORAGE_OFFSET, _storage_offset, 0);
   }
   const flatbuffers::Vector<int32_t> *sizes() const {
@@ -588,7 +595,7 @@ struct TensorMetadata FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   bool requires_grad() const {
     return GetField<uint8_t>(VT_REQUIRES_GRAD, 0) != 0;
   }
-  bool mutate_requires_grad(bool _requires_grad) {
+  bool mutate_requires_grad(bool _requires_grad = 0) {
     return SetField<uint8_t>(VT_REQUIRES_GRAD, static_cast<uint8_t>(_requires_grad), 0);
   }
   const torch::jit::mobile::serialization::QuantizedSchema *quantized_schema() const {
@@ -642,7 +649,6 @@ struct TensorMetadataBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  TensorMetadataBuilder &operator=(const TensorMetadataBuilder &);
   flatbuffers::Offset<TensorMetadata> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<TensorMetadata>(end);
@@ -722,7 +728,6 @@ struct StringBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  StringBuilder &operator=(const StringBuilder &);
   flatbuffers::Offset<String> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<String>(end);
@@ -777,7 +782,6 @@ struct DeviceBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  DeviceBuilder &operator=(const DeviceBuilder &);
   flatbuffers::Offset<Device> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Device>(end);
@@ -844,7 +848,6 @@ struct ListBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ListBuilder &operator=(const ListBuilder &);
   flatbuffers::Offset<List> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<List>(end);
@@ -904,7 +907,6 @@ struct IntListBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  IntListBuilder &operator=(const IntListBuilder &);
   flatbuffers::Offset<IntList> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<IntList>(end);
@@ -959,7 +961,6 @@ struct DoubleListBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  DoubleListBuilder &operator=(const DoubleListBuilder &);
   flatbuffers::Offset<DoubleList> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<DoubleList>(end);
@@ -1014,7 +1015,6 @@ struct BoolListBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  BoolListBuilder &operator=(const BoolListBuilder &);
   flatbuffers::Offset<BoolList> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<BoolList>(end);
@@ -1069,7 +1069,6 @@ struct TupleBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  TupleBuilder &operator=(const TupleBuilder &);
   flatbuffers::Offset<Tuple> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Tuple>(end);
@@ -1148,7 +1147,6 @@ struct DictBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  DictBuilder &operator=(const DictBuilder &);
   flatbuffers::Offset<Dict> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Dict>(end);
@@ -1199,7 +1197,7 @@ struct ObjectType FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   torch::jit::mobile::serialization::TypeType type() const {
     return static_cast<torch::jit::mobile::serialization::TypeType>(GetField<uint8_t>(VT_TYPE, 0));
   }
-  bool mutate_type(torch::jit::mobile::serialization::TypeType _type) {
+  bool mutate_type(torch::jit::mobile::serialization::TypeType _type = static_cast<torch::jit::mobile::serialization::TypeType>(0)) {
     return SetField<uint8_t>(VT_TYPE, static_cast<uint8_t>(_type), 0);
   }
   const flatbuffers::Vector<flatbuffers::Offset<flatbuffers::String>> *attr_names() const {
@@ -1237,7 +1235,6 @@ struct ObjectTypeBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ObjectTypeBuilder &operator=(const ObjectTypeBuilder &);
   flatbuffers::Offset<ObjectType> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<ObjectType>(end);
@@ -1282,13 +1279,13 @@ struct Object FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t type_index() const {
     return GetField<uint32_t>(VT_TYPE_INDEX, 0);
   }
-  bool mutate_type_index(uint32_t _type_index) {
+  bool mutate_type_index(uint32_t _type_index = 0) {
     return SetField<uint32_t>(VT_TYPE_INDEX, _type_index, 0);
   }
   uint32_t state() const {
     return GetField<uint32_t>(VT_STATE, 0);
   }
-  bool mutate_state(uint32_t _state) {
+  bool mutate_state(uint32_t _state = 0) {
     return SetField<uint32_t>(VT_STATE, _state, 0);
   }
   const flatbuffers::Vector<uint32_t> *attrs() const {
@@ -1300,7 +1297,7 @@ struct Object FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t setstate_func() const {
     return GetField<uint32_t>(VT_SETSTATE_FUNC, 0);
   }
-  bool mutate_setstate_func(uint32_t _setstate_func) {
+  bool mutate_setstate_func(uint32_t _setstate_func = 0) {
     return SetField<uint32_t>(VT_SETSTATE_FUNC, _setstate_func, 0);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -1334,7 +1331,6 @@ struct ObjectBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ObjectBuilder &operator=(const ObjectBuilder &);
   flatbuffers::Offset<Object> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Object>(end);
@@ -1386,7 +1382,7 @@ struct EnumValue FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t value() const {
     return GetField<uint32_t>(VT_VALUE, 0);
   }
-  bool mutate_value(uint32_t _value) {
+  bool mutate_value(uint32_t _value = 0) {
     return SetField<uint32_t>(VT_VALUE, _value, 0);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -1412,7 +1408,6 @@ struct EnumValueBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  EnumValueBuilder &operator=(const EnumValueBuilder &);
   flatbuffers::Offset<EnumValue> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<EnumValue>(end);
@@ -1463,7 +1458,7 @@ struct Operator FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   int32_t num_args_serialized() const {
     return GetField<int32_t>(VT_NUM_ARGS_SERIALIZED, -1);
   }
-  bool mutate_num_args_serialized(int32_t _num_args_serialized) {
+  bool mutate_num_args_serialized(int32_t _num_args_serialized = -1) {
     return SetField<int32_t>(VT_NUM_ARGS_SERIALIZED, _num_args_serialized, -1);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -1494,7 +1489,6 @@ struct OperatorBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  OperatorBuilder &operator=(const OperatorBuilder &);
   flatbuffers::Offset<Operator> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Operator>(end);
@@ -1550,7 +1544,7 @@ struct Arg FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t default_value() const {
     return GetField<uint32_t>(VT_DEFAULT_VALUE, 0);
   }
-  bool mutate_default_value(uint32_t _default_value) {
+  bool mutate_default_value(uint32_t _default_value = 0) {
     return SetField<uint32_t>(VT_DEFAULT_VALUE, _default_value, 0);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -1581,7 +1575,6 @@ struct ArgBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ArgBuilder &operator=(const ArgBuilder &);
   flatbuffers::Offset<Arg> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Arg>(end);
@@ -1659,7 +1652,6 @@ struct SchemaBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  SchemaBuilder &operator=(const SchemaBuilder &);
   flatbuffers::Offset<Schema> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Schema>(end);
@@ -1719,7 +1711,6 @@ struct DebugInfoBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  DebugInfoBuilder &operator=(const DebugInfoBuilder &);
   flatbuffers::Offset<DebugInfo> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<DebugInfo>(end);
@@ -1790,7 +1781,7 @@ struct Function FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   int32_t register_size() const {
     return GetField<int32_t>(VT_REGISTER_SIZE, 0);
   }
-  bool mutate_register_size(int32_t _register_size) {
+  bool mutate_register_size(int32_t _register_size = 0) {
     return SetField<int32_t>(VT_REGISTER_SIZE, _register_size, 0);
   }
   const torch::jit::mobile::serialization::Schema *schema() const {
@@ -1808,7 +1799,7 @@ struct Function FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t class_type() const {
     return GetField<uint32_t>(VT_CLASS_TYPE, 0);
   }
-  bool mutate_class_type(uint32_t _class_type) {
+  bool mutate_class_type(uint32_t _class_type = 0) {
     return SetField<uint32_t>(VT_CLASS_TYPE, _class_type, 0);
   }
   bool Verify(flatbuffers::Verifier &verifier) const {
@@ -1870,7 +1861,6 @@ struct FunctionBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  FunctionBuilder &operator=(const FunctionBuilder &);
   flatbuffers::Offset<Function> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Function>(end);
@@ -1961,7 +1951,6 @@ struct StorageDataBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  StorageDataBuilder &operator=(const StorageDataBuilder &);
   flatbuffers::Offset<StorageData> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<StorageData>(end);
@@ -2138,7 +2127,6 @@ struct IValueBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  IValueBuilder &operator=(const IValueBuilder &);
   flatbuffers::Offset<IValue> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<IValue>(end);
@@ -2198,7 +2186,6 @@ struct ExtraFileBuilder {
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ExtraFileBuilder &operator=(const ExtraFileBuilder &);
   flatbuffers::Offset<ExtraFile> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<ExtraFile>(end);
@@ -2231,20 +2218,23 @@ inline flatbuffers::Offset<ExtraFile> CreateExtraFileDirect(
 struct Module FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   typedef ModuleBuilder Builder;
   enum FlatBuffersVTableOffset FLATBUFFERS_VTABLE_UNDERLYING_TYPE {
-    VT_VERSION = 4,
+    VT_BYTECODE_VERSION = 4,
     VT_EXTRA_FILES = 6,
     VT_METHODS = 8,
     VT_STATE_OBJ = 10,
     VT_IVALUES = 12,
     VT_STORAGE_DATA_SIZE = 14,
     VT_STORAGE_DATA = 16,
-    VT_OBJECT_TYPES = 18
+    VT_OBJECT_TYPES = 18,
+    VT_JIT_SOURCES = 20,
+    VT_JIT_CONSTANTS = 22,
+    VT_OPERATOR_VERSION = 24
   };
-  int32_t version() const {
-    return GetField<int32_t>(VT_VERSION, 0);
+  uint32_t bytecode_version() const {
+    return GetField<uint32_t>(VT_BYTECODE_VERSION, 0);
   }
-  bool mutate_version(int32_t _version) {
-    return SetField<int32_t>(VT_VERSION, _version, 0);
+  bool mutate_bytecode_version(uint32_t _bytecode_version = 0) {
+    return SetField<uint32_t>(VT_BYTECODE_VERSION, _bytecode_version, 0);
   }
   const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *extra_files() const {
     return GetPointer<const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *>(VT_EXTRA_FILES);
@@ -2261,7 +2251,7 @@ struct Module FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   uint32_t state_obj() const {
     return GetField<uint32_t>(VT_STATE_OBJ, 0);
   }
-  bool mutate_state_obj(uint32_t _state_obj) {
+  bool mutate_state_obj(uint32_t _state_obj = 0) {
     return SetField<uint32_t>(VT_STATE_OBJ, _state_obj, 0);
   }
   const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue>> *ivalues() const {
@@ -2273,7 +2263,7 @@ struct Module FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   int32_t storage_data_size() const {
     return GetField<int32_t>(VT_STORAGE_DATA_SIZE, 0);
   }
-  bool mutate_storage_data_size(int32_t _storage_data_size) {
+  bool mutate_storage_data_size(int32_t _storage_data_size = 0) {
     return SetField<int32_t>(VT_STORAGE_DATA_SIZE, _storage_data_size, 0);
   }
   const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::StorageData>> *storage_data() const {
@@ -2288,9 +2278,27 @@ struct Module FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>> *mutable_object_types() {
     return GetPointer<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>> *>(VT_OBJECT_TYPES);
   }
+  const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *jit_sources() const {
+    return GetPointer<const flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *>(VT_JIT_SOURCES);
+  }
+  flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *mutable_jit_sources() {
+    return GetPointer<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *>(VT_JIT_SOURCES);
+  }
+  const flatbuffers::Vector<uint32_t> *jit_constants() const {
+    return GetPointer<const flatbuffers::Vector<uint32_t> *>(VT_JIT_CONSTANTS);
+  }
+  flatbuffers::Vector<uint32_t> *mutable_jit_constants() {
+    return GetPointer<flatbuffers::Vector<uint32_t> *>(VT_JIT_CONSTANTS);
+  }
+  uint32_t operator_version() const {
+    return GetField<uint32_t>(VT_OPERATOR_VERSION, 0);
+  }
+  bool mutate_operator_version(uint32_t _operator_version = 0) {
+    return SetField<uint32_t>(VT_OPERATOR_VERSION, _operator_version, 0);
+  }
   bool Verify(flatbuffers::Verifier &verifier) const {
     return VerifyTableStart(verifier) &&
-           VerifyField<int32_t>(verifier, VT_VERSION) &&
+           VerifyField<uint32_t>(verifier, VT_BYTECODE_VERSION) &&
            VerifyOffset(verifier, VT_EXTRA_FILES) &&
            verifier.VerifyVector(extra_files()) &&
            verifier.VerifyVectorOfTables(extra_files()) &&
@@ -2307,6 +2315,12 @@ struct Module FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
            VerifyOffset(verifier, VT_OBJECT_TYPES) &&
            verifier.VerifyVector(object_types()) &&
            verifier.VerifyVectorOfTables(object_types()) &&
+           VerifyOffset(verifier, VT_JIT_SOURCES) &&
+           verifier.VerifyVector(jit_sources()) &&
+           verifier.VerifyVectorOfTables(jit_sources()) &&
+           VerifyOffset(verifier, VT_JIT_CONSTANTS) &&
+           verifier.VerifyVector(jit_constants()) &&
+           VerifyField<uint32_t>(verifier, VT_OPERATOR_VERSION) &&
            verifier.EndTable();
   }
 };
@@ -2315,8 +2329,8 @@ struct ModuleBuilder {
   typedef Module Table;
   flatbuffers::FlatBufferBuilder &fbb_;
   flatbuffers::uoffset_t start_;
-  void add_version(int32_t version) {
-    fbb_.AddElement<int32_t>(Module::VT_VERSION, version, 0);
+  void add_bytecode_version(uint32_t bytecode_version) {
+    fbb_.AddElement<uint32_t>(Module::VT_BYTECODE_VERSION, bytecode_version, 0);
   }
   void add_extra_files(flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>> extra_files) {
     fbb_.AddOffset(Module::VT_EXTRA_FILES, extra_files);
@@ -2339,11 +2353,19 @@ struct ModuleBuilder {
   void add_object_types(flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>>> object_types) {
     fbb_.AddOffset(Module::VT_OBJECT_TYPES, object_types);
   }
+  void add_jit_sources(flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>> jit_sources) {
+    fbb_.AddOffset(Module::VT_JIT_SOURCES, jit_sources);
+  }
+  void add_jit_constants(flatbuffers::Offset<flatbuffers::Vector<uint32_t>> jit_constants) {
+    fbb_.AddOffset(Module::VT_JIT_CONSTANTS, jit_constants);
+  }
+  void add_operator_version(uint32_t operator_version) {
+    fbb_.AddElement<uint32_t>(Module::VT_OPERATOR_VERSION, operator_version, 0);
+  }
   explicit ModuleBuilder(flatbuffers::FlatBufferBuilder &_fbb)
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
   }
-  ModuleBuilder &operator=(const ModuleBuilder &);
   flatbuffers::Offset<Module> Finish() {
     const auto end = fbb_.EndTable(start_);
     auto o = flatbuffers::Offset<Module>(end);
@@ -2353,15 +2375,21 @@ struct ModuleBuilder {
 
 inline flatbuffers::Offset<Module> CreateModule(
     flatbuffers::FlatBufferBuilder &_fbb,
-    int32_t version = 0,
+    uint32_t bytecode_version = 0,
     flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>> extra_files = 0,
     flatbuffers::Offset<flatbuffers::Vector<uint32_t>> methods = 0,
     uint32_t state_obj = 0,
     flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue>>> ivalues = 0,
     int32_t storage_data_size = 0,
     flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::StorageData>>> storage_data = 0,
-    flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>>> object_types = 0) {
+    flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>>> object_types = 0,
+    flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>> jit_sources = 0,
+    flatbuffers::Offset<flatbuffers::Vector<uint32_t>> jit_constants = 0,
+    uint32_t operator_version = 0) {
   ModuleBuilder builder_(_fbb);
+  builder_.add_operator_version(operator_version);
+  builder_.add_jit_constants(jit_constants);
+  builder_.add_jit_sources(jit_sources);
   builder_.add_object_types(object_types);
   builder_.add_storage_data(storage_data);
   builder_.add_storage_data_size(storage_data_size);
@@ -2369,35 +2397,43 @@ inline flatbuffers::Offset<Module> CreateModule(
   builder_.add_state_obj(state_obj);
   builder_.add_methods(methods);
   builder_.add_extra_files(extra_files);
-  builder_.add_version(version);
+  builder_.add_bytecode_version(bytecode_version);
   return builder_.Finish();
 }
 
 inline flatbuffers::Offset<Module> CreateModuleDirect(
     flatbuffers::FlatBufferBuilder &_fbb,
-    int32_t version = 0,
+    uint32_t bytecode_version = 0,
     const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *extra_files = nullptr,
     const std::vector<uint32_t> *methods = nullptr,
     uint32_t state_obj = 0,
     const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue>> *ivalues = nullptr,
     int32_t storage_data_size = 0,
     const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::StorageData>> *storage_data = nullptr,
-    const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>> *object_types = nullptr) {
+    const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>> *object_types = nullptr,
+    const std::vector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>> *jit_sources = nullptr,
+    const std::vector<uint32_t> *jit_constants = nullptr,
+    uint32_t operator_version = 0) {
   auto extra_files__ = extra_files ? _fbb.CreateVector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>(*extra_files) : 0;
   auto methods__ = methods ? _fbb.CreateVector<uint32_t>(*methods) : 0;
   auto ivalues__ = ivalues ? _fbb.CreateVector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue>>(*ivalues) : 0;
   auto storage_data__ = storage_data ? _fbb.CreateVector<flatbuffers::Offset<torch::jit::mobile::serialization::StorageData>>(*storage_data) : 0;
   auto object_types__ = object_types ? _fbb.CreateVector<flatbuffers::Offset<torch::jit::mobile::serialization::ObjectType>>(*object_types) : 0;
+  auto jit_sources__ = jit_sources ? _fbb.CreateVector<flatbuffers::Offset<torch::jit::mobile::serialization::ExtraFile>>(*jit_sources) : 0;
+  auto jit_constants__ = jit_constants ? _fbb.CreateVector<uint32_t>(*jit_constants) : 0;
   return torch::jit::mobile::serialization::CreateModule(
       _fbb,
-      version,
+      bytecode_version,
       extra_files__,
       methods__,
       state_obj,
       ivalues__,
       storage_data_size,
       storage_data__,
-      object_types__);
+      object_types__,
+      jit_sources__,
+      jit_constants__,
+      operator_version);
 }
 
 inline bool VerifyIValueUnion(flatbuffers::Verifier &verifier, const void *obj, IValueUnion type) {
@@ -2469,7 +2505,7 @@ inline bool VerifyIValueUnion(flatbuffers::Verifier &verifier, const void *obj,
   }
 }
 
-inline bool VerifyIValueUnionVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<uint8_t> *types) {
+inline bool VerifyIValueUnionVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<IValueUnion> *types) {
   if (!values || !types) return !values && !types;
   if (values->size() != types->size()) return false;
   for (flatbuffers::uoffset_t i = 0; i < values->size(); ++i) {
@@ -2485,34 +2521,39 @@ inline const torch::jit::mobile::serialization::Module *GetModule(const void *bu
   return flatbuffers::GetRoot<torch::jit::mobile::serialization::Module>(buf);
 }
 
-inline const torch::jit::mobile::serialization::Module *GetSizePrefixedModule(const void *buf) {
-  return flatbuffers::GetSizePrefixedRoot<torch::jit::mobile::serialization::Module>(buf);
+inline Module* GetMutableModule(void* buf) {
+  return flatbuffers::GetMutableRoot<Module>(buf);
 }
 
-inline Module *GetMutableModule(void *buf) {
-  return flatbuffers::GetMutableRoot<Module>(buf);
+inline const char *ModuleIdentifier() {
+  return "PTMF";
+}
+
+inline bool ModuleBufferHasIdentifier(const void *buf) {
+  return flatbuffers::BufferHasIdentifier(
+      buf, ModuleIdentifier());
 }
 
 inline bool VerifyModuleBuffer(
     flatbuffers::Verifier &verifier) {
-  return verifier.VerifyBuffer<torch::jit::mobile::serialization::Module>(nullptr);
+  return verifier.VerifyBuffer<torch::jit::mobile::serialization::Module>(ModuleIdentifier());
 }
 
 inline bool VerifySizePrefixedModuleBuffer(
     flatbuffers::Verifier &verifier) {
-  return verifier.VerifySizePrefixedBuffer<torch::jit::mobile::serialization::Module>(nullptr);
+  return verifier.VerifySizePrefixedBuffer<torch::jit::mobile::serialization::Module>(ModuleIdentifier());
 }
 
 inline void FinishModuleBuffer(
     flatbuffers::FlatBufferBuilder &fbb,
     flatbuffers::Offset<torch::jit::mobile::serialization::Module> root) {
-  fbb.Finish(root);
+  fbb.Finish(root, ModuleIdentifier());
 }
 
 inline void FinishSizePrefixedModuleBuffer(
     flatbuffers::FlatBufferBuilder &fbb,
     flatbuffers::Offset<torch::jit::mobile::serialization::Module> root) {
-  fbb.FinishSizePrefixed(root);
+  fbb.FinishSizePrefixed(root, ModuleIdentifier());
 }
 
 }  // namespace serialization
diff --git a/torch/csrc/jit/serialization/python_print.cpp b/torch/csrc/jit/serialization/python_print.cpp
index 846c29a61d99d6..c9baa4f6268c36 100644
--- a/torch/csrc/jit/serialization/python_print.cpp
+++ b/torch/csrc/jit/serialization/python_print.cpp
@@ -2,6 +2,7 @@
 
 #include <algorithm>
 
+#include <ATen/core/ivalue.h>
 #include <ATen/core/qualified_name.h>
 #include <c10/util/Exception.h>
 #include <c10/util/StringUtil.h>
@@ -17,6 +18,7 @@
 #include <torch/csrc/jit/operator_upgraders/version_map.h>
 #include <torch/csrc/jit/resource_guard.h>
 #include <torch/csrc/jit/runtime/calculate_necessary_args.h>
+#include <torch/csrc/jit/serialization/type_name_uniquer.h>
 
 using c10::QualifiedName;
 
@@ -1662,5 +1664,98 @@ uint64_t PythonPrint::minVersion() const {
 
 PythonPrint::~PythonPrint() = default;
 
+std::vector<IValue> traverseIValueAndGetObjects(IValue ivalue) {
+  std::vector<IValue> result;
+  std::vector<IValue> stack;
+  stack.emplace_back(ivalue);
+  while (!stack.empty()) {
+    IValue head = stack.back();
+    stack.pop_back();
+    if (head.isObject()) {
+      result.push_back(head);
+      auto obj = head.toObject();
+      ClassTypePtr type = obj->type();
+      if (type->hasMethod("__getstate__")) {
+        Function& getstate = type->getMethod("__getstate__");
+        stack.emplace_back(getstate({obj}));
+      } else {
+        for (size_t i = 0, n = type->numAttributes(); i < n; ++i) {
+          stack.emplace_back(obj->getSlot(i));
+        }
+      }
+    } else if (ivalue.isGenericDict()) {
+      for (const auto& kv : ivalue.toGenericDict()) {
+        // skip key because key cannot be an object
+        stack.emplace_back(kv.value());
+      }
+    } else if (ivalue.isList()) {
+      for (const auto& v : ivalue.toList()) {
+        stack.emplace_back(v);
+      }
+    } else if (ivalue.isTuple()) {
+      for (const auto& v : ivalue.toTuple()->elements()) {
+        stack.emplace_back(v);
+      }
+    }
+  }
+  return result;
+}
+
+c10::optional<std::string> printType(
+    const c10::Type& type,
+    torch::jit::TypeNameUniquer& type_name_uniquer) {
+  if (auto dyn = type.castRaw<c10::DynamicType>()) {
+    return dyn->fallback()->annotation_str(
+        [&](auto&& t) { return printType(t, type_name_uniquer); });
+  }
+  auto namedType = type.cast<c10::NamedType>();
+  if (namedType && namedType->name()) {
+    return type_name_uniquer.getUniqueName(namedType).qualifiedName();
+  }
+  return c10::nullopt;
+}
+
+void jitModuleToPythonCodeAndConstants(
+    const Module& module,
+    ExtraFilesMap* jit_sources, // output
+    std::vector<IValue>* constants // output
+) {
+  std::vector<IValue> objects = traverseIValueAndGetObjects(module._ivalue());
+  std::unordered_set<c10::QualifiedName> visited;
+  PrintDepsTable class_deps;
+  TypeNameUniquer uniquer;
+  auto type_printer = [&](const c10::Type& t) { return printType(t, uniquer); };
+
+  // Group by prefix; because every prefix is a file.
+  std::unordered_map<std::string, PythonPrint> grouped_by_prefix;
+  for (const IValue& obj : objects) {
+    ObjectPtr obj_ptr = obj.toObject();
+    ClassTypePtr class_type = obj_ptr->type();
+    class_deps.add(class_type);
+  }
+
+  for (int i = 0; i < class_deps.size(); ++i) {
+    auto type = class_deps[i];
+    auto qualname = uniquer.getUniqueName(type);
+    std::string qualifier = qualname.prefix();
+    auto pp_iter = grouped_by_prefix.find(qualifier);
+    if (pp_iter == grouped_by_prefix.end()) {
+      pp_iter = grouped_by_prefix
+                    .emplace(
+                        qualifier,
+                        PythonPrint(
+                            *constants,
+                            class_deps,
+                            type_printer,
+                            /*enforce_importable=*/true))
+                    .first;
+    }
+    pp_iter->second.printNamedType(type);
+  }
+  for (const auto& kv : grouped_by_prefix) {
+    (*jit_sources)[kv.first] = kv.second.str();
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/python_print.h b/torch/csrc/jit/serialization/python_print.h
index 0e95d6de626af1..94fb83d7aab774 100644
--- a/torch/csrc/jit/serialization/python_print.h
+++ b/torch/csrc/jit/serialization/python_print.h
@@ -1,5 +1,6 @@
 #pragma once
 #include <torch/csrc/Export.h>
+#include <torch/csrc/jit/api/module.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <iostream>
 #include <vector>
@@ -49,5 +50,12 @@ struct TORCH_API PythonPrint {
 };
 
 TORCH_API bool printerHasSpecialCaseFor(c10::Symbol sym);
+
+TORCH_API void jitModuleToPythonCodeAndConstants(
+    const Module& module,
+    ExtraFilesMap* jit_sources, // output
+    std::vector<IValue>* constants // output
+);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/source_range_serialization.cpp b/torch/csrc/jit/serialization/source_range_serialization.cpp
index f9c000a51c68c4..bc7ae640048500 100644
--- a/torch/csrc/jit/serialization/source_range_serialization.cpp
+++ b/torch/csrc/jit/serialization/source_range_serialization.cpp
@@ -1,97 +1,54 @@
 #include <torch/csrc/jit/serialization/source_range_serialization.h>
 #include <torch/csrc/jit/serialization/source_range_serialization_impl.h>
 
-#include <c10/util/Exception.h>
-#include <c10/util/Flags.h>
 #include <torch/csrc/jit/mobile/type_parser.h>
 #include <torch/csrc/jit/serialization/pickle.h>
-#include <algorithm>
 
 namespace torch {
 namespace jit {
 
-// "Whether to emit compact debug_pkl when saving a model to .pt file."
-// "Compact file is smaller but cannot be loaded by old torch binaries."
-// TODO(qihan) remove when all binaries are using string table.
-thread_local bool should_use_format_with_string_table_ = false;
-
 class SourceRangeSerializer {
  public:
   // Serialize SourceRange as Tuple[SourceType, int, int]
-  // where SourceType = Tuple[int, int, int, List[int]],
-  // The first 2 ints are positions into the vector returned by textSaved
-  // after all the Ranges are processed. textSaved() returns a vector of str
+  // where SourceType = Tuple[str, Optional[str], int, List[int]],
   // the serialized form of Source
   c10::IValue serialize(const SourceRange& sr);
 
-  const std::vector<c10::IValue>& texts_saved() {
-    return texts_;
-  }
-
-  SourceRangeSerializer() {
-    texts_.emplace_back("");
-    text_to_idx_[texts_.back().toStringRef()] = 0;
-  }
-
  private:
   // Serialize Source as Tuple[str, Optional[str], int, List[int]]
   // This caches serialized sources, since many SourceRanges can
   // refer to the same one.
-  c10::IValue serialize_source(const std::shared_ptr<Source>& s);
-  std::unordered_map<std::shared_ptr<Source>, c10::IValue> serialized_sources;
-
-  int64_t store_text_and_get_index(const std::string& text_view);
+  c10::IValue serialize_source(const std::shared_ptr<SourceView>& s);
 
-  std::vector<c10::IValue> texts_;
-  std::unordered_map<c10::string_view, int64_t> text_to_idx_;
+  std::unordered_map<std::shared_ptr<SourceView>, c10::IValue>
+      serialized_sources;
 };
 
 SourceRange SourceRangeDeserializer::deserialize(const c10::IValue& iv) {
   const auto& tup_elems = iv.toTupleRef().elements();
   TORCH_INTERNAL_ASSERT(tup_elems.size() == 3);
-  std::shared_ptr<Source> source_ = deserialize_source(tup_elems[0]);
+  std::shared_ptr<SourceView> source_ = deserialize_source(tup_elems[0]);
   int64_t start_ = tup_elems[1].toInt();
   int64_t end_ = tup_elems[2].toInt();
   return SourceRange(source_, start_, end_);
 }
 
-std::shared_ptr<Source> SourceRangeDeserializer::deserialize_source(
+std::shared_ptr<SourceView> SourceRangeDeserializer::deserialize_source(
     const c10::IValue& iv) {
   auto tup = iv.toTuple();
   auto it = cached_sources.find(tup);
   if (it != cached_sources.end()) {
     return it->second;
   }
-  std::shared_ptr<Source> source;
+
   const auto& tup_elems = tup->elements();
   TORCH_INTERNAL_ASSERT(tup_elems.size() == 3);
-  if (!text_table_.empty()) {
-    const auto& textIndex = tup_elems[0].toIntList();
-    int64_t fnameIndex = tup_elems[1].toInt();
-    int64_t starting_line_no_ = tup_elems[2].toInt();
-    c10::optional<std::string> filename = c10::nullopt;
-
-    filename = *text_table_[fnameIndex];
+  std::string text_ = tup_elems[0].toString()->string();
+  c10::optional<std::string> filename_ = tup_elems[1].toOptional<std::string>();
+  int64_t starting_line_no_ = tup_elems[2].toInt();
 
-    std::vector<c10::string_view> pieces;
-    std::vector<std::shared_ptr<std::string>> strs;
-
-    for (int64_t i : textIndex) {
-      pieces.emplace_back(*text_table_[i]);
-      strs.emplace_back(text_table_[i]);
-    }
-
-    StringCordView str_cord(std::move(pieces), std::move(strs));
-
-    source = std::make_shared<Source>(str_cord, filename, starting_line_no_);
-  } else {
-    std::string text_ = tup_elems[0].toString()->string();
-    c10::optional<std::string> filename_ =
-        tup_elems[1].toOptional<std::string>();
-    int64_t starting_line_no_ = tup_elems[2].toInt();
-    source = std::make_shared<Source>(
-        std::move(text_), std::move(filename_), starting_line_no_);
-  }
+  auto source = std::make_shared<Source>(
+      std::move(text_), std::move(filename_), starting_line_no_);
   cached_sources[tup] = source;
   return source;
 }
@@ -101,50 +58,17 @@ c10::IValue SourceRangeSerializer::serialize(const SourceRange& sr) {
       serialize_source(sr.source()), (int64_t)sr.start(), (int64_t)sr.end());
 }
 
-int64_t SourceRangeSerializer::store_text_and_get_index(
-    const std::string& text_view) {
-  auto text_iter = text_to_idx_.find(text_view);
-  if (text_iter == text_to_idx_.end()) {
-    int64_t text_pos = static_cast<int64_t>(texts_.size());
-    texts_.emplace_back(text_view);
-    text_to_idx_[texts_.back().toStringView()] = text_pos;
-    return text_pos;
-  } else {
-    return text_iter->second;
-  }
-}
-
 c10::IValue SourceRangeSerializer::serialize_source(
-    const std::shared_ptr<Source>& s) {
+    const std::shared_ptr<SourceView>& s) {
   if (serialized_sources.count(s)) {
     return serialized_sources.at(s);
   }
   c10::intrusive_ptr<c10::ivalue::Tuple> serialized;
-  c10::List<int64_t> lines;
-  if (should_use_format_with_string_table_) {
-    if (s == nullptr) {
-      serialized = c10::ivalue::Tuple::create({lines, 0, 0});
-    } else {
-      for (size_t lineno = 0; lineno < s->num_lines(); lineno++) {
-        std::string line_content = s->get_line(lineno).str();
-        int64_t text_pos = store_text_and_get_index(line_content);
-        lines.push_back(text_pos);
-      }
-
-      int64_t fname_pos = 0;
-      if (s->filename().has_value()) {
-        fname_pos = store_text_and_get_index(*s->filename());
-      }
-      serialized = c10::ivalue::Tuple::create(
-          {lines, fname_pos, (int64_t)s->starting_line_no()});
-    }
+  if (s == nullptr) {
+    serialized = c10::ivalue::Tuple::create({"", "", 0});
   } else {
-    if (s == nullptr) {
-      serialized = c10::ivalue::Tuple::create({"", "", 0});
-    } else {
-      serialized = c10::ivalue::Tuple::create(
-          {s->text_str().str(), s->filename(), (int64_t)s->starting_line_no()});
-    }
+    serialized = c10::ivalue::Tuple::create(
+        {s->text(), s->filename(), (int64_t)s->starting_line_no()});
   }
   serialized_sources[s] = serialized;
   return serialized;
@@ -162,24 +86,14 @@ std::vector<char> SourceRangePickler::pickle(
     if (it != source_range_tags.end()) {
       source_range_tag = it->second;
     }
-
     ivalues.emplace_back(c10::ivalue::Tuple::create(
         {(int64_t)range.bytes,
          srs->serialize(range.range),
          static_cast<int64_t>(source_range_tag)}));
   }
-
   std::vector<at::Tensor> table;
-  auto textTable = c10::ivalue::Tuple::create(srs->texts_saved());
   auto ivalue = c10::ivalue::Tuple::create(std::move(ivalues));
-  std::vector<char> result;
-  if (should_use_format_with_string_table_) {
-    result = jit::pickle(
-        c10::ivalue::Tuple::create({kFormatWithStringTable, textTable, ivalue}),
-        &table);
-  } else {
-    result = jit::pickle(ivalue, &table);
-  }
+  auto result = jit::pickle(ivalue, &table);
   TORCH_CHECK(table.size() == 0, "Expected 0 tensors to be written");
   return result;
 }
@@ -189,7 +103,7 @@ ConcreteSourceRangeUnpickler::ConcreteSourceRangeUnpickler(
     size_t size)
     : data(std::move(data)),
       size(size),
-      deserializer(nullptr),
+      deserializer(new SourceRangeDeserializer()),
       unpickled_records(nullptr) {}
 
 void ConcreteSourceRangeUnpickler::unpickle() {
@@ -205,19 +119,10 @@ void ConcreteSourceRangeUnpickler::unpickle() {
                           {},
                           c10::parseType)
                           .toTuple();
-
   const auto& ivalues = ivaluesTuple->elements();
+
   unpickled_records = std::make_shared<SourceRangeRecords>();
-  IValue lines;
-  if (ivalues[0].isString() &&
-      kFormatWithStringTable == ivalues[0].toStringRef()) {
-    deserializer.reset(new SourceRangeDeserializer(ivalues[1]));
-    lines = ivalues[2];
-  } else {
-    deserializer.reset(new SourceRangeDeserializer());
-    lines = ivaluesTuple;
-  }
-  for (auto& val : lines.toTuple()->elements()) {
+  for (auto& val : ivalues) {
     const auto& tup_elems = val.toTupleRef().elements();
     int64_t offset = tup_elems[kByteOffsetIndex].toInt();
     auto source_range = deserializer->deserialize(tup_elems[kSourceRangeIndex]);
@@ -247,10 +152,5 @@ c10::optional<SourceRange> ConcreteSourceRangeUnpickler::
   return c10::nullopt;
 }
 
-TORCH_API void setShouldUseFormatWithStringTable(
-    bool should_use_format_with_string_table) {
-  should_use_format_with_string_table_ = should_use_format_with_string_table;
-}
-
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/source_range_serialization.h b/torch/csrc/jit/serialization/source_range_serialization.h
index 577a3832a3c1fb..4ecd5dcfff1866 100644
--- a/torch/csrc/jit/serialization/source_range_serialization.h
+++ b/torch/csrc/jit/serialization/source_range_serialization.h
@@ -20,7 +20,6 @@ class SourceRangeSerializer;
 static constexpr size_t kByteOffsetIndex = 0;
 static constexpr size_t kSourceRangeIndex = 1;
 static constexpr size_t kSourceRangeTagIndex = 2;
-constexpr c10::string_view kFormatWithStringTable = "FORMAT_WITH_STRING_TABLE";
 
 class SourceRangePickler {
  public:
@@ -36,21 +35,14 @@ class SourceRangePickler {
 
 class SourceRangeDeserializer {
  public:
-  SourceRangeDeserializer() = default;
-  explicit SourceRangeDeserializer(c10::IValue text_table) {
-    for (const auto& x : text_table.toTuple()->elements()) {
-      text_table_.emplace_back(std::make_shared<std::string>(x.toStringRef()));
-    }
-  }
   SourceRange deserialize(const c10::IValue& iv);
 
  private:
-  std::shared_ptr<Source> deserialize_source(const c10::IValue& iv);
+  std::shared_ptr<SourceView> deserialize_source(const c10::IValue& iv);
   std::unordered_map<
       c10::intrusive_ptr<c10::ivalue::Tuple>,
-      std::shared_ptr<Source>>
+      std::shared_ptr<SourceView>>
       cached_sources;
-  std::vector<std::shared_ptr<std::string>> text_table_;
 };
 
 class SourceRangeUnpickler {
@@ -61,8 +53,5 @@ class SourceRangeUnpickler {
   virtual ~SourceRangeUnpickler() = default;
 };
 
-TORCH_API void setShouldUseFormatWithStringTable(
-    bool should_use_format_with_string_table);
-
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/unpickler.cpp b/torch/csrc/jit/serialization/unpickler.cpp
index 391849ace7c724..21a3c8340f6ea3 100644
--- a/torch/csrc/jit/serialization/unpickler.cpp
+++ b/torch/csrc/jit/serialization/unpickler.cpp
@@ -86,6 +86,10 @@ void restoreAccurateTypeTags(const IValue& root, const TypePtr& type_tag) {
       case AnyEnumType::Kind:
         // no op, there is nothing to tag
         break;
+      case c10::SymIntType::Kind:
+        TORCH_CHECK(!w.value.toSymInt().is_symbolic());
+        // no op, there is nothing to tag
+        break;
       case DynamicType::Kind:
       case UnionType::Kind:
       case EnumType::Kind:
@@ -500,7 +504,7 @@ PickleOpCode Unpickler::readInstruction() {
         tensor = at::empty({0}, options).set_(storage);
       }
 
-      if (device.is_cuda() || device.is_xpu()) {
+      if (device.is_cuda() || device.is_xpu() || device.is_meta()) {
         tensor = tensor.to(device, tensor.scalar_type());
       } else if (device.type() != DeviceType::CPU) {
         AT_ERROR(
diff --git a/torch/csrc/jit/tensorexpr/analysis.h b/torch/csrc/jit/tensorexpr/analysis.h
index 82e7b7f62afd7f..d7469546141146 100644
--- a/torch/csrc/jit/tensorexpr/analysis.h
+++ b/torch/csrc/jit/tensorexpr/analysis.h
@@ -196,6 +196,34 @@ class StmtsReadingBuf : public IRVisitor {
   std::vector<StmtPtr> reads_;
 };
 
+class ExternalAllocBufFinder : public IRVisitor {
+ public:
+  void visit(ExternalCallWithAllocPtr v) override {
+    const auto& bufs_out = v->buf_out_args();
+    bufs_.insert(bufs_out.begin(), bufs_out.end());
+    IRVisitor::visit(v);
+  }
+
+  static std::unordered_set<BufPtr> find(StmtPtr s) {
+    ExternalAllocBufFinder f;
+    s->accept(&f);
+    return f.bufs();
+  }
+
+  static std::unordered_set<BufPtr> find(ExprPtr e) {
+    ExternalAllocBufFinder f;
+    e->accept(&f);
+    return f.bufs();
+  }
+
+  const std::unordered_set<BufPtr>& bufs() {
+    return bufs_;
+  }
+
+ private:
+  std::unordered_set<BufPtr> bufs_;
+};
+
 // Traverses the IR to determine if a particular Var is modified within it.
 class ModifiesVarChecker : public IRVisitor {
  public:
@@ -287,6 +315,14 @@ class BufLiveRange : public IRVisitor {
         }
       }
     }
+    auto loads3 = NodeFinder<ExternalCallWithAlloc>::find(s);
+    for (auto l : loads3) {
+      for (auto lb : l->buf_args()) {
+        if (lb == buf_) {
+          return true;
+        }
+      }
+    }
     return false;
   }
 
@@ -303,6 +339,14 @@ class BufLiveRange : public IRVisitor {
         return true;
       }
     }
+    auto writes3 = NodeFinder<ExternalCallWithAlloc>::find(s);
+    for (auto w : writes3) {
+      for (auto wb : w->buf_out_args()) {
+        if (wb == buf_) {
+          return true;
+        }
+      }
+    }
     return false;
   }
 
diff --git a/torch/csrc/jit/tensorexpr/codegen.cpp b/torch/csrc/jit/tensorexpr/codegen.cpp
index 59786cb980c6cc..5621dd4d1715eb 100644
--- a/torch/csrc/jit/tensorexpr/codegen.cpp
+++ b/torch/csrc/jit/tensorexpr/codegen.cpp
@@ -8,6 +8,20 @@ namespace torch {
 namespace jit {
 namespace tensorexpr {
 
+CodeGen::CodeGen(
+    StmtPtr stmt,
+    std::vector<BufferArg> buffer_args,
+    at::Device device,
+    std::string kernel_func_name)
+    : stmt_(std::move(stmt)),
+      buffer_args_(std::move(buffer_args)),
+      device_(device),
+      kernel_func_name_(std::move(kernel_func_name)) {
+  ExtCallMemoryReuse extCallMemoryReuse(buffer_args_);
+  apply_mutator(&extCallMemoryReuse);
+  allocIntermediateBufs();
+}
+
 RegisterCodeGenList::StmtFactoryMethod RegisterCodeGenList::
     FindStmtFactoryMethod(const std::string& name) {
   auto iter = stmt_factory_methods_.find(name);
@@ -105,8 +119,8 @@ c10::optional<size_t> bufSize(BufPtr buf) {
 // as "up for grabs" for future reuse.
 std::vector<std::pair<BufPtr, BufPtr>> AllocBufsWithMemReuse(
     const std::unordered_set<BufPtr>& bufs,
-    const std::unordered_map<BufPtr, std::tuple<int32_t, int32_t>>&
-        buf_ranges) {
+    const std::unordered_map<BufPtr, std::tuple<int32_t, int32_t>>& buf_ranges,
+    const std::unordered_set<BufPtr>& bufs_external_allocs) {
   // Sort buffers by the time they appear.
   std::vector<BufPtr> bufs_sorted(bufs.begin(), bufs.end());
   auto sorting_function_by_start_time = [&buf_ranges](
@@ -161,16 +175,18 @@ std::vector<std::pair<BufPtr, BufPtr>> AllocBufsWithMemReuse(
     }
 
     bool allocated = false;
-    // Check whether there are free memories that this buf can reuse.
-    for (auto it = mem_up_for_grabs.begin(); it != mem_up_for_grabs.end();
-         it++) {
-      auto m = *it;
-      if (bufSize(m) >= bufSize(buf)) {
-        buf_mem_map[buf] = m;
-        buf_allocs.emplace_back(std::make_pair(buf, m));
-        allocated = true;
-        mem_up_for_grabs.erase(it);
-        break;
+    if (bufs_external_allocs.find(buf) == bufs_external_allocs.end()) {
+      // Check whether there are free memories that this buf can reuse.
+      for (auto it = mem_up_for_grabs.begin(); it != mem_up_for_grabs.end();
+           it++) {
+        auto m = *it;
+        if (bufSize(m) >= bufSize(buf)) {
+          buf_mem_map[buf] = m;
+          buf_allocs.emplace_back(buf, m);
+          allocated = true;
+          mem_up_for_grabs.erase(it);
+          break;
+        }
       }
     }
 
@@ -187,26 +203,77 @@ std::vector<std::pair<BufPtr, BufPtr>> AllocBufsWithMemReuse(
 
 StmtPtr insertAllocFree(
     std::vector<std::pair<BufPtr, BufPtr>>& buf_allocs,
+    const std::unordered_set<BufPtr>& bufs_external_allocs,
     StmtPtr stmt) {
   BlockPtr b = to<Block>(stmt);
   if (!b) {
     b = alloc<Block>(std::vector<StmtPtr>({stmt}));
   }
 
+  std::vector<BufPtr> bufs_ext_to_free;
   // Insert allocations and frees for temporary buffers at global scope.
   for (auto rit = buf_allocs.rbegin(); rit != buf_allocs.rend(); ++rit) {
     if (rit->first == rit->second) {
       BufPtr buf = rit->first;
-      b->prepend_stmt(alloc<Allocate>(buf));
-      b->append_stmt(alloc<Free>(buf));
+      if (bufs_external_allocs.find(buf) == bufs_external_allocs.end()) {
+        b->prepend_stmt(alloc<Allocate>(buf));
+        b->append_stmt(alloc<Free>(buf));
+      } else {
+        bufs_ext_to_free.push_back(buf);
+      }
     } else {
       b->prepend_stmt(alloc<PlacementAllocate>(rit->first, rit->second));
     }
   }
 
+  b->append_stmt(alloc<FreeExt>(bufs_ext_to_free));
   return b;
 }
 
+std::unordered_map<std::string, std::string> ExtCallMemoryReuse::
+    makeExtCallFuncNameMap() {
+  return {
+      {"nnc_aten_quantize_per_tensor", "nnc_aten_quantize_per_tensor_out"},
+      {"nnc_aten_dequantize", "nnc_aten_dequantize_out"},
+      {"nnc_aten_quantized_mul", "nnc_aten_quantized_mul_out"},
+      {"nnc_aten_quantized_conv2d", "nnc_aten_quantized_conv2d_out"},
+      {"nnc_aten_quantized_conv2d_relu", "nnc_aten_quantized_conv2d_relu_out"},
+      {"nnc_aten_quantized_mul", "nnc_aten_quantized_mul_out"},
+      {"nnc_aten_quantized_sigmoid", "nnc_aten_quantized_sigmoid_out"},
+      {"nnc_aten_upsample_nearest2d", "nnc_aten_upsample_nearest2d_out"},
+      {"nnc_aten_quantized_linear", "nnc_aten_quantized_linear_out"},
+      {"nnc_aten_quantized_conv1d", "nnc_aten_quantized_conv1d_out"},
+      {"nnc_aten_quantized_mul_scalar", "nnc_aten_quantized_mul_scalar_out"},
+      {"nnc_aten_max_red", "nnc_aten_max_red_out"},
+      {"nnc_aten_conv1d", "nnc_aten_conv1d_out"},
+  };
+}
+
+const std::unordered_map<std::string, std::string>
+    ExtCallMemoryReuse::extCallFuncNameMap_ = makeExtCallFuncNameMap();
+
+ExtCallMemoryReuse::ExtCallMemoryReuse(
+    const std::vector<CodeGen::BufferArg>& bufferArgs) {
+  for (const auto& ba : bufferArgs) {
+    if (ba.buf()) {
+      bufferArgs_.insert(ba.buf());
+    }
+  }
+}
+
+StmtPtr ExtCallMemoryReuse::mutate(ExternalCallPtr v) {
+  if (extCallFuncNameMap_.count(v->func_name()) &&
+      bufferArgs_.count(v->buf()) == 0) {
+    std::vector<BufPtr> buf_out_args = {v->buf()};
+    return alloc<ExternalCallWithAlloc>(
+        extCallFuncNameMap_.at(v->func_name()),
+        buf_out_args,
+        v->buf_args(),
+        v->args());
+  }
+  return v;
+}
+
 // We allocate intermediate buffers by inserting Allocate/Free or
 // PlacementAllocate stmts. Allocate/Free stmts will allocate memory at runtime,
 // and PlacementAllocate stmt reuses the memory of one buffer for another
@@ -236,14 +303,17 @@ void CodeGen::allocIntermediateBufs() {
     }
   }
 
+  const auto bufs_external_allocs = ExternalAllocBufFinder::find(stmt_);
+
   // For each intermediate buffer, we reuse the memory of an old buffer whose
   // liveness range does not overlap with the current buffer, or allocate memory
   // if reusing buffer is impossible.
-  auto buf_allocs = AllocBufsWithMemReuse(interm_bufs, interm_buf_ranges);
+  auto buf_allocs = AllocBufsWithMemReuse(
+      interm_bufs, interm_buf_ranges, bufs_external_allocs);
 
   // Insert memory allocation/mapping nodes.
   if (buf_allocs.size() > 0) {
-    auto stmt_new = insertAllocFree(buf_allocs, stmt_);
+    auto stmt_new = insertAllocFree(buf_allocs, bufs_external_allocs, stmt_);
     set_stmt(stmt_new);
   }
 
diff --git a/torch/csrc/jit/tensorexpr/codegen.h b/torch/csrc/jit/tensorexpr/codegen.h
index 3115527786c9ee..4b94b2e4b7f6cb 100644
--- a/torch/csrc/jit/tensorexpr/codegen.h
+++ b/torch/csrc/jit/tensorexpr/codegen.h
@@ -26,13 +26,7 @@ class TORCH_API CodeGen {
       StmtPtr stmt,
       std::vector<BufferArg> buffer_args,
       at::Device device = at::kCPU,
-      std::string kernel_func_name = "func")
-      : stmt_(stmt),
-        buffer_args_(std::move(buffer_args)),
-        device_(device),
-        kernel_func_name_(std::move(kernel_func_name)) {
-    allocIntermediateBufs();
-  }
+      std::string kernel_func_name = "func");
 
   virtual ~CodeGen() = default;
 
@@ -113,6 +107,20 @@ class TORCH_API CodeGen {
   std::string kernel_func_name_ = "func";
 };
 
+class TORCH_API ExtCallMemoryReuse : public IRMutator {
+  static std::unordered_map<std::string, std::string> makeExtCallFuncNameMap();
+  static const std::unordered_map<std::string, std::string> extCallFuncNameMap_;
+
+ public:
+  explicit ExtCallMemoryReuse(
+      const std::vector<CodeGen::BufferArg>& bufferArgs);
+  ~ExtCallMemoryReuse() override = default;
+  StmtPtr mutate(ExternalCallPtr v) override;
+
+ private:
+  std::unordered_set<BufPtr> bufferArgs_;
+};
+
 class CodeGen::BufferArg {
  public:
   BufferArg(Tensor tensor) : buf_(tensor.buf()) {}
diff --git a/torch/csrc/jit/tensorexpr/eval.cpp b/torch/csrc/jit/tensorexpr/eval.cpp
index 9cfeae16b25840..0d5a09a7e13c2d 100644
--- a/torch/csrc/jit/tensorexpr/eval.cpp
+++ b/torch/csrc/jit/tensorexpr/eval.cpp
@@ -1,6 +1,7 @@
 #include <torch/csrc/jit/tensorexpr/eval.h>
 
 #include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/tensorexpr/external_functions_core.h>
 #include <torch/csrc/jit/tensorexpr/external_functions_registry.h>
 
 #include <c10/util/irange.h>
@@ -897,6 +898,86 @@ class SimpleIREvaluatorImpl : public IRVisitor {
         extra_args.data());
   }
 
+  void visit(ExternalCallWithAllocPtr v) override {
+    auto& func_registry = getNNCFunctionRegistry();
+    if (!func_registry.count(v->func_name())) {
+      throw unimplemented_lowering(v);
+    }
+    GRAPH_DEBUG("EXTERNAL CALL: func=", v->func_name());
+
+    const auto& bufs_out = v->buf_out_args();
+    const auto& bufs_in = v->buf_args();
+    const auto bufs_in_size = bufs_in.size();
+    const auto bufs_out_size = bufs_out.size();
+
+    std::vector<void*> buf_ptrs(bufs_in_size + 2 * bufs_out_size);
+    std::vector<int64_t> buf_ranks;
+    std::vector<int64_t> buf_dims;
+    std::vector<int64_t> buf_strides;
+    std::vector<int8_t> buf_dtypes;
+    std::vector<int64_t> extra_args;
+
+    size_t i = 0;
+    for (const auto& b : bufs_in) {
+      auto iter = buffer_mapping_.find(b);
+      if (iter == buffer_mapping_.end()) {
+        throw malformed_input("could not find buf", v);
+      }
+      buf_ptrs[bufs_out_size + i] = iter->second;
+      // @lint-ignore CLANGTIDY
+      buf_ranks.push_back(b->dims().size());
+      buf_dtypes.push_back((int8_t)b->dtype().scalar_type());
+      for (const auto& dim_expr : b->dims()) {
+        dim_expr->accept(this);
+        buf_dims.push_back(value().intValue());
+      }
+      for (const ExprPtr& stride_expr : b->strides()) {
+        stride_expr->accept(this);
+        buf_strides.push_back(value().intValue());
+      }
+      i++;
+    }
+    for (const auto& a : v->args()) {
+      a->accept(this);
+      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+      int64_t val;
+      if (value().dtype() == kLong) {
+        val = value().as<int64_t>();
+      } else if (value().dtype() == kInt) {
+        val = value().intValue();
+      } else if (value().dtype() == kDouble) {
+        auto x = value().as<double>();
+        val = reinterpret_cast<int64_t*>(&x)[0];
+      } else if (value().dtype() == kFloat) {
+        auto x = value().as<float>();
+        val = reinterpret_cast<int64_t*>(&x)[0];
+      } else {
+        throw malformed_input(
+            "extra_args in ExternalCalls must have int64 dtype", v);
+      }
+      extra_args.push_back(val);
+    }
+
+    auto fn_ptr = func_registry.at(v->func_name());
+    (*fn_ptr)(
+        // @lint-ignore CLANGTIDY
+        bufs_in_size,
+        buf_ptrs.data(),
+        buf_ranks.data(),
+        buf_dims.data(),
+        buf_strides.data(),
+        buf_dtypes.data(),
+        // @lint-ignore CLANGTIDY
+        extra_args.size(),
+        extra_args.data());
+
+    for (i = 0; i < bufs_out_size; ++i) {
+      const auto& buf_out = bufs_out[i];
+      buffer_mapping_[buf_out] = buf_ptrs[i];
+      ext_bufs_free_ptr_[buf_out] = buf_ptrs[bufs_in_size + bufs_out_size + i];
+    }
+  }
+
   template <typename TReturn, typename TInput>
   void visit_intrinsics_helper(IntrinsicsPtr v) {
     std::vector<InterpValue> values(v->nparams());
@@ -998,6 +1079,21 @@ class SimpleIREvaluatorImpl : public IRVisitor {
     buffer_mapping_.erase(b);
   }
 
+  void visit(FreeExtPtr v) override {
+    const auto& bufs = v->bufs();
+    const auto bufs_num = bufs.size();
+    std::vector<void*> buf_ptrs;
+    for (const auto& buf : bufs) {
+      if (!ext_bufs_free_ptr_.count(buf)) {
+        throw std::runtime_error(
+            "Free an external allocated buffer that does not have corresponding pointer for freeing: " +
+            buf->base_handle()->name_hint());
+      }
+      buf_ptrs.push_back(ext_bufs_free_ptr_[buf]);
+    }
+    nnc_aten_free(bufs_num, buf_ptrs.data());
+  }
+
   void visit(LetPtr v) override {
     var_by_scope_[scope_].push_back(v->var());
     bindVar(v->var(), evaluateExpr(v->value()));
@@ -1141,6 +1237,7 @@ class SimpleIREvaluatorImpl : public IRVisitor {
   std::unordered_map<BufPtr, void*> buffer_mapping_;
   std::unordered_map<BufPtr, std::unique_ptr<std::vector<int>>>
       internal_buffers_;
+  std::unordered_map<BufPtr, void*> ext_bufs_free_ptr_;
 };
 
 SimpleIREvaluator::SimpleIREvaluator(
diff --git a/torch/csrc/jit/tensorexpr/external_functions.cpp b/torch/csrc/jit/tensorexpr/external_functions.cpp
index c041c6bf1d0920..8459bd696cbbf2 100644
--- a/torch/csrc/jit/tensorexpr/external_functions.cpp
+++ b/torch/csrc/jit/tensorexpr/external_functions.cpp
@@ -4,14 +4,15 @@
 #include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/core/Tensor.h>
-#include <ATen/native/quantized/cpu/conv_packed_params.h>
 #include <ATen/native/quantized/cpu/conv_serialization.h>
 #include <ATen/native/quantized/cpu/qadd.h>
 #include <ATen/native/quantized/cpu/quant_utils.h>
 #include <ATen/native/quantized/cpu/quantized_ops.h>
+#include <ATen/native/quantized/packed_params.h>
 #include <ATen/native/xnnpack/OpContext.h>
 #include <ATen/quantized/QTensorImpl.h>
 #include <aten/src/ATen/Parallel.h>
+#include <c10/core/TensorImpl.h>
 #include <c10/core/TensorOptions.h>
 #include <c10/util/ArrayRef.h>
 #include <c10/util/irange.h>
@@ -177,6 +178,124 @@ std::vector<at::Tensor> constructTensors(
       bufs_num, buf_data, buf_ranks, buf_dims, buf_strides, buf_dtypes, opt);
 }
 
+std::vector<at::Tensor> constructTensors2(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    c10::optional<std::vector<std::pair<size_t, QIData>>> qdataArg,
+    size_t bufs_out_num) {
+  std::vector<void*> buf_data_vec;
+  std::vector<std::vector<int64_t>> buf_dims_vec;
+  std::vector<std::vector<int64_t>> buf_strides_vec;
+  std::vector<c10::ScalarType> buf_dtypes_vec;
+  int64_t buf_dims_idx = 0;
+  int64_t buf_strides_idx = 0;
+  for (const auto i : c10::irange(bufs_in_num)) {
+    buf_data_vec.push_back(buf_data[bufs_out_num + i]);
+    buf_dims_vec.emplace_back();
+    buf_strides_vec.emplace_back();
+    for (const auto dim : c10::irange(buf_ranks[i])) {
+      (void)dim;
+      buf_dims_vec[i].push_back(buf_dims[buf_dims_idx++]);
+      buf_strides_vec[i].push_back(buf_strides[buf_strides_idx++]);
+    }
+    buf_dtypes_vec.push_back(static_cast<c10::ScalarType>(buf_dtypes[i]));
+  }
+
+  std::vector<at::Tensor> tensors;
+  at::Tensor und;
+  for (const auto i : c10::irange(bufs_out_num)) {
+    (void)i;
+    tensors.emplace_back(und);
+  }
+  if (!qdataArg.has_value()) {
+    for (const auto i : c10::irange(buf_data_vec.size())) {
+      auto options = at::TensorOptions()
+                         // NOLINTNEXTLINE
+                         .dtype(buf_dtypes_vec[i])
+                         .layout(at::kStrided)
+                         .device(at::kCPU) // TODO: support GPUs too
+                         .memory_format(deduce_memory_format(
+                             // NOLINTNEXTLINE
+                             buf_strides_vec[i],
+                             // NOLINTNEXTLINE
+                             buf_dims_vec[i]))
+                         .requires_grad(false);
+      auto tensor = at::from_blob(
+          // NOLINTNEXTLINE
+          buf_data_vec[i],
+          buf_dims_vec[i],
+          buf_strides_vec[i],
+          options);
+      tensors.emplace_back(tensor);
+    }
+  } else {
+    // handle quantized
+    std::vector<c10::optional<QIData>> qdata(bufs_in_num, c10::nullopt);
+    for (const auto& qd : *qdataArg) {
+      qdata[qd.first - bufs_out_num] = qd.second;
+    }
+    for (const auto i : c10::irange(buf_data_vec.size())) {
+      auto options = at::TensorOptions()
+                         // NOLINTNEXTLINE
+                         .dtype(buf_dtypes_vec[i])
+                         .layout(at::kStrided)
+                         .device(at::kCPU) // TODO: support GPUs too
+                         .memory_format(deduce_memory_format(
+                             // NOLINTNEXTLINE
+                             buf_strides_vec[i],
+                             // NOLINTNEXTLINE
+                             buf_dims_vec[i]))
+                         .requires_grad(false);
+      if (auto qd = qdata[i]) {
+        // inplace tensor
+        auto tensor = from_blob_quantized(
+            // NOLINTNEXTLINE
+            buf_data_vec[i],
+            buf_dims_vec[i],
+            buf_strides_vec[i],
+            qd->scale,
+            qd->zero,
+            qd->scalarType);
+        tensors.emplace_back(tensor);
+      } else {
+        auto tensor = at::from_blob(
+            // NOLINTNEXTLINE
+            buf_data_vec[i],
+            buf_dims_vec[i],
+            buf_strides_vec[i],
+            options);
+        tensors.emplace_back(tensor);
+      }
+    }
+  }
+  return tensors;
+}
+
+std::vector<at::Tensor> constructTensors2(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    std::vector<std::pair<size_t, QIData>> qdata,
+    size_t bufs_out_num = 0u) {
+  c10::optional<std::vector<std::pair<size_t, QIData>>> opt = std::move(qdata);
+  return constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      opt,
+      bufs_out_num);
+}
+
 #ifndef _WIN32
 at::Tensor quantized_add(
     const at::Tensor& x1,
@@ -249,24 +368,6 @@ at::Tensor quantized_relu(const at::Tensor& qx) {
 extern "C" {
 #endif
 
-typedef void (*ParallelCallee)(int64_t index, int8_t* packed_data);
-void DispatchParallel(
-    int8_t* func,
-    int64_t start,
-    int64_t stop,
-    int8_t* packed_data) noexcept {
-  // TODO: preserve the func type.
-  try {
-    ParallelCallee callee = reinterpret_cast<ParallelCallee>(func);
-    at::parallel_for(start, stop, 1, [&](int64_t f_begin, int64_t f_end) {
-      for (int64_t index = f_begin; index < f_end; index++) {
-        callee(index, packed_data);
-      }
-    });
-  } catch (...) {
-  }
-}
-
 void nnc_aten_conv2d(
     int64_t bufs_num,
     void** buf_data,
@@ -349,6 +450,41 @@ void nnc_aten_quantized_conv1d(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_conv1d_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+  auto convPackedParams =
+      reinterpret_cast<ConvPackedParamsBase<2>*>(buf_data[2]);
+  const double out_qscale = ((double*)extra_args)[3];
+  const int64_t out_qzero = extra_args[4];
+  // NOLINTNEXTLINE
+  auto qx = tensors[1].unsqueeze(quant_utils::kConv1dSqueezeDim + 2);
+  auto r = convPackedParams->apply(qx, out_qscale, out_qzero);
+  r = r.squeeze_(quant_utils::kConv1dSqueezeDim + 2);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_conv2d(
     int64_t bufs_num,
     void** buf_data,
@@ -378,6 +514,39 @@ void nnc_aten_quantized_conv2d(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_conv2d_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+  auto convPackedParams =
+      reinterpret_cast<ConvPackedParamsBase<2>*>(buf_data[2]);
+  const double out_qscale = ((double*)extra_args)[3];
+  const int64_t out_qzero = extra_args[4];
+  // NOLINTNEXTLINE
+  auto r = convPackedParams->apply(tensors[1], out_qscale, out_qzero);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_conv2d_relu(
     int64_t bufs_num,
     void** buf_data,
@@ -407,6 +576,39 @@ void nnc_aten_quantized_conv2d_relu(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_conv2d_relu_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+  auto convPackedParams =
+      reinterpret_cast<ConvPackedParamsBase<2>*>(buf_data[2]);
+  const double out_qscale = ((double*)extra_args)[3];
+  const int64_t out_qzero = extra_args[4];
+  // NOLINTNEXTLINE
+  auto r = convPackedParams->apply_relu(tensors[1], out_qscale, out_qzero);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_linear(
     int64_t bufs_num,
     void** buf_data,
@@ -436,6 +638,39 @@ void nnc_aten_quantized_linear(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_linear_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+  auto linearPackedParams =
+      reinterpret_cast<LinearPackedParamsBase*>(buf_data[2]);
+  const double out_qscale = ((double*)extra_args)[3];
+  const int64_t out_qzero = extra_args[4];
+  // NOLINTNEXTLINE
+  auto r = linearPackedParams->apply(tensors[1], out_qscale, out_qzero);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_linear_relu(
     int64_t bufs_num,
     void** buf_data,
@@ -531,6 +766,41 @@ void nnc_aten_quantized_mul(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_mul_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double a_qscale = ((double*)extra_args)[0];
+  const int64_t a_qzero = extra_args[1];
+  const c10::ScalarType a_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  const double b_qscale = ((double*)extra_args)[3];
+  const int64_t b_qzero = extra_args[4];
+  const c10::ScalarType b_qdtype = static_cast<c10::ScalarType>(extra_args[5]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {a_qscale, a_qzero, toQIntType(a_qdtype)}},
+       {2u, {b_qscale, b_qzero, toQIntType(b_qdtype)}}},
+      1u);
+  const double out_qscale = ((double*)extra_args)[6];
+  const int64_t out_qzero = extra_args[7];
+  // NOLINTNEXTLINE
+  auto r = quantized_mul(tensors[1], tensors[2], out_qscale, out_qzero);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_mul_scalar(
     int64_t bufs_num,
     void** buf_data,
@@ -557,6 +827,36 @@ void nnc_aten_quantized_mul_scalar(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_mul_scalar_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+  const double scalar = ((double*)extra_args)[3];
+  // NOLINTNEXTLINE
+  auto r = quantized_mul_scalar(tensors[1], scalar);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_relu(
     int64_t bufs_num,
     void** buf_data,
@@ -608,6 +908,36 @@ void nnc_aten_quantized_sigmoid(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantized_sigmoid_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const c10::ScalarType x_qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  const size_t bufs_out_num = 1u;
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {x_qscale, x_qzero, toQIntType(x_qdtype)}}},
+      bufs_out_num);
+
+  // NOLINTNEXTLINE
+  auto r = at::sigmoid(tensors[1]);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantized_cat(
     int64_t bufs_num,
     void** buf_data,
@@ -684,6 +1014,58 @@ void nnc_aten_upsample_nearest2d(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_upsample_nearest2d_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  // NOLINTNEXTLINE(facebook-hte-LocalUncheckedArrayBounds)
+  const double x_qscale = ((double*)extra_args)[0];
+  const int64_t x_qzero = extra_args[1];
+  const int64_t x_qdtype = extra_args[2];
+  const auto is_quantized = x_qdtype != -1;
+  c10::optional<std::vector<std::pair<size_t, QIData>>> qdata;
+  if (is_quantized) {
+    qdata = {
+        {1u,
+         {x_qscale,
+          x_qzero,
+          at::toQIntType(static_cast<c10::ScalarType>(x_qdtype))}}};
+  }
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      qdata,
+      bufs_out_num);
+  auto x = tensors[1];
+
+  int64_t output_size_h = extra_args[3];
+  int64_t output_size_w = extra_args[4];
+  double scale_factor_h = ((double*)extra_args)[5];
+  double scale_factor_w = ((double*)extra_args)[6];
+
+  auto r = at::upsample_nearest2d(
+      x,
+      (output_size_h != -1)
+          ? c10::optional<at::IntArrayRef>({output_size_h, output_size_w})
+          : c10::nullopt,
+      (scale_factor_h != -1.f) ? c10::optional<at::ArrayRef<double>>(
+                                     {scale_factor_h, scale_factor_w})
+                               : c10::nullopt);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_quantize_per_tensor(
     int64_t bufs_num,
     void** buf_data,
@@ -704,6 +1086,36 @@ void nnc_aten_quantize_per_tensor(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_quantize_per_tensor_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      c10::nullopt,
+      bufs_out_num);
+  // NOLINTNEXTLINE(facebook-hte-LocalUncheckedArrayBounds)
+  at::Tensor x = tensors[1];
+  const double qscale = ((double*)extra_args)[0];
+  const int64_t qzero = extra_args[1];
+  const c10::ScalarType qdtype = static_cast<c10::ScalarType>(extra_args[2]);
+  auto r = at::quantize_per_tensor(x, qscale, qzero, qdtype);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_dequantize(
     int64_t bufs_num,
     void** buf_data,
@@ -730,6 +1142,35 @@ void nnc_aten_dequantize(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_dequantize_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  const double qscale = ((double*)extra_args)[0];
+  const int64_t qzero = extra_args[1];
+  const int64_t qdtype = extra_args[2];
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      {{1u, {qscale, qzero, toQIntType(static_cast<c10::ScalarType>(qdtype))}}},
+      bufs_out_num);
+  // NOLINTNEXTLINE
+  auto r = at::dequantize(tensors[1]);
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_conv1d(
     int64_t bufs_num,
     void** buf_data,
@@ -770,6 +1211,56 @@ void nnc_aten_conv1d(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_conv1d_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t args_num,
+    int64_t* extra_args) {
+  const size_t bufs_out_num = 1u;
+  auto tensors = constructTensors2(
+      bufs_in_num,
+      buf_data,
+      buf_ranks,
+      buf_dims,
+      buf_strides,
+      buf_dtypes,
+      c10::nullopt,
+      bufs_out_num);
+
+  at::Tensor r;
+  const at::Tensor& x = tensors[1];
+  const at::Tensor& w = tensors[2];
+  if (args_num > 0) {
+    // Check that if the extra arguments are provided, then the bias tensor is
+    // also present
+    TORCH_INTERNAL_ASSERT(args_num == 4 && bufs_in_num == 3);
+    const at::Tensor& b = tensors[3];
+
+    int64_t stride = extra_args[0];
+    int64_t padding = extra_args[1];
+    int64_t dilation = extra_args[2];
+    int64_t groups = extra_args[3];
+
+    try {
+      r = at::conv1d(x, w, b, {stride}, {padding}, {dilation}, groups);
+    } catch (...) {
+    }
+  } else {
+    try {
+      r = at::conv1d(x, w);
+    } catch (...) {
+    }
+  }
+
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_adaptive_avg_pool2d(
     int64_t bufs_num,
     void** buf_data,
@@ -844,6 +1335,33 @@ void nnc_aten_max_red(
   memcpy(buf_data[0], r.data_ptr(), r.element_size() * r.numel());
 }
 
+void nnc_aten_max_red_out(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t,
+    int64_t* extra_args) {
+  size_t bufs_out_num = 1u;
+  auto tensors = constructTensors2(
+      bufs_in_num, buf_data, buf_ranks, buf_dims, buf_strides, buf_dtypes);
+
+  at::Tensor r;
+  // @lint-ignore CLANGTIDY
+  const at::Tensor& x = tensors[1];
+  int64_t max_dim = extra_args[0];
+  bool keep_dim = extra_args[1];
+  try {
+    r = std::get<0>(at::max(x, max_dim, keep_dim));
+  } catch (...) {
+  }
+  buf_data[0] = r.data_ptr();
+  c10::raw::intrusive_ptr::incref(r.getIntrusivePtr().get());
+  buf_data[bufs_in_num + bufs_out_num] = r.getIntrusivePtr().get();
+}
+
 void nnc_aten_addmm(
     int64_t bufs_num,
     void** buf_data,
@@ -972,15 +1490,27 @@ const static RegisterNNCExternalFunction nnc_conv2d(
 const static RegisterNNCExternalFunction nnc_quantized_conv1d(
     "nnc_aten_quantized_conv1d",
     nnc_aten_quantized_conv1d);
+const static RegisterNNCExternalFunction nnc_quantized_conv1d_out(
+    "nnc_aten_quantized_conv1d_out",
+    nnc_aten_quantized_conv1d_out);
 const static RegisterNNCExternalFunction nnc_quantized_conv2d(
     "nnc_aten_quantized_conv2d",
     nnc_aten_quantized_conv2d);
+const static RegisterNNCExternalFunction nnc_quantized_conv2d_out(
+    "nnc_aten_quantized_conv2d_out",
+    nnc_aten_quantized_conv2d_out);
 const static RegisterNNCExternalFunction nnc_quantized_conv2d_relu(
     "nnc_aten_quantized_conv2d_relu",
     nnc_aten_quantized_conv2d_relu);
+const static RegisterNNCExternalFunction nnc_quantized_conv2d_relu_out(
+    "nnc_aten_quantized_conv2d_relu_out",
+    nnc_aten_quantized_conv2d_relu_out);
 const static RegisterNNCExternalFunction nnc_quantized_linear(
     "nnc_aten_quantized_linear",
     nnc_aten_quantized_linear);
+const static RegisterNNCExternalFunction nnc_quantized_linear_out(
+    "nnc_aten_quantized_linear_out",
+    nnc_aten_quantized_linear_out);
 #ifndef _WIN32
 const static RegisterNNCExternalFunction nnc_quantized_add(
     "nnc_aten_quantized_add",
@@ -988,12 +1518,21 @@ const static RegisterNNCExternalFunction nnc_quantized_add(
 const static RegisterNNCExternalFunction nnc_quantized_mul(
     "nnc_aten_quantized_mul",
     nnc_aten_quantized_mul);
+const static RegisterNNCExternalFunction nnc_quantized_mul_out(
+    "nnc_aten_quantized_mul_out",
+    nnc_aten_quantized_mul_out);
 const static RegisterNNCExternalFunction nnc_quantized_mul_scalar(
     "nnc_aten_quantized_mul_scalar",
     nnc_aten_quantized_mul_scalar);
+const static RegisterNNCExternalFunction nnc_quantized_mul_scalar_out(
+    "nnc_aten_quantized_mul_scalar_out",
+    nnc_aten_quantized_mul_scalar_out);
 const static RegisterNNCExternalFunction nnc_quantized_sigmoid(
     "nnc_aten_quantized_sigmoid",
     nnc_aten_quantized_sigmoid);
+const static RegisterNNCExternalFunction nnc_quantized_sigmoid_out(
+    "nnc_aten_quantized_sigmoid_out",
+    nnc_aten_quantized_sigmoid_out);
 const static RegisterNNCExternalFunction nnc_quantized_cat(
     "nnc_aten_quantized_cat",
     nnc_aten_quantized_cat);
@@ -1004,16 +1543,28 @@ const static RegisterNNCExternalFunction nnc_quantized_relu(
 const static RegisterNNCExternalFunction nnc_quantize_per_tensor(
     "nnc_aten_quantize_per_tensor",
     nnc_aten_quantize_per_tensor);
+const static RegisterNNCExternalFunction nnc_quantize_per_tensor_out(
+    "nnc_aten_quantize_per_tensor_out",
+    nnc_aten_quantize_per_tensor_out);
 const static RegisterNNCExternalFunction nnc_dequantize(
     "nnc_aten_dequantize",
     nnc_aten_dequantize);
+const static RegisterNNCExternalFunction nnc_dequantize_out(
+    "nnc_aten_dequantize_out",
+    nnc_aten_dequantize_out);
 
 const static RegisterNNCExternalFunction nnc_upsample_nearest2d(
     "nnc_aten_upsample_nearest2d",
     nnc_aten_upsample_nearest2d);
+const static RegisterNNCExternalFunction nnc_upsample_nearest2d_out(
+    "nnc_aten_upsample_nearest2d_out",
+    nnc_aten_upsample_nearest2d_out);
 const static RegisterNNCExternalFunction nnc_conv1d(
     "nnc_aten_conv1d",
     nnc_aten_conv1d);
+const static RegisterNNCExternalFunction nnc_conv1d_out(
+    "nnc_aten_conv1d_out",
+    nnc_aten_conv1d_out);
 const static RegisterNNCExternalFunction nnc_adaptive_avg_pool2d(
     "nnc_aten_adaptive_avg_pool2d",
     nnc_aten_adaptive_avg_pool2d);
@@ -1023,6 +1574,9 @@ const static RegisterNNCExternalFunction nnc_mean(
 const static RegisterNNCExternalFunction nnc_max_red(
     "nnc_aten_max_red",
     nnc_aten_max_red);
+const static RegisterNNCExternalFunction nnc_max_red_out(
+    "nnc_aten_max_red_out",
+    nnc_aten_max_red_out);
 const static RegisterNNCExternalFunction nnc_addmm(
     "nnc_aten_addmm",
     nnc_aten_addmm);
diff --git a/torch/csrc/jit/tensorexpr/external_functions.h b/torch/csrc/jit/tensorexpr/external_functions.h
index 9e27ad1f4086e5..17de3f2e681265 100644
--- a/torch/csrc/jit/tensorexpr/external_functions.h
+++ b/torch/csrc/jit/tensorexpr/external_functions.h
@@ -6,28 +6,41 @@
 #include <cstdint>
 #include <vector>
 
-#define FOR_ALL_EXTERNAL_FUNCTIONS(_) \
-  _(nnc_aten_conv2d)                  \
-  _(nnc_aten_matmul)                  \
-  _(nnc_aten_mv)                      \
-  _(nnc_aten_mm)                      \
-  _(nnc_aten_adaptive_avg_pool2d)     \
-  _(nnc_aten_mean)                    \
-  _(nnc_aten_addmm)                   \
-  _(nnc_aten_quantized_conv1d)        \
-  _(nnc_aten_quantized_conv2d)        \
-  _(nnc_aten_quantized_conv2d_relu)   \
-  _(nnc_aten_quantized_linear)        \
-  _(nnc_aten_quantized_linear_relu)   \
-  _(nnc_aten_quantized_add)           \
-  _(nnc_aten_quantized_cat)           \
-  _(nnc_aten_quantized_mul)           \
-  _(nnc_aten_quantized_mul_scalar)    \
-  _(nnc_aten_quantized_relu)          \
-  _(nnc_aten_quantized_sigmoid)       \
-  _(nnc_aten_quantize_per_tensor)     \
-  _(nnc_aten_dequantize)              \
-  _(nnc_aten_upsample_nearest2d)
+#define FOR_ALL_EXTERNAL_FUNCTIONS(_)   \
+  _(nnc_aten_conv2d)                    \
+  _(nnc_aten_conv1d)                    \
+  _(nnc_aten_conv1d_out)                \
+  _(nnc_aten_matmul)                    \
+  _(nnc_aten_mv)                        \
+  _(nnc_aten_mm)                        \
+  _(nnc_aten_adaptive_avg_pool2d)       \
+  _(nnc_aten_mean)                      \
+  _(nnc_aten_addmm)                     \
+  _(nnc_aten_max_red)                   \
+  _(nnc_aten_max_red_out)               \
+  _(nnc_aten_quantized_conv1d)          \
+  _(nnc_aten_quantized_conv1d_out)      \
+  _(nnc_aten_quantized_conv2d)          \
+  _(nnc_aten_quantized_conv2d_out)      \
+  _(nnc_aten_quantized_conv2d_relu)     \
+  _(nnc_aten_quantized_conv2d_relu_out) \
+  _(nnc_aten_quantized_linear)          \
+  _(nnc_aten_quantized_linear_out)      \
+  _(nnc_aten_quantized_linear_relu)     \
+  _(nnc_aten_quantized_add)             \
+  _(nnc_aten_quantized_cat)             \
+  _(nnc_aten_quantized_mul)             \
+  _(nnc_aten_quantized_mul_out)         \
+  _(nnc_aten_quantized_mul_scalar)      \
+  _(nnc_aten_quantized_mul_scalar_out)  \
+  _(nnc_aten_quantized_relu)            \
+  _(nnc_aten_quantized_sigmoid)         \
+  _(nnc_aten_quantize_per_tensor)       \
+  _(nnc_aten_quantize_per_tensor_out)   \
+  _(nnc_aten_dequantize)                \
+  _(nnc_aten_dequantize_out)            \
+  _(nnc_aten_upsample_nearest2d)        \
+  _(nnc_aten_upsample_nearest2d_out)
 
 #define DECLARE_EXTERNAL_FUNCTION(NAME) \
   TORCH_API void NAME(                  \
@@ -58,6 +71,17 @@ std::vector<at::Tensor> constructTensors(
     c10::optional<std::vector<std::pair<size_t, QIData>>> qdataArg =
         c10::nullopt);
 
+std::vector<at::Tensor> constructTensors2(
+    int64_t bufs_in_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    c10::optional<std::vector<std::pair<size_t, QIData>>> qdataArg =
+        c10::nullopt,
+    size_t bufs_out_num = 0);
+
 #ifdef C10_MOBILE
 extern "C" {
 #endif
@@ -69,6 +93,8 @@ void DispatchParallel(
 
 FOR_ALL_EXTERNAL_FUNCTIONS(DECLARE_EXTERNAL_FUNCTION)
 
+TORCH_API void nnc_aten_free(int64_t bufs_num, void** ptrs) noexcept;
+
 #ifdef C10_MOBILE
 } // extern "C"
 #endif
diff --git a/torch/csrc/jit/tensorexpr/external_functions_core.cpp b/torch/csrc/jit/tensorexpr/external_functions_core.cpp
new file mode 100644
index 00000000000000..1d84b25e0c45b6
--- /dev/null
+++ b/torch/csrc/jit/tensorexpr/external_functions_core.cpp
@@ -0,0 +1,41 @@
+#include <torch/csrc/jit/tensorexpr/external_functions_core.h>
+
+namespace torch {
+namespace jit {
+namespace tensorexpr {
+
+#ifdef C10_MOBILE
+extern "C" {
+#endif
+
+using ParallelCallee = void (*)(int64_t, int8_t*);
+void DispatchParallel(
+    int8_t* func,
+    int64_t start,
+    int64_t stop,
+    int8_t* packed_data) noexcept {
+  // TODO: preserve the func type.
+  try {
+    ParallelCallee callee = reinterpret_cast<ParallelCallee>(func);
+    at::parallel_for(start, stop, 1, [&](int64_t f_begin, int64_t f_end) {
+      for (int64_t index = f_begin; index < f_end; index++) {
+        callee(index, packed_data);
+      }
+    });
+  } catch (...) {
+  }
+}
+
+void nnc_aten_free(int64_t bufs_num, void** ptrs) noexcept {
+  for (const auto i : c10::irange(bufs_num)) {
+    c10::raw::intrusive_ptr::decref((c10::TensorImpl*)ptrs[i]);
+  }
+}
+
+#ifdef C10_MOBILE
+} // extern "C"
+#endif
+
+} // namespace tensorexpr
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/tensorexpr/external_functions_core.h b/torch/csrc/jit/tensorexpr/external_functions_core.h
new file mode 100644
index 00000000000000..7d85d545af8c8e
--- /dev/null
+++ b/torch/csrc/jit/tensorexpr/external_functions_core.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include <ATen/ATen.h>
+#include <ATen/Parallel.h>
+#include <torch/csrc/Export.h>
+#include <cstdint>
+
+namespace torch {
+namespace jit {
+namespace tensorexpr {
+
+#ifdef C10_MOBILE
+extern "C" {
+#endif
+void DispatchParallel(
+    int8_t* func,
+    int64_t start,
+    int64_t stop,
+    int8_t* packed_data) noexcept;
+
+TORCH_API void nnc_aten_free(int64_t bufs_num, void** ptrs) noexcept;
+
+#ifdef C10_MOBILE
+} // extern "C"
+#endif
+
+} // namespace tensorexpr
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/tensorexpr/fwd_decls.h b/torch/csrc/jit/tensorexpr/fwd_decls.h
index f8efd30b510dd7..f7caf135350a78 100644
--- a/torch/csrc/jit/tensorexpr/fwd_decls.h
+++ b/torch/csrc/jit/tensorexpr/fwd_decls.h
@@ -100,8 +100,10 @@ class AtomicAdd;
 class Block;
 class Cond;
 class ExternalCall;
+class ExternalCallWithAlloc;
 class For;
 class Free;
+class FreeExt;
 class PlacementAllocate;
 class SyncThreads;
 using AllocatePtr = NodePtr<Allocate>;
@@ -109,8 +111,10 @@ using AtomicAddPtr = NodePtr<AtomicAdd>;
 using BlockPtr = NodePtr<Block>;
 using CondPtr = NodePtr<Cond>;
 using ExternalCallPtr = NodePtr<ExternalCall>;
+using ExternalCallWithAllocPtr = NodePtr<ExternalCallWithAlloc>;
 using ForPtr = NodePtr<For>;
 using FreePtr = NodePtr<Free>;
+using FreeExtPtr = NodePtr<FreeExt>;
 using PlacementAllocatePtr = NodePtr<PlacementAllocate>;
 using SyncThreadsPtr = NodePtr<SyncThreads>;
 
diff --git a/torch/csrc/jit/tensorexpr/ir.cpp b/torch/csrc/jit/tensorexpr/ir.cpp
index e9ea44c2dde9aa..3a73fac61a886b 100644
--- a/torch/csrc/jit/tensorexpr/ir.cpp
+++ b/torch/csrc/jit/tensorexpr/ir.cpp
@@ -192,6 +192,38 @@ ExternalCallPtr ExternalCall::make(
       buf.node(), func_name, buf_arg_nodes, ExprHandleVectorToExprVector(args));
 }
 
+ExternalCallWithAllocPtr ExternalCallWithAlloc::make(
+    const std::string& func_name,
+    const std::vector<BufHandle>& buf_out_args,
+    const std::vector<BufHandle>& buf_args,
+    const std::vector<ExprHandle>& args) {
+  std::vector<BufPtr> buf_out_arg_nodes;
+  buf_out_arg_nodes.reserve(buf_out_args.size());
+  for (const BufHandle& buf_out_arg : buf_out_args) {
+    buf_out_arg_nodes.push_back(buf_out_arg.node());
+  }
+
+  std::vector<BufPtr> buf_arg_nodes;
+  buf_arg_nodes.reserve(buf_args.size());
+  for (const BufHandle& buf_arg : buf_args) {
+    buf_arg_nodes.push_back(buf_arg.node());
+  }
+  return alloc<ExternalCallWithAlloc>(
+      func_name,
+      buf_out_arg_nodes,
+      buf_arg_nodes,
+      ExprHandleVectorToExprVector(args));
+}
+
+FreeExtPtr FreeExt::make(const std::vector<BufHandle>& bufs) {
+  std::vector<BufPtr> buf_nodes;
+  buf_nodes.reserve(bufs.size());
+  for (const BufHandle& buf : bufs) {
+    buf_nodes.push_back(buf.node());
+  }
+  return alloc<FreeExt>(buf_nodes);
+}
+
 std::vector<ExprPtr> ExprHandleVectorToExprVector(
     const std::vector<ExprHandle>& v) {
   std::vector<ExprPtr> result(v.size());
diff --git a/torch/csrc/jit/tensorexpr/ir_cloner.cpp b/torch/csrc/jit/tensorexpr/ir_cloner.cpp
index 1144833c7990ec..5783519ea5a493 100644
--- a/torch/csrc/jit/tensorexpr/ir_cloner.cpp
+++ b/torch/csrc/jit/tensorexpr/ir_cloner.cpp
@@ -321,6 +321,28 @@ StmtPtr IRCloner::mutate(ExternalCallPtr v) {
   return alloc<ExternalCall>(buf_new, v->func_name(), buf_args_new, args_new);
 }
 
+StmtPtr IRCloner::mutate(ExternalCallWithAllocPtr v) {
+  std::vector<BufPtr> buf_out_args_new;
+  buf_out_args_new.reserve(v->buf_out_args().size());
+  for (const auto& buf_out_arg : v->buf_out_args()) {
+    buf_out_args_new.push_back(to<Buf>(buf_out_arg->accept_mutator(this)));
+  }
+
+  std::vector<BufPtr> buf_args_new;
+  buf_args_new.reserve(v->buf_args().size());
+  for (const auto& buf_arg : v->buf_args()) {
+    buf_args_new.push_back(to<Buf>(buf_arg->accept_mutator(this)));
+  }
+  std::vector<ExprPtr> args_new;
+  args_new.reserve(v->args().size());
+  for (const auto& arg : v->args()) {
+    args_new.push_back(arg->accept_mutator(this));
+  }
+
+  return alloc<ExternalCallWithAlloc>(
+      v->func_name(), buf_out_args_new, buf_args_new, args_new);
+}
+
 StmtPtr IRCloner::mutate(LetPtr v) {
   auto value_new = v->value()->accept_mutator(this);
   return alloc<Let>(v->var(), value_new);
diff --git a/torch/csrc/jit/tensorexpr/ir_cloner.h b/torch/csrc/jit/tensorexpr/ir_cloner.h
index 3cc2f4dcd6fe38..37216693b35b44 100644
--- a/torch/csrc/jit/tensorexpr/ir_cloner.h
+++ b/torch/csrc/jit/tensorexpr/ir_cloner.h
@@ -52,6 +52,7 @@ class TORCH_API IRCloner : public IRMutator {
   StmtPtr mutate(AtomicAddPtr v) override;
   StmtPtr mutate(SyncThreadsPtr v) override;
   StmtPtr mutate(ExternalCallPtr v) override;
+  StmtPtr mutate(ExternalCallWithAllocPtr v) override;
 
   StmtPtr mutate(AllocatePtr v) override;
   StmtPtr mutate(FreePtr v) override;
diff --git a/torch/csrc/jit/tensorexpr/ir_mutator.cpp b/torch/csrc/jit/tensorexpr/ir_mutator.cpp
index 4fb8cd451d6322..85449ba780bd3f 100644
--- a/torch/csrc/jit/tensorexpr/ir_mutator.cpp
+++ b/torch/csrc/jit/tensorexpr/ir_mutator.cpp
@@ -475,6 +475,50 @@ StmtPtr IRMutator::mutate(ExternalCallPtr v) {
   return v;
 }
 
+StmtPtr IRMutator::mutate(ExternalCallWithAllocPtr v) {
+  bool buf_out_args_changed = false;
+  std::vector<BufPtr> buf_out_args_new;
+  buf_out_args_new.reserve(v->buf_out_args().size());
+  for (const auto& buf_out_arg : v->buf_out_args()) {
+    BufPtr buf_out_arg_new = to<Buf>(buf_out_arg->accept_mutator(this));
+    TORCH_INTERNAL_ASSERT(
+        buf_out_arg_new, buildErrorMessage("IRMutator produced null for Buf."));
+    buf_out_args_new.push_back(buf_out_arg_new);
+    buf_out_args_changed |= buf_out_arg_new != buf_out_arg;
+  }
+
+  bool buf_args_changed = false;
+  std::vector<BufPtr> buf_args_new;
+  buf_args_new.reserve(v->buf_args().size());
+  for (const auto& buf_arg : v->buf_args()) {
+    BufPtr buf_arg_new = to<Buf>(buf_arg->accept_mutator(this));
+    TORCH_INTERNAL_ASSERT(
+        buf_arg_new, buildErrorMessage("IRMutator produced null for Buf."));
+    buf_args_new.push_back(buf_arg_new);
+    buf_args_changed |= buf_arg_new != buf_arg;
+  }
+
+  bool args_changed = false;
+  std::vector<ExprPtr> args_new;
+  args_new.reserve(v->args().size());
+  for (const auto& arg : v->args()) {
+    ExprPtr arg_new = arg->accept_mutator(this);
+    args_new.push_back(arg_new);
+    args_changed |= arg_new != arg;
+  }
+
+  if (buf_out_args_changed) {
+    v->set_buf_out_args(buf_out_args_new);
+  }
+  if (buf_args_changed) {
+    v->set_buf_args(buf_args_new);
+  }
+  if (args_changed) {
+    v->set_args(args_new);
+  }
+  return v;
+}
+
 StmtPtr IRMutator::mutate(AllocatePtr v) {
   BufPtr buf = v->buf();
   BufPtr buf_new = to<Buf>(buf->accept_mutator(this));
@@ -497,6 +541,24 @@ StmtPtr IRMutator::mutate(FreePtr v) {
   return v;
 }
 
+StmtPtr IRMutator::mutate(FreeExtPtr v) {
+  bool bufs_changed = false;
+  std::vector<BufPtr> bufs_new;
+  bufs_new.reserve(v->bufs().size());
+  for (const auto& buf : v->bufs()) {
+    BufPtr buf_new = to<Buf>(buf->accept_mutator(this));
+    TORCH_INTERNAL_ASSERT(
+        buf_new, buildErrorMessage("IRMutator produced null for Buf."));
+    bufs_new.push_back(buf_new);
+    bufs_changed |= buf_new != buf;
+  }
+
+  if (bufs_changed) {
+    v->set_bufs(bufs_new);
+  }
+  return v;
+}
+
 StmtPtr IRMutator::mutate(PlacementAllocatePtr v) {
   BufPtr buf = v->buf();
   BufPtr buf_new = to<Buf>(buf->accept_mutator(this));
diff --git a/torch/csrc/jit/tensorexpr/ir_mutator.h b/torch/csrc/jit/tensorexpr/ir_mutator.h
index 2d37d49ba60c1d..aaecd34289bc34 100644
--- a/torch/csrc/jit/tensorexpr/ir_mutator.h
+++ b/torch/csrc/jit/tensorexpr/ir_mutator.h
@@ -51,9 +51,11 @@ class TORCH_API IRMutator {
   virtual StmtPtr mutate(AtomicAddPtr v);
   virtual StmtPtr mutate(SyncThreadsPtr v);
   virtual StmtPtr mutate(ExternalCallPtr v);
+  virtual StmtPtr mutate(ExternalCallWithAllocPtr v);
 
   virtual StmtPtr mutate(AllocatePtr v);
   virtual StmtPtr mutate(FreePtr v);
+  virtual StmtPtr mutate(FreeExtPtr v);
   virtual StmtPtr mutate(PlacementAllocatePtr v);
   virtual StmtPtr mutate(LetPtr v);
   virtual StmtPtr mutate(CondPtr v);
diff --git a/torch/csrc/jit/tensorexpr/ir_printer.cpp b/torch/csrc/jit/tensorexpr/ir_printer.cpp
index 351074ab900ec2..680dd5da50aa14 100644
--- a/torch/csrc/jit/tensorexpr/ir_printer.cpp
+++ b/torch/csrc/jit/tensorexpr/ir_printer.cpp
@@ -483,6 +483,19 @@ void IRPrinter::visit(FreePtr v) {
   os() << "Free(" << *v->buffer_var() << ");";
 }
 
+void IRPrinter::visit(FreeExtPtr v) {
+  os() << "FreeExt(bufs={";
+  int i = 0;
+  for (const auto& buf : v->bufs()) {
+    if (i++ > 0) {
+      os() << ", ";
+    }
+    os() << *buf;
+  }
+
+  os() << "});";
+}
+
 void IRPrinter::visit(PlacementAllocatePtr v) {
   os() << "Alias(" << *v->buf()->base_handle() << ","
        << *v->buf_to_reuse()->base_handle() << ");";
@@ -552,6 +565,37 @@ void IRPrinter::visit(ExternalCallPtr v) {
   os() << "})";
 }
 
+void IRPrinter::visit(ExternalCallWithAllocPtr v) {
+  int i = 0;
+  for (const auto& buf_out_arg : v->buf_out_args()) {
+    if (i++ > 0) {
+      os() << ", ";
+    }
+    os() << *buf_out_arg;
+  }
+
+  os() << " := " << v->func_name() << "(";
+
+  os() << "buf_args={";
+  i = 0;
+  for (const auto& buf_arg : v->buf_args()) {
+    if (i++ > 0) {
+      os() << ", ";
+    }
+    os() << *buf_arg;
+  }
+
+  os() << "}, args={";
+  i = 0;
+  for (const auto& arg : v->args()) {
+    if (i++ > 0) {
+      os() << ", ";
+    }
+    os() << *arg;
+  }
+  os() << "})";
+}
+
 void IRPrinter::emitIndent() {
   os() << std::setw(2 * indent_) << "";
 }
diff --git a/torch/csrc/jit/tensorexpr/ir_printer.h b/torch/csrc/jit/tensorexpr/ir_printer.h
index c58012e8a1b8b4..661d3a463d137e 100644
--- a/torch/csrc/jit/tensorexpr/ir_printer.h
+++ b/torch/csrc/jit/tensorexpr/ir_printer.h
@@ -55,12 +55,14 @@ class TORCH_API IRPrinter : public IRVisitor {
   void visit(AtomicAddPtr v) override;
   void visit(SyncThreadsPtr v) override;
   void visit(ExternalCallPtr v) override;
+  void visit(ExternalCallWithAllocPtr v) override;
   void visit(StorePtr v) override;
   void visit(ForPtr v) override;
   void visit(CondPtr v) override;
   void visit(BlockPtr v) override;
   void visit(AllocatePtr v) override;
   void visit(FreePtr v) override;
+  void visit(FreeExtPtr v) override;
   void visit(PlacementAllocatePtr v) override;
   void visit(LetPtr v) override;
 
diff --git a/torch/csrc/jit/tensorexpr/ir_visitor.cpp b/torch/csrc/jit/tensorexpr/ir_visitor.cpp
index 649a51ee4577fb..6479d99f214bf1 100644
--- a/torch/csrc/jit/tensorexpr/ir_visitor.cpp
+++ b/torch/csrc/jit/tensorexpr/ir_visitor.cpp
@@ -140,6 +140,24 @@ void IRVisitor::visit(ExternalCallPtr v) {
   }
 }
 
+void IRVisitor::visit(ExternalCallWithAllocPtr v) {
+  for (const auto& buf_out_arg : v->buf_out_args()) {
+    buf_out_arg->accept(this);
+  }
+  for (const auto& buf_arg : v->buf_args()) {
+    buf_arg->accept(this);
+  }
+  for (const auto& arg : v->args()) {
+    arg->accept(this);
+  }
+}
+
+void IRVisitor::visit(FreeExtPtr v) {
+  for (const auto& buf : v->bufs()) {
+    buf->accept(this);
+  }
+}
+
 void IRVisitor::visit(BlockPtr v) {
   for (StmtPtr s : *v) {
     s->accept(this);
diff --git a/torch/csrc/jit/tensorexpr/ir_visitor.h b/torch/csrc/jit/tensorexpr/ir_visitor.h
index 2bb48088d89f93..09e6069dba1c28 100644
--- a/torch/csrc/jit/tensorexpr/ir_visitor.h
+++ b/torch/csrc/jit/tensorexpr/ir_visitor.h
@@ -43,6 +43,7 @@ class TORCH_API IRVisitor {
   virtual void visit(IntrinsicsPtr v);
   virtual void visit(AllocatePtr v);
   virtual void visit(FreePtr v);
+  virtual void visit(FreeExtPtr v);
   virtual void visit(PlacementAllocatePtr v);
   virtual void visit(LetPtr v);
   virtual void visit(CondPtr v);
@@ -55,6 +56,7 @@ class TORCH_API IRVisitor {
   virtual void visit(AtomicAddPtr v);
   virtual void visit(SyncThreadsPtr v);
   virtual void visit(ExternalCallPtr v);
+  virtual void visit(ExternalCallWithAllocPtr v);
 };
 
 } // namespace tensorexpr
diff --git a/torch/csrc/jit/tensorexpr/llvm_codegen.cpp b/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
index 9c7ff2d388758c..2284adb8150e39 100644
--- a/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
+++ b/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
@@ -212,6 +212,9 @@ class LLVMCodeGenImpl : public IRVisitor {
 
   std::unordered_map<VarPtr, int> varToArg_;
   std::unordered_map<VarPtr, llvm::Value*> varToVal_;
+  std::unordered_set<BufPtr> bufsExtAlloc_;
+  std::unordered_map<VarPtr, llvm::Value*> bufsExtToFreeVal_;
+  std::unordered_multimap<BufPtr, BufPtr> bufsExtAllocReuse_;
   std::unordered_map<BlockPtr, std::vector<VarPtr>> scopeToVar_;
   BlockPtr scope_;
 
@@ -245,6 +248,7 @@ class LLVMCodeGenImpl : public IRVisitor {
   llvm::Value* packFuncArgs(const std::vector<llvm::Value*>& func_args);
   std::vector<llvm::Value*> unpackFuncArgs(llvm::Value* packed, int arg_count);
   void processParallelFor(ForPtr v);
+  void handleBufReuse(BufPtr buf, BufPtr buf_to_reuse);
 
  public:
   LLVMCodeGenImpl(
@@ -292,10 +296,12 @@ class LLVMCodeGenImpl : public IRVisitor {
   void visit(IntrinsicsPtr v) override;
   void visit(AllocatePtr v) override;
   void visit(FreePtr v) override;
+  void visit(FreeExtPtr v) override;
   void visit(PlacementAllocatePtr v) override;
   void visit(LetPtr v) override;
   void visit(CondPtr v) override;
   void visit(ExternalCallPtr v) override;
+  void visit(ExternalCallWithAllocPtr v) override;
 
   void emitIsNan(IntrinsicsPtr v);
 
@@ -432,7 +438,8 @@ LLVMCodeGenImpl::LLVMCodeGenImpl(
     c10::optional<std::string> attrs)
     : context_(std::make_unique<llvm::LLVMContext>()),
       irb_(getContext()),
-      kernel_func_name_(std::move(kernel_func_name)) {
+      kernel_func_name_(std::move(kernel_func_name)),
+      bufsExtAlloc_(ExternalAllocBufFinder::find(stmt)) {
   if (!triple) {
     triple = LLVMTargetTriple();
   }
@@ -1902,6 +1909,14 @@ void LLVMCodeGenImpl::visit(IntrinsicsPtr v) {
   }
 }
 
+void LLVMCodeGenImpl::handleBufReuse(BufPtr buf, BufPtr buf_to_reuse) {
+  llvm::Value* ptr = varToVal_.at(buf_to_reuse->base_handle());
+  if (buf_to_reuse->dtype().scalar_type() != buf->dtype().scalar_type()) {
+    ptr = irb_.CreatePointerCast(ptr, dtypeToLLVMPtr(buf->dtype()));
+  }
+  varToVal_[buf->base_handle()] = ptr;
+}
+
 void LLVMCodeGenImpl::visit(ExternalCallPtr v) {
   auto& func_registry = getNNCFunctionRegistry();
   if (!func_registry.count(v->func_name())) {
@@ -2030,6 +2045,172 @@ void LLVMCodeGenImpl::visit(ExternalCallPtr v) {
   value_ = llvm::ConstantInt::get(IntTy_, 0);
 }
 
+void LLVMCodeGenImpl::visit(ExternalCallWithAllocPtr v) {
+  auto& func_registry = getNNCFunctionRegistry();
+  if (!func_registry.count(v->func_name())) {
+    throw unimplemented_lowering(v);
+  }
+
+  const auto& bufs_out = v->buf_out_args();
+  const auto& bufs_in = v->buf_args();
+
+  const auto bufs_in_size = bufs_in.size();
+  const auto bufs_out_size = bufs_out.size();
+  const auto args_num = v->args().size();
+
+  // Count the size of dims array - it consists of dimension of all bufs
+  // concatenated together.
+  size_t dims_num = 0;
+  for (const auto& b : bufs_in) {
+    dims_num += b->dims().size();
+  }
+
+  // bufs_out_size for out tensors data pointers
+  // bufs_in_size for input pointers
+  // bufs_out_size for out tensors TensorImpl* to pass to nnc_aten_free to
+  // release out tensors
+  llvm::Value* buf_ptrs = irb_.CreateAlloca(
+      Int8PtrTy_,
+      llvm::ConstantInt::getSigned(IntTy_, bufs_in_size + 2 * bufs_out_size));
+  // @lint-ignore CLANGTIDY
+  llvm::Value* buf_ranks = irb_.CreateAlloca(
+      LongTy_, llvm::ConstantInt::getSigned(IntTy_, bufs_in_size));
+  llvm::Value* buf_dims = irb_.CreateAlloca(
+      LongTy_, llvm::ConstantInt::getSigned(IntTy_, dims_num));
+  llvm::Value* buf_strides = irb_.CreateAlloca(
+      LongTy_, llvm::ConstantInt::getSigned(IntTy_, dims_num));
+  llvm::Value* buf_dtypes = irb_.CreateAlloca(
+      ByteTy_, llvm::ConstantInt::getSigned(IntTy_, bufs_in_size));
+  // @lint-ignore CLANGTIDY
+  llvm::Value* extra_args = irb_.CreateAlloca(
+      LongTy_, llvm::ConstantInt::getSigned(IntTy_, args_num));
+
+  int i = 0;
+  int dim_idx = 0;
+  int stride_idx = 0;
+  for (const auto& b : bufs_in) {
+    // Store value for buf pointer
+    llvm::Value* gep = irb_.CreateInBoundsGEP(
+        Int8PtrTy_,
+        buf_ptrs,
+        // @lint-ignore CLANGTIDY
+        llvm::ConstantInt::getSigned(IntTy_, bufs_out_size + i));
+    b->base_handle()->accept(this);
+    auto buf_ptr = this->value_;
+    auto buf_void_ptr = irb_.CreatePointerCast(buf_ptr, Int8PtrTy_);
+    irb_.CreateStore(buf_void_ptr, gep);
+
+    // Store dtype of the buf
+    gep = irb_.CreateInBoundsGEP(
+        ByteTy_, buf_dtypes, llvm::ConstantInt::getSigned(IntTy_, i));
+    irb_.CreateStore(
+        llvm::ConstantInt::getSigned(ByteTy_, (int8_t)b->dtype().scalar_type()),
+        gep);
+
+    // Store rank of the buf
+    // @lint-ignore CLANGTIDY
+    gep = irb_.CreateInBoundsGEP(
+        LongTy_, buf_ranks, llvm::ConstantInt::getSigned(IntTy_, i));
+    irb_.CreateStore(
+        llvm::ConstantInt::getSigned(LongTy_, b->dims().size()), gep);
+
+    // Store dims of the buf
+    for (const auto dim : c10::irange(b->dims().size())) {
+      gep = irb_.CreateInBoundsGEP(
+          LongTy_, buf_dims, llvm::ConstantInt::getSigned(IntTy_, dim_idx));
+      b->dims()[dim]->accept(this);
+      auto dim_val = this->value_;
+      irb_.CreateStore(irb_.CreateZExt(dim_val, LongTy_), gep);
+      dim_idx++;
+    }
+
+    // Store strides of the buf
+    for (const auto dim : c10::irange(b->dims().size())) {
+      gep = irb_.CreateInBoundsGEP(
+          LongTy_,
+          buf_strides,
+          llvm::ConstantInt::getSigned(IntTy_, stride_idx));
+      b->strides()[dim]->accept(this);
+      auto stride_val = this->value_;
+      irb_.CreateStore(irb_.CreateZExt(stride_val, LongTy_), gep);
+      stride_idx++;
+    }
+
+    i++;
+  }
+
+  i = 0;
+  for (const ExprPtr& arg : v->args()) {
+    auto gep = irb_.CreateInBoundsGEP(
+        LongTy_, extra_args, llvm::ConstantInt::getSigned(IntTy_, i));
+    arg->accept(this);
+    irb_.CreateStore(irb_.CreateZExtOrBitCast(this->value_, LongTy_), gep);
+    i++;
+  }
+
+  // Generate the call itself
+  std::string fname = v->func_name();
+  FunctionCallee callee = module_->getOrInsertFunction(
+      fname,
+      llvm::FunctionType::get(
+          llvm::Type::getVoidTy(getContext()), // return type
+          {LongTy_, // int64_t bufs_in_size
+           Int8PtrTy_->getPointerTo(), // void** buf_data
+           LongTy_->getPointerTo(), // int64_t* buf_ranks
+           LongTy_->getPointerTo(), // int64_t* buf_dims
+           LongTy_->getPointerTo(), // int64_t* buf_strides
+           ByteTy_->getPointerTo(), // int64_t* buf_dtypes
+           LongTy_, // int64_t args_num
+           LongTy_->getPointerTo()}, // int64_t* extra_args
+          false)); // is var_arg
+
+  auto call_ty = callee.getFunctionType();
+  auto call_fn = callee.getCallee();
+  llvm::cast<llvm::Function>(call_fn)->addFnAttr(llvm::Attribute::NoUnwind);
+
+  irb_.CreateCall(
+      call_ty,
+      call_fn,
+      // @lint-ignore CLANGTIDY
+      {llvm::ConstantInt::getSigned(LongTy_, bufs_in_size),
+       buf_ptrs,
+       buf_ranks,
+       buf_dims,
+       buf_strides,
+       buf_dtypes,
+       // @lint-ignore CLANGTIDY
+       llvm::ConstantInt::getSigned(LongTy_, args_num),
+       extra_args});
+
+  // @lint-ignore CLANGTIDY
+  for (const auto i : c10::irange(bufs_out_size)) {
+    const auto& buf_out = bufs_out[i];
+    auto gep = irb_.CreateInBoundsGEP(
+        Int8PtrTy_, buf_ptrs, llvm::ConstantInt::getSigned(IntTy_, i));
+    llvm::Value* ptr = irb_.CreatePointerCast(
+        irb_.CreateLoad(Int8PtrTy_, gep), dtypeToLLVMPtr(buf_out->dtype()));
+    varToVal_[buf_out->base_handle()] = ptr;
+
+    for (auto it = bufsExtAllocReuse_.find(buf_out);
+         it != bufsExtAllocReuse_.end();
+         it++) {
+      auto buf = it->second;
+      handleBufReuse(buf, buf_out);
+    }
+    bufsExtAllocReuse_.erase(buf_out);
+
+    gep = irb_.CreateInBoundsGEP(
+        Int8PtrTy_,
+        buf_ptrs,
+        // @lint-ignore CLANGTIDY
+        llvm::ConstantInt::getSigned(IntTy_, bufs_out_size + bufs_in_size + i));
+    bufsExtToFreeVal_[buf_out->base_handle()] =
+        irb_.CreateLoad(Int8PtrTy_, gep);
+  }
+
+  value_ = llvm::ConstantInt::get(IntTy_, 0);
+}
+
 void LLVMCodeGenImpl::visit(AllocatePtr v) {
   llvm::Value* size =
       llvm::ConstantInt::getSigned(LongTy_, v->dtype().byte_size());
@@ -2066,22 +2247,63 @@ void LLVMCodeGenImpl::visit(PlacementAllocatePtr v) {
   auto buf_to_reuse = v->buf_to_reuse();
   auto buf = v->buf();
 
-  llvm::Value* ptr = varToVal_.at(buf_to_reuse->base_handle());
-  if (buf_to_reuse->dtype().scalar_type() != buf->dtype().scalar_type()) {
-    ptr = irb_.CreatePointerCast(ptr, dtypeToLLVMPtr(buf->dtype()));
+  if (bufsExtAlloc_.count(buf_to_reuse)) {
+    bufsExtAllocReuse_.insert({buf_to_reuse, buf});
+    return;
   }
 
-  varToVal_[buf->base_handle()] = ptr;
+  handleBufReuse(buf, buf_to_reuse);
 }
 
 void LLVMCodeGenImpl::visit(FreePtr v) {
   value_ = llvm::ConstantInt::get(IntTy_, 0);
-  llvm::Value* ptr = varToVal_.at(v->buffer_var());
+
+  llvm::Value* ptr = bufsExtToFreeVal_.count(v->buffer_var())
+      ? bufsExtToFreeVal_.at(v->buffer_var())
+      : varToVal_.at(v->buffer_var());
+
   if (!llvm::isa<llvm::AllocaInst>(ptr)) {
     irb_.Insert(llvm::CallInst::CreateFree(ptr, irb_.GetInsertBlock()));
   }
 }
 
+void LLVMCodeGenImpl::visit(FreeExtPtr v) {
+  value_ = llvm::ConstantInt::get(IntTy_, 0);
+  const auto& bufs = v->bufs();
+  const auto bufs_num = bufs.size();
+
+  llvm::Value* ptrs = irb_.CreateAlloca(
+      Int8PtrTy_, llvm::ConstantInt::getSigned(IntTy_, bufs_num));
+  for (const auto i : c10::irange(bufs_num)) {
+    const auto& buf = bufs[i];
+    llvm::Value* gep = irb_.CreateInBoundsGEP(
+        Int8PtrTy_, ptrs, llvm::ConstantInt::getSigned(IntTy_, i));
+    auto ptr = bufsExtToFreeVal_[buf->base_handle()];
+    irb_.CreateStore(ptr, gep);
+  }
+
+  FunctionCallee callee = module_->getOrInsertFunction(
+      "nnc_aten_free",
+      llvm::FunctionType::get(
+          llvm::Type::getVoidTy(getContext()), // return type
+          {
+              LongTy_, // int64_t bufs_num
+              Int8PtrTy_->getPointerTo(), // void** ptrs
+          },
+          false)); // is var_arg
+
+  auto call_ty = callee.getFunctionType();
+  auto call_fn = callee.getCallee();
+  llvm::cast<llvm::Function>(call_fn)->addFnAttr(llvm::Attribute::NoUnwind);
+
+  irb_.CreateCall(
+      call_ty,
+      call_fn,
+      {llvm::ConstantInt::getSigned(LongTy_, bufs_num), ptrs});
+
+  value_ = llvm::ConstantInt::get(IntTy_, 0);
+}
+
 void LLVMCodeGenImpl::visit(LetPtr v) {
   v->value()->accept(this);
   if (!varToVal_.count(v->var())) {
diff --git a/torch/csrc/jit/tensorexpr/llvm_jit.cpp b/torch/csrc/jit/tensorexpr/llvm_jit.cpp
index 72f7d6342d46f3..c8d5c4afba81ae 100644
--- a/torch/csrc/jit/tensorexpr/llvm_jit.cpp
+++ b/torch/csrc/jit/tensorexpr/llvm_jit.cpp
@@ -120,6 +120,8 @@ static void registerIntrinsics(
   }
   assertSuccess(JD.define(
       absoluteSymbols({entry("DispatchParallel", DispatchParallel)})));
+  assertSuccess(
+      JD.define(absoluteSymbols({entry("nnc_aten_free", nnc_aten_free)})));
 }
 
 namespace llvm {
diff --git a/torch/csrc/jit/tensorexpr/loopnest.cpp b/torch/csrc/jit/tensorexpr/loopnest.cpp
index 2dbfbfee7e70e8..6b66d48fe505e4 100644
--- a/torch/csrc/jit/tensorexpr/loopnest.cpp
+++ b/torch/csrc/jit/tensorexpr/loopnest.cpp
@@ -890,7 +890,7 @@ StmtPtr computeInlineImpl(
   }
   for (auto& use : buf_load_store_uses.at(b)) {
     StmtPtr s = use.s;
-    if (to<ExternalCall>(s)) {
+    if (to<ExternalCall>(s) || to<ExternalCallWithAlloc>(s)) {
       return nullptr;
     }
   }
@@ -989,7 +989,8 @@ void LoopNest::inlineIntermediateBufs(bool allow_duplicated_work) {
         } else {
           // If S is not a store, it must be an ExternalCall.
           TORCH_INTERNAL_ASSERT(
-              to<ExternalCall>(stores[0].s),
+              to<ExternalCall>(stores[0].s) ||
+                  to<ExternalCallWithAlloc>(stores[0].s),
               buildErrorMessage(
                   "Expected stmt: " + std::to_string(stores[0].s) +
                   "\nto be either a Store or an ExternalCall in the fuser."));
@@ -1049,6 +1050,23 @@ class LoadOrStoreUseFinder : public IRVisitor {
     IRVisitor::visit(v);
   }
 
+  void visit(ExternalCallWithAllocPtr v) override {
+    for (const auto& out_buf : v->buf_out_args()) {
+      if (stores_[out_buf].insert(last_stmt_).second) {
+        uses_[out_buf].push_back({(StmtPtr)v, true});
+      }
+    }
+    last_stmt_ = (StmtPtr)v;
+
+    for (const auto& input_buf : v->buf_args()) {
+      if (loads_[input_buf].insert(last_stmt_).second) {
+        uses_[input_buf].push_back({last_stmt_, false});
+      }
+    }
+
+    IRVisitor::visit(v);
+  }
+
   void visit(LoadPtr v) override {
     if (loads_[v->buf()].insert(last_stmt_).second) {
       uses_[v->buf()].push_back({last_stmt_, false});
@@ -1088,6 +1106,10 @@ class ContainedStmtsFinder : public IRVisitor {
     contained_.insert((StmtPtr)v);
     IRVisitor::visit(v);
   }
+  void visit(ExternalCallWithAllocPtr v) override {
+    contained_.insert((StmtPtr)v);
+    IRVisitor::visit(v);
+  }
   void visit(BlockPtr v) override {
     contained_.insert((StmtPtr)v);
     IRVisitor::visit(v);
diff --git a/torch/csrc/jit/tensorexpr/lowerings.cpp b/torch/csrc/jit/tensorexpr/lowerings.cpp
index c0905588c2df1d..37ee0703c93981 100644
--- a/torch/csrc/jit/tensorexpr/lowerings.cpp
+++ b/torch/csrc/jit/tensorexpr/lowerings.cpp
@@ -1736,7 +1736,7 @@ int nnc_lowerings_lazy_registration() {
 
   RegisterNNCLoweringsFunction aten_upsample_nearest2d(
       {"aten::upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> (Tensor)"},
-      computeUpsampleNearest2d);
+      computeUpsampleNearest2dExternalCall);
 
   return 0;
 }
diff --git a/torch/csrc/jit/tensorexpr/operators/quantization.cpp b/torch/csrc/jit/tensorexpr/operators/quantization.cpp
index e45445d622be5b..68c173d17362b7 100644
--- a/torch/csrc/jit/tensorexpr/operators/quantization.cpp
+++ b/torch/csrc/jit/tensorexpr/operators/quantization.cpp
@@ -209,7 +209,8 @@ Tensor computeQuantizedAdd(
       ExprHandleVectorToExprVector(outputShape),
       out_dtype,
       nullptr,
-      c10::nullopt,
+      isChannelsLast(QA) ? make_channels_last_strides(outputShape)
+                         : make_contiguous_strides(outputShape),
       out_qscale.node(),
       out_qzero.node());
   return Tensor(buf, vars, exprHandle.node());
@@ -706,12 +707,14 @@ Tensor computeUpsampleNearest2d(
     return A.load(newAxes);
   };
   auto e = body_func(args);
+  auto strides = isChannelsLast(A) ? make_channels_last_strides(outputShape)
+                                   : make_contiguous_strides(outputShape);
   BufHandle buf = Buf::make(
-      "quantize_upsample_nearest2d",
+      "upsample_nearest2d",
       outputShape,
       Dtype(*outputType),
-      c10::nullopt,
-      c10::nullopt,
+      c10::nullopt, // initializer
+      fmap(strides, [&](ExprPtr stride) { return ExprHandle(stride); }),
       ExprHandle(A.node()->qscale()),
       ExprHandle(A.node()->qzero()));
   return Tensor(buf, args, e);
@@ -787,12 +790,18 @@ Tensor computeQuantizedSigmoidExternalCall(
   const double out_qscale = 1.0f / 256.0f;
   const int64_t out_qzero = (out_qdtype == ScalarType::QInt8) ? -128 : 0;
 
-  auto ResultBuf = makeQBufHandleChannelsLast(
-      "quantized_sigmoid",
-      outputShape,
-      Dtype(out_qdtype),
-      out_qscale,
-      out_qzero);
+  auto ResultBuf = isChannelsLast(qx) ? makeQBufHandleChannelsLast(
+                                            "quantized_sigmoid",
+                                            outputShape,
+                                            Dtype(out_qdtype),
+                                            out_qscale,
+                                            out_qzero)
+                                      : makeQBufHandleContiguous(
+                                            "quantized_sigmoid",
+                                            outputShape,
+                                            Dtype(out_qdtype),
+                                            out_qscale,
+                                            out_qzero);
   StmtPtr s = ExternalCall::make(
       ResultBuf,
       "nnc_aten_quantized_sigmoid",
diff --git a/torch/csrc/jit/tensorexpr/operators/quantization.h b/torch/csrc/jit/tensorexpr/operators/quantization.h
index 67395e62a70893..217561cd4544cf 100644
--- a/torch/csrc/jit/tensorexpr/operators/quantization.h
+++ b/torch/csrc/jit/tensorexpr/operators/quantization.h
@@ -124,6 +124,12 @@ TORCH_API Tensor computeUpsampleNearest2d(
     const c10::optional<ScalarType>& outputType,
     at::Device device);
 
+TORCH_API Tensor computeUpsampleNearest2dExternalCall(
+    const std::vector<ArgValue>& inputs,
+    const std::vector<ExprHandle>& outputShape,
+    const c10::optional<ScalarType>& outputType,
+    at::Device device);
+
 TORCH_API Tensor computeQuantizedSigmoidExternalCall(
     const std::vector<ArgValue>& inputs,
     const std::vector<ExprHandle>& outputShape,
diff --git a/torch/csrc/jit/tensorexpr/stmt.h b/torch/csrc/jit/tensorexpr/stmt.h
index b39976443e482b..d2894ea157e6ed 100644
--- a/torch/csrc/jit/tensorexpr/stmt.h
+++ b/torch/csrc/jit/tensorexpr/stmt.h
@@ -449,6 +449,24 @@ class TORCH_API Free : public StmtNode<Free> {
   BufPtr buf_;
 };
 
+class TORCH_API FreeExt : public StmtNode<FreeExt> {
+ public:
+  static FreeExtPtr make(const std::vector<BufHandle>& bufs);
+
+  std::vector<BufPtr> bufs() const {
+    return bufs_;
+  }
+
+  void set_bufs(std::vector<BufPtr> bufs) {
+    bufs_ = std::move(bufs);
+  }
+
+  explicit FreeExt(std::vector<BufPtr> bufs) : bufs_(std::move(bufs)) {}
+
+ private:
+  std::vector<BufPtr> bufs_;
+};
+
 class TORCH_API Let : public StmtNode<Let> {
  public:
   static LetPtr make(const VarHandle& var, const ExprHandle& val) {
@@ -940,6 +958,60 @@ class TORCH_API ExternalCall : public StmtNode<ExternalCall> {
   std::vector<ExprPtr> args_;
 };
 
+class TORCH_API ExternalCallWithAlloc : public StmtNode<ExternalCallWithAlloc> {
+ public:
+  static ExternalCallWithAllocPtr make(
+      const std::string& func_name,
+      const std::vector<BufHandle>& buf_out_args,
+      const std::vector<BufHandle>& buf_args,
+      const std::vector<ExprHandle>& args);
+
+  std::vector<BufPtr> buf_out_args() const {
+    return buf_out_args_;
+  }
+
+  std::string func_name() const {
+    return func_name_;
+  }
+
+  std::vector<BufPtr> buf_args() const {
+    return buf_args_;
+  }
+
+  std::vector<ExprPtr> args() const {
+    return args_;
+  }
+
+  void set_buf_out_args(std::vector<BufPtr> buf_out_args) {
+    buf_out_args_ = std::move(buf_out_args);
+  }
+
+  void set_buf_args(std::vector<BufPtr> buf_args) {
+    buf_args_ = std::move(buf_args);
+  }
+
+  void set_args(std::vector<ExprPtr> args) {
+    args_ = std::move(args);
+  }
+
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+  ExternalCallWithAlloc(
+      std::string func_name,
+      std::vector<BufPtr> buf_out_args,
+      std::vector<BufPtr> buf_args,
+      std::vector<ExprPtr> args)
+      : func_name_(std::move(func_name)),
+        buf_out_args_(std::move(buf_out_args)),
+        buf_args_(std::move(buf_args)),
+        args_(std::move(args)) {}
+
+ private:
+  std::string func_name_;
+  std::vector<BufPtr> buf_out_args_;
+  std::vector<BufPtr> buf_args_;
+  std::vector<ExprPtr> args_;
+};
+
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/testing/file_check.cpp b/torch/csrc/jit/testing/file_check.cpp
index 19b4338829fb75..88ac4d61a66782 100644
--- a/torch/csrc/jit/testing/file_check.cpp
+++ b/torch/csrc/jit/testing/file_check.cpp
@@ -94,7 +94,7 @@ size_t assertFind(
     const SourceRange& search_range,
     const std::string& sub,
     const std::function<void(std::ostream& out)>& extra_msg = nullptr) {
-  auto pos = search_range.source()->text_str().find(sub, search_range.start());
+  auto pos = search_range.source()->text().find(sub, search_range.start());
   if (pos == std::string::npos || (pos + sub.size()) > search_range.end()) {
     auto found_range =
         SourceRange(search_range.source(), search_range.start(), sub.size());
@@ -122,18 +122,19 @@ size_t assertFind(
 }
 
 size_t assertFind(
-    const std::shared_ptr<Source>& source,
+    const std::shared_ptr<SourceView>& source,
     const std::string& sub,
     size_t start,
     const Check& check) {
-  return assertFind(SourceRange(source, start, source->size()), sub, check);
+  return assertFind(
+      SourceRange(source, start, source->text().size()), sub, check);
 }
 
 void assertNotFind(
     const SourceRange& search_range,
     const std::string& sub,
     const Check& check) {
-  auto pos = search_range.source()->text_str().find(sub, search_range.start());
+  auto pos = search_range.source()->text().find(sub, search_range.start());
   if (pos != std::string::npos && (pos + sub.size()) <= search_range.end()) {
     auto found_range =
         SourceRange(search_range.source(), pos, sub.size() + pos);
@@ -201,7 +202,9 @@ struct FileCheckImpl {
   friend std::ostream& operator<<(std::ostream& out, const FileCheckImpl& fc);
 
  private:
-  bool parseSingleCheck(const std::shared_ptr<Source>& source, size_t* start) {
+  bool parseSingleCheck(
+      const std::shared_ptr<SourceView>& source,
+      size_t* start) {
     const static std::vector<std::pair<CheckType, std::string>> check_pairs = {
         {CHECK, ": "},
         {CHECK_NEXT, "-NEXT: "},
@@ -214,35 +217,31 @@ struct FileCheckImpl {
 
     for (const auto& check_pair : check_pairs) {
       const std::string& check_suffix = check_pair.second;
-      auto suffix_pos = source->text_str().find(check_suffix, *start);
+      auto suffix_pos = source->text().find(check_suffix, *start);
       if (suffix_pos != *start) {
         continue;
       }
       size_t end_check_string = suffix_pos + check_suffix.size();
       CheckType type = check_pair.first;
       c10::optional<size_t> count = c10::nullopt;
-      auto end_line = source->text_str().find("\n", end_check_string);
+      auto end_line = source->text().find('\n', end_check_string);
       bool exactly = false;
       if (type == CHECK_COUNT) {
         const std::string exact = "EXACTLY-";
-        if (source->text_str().find(exact, end_check_string) ==
-            end_check_string) {
+        if (source->text().find(exact, end_check_string) == end_check_string) {
           exactly = true;
           end_check_string += exact.size();
         }
         size_t end =
             assertFind(SourceRange(source, end_check_string, end_line), ":");
-        auto count_view = source->text_str()
-                              .substr(end_check_string, end - end_check_string)
-                              .str();
+        auto count_view =
+            source->text().substr(end_check_string, end - end_check_string);
         count = c10::stoll(std::string(count_view.begin(), count_view.end()));
         end_check_string = end + 2; // add ':' and the space
       }
       auto check = Check(
           type,
-          source->text_str()
-              .substr(end_check_string, end_line - end_check_string)
-              .str(),
+          source->text().substr(end_check_string, end_line - end_check_string),
           count);
       addCheck(check);
       if (exactly) {
@@ -254,30 +253,32 @@ struct FileCheckImpl {
     return false;
   }
 
-  size_t findNextStart(const std::shared_ptr<Source>& source, size_t prev_end) {
-    size_t start = source->text_str().find("#", prev_end);
+  size_t findNextStart(
+      const std::shared_ptr<SourceView>& source,
+      size_t prev_end) {
+    size_t start = source->text().find('#', prev_end);
     if (start == std::string::npos) {
       return start;
     }
     start += 1;
     static constexpr size_t max_whitespace = 6;
     size_t i = 0;
-    while (start + i < source->size() && i < max_whitespace) {
-      auto c = source->char_at(start + i);
+    while (start + i < source->text().size() && i < max_whitespace) {
+      auto c = source->text().at(start + i);
       if (c != ' ' && c != '\t') {
         break;
       }
       i++;
     }
     static const std::string check = "CHECK";
-    if (source->text_str().substr(start + i, check.size()) == check) {
+    if (source->text().substr(start + i, check.size()) == check) {
       return start + i + check.size();
     } else {
       return findNextStart(source, start + i + 1);
     }
   }
 
-  void parseStrings(const std::shared_ptr<Source>& source) {
+  void parseStrings(const std::shared_ptr<SourceView>& source) {
     size_t start = 0;
     start = findNextStart(source, 0);
     while (start != std::string::npos) {
@@ -296,7 +297,7 @@ struct FileCheckImpl {
 
   void doCheckNot(
       const std::vector<Check>& nots,
-      const std::shared_ptr<Source>& source,
+      const std::shared_ptr<SourceView>& source,
       const SourceRange& prev,
       const SourceRange& next) {
     auto start = prev.end(); // inclusive
@@ -313,7 +314,7 @@ struct FileCheckImpl {
   // Checks that source token is highlighted, does not advance search range.
   void doCheckSourceHighlighted(
       const Check& check,
-      const std::shared_ptr<Source>& source,
+      const std::shared_ptr<SourceView>& source,
       size_t start_offset) {
     auto construct_error_and_throw = [&](size_t error_start_pos) {
       SourceRange error_range(
@@ -329,8 +330,8 @@ struct FileCheckImpl {
     size_t search_start_offset = start_offset;
     bool found_token_at_least_once = false;
     size_t pos = search_start_offset;
-    while (pos < source->size()) {
-      pos = source->text_str().find(check.search_str_, search_start_offset);
+    while (pos < source->text().size()) {
+      pos = source->text().find(check.search_str_, search_start_offset);
       if (pos == std::string::npos) {
         break;
       }
@@ -348,16 +349,17 @@ struct FileCheckImpl {
       auto highlight_start_offset =
           source->offset_for_line(highlight_lineno) + col;
       auto highlight_end_offset = std::min(
-          highlight_start_offset + check.search_str_.size(), source->size());
+          highlight_start_offset + check.search_str_.size(),
+          source->text().size());
 
-      if (highlight_end_offset >= source->size()) {
+      if (highlight_end_offset >= source->text().size()) {
         construct_error_and_throw(pos);
       }
 
       bool found_highlight = true;
       for (const auto posi :
            c10::irange(highlight_start_offset, highlight_end_offset)) {
-        if (source->char_at(posi) != '~') {
+        if (source->text()[posi] != '~') {
           found_highlight = false;
         }
       }
@@ -388,7 +390,7 @@ struct FileCheckImpl {
 
   SourceRange matchDagGroup(
       const std::vector<Check>& group,
-      const std::shared_ptr<Source>& source,
+      const std::shared_ptr<SourceView>& source,
       const SourceRange& prev) {
     size_t group_beg = std::string::npos;
     size_t group_end = 0;
@@ -406,7 +408,7 @@ struct FileCheckImpl {
 
   SourceRange matchGroup(
       const std::vector<Check>& group,
-      const std::shared_ptr<Source>& source,
+      const std::shared_ptr<SourceView>& source,
       const SourceRange& prev) {
     AT_ASSERT(group.size() != 0);
     CheckType type = group[0].type_;
@@ -465,7 +467,7 @@ struct FileCheckImpl {
     return SourceRange(source, start_range, end_range);
   }
 
-  void doChecks(const std::shared_ptr<Source>& source) {
+  void doChecks(const std::shared_ptr<SourceView>& source) {
     SourceRange prev(source, 0, 0);
     for (size_t i = 0; i < groups.size(); i++) {
       const auto& curr_group = groups[i];
@@ -482,7 +484,7 @@ struct FileCheckImpl {
           ++i; // already checked the group after
         } else {
           SourceRange end_of_file(
-              source, source->size() + 1, source->size() + 1);
+              source, source->text().size() + 1, source->text().size() + 1);
           doCheckNot(curr_group, source, prev, end_of_file);
         }
       }
diff --git a/torch/csrc/lazy/backend/backend_device.cpp b/torch/csrc/lazy/backend/backend_device.cpp
index 9cf5db7695e6e5..bf066cd83242e6 100644
--- a/torch/csrc/lazy/backend/backend_device.cpp
+++ b/torch/csrc/lazy/backend/backend_device.cpp
@@ -54,6 +54,15 @@ c10::Device backendDeviceToAtenDevice(const BackendDevice& device) {
   return c10::Device(at::kLazy, device.ordinal());
 }
 
+c10::optional<BackendDevice> GetBackendDevice(const at::TensorList tensors) {
+  for (auto& tensor: tensors) {
+    if (auto lt = TryGetLtcTensor(tensor)) {
+      return lt->GetDevice();
+    }
+  }
+  return c10::nullopt;
+}
+
 c10::optional<BackendDevice> GetBackendDevice(const at::Tensor& tensor) {
   if (auto lt = TryGetLtcTensor(tensor)) {
     return lt->GetDevice();
diff --git a/torch/csrc/lazy/backend/backend_device.h b/torch/csrc/lazy/backend/backend_device.h
index 6c4cde8ac8bc83..ff4a0b5ecea40f 100644
--- a/torch/csrc/lazy/backend/backend_device.h
+++ b/torch/csrc/lazy/backend/backend_device.h
@@ -18,7 +18,11 @@ namespace lazy {
 
 // Backend should extend it and define their own supported hardware types.
 struct TORCH_API BackendDeviceType {
-  int8_t type {0};
+  int8_t type {(int8_t)at::kCPU};
+  // Note: previous default value was '0', which actually maps to at::kCPU, at least now it is explicit,
+  // we may want to make default/undefined semantics more clear though
+  BackendDeviceType() :type((int8_t)at::kCPU) {}
+  BackendDeviceType(int8_t type) :type(type) {}
 
   virtual ~BackendDeviceType() = default;
   virtual std::string toString() const { return "Unknown"; }
@@ -59,6 +63,7 @@ TORCH_API c10::Device backendDeviceToAtenDevice(const BackendDevice& device);
 
 // Tries to extract the backend device out of the lazy tensor. Returns nullopt if the
 // input is not a lazy tensor.
+TORCH_API c10::optional<BackendDevice> GetBackendDevice(const at::TensorList tensors);
 TORCH_API c10::optional<BackendDevice> GetBackendDevice(const at::Tensor& tensor);
 
 // For variadic template.
diff --git a/torch/csrc/lazy/core/cache.h b/torch/csrc/lazy/core/cache.h
index cd49a1134ce37e..2ff45b4d1de77e 100644
--- a/torch/csrc/lazy/core/cache.h
+++ b/torch/csrc/lazy/core/cache.h
@@ -63,6 +63,12 @@ class Cache {
     return it->second->second;
   }
 
+  TypePtr GetLatest() {
+    std::lock_guard<std::mutex> g(lock_);
+    TORCH_CHECK(element_list_.size() > 0);
+    return element_list_.front().second;
+  }
+
   bool Erase(const K& key) {
     std::lock_guard<std::mutex> slock(lock_);
     auto it = element_map_.find(&key);
@@ -81,6 +87,12 @@ class Cache {
     element_list_.clear();
   }
 
+  int Numel() const {
+    std::lock_guard<std::mutex> g(lock_);
+    TORCH_CHECK(element_map_.size() == element_list_.size());
+    return element_map_.size();
+  }
+
  private:
   using ElementList = std::list<Element>;
 
@@ -107,7 +119,7 @@ class Cache {
     element_list_.splice(element_list_.begin(), element_list_, it);
   }
 
-  std::mutex lock_;
+  mutable std::mutex lock_;
   size_t max_size_ = 0;
   ElementList element_list_;
   ElementMap element_map_;
diff --git a/torch/csrc/lazy/core/config.cpp b/torch/csrc/lazy/core/config.cpp
index b47054913e10ac..ff1319061d8ecc 100644
--- a/torch/csrc/lazy/core/config.cpp
+++ b/torch/csrc/lazy/core/config.cpp
@@ -7,6 +7,11 @@ C10_DEFINE_bool(
     false,
     "Enable parameter aliasing support");
 
+C10_DEFINE_bool(
+    torch_lazy_handle_special_scalars,
+    false,
+    "Handle special scalars 0 and 1 diffrently");
+
 C10_DEFINE_bool(
     torch_lazy_use_thread_pool,
     false,
@@ -45,3 +50,21 @@ C10_DEFINE_string(
     torch_lazy_metrics_percentiles,
     "0.01:0.05:0.1:0.2:0.5:0.8:0.9:0.95:0.99",
     "Metrics percentiles to be collected, using : as the delimiter");
+
+namespace torch {
+namespace lazy {
+
+std::string& getLTCForceFallback() {
+    static std::string config;
+    static bool _ignore = [&]() {
+        char *envptr = std::getenv("LTC_FORCE_FALLBACK");
+        if (envptr) {
+            config = std::string(envptr);
+        }
+        return true;
+    }();
+    (void) _ignore;  // avoid unused variables warning
+    return config;
+}
+
+} }
diff --git a/torch/csrc/lazy/core/config.h b/torch/csrc/lazy/core/config.h
index fa6630123cd936..a6c1b2b8f81044 100644
--- a/torch/csrc/lazy/core/config.h
+++ b/torch/csrc/lazy/core/config.h
@@ -1,7 +1,9 @@
 #pragma once
 #include <c10/util/Flags.h>
+#include <c10/macros/Export.h>
 
 C10_DECLARE_bool(torch_lazy_ir_debug);
+C10_DECLARE_bool(torch_lazy_handle_special_scalars);
 C10_DECLARE_bool(torch_lazy_param_aliasing);
 C10_DECLARE_bool(torch_lazy_use_thread_pool);
 
@@ -13,3 +15,8 @@ C10_DECLARE_int(torch_lazy_trim_graph_check_frequency);
 C10_DECLARE_int(torch_lazy_trim_graph_size);
 
 C10_DECLARE_string(torch_lazy_metrics_percentiles);
+
+namespace torch {
+namespace lazy {
+TORCH_API std::string& getLTCForceFallback();
+} }
diff --git a/torch/csrc/lazy/core/debug_util.cpp b/torch/csrc/lazy/core/debug_util.cpp
index 69ff2aedd3904a..f64fe476592015 100644
--- a/torch/csrc/lazy/core/debug_util.cpp
+++ b/torch/csrc/lazy/core/debug_util.cpp
@@ -65,6 +65,28 @@ DebugUtil::GraphFormat DebugUtil::GetDefaultGraphFormat() {
   return format;
 }
 
+std::string GetFirstUserFrameInPython() {
+
+  std::string empty;
+  if (!torch::lazy::GetPythonFramesFunction()) {
+    return empty;
+  }
+
+  auto frames = torch::lazy::GetPythonFramesFunction()();
+
+  for (auto i = frames.size(); i > 0; i--) {
+    auto& loc = frames[i - 1];
+    if (loc.file.find("site-packages") == std::string::npos) {
+      std::stringstream ss;
+      ss << loc.file << " "
+        << loc.function << " "
+        << loc.line;
+      return ss.str();
+    }
+  }
+  return empty;
+}
+
 std::string DebugUtil::GetTensorsGraphInfo(c10::ArrayRef<torch::lazy::LazyTensorPtr> tensors,
                                            const std::vector<size_t>* indices,
                                            GraphFormat format) {
diff --git a/torch/csrc/lazy/core/debug_util.h b/torch/csrc/lazy/core/debug_util.h
index 8d0c473089d2a4..62cdcf98c641fc 100644
--- a/torch/csrc/lazy/core/debug_util.h
+++ b/torch/csrc/lazy/core/debug_util.h
@@ -11,6 +11,8 @@ namespace lazy {
 
 TORCH_API std::function<std::vector<SourceLocation>()>& GetPythonFramesFunction();
 
+TORCH_API std::string GetFirstUserFrameInPython();
+
 class TORCH_API DebugUtil {
  public:
   enum GraphFormat {
diff --git a/torch/csrc/lazy/core/dynamic_ir.cpp b/torch/csrc/lazy/core/dynamic_ir.cpp
new file mode 100644
index 00000000000000..7af1429e484aca
--- /dev/null
+++ b/torch/csrc/lazy/core/dynamic_ir.cpp
@@ -0,0 +1,61 @@
+#include <torch/csrc/lazy/core/dynamic_ir.h>
+
+namespace torch {
+namespace lazy {
+
+DimensionNode::DimensionNode(OpKind op, OpList operands, hash_t hash_seed):
+  TsNode(op, operands, /*num_outputs=*/1,
+  /* node_hash */ HashCombine(op.hash(), hash_seed)){}
+
+std::string DimensionNode::ToString() const {
+  return "DimensionNode";
+}
+
+SizeNode::SizeNode(Value input, size_t dim):
+    DimensionNode(OpKind{c10::Symbol::fromQualString("aten::size")}, {input}, MHash(dim)),
+    dim_(dim) {};
+
+int64_t SizeNode:: getStaticValue() const {
+    return dynamic_cast<const TsNode*>(operand(0).node)->shape(0).size(dim_);
+}
+
+std::string SizeNode::ToString() const {
+  return "SizeNode";
+}
+
+SizeAdd::SizeAdd(Value a, Value b):
+  DimensionNode(OpKind{c10::Symbol::fromQualString("aten::add")}, {a, b}) {};
+
+int64_t SizeAdd::getStaticValue() const {
+    return dynamic_cast<const DimensionNode*>(operand(0).node)->getStaticValue() + dynamic_cast<const DimensionNode*>(operand(1).node)->getStaticValue();
+}
+
+std::string SizeAdd::ToString() const {
+  return "SizeAdd";
+}
+
+SizeMul::SizeMul(Value a, Value b):
+  DimensionNode(OpKind{c10::Symbol::fromQualString("aten::mul")}, {a, b}) {};
+
+int64_t SizeMul::getStaticValue() const {
+    return dynamic_cast<const DimensionNode*>(operand(0).node)->getStaticValue() * dynamic_cast<const DimensionNode*>(operand(1).node)->getStaticValue();
+}
+
+std::string SizeMul::ToString() const {
+  return "SizeMul";
+}
+
+SizeDiv::SizeDiv(Value a, Value b):
+  DimensionNode(OpKind{c10::Symbol::fromQualString("aten::div")}, {a, b}) {};
+
+int64_t SizeDiv::getStaticValue() const {
+    TORCH_CHECK(dynamic_cast<const DimensionNode*>(operand(1).node)->getStaticValue() != 0, "Can't divide a dimension by zero");
+    return dynamic_cast<const DimensionNode*>(operand(0).node)->getStaticValue() / dynamic_cast<const DimensionNode*>(operand(1).node)->getStaticValue();
+}
+
+std::string SizeDiv::ToString() const {
+  return "SizeDiv";
+}
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/core/dynamic_ir.h b/torch/csrc/lazy/core/dynamic_ir.h
new file mode 100644
index 00000000000000..1bc86e5b845793
--- /dev/null
+++ b/torch/csrc/lazy/core/dynamic_ir.h
@@ -0,0 +1,96 @@
+#pragma once
+
+#include <ATen/core/symbol.h>
+
+#include <functional>
+#include <memory>
+#include <set>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
+#include <vector>
+
+#include <c10/core/ScalarType.h>
+#include <torch/csrc/lazy/core/hash.h>
+#include <torch/csrc/lazy/core/ir.h>
+#include <torch/csrc/lazy/core/ir_metadata.h>
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
+#include <c10/util/Flags.h>
+
+C10_DECLARE_bool(ltc_enable_dynamic_shapes);
+
+namespace torch {
+namespace lazy {
+
+/**
+ * The goal of "dynamic" Nodes is to patch a hole in our tracing.
+ * Previously, if a user called `sizes` on a Tensor, it would leak out
+ * of our tracing system, as `sizes` returns a torch.Size or an int. To
+ * prevent this from happening, we introduce DimensionNode, a new type
+ * of Node that abstracts the operation of getting the dimensions of a
+ * Tensor.
+ *
+ * Consider the following example:
+ * ```
+ * numel = x.shape()[0] * x.shape()[1]
+ * ```
+ *
+ * Here, `x.shape()[i]` will be a SizeNode (subclass of DimensionNode),
+ * and the multiplication of the two SizeNodes will be represented by
+ * a SizeMul (also a subclass of DimensionNode). Through this, we can
+ * prevent `numel` from being represented as a Python int and thus
+ * burned into the Graph.
+ */
+
+class TORCH_API DimensionNode : public lazy::TsNode {
+ public:
+  DimensionNode(OpKind op, OpList operands, hash_t hash_seed = kHashSeed);
+
+  // N.B. Node doesn't have sizes() so we don't need to override it to
+  // throw an error
+
+  // TODO: Fix this when John lands input shape API. Change
+  // DimensionNode's `isDynamic` to a virtual method and implement the
+  // actual `isDynamic` in all DimensionNode subclasses
+  bool isDynamic() {
+      return false;
+  }
+
+  std::string ToString() const override;
+
+  virtual int64_t getStaticValue() const = 0;
+};
+
+// Represents the result of calling `size` on a Tensor
+class TORCH_API SizeNode : public DimensionNode {
+ public:
+  SizeNode(Value input, size_t dim);
+  int64_t getStaticValue() const override;
+  std::string ToString() const override;
+  size_t dim_ = 0;
+};
+
+class TORCH_API SizeAdd: public DimensionNode {
+ public:
+  SizeAdd(Value a, Value b);
+  int64_t getStaticValue() const override;
+  std::string ToString() const override;
+};
+
+class TORCH_API SizeMul: public DimensionNode {
+ public:
+  SizeMul(Value a, Value b);
+  int64_t getStaticValue() const override;
+  std::string ToString() const override;
+};
+
+class TORCH_API SizeDiv: public DimensionNode {
+ public:
+  SizeDiv(Value a, Value b);
+  int64_t getStaticValue() const override;
+  std::string ToString() const override;
+};
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/core/hash.cpp b/torch/csrc/lazy/core/hash.cpp
index 41eb83cee80cf6..42dac8d09bf320 100644
--- a/torch/csrc/lazy/core/hash.cpp
+++ b/torch/csrc/lazy/core/hash.cpp
@@ -89,5 +89,23 @@ std::string HashToString(const hash_t& a) {
   return ss.str();
 }
 
+hash_t Hash(const std::vector<bool>& values){
+  // We can't assume a DataHash size/dataptr approach here bc
+  // vector<bool> can be optimized as vector<bit> and storage details
+  // are decoupled from actual size of 'bool' type
+  hash_t h(static_cast<uint64_t>(0xad2ed1983bbf2e28));
+  static const hash_t h_true(static_cast<uint64_t>(0x74f6b5198daa2b2));
+  static const hash_t h_false(static_cast<uint64_t>(0xe39f30789cab5382));
+  for (const auto& b: values) {
+    if(b) {
+      h = HashCombine(h, h_true);
+    } else {
+      h = HashCombine(h, h_false);
+    }
+  }
+  return h;
+}
+
+
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/hash.h b/torch/csrc/lazy/core/hash.h
index c9511cd037ea33..0e871bec2d45f3 100644
--- a/torch/csrc/lazy/core/hash.h
+++ b/torch/csrc/lazy/core/hash.h
@@ -8,7 +8,6 @@
 #include <set>
 #include <string>
 #include <vector>
-
 #include <ATen/Tensor.h>
 #include <c10/core/Scalar.h>
 #include <c10/util/int128.h>
@@ -68,11 +67,31 @@ hash_t Hash(const T& value) {
   return DataHash(&value, sizeof(value));
 }
 
+// added because on macos builds the vector<bool> specialization
+// breaks falling through to the templated arithmetic types above
+hash_t TORCH_API Hash(const std::vector<bool>& value);
+
 // Specialiazed implementations for proprietary types
 static inline hash_t Hash(const c10::ScalarType& value) {
   return DataHash(&value, sizeof(value));
 }
 
+static inline hash_t Hash(const c10::MemoryFormat& value) {
+  return DataHash(&value, sizeof(value));
+}
+
+static inline hash_t Hash(const c10::DeviceType& value) {
+  return DataHash(&value, sizeof(value));
+}
+
+static inline hash_t Hash(const c10::Device& value) {
+  return HashCombine(Hash(value.type()), Hash(value.index()));
+}
+
+static inline hash_t Hash(const c10::Layout& value) {
+  return DataHash(&value, sizeof(value));
+}
+
 static inline hash_t Hash(const c10::Scalar& value) {
   switch(value.type()){
   case c10::ScalarType::ComplexDouble:
diff --git a/torch/csrc/lazy/core/ir.cpp b/torch/csrc/lazy/core/ir.cpp
index a1726aacba6689..b28796507b8dea 100644
--- a/torch/csrc/lazy/core/ir.cpp
+++ b/torch/csrc/lazy/core/ir.cpp
@@ -1,6 +1,7 @@
 #include <torch/csrc/lazy/core/ir.h>
 #include <torch/csrc/lazy/core/ir_metadata.h>
 
+// Enables caching on for dynamic shapes (aka disable hash on shapes)
 C10_DEFINE_bool(ltc_enable_dynamic_shapes, false, "Whether dynamic shape is enabled");
 
 namespace torch {
@@ -41,6 +42,19 @@ hash_t OpKind::hash() const {
   return StringHash(op.toQualString());
 }
 
+hash_t OperandHashes(const OpList& operands, const hash_t& seed, bool bakeInSizes) {
+  hash_t hash = seed;
+  for (auto& operand : operands) {
+    if (!operand) {
+      hash = HashCombine(hash, static_cast<uint64_t>(kNullOpt));
+      continue;
+    }
+    auto operand_hash = bakeInSizes ? operand.hash_with_sizes() : operand.hash_without_sizes();
+    hash = HashCombine(hash, operand_hash);
+  }
+  return hash;
+}
+
 bool Node::enableDynamicShape() {
   static bool enabled = std::getenv("LTC_ENABLE_DYNAMIC_SHAPES") != nullptr;
   return enabled || FLAGS_ltc_enable_dynamic_shapes;
@@ -62,20 +76,84 @@ Node::Node(OpKind op, size_t num_outputs, std::function<hash_t(bool)> node_hash_
       dag_hash_with_sizes_(node_hash_fn(true)),
       metadata_(GetMetaDataIfDebugging()) {}
 
+
+Node::Node(OpKind op, OpList operands, std::vector<Shape>&& shapes,
+           size_t num_outputs, hash_t hash_seed)
+    : Node(op, num_outputs,
+           // TODO(WHC) this is inefficient (having to compute node_hash twice
+           // since I can't call hash() yet) so probably move dag_hash
+           // initialization to a separate function?
+           /* node_hash */ HashCombine(op.hash(), hash_seed),
+           /* dag_hash */
+           [&](bool bakeInSizes) { return OperandHashes(operands, HashCombine(op.hash(), hash_seed), bakeInSizes); }) {
+  // Move shapes into node
+  shapes_.insert(
+    shapes_.end(),
+    std::make_move_iterator(shapes.begin()),
+    std::make_move_iterator(shapes.end()));
+
+  for (auto& operand : operands) {
+    // Ideally, optional operands should be filtered by the leaf node classes,
+    // but it's just much easier to do it here.
+    // TODO(alanwaketan): Find a way to move the below logic to the leaf node
+    // classes.
+    if (!operand) {
+      continue;
+    }
+
+    AddOperand(operand.node, operand.index);
+  }
+}
+
+Node::Node(OpKind op, OpList operands, size_t num_outputs, hash_t hash_seed)
+  : Node(op, operands, std::vector<Shape>{}, num_outputs, hash_seed) {}
+
+Node::Node(OpKind op, Shape shape, size_t num_outputs, hash_t hash_seed)
+    : Node(op, num_outputs, [&](bool bakeInSizes) -> hash_t { return GetOpHash(op, shape, hash_seed, bakeInSizes); }) {
+  shapes_.push_back(std::move(shape));
+}
+
 Node::~Node() = default;
 
+hash_t Node::GetOpHash(OpKind op, const Shape& shape, hash_t hash_seed, bool bakeInSizes) {
+  hash_t h = HashCombine(op.hash(), shape.hash(bakeInSizes));
+  return HashCombine(h, hash_seed);
+}
+
+// Retrieves the full shape of the IR Node.
+c10::ArrayRef<Shape> Node::shapes() const { return shapes_; }
+
+// Retrieves the shape of the output at a given index.
+const Shape& Node::shape(size_t output_index) const {
+  return shapes_.at(output_index);
+}
+
+const std::vector<Output>& Node::operands() const {
+  return operands_as_outputs_;
+}
+const Output& Node::operand(size_t i) const {
+  return operands_as_outputs_.at(i);
+}
+
 std::string Node::ToString() const {
   std::stringstream ss;
-  ss << op();
+  ss << shapes() << " " << op();
   if (num_outputs() > 1) {
     ss << ", num_outputs=" << num_outputs();
   }
-  if (!metadata_.scope.empty()) {
-    ss << ", scope=" << metadata_.scope;
+  if (!metadata().scope.empty()) {
+    ss << ", scope=" << metadata().scope;
   }
-  EmitShortFrameInfo(ss, metadata_.frame_info);
+  EmitShortFrameInfo(ss, metadata().frame_info);
   return ss.str();
 }
 
+void Node::AddOperand(NodePtr node, size_t index) {
+  CHECK_LT(index, node->num_outputs());
+  operands_.push_back(std::move(node));
+  operands_as_outputs_.emplace_back(operands_.back().get(), index);
+}
+
+
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/ir.h b/torch/csrc/lazy/core/ir.h
index 4132400bb65eeb..27f5afc26cb26d 100644
--- a/torch/csrc/lazy/core/ir.h
+++ b/torch/csrc/lazy/core/ir.h
@@ -14,6 +14,7 @@
 #include <c10/core/ScalarType.h>
 #include <c10/util/ArrayRef.h>
 #include <torch/csrc/lazy/core/hash.h>
+#include <torch/csrc/lazy/core/shape.h>
 #include <torch/csrc/lazy/core/ir_metadata.h>
 #include <c10/util/Flags.h>
 
@@ -25,72 +26,11 @@ namespace lazy {
 static const hash_t kHashSeed(static_cast<uint32_t>(0x5a2d296e9));
 
 class Node;
+struct Output;
+struct Value;
 
 using NodePtr = std::shared_ptr<Node>;
 
-// Represents a specific output produced by a node. Since the output of a node
-// can be composed by multiple outputs, the node+index coordinates fully qualify
-// each single output.
-struct TORCH_API Output {
-  struct Hasher {
-    size_t operator()(const Output& output) const;
-  };
-
-  Output() = default;
-  explicit Output(const Node* node, size_t index = 0)
-      : node(node), index(index) {}
-
-  hash_t hash() const;
-
-  bool operator==(const Output& rhs) const {
-    return node == rhs.node && index == rhs.index;
-  }
-  bool operator!=(const Output& rhs) const {
-    return !operator==(rhs);
-  }
-
-  std::string ToString() const;
-
-  // The node providing the output.
-  const Node* node{nullptr};
-  // The index in the node's output this output refers to.
-  size_t index{0};
-};
-
-inline std::ostream& operator<<(std::ostream& stream, const Output& output) {
-  stream << output.ToString();
-  return stream;
-}
-
-template <typename T>
-using OutputMap = std::unordered_map<Output, T, Output::Hasher>;
-
-// Represents an input/operand for a Node object.
-struct TORCH_API Value {
-  Value() = default;
-  /* implicit */ Value(NodePtr&& node, size_t index = 0) : node(std::move(node)), index(index) {}
-  /* implicit */ Value(const NodePtr& node, size_t index = 0) : node(node), index(index) {}
-
-  hash_t hash() const;
-  hash_t hash_with_sizes() const;
-  hash_t hash_without_sizes() const;
-
-  operator bool() const {
-    return node != nullptr;
-  }
-
-  operator Output() const {
-    return Output(node.get(), index);
-  }
-
-  Node* operator->() const {
-    return node.get();
-  }
-
-  NodePtr node;
-  size_t index = 0;
-};
-
 // The Kind of operation a Node can be associated to.
 struct TORCH_API OpKind {
   OpKind() = default;
@@ -127,11 +67,12 @@ inline std::ostream& operator<<(std::ostream& stream, const OpKind& op) {
 
 using OpList = c10::ArrayRef<Value>;
 
-// A node in the graph. Nodes for operations which requires extra data to be
-// stored for lowering, should inherit from this class and add operation
+hash_t OperandHashes(const OpList& operands, const hash_t& seed, bool bakeInSizes);
+// A node in the graph. Nodes for operations which require extra data to be
+// stored for lowering should inherit from this class and add an operation
 // specific member there. For example, a constant might create a new
 // NodeConstant class (inheriting from Node) with an extra lazy_tensors::Literal
-// field, or a tensor value might create a new NodeTensor with computation
+// field, or a tensor value might create a new NodeTensor with a computation
 // client data handle in it.
 class TORCH_API Node {
  public:
@@ -148,8 +89,22 @@ class TORCH_API Node {
   // Contructor used to create leaf nodes.
   Node(OpKind op, size_t num_outputs, std::function<hash_t(bool)> node_hash_fn);
 
+  // Construct node with operands and shapes
+  Node(OpKind op, OpList operands, std::vector<Shape>&& shapes,
+       size_t num_outputs = 1, hash_t hash_seed = kHashSeed);
+
+  // Construct node with operands and no shape
+  Node(OpKind op, OpList operands, size_t num_outputs = 1,
+       hash_t hash_seed = kHashSeed);
+
+  // Construct node with shape and no operands
+  Node(OpKind op, Shape shape, size_t num_outputs = 1,
+       hash_t hash_seed = kHashSeed);
+
   virtual ~Node();
 
+  static hash_t GetOpHash(OpKind op, const Shape& shape, hash_t hash_seed, bool bakeInSizes);
+
   const OpKind& op() const {
     return op_;
   }
@@ -158,9 +113,15 @@ class TORCH_API Node {
     return num_outputs_;
   }
 
-  virtual const std::vector<Output>& operands() const = 0;
+  // Retrieves the full shape of the IR Node.
+  virtual c10::ArrayRef<Shape> shapes() const;
+
+  // Retrieves the shape of the output at a given index.
+  virtual const Shape& shape(size_t output_index = 0) const;
 
-  virtual const Output& operand(size_t i) const = 0;
+  virtual const std::vector<Output>& operands() const;
+
+  virtual const Output& operand(size_t i) const;
 
   hash_t node_hash() const {
     return node_hash_;
@@ -217,6 +178,17 @@ class TORCH_API Node {
   // The IR framework user can attach a user defined metadata object deriving
   // from UserMetaData.
   std::shared_ptr<UserMetaData> user_metadata_;
+
+protected:
+  // Adds node's index output number as operand.
+  void AddOperand(NodePtr node, size_t index = 0);
+
+  std::vector<Shape> shapes_;
+  // A node holds a real reference to its operands.
+  std::vector<NodePtr> operands_;
+  // Outputs do not hold references on the nodes, and neither do the uses, since
+  // otherwise we get into circular reference counting.
+  std::vector<Output> operands_as_outputs_;
 };
 
 
@@ -244,6 +216,78 @@ const T* NodeCast(const Node* node, OpKind op) {
 #endif
 }
 
+
+// Represents a specific output produced by a node. Since the output of a node
+// can be composed by multiple outputs, the node+index coordinates fully qualify
+// each single output.
+struct TORCH_API Output {
+  struct Hasher {
+    size_t operator()(const Output& output) const;
+  };
+
+  Output() = default;
+  explicit Output(const Node* node, size_t index = 0)
+      : node(node), index(index) {}
+
+  hash_t hash() const;
+
+  bool operator==(const Output& rhs) const {
+    return node == rhs.node && index == rhs.index;
+  }
+  bool operator!=(const Output& rhs) const {
+    return !operator==(rhs);
+  }
+
+  const Shape& shape() const {
+    return node->shape(index);
+  }
+
+  std::string ToString() const;
+
+  // The node providing the output.
+  const Node* node{nullptr};
+  // The index in the node's output this output refers to.
+  size_t index{0};
+};
+
+inline std::ostream& operator<<(std::ostream& stream, const Output& output) {
+  stream << output.ToString();
+  return stream;
+}
+
+template <typename T>
+using OutputMap = std::unordered_map<Output, T, Output::Hasher>;
+
+// Represents an input/operand for a Node object.
+struct TORCH_API Value {
+  Value() = default;
+  /* implicit */ Value(NodePtr&& node, size_t index = 0) : node(std::move(node)), index(index) {}
+  /* implicit */ Value(const NodePtr& node, size_t index = 0) : node(node), index(index) {}
+
+  hash_t hash() const;
+  hash_t hash_with_sizes() const;
+  hash_t hash_without_sizes() const;
+
+  operator bool() const {
+    return node != nullptr;
+  }
+
+  operator Output() const {
+    return Output(node.get(), index);
+  }
+
+  const Shape& shape() const {
+    return node->shape(index);
+  }
+
+  Node* operator->() const {
+    return node.get();
+  }
+
+  NodePtr node;
+  size_t index = 0;
+};
+
 } // namespace lazy
 } // namespace torch
 
diff --git a/torch/csrc/lazy/core/ir_dump_util.cpp b/torch/csrc/lazy/core/ir_dump_util.cpp
index a214ac7dcaa6ba..52f52cc56be617 100644
--- a/torch/csrc/lazy/core/ir_dump_util.cpp
+++ b/torch/csrc/lazy/core/ir_dump_util.cpp
@@ -251,6 +251,7 @@ std::string DumpUtil::PostOrderToText(
     if (opt_root_id) {
       ss << ", ROOT=" << *opt_root_id;
     }
+    ss << ", NodeType=" << typeid(*node).name();
     ss << "\n";
   }
   ss << "}\n";
diff --git a/torch/csrc/lazy/core/ir_metadata.cpp b/torch/csrc/lazy/core/ir_metadata.cpp
index a310fb5ee38554..cea721bc139741 100644
--- a/torch/csrc/lazy/core/ir_metadata.cpp
+++ b/torch/csrc/lazy/core/ir_metadata.cpp
@@ -1,5 +1,6 @@
 #include <functional>
 #include <torch/csrc/lazy/core/config.h>
+#include <torch/csrc/lazy/core/debug_util.h>
 #include <torch/csrc/lazy/core/ir_metadata.h>
 
 namespace torch {
@@ -89,23 +90,13 @@ std::string GetCurrentScope() {
   return scope;
 }
 
-
-std::vector<SourceLocation> GetFrameInfoDefault() {
-  return std::vector<SourceLocation>();
-}
-
-std::function <std::vector<SourceLocation> ()> GetFrameInfo = GetFrameInfoDefault;
-void RegisterGetFrameInfo(const std::function <std::vector<SourceLocation> ()>& getFrameInfo) {
-  GetFrameInfo = getFrameInfo;
-}
-
 MetaData GetMetaDataIfDebugging() {
   if (!FLAGS_torch_lazy_ir_debug) {
     return MetaData();
   }
   MetaData meta;
   meta.scope = GetCurrentScope();
-  meta.frame_info = GetFrameInfo();
+  meta.frame_info = torch::lazy::GetPythonFramesFunction()();
   return meta;
 }
 
diff --git a/torch/csrc/lazy/core/ir_metadata.h b/torch/csrc/lazy/core/ir_metadata.h
index 437f177c13a92c..ea413fcfb8263e 100644
--- a/torch/csrc/lazy/core/ir_metadata.h
+++ b/torch/csrc/lazy/core/ir_metadata.h
@@ -43,13 +43,7 @@ struct TORCH_API ScopePusher {
   static void ResetScopes();
 };
 
-MetaData GetMetaDataIfDebugging();
-
-// If python bindings for lazy tensor core are initialized, they should
-// register a function to get python frame info.  Otherwise, frame info
-// will not be available.
-TORCH_API void RegisterGetFrameInfo(
-    const std::function<std::vector<SourceLocation>()>& getFrameInfo);
+TORCH_API MetaData GetMetaDataIfDebugging();
 
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.cpp b/torch/csrc/lazy/core/lazy_graph_executor.cpp
index c07167f2ac8191..0d576d443cbd26 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.cpp
+++ b/torch/csrc/lazy/core/lazy_graph_executor.cpp
@@ -3,6 +3,7 @@
 #include <ATen/ScalarOps.h>
 #include <c10/util/Logging.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/lazy/core/config.h>
 #include <torch/csrc/lazy/core/internal_ops/ltc_ops.h>
 #include <torch/csrc/lazy/core/ir_dump_util.h>
@@ -18,6 +19,8 @@
 #include <torch/csrc/lazy/ts_backend/ops/expand.h>
 #include <torch/csrc/lazy/ts_backend/ops/scalar.h>
 
+#include <ATen/ScalarOps.h>
+
 namespace torch {
 namespace lazy {
 namespace {
@@ -748,6 +751,10 @@ std::shared_ptr<LazyGraphExecutor::Async> LazyGraphExecutor::TryRunCachedSync(
   if (cached_computation == nullptr) {
     return nullptr;
   }
+  if (GRAPH_DUMP_ENABLED) {
+    auto* tscomp = (TSComputation*) cached_computation->computation.get();
+    LOG(ERROR) << "Run a cached graph: " << *tscomp->graph() << std::endl;
+  }
   TORCH_LAZY_VALUE_METRIC("TensorsGraphSize", po_data->post_order.size());
   VLOG(5) << "TensorsGraphSize=" << po_data->post_order.size();
 
@@ -903,6 +910,11 @@ std::shared_ptr<LazyGraphExecutor::Async> LazyGraphExecutor::
   }
 
   CompilationResult compile_result = Compile(*tensors, devices, coll, &po_data);
+  if (GRAPH_DUMP_ENABLED) {
+    auto* tscomp = (TSComputation*) compile_result.computation.get();
+    LOG(ERROR) << "Add a cached computation with hash " << coll.hash << std::endl;
+    LOG(ERROR) << "Add a graph to cache: " << *tscomp->graph() << std::endl;
+  }
 
   TORCH_LAZY_VALUE_METRIC("TensorsGraphSize", compile_result.emitted_nodes);
   VLOG(5) << "TensorsGraphSize=" << compile_result.emitted_nodes;
@@ -1084,5 +1096,15 @@ std::vector<BackendDataPtr> LazyGraphExecutor::GatherTensorsData(
   return result_tensors_data;
 }
 
+hash_t LazyGraphExecutor::GetGraphHash(const std::vector<LazyTensorPtr>& tensors) {
+  SyncTensorsConfig config;
+  config.sync_ltc_data = false;
+
+  auto coll = CollectSyncTensors(tensors, config);
+  auto po_data = RunPostOrder(tensors, coll.indices);
+  coll.hash = HashCombine(coll.hash, Hash(po_data.parameter_sequence));
+  return coll.hash;
+}
+
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.h b/torch/csrc/lazy/core/lazy_graph_executor.h
index b426393d897729..ad8cfa3d1202b9 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.h
+++ b/torch/csrc/lazy/core/lazy_graph_executor.h
@@ -114,6 +114,19 @@ class TORCH_API LazyGraphExecutor {
     noop_execution_mode_ = enable_noop;
   }
 
+  struct CachedComputation {
+    explicit CachedComputation(ComputationPtr computation)
+        : computation(std::move(computation)) {}
+
+    ComputationPtr computation;
+  };
+
+  using ComputationCache = Cache<hash_t, CachedComputation, HashReducer>;
+
+  ComputationCache* GetComputationCache();
+
+  hash_t GetGraphHash(const std::vector<LazyTensorPtr>& tensors);
+
  private:
   struct SyncTensorsConfig {
     // Whether we want to force data on the target tensors (hence trimming
@@ -148,15 +161,6 @@ class TORCH_API LazyGraphExecutor {
     std::vector<BackendDataPtr> parameters_data;
   };
 
-  struct CachedComputation {
-    explicit CachedComputation(ComputationPtr computation)
-        : computation(std::move(computation)) {}
-
-    ComputationPtr computation;
-  };
-
-  using ComputationCache = Cache<hash_t, CachedComputation, HashReducer>;
-
   struct Async {
     Async(
         SyncTensorCollection* coll,
@@ -202,8 +206,6 @@ class TORCH_API LazyGraphExecutor {
       const SyncTensorCollection& coll,
       PostOrderData* po_data);
 
-  ComputationCache* GetComputationCache();
-
   ComputationCache::TypePtr LookupCachedCompile(const hash_t& hash);
 
   void BuildInputOutputAliases(
diff --git a/torch/csrc/lazy/core/lazy_view.cpp b/torch/csrc/lazy/core/lazy_view.cpp
index 5bd1dfa9f6438a..443508b99e3565 100644
--- a/torch/csrc/lazy/core/lazy_view.cpp
+++ b/torch/csrc/lazy/core/lazy_view.cpp
@@ -67,6 +67,16 @@ Value ApplyViewInfo(Value ir_value, const ViewInfo& view_info) {
   }
 }
 
+// Here we are trying to populate inplace updated values from the latest view
+// all the way back to the original tensor.
+// For example:
+//     a = torch.diagonal(b)
+//     b.add_(1) # a should be updated as well.
+//
+// Ideally we should all have a *ViewUpdate IR which updates the original tensor/view
+// withe current value. See DiagonalViewUpdate and corresponding LowerDiagonalViewUpdate
+// in ts_node_lowering.cpp. There are some "edge cases" here simply because they can
+// smartly reuse some other ops to undo themselves.
 Value ApplyUpdate(Value ir_value, const Alias::UpdateData& update_data) {
   // We first bring the source IR value forward, by reshaping and slicing.
   std::vector<Value> tmp_values({ir_value});
diff --git a/torch/csrc/lazy/core/metrics.h b/torch/csrc/lazy/core/metrics.h
index 02cf92a11b4c12..c1e125f4048cb2 100644
--- a/torch/csrc/lazy/core/metrics.h
+++ b/torch/csrc/lazy/core/metrics.h
@@ -205,7 +205,7 @@ class TORCH_API Counter {
   } while (0)
 
 #define TORCH_LAZY_FN_COUNTER(ns) \
-  TORCH_LAZY_COUNTER(c10::str(ns, __FUNCTION__), 1)
+  TORCH_LAZY_COUNTER(c10::str(ns, __func__), 1)
 
 #define TORCH_LAZY_VALUE_METRIC(name, value)                         \
   do {                                                               \
diff --git a/torch/csrc/lazy/core/shape.cpp b/torch/csrc/lazy/core/shape.cpp
index bd5ea5b75c9635..538639d34b388f 100644
--- a/torch/csrc/lazy/core/shape.cpp
+++ b/torch/csrc/lazy/core/shape.cpp
@@ -1,12 +1,17 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/lazy/core/shape.h>
+#include <torch/csrc/lazy/core/tensor.h>
+
+C10_DEFINE_bool(
+    ltc_enable_symbolic_shapes,
+    false,
+    "Enables calculation of if dims are symbolic");
 
 namespace torch {
 namespace lazy {
 
 Shape::Shape(at::ScalarType scalar_type, c10::ArrayRef<int64_t> sizes)
-    : scalar_type_(scalar_type),
-      sizes_(sizes.begin(), sizes.end()) {}
+    : scalar_type_(scalar_type), sizes_(sizes.begin(), sizes.end()) {}
 
 std::string Shape::to_string() const {
   return c10::str(toString(scalar_type_), "[", c10::Join(",", sizes_), "]");
@@ -30,11 +35,94 @@ size_t Shape::numel() const {
 
 hash_t Shape::hash(bool bakeInSizes) const {
   if (bakeInSizes) {
-    return HashCombine(Hash(scalar_type_), DataHash(sizes_.data(), sizes_.size() * sizeof(int64_t)));
+    return HashCombine(
+        Hash(scalar_type_),
+        DataHash(sizes_.data(), sizes_.size() * sizeof(int64_t)));
   } else {
     return HashCombine(Hash(scalar_type_), Hash(sizes_.size()));
   }
 }
 
-}  // namespace lazy
-}  // namespace torch
+Shape Shape::with_symbolic_dims(
+    c10::optional<std::vector<bool>> symbolic_dims) const {
+  Shape copy = *this;
+  copy.is_symbolic_ = symbolic_dims;
+  return copy;
+}
+
+bool symbolicShapeEnabled() {
+  static bool enabled = std::getenv("LTC_ENABLE_SYMBOLIC_SHAPES") != nullptr;
+  return enabled || FLAGS_ltc_enable_symbolic_shapes;
+}
+
+c10::SymbolicShape get_symbolic_shape(at::Tensor& tensor) {
+  auto ltc_tensor = TryGetLtcTensor(tensor);
+  if (!ltc_tensor) {
+    // Set Concrete sizes for Concrete tensors
+    return c10::SymbolicShape(tensor.sizes());
+  }
+  const Shape& input_shape = ltc_tensor->GetIrValue()->shape();
+  auto& is_symbolic = input_shape.is_symbolic();
+  if (!is_symbolic.has_value()) {
+    return c10::SymbolicShape();
+  }
+  auto sizes = input_shape.sizes();
+  TORCH_INTERNAL_ASSERT(
+      sizes.size() == is_symbolic->size(),
+      "Dims of two values are not consistent");
+  std::vector<c10::optional<int64_t>> symbolic_dims;
+  for (int64_t i = 0; i < sizes.size(); i++) {
+    if (is_symbolic->at(i)) {
+      symbolic_dims.emplace_back(c10::nullopt);
+    } else {
+      symbolic_dims.emplace_back(sizes.at(i));
+    }
+  }
+  return c10::SymbolicShape(symbolic_dims);
+}
+
+void applySymbolicShapesOnLT(
+    const char* schema_str,
+    std::vector<c10::IValue> args,
+    std::vector<Shape>& result_shapes) {
+  std::vector<jit::SSAInput> converted_args;
+  // TODO: Determine if there are any unknown values in LazyTensor
+  const c10::FunctionSchema& schema =
+      jit::getOperatorForLiteral(schema_str)->schema();
+
+  for (auto& arg : args) {
+    // Handle list of tensors
+    if (arg.isTensorList()) {
+      at::List<at::Tensor> tensor_list = arg.toTensorList();
+      for (at::Tensor tensor : tensor_list) {
+        converted_args.emplace_back(get_symbolic_shape(tensor));
+      }
+    } else if (arg.isTensor()) {
+      auto ss = get_symbolic_shape(arg.toTensor());
+      converted_args.emplace_back(ss);
+    } else {
+      // If we need to support symbolic ints, here is the place
+      // to add it.
+      converted_args.emplace_back(arg);
+    }
+  }
+  auto res_symbolic = jit::calculateSymbolicShapesOnOp(&schema, converted_args);
+  if (!res_symbolic) {
+    for (int64_t i = 0; i < res_symbolic->size(); i++) {
+      result_shapes[i] = result_shapes[i].with_symbolic_dims(c10::nullopt);
+    }
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        res_symbolic->size() == result_shapes.size(),
+        "Result shape size is not consistent");
+    for (int64_t i = 0; i < res_symbolic->size(); i++) {
+      auto sym_dims = res_symbolic->at(i).symbolicDims();
+      if (sym_dims.has_value()) {
+        result_shapes[i] = result_shapes[i].with_symbolic_dims(*sym_dims);
+      }
+    }
+  }
+}
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/core/shape.h b/torch/csrc/lazy/core/shape.h
index 9b34b90fec041e..2203d94a47434c 100644
--- a/torch/csrc/lazy/core/shape.h
+++ b/torch/csrc/lazy/core/shape.h
@@ -4,8 +4,11 @@
 #include <vector>
 
 #include <c10/core/Scalar.h>
+#include <torch/csrc/jit/passes/symbolic_shape_analysis.h>
 #include <torch/csrc/lazy/core/hash.h>
 
+C10_DECLARE_bool(ltc_enable_symbolic_shapes);
+
 namespace torch {
 namespace lazy {
 
@@ -17,24 +20,58 @@ class TORCH_API Shape {
 
   std::string to_string() const;
 
-  c10::ScalarType scalar_type() const { return scalar_type_; }
-  void set_scalar_type(at::ScalarType value) { scalar_type_ = value; }
+  c10::ScalarType scalar_type() const {
+    return scalar_type_;
+  }
+  void set_scalar_type(at::ScalarType value) {
+    scalar_type_ = value;
+  }
+
+  int64_t dim() const {
+    return sizes_.size();
+  }
+  c10::ArrayRef<int64_t> sizes() const {
+    return sizes_;
+  }
+  int64_t size(int64_t dim) const {
+    return sizes_.at(dim);
+  }
+  void set_size(int64_t dim, int64_t size) {
+    sizes_.at(dim) = size;
+  }
+
+  const c10::optional<std::vector<bool>>& is_symbolic() const {
+    return is_symbolic_;
+  }
+
+  // Makes a copy with symbolic dims applied
+  Shape with_symbolic_dims(
+      c10::optional<std::vector<bool>> symbolic_dims) const;
 
-  int64_t dim() const { return sizes_.size(); }
-  c10::ArrayRef<int64_t> sizes() const { return sizes_; }
-  int64_t size(int64_t dim) const { return sizes_.at(dim); }
-  void set_size(int64_t dim, int64_t size) { sizes_.at(dim) = size; }
   size_t numel() const;
   hash_t hash(bool bakeInSizes) const;
 
   bool operator==(const Shape& other) const;
 
  private:
-  c10::ScalarType scalar_type_ {c10::ScalarType::Undefined};
+  c10::ScalarType scalar_type_{c10::ScalarType::Undefined};
+
+  // Stores which dimmensions are symbolic
+  // If nullopt, either it hasn't been initialized or the symbolic
+  // dimmensions are not calculatable
+  c10::optional<std::vector<bool>> is_symbolic_ = c10::nullopt;
+  // Sizes are the upper bound sizes for a tensor, used by XLA.
   std::vector<int64_t> sizes_;
 };
 
 TORCH_API std::ostream& operator<<(std::ostream& out, const Shape& shape);
 
-}  // namespace lazy
-}  // namespace torch
+bool symbolicShapeEnabled();
+// Calculate and applies symbolic shapes onto the
+// Shape objects passed to result_shapes
+void applySymbolicShapesOnLT(
+    const char* schema_str,
+    std::vector<c10::IValue> args,
+    std::vector<Shape>& result_shapes);
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/core/shape_inference.cpp b/torch/csrc/lazy/core/shape_inference.cpp
index 7f944d1d524778..5d77ad94d435bc 100644
--- a/torch/csrc/lazy/core/shape_inference.cpp
+++ b/torch/csrc/lazy/core/shape_inference.cpp
@@ -141,6 +141,14 @@ std::vector<Shape> compute_shape_abs(const at::Tensor & self) {
   return {Shape(self.scalar_type(), self.sizes().vec())};
 }
 
+std::vector<Shape> compute_shape_bernoulli(const at::Tensor & self, c10::optional<at::Generator> generator) {
+  return {Shape(self.scalar_type(), self.sizes().vec())};
+}
+
+std::vector<Shape> compute_shape_bernoulli_(at::Tensor & self, double p, c10::optional<at::Generator> generator) {
+  return compute_shape_bernoulli(self, generator);
+}
+
 std::vector<Shape> compute_shape_binary_cross_entropy(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction) {
   if(reduction == at::Reduction::None) {
     return {Shape(self.scalar_type(), self.sizes().vec())};
@@ -183,7 +191,7 @@ std::vector<Shape> compute_shape_constant_pad_nd(const at::Tensor & self, at::In
   return {Shape(self.scalar_type(), new_shape)};
 }
 
-std::vector<Shape> compute_shape_convolution_backward(const at::Tensor & grad_output, const at::Tensor & input, const at::Tensor & weight, c10::optional<at::IntArrayRef> bias_sizes, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation, bool transposed, at::IntArrayRef output_padding, int64_t groups, ::std::array<bool,3> output_mask) {
+std::vector<Shape> compute_shape_convolution_backward(const at::Tensor & grad_output, const at::Tensor & input, const at::Tensor & weight, at::OptionalIntArrayRef bias_sizes, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation, bool transposed, at::IntArrayRef output_padding, int64_t groups, ::std::array<bool,3> output_mask) {
   if (bias_sizes.has_value()) {
     return {Shape(input.scalar_type(), input.sizes().vec()),
             Shape(weight.scalar_type(), weight.sizes().vec()),
@@ -249,7 +257,7 @@ std::vector<Shape> compute_shape_std(const at::Tensor & self, bool unbiased){
 std::vector<Shape> compute_shape_std(const at::Tensor & self, at::IntArrayRef dim, bool unbiased, bool keepdim){
   return compute_shape_std(self, dim, c10::nullopt, keepdim);
 }
-std::vector<Shape> compute_shape_std(const at::Tensor & self, c10::optional<at::IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim){
+std::vector<Shape> compute_shape_std(const at::Tensor & self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim){
   if (dim.has_value()) {
     auto shape = at::native::shape_from_dim_mask(self, at::native::make_dim_mask(dim.value(), self.dim()), keepdim);
     return {Shape(self.scalar_type(), std::vector<int64_t>(shape.begin(), shape.end()))};
@@ -353,6 +361,18 @@ std::vector<Shape> compute_shape_native_dropout_backward(const at::Tensor & grad
   return {Shape(grad_output.scalar_type(), grad_output.sizes().vec())};
 }
 
+std::vector<Shape> compute_shape_random_(at::Tensor & self, c10::optional<at::Generator> generator) {
+  return {Shape(self.scalar_type(), self.sizes().vec())};
+}
+
+std::vector<Shape> compute_shape_random_(at::Tensor & self, int64_t to, c10::optional<at::Generator> generator) {
+  return compute_shape_random_(self, generator);
+}
+
+std::vector<Shape> compute_shape_random_(at::Tensor & self, int64_t from, c10::optional<int64_t> to, c10::optional<at::Generator> generator) {
+  return compute_shape_random_(self, generator);
+}
+
 std::vector<Shape> compute_shape_relu(const at::Tensor& self) {
   return {Shape(self.scalar_type(), self.sizes().vec())};
 }
@@ -534,6 +554,43 @@ std::vector<Shape> compute_shape_clamp_min(const at::Tensor & self, const at::Sc
   return {Shape(self.scalar_type(), self.sizes().vec())};
 }
 
+std::vector<Shape> compute_shape__to_copy(const at::Tensor & self, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, bool non_blocking, c10::optional<at::MemoryFormat> memory_format) {
+  if(dtype){
+    return {Shape(*dtype, self.sizes().vec())};
+  }
+  return {Shape(self.scalar_type(), self.sizes().vec())};
+}
+
+std::vector<Shape> compute_shape_stack(at::TensorList tensors, int64_t dim) {
+  TORCH_CHECK(tensors.size() > 0, "stack expects a non-empty TensorList");
+  auto wrapped_dim = at::maybe_wrap_dim(dim, tensors[0].ndimension() + 1);
+
+  // Copied from 'check_stack_inputs' in TensorShape.cpp
+  at::IntArrayRef entry_shape = tensors[0].sizes();
+  for (const auto i : c10::irange(1, tensors.size())) {
+    TORCH_CHECK(tensors[i].sizes() == entry_shape,
+      "stack expects each tensor to be equal size, but got ", entry_shape,
+      " at entry 0 and ", tensors[i].sizes(), " at entry ", i);
+  }
+
+  auto result_sizes = tensors[0].sizes().vec();
+  result_sizes.insert(result_sizes.begin() + wrapped_dim, tensors.size());
+  return {Shape(tensors[0].scalar_type(), result_sizes)};
+}
+
+std::vector<Shape> compute_shape_repeat(const at::Tensor & self, at::IntArrayRef repeats) {
+  CHECK_GE(repeats.size(), self.dim());
+  int64_t num_new_dimensions = repeats.size() - self.dim();
+  std::vector<int64_t> padded_size(num_new_dimensions, 1);
+  padded_size.insert(padded_size.end(), self.sizes().begin(),
+                     self.sizes().end());
+  std::vector<int64_t> target_size(repeats.size());
+  for (const auto idx : c10::irange(repeats.size())) {
+    target_size[idx] = padded_size[idx] * repeats[idx];
+  }
+  return {Shape(self.scalar_type(), target_size)};
+}
+
 // Restore unused-parameters warnings
 #pragma GCC diagnostic pop
 
diff --git a/torch/csrc/lazy/core/shape_inference.h b/torch/csrc/lazy/core/shape_inference.h
index c06b42e8027bc0..ea85ffee2622a0 100644
--- a/torch/csrc/lazy/core/shape_inference.h
+++ b/torch/csrc/lazy/core/shape_inference.h
@@ -15,13 +15,15 @@ TORCH_API std::vector<Shape> compute_shape__adaptive_avg_pool2d(const at::Tensor
 TORCH_API std::vector<Shape> compute_shape__adaptive_avg_pool2d_backward(const at::Tensor & grad_output, const at::Tensor & self);
 TORCH_API std::vector<Shape> compute_shape_abs(const at::Tensor & self);
 TORCH_API std::vector<Shape> compute_shape_arange_out(const at::Scalar & start, const at::Scalar & end, const at::Scalar & step, at::Tensor & out);
+TORCH_API std::vector<Shape> compute_shape_bernoulli(const at::Tensor & self, c10::optional<at::Generator> generator);
+TORCH_API std::vector<Shape> compute_shape_bernoulli_(at::Tensor & self, double p, c10::optional<at::Generator> generator);
 TORCH_API std::vector<Shape> compute_shape_binary_cross_entropy(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<Shape> compute_shape_binary_cross_entropy_backward(const at::Tensor & grad_output, const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<Shape> compute_shape_cat(at::TensorList tensors, int64_t dim);
 TORCH_API std::vector<Shape> compute_shape_clamp_min(const at::Tensor & self, const at::Scalar & min);
 TORCH_API std::vector<Shape> compute_shape_constant_pad_nd(const at::Tensor & self, at::IntArrayRef pad, const at::Scalar & value);
 TORCH_API std::vector<Shape> compute_shape_convolution(const at::Tensor & input, const at::Tensor & weight, const c10::optional<at::Tensor> & bias, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation, bool transposed, at::IntArrayRef output_padding, int64_t groups);
-TORCH_API std::vector<Shape> compute_shape_convolution_backward(const at::Tensor & grad_output, const at::Tensor & input, const at::Tensor & weight, c10::optional<at::IntArrayRef> bias_sizes, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation, bool transposed, at::IntArrayRef output_padding, int64_t groups, ::std::array<bool,3> output_mask);
+TORCH_API std::vector<Shape> compute_shape_convolution_backward(const at::Tensor & grad_output, const at::Tensor & input, const at::Tensor & weight, at::OptionalIntArrayRef bias_sizes, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation, bool transposed, at::IntArrayRef output_padding, int64_t groups, ::std::array<bool,3> output_mask);
 TORCH_API std::vector<Shape> compute_shape_embedding(const at::Tensor & weight, const at::Tensor & indices, int64_t padding_idx, bool scale_grad_by_freq, bool sparse);
 TORCH_API std::vector<Shape> compute_shape_embedding_dense_backward(const at::Tensor & grad_output, const at::Tensor & indices, int64_t num_weights, int64_t padding_idx, bool scale_grad_by_freq);
 TORCH_API std::vector<Shape> compute_shape_flip(const at::Tensor & self, at::IntArrayRef dims);
@@ -46,14 +48,20 @@ TORCH_API std::vector<Shape> compute_shape_native_layer_norm(const at::Tensor &
 TORCH_API std::vector<Shape> compute_shape_native_layer_norm_backward(const at::Tensor & grad_out, const at::Tensor & input, at::IntArrayRef normalized_shape, const at::Tensor & mean, const at::Tensor & rstd, const c10::optional<at::Tensor> & weight, const c10::optional<at::Tensor> & bias, ::std::array<bool,3> output_mask);
 TORCH_API std::vector<Shape> compute_shape_nll_loss2d_backward(const at::Tensor & grad_output, const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction, int64_t ignore_index, const at::Tensor & total_weight);
 TORCH_API std::vector<Shape> compute_shape_nll_loss2d_forward(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction, int64_t ignore_index);
+TORCH_API std::vector<Shape> compute_shape_random_(at::Tensor & self, c10::optional<at::Generator> generator);
+TORCH_API std::vector<Shape> compute_shape_random_(at::Tensor & self, int64_t to, c10::optional<at::Generator> generator);
+TORCH_API std::vector<Shape> compute_shape_random_(at::Tensor & self, int64_t from, c10::optional<int64_t> to, c10::optional<at::Generator> generator);
 TORCH_API std::vector<Shape> compute_shape_relu(const at::Tensor & self);
 TORCH_API std::vector<Shape> compute_shape_relu_(at::Tensor & self);
+TORCH_API std::vector<Shape> compute_shape_repeat(const at::Tensor & self, at::IntArrayRef repeats);
 TORCH_API std::vector<Shape> compute_shape_smooth_l1_loss_backward(const at::Tensor & grad_output, const at::Tensor & self, const at::Tensor & target, int64_t reduction, double beta);
 TORCH_API std::vector<Shape> compute_shape_sort(const at::Tensor & self, int64_t dim, bool descending);
+TORCH_API std::vector<Shape> compute_shape_stack(at::TensorList tensors, int64_t dim);
 TORCH_API std::vector<Shape> compute_shape_std(const at::Tensor & self, bool unbiased);
 TORCH_API std::vector<Shape> compute_shape_std(const at::Tensor & self, at::IntArrayRef dim, bool unbiased, bool keepdim);
-TORCH_API std::vector<Shape> compute_shape_std(const at::Tensor & self, c10::optional<at::IntArrayRef> dim, c10::optional<int64_t> correction, bool keepdim);
+TORCH_API std::vector<Shape> compute_shape_std(const at::Tensor & self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim);
 TORCH_API std::vector<Shape> compute_shape_sum(const at::Tensor & self, c10::optional<at::ScalarType> dtype);
+TORCH_API std::vector<Shape> compute_shape__to_copy(const at::Tensor & self, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, bool non_blocking, c10::optional<at::MemoryFormat> memory_format);
 TORCH_API std::vector<Shape> compute_shape_trace(const at::Tensor & self);
 TORCH_API std::vector<Shape> compute_shape_zero_(at::Tensor & self);
 
diff --git a/torch/csrc/lazy/core/tensor.cpp b/torch/csrc/lazy/core/tensor.cpp
index af69af924d007a..8b90054149caf9 100644
--- a/torch/csrc/lazy/core/tensor.cpp
+++ b/torch/csrc/lazy/core/tensor.cpp
@@ -12,6 +12,7 @@
 #include <torch/csrc/lazy/ts_backend/ops/device_data.h>
 #include <torch/csrc/lazy/ts_backend/ops/scalar.h>
 
+
 namespace torch {
 namespace lazy {
 namespace {
@@ -105,7 +106,7 @@ MaybeRef<Shape> LazyTensor::shape() const {
   }
   if (data()->ir_value) {
     // TODO(whc) remove shape from LazyTensor API too!
-    return GetShapeFromTsValue(data()->ir_value);
+    return data()->ir_value.shape();
   }
   TORCH_CHECK(data()->tensor_data);
   return Shape(
@@ -200,7 +201,7 @@ void LazyTensor::SetIrValue(Value ir_value) {
 void LazyTensor::SetInPlaceIrValue(Value ir_value) {
   auto tensor_shape = shape();
   if (tensor_shape.Get().scalar_type() !=
-      GetShapeFromTsValue(ir_value).scalar_type()) {
+      ir_value.shape().scalar_type()) {
     ir_value = MakeNode<Cast>(ir_value, tensor_shape.Get().scalar_type());
   }
   SetIrValue(std::move(ir_value));
@@ -296,11 +297,11 @@ std::tuple<Value, bool> LazyTensor::GetViewUpdate(
 std::shared_ptr<LazyView> LazyTensor::UpdateView(
     std::shared_ptr<LazyView> view,
     Value ir_value) const {
-  if (GetShapeFromTsValue(ir_value).sizes() != view->shape().sizes()) {
-    TORCH_CHECK(GetShapeFromTsValue(ir_value).numel() == view->shape().numel());
+  if (ir_value.shape().sizes() != view->shape().sizes()) {
+    TORCH_CHECK(ir_value.shape().numel() == view->shape().numel());
 
     ViewInfo view_info(
-        ViewInfo::Type::kReshape, GetShapeFromTsValue(ir_value), view->shape());
+        ViewInfo::Type::kReshape, ir_value.shape(), view->shape());
     view = view->CreateSubView(view_info.shape, view_info);
   }
   view->Update(std::move(ir_value));
@@ -336,10 +337,10 @@ std::shared_ptr<LazyView> LazyTensor::CreateView(ViewInfo view_info) const {
   std::shared_ptr<Alias> alias = std::make_shared<Alias>(ir_value);
   ViewInfo this_view_info(
       ViewInfo::Type::kNoOp,
-      GetShapeFromTsValue(ir_value),
-      GetShapeFromTsValue(ir_value));
+      ir_value.shape(),
+      ir_value.shape());
   data()->view = std::make_shared<LazyView>(
-      GetShapeFromTsValue(ir_value), alias, std::move(this_view_info));
+      ir_value.shape(), alias, std::move(this_view_info));
   AssignIrValue(Value());
   return std::make_shared<LazyView>(view_info.shape, alias, view_info);
 }
@@ -459,6 +460,18 @@ int64_t LazyTensor::GetNextTensorId() {
   return id_generator->fetch_add(1);
 }
 
+torch::lazy::Value GetTensorList(c10::ArrayRef<at::Tensor> tensors) {
+  std::vector<Value> values;
+  for (const auto& t: tensors) {
+    auto* impl = dynamic_cast<LTCTensorImpl*>(t.unsafeGetTensorImpl());
+    TORCH_INTERNAL_ASSERT(impl,
+      "GetTensorList only supports lists of valid tensors, but optional support could be added");
+    values.push_back(impl->tensor()->GetIrValue());
+  }
+
+  return torch::lazy::Value(torch::lazy::MakeNode<TensorList>(std::move(values)));
+}
+
 LazyTensorPtr TryGetLtcTensor(const at::Tensor& tensor) {
   auto* impl = dynamic_cast<LTCTensorImpl*>(tensor.unsafeGetTensorImpl());
   if (impl == nullptr) {
diff --git a/torch/csrc/lazy/core/tensor.h b/torch/csrc/lazy/core/tensor.h
index 41b315404ab48a..901b8491b5cd0d 100644
--- a/torch/csrc/lazy/core/tensor.h
+++ b/torch/csrc/lazy/core/tensor.h
@@ -180,6 +180,13 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
 };
 
 // Utils to convert at::Tensor to LazyTensor, and vice versa.
+
+// Section 0: c10::Tensorlist ==> lazy::TensorList
+// note: GetTensorList is not totally parallel to GetLtcTensor; A TensorList skips
+//       the LazyTensor wrappers, assuming that the list of underlying IR nodes is
+//       actually more useful for downstream computations.  TBD.
+TORCH_API torch::lazy::Value GetTensorList(c10::ArrayRef<at::Tensor> tensors);
+
 // Section 1: at::Tensor => LazyTensor.
 // Extracts the LazyTensor out of an at::Tensor. Returns a null LazyTensor
 // if the tensor is not a lazy tensor.
diff --git a/torch/csrc/lazy/core/tensor_util.cpp b/torch/csrc/lazy/core/tensor_util.cpp
index 63926fb59db9de..1d9c5c36dbaf9a 100644
--- a/torch/csrc/lazy/core/tensor_util.cpp
+++ b/torch/csrc/lazy/core/tensor_util.cpp
@@ -6,6 +6,7 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/lazy/backend/backend_device.h>
 #include <torch/csrc/lazy/backend/backend_interface.h>
+#include <torch/csrc/lazy/core/config.h>
 #include <torch/csrc/lazy/core/helpers.h>
 
 #include <algorithm>
@@ -57,10 +58,7 @@ std::vector<BackendDataPtr> CreateTensorsData(
 }
 
 bool IsSpecialScalar(const at::Scalar& value) {
-  static bool no_scalars = false;
-  // TODO: need to clean up all the env options
-  // lazy_tensors::sys_util::GetEnvBool("NO_SPECIAL_SCALARS", false);
-  if (!no_scalars && (value.isIntegral(false) || value.isFloatingPoint())) {
+  if (FLAGS_torch_lazy_handle_special_scalars && (value.isIntegral(false) || value.isFloatingPoint())) {
     double scalar_value = value.toDouble();
     return scalar_value == 0.0 || std::fabs(scalar_value) == 1.0;
   }
diff --git a/torch/csrc/lazy/core/util.h b/torch/csrc/lazy/core/util.h
index 77adfe23932f71..e324099af8b165 100644
--- a/torch/csrc/lazy/core/util.h
+++ b/torch/csrc/lazy/core/util.h
@@ -10,6 +10,7 @@
 #include <vector>
 
 #include <c10/util/Optional.h>
+#include <c10/util/OptionalArrayRef.h>
 
 namespace torch {
 namespace lazy {
@@ -93,7 +94,7 @@ std::vector<T> ToVector(const S& input) {
 }
 
 template<typename T>
-c10::optional<std::vector<T>> ToOptionalVector(c10::optional<c10::ArrayRef<T>> arrayRef) {
+c10::optional<std::vector<T>> ToOptionalVector(c10::OptionalArrayRef<T> arrayRef) {
   if (arrayRef) {
     return arrayRef->vec();
   }
diff --git a/torch/csrc/lazy/core/view_ops/as_strided.cpp b/torch/csrc/lazy/core/view_ops/as_strided.cpp
index 18a4ebb6fe5cba..da9baaf30d108e 100644
--- a/torch/csrc/lazy/core/view_ops/as_strided.cpp
+++ b/torch/csrc/lazy/core/view_ops/as_strided.cpp
@@ -17,7 +17,7 @@ AsStrided::AsStrided(
           OpKind(at::aten::as_strided),
           {input},
           [&]() {
-            return Shape(GetShapeFromTsValue(input).scalar_type(), size);
+            return Shape(input.shape().scalar_type(), size);
           },
           /*num_outputs=*/1,
           MHash(size, stride, storage_offset)),
diff --git a/torch/csrc/lazy/core/view_ops/as_strided_view_update.cpp b/torch/csrc/lazy/core/view_ops/as_strided_view_update.cpp
index 4e29a97ce05541..3d1bab587e5a85 100644
--- a/torch/csrc/lazy/core/view_ops/as_strided_view_update.cpp
+++ b/torch/csrc/lazy/core/view_ops/as_strided_view_update.cpp
@@ -17,7 +17,7 @@ AsStridedViewUpdate::AsStridedViewUpdate(
           ltc_as_strided_view_update,
           {target, input},
           [&]() {
-            return Shape(GetShapeFromTsValue(target).scalar_type(), size);
+            return Shape(target.shape().scalar_type(), size);
           },
           /*num_outputs=*/1,
           MHash(size, stride, storage_offset)),
diff --git a/torch/csrc/lazy/core/view_ops/diagonal.cpp b/torch/csrc/lazy/core/view_ops/diagonal.cpp
index 197b9632f10277..b7491932efe4dc 100644
--- a/torch/csrc/lazy/core/view_ops/diagonal.cpp
+++ b/torch/csrc/lazy/core/view_ops/diagonal.cpp
@@ -15,8 +15,7 @@ Diagonal::Diagonal(
           OpKind(at::aten::diagonal),
           {input},
           [&]() {
-            return MakeDiagonalShape(
-                GetShapeFromTsValue(input), offset, dim1, dim2);
+            return MakeDiagonalShape(input.shape(), offset, dim1, dim2);
           },
           /*num_outputs=*/1,
           MHash(offset, dim1, dim2)),
diff --git a/torch/csrc/lazy/core/view_ops/diagonal_view_update.cpp b/torch/csrc/lazy/core/view_ops/diagonal_view_update.cpp
index 381713e461de11..13c963ee4a6538 100644
--- a/torch/csrc/lazy/core/view_ops/diagonal_view_update.cpp
+++ b/torch/csrc/lazy/core/view_ops/diagonal_view_update.cpp
@@ -14,7 +14,7 @@ DiagonalViewUpdate::DiagonalViewUpdate(
     : TsNode(
           ltc_diagonal_view_update,
           {target, input},
-          {GetShapeFromTsValue(target)},
+          {target.shape()},
           /*num_outputs=*/1,
           MHash(offset, dim1, dim2)),
       offset_(offset),
diff --git a/torch/csrc/lazy/core/view_ops/narrow.cpp b/torch/csrc/lazy/core/view_ops/narrow.cpp
index dda66b3da8ea36..496310afbf6890 100644
--- a/torch/csrc/lazy/core/view_ops/narrow.cpp
+++ b/torch/csrc/lazy/core/view_ops/narrow.cpp
@@ -17,7 +17,7 @@ Narrow::Narrow(
       base_indices_(base_indices.begin(), base_indices.end()),
       sizes_(sizes.begin(), sizes.end()) {
   SetShapeDeferred([&]() {
-    return Shape(GetShapeFromTsOutput(operand(0)).scalar_type(), sizes);
+    return Shape(operand(0).shape().scalar_type(), sizes);
   });
 }
 
diff --git a/torch/csrc/lazy/core/view_ops/narrow_view_update.cpp b/torch/csrc/lazy/core/view_ops/narrow_view_update.cpp
index 5334b8bbd87a1d..69522208777f7c 100644
--- a/torch/csrc/lazy/core/view_ops/narrow_view_update.cpp
+++ b/torch/csrc/lazy/core/view_ops/narrow_view_update.cpp
@@ -15,7 +15,7 @@ NarrowViewUpdate::NarrowViewUpdate(
           /*num_outputs=*/1,
           MHash(base_indices)),
       base_indices_(base_indices.begin(), base_indices.end()) {
-  SetShapeDeferred([&]() { return GetShapeFromTsOutput(operand(0)); });
+  SetShapeDeferred([&]() { return operand(0).shape(); });
 }
 
 std::string NarrowViewUpdate::ToString() const {
diff --git a/torch/csrc/lazy/core/view_ops/permute.cpp b/torch/csrc/lazy/core/view_ops/permute.cpp
index 9e41be67a1a7a2..161ec9e972e09a 100644
--- a/torch/csrc/lazy/core/view_ops/permute.cpp
+++ b/torch/csrc/lazy/core/view_ops/permute.cpp
@@ -13,7 +13,7 @@ Permute::Permute(const Value& input, std::vector<int64_t> dims)
           MHash(dims)),
       dims_(std::move(dims)) {
   SetShapeDeferred([&]() {
-    return MakePermuteShape(GetShapeFromTsOutput(operand(0)), dims_);
+    return MakePermuteShape(operand(0).shape(), dims_);
   });
 }
 
diff --git a/torch/csrc/lazy/core/view_ops/resize.cpp b/torch/csrc/lazy/core/view_ops/resize.cpp
index 409bd1e9e440aa..b44e9b01cf2136 100644
--- a/torch/csrc/lazy/core/view_ops/resize.cpp
+++ b/torch/csrc/lazy/core/view_ops/resize.cpp
@@ -6,7 +6,7 @@ namespace lazy {
 namespace {
 
 Shape NodeOutputShape(const Value& input, c10::ArrayRef<int64_t> size) {
-  return Shape(GetShapeFromTsValue(input).scalar_type(), size);
+  return Shape(input.shape().scalar_type(), size);
 }
 
 } // namespace
diff --git a/torch/csrc/lazy/core/view_ops/select.cpp b/torch/csrc/lazy/core/view_ops/select.cpp
index c78fa507e247b5..bb5a4de503e188 100644
--- a/torch/csrc/lazy/core/view_ops/select.cpp
+++ b/torch/csrc/lazy/core/view_ops/select.cpp
@@ -15,8 +15,7 @@ Select::Select(
           OpKind(at::aten::select),
           {input},
           [&]() {
-            return MakeSelectShape(
-                GetShapeFromTsValue(input), dim, start, end, stride);
+            return MakeSelectShape(input.shape(), dim, start, end, stride);
           },
           /*num_outputs=*/1,
           MHash(dim, start, end, stride)),
diff --git a/torch/csrc/lazy/core/view_ops/select_view_update.cpp b/torch/csrc/lazy/core/view_ops/select_view_update.cpp
index 7f778b18a5e60f..10040e6a7f9c81 100644
--- a/torch/csrc/lazy/core/view_ops/select_view_update.cpp
+++ b/torch/csrc/lazy/core/view_ops/select_view_update.cpp
@@ -17,7 +17,7 @@ SelectViewUpdate::SelectViewUpdate(
     : TsNode(
           ltc_select_view_update,
           {target, source},
-          {GetShapeFromTsValue(target)},
+          {target.shape()},
           /*num_outputs=*/1,
           MHash(dim, start, end, stride)),
       dim_(dim),
diff --git a/torch/csrc/lazy/core/view_ops/squeeze.cpp b/torch/csrc/lazy/core/view_ops/squeeze.cpp
index a37524462d2a9a..3427a52a8ffc9e 100644
--- a/torch/csrc/lazy/core/view_ops/squeeze.cpp
+++ b/torch/csrc/lazy/core/view_ops/squeeze.cpp
@@ -25,7 +25,7 @@ Squeeze::Squeeze(const torch::lazy::Value& input, int dim)
       dim_(dim) {
   SetShapeDeferred(
       [&]() {
-        const auto& input_shape = GetShapeFromTsValue(input);
+        const auto& input_shape = input.shape();
         return torch::lazy::Shape(input_shape.scalar_type(),
           BuildSqueezedDimensions(input_shape.sizes(), dim));
       });
diff --git a/torch/csrc/lazy/core/view_ops/unsqueeze.cpp b/torch/csrc/lazy/core/view_ops/unsqueeze.cpp
index 4cea5383740621..c4be30c94d6073 100644
--- a/torch/csrc/lazy/core/view_ops/unsqueeze.cpp
+++ b/torch/csrc/lazy/core/view_ops/unsqueeze.cpp
@@ -21,7 +21,7 @@ Unsqueeze::Unsqueeze(const torch::lazy::Value& input, int dim)
           torch::lazy::MHash(dim)),
       dim_(dim) {
   SetShapeDeferred([&]() {
-    const auto& input_shape = GetShapeFromTsValue(input);
+    const auto& input_shape = input.shape();
     return torch::lazy::Shape(
         input_shape.scalar_type(),
         BuildUnsqueezedDimensions(input_shape.sizes(), dim));
diff --git a/torch/csrc/lazy/core/view_ops/view.cpp b/torch/csrc/lazy/core/view_ops/view.cpp
index 1a0e5093ce6409..d79e471af80e60 100644
--- a/torch/csrc/lazy/core/view_ops/view.cpp
+++ b/torch/csrc/lazy/core/view_ops/view.cpp
@@ -8,7 +8,7 @@ namespace lazy {
 namespace {
 
 Shape NodeOutputShape(const Value& input, c10::ArrayRef<int64_t> output_sizes) {
-  const Shape& input_shape = GetShapeFromTsValue(input);
+  const Shape& input_shape = input.shape();
   const auto complete_output_sizes =
       at::infer_size(output_sizes, input_shape.numel());
   return Shape(input_shape.scalar_type(), complete_output_sizes);
diff --git a/torch/csrc/lazy/generated/README.md b/torch/csrc/lazy/generated/README.md
new file mode 100644
index 00000000000000..6712a90afea2a3
--- /dev/null
+++ b/torch/csrc/lazy/generated/README.md
@@ -0,0 +1,18 @@
+This folder contains generated sources for the lazy torchscript backend.
+
+The main input file that drives which operators get codegen support for torchscript backend is
+[../../../../aten/src/ATen/native/ts_native_functions.yaml](../../../../aten/src/ATen/native/ts_native_functions.yaml)
+
+The code generator lives at `tools/codegen/gen_lazy_tensor.py`.
+
+It is called automatically by the torch autograd codegen (`tools/setup_helpers/generate_code.py`)
+as a part of the build process in OSS builds (CMake/Bazel) and Buck.
+
+External backends (e.g. torch/xla) call `gen_lazy_tensor.py` directly,
+and feed it command line args indicating where the output files should go.
+
+For more information on codegen, see these resources:
+* Info about lazy tensor codegen: [gen_lazy_tensor.py docs](../../../../tools/codegen/gen_lazy_tensor.py)
+* Lazy TorchScript backend native functions: [ts_native_functions.yaml](../../../../aten/src/ATen/native/ts_native_functions.yaml)
+* Source of truth for native func definitions [ATen native_functions.yaml](../../../../aten/src/ATen/native/native_functions.yaml)
+* Info about native functions [ATen nativefunc README.md](../../../../aten/src/ATen/native/README.md)
diff --git a/torch/csrc/lazy/python/init.cpp b/torch/csrc/lazy/python/init.cpp
index cd5905a4ef3733..dea947c91f70c4 100644
--- a/torch/csrc/lazy/python/init.cpp
+++ b/torch/csrc/lazy/python/init.cpp
@@ -1,12 +1,259 @@
 #include <torch/csrc/lazy/python/init.h>
 
+#include <c10/core/Device.h>
+#include <torch/csrc/jit/python/pybind.h>
+#include <torch/csrc/lazy/backend/backend_device.h>
 #include <torch/csrc/lazy/core/debug_util.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
+#include <torch/csrc/lazy/core/metrics.h>
 #include <torch/csrc/lazy/python/python_util.h>
+#include <torch/csrc/lazy/backend/backend_interface.h>
+#include <torch/csrc/lazy/core/ir_dump_util.h>
+#include <torch/csrc/lazy/core/internal_ops/ltc_ops.h>
+#include <torch/csrc/lazy/ts_backend/ops/device_data.h>
+#include <torch/csrc/lazy/core/config.h>
+#if !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
+#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
+#endif // FBCODE_CAFFE2 || OVRSOURCE
+#include <string>
+#include <vector>
 
 namespace torch {
 namespace lazy {
 
-void initLazyBindings(PyObject* /* module */){
+// TODO(whc) backend 'device' related APIs are not very clear, this code could be
+// simplified but it should probably be done together with designing/refactoring
+// the overall approach to get/set of default eager/lazy device types
+torch::lazy::BackendDevice GetDeviceOrCurrent(const std::string& device_str) {
+  if (device_str.empty()) {
+    getBackend()->GetDefaultDeviceType();
+    return torch::lazy::BackendDevice();
+  }
+  return torch::lazy::atenDeviceToBackendDevice(c10::Device(device_str));
+}
+
+std::ptrdiff_t GetTensorId(const at::Tensor& tensor) {
+  torch::lazy::LazyTensorPtr lazy_tensor = torch::lazy::TryGetLtcTensor(tensor);
+  return lazy_tensor->GetUniqueId();
+}
+
+std::string GetTensorsDump(
+    const std::vector<at::Tensor>& tensors,
+    const std::function<std::string(c10::ArrayRef<torch::lazy::Node*>)>&
+        coverter) {
+  std::vector<torch::lazy::Node*> nodes;
+  std::vector<torch::lazy::Value> values;
+  for (auto& tensor : tensors) {
+    torch::lazy::LazyTensorPtr lazy_tensor = torch::lazy::TryGetLtcTensor(tensor);
+    values.push_back(lazy_tensor->GetIrValue());
+    nodes.push_back(values.back().node.get());
+  }
+  return coverter(nodes);
+}
+
+std::vector<torch::lazy::LazyTensorPtr> GetLtcTensors(const std::vector<at::Tensor>& tensors,
+                                      bool want_all) {
+  std::vector<torch::lazy::LazyTensorPtr> lazy_tensors;
+  lazy_tensors.reserve(tensors.size());
+  if (want_all) {
+    for (auto& tensor : tensors) {
+      lazy_tensors.push_back(torch::lazy::TryGetLtcTensor(tensor));
+    }
+  } else {
+    for (auto& tensor : tensors) {
+      auto lazy_tensor = torch::lazy::TryGetLtcTensor(tensor);
+      if (lazy_tensor) {
+        lazy_tensors.push_back(lazy_tensor);
+      }
+    }
+  }
+  return lazy_tensors;
+}
+
+std::string GetTensorsBackendGraph(const std::vector<at::Tensor>& tensors) {
+  std::vector<torch::lazy::LazyTensorPtr> lazy_tensors = GetLtcTensors(tensors, /*want_all=*/false);
+  return torch::lazy::LazyGraphExecutor::Get()->DumpBackendComputation(lazy_tensors);
+}
+
+void SyncTensors(const std::vector<at::Tensor>& tensors,
+                 const std::vector<std::string>& devices, bool wait,
+                 bool sync_ltc_data) {
+  std::vector<torch::lazy::LazyTensorPtr> lazy_tensors = GetLtcTensors(tensors, /*want_all=*/false);
+  torch::lazy::LazyGraphExecutor::Get()->SyncTensorsGraph(&lazy_tensors, devices, wait,
+                                             sync_ltc_data);
+}
+
+void initLazyBindings(PyObject* module){
+  auto m = py::handle(module).cast<py::module>();
+  auto lazy = m.def_submodule("_lazy");
+  auto lazy_ts_backend = m.def_submodule("_lazy_ts_backend");
+
+  lazy.def(
+      "_mark_step",
+      // TODO(whc) this API should probably change from vector<string> to
+      // vector<c10::device> but in a separate PR
+      [](const std::string& device_str, const std::vector<std::string>& devices,
+         bool wait) {
+        pybind11::gil_scoped_release no_gil;
+        auto backend_device = GetDeviceOrCurrent(device_str);
+        torch::lazy::LazyGraphExecutor::Get()->SyncLiveTensorsGraph(&backend_device, devices, wait);
+        torch::lazy::LazyGraphExecutor::Get()->MarkStep(backend_device);
+      },
+      py::arg("device") = "", py::arg("devices"), py::arg("wait") = true);
+  lazy.def(
+      "_wait_device_ops",
+      [](const std::vector<std::string>& devices) {
+        pybind11::gil_scoped_release no_gil;
+        // TODO: Add support of non-empty devices.
+        if (!devices.empty()) {
+          LOG(ERROR) << "Non-empty devices are not supported.";
+        }
+        torch::lazy::LazyGraphExecutor::Get()->WaitDeviceOps({});
+      },
+      py::arg("devices"));
+  lazy.def("_reset_metrics",
+        []() { torch::lazy::MetricsArena::Get()->Reset(); });
+  lazy.def("_counter_names", []() { return torch::lazy::GetCounterNames(); });
+  lazy.def("_counter_value", [](const std::string& name) -> py::object {
+    torch::lazy::CounterData* data = torch::lazy::GetCounter(name);
+    return data != nullptr ? py::cast<int64_t>(data->Value()) : py::none();
+  });
+  lazy.def("_get_tensor_id", [](const at::Tensor& tensor) { return GetTensorId(tensor); });
+
+  lazy.def("_get_tensors_text",
+        [](const std::vector<at::Tensor>& tensors) -> std::string {
+          auto coverter = [](c10::ArrayRef<torch::lazy::Node*> nodes) {
+            return torch::lazy::DumpUtil::ToText(nodes);
+          };
+          return GetTensorsDump(tensors, coverter);
+        });
+  lazy.def("_get_tensors_dot",
+        [](const std::vector<at::Tensor>& tensors) -> std::string {
+          auto coverter = [](c10::ArrayRef<torch::lazy::Node*> nodes) {
+            return torch::lazy::DumpUtil::ToDot(nodes);
+          };
+          return GetTensorsDump(tensors, coverter);
+        });
+  lazy.def("_get_tensors_backend",
+        [](const std::vector<at::Tensor>& tensors) -> std::string {
+          return GetTensorsBackendGraph(tensors);
+        });
+  lazy.def("_get_graph_hash", [](const std::vector<at::Tensor>& tensors) {
+    std::vector<LazyTensorPtr> xtensors;
+    xtensors.reserve(tensors.size());
+    for (auto& tensor : tensors) {
+      xtensors.push_back(TryGetLtcTensor(tensor));
+    }
+    auto hash = LazyGraphExecutor::Get()->GetGraphHash(xtensors);
+    std::string bin((const char*) &hash, sizeof(hash));
+    return py::bytes(bin);
+  });
+  lazy.def(
+      "_sync_multi",
+      [](const std::vector<at::Tensor>& tensors,
+         const std::vector<std::string>& devices, bool wait,
+         bool sync_ltc_data) {
+        pybind11::gil_scoped_release no_gil;
+        SyncTensors(tensors, devices, wait, sync_ltc_data);
+      },
+      py::arg("tensors"), py::arg("devices"), py::arg("wait") = true,
+      py::arg("sync_ltc_data") = true);
+
+  lazy.def(
+    "_get_force_fallback", []() {
+        return torch::lazy::getLTCForceFallback();
+    }
+  );
+  lazy.def(
+    "_set_force_fallback", [](std::string newval) {
+        torch::lazy::getLTCForceFallback() = newval;
+    }
+  );
+
+  lazy_ts_backend.def(
+    "_init",
+    []() {
+#if !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
+      torch::lazy::InitTorchScriptBackend();
+#else
+      TORCH_CHECK(false, "TorchScript backend not yet supported in FBCODE/OVRSOURCE builds");
+#endif  // !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
+    });
+
+  /*
+   * Return tensor ids and tensors for DeviceData nodes.
+   * TODO(shunting) revisit this API for XLA
+   */
+  lazy_ts_backend.def("_get_tensors_ts_device_data_node",
+        [](const std::vector<at::Tensor>& tensors) -> std::pair<std::vector<int64_t>, std::vector<at::IValue>> {
+#if !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
+          std::vector<Node*> roots;
+          for (auto& tensor : tensors) {
+            auto xtensor = TryGetLtcTensor(tensor);
+            roots.push_back(xtensor->GetIrValue().node.get());
+          }
+          auto post_order = Util::ComputePostOrder(roots);
+          std::vector<int64_t> tensor_ids;
+          std::vector<at::IValue> ivalues;
+
+          std::unordered_set<BackendData::Handle> data_handles_;
+          for (auto nodeptr : post_order) {
+            if (nodeptr->op() == *torch::lazy::ltc_device_data) {
+              const auto* device_data_node = torch::lazy::NodeCast<torch::lazy::DeviceData>(nodeptr, *torch::lazy::ltc_device_data);
+
+              auto infoptr = device_data_node->data()->info();
+              auto deviceDataInfoPtr = (torch::lazy::LazyGraphExecutor::DeviceDataInfo*) infoptr;
+              auto* tsDataPtr = (torch::lazy::TSData*) device_data_node->data().get();
+
+              // dedup DeviceData by handle
+              auto handle = tsDataPtr->GetHandle();
+              if (!data_handles_.insert(handle).second) {
+                continue;
+              }
+              tensor_ids.push_back(deviceDataInfoPtr->tensor_id);
+              /*
+               * If the TSData contains a tensor, then the tensor id will uniquely identify the tensor.
+               * We use that tensor id to find the tensor in other places: e.g. in the python forward method parameters.
+               *
+               * If the TSData contains a scalar, the tensor id itself is not important. We reuse the scalar value in
+               * future calls.
+               */
+              if (tsDataPtr->HasValue()) {
+                ivalues.emplace_back(tsDataPtr->data());
+              } else {
+                CHECK(tsDataPtr->scalar.has_value());
+                ivalues.emplace_back(tsDataPtr->scalar.value());
+              }
+            }
+          }
+          return std::make_pair(tensor_ids, ivalues);
+#else
+          TORCH_CHECK(false, "TorchScript backend not yet supported in FBCODE builds");
+          return std::make_pair(std::vector<int64_t>(), std::vector<at::IValue>());
+#endif  // !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
+        });
+  // TODO(shunting) revisit this part for XLA
+  lazy_ts_backend.def("_run_cached_graph", [](const std::string& hash_str, const std::vector<at::IValue>& graph_inputs) {
+    TORCH_CHECK(hash_str.size() == sizeof(hash_t));
+    hash_t hash = *(hash_t*) (hash_str.c_str());
+    auto cachedComputation = LazyGraphExecutor::Get()->GetComputationCache()->Get(hash);
+    TORCH_CHECK(cachedComputation, "Failed to get computation by hash. Maybe the entry get kicked out of the LRU cache"); // TODO implement a fallback mechanism, or make sure those entries never get kicked out
+    auto computationPtr = (torch::lazy::TSComputation*) cachedComputation->computation.get();
+
+    std::vector<torch::jit::IValue> stack;
+    stack.reserve(graph_inputs.size());
+    for (const auto& arg : graph_inputs) {
+      stack.emplace_back(arg);
+    }
+    computationPtr->graph_executor().run(stack);
+    std::vector<at::Tensor> result;
+    result.reserve(stack.size());
+    for (torch::jit::IValue elem : stack) {
+      result.push_back(elem.toTensor());
+    }
+    return result;
+  });
+
 #ifndef USE_DEPLOY
   // When libtorch_python is loaded, we register the python frame getter
   // otherwise, debug util simply omits python frames
diff --git a/torch/csrc/lazy/ts_backend/config.cpp b/torch/csrc/lazy/ts_backend/config.cpp
index 3c030288ea0666..bcad849b7b429a 100644
--- a/torch/csrc/lazy/ts_backend/config.cpp
+++ b/torch/csrc/lazy/ts_backend/config.cpp
@@ -5,3 +5,16 @@ C10_DEFINE_int(
     torch_lazy_ts_shape_cache_size,
     4096,
     "Set the size for the shape cache used for shape inference");
+
+// TODO(whc) unclear if this is useful, has only been tested as true
+C10_DEFINE_bool(
+    torch_lazy_ts_tensor_update_sync,
+    true,
+    "Use synchronous copy inside _copy_from op");
+
+// TODO(whc) we need to hook up these flags in a more useful way
+// possibly also keep LTC_TS_CUDA env working?
+C10_DEFINE_bool(
+    torch_lazy_ts_cuda,
+    false,
+    "Use cuda device for torchscript backend (instead of CPU)");
diff --git a/torch/csrc/lazy/ts_backend/config.h b/torch/csrc/lazy/ts_backend/config.h
index 6698007c4ddf5b..aeb6e97f1abe84 100644
--- a/torch/csrc/lazy/ts_backend/config.h
+++ b/torch/csrc/lazy/ts_backend/config.h
@@ -3,3 +3,8 @@
 
 // TODO(whc) either deprecate this, or use it for all shape inference
 C10_DECLARE_int(torch_lazy_ts_shape_cache_size);
+
+// TODO(whc) unclear if this is useful, has only been tested as true
+C10_DECLARE_bool(torch_lazy_ts_tensor_update_sync);
+
+C10_DECLARE_bool(torch_lazy_ts_cuda);
diff --git a/torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.cpp b/torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.cpp
index 91ce1866b5c366..c2ee5e6175952d 100644
--- a/torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.cpp
+++ b/torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.cpp
@@ -9,12 +9,15 @@
 namespace torch {
 namespace lazy {
 
+// These operators were once widely used in nativefunction impls to perform
+// convenient decompositions (partial lowerings) of aten operators into more primitive
+// opererators. They should not be used for this purpose anymore, but still used in
+// lazy_graph_executor for RNG math in one place.  We could rewrite that.
 NodePtr operator+(const Value& node1, const Value& node2) {
   return GenericOp(
       OpKind(at::aten::add),
       {node1, node2},
-      GetPromotedBinaryOpShape(
-          GetShapeFromTsValue(node1), GetShapeFromTsValue(node2)));
+      GetPromotedBinaryOpShape(node1.shape(), node2.shape()));
 }
 
 NodePtr operator-(const Value& node1, const Value& node2) {
@@ -22,7 +25,7 @@ NodePtr operator-(const Value& node1, const Value& node2) {
       OpKind(at::aten::sub),
       {node1, node2},
       GetPromotedBinaryOpShape(
-          GetShapeFromTsValue(node1), GetShapeFromTsValue(node2)));
+          node1.shape(), node2.shape()));
 }
 
 NodePtr operator*(const Value& node1, const Value& node2) {
@@ -30,7 +33,7 @@ NodePtr operator*(const Value& node1, const Value& node2) {
       OpKind(at::aten::mul),
       {node1, node2},
       GetPromotedBinaryOpShape(
-          GetShapeFromTsValue(node1), GetShapeFromTsValue(node2)));
+          node1.shape(), node2.shape()));
 }
 
 NodePtr operator/(const Value& node1, const Value& node2) {
@@ -38,7 +41,7 @@ NodePtr operator/(const Value& node1, const Value& node2) {
       OpKind(at::aten::div),
       {node1, node2},
       GetPromotedBinaryOpShape(
-          GetShapeFromTsValue(node1), GetShapeFromTsValue(node2)));
+          node1.shape(), node2.shape()));
 }
 
 } // namespace lazy
diff --git a/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.cpp b/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.cpp
new file mode 100644
index 00000000000000..d751e414a5539c
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.cpp
@@ -0,0 +1,76 @@
+#include <torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h>
+#include <torch/csrc/lazy/core/util.h>
+
+namespace torch {
+namespace lazy {
+
+TSNativeBatchNormBackward::TSNativeBatchNormBackward(
+    const torch::lazy::Value& grad_out, const torch::lazy::Value& input,
+    const torch::lazy::Value& weight, const torch::lazy::Value& running_mean,
+    const torch::lazy::Value& running_var, const torch::lazy::Value& save_mean,
+    const torch::lazy::Value& save_invstd, bool training, double eps,
+    std::array<bool, 3> output_mask)
+    : torch::lazy::TsNode(
+          torch::lazy::OpKind(at::aten::native_batch_norm_backward),
+          {grad_out, input, weight, running_mean, running_var, save_mean,
+           save_invstd},
+          {input.shape(),
+           weight.shape(),
+           weight.shape()},
+          /*num_outputs=*/3,
+          torch::lazy::MHash(training, eps, output_mask[0], output_mask[1],
+                             output_mask[2])),
+      training_(training),
+      eps_(eps),
+      output_mask_(output_mask) {}
+
+TSNativeBatchNormBackward::TSNativeBatchNormBackward(
+    const torch::lazy::Value& grad_out, const torch::lazy::Value& input,
+    const torch::lazy::Value& weight, const torch::lazy::Value& save_mean,
+    const torch::lazy::Value& save_invstd, bool training, double eps,
+    std::array<bool, 3> output_mask)
+    : torch::lazy::TsNode(
+          torch::lazy::OpKind(at::aten::native_batch_norm_backward),
+          {grad_out, input, weight, save_mean, save_invstd},
+          {input.shape(),
+           weight.shape(),
+           weight.shape()},
+          /*num_outputs=*/3,
+          torch::lazy::MHash(training, eps, output_mask[0], output_mask[1],
+                             output_mask[2])),
+      training_(training),
+      eps_(eps),
+      output_mask_(output_mask) {}
+
+std::string TSNativeBatchNormBackward::ToString() const {
+  std::stringstream ss;
+  ss << torch::lazy::TsNode::ToString() << ", training=" << training_
+     << ", eps=" << eps_;
+  return ss.str();
+}
+
+TSNativeBatchNormForward::TSNativeBatchNormForward(
+    const torch::lazy::Value& input, const torch::lazy::Value& weight,
+    const torch::lazy::Value& bias, const torch::lazy::Value& running_mean,
+    const torch::lazy::Value& running_var, bool training, double momentum,
+    double eps)
+    : torch::lazy::TsNode(torch::lazy::OpKind(at::aten::native_batch_norm),
+                          {input, weight, bias, running_mean, running_var},
+                          {input.shape(),
+                           running_mean.shape(),
+                           running_var.shape()},
+                          /*num_outputs=*/3,
+                          torch::lazy::MHash(training, momentum, eps)),
+      training_(training),
+      momentum_(momentum),
+      eps_(eps) {}
+
+std::string TSNativeBatchNormForward::ToString() const {
+  std::stringstream ss;
+  ss << torch::lazy::TsNode::ToString() << ", training=" << training_
+     << ", momentum=" << momentum_ << ", eps=" << eps_;
+  return ss.str();
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h b/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h
new file mode 100644
index 00000000000000..d421221d548668
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h
@@ -0,0 +1,58 @@
+#pragma once
+
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
+
+namespace torch {
+namespace lazy {
+
+// Node for the backward batch norm operator.
+class TSNativeBatchNormBackward : public torch::lazy::TsNode {
+ public:
+  TSNativeBatchNormBackward(const torch::lazy::Value& grad_out, const torch::lazy::Value& input,
+                            const torch::lazy::Value& weight, const torch::lazy::Value& running_mean,
+                            const torch::lazy::Value& running_var, const torch::lazy::Value& save_mean,
+                            const torch::lazy::Value& save_invstd, bool training, double eps,
+                            std::array<bool, 3> output_mask);
+
+  TSNativeBatchNormBackward(const torch::lazy::Value& grad_out, const torch::lazy::Value& input,
+                            const torch::lazy::Value& weight, const torch::lazy::Value& save_mean,
+                            const torch::lazy::Value& save_invstd, bool training, double eps,
+                            std::array<bool, 3> output_mask);
+
+  std::string ToString() const override;
+
+  bool training() const { return training_; }
+
+  double eps() const { return eps_; }
+
+  const std::array<bool, 3>& output_mask() const { return output_mask_; }
+
+ private:
+  bool training_;
+  double eps_;
+  std::array<bool, 3> output_mask_;
+};
+
+class TSNativeBatchNormForward : public torch::lazy::TsNode {
+ public:
+  TSNativeBatchNormForward(const torch::lazy::Value& input, const torch::lazy::Value& weight,
+                           const torch::lazy::Value& bias, const torch::lazy::Value& running_mean,
+                           const torch::lazy::Value& running_var, bool training,
+                           double momentum, double eps);
+
+  std::string ToString() const override;
+
+  bool training() const { return training_; }
+
+  double momentum() const { return momentum_; }
+
+  double eps() const { return eps_; }
+
+ private:
+  bool training_;
+  double momentum_;
+  double eps_;
+};
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ops/cast.cpp b/torch/csrc/lazy/ts_backend/ops/cast.cpp
index d64bc4a4dfe6a6..8adc4da71d2cc8 100644
--- a/torch/csrc/lazy/ts_backend/ops/cast.cpp
+++ b/torch/csrc/lazy/ts_backend/ops/cast.cpp
@@ -9,7 +9,7 @@ namespace lazy {
 namespace {
 
 Shape NodeOutputShape(const Value& input, c10::ScalarType type) {
-  Shape shape = GetShapeFromTsValue(input);
+  Shape shape = input.shape();
   shape.set_scalar_type(type);
   return shape;
 }
diff --git a/torch/csrc/lazy/ts_backend/ops/expand.cpp b/torch/csrc/lazy/ts_backend/ops/expand.cpp
index 6ac4f160486535..ce124cbc9e99c3 100644
--- a/torch/csrc/lazy/ts_backend/ops/expand.cpp
+++ b/torch/csrc/lazy/ts_backend/ops/expand.cpp
@@ -15,7 +15,7 @@ Expand::Expand(
       size_(std::move(size)),
       is_scalar_expand_(is_scalar_expand) {
   SetShapeDeferred(
-      [&]() { return Shape(GetShapeFromTsValue(input).scalar_type(), size_); });
+      [&]() { return Shape(input.shape().scalar_type(), size_); });
 }
 
 std::string Expand::ToString() const {
diff --git a/torch/csrc/lazy/ts_backend/ops/random_ops.cpp b/torch/csrc/lazy/ts_backend/ops/random_ops.cpp
new file mode 100644
index 00000000000000..05ae04a31df3d6
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ops/random_ops.cpp
@@ -0,0 +1,40 @@
+#include <torch/csrc/lazy/ts_backend/ops/random_ops.h>
+#include <torch/csrc/lazy/core/util.h>
+
+namespace torch {
+namespace lazy {
+
+Normal::Normal(const torch::lazy::Value& self, const double& mean, const double& std, std::vector<torch::lazy::Shape>&& shapes)
+    : torch::lazy::TsNode(torch::lazy::OpKind(c10::Symbol::fromQualString("aten::normal_")),
+            {self}, std::move(shapes),
+            /* num_outputs */ 1,
+            torch::lazy::MHash(mean, std)),
+    mean_(mean),
+    std_(std) {}
+
+std::string Normal::ToString() const {
+  std::stringstream ss;
+  ss << TsNode::ToString();
+  ss << ", mean=" << mean_;
+  ss << ", std=" << std_;
+  return ss.str();
+}
+
+torch::lazy::TSOpVector Normal::Lower(
+    std::shared_ptr<torch::jit::GraphFunction> function,
+    torch::lazy::TSLoweringContext* loctx) const {
+  std::vector<torch::jit::NamedValue> arguments;
+  std::vector<torch::jit::NamedValue> kwarguments;
+  arguments.reserve(3);
+  size_t i = 0;
+  arguments.emplace_back(loctx->GetOutputOp(operand(i++)));
+  arguments.emplace_back("mean", mean_);
+  arguments.emplace_back("std", std_);
+  torch::lazy::TSOpVector normal__out = torch::lazy::LowerTSBuiltin(function, op().op, arguments, kwarguments);
+  CHECK_EQ(normal__out.size(), 1);
+
+  return normal__out;
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ops/random_ops.h b/torch/csrc/lazy/ts_backend/ops/random_ops.h
new file mode 100644
index 00000000000000..2cc8fcc25b15fb
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ops/random_ops.h
@@ -0,0 +1,21 @@
+#pragma once
+
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
+
+namespace torch {
+namespace lazy {
+
+class Normal : public torch::lazy::TsNode {
+ public:
+  Normal(const torch::lazy::Value& self, const double& mean, const double& std, std::vector<torch::lazy::Shape>&& shapes);
+
+  std::string ToString() const override;
+  torch::lazy::TSOpVector Lower(std::shared_ptr<torch::jit::GraphFunction> function,
+                   torch::lazy::TSLoweringContext* loctx) const override;
+
+  double mean_;
+  double std_;
+};
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ops/to_copy.h b/torch/csrc/lazy/ts_backend/ops/to_copy.h
new file mode 100644
index 00000000000000..231efdc46559d7
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ops/to_copy.h
@@ -0,0 +1,89 @@
+#pragma once
+
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
+
+namespace torch {
+namespace lazy {
+
+
+// This IR was copied from code-generated output, but the entire _to_copy operator
+// cannot be trivially code genereated since it is only desirable to capture IR for
+// certain permutaions of _to_copy (e.g. dtype), and for the others it is difficult to even invoke
+// the aten/eager fallback necessitating directly implementing the right to(device) behavior
+class ToCopy : public torch::lazy::TsNode {
+ public:
+  ToCopy(const torch::lazy::Value& self, const c10::optional<at::ScalarType>& dtype, const c10::optional<at::Layout>& layout, const c10::optional<at::Device>& device, const c10::optional<bool>& pin_memory, const bool& non_blocking, const c10::optional<at::MemoryFormat>& memory_format, std::vector<torch::lazy::Shape>&& shapes)
+      : torch::lazy::TsNode(torch::lazy::OpKind(at::aten::_to_copy),
+              {self}, std::move(shapes),
+              /* num_outputs */ 1,
+              torch::lazy::MHash(dtype, layout, device, pin_memory, non_blocking, memory_format)),
+
+        dtype(dtype),
+        layout(layout),
+        device(device),
+        pin_memory(pin_memory),
+        non_blocking(non_blocking),
+        memory_format(memory_format) {}
+
+  std::string ToString() const override {
+    std::stringstream ss;
+    ss << torch::lazy::TsNode::ToString();
+    if (dtype.has_value()) {
+        ss << ", dtype=" << dtype.value();
+    } else {
+        ss << ", dtype=null";
+    }
+    if (layout.has_value()) {
+        ss << ", layout=" << layout.value();
+    } else {
+        ss << ", layout=null";
+    }
+    if (device.has_value()) {
+        ss << ", device=" << device.value();
+    } else {
+        ss << ", device=null";
+    }
+    if (pin_memory.has_value()) {
+        ss << ", pin_memory=" << pin_memory.value();
+    } else {
+        ss << ", pin_memory=null";
+    }
+    ss << ", non_blocking=" << non_blocking;
+    if (memory_format.has_value()) {
+        ss << ", memory_format=" << memory_format.value();
+    } else {
+        ss << ", memory_format=null";
+    }
+    return ss.str();
+  }
+
+  torch::lazy::TSOpVector Lower(std::shared_ptr<torch::jit::GraphFunction> function,
+                   torch::lazy::TSLoweringContext* loctx) const override {
+        std::vector<torch::jit::NamedValue> arguments;
+    std::vector<torch::jit::NamedValue> kwarguments;
+    arguments.reserve(1);
+    kwarguments.reserve(6);
+    size_t i = 0;
+    arguments.emplace_back(loctx->GetOutputOp(operand(i++)));
+    kwarguments.emplace_back("dtype", dtype);
+    kwarguments.emplace_back("layout", layout);
+    kwarguments.emplace_back("device", device);
+    kwarguments.emplace_back("pin_memory", pin_memory);
+    kwarguments.emplace_back("non_blocking", non_blocking);
+    kwarguments.emplace_back("memory_format", memory_format);
+    torch::lazy::TSOpVector _to_copy_out = torch::lazy::LowerTSBuiltin(function, op().op, arguments, kwarguments);
+    CHECK_EQ(_to_copy_out.size(), 1);
+
+    return _to_copy_out;
+
+  }
+
+  c10::optional<at::ScalarType> dtype;
+  c10::optional<at::Layout> layout;
+  c10::optional<at::Device> device;
+  c10::optional<bool> pin_memory;
+  bool non_blocking;
+  c10::optional<at::MemoryFormat> memory_format;
+};
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp b/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp
new file mode 100644
index 00000000000000..f8496a483911ae
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp
@@ -0,0 +1,331 @@
+#include <torch/csrc/lazy/ts_backend/tensor_aten_ops.h>
+
+#include <ATen/InferSize.h>
+#include <c10/util/Optional.h>
+#include <torch/csrc/autograd/variable.h>
+#include <torch/csrc/lazy/core/helpers.h>
+#include <torch/csrc/lazy/ts_backend/ops/arithmetic_ir_ops.h>
+#include <torch/csrc/lazy/ts_backend/ops/cast.h>
+#include <torch/csrc/lazy/ts_backend/ops/expand.h>
+#include <torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h>
+#include <torch/csrc/lazy/ts_backend/ops/random_ops.h>
+#include <torch/csrc/lazy/core/ir_util.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
+#include <torch/csrc/lazy/core/metrics.h>
+#include <torch/csrc/lazy/core/tensor.h>
+#include <torch/csrc/lazy/core/util.h>
+#include <torch/csrc/lazy/core/view_ops/as_strided.h>
+#include <torch/csrc/lazy/core/view_ops/permute.h>
+#include <torch/csrc/lazy/core/view_ops/squeeze.h>
+#include <torch/csrc/lazy/core/view_ops/unsqueeze.h>
+#include <torch/csrc/lazy/core/view_ops/view.h>
+#include <torch/csrc/lazy/generated/LazyIr.h>
+#include <algorithm>
+#include <functional>
+
+namespace torch {
+namespace lazy {
+namespace {
+
+// to enable operator+-*/ for Value
+using namespace torch::lazy;
+
+torch::lazy::Value MaybeExpand(const torch::lazy::Value& input,
+                               const torch::lazy::Shape& target_shape) {
+  if (input.shape().sizes() == target_shape.sizes()) {
+    return input;
+  }
+  return torch::lazy::MakeNode<torch::lazy::Expand>(
+      input, target_shape.sizes().vec(),
+      /*is_scalar_expand=*/false);
+}
+
+std::vector<int64_t> GetExpandDimensions(const torch::lazy::Shape& shape,
+                                         std::vector<int64_t> dimensions) {
+  CHECK_GE(dimensions.size(), shape.dim()) << shape;
+  int64_t base = dimensions.size() - shape.dim();
+  for (size_t i = 0; i < shape.dim(); ++i) {
+    if (dimensions[base + i] == -1) {
+      dimensions[base + i] = shape.size(i);
+    }
+  }
+  return dimensions;
+}
+
+// Returns a 1-D shape for batch norm weight or bias based on the input shape.
+torch::lazy::Shape BatchNormFeaturesShape(const torch::lazy::LazyTensorPtr& input) {
+  CHECK(input);
+  auto input_shape = input->shape().Get();
+  return torch::lazy::Shape(input_shape.scalar_type(),
+                             input_shape.sizes()[1]);
+}
+
+// Returns the IR for the given input or the provided default value broadcasted
+// to the default shape, if the input is undefined.
+torch::lazy::Value GetIrValueOrDefault(const torch::lazy::LazyTensorPtr& input,
+                                       const at::Scalar& default_value,
+                                       const torch::lazy::Shape& default_shape,
+                                       const torch::lazy::BackendDevice& device) {
+  return input ? input->GetIrValue()
+               : torch::lazy::LazyGraphExecutor::Get()->GetIrValueForExpandedScalar(default_value,
+                                                                                    default_shape,
+                                                                                    device);
+}
+
+torch::lazy::ViewInfo CreateAsStridedViewInfo(
+    const torch::lazy::Shape& input_shape, std::vector<int64_t> size,
+    std::vector<int64_t> stride, c10::optional<int64_t> storage_offset) {
+  torch::lazy::Shape result_shape =
+      torch::lazy::Shape(input_shape.scalar_type(), size);
+  torch::lazy::AsStridedInfo as_strided_info;
+  as_strided_info.stride = std::move(stride);
+  if (storage_offset) {
+    as_strided_info.offset = *storage_offset;
+  }
+  return torch::lazy::ViewInfo(torch::lazy::ViewInfo::Type::kAsStrided,
+                               std::move(result_shape), input_shape,
+                               std::move(as_strided_info));
+}
+
+}  // namespace
+
+//////////////////////////////////////////////////////////////////////////////
+// ATEN operators follows here, listed in alphabetical order.
+//////////////////////////////////////////////////////////////////////////////
+torch::lazy::LazyTensorPtr as_strided(const torch::lazy::LazyTensorPtr& input, std::vector<int64_t> size,
+                      std::vector<int64_t> stride,
+                      c10::optional<int64_t> storage_offset) {
+  auto input_shape = input->shape();
+  return input->CreateViewTensor(CreateAsStridedViewInfo(
+      input_shape, std::move(size), std::move(stride), storage_offset));
+}
+
+void as_strided_(torch::lazy::LazyTensorPtr& input, std::vector<int64_t> size,
+                 std::vector<int64_t> stride,
+                 c10::optional<int64_t> storage_offset) {
+  if (input->data()->view == nullptr) {
+    input->SetIrValue(torch::lazy::MakeNode<torch::lazy::AsStrided>(
+        input->GetIrValue(), std::move(size), std::move(stride),
+        storage_offset.value_or(0)));
+  } else {
+    auto input_shape = input->shape();
+    input->SetSubView(CreateAsStridedViewInfo(
+        input_shape, std::move(size), std::move(stride), storage_offset));
+  }
+}
+
+torch::lazy::LazyTensorPtr expand(const torch::lazy::LazyTensorPtr& input, std::vector<int64_t> size) {
+  auto input_shape = input->shape();
+  return torch::lazy::LazyTensor::Create(torch::lazy::MakeNode<torch::lazy::Expand>(
+      input->GetIrValue(),
+      GetExpandDimensions(input_shape.Get(), std::move(size)),
+      /*is_scalar_expand=*/false), input->GetDevice());
+}
+
+void fill_(torch::lazy::LazyTensorPtr& input, const at::Scalar& value) {
+  torch::lazy::Value constant = torch::lazy::LazyGraphExecutor::Get()->GetIrValueForExpandedScalar(
+      value, input->shape(), input->GetDevice());
+  input->SetInPlaceIrValue(std::move(constant));
+}
+
+torch::lazy::LazyTensorPtr narrow(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t start,
+                  int64_t length) {
+  auto input_shape = input->shape();
+  dim = torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim());
+  torch::lazy::Shape narrow_shape = input_shape;
+  narrow_shape.set_size(dim, length);
+
+  torch::lazy::ViewInfo::Type view_type =
+      (input_shape.Get().numel() == narrow_shape.numel())
+          ? torch::lazy::ViewInfo::Type::kReshape
+          : torch::lazy::ViewInfo::Type::kNarrow;
+  torch::lazy::ViewInfo view_info(view_type, std::move(narrow_shape),
+                                  input_shape);
+  view_info.indices[dim] =
+      torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, start);
+  return input->CreateViewTensor(std::move(view_info));
+}
+
+std::tuple<torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr> ts_native_batch_norm(
+    const torch::lazy::LazyTensorPtr& input, const torch::lazy::LazyTensorPtr& weight, const torch::lazy::LazyTensorPtr& bias,
+    torch::lazy::LazyTensorPtr& running_mean, torch::lazy::LazyTensorPtr& running_var, bool training,
+    double momentum, double eps) {
+  torch::lazy::Shape features_shape = BatchNormFeaturesShape(input);
+  torch::lazy::Value weight_value =
+      GetIrValueOrDefault(weight, 1, features_shape, input->GetDevice());
+  torch::lazy::Value bias_value =
+      GetIrValueOrDefault(bias, 0, features_shape, input->GetDevice());
+  torch::lazy::Value running_mean_value =
+      GetIrValueOrDefault(running_mean, 0, features_shape, input->GetDevice());
+  torch::lazy::Value running_var_value =
+      GetIrValueOrDefault(running_var, 0, features_shape, input->GetDevice());
+  torch::lazy::NodePtr node =
+      torch::lazy::MakeNode<TSNativeBatchNormForward>(
+          input->GetIrValue(), weight_value, bias_value, running_mean_value,
+          running_var_value, training, momentum, eps);
+  torch::lazy::LazyTensorPtr output = torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 0), input->GetDevice());
+  torch::lazy::LazyTensorPtr running_mean_output =
+      torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 1), input->GetDevice());
+  torch::lazy::LazyTensorPtr running_var_output = torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 2), input->GetDevice());
+  return std::make_tuple(std::move(output), std::move(running_mean_output),
+                         std::move(running_var_output));
+}
+
+std::tuple<torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr> ts_native_batch_norm_backward(
+    const torch::lazy::LazyTensorPtr& grad_out, const torch::lazy::LazyTensorPtr& input,
+    const torch::lazy::LazyTensorPtr& weight, const torch::lazy::LazyTensorPtr& running_mean,
+    const torch::lazy::LazyTensorPtr& running_var, const torch::lazy::LazyTensorPtr& save_mean,
+    const torch::lazy::LazyTensorPtr& save_invstd, bool training, double eps,
+    c10::ArrayRef<bool> output_mask) {
+  torch::lazy::Shape features_shape = BatchNormFeaturesShape(input);
+  torch::lazy::Value weight_value =
+      GetIrValueOrDefault(weight, 1, features_shape, input->GetDevice());
+  torch::lazy::NodePtr node;
+  if (!running_mean && !running_var) {
+    node = torch::lazy::MakeNode<TSNativeBatchNormBackward>(
+        grad_out->GetIrValue(), input->GetIrValue(), weight_value,
+        save_mean->GetIrValue(), save_invstd->GetIrValue(), training, eps,
+        std::array<bool, 3>{output_mask[0], output_mask[1], output_mask[2]});
+  } else {
+    CHECK(running_mean);
+    CHECK(running_var);
+    node = torch::lazy::MakeNode<TSNativeBatchNormBackward>(
+        grad_out->GetIrValue(), input->GetIrValue(), weight_value,
+        running_mean->GetIrValue(), running_var->GetIrValue(),
+        save_mean->GetIrValue(), save_invstd->GetIrValue(), training, eps,
+        std::array<bool, 3>{output_mask[0], output_mask[1], output_mask[2]});
+  }
+  torch::lazy::LazyTensorPtr grad_input = torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 0), input->GetDevice());
+  torch::lazy::LazyTensorPtr grad_weight = torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 1), input->GetDevice());
+  torch::lazy::LazyTensorPtr grad_bias = torch::lazy::LazyTensor::Create(torch::lazy::Value(node, 2), input->GetDevice());
+  return std::make_tuple(std::move(grad_input), std::move(grad_weight),
+                         std::move(grad_bias));
+}
+
+torch::lazy::LazyTensorPtr permute(const torch::lazy::LazyTensorPtr& input, c10::ArrayRef<int64_t> dims) {
+  auto input_shape = input->shape();
+  torch::lazy::ViewInfo view_info(
+      torch::lazy::ViewInfo::Type::kPermute, input_shape,
+      torch::lazy::GetCanonicalDimensionIndices(dims, input_shape.Get().dim()));
+  return input->CreateViewTensor(std::move(view_info));
+}
+
+void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src) {
+  if (input->GetDevice() == src->GetDevice()) {
+    torch::lazy::Value copy_value;
+    if (input->dtype() == src->dtype()) {
+      copy_value = src->GetIrValue();
+    } else {
+      copy_value = torch::lazy::MakeNode<torch::lazy::Cast>(
+          src->GetIrValue(), input->dtype(), src->dtype());
+    }
+    input->SetIrValue(MaybeExpand(copy_value, input->shape()));
+  } else {
+    auto input_shape = input->shape();
+    at::Tensor src_tensor = src->ToTensor(/*detached=*/true);
+    if (src_tensor.sizes() != input_shape.Get().sizes()) {
+      src_tensor = src_tensor.expand(input_shape.Get().sizes().vec());
+    }
+    input->UpdateFromTensor(std::move(src_tensor), /*sync=*/false);
+  }
+}
+
+torch::lazy::LazyTensorPtr select(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t index) {
+  auto shape = input->shape();
+  dim = torch::lazy::GetCanonicalDimensionIndex(dim, shape.Get().dim());
+  torch::lazy::LazyTensorPtr result = narrow(input, dim, index, 1);
+  auto new_dims = torch::lazy::DropDimensions(shape.Get().sizes(), {dim});
+  return view(result, new_dims);
+}
+
+torch::lazy::LazyTensorPtr slice(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t start,
+                 int64_t end, int64_t step) {
+  auto input_shape = input->shape();
+  dim = torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim());
+  start =
+      torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, start);
+  end = torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, end);
+  // PyTorch allows tensor[-1:0] to return a 0-dim tensor.
+  if (start > end) {
+    end = start;
+  }
+  step = std::min(step, end - start);
+
+  torch::lazy::SelectInfo select = {dim, start, end, step};
+  torch::lazy::ViewInfo view_info(torch::lazy::ViewInfo::Type::kSelect,
+                                  input_shape, select);
+  return input->CreateViewTensor(std::move(view_info));
+}
+
+torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input) {
+  auto input_shape = input->shape();
+  auto output_dimensions = BuildSqueezedDimensions(
+      input_shape.Get().sizes(), /*squeeze_dim=*/-1);
+  return view(input, output_dimensions);
+}
+
+torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input, int64_t dim) {
+  auto input_shape = input->shape();
+  int64_t squeeze_dim =
+      torch::lazy::GetCanonicalDimensionIndex(dim, input->shape().Get().dim());
+  auto output_dimensions =
+      BuildSqueezedDimensions(input_shape.Get().sizes(), squeeze_dim);
+  return view(input, output_dimensions);
+}
+
+void squeeze_(torch::lazy::LazyTensorPtr& input) {
+  input->SetIrValue(
+      torch::lazy::MakeNode<Squeeze>(input->GetIrValue(), -1));
+}
+
+void squeeze_(torch::lazy::LazyTensorPtr& input, int64_t dim) {
+  input->SetIrValue(torch::lazy::MakeNode<Squeeze>(
+      input->GetIrValue(),
+      torch::lazy::GetCanonicalDimensionIndex(dim, input->shape().Get().dim())));
+}
+
+torch::lazy::LazyTensorPtr transpose(const torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1) {
+  auto input_shape = input->shape();
+  auto permute_dims = torch::lazy::MakeTransposePermutation(
+      /*dim0=*/dim0, /*dim1=*/dim1, /*rank=*/input_shape.Get().dim());
+  torch::lazy::ViewInfo view_info(torch::lazy::ViewInfo::Type::kPermute,
+                                  input_shape, permute_dims);
+  return input->CreateViewTensor(std::move(view_info));
+}
+
+void transpose_(torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1) {
+  auto input_shape = input->shape();
+  auto permute_dims = torch::lazy::MakeTransposePermutation(
+      /*dim0=*/dim0, /*dim1=*/dim1, /*rank=*/input_shape.Get().dim());
+  torch::lazy::ViewInfo view_info(torch::lazy::ViewInfo::Type::kPermute,
+                                  input_shape, permute_dims);
+  return input->ModifyCurrentView(std::move(view_info));
+}
+
+torch::lazy::LazyTensorPtr unsqueeze(const torch::lazy::LazyTensorPtr& input, int64_t dim) {
+  auto input_shape = input->shape();
+  int64_t squeeze_dim =
+      torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim() + 1);
+  auto dimensions =
+      BuildUnsqueezedDimensions(input_shape.Get().sizes(), squeeze_dim);
+  return view(input, dimensions);
+}
+
+void unsqueeze_(torch::lazy::LazyTensorPtr& input, int64_t dim) {
+  int squeeze_dim = torch::lazy::GetCanonicalDimensionIndex(
+      dim, input->shape().Get().dim() + 1);
+  input->SetIrValue(torch::lazy::MakeNode<Unsqueeze>(input->GetIrValue(),
+                                                             squeeze_dim));
+}
+
+torch::lazy::LazyTensorPtr view(const torch::lazy::LazyTensorPtr& input, c10::ArrayRef<int64_t> output_size) {
+  auto input_shape = input->shape().Get();
+  torch::lazy::Shape shape = torch::lazy::Shape(
+      input_shape.scalar_type(), at::infer_size(output_size, input_shape.numel()));
+  torch::lazy::ViewInfo view_info(torch::lazy::ViewInfo::Type::kReshape,
+                                  std::move(shape), input_shape);
+  return input->CreateViewTensor(std::move(view_info));
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/tensor_aten_ops.h b/torch/csrc/lazy/ts_backend/tensor_aten_ops.h
new file mode 100644
index 00000000000000..e506bfd0255ace
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/tensor_aten_ops.h
@@ -0,0 +1,90 @@
+#pragma once
+
+#include <torch/csrc/lazy/core/tensor.h>
+
+namespace torch {
+namespace lazy {
+
+//////////////////////////////////////////////////////////////////////////////
+// ATEN operators follows here, listed in alphabetical order.
+//////////////////////////////////////////////////////////////////////////////
+// Takes a slice from the input as R1 at the specified offset and reshapes it
+// into the provided size.
+torch::lazy::LazyTensorPtr as_strided(const torch::lazy::LazyTensorPtr& input, std::vector<int64_t> size,
+                      std::vector<int64_t> stride,
+                      c10::optional<int64_t> storage_offset);
+
+// In-place version of the method above.
+void as_strided_(torch::lazy::LazyTensorPtr& input, std::vector<int64_t> size,
+                 std::vector<int64_t> stride,
+                 c10::optional<int64_t> storage_offset);
+
+torch::lazy::LazyTensorPtr expand(const torch::lazy::LazyTensorPtr& input,
+                  std::vector<int64_t> size);
+
+// Fills the input with the given value.
+void fill_(torch::lazy::LazyTensorPtr& input, const at::Scalar& value);
+
+// Returns a new tensor that is a narrowed view of the input in the given
+// dimension.
+torch::lazy::LazyTensorPtr narrow(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t start,
+                  int64_t length);
+
+std::tuple<torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr> ts_native_batch_norm(
+    const torch::lazy::LazyTensorPtr& input, const torch::lazy::LazyTensorPtr& weight, const torch::lazy::LazyTensorPtr& bias,
+    torch::lazy::LazyTensorPtr& running_mean, torch::lazy::LazyTensorPtr& running_var, bool training,
+    double momentum, double eps);
+
+std::tuple<torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr> ts_native_batch_norm_backward(
+    const torch::lazy::LazyTensorPtr& grad_out, const torch::lazy::LazyTensorPtr& input,
+    const torch::lazy::LazyTensorPtr& weight, const torch::lazy::LazyTensorPtr& running_mean,
+    const torch::lazy::LazyTensorPtr& running_var, const torch::lazy::LazyTensorPtr& save_mean,
+    const torch::lazy::LazyTensorPtr& save_invstd, bool training, double eps,
+    c10::ArrayRef<bool> output_mask);
+
+// Permute the dimensions of this tensor according to the given permutation.
+torch::lazy::LazyTensorPtr permute(const torch::lazy::LazyTensorPtr& input, c10::ArrayRef<int64_t> dims);
+
+// Repeats the input tensor along each dimension by the given number of
+// repeats.
+torch::lazy::LazyTensorPtr repeat(const torch::lazy::LazyTensorPtr& input, std::vector<int64_t> repeats);
+
+void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src);
+
+torch::lazy::LazyTensorPtr select(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t index);
+
+torch::lazy::LazyTensorPtr slice(const torch::lazy::LazyTensorPtr& input, int64_t dim, int64_t start,
+                 int64_t end, int64_t step);
+
+// Squeeze out all trivial (size 1) dimensions.
+torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input);
+
+// Squeeze out the specified dimension index, if trivial (size 1). Returns
+// unchanged input otherwise.
+torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input, int64_t dim);
+
+// In-place versions of the methods above.
+void squeeze_(torch::lazy::LazyTensorPtr& input);
+void squeeze_(torch::lazy::LazyTensorPtr& input, int64_t dim);
+
+
+std::tuple<torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr, torch::lazy::LazyTensorPtr> svd(
+    const torch::lazy::LazyTensorPtr& input,
+    bool some, bool compute_uv);
+
+// Swap given dimensions of the input.
+torch::lazy::LazyTensorPtr transpose(const torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1);
+
+// In-place version of the method above.
+void transpose_(torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1);
+
+// Insert a dimension of size one at the specified position.
+torch::lazy::LazyTensorPtr unsqueeze(const torch::lazy::LazyTensorPtr& input, int64_t dim);
+
+// In-place version of the method above.
+void unsqueeze_(torch::lazy::LazyTensorPtr& input, int64_t dim);
+
+// Like reshape, but it returns a view into the original tensor.
+torch::lazy::LazyTensorPtr view(const torch::lazy::LazyTensorPtr& input, c10::ArrayRef<int64_t> output_size);
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_autograd_functions.cpp b/torch/csrc/lazy/ts_backend/ts_autograd_functions.cpp
new file mode 100644
index 00000000000000..1a1259f6441df0
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_autograd_functions.cpp
@@ -0,0 +1,55 @@
+#include <torch/csrc/lazy/ts_backend/ts_autograd_functions.h>
+#include <ATen/Operators.h>
+#include <ATen/native/CPUFallback.h>
+#include <torch/csrc/lazy/ts_backend/ts_eager_fallback.h>
+
+namespace torch {
+namespace lazy {
+
+at::Tensor MaxPool3dAutogradFunctionTS::forward(
+    torch::autograd::AutogradContext* ctx, at::Tensor self,
+    at::IntArrayRef kernel_size, at::IntArrayRef stride,
+    at::IntArrayRef padding, at::IntArrayRef dilation, bool ceil_mode) {
+  ctx->saved_data["kernel_size"] = kernel_size;
+  ctx->saved_data["stride"] = stride;
+  ctx->saved_data["padding"] = padding;
+  ctx->saved_data["dilation"] = dilation;
+  ctx->saved_data["ceil_mode"] = ceil_mode;
+  auto results = at::native::call_fallback_fn<
+      &ltc_eager_fallback, ATEN_OP(max_pool3d_with_indices)>::call(self,
+                                                                   kernel_size,
+                                                                   stride,
+                                                                   padding,
+                                                                   dilation,
+                                                                   ceil_mode);
+  ctx->save_for_backward({self, std::get<1>(results)});
+  return std::get<0>(results);
+}
+
+torch::autograd::variable_list MaxPool3dAutogradFunctionTS::backward(
+    torch::autograd::AutogradContext* ctx,
+    torch::autograd::variable_list grad_output) {
+  auto kernel_size = ctx->saved_data["kernel_size"].toIntList().vec();
+  auto stride = ctx->saved_data["stride"].toIntList().vec();
+  auto padding = ctx->saved_data["padding"].toIntList().vec();
+  auto dilation = ctx->saved_data["dilation"].toIntList().vec();
+  auto ceil_mode = ctx->saved_data["ceil_mode"].toBool();
+  auto saved = ctx->get_saved_variables();
+  auto self = saved[0];
+  at::Tensor grad;
+  auto indices = saved[1];
+  grad = at::native::call_fallback_fn<
+      &ltc_eager_fallback,
+      ATEN_OP(max_pool3d_with_indices_backward)>::call(grad_output[0], self,
+                                                       kernel_size, stride,
+                                                       padding, dilation,
+                                                       ceil_mode, indices);
+
+  at::Tensor undef;
+  torch::autograd::variable_list grad_inputs = {grad,  undef, undef,
+                                                undef, undef, undef};
+  return grad_inputs;
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_autograd_functions.h b/torch/csrc/lazy/ts_backend/ts_autograd_functions.h
new file mode 100644
index 00000000000000..3ed0587e2ad6f0
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_autograd_functions.h
@@ -0,0 +1,22 @@
+#pragma once
+
+#include <torch/csrc/autograd/custom_function.h>
+
+namespace torch {
+namespace lazy {
+
+struct MaxPool3dAutogradFunctionTS
+    : public torch::autograd::Function<MaxPool3dAutogradFunctionTS> {
+  static at::Tensor forward(torch::autograd::AutogradContext* ctx,
+                               at::Tensor self,
+                               at::IntArrayRef kernel_size,
+                               at::IntArrayRef stride,
+                               at::IntArrayRef padding,
+                               at::IntArrayRef dilation, bool ceil_mode);
+  static torch::autograd::variable_list backward(
+      torch::autograd::AutogradContext* ctx,
+      torch::autograd::variable_list grad_output);
+};
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp b/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp
new file mode 100644
index 00000000000000..1d3a8a5c6180ac
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp
@@ -0,0 +1,247 @@
+#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
+
+#include <ATen/Functions.h>
+#include <torch/csrc/lazy/backend/backend_device.h>
+#include <torch/csrc/lazy/generated/LazyNativeFunctions.h>
+#include <torch/csrc/lazy/ts_backend/config.h>
+#include <torch/csrc/lazy/ts_backend/ts_eager_fallback.h>
+#include <torch/csrc/lazy/ts_backend/ts_lowering_context.h>
+
+namespace at {
+// This function is defined in the codegenerated RegisterDispatchKey.cpp file.
+// For the TorchScript backend, we have a special case where the registration does not happen
+// immediately (at static initialization time), so that if an external backend is loaded,
+// it has a chance to register itself, and TorchScript only registers itself if explicitly initialized
+extern TORCH_API void RegisterTorchScriptLazyNativeFunctions();
+extern TORCH_API void RegisterTorchScriptAutogradLazyNativeFunctions();
+}
+
+namespace torch {
+namespace lazy {
+
+struct TSBackendDeviceType : public BackendDeviceType {
+  TSBackendDeviceType() = delete;
+  TSBackendDeviceType(c10::DeviceType deviceType)
+  :BackendDeviceType((int8_t)deviceType) {
+    TORCH_CHECK(deviceType == at::kCPU || deviceType == at::kCUDA);
+  }
+
+  std::string toString() const override {
+    return c10::DeviceTypeName((c10::DeviceType)type);
+  }
+
+  c10::DeviceType c10Type() const {
+    return (c10::DeviceType)type;
+  }
+};
+
+class TSBackendImpl : public torch::lazy::BackendImplInterface {
+ public:
+  TSBackendImpl() : default_device_type_(at::kCPU) {
+    // TODO(whc) unify how all our flags are set and parsed as envs
+    static bool env_use_cuda = std::getenv("LTC_TS_CUDA") != nullptr;
+    auto type = (env_use_cuda || FLAGS_torch_lazy_ts_cuda) ? at::kCUDA : at::kCPU;
+    default_device_type_ = TSBackendDeviceType(type);
+  }
+  std::unique_ptr<torch::lazy::LoweringContext> CreateLoweringContext(
+      const std::string& name,
+      torch::lazy::BackendDevice device,
+      c10::ArrayRef<torch::lazy::Node*> post_order,
+      torch::lazy::Util::EmissionMap emit_status) const override {
+    return std::make_unique<torch::lazy::TSLoweringContext>(
+        name, device, post_order, emit_status);
+  }
+
+  std::unique_ptr<torch::lazy::LoweringContext> CreateLoweringContext(
+      const std::string& name,
+      torch::lazy::BackendDevice device) const override {
+    return std::make_unique<torch::lazy::TSLoweringContext>(name, device);
+  }
+
+  std::vector<std::string> GetCompilationDevices(
+      const std::string& device,
+      c10::ArrayRef<std::string> devices) const override {
+    return std::vector<std::string>(devices.begin(), devices.end());
+  }
+
+  at::Tensor MakeTensorFromComputationData(
+      const torch::lazy::BackendDataPtr data,
+      c10::optional<at::ScalarType> logical_scalar_type) const override {
+    const auto ts_data = std::static_pointer_cast<TSData>(data);
+    return ts_data->data();
+  }
+
+  torch::lazy::BackendDataPtr MakeComputationDataFromTensor(
+      const at::Tensor& tensor,
+      const torch::lazy::Shape& shape,
+      const torch::lazy::BackendDevice& device) const override {
+    at::TensorOptions options = tensor.options().device(
+        default_device_type_.c10Type(), device.ordinal());
+    if (tensor.device().type() == default_device_type_.c10Type() &&
+        default_device_type_.c10Type() == at::kCUDA) {
+      return std::make_shared<TSData>(
+          tensor.to(options, /*non_blocking=*/true), shape, device);
+    } else if (tensor.device().type() == at::kCPU && tensor.numel() == 1) {
+      // calling .item() on singleton cpu tensor is fast, and using fill is a safe,
+      // async way to copy cpu to cuda for a single value
+      auto device_tensor = at::full(tensor.sizes(), tensor.item(), options);
+      return std::make_shared<TSData>(device_tensor, shape, device);
+    } else {
+      return std::make_shared<TSData>(
+          tensor.to(options, /*non_blocking=*/false), shape, device);
+    }
+  }
+
+  torch::lazy::BackendDataPtr MakeComputationDataFromScalar(
+      const at::Scalar& scalar,
+      const torch::lazy::BackendDevice& device) const override {
+    return std::make_shared<TSData>(scalar, device);
+  }
+
+  std::string GetComputationBackendText(
+      const torch::lazy::ComputationPtr computation) const override {
+    auto ts_computation =
+        static_cast<torch::lazy::TSComputation*>(computation.get());
+    return ts_computation->graph()->toString();
+  }
+
+  //////////////computation client interfaces///////////////////////
+
+ public:
+  torch::lazy::BackendDataPtr CreateDataPlaceholder(
+      const torch::lazy::BackendDevice& device,
+      const torch::lazy::Shape& shape) const override;
+
+  std::vector<torch::lazy::ComputationPtr> Compile(
+      std::vector<torch::lazy::ComputationPtr> instances) const override;
+
+  std::vector<torch::lazy::BackendDataPtr> ExecuteComputation(
+      torch::lazy::Computation& computation,
+      c10::ArrayRef<torch::lazy::BackendDataPtr> arguments,
+      const torch::lazy::BackendDevice& device) const override;
+
+  std::shared_ptr<torch::lazy::BackendDeviceType> GetDefaultDeviceType()
+      const override {
+    return std::make_shared<BackendDeviceType>(default_device_type_);
+  }
+
+  at::DeviceType EagerFallbackDeviceType() const override;
+
+  void SetDefaultDeviceType(std::string type) override {
+    default_device_type_ = TSBackendDeviceType(c10::Device(type).type());
+    // The first CUDA usage could happen via lazy tensors. Initialize CUDA here
+    // to account for that, at::scalar_tensor constructor triggers everything we
+    // need.
+    static auto init_cuda = default_device_type_.c10Type() == at::kCUDA
+        ? c10::optional<at::Tensor>(
+              at::scalar_tensor(0, at::TensorOptions().device(at::kCUDA)))
+        : c10::nullopt;
+  }
+
+  std::vector<torch::lazy::BackendDevice> GetBackendDevices() const override;
+
+  torch::lazy::BackendDevice GetBackendDevice(
+      c10::Device device) const override;
+
+  void SetRngSeed(size_t seed) const override {
+    LOG(FATAL) << "Not implemented yet.";
+  }
+
+  // std::map<std::string, Metric> GetMetrics() const override { return {}; }
+
+  // MemoryInfo GetMemoryInfo(const std::string& device) override {
+  //   LOG(FATAL) << "Not implemented yet.";
+  // }
+
+  void PrepareToExit() const override;
+
+ private:
+  TSBackendDeviceType default_device_type_;
+};
+
+torch::lazy::BackendDataPtr TSBackendImpl::CreateDataPlaceholder(
+    const torch::lazy::BackendDevice& device,
+    const torch::lazy::Shape& shape) const {
+  return std::make_shared<TSData>(shape, device);
+}
+
+std::vector<torch::lazy::ComputationPtr> TSBackendImpl::Compile(
+    std::vector<torch::lazy::ComputationPtr> instances) const {
+  for (const auto& instance : instances) {
+    auto ts_computation =
+        static_cast<torch::lazy::TSComputation*>(instance.get());
+  }
+  return instances;
+}
+
+std::vector<torch::lazy::BackendDataPtr> TSBackendImpl::ExecuteComputation(
+    torch::lazy::Computation& computation,
+    c10::ArrayRef<torch::lazy::BackendDataPtr> arguments,
+    const torch::lazy::BackendDevice& device) const {
+  torch::jit::GraphExecutor& graph_executor =
+      static_cast<torch::lazy::TSComputation&>(computation).graph_executor();
+  std::vector<torch::jit::IValue> stack;
+  for (const auto& argument : arguments) {
+    const auto ts_data = std::static_pointer_cast<TSData>(argument);
+    if (ts_data->scalar.has_value()) {
+      stack.emplace_back(ts_data->scalar.value());
+    } else {
+      // TODO(whc) should this check be made more general? it's written somewhat
+      // oddly
+      CHECK(
+          (c10::DeviceType)default_device_type_.type != at::kCUDA ||
+          ts_data->data().device().type() == at::kCUDA);
+      stack.emplace_back(ts_data->data());
+    }
+  }
+  graph_executor.run(stack);
+  std::vector<torch::lazy::BackendDataPtr> results;
+  for (torch::jit::IValue component : stack) {
+    at::Tensor result = component.toTensor();
+    at::IntArrayRef result_sizes = result.sizes();
+    torch::lazy::Shape shape(
+        result.scalar_type(),
+        std::vector<int64_t>(result_sizes.begin(), result_sizes.end()));
+    results.push_back(std::make_shared<TSData>(result, shape, device));
+  }
+  return results;
+}
+
+std::vector<torch::lazy::BackendDevice> TSBackendImpl::GetBackendDevices()
+    const {
+  std::vector<torch::lazy::BackendDevice> devices;
+  // TODO(whc) figure out how to query available devices from pytorch
+  devices.emplace_back(GetBackendDevice(c10::Device(c10::kCPU, 0)));
+  devices.emplace_back(GetBackendDevice(c10::Device(c10::kCUDA, 0)));
+  return devices;
+}
+
+torch::lazy::BackendDevice TSBackendImpl::GetBackendDevice(
+    c10::Device device) const {
+  // Note, we ignore the device type specified by the c10::Device since it is
+  // expected to be a virtual device (lazy::), but we need to change this when
+  // we support lazy as a mode
+  return torch::lazy::BackendDevice(GetDefaultDeviceType(), device.index());
+}
+
+void TSBackendImpl::PrepareToExit() const {}
+
+c10::DeviceType TSBackendImpl::EagerFallbackDeviceType() const {
+  // For TS backend, hardware device _is_ eager device
+  return (c10::DeviceType)GetDefaultDeviceType()->type;
+}
+
+torch::lazy::BackendImplInterface* GetTSBackendImpl() {
+  static TSBackendImpl* ts_backend_impl = new TSBackendImpl();
+  return ts_backend_impl;
+}
+
+void InitTorchScriptBackend() {
+  at::RegisterTorchScriptLazyNativeFunctions();
+  at::RegisterTorchScriptAutogradLazyNativeFunctions();
+  register_ts_ltc_eager_fallback();
+  static std::unique_ptr<BackendRegistrar> s_registrar;
+  s_registrar = std::make_unique<BackendRegistrar>(GetTSBackendImpl());
+}
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_backend_impl.h b/torch/csrc/lazy/ts_backend/ts_backend_impl.h
new file mode 100644
index 00000000000000..d238e8263e5778
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_backend_impl.h
@@ -0,0 +1,52 @@
+#pragma once
+
+#include <torch/csrc/lazy/backend/backend_interface.h>
+
+namespace torch {
+namespace lazy {
+
+class TORCH_API TSData : public torch::lazy::BackendData {
+ public:
+  TSData(const at::Scalar& scalar, const torch::lazy::BackendDevice& device)
+      : torch::lazy::BackendData(device, torch::lazy::Shape(scalar.type(), {})),
+        scalar(scalar) {}
+
+  TSData(
+      const at::Tensor& data,
+      const torch::lazy::Shape& shape,
+      const torch::lazy::BackendDevice& device)
+      : torch::lazy::BackendData(device, shape), data_(data) {}
+
+  TSData(
+      const torch::lazy::Shape& shape,
+      const torch::lazy::BackendDevice& device)
+      : torch::lazy::BackendData(device, shape) {}
+
+  Handle GetHandle() override {
+    return reinterpret_cast<int64_t>(this);
+  }
+
+  void Assign(const torch::lazy::BackendData& data) override {
+    data_ = static_cast<const TSData&>(data).data_;
+  }
+
+  bool HasValue() const override {
+    return data_.defined();
+  }
+
+  at::Tensor data() {
+    return data_;
+  }
+
+  c10::optional<at::Scalar> scalar;
+
+ private:
+  at::Tensor data_;
+};
+
+TORCH_API torch::lazy::BackendImplInterface* GetTSBackendImpl();
+
+TORCH_API void InitTorchScriptBackend();
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp b/torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp
new file mode 100644
index 00000000000000..973a2f2d4cea82
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp
@@ -0,0 +1,318 @@
+#include <torch/csrc/lazy/ts_backend/ts_eager_fallback.h>
+
+#include <ATen/Functions.h>
+#include <ATen/core/boxing/KernelFunction.h>
+#include <ATen/native/CPUFallback.h>
+#include <torch/csrc/lazy/backend/backend_interface.h>
+#include <torch/csrc/lazy/core/metrics.h>
+#include <torch/csrc/lazy/core/tensor.h>
+#include <torch/csrc/lazy/core/config.h>
+#include <torch/library.h>
+#include <sstream>
+#include <unordered_map>
+
+namespace torch {
+namespace lazy {
+namespace {
+
+std::vector<at::Tensor> _to_eager(
+    at::TensorList tensors,
+    c10::DeviceType device_type) {
+  switch (device_type) {
+    case at::kCPU: {
+      return at::_to_cpu(tensors);
+    }
+    default: {
+      std::vector<at::Tensor> eager_tensors;
+      for (const auto& t : tensors) {
+        c10::TensorOptions options = t.options().device(device_type);
+        at::Tensor eager_tensor = t.to(
+            options,
+            /*non_blocking*/ false,
+            /*copy*/ false);
+        eager_tensors.push_back(eager_tensor);
+      }
+      return eager_tensors;
+    }
+  }
+}
+
+// convenience helper for converting tensors to cpu
+
+std::vector<at::Tensor> to_eager(
+    const at::TensorList& tensors,
+    c10::DeviceType device_type) {
+  // We can't just call _to_eager() on the entire list of Tensors because it
+  // will break on undefined tensors. Separate out undefined tensors first.
+  std::vector<at::Tensor> eager_tensors(tensors.size());
+  std::vector<at::Tensor> valid_tensors;
+  std::vector<bool> to_translate(tensors.size());
+  for (size_t i = 0; i < tensors.size(); ++i) {
+    const at::Tensor& tensor = tensors[i];
+    // Explicitly handling undefined tensors here instead of letting `_to_eager`
+    // handle it. Otherwise, we'd need to require all backends with their own
+    // implementation of _to_eager to properly handle undefined tensors.
+    if (tensor.defined()) {
+      to_translate[i] = true;
+      valid_tensors.push_back(tensor);
+    } else {
+      eager_tensors[i] = tensor;
+    }
+  }
+  auto eager_valid_tensors = _to_eager(valid_tensors, device_type);
+  for (size_t i = 0, defined_pos = 0; i < tensors.size(); ++i) {
+    if (to_translate[i]) {
+      eager_tensors[i] = std::move(eager_valid_tensors[defined_pos++]);
+    }
+  }
+  return eager_tensors;
+}
+
+c10::DispatchKey dispatch_key(c10::DeviceType device_type) {
+  switch (device_type) {
+    case at::kCPU: {
+      return c10::DispatchKey::CPU;
+    }
+    case at::kCUDA: {
+      return c10::DispatchKey::CUDA;
+    }
+    default: {
+      AT_ERROR("Unsupported device type: ", device_type);
+    }
+  }
+}
+
+c10::optional<c10::Device> compute_target_device(
+    std::vector<at::Tensor>& t_args,
+    std::vector<c10::List<at::Tensor>> tlist_args) {
+  // Decide what device to move the output tensor(s) to.
+  // The current convention is that we use the first tensor arg to pick the
+  // device Barring that, we take the first tensor from a TensorList arg.
+  if (t_args.size() > 0) {
+    return t_args[0].device();
+  } else {
+    // We need to loop through all of the (potentially multiple) TensorList
+    // arguments In case, e.g. the first one is empty but the second is not.
+    for (auto& tens_list : tlist_args) {
+      for (const auto i : c10::irange(tens_list.size())) {
+        return tens_list.get(i).device();
+      }
+    }
+  }
+  return c10::nullopt;
+}
+
+} // namespace
+
+static std::unordered_map<std::string, ::torch::lazy::Counter*>
+    _eager_fallback_counters;
+
+bool force_eager_fallback(c10::Symbol op) {
+  auto force_str = getLTCForceFallback();
+  if (!force_str.empty()) {
+    static auto force_sym = c10::Symbol::fromQualString(std::string(force_str));
+    if (op == force_sym) {
+      return true;
+    }
+  }
+  return false;
+}
+
+void ltc_eager_fallback(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack) {
+  // TODO(whc) this FN_TRACK thing hasn't been used so far in LTC iirc but could land/re-enable it
+  // LTC_FN_TRACK(3);;
+  const auto name = c10::toString(op.operator_name());
+
+  // Manually applying the TORCH_LAZY_COUNTER macro.
+  // We need to do it ourselves and explicitly keep a mapping of counters
+  // because this boxed fallback kernel is used by multiple operators,
+  // and the macro stamps out a static Counter object with a fixed name
+  // at the code location that it was called.
+  if (_eager_fallback_counters.find(name) == _eager_fallback_counters.end()) {
+    _eager_fallback_counters[name] = new ::torch::lazy::Counter(name);
+  }
+  _eager_fallback_counters[name]->AddValue(1);
+
+  auto& args = op.schema().arguments();
+  auto arguments = torch::jit::last(stack, args.size());
+
+  // Log each tensor argument.
+  for (const auto & ivalue : arguments) {
+    if (ivalue.isTensor()) {
+      VLOG(3) << ivalue.toTensor().toString();
+    }
+  }
+
+  // Call the actual boxed CPU fallback.
+  ts_eager_fallback(
+      op, stack, torch::lazy::getBackend()->EagerFallbackDeviceType());
+}
+
+void register_ts_ltc_eager_fallback() {
+  static auto m = MAKE_TORCH_LIBRARY_IMPL(_, Lazy);
+  // Most backends use TORCH_LIBRARY_* macros which perform their dispatcher
+  // registrations at static library init time, but the lazy Torchscript backend
+  // does not since it is built in the main torch lib but not always used.
+  // In particular, if another external backend wants to register itself to the
+  // same key (Lazy), Torchscript backend must not be initialized.
+  m.fallback(torch::CppFunction::makeFromBoxedFunction<&ltc_eager_fallback>());
+}
+
+void ts_eager_fallback(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack,
+    c10::DeviceType device_type) {
+  auto& schema_args = op.schema().arguments();
+  const auto num_arguments = schema_args.size();
+  auto arguments = torch::jit::last(stack, num_arguments);
+  const auto arguments_begin = stack->size() - num_arguments;
+
+  std::vector<at::Tensor> tensor_args;
+  std::vector<int> tensor_args_indices;
+
+  std::vector<c10::List<at::Tensor>> tensorlist_args;
+
+  // Step 1: Convert all non-eager tensor inputs into eager tensors and put them
+  // on the stack at the correct indices.
+  for (int64_t idx = 0; idx < arguments.size(); ++idx) {
+    const auto& ivalue = arguments[idx];
+    if (ivalue.isTensor()) {
+      tensor_args.push_back(ivalue.toTensor());
+      tensor_args_indices.push_back(idx);
+    } else if (ivalue.isTensorList()) {
+      // Note: we copy each TensorList argument to eager individually out of
+      // convenience, but XLA would benefit from materializing all tensor and
+      // TensorList args onto the CPU at the same time. We can improve this if
+      // we need better perf for XLA's CPU fallbacks.
+      auto eager_ivalue = c10::IValue(c10::List<at::Tensor>(
+          to_eager(ivalue.toTensorList().vec(), device_type)));
+      (*stack)[arguments_begin + idx] = std::move(eager_ivalue);
+      tensorlist_args.push_back(ivalue.toTensorList());
+    }
+  }
+  // XLA requires all of the tensor arguments to be gathered up and converted to
+  // CPU together.
+  auto eager_tensors = to_eager(tensor_args, device_type);
+
+  for (auto i = 0; i < tensor_args_indices.size(); ++i) {
+    auto idx = tensor_args_indices[i];
+    (*stack)[arguments_begin + idx] = c10::IValue(eager_tensors[i]);
+  }
+
+  // Step 2: Call the underlying eager implementation of the operator
+  op.redispatchBoxed(c10::DispatchKeySet(dispatch_key(device_type)), stack);
+
+  // Step 3: We need to take special care to handle mutable aliases properly:
+  // If any input tensors are mutable aliases, we need to directly copy the
+  // updated data on the eager tensors back to the original inputs.
+  for (int64_t i = 0; i < tensor_args_indices.size(); ++i) {
+    auto tensor_idx = tensor_args_indices[i];
+    const auto alias_info = schema_args[tensor_idx].alias_info();
+    if (alias_info != nullptr && alias_info->isWrite()) {
+      at::_copy_from_and_resize(eager_tensors[i], tensor_args[i]);
+    }
+  }
+
+  // Step 4: Convert any eager output tensors back to the original input device.
+  // For mutable alias'd outputs, we also need to take special care
+  // to move the ORIGINAL input tensor back onto the stack, in place of
+  // the temporary eager output tensor that we created.
+  //
+  // Note [Eager Fallback Does Not Handle View Operators]
+  // Also note that we are incapable of handling immutable alises properly.
+  // Why?
+  // Schemas with an immutable alias'd tensor outputs correspond to view
+  // operators. For example, the `view_as` schema from native_functions.yaml:
+  // `view_as(Tensor(a) self, Tensor other) -> Tensor(a)`
+  // We can't handle these ops properly, because view ops are supposed to return
+  // a NEW tensor that shares the SAME storage as the original tensor.
+  // However, the new tensor that we created cannot share the same storage,
+  // since it lives on the eager CPU / CUDA device and the original tensor lives
+  // on a different device. Because of that, we warn if someone attempts to call
+  // the eager fallback on a view operator (this is to maintain BC for view ops
+  // for XLA that fall back to CPU).
+  const auto& schema_returns = op.schema().returns();
+  const auto& num_returns = schema_returns.size();
+  auto returns = torch::jit::last(stack, num_returns);
+  const auto returns_begin = stack->size() - num_returns;
+
+  for (int64_t idx = 0; idx < returns.size(); ++idx) {
+    if (returns[idx].isTensor()) {
+      const auto& return_tens = returns[idx].toTensor();
+      if (return_tens.defined()) {
+        const auto alias_info = schema_returns[idx].alias_info();
+        if (alias_info != nullptr && alias_info->isWrite()) {
+          // Case (1): mutable alias case. Move the input ivalue directly onto
+          // the stack in place of the existing eager output tensor.
+          bool found_alias = false;
+          // We could store some extra metadata on the function schema to avoid
+          // the loop here if we need to improve perf.
+          for (int64_t i = 0; i < tensor_args_indices.size(); ++i) {
+            auto input_tensor_idx = tensor_args_indices[i];
+            const auto& input_tensor = eager_tensors[i];
+            const auto input_alias_info =
+                schema_args[input_tensor_idx].alias_info();
+            if (input_tensor.defined() && input_alias_info != nullptr &&
+                *alias_info == *input_alias_info) {
+              // We've found the original input tensor that aliases with the
+              // current output. Wrap it in an IValue and put it directly on the
+              // stack.
+              (*stack)[returns_begin + idx] = c10::IValue(tensor_args[i]);
+              found_alias = true;
+              break;
+            }
+          }
+          TORCH_CHECK(
+              found_alias,
+              "The operator ",
+              op.schema().operator_name(),
+              " appears to have invalid alias information. ",
+              "Found a return tensor argument with a mismatched "
+              "mutable alias: ",
+              schema_returns[idx]);
+        } else {
+          c10::optional<c10::Device> tgt_device =
+              compute_target_device(tensor_args, tensorlist_args);
+          if (alias_info != nullptr && !alias_info->isWrite()) {
+            // immutable alias (view) case: Warn here, since we're copying and
+            // not creating a view.
+            // If this operator is needed, the backend should provide a kernel
+            // for it.
+            // See Note [Eager Fallback Does Not Handle View Operators]
+            std::stringstream dev_str;
+            if (tgt_device) {
+              dev_str << *tgt_device;
+            } else {
+              dev_str << "<none>";
+            }
+            TORCH_WARN(
+                false,
+                "The operator ",
+                op.schema().operator_name(),
+                " appears to be a view operator, ",
+                "but it has no implementation for the backend \"",
+                dev_str.str(),
+                "\". View operators don't support ",
+                "falling back to run on the eager, since the tensor's "
+                "storage cannot be shared across devices.");
+          }
+          // Case (2): copy case. Copy the eager output tensor to the original
+          // device.
+
+          // We technically  might not have a target device, e.g. if you call
+          // torch.cat() with an empty list In that case, we shouldn't have any
+          // tensors to schlep across devices anyway.
+          if (tgt_device) {
+            (*stack)[returns_begin + idx] =
+                c10::IValue(returns[idx].toTensor().to(*tgt_device));
+          }
+        }
+      }
+    }
+  }
+}
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_eager_fallback.h b/torch/csrc/lazy/ts_backend/ts_eager_fallback.h
new file mode 100644
index 00000000000000..7c805a63364af0
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_eager_fallback.h
@@ -0,0 +1,27 @@
+#pragma once
+
+#include <functional>
+#include <ATen/core/dispatch/Dispatcher.h>
+#include <ATen/core/ivalue.h>
+#include <ATen/core/stack.h>
+
+namespace torch {
+namespace lazy {
+
+bool force_eager_fallback(c10::Symbol op);
+void ltc_eager_fallback(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack);
+
+void ts_eager_fallback(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack,
+    c10::DeviceType device_type);
+
+// The TorchScript backend does not register itself with pytorch dispatcher
+// until it is explicitly initialized.  This function should only be called
+// by the main Torchscript backend init function.
+void register_ts_ltc_eager_fallback();
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp b/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp
new file mode 100644
index 00000000000000..25077751a5d7c2
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp
@@ -0,0 +1,69 @@
+#include <c10/core/ScalarType.h>
+#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
+#include <torch/csrc/lazy/ts_backend/ts_lowering_context.h>
+#include <torch/csrc/lazy/ts_backend/ts_node.h>
+
+namespace torch {
+namespace lazy {
+
+TSLoweringContext::TSLoweringContext(
+    const std::string& name,
+    BackendDevice device)
+    : torch::lazy::LoweringContext(name, device),
+      graph_(std::make_shared<torch::jit::Graph>()) {
+  lowering_ = TSNodeLoweringInterface::Create(this);
+}
+
+TSLoweringContext::TSLoweringContext(
+    const std::string& name,
+    BackendDevice device,
+    c10::ArrayRef<Node*> post_order,
+    Util::EmissionMap emit_status)
+    : torch::lazy::LoweringContext(name, device, post_order, emit_status),
+      graph_(std::make_shared<torch::jit::Graph>()) {
+  lowering_ = TSNodeLoweringInterface::Create(this);
+  for (auto node : post_order) {
+    bool ok = lowering_->Lower(node);
+    CHECK(ok) << "Failed to lower: " << *node;
+  }
+}
+
+void TSLoweringContext::AssignOutputOp(
+    const Output& output,
+    torch::jit::Value* op) {
+  auto ts_node = NodeCast<TsNode>(output.node, output.node->op());
+  if (!ts_node->getPythonStacktrace().empty()) {
+    op->node()->s_(c10::Symbol::attr("source"), ts_node->getPythonStacktrace());
+  }
+  emitted_outputs_[output] = op;
+}
+
+torch::jit::Value* TSLoweringContext::GetParameter(BackendDataPtr data) {
+  const auto ts_data =
+      std::static_pointer_cast<TSData>(data);
+  BackendData::Handle handle = ts_data->GetHandle();
+  auto it = parameters_map_.find(handle);
+  if (it == parameters_map_.end()) {
+    torch::jit::Value* param =
+        graph_->addInput(c10::str("p", parameters_.size()));
+    if (ts_data->scalar.has_value()) {
+      auto scalarType = ts_data->scalar.value().type();
+      if (isFloatingType(scalarType)) {
+        param->setType(c10::FloatType::get());
+      } else if (isIntegralType(scalarType, /*includeBool=*/true)) {
+        param->setType(c10::IntType::get());
+      } else {
+        TORCH_CHECK(
+            false, "Unhandled scalar type: ", c10::toString(scalarType));
+      }
+    }
+    it = parameters_map_.emplace(handle, Parameter{param, parameters_.size()})
+             .first;
+    parameters_.push_back(ts_data);
+  }
+  parameter_sequence_.push_back(it->second.index);
+  return it->second.param;
+}
+
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_lowering_context.h b/torch/csrc/lazy/ts_backend/ts_lowering_context.h
index bc198e5d71d124..47951c21746bd5 100644
--- a/torch/csrc/lazy/ts_backend/ts_lowering_context.h
+++ b/torch/csrc/lazy/ts_backend/ts_lowering_context.h
@@ -1,9 +1,9 @@
 #pragma once
 
+#include <torch/csrc/api/include/torch/jit.h>
+#include <torch/csrc/jit/runtime/graph_executor.h>
 #include <torch/csrc/lazy/backend/lowering_context.h>
 #include <torch/csrc/lazy/ts_backend/ts_node_lowering.h>
-#include <torch/csrc/jit/runtime/graph_executor.h>
-#include <torch/csrc/api/include/torch/jit.h>
 
 namespace torch {
 namespace lazy {
@@ -22,7 +22,8 @@ class TORCH_API TSNodeLoweringInterface {
 
   virtual bool Lower(const Node* node) = 0;
 
-  static std::unique_ptr<TSNodeLoweringInterface> Create(LoweringContext* loctx);
+  static std::unique_ptr<TSNodeLoweringInterface> Create(
+      LoweringContext* loctx);
 };
 
 class TORCH_API TSComputation : public Computation {
@@ -34,7 +35,9 @@ class TORCH_API TSComputation : public Computation {
     }
   }
 
-  int parameters_size() const override { return parameter_names_.size(); }
+  int parameters_size() const override {
+    return parameter_names_.size();
+  }
 
   const std::vector<Shape>& parameter_shapes() const override {
     throw std::runtime_error(
@@ -52,9 +55,13 @@ class TORCH_API TSComputation : public Computation {
     return result_shape_;
   }
 
-  std::shared_ptr<torch::jit::Graph> graph() const { return graph_; }
+  std::shared_ptr<torch::jit::Graph> graph() const {
+    return graph_;
+  }
 
-  torch::jit::GraphExecutor& graph_executor() { return graph_executor_; }
+  torch::jit::GraphExecutor& graph_executor() {
+    return graph_executor_;
+  }
 
  private:
   std::shared_ptr<torch::jit::Graph> graph_;
@@ -68,12 +75,15 @@ class TORCH_API TSLoweringContext : public LoweringContext {
  public:
   TSLoweringContext(const std::string& name, const BackendDevice device);
 
-  TSLoweringContext(const std::string& name, BackendDevice device,
-                    c10::ArrayRef<Node*> post_order,
-                    Util::EmissionMap emit_status);
+  TSLoweringContext(
+      const std::string& name,
+      BackendDevice device,
+      c10::ArrayRef<Node*> post_order,
+      Util::EmissionMap emit_status);
 
   // TODO(whc) replace these when real impl lands;
-  // I am just landing the interface in this diff, but MSVC won't allow undefined virtual funcs
+  // I am just landing the interface in this diff, but MSVC won't allow
+  // undefined virtual funcs
   Shape GetResultShape(size_t index) const override {
     TORCH_INTERNAL_ASSERT(false, "not implemented");
   }
@@ -129,7 +139,9 @@ class TORCH_API TSLoweringContext : public LoweringContext {
   // held in data.
   torch::jit::Value* GetParameter(BackendDataPtr data);
 
-  std::shared_ptr<torch::jit::Graph> graph() const { return graph_; }
+  std::shared_ptr<torch::jit::Graph> graph() const {
+    return graph_;
+  }
 
  private:
   struct Parameter {
@@ -149,5 +161,5 @@ class TORCH_API TSLoweringContext : public LoweringContext {
   std::unique_ptr<TSNodeLoweringInterface> lowering_;
 };
 
-}  // namespace lazy
-}  // namespace torch
+} // namespace lazy
+} // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_native_functions.cpp b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
new file mode 100644
index 00000000000000..507c81eca9de89
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
@@ -0,0 +1,537 @@
+#include <ATen/Operators.h>
+#include <ATen/Functions.h>
+#include <ATen/MetaFunctions.h>
+#include <ATen/native/BinaryOps.h>
+#include <ATen/native/CPUFallback.h>
+#include <torch/csrc/lazy/core/helpers.h>
+#include <torch/csrc/lazy/core/metrics.h>
+#include <torch/csrc/lazy/core/shape_inference.h>
+#include <torch/csrc/lazy/core/tensor_util.h>
+#include <torch/csrc/lazy/core/view_ops/as_strided.h>
+#include <torch/csrc/lazy/core/tensor_impl.h>
+#include <torch/csrc/lazy/generated/LazyNativeFunctions.h>
+#include <torch/csrc/lazy/ts_backend/config.h>
+#include <torch/csrc/lazy/ts_backend/ts_eager_fallback.h>
+#include <torch/csrc/lazy/ts_backend/tensor_aten_ops.h>
+#include <torch/csrc/lazy/ts_backend/ts_autograd_functions.h>
+#include <torch/csrc/lazy/ts_backend/ops/random_ops.h>
+#include <torch/csrc/lazy/ts_backend/ops/to_copy.h>
+#include <torch/library.h>
+
+namespace torch {
+namespace lazy {
+namespace {
+
+at::Tensor CreateLtcTensor(const at::Tensor& tensor,
+                           const c10::optional<torch::lazy::BackendDevice>& device) {
+  if (tensor.defined() && device) {
+    return torch::lazy::CreateAtenFromLtcTensor(torch::lazy::LazyTensor::Create(tensor, *device));
+  }
+  return tensor;
+}
+
+c10::optional<torch::lazy::BackendDevice> GetLtcDevice(const c10::optional<c10::Device>& device) {
+  if (!device) {
+    return c10::nullopt;
+  }
+  if (device->type() != at::kLazy) {
+    return c10::nullopt;
+  }
+  return torch::lazy::atenDeviceToBackendDevice(*device);
+}
+
+}  // namespace
+
+at::Tensor LazyNativeFunctions::alias(const at::Tensor& self) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::as_strided(
+    const at::Tensor& self, at::IntArrayRef size, at::IntArrayRef stride,
+    c10::optional<int64_t> storage_offset) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  torch::lazy::LazyTensorPtr self_tensor = torch::lazy::TryGetLtcTensor(self);
+  auto xsize = torch::lazy::ToI64Vector(size);
+  auto xstride = torch::lazy::ToI64Vector(stride);
+  if (!torch::lazy::AsStrided::StrideIsSupported(xstride)) {
+    return at::native::call_fallback_fn<
+        &ltc_eager_fallback, ATEN_OP(as_strided)>::call(self, size, stride,
+                                                        storage_offset);
+  }
+  return torch::lazy::CreateAtenFromLtcTensor(torch::lazy::as_strided(
+      self_tensor, std::move(xsize), std::move(xstride), storage_offset));
+}
+
+const at::Tensor& LazyNativeFunctions::as_strided_(
+    const at::Tensor& self, at::IntArrayRef size, at::IntArrayRef stride,
+    c10::optional<int64_t> storage_offset) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  auto xsize = torch::lazy::ToI64Vector(size);
+  auto xstride = torch::lazy::ToI64Vector(stride);
+  if (!torch::lazy::AsStrided::StrideIsSupported(xstride)) {
+    return at::native::call_fallback_fn<
+        &ltc_eager_fallback, ATEN_OP(as_strided_)>::call(self, size, stride,
+                                                         storage_offset);
+  }
+  torch::lazy::as_strided_(self_tensor, std::move(xsize),
+                                    std::move(xstride), storage_offset);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::clone(const at::Tensor & self, c10::optional<at::MemoryFormat> memory_format) {
+  auto self_lt = torch::lazy::TryGetLtcTensor(self);
+  return torch::lazy::CreateAtenFromLtcTensor(self_lt->Create(self_lt->GetIrValue(), self_lt->GetDevice()));
+}
+
+at::Tensor LazyNativeFunctions::_copy_from(const at::Tensor& self,
+                                           const at::Tensor& dst,
+                                           bool non_blocking) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto dst_tensor = torch::lazy::TryGetLtcTensor(dst);
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  if (!self_tensor) {
+    // providing a new 'eager' value (self) for an existing lazy tensor (dst)
+    static bool sync_update = FLAGS_torch_lazy_ts_tensor_update_sync;
+    CHECK(dst_tensor);
+    dst_tensor->UpdateFromTensor(self, /*sync=*/sync_update);
+  } else if (!dst_tensor) {
+    // materializing a lazy tensor (self) and copying its value into eager tensor (dst)
+    // detached=false lets us skip a copy in `ToTensor`, which should be safe
+    // because we are only going to use the tensor for dst.copy_()
+    CHECK(self_tensor);
+    at::Tensor tensor = self_tensor->ToTensor(/*detached=*/false);
+    at::Tensor typed_tensor =
+        torch::lazy::CopyTensor(tensor, dst.scalar_type(), /*copy=*/false);
+    dst.resize_as_(typed_tensor).copy_(typed_tensor);
+  } else {
+    // Copying one lazy tensor to another
+    if (!dst_tensor->CurrentIrValue()) {
+      // if dest is not backed by IR (e.g. result of some lazy operation),
+      // then it should have at::Tensor data backing it instead
+      auto dst_tensor_data = dst_tensor->CurrentTensorData();
+      CHECK(dst_tensor_data);
+      auto src_tensor_data = self_tensor->CurrentTensorData();
+      if (src_tensor_data) {
+        // both src/dst are simply backed by at::Tensor data, no IR- do a straightforward copy
+        dst_tensor_data->copy_(*src_tensor_data);
+      } else {
+        // src needs to be materialized before its result can be used for a copy into dst
+        // since we use the src tensor only for making a copy, we don't need to detach it
+        // note: it would be even more efficient if we could cause ToTensor to materialize the
+        // value directly into dst's buffer (that would need to be detached though).
+        dst_tensor_data->copy_(self_tensor->ToTensor(/*detached=*/false));
+      }
+    } else {
+      copy_(dst_tensor, self_tensor);
+      auto* impl = dynamic_cast<torch::lazy::LTCTensorImpl*>(dst.unsafeGetTensorImpl());
+      impl->set_tensor(dst_tensor);
+    }
+  }
+  return dst;
+}
+
+at::Tensor LazyNativeFunctions::_copy_from_and_resize(const at::Tensor& self,
+                                                      const at::Tensor& dst) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto dst_tensor = torch::lazy::TryGetLtcTensor(dst);
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  if (!self_tensor) {
+    CHECK(dst_tensor);
+    dst_tensor->UpdateFromTensorOut(self);
+  } else if (!dst_tensor) {
+    CHECK(self_tensor);
+    at::Tensor tensor = self_tensor->ToTensor(/*detached=*/true);
+    at::Tensor typed_tensor =
+        torch::lazy::CopyTensor(tensor, dst.scalar_type(), /*copy=*/false);
+    dst.resize_as_(typed_tensor).copy_(typed_tensor);
+  } else {
+    // at this point we know dst is a lazy tensor
+    auto* dest_impl =
+        dynamic_cast<torch::lazy::LTCTensorImpl*>(dst.unsafeGetTensorImpl());
+    dest_impl->tensor()->UpdateFromTensorOut(self_tensor);
+    dest_impl->force_refresh_sizes();
+  }
+  return dst;
+}
+
+at::Tensor LazyNativeFunctions::_to_copy(const at::Tensor & self,
+                                         c10::optional<at::ScalarType> dtype,
+                                         c10::optional<at::Layout> layout,
+                                         c10::optional<at::Device> device,
+                                         c10::optional<bool> pin_memory,
+                                         bool non_blocking,
+                                         c10::optional<at::MemoryFormat> memory_format) {
+
+    if (force_eager_fallback(at::aten::_to_copy)) {
+      TORCH_INTERNAL_ASSERT(false,
+        "Fallback is currently impossible for _to_copy since the fallback helper itself reinvokes _to_copy");
+    }
+
+    auto options = self.options();
+    if (dtype) {
+      // I put each of these setters in a conditional instead of doing `self.options().dtype(dtype).layout(layout)...
+      // because calling .dtype(nullopt) on an options() that already has dtype appears to wipe it
+      options = options.dtype(dtype);
+    }
+    if (layout) {
+      options = options.layout(layout);
+    }
+    if (memory_format) {
+      options = options.memory_format(memory_format);
+    }
+    if (pin_memory) {
+      // TODO(whc) can we honor 'pin_memory' in some/all cases?
+      options = options.pinned_memory(pin_memory);
+      TORCH_WARN_ONCE("Pinned memory used in lazy _to_copy, check if the behavior is as intended");
+    }
+
+    TORCH_LAZY_FN_COUNTER("lazy::");
+    auto lazy_self = torch::lazy::TryGetLtcTensor(self);
+    if (!lazy_self && device && device->type() == c10::kLazy) {
+      // Case 1: eager->lazy (we create a new lazy tensor)
+
+      auto eager_tensor = self.to(options, /*non_blocking=*/non_blocking, /*copy=*/true);
+      lazy_self = torch::lazy::GetOrCreateLtcTensor(eager_tensor,
+                                                    torch::lazy::atenDeviceToBackendDevice(*device));
+      return torch::lazy::CreateAtenFromLtcTensor(lazy_self);
+    } else if(device && device->type() != c10::kLazy) {
+      // Case 2: lazy->eager (forces a graph break since we are materializing a tensor)
+
+      TORCH_INTERNAL_ASSERT(lazy_self);
+      auto eager_tensor = lazy_self->ToTensor(/*detached=*/true);
+      options = options.device(device);
+      auto moved_eager_tensor = eager_tensor.to(options, /*non_blocking=*/non_blocking, /*copy=*/true);
+      return moved_eager_tensor;
+    } else if (device &&
+               device->type() == c10::kLazy &&
+               device->has_index() &&
+               device->index() != self.device().index()) {
+      // Case 3: lazy:0 -> lazy:1
+
+      // TODO(whc) what do we actually want to do here?
+      //   option 1: materialize, move eager tensor, create new lazy tensor
+      //     - this should be our default, as it is what would happen before we implemented _to_copy
+      //     - actually combines case 1 + case 2
+      //   option 2: support multiple devices inside one lazy/TS executor (case 4)
+      //     - but: we may have other assumptions that there is just one device per executor? so don't take this lightly
+
+      TORCH_INTERNAL_ASSERT(lazy_self);
+      auto eager_tensor = lazy_self->ToTensor(/*detached=*/true);
+      // we move the eager tensor to the 'eager' equivalent of our lazy device
+      // e.g. if our device is lazy:1, the backend maps that to cuda:1, which is what we use
+      auto eager_device = c10::Device(torch::lazy::getBackend()->EagerFallbackDeviceType(), device->index());
+      options = options.device(eager_device);
+      auto moved_eager_tensor = eager_tensor.to(options, /*non_blocking=*/false, /*copy=*/true);
+      lazy_self = torch::lazy::GetOrCreateLtcTensor(moved_eager_tensor,
+                                                    torch::lazy::atenDeviceToBackendDevice(eager_device));
+      return torch::lazy::CreateAtenFromLtcTensor(lazy_self);
+
+    } else {
+      // Case 4: lazy->lazy (special case: keep the _to_copy INSIDE the lazy graph)
+
+      // Note: captured _to_copy will be executed with real eager tensors, not lazy tensors.
+      // We DO NOT want to burn 'lazy:0' as the device into this captured IR, or we will try to
+      // convert an eager tensor back to a lazy one inside the torchscript executor
+      // lazy:0 -> lazy:1 is handled in case3, so we can safely drop the device argument
+      device = c10::nullopt;
+
+      auto shapes = torch::lazy::compute_shape__to_copy(self, dtype, layout, device, pin_memory, non_blocking, memory_format);
+      TORCH_INTERNAL_ASSERT(shapes.size() == 1);
+      auto node = torch::lazy::MakeNode<ToCopy>(lazy_self->GetIrValue(),
+                            dtype,
+                            layout,
+                            device,
+                            pin_memory,
+                            non_blocking,
+                            memory_format,
+                            std::move(shapes));
+
+      auto result = torch::lazy::CreateAtenFromLtcTensor(
+              torch::lazy::LazyTensor::Create(std::move(node), lazy_self->GetDevice()));
+      return result;
+    }
+};
+
+at::Tensor LazyNativeFunctions::diagonal(const at::Tensor& self, int64_t offset, int64_t dim1, int64_t dim2) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+
+  auto input = GetLtcTensor(self);
+  auto input_shape = input->shape();
+  dim1 = at::maybe_wrap_dim(dim1, self);
+  dim2 = at::maybe_wrap_dim(dim2, self);
+  auto diagonal_info = DiagonalInfo {offset, dim1, dim2};
+  auto view_info = ViewInfo(ViewInfo::Type::kDiagonal, input_shape, diagonal_info);
+
+  return CreateAtenFromLtcTensor(input->CreateViewTensor(std::move(view_info)));
+}
+
+at::Tensor LazyNativeFunctions::empty(
+    at::IntArrayRef size, c10::optional<at::ScalarType> dtype,
+    c10::optional<at::Layout> layout, c10::optional<at::Device> device,
+    c10::optional<bool> pin_memory,
+    c10::optional<at::MemoryFormat> memory_format) {
+  const auto device_type = torch::lazy::getBackend()->EagerFallbackDeviceType();
+  at::TensorOptions options = at::TensorOptions()
+                                  .device(c10::Device(device_type))
+                                  .layout(layout)
+                                  .pinned_memory(pin_memory)
+                                  .dtype(dtype);
+  auto x_result = at::empty(size, options, memory_format);
+  return CreateLtcTensor(x_result, GetLtcDevice(device));
+}
+
+at::Tensor LazyNativeFunctions::empty_strided(
+    at::IntArrayRef size, at::IntArrayRef stride,
+    c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout,
+    c10::optional<at::Device> device, c10::optional<bool> pin_memory) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  at::Tensor t = empty(size, dtype, layout, device, pin_memory, c10::nullopt);
+  return LazyNativeFunctions::as_strided(
+      t, size, stride, /*storage_offset=*/0);
+}
+
+at::Tensor LazyNativeFunctions::expand(const at::Tensor& self,
+                                       at::IntArrayRef size, bool implicit) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(torch::lazy::expand(
+      torch::lazy::TryGetLtcTensor(self), size.vec()));
+}
+
+at::Tensor& LazyNativeFunctions::fill_(at::Tensor& self,
+                                       const at::Scalar& value) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::fill_(self_tensor, value);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::max_pool3d(
+    const at::Tensor& self, at::IntArrayRef kernel_size, at::IntArrayRef stride,
+    at::IntArrayRef padding, at::IntArrayRef dilation, bool ceil_mode) {
+  return torch::lazy::MaxPool3dAutogradFunctionTS::apply(
+      self, kernel_size, stride, padding, dilation, ceil_mode);
+}
+
+std::tuple<at::Tensor, at::Tensor, at::Tensor>
+LazyNativeFunctions::native_batch_norm(
+    const at::Tensor& input, const c10::optional<at::Tensor>& weight,
+    const c10::optional<at::Tensor>& bias,
+    const c10::optional<at::Tensor>& running_mean,
+    const c10::optional<at::Tensor>& running_var, bool training,
+    double momentum, double eps) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto input_tensor = torch::lazy::TryGetLtcTensor(input);
+  const torch::lazy::BackendDevice& device = input_tensor->GetDevice();
+  auto running_mean_tensor =
+      GetOrCreateLtcTensor(running_mean, device);
+  auto running_var_tensor =
+      GetOrCreateLtcTensor(running_var, device);
+  auto outputs = ts_native_batch_norm(
+      torch::lazy::TryGetLtcTensor(input), GetOrCreateLtcTensor(weight, device),
+      GetOrCreateLtcTensor(bias, device), running_mean_tensor,
+      running_var_tensor, training, momentum, eps);
+  return std::make_tuple(torch::lazy::CreateAtenFromLtcTensor(std::get<0>(outputs)),
+                         torch::lazy::CreateAtenFromLtcTensor(std::get<1>(outputs)),
+                         torch::lazy::CreateAtenFromLtcTensor(std::get<2>(outputs)));
+}
+
+std::tuple<at::Tensor, at::Tensor, at::Tensor>
+LazyNativeFunctions::native_batch_norm_backward(
+    const at::Tensor& grad_out, const at::Tensor& input,
+    const c10::optional<at::Tensor>& weight,
+    const c10::optional<at::Tensor>& running_mean,
+    const c10::optional<at::Tensor>& running_var,
+    const c10::optional<at::Tensor>& save_mean,
+    const c10::optional<at::Tensor>& save_invstd, bool train, double eps,
+    std::array<bool, 3> output_mask) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto grad_out_tensor = torch::lazy::TryGetLtcTensor(grad_out);
+  const torch::lazy::BackendDevice& device = grad_out_tensor->GetDevice();
+  torch::lazy::LazyTensorPtr null_tensor;
+  bool running_stats = running_mean && running_mean->defined();
+  CHECK_EQ(running_var && running_var->defined(), running_stats);
+  auto gradients = ts_native_batch_norm_backward(
+      torch::lazy::TryGetLtcTensor(grad_out), torch::lazy::TryGetLtcTensor(input),
+      GetOrCreateLtcTensor(weight, device),
+      running_stats ? GetOrCreateLtcTensor(running_mean, device)
+                    : null_tensor,
+      running_stats ? GetOrCreateLtcTensor(running_var, device)
+                    : null_tensor,
+      GetOrCreateLtcTensor(save_mean, device),
+      GetOrCreateLtcTensor(save_invstd, device), train, eps,
+      output_mask);
+  at::Tensor undefined;
+  return std::make_tuple(
+      output_mask[0] ? torch::lazy::CreateAtenFromLtcTensor(std::get<0>(gradients))
+                     : undefined,
+      output_mask[1] ? torch::lazy::CreateAtenFromLtcTensor(std::get<1>(gradients))
+                     : undefined,
+      output_mask[2] ? torch::lazy::CreateAtenFromLtcTensor(std::get<2>(gradients))
+                     : undefined);
+}
+
+// We need to explicitly override max pooling operators and just call the
+// fallback for them because we've customized the autograd function for them
+// (backward needs saved indices from forward).
+std::tuple<at::Tensor, at::Tensor> LazyNativeFunctions::max_pool3d_with_indices(
+    const at::Tensor& self, at::IntArrayRef kernel_size, at::IntArrayRef stride,
+    at::IntArrayRef padding, at::IntArrayRef dilation, bool ceil_mode) {
+  return at::native::call_fallback_fn<
+      &ltc_eager_fallback, ATEN_OP(max_pool3d_with_indices)>::call(self,
+                                                                   kernel_size,
+                                                                   stride,
+                                                                   padding,
+                                                                   dilation,
+                                                                   ceil_mode);
+}
+
+at::Tensor LazyNativeFunctions::max_pool3d_with_indices_backward(
+    const at::Tensor& grad_output, const at::Tensor& self,
+    at::IntArrayRef kernel_size, at::IntArrayRef stride,
+    at::IntArrayRef padding, at::IntArrayRef dilation, bool ceil_mode,
+    const at::Tensor& indices) {
+  return at::native::call_fallback_fn<
+      &ltc_eager_fallback,
+      ATEN_OP(max_pool3d_with_indices_backward)>::call(grad_output, self,
+                                                       kernel_size, stride,
+                                                       padding, dilation,
+                                                       ceil_mode, indices);
+}
+
+at::Tensor & LazyNativeFunctions::normal_(at::Tensor & self, double mean, double std, c10::optional<at::Generator> generator) {
+    // Unconditionally fall back.
+    // implementing normal_ via lazy tensor caused differences in results compared to eager.
+    return at::native::call_fallback_fn<&ltc_eager_fallback, ATEN_OP(normal_)>::call(self, mean, std, generator);
+
+    // if (force_eager_fallback(c10::Symbol::fromQualString("aten::normal_"))) {
+    //   return at::native::call_fallback_fn<&ltc_eager_fallback, ATEN_OP(normal_)>::call(self, mean, std, generator);
+    // }
+
+    // if (generator.has_value()) {
+    //   return at::native::call_fallback_fn<&ltc_eager_fallback, ATEN_OP(normal_)>::call(self, mean, std, generator);
+    // }
+
+    // TORCH_LAZY_FN_COUNTER("lazy::");
+    // auto device = bridge::GetBackendDevice(self);
+    // LazyTensor lazy_self = GetLtcTensorOrCreateForWrappedNumber(self, *device);
+    // std::vector<torch::lazy::Shape> shapes = {torch::lazy::Shape(self.scalar_type(), self.sizes().vec())};
+    // auto node = torch::lazy::MakeNode<Normal>(lazy_self.GetIrValue(), mean, std, std::move(shapes));
+    // lazy_self.SetInPlaceIrValue(node);
+    // return self;
+};
+
+at::Tensor LazyNativeFunctions::permute(const at::Tensor& self,
+                                        at::IntArrayRef dims) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  return torch::lazy::CreateAtenFromLtcTensor(torch::lazy::permute(
+      self_tensor, torch::lazy::ToI64Vector(dims)));
+}
+
+at::Tensor LazyNativeFunctions::select(const at::Tensor& self, int64_t dim,
+                                       int64_t index) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::select(torch::lazy::TryGetLtcTensor(self), dim, index));
+}
+
+at::Tensor LazyNativeFunctions::slice(const at::Tensor& self, int64_t dim,
+                                      c10::optional<int64_t> start,
+                                      c10::optional<int64_t> end,
+                                      int64_t step) {
+  int64_t start_val = start.has_value() ? start.value() : 0;
+  int64_t end_val = end.has_value() ? end.value() : INT64_MAX;
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(torch::lazy::slice(
+      torch::lazy::TryGetLtcTensor(self), dim, start_val, end_val, step));
+}
+
+at::Tensor LazyNativeFunctions::squeeze(const at::Tensor& self) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::squeeze(torch::lazy::TryGetLtcTensor(self)));
+}
+
+at::Tensor LazyNativeFunctions::squeeze(const at::Tensor& self, int64_t dim) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::squeeze(torch::lazy::TryGetLtcTensor(self), dim));
+}
+
+at::Tensor& LazyNativeFunctions::squeeze_(at::Tensor& self) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::squeeze_(self_tensor);
+  return self;
+}
+
+at::Tensor& LazyNativeFunctions::squeeze_(at::Tensor& self, int64_t dim) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::squeeze_(self_tensor, dim);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::t(const at::Tensor& self) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::transpose(torch::lazy::TryGetLtcTensor(self), 0, 1));
+}
+
+at::Tensor& LazyNativeFunctions::t_(at::Tensor& self) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::transpose_(self_tensor, 0, 1);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::transpose(const at::Tensor& self, int64_t dim0,
+                                          int64_t dim1) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::transpose(torch::lazy::TryGetLtcTensor(self), dim0, dim1));
+}
+
+at::Tensor& LazyNativeFunctions::transpose_(at::Tensor& self, int64_t dim0,
+                                            int64_t dim1) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::transpose_(self_tensor, dim0, dim1);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::unsqueeze(const at::Tensor& self, int64_t dim) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::unsqueeze(torch::lazy::TryGetLtcTensor(self), dim));
+}
+
+at::Tensor& LazyNativeFunctions::unsqueeze_(at::Tensor& self, int64_t dim) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  torch::lazy::unsqueeze_(self_tensor, dim);
+  return self;
+}
+
+at::Tensor LazyNativeFunctions::view(const at::Tensor& self,
+                                     at::IntArrayRef size) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::view(self_tensor, torch::lazy::ToI64Vector(size)));
+}
+
+at::Tensor LazyNativeFunctions::_unsafe_view(const at::Tensor& self,
+                                     at::IntArrayRef size) {
+  TORCH_LAZY_FN_COUNTER("lazy::");
+  auto self_tensor = torch::lazy::TryGetLtcTensor(self);
+  return torch::lazy::CreateAtenFromLtcTensor(
+      torch::lazy::view(self_tensor, torch::lazy::ToI64Vector(size)));
+}
+
+void InitializeAtenBindings() {}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_node.cpp b/torch/csrc/lazy/ts_backend/ts_node.cpp
index a7948e5cbec531..38e7099d3195ca 100644
--- a/torch/csrc/lazy/ts_backend/ts_node.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_node.cpp
@@ -1,69 +1,22 @@
 #include <torch/csrc/lazy/ts_backend/ts_node.h>
 #include <torch/csrc/lazy/ts_backend/config.h>
 #include <torch/csrc/lazy/core/cache.h>
+#include <torch/csrc/lazy/core/debug_util.h>
 
-namespace torch {
-namespace lazy {
-
-const Shape& GetShapeFromTsOutput(const Output& output) {
-  if (auto* tsnode = dynamic_cast<const TsNode*>(output.node)) {
-    return tsnode->shape(output.index);
-  }
-  throw std::runtime_error("Expected TsNode but could not dynamic cast");
-}
-
-const Shape& GetShapeFromTsValue(const Value& value) {
-  if (auto* tsnode = dynamic_cast<const TsNode*>(value.node.get())) {
-    return tsnode->shape(value.index);
-  }
-  throw std::runtime_error("Expected TsNode but could not dynamic cast");
-}
-
-void TsNodeSetShapeDeferred(
-    NodePtr node, const std::function<Shape()>& shape_fn) {
-  if (auto tsnode = std::dynamic_pointer_cast<TsNode>(node)) {
-    tsnode->SetShapeDeferred(shape_fn);
-    return;
-  }
-  throw std::runtime_error("Expected TsNode but could not dynamic cast");
-}
-
-hash_t OperandHashes(const OpList& operands, const hash_t& seed, bool bakeInSizes) {
-  hash_t hash = seed;
-  for (auto& operand : operands) {
-    if (!operand) {
-      hash = HashCombine(hash, static_cast<uint64_t>(kNullOpt));
-      continue;
+namespace {
+  std::string GetFirstUserFrameInPythonIfEnabled() {
+    static const auto LTC_ENABLE_SOURCE_INFO = std::getenv("LTC_ENABLE_SOURCE_INFO");
+    if (!LTC_ENABLE_SOURCE_INFO) {
+      return {};
     }
-    auto operand_hash = bakeInSizes ? operand.hash_with_sizes() : operand.hash_without_sizes();
-    hash = HashCombine(hash, operand_hash);
-  }
-  return hash;
-}
 
-TsNode::TsNode(OpKind op, OpList operands, std::vector<Shape>&& shapes,
-               size_t num_outputs, hash_t hash_seed)
-    : Node(op, num_outputs,
-           // TODO(WHC) this is inefficient (having to compute node_hash twice
-           // since I can't call hash() yet) so probably move dag_hash
-           // initialization to a separate function?
-           /* node_hash */ HashCombine(op.hash(), hash_seed),
-           /* dag_hash */
-           [&](bool bakeInSizes) { return OperandHashes(operands, HashCombine(op.hash(), hash_seed), bakeInSizes); }),
-      shapes_(shapes) {
-  for (auto& operand : operands) {
-    // Ideally, optional operands should be filtered by the leaf node classes,
-    // but it's just much easier to do it here.
-    // TODO(alanwaketan): Find a way to move the below logic to the leaf node
-    // classes.
-    if (!operand) {
-      continue;
-    }
-
-    AddOperand(operand.node, operand.index);
+    return torch::lazy::GetFirstUserFrameInPython();
   }
 }
 
+namespace torch {
+namespace lazy {
+
 TsNode::TsNode(OpKind op, OpList operands,
                const std::function<Shape()>& shape_fn,
                size_t num_outputs, hash_t hash_seed)
@@ -71,25 +24,11 @@ TsNode::TsNode(OpKind op, OpList operands,
   shapes_.push_back(GetOpShape(shape_fn));
 }
 
-TsNode::TsNode(OpKind op, OpList operands, size_t num_outputs,
-               hash_t hash_seed)
-    : TsNode(op, operands, std::vector<Shape>{}, num_outputs, hash_seed) {}
-
 void TsNode::SetShapeDeferred(
     const std::function<Shape()>& shape_fn) {
   shapes_.push_back(GetOpShape(shape_fn));
 }
 
-TsNode::TsNode(OpKind op, Shape shape, size_t num_outputs, hash_t hash_seed)
-    : Node(op, num_outputs, [&](bool bakeInSizes) -> hash_t { return GetOpHash(op, shape, hash_seed, bakeInSizes); })
-{
-  shapes_.push_back(std::move(shape));
-}
-
-const Shape& TsNode::shape(size_t output_index) const {
-  return shapes_.at(output_index);
-}
-
 using ShapeCache = Cache<hash_t, Shape, HashReducer>;
 
 ShapeCache* GetShapeCache() {
@@ -109,30 +48,6 @@ Shape TsNode::GetOpShape(
   return *shape;
 }
 
-std::string TsNode::ToString() const {
-  std::stringstream ss;
-  ss << shapes() << " " << op();
-  if (num_outputs() > 1) {
-    ss << ", num_outputs=" << num_outputs();
-  }
-  if (!metadata().scope.empty()) {
-    ss << ", scope=" << metadata().scope;
-  }
-  EmitShortFrameInfo(ss, metadata().frame_info);
-  return ss.str();
-}
-
-hash_t TsNode::GetOpHash(OpKind op, const Shape& shape, hash_t hash_seed, bool bakeInSizes) {
-  hash_t h = HashCombine(op.hash(), shape.hash(bakeInSizes));
-  return HashCombine(h, hash_seed);
-}
-
-void TsNode::AddOperand(NodePtr node, size_t index) {
-  CHECK_LT(index, node->num_outputs());
-  operands_.push_back(std::move(node));
-  operands_as_outputs_.emplace_back(operands_.back().get(), index);
-}
-
 TSOpVector TsNode::Lower(std::shared_ptr<torch::jit::GraphFunction> function,
                          TSLoweringContext* loctx) const {
   // TODO(whc) beginning to invert the design here.  Move to provide a Lower()
@@ -143,5 +58,25 @@ TSOpVector TsNode::Lower(std::shared_ptr<torch::jit::GraphFunction> function,
   return {};
 }
 
+TensorList::TensorList(OpList values)
+  : TsNode(/*op=*/tensor_list_opkind,
+           /*operands=*/values,
+           /*shapes=*/std::vector<Shape>(),
+         /*num_outputs=*/1,
+         /*hash_seed=*/OperandHashes(values, /*seed=*/kHashSeed, enableDynamicShape())) {}
+
+TSOpVector TensorList::Lower(std::shared_ptr<torch::jit::GraphFunction> function,
+                             TSLoweringContext* loctx) const {
+
+  std::vector<torch::jit::Value*> tensor_list;
+  CHECK(!operands().empty());
+  for (const torch::lazy::Output& operand : operands()) {
+    tensor_list.emplace_back(loctx->GetOutputOp(operand));
+  }
+  auto graph = function->graph();
+  auto listnode = graph->insertNode(graph->createList(tensor_list[0]->type(), tensor_list));
+  return {listnode->output()};
+}
+
 }  // namespace lazy
 }  // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_node.h b/torch/csrc/lazy/ts_backend/ts_node.h
index 156444852d9dee..3fc1f2330af345 100644
--- a/torch/csrc/lazy/ts_backend/ts_node.h
+++ b/torch/csrc/lazy/ts_backend/ts_node.h
@@ -13,18 +13,9 @@ namespace lazy {
 
 using TSOpVector = std::vector<torch::jit::Value*>;
 
-// Helper that makes it easy to access the TsNode::shape() method
-// from an torch::lazy::Output* that holds a Node* that points to a TsNode
-// TODO(whc) remove these once migrating to codegen and cleaning up Shape use
-TORCH_API const Shape& GetShapeFromTsOutput(const Output& output);
-TORCH_API const Shape& GetShapeFromTsValue(const Value& value);
-TORCH_API void TsNodeSetShapeDeferred(
-    NodePtr node, const std::function<Shape()>& shape_fn);
-
 class TORCH_API TsNode : public lazy::Node {
  public:
-  TsNode(OpKind op, OpList operands, std::vector<Shape>&& shapes,
-         size_t num_outputs = 1, hash_t hash_seed = kHashSeed);
+  using Node::Node;
 
   // Same as the constructor above, but the shape is generated by a function,
   // only if needed (shape cache miss).
@@ -32,37 +23,13 @@ class TORCH_API TsNode : public lazy::Node {
          const std::function<Shape()>& shape_fn,
          size_t num_outputs = 1, hash_t hash_seed = kHashSeed);
 
-  // The shape is set later.
-  TsNode(OpKind op, OpList operands, size_t num_outputs = 1,
-         hash_t hash_seed = kHashSeed);
-
   void SetShapeDeferred(const std::function<Shape()>& shape_fn);
 
-  // Contructor used to create leaf nodes.
-  TsNode(OpKind op, Shape shape, size_t num_outputs = 1,
-         hash_t hash_seed = kHashSeed);
-
   ~TsNode() override = default;
 
-  Shape GetOpShape(
-      const std::function<Shape()>& shape_fn) const;
-
-  // Retrieves the full shape of the IR Node.
-  c10::ArrayRef<Shape> shapes() const { return shapes_; }
+  Shape GetOpShape(const std::function<Shape()>& shape_fn) const;
 
-  // Retrieves the shape of the output at a given index.
-  const Shape& shape(size_t output_index = 0) const;
-
-  std::string ToString() const override;
-
-  static hash_t GetOpHash(OpKind op, const Shape& shape, hash_t hash_seed, bool bakeInSizes);
-
-  const std::vector<Output>& operands() const override {
-    return operands_as_outputs_;
-  }
-  const Output& operand(size_t i) const override {
-    return operands_as_outputs_.at(i);
-  }
+  const std::string& getPythonStacktrace() const { return python_stacktrace_; }
 
   // Lower is a backend-specific method since it returns a backend specific
   // type. hence, it is convenient to define it differently per-backend rather
@@ -71,15 +38,30 @@ class TORCH_API TsNode : public lazy::Node {
                            TSLoweringContext* loctx) const;
 
  private:
-  // Adds node's index output number as operand.
-  void AddOperand(NodePtr node, size_t index = 0);
+  std::string python_stacktrace_;
+};
 
-  std::vector<Shape> shapes_;
-  // A node holds a real reference to its operands.
-  std::vector<NodePtr> operands_;
-  // Outputs do not hold references on the nodes, and neither do the uses, since
-  // otherwise we get into circular reference counting.
-  std::vector<Output> operands_as_outputs_;
+// Note: this OpKind is separate from ltc_ops.h since it would be a circular import otherwise, I like leaving TensorList
+// in this file, and I think most of ltc_ops special cases will be deleted anyway
+const OpKind tensor_list_opkind = OpKind::Get("lazy_tensors::tensor_list");
+
+// TensorList represents an at::TensorList which is a vector[Tensor] but is also a first-class IValue
+// and can be fed as a single input to a TS program.  It is much easier to handle TensorLists in Lazy Tensor code
+// if they are represented as a single Node so there can be more than one TensorList and more than one Tensor
+// side-by-side as operands to an op.
+//
+// Note: shape is undefined for TensorList.  We assert in some places that #shapes matches #outputs and this stems from
+//       the fact that currently all IR nodes represent tensors (there is no type system for this IR).  Becuase of this,
+//       TensorList is a bit of a hack.
+//
+// TODO(whc) once Shape() API is moved to Node base, also make it virtual, and then implement it as NotImplemented for
+// TensorList, also fixing the assertion that would fail.
+struct TORCH_API TensorList : public TsNode {
+  TensorList() = delete;
+  TensorList(OpList values);
+
+  TSOpVector Lower(std::shared_ptr<torch::jit::GraphFunction> function,
+                   TSLoweringContext* loctx) const override;
 };
 
 }  // namespace lazy
diff --git a/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp b/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp
new file mode 100644
index 00000000000000..c8d91e9090fa44
--- /dev/null
+++ b/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp
@@ -0,0 +1,453 @@
+#include <torch/csrc/lazy/ts_backend/ts_node_lowering.h>
+
+#include <ATen/Functions.h>
+#include <torch/csrc/jit/frontend/sugared_value.h>
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/lazy/backend/backend_interface.h>
+#include <torch/csrc/lazy/core/helpers.h>
+#include <torch/csrc/lazy/ts_backend/ops/cast.h>
+#include <torch/csrc/lazy/ts_backend/ops/device_data.h>
+#include <torch/csrc/lazy/ts_backend/ops/expand.h>
+#include <torch/csrc/lazy/core/internal_ops/ltc_ops.h>
+#include <torch/csrc/lazy/ts_backend/ops/scalar.h>
+#include <torch/csrc/lazy/ts_backend/ops/batch_norm_ops.h>
+#include <torch/csrc/lazy/core/permutation_util.h>
+#include <torch/csrc/lazy/core/view_ops/as_strided.h>
+#include <torch/csrc/lazy/core/view_ops/as_strided_view_update.h>
+#include <torch/csrc/lazy/core/view_ops/diagonal.h>
+#include <torch/csrc/lazy/core/view_ops/diagonal_view_update.h>
+#include <torch/csrc/lazy/core/view_ops/narrow.h>
+#include <torch/csrc/lazy/core/view_ops/narrow_view_update.h>
+#include <torch/csrc/lazy/core/view_ops/permute.h>
+#include <torch/csrc/lazy/core/view_ops/select.h>
+#include <torch/csrc/lazy/core/view_ops/select_view_update.h>
+#include <torch/csrc/lazy/core/view_ops/squeeze.h>
+#include <torch/csrc/lazy/core/view_ops/unsqueeze.h>
+#include <torch/csrc/lazy/core/view_ops/view.h>
+#include <torch/csrc/lazy/ts_backend/ts_lowering_context.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
+
+namespace torch {
+namespace lazy {
+
+class TSNodeLowering : public TSNodeLoweringInterface {
+ public:
+  TSNodeLowering(const std::string& name, torch::lazy::TSLoweringContext* loctx)
+      : loctx_(loctx),
+        function_(loctx ? std::make_shared<torch::jit::GraphFunction>(
+                              name, loctx->graph(), nullptr)
+                        : nullptr) {}
+
+  torch::lazy::TSLoweringContext* loctx() { return loctx_; }
+
+  bool Lower(const torch::lazy::Node* node) override {
+    if (auto* tsnode = dynamic_cast<const torch::lazy::TsNode*>(node)) {
+      // First, we call the node lowering function, which exists for newly
+      // codegenned or refactored nodes
+      TSOpVector ops = tsnode->Lower(function_, loctx());
+      if (ops.empty()) {
+        // Then fall back to legacy lowering code, which should be gradually
+        // removed
+        ops = LowerNonCodegenOps(node);
+      }
+      if (ops.empty()) {
+        return false;
+      }
+      CHECK_EQ(node->num_outputs(), ops.size());
+      for (size_t i = 0; i < ops.size(); ++i) {
+        loctx()->AssignOutputOp(torch::lazy::Output(node, i), ops[i]);
+      }
+      return true;
+    }
+    throw std::runtime_error(
+        "Expected torch::lazy::TsNode but could not dynamic cast");
+  }
+
+  // TODO(whc) this is for legacy/non-codegen Ops, and after moving most ops
+  // to codegen we should delete this and put all the lowering logic into Node
+  // classes
+  TSOpVector LowerNonCodegenOps(const torch::lazy::Node* node) {
+    if (node->op().op == at::aten::as_strided) {
+      return LowerAsStrided(torch::lazy::NodeCast<torch::lazy::AsStrided>(
+          node, torch::lazy::OpKind(at::aten::as_strided)));
+    }
+    if (node->op() == *torch::lazy::ltc_as_strided_view_update) {
+      return LowerAsStridedViewUpdate(
+          torch::lazy::NodeCast<torch::lazy::AsStridedViewUpdate>(
+              node, *torch::lazy::ltc_as_strided_view_update));
+    }
+    if (node->op() == *torch::lazy::ltc_cast) {
+      return LowerCast(torch::lazy::NodeCast<torch::lazy::Cast>(
+          node, *torch::lazy::ltc_cast));
+    }
+    if (node->op() == *torch::lazy::ltc_select_view_update) {
+      return LowerSelectViewUpdate(
+          torch::lazy::NodeCast<torch::lazy::SelectViewUpdate>(
+              node, *torch::lazy::ltc_select_view_update));
+    }
+    if (node->op() == *torch::lazy::ltc_narrow_view_update) {
+      return LowerNarrowViewUpdate(
+          torch::lazy::NodeCast<torch::lazy::NarrowViewUpdate>(
+              node, *torch::lazy::ltc_narrow_view_update));
+    }
+    if (node->op().op == at::prim::Constant) {
+      return LowerScalar(torch::lazy::NodeCast<torch::lazy::Scalar>(
+          node, torch::lazy::OpKind(at::prim::Constant)));
+    }
+    if (node->op().op == at::aten::native_batch_norm) {
+      return LowerBatchNorm(
+          torch::lazy::NodeCast<TSNativeBatchNormForward>(
+              node, torch::lazy::OpKind(at::aten::native_batch_norm)));
+    }
+    if (node->op().op == at::aten::native_batch_norm_backward) {
+      return LowerBatchNormBackward(
+          torch::lazy::NodeCast<TSNativeBatchNormBackward>(
+              node, torch::lazy::OpKind(at::aten::native_batch_norm_backward)));
+    }
+    if (node->op().op == at::aten::expand) {
+      return LowerExpand(
+          torch::lazy::NodeCast<torch::lazy::Expand>(
+              node, torch::lazy::OpKind(at::aten::expand)));
+    }
+    if (node->op().op == at::aten::narrow) {
+      return LowerNarrow(torch::lazy::NodeCast<torch::lazy::Narrow>(
+          node, torch::lazy::OpKind(at::aten::narrow)));
+    }
+    if (node->op().op == at::aten::permute) {
+      return LowerPermute(torch::lazy::NodeCast<torch::lazy::Permute>(
+          node, torch::lazy::OpKind(at::aten::permute)));
+    }
+    if (node->op().op == at::aten::select) {
+      return LowerSelect(torch::lazy::NodeCast<torch::lazy::Select>(
+          node, torch::lazy::OpKind(at::aten::select)));
+    }
+    if (node->op().op == at::aten::squeeze) {
+      return LowerSqueeze(
+          torch::lazy::NodeCast<Squeeze>(
+              node, torch::lazy::OpKind(at::aten::squeeze)));
+    }
+    if (node->op().op == at::aten::unsqueeze) {
+      return LowerUnsqueeze(
+          torch::lazy::NodeCast<Unsqueeze>(
+              node, torch::lazy::OpKind(at::aten::unsqueeze)));
+    }
+    if (node->op().op == at::aten::view) {
+      return LowerView(torch::lazy::NodeCast<torch::lazy::View>(
+          node, torch::lazy::OpKind(at::aten::view)));
+    }
+    if (node->op().op == at::aten::diagonal) {
+      return LowerDiagonal(torch::lazy::NodeCast<torch::lazy::Diagonal>(
+          node, torch::lazy::OpKind(at::aten::diagonal)));
+    }
+    if (node->op() == *torch::lazy::ltc_diagonal_view_update) {
+      return LowerDiagonalViewUpdate(torch::lazy::NodeCast<torch::lazy::DiagonalViewUpdate>(
+          node, *torch::lazy::ltc_diagonal_view_update));
+    }
+    if (node->op() == *torch::lazy::ltc_device_data) {
+      const torch::lazy::DeviceData* device_data_node =
+          torch::lazy::NodeCast<torch::lazy::DeviceData>(
+              node, *torch::lazy::ltc_device_data);
+      auto infoptr = device_data_node->data()->info();
+      auto deviceDataInfoPtr = (torch::lazy::LazyGraphExecutor::DeviceDataInfo*) infoptr;
+      if (GRAPH_DUMP_ENABLED) {
+        LOG(ERROR) << "Lowering device data node, tensor id " << deviceDataInfoPtr->tensor_id << std::endl;
+      }
+      return {loctx()->GetParameter(device_data_node->data())};
+    }
+
+    std::vector<torch::jit::NamedValue> arguments;
+    for (const torch::lazy::Output& output : node->operands()) {
+      arguments.emplace_back(loctx()->GetOutputOp(output));
+    }
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerBuiltin(
+      const torch::lazy::Node* node,
+      const std::vector<torch::jit::NamedValue>& arguments,
+      const std::vector<torch::jit::NamedValue>& kwarguments = {}) {
+    return LowerTSBuiltin(function_, node->op().op, arguments, kwarguments);
+  }
+  TSOpVector LowerBuiltin(
+      c10::Symbol sym, const std::vector<torch::jit::NamedValue>& arguments,
+      const std::vector<torch::jit::NamedValue>& kwarguments = {}) {
+    return LowerTSBuiltin(function_, sym, arguments, kwarguments);
+  }
+
+  TSOpVector LowerAsStrided(const torch::lazy::AsStrided* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->size());
+    arguments.emplace_back(node->stride());
+    arguments.emplace_back(node->storage_offset());
+    TSOpVector as_strided_out = LowerBuiltin(node, arguments);
+    CHECK_EQ(as_strided_out.size(), 1);
+    return {GenerateClone(as_strided_out.front())};
+  }
+
+  TSOpVector LowerAsStridedViewUpdate(
+      const torch::lazy::AsStridedViewUpdate* node) {
+    torch::jit::Value* destination =
+        GenerateClone(loctx()->GetOutputOp(node->operand(0)));
+    const torch::lazy::Output& input_op = node->operand(1);
+    const torch::lazy::Shape& input_shape = input_op.shape();
+    const auto input_dimensions = input_shape.sizes();
+    std::vector<torch::jit::NamedValue> dest_arguments;
+    dest_arguments.emplace_back(destination);
+    dest_arguments.emplace_back(
+        std::vector<int64_t>(input_dimensions.begin(), input_dimensions.end()));
+    dest_arguments.emplace_back(node->stride());
+    dest_arguments.emplace_back(node->storage_offset());
+    TSOpVector as_strided_out =
+        LowerBuiltin(at::aten::as_strided, dest_arguments);
+    CHECK_EQ(as_strided_out.size(), 1);
+    torch::jit::Value* as_strided = as_strided_out.front();
+    GenerateCopy(as_strided, loctx()->GetOutputOp(input_op));
+    return {destination};
+  }
+
+  TSOpVector LowerBatchNorm(const TSNativeBatchNormForward* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    for (size_t i = 0; i < 5; ++i) {
+      arguments.emplace_back(loctx()->GetOutputOp(node->operand(i)));
+    }
+    arguments.emplace_back(node->training());
+    arguments.emplace_back(node->momentum());
+    arguments.emplace_back(node->eps());
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerBatchNormBackward(const TSNativeBatchNormBackward* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    for (size_t i = 0; i < 3; ++i) {
+      arguments.emplace_back(loctx()->GetOutputOp(node->operand(i)));
+    }
+    const auto& operands = node->operands();
+    c10::optional<at::Tensor> null_arg;
+    if (operands.size() == 5) {
+      arguments.emplace_back(null_arg);
+      arguments.emplace_back(null_arg);
+    }
+    for (size_t i = 3; i < operands.size(); ++i) {
+      arguments.emplace_back(loctx()->GetOutputOp(node->operand(i)));
+    }
+    arguments.emplace_back(node->training());
+    arguments.emplace_back(node->eps());
+    arguments.emplace_back(node->output_mask());
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerCast(const torch::lazy::Cast* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->dtype());
+    return LowerBuiltin(at::aten::to, arguments);
+  }
+
+  TSOpVector LowerExpand(const torch::lazy::Expand* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->size());
+    auto expand_out = LowerBuiltin(node, arguments);
+    if (node->is_scalar_expand()) {
+      // The aten::expand operations sets all strides to 0 when the original is
+      // of rank 0. This leads to false positives when checking for internal
+      // memory overlap, because at::has_internal_overlap returns
+      // MemOverlap::YES when a stride is set to 0.
+      CHECK_EQ(expand_out.size(), 1);
+      return {GenerateClone(expand_out.front())};
+    }
+    return expand_out;
+  }
+
+  TSOpVector LowerNarrow(const torch::lazy::Narrow* node) {
+    const torch::lazy::Output& input = node->operand(0);
+    torch::jit::Value* base = loctx()->GetOutputOp(input);
+    const auto& base_indices = node->base_indices();
+    const auto& sizes = node->sizes();
+    const torch::lazy::Shape& input_shape = input.shape();
+    CHECK_EQ(sizes.size(), base_indices.size());
+    CHECK_EQ(input_shape.dim(), base_indices.size());
+    for (size_t dim = 0; dim < base_indices.size(); ++dim) {
+      int64_t start = base_indices[dim];
+      base = GenerateSlice(/*base=*/base, /*dim=*/dim, /*start=*/start,
+                           /*end=*/start + sizes[dim], /*step=*/1);
+    }
+    return {base};
+  }
+
+  TSOpVector LowerPermute(const torch::lazy::Permute* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->dims());
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerScalar(const torch::lazy::Scalar* node) {
+    const at::Scalar& value = node->value();
+    const torch::lazy::Shape& shape = node->shape();
+    auto options =
+        at::TensorOptions()
+            .device(torch::lazy::getBackend()->EagerFallbackDeviceType())
+            .dtype(shape.scalar_type());
+    return {
+        loctx()->graph()->insertConstant(at::scalar_tensor(value, options))};
+  }
+
+  TSOpVector LowerSelect(const torch::lazy::Select* node) {
+    int64_t step = torch::lazy::Select::GetStride(node->start(), node->end(),
+                                                  node->stride());
+    torch::jit::Value* base = loctx()->GetOutputOp(node->operand(0));
+    return {GenerateSlice(/*base=*/base, /*dim=*/node->dim(),
+                          /*start=*/node->start(), /*end=*/node->end(),
+                          /*step=*/step)};
+  }
+
+  TSOpVector LowerSqueeze(const Squeeze* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    if (node->dim() != -1) {
+      arguments.emplace_back(node->dim());
+    }
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerSelectViewUpdate(const torch::lazy::SelectViewUpdate* node) {
+    torch::jit::Value* dest =
+        GenerateClone(loctx()->GetOutputOp(node->operand(0)));
+    int64_t step = torch::lazy::Select::GetStride(node->start(), node->end(),
+                                                  node->stride());
+    torch::jit::Value* selected = GenerateSlice(
+        /*base=*/dest, /*dim=*/node->dim(), /*start=*/node->start(),
+        /*end=*/node->end(), /*step=*/step);
+    GenerateCopy(selected, loctx()->GetOutputOp(node->operand(1)));
+    return {dest};
+  }
+
+  TSOpVector LowerNarrowViewUpdate(const torch::lazy::NarrowViewUpdate* node) {
+    torch::jit::Value* dest =
+        GenerateClone(loctx()->GetOutputOp(node->operand(0)));
+    const auto& base_indices = node->base_indices();
+    const torch::lazy::Output& source_argument = node->operand(1);
+    const torch::lazy::Shape& source_shape = source_argument.shape();
+    CHECK_EQ(source_shape.dim(), base_indices.size());
+    torch::jit::Value* base = dest;
+    for (size_t dim = 0; dim < base_indices.size(); ++dim) {
+      int64_t start = base_indices[dim];
+      base = GenerateSlice(/*base=*/base, /*dim=*/dim, /*start=*/start,
+                           /*end=*/start + source_shape.size(dim),
+                           /*step=*/1);
+    }
+    GenerateCopy(base, loctx()->GetOutputOp(source_argument));
+    return {dest};
+  }
+
+  TSOpVector LowerUnsqueeze(const Unsqueeze* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->dim());
+    return LowerBuiltin(node, arguments);
+  }
+
+  TSOpVector LowerView(const torch::lazy::View* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->output_size());
+    return LowerBuiltin(at::aten::reshape, arguments);
+  }
+
+  TSOpVector LowerDiagonal(const Diagonal* node) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(loctx()->GetOutputOp(node->operand(0)));
+    arguments.emplace_back(node->offset());
+    arguments.emplace_back(node->dim1());
+    arguments.emplace_back(node->dim2());
+    return LowerBuiltin(node, arguments);
+  }
+
+  // FIXME(alanwaketan): One day we should code-gen all view ops, or at
+  // least move the lowering to the IR nodes.
+  TSOpVector LowerDiagonalViewUpdate(const DiagonalViewUpdate* node) {
+    // Since we promise the backends that we never generate any aliased
+    // inplace update IR, therefore we clone the target first and then
+    // update the clone inplace instead. Since the clone is transient,
+    // it will never be aliased, and therefore it's safe.
+    auto* destination = GenerateClone(loctx()->GetOutputOp(node->operand(0)));
+
+    // Replay the diagonal.
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(destination);
+    arguments.emplace_back(node->offset());
+    arguments.emplace_back(node->dim1());
+    arguments.emplace_back(node->dim2());
+    auto diag = LowerBuiltin(at::aten::diagonal, arguments);
+
+    // Update the replayed diagonal view with the input.
+    GenerateCopy(diag.front(), loctx()->GetOutputOp(node->operand(1)));
+
+    // Destination's diag view should be updated.
+    return {destination};
+  }
+
+  torch::jit::Value* GenerateClone(torch::jit::Value* val) {
+    std::vector<torch::jit::NamedValue> clone_arguments;
+    clone_arguments.emplace_back(val);
+    TSOpVector cloned = LowerBuiltin(at::aten::clone, clone_arguments);
+    CHECK_EQ(cloned.size(), 1);
+    return cloned.front();
+  }
+
+  void GenerateCopy(torch::jit::Value* destination, torch::jit::Value* source) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(destination);
+    arguments.emplace_back(source);
+    LowerBuiltin(at::aten::copy_, arguments);
+  }
+
+  torch::jit::Value* GenerateSlice(torch::jit::Value* base, int64_t dim,
+                                   int64_t start, int64_t end, int64_t step) {
+    std::vector<torch::jit::NamedValue> arguments;
+    arguments.emplace_back(base);
+    arguments.emplace_back(dim);
+    arguments.emplace_back(start);
+    arguments.emplace_back(end);
+    arguments.emplace_back(step);
+    TSOpVector selected = LowerBuiltin(at::aten::slice, arguments);
+    CHECK_EQ(selected.size(), 1);
+    return selected.front();
+  }
+  torch::lazy::TSLoweringContext* loctx_;
+  std::shared_ptr<torch::jit::GraphFunction> function_;
+};
+
+std::unique_ptr<TSNodeLoweringInterface> TSNodeLoweringInterface::Create(
+    torch::lazy::LoweringContext* loctx) {
+  return std::make_unique<TSNodeLowering>(
+      "TSNodeLowering", static_cast<torch::lazy::TSLoweringContext*>(loctx));
+}
+
+TSOpVector LowerTSBuiltin(
+    std::shared_ptr<torch::jit::GraphFunction> function, c10::Symbol sym,
+    const std::vector<torch::jit::NamedValue>& arguments,
+    const std::vector<torch::jit::NamedValue>& kwarguments) {
+  auto builtin =
+      std::make_shared<torch::jit::BuiltinFunction>(sym, at::nullopt);
+  auto magic_method = std::make_shared<torch::jit::MagicMethod>("", builtin);
+  auto ret = magic_method->call({}, *function, arguments, kwarguments, 0);
+  auto sv = dynamic_cast<torch::jit::SimpleValue*>(ret.get());
+  CHECK(sv);
+  if (sv->getValue()->type()->kind() == c10::TypeKind::TupleType) {
+    const auto tuple_call_result = sv->asTuple({}, *function);
+    TSOpVector tuple_result;
+    for (const auto& tuple_component : tuple_call_result) {
+      auto tuple_component_sv =
+          dynamic_cast<torch::jit::SimpleValue*>(tuple_component.get());
+      tuple_result.push_back(tuple_component_sv->getValue());
+    }
+    return tuple_result;
+  }
+  return {sv->getValue()};
+}
+
+}  // namespace lazy
+}  // namespace torch
diff --git a/torch/csrc/onnx/init.cpp b/torch/csrc/onnx/init.cpp
index c166e3290b4e82..2e400c2a97631d 100644
--- a/torch/csrc/onnx/init.cpp
+++ b/torch/csrc/onnx/init.cpp
@@ -12,6 +12,7 @@
 #include <torch/csrc/jit/passes/onnx/function_extraction.h>
 #include <torch/csrc/jit/passes/onnx/function_substitution.h>
 #include <torch/csrc/jit/passes/onnx/list_model_parameters.h>
+#include <torch/csrc/jit/passes/onnx/onnx_log.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.h>
 #include <torch/csrc/jit/passes/onnx/peephole.h>
@@ -154,7 +155,42 @@ void initONNXBindings(PyObject* module) {
           &torch::jit::onnx::ONNXClearScopeRecords)
       .def(
           "_jit_pass_onnx_track_scope_attributes",
-          &torch::jit::onnx::ONNXTrackScopeAttributes);
+          &torch::jit::onnx::ONNXTrackScopeAttributes)
+      .def(
+          "_jit_is_onnx_log_enabled",
+          ::torch::jit::onnx::is_log_enabled,
+          "Returns whether ONNX logging is enabled or disabled.")
+      .def(
+          "_jit_set_onnx_log_enabled",
+          ::torch::jit::onnx::set_log_enabled,
+          "Enables or disables ONNX logging.")
+      .def(
+          "_jit_set_onnx_log_output_stream",
+          [](std::string stream_name = "stdout") -> void {
+            std::shared_ptr<std::ostream> out;
+            if (stream_name == "stdout") {
+              out = std::shared_ptr<std::ostream>(&std::cout, [](std::ostream*){});
+            } else if (stream_name == "stderr") {
+              out = std::shared_ptr<std::ostream>(&std::cerr, [](std::ostream*){});
+            } else {
+              std::cerr << "ERROR: only `stdout` and `stderr`"
+                        << "are supported as `stream_name`" << std::endl;
+            }
+            ::torch::jit::onnx::set_log_output_stream(out);
+          },
+          "Set specific file stream for ONNX logging.")
+      .def(
+          "_jit_onnx_log",
+          [](py::args args) -> void {
+            if (::torch::jit::onnx::is_log_enabled()) {
+              auto& out = ::torch::jit::onnx::_get_log_output_stream();
+              for (auto arg : args) {
+                out << ::c10::str(arg);
+              }
+              out << std::endl;
+            }
+          },
+          "Write `args` to the previously specified ONNX log stream.");
 
   m.def(
       "_check_onnx_proto",
diff --git a/torch/csrc/profiler/collection.cpp b/torch/csrc/profiler/collection.cpp
new file mode 100644
index 00000000000000..ac996d4c7c50c6
--- /dev/null
+++ b/torch/csrc/profiler/collection.cpp
@@ -0,0 +1,291 @@
+#include <torch/csrc/profiler/collection.h>
+
+#include <algorithm>
+
+#include <ATen/record_function.h>
+#include <c10/core/ScalarTypeToTypeMeta.h>
+#include <c10/util/overloaded.h>
+#include <torch/csrc/jit/runtime/interpreter.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+void InputOutputEncoder::push(const std::vector<c10::IValue>& values) {
+  for (const auto& value : values) {
+    if (value.isTensor()) {
+      push(value.toTensor());
+    } else if (value.isScalar()) {
+      tags_.emplace_back(Tag::Scalar);
+    } else if (value.isTensorList()) {
+      tags_.emplace_back(Tag::TensorListBegin);
+      // TODO: Skip TensorList for now.
+      tags_.emplace_back(Tag::TERMINATOR);
+    } else {
+      tags_.emplace_back(Tag::Other);
+    }
+  }
+  tags_.emplace_back(Tag::TERMINATOR);
+}
+
+void InputOutputEncoder::push(const at::Tensor& t) {
+  if (t.defined()) {
+    tags_.emplace_back(Tag::Tensor);
+    const auto& sizes = t.sizes();
+    const auto dim = sizes.size();
+    TORCH_CHECK(
+      dim <= std::numeric_limits<uint32_t>::max(),
+      "Cannot profile Tensors of size > uint32 max. Got dim: ", dim);
+
+    tensor_metadata_.emplace_back(
+      /*ptr_=*/(void*)t.unsafeGetTensorImpl(),
+      /*dtype_=*/t.scalar_type(),
+      /*dim_=*/(uint32_t)dim
+    );
+
+    for (const auto i : sizes) {
+      tensor_sizes_.emplace_back(i);
+    }
+  } else {
+    tags_.emplace_back(Tag::UndefinedTensor);
+  }
+}
+
+// This is a custom-iterator-like getter to obtain input shapes and dtypes.
+auto InputOutputEncoder::getNextShapesAndDtypes() {
+  return [this, tag_it = tags_.begin(),
+          tensor_metadata_it = tensor_metadata_.begin(),
+          tensor_size_it = tensor_sizes_.begin()]() mutable {
+    struct Inputs out;
+    bool terminate = false;
+    while (!terminate && tag_it != tags_.end()) {
+      out.shapes_.emplace_back();
+      switch (*tag_it) {
+        case Tag::Tensor:
+          {
+            const auto& md = *tensor_metadata_it++;
+            for (const auto _ : c10::irange(md.dim_)) {
+              (void)_; // Suppress unused variable warning
+              out.shapes_.back().push_back(*tensor_size_it++);
+            }
+            out.dtypes_.emplace_back(scalarTypeToTypeMeta(md.dtype_).name());
+          }
+          break;
+
+        case Tag::TensorListBegin:
+            while (*(++tag_it) != Tag::TERMINATOR) {
+              // TODO: Skip TensorLists for now.
+            }
+          out.dtypes_.emplace_back("TensorList");
+          break;
+
+        case Tag::Scalar:
+          out.dtypes_.emplace_back("Scalar");
+          break;
+
+        case Tag::UndefinedTensor:
+        case Tag::Other:
+          out.dtypes_.emplace_back();
+          break;
+
+        case Tag::TERMINATOR:
+          // This marks the end of this op.
+          out.shapes_.pop_back();
+          terminate = true;
+          break;
+
+        default:
+          break;
+      }
+      ++tag_it;
+    }
+    return out;
+  };
+}
+
+void InputOutputEncoder::clear() {
+  tags_.clear();
+  tensor_metadata_.clear();
+  tensor_sizes_.clear();
+}
+
+namespace {
+// See `RecordQueue::getSubqueue()` for an overview of this cache.
+struct SubQueueThreadCache {
+  uint32_t key_;
+  ThreadLocalSubqueue* ref_;
+};
+
+// The astute observer will note that this leaves a dangling reference; nothing
+// in the teardown of `RecordQueue` or `ThreadLocalSubqueue` clears this value.
+// (And the raw pointer in `SubQueueThreadCache` will not extend the lifetime
+// of `*ref_`.) This is safe, however, because `getSubqueue` will check
+// `sub_queue_cache_.key_` before attempting to access `ref_`, and if `key_`
+// does not match the RecordQueue's *unique* `id_` it will evict
+// `sub_queue_cache_` and fall back to a different mechanism.
+std::atomic<uint32_t> queue_id_{0};
+thread_local SubQueueThreadCache sub_queue_cache_{0, nullptr};
+} // namespace
+
+std::string Result::name() const {
+  return c10::visit([](auto& e){ return e.name_; }, event_);
+}
+
+uint64_t Result::correlation_id() const {
+  return c10::visit(c10::overloaded(
+      [](const OpEvent& e){ return e.correlation_id_; },
+      [](const BackendEvent& e) { return std::numeric_limits<uint64_t>::max(); }
+  ), event_);
+}
+
+ThreadLocalSubqueue::ThreadLocalSubqueue(
+    const uint64_t tid,
+    const ProfilerConfig& config)
+    : tid_{tid}, config_{config}, kineto_info_{kineto::kineto_ids()} {
+  torch::profiler::impl::kineto::recordThreadInfo();
+}
+
+std::unique_ptr<KinetoObserverContext> ThreadLocalSubqueue::begin_op(
+    const at::RecordFunction& fn,
+    uint64_t correlation_id) {
+  auto event = op_events_.emplace_back(
+      correlation_id,
+      fn.threadId(),
+      fn.seqNr(),
+      fn.forwardThreadId(),
+      fn.scope(),
+      fn.isAsync(),
+      fn.debugHandle(),
+      fn.name());
+  if (config_.report_input_shapes) {
+    inputs_outputs_.push(fn.inputs());
+  }
+
+#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
+  // backward nodes source range corresponds to the forward node
+  // TODO: consider using C++ stack trace
+  if (config_.with_stack && fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
+    jit_stack_.emplace_back(callstackStr(cs));
+  }
+  if (config_.with_modules &&
+      fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    jit_modules_.emplace_back(jit::currentModuleHierarchy());
+  }
+#endif
+  if (config_.with_flops) {
+    extra_args_.emplace_back(torch::profiler::impl::saveExtraArgs(fn));
+  }
+
+  auto out = std::make_unique<KinetoObserverContext>(event);
+
+  if (config_.state == ProfilerState::KINETO_GPU_FALLBACK) {
+    try {
+      out->fallback_ = gpu_fallback_.emplace_back();
+      torch::profiler::impl::cudaStubs()->record(
+          nullptr, &out->fallback_->cuda_event_start_, nullptr);
+    } catch (const std::exception& e) {
+      LOG(WARNING) << "Failed to record CUDA event. " << e.what();
+    }
+  }
+
+  event->start_time_ = torch::profiler::impl::getApproximateTime();
+  return out;
+}
+
+RecordQueue::RecordQueue(const ProfilerConfig& config)
+    : id_(++queue_id_), config_{config} {}
+
+ThreadLocalSubqueue* RecordQueue::getSubqueue() {
+  // In the most common case, a thread will want to write to the same sub-queue
+  // that it wrote to last call. The only time that isn't true is if:
+  //  A) The profiler context has ended and we are in a new one.
+  //  B) Two profilers are active in different TLS contexts, and this thread
+  //     is a worker helping with intra-op parallelism.
+  // Since we expect this to be the OVERWHELMINGLY common case (>99%), we add a
+  // special thread_local cache so that we can skip the overall `flat_hash_map`
+  // (and corresponding lock).
+  if (id_ == sub_queue_cache_.key_) {
+    return sub_queue_cache_.ref_;
+  }
+
+  const auto tid = at::RecordFunction::currentThreadId();
+  std::lock_guard<std::mutex> guard(sub_queue_mutex_);
+  auto it = sub_queues_.find(tid);
+  if (it == sub_queues_.end()) {
+    it =
+        sub_queues_.emplace(tid, std::make_unique<ThreadLocalSubqueue>(tid, config_)).first;
+  }
+
+  sub_queue_cache_ = SubQueueThreadCache{id_, it->second.get()};
+  return it->second.get();
+}
+
+template <typename T>
+auto steal_or_default(T& it) {
+  if (it.exhausted()) {
+    return typename T::value_type();
+  } else {
+    auto result = std::move(*it);
+    ++it;
+    return result;
+  }
+}
+
+std::deque<Result> RecordQueue::getRecords(
+    std::function<time_t(approx_time_t)> time_converter) {
+  auto converter = [&](approx_time_t t) {
+    return t == std::numeric_limits<approx_time_t>::min()
+        ? std::numeric_limits<int64_t>::min()
+        : time_converter(t) / 1000; // ns to ms
+  };
+  std::deque<Result> out;
+  for (auto& subqueue_it : sub_queues_) {
+    auto& queue = *subqueue_it.second;
+    for (auto& i : queue.backend_events_) {
+      Result r;
+      r.start_time_us_ = i.start_time_us_;
+      r.end_time_us_ = i.end_time_us_;
+      r.start_tid_ = queue.tid();
+      r.kineto_info_ = queue.kineto_info();
+      r.event_ = std::move(i);
+      out.push_back(std::move(r));
+    }
+
+    auto input_getter = queue.inputs_outputs_.getNextShapesAndDtypes();
+    auto jit_stack_it = queue.jit_stack_.begin();
+    auto jit_module_it = queue.jit_modules_.begin();
+    auto extra_args_it = queue.extra_args_.begin();
+    auto gpu_fallback_it = queue.gpu_fallback_.begin();
+    for (auto& i : queue.op_events_) {
+      Result r;
+      r.start_time_us_ = converter(i.start_time_);
+      r.end_time_us_ = converter(i.end_time_);
+      r.start_tid_ = queue.tid();
+      r.kineto_info_ = queue.kineto_info();
+      r.event_ = std::move(i);
+      r.inputs_ = input_getter();
+      r.jit_stack_ = steal_or_default(jit_stack_it);
+      r.jit_modules_ = steal_or_default(jit_module_it);
+      r.extra_args_ = steal_or_default(extra_args_it);
+      r.gpu_fallback_ = steal_or_default(gpu_fallback_it);
+
+      out.push_back(std::move(r));
+    }
+    queue.op_events_.clear();
+    queue.inputs_outputs_.clear();
+    queue.jit_stack_.clear();
+    queue.jit_modules_.clear();
+    queue.extra_args_.clear();
+    queue.gpu_fallback_.clear();
+  }
+
+  std::stable_sort(out.begin(), out.end(), [](const auto& a, const auto& b) {
+    return a.start_time_us_ < b.start_time_us_;
+  });
+  return out;
+}
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/collection.h b/torch/csrc/profiler/collection.h
new file mode 100644
index 00000000000000..a82d9475a7b240
--- /dev/null
+++ b/torch/csrc/profiler/collection.h
@@ -0,0 +1,204 @@
+#pragma once
+
+#include <memory>
+#include <mutex>
+#include <utility>
+
+#include <ATen/Context.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/flat_hash_map.h>
+#include <c10/util/variant.h>
+#include <torch/csrc/profiler/containers.h>
+#include <torch/csrc/profiler/kineto_shim.h>
+#include <torch/csrc/profiler/util.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+struct OpEvent {
+  OpEvent() = default;
+  OpEvent(
+      const uint64_t correlation_id,
+      const uint64_t start_thread_id,
+      const int64_t sequence_number,
+      const uint64_t forward_thread_id,
+      const at::RecordScope scope,
+      const bool is_async,
+      const int64_t debug_handle,
+      const std::string name)
+      : correlation_id_{correlation_id},
+        start_thread_id_{start_thread_id},
+        sequence_number_{sequence_number},
+        forward_thread_id_{forward_thread_id},
+        record_function_scope_{(uint8_t)scope},
+        is_async_{is_async},
+        debug_handle_{debug_handle},
+        name_{name} {}
+
+  approx_time_t start_time_;
+  approx_time_t end_time_{std::numeric_limits<approx_time_t>::min()};
+  uint64_t correlation_id_;
+  uint64_t start_thread_id_;
+  uint64_t end_thread_id_;
+  int64_t sequence_number_;
+  uint64_t forward_thread_id_;
+  uint8_t record_function_scope_;
+  bool is_async_;
+  int64_t debug_handle_;
+  std::string name_;
+};
+
+struct Inputs {
+  std::vector<std::vector<int64_t>> shapes_;
+  std::vector<std::string> dtypes_;
+};
+
+struct FallbackPair {
+  CUDAEventStub cuda_event_start_ = nullptr;
+  CUDAEventStub cuda_event_end_ = nullptr;
+};
+
+struct BackendEvent {
+  int64_t start_time_us_;
+  int64_t end_time_us_;
+  uint8_t record_function_scope_;
+  int64_t debug_handle_;
+  std::string name_;
+  std::string backend_;
+};
+
+// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+struct Result {
+  std::string name() const;
+  uint64_t correlation_id() const;
+
+  int64_t start_time_us_;
+  int64_t end_time_us_;
+  uint64_t start_tid_;
+  kineto::DeviceAndResource kineto_info_;
+
+  c10::variant<OpEvent, BackendEvent> event_;
+
+  // OpEvent only.
+  Inputs inputs_;
+  std::vector<std::string> jit_stack_;
+  std::vector<std::string> jit_modules_;
+  std::unordered_map<std::string, c10::IValue> extra_args_;
+  FallbackPair gpu_fallback_;
+};
+
+struct KinetoObserverContext : public at::ObserverContext {
+  explicit KinetoObserverContext(OpEvent* event)
+    : event_{event} {}
+
+  OpEvent* event_;
+  FallbackPair* fallback_ {nullptr};
+};
+
+constexpr int IO_ENCODER_DEFAULT_BLOCK_SIZE = 1024;
+
+// InputOutputEncoder
+// Stores each op_events' shapes and dtypes into a contiguous AppendOnlyList
+// so that we no longer create vectors for shapes and dtypes on every op.
+// Those vectors can be created during post-processing.
+class InputOutputEncoder final {
+ public:
+  void push(const std::vector<c10::IValue>& values);
+
+  // Used during post-processing to create vectors for shapes and dtype.
+  auto getNextShapesAndDtypes();
+
+  void clear();
+
+ private:
+  enum class Tag {
+    Tensor = 0,
+    UndefinedTensor,
+    TensorListBegin, // TODO: generalize to other lists.
+    Scalar,
+    Other,
+    TERMINATOR
+  };
+
+  struct TensorMetadata {
+    void* ptr_;
+    c10::ScalarType dtype_;
+    uint32_t dim_;
+  };
+
+  void push(const at::Tensor& t);
+
+  AppendOnlyList<Tag, IO_ENCODER_DEFAULT_BLOCK_SIZE> tags_;
+  AppendOnlyList<TensorMetadata, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_metadata_;
+  AppendOnlyList<int64_t, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_sizes_;
+};
+
+
+class TORCH_API ThreadLocalSubqueue {
+ public:
+  ThreadLocalSubqueue(const uint64_t tid, const ProfilerConfig& config);
+
+  std::unique_ptr<KinetoObserverContext> begin_op(const at::RecordFunction& fn, uint64_t correlation_id);
+
+  template <class... Args>
+  void emplace_backend_event(Args&&... args) {
+    backend_events_.emplace_back(std::forward<Args>(args)...);
+  }
+
+  uint64_t tid() const {
+    return tid_;
+  }
+
+  const kineto::DeviceAndResource& kineto_info() const {
+    return kineto_info_;
+  }
+
+ private:
+  uint64_t tid_;
+  ProfilerConfig config_;
+  kineto::DeviceAndResource kineto_info_;
+
+  friend class RecordQueue;
+  // See `containers.h` for block size benchmarks.
+  static constexpr size_t BlockSize = 512;
+  AppendOnlyList<OpEvent, BlockSize> op_events_;
+
+  // report_input_shapes
+  InputOutputEncoder inputs_outputs_;
+
+  // with_stack
+  AppendOnlyList<std::vector<std::string>, BlockSize> jit_stack_;
+
+  // with_modules
+  AppendOnlyList<std::vector<std::string>, BlockSize> jit_modules_;
+
+  // with_flops
+  AppendOnlyList<std::unordered_map<std::string, c10::IValue>, BlockSize> extra_args_;
+
+  // ProfilerState::KINETO_GPU_FALLBACK
+  AppendOnlyList<FallbackPair, BlockSize> gpu_fallback_;
+
+  // reportBackendEventToActiveKinetoProfiler
+  AppendOnlyList<BackendEvent, BlockSize> backend_events_;
+};
+
+class TORCH_API RecordQueue {
+ public:
+  explicit RecordQueue(const ProfilerConfig& config);
+
+  ThreadLocalSubqueue* getSubqueue();
+
+  // NB: This is a destructive operation.
+  std::deque<Result> getRecords(std::function<time_t(approx_time_t)> time_converter);
+
+ private:
+  uint32_t id_;
+  ProfilerConfig config_;
+  ska::flat_hash_map<uint64_t, std::unique_ptr<ThreadLocalSubqueue>> sub_queues_;
+  std::mutex sub_queue_mutex_;
+};
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/containers.h b/torch/csrc/profiler/containers.h
new file mode 100644
index 00000000000000..ff5e684709e7d5
--- /dev/null
+++ b/torch/csrc/profiler/containers.h
@@ -0,0 +1,156 @@
+#pragma once
+
+#include <algorithm>
+#include <array>
+#include <cstddef>
+#include <cstdint>
+#include <forward_list>
+#include <utility>
+#include <vector>
+
+#include <c10/macros/Macros.h>
+#include <c10/util/Exception.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+// ============================================================================
+// == AppendOnlyList ==========================================================
+// ============================================================================
+//   During profiling, we have a very predictable access pattern: we only
+// append to the end of the container. We can specialize and outperform both
+// std::vector (which must realloc) and std::deque (which performs a double
+// indirection), and this class of operation is sufficiently important to the
+// profiling hot path to warrant specializing:
+//   https://godbolt.org/z/rTjozf1c4
+//   https://quick-bench.com/q/mmfuu71ogwaiULDCJyHdKnHZms4    (Prototype #1, int)
+//   https://quick-bench.com/q/5vWDW6jjdXVdoffev2zst8D09no    (Prototype #1, int pair)
+//   https://quick-bench.com/q/IfEkfAQMeJSNBA52xtMP6Agcl-Q    (Prototype #2, int pair)
+//   https://quick-bench.com/q/wJV2lKmuXL4XyGJzcI5hs4gEHFg    (Prototype #3, int pair)
+//   https://quick-bench.com/q/xiO8ZaBEkYRYUA9dFrMuPLlW9fo    (Full impl, int pair)
+// AppendOnlyList has 2x lower emplace overhead compared to more generic STL
+// containers.
+//
+//   The optimal value of `ChunkSize` will vary by use case, but testing shows
+// that a value of 1024 does a good job amortizing the `malloc` cost of growth.
+// Performance drops off for larger values, so testing on a case-by-case basis
+// is recommended if performance is absolutely critical.
+
+template <typename T, size_t ChunkSize>
+class AppendOnlyList {
+ public:
+  using array_t = std::array<T, ChunkSize>;
+  static_assert(ChunkSize > 0, "Block cannot be empty.");
+
+  AppendOnlyList() : buffer_last_{buffer_.before_begin()} {}
+  AppendOnlyList(const AppendOnlyList&) = delete;
+  AppendOnlyList& operator=(const AppendOnlyList&) = delete;
+
+  size_t size() const {
+    return n_blocks_ * ChunkSize + (size_t)(next_ - end_);
+  }
+
+  template <class... Args>
+  T* emplace_back(Args&&... args) {
+    maybe_grow();
+    *next_ = {std::forward<Args>(args)...};
+    return next_++;
+  }
+
+  void clear() {
+    buffer_.clear();
+    buffer_last_ = buffer_.begin();
+    n_blocks_ = 0;
+    next_ = nullptr;
+    end_ = nullptr;
+  }
+
+  struct Iterator {
+    using iterator_category = std::forward_iterator_tag;
+    using difference_type   = std::ptrdiff_t;
+    using value_type        = T;
+    using pointer           = T*;
+    using reference         = T&;
+
+    Iterator(std::forward_list<array_t>& buffer, const size_t size)
+      : block_{buffer.begin()}, size_{size} {}
+
+    // End iterator.
+    Iterator() = default;
+
+    bool exhausted() const {
+      return current_ >= size_;
+    }
+
+    reference operator*() const { return *current_ptr(/*checked=*/true); }
+    pointer operator->() { return current_ptr(/*checked=*/true); }
+
+    // Prefix increment
+    Iterator& operator++() {
+      if (!(++current_ % ChunkSize)) {
+        block_++;
+      }
+      return *this;
+    }
+
+    // Postfix increment
+    Iterator operator++(int) { Iterator tmp = *this; ++(*this); return tmp; }
+
+    friend bool operator==(const Iterator& a, const Iterator& b) {
+      return a.current_ptr() == b.current_ptr();
+    }
+    friend bool operator!=(const Iterator& a, const Iterator& b) {
+      return a.current_ptr() != b.current_ptr();
+    }
+
+    std::pair<array_t*, size_t> address() const {
+      if (current_ >= size_){
+        return {nullptr, 0};
+      }
+      return {&(*block_), current_ % ChunkSize};
+    }
+
+   private:
+    T* current_ptr(bool checked = false) const {
+      auto a = address();
+      if (a.first == nullptr) {
+        TORCH_INTERNAL_ASSERT(!checked, "Invalid access on AppendOnlyList.");
+        return nullptr;
+      }
+      return a.first->data() + a.second;
+    }
+
+    typename std::forward_list<array_t>::iterator block_;
+    size_t current_ {0};
+    size_t size_ {0};
+  };
+
+  Iterator begin() { return Iterator(buffer_, size()); }
+  Iterator end()   { return Iterator(); }
+  // TODO: cbegin and cend()
+
+// TODO: make private
+ protected:
+  void maybe_grow() {
+    if (C10_UNLIKELY(next_ == end_)) {
+      buffer_last_ = buffer_.emplace_after(buffer_last_);
+      n_blocks_++;
+      next_ = buffer_last_->data();
+      end_ = next_ + ChunkSize;
+    }
+  }
+
+  std::forward_list<array_t> buffer_;
+
+  // We maintain a pointer to the last element of `buffer_` so that we can
+  // insert at the end in O(1) time.
+  typename std::forward_list<array_t>::iterator buffer_last_;
+  size_t n_blocks_ {0};
+  T* next_ {nullptr};
+  T* end_ {nullptr};
+};
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/kineto_shim.cpp b/torch/csrc/profiler/kineto_shim.cpp
index aea60edb74b66e..fbc9b49ae0629b 100644
--- a/torch/csrc/profiler/kineto_shim.cpp
+++ b/torch/csrc/profiler/kineto_shim.cpp
@@ -251,6 +251,16 @@ void addMetadataJson(const std::string& key, const std::string& value) {
 #endif // USE_KINETO
 }
 
+void profilerStep() {
+#ifdef USE_KINETO
+  if (libkineto::api().isProfilerInitialized()) {
+    libkineto::api().activityProfiler().step();
+  } else {
+    LOG(WARNING) << "Profiler is not initialized: skipping step() invocation";
+  }
+#endif // USE_KINETO
+}
+
 } // namespace profiler
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/profiler/kineto_shim.h b/torch/csrc/profiler/kineto_shim.h
index fb251fdd740850..bc19207add3328 100644
--- a/torch/csrc/profiler/kineto_shim.h
+++ b/torch/csrc/profiler/kineto_shim.h
@@ -135,6 +135,8 @@ TORCH_API void addMetadataJson(
     const std::string& key,
     const std::string& value);
 
+TORCH_API void profilerStep();
+
 } // namespace profiler
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/profiler/nvtx_observer.cpp b/torch/csrc/profiler/nvtx_observer.cpp
index 06ad46ebb60ea8..8946087eaf3531 100644
--- a/torch/csrc/profiler/nvtx_observer.cpp
+++ b/torch/csrc/profiler/nvtx_observer.cpp
@@ -30,17 +30,104 @@ struct NVTXThreadLocalState : ProfilerThreadLocalStateBase {
         tls == nullptr || tls->profilerType() == ActiveProfilerType::NVTX);
     return static_cast<NVTXThreadLocalState*>(tls);
   }
+  std::pair<at::RecordFunctionHandle, int> getOpIdFromInput(const at::Tensor& tensor);
+
+  void setProducerTensorMap(at::TensorImpl *tensor, at::RecordFunctionHandle op_id, int output_nr){
+    producer_tensor_map_[(void*)tensor] = std::pair<at::RecordFunctionHandle, int> {op_id, output_nr};
+  }
+
+ protected:
+  // Maps the address of an output Tensor to a unique op id and output
+  // index of the tensor.
+  // at::TensorImpl* is the actual type of the key, but using void*
+  // to indicate the pointer is just being used as a key
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  std::unordered_map<void*, std::pair<at::RecordFunctionHandle, int>> producer_tensor_map_;
 };
 
+std::pair<at::RecordFunctionHandle, int> NVTXThreadLocalState::getOpIdFromInput(const at::Tensor& tensor) {
+  std::pair<at::RecordFunctionHandle, int> producer_op_pair(0, -1);
+  if (tensor.defined()) {
+    at::TensorImpl *ten_addr =  tensor.unsafeGetTensorImpl();
+    // See if Address is in the map already
+    if (producer_tensor_map_.count((void*)ten_addr) > 0) {
+      producer_op_pair  =  producer_tensor_map_[(void*)ten_addr];
+    }
+  }
+  return producer_op_pair;
+}
+
+std::list<std::pair<at::RecordFunctionHandle, int>> flattenOpIdList(c10::List<c10::IValue> list, std::string fn_name) {
+  std::list<std::pair<at::RecordFunctionHandle, int>> input_op_id_list;
+  auto state_ptr = NVTXThreadLocalState::getTLS();
+  TORCH_INTERNAL_ASSERT(state_ptr, "Expected profiler state set");
+  for (const c10::IValue input : list) {
+    if (input.isTensor()) {
+      const at::Tensor& tensor = input.toTensor();
+      auto producer_op_pair = state_ptr->getOpIdFromInput(tensor);
+      input_op_id_list.push_back(producer_op_pair);
+    }
+  }
+  return input_op_id_list;
+}
+
+std::list<std::pair<at::RecordFunctionHandle, int>> getInputTensorOpIds(const at::RecordFunction& fn) {
+  int num_inputs = fn.inputs().size();
+  std::pair<at::RecordFunctionHandle, int> undefined_op_pair(0,-1);
+  std::list<std::pair<at::RecordFunctionHandle, int>> input_producer_ops_;
+  auto state_ptr = NVTXThreadLocalState::getTLS();
+  TORCH_INTERNAL_ASSERT(state_ptr, "Expected profiler state set");
+  for (const c10::IValue& input_item : fn.inputs()) {
+    if(input_item.isTensor()) {
+      const at::Tensor& tensor = input_item.toTensor();
+      auto producer_pair = state_ptr->getOpIdFromInput(tensor);
+      input_producer_ops_.push_back(producer_pair);
+    } else {
+      if (input_item.isList()) {
+        std::list<std::pair<at::RecordFunctionHandle, int>> tmp_op_ids = flattenOpIdList(input_item.toList(), std::string(fn.name()));
+        // Extend the current sizes array by the array returned from input sizes
+        if (!tmp_op_ids.empty()) {
+          input_producer_ops_.splice(input_producer_ops_.end(), tmp_op_ids);
+        } else {
+          input_producer_ops_.emplace_back(undefined_op_pair);
+        }
+      } else {
+          input_producer_ops_.emplace_back(undefined_op_pair);
+      }
+    }
+  }
+  return input_producer_ops_;
+}
+
+void updateOutputTensorTracker(const at::RecordFunction& fn) {
+  int output_nr = 0;
+  auto state_ptr = NVTXThreadLocalState::getTLS();
+  TORCH_INTERNAL_ASSERT(state_ptr, "Expected profiler state set");
+  for (const c10::IValue& s_tensor : fn.outputs()){
+    if(s_tensor.isTensor()) {
+      const at::Tensor& tensor = s_tensor.toTensor();
+      if (tensor.defined()) {
+        auto ten_addr =  tensor.unsafeGetTensorImpl();
+        state_ptr->setProducerTensorMap(ten_addr, fn.handle(), output_nr);
+      }
+    }
+    output_nr++;
+  }
+}
+
 template <bool report_input_shapes>
 std::unique_ptr<at::ObserverContext> enterNVTX(const at::RecordFunction& fn) {
   if (NVTXThreadLocalState::getTLS() != nullptr) {
+    auto input_op_ids = getInputTensorOpIds(fn);
     torch::profiler::impl::cudaStubs()->nvtxRangePushA(
         torch::profiler::impl::getNvtxStr(
             fn.name(),
             fn.seqNr(),
-            report_input_shapes ? torch::profiler::impl::inputSizes(fn)
-                                : std::vector<std::vector<int64_t>>())
+            report_input_shapes ? torch::profiler::impl::inputSizes(fn, true)
+                                : std::vector<std::vector<int64_t>>(),
+            fn.handle(),
+            report_input_shapes ? input_op_ids
+                                : std::list<std::pair<at::RecordFunctionHandle, int>>())
             .c_str());
   }
   return nullptr;
@@ -65,10 +152,13 @@ void pushNVTXCallbacks(
           state_ptr->config().report_input_shapes
               ? &enterNVTX</*report_input_shapes=*/true>
               : &enterNVTX</*report_input_shapes=*/false>,
-          [](const at::RecordFunction&, at::ObserverContext*) {
+          [](const at::RecordFunction& fn, at::ObserverContext *ctx) {
             torch::profiler::impl::cudaStubs()->nvtxRangePop();
+            updateOutputTensorTracker(fn);
           })
           .needsInputs(config.report_input_shapes)
+          .needsOutputs(config.report_input_shapes)
+          .needsIds(true)
           .scopes(scopes));
   state_ptr->setCallbackHandle(handle);
 }
diff --git a/torch/csrc/profiler/util.cpp b/torch/csrc/profiler/util.cpp
index 95c1c7c1933f5a..b5ced8ab6bcb8a 100644
--- a/torch/csrc/profiler/util.cpp
+++ b/torch/csrc/profiler/util.cpp
@@ -1,4 +1,5 @@
 #include <torch/csrc/profiler/util.h>
+#include <torch/csrc/autograd/function.h>
 #include <torch/csrc/profiler/kineto_shim.h>
 
 #include <c10/util/ArrayRef.h>
@@ -13,13 +14,85 @@ namespace torch {
 namespace profiler {
 namespace impl {
 
+ApproximateClockToUnixTimeConverter::ApproximateClockToUnixTimeConverter()
+  : start_times_(measurePairs()) {}
+
+ApproximateClockToUnixTimeConverter::UnixAndApproximateTimePair
+ApproximateClockToUnixTimeConverter::measurePair() {
+  // Take a measurement on either side to avoid an ordering bias.
+  auto fast_0 = getApproximateTime();
+  auto wall = std::chrono::system_clock::now();
+  auto fast_1 = getApproximateTime();
+
+  TORCH_INTERNAL_ASSERT(fast_1 >= fast_0, "getCount is non-monotonic.");
+  auto t = std::chrono::duration_cast<std::chrono::nanoseconds>(
+      wall.time_since_epoch());
+
+  // `x + (y - x) / 2` is a more numerically stable average than `(x + y) / 2`.
+  return {t.count(), fast_0 + (fast_1 - fast_0) / 2};
+}
+
+ApproximateClockToUnixTimeConverter::time_pairs
+    ApproximateClockToUnixTimeConverter::measurePairs() {
+  static constexpr auto n_warmup = 5;
+  for (const auto _ : c10::irange(n_warmup)) {
+    getApproximateTime();
+    steady_clock_t::now();
+  }
+
+  time_pairs out;
+  for (const auto i : c10::irange(out.size())) {
+    out[i] = measurePair();
+  }
+  return out;
+}
+
+std::function<time_t(approx_time_t)>
+ApproximateClockToUnixTimeConverter::makeConverter() {
+  auto end_times = measurePairs();
+
+  // Compute the real time that passes for each tick of the approximate clock.
+  std::array<long double, replicates> scale_factors{};
+  for (const auto i : c10::irange(replicates)) {
+    auto delta_ns = end_times[i].t_ - start_times_[i].t_;
+    auto delta_approx = end_times[i].approx_t_ - start_times_[i].approx_t_;
+    scale_factors[i] = (double)delta_ns / (double)delta_approx;
+  }
+  std::sort(scale_factors.begin(), scale_factors.end());
+  long double scale_factor = scale_factors[replicates / 2 + 1];
+
+  // We shift all times by `t0` for better numerics. Double precision only has
+  // 16 decimal digits of accuracy, so if we blindly multiply times by
+  // `scale_factor` we may suffer from precision loss. The choice of `t0` is
+  // mostly arbitrary; we just need a factor that is the correct order of
+  // magnitude to bring the intermediate values closer to zero. We are not,
+  // however, guaranteed that `t0_approx` is *exactly* the getApproximateTime
+  // equivilent of `t0`; it is only an estimate that we have to fine tune.
+  auto t0 = start_times_[0].t_;
+  auto t0_approx = start_times_[0].approx_t_;
+  std::array<double, replicates> t0_correction{};
+  for (const auto i : c10::irange(replicates)) {
+    auto dt = start_times_[i].t_  - t0;
+    auto dt_approx = (double)(start_times_[i].approx_t_ - t0_approx) * scale_factor;
+    t0_correction[i] = dt - (time_t)dt_approx;
+  }
+  t0 += t0_correction[t0_correction.size() / 2 + 1];
+
+  return [=](approx_time_t t_approx) {
+    // See above for why this is more stable than `A * t_approx + B`.
+    return (time_t)((double)(t_approx - t0_approx) * scale_factor) + t0;
+  };
+}
+
 // ----------------------------------------------------------------------------
 // -- NVTX --------------------------------------------------------------------
 // ----------------------------------------------------------------------------
 std::string getNvtxStr(
     const char* name,
     int64_t sequence_nr,
-    const std::vector<std::vector<int64_t>>& shapes) {
+    const std::vector<std::vector<int64_t>>& shapes,
+    at::RecordFunctionHandle op_id,
+    const std::list<std::pair<at::RecordFunctionHandle, int>>& input_op_ids) {
   if (sequence_nr >= -1 || shapes.size() > 0) {
     std::string str;
     if (sequence_nr >= 0) {
@@ -32,31 +105,17 @@ std::string getNvtxStr(
       str = name;
 #endif
     }
+    if (op_id > 0) {
+      str = fmt::format("{}, op_id = {}", str, op_id);
+    }
     if (shapes.size() > 0) {
-      std::stringstream s;
-      s << str;
-      s << ", sizes = [";
-      for (const auto idx : c10::irange(shapes.size())) {
-        if (shapes[idx].size() > 0) {
-          s << "[";
-          for (const auto dim : c10::irange(shapes[idx].size())) {
-            s << shapes[idx][dim];
-            if (dim < shapes[idx].size() - 1) {
-              s << ", ";
-            }
-          }
-          s << "]";
-        } else {
-          s << "[]";
-        }
-        if (idx < shapes.size() - 1) {
-          s << ", ";
-        }
-      }
-      s << "]";
-      return s.str();
+      str = fmt::format("{}, sizes = {}", str, shapesToStr(shapes));
+    }
+    // Include the op ids of the input edges so
+    // you can build the network graph
+    if (input_op_ids.size() > 0) {
+      str = fmt::format("{}, input_op_ids = {}", str, inputOpIdsToStr(input_op_ids));
     }
-
     return str;
   } else {
     return name;
@@ -115,17 +174,41 @@ std::string stacksToStr(
   return "\"" + rc + "\"";
 }
 
-std::vector<std::vector<int64_t>> inputSizes(const at::RecordFunction& fn) {
+std::vector<std::vector<int64_t>> flattenList(c10::List<c10::IValue> list, std::string fn_name) {
+  std::vector<std::vector<int64_t>> tensor_dims;
+  for (const c10::IValue input : list) {
+    if (input.isTensor()) {
+      const at::Tensor& tensor = input.toTensor();
+      if (tensor.defined()) {
+        tensor_dims.push_back(input.toTensor().sizes().vec());
+      }
+    }
+  }
+  return tensor_dims;
+}
+
+std::vector<std::vector<int64_t>> inputSizes(const at::RecordFunction& fn, bool flatten_list_enabled) {
   std::vector<std::vector<int64_t>> sizes;
   sizes.reserve(fn.inputs().size());
   for (const c10::IValue& input : fn.inputs()) {
-    if (!input.isTensor()) {
-      sizes.emplace_back();
-      continue;
-    }
-    const at::Tensor& tensor = input.toTensor();
-    if (tensor.defined()) {
-      sizes.push_back(input.toTensor().sizes().vec());
+    if (input.isTensor()) {
+      const at::Tensor& tensor = input.toTensor();
+      if (tensor.defined()) {
+        sizes.push_back(input.toTensor().sizes().vec());
+      } else {
+        sizes.emplace_back();
+      }
+    } else if (input.isList()) {
+      std::vector<std::vector<int64_t>> tmp_sizes;
+      if (flatten_list_enabled) {
+        tmp_sizes = flattenList(input.toList(), std::string(fn.name()));
+      }
+      // Extend the current sizes array by the array returned from input sizes
+      if (!tmp_sizes.empty()) {
+        sizes.insert(sizes.end(), tmp_sizes.begin(), tmp_sizes.end());
+      } else {
+        sizes.emplace_back();
+      }
     } else {
       sizes.emplace_back();
     }
@@ -134,23 +217,37 @@ std::vector<std::vector<int64_t>> inputSizes(const at::RecordFunction& fn) {
 }
 
 std::string shapesToStr(const std::vector<std::vector<int64_t>>& shapes) {
-  std::ostringstream oss;
-  oss << "[";
+  std::string str("[");
   for (const auto t_idx : c10::irange(shapes.size())) {
     if (t_idx > 0) {
-      oss << ", ";
+      str = fmt::format("{}, ", str);
     }
-    oss << "[";
+    str = fmt::format("{}[", str);
     for (const auto s_idx : c10::irange(shapes[t_idx].size())) {
       if (s_idx > 0) {
-        oss << ", ";
+        str = fmt::format("{}, ", str);
       }
-      oss << shapes[t_idx][s_idx];
+      str = fmt::format("{}{}", str, shapes[t_idx][s_idx]);
+    }
+    str = fmt::format("{}]", str);
+  }
+  str = fmt::format("{}]", str);
+  return str;
+}
+
+std::string inputOpIdsToStr(const std::list<std::pair<at::RecordFunctionHandle, int>>& input_op_ids) {
+  std::string str("[");
+  int idx = 0;
+
+  for (const auto& op_id_info_pair : input_op_ids) {
+    if (idx++ > 0) {
+      str = fmt::format("{}, ", str);
     }
-    oss << "]";
+    // (OpId,OutputNr)
+    str = fmt::format("{}({},{})", str, op_id_info_pair.first, op_id_info_pair.second);
   }
-  oss << "]";
-  return oss.str();
+  str = fmt::format("{}]", str);
+  return str;
 }
 
 std::string dtypesToStr(const std::vector<std::string>& types) {
diff --git a/torch/csrc/profiler/util.h b/torch/csrc/profiler/util.h
index ddd5db0e1b0c30..d5d0792817e142 100644
--- a/torch/csrc/profiler/util.h
+++ b/torch/csrc/profiler/util.h
@@ -5,6 +5,7 @@
 #include <string>
 #include <unordered_map>
 #include <vector>
+#include <list>
 
 #include <c10/macros/Macros.h>
 #include <ATen/record_function.h>
@@ -18,26 +19,50 @@
 #include <sys/time.h> // for gettimeofday()
 #endif
 
+#if defined(__i386__) || defined(__x86_64__) || defined(__amd64__)
+#define C10_RDTSC
+#if defined(_MSC_VER)
+#include <intrin.h>
+#elif defined(__CUDACC__) || defined(__HIPCC__)
+#undef C10_RDTSC
+#elif defined(__clang__)
+// `__rdtsc` is available by default.
+// NB: This has to be first, because Clang will also define `__GNUC__`
+#elif defined(__GNUC__)
+#include <x86intrin.h>
+#else
+#undef C10_RDTSC
+#endif
+#endif
+
 namespace torch {
 namespace profiler {
 namespace impl {
 
-inline int64_t getTime(bool allow_monotonic = false) {
+using time_t = int64_t;
+using steady_clock_t = std::conditional<
+    std::chrono::high_resolution_clock::is_steady,
+    std::chrono::high_resolution_clock,
+    std::chrono::steady_clock>::type;
+
+inline time_t getTimeSinceEpoch() {
+  auto now = std::chrono::system_clock::now().time_since_epoch();
+  return std::chrono::duration_cast<std::chrono::nanoseconds>(now).count();
+}
+
+inline time_t getTime(bool allow_monotonic = false) {
 #if defined(C10_IOS) && defined(C10_MOBILE)
   // clock_gettime is only available on iOS 10.0 or newer. Unlike OS X, iOS
   // can't rely on CLOCK_REALTIME, as it is defined no matter if clock_gettime
   // is implemented or not
   struct timeval now;
   gettimeofday(&now, NULL);
-  return static_cast<int64_t>(now.tv_sec) * 1000000000 +
-      static_cast<int64_t>(now.tv_usec) * 1000;
+  return static_cast<time_t>(now.tv_sec) * 1000000000 +
+      static_cast<time_t>(now.tv_usec) * 1000;
 #elif defined(_WIN32) || defined(__MACH__)
-  using namespace std::chrono;
-  using clock = std::conditional<
-      high_resolution_clock::is_steady,
-      high_resolution_clock,
-      steady_clock>::type;
-  return duration_cast<nanoseconds>(clock::now().time_since_epoch()).count();
+  return std::chrono::duration_cast<std::chrono::nanoseconds>(
+             steady_clock_t::now().time_since_epoch())
+      .count();
 #else
   // clock_gettime is *much* faster than std::chrono implementation on Linux
   struct timespec t {};
@@ -46,15 +71,59 @@ inline int64_t getTime(bool allow_monotonic = false) {
     mode = CLOCK_MONOTONIC;
   }
   clock_gettime(mode, &t);
-  return static_cast<int64_t>(t.tv_sec) * 1000000000 +
-      static_cast<int64_t>(t.tv_nsec);
+  return static_cast<time_t>(t.tv_sec) * 1000000000 +
+      static_cast<time_t>(t.tv_nsec);
+#endif
+}
+
+// We often do not need to capture true wall times. If a fast mechanism such
+// as TSC is available we can use that instead and convert back to epoch time
+// during post processing. This greatly reduce the clock's contribution to
+// profiling.
+//   http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/
+//   https://quick-bench.com/q/r8opkkGZSJMu9wM_XTbDouq-0Io
+// TODO: We should use
+// `https://github.com/google/benchmark/blob/main/src/cycleclock.h`
+inline auto getApproximateTime() {
+#if defined(C10_RDTSC)
+  return static_cast<uint64_t>(__rdtsc());
+#else
+  return getTime();
 #endif
 }
 
+using approx_time_t = decltype(getApproximateTime());
+static_assert(
+    std::is_same<approx_time_t, int64_t>::value ||
+    std::is_same<approx_time_t, uint64_t>::value,
+    "Expected either int64_t (`getTime`) or uint64_t (some TSC reads).");
+
+// Convert `getCount` results to Nanoseconds since unix epoch.
+class ApproximateClockToUnixTimeConverter final {
+ public:
+  ApproximateClockToUnixTimeConverter();
+  std::function<time_t(approx_time_t)> makeConverter();
+
+  struct UnixAndApproximateTimePair {
+    time_t t_;
+    approx_time_t approx_t_;
+  };
+  static UnixAndApproximateTimePair measurePair();
+
+ private:
+  static constexpr size_t replicates = 1001;
+  using time_pairs = std::array<UnixAndApproximateTimePair, replicates>;
+  time_pairs measurePairs();
+
+  time_pairs start_times_;
+};
+
 std::string getNvtxStr(
     const char* name,
     int64_t sequence_nr,
-    const std::vector<std::vector<int64_t>>& shapes);
+    const std::vector<std::vector<int64_t>>& shapes,
+    at::RecordFunctionHandle op_id = 0,
+    const std::list<std::pair<at::RecordFunctionHandle, int>>& input_op_ids = {});
 
 struct TORCH_API FileLineFunc {
   std::string filename;
@@ -70,10 +139,12 @@ TORCH_API std::string stacksToStr(
     const std::vector<std::string>& stacks,
     const char* delim);
 TORCH_API std::vector<std::vector<int64_t>> inputSizes(
-    const at::RecordFunction& fn);
+    const at::RecordFunction& fn,
+    const bool flatten_list_enabled=false);
 TORCH_API std::string shapesToStr(
     const std::vector<std::vector<int64_t>>& shapes);
 TORCH_API std::string dtypesToStr(const std::vector<std::string>& types);
+TORCH_API std::string inputOpIdsToStr(const std::list<std::pair<at::RecordFunctionHandle, int>>& input_op_ids);
 TORCH_API std::vector<std::string> inputTypes(const at::RecordFunction& fn);
 
 std::unordered_map<std::string, c10::IValue> TORCH_API
diff --git a/torch/csrc/stub_with_flatbuffer.c b/torch/csrc/stub_with_flatbuffer.c
new file mode 100644
index 00000000000000..6f7c159634e097
--- /dev/null
+++ b/torch/csrc/stub_with_flatbuffer.c
@@ -0,0 +1,18 @@
+#include <Python.h>  // NOLINT
+
+#ifdef _WIN32
+__declspec(dllimport)
+#endif
+extern PyObject* initModuleFlatbuffer(void);
+
+#ifndef _WIN32
+#ifdef __cplusplus
+extern "C"
+#endif
+__attribute__((visibility("default"))) PyObject* PyInit__C_flatbuffer(void);
+#endif
+
+PyMODINIT_FUNC PyInit__C_flatbuffer(void)
+{
+  return initModuleFlatbuffer();
+}
diff --git a/torch/csrc/utils.cpp b/torch/csrc/utils.cpp
index 98203fc77d7018..aba5f01a50c9c5 100644
--- a/torch/csrc/utils.cpp
+++ b/torch/csrc/utils.cpp
@@ -198,14 +198,14 @@ void storage_fill(at::Storage self, uint8_t value) {
 }
 
 void storage_set(at::Storage self, ptrdiff_t idx, uint8_t value) {
-  TORCH_CHECK((idx >= 0) && (idx < self.nbytes()), "out of bounds");
+  TORCH_CHECK((idx >= 0) && (idx < static_cast<ptrdiff_t>(self.nbytes())), "out of bounds");
   auto options = c10::TensorOptions().device(self.device()).dtype(at::kByte);
   auto self_t = at::empty({0}, {}, options).set_(self);
   self_t[idx].fill_(value);
 }
 
 uint8_t storage_get(at::Storage self, ptrdiff_t idx) {
-  TORCH_CHECK((idx >= 0) && (idx < self.nbytes()), "out of bounds");
+  TORCH_CHECK((idx >= 0) && (idx < static_cast<ptrdiff_t>(self.nbytes())), "out of bounds");
   auto options = c10::TensorOptions().device(self.device()).dtype(at::kByte);
   auto self_t = at::empty({0}, {}, options).set_(self);
   return self_t[idx].item<uint8_t>();
diff --git a/torch/csrc/utils/disable_torch_function.cpp b/torch/csrc/utils/disable_torch_function.cpp
index f8207271c8d4e1..1bf5bfb0ddbfe6 100644
--- a/torch/csrc/utils/disable_torch_function.cpp
+++ b/torch/csrc/utils/disable_torch_function.cpp
@@ -3,13 +3,14 @@
 #include <torch/csrc/Exceptions.h>
 #include <torch/csrc/utils/python_strings.h>
 
+#include <ATen/PythonTorchFunctionTLS.h>
+
 namespace torch {
-  static thread_local bool enable_torch_function = true;
   PyObject* disabled_torch_function = nullptr;
   PyObject* disabled_torch_dispatch = nullptr;
 
   bool torch_function_enabled() {
-      return enable_torch_function;
+      return !at::impl::PythonTorchFunctionTLS::is_disabled();
   }
 
   PyObject* disabled_torch_function_impl() {
@@ -36,18 +37,18 @@ typedef struct {
 } DisableTorchFunction;
 
 PyObject* DisableTorchFunction__enter(PyObject* self, PyObject *unused) {
-    ((DisableTorchFunction*)self)->old_state = torch::enable_torch_function;
-    torch::enable_torch_function = false;
+    ((DisableTorchFunction*)self)->old_state = at::impl::PythonTorchFunctionTLS::is_disabled();
+    at::impl::PythonTorchFunctionTLS::set_disabled(true);
     Py_RETURN_NONE;
 }
 
 PyObject* DisableTorchFunction__exit(PyObject* self, PyObject *unused) {
-    torch::enable_torch_function = ((DisableTorchFunction*)self)->old_state;
+    at::impl::PythonTorchFunctionTLS::set_disabled(((DisableTorchFunction*)self)->old_state);
     Py_RETURN_NONE;
 }
 
 PyObject* THPModule_isEnabledTorchFunction(PyObject* self, PyObject *unused) {
-    if (torch::enable_torch_function) {
+    if (torch::torch_function_enabled()) {
         Py_RETURN_TRUE;
     } else
     {
@@ -119,19 +120,22 @@ PyObject* THPModule_disable_torch_function(PyObject *self, PyObject *a) {
   py::tuple py_args;
   if (args == nullptr) {
     py_args = py::make_tuple();
-  }
-  else {
+  } else if (PyList_Check(args)) {
+    py_args = py::reinterpret_steal<py::tuple>(PyList_AsTuple(args));
+  } else if (PyTuple_Check(args)) {
     py_args = py::reinterpret_borrow<py::tuple>(args);
+  } else {
+    throw torch::TypeError("expected List or Tuple (got %s)", Py_TYPE(args)->tp_name);
   }
 
   // These are all C-API calls so no exceptions will be raised
   // and therefore no need for RAII approach to storing
   // the old value.
-  bool old_value = torch::enable_torch_function;
-  torch::enable_torch_function = false;
+  bool old_value = at::impl::PythonTorchFunctionTLS::is_disabled();
+  at::impl::PythonTorchFunctionTLS::set_disabled(true);
   // kwargs can safely be nullptr here.
   PyObject *result = PyObject_Call(func, py_args.ptr(), kwargs);
-  torch::enable_torch_function = old_value;
+  at::impl::PythonTorchFunctionTLS::set_disabled(old_value);
   return result;
   END_HANDLE_TH_ERRORS
 }
@@ -145,9 +149,12 @@ PyObject* THPModule_disable_torch_dispatch(PyObject *self, PyObject *a) {
   py::tuple py_args;
   if (args == nullptr) {
     py_args = py::make_tuple();
-  }
-  else {
+  } else if (PyList_Check(args)) {
+    py_args = py::reinterpret_steal<py::tuple>(PyList_AsTuple(args));
+  } else if (PyTuple_Check(args)) {
     py_args = py::reinterpret_borrow<py::tuple>(args);
+  } else {
+    throw torch::TypeError("expected List or Tuple (got %s)", Py_TYPE(args)->tp_name);
   }
 
   // This implementation is not completely correct.  The moral
@@ -169,7 +176,9 @@ PyObject* THPModule_disable_torch_dispatch(PyObject *self, PyObject *a) {
       // included in AFTER, so it is included in the negation (and that's
       // correct: we want to exclude Python key and everything BEFORE it.)
   );
-  return PyObject_Call(func, py_args.ptr(), kwargs);
+  auto r = PyObject_Call(func, py_args.ptr(), kwargs);
+  if (r == nullptr) throw python_error();
+  return r;
   END_HANDLE_TH_ERRORS
 }
 
@@ -214,8 +223,9 @@ inline bool has_torch_function_attr(PyObject* obj) {
 }
 
 namespace torch {
-auto check_has_torch_function(PyObject* obj) -> bool
-{
+auto check_has_torch_function(PyObject* obj, bool ignore_mode) -> bool {
+  if (!ignore_mode && at::impl::PythonTorchFunctionTLS::get_mode())
+    return true;
   PyTypeObject *tp = Py_TYPE(obj);
   return (
     !THPVariable_CheckTypeExact(tp) &&
diff --git a/torch/csrc/utils/disable_torch_function.h b/torch/csrc/utils/disable_torch_function.h
index e84d2a6e32b368..18ed7f47bef60f 100644
--- a/torch/csrc/utils/disable_torch_function.h
+++ b/torch/csrc/utils/disable_torch_function.h
@@ -13,7 +13,10 @@ namespace torch {
   PyObject* disabled_torch_dispatch_impl();
   void set_disabled_torch_function_impl(PyObject* value);
   void set_disabled_torch_dispatch_impl(PyObject* value);
-  bool check_has_torch_function(PyObject* obj);
+  // Set ignore_mode to true if you're trying to collect overloaded arguments;
+  // using mode here will improperly cause you to add ALL objects to the
+  // overloaded list even if they don't actually have __torch_function__
+  bool check_has_torch_function(PyObject* obj, bool ignore_mode = false);
 
   struct DisableTorchDispatch {
     DisableTorchDispatch() : guard_(c10::DispatchKey::Python),
diff --git a/torch/csrc/utils/python_arg_parser.cpp b/torch/csrc/utils/python_arg_parser.cpp
index ab522fd4a54737..b2276385d8762f 100644
--- a/torch/csrc/utils/python_arg_parser.cpp
+++ b/torch/csrc/utils/python_arg_parser.cpp
@@ -3,10 +3,13 @@
 #include <torch/csrc/Exceptions.h>
 #include <torch/csrc/Layout.h>
 #include <torch/csrc/MemoryFormat.h>
+#include <torch/csrc/autograd/python_variable.h>
 #include <torch/csrc/utils/invalid_arguments.h>
 #include <torch/csrc/utils/python_strings.h>
+#include <torch/csrc/utils/python_torch_function_mode.h>
 
 #include <ATen/ATen.h>
+#include <ATen/PythonTorchFunctionTLS.h>
 #include <ATen/TracerMode.h>
 #include <c10/util/irange.h>
 
@@ -40,6 +43,7 @@ static std::unordered_map<std::string, ParameterType> type_map = {
   {"Stream", ParameterType::STREAM},
   {"std::string", ParameterType::STRING},
   {"c10::string_view", ParameterType::STRING},
+  {"SymInt", ParameterType::SYM_INT},
   {"Dimname", ParameterType::DIMNAME},
   {"DimnameList", ParameterType::DIMNAME_LIST},
   {"ScalarList", ParameterType::SCALAR_LIST},
@@ -182,25 +186,19 @@ auto combine_self_args(PyObject *self, PyObject *args) -> py::tuple {
   return args_;
 }
 
+// TODO: I'm not sure if I should call this __torch_function__ or
+// torch_function.  The former makes it easier to take an existing
+// Tensor-like __torch_function__ object and turn it into a mode;
+// but in general modes don't have to be Tensor-like (and we will
+// improperly accept mode objects as arguments when they shouldn't
+// be passed around in this way).
+const char* torch_function_mode_name = "__torch_function__";
+
 auto handle_torch_function(PyObject* self, const std::string& func_name, PyObject* args, PyObject* kwargs, PyObject* torch_api, const std::string& module_name) -> PyObject* {
   py::object torch_api_function = PyObject_FastGetAttrString(torch_api, (char*)func_name.c_str());
   TORCH_INTERNAL_ASSERT(torch_api_function.ptr() != nullptr, "torch API function must exist");
   py::tuple args_ = combine_self_args(self, args);
-  py::tuple py_types = py::make_tuple(py::handle(PyObject_Type(self)));
-  // NOLINTNEXTLINE(clang-diagnostic-writable-strings)
-  py::object torch_function = PyObject_FastGetAttrString(self, "__torch_function__");
-  py::object ret = py::reinterpret_steal<py::object>(PyObject_CallFunctionObjArgs(torch_function.ptr(), torch_api_function.ptr(), py_types.ptr(), args_.ptr(), kwargs));
-  if (ret.ptr() == nullptr) {
-    // if an exception occurred in a user's implementation of
-    // __torch_function__, throw it
-    throw python_error();
-  }
-  if (ret.ptr() == Py_NotImplemented) {
-    std::string error_msg = "no implementation found for " + module_name + "." + func_name + "' on types that implement __torch_function__: [" + self->ob_type->tp_name + "]";
-    PyErr_SetString(PyExc_TypeError, error_msg.c_str());
-    throw python_error();
-  }
-  return ret.release().ptr();
+  return handle_torch_function_no_python_arg_parser({py::handle(self)}, args_.ptr(), kwargs, func_name.c_str(), torch_api_function.ptr(), module_name.c_str(), TorchFunctionName::TorchFunction);
 }
 
 // Note: [Overloaded args]
@@ -219,8 +217,28 @@ static PyObject* get_type_of_overloaded_arg(PyObject* obj_or_type) {
 }
 
 // See Note: [Overloaded args] for what they hold
-auto handle_torch_function_no_python_arg_parser(const std::vector<py::handle> &overloaded_args, PyObject* args, PyObject* kwargs, const char* func_name, PyObject* torch_api_function, const char* module_name, const char* torch_function_name) -> PyObject* {
+auto handle_torch_function_no_python_arg_parser(
+    at::ArrayRef<py::handle> overloaded_args,
+    PyObject* args,
+    PyObject* kwargs,
+    const char* func_name,
+    PyObject* torch_api_function,
+    const char* module_name,
+    TorchFunctionName torch_function_name) -> PyObject* {
+  const char* torch_function_name_str = nullptr;
+  switch (torch_function_name) {
+    case TorchFunctionName::TorchFunction:
+      torch_function_name_str = "__torch_function__";
+      break;
+    case TorchFunctionName::TorchDispatch:
+      torch_function_name_str = "__torch_dispatch__";
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(0, static_cast<int>(torch_function_name));
+  }
   // overloaded_args already all have unique types
+  // nb: modes don't go in the overloaded types list, as they are not
+  // necessarily types
   std::vector<py::object> overloaded_types;
   overloaded_types.reserve(overloaded_args.size());
   for (auto &arg : overloaded_args) {
@@ -228,14 +246,61 @@ auto handle_torch_function_no_python_arg_parser(const std::vector<py::handle> &o
   }
   py::tuple py_types = py::cast(overloaded_types);
   py::object ret;
-  for (auto &arg : overloaded_args) {
-    // NOLINTNEXTLINE(clang-diagnostic-writable-strings)
-    py::object torch_function = PyObject_FastGetAttrString(arg.ptr(), torch_function_name);
-    ret = py::reinterpret_steal<py::object>(PyObject_CallFunctionObjArgs(torch_function.ptr(), torch_api_function, py_types.ptr(), args, kwargs, NULL));
-    if (ret.ptr() != Py_NotImplemented) {
-      // Return the reference to the result. This also covers the case where ret
-      // is NULL and __torch_function__ raised an exception, which we throw below
-      break;
+  PyObject* mode_obj = nullptr;
+  if (torch_function_name == TorchFunctionName::TorchFunction) {
+    const auto& maybe_mode = at::impl::PythonTorchFunctionTLS::get_mode();
+    if (maybe_mode) {
+      mode_obj = maybe_mode->ptr(getPyInterpreter());
+      TORCH_INTERNAL_ASSERT(py_types.ptr() != nullptr);
+      TORCH_INTERNAL_ASSERT(args != nullptr);
+      // Disable mode on the inside; this makes for a more user-friendly
+      // experience if you try to, e.g., print your tensors.
+      torch::overrides::no_torch_function_mode g;
+      // Blegh.  This accidentally works in PyObject_CallFunctionObjArgs below
+      // because the nullptr terminates the argument list ick ick ick.
+      if (kwargs == nullptr) {
+        ret = py::reinterpret_steal<py::object>(PyObject_CallMethod(
+            mode_obj, torch_function_mode_name, "OOO", torch_api_function, py_types.ptr(), args));
+      } else {
+        ret = py::reinterpret_steal<py::object>(PyObject_CallMethod(
+            mode_obj,
+            torch_function_mode_name,
+            "OOOO",
+            torch_api_function,
+            py_types.ptr(),
+            args,
+            kwargs));
+      }
+      if (ret.ptr() == nullptr) {
+        throw python_error();
+      }
+    }
+  }
+  if (ret.ptr() == nullptr || ret.ptr() == Py_NotImplemented) {
+    for (auto& arg : overloaded_args) {
+      // NOLINTNEXTLINE(clang-diagnostic-writable-strings)
+      py::object torch_function =
+          PyObject_FastGetAttrString(arg.ptr(), torch_function_name_str);
+
+      // See https://github.com/pytorch/pytorch/issues/63767
+      if (PyObject_FastGetAttrString(torch_function.ptr(), "__self__").is(arg)) {
+        TORCH_WARN("Defining your `__torch_function__` as a plain method is deprecated and ",
+                   "will be an error in future, please define it as a classmethod.");
+      }
+
+      ret = py::reinterpret_steal<py::object>(PyObject_CallFunctionObjArgs(
+          torch_function.ptr(),
+          torch_api_function,
+          py_types.ptr(),
+          args,
+          kwargs,
+          NULL));
+      if (ret.ptr() != Py_NotImplemented) {
+        // Return the reference to the result. This also covers the case where
+        // ret is NULL and __torch_function__/__torch_dispatch raised an
+        // exception, which we throw below
+        break;
+      }
     }
   }
   if (ret.ptr() == nullptr) {
@@ -248,15 +313,24 @@ auto handle_torch_function_no_python_arg_parser(const std::vector<py::handle> &o
     // returned NotImplemented, so we raise a TypeError.
     std::stringstream ss;
     ss << "no implementation found for '" << module_name << "." << func_name
-       << "' on types that implement " << torch_function_name << ": [";
+       << "' on types that implement " << torch_function_name_str << ": [";
     for (auto &arg : overloaded_args) {
-      ss << PyObject_Repr(get_type_of_overloaded_arg(arg.ptr()));
+      ss << py::repr(get_type_of_overloaded_arg(arg.ptr()));
       if (!arg.is(overloaded_args.back())) {
         ss << ", ";
       }
-      else {
-        ss << "]";
-      }
+    }
+    ss << "]";
+    if (mode_obj) {
+      // Note [Paranoid check mode is same]
+      // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+      // If a user forcibly changes the mode in a non-lexical way
+      // in the inner context, the mode could be invalid here.  So just be
+      // a bit safe, it doesn't cost us anything since this is error reporting
+      const auto& maybe_mode = at::impl::PythonTorchFunctionTLS::get_mode();
+      TORCH_INTERNAL_ASSERT(mode_obj == maybe_mode->ptr(getPyInterpreter()));
+      ss << " nor was it found on the currently active mode "
+         << py::repr(mode_obj);
     }
     const std::string& tmp = ss.str();
     PyErr_SetString(PyExc_TypeError, tmp.c_str());
@@ -394,7 +468,7 @@ bool is_tensor_and_append_overloaded(PyObject* obj, std::vector<py::handle>* ove
     return true;
   }
 
-  if (check_has_torch_function(obj)) {
+  if (check_has_torch_function(obj, /*ignore_mode*/ true)) {
     // tensor subclasses and unrelated objects with __torch_function__
     append_overloaded_tensor(overloaded_args, obj);
     return true;
@@ -467,14 +541,25 @@ static bool is_int_list(PyObject* obj, int broadcast_size) {
     }
     auto item = py::reinterpret_steal<py::object>(
         PySequence_GetItem(obj, 0));
-    // NOTE: JIT tracer allows arbitrary scalar tensors to act as ints
-    // in an intlist argument. Even float or complex scalar tensors.
-    return (THPVariable_Check(item.ptr()) || THPUtils_checkIndex(item.ptr()));
+    return (
+      THPUtils_checkIndex(item.ptr()) ||
+      // NOTE: JIT tracer allows arbitrary scalar tensors to act as ints
+      // in an intlist argument. Even float or complex scalar tensors.
+      (jit::tracer::isTracing() && THPVariable_Check(item.ptr())));
   }
   // if a size is specified (e.g. IntArrayRef[2]) we also allow passing a single int
   return broadcast_size > 0 && THPUtils_checkLong(obj);
 }
 
+static bool is_int_or_symbolic_or_concrete_int(PyObject* obj) {
+  if (THPUtils_checkLong(obj)) {
+      return true;
+  }
+
+  // TODO: test if it's the Python binding for SymbolicIntNode
+  return false;
+}
+
 // argnum is needed for raising the TypeError, it's used in the error message.
 auto FunctionParameter::check(PyObject* obj, std::vector<py::handle> &overloaded_args, int argnum) -> bool
 {
@@ -541,6 +626,9 @@ auto FunctionParameter::check(PyObject* obj, std::vector<py::handle> &overloaded
     case ParameterType::SCALAR_LIST: {
       return is_scalar_list(obj);
     }
+    case ParameterType::SYM_INT: {
+      return is_int_or_symbolic_or_concrete_int(obj);
+    }
   }
 }
 
@@ -549,6 +637,7 @@ std::string FunctionParameter::type_name() const {
     case ParameterType::TENSOR: return "Tensor";
     case ParameterType::SCALAR: return "Number";
     case ParameterType::INT64: return "int";
+    case ParameterType::SYM_INT: return "SymInt";
     case ParameterType::DOUBLE: return "float";
     case ParameterType::COMPLEX: return "complex";
     case ParameterType::TENSOR_LIST: return "tuple of Tensors";
@@ -910,9 +999,9 @@ static void extra_kwargs(FunctionSignature& signature, PyObject* kwargs, Py_ssiz
 
 bool FunctionSignature::parse(PyObject* self, PyObject* args, PyObject* kwargs, PyObject* dst[],  // NOLINT
                               bool raise_exception) {
-  auto nargs = args ? PyTuple_GET_SIZE(args) : 0;
+  size_t nargs = args ? PyTuple_GET_SIZE(args) : 0;
   auto remaining_kwargs = kwargs ? PyDict_Size(kwargs) : 0;
-  Py_ssize_t arg_pos = 0;
+  size_t arg_pos = 0;
   bool allow_varargs_intlist = false;
 
   // if there is a single positional IntArrayRef argument, i.e. expand(..), view(...),
@@ -934,7 +1023,7 @@ bool FunctionSignature::parse(PyObject* self, PyObject* args, PyObject* kwargs,
   }
 
   int i = 0;
-  if (self != nullptr && check_has_torch_function(self)) {
+  if (self != nullptr && check_has_torch_function(self, /*ignore_mode*/ true)) {
     append_overloaded_tensor(&this->overloaded_args, self);
   }
   for (auto& param : params) {
@@ -1077,7 +1166,7 @@ PythonArgs PythonArgParser::raw_parse(PyObject* self, PyObject* args, PyObject*
 
 void PythonArgParser::print_error(PyObject* self, PyObject* args, PyObject* kwargs, PyObject* parsed_args[]) {  // NOLINT
   // NOLINTNEXTLINE(clang-analyzer-core.NullDereference)
-  auto num_args = PyTuple_GET_SIZE(args) + (kwargs ? PyDict_Size(kwargs) : 0);
+  size_t num_args = PyTuple_GET_SIZE(args) + (kwargs ? PyDict_Size(kwargs) : 0);
   std::vector<unsigned> plausible_idxs;
   unsigned i = 0;
   for (auto& signature : signatures_) {
diff --git a/torch/csrc/utils/python_arg_parser.h b/torch/csrc/utils/python_arg_parser.h
index 3de479b575786a..53387b366b534c 100644
--- a/torch/csrc/utils/python_arg_parser.h
+++ b/torch/csrc/utils/python_arg_parser.h
@@ -63,6 +63,7 @@
 #include <torch/csrc/utils/six.h>
 #include <torch/csrc/autograd/variable.h>
 
+#include <ATen/PythonTorchFunctionTLS.h>
 #include <ATen/core/Tensor.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
@@ -77,7 +78,7 @@
 namespace torch {
 
 enum class ParameterType {
-  TENSOR, SCALAR, INT64, DOUBLE, COMPLEX, TENSOR_LIST, INT_LIST, GENERATOR,
+  TENSOR, SCALAR, INT64, SYM_INT, DOUBLE, COMPLEX, TENSOR_LIST, INT_LIST, GENERATOR,
   BOOL, STORAGE, PYOBJECT, SCALARTYPE, LAYOUT, MEMORY_FORMAT, DEVICE, STREAM, STRING,
   DIMNAME, DIMNAME_LIST, QSCHEME, FLOAT_LIST, SCALAR_LIST
 };
@@ -203,6 +204,7 @@ struct PythonArgs {
   inline c10::optional<c10::string_view> stringViewOptional(int i);
   inline PyObject* pyobject(int i);
   inline int64_t toInt64(int i);
+  inline c10::SymInt toSymInt(int i);
   inline int64_t toInt64WithDefault(int i, int64_t default_int);
   inline double toDouble(int i);
   inline double toDoubleWithDefault(int i, double default_double);
@@ -271,7 +273,9 @@ inline PythonArgs PythonArgParser::parse(PyObject* self, ParsedArgs<0>& dst) {
 }
 
 inline bool PythonArgs::has_torch_function(){
-  return !this->signature.overloaded_args.empty();
+  return !this->signature.overloaded_args.empty() ||
+      (torch::torch_function_enabled() &&
+       at::impl::PythonTorchFunctionTLS::get_mode());
 }
 
 inline std::string PythonArgs::get_func_name(){
@@ -627,6 +631,16 @@ inline int64_t PythonArgs::toInt64(int i) {
   return THPUtils_unpackLong(args[i]);
 }
 
+inline c10::SymInt PythonArgs::toSymInt(int i) {
+  if (!args[i]) return signature.params[i].default_int;
+  if (traceable && jit::tracer::isTracing() && THPVariable_Check(args[i])) {
+    auto & var = THPVariable_Unpack(args[i]);
+    jit::tracer::ArgumentStash::stashValue(
+        signature.params[i].name, idx, var, c10::IntType::get());
+  }
+  return c10::SymInt(THPUtils_unpackLong(args[i]));
+}
+
 inline int64_t PythonArgs::toInt64WithDefault(int i, int64_t default_int) {
   if (!args[i]) return default_int;
   return toInt64(i);
@@ -786,7 +800,17 @@ auto handle_torch_function(PythonArgs &r, PyObject* args, PyObject* kwargs, PyOb
 auto handle_torch_function(PyObject* self, const std::string& func_name, PyObject* args=nullptr, PyObject* kwargs=nullptr, PyObject* torch_api=THPVariableClass, const std::string& module_name="torch.Tensor") -> PyObject*;
 
 // Used for functions created in C++, e.g., C++ custom op, which doesn't use PythonArgParser to get overloaded_args.
-auto TORCH_API handle_torch_function_no_python_arg_parser(const std::vector<py::handle> &overloaded_args, PyObject* args, PyObject* kwargs, const char* func_name, PyObject* torch_api_function, const char* module_name, const char* torch_function_name = "__torch_function__") -> PyObject*;
+enum class TorchFunctionName { TorchFunction, TorchDispatch };
+
+auto TORCH_API handle_torch_function_no_python_arg_parser(
+    at::ArrayRef<py::handle> overloaded_args,
+    PyObject* args,
+    PyObject* kwargs,
+    const char* func_name,
+    PyObject* torch_api_function,
+    const char* module_name,
+    TorchFunctionName torch_function_name = TorchFunctionName::TorchFunction)
+    -> PyObject*;
 
 // Used for getters of Tensor properties
 auto handle_torch_function_getter(THPVariable* self, const std::string& property_name) -> PyObject*;
diff --git a/torch/csrc/utils/python_scalars.h b/torch/csrc/utils/python_scalars.h
index 98f8a262a10a96..7f454bdff8262a 100644
--- a/torch/csrc/utils/python_scalars.h
+++ b/torch/csrc/utils/python_scalars.h
@@ -20,7 +20,7 @@ inline void store_scalar(void* data, at::ScalarType scalarType, PyObject* obj) {
       break;
     case at::kFloat: *(float*)data = (float)THPUtils_unpackDouble(obj); break;
     case at::kDouble: *(double*)data = THPUtils_unpackDouble(obj); break;
-    case at::kComplexHalf: *(c10::complex<at::Half>*)data = (c10::complex<at::Half>)THPUtils_unpackComplexDouble(obj); break;
+    case at::kComplexHalf: *(c10::complex<at::Half>*)data = (c10::complex<at::Half>)static_cast<c10::complex<float>>(THPUtils_unpackComplexDouble(obj)); break;
     case at::kComplexFloat: *(c10::complex<float>*)data = (c10::complex<float>)THPUtils_unpackComplexDouble(obj); break;
     case at::kComplexDouble: *(c10::complex<double>*)data = THPUtils_unpackComplexDouble(obj); break;
     case at::kBool: *(bool*)data = THPUtils_unpackNumberAsBool(obj); break;
diff --git a/torch/csrc/utils/python_torch_function_mode.h b/torch/csrc/utils/python_torch_function_mode.h
new file mode 100644
index 00000000000000..fe8fcd08a2b685
--- /dev/null
+++ b/torch/csrc/utils/python_torch_function_mode.h
@@ -0,0 +1,21 @@
+#pragma once
+
+#include <ATen/PythonTorchFunctionTLS.h>
+
+namespace torch {
+namespace overrides {
+
+// Corresponds to torch.overrides._no_torch_function_mode.  We discourage use
+// of this in userland because it's non-compositional; there might be another
+// mode waiting to go after you, and you shouldn't just blindly disable it.
+// From C++ side, there is no such thing as compositional modes, there is one
+// mode and of course you should be able to clear it.
+struct no_torch_function_mode {
+  no_torch_function_mode() { at::impl::PythonTorchFunctionTLS::swap_mode(old_mode_); }
+  ~no_torch_function_mode() { at::impl::PythonTorchFunctionTLS::set_mode(std::move(old_mode_)); }
+private:
+  std::shared_ptr<c10::SafePyObject> old_mode_ = nullptr;
+};
+
+} // namespace overrides
+} // namespace torch
diff --git a/torch/csrc/utils/tensor_new.cpp b/torch/csrc/utils/tensor_new.cpp
index 580f572977eccc..320b7ee6efde07 100644
--- a/torch/csrc/utils/tensor_new.cpp
+++ b/torch/csrc/utils/tensor_new.cpp
@@ -87,13 +87,6 @@ Tensor new_with_storage(c10::TensorOptions options, at::ScalarType scalar_type,
   return tensor;
 }
 
-Tensor new_with_tensor(c10::TensorOptions options, at::ScalarType scalar_type, const Tensor& other) {
-  options = options.dtype(scalar_type);
-  TORCH_CHECK_TYPE(other.options().type_equal(options), "expected ",
-                   options, " (got ", other.options(), ")");
-  return other.alias();
-}
-
 std::vector<int64_t> compute_sizes(PyObject* seq, ScalarType scalar_type) {
   bool is_storage = isStorage(seq);
   std::vector<int64_t> sizes;
@@ -364,6 +357,7 @@ void check_base_legacy_new(c10::DispatchKey dispatch_key, at::Layout expected_la
         c10::DispatchKey::HIP,
         c10::DispatchKey::XLA,
         c10::DispatchKey::Lazy,
+        c10::DispatchKey::IPU,
         c10::DispatchKey::XPU,
         c10::DispatchKey::HPU,
     });
@@ -399,7 +393,13 @@ void check_legacy_ctor_device(c10::DispatchKey dispatch_key, c10::optional<Devic
   }
 }
 
-Tensor legacy_sparse_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+enum class CtorOrNew {
+  BASE_CTOR,
+  CTOR,
+  NEW,
+};
+
+Tensor legacy_sparse_tensor_generic_ctor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs, CtorOrNew ctor_or_new) {
   auto options = dispatchKeyToTensorOptions(dispatch_key);
   static PythonArgParser parser({
     "new(*, Device? device=None)",
@@ -408,6 +408,7 @@ Tensor legacy_sparse_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType s
     "new(Tensor indices, Tensor values, IntArrayRef size, *, Device? device=None)",
     "new(IntArrayRef size, *, Device? device=None)",
   });
+  if (ctor_or_new == CtorOrNew::NEW) check_base_legacy_new(dispatch_key, c10::kSparse);
   ParsedArgs<4> parsed_args;
   auto r = parser.parse(args, kwargs, parsed_args);
   if (r.idx == 0) {
@@ -417,51 +418,6 @@ Tensor legacy_sparse_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType s
   } else if (r.idx == 1) {
     auto cdata = reinterpret_cast<void*>(r.toInt64(0));
     return at::unsafeTensorFromTH(cdata, true);
-  } else if (r.idx == 2) {
-    auto deviceOptional = r.deviceOptional(2);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    at::OptionalDeviceGuard device_guard(deviceOptional);
-    return at::sparse_coo_tensor(r.tensor(0), r.tensor(1));
-  } else if (r.idx == 3) {
-    auto deviceOptional = r.deviceOptional(3);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    at::OptionalDeviceGuard device_guard(deviceOptional);
-    return at::sparse_coo_tensor(r.tensor(0), r.tensor(1), r.intlist(2));
-  } else if (r.idx == 4) {
-    PyObject* arg = r.pyobject(0);
-    auto deviceOptional = r.deviceOptional(1);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    if (!THPSize_Check(arg) && PyTuple_GET_SIZE(args) >= 1 && arg == PyTuple_GET_ITEM(args, 0)) {
-      // new(sequence) binds to this signature but should be treated differently
-      // unless the sequences is a torch.Size
-      throw TypeError("torch.SparseTensor(sequence) only accepts sizes.  Please use torch.sparse_coo_tensor() " \
-                      "or construct a strided tensor and convert it to sparse via to_sparse.");
-    }
-    return new_with_sizes(options, scalar_type, r.deviceOptional(1), r.intlist(0));
-  }
-  throw std::runtime_error("new(): invalid arguments");
-}
-
-Tensor legacy_sparse_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
-  auto options = dispatchKeyToTensorOptions(dispatch_key);
-  static PythonArgParser parser({
-    "new(*, Device? device=None)",
-    "new(*, int64_t cdata)|hidden",
-    "new(Tensor indices, Tensor values, *, Device? device=None)",
-    "new(Tensor indices, Tensor values, IntArrayRef size, *, Device? device=None)",
-    "new(IntArrayRef size, *, Device? device=None)",
-  });
-  check_base_legacy_new(dispatch_key, c10::kSparse);
-  ParsedArgs<5> parsed_args;
-  auto r = parser.parse(args, kwargs, parsed_args);
-  if (r.idx == 0) {
-    auto deviceOptional = r.deviceOptional(0);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    at::OptionalDeviceGuard device_guard(deviceOptional);
-    return at::empty({0}, build_options(options, scalar_type));
-  } else if (r.idx == 1) {
-    auto cdata = reinterpret_cast<void*>(r.toInt64(0));
-    return at::unsafeTensorFromTH(cdata, true);
   } else if (r.idx == 2) {
     // Note: this signature doesn't have a dtype, even though it has a device; it probably shouldn't
     // have a device (we should infer it).
@@ -483,14 +439,27 @@ Tensor legacy_sparse_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType sc
     if (!THPSize_Check(arg) && PyTuple_GET_SIZE(args) >= 1 && arg == PyTuple_GET_ITEM(args, 0)) {
       // new(sequence) binds to this signature but should be treated differently
       // unless the sequences is a torch.Size
-      throw TypeError("SparseTensor.new(sequence) only accepts sizes.  Please use torch.sparse_coo_tensor() " \
-                      "or construct a strided tensor and convert it to sparse via to_sparse.");
+      if (ctor_or_new == CtorOrNew::CTOR) {
+        throw TypeError("torch.SparseTensor(sequence) only accepts sizes.  Please use torch.sparse_coo_tensor() " \
+                        "or construct a strided tensor and convert it to sparse via to_sparse.");
+      } else {
+        throw TypeError("SparseTensor.new(sequence) only accepts sizes.  Please use torch.sparse_coo_tensor() " \
+                        "or construct a strided tensor and convert it to sparse via to_sparse.");
+      }
     }
     return new_with_sizes(options, scalar_type, r.deviceOptional(1), r.intlist(0));
   }
   throw std::runtime_error("new(): invalid arguments");
 }
 
+Tensor legacy_sparse_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+  return legacy_sparse_tensor_generic_ctor_new(dispatch_key, scalar_type, args, kwargs, CtorOrNew::CTOR);
+}
+
+Tensor legacy_sparse_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+  return legacy_sparse_tensor_generic_ctor_new(dispatch_key, scalar_type, args, kwargs, CtorOrNew::NEW);
+}
+
 // NB: device_idx here is NOT a DeviceIndex, but index into PythonArgs
 c10::TensorOptions typeIdWithDefault(PythonArgs& r, int64_t device_idx, c10::DispatchKey dispatch_key) {
   auto options = dispatchKeyToTensorOptions(dispatch_key);
@@ -503,12 +472,14 @@ c10::TensorOptions typeIdWithDefault(PythonArgs& r, int64_t device_idx, c10::Dis
 
 } // namespace
 
-Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+Tensor legacy_tensor_generic_ctor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs, CtorOrNew ctor_or_new) {
   auto options = dispatchKeyToTensorOptions(dispatch_key);
   static PythonArgParser parser({
     "new(*, Device? device=None)",
     "new(Storage storage)",
     "new(*, int64_t cdata)|hidden",
+    // This constructor is no longer legacy, it will also be usable for
+    // subclass initialization
     "new(Tensor other)",
     "new(Tensor other, *, Device? device=None)|hidden",  // prevent Tensor matching with IntArrayRef, PyObject*
     "new(IntArrayRef size, *, Device? device=None)",
@@ -516,9 +487,11 @@ Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_t
   });
 
   if (isSparse(dispatchKeyToBackend(dispatch_key))) {
-    return legacy_sparse_tensor_ctor(dispatch_key, scalar_type, args, kwargs);
+    return legacy_sparse_tensor_generic_ctor_new(dispatch_key, scalar_type, args, kwargs, ctor_or_new);
   }
 
+  if (ctor_or_new == CtorOrNew::NEW) check_base_legacy_new(dispatch_key, c10::kStrided);
+
   ParsedArgs<2> parsed_args;
   auto r = parser.parse(args, kwargs, parsed_args);
   if (r.idx == 0) {
@@ -542,10 +515,23 @@ Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_t
     auto cdata = reinterpret_cast<void*>(r.toInt64(0));
     return at::unsafeTensorFromTH(cdata, true);
   } else if (r.idx == 3) {
-    return new_with_tensor(options, scalar_type, r.tensor(0));
+    const auto& other = r.tensor(0);
+    // BASE_CTOR (aka torch.Tensor) is now relaxed to accept any
+    // dtype; previously it was "float" biased
+    if (ctor_or_new != CtorOrNew::BASE_CTOR) {
+      options = options.dtype(scalar_type);
+      TORCH_CHECK_TYPE(other.options().type_equal(options), "expected ",
+                       options, " (got ", other.options(), ")");
+    }
+    return other.alias();
   } else if (r.idx == 4) {
-    TORCH_CHECK(false, "Legacy tensor constructor of the form torch.Tensor(tensor, device=device) " \
-                "is not supported.  Use torch.tensor(...) or torch.as_tensor(...) instead.");
+    if (ctor_or_new == CtorOrNew::CTOR || ctor_or_new == CtorOrNew::BASE_CTOR) {
+      TORCH_CHECK(false, "Legacy tensor constructor of the form torch.Tensor(tensor, device=device) " \
+                  "is not supported.  Use torch.tensor(...) or torch.as_tensor(...) instead.");
+    } else {
+      TORCH_CHECK(false, "Legacy tensor new of the form tensor.new(tensor, device=device) " \
+                  "is not supported.  Use torch.as_tensor(...) instead.");
+    }
   } else if (r.idx == 5) {
     PyObject* arg = r.pyobject(0);
     auto deviceOptional = r.deviceOptional(1);
@@ -564,66 +550,27 @@ Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_t
   throw std::runtime_error("new(): invalid arguments");
 }
 
-Tensor legacy_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
-  auto options = dispatchKeyToTensorOptions(dispatch_key);
-  static PythonArgParser parser({
-    "new(*, Device? device=None)",
-    "new(Storage storage)",
-    "new(*, int64_t cdata)|hidden",
-    "new(Tensor other)",  // this doesn't have a dtype/device because it creates an alias.
-    "new(Tensor other, *, Device? device=None)|hidden",  // prevent Tensor matching with IntArrayRef, PyObject*
-    "new(IntArrayRef size, *, Device? device=None)",
-    "new(PyObject* data, *, Device? device=None)",
-  });
+// Handles ONLY torch.Tensor
+// Unlike the legacy dtype/device specialized constructors, this one is
+// relaxed to accept any device/dtype input tensor (even if it doesn't
+// match the default)
+Tensor base_tensor_ctor(PyObject* args, PyObject* kwargs) {
+  return legacy_tensor_generic_ctor_new(
+    torch::tensors::get_default_dispatch_key(),
+    torch::tensors::get_default_scalar_type(),
+    args, kwargs, CtorOrNew::BASE_CTOR
+  );
+}
 
-  if (isSparse(dispatchKeyToBackend(dispatch_key))) {
-    return legacy_sparse_tensor_new(dispatch_key, scalar_type, args, kwargs);
-  }
+// Handles calls like torch.DoubleTensor, torch.cuda.FloatTensor,
+// torch.sparse.FloatTensor, etc.
+Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+  return legacy_tensor_generic_ctor_new(dispatch_key, scalar_type, args, kwargs, CtorOrNew::CTOR);
+}
 
-  check_base_legacy_new(dispatch_key, c10::kStrided);
-  ParsedArgs<3> parsed_args;
-  auto r = parser.parse(args, kwargs, parsed_args);
-  if (r.idx == 0) {
-    auto deviceOptional = r.deviceOptional(0);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    at::OptionalDeviceGuard device_guard(deviceOptional);
-    return at::empty({0}, build_options(options, scalar_type));
-  } else if (r.idx == 1) {
-    at::ScalarType storage_scalar_type;
-    bool is_typed_storage = false;
-    at::Storage storage = r.storage(0, storage_scalar_type, is_typed_storage);
-    if (storage_scalar_type != at::ScalarType::Undefined && is_typed_storage) {
-      TORCH_CHECK(
-        storage_scalar_type == scalar_type,
-        "Expected a Storage of type ", scalar_type,
-        " or an _UntypedStorage, but got type ", storage_scalar_type,
-        " for argument 1 'storage'");
-    }
-    return new_with_storage(options, scalar_type, storage);
-  } else if (r.idx == 2) {
-    auto cdata = reinterpret_cast<void*>(r.toInt64(0));
-    return at::unsafeTensorFromTH(cdata, true);
-  } else if (r.idx == 3) {
-    return new_with_tensor(options, scalar_type, r.tensor(0));
-  } else if (r.idx == 4) {
-      TORCH_CHECK(false, "Legacy tensor new of the form tensor.new(tensor, device=device) " \
-                  "is not supported.  Use torch.as_tensor(...) instead.");
-  } else if (r.idx == 5) {
-    PyObject* arg = r.pyobject(0);
-    auto deviceOptional = r.deviceOptional(1);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    if (!THPSize_Check(arg) && PyTuple_GET_SIZE(args) >= 1 && arg == PyTuple_GET_ITEM(args, 0)) {
-      // new(sequence) binds to this signature but should be treated differently
-      // unless the sequences is a torch.Size
-      return legacy_new_from_sequence(options, scalar_type, deviceOptional, r.pyobject(0));
-    }
-    return new_with_sizes(options, scalar_type, r.deviceOptional(1), r.intlist(0));
-  } else if (r.idx == 6) {
-    auto deviceOptional = r.deviceOptional(1);
-    check_legacy_ctor_device(dispatch_key, deviceOptional);
-    return legacy_new_from_sequence(options, scalar_type, r.deviceOptional(1), r.pyobject(0));
-  }
-  throw std::runtime_error("new(): invalid arguments");
+// Handles tensor.new(...)
+Tensor legacy_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs) {
+  return legacy_tensor_generic_ctor_new(dispatch_key, scalar_type, args, kwargs, CtorOrNew::NEW);
 }
 
 Tensor indexing_tensor_from_data(
@@ -1035,7 +982,7 @@ Tensor tensor_frombuffer(PyObject* buffer, ScalarType dtype, int64_t count, int6
   }
 
   TORCH_CHECK_VALUE(
-      static_cast<size_t>(offset) + actual_count * elsize <= len,
+      static_cast<size_t>(offset) + actual_count * elsize <= static_cast<size_t>(len),
       "requested buffer length (", actual_count, " * ", elsize, " bytes) "
       "after offset (", offset, " bytes) must not be greater than actual "
       "buffer length (", len, " bytes)");
diff --git a/torch/csrc/utils/tensor_new.h b/torch/csrc/utils/tensor_new.h
index 1865583ea61ffc..17a25eb226e153 100644
--- a/torch/csrc/utils/tensor_new.h
+++ b/torch/csrc/utils/tensor_new.h
@@ -6,6 +6,7 @@
 
 namespace torch { namespace utils {
 
+at::Tensor base_tensor_ctor(PyObject* args, PyObject* kwargs);
 at::Tensor legacy_tensor_ctor(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs);
 at::Tensor legacy_tensor_new(c10::DispatchKey dispatch_key, at::ScalarType scalar_type, PyObject* args, PyObject* kwargs);
 at::Tensor indexing_tensor_from_data(
diff --git a/torch/csrc/utils/tensor_numpy.cpp b/torch/csrc/utils/tensor_numpy.cpp
index dd1f94990085bf..b56594fa49b53c 100644
--- a/torch/csrc/utils/tensor_numpy.cpp
+++ b/torch/csrc/utils/tensor_numpy.cpp
@@ -371,7 +371,7 @@ at::Tensor tensor_from_cuda_array_interface(PyObject* obj) {
   {
     PyObject *py_strides = PyDict_GetItemString(cuda_dict, "strides");
     if (py_strides != nullptr && py_strides != Py_None) {
-      if (PySequence_Length(py_strides) == -1 || PySequence_Length(py_strides) != sizes.size()) {
+      if (PySequence_Length(py_strides) == -1 || static_cast<size_t>(PySequence_Length(py_strides)) != sizes.size()) {
         throw TypeError("strides must be a sequence of the same length as shape");
       }
       strides = seq_to_aten_shape(py_strides);
diff --git a/torch/csrc/utils/tensor_types.cpp b/torch/csrc/utils/tensor_types.cpp
index 29d68538470b18..12a324d3c96d46 100644
--- a/torch/csrc/utils/tensor_types.cpp
+++ b/torch/csrc/utils/tensor_types.cpp
@@ -21,6 +21,7 @@ static const char* backend_to_string(const at::Backend& backend) {
     case at::Backend::CPU: return "torch";
     case at::Backend::CUDA: return "torch.cuda";
     case at::Backend::XPU: return "torch.xpu";
+    case at::Backend::IPU: return "torch.ipu";
     case at::Backend::SparseCPU: return "torch.sparse";
     case at::Backend::SparseCUDA: return "torch.cuda.sparse";
     case at::Backend::SparseXPU: return "torch.xpu.sparse";
diff --git a/torch/cuda/__init__.py b/torch/cuda/__init__.py
index ac7026ea0ddce1..ab7338f451ad3c 100644
--- a/torch/cuda/__init__.py
+++ b/torch/cuda/__init__.py
@@ -674,67 +674,77 @@ def type(self, *args, **kwargs):
 
     __new__ = _lazy_new
 
-from torch.storage import _TypedStorage
+from torch.storage import _TypedStorage, _LegacyStorage
 
 class _UntypedStorage(_CudaBase, torch._C.CudaByteStorageBase, _StorageBase):
-    pass
+    @classmethod
+    def from_buffer(cls, *args, **kwargs):
+        raise RuntimeError('from_buffer: Not available for CUDA storage')
+
+    @classmethod
+    def _new_with_weak_ptr(cls, *args, **kwargs):
+        raise RuntimeError('_new_with_weak_ptr: Not available for CUDA storage')
+
+    @classmethod
+    def _new_shared_filename(cls, manager, obj, size, *, device=None, dtype=None):
+        raise RuntimeError('_new_shared_filename: Not available for CUDA storage')
 
-class ByteStorage(_TypedStorage):
+class ByteStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.uint8
 
-class DoubleStorage(_TypedStorage):
+class DoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.double
 
-class FloatStorage(_TypedStorage):
+class FloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.float
 
-class HalfStorage(_TypedStorage):
+class HalfStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.half
 
-class LongStorage(_TypedStorage):
+class LongStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.long
 
-class IntStorage(_TypedStorage):
+class IntStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.int
 
-class ShortStorage(_TypedStorage):
+class ShortStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.short
 
-class CharStorage(_TypedStorage):
+class CharStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.int8
 
-class BoolStorage(_TypedStorage):
+class BoolStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.bool
 
-class BFloat16Storage(_TypedStorage):
+class BFloat16Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.bfloat16
 
-class ComplexDoubleStorage(_TypedStorage):
+class ComplexDoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.cdouble
 
-class ComplexFloatStorage(_TypedStorage):
+class ComplexFloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
         return torch.cfloat
diff --git a/torch/cuda/amp/autocast_mode.py b/torch/cuda/amp/autocast_mode.py
index 839dac65207351..ef6d4ab6432f53 100644
--- a/torch/cuda/amp/autocast_mode.py
+++ b/torch/cuda/amp/autocast_mode.py
@@ -10,7 +10,7 @@
 from typing import Any
 
 
-class autocast(torch.autocast_mode.autocast):
+class autocast(torch.amp.autocast_mode.autocast):
     r"""
     See :class:`torch.autocast`.
     ``torch.cuda.amp.autocast(args...)`` is equivalent to ``torch.autocast("cuda", args...)``
diff --git a/torch/distributed/__init__.py b/torch/distributed/__init__.py
index a4b8d6cbfe3366..3a5345c9ff4cd4 100644
--- a/torch/distributed/__init__.py
+++ b/torch/distributed/__init__.py
@@ -66,6 +66,10 @@ def is_available() -> bool:
         _rank_not_in_group,
     )
 
+    from .rendezvous import (
+        _create_store_from_options,
+    )
+
     from .remote_device import _remote_device
 
     set_debug_level_from_env()
diff --git a/torch/distributed/_shard/__init__.py b/torch/distributed/_shard/__init__.py
index b6f0776a36af52..194ae2c6bc7f13 100644
--- a/torch/distributed/_shard/__init__.py
+++ b/torch/distributed/_shard/__init__.py
@@ -1 +1 @@
-from .api import shard_parameter, _shard_tensor
+from .api import shard_parameter, _shard_tensor, _replicate_tensor
diff --git a/torch/distributed/_shard/api.py b/torch/distributed/_shard/api.py
index 0de8a59660e296..c5082b0d9a89f7 100644
--- a/torch/distributed/_shard/api.py
+++ b/torch/distributed/_shard/api.py
@@ -7,6 +7,7 @@
 from .sharding_spec import (
     ShardingSpec,
 )
+from .replicated_tensor import ReplicatedTensor
 
 def _shard_tensor(
     tensor: torch.Tensor, sharding_spec: ShardingSpec, src_rank=0, process_group=None
@@ -118,3 +119,20 @@ def shard_parameter(
 
     # Now we can set the attribute appropriately.
     setattr(module, param_name, st)
+
+
+def _replicate_tensor(tensor: torch.Tensor, process_group=None) -> ReplicatedTensor:
+    """
+    Given a :class:`torch.Tensor`, mark it as a ReplicatedTensor where all
+    ranks have the same value.
+
+    Args:
+        tensor (:class:`torch.Tensor`): the tensor to be marked as replicated.
+    Keyword args:
+        process_group (ProcessGroup, optional): The process group to replicate on.
+            If None, the default process group will be used.
+    Returns:
+        A :class:`ReplicatedTensor` from the given tensor.
+
+    """
+    return ReplicatedTensor(tensor, process_group=process_group)
diff --git a/torch/distributed/_shard/replicated_tensor.py b/torch/distributed/_shard/replicated_tensor.py
new file mode 100644
index 00000000000000..12253a0b465ea5
--- /dev/null
+++ b/torch/distributed/_shard/replicated_tensor.py
@@ -0,0 +1,125 @@
+import torch
+import torch.distributed as dist
+
+from torch.overrides import get_default_nowrap_functions
+from torch.distributed._shard.sharded_tensor.api import ShardedTensor
+from torch.distributed import distributed_c10d
+
+
+class ReplicatedTensor(torch.Tensor):
+    """
+    ReplicatedTensor represents a tensor which is replicated across the `world_size` and
+    has the same value on each rank.
+
+    ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together
+    with ShardedTensor/Tensor together to express different types of computation. The
+    inter-op rules defined as (using torch.add as an example op):
+        ReplicatedTensor + ReplicatedTensor = ReplicatedTensor
+        ReplicatedTensor + torch.Tensor = torch.Tensor
+        ReplicatedTensor + ShardedTensor = ShardedTensor
+        ReplicatedTensor + other type (i.e. Scalar) = other type
+
+    NOTE: We do not gurantee equal content of ReplicatedTensor across nodes after its
+    construction. Although we defined proper inter-op rules to make sure ReplicatedTensor
+    stays the same, there's no enforcement on it (i.e. if you manually modify content on
+    some ranks, the modified value will not automatically get synced to other nodes). If
+    you wish to manually validate tensors are the same across ranks, use `validate()`.
+
+    """
+    process_group: distributed_c10d.ProcessGroup
+
+    __slots__ = ["process_group"]
+
+    def __new__(cls, data=None, process_group=None):
+        if data is None:
+            data = torch.empty(0)
+        r = torch.Tensor._make_subclass(cls, data)      # type: ignore[arg-type]
+        r.process_group = (     # type: ignore[attr-defined]
+            process_group
+            if process_group is not None
+            else distributed_c10d._get_default_group()
+        )
+        return r
+
+    def __repr__(self):
+        return f"ReplicatedTensor({super(ReplicatedTensor, self).__repr__()})"
+
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+        # We will re-dispatch the execution to ShardedTensor __torch_function__
+        # if we find there're ShardedTensor operands. We will also check if args/kwargs
+        # are all replicated tensor operands, we have to do this to ensure we do not
+        # converting results back to ReplicatedTensor if not all operands are replicated.
+        all_replicated = True
+        replicated_pg = None
+
+        def dispatch_arg(arg):
+            nonlocal replicated_pg, all_replicated
+            if isinstance(arg, ShardedTensor):
+                # redispatch to ShardedTensor
+                # TODO: handle ShardedTensor/PartialTensor inter-op with ReplicatedTensor
+                return arg.__torch_function__(func, types, args, kwargs)
+            if isinstance(arg, ReplicatedTensor):
+                if replicated_pg is None:
+                    replicated_pg = arg.process_group
+                elif replicated_pg != arg.process_group:
+                    raise RuntimeError(
+                        f"ReplicatedTensor operands must be in the same process group "
+                        f"in torch function '{func.__name__}', but found at least two "
+                        f"ReplicatedTensor operands in different process groups! ")
+            else:
+                all_replicated = False
+
+        for arg in args:
+            dispatch_arg(arg)
+
+        if kwargs is not None:
+            for k, v in kwargs.items():
+                dispatch_arg(v)
+
+        # We cann't do super().__torch_function__() as it implicitly convert the result
+        # back to tensor subclasses, where in our case, we need to control the output type
+        # base on the inter-op rules we defined.
+        with torch._C.DisableTorchFunction():
+            rs = func(*args, **kwargs)
+            if func in get_default_nowrap_functions():
+                return rs
+            if all_replicated and isinstance(rs, torch.Tensor) and not isinstance(rs, cls):
+                # if all operands are ReplicatedTensors and does not get dispatched to ShardedTensor
+                # __torch_function__, result is a torch.Tensor, then we convert and return a
+                # ReplicatedTensor according to our inter-op rule
+                rs = rs.as_subclass(cls)        # type: ignore[arg-type]
+                # propagate the process_group field to result
+                rs.process_group = replicated_pg        # type: ignore[attr-defined]
+
+            return rs
+
+    def validate(self) -> bool:
+        """
+        Validate the ReplicatedTensor is legit by all gathering tensors on all ranks
+        and check to make sure they are the same.
+
+        If there's some ranks with different values, a ValueError will be raised.
+
+        Keyword args:
+            process_group (ProcessGroup, optional): The process group to work on. If None,
+                the default process group will be used.
+
+        Returns:
+            True if validation succeed.
+        """
+        world_size = dist.get_world_size(self.process_group)
+        current_rank = dist.get_rank(self.process_group)
+
+        tensors_on_rank = [torch.empty_like(self) for _ in range(world_size)]
+
+        dist.all_gather(tensors_on_rank, self, group=self.process_group)
+        # validate and check if all tensors are equal
+        for rank, tensor in enumerate(tensors_on_rank):
+            if not torch.allclose(self, tensor):
+                raise ValueError(
+                    f"ReplicatedTensor have different values on rank {current_rank} and {rank}")
+
+        return True
diff --git a/torch/distributed/_shard/sharded_tensor/__init__.py b/torch/distributed/_shard/sharded_tensor/__init__.py
index ba1cdf326c2e8a..58b5e0227475ad 100644
--- a/torch/distributed/_shard/sharded_tensor/__init__.py
+++ b/torch/distributed/_shard/sharded_tensor/__init__.py
@@ -366,7 +366,7 @@ def sharded_op_impl(func):
     parameters, the function provided will be invoked for that operator.
 
     Example::
-        >>> @custom_sharded_op(torch.nn.functional.linear)
+        >>> @sharded_op_impl(torch.nn.functional.linear)
         >>> def my_custom_sharded_linear(types, args, kwargs, process_group):
         >>>   ....
         >>>
diff --git a/torch/distributed/_shard/sharded_tensor/api.py b/torch/distributed/_shard/sharded_tensor/api.py
index 77fd7143217687..dd2c04069d15a1 100644
--- a/torch/distributed/_shard/sharded_tensor/api.py
+++ b/torch/distributed/_shard/sharded_tensor/api.py
@@ -8,6 +8,7 @@
     Sequence,
     Union
 )
+import copy
 import weakref
 
 import threading
@@ -149,7 +150,7 @@ def __init__(
         dims = _flatten_tensor_size(size)
 
         if not isinstance(sharding_spec, shard_spec.ShardingSpec):
-            raise ValueError(f'Expecting ShardingSpec but got: {type(self._sharding_spec)}')
+            raise ValueError(f'Expecting ShardingSpec but got: {type(sharding_spec)}')
 
         self._sharding_spec = sharding_spec
 
@@ -280,17 +281,19 @@ def gather(
 
         world_size = dist.get_world_size(self._process_group)
 
-        gathered_shards = [None] * world_size
-        # will revise this part with CPU support and use dist.gather()
-        # once NCCL support for gather() is ready
-        # https://github.com/pytorch/pytorch/issues/66187
-        dist.all_gather_object(
+        gathered_shards: List[Optional[List[Shard]]] = [None] * world_size if rank == dst else []
+        # TODO: see how we could use dist.gather() instead of dist.gather_object
+        # as the latter one involves pickling on CPU, see more context
+        # https://github.com/pytorch/pytorch/issues/73935
+        dist.gather_object(
             obj=local_shards,
-            object_list=gathered_shards,
+            object_gather_list=gathered_shards,
+            dst=dst,
             group=self._process_group,
         )
-
         if rank == dst:
+            if out is None:
+                raise ValueError("`out` Tensor must be provided on dst rank!")
             dims = len(full_size)
             for shards in gathered_shards:
                 if shards is None:
@@ -311,6 +314,61 @@ def gather(
 
                     out_narrow_view.copy_(tensor)
 
+    def cpu(
+        self,
+        memory_format=torch.preserve_format,
+        process_group=None
+    ) -> ShardedTensor:
+        """
+        Returns a copy of this object in CPU memory.
+
+        If this ShardedTensor is already on CPU memory, then no copy is
+        performed and original object is returned.
+
+        .. note:: When moving a ShardedTensor from GPU to CPU, the ShardedTensor might
+            need to be managed by a different type of ProcessGroup(i.e. ProcessGroupGloo),
+            it is the user's responsiblity to explicitly pass in a new process_group that
+            is compatible with CPU.
+        """
+        # TODO: make this a __torch_function__ op once ShardedTensor becomes a
+        # torch.Tensor subclass, see https://github.com/pytorch/pytorch/issues/75402
+        if memory_format != torch.preserve_format and \
+                memory_format != torch.contiguous_format:
+            raise RuntimeError("Only `torch.contiguous_format` or "
+                               "`torch.preserve_format` is supported!")
+        all_on_cpu = True
+        for meta in self.metadata().shards_metadata:
+            all_on_cpu &= (meta.placement.device().type == "cpu")  # type: ignore[union-attr]
+
+        # if every shard is already on CPU, return the original object
+        if all_on_cpu:
+            return self
+
+        # if not, returns a copy of this object on CPU
+        list_shards: List[Shard] = []
+        # move all local shards to cpu, and change metadata
+        for shard in self._local_shards:
+            cpu_tensor = shard.tensor.cpu(memory_format=memory_format)  # type: ignore[call-arg]
+            metadata = copy.deepcopy(shard.metadata)
+            metadata.placement._device = torch.device("cpu")  # type: ignore[union-attr]
+            list_shards.append(
+                Shard(cpu_tensor, metadata)
+            )
+
+        st_meta = copy.deepcopy(self.metadata())
+        for meta in st_meta.shards_metadata:
+            if meta.placement.device().type != "cpu":  # type: ignore[union-attr]
+                meta.placement._device = torch.device("cpu")  # type: ignore[union-attr]
+
+        pg = self._process_group if process_group is None else process_group
+        st_cpu = ShardedTensor._init_from_local_shards_and_global_metadata(
+            list_shards,
+            sharded_tensor_metadata=st_meta,
+            process_group=pg,
+            init_rrefs=self._init_rrefs
+        )
+        return st_cpu
+
     @classmethod
     def _init_from_local_shards(
         cls,
@@ -681,9 +739,18 @@ def local_tensor(self) -> torch.Tensor:
             raise NotImplementedError("Only single local shard is supported.")
         return self.local_shards()[0].tensor
 
-    def __torch_function__(self, func, types, args=(), kwargs=None):
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
         if func in _SHARDED_OPS:
-            return _SHARDED_OPS[func](types, args, kwargs, self._process_group)
+            # Find ShardedTensor instance to get process_group.
+            for arg in args:
+                if isinstance(arg, ShardedTensor):
+                    return _SHARDED_OPS[func](types, args, kwargs, arg._process_group)
+
+            for kwarg in kwargs.values():
+                if isinstance(kwarg, ShardedTensor):
+                    return _SHARDED_OPS[func](types, args, kwargs, kwarg._process_group)
+
         raise RuntimeError(
             f"torch function '{func.__name__}', with args: {args} and "
             f"kwargs: {kwargs} not supported for ShardedTensor!")
diff --git a/torch/distributed/_shard/sharding_spec/api.py b/torch/distributed/_shard/sharding_spec/api.py
index bad89d8ffe02ae..ac7d662782e76a 100644
--- a/torch/distributed/_shard/sharding_spec/api.py
+++ b/torch/distributed/_shard/sharding_spec/api.py
@@ -279,8 +279,10 @@ def shard(self, tensor: torch.Tensor, src_rank: int = 0, process_group=None) ->
 def _infer_sharding_spec_from_shards_metadata(shards_metadata):
     """
     Infer the sharding spec from the metadata of each shard of a ShardedTensor.
-    If the tensor is sharded only on one dimension, we then assume it's a ChunkShardingSpec.
-    Otherwise, we assume it's enum sharded.
+    If the tensor is sharded only on one dimension, we can then verify whether it's
+    a ChunkShardingSpec or not. The way to verify it is to first get the total length
+    and perform a chunk sharding with the given placements to see if we can have the
+    same chunk size as the given shards_metadata. If not, we assume it's enum sharded.
 
     Args:
         shards_metadata (List[ShardMetadata]): List of Metadata of local shards.
@@ -326,19 +328,16 @@ def _infer_sharding_spec_from_shards_metadata(shards_metadata):
             dim=chunk_sharding_dim,
             placements=placements,
         )
-        shard_sizes = [
-            x[chunk_sharding_dim]
-            for _, x in sorted(zip(chunk_offset_list, shard_size_list))
-        ]
-        if len(shard_sizes) == 1 or (
-            len(set(shard_sizes[:-1])) == 1 and shard_sizes[0] >= shard_sizes[-1]
-        ):
-            return chunk_spec
-        # Corner case when length = 5 and chunks = 4, local size is [2, 2, 1, 0]
-        if (
-            len(set(shard_sizes[:-2])) == 1
-            and shard_sizes[0] >= shard_sizes[-2]
-            and shard_sizes[-2] >= shard_sizes[-1]
-        ):
+        shard_sizes = sorted([x[chunk_sharding_dim] for x in shard_size_list])
+        shard_total_length = sum(shard_sizes)
+        chunks = len(placements)
+        split_size = get_split_size(shard_total_length, chunks)
+        chunk_shard_sizes = sorted(
+            [
+                get_chunked_dim_size(shard_total_length, split_size, idx)
+                for idx in range(len(placements))
+            ]
+        )
+        if shard_sizes == chunk_shard_sizes:
             return chunk_spec
     return EnumerableShardingSpec(shards_metadata)
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
index af81528abefaad..30093fd18639d3 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
@@ -83,7 +83,6 @@ def post_localSGD_hook(
     global_group_to_use = (
         state.process_group if state.process_group is not None else dist.group.WORLD
     )
-    world_size = global_group_to_use.size()
 
     # The input tensor is a flattened 1D tensor.
     input_tensor = bucket.buffer()
diff --git a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
index f70cf0fe2d8576..3151213e67ad68 100644
--- a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
+++ b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
@@ -4,14 +4,15 @@
 import logging
 
 import torch.distributed as dist
+import torch.distributed.algorithms.model_averaging.averagers as averagers
 import torch.distributed.algorithms.model_averaging.utils as utils
 
 logger = logging.getLogger(__name__)
 
 
-class HierarchicalModelAverager:
+class HierarchicalModelAverager(averagers.ModelAverager):
     r"""
-    A group of model averagers used for hierarchical model averaging (hierarchical SGD).
+    Runs hierarchical model averaging (`hierarchical SGD <https://arxiv.org/pdf/2010.12998.pdf>`_).
     Process groups of different sizes are organized in a hierarhicy, and they average parameters
     by using different periods concurrently after the warm-up stage.
     This is an extension of :class:`~torch.distributed.algorithms.model_averaging.averagers.PeriodicModelAverager`
@@ -92,6 +93,7 @@ class HierarchicalModelAverager:
     """
 
     def __init__(self, period_group_size_dict=None, warmup_steps=0, process_group=None):
+        super().__init__(process_group)
         if not period_group_size_dict:
             raise ValueError("Arg ``period_group_size_dict`` must not be empty.")
         self._periods = list(period_group_size_dict.keys())
@@ -105,10 +107,7 @@ def __init__(self, period_group_size_dict=None, warmup_steps=0, process_group=No
                 "by DistributedDataParallel in the backward pass. Therefore, only "
                 "DistributedDataParallel should be used for this case."
             )
-        ovall_group : dist.ProcessGroup = (
-            process_group if process_group is not None else dist.group.WORLD
-        )
-        overall_group_size = dist.get_world_size(group=ovall_group)
+        overall_group_size = dist.get_world_size(group=self.process_group)
         if list(period_group_size_dict.values())[-1] != overall_group_size:
             raise ValueError(
                 "The last value in arg ``period_process_group_dict`` "
@@ -122,27 +121,27 @@ def __init__(self, period_group_size_dict=None, warmup_steps=0, process_group=No
                 "if no higher-level averaging.")
             if group_size != overall_group_size:
                 self.period_process_group_dict[period], _ = dist.new_subgroups(
-                    group_size=group_size, group=ovall_group)
+                    group_size=group_size, group=self.process_group)
             else:
-                self.period_process_group_dict[period] = ovall_group
+                self.period_process_group_dict[period] = self.process_group
 
         if warmup_steps < 0:
             raise ValueError("Arg ``warmup_steps`` must be a non-negative number.")
         self.warmup_steps = warmup_steps
-        self.step = 0
 
     def _find_process_group(self):
         """
-        Returns a tuple consisting of whether ``step`` can be divided by
-        a period in the keys of ``period_process_group_dict`` and the associated process group if any.
+        Returns a process group as the value of an ``period_process_group_dict`` entry,
+        if ``step`` can be divided by a period in the keys of ``period_process_group_dict``.
         If ``step`` can be divided by multiple periods in the keys of ``period_process_group_dict``,
         then the returned process group is the one corresponding to the largest period,
         since this process group will be used for averaging parameters at this ``step``.
+        Returns ``None`` if not found.
         """
         for period in reversed(self._periods):
             if self.step % period == 0:
-                return (True, self.period_process_group_dict[period])
-        return (False, None)
+                return self.period_process_group_dict[period]
+        return None
 
     def average_parameters(self, params):
         r"""
@@ -153,7 +152,7 @@ def average_parameters(self, params):
         only the largest period is used, and the corresponding process group is used for averaging parameters.
         """
         if self.step >= self.warmup_steps:
-            found, group = self._find_process_group()
-            if found:
+            group = self._find_process_group()
+            if group is not None:
                 utils.average_parameters(iter(params), group)
         self.step += 1
diff --git a/torch/distributed/elastic/agent/server/local_elastic_agent.py b/torch/distributed/elastic/agent/server/local_elastic_agent.py
index c84df1a8e43426..8fa868398d2827 100644
--- a/torch/distributed/elastic/agent/server/local_elastic_agent.py
+++ b/torch/distributed/elastic/agent/server/local_elastic_agent.py
@@ -156,10 +156,13 @@ def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
                 "TORCHELASTIC_MAX_RESTARTS": str(spec.max_restarts),
                 "TORCHELASTIC_RUN_ID": spec.rdzv_handler.get_run_id(),
                 "TORCHELASTIC_USE_AGENT_STORE": str(use_agent_store),
-                "NCCL_ASYNC_ERROR_HANDLING": str(1),
+                "NCCL_ASYNC_ERROR_HANDLING": os.getenv(
+                    "NCCL_ASYNC_ERROR_HANDLING", str(1)
+                ),
             }
             if "OMP_NUM_THREADS" in os.environ:
                 worker_env["OMP_NUM_THREADS"] = os.environ["OMP_NUM_THREADS"]
+
             envs[local_rank] = worker_env
             worker_args = list(spec.args)
             worker_args = macros.substitute(worker_args, str(local_rank))
diff --git a/torch/distributed/fsdp/__init__.py b/torch/distributed/fsdp/__init__.py
index 7c1f0b388c2015..e0ba8f4c1badf9 100644
--- a/torch/distributed/fsdp/__init__.py
+++ b/torch/distributed/fsdp/__init__.py
@@ -1,4 +1,4 @@
 from .flatten_params_wrapper import FlatParameter
 from .fully_sharded_data_parallel import FullyShardedDataParallel
-from .fully_sharded_data_parallel import CPUOffload
-from .fully_sharded_data_parallel import StateDictType
+from .fully_sharded_data_parallel import CPUOffload, BackwardPrefetch, ShardingStrategy, MixedPrecision
+from .fully_sharded_data_parallel import StateDictType, OptimStateKeyType
diff --git a/torch/distributed/fsdp/flatten_params_wrapper.py b/torch/distributed/fsdp/flatten_params_wrapper.py
index 13be7bd74f1af3..a625dd08e63634 100644
--- a/torch/distributed/fsdp/flatten_params_wrapper.py
+++ b/torch/distributed/fsdp/flatten_params_wrapper.py
@@ -57,9 +57,9 @@ def _pre_load_state_dict_hook(
     *args: Any,
 ) -> None:
     """
-    _post_state_dict_hook() is called before the _load_from_state_dict() is
-    This API pre-processes the keys of the state_dict to add the
-    FlattenParamsWrapper internal prefix
+    _pre_load_state_dict_hook() is called before the _load_from_state_dict() is
+    executed. This API pre-processes the keys of the state_dict to add the
+    FlattenParamsWrapper internal prefix.
     """
     # Push everything down to FPW_MODULE level.
     _replace_by_prefix(state_dict, prefix, prefix + f"{FPW_MODULE}.")
@@ -95,8 +95,8 @@ class FlatParameter(nn.Parameter):
     Args:
         params (Sequence[nn.Parameter])
             The parameters to be flattend and concatened.
-        requres_grad (bool):
-            Set to Ture if gradients need to be computed for this parameter,
+        requires_grad (bool):
+            Set to True if gradients need to be computed for this parameter,
             False otherwise.
     """
 
@@ -105,7 +105,7 @@ def __new__(
     ) -> "FlatParameter":
         """Make an object using the parent's __new__ function."""
 
-        # A empty of non-list input doesn't make sense.
+        # A empty or non-list input doesn't make sense.
         if not isinstance(params, (list, tuple)) or len(params) == 0:
             raise ValueError("An non-empty list or tuple argument is needed")
 
@@ -225,6 +225,18 @@ def get_param_views(
             for (t, s) in zip(data.split(self._param_numels), self._param_shapes)
         )
 
+    @property
+    def _num_unflattened_params(self) -> int:
+        """Returns the number of unflattened parameters that comprise this
+        flattened parameter."""
+        assert hasattr(self, "_param_infos"), \
+            "`_param_infos` has not been set, meaning this `FlatParameter` " \
+            "has not been initialized yet"
+        num_unflat_params = len(self._param_infos)
+        assert num_unflat_params > 0, "`FlatParameter` corresponding to 0 " \
+            "unflattened parameters"
+        return num_unflat_params
+
     @property
     def _param_names(self):
         return [".".join([m, n]) if m else n for (m, _, n) in self._param_infos]
@@ -238,9 +250,8 @@ def shard_metadata(
     ) -> ShardMetadata:
         """
         Return tuple of (names, shapes, numels) metadata for the sharded parameter
-        metada of this flat parameter.
+        metadata of this flat parameter.
         """
-        names = [".".join([m, n]) if m else n for (m, _, n) in self._param_infos]
         return ShardMetadata(
             self._param_names[self._offset_to_slice()],
             self._param_shapes[self._offset_to_slice()],
@@ -275,6 +286,14 @@ def __init__(self, module: nn.Module, param_list: List[nn.Parameter]):
         self._fpw_module = module
         self.flat_param = None
 
+        # Register hook to be called after state_dict() to remove the
+        # "_fpw_module." prefix and before load_state_dict() to add it back.
+        # The hooks must be registered even if the target param_list is empty as
+        # all submodules in FlattenParamsWrapper should be pre/post processed by
+        # the hooks.
+        self._register_state_dict_hook(_post_state_dict_hook)
+        self._register_load_state_dict_pre_hook(_pre_load_state_dict_hook)
+
         if len(param_list) == 0:
             return
 
@@ -294,6 +313,7 @@ def __init__(self, module: nn.Module, param_list: List[nn.Parameter]):
         self.flat_param = FlatParameter(params, params[0].requires_grad)
         self.flat_param._param_infos = param_infos
         self.flat_param._shared_param_infos = shared_param_infos
+
         # This attribute is used to remember the flat_param inside the unflatten_params()
         # context. With this attribute, FSDP can access the flat parameter metadata
         # even if flat_param is temporarily deleted.
@@ -305,11 +325,6 @@ def __init__(self, module: nn.Module, param_list: List[nn.Parameter]):
         assert getattr(self, FPW_MODULE) is self._fpw_module
         assert getattr(self, FLAT_PARAM) is self.flat_param
 
-        # Register hook to be called after state_dict() to remove the
-        # "_fpw_module." prefix and before load_state_dict() to add it back.
-        self._register_state_dict_hook(_post_state_dict_hook)
-        self._register_load_state_dict_pre_hook(_pre_load_state_dict_hook)
-
     @property
     def module(self) -> Any:
         """Support _fsdp_wrapped_module.module in case we are immitating DDP, which has .module
diff --git a/torch/distributed/fsdp/fully_sharded_data_parallel.py b/torch/distributed/fsdp/fully_sharded_data_parallel.py
index 69bf2c031c542a..711bcbb5790e0a 100644
--- a/torch/distributed/fsdp/fully_sharded_data_parallel.py
+++ b/torch/distributed/fsdp/fully_sharded_data_parallel.py
@@ -1,18 +1,23 @@
 import contextlib
+import copy
 import functools
 import traceback
+import warnings
 from contextlib import contextmanager
 from dataclasses import dataclass
 from enum import Enum, auto
+from math import inf
 from typing import (
     TYPE_CHECKING,
     Any,
     Callable,
     Dict,
-    List,
-    Optional,
     Generator,
+    Iterable,
+    Iterator,
+    List,
     NamedTuple,
+    Optional,
     Set,
     Tuple,
     Union,
@@ -26,18 +31,27 @@
 from torch.autograd import Variable
 from torch.distributed import ProcessGroup
 from torch.distributed._sharded_tensor import (
-    init_from_local_shards,
     Shard,
     ShardedTensor,
+    init_from_local_shards,
 )
 from torch.distributed.distributed_c10d import _get_default_group
 from torch.nn.parameter import Parameter
 
-from .flatten_params_wrapper import FlatParameter, FlattenParamsWrapper, FLAT_PARAM
-from .utils import (
-    _apply_to_tensors,
-    _replace_by_prefix,
+from .flatten_params_wrapper import (
+    FLAT_PARAM,
+    FPW_MODULE,
+    FlatParameter,
+    FlattenParamsWrapper,
+)
+from .optim_utils import (
+    _flatten_optim_state,
+    _get_flat_param_to_fsdp_module,
+    _get_param_id_to_param,
+    _get_param_to_param_id,
+    _unflatten_optim_state,
 )
+from .utils import _apply_to_tensors, _replace_by_prefix
 from .wrap import _recursive_wrap
 
 if TYPE_CHECKING:
@@ -45,12 +59,87 @@
 
 
 FSDP_WRAPPED_MODULE = "_fsdp_wrapped_module"
+FSDP_PREFIX = FSDP_WRAPPED_MODULE + "." + FPW_MODULE + "."
+
+
+class ShardingStrategy(Enum):
+    """
+    Specify which sharding strategy will be used for the distributed training.
+    FULL_SHARD: if Shard parameters, gradients and optimizer states, this algorithm
+                inserts all_gather before forward and backward computation to gather
+                parameters, also inserts reduce_scatter after backward computation for
+                synchronizing and sharding gradients. Sharded optimizer states are
+                updated locally.
+    SHARD_GRAD_OP: Shard optimizer states and gradients, this algorithm inserts all_gather
+                   before forward computation and keeps the full parameters in
+                   GPU memory until backward computation is done. It inserts reduce_scater
+                   after backward computation for synchronizing and sharding gradients.
+                   Sharded optimizer states are updated locally.
+    SHARD_OP(future support): Shard optimizer states only, this algorithm inserts all_gather
+                              or broadcast before forward computation and keeps the full
+                              parameters in GPU memory until backward computation is done. It
+                              inserts all_reduce after backward computation for synchronizing
+                              gradients. Sharded optimizer states are updated locally.
+    NO_SHARD(future support): This is similar to PyTorch `DistributedDataParallel` API.
+                              Parameters, gradients and optimizer states are replicated
+                              among ranks, all_reduce is inserted after backward computation
+                              is done for synchronizing gradients. Full optimizer states
+                              are updated in each rank.
+    HYBRID_SHARD(future support): apply FULL_SHARD algorithm in the intra node and
+                                  apply NO_SHARD algorithm in the inter nodes.
+
+    """
+    FULL_SHARD = auto()
+    SHARD_GRAD_OP = auto()
+    # TODO
+    # SHARD_OP = auto()
+    # NO_SHARD = auto()
+    # HYBRID_SHARD = auto()
+
+
+@dataclass
+class MixedPrecision:
+    """
+    A config to enable mixed precision training with FullyShardedDataParallel.
+    This class can be constructed with three flags:
+        ``param_dtype`` controls the precision of model parameters, inputs, and
+        therefore the precision under which computation happens. After forward
+        and backward passes, FSDP parameters point to full precision shards
+        that are kept in memory. Full precision parameters are always
+        checkpointed.
+        ``reduce_dtype`` controls the precision under which gradient reduction
+        would occur, which can potentially be different than ``param_dtype``
+        for use cases such as communication efficiency.
+        ``buffer_dtype`` controls the precision that buffers are cast to. Note
+        that buffers are unsharded and are cast in the first forward pass, and
+        remain in their reduced precision state even after forward/backward
+        passes. However, when taking checkpoints with ``state_dict``, buffers
+        are checkpointed in their full precision (and then restored back to
+        to their reduced precision) as expected. Note that this checkpoint
+        support is currently limited to ``StateDictType.FULL_STATE_DICT``.
+
+    .. note:: In ``summon_full_params``, parameters are summoned in full
+        precision but buffers are not.
+
+    .. note:: Parameters and buffers are checkpointed in full precision. For
+        buffers, this is only guaranteed to work for ``StateDictType.FULL_STATE_DICT``.
+
+    .. note:: This API is experimental and subject to change.
+    """
+    # maintain a tensor of this dtype that the fp32 param shard will be cast to.
+    # Will control the precision of model params, inputs, and thus compute as
+    # well.
+    param_dtype: torch.dtype = torch.float16
+    # Gradient communication precision.
+    reduce_dtype: torch.dtype = torch.float16
+    # Buffer precision.
+    buffer_dtype: torch.dtype = torch.float16
 
 
 @dataclass
 class CPUOffload:
     """
-    CPU offlaoding config. Currently, only parameter and gradient CPU
+    CPU offloading config. Currently, only parameter and gradient CPU
     offload are supported.
     offload_params: Offloading parameters to CPUs when these parameters are
                     not used for computation on GPUs. This implicitly enables
@@ -139,11 +228,16 @@ class StateDictType(Enum):
     SHARDED_STATE_DICT = auto()
 
 
+class OptimStateKeyType(Enum):
+    PARAM_NAME = auto()
+    PARAM_ID = auto()
+
+
 class FullyShardedDataParallel(nn.Module):
     """
     A wrapper for sharding Module parameters across data parallel workers. This
     is inspired by `Xu et al.`_ as well as the ZeRO Stage 3 from DeepSpeed_.
-    FullyShardedDataParallel is commonly shorten to FSDP.
+    FullyShardedDataParallel is commonly shortened to FSDP.
 
     .. _`Xu et al.`: https://arxiv.org/abs/2004.13336
     .. _DeepSpeed: https://www.deepspeed.ai/
@@ -165,18 +259,28 @@ class FullyShardedDataParallel(nn.Module):
         since FSDP will shard parameters in-place and this will break any
         previously initialized optimizers.
 
-    .. warning:
+    .. warning::
         Module should be already placed on the destination device or
-        device is set properly using torch.cuda.set_device(device_id).
+        device is set properly using ``torch.cuda.set_device(device_id)``.
         FSDP will get compute device from module first, if module device
         is CPU, FSDP will then get compute device from current device.
 
+    .. warning::
+        FSDP currently does not support gradient accumulation outside
+        `no_sync()` when using CPU offloading. Trying to do so yields incorrect
+        results since FSDP will use the newly-reduced gradient instead of
+        accumulating with any existing gradient.
+
     Args:
         module (nn.Module):
             module to be wrapped with FSDP.
         process_group (Optional[ProcessGroup]):
             process group for sharding
-        cpu_offload (Optional [CPUOffload]):
+        sharding_strategy (Optional[ShardingStrategy]):
+            Config sharding algorithm, different sharding algorithm has trade
+            off between memory saving and communication overhead. ``FULL_SHARD``
+            will be chosen if sharding_strategy is not specified.
+        cpu_offload (Optional[CPUOffload]):
             CPU offloading config. Currently, only parameter and gradient CPU
             offload is supported. It can be enabled via passing in
             ``cpu_offload=CPUOffload(offload_params=True)``. Note that this
@@ -184,18 +288,18 @@ class FullyShardedDataParallel(nn.Module):
             params and grads to be on same device to work with optimizer. This
             API is subject to change. Default is ``None`` in which case there
             will be no offloading.
-        fsdp_auto_wrap_policy: (Optional [callable]):
+        auto_wrap_policy (Optional[Callable]):
             A callable specifying a policy to recursively wrap layers with FSDP.
             Note that this policy currently will only apply to child modules of
             the passed in module. The remainder modules are always wrapped in
             the returned FSDP root instance.
             ``default_auto_wrap_policy`` written in ``torch.distributed.fsdp.wrap`` is
-            an example of ``fsdp_auto_wrap_policy`` callable, this policy wraps layers
+            an example of ``auto_wrap_policy`` callable, this policy wraps layers
             with parameter sizes larger than 100M. Users can supply the customized
-            ``fsdp_auto_wrap_policy`` callable that should accept following arguments:
+            ``auto_wrap_policy`` callable that should accept following arguments:
             ``module: nn.Module``, ``recurse: bool``, ``unwrapped_params: int``,
             extra customized arguments could be added to the customized
-            ``fsdp_auto_wrap_policy`` callable as well.
+            ``auto_wrap_policy`` callable as well.
 
             Example::
 
@@ -208,7 +312,7 @@ class FullyShardedDataParallel(nn.Module):
                 >>> ) -> bool:
                 >>>     return unwrapped_params >= min_num_params
 
-        backward_prefetch: (Optional[BackwardPrefetch]):
+        backward_prefetch (Optional[BackwardPrefetch]):
             This is an experimental feature that is subject to change in the
             the near future. It allows users to enable two different backward_prefetch
             algorithms to help backward communication and computation overlapping.
@@ -219,16 +323,18 @@ def __init__(
         self,
         module: nn.Module,
         process_group: Optional[ProcessGroup] = None,
+        sharding_strategy: Optional[ShardingStrategy] = None,
         cpu_offload: Optional[CPUOffload] = None,
-        fsdp_auto_wrap_policy: Optional[Callable] = None,
+        auto_wrap_policy: Optional[Callable] = None,
         backward_prefetch: Optional[BackwardPrefetch] = None,
+        mixed_precision: Optional[MixedPrecision] = None
     ):
         torch._C._log_api_usage_once("torch.distributed.fsdp")
         super().__init__()
-        # if fsdp_auto_wrap_policy is specified, submodules should not be
+        # if auto_wrap_policy is specified, submodules should not be
         # already wrapped, otherwise we'd attempt to double wrap them resulting
         # in errors.
-        if fsdp_auto_wrap_policy is not None:
+        if auto_wrap_policy is not None:
             self._check_wrapped(
                 module,
                 check_fn=lambda mod: not isinstance(mod, FullyShardedDataParallel),
@@ -236,7 +342,7 @@ def __init__(
             )
             _recursive_wrap(
                 module,
-                auto_wrap_policy=fsdp_auto_wrap_policy,
+                auto_wrap_policy=auto_wrap_policy,
                 wrapper_cls=FullyShardedDataParallel,
                 # Note that we have the recursive_wrap skip wrapping for
                 # the outermost (this) module otherwise it will result in a
@@ -244,12 +350,10 @@ def __init__(
                 only_wrap_children=True,
                 # FSDP arguments follow.
                 process_group=process_group,
+                sharding_strategy=sharding_strategy,
                 cpu_offload=cpu_offload,
                 backward_prefetch=backward_prefetch,
-                # Note that recursive_wap should not call FSDP with wrapping
-                # enabled, as this recursive call handles all wrapping,
-                # including for nested children.
-                fsdp_auto_wrap_policy=None,
+                mixed_precision=mixed_precision,
             )
 
         self.process_group = process_group or _get_default_group()
@@ -259,9 +363,6 @@ def __init__(
         # if module is on CPU, use current device;
         self.compute_device = _get_default_cuda_device(module)
 
-        # Free full params and keep shard only after forward
-        self.reshard_after_forward = True
-
         # setting two factors to avoid underflow and overflow
         self.gradient_predivide_factor: float = self._get_gradient_predivide_factor(
             self.world_size
@@ -273,6 +374,13 @@ def __init__(
         self.numel_padded_per_param: List[int] = []
         self.cpu_offload = cpu_offload or CPUOffload()
         self.backward_prefetch = backward_prefetch
+        self.sharding_strategy = sharding_strategy or ShardingStrategy.FULL_SHARD
+        self.mixed_precision = mixed_precision
+        # Original buffer type (mapping since all buffers may not be of same type). In
+        # the case of mixed precision training, this is used to restore buffers
+        # to their original type (which may not be the same as that of the
+        # parameters in the model) when checkpointing.
+        self._orig_buffer_dtypes: Dict[str, torch.dtype] = {}
 
         # Only handle params which are not already sharded. This enables
         # sharding individual layers of a Module, with an outer wrapper to
@@ -343,6 +451,19 @@ def __init__(
             for p in self.params:
                 self._offload_to_cpu(p)
 
+    def _init_reshard_after_forward(self):
+        if self.sharding_strategy == ShardingStrategy.FULL_SHARD:
+            # Free full params and keep shard only after forward
+            self.reshard_after_forward = True
+        elif self.sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+            # Keep full params in the GPU memory until backward
+            # computation is done
+            self.reshard_after_forward = False
+        else:
+            raise RuntimeError(
+                "sharding_strategy only supports FULL_SHARD and SHARD_GRAD_OP right now."
+            )
+
     @classmethod
     def _check_wrapped(cls, begin_module, check_fn, err_fn):
         for _, mod in begin_module.named_modules():
@@ -355,14 +476,30 @@ def module(self) -> FlattenParamsWrapper:
         assert isinstance(self._fsdp_wrapped_module, FlattenParamsWrapper)
         return self._fsdp_wrapped_module
 
-    def fsdp_modules(self) -> List["FullyShardedDataParallel"]:
+    def check_is_root(self) -> bool:
+        self._lazy_init()
+        assert self._is_root is not None
+        return self._is_root
+
+    @staticmethod
+    def fsdp_modules(
+        module: nn.Module, root_only: bool = False
+    ) -> List["FullyShardedDataParallel"]:
         """
         Helper function to return all nested FSDP instances, including self.
+
+        Args:
+            module: the root module. This module does not have to be a FSDP module.
+            root_only: whether to return only root FSDP modules (default: False).
+
+        Returns:
+            fsdp_modules: the FSDP modules that are nested in the input module.
         """
         fsdp_modules = []
-        for module in self.modules():
-            if isinstance(module, FullyShardedDataParallel):
-                fsdp_modules.append(module)
+        for sub_module in module.modules():
+            if isinstance(sub_module, FullyShardedDataParallel):
+                if not root_only or sub_module.check_is_root():
+                    fsdp_modules.append(sub_module)
 
         return fsdp_modules
 
@@ -389,7 +526,7 @@ def apply(self, fn: Callable[[nn.Module], None]) -> "FullyShardedDataParallel":
         # Reset lazy init that might be called by summon_full_params, since
         # it could have set is_root incorrectly for non-root FSDP instances.
         if uninitialized and self._is_root:
-            for module in self.fsdp_modules():
+            for module in self.fsdp_modules(self):
                 module._reset_lazy_init()
 
         return ret
@@ -413,32 +550,126 @@ def _offload_to_cpu(self, p):
         with torch.no_grad():
             p.data = p.to(cpu_device)
 
+    def _cast_fp_inputs_to_precision(self, dtype: torch.dtype, *args: Any, **kwargs: Any) -> Tuple[Any, Any]:
+        """
+        Casts floating point tensors in args and kwargs to precision given by dtype.
+        requires_grad field is respected.
+        """
+        def cast_fn(x: torch.Tensor) -> torch.Tensor:
+            if not torch.is_floating_point(x):
+                return x
+            y = x.to(dtype)
+            # Explicitly copy over requires_grad context since this is happening
+            # within torch.no_grad.
+            if x.is_leaf:
+                y.requires_grad = x.requires_grad
+            return y
+
+        with torch.no_grad():
+            return (
+                _apply_to_tensors(cast_fn, args),
+                _apply_to_tensors(cast_fn, kwargs)
+            )
+
+    @torch.no_grad()
+    def _cast_param_shards_to_dtype(self):
+        """
+        Allocates a mixed precision paramter shard and casts parameter shards to
+        reduced precision by copying into this mixed precision shard. Note that
+        if we are CPU offloading, this also implicitly loads the parameter shard
+        back to GPU.
+        """
+        assert self.mixed_precision is not None, "Expected to only be called when mixed precision is enabled."
+        with torch.cuda.stream(self._streams["mixed_precision_params"]):
+            for p in self.params:
+                assert p._mp_shard is not None
+                _alloc_storage(data=p._mp_shard, size=p._local_shard.size())
+                # Cast is done by copy
+                p._mp_shard.copy_(
+                    # no-op if not CPU offloading, otherwise nonblocking because
+                    # p._local_shard is pinned in _init_param_attributes.
+                    p._local_shard.to(p._mp_shard.device, non_blocking=True)
+                )
+                # Point p to the mp shard
+                p.data = p._mp_shard
+        # Block current stream on this copy work.
+        torch.cuda.current_stream().wait_stream(self._streams["mixed_precision_params"])
+
+    @torch.no_grad()
+    def _free_mp_shard(self, params: List[FlatParameter]):
+        """
+        Deallocate storage for parameter's mixed precision shard.
+        """
+        assert self.mixed_precision is not None
+        current_stream = torch.cuda.current_stream()
+        for p in params:
+            # mp_shard should always be allocated.
+            assert p._mp_shard is not None
+            # Shard is allocated in "mixed_precision_stream" and then we block
+            # current stream on this stream, so don't free it until work in the
+            # current stream is completed.
+            p._mp_shard.record_stream(current_stream)
+            _free_storage(p._mp_shard)
+
     def _cast_buffers(
-        self, device: Optional[torch.device] = None, memo: Optional[Set] = None
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[Dict[str, torch.dtype]] = None,
+        memo: Optional[Set] = None,
+        recurse: bool = True,
     ) -> None:
-        """Move all buffers to the given *device*.
+        """Move all buffers to the given *device* and *dtype*.
         If *device* is not given, then it will default to
-        ``self.compute_device``. In the
-        case of nested FSDP instances, we will respect the child instance's
+        ``self.compute_device``, otherwise buffer will be moved to ``device``.
+        In the case of nested FSDP instances, we will respect the child instance's
         ``compute_device`` configuration.
+        If *dtype* is given, it must be a mapping of buffer name to buffer dtype,
+            and this argument is currently only given to restore back to original
+            buffer types during checkpoint. If *dtype* is not given, and we are
+            in mixed precision training, the buffer will be cast to buffer_dtype,
+            otherwise the buffer will not be cast.
         Args:
             device (torch.device, Optional):
                 device to cast buffers to (defaults to compute_device)
+            dtype: (Dict[str, torch.dtype], Optional):
+                Mapping of buffer name to their dtype to cast to.
             memo (Set, Optional):
                 set of modules that have already been processed
+            recurse (bool, Optional):
+                Whether to call _cast_buffers recursively on nested FSDP
+                instances (default is True).
         """
         if memo is None:
             memo = set()
         for module in self.modules():
-            if module is not self and isinstance(module, FullyShardedDataParallel):
+            if module is not self and isinstance(module, FullyShardedDataParallel) and recurse:
                 # Allow any child FSDP instances to handle their own buffers.
-                module._cast_buffers(device=device, memo=memo)
+                module._cast_buffers(device=device, dtype=dtype, memo=memo, recurse=recurse)
             elif module not in memo:
                 memo.add(module)
                 for name, buf in module.named_buffers(recurse=False):
                     if buf is None:
                         continue
                     buf = buf.to(device=device or self.compute_device)
+                    if name not in self._orig_buffer_dtypes:
+                        self._orig_buffer_dtypes[name] = buf.dtype
+                    # If given, cast buffer to the given dtype. This is used to
+                    # suppport mixed precision for buffers
+                    # (given by self.mixed_precision.buffer_dtype) and also used
+                    # to restore the buffer dtype to the original precision for
+                    # state_dict() calls.
+                    # Note that non-floating point buffers are not casted.
+                    if torch.is_floating_point(buf):
+                        # We are restoring the original buffer type in
+                        # preparation for checkpoint.
+                        if dtype:
+                            buf = buf.to(dtype=dtype[name])
+                        # Note that we don't pass in self.mixed_precision.buffer_dtype
+                        # recursively into _cast_buffers, as we want to respect
+                        # mp config for child FSDP instances.
+                        elif self.mixed_precision is not None:
+                            buf = buf.to(self.mixed_precision.buffer_dtype)
+
                     setattr(module, name, buf)
 
     @torch.no_grad()
@@ -547,6 +778,8 @@ def _reset_lazy_init(self) -> None:
                 # reset attributes that are added in _init_param_attributes, as
                 # part of _lazy_init
                 del p._local_shard  # type: ignore[attr-defined]
+        # set 'self.reshard_after_forward' flag based on self.sharding_strategy
+        self._init_reshard_after_forward()
 
     def _lazy_init(self) -> None:
         """Initialization steps that should happen lazily, typically right
@@ -568,7 +801,7 @@ def _lazy_init(self) -> None:
         if self._is_root:
             # Buffers stay on GPU, and don't get sharded. Since _cast_buffers
             # applies recursively, we only call this from the root instance.
-            self._cast_buffers()
+            self._cast_buffers(recurse=True)
 
             # Don't free the full params for the outer-most (root) instance,
             # In most cases, root instance contains params in the last layers
@@ -646,16 +879,38 @@ def _init_param_attributes(self, p: Parameter) -> None:
                 p, device=torch.device("cpu")
             ).pin_memory()
 
+        # If mixed_precision, maintain reduced precision param shard on
+        # compute_device for computation in fwd/bwd. We resize storage to 0 here
+        # and rematerialize before building the full param when needed. After
+        # fwd/bwd, it is freed and we only hold on to the full precision shard.
+        # As a result, this reduced precision shard is not allocated if we are
+        # not in the forward/backward pass.
+        if self.mixed_precision:
+            p._mp_shard = torch.zeros_like(
+                p._local_shard,
+                device=self.compute_device,
+                dtype=self.mixed_precision.param_dtype
+            )
+            _free_storage(p._mp_shard)
+
         # We also maintain a full-sized parameter of type self.compute_dtype.
         # We resize the storage to size 0 at init (here) and only materialize
         # as needed. The storage may contain padding elements so that it is
         # evenly divisible by world_size, although these padding elements will
         # be removed before the relevant computation.
         if p._is_sharded:  # type: ignore[attr-defined]
+            # We set p._full_param_padded's dtype to the desired parameter dtype
+            # in the case of mixed precision. This is so that when we all_gather
+            # into full_param_padded it can occur without issues and result in
+            # full_param_padded having the expected param_dtype.
+            full_param_dtype = (
+                p.dtype if self.mixed_precision is None
+                else self.mixed_precision.param_dtype
+            )
             p._full_param_padded = torch.zeros(  # type: ignore[attr-defined]
                 p.numel() * self.world_size,
                 device=self.compute_device,
-                dtype=p.dtype,
+                dtype=full_param_dtype,
             )
             _free_storage(p._full_param_padded)  # type: ignore[attr-defined]
 
@@ -681,8 +936,8 @@ def _set_is_root(self) -> None:
                 # again in FSDP later, for example after training to run inference.
                 assert (
                     m._is_root is None or not m._is_root
-                ), "Non-root instance's _is_root flag should have not been set yet \
-                    or has already been set as False."
+                ), "Non-root instance's _is_root flag should have not been set yet" \
+                    "or has already been set as False."
                 if m._is_root is None:
                     m._is_root = False
 
@@ -696,6 +951,10 @@ def _setup_streams(self) -> None:
             self._streams["all_gather"] = torch.cuda.Stream()
             # Stream for overlapping grad reduction with the backward pass.
             self._streams["post_backward"] = torch.cuda.Stream()
+            # Stream to move main params to self.mixed_precision.param_dtype
+            # for forward pass.
+            if self.mixed_precision:
+                self._streams["mixed_precision_params"] = torch.cuda.Stream()
 
         # We share streams with all children instances, which allows them to
         # overlap transfers across the forward pass without synchronizing with
@@ -713,6 +972,12 @@ def _wait_for_previous_optim_step(self) -> None:
         """
         if not torch.cuda.is_available():
             return
+
+        if self.mixed_precision:
+            self._streams["mixed_precision_params"].wait_stream(
+                torch.cuda.current_stream()
+            )
+
         self._streams["all_gather"].wait_stream(torch.cuda.current_stream())
 
     def _need_prefetch_pre_backward_hook(self) -> bool:
@@ -744,44 +1009,51 @@ def _need_prefetch_post_backward_hook(self) -> bool:
         else:
             return False
 
+    @staticmethod
     @contextlib.contextmanager
-    def state_dict_type(self, state_dict_type: StateDictType) -> Generator:
+    def state_dict_type(module: nn.Module, state_dict_type: StateDictType) -> Generator:
         """
-        A context manager to set the state_dict_type of this FSDP module and
-        its descendant FSDP modules.
-        .. note:: This API should be called for only the root FSDP module.
-        .. note:: The default state_dict_type is StateDictTyp.FULL_STATE_DICT.
+        A context manager to set the ``state_dict_type`` of all the descendant
+        FSDP modules of the target module. The target module does not have to
+        be a FSDP module. If the target module is a FSDP module, its
+        ``state_dict_type`` will also be changed.
+
+        .. note:: This API should be called for only the top-level (root)
+            module.
+
         .. note:: This API enables users to transparently use the conventional
-        ``state_dict`` API to take model checkpoints in cases where the root
-        FSDP module is wrapped by another ``nn.Module``. For example, the
-        following will ensure `state_dict`  is called on all non-FSDP instances,
-        while dispatching into `local_state_dict` implementation for FSDP:
+            ``state_dict`` API to take model checkpoints in cases where the
+            root FSDP module is wrapped by another ``nn.Module``. For example,
+            the following will ensure ``state_dict``  is called on all non-FSDP
+            instances, while dispatching into `local_state_dict` implementation
+            for FSDP:
+
+        Example::
+
         >>> model = DDP(FSDP(...))
-        >> fsdp_root = model.module
-        >>> with fsdp_root.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+        >>> fsdp_root = model.module
+        >>> with FSDP.state_dict_type(fsdp_root, StateDictType.LOCAL_STATE_DICT):
         >>>     checkpoint = model.state_dict()
+
         Args:
+            module (torch.nn.Module): Root module.
             state_dict_type (StateDictType): the desired state_dict_type to set.
         """
-        self._lazy_init()
-        if not self._is_root:
-            raise RuntimeError(
-                f"state_dict_type context manager can only be called from the root FSDP module.  {self._is_root}"
-            )
-        prev_state_dict_type = self._state_dict_type
-        for module in self.modules():
-            if isinstance(module, FullyShardedDataParallel):
-                if module._state_dict_type != prev_state_dict_type:
-                    raise RuntimeError(
-                        "All FSDP module should the same state_dict_type."
-                    )
-                module._state_dict_type = state_dict_type
+        prev_state_dict_type = None
+        for module in FullyShardedDataParallel.fsdp_modules(module):
+            if prev_state_dict_type is None:
+                prev_state_dict_type = module._state_dict_type
+            if prev_state_dict_type != module._state_dict_type:
+                raise RuntimeError(
+                    "All FSDP module should the same state_dict_type."
+                )
+            module._state_dict_type = state_dict_type
         try:
             yield
         finally:
-            for module in self.modules():
-                if isinstance(module, FullyShardedDataParallel):
-                    module._state_dict_type = prev_state_dict_type
+            assert prev_state_dict_type is not None  # Avoid mypy warning
+            for module in FullyShardedDataParallel.fsdp_modules(module):
+                module._state_dict_type = prev_state_dict_type
 
     def _full_post_state_dict_hook(
         self,
@@ -801,8 +1073,17 @@ def _full_post_state_dict_hook(
             if (
                 not getattr(state_dict[key], "_has_been_cloned", False)
             ):
-                state_dict[key] = state_dict[key].clone().detach()
-                state_dict[key]._has_been_cloned = True  # type: ignore[attr-defined]
+                try:
+                    state_dict[key] = state_dict[key].clone().detach()
+                    state_dict[key]._has_been_cloned = True  # type: ignore[attr-defined]
+                except BaseException as e:
+                    warnings.warn(
+                        f"Failed to clone() tensor with name {key}. This may mean "
+                        "that this state_dict entry could point to invalid memory "
+                        "regions after returning from state_dict() call if this "
+                        "parameter is managed by FSDP. Please check clone "
+                        f"implementation of {key}. Error: {str(e)}"
+                    )
 
         _replace_by_prefix(state_dict, prefix + f"{FSDP_WRAPPED_MODULE}.", prefix)
         return state_dict
@@ -862,7 +1143,14 @@ def _post_state_dict_hook(
         what postprocessing will be done.
         """
         self = cast(FullyShardedDataParallel, module)
-        return self._post_state_dict_hook_fn[self._state_dict_type](state_dict, prefix)
+        processed_state_dict = self._post_state_dict_hook_fn[self._state_dict_type](state_dict, prefix)
+        # Restore buffers, which currently are in their full precision type,
+        # back to their mixed precision type. This is because buffers are cast
+        # during lazy_init() and stay at their mixed precision type before/after
+        # forward/backward. As a result state_dict() should maintain this.
+        if self._is_root and self.mixed_precision is not None:
+            self._cast_buffers(recurse=True)
+        return processed_state_dict
 
     def state_dict(self, *args, **kwargs):
         """
@@ -872,7 +1160,7 @@ def state_dict(self, *args, **kwargs):
         ``state_dict`` on every rank, which could result in OOM if the model
         cannot fit on a single GPU. As a result, :func:`state_dict_type` API is
         available to configure between `state_dict` implementations. User can
-        thus use `with self.state_dict_type(StateDictType.LOCAL_STATE_DICT)`
+        thus use `with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT)`
         context manager to perform a local checkpoint that will store only local
         shards of the module. Currently, the only supported implementations are
         ``StateDictType.LOCAL_STATE_DICT`` and ``StateDictType.FULL_STATE_DICT``
@@ -886,16 +1174,15 @@ def state_dict(self, *args, **kwargs):
         >>> torch.cuda.set_device(device_id)
         >>> my_module = nn.Linear(...)
         >>> sharded_module = FSDP(my_module)
-        >>> with sharded_module.state_dict_type(StateDictType.FULL_STATE_DICT):
+        >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT):
         >>>     full_dict = sharded_module.state_dict()
         >>> full_dict.keys()
         >>> odict_keys(['weight', 'bias'])
         >>> # using local state dict
-        >>> with sharded_module.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+        >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
         >>>     local_dict = sharded_module.state_dict()
         >>> local_dict.keys()
         >>> odict_keys(['flat_param', 'inner.flat_param'])
-        >>> {hi}
 
         .. warning:: This needs to be called on all ranks, since synchronization
             primitives may be used.
@@ -905,10 +1192,22 @@ def state_dict(self, *args, **kwargs):
 
         self._lazy_init()
         if self._state_dict_type == StateDictType.FULL_STATE_DICT:
-            if self.training_state != TrainingState_.SUMMON_FULL_PARAMS:
-                with self.summon_full_params(recurse=False, writeback=False):
-                    state_dict = super().state_dict(*args, **kwargs)
-            else:
+            summon_ctx = (
+                self.summon_full_params(recurse=False, writeback=False)
+                if self.training_state != TrainingState_.SUMMON_FULL_PARAMS else
+                contextlib.suppress()
+            )
+            with summon_ctx:
+                # Since buffers are not sharded and stay casted, restore them to their
+                # original user module specified types for checkpoint. We take care to
+                # recast in post_state_dict_hook for consistency with the fact that
+                # buffers stay casted after forward/backward. We must have the
+                # call here instead of above because summon_full_params itself
+                # calls _lazy_init() which would cast the buffers.
+                if self.mixed_precision is not None and self._is_root:
+                    self._cast_buffers(
+                        dtype=self._orig_buffer_dtypes, recurse=False
+                    )
                 state_dict = super().state_dict(*args, **kwargs)
 
             # TODO: support offload to CPU in post state dict hook.
@@ -929,7 +1228,7 @@ def _local_state_dict(self, *args: Any, **kwargs: Any) -> Any:
         sharded, so the resulting state_dict can only be loaded after the module
         has been wrapped with FSDP.
         """
-        with self.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+        with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT):
             return self.state_dict(*args, **kwargs)
 
     def _full_pre_load_state_dict_hook(
@@ -1008,7 +1307,7 @@ def load_state_dict(
         FSDP to load the full parameter context on each rank which could result
         in GPU OOM. As a result, :func:`state_dict_type` API is available to
         configure between `load_state_dict` implementations. User can thus use
-        ``with self.state_dict_type(StateDictType.LOCAL_STATE_DICT)`` context
+        ``with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT)`` context
         manager to load a local state dict checkpoint that will restore only
         local shards of the module. Currently, the only supported
         implementations are ``StateDictType.LOCAL_STATE_DICT`` and
@@ -1025,13 +1324,13 @@ def load_state_dict(
         >>> sharded_module = FSDP(my_module)
         >>> checkpoint = torch.load(PATH)
         >>> full_state_dict = checkpoint['full_state_dict']
-        >>> with sharded_module.state_dict_type(StateDictType.FULL_STATE_DICT):
+        >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT):
         >>>     sharded_module.load_state_dict(full_state_dict)
         >>> full_dict.keys()
         >>> odict_keys(['weight', 'bias'])
         >>> # using local state dict
         >>> local_state_dict = checkpoint['local_state_dict]
-        >>> with sharded_module.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+        >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
         >>>     sharded_module.load_state_dict(local_state_dict)
         >>> local_dict.keys()
         >>> odict_keys(['flat_param', 'inner.flat_param'])
@@ -1060,7 +1359,7 @@ def _load_local_state_dict(
         """
         Load states from a flatten, sharded state dictionary.
         """
-        with self.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+        with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT):
             return self.load_state_dict(state_dict, *args)
 
     def forward(self, *args: Any, **kwargs: Any) -> Any:
@@ -1069,6 +1368,13 @@ def forward(self, *args: Any, **kwargs: Any) -> Any:
         # Start of a forward pass.
         self.training_state = TrainingState_.FORWARD
 
+        # Cast inputs to their mixed precision type.
+        if self._is_root and self.mixed_precision is not None:
+            input_dtype = self.mixed_precision.param_dtype
+            args, kwargs = self._cast_fp_inputs_to_precision(
+                input_dtype, *args, **kwargs
+            )
+
         # All-gather full parameters, moving them to compute_device if
         # necessary.
         self._rebuild_full_params()
@@ -1079,7 +1385,6 @@ def forward(self, *args: Any, **kwargs: Any) -> Any:
         # These need to be re-registered every forward pass in some cases where grad_fn
         # is mutated.
         self._register_post_backward_hooks()
-
         outputs = self.module(*args, **kwargs)
 
         if self not in self._fsdp_graph_order:
@@ -1088,11 +1393,16 @@ def forward(self, *args: Any, **kwargs: Any) -> Any:
 
         if self.reshard_after_forward:
             self._free_full_params()
+            if self.mixed_precision is not None:
+                self._free_mp_shard(self.params)
         # Switch to original local shards of params. We maintain this invariant throughout
         # the code, i.e., ``p.data == p._local_shard`` after each function. This
         # also ensures that after the first forward, the optimizer state will be
         # initialized with the correct dtype and (sharded) size, since optimizer
-        # state is typically initialized lazily in ``optim.step()``.
+        # state is typically initialized lazily in ``optim.step()``. Note that
+        # when CPU offload is enabled, _use_param_local_shard implicitly
+        # offloads the local shard to CPU by making p.data point to
+        # p._local_shard, which would reside on CPU.
         self._use_param_local_shard()
 
         # Register pre-backward hooks to all-gather the params for the backward
@@ -1106,24 +1416,27 @@ def forward(self, *args: Any, **kwargs: Any) -> Any:
         return outputs
 
     @torch.no_grad()
-    def _write_back_current_shard(self):
-        for p in self.params:
+    def _write_back_current_shard(self, full_params):
+        """
+        Writes back full_params into self.params.
+        """
+        for p, (full_param, _) in zip(self.params, full_params):
             if not p._is_sharded:  # type: ignore[attr-defined]
                 continue  # Already copied because no sharding.
-            chunks = p._full_param_padded.chunk(self.world_size)  # type: ignore[attr-defined]
+
+            # TODO: Might be able to refactor to use _get_shard.
+            chunks = full_param.chunk(self.world_size)  # type: ignore[attr-defined]
             assert len(chunks) > self.rank
             chunk = chunks[self.rank]
             p._local_shard.copy_(chunk)  # type: ignore[attr-defined]
 
-    def _collect_local_params(self):
-        def _is_full_param_in_use(p: Parameter):
-            return p._is_sharded and p._full_param_padded.storage().size() > 0  # type: ignore[attr-defined]
-
-        return [p for p in self.params if not _is_full_param_in_use(p)]
-
     @contextlib.contextmanager
     def summon_full_params(
-        self, recurse: bool = True, writeback: bool = True
+        self,
+        recurse: bool = True,
+        writeback: bool = True,
+        rank0_only: bool = False,
+        offload_to_cpu: bool = False,
     ) -> Generator:
         r""" A context manager to expose full params for the current FSDP
         instance. Can be useful *after* forward/backward for a model to get
@@ -1138,26 +1451,89 @@ def summon_full_params(
             corresponding to the local param shard will persist after the
             context manager exits (unless ``writeback=False``, in which case
             changes will be discarded). In the case where FSDP does not shard
-            the parameters, currently only when world_size == 1, the
+            the parameters, currently only when ``world_size == 1``, the
             modification is persisted regardless of ``writeback``.
 
+        .. warning:: Note that ``rank0_only=True`` in conjunction with
+            ``writeback=True`` is not currently supported and will raise an
+            error. This is because model parameter shapes would be different
+            across ranks within the context, and writing to them can lead to
+            inconsistency across ranks when the context is exited.
+
+        ..warning:: Note that ``offload_to_cpu`` and ``rank0_only=False`` will
+            result in full parameters being redundantly copied to CPU memory for
+            GPUs that reside on the same machine, which may incur the risk of
+            CPU OOM. It is recommended to use ``offload_to_cpu`` with
+            ``rank0_only=True``.
+
         Args:
             recurse (bool, Optional): recursively summon all params for nested
                 FSDP instances (default: True)
             writeback (bool, Optional): if ``False``, modifications to params are
                 discarded after the context manager exists;
                 disabling this can be slightly more efficient (default: True)
+            rank0_only (bool, Optional): if ``True``, full parameters are
+                materialized on only global rank 0. This means that within the
+                context, only rank 0 will have full parameters and the other
+                ranks will have sharded parameters. Note that setting
+                ``rank0_only=True`` with ``writeback=True`` is not supported,
+                as model parameter shapes will be different across ranks
+                within the context, and writing to them can lead to
+                inconsistency across ranks when the context is exited.
+            offload_to_cpu (bool, optional): If ``True``, full parameters are
+                offloaded to CPU. Note that this offloading currently only
+                occurs if the parameter is sharded (which is only not the case
+                for world_size = 1). It is recommended to use ``offload_to_cpu``
+                with ``rank0_only=True`` to avoid redundant copies of model
+                parameters being offloaded to the same CPU memory.
         """
+
+        if writeback and rank0_only:
+            raise ValueError(
+                "writeback=True and rank0_only=True is not supported, as model "
+                "parameter shapes will be different across ranks, and writing "
+                "to them can lead to inconsistencies across ranks when the "
+                "context is exited."
+            )
+
+        if offload_to_cpu and not rank0_only:
+            warnings.warn(
+                "offload_to_cpu and rank0_only=False will result in "
+                "full parameters being redundantly copied to CPU memory for "
+                "GPUs that reside on the same machine, which may incur the risk of "
+                "CPU OOM. It is recommended to use ``offload_to_cpu`` with "
+                "rank0_only=True."
+            )
+
+        def _free_full_params_and_use_local_shard(params_to_free):
+            # We may not always be able to free the full param, for example in
+            # the case where world_size == 1 and the shard actually points to
+            # the full parameter.
+            for (param, can_free) in params_to_free:
+                if can_free:
+                    current_stream = torch.cuda.current_stream()
+                    # Don't let PyTorch reuse this memory until all work in the
+                    # current stream is complete
+                    param.record_stream(current_stream)
+                    _free_storage(param)
+
+            # when CPU offload is enabled, _use_param_local_shard implicitly
+            # offloads the local shard to CPU by making p.data point to
+            # p._local_shard, which would reside on CPU.
+            self._use_param_local_shard()
+
         if recurse:
             with contextlib.ExitStack() as stack:
                 # Summon all params for any nested FSDP instances.
-                for module in self.modules():
-                    if isinstance(module, FullyShardedDataParallel):
-                        stack.enter_context(
-                            module.summon_full_params(
-                                recurse=False, writeback=writeback
-                            )
+                for module in self.fsdp_modules(self):
+                    stack.enter_context(
+                        module.summon_full_params(
+                            recurse=False,
+                            writeback=writeback,
+                            rank0_only=rank0_only,
+                            offload_to_cpu=offload_to_cpu,
                         )
+                    )
                 # Yield to the caller, with full params in all nested instances.
                 yield
             # Exiting from the ExitStack will re-shard params.
@@ -1170,24 +1546,104 @@ def summon_full_params(
             # forward/backward.
             self.training_state = TrainingState_.SUMMON_FULL_PARAMS
 
-            currently_local_params = self._collect_local_params()
-            self._rebuild_full_params()
+            # Even if rank0_only = True, we need to materialize all params here
+            # and free them right after as full param materialization requires
+            # collective comm.
+            currently_local_params = self._rebuild_full_params()
             # Wait for all_gather to finish before computation
             torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-
-            # FSDP now has the full flattened parameter. Unflatten it to get the
-            # full parameters.
-            with contextlib.ExitStack() as stack:
-                stack.enter_context(self.module.unflatten_params())
+            my_rank = dist.get_rank(self.process_group)
+            if offload_to_cpu and (not rank0_only or my_rank == 0):
+                for p in self.params:
+                    if p._is_sharded:
+                        with torch.no_grad():
+                            # Note that we avoid using p._full_param_padded
+                            # directly here as we may not be using that param
+                            # as the full_param from _rebuild_full_params (i.e.)
+                            # in mixed precision.
+                            for p, (full_param, _) in zip(
+                                self.params, currently_local_params
+                            ):
+                                full_param = full_param.to(torch.device("cpu"))
+                                self._update_p_data(p, output_tensor=full_param)
+
+            if rank0_only and my_rank != 0:
+                _free_full_params_and_use_local_shard(currently_local_params)
                 try:
                     yield
                 finally:
-                    if writeback:
-                        self._write_back_current_shard()
-                    stack.close()
-                    self._free_full_params(currently_local_params)
-                    self._use_param_local_shard()
                     self.training_state = TrainingState_.IDLE
+            else:
+                # FSDP now has the full flattened parameter. Unflatten it to get the
+                # full parameters.
+                with contextlib.ExitStack() as stack:
+                    # Invariant: rank == 0 or !rank0_only
+                    stack.enter_context(self.module.unflatten_params())
+                    try:
+                        yield
+                    finally:
+                        if offload_to_cpu:
+                            for p in self.params:
+                                if p._is_sharded:
+                                    with torch.no_grad():
+                                        # Note that we avoid using
+                                        # p._full_param_padded directly here as
+                                        # we may not be using that param
+                                        # as the full_param from
+                                        # _rebuild_full_params (i.e. in mixed
+                                        # precision.
+                                        for p, (full_param, _) in zip(
+                                            self.params, currently_local_params
+                                        ):
+                                            full_param = full_param.to(self.compute_device)
+                                            self._update_p_data(
+                                                p, output_tensor=full_param,
+                                            )
+
+                        if writeback:
+                            self._write_back_current_shard(currently_local_params)
+                        stack.close()
+                        _free_full_params_and_use_local_shard(currently_local_params)
+                        self.training_state = TrainingState_.IDLE
+
+    def named_buffers(
+        self,
+        *args,
+        **kwargs,
+    ) -> Iterator[Tuple[str, torch.Tensor]]:
+        """
+        Overrides :meth:`named_buffers()` to intercept buffer names and
+        remove all occurrences of the FSDP-specific flattened buffer prefix
+        when inside the :meth:`summon_full_params` context manager.
+        """
+        in_summon_full_params = getattr(self, "training_state", None) == \
+            TrainingState_.SUMMON_FULL_PARAMS
+        for buffer_name, buffer in super().named_buffers(*args, **kwargs):
+            if in_summon_full_params:
+                # Remove any instances of the FSDP-specific prefix; there can
+                # be multiple in the case of nested FSDP modules
+                buffer_name = buffer_name.replace(FSDP_PREFIX, "")
+            yield (buffer_name, buffer)
+
+    def named_parameters(
+        self,
+        *args,
+        **kwargs,
+    ) -> Iterator[Tuple[str, torch.nn.Parameter]]:
+        """
+        Overrides :meth:`named_parameters()` to intercept parameter names and
+        remove all occurrences of the FSDP-specific flattened parameter prefix
+        when inside the :meth:`summon_full_params` context manager.
+        """
+        # Determine which logic to use based on the context at call time
+        in_summon_full_params = getattr(self, "training_state", None) == \
+            TrainingState_.SUMMON_FULL_PARAMS
+        for param_name, param in super().named_parameters(*args, **kwargs):
+            if in_summon_full_params:
+                # Remove any instances of the FSDP-specific prefix; there can
+                # be multiple in the case of nested FSDP modules
+                param_name = param_name.replace(FSDP_PREFIX, "")
+            yield (param_name, param)
 
     def _register_pre_backward_hooks(self, outputs: Any) -> Any:
         """Register pre-backward hook to run before the wrapped module's
@@ -1337,9 +1793,25 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                 "FSDP only works with gradients that don't require gradients"
             )
 
-        self._free_full_params(cast(List[FlatParameter], [param]))
-
-        # Switch to local shard after backward.
+        if self._require_backward_grad_sync or self.reshard_after_forward:
+            # Free full params. when in a ``no_sync`` context (as inversely indicated
+            # by ``self._require_backward_grad_sync``), the params will not get updated
+            # before the next forward. As a special case in a ``no_sync`` context, do
+            # not free full params, there is no need to call all_gather to sync params
+            # among ranks if users would like to use more GPU memory and save network
+            # overhead when users set sharding_strategy=SHARD_GRAD_OP
+            # (self.reshard_after_forward = False).
+            self._free_full_params(cast(List[FlatParameter], [param]))
+
+        if self.mixed_precision:
+            # Noop if reshard_after_forward=True because we'd free the param
+            # shard when rebuilding the full params in the pre_beckward_hook.
+            self._free_mp_shard(cast(List[FlatParameter], [param]))
+
+        # Switch to local shard after backward. Note that
+        # when CPU offload is enabled, _use_param_local_shard implicitly
+        # offloads the local shard to CPU by making p.data point to
+        # p._local_shard, which would reside on CPU.
         self._use_param_local_shard(cast(List[FlatParameter], [param]))
 
         # Prefetch previous layer's full params in backward pass post backward hook,
@@ -1361,15 +1833,37 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
 
         with torch.cuda.stream(self._streams["post_backward"]):
             orig_grad_data = param.grad.data
+            if (
+                self.mixed_precision is not None and
+                self.mixed_precision.param_dtype != self.mixed_precision.reduce_dtype
+            ):
+                # Cast gradient to precision in which it should be communicated.
+                # TODO: Make this a communication hook when communication hooks
+                # are implemented for FSDP.
+                param.grad.data = param.grad.data.to(self.mixed_precision.reduce_dtype)
 
             if self.gradient_predivide_factor > 1:
                 # Average grad by world_size for consistency with PyTorch DDP.
                 param.grad.div_(self.gradient_predivide_factor)
 
+            grad = param.grad.data
             if param._is_sharded:  # type: ignore[attr-defined]
-                grad_flatten = torch.flatten(param.grad)
+                # We clear `param.grad` to permit repeated gradient
+                # computations when this FSDP module is called multiple times.
+                # This is to avoid a race among multiple re-entrant backward
+                # passes. For example, the second backward pass computation
+                # precedes ahead of the first backward pass reduction, which is
+                # possible since the reduction is in a different stream and is
+                # async. Then, the first backward pass may be incorrectly
+                # reducing the second backward pass's `param.grad`.
+                # The reduced gradients are accumulated in
+                # `param._saved_grad_shard`, and the gradient reductions can
+                # happen in arbitrary order, though we tolerate this due to the
+                # (approximate) commutativity of floating-point addition.
+                param.grad = None
+                grad_flatten = torch.flatten(grad)
                 chunks = list(grad_flatten.chunk(self.world_size))
-                num_pad = self.world_size * chunks[0].numel() - param.grad.numel()
+                num_pad = self.world_size * chunks[0].numel() - grad.numel()
                 input_flattened = F.pad(grad_flatten, [0, num_pad])
                 output = torch.zeros_like(chunks[0])
                 dist._reduce_scatter_base(
@@ -1378,7 +1872,38 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                 if self.gradient_postdivide_factor > 1:
                     # Average grad by world_size for consistency with PyTorch DDP.
                     output.div_(self.gradient_postdivide_factor)
-                param.grad.data = output
+
+                if self.mixed_precision is not None:
+                    # Cast gradients back to the full parameter precision so that
+                    # optimizer.step() happens in full precision.
+                    orig_param_grad_data = output
+                    output.data = output.data.to(dtype=param.data.dtype)
+                    # Don't let this memory get reused until after the transfer.
+                    orig_param_grad_data.record_stream(torch.cuda.current_stream())
+
+                # To support gradient accumulation outside `no_sync()`, we save
+                # the gradient data to `param._saved_grad_shard` before the
+                # backward pass, accumulate gradients into it here, and set
+                # `param.grad` with the accumulated value at the end of the
+                # backward pass in preparation for the optimizer step.
+                accumulate_grad = hasattr(param, "_saved_grad_shard")
+                if accumulate_grad:
+                    p_assert(
+                        param._saved_grad_shard.shape == output.shape,  # type: ignore[attr-defined]
+                        "Shape mismatch when accumulating gradients: "  # type: ignore[attr-defined]
+                        f"existing grad shape={param._saved_grad_shard.shape} "
+                        f"new grad shape={output.shape}"  # type: ignore[attr-defined]
+                    )
+                    p_assert(
+                        param._saved_grad_shard.device == output.device,  # type: ignore[attr-defined]
+                        "Device mismatch when accumulating gradients: "  # type: ignore[attr-defined]
+                        f"existing grad device={param._saved_grad_shard.device} "
+                        f"new grad device={output.device}"  # type: ignore[attr-defined]
+                    )
+                    param._saved_grad_shard += output  # type: ignore[attr-defined]
+                else:
+                    param._saved_grad_shard = output  # type: ignore[attr-defined]
+                grad = param._saved_grad_shard  # type: ignore[attr-defined]
             else:
                 # Currently the only way for _is_sharded to be False is if
                 # world_size == 1. This could be relaxed in the future, e.g,
@@ -1388,6 +1913,13 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                     self.world_size == 1
                 ), "Currently the only way for _is_sharded to be False is \
                     world_size == 1"
+                if self.mixed_precision is not None:
+                    # Cast gradients back to the full parameter precision so that
+                    # optimizer.step() happens in full precision.
+                    orig_param_grad_data = param.grad.data
+                    param.grad.data = param.grad.data.to(dtype=param.data.dtype)
+                    # Don't let this memory get reused until after the transfer.
+                    orig_param_grad_data.record_stream(torch.cuda.current_stream())
 
             # Regardless of sharding or not, offload the grad to CPU if we are
             # offloading params. This is so param and grad reside on same device
@@ -1397,14 +1929,10 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                 # and ensure the appropriate synchronization is done by waiting
                 # streams in _wait_for_post_backward.
                 param._cpu_grad.copy_(  # type: ignore[attr-defined]
-                    param.grad.detach(), non_blocking=True
+                    grad.detach(), non_blocking=True
                 )
                 # Don't let this memory get reused until after the transfer.
-                param.grad.data.record_stream(torch.cuda.current_stream())
-                # Point param.grad.data to CPU grad to offload it. Note that
-                # the transfer is async so it is not necessarily done until we
-                # explicitly synchronize in backward.
-                param.grad.data = param._cpu_grad  # type: ignore[attr-defined]
+                grad.data.record_stream(torch.cuda.current_stream())
 
             # After _post_backward_hook returns, orig_grad_data will eventually
             # go out of scope, at which point it could otherwise be freed for
@@ -1451,7 +1979,7 @@ def _wait_for_post_backward(self) -> None:
 
         # A backward pass is done, clean up below.
 
-        def _remove_shard_bwd_hook(fsdp_module: FullyShardedDataParallel) -> None:
+        def _finalize_params(fsdp_module: FullyShardedDataParallel) -> None:
             """Helper used below on all fsdp modules."""
             for p in fsdp_module.params:
                 if p.requires_grad:
@@ -1463,11 +1991,41 @@ def _remove_shard_bwd_hook(fsdp_module: FullyShardedDataParallel) -> None:
                         )
                         p._shard_bwd_hook[1].remove()  # type: ignore[attr-defined]
                         delattr(p, "_shard_bwd_hook")
+                    # Preserve the gradient accumulation state if not
+                    # synchronizing: `p.grad` remains the unsharded gradient
+                    # accumulated from prior `no_sync()` iterations, and
+                    # `p._saved_grad_shard` remains the sharded gradient from
+                    # the last synchronized iteration
+                    if not self._require_backward_grad_sync:
+                        continue
+                    # Set `p.grad` as needed to ensure optimizer correctness
+                    # since optimizers operate on the `grad` attribute
+                    if hasattr(p, "_cpu_grad"):
+                        p_assert(
+                            p.device == torch.device("cpu"),
+                            f"Device mismatch: p={p.device} "  # type: ignore[attr-defined]
+                            f"p._cpu_grad={p._cpu_grad}"
+                        )
+                        p.grad = p._cpu_grad  # type: ignore[attr-defined]
+                    elif hasattr(p, "_saved_grad_shard"):
+                        p_assert(
+                            p.device == p._saved_grad_shard.device,  # type: ignore[attr-defined]
+                            f"Device mismatch: p={p.device} "  # type: ignore[attr-defined]
+                            f"p._saved_grad_shard={p._saved_grad_shard.device}"
+                        )
+                        p.grad = p._saved_grad_shard  # type: ignore[attr-defined]
+                    else:
+                        p_assert(
+                            not p._is_sharded, "All sharded parameters should "
+                            "use `_saved_grad_shard`"
+                        )
+                    if hasattr(p, "_saved_grad_shard"):
+                        delattr(p, "_saved_grad_shard")
 
         # Update root and nested FSDP's hooks and flags.
         for m in self.modules():  # includes self
             if isinstance(m, FullyShardedDataParallel):
-                _remove_shard_bwd_hook(m)
+                _finalize_params(m)
                 m._pre_backward_hook_has_run = False
                 if any(p.requires_grad for p in m.parameters()):
                     # Check if the module has params and if any of them has
@@ -1492,26 +2050,46 @@ def _remove_shard_bwd_hook(fsdp_module: FullyShardedDataParallel) -> None:
                     # reset this flag for cases like "one forward pass + multiple backward passes"
                     self._post_backward_callback_queued = False
 
+    def _update_p_data(self, p, output_tensor: torch.Tensor) -> None:
+        """
+        Helper function to update p.data pointer.
+        Args:
+            output_tensor (torch.Tensor): this tensor contains the data we just gathered.
+        """
+        p.data = output_tensor
+        # Trim any padding and reshape to match original size.
+        p.data = p.data[: p._orig_size.numel()].view(p._orig_size)  # type: ignore[attr-defined]
+
     @torch.no_grad()
-    def _rebuild_full_params(self) -> None:
+    def _rebuild_full_params(self) -> List[Tuple[torch.Tensor, bool]]:
         """
         Gather all shards of params.
         """
-        self._lazy_init()
-
-        def update_p_data(output_tensor: torch.Tensor) -> None:
-            """
-            Helper function to update p.data pointer.
-            Args:
-                output_tensor (torch.Tensor): this tensor contains the data we just gathered.
-            """
-            p.data = output_tensor
-            # Trim any padding and reshape to match original size.
-            p.data = p.data[: p._orig_size.numel()].view(p._orig_size)  # type: ignore[attr-defined]
-
+        # summon_full_params must do a full precision rebuild even under mixed
+        # precision, because it is used for e.g. checkpoint where we'd like to
+        # checkpoint in full precision.
+        force_full_precision = (self.training_state == TrainingState_.SUMMON_FULL_PARAMS)
+        # full param output tensors and a flag indicating whether
+        # summon_full_params can free them or not. It is possible that we can't
+        # free the full param, which currently occurs when the returned
+        # parameter points to the unsharded param when world_size == 1, or when
+        # we're returning the full parameter and reshard_after_forward=False
+        # (because we need to ensure p._full_param_padded stays intact)
+        output_tensors: List[Tuple[torch.Tensor, bool]] = []
         with torch.cuda.stream(self._streams["all_gather"]):
             for p in self.params:
-                if self.cpu_offload.offload_params:
+                mixed_precision_cast_ran = (
+                    self.mixed_precision is not None and not force_full_precision
+                )
+                if mixed_precision_cast_ran:
+                    self._cast_param_shards_to_dtype()
+                    # TODO: remove below
+                    for p in self.params:
+                        assert p.dtype == self.mixed_precision.param_dtype
+                # We can skip moving params to GPU if mixed precision, as p.data
+                # would then be pointing to p._mp_shard which is already on
+                # self.compute_device.
+                if self.cpu_offload.offload_params and not mixed_precision_cast_ran:
                     # Move params to GPU if needed. Note that we don't use
                     # self._full_param_padded.device here because the attr is
                     # not set always, i.e. when world_size=1 and
@@ -1520,13 +2098,37 @@ def update_p_data(output_tensor: torch.Tensor) -> None:
                     p.data = p.data.to(self.compute_device, non_blocking=True)
                 # e.g., when world_size == 1
                 if not p._is_sharded:  # type: ignore[attr-defined]
+                    if mixed_precision_cast_ran:
+                        # p.data should be the same type as p._mp_shard, and it
+                        # is safe to free.
+                        assert p.data.dtype == p._mp_shard.dtype
+                        # Safe to free because p.data points to the mp shard.
+                        output_tensors.append((p.data, True))
+                    else:
+                        # p.data points to the unsharded parameter, so not safe to
+                        # free.
+                        output_tensors.append((p.data, False))
                     continue
                 # If full param has been rebuilt or has not been freed, no need to call all gather
                 elif (
                     p._full_param_padded.storage().size()  # type: ignore[attr-defined]
                     == p._full_param_padded.size().numel()  # type: ignore[attr-defined]
                 ):
-                    update_p_data(p._full_param_padded)  # type: ignore[attr-defined]
+                    # Check that the full param is in the expected precision, if
+                    # training with mixed precision
+                    if mixed_precision_cast_ran:
+                        if p._full_param_padded.dtype != self.mixed_precision.param_dtype:
+                            raise ValueError(
+                                "_rebuild_full_params: Expected full param to be "
+                                f"of type {self.mixed_precision.param_dtype}, "
+                                f"but got {p._full_param_padded.dtype}!"
+                            )
+                    # output is full_param_padded which can be freed depending
+                    # on reshard_after_forward (this path is exercised by tests
+                    # in test_fsdp_summon_full_params).
+                    output_tensors.append((p._full_param_padded, self.reshard_after_forward))
+
+                    self._update_p_data(p, output_tensor=p._full_param_padded)  # type: ignore[attr-defined]
                     continue
                 else:
                     # If full param has not been rebuilt or has been freed, call all gather
@@ -1538,17 +2140,36 @@ def update_p_data(output_tensor: torch.Tensor) -> None:
                     assert (
                         p._full_param_padded.storage().size() == 0  # type: ignore[attr-defined]
                     ), "Full param's storage should have been freed before if all gather is needed."  # type: ignore[attr-defined]
-                    # Allocate based on full size from all shards.
-                    _alloc_storage(p._full_param_padded, size=p_full_size)  # type: ignore[attr-defined]
-                    output_tensor = p._full_param_padded  # type: ignore[attr-defined]
-
+                    if self.mixed_precision and force_full_precision:
+                        # p._full_param_padded has the reduced precision type,
+                        # but we need full precision rebuild as we're in
+                        # summon_full_params. Note that this is why
+                        # summon_full_params collects locally used params from
+                        # _rebuild_full_params instead of relying on
+                        # p._full_param_padded, as it may not always be
+                        # allocated such as during mixed precision.
+                        output_tensor = p_data.new_zeros(p_full_size)
+                    else:
+                        # Allocate based on full size from all shards.
+                        _alloc_storage(p._full_param_padded, size=p_full_size)  # type: ignore[attr-defined]
+                        output_tensor = p._full_param_padded  # type: ignore[attr-defined]
                     # Fill output_tensor with (p.data for each shard in self.world_size)
                     dist._all_gather_base(
                         output_tensor, p_data, group=self.process_group
                     )
 
+                    # The full parameter, which can be freed. Note that we
+                    # append here before update_p_data so as to not saved the
+                    # tensor with padding trimmed, which causes issues with
+                    # writeback in summon_full_params.
+                    output_tensors.append((output_tensor, True))
                     # Set p.data = output_tensor (with padding trimmed)
-                    update_p_data(output_tensor)
+                    self._update_p_data(p, output_tensor=output_tensor)
+                    # We can free the reduced precision shard as we have the
+                    # full precision parameter.
+                    if self.mixed_precision is not None:
+                        self._free_mp_shard(cast(List[FlatParameter], [p]))
+        return output_tensors
 
     @torch.no_grad()
     def _prep_grads_for_backward(self) -> None:
@@ -1558,6 +2179,26 @@ def _prep_grads_for_backward(self) -> None:
                 p.grad.size() != p._orig_size  # type: ignore[attr-defined]
                 or p.grad.device != p.device
             ):
+                offloaded: bool = p.grad.device != p.device
+                if offloaded:
+                    assert self.cpu_offload.offload_params, \
+                        "`p.grad.device` and `p.device` should be the same " \
+                        "if not offloading parameters to CPU"
+                prev_iter_outside_no_sync: bool = \
+                    p.grad.size() == p._local_shard.shape  # type: ignore[attr-defined]
+                # As long as the previous iteration was outside `no_sync()`,
+                # then we must save the gradient in `_saved_grad_shard`, even
+                # if the current iteration is inside `no_sync()`. This is to
+                # prepare for the next iteration outside `no_sync()`, which may
+                # try to accumulate gradients. FSDP accumulates gradients in
+                # the separate variable `p._saved_grad_shard` to leave `p.grad`
+                # for the per-iteration gradient.
+                if prev_iter_outside_no_sync:
+                    # FSDP currently does not support gradient accumulation
+                    # outside `no_sync()` when using CPU offloading (see the
+                    # warning in the class's docstring).
+                    if not offloaded:
+                        p._saved_grad_shard = p.grad.data  # type: ignore[attr-defined]
                 p.grad = None
 
     @torch.no_grad()
@@ -1571,6 +2212,8 @@ def _free_full_params(self, params: Optional[List[FlatParameter]] = None) -> Non
         for p in params:
             # e.g., world_size == 1
             if not p._is_sharded:  # type: ignore[attr-defined]
+                if self.mixed_precision is not None:
+                    self._free_mp_shard(cast(List[FlatParameter], [p]))
                 continue
             # Don't let PyTorch reuse this memory until all work in the current
             # stream is complete.
@@ -1632,6 +2275,10 @@ def no_sync(self) -> Generator:
         .. note:: This likely results in higher memory usage because FSDP will
             accumulate the full model gradients (instead of gradient shards)
             until the eventual sync.
+
+        .. note:: When used with CPU offloading, the gradients will not be
+            offloaded to CPU when inside the context manager. Instead, they
+            will only be offloaded right after the eventual sync.
         """
         self._lazy_init()
         assert self._is_root, "`no_sync()` on inner FSDP instances is not supported"
@@ -1651,6 +2298,413 @@ def no_sync(self) -> Generator:
                 )
                 m._require_backward_grad_sync = old_flag
 
+    @property
+    def params_with_grad(self) -> List[Parameter]:
+        """
+        Recursively returns a list of all module parameters that have a gradient.
+        """
+        return [p for p in self.parameters() if p.grad is not None]
+
+    @torch.no_grad()
+    def clip_grad_norm_(
+        self, max_norm: Union[float, int], norm_type: Union[float, int] = 2.0
+    ) -> None:
+        """
+        Clip all gradients at this point in time. The norm is computed over all
+        gradients together, as if they were concatenated into a single vector.
+        Gradients are modified in-place.
+
+        Args:
+            max_norm (float or int): max norm of the gradients
+            norm_type (float or int): type of the used p-norm. Can be ``'inf'``
+                for infinity norm.
+
+        Returns:
+            Total norm of the parameters (viewed as a single vector).
+
+        .. note:: This is analogous to `torch.nn.utils.clip_grad_norm_` but
+            handles the partitioning and multiple devices per rank under the
+            hood. The default torch util is not applicable here, because each
+            rank only has a partial view of all the grads in the model, so
+            calling it for FSDP models would lead to different scaling being
+            applied per subset of model parameters.
+
+        .. warning:: This needs to be called on all ranks, since synchronization
+            primitives will be used.
+        """
+        # Call `_lazy_init` to ensure the stream synchronization is done appropriately.
+        self._lazy_init()
+        assert self._is_root, "clip_grad_norm should only be called on the root (parent) instance"
+        self._assert_state(TrainingState_.IDLE)
+
+        max_norm = float(max_norm)
+        norm_type = float(norm_type)
+        # Computes the max norm for this shard's gradients and sync's across workers
+        local_norm = _calc_grad_norm(self.params_with_grad, norm_type).cuda()  # type: ignore[arg-type]
+        if norm_type == inf:
+            total_norm = local_norm
+            dist.all_reduce(total_norm, op=torch.distributed.ReduceOp.MAX, group=self.process_group)
+        else:
+            total_norm = local_norm ** norm_type
+            dist.all_reduce(total_norm, group=self.process_group)
+            total_norm = total_norm ** (1.0 / norm_type)
+
+        if self.cpu_offload:
+            total_norm = total_norm.cpu()
+
+        clip_coef = torch.tensor(max_norm, dtype=total_norm.dtype, device=total_norm.device) / (total_norm + 1e-6)
+        if clip_coef < 1:
+            # multiply by clip_coef, aka, (max_norm/total_norm).
+            for p in self.params_with_grad:
+                assert p.grad is not None
+                p.grad.detach().mul_(clip_coef.to(p.grad.device))
+
+    @staticmethod
+    def full_optim_state_dict(
+        model: torch.nn.Module,
+        optim: torch.optim.Optimizer,
+        optim_input: Optional[Union[
+            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+        ]] = None,
+        rank0_only: bool = True,
+    ) -> Dict[str, Any]:
+        """
+        Consolidates the full optimizer state on rank 0 and returns it
+        as a :class:`dict` following the convention of
+        :meth:`torch.optim.Optimizer.state_dict`, i.e. with keys ``"state"``
+        and ``"param_groups"``. The flattened parameters in ``FSDP`` modules
+        contained in ``model`` are mapped back to their unflattened parameters.
+
+        .. warning:: This needs to be called on all ranks since synchronization
+            primitives are used. However, if ``rank0_only=True``, then the
+            state dict is only populated on rank 0, and all other ranks return
+            an empty :class:`dict`.
+
+        .. warning:: Unlike ``torch.optim.Optimizer.state_dict()``, this method
+            uses full parameter names as keys instead of parameter IDs.
+
+        .. warning:: If you do not pass ``model.parameters()`` as the first
+            argument to the optimizer, then you should pass that same value to
+            this method as ``optim_input``.
+
+        .. note:: Like in :meth:`torch.optim.Optimizer.state_dict`, the tensors
+            contained in the optimizer state dict are not cloned, so there may
+            be aliasing surprises. For best practices, consider saving the
+            returned optimizer state dict immediately, e.g. using
+            ``torch.save()``.
+
+        Args:
+            model (torch.nn.Module): Root module (which may or may not be a
+                :class:`FullyShardedDataParallel` instance) whose parameters
+                were passed into the optimizer ``optim``.
+            optim (torch.optim.Optimizer): Optimizer for ``model`` 's
+                parameters.
+            optim_input (Optional[Union[List[Dict[str, Any]], Iterable[torch.nn.Parameter]]]):
+                Input passed into the optimizer ``optim`` representing either a
+                :class:`list` of parameter groups or an iterable of parameters;
+                if ``None``, then this method assumes the input was
+                ``model.parameters()``. (Default: ``None``)
+            rank0_only (bool): If ``True``, saves the populated :class:`dict`
+                only on rank 0; if ``False``, saves it on all ranks. (Default:
+                ``True``)
+
+        Returns:
+            Dict[str, Any]: A :class:`dict` containing the optimizer state for
+            ``model`` 's original unflattened parameters and including keys
+            "state" and "param_groups" following the convention of
+            :meth:`torch.optim.Optimizer.state_dict`. If ``rank0_only=False``,
+            then nonzero ranks return an empty :class:`dict`.
+        """
+        osd = optim.state_dict()
+        osd_state, osd_param_groups = osd["state"], osd["param_groups"]  # alias
+
+        group = model.process_group if hasattr(model, "process_group") \
+            else None  # not all `torch.nn.Module`s have `process_group`
+        rank = dist.get_rank(group)
+        to_save = not rank0_only or rank == 0
+        full_osd: Dict = {"state": {}, "param_groups": []} if to_save else {}
+        full_osd_state = full_osd["state"] if to_save else None  # alias
+
+        # Handle the "state" part of the optimizer state dict
+        param_to_unflat_param_names = _get_param_to_unflat_param_names(model)
+        flat_param_id_to_param = _get_param_id_to_param(model, optim_input)
+        flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
+        for flat_param_id, param in enumerate(flat_param_id_to_param):  # type: ignore[assignment]
+            # Do not include parameters without state to avoid empty mappings
+            if flat_param_id not in osd_state:
+                continue
+            assert param in param_to_unflat_param_names, \
+                "Check the `param_to_unflat_params` construction\n" \
+                f"param: {param}"
+            unflat_param_names = param_to_unflat_param_names[param]
+            # For FSDP parameters, we need to unflatten
+            if isinstance(param, FlatParameter):
+                assert param in flat_param_to_fsdp_module, \
+                    "Check the `flat_param_to_fsdp_module` construction\n" \
+                    f"param: {param}"
+                unflat_state = _unflatten_optim_state(
+                    flat_param_to_fsdp_module[param], param,
+                    osd_state[flat_param_id], to_save,
+                )
+                if to_save:
+                    assert len(unflat_state) == len(unflat_param_names) and \
+                        len(unflat_state) == param._num_unflattened_params, \
+                        f"{len(unflat_state)} {len(unflat_param_names)} " \
+                        f"{param._num_unflattened_params}"
+                    for unflat_param_name, unflat_param_state in zip(
+                        unflat_param_names, unflat_state,
+                    ):
+                        full_osd_state[unflat_param_name] = unflat_param_state
+            # For parameters from non-FSDP modules, we do not need to unflatten
+            elif to_save:
+                assert len(unflat_param_names) == 1
+                unflat_param_name = unflat_param_names[0]
+                # Do not `deepcopy()` to avoid unnecessarily duplicating
+                # tensor storage
+                full_osd_state[unflat_param_name] = \
+                    copy.copy(osd_state[flat_param_id])
+                # Move all tensor state to CPU
+                param_state = full_osd_state[unflat_param_name]
+                for state_name, value in param_state.items():
+                    if torch.is_tensor(value):
+                        param_state[state_name] = value.cpu()
+
+        # Non-target ranks may return since there is no more communication
+        if not to_save:
+            return full_osd
+
+        # Handle the "param_groups" part of the optimizer state dict
+        full_osd_param_groups = full_osd["param_groups"]  # alias
+        for flat_param_group in osd_param_groups:
+            unflat_param_group = copy.deepcopy(flat_param_group)
+            param_group_params = [
+                flat_param_id_to_param[flat_param_id]
+                for flat_param_id in flat_param_group["params"]
+            ]
+            nested_unflat_param_names = [
+                param_to_unflat_param_names[param]
+                for param in param_group_params
+            ]
+            unflat_param_group["params"] = [
+                unflat_param_name
+                for unflat_param_names in nested_unflat_param_names
+                for unflat_param_name in unflat_param_names
+            ]  # flatten the list of lists
+            full_osd_param_groups.append(unflat_param_group)
+        return full_osd
+
+
+    @staticmethod
+    def shard_full_optim_state_dict(
+        full_optim_state_dict: Dict[str, Any],
+        model: torch.nn.Module,
+        optim_input: Optional[Union[
+            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+        ]] = None,
+    ) -> Dict[str, Any]:
+        """
+        Shards the full optimizer state dict ``full_optim_state_dict`` by
+        remapping the state to flattened parameters instead of unflattened
+        parameters and restricting to only this rank's part of the optimizer
+        state. The first argument should be the return value of
+        :meth:`full_optim_state_dict`.
+
+        Example::
+
+            >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+            >>> model, optim = ...
+            >>> full_osd = FSDP.full_optim_state_dict(model, optim)
+            >>> torch.save(full_osd, PATH)
+            >>> # Define new model with possibly different world size
+            >>> new_model, new_optim = ...
+            >>> full_osd = torch.load(PATH)
+            >>> sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model)
+            >>> new_optim.load_state_dict(sharded_osd)
+
+        .. warning:: If you do not pass ``model.parameters()`` as the first
+            argument to the optimizer, then you should pass that same value to
+            this method as ``optim_input``.
+
+        Args:
+            full_optim_state_dict (Dict[str, Any]): Optimizer state dict
+                corresponding to the unflattened parameters and holding the
+                full non-sharded optimizer state.
+            model (torch.nn.Module): Root module (which may or may not be a
+                :class:`FullyShardedDataParallel` instance) whose parameters
+                correspond to the optimizer state in ``full_optim_state_dict``.
+            optim_input (Optional[Union[List[Dict[str, Any]], Iterable[torch.nn.Parameter]]]):
+                Input passed into the optimizer representing either a
+                :class:`list` of parameter groups or an iterable of parameters;
+                if ``None``, then this method assumes the input was
+                ``model.parameters()``. (Default: ``None``)
+
+        Returns:
+            Dict[str, Any]: The full optimizer state dict remapped to flattened
+            parameters instead of unflattened parameters and restricted to only
+            include this rank's part of the optimizer state.
+        """
+        full_osd = full_optim_state_dict  # alias
+        if "state" not in full_osd or "param_groups" not in full_osd:
+            raise ValueError(
+                "`full_optim_state_dict` must have the keys \"state\" and "
+                "\"param_groups\" to be a valid optimizer state dict"
+            )
+
+        flat_param_id_to_param = _get_param_id_to_param(model, optim_input)
+        flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
+        param_to_unflat_param_names = _get_param_to_unflat_param_names(model)
+
+        # Handle the "state" part of the optimizer state dict
+        sharded_osd_state: Dict[int, Any] = {}
+        full_osd_state = full_osd["state"]
+        unflat_param_names_to_flat_param_id: Dict[str, int] = {}
+        for flat_param_id, param in enumerate(flat_param_id_to_param):  # type: ignore[assignment]
+            assert param in param_to_unflat_param_names, \
+                "Check the `param_to_unflat_params` construction\n" \
+                f"param: {param}"
+            unflat_param_names = param_to_unflat_param_names[param]
+            # For FSDP parameters, we need to flatten
+            if isinstance(param, FlatParameter):
+                assert param in flat_param_to_fsdp_module, \
+                    "Check the `flat_param_to_fsdp_module` mapping " \
+                    f"construction\nparam={param}"
+                unflat_param_names = param_to_unflat_param_names[param]
+                fsdp_module = flat_param_to_fsdp_module[param]
+                flat_state = _flatten_optim_state(
+                    full_osd_state, unflat_param_names, fsdp_module, param,
+                )
+                sharded_osd_state[flat_param_id] = flat_state
+                for unflat_param_name in unflat_param_names:
+                    unflat_param_names_to_flat_param_id[unflat_param_name] = flat_param_id
+            # For parameters from non-FSDP modules, we do not need to flatten
+            else:
+                assert len(unflat_param_names) == 1
+                unflat_param_name = unflat_param_names[0]
+                # Remap from unflattened to flattened parameter ID -- do not
+                # deepcopy to avoid unnecessarily duplicating tensor storage
+                sharded_osd_state[flat_param_id] = \
+                    copy.copy(full_osd_state[unflat_param_name])
+                unflat_param_names_to_flat_param_id[unflat_param_name] = flat_param_id
+
+        # Handle the "param_groups" part of the optimizer state dict
+        sharded_osd_param_groups: List[Dict[str, Any]] = []
+        for unflat_param_group in full_osd["param_groups"]:
+            flat_param_group = copy.deepcopy(unflat_param_group)
+            # Map from unflattened parameter names to flattened parameter IDs
+            flat_param_ids = sorted(set(
+                unflat_param_names_to_flat_param_id[unflat_param_name]
+                for unflat_param_name in unflat_param_group["params"]
+            ))
+            flat_param_group["params"] = flat_param_ids
+            sharded_osd_param_groups.append(flat_param_group)
+
+        sharded_optim_state_dict = {
+            "state": sharded_osd_state,
+            "param_groups": sharded_osd_param_groups,
+        }
+        return sharded_optim_state_dict
+
+    @staticmethod
+    def rekey_optim_state_dict(
+        optim_state_dict: Dict[str, Any],
+        optim_state_key_type: OptimStateKeyType,
+        model: torch.nn.Module,
+        optim_input: Optional[Union[
+            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+        ]] = None,
+    ) -> Dict[str, Any]:
+        """
+        Re-keys the optimizer state dict ``optim_state_dict`` to use the key
+        type ``optim_state_key_type``. This can be used to achieve
+        compatibility between optimizer state dicts from models with FSDP
+        instances and ones without.
+
+        To re-key an FSDP full optimizer state dict (i.e. from
+        :meth:`full_optim_state_dict`) to use parameter IDs and be loadable to
+        a non-wrapped model::
+
+            >>> wrapped_model, wrapped_optim = ...
+            >>> full_osd = FSDP.full_optim_state_dict(wrapped_model, wrapped_optim)
+            >>> nonwrapped_model, nonwrapped_optim = ...
+            >>> rekeyed_osd = FSDP.rekey_optim_state_dict(full_osd, OptimStateKeyType.PARAM_ID, nonwrapped_model)
+            >>> nonwrapped_optim.load_state_dict(rekeyed_osd)
+
+        To re-key a normal optimizer state dict from a non-wrapped model to be
+        loadable to a wrapped model::
+
+            >>> nonwrapped_model, nonwrapped_optim = ...
+            >>> osd = nonwrapped_optim.state_dict()
+            >>> rekeyed_osd = FSDP.rekey_optim_state_dict(osd, OptimStateKeyType.PARAM_NAME, nonwrapped_model)
+            >>> wrapped_model, wrapped_optim = ...
+            >>> sharded_osd = FSDP.shard_full_optim_state_dict(rekeyed_osd, wrapped_model)
+            >>> wrapped_optim.load_state_dict(sharded_osd)
+
+        Returns:
+            Dict[str, Any]: The optimizer state dict re-keyed using the
+            parameter keys specified by ``optim_state_key_type``.
+        """
+        assert optim_state_key_type in \
+            (OptimStateKeyType.PARAM_NAME, OptimStateKeyType.PARAM_ID)
+        osd = optim_state_dict  # alias
+        # Validate that the existing parameter keys are uniformly typed
+        uses_param_name_mask = [
+            type(param_key) is str for param_key in osd["state"]
+        ]
+        uses_param_id_mask = [
+            type(param_key) is int for param_key in osd["state"]
+        ]
+        if (any(uses_param_name_mask) and not all(uses_param_name_mask)) or \
+                (any(uses_param_id_mask) and not all(uses_param_id_mask)):
+            error_msg = f"Invalid parameter keys: {osd['state'].keys()}"
+            raise ValueError(error_msg)
+        # Return directly if the existing key type matches the target key type
+        if (optim_state_key_type == OptimStateKeyType.PARAM_NAME and
+            all(uses_param_name_mask)) or \
+            (optim_state_key_type == OptimStateKeyType.PARAM_ID and
+                all(uses_param_id_mask)):
+            return osd
+        # Otherwise, actually perform the re-keying
+        new_osd = {}
+        if optim_state_key_type == OptimStateKeyType.PARAM_NAME:  # ID -> name
+            param_id_to_param = _get_param_id_to_param(model, optim_input)
+            param_to_param_name = _get_param_to_param_name(model)
+            param_id_to_param_name: List[str] = [
+                param_to_param_name[param] for param in param_id_to_param
+            ]
+            new_osd["state"] = {
+                param_id_to_param_name[param_id]: param_state
+                for param_id, param_state in osd["state"].items()
+            }
+            new_osd["param_groups"] = copy.deepcopy(osd["param_groups"])
+            for param_group in new_osd["param_groups"]:
+                param_group["params"] = sorted([
+                    param_id_to_param_name[param_id]
+                    for param_id in param_group["params"]
+                ])
+            return new_osd
+        elif optim_state_key_type == OptimStateKeyType.PARAM_ID:  # name -> ID
+            param_name_to_param = _get_param_name_to_param(model)
+            param_to_param_id = _get_param_to_param_id(model, optim_input)
+            # Because not all model parameters may be passed as the optimizer
+            # input, we may need to drop some parameters from this mapping
+            param_name_to_param_id = {
+                param_name: param_to_param_id[param]
+                for param_name, param in param_name_to_param.items()
+                if param in param_to_param_id
+            }
+            new_osd["state"] = {
+                param_name_to_param_id[param_name]: param_state
+                for param_name, param_state in osd["state"].items()
+            }
+            new_osd["param_groups"] = copy.deepcopy(osd["param_groups"])
+            for param_group in new_osd["param_groups"]:
+                param_group["params"] = sorted([
+                    param_name_to_param_id[param_name]
+                    for param_name in param_group["params"]
+                ])
+            return new_osd
+        return new_osd  # should never reach here
+
 
 def _get_default_cuda_device(module: nn.Module) -> torch.device:
     """Try to infer CUDA device from module parameters."""
@@ -1686,3 +2740,125 @@ def _alloc_storage(data: torch.Tensor, size: torch.Size) -> None:
         data.storage().size() == 0
     ), "Then tensor storage should have been resized to be 0."
     data.storage().resize_(size.numel())  # type: ignore[attr-defined]
+
+def p_assert(cond: Any, s: Any) -> None:
+    """This is used as an alternate to ``assert`` when in the backward context
+    to print the error message ``s`` since otherwise, it is swallowed."""
+    if not cond:
+        print(s)
+        raise AssertionError
+
+def _calc_grad_norm(parameters: List[torch.nn.Parameter], p: float) -> torch.Tensor:
+    r"""Calculate gradient norm of an iterable of parameters.
+    Returns:
+        Total norm of the parameters (viewed as a single vector).
+    """
+    parameters = [p for p in parameters if p.grad is not None]
+
+    if len(parameters) == 0:
+        return torch.tensor(0.0)
+    if p == inf:
+        local_norm = torch.tensor(max(par.grad.detach().abs().max() for par in parameters))
+    else:
+        # Compute the norm in full precision no matter what
+        local_norm = torch.linalg.norm(
+            torch.stack(
+                [
+                    torch.linalg.norm(par.grad.detach(), p, dtype=torch.float32)
+                    for par in parameters
+                ]
+            ),
+            p,
+        )
+    local_norm.to(dtype=parameters[0].dtype)
+    return local_norm
+
+
+def _get_param_to_unflat_param_names(
+    model: torch.nn.Module,
+) -> Dict[torch.nn.Parameter, List[str]]:
+    """
+    Constructs a mapping from flattened parameters (including non-FSDP-module
+    parameters) to their unflattened parameter names. For non-FSDP-module
+    parameters, these mapped-to lists always contain a single element. The
+    unflattened parameter names should match the keys of the model state dict.
+
+    Args:
+        model (torch.nn.Module): Root module (which may or may not be a
+            :class:`FullyShardedDataParallel` instance).
+    """
+    param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = {}
+
+    def clean_param_name(prefix, param_info):
+        """This replicates the parameter name cleaning logic in model state
+        dict but avoids gathering any parameters."""
+        name = prefix + param_info.module_name + "." + param_info.param_name
+        # FSDP full parameter names may not have both (i.e. `FSDP_PREFIX`), so
+        # we call `replace()` twice separately
+        name = name.replace(FSDP_WRAPPED_MODULE + ".", "")
+        name = name.replace(FPW_MODULE + ".", "")
+        return name
+
+    def f(param_to_unflat_param_names, module: torch.nn.Module, prefix: str):
+        # For FSDP modules, only add the entry when considering the contained
+        # `FlattenParamsWrapper` to avoid duplication
+        if not isinstance(module, FullyShardedDataParallel):
+            for param_name, param in module.named_parameters(recurse=False):
+                assert param not in param_to_unflat_param_names, \
+                    f"Incorrect recursion; already visited {param_name}"
+                if isinstance(param, FlatParameter):
+                    param_to_unflat_param_names[param] = [
+                        clean_param_name(prefix, param_info)
+                        for param_info in param._param_infos
+                    ]
+                else:
+                    param_to_unflat_param_names[param] = [prefix + param_name]
+
+        for submodule_name, submodule in module.named_children():
+            if submodule is not None:
+                new_prefix = prefix + submodule_name + "."
+                f(param_to_unflat_param_names, submodule, new_prefix)
+
+    f(param_to_unflat_param_names, model, "")
+    return param_to_unflat_param_names
+
+
+def _get_param_to_param_name(
+    model: torch.nn.Module,
+) -> Dict[torch.nn.Parameter, str]:
+    """
+    Constructs a mapping from parameters to their parameter names. ``model``
+    should not contain any :class:`FullyShardedDataParallel` instances, which
+    means that none of the parameters should be ``FlatParameter`` s. As a
+    result, compared to :meth:`_get_param_to_unflat_param_names`, the mapped
+    values may be flattened from singleton :class:`list` s to the contained
+    names themselves.
+
+    Args:
+        model (torch.nn.Module): Root module, which should not contain any
+            :class:`FullyShardedDataParallel` instances.
+    """
+    param_to_param_names = _get_param_to_unflat_param_names(model)
+    for param_names in param_to_param_names.values():
+        assert len(param_names) > 0, "`_get_param_to_unflat_param_names()` " \
+            "should not construct empty lists"
+        if len(param_names) > 1:
+            raise RuntimeError(
+                "Each parameter should only map to one parameter name but got "
+                f"{len(param_names)}"
+            )
+    param_to_param_name = {
+        param: param_names[0]
+        for param, param_names in param_to_param_names.items()
+    }
+    return param_to_param_name
+
+
+def _get_param_name_to_param(
+    model: torch.nn.Module,
+) -> Dict[str, torch.nn.Parameter]:
+    """Constructs the inverse mapping of :meth:`_get_param_to_param_name`."""
+    return {
+        param_name: param
+        for param, param_name in _get_param_to_param_name(model).items()
+    }
diff --git a/torch/distributed/fsdp/optim_utils.py b/torch/distributed/fsdp/optim_utils.py
new file mode 100644
index 00000000000000..3abd620be984db
--- /dev/null
+++ b/torch/distributed/fsdp/optim_utils.py
@@ -0,0 +1,635 @@
+from typing import Any, Dict, Iterable, Iterator, List, Optional, Union
+
+import torch
+import torch.distributed as dist
+# Import the entire FSDP file to avoid circular imports
+import torch.distributed.fsdp.fully_sharded_data_parallel as FSDP
+from torch.distributed.fsdp.flatten_params_wrapper import FlatParameter
+
+class ConsolidatedOptimState:
+    """
+    This holds the consolidated optimizer state on the target rank. Positive-
+    dimension tensor state is communicated across ranks, while zero-dimension
+    tensor state and non-tensor state is taken directly from the target rank.
+
+    PyTorch version 1.12 moved to using zero-dimension tensors for scalar
+    values, but user implemented optimizers may still use float (i.e. a
+    non-tensor). Thus, we support both and handle them identically.
+
+    Attributes:
+        tensor_state (Dict[str, torch.Tensor]): Mapping from positive-dimension
+            tensor state name to the unsharded flattened tensor representing
+            the state.
+        zero_dim_tensor_state (Dict[str, torch.Tensor]): Mapping from zero-
+            dimension tensor state name to its value.
+        non_tensor_state (Dict[str, Any]): Mapping from non-tensor state
+            name to its value.
+    """
+    tensor_state: Dict[str, torch.Tensor] = {}
+    zero_dim_tensor_state: Dict[str, torch.Tensor] = {}
+    non_tensor_state: Dict[str, Any] = {}
+
+
+def _unflatten_optim_state(
+    fsdp_module,
+    flat_param: FlatParameter,
+    flat_param_state: Dict[str, Any],
+    to_save: bool,
+) -> List[Dict[str, Any]]:
+    """
+    Unflattens the optimizer state, consisting of the "state" part and the
+    "param_groups" part. Unflattening the "state" part involves consolidating
+    the state on the target rank and remapping from flattened to unflattened
+    parameter IDs, and the "param_groups" part only involves remapping from
+    flattened to unflattened parameter IDs.
+
+    Args:
+        fsdp_module (FullyShardedDataParallel): FSDP module that owns
+            ``flat_param``, i.e. holds it in ``self.params``.
+        flat_param (FlatParameter): The flattened parameter.
+        flat_param_state (Dict[str, Any]): Entry for the flattened parameter
+            in the "state" part of the optimizer state dict.
+        to_save (bool): Whether to save the state on this rank.
+
+    Returns:
+        List[Dict[str, Any]]: A :class:`list` holding the entries in the
+        "state" part of the optimizer state dict corresponding to the
+        unflattened parameters comprising the flattened parameter
+        ``flat_param`` if on the target rank or an empty :class:`list`
+        otherwise. The final optimizer state dict will need to map these
+        entries using the proper unflattened parameter IDs.
+    """
+    assert sum(p is flat_param for p in fsdp_module.params) == 1, \
+        "`fsdp_module` must own `flat_param`"
+    consolidated_state = _communicate_optim_state(
+        fsdp_module, flat_param, flat_param_state, to_save,
+    )
+    unflat_param_state = _unflatten_communicated_optim_state(
+        fsdp_module,
+        flat_param,
+        consolidated_state,
+    ) if to_save else []
+    return unflat_param_state
+
+
+def _communicate_optim_state(
+    fsdp_module,
+    flat_param: FlatParameter,
+    flat_param_state: Dict[str, Any],
+    to_save: bool,
+) -> ConsolidatedOptimState:
+    """
+    Communicates the optimizer state for a flattened parameter ``flat_param``
+    across ranks so that the target rank holds the entire non-sharded optimizer
+    state.
+
+    If ``N`` is the number of tensor optimizer states in the optimizer state
+    dict, then the communication complexity is 0 if ``N = 0`` and ``N + 1``
+    otherwise (where the plus 1 comes from all-gathering the padding per rank).
+
+    Args:
+        flat_param (FlatParameter): The flattened parameter.
+        flat_param_state (Dict[str, Any]): The entry in the "state" part of the
+            optimizer state dict corresponding to the flattened parameter.
+        to_save (bool): Whether to save the state on this rank.
+
+    Returns:
+        ConsolidatedOptimState: Consolidated optimizer state for
+        ``flat_param``; the state is not populated for non-target ranks.
+    """
+    param_index = -1
+    for i, param in enumerate(fsdp_module.params):
+        if param is flat_param:
+            param_index = i
+            break
+    assert param_index >= 0, "`fsdp_module` must own `flat_param`"
+
+    state = ConsolidatedOptimState()
+    tensor_state, zero_dim_tensor_state, non_tensor_state = \
+        state.tensor_state, state.zero_dim_tensor_state, state.non_tensor_state
+    process_group = fsdp_module.process_group
+
+    tensor_buffer = None  # initialize lazily in case it is not needed
+    for state_name, value in flat_param_state.items():
+        # Positive-dimension tensor state: communicate across ranks
+        if torch.is_tensor(value) and value.dim() > 0:
+            # If the parameter is not sharded (e.g. world size of 1), then
+            # neither is the positive-dimension tensor state, so no need to
+            # communicate it -- we take the target rank's value
+            if not flat_param._is_sharded:
+                tensor_state[state_name] = value.cpu()
+                continue
+            if tensor_buffer is None:
+                # Assume that positive-dimension tensor optimizer state
+                # has the same shape as the sharded flattened parameter
+                buffer_size = flat_param._full_param_padded.size()  # type: ignore[attr-defined]
+                tensor_buffer = value.new_zeros(*buffer_size)
+            dist._all_gather_base(tensor_buffer, value, group=process_group)
+            if to_save:
+                assert hasattr(flat_param, "_orig_size"), \
+                    "Sharded flattened parameter should have `_orig_size` set"
+                unpadded_numel = flat_param._orig_size.numel()  # type: ignore[attr-defined]
+                tensor_state[state_name] = tensor_buffer[:unpadded_numel].cpu()
+        # Zero-dimension tensor state and non-tensor state: take this rank's
+        # value directly
+        elif to_save:
+            if _is_zero_dim_tensor(value):
+                zero_dim_tensor_state[state_name] = value.cpu()
+            else:
+                non_tensor_state[state_name] = value
+    return state
+
+
+def _unflatten_communicated_optim_state(
+    fsdp_module,
+    flat_param: FlatParameter,
+    state: ConsolidatedOptimState,
+) -> List[Dict[str, Any]]:
+    """
+    Unflattens the communicated optimizer state (given by ``tensor_state``,
+    ``non_tensor_state``, and ``zero_dim_tensor_state``) for a single flattened
+    parameter ``flat_param``. This should only be called on the target rank.
+
+    Args:
+        fsdp_module (FullyShardedDataParallel): FSDP module that owns
+            ``flat_param``, i.e. holds it in ``self.params``.
+        flat_param (FlatParameter): The flattened parameter.
+        state (ConsolidatedOptimState): Consolidated optimizer state.
+
+    Returns:
+        List[Dict[str, Any]]: A :class:`list` holding the entries in the
+        "state" part of the optimizer state dict corresponding to the
+        unflattened parameters comprising the flattened parameter
+        ``flat_param``. The final optimizer state dict will need to map these
+        entries using the proper unflattened parameter IDs.
+    """
+    assert sum(p is flat_param for p in fsdp_module.params) == 1, \
+        "`fsdp_module` must own `flat_param`"
+    unflat_param_state: List[Dict[str, Any]] = []
+    flat_param_views: Dict[str, Iterator] = {}
+    num_unflat_params = flat_param._num_unflattened_params
+    tensor_state, zero_dim_tensor_state, non_tensor_state = \
+        state.tensor_state, state.zero_dim_tensor_state, state.non_tensor_state
+
+    for _ in range(num_unflat_params):
+        unflat_state_param = {}
+        # Add positive-dimension tensor state: unflatten with views
+        for state_name, flat_tensor in tensor_state.items():
+            views_generated = state_name in flat_param_views
+            if not views_generated:
+                param_views = flat_param.get_param_views(flat_tensor)
+                flat_param_views[state_name] = param_views
+            else:
+                param_views = flat_param_views[state_name]
+            unflat_state_param[state_name] = next(param_views)
+        # Add zero-dimension tensor state: take the target rank's value
+        for state_name, zero_dim_tensor in zero_dim_tensor_state.items():
+            unflat_state_param[state_name] = zero_dim_tensor
+        # Add non-tensor state: take the target rank's value
+        for state_name, non_tensor in non_tensor_state.items():
+            unflat_state_param[state_name] = non_tensor
+        unflat_param_state.append(unflat_state_param)
+    return unflat_param_state
+
+
+def _flatten_optim_state(
+    unflat_osd_state: Dict[str, Dict[str, Any]],
+    unflat_param_names: List[str],
+    fsdp_module,
+    flat_param: FlatParameter,
+) -> Dict[str, Any]:
+    """
+    Flattens the optimizer state in ``full_optim_state_dict`` for a single
+    flattened parameter ``flat_param`` in ``fsdp_module`` corresponding to
+    the unflattened parameter names in ``unflat_param_names``.
+
+    Args:
+        unflat_osd_state (Dict[str, Dict[str, Any]]): The "state" part of the
+            optimizer state dict corresponding to the unflattened parameters.
+        unflat_param_names (List[str]): A :class:`list` of unflattened
+            parameter names corresponding to the flattened parameter
+            ``flat_param``.
+        fsdp_module (FullyShardedDataParallel): FSDP module owning the
+            flattened parameter.
+        flat_param (FlatParameter): The flattened parameter.
+
+    Returns:
+        Dict[str, Any]: A :class:`dict` mapping state names to their values for
+        a particular flattened parameter. The sharded optimizer state dict's
+        "state" part will map the flattened parameter ID to this returned
+        value.
+    """
+    num_unflat_params = len(unflat_param_names)
+    assert num_unflat_params > 0, \
+        "Expects at least one unflattened parameter corresponding to the " \
+        "flattened parameter"
+    unflat_param_shapes = flat_param._param_shapes
+    num_unflat_param_shapes = len(unflat_param_shapes)
+    assert num_unflat_params == num_unflat_param_shapes, \
+        f"Expects {num_unflat_params} shapes but got {num_unflat_param_shapes}"
+
+    # Check if these unflattened parameters have any optimizer state
+    has_state = [
+        bool(unflat_param_name in unflat_osd_state)
+        for unflat_param_name in unflat_param_names
+    ]
+    # If none of the unflattened parameters comprising this flattened parameter
+    # have any state, then we do not want an entry in the optimizer state dict
+    if not any(has_state):
+        return {}  # no need to flatten any state
+    # There may still be some unflattened parameters with state and some
+    # without
+    unflat_param_states = [
+        unflat_osd_state[unflat_param_name]
+        if unflat_param_name in unflat_osd_state else None
+        for unflat_param_name in unflat_param_names
+    ]
+    # Check that the unflattened parameters have the same state names
+    state_names = None
+    for unflat_param_state in unflat_param_states:
+        if unflat_param_state is None:
+            continue
+        if state_names is None:
+            state_names = set(unflat_param_state.keys())
+        else:
+            if state_names != set(unflat_param_state.keys()):
+                raise ValueError(
+                    "Differing optimizer state names for the unflattened "
+                    f"parameters: {unflat_param_names}"
+                )
+    assert state_names is not None
+
+    # Flatten the state
+    flat_state: Dict[str, Any] = {}
+    for state_name in state_names:
+        state_values = [
+            unflat_param_state[state_name]
+            if unflat_param_state is not None else None
+            for unflat_param_state in unflat_param_states
+        ]
+        non_none_state_values = [v for v in state_values if v is not None]
+        are_pos_dim_tensors = are_zero_dim_tensors = are_non_tensors = True
+        for v in non_none_state_values:
+            are_pos_dim_tensors &= torch.is_tensor(v) and v.dim() > 0
+            are_zero_dim_tensors &= _is_zero_dim_tensor(v)
+            are_non_tensors &= not torch.is_tensor(v)
+        types = set(type(v) for v in non_none_state_values)
+        if len(types) != 1 or not (
+            are_pos_dim_tensors or are_zero_dim_tensors or are_non_tensors
+        ):
+            raise ValueError(
+                f"Differing optimizer state types for state {state_name}, "
+                f"values {non_none_state_values}, and unflattened parameter "
+                f"names {unflat_param_names}"
+            )
+        if are_pos_dim_tensors:
+            flat_tensor = _flatten_tensor_optim_state(
+                state_name, state_values, unflat_param_names,
+                unflat_param_shapes, flat_param,
+            )
+            # Shard the flattened tensor immediately to minimize the max memory
+            # usage
+            sharded_flat_tensor, _ = fsdp_module._get_shard(flat_tensor)
+            flat_state[state_name] = sharded_flat_tensor
+        elif are_zero_dim_tensors:
+            flat_state[state_name] = _flatten_zero_dim_tensor_optim_state(
+                state_name, state_values, unflat_param_names,
+            )
+        else:
+            assert are_non_tensors
+            flat_state[state_name] = _flatten_non_tensor_optim_state(
+                state_name, state_values, unflat_param_names,
+            )
+
+    return flat_state
+
+
+def _flatten_tensor_optim_state(
+    state_name: str,
+    pos_dim_tensors: List[torch.Tensor],
+    unflat_param_names: List[str],
+    unflat_param_shapes: List[torch.Size],
+    flat_param: FlatParameter,
+) -> torch.Tensor:
+    """
+    Flattens the positive-dimension tensor optimizer state given by the values
+    ``tensors`` for the state ``state_name`` for a single flattened parameter
+    ``flat_param`` corresponding to the unflattened parameter names
+    ``unflat_param_names`` and unflatted parameter shapes
+    ``unflat_param_shapes``. This flattens each unflattened parameter's tensor
+    state into one tensor.
+
+    NOTE: We use zero tensors for any unflattened parameters without state
+    since some value is required to fill those entries. This assumes that the
+    zero tensor is mathematically equivalent to having no state, which is true
+    for Adam's ``exp_avg`` and ``exp_avg_sq`` but may not be true for all
+    optimizers.
+
+    Args:
+        state_name (str): Optimizer state name.
+        pos_dim_tensors (List[torch.Tensor]): Positive-dimension tensor
+            optimizer state values for the unflattened parameters corresponding
+            to the single flattened parameter.
+        unflat_param_names (List[str]): A :class:`list` of unflattened
+            parameter names corresponding to the single flattened parameter.
+        unflat_param_shapes (List[torch.Size]): Unflattened parameter shapes
+            corresponding to the single flattened parameter.
+        flat_param (FlatParameter): The flattened parameter.
+
+    Returns:
+        torch.Tensor: A flattened tensor containing the optimizer state
+        corresponding to ``state_name`` constructed by concatenating the
+        unflattened parameter tensor states in ``pos_dim_tensors`` (using zero
+        tensors for any unflattened parameters without the state).
+    """
+    non_none_tensors = [t for t in pos_dim_tensors if t is not None]
+    # Check that all are tensors with the same dtype
+    dtypes = set(t.dtype for t in non_none_tensors)
+    if len(dtypes) != 1:
+        raise ValueError(
+            "All unflattened parameters comprising a single flattened "
+            "parameter must have positive-dimension tensor state with the "
+            f"same dtype but got dtypes {dtypes} for state {state_name} and "
+            f"unflattened parameter names {unflat_param_names}"
+        )
+    dtype = next(iter(dtypes))
+    # Check that each tensor state matches its parameter's shape
+    for tensor, shape in zip(pos_dim_tensors, unflat_param_shapes):
+        if tensor is None and len(shape) == 0:
+            raise ValueError(
+                "Flattening a zero-dimension parameter is not supported"
+            )
+        elif tensor is not None and tensor.shape != shape:
+            raise ValueError(
+                "Tensor optimizer state does not have same shape as its "
+                f"parameter: {tensor.shape} {shape}"
+            )
+    # Flatten the tensor states
+    cpu_device = torch.device("cpu")
+    tensors = [
+        torch.flatten(state_value.to(cpu_device)) if state_value is not None
+        else torch.flatten(torch.zeros(
+            size=shape, dtype=dtype, device=cpu_device,
+        ))
+        for state_value, shape
+        in zip(pos_dim_tensors, unflat_param_shapes)
+    ]
+    padding = flat_param.num_padded
+    if padding > 0:
+        tensors.append(torch.zeros(padding, dtype=dtype, device=cpu_device))
+    flat_tensor = torch.cat(tensors)
+    # `flat_tensor`'s shape should be 1D and less than or equal to the
+    # flattened parameter's shape (where the inequality is strict for positive
+    # padding)
+    if not flat_param._is_sharded:  # currently, only when world size is 1
+        # If the parameter is not sharded, then `_full_param_padded` is not
+        # used, so we skip the shape check
+        return flat_tensor
+    full_padded_dim = flat_param._full_param_padded.dim()  # type: ignore[attr-defined]
+    full_padded_shape = flat_param._full_param_padded.shape  # type: ignore[attr-defined]
+    assert flat_tensor.dim() == 1, \
+        f"`flat_tensor` should be 1D but got {flat_tensor.dim()} dims"
+    assert full_padded_dim == 1, \
+        f"`_full_param_padded` should be 1D but got {full_padded_dim} dims"
+    assert flat_tensor.shape[0] <= full_padded_shape[0], \
+        f"tensor optim state: {flat_tensor.shape} " \
+        f"parameter: {full_padded_shape}"
+    return flat_tensor
+
+
+def _flatten_zero_dim_tensor_optim_state(
+    state_name: str,
+    zero_dim_tensors: List[torch.Tensor],
+    unflat_param_names: List[str],
+) -> torch.Tensor:
+    """
+    Flattens the zero-dimension tensor optimizer state given by the values
+    ``zero_dim_tensors`` for the state ``state_name`` for a single flattened
+    parameter corresponding to the unflattened parameter names
+    ``unflat_param_names`` by enforcing that all tensors are the same and using
+    that common value.
+
+    NOTE: The requirement that the tensors are the same across all unflattened
+    parameters comprising the flattened parameter is needed to maintain the
+    invariant that FSDP performs the same computation as its non-sharded
+    equivalent. This means that none of the unflattened parameters can be
+    missing this state since imposing a value may differ from having no value.
+    For example, for Adam's "step", no value means maximum bias correction,
+    while having some positive value means less bias correction.
+
+    Args:
+        state_name (str): Optimizer state name.
+        zero_dim_tensors (List[torch.Tensor]): Zero-dimension optimizer state
+            for the unflattened parameters corresponding to the single
+            flattened parameter.
+        unflat_param_names (List[str]): A :class:`list` of unflattened
+            parameter names corresponding to the single flattened parameter.
+
+    Returns:
+        torch.Tensor: A zero-dimensional tensor giving the value of the state
+        ``state_name`` for all unflattened parameters corresponding to the
+        names ``unflat_param_names``.
+    """
+    non_none_tensors = [t for t in zero_dim_tensors if t is not None]
+    # Enforce that all have the same value and dtype
+    values_set = set(t.item() for t in zero_dim_tensors)
+    dtypes = set(t.dtype for t in zero_dim_tensors)
+    if len(non_none_tensors) != len(zero_dim_tensors) or \
+            len(values_set) != 1 or len(dtypes) != 1:
+        raise ValueError(
+            "All unflattened parameters comprising a single flattened "
+            "parameter must have scalar state with the same value and dtype "
+            f"but got values {values_set} and dtypes {dtypes} for state "
+            f"{state_name} and unflattened parameter names "
+            f"{unflat_param_names}"
+        )
+    value = next(iter(values_set))
+    dtype = next(iter(dtypes))
+    return torch.tensor(value, dtype=dtype, device=torch.device("cpu"))
+
+
+def _flatten_non_tensor_optim_state(
+    state_name: str,
+    non_tensors: List[Any],
+    unflat_param_names: List[str],
+) -> Any:
+    """
+    Flattens the non-tensor optimizer state given by the values ``non_tensors``
+    for the state ``state_name`` for a single flattened parameter corresponding
+    to the unflattened parameter names ``unflat_param_names`` by enforcing that
+    all values are the same and using that common value.
+
+    See the note in :func:`_flatten_zero_dim_tensor_optim_state`.
+
+    Args:
+        state_name (str): Optimizer state name.
+        non_tensors (List[Any]): Non-tensor optimizer state for the unflattened
+            parameters corresponding to the single flattened parameter.
+        unflat_param_names (List[str]): A :class:`list` of unflattened
+            parameter names corresponding to the single flattened parameter.
+
+    Returns:
+        Any: A non-tensor giving the value of the state ``state_name`` for all
+        unflattened parameters corresponding to the names
+        ``unflat_param_names``.
+    """
+    non_none_non_tensors = [nt for nt in non_tensors if nt is not None]
+    # Enforce that all have the same value (same type already checked)
+    non_tensor_set = set(non_tensors)
+    if len(non_none_non_tensors) != len(non_tensors) or \
+            len(non_tensor_set) != 1:
+        raise ValueError(
+            "All unflattened parameters comprising a single flattened "
+            "parameter must have scalar state with the same value and dtype "
+            f"but got values {non_tensor_set} for state {state_name} and  "
+            f"unflattened parameter names {unflat_param_names}"
+        )
+    non_tensor = next(iter(non_tensor_set))
+    return non_tensor
+
+
+def _get_flat_param_to_fsdp_module(
+    model: torch.nn.Module,
+):
+    """
+    Constructs a mapping from FSDP flattened parameters to their owning FSDP
+    modules and ensures that all FSDP modules are initialized.
+
+    Args:
+        model (torch.nn.model): Root module (which may or may not be a
+            :class:`FullyShardedDataParallel` instance).
+    """
+    flat_param_to_fsdp_module = {}
+    for module in model.modules():
+        if isinstance(module, FSDP.FullyShardedDataParallel):
+            module._lazy_init()
+            for param in module.params:  # may have none
+                flat_param_to_fsdp_module[param] = module
+    return flat_param_to_fsdp_module
+
+
+def _get_param_id_to_param(
+    model: torch.nn.Module,
+    optim_input: Optional[Union[
+        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+    ]] = None,
+) -> List[torch.nn.Parameter]:
+    """
+    Constructs a mapping from parameter IDs to parameters. This may be used
+    both for models with ``FlatParameter`` s and without.
+
+    NOTE: We critically assume that, whether the optimizer input is a list of
+    parameters or a list of parameter groups, :class:`torch.optim.Optimizer`
+    enumerates the parameter IDs in order. In other words, for a parameter list
+    input, the parameter IDs should be in that list order, and for a parameter
+    groups input, the parameter IDs should be in order within each parameter
+    group and in order across parameter groups.
+
+    Args:
+        model (torch.nn.Module): Model whose parameters are passed into the
+            optimizer.
+        optim_input (Optional[Union[List[Dict[str, Any]],
+        Iterable[torch.nn.Parameter]]]): Input passed into the optimizer
+            representing either a :class:`list` of parameter groups or an
+            iterable of parameters; if ``None``, then this method assumes the
+            input was ``model.parameters()``. (Default: ``None``)
+
+    Returns:
+        List[torch.nn.Parameter]: Mapping from parameter IDs to parameters,
+        where the parameter ID is implicitly the index in the :class:`list`.
+    """
+    # Assume the standard case of passing `model.parameters()` to the optimizer
+    # if `optim_input` is not specified
+    if optim_input is None:
+        return list(model.parameters())
+    try:
+        params = list(optim_input)
+    except TypeError:
+        raise TypeError(
+            "Optimizer input should be an iterable of Tensors or dicts, "
+            f"but got {optim_input}"
+        )
+    if len(params) == 0:
+        raise ValueError("Optimizer input should not be empty")
+
+    # Check if the optimizer input represents tensors or parameter groups
+    all_tensors = True
+    all_dicts = True
+    for param in params:
+        all_tensors &= isinstance(param, torch.Tensor)
+        all_dicts &= isinstance(param, dict)
+    if not all_tensors and not all_dicts:
+        raise TypeError(
+            "Optimizer input should be an iterable of Tensors or dicts"
+        )
+    if all_tensors:
+        return params  # type: ignore[return-value]
+    assert all_dicts
+    param_id_to_param = []
+    for param_group in params:
+        has_params_key = "params" in param_group  # type: ignore[operator]
+        assert has_params_key, \
+            "A parameter group should map \"params\" to a list of the " \
+            "parameters in the group"
+        for param in param_group["params"]:  # type: ignore[index]
+            # Implicitly map `flat_param_id` (current length of the list) to
+            # `param`
+            param_id_to_param.append(param)
+    return param_id_to_param  # type: ignore[return-value]
+
+
+def _get_param_to_param_id(
+    model: torch.nn.Module,
+    optim_input: Optional[Union[
+        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+    ]] = None,
+) -> Dict[torch.nn.Parameter, int]:
+    """Constructs the inverse mapping of :func:`_get_param_id_to_param`."""
+    param_id_to_param = _get_param_id_to_param(model, optim_input)
+    return {
+        param: param_id for param_id, param in enumerate(param_id_to_param)
+    }
+
+
+def _get_unflat_to_flat_param_ids(
+    flat_to_unflat_param_ids: Dict[int, List[int]],
+) -> List[int]:
+    """
+    Inverts the mapping ``flat_to_unflat_param_ids`` to be from unflattened
+    parameter ID to flattened parameter ID, where the unflattened parameter ID
+    is the index in the returned :class:`list`. There may be multiple
+    unflattened parameter IDs mapping to the same flattened parameter ID.
+
+    Args:
+        flat_to_unflat_param_ids (Dict[int, List[int]]): A mapping from
+            flattened parameter ID to a :class:`list` of corresponding
+            unflattened parameter IDs.
+
+    Returns:
+        List[int]: A mapping from unflattened parameter ID to flattened
+        parameter ID, where the unflattened parameter ID is the index in the
+        :class:`list`.
+    """
+    # Construct as a dict and then convert to list
+    unflat_to_flat_param_ids = {}
+    for flat_param_id, unflat_param_ids in flat_to_unflat_param_ids.items():
+        for unflat_param_id in unflat_param_ids:
+            assert unflat_param_id not in unflat_to_flat_param_ids, \
+                "`flat_to_unflat_param_ids` has the unflattened parameter " \
+                f"ID {unflat_param_id} mapped to multiple flattened " \
+                "parameter IDs"
+            unflat_to_flat_param_ids[unflat_param_id] = flat_param_id
+    num_unflat_param_ids = len(unflat_to_flat_param_ids)
+    unflat_param_ids_set = set(unflat_to_flat_param_ids.keys())
+    assert unflat_param_ids_set == set(range(num_unflat_param_ids)), \
+        "The set of unflattened parameter IDs should be {0, ..., " + \
+        str(num_unflat_param_ids - 1) + "} but got " + \
+        f"{unflat_param_ids_set}"
+    return [
+        unflat_to_flat_param_ids[unflat_param_id]
+        for unflat_param_id in range(num_unflat_param_ids)
+    ]
+
+
+def _is_zero_dim_tensor(x: Any) -> bool:
+    return torch.is_tensor(x) and x.dim() == 0
diff --git a/torch/distributed/fsdp/utils.py b/torch/distributed/fsdp/utils.py
index 2b64ab9c99897e..873898c04a001c 100644
--- a/torch/distributed/fsdp/utils.py
+++ b/torch/distributed/fsdp/utils.py
@@ -1,9 +1,8 @@
-from typing import Dict, List, Tuple, Union, Any, Callable, Set, TYPE_CHECKING
+from typing import Dict, List, Tuple, Union, Any, Callable, Set
 
 import torch
 
-if TYPE_CHECKING:
-    from collections import OrderedDict  # noqa: F401
+from collections import OrderedDict
 
 """Useful functions to deal with tensor types with other python container types."""
 
@@ -13,9 +12,14 @@ def _apply_to_tensors(
 ) -> Any:
     """Recursively apply to all tensor in different kinds of container types."""
 
-    def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set]) -> Any:
+    def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict]) -> Any:
         if torch.is_tensor(x):
             return fn(x)
+        elif isinstance(x, OrderedDict):
+            od = x.__class__()
+            for key, value in x.items():
+                od[key] = apply(value)
+            return od
         elif isinstance(x, dict):
             return {key: apply(value) for key, value in x.items()}
         elif isinstance(x, (list, tuple, set)):
diff --git a/torch/distributed/fsdp/wrap.py b/torch/distributed/fsdp/wrap.py
index 1ad6d29090c2c4..a14a8798c742ac 100644
--- a/torch/distributed/fsdp/wrap.py
+++ b/torch/distributed/fsdp/wrap.py
@@ -9,6 +9,15 @@
 import torch.nn as nn
 
 
+def always_wrap_policy(*args, **kwargs) -> bool:
+    """
+    A simple wrapper policy that always returns ``True``,
+    i.e. when passed as the `auto_wrap_policy` into FSDP,
+    this will result in all submodules being wrapped as
+    distinct FSDP instances.
+    """
+    return True
+
 def default_auto_wrap_policy(
     module: nn.Module,
     recurse: bool,
diff --git a/torch/distributed/nn/api/remote_module.py b/torch/distributed/nn/api/remote_module.py
index 5f458547d10cb5..e76da6386d0fe4 100644
--- a/torch/distributed/nn/api/remote_module.py
+++ b/torch/distributed/nn/api/remote_module.py
@@ -323,6 +323,9 @@ def apply(self: T, fn: Callable[[Module], None]) -> T:  # type: ignore[return]
     def cuda(self: T, device: Optional[Union[int, device]] = None) -> T:  # type: ignore[return]
         _raise_not_supported(self.cuda.__name__)
 
+    def ipu(self: T, device: Optional[Union[int, device]] = None) -> T:  # type: ignore[return]
+        _raise_not_supported(self.ipu.__name__)
+
     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:  # type: ignore[return]
         _raise_not_supported(self.xpu.__name__)
 
diff --git a/torch/distributed/optim/functional_adadelta.py b/torch/distributed/optim/functional_adadelta.py
index eaa5d3157da20b..ddba3d5ae671b5 100644
--- a/torch/distributed/optim/functional_adadelta.py
+++ b/torch/distributed/optim/functional_adadelta.py
@@ -23,6 +23,7 @@ def __init__(
         eps: float = 1e-6,
         weight_decay: float = 0.0,
         foreach: bool = False,
+        maximize: bool = False,
         _allow_empty_param_list: bool = False,
     ):
         self.defaults = {
@@ -32,6 +33,7 @@ def __init__(
             "weight_decay": weight_decay,
         }
         self.foreach = foreach
+        self.maximize = maximize
 
         if len(params) == 0 and not _allow_empty_param_list:
             raise ValueError("optimizer got an empty parameter list")
@@ -85,4 +87,5 @@ def step(self, gradients: List[Optional[Tensor]]):
                        rho=rho,
                        eps=eps,
                        weight_decay=weight_decay,
-                       foreach=self.foreach)
+                       foreach=self.foreach,
+                       maximize=self.maximize)
diff --git a/torch/distributed/optim/post_localSGD_optimizer.py b/torch/distributed/optim/post_localSGD_optimizer.py
index f24293476e3d8d..c0f160898fc6a7 100644
--- a/torch/distributed/optim/post_localSGD_optimizer.py
+++ b/torch/distributed/optim/post_localSGD_optimizer.py
@@ -75,11 +75,13 @@ def step(self):
         Performs a single optimization step (parameter update).
         """
         self.optim.step()
-        for param_group in self.param_groups:
-            for params in param_group["params"]:
-                if params.grad is None:
-                    continue
-                self.averager.average_parameters(iter(params))
+        params = [
+            param
+            for param_group in self.param_groups
+            for param in param_group["params"]
+            if param.grad is not None
+        ]
+        self.averager.average_parameters(iter(params))
 
     def zero_grad(self):
         self.optim.zero_grad()
diff --git a/torch/distributed/optim/zero_redundancy_optimizer.py b/torch/distributed/optim/zero_redundancy_optimizer.py
index 3fb9e27072026c..22f1817fcc126c 100644
--- a/torch/distributed/optim/zero_redundancy_optimizer.py
+++ b/torch/distributed/optim/zero_redundancy_optimizer.py
@@ -1325,8 +1325,8 @@ def _verify_and_init_params(
         try:
             all_params = list(params)
         except TypeError:
-            raise TypeError("`params` argument should be an iterable of "
-                            f"Tensors, but got {torch.typename(params)}")
+            raise TypeError("`params` argument should be an iterable of Tensors"
+                            f" or dicts, but got {torch.typename(params)}")
         if len(all_params) == 0:
             raise ValueError("ZeroRedundancyOptimizer got an empty parameter "
                              "list")
@@ -1415,7 +1415,7 @@ def _init_local_optimizer(self) -> None:
                     "`_allow_empty_param_list`; ZeroRedundancyOptimizer may "
                     "error due to an empty parameter list"
                 )
-                self.optim: Any = self._optim_constructor(params, **self._optim_defaults)
+                self.optim: Any = self._optim_constructor(params, **self._optim_defaults)  # type: ignore[no-redef]
 
             # Log information about the DDP and ZeRO bucketing
             if dist.get_debug_level() != dist.DebugLevel.OFF:
diff --git a/torch/distributed/rendezvous.py b/torch/distributed/rendezvous.py
index 2c58f7e836d8c5..991b0b32f0c0c5 100644
--- a/torch/distributed/rendezvous.py
+++ b/torch/distributed/rendezvous.py
@@ -9,7 +9,7 @@
 import os
 import sys
 from datetime import timedelta
-from typing import Dict, Optional, Union
+from typing import cast, Dict, Iterable, Optional, Tuple, Union
 
 import torch._six as six
 from torch.distributed import FileStore, PrefixStore, Store, TCPStore
@@ -50,6 +50,13 @@ def register_rendezvous_handler(scheme, handler):
         )
     _rendezvous_handlers[scheme] = handler
 
+# Query will have format "rank=0&world_size=1" and is
+# converted into {"rank": 0, "world_size": 1}
+def _query_to_dict(query):
+    query_dict: Dict[str, str] = dict(
+        cast(Tuple[str, str], pair.split("=")) for pair in cast(Iterable[str], filter(None, query.split("&")))
+    )
+    return query_dict
 
 def rendezvous(url: str, rank: int = -1, world_size: int = -1, **kwargs):
     if not isinstance(url, six.string_classes):
@@ -64,9 +71,7 @@ def rendezvous(url: str, rank: int = -1, world_size: int = -1, **kwargs):
     # Append node-specific arguments.
     result = urlparse(url)
     if rank != -1 or world_size != -1:
-        query_dict: Dict[str, Union[int, str]] = dict(
-            pair.split("=") for pair in filter(None, result.query.split("&"))
-        )
+        query_dict = _query_to_dict(result.query)
         assert (
             "rank" not in query_dict and "world_size" not in query_dict
         ), "The url: {url} has node-specific arguments(rank, world_size) already.".format(
@@ -88,6 +93,34 @@ def rendezvous(url: str, rank: int = -1, world_size: int = -1, **kwargs):
         raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
     return _rendezvous_handlers[result.scheme](url, **kwargs)
 
+def _create_store_from_options(backend_options, rank):
+    result = urlparse(backend_options.init_method)
+
+    # If using env initialization, get rank and world_size from env
+    world_size = -1
+    if result.scheme == "env":
+        rank = os.environ.get("RANK", rank)
+        # Here, the world_size has already beeen initialized to -1 in init_rpc
+        # If the world_size env variable is also not present then it is a dynamic group
+        world_size = int(os.environ.get("WORLD_SIZE", world_size))
+
+    query_dict = _query_to_dict(result.query)
+    # if rank is -1 then intentionally exclude rank for the query, error will be thrown later
+    if rank != -1:
+        query_dict["rank"] = rank
+    query_dict["world_size"] = world_size
+
+    result = result._replace(
+        query="{}".format(
+            "&".join(["{}={}".format(k, v) for k, v in query_dict.items()])
+        )
+    )
+
+    url = urlunparse(result)
+    if result.scheme not in _rendezvous_handlers:
+        raise RuntimeError("No handler for {}://".format(result.scheme))
+    store, _, _ = next(_rendezvous_handlers[result.scheme](url))
+    return store
 
 def _rendezvous_error(msg):
     return ValueError("Error initializing torch.distributed using " + msg)
@@ -110,16 +143,14 @@ def _error(msg):
 
     if not path:
         raise _error("path missing")
-    query: Dict[str, str]
-    # mypy doesn't allow dict() to accept List of values (#257)
-    query = dict(pair.split("=") for pair in filter(None, result.query.split("&")))  # type: ignore[misc, arg-type]
-    if "rank" not in query:
+    query_dict = _query_to_dict(result.query)
+    if "rank" not in query_dict:
         raise _error("rank parameter missing")
-    if "world_size" not in query:
+    if "world_size" not in query_dict:
         raise _error("world size parameter missing")
 
-    rank = int(query["rank"])
-    world_size = int(query["world_size"])
+    rank = int(query_dict["rank"])
+    world_size = int(query_dict["world_size"])
     store = FileStore(path, world_size)
     yield (store, rank, world_size)
 
@@ -171,16 +202,14 @@ def _error(msg):
     result = urlparse(url)
     if not result.port:
         raise _error("port number missing")
-    query: Dict[str, Union[int, str]]
-    # mypy doesn't allow dict() to accept List of values (#257)
-    query = dict(pair.split("=") for pair in filter(None, result.query.split("&")))  # type: ignore[misc, arg-type]
-    if "rank" not in query:
+    query_dict = _query_to_dict(result.query)
+    if "rank" not in query_dict:
         raise _error("rank parameter missing")
-    if "world_size" not in query:
+    if "world_size" not in query_dict:
         raise _error("world size parameter missing")
 
-    rank = int(query["rank"])
-    world_size = int(query["world_size"])
+    rank = int(query_dict["rank"])
+    world_size = int(query_dict["world_size"])
     assert result.hostname is not None
 
     store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
@@ -208,21 +237,19 @@ def _get_env_or_raise(env_var: str) -> str:
             return env_val
 
     result = urlparse(url)
-    query: Dict[str, Union[int, str]]
-    # mypy doesn't allow dict() to accept List of values (#257)
-    query = dict(pair.split("=") for pair in filter(None, result.query.split("&")))  # type: ignore[misc, arg-type]
+    query_dict: Dict[str, Union[int, str]] = _query_to_dict(result.query)
 
     rank: Optional[Union[str, int]]
     world_size: Optional[Union[str, int]]
     master_port: Optional[Union[str, int]]
 
-    if "rank" in query:
-        rank = int(query["rank"])
+    if "rank" in query_dict:
+        rank = int(query_dict["rank"])
     else:
         rank = int(_get_env_or_raise("RANK"))
 
-    if "world_size" in query:
-        world_size = int(query["world_size"])
+    if "world_size" in query_dict:
+        world_size = int(query_dict["world_size"])
     else:
         world_size = int(_get_env_or_raise("WORLD_SIZE"))
 
diff --git a/torch/distributed/rpc/__init__.py b/torch/distributed/rpc/__init__.py
index 9a5edeb8acb3ba..07afb05b38bcf3 100644
--- a/torch/distributed/rpc/__init__.py
+++ b/torch/distributed/rpc/__init__.py
@@ -1,8 +1,10 @@
 from datetime import timedelta
 import logging
+import os
 import threading
 import warnings
 from typing import Generator, Tuple
+from urllib.parse import urlparse
 
 import torch
 import torch.distributed as dist
@@ -158,16 +160,20 @@ def init_rpc(
                 backend
             )
 
-        # Rendezvous.
-        # This rendezvous state sometimes is destroyed before all processes
-        # finishing handshaking. To avoid that issue, we make it global to
-        # keep it alive.
-        global rendezvous_iterator
-        rendezvous_iterator = dist.rendezvous(
-            rpc_backend_options.init_method, rank=rank, world_size=world_size
-        )
-        store, _, _ = next(rendezvous_iterator)
-
+        # Create store, performs rendezvous for static RPC group.
+        if not world_size:
+            # If world_size is not set in construction and also not set in environment variables
+            # The store will be created for the dynamic group setting
+            store = dist._create_store_from_options(rpc_backend_options, rank)
+        else:
+            # This rendezvous state sometimes is destroyed before all processes
+            # finishing handshaking. To avoid that issue, we make it global to
+            # keep it alive.
+            global rendezvous_iterator
+            rendezvous_iterator = dist.rendezvous(
+                rpc_backend_options.init_method, rank=rank, world_size=world_size
+            )
+            store, _, _ = next(rendezvous_iterator)
         # Use same timeout as RPC.
         store.set_timeout(timedelta(seconds=rpc_backend_options.rpc_timeout))
 
@@ -195,11 +201,12 @@ def _validate_rpc_args(backend, store, name, rank, world_size, rpc_backend_optio
             store: dist.Store,
             name: str,
             rank: numbers.Integral,
-            world_size: numbers.Integral,
+            # world_size can be None for a dynamic group
+            world_size: (numbers.Integral, type(None)),
             rpc_backend_options: RpcBackendOptions,
         }
         for arg, arg_type in type_mapping.items():
-            if not isinstance(arg, arg_type):
+            if not isinstance(arg, arg_type):  # type: ignore[arg-type]
                 raise RuntimeError(
                     "Argument {} must be of type {} but got type {}".format(
                         arg, arg_type, type(arg)
@@ -211,7 +218,7 @@ def _init_rpc_backend(
         store=None,
         name=None,
         rank=-1,
-        world_size=-1,
+        world_size=None,
         rpc_backend_options=None,
     ):
 
diff --git a/torch/distributed/rpc/backend_registry.py b/torch/distributed/rpc/backend_registry.py
index 476b1c207acba1..7fcdda200f539e 100644
--- a/torch/distributed/rpc/backend_registry.py
+++ b/torch/distributed/rpc/backend_registry.py
@@ -243,6 +243,30 @@ def _tensorpipe_exchange_and_check_all_device_maps(
 
     return reverse_device_maps, my_devices
 
+def _tensorpipe_check_local_device_maps(name, device_count, options):
+    # Check local devices in device_maps and devices are all valid.
+    local_devices = set(options.devices) if options.devices else set()
+    device_maps = options.device_maps
+    for worker_name in device_maps:
+        device_map = device_maps[worker_name]
+        key_set = set(device_map.keys())
+        val_set = set(device_map.values())
+        if not all([
+            len(key_set) == len(device_map),
+            len(val_set) == len(device_map),
+        ]):
+            raise ValueError(
+                f"Invalid device_map configuration for {worker_name}, "
+                f"not 1-to-1 mapping:\ndevice_maps = {device_map}"
+            )
+        local_devices.update(key_set)
+
+    if not _tensorpipe_validate_devices(local_devices, device_count):
+        raise ValueError(
+            f"Invalid device in TensorPipe options on {name}:\n"
+            f"device_maps = {options.device_maps},\n"
+            f"devices = {options.devices}"
+        )
 
 def _tensorpipe_init_backend_handler(store, name, rank, world_size, rpc_backend_options):
     from . import TensorPipeRpcBackendOptions
@@ -260,12 +284,6 @@ def _tensorpipe_init_backend_handler(store, name, rank, world_size, rpc_backend_
             )
         )
 
-    # The agent's join method is required to behave like a barrier and perform
-    # collective operations, for which it relies on a process group, instead of
-    # re-implementing this on top of RPCs.
-
-    group = _init_process_group(store, rank, world_size)
-
     if torch.cuda.is_available():
         # It's necessary to initialize PyTorch CUDA states here (e.g.,
         # CUDACachingAllocator). If this is missing, we could hit errors like
@@ -277,38 +295,88 @@ def _tensorpipe_init_backend_handler(store, name, rank, world_size, rpc_backend_
     else:
         device_count = 0
 
-    reverse_device_maps, devices = _tensorpipe_exchange_and_check_all_device_maps(
-        name,
-        device_count,
-        rpc_backend_options.device_maps,
-        rpc_backend_options.devices,
-        group,
-    )
+    is_static_group = True if world_size else False
+    # world_size is specified so this is a static group (ranks cannot join and leave)
+    if is_static_group:
+        # The agent's join method is required to behave like a barrier and perform
+        # collective operations, for which it relies on a process group, instead of
+        # re-implementing this on top of RPCs.
+        group = _init_process_group(store, rank, world_size)
+
+        reverse_device_maps, devices = _tensorpipe_exchange_and_check_all_device_maps(
+            name,
+            device_count,
+            rpc_backend_options.device_maps,
+            rpc_backend_options.devices,
+            group,
+        )
 
-    # TODO: add try-except and destroy _agent in all processes if any fails.
-    agent = TensorPipeAgent(
-        store,
-        name,
-        rank,
-        world_size,
-        rpc_backend_options,
-        reverse_device_maps,
-        devices,
-    )
+        # TODO: add try-except and destroy _agent in all processes if any fails.
+        agent = TensorPipeAgent(
+            store,
+            name,
+            rank,
+            world_size,
+            rpc_backend_options,
+            reverse_device_maps,
+            devices,
+        )
 
-    api._init_rpc_states(agent)
+        api._init_rpc_states(agent)
 
-    # Run one dummy round of RPC to initialize channels/transports. Without
-    # this, it's easy to hit timeout in rpc.shutdown() if there is no other RPC
-    # on that process before rpc.shutdown(), as the agent initialization can
-    # take longer than 5s.
-    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
-    # Need a barrier here to make sure no peers leave before the rank0 finishes
-    # _all_gather
-    group.barrier().wait()
+        # Run one dummy round of RPC to initialize channels/transports. Without
+        # this, it's easy to hit timeout in rpc.shutdown() if there is no other RPC
+        # on that process before rpc.shutdown(), as the agent initialization can
+        # take longer than 5s.
+        api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
+        # Need a barrier here to make sure no peers leave before the rank0 finishes
+        # _all_gather
+        group.barrier().wait()
 
-    return agent
+        return agent
+    # initialization for dynamic rpc (ranks can join and leave)
+    else:
+        # Validate devices and device_maps locally for current rank
+        _tensorpipe_check_local_device_maps(name, device_count, rpc_backend_options)
+
+        token_key = "RpcGroupManagementToken"
+        token_location = f"TokenOnWorker{rank}"
+        while True:
+            # Retrieve token from store to signal start of rank join/leave critical section
+            returned = store.compare_set(token_key, "", token_location).decode()
+            if returned == token_location:
+                # Construct TPAgent with empty reverse_device_map and devices
+                # these two properties will be updated after construction
+                agent = TensorPipeAgent(
+                    store,
+                    name,
+                    rank,
+                    world_size,
+                    rpc_backend_options,
+                    {},
+                    [],
+                )
 
+                try:
+                    # TODO: Notify all workers in group this rank has joined and set devices and reverse_device_map
+                    # This is a synchronous operation that completes once all existing ranks are updated
+                    # _tensorpipe_check_remote_device_maps(agent, rpc_backend_options)
+                    pass
+                except Exception:
+                    api.shutdown()
+                    raise
+
+                # Finish initialization
+                break
+            else:
+                # Store will wait for the token to be released based on the timeout set in backend options
+                store.wait([returned])
+
+        # Update from store to signal end of rank join/leave critical section
+        store.set(token_key, "")
+        # Other will wait for this token to be set before they execute
+        store.set(token_location, "Done")
+        return agent
 
 register_backend(
     "TENSORPIPE",
diff --git a/torch/distributed/rpc/options.py b/torch/distributed/rpc/options.py
index b9bf6583f21e85..5c018075cc3722 100644
--- a/torch/distributed/rpc/options.py
+++ b/torch/distributed/rpc/options.py
@@ -23,8 +23,7 @@ def _to_device_map(
 ) -> Dict[torch.device, torch.device]:
     full_device_map: Dict[torch.device, torch.device] = {}
     reverse_map: Dict[torch.device, torch.device] = {}
-    for k in device_map:
-        v = device_map[k]
+    for k, v in device_map.items():
         k, v = torch.device(k), torch.device(v)
         if v in reverse_map:
             raise ValueError(
diff --git a/torch/distributions/exp_family.py b/torch/distributions/exp_family.py
index 669619d9db17d4..7084714ee3d063 100644
--- a/torch/distributions/exp_family.py
+++ b/torch/distributions/exp_family.py
@@ -56,5 +56,5 @@ def entropy(self):
         gradients = torch.autograd.grad(lg_normal.sum(), nparams, create_graph=True)
         result += lg_normal
         for np, g in zip(nparams, gradients):
-            result -= np * g
+            result -= (np * g).reshape(self._batch_shape + (-1,)).sum(-1)
         return result
diff --git a/torch/distributions/lkj_cholesky.py b/torch/distributions/lkj_cholesky.py
index ef657b1416ef1a..132b873a5bb877 100644
--- a/torch/distributions/lkj_cholesky.py
+++ b/torch/distributions/lkj_cholesky.py
@@ -112,7 +112,7 @@ def log_prob(self, value):
         if self._validate_args:
             self._validate_sample(value)
         diag_elems = value.diagonal(dim1=-1, dim2=-2)[..., 1:]
-        order = torch.arange(2, self.dim + 1)
+        order = torch.arange(2, self.dim + 1, device=self.concentration.device)
         order = 2 * (self.concentration - 1).unsqueeze(-1) + self.dim - order
         unnormalized_log_pdf = torch.sum(order * diag_elems.log(), dim=-1)
         # Compute normalization constant (page 1999 of [1])
diff --git a/torch/distributions/transforms.py b/torch/distributions/transforms.py
index 56c32b43b133e6..045441d25f5ac1 100644
--- a/torch/distributions/transforms.py
+++ b/torch/distributions/transforms.py
@@ -968,7 +968,6 @@ def _inverse(self, y):
 
 
 class CatTransform(Transform):
-    tseq: List[numbers.Number]
     """
     Transform functor that applies a sequence of transforms `tseq`
     component-wise to each submatrix at `dim`, of length `lengths[dim]`,
@@ -981,6 +980,8 @@ class CatTransform(Transform):
        t = CatTransform([t0, t0], dim=0, lengths=[20, 20])
        y = t(x)
     """
+    transforms: List[Transform]
+
     def __init__(self, tseq, dim=0, lengths=None, cache_size=0):
         assert all(isinstance(t, Transform) for t in tseq)
         if cache_size:
@@ -1004,7 +1005,7 @@ def length(self):
     def with_cache(self, cache_size=1):
         if self._cache_size == cache_size:
             return self
-        return CatTransform(self.tseq, self.dim, self.lengths, cache_size)
+        return CatTransform(self.transforms, self.dim, self.lengths, cache_size)
 
     def _call(self, x):
         assert -x.dim() <= self.dim < x.dim()
@@ -1079,6 +1080,8 @@ class StackTransform(Transform):
        t = StackTransform([ExpTransform(), identity_transform], dim=1)
        y = t(x)
     """
+    transforms: List[Transform]
+
     def __init__(self, tseq, dim=0, cache_size=0):
         assert all(isinstance(t, Transform) for t in tseq)
         if cache_size:
diff --git a/torch/distributions/wishart.py b/torch/distributions/wishart.py
index 0dd431a0f7ba04..6c9a461a298c23 100644
--- a/torch/distributions/wishart.py
+++ b/torch/distributions/wishart.py
@@ -20,6 +20,10 @@ def _mvdigamma(x: torch.Tensor, p: int) -> torch.Tensor:
         - torch.arange(p, dtype=x.dtype, device=x.device).div(2).expand(x.shape + (-1,))
     ).sum(-1)
 
+def _clamp_above_eps(x: torch.Tensor) -> torch.Tensor:
+    # We assume positive input for this function
+    return x.clamp(min=torch.finfo(x.dtype).eps)
+
 class Wishart(ExponentialFamily):
     r"""
     Creates a Wishart distribution parameterized by a symmetric positive definite matrix :math:`\Sigma`,
@@ -27,8 +31,9 @@ class Wishart(ExponentialFamily):
 
     Example:
         >>> m = Wishart(torch.eye(2), torch.Tensor([2]))
-        >>> m.sample()  #Wishart distributed with mean=`df * I` and
-                        #variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j
+        >>> m.sample()  # Wishart distributed with mean=`df * I` and
+                        # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j
+
     Args:
         covariance_matrix (Tensor): positive-definite covariance matrix
         precision_matrix (Tensor): positive-definite precision matrix
@@ -56,6 +61,7 @@ class Wishart(ExponentialFamily):
     }
     support = constraints.positive_definite
     has_rsample = True
+    _mean_carrier_measure = 0
 
     def __init__(self,
                  df: Union[torch.Tensor, Number],
@@ -80,7 +86,7 @@ def __init__(self,
         event_shape = param.shape[-2:]
 
         if self.df.le(event_shape[-1] - 1).any():
-            raise ValueError(f"Value of df={df} expected to be greater than ndim={event_shape[-1]-1}.")
+            raise ValueError(f"Value of df={df} expected to be greater than ndim - 1 = {event_shape[-1]-1}.")
 
         if scale_tril is not None:
             self.scale_tril = param.expand(batch_shape + (-1, -1))
@@ -119,9 +125,8 @@ def expand(self, batch_shape, _instance=None):
         new = self._get_checked_instance(Wishart, _instance)
         batch_shape = torch.Size(batch_shape)
         cov_shape = batch_shape + self.event_shape
-        df_shape = batch_shape
         new._unbroadcasted_scale_tril = self._unbroadcasted_scale_tril.expand(cov_shape)
-        new.df = self.df.expand(df_shape)
+        new.df = self.df.expand(batch_shape)
 
         new._batch_dims = [-(x + 1) for x in range(len(batch_shape))]
 
@@ -172,19 +177,22 @@ def precision_matrix(self):
 
     @property
     def mean(self):
-        return self.df.view(self._batch_shape + (1, 1,)) * self.covariance_matrix
+        return self.df.view(self._batch_shape + (1, 1)) * self.covariance_matrix
 
     @property
     def variance(self):
         V = self.covariance_matrix  # has shape (batch_shape x event_shape)
         diag_V = V.diagonal(dim1=-2, dim2=-1)
-        return self.df.view(self._batch_shape + (1, 1,)) * (V.pow(2) + torch.einsum("...i,...j->...ij", diag_V, diag_V))
+        return self.df.view(self._batch_shape + (1, 1)) * (V.pow(2) + torch.einsum("...i,...j->...ij", diag_V, diag_V))
 
     def _bartlett_sampling(self, sample_shape=torch.Size()):
         p = self._event_shape[-1]  # has singleton shape
 
         # Implemented Sampling using Bartlett decomposition
-        noise = self._dist_chi2.rsample(sample_shape).sqrt().diag_embed(dim1=-2, dim2=-1)
+        noise = _clamp_above_eps(
+            self._dist_chi2.rsample(sample_shape).sqrt()
+        ).diag_embed(dim1=-2, dim2=-1)
+
         i, j = torch.tril_indices(p, p, offset=-1)
         noise[..., i, j] = torch.randn(
             torch.Size(sample_shape) + self._batch_shape + (int(p * (p - 1) / 2),),
@@ -250,8 +258,7 @@ def log_prob(self, value):
         nu = self.df  # has shape (batch_shape)
         p = self._event_shape[-1]  # has singleton shape
         return (
-            - nu * p * _log_2 / 2
-            - nu * self._unbroadcasted_scale_tril.diagonal(dim1=-2, dim2=-1).log().sum(-1)
+            - nu * (p * _log_2 / 2 + self._unbroadcasted_scale_tril.diagonal(dim1=-2, dim2=-1).log().sum(-1))
             - torch.mvlgamma(nu / 2, p=p)
             + (nu - p - 1) / 2 * torch.linalg.slogdet(value).logabsdet
             - torch.cholesky_solve(value, self._unbroadcasted_scale_tril).diagonal(dim1=-2, dim2=-1).sum(dim=-1) / 2
@@ -262,8 +269,7 @@ def entropy(self):
         p = self._event_shape[-1]  # has singleton shape
         V = self.covariance_matrix  # has shape (batch_shape x event_shape)
         return (
-            (p + 1) * self._unbroadcasted_scale_tril.diagonal(dim1=-2, dim2=-1).log().sum(-1)
-            + p * (p + 1) * _log_2 / 2
+            (p + 1) * (p * _log_2 / 2 + self._unbroadcasted_scale_tril.diagonal(dim1=-2, dim2=-1).log().sum(-1))
             + torch.mvlgamma(nu / 2, p=p)
             - (nu - p - 1) / 2 * _mvdigamma(nu / 2, p=p)
             + nu * p / 2
@@ -271,11 +277,13 @@ def entropy(self):
 
     @property
     def _natural_params(self):
-        return (
-            0.5 * self.df,
-            - 0.5 * self.precision_matrix,
-        )
+        nu = self.df  # has shape (batch_shape)
+        p = self._event_shape[-1]  # has singleton shape
+        return - self.precision_matrix / 2, (nu - p - 1) / 2
 
     def _log_normalizer(self, x, y):
-        p = y.shape[-1]
-        return x * (- torch.linalg.slogdet(-2 * y).logabsdet + _log_2 * p) + _mvdigamma(x, p=p)
+        p = self._event_shape[-1]
+        return (
+            (y + (p + 1) / 2) * (- torch.linalg.slogdet(- 2 * x).logabsdet + _log_2 * p)
+            + torch.mvlgamma(y + (p + 1) / 2, p=p)
+        )
diff --git a/torch/for_onnx/__init__.py b/torch/for_onnx/__init__.py
deleted file mode 100644
index 30c8a029804595..00000000000000
--- a/torch/for_onnx/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .onnx import *  # noqa: F403
diff --git a/torch/functional.py b/torch/functional.py
index 334e59f5335033..c131f63eabc5a4 100644
--- a/torch/functional.py
+++ b/torch/functional.py
@@ -1,7 +1,6 @@
 from typing import (
     Tuple, Optional, Union, Any, Sequence, TYPE_CHECKING
 )
-from collections import namedtuple
 import itertools
 
 import torch
@@ -102,7 +101,7 @@ def broadcast_shapes(*shapes):
         RuntimeError: If shapes are incompatible.
     """
     # This wrapper exists to support variadic args.
-    # TODO Movie this to C++ once the jit has better support for torch.Size.
+    # TODO Move this to C++ once the jit has better support for torch.Size.
     if not torch.jit.is_tracing():
         max_len = 0
         for shape in shapes:
@@ -443,18 +442,20 @@ def histogramdd(input, bins, range=None, weight=None, density=False):
         Example::
             >>> torch.histogramdd(torch.tensor([[0., 1.], [1., 0.], [2., 0.], [2., 2.]]), bins=[3, 3],
             ...                   weight=torch.tensor([1., 2., 4., 8.]))
-                histogramdd_return_type(hist=tensor([[0., 1., 0.],
-                                                     [2., 0., 0.],
-                                                     [4., 0., 8.]]),
-                                        bin_edges=(tensor([0.0000, 0.6667, 1.3333, 2.0000]),
-                                                   tensor([0.0000, 0.6667, 1.3333, 2.0000])))
+                torch.return_types.histogramdd(
+                    hist=tensor([[0., 1., 0.],
+                                 [2., 0., 0.],
+                                 [4., 0., 8.]]),
+                    bin_edges=(tensor([0.0000, 0.6667, 1.3333, 2.0000]),
+                               tensor([0.0000, 0.6667, 1.3333, 2.0000])))
 
             >>> torch.histogramdd(torch.tensor([[0., 0.], [1., 1.], [2., 2.]]), bins=[2, 2],
             ...                   range=[0., 1., 0., 1.], density=True)
-                histogramdd_return_type(hist=tensor([[2., 0.],
-                                                     [0., 2.]]),
-                                        bin_edges=(tensor([0.0000, 0.5000, 1.0000]),
-                                                   tensor([0.0000, 0.5000, 1.0000])))
+                torch.return_types.histogramdd(
+                    hist=tensor([[2., 0.],
+                                 [0., 2.]]),
+                    bin_edges=(tensor([0.0000, 0.5000, 1.0000]),
+                               tensor([0.0000, 0.5000, 1.0000])))
 
         """
         if isinstance(bins, int):
@@ -477,9 +478,7 @@ def histogramdd(input, bins, range=None, weight=None, density=False):
             bin_edges = bins
             hist = _VF._histogramdd_from_bin_tensors(input, bin_edges, weight=weight, density=density)
 
-        # TODO: figure out how to return torch.return_types.histogramdd
-        histogramdd_return_type = namedtuple('histogramdd_return_type', 'hist bin_edges')
-        return histogramdd_return_type(hist, bin_edges)
+        return torch.return_types.histogramdd((hist, bin_edges))
 
 # This wrapper exists to support variadic args.
 if TYPE_CHECKING:
@@ -716,8 +715,8 @@ def stft(input: Tensor, n_fft: int, hop_length: Optional[int] = None,
             stft, (input,), input, n_fft, hop_length=hop_length, win_length=win_length,
             window=window, center=center, pad_mode=pad_mode, normalized=normalized,
             onesided=onesided, return_complex=return_complex)
-    # TODO: after having proper ways to map Python strings to ATen Enum, move
-    #       this and F.pad to ATen.
+    # NOTE: Do not edit. This code will be removed once the forward-compatibility
+    #       period is over for PR #73432
     if center:
         signal_dim = input.dim()
         extended_shape = [1] * (3 - signal_dim) + list(input.size())
diff --git a/torch/fx/_symbolic_trace.py b/torch/fx/_symbolic_trace.py
index 6856a19f40fbf7..f101e167b103ae 100644
--- a/torch/fx/_symbolic_trace.py
+++ b/torch/fx/_symbolic_trace.py
@@ -414,12 +414,15 @@ def replace_ph(x):
                         return out
                     # Union[int, bool] == bool in Python <= 3.6
                     if type(x) == bool or type(x) in base_types and type(x) != torch.Tensor:
-                        torch._assert(out == x, f"{name} has been specialized to have value {x}")
+                        torch._assert(out == x, f"{name} has been specialized to have value {x} but got another value")
+                    elif type(x) == type(None):
+                        args = (out, f"{name} has been specialized to have value None but got another value")
+                        self.create_proxy('call_function', _assert_is_none, args, {})
                     else:
                         torch.warnings.warn(
-                            "Was not able to add assertion to guarantee correct inputs to "
-                            "specialized function. It is up to the user to make sure that your inputs match the "
-                            "inputs you specialized the function with."
+                            f"Was not able to add assertion to guarantee correct input {name} to "
+                            f"specialized function. It is up to the user to make sure that your inputs match the "
+                            f"inputs you specialized the function with."
                         )
 
                     return x
@@ -868,3 +871,8 @@ def f(x):
     graph = tracer.trace(root, concrete_args)
     name = root.__class__.__name__ if isinstance(root, torch.nn.Module) else root.__name__
     return GraphModule(tracer.root, graph, name)
+
+
+@wrap
+def _assert_is_none(value, msg):
+    assert value is None, msg
diff --git a/torch/fx/experimental/const_fold.py b/torch/fx/experimental/const_fold.py
index 1ccc498c5652f2..9ce0f4600ca247 100644
--- a/torch/fx/experimental/const_fold.py
+++ b/torch/fx/experimental/const_fold.py
@@ -59,9 +59,11 @@ def run_folding(self):
         # Tuple[Tensor,].
         folded_attrs = self.const_subgraph_module()
         params = (
-            torch.nn.ParameterList([torch.nn.Parameter(i) for i in folded_attrs])
+            torch.nn.ParameterList([torch.nn.Parameter(
+                i if not isinstance(i, int) else torch.Tensor([i]).cuda()) for i in folded_attrs])
             if isinstance(folded_attrs, tuple)
-            else torch.nn.Parameter(folded_attrs)
+            else torch.nn.Parameter(
+                folded_attrs if not isinstance(folded_attrs, int) else torch.Tensor([folded_attrs]).cuda())
         )
         setattr(self, self.fx_const_folded_attrs_name, params)
 
diff --git a/torch/fx/experimental/normalize.py b/torch/fx/experimental/normalize.py
index b8164e2417f93e..d7bd2c114684bd 100644
--- a/torch/fx/experimental/normalize.py
+++ b/torch/fx/experimental/normalize.py
@@ -34,7 +34,7 @@ class NormalizeArgs(Transformer):
     """
 
     def __init__(
-        self, module: torch.nn.Module, normalize_to_only_use_kwargs: bool = True
+        self, module: torch.fx.GraphModule, normalize_to_only_use_kwargs: bool = True
     ):
         super().__init__(module)
         self.node_map: Dict[Proxy, Node] = {}
diff --git a/torch/fx/graph.py b/torch/fx/graph.py
index d85c8c2a37a597..f9d139e3c1c5b4 100644
--- a/torch/fx/graph.py
+++ b/torch/fx/graph.py
@@ -337,15 +337,25 @@ def type_repr(o : Any):
 
             typename = _type_repr(o)
 
-            # This is a generic type, e.g. typing.List[torch.Tensor]
             if hasattr(o, '__origin__'):
+                # This is a generic type, e.g. typing.List[torch.Tensor]
                 origin_type = _origin_type_map.get(o.__origin__, o.__origin__)
                 origin_typename = add_global(_type_repr(origin_type), origin_type)
 
-                # Assign global names for each of the inner type variables.
-                args = [type_repr(arg) for arg in o.__args__]
+                if hasattr(o, '__args__'):
+                    # Assign global names for each of the inner type variables.
+                    args = [type_repr(arg) for arg in o.__args__]
 
-                return f'{origin_typename}[{",".join(args)}]'
+                    if len(args) == 0:
+                        # Bare type, such as `typing.Tuple` with no subscript
+                        # This code-path used in Python < 3.9
+                        return origin_typename
+
+                    return f'{origin_typename}[{",".join(args)}]'
+                else:
+                    # Bare type, such as `typing.Tuple` with no subscript
+                    # This code-path used in Python 3.9+
+                    return origin_typename
 
             # Common case: this is a regular module name like 'foo.bar.baz'
             return add_global(typename, o)
diff --git a/torch/fx/graph_module.py b/torch/fx/graph_module.py
index 5e19b80b71e328..e2f033d72a1cb9 100644
--- a/torch/fx/graph_module.py
+++ b/torch/fx/graph_module.py
@@ -619,7 +619,7 @@ def wrapped_call(self, *args, **kwargs):
                 if cls_call is not None:
                     return cls_call(self, *args, **kwargs)
                 else:
-                    return super(type(self), self).__call__(*args, **kwargs)
+                    return super(cls, self).__call__(*args, **kwargs)
             except Exception as e:
                 assert e.__traceback__
                 topmost_framesummary: traceback.FrameSummary = \
@@ -627,7 +627,9 @@ def wrapped_call(self, *args, **kwargs):
                 if "eval_with_key" in topmost_framesummary.filename:
                     print(generate_error_message(topmost_framesummary),
                           file=sys.stderr)
-                raise e.with_traceback(None)
+                    raise e.with_traceback(None)
+                else:
+                    raise e
 
         cls.__call__ = wrapped_call
 
diff --git a/torch/fx/interpreter.py b/torch/fx/interpreter.py
index c6200677bb3e3e..cf945aa30eea05 100644
--- a/torch/fx/interpreter.py
+++ b/torch/fx/interpreter.py
@@ -5,6 +5,7 @@
 from ._symbolic_trace import Tracer
 from ._compatibility import compatibility
 from typing import Any, Dict, Iterator, List, Optional, Tuple, Union
+import inspect
 
 @compatibility(is_backward_compatible=True)
 class Interpreter:
@@ -88,7 +89,7 @@ def register_last_uses(n : Node, user : Node):
                 map_arg(node.kwargs, lambda n: register_last_uses(n, node))
 
     @compatibility(is_backward_compatible=True)
-    def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None) -> Any:
+    def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None, enable_io_processing : bool = True) -> Any:
         """
         Run `module` via interpretation and return the result.
 
@@ -98,6 +99,8 @@ def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None) -> Any:
                 This is a dict mapping `Node` to any value. This can be used, for example, to
                 pre-populate results for certain `Nodes` so as to do only partial evaluation within
                 the interpreter.
+            enable_io_processing (bool): If true, we process the inputs and outputs with graph's process_inputs and
+                process_outputs function first before using them.
 
         Returns:
             Any: The value returned from executing the Module
@@ -107,6 +110,8 @@ def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None) -> Any:
         # Positional function args are consumed left-to-right by
         # `placeholder` nodes. Use an iterator to keep track of
         # position and extract those values.
+        if enable_io_processing:
+            args = self.module.graph.process_inputs(*args)
         self.args_iter : Iterator[Any] = iter(args)
 
         for node in self.module.graph.nodes:
@@ -125,7 +130,7 @@ def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None) -> Any:
 
             if node.op == 'output':
                 output_val = self.env[node]
-                return output_val
+                return self.module.graph.process_outputs(output_val) if enable_io_processing else output_val
 
     @compatibility(is_backward_compatible=True)
     def run_node(self, n : Node) -> Any:
@@ -381,6 +386,7 @@ def fn(x):
     def __init__(self, module):
         super().__init__(module)
         self.new_graph = Graph()
+        self.new_graph.set_codegen(module.graph._codegen)
 
         class TransformerTracer(Tracer):
             def __init__(self, graph: Graph):
@@ -407,7 +413,8 @@ def placeholder(self, target : 'Target', args : Tuple[Argument, ...], kwargs : D
             kwargs (Dict): Dict of keyword arguments for this invocation
         """
         assert isinstance(target, str)
-        return Proxy(self.new_graph.placeholder(target), self.tracer)
+        default_value = next(iter(args)) if args else inspect.Signature.empty
+        return Proxy(self.new_graph.placeholder(target, default_value=default_value), self.tracer)
 
     @compatibility(is_backward_compatible=True)
     def get_attr(self, target : 'Target', args : Tuple[Argument, ...], kwargs : Dict[str, Any]) -> Proxy:
@@ -444,7 +451,7 @@ def transform(self) -> GraphModule:
         Transform ``self.module`` and return the transformed
         ``GraphModule``.
         """
-        result = super().run()
+        result = super().run(enable_io_processing=False)
         if result is not None:
             def strip_proxy(a : Union[Argument, Proxy]) -> Any:
                 return a.node if isinstance(a, Proxy) else a
diff --git a/torch/fx/node.py b/torch/fx/node.py
index eefa731c5ceaaa..431109422c085d 100644
--- a/torch/fx/node.py
+++ b/torch/fx/node.py
@@ -26,7 +26,9 @@
 ]]
 
 _side_effectful_functions: Set[Callable] = {
-    torch._assert, torch.ops.profiler._record_function_enter,
+    torch._assert,
+    torch.ops.profiler._record_function_enter,
+    torch.ops.profiler._record_function_enter_new,
     torch.ops.profiler._record_function_exit}
 
 # this is fixed on master, WAR for 1.5
diff --git a/torch/fx/operator_schemas.py b/torch/fx/operator_schemas.py
index 1e3e02ed7cff2f..131dea53925717 100644
--- a/torch/fx/operator_schemas.py
+++ b/torch/fx/operator_schemas.py
@@ -7,7 +7,7 @@
 from typing import Any, Callable, Dict, List, Optional, Tuple, NamedTuple, cast, TYPE_CHECKING
 from torch._jit_internal import boolean_dispatched
 from ._compatibility import compatibility
-from torch._ops import OpOverloadPacket
+from torch._ops import OpOverloadPacket, OpOverload
 
 if TYPE_CHECKING:
     from .node import Argument
@@ -134,13 +134,12 @@ def get_signature_for_torch_op(op : Callable, return_schemas : bool = False):
             return_schemas=True, returns a tuple containing the optional Python signatures
             and the optional TorchScript Function signature
     """
+    if isinstance(op, OpOverloadPacket) or isinstance(op, OpOverload):
+        op = op.op
     override = _manual_overrides.get(op)
     if override:
         return (override, None) if return_schemas else None
 
-    if isinstance(op, OpOverloadPacket):
-        op = op._op
-
     aten_fn = torch.jit._builtins._find_builtin(op)
 
     if aten_fn is None:
diff --git a/torch/fx/passes/graph_drawer.py b/torch/fx/passes/graph_drawer.py
index 7322894b1b0349..045f019b587cbc 100644
--- a/torch/fx/passes/graph_drawer.py
+++ b/torch/fx/passes/graph_drawer.py
@@ -60,9 +60,19 @@ class FxGraphDrawer:
                 f.write(g.get_dot_graph().create_svg())
         """
 
-        def __init__(self, graph_module: torch.fx.GraphModule, name: str, ignore_getattr: bool = False):
+        def __init__(
+            self,
+            graph_module: torch.fx.GraphModule,
+            name: str,
+            ignore_getattr: bool = False,
+            skip_node_names_in_args: bool = True,
+        ):
             self._name = name
-            self._dot_graphs = {name: self._to_dot(graph_module, name, ignore_getattr)}
+            self._dot_graphs = {
+                name: self._to_dot(
+                    graph_module, name, ignore_getattr, skip_node_names_in_args
+                )
+            }
 
             for node in graph_module.graph.nodes:
                 if node.op != "call_module":
@@ -73,7 +83,12 @@ def __init__(self, graph_module: torch.fx.GraphModule, name: str, ignore_getattr
                 if not isinstance(leaf_node, torch.fx.GraphModule):
                     continue
 
-                self._dot_graphs[f"{name}_{node.target}"] = self._to_dot(leaf_node, f"{name}_{node.target}", ignore_getattr)
+                self._dot_graphs[f"{name}_{node.target}"] = self._to_dot(
+                    leaf_node,
+                    f"{name}_{node.target}",
+                    ignore_getattr,
+                    skip_node_names_in_args,
+                )
 
         def get_dot_graph(self, submod_name=None) -> pydot.Dot:
             if submod_name is None:
@@ -129,15 +144,32 @@ def _typename(self, target: Any) -> str:
 
             return _get_qualified_name(target)
 
-        def _get_node_label(self, module: torch.fx.GraphModule, node: torch.fx.Node) -> str:
+        def _get_node_label(
+            self,
+            module: torch.fx.GraphModule,
+            node: torch.fx.Node,
+            skip_node_names_in_args: bool,
+        ) -> str:
             def _get_str_for_args_kwargs(arg):
                 if isinstance(arg, tuple):
-                    s = r",\n".join(_format_arg(a, max_list_len=10) for a in arg)
-                    return fr"(\l{s},\n)".replace("{", r"\{").replace("}", r"\}")
-                if isinstance(arg, dict):
-                    s = r",\n".join(f"{k}: {_format_arg(v, max_list_len=10)}" for k, v in arg.items())
-                    return fr"{{\l{s},\n}}".replace("{", r"\{").replace("}", r"\}")
-                return _format_arg(arg).replace("{", r"\{").replace("}", r"\}")
+                    prefix, suffix = r"|args=(\l", r",\n)\l"
+                    arg_strs_list = [_format_arg(a, max_list_len=8) for a in arg]
+                elif isinstance(arg, dict):
+                    prefix, suffix = r"|kwargs={\l", r",\n}\l"
+                    arg_strs_list = [
+                        f"{k}: {_format_arg(v, max_list_len=8)}"
+                        for k, v in arg.items()
+                    ]
+                else:  # Fall back to nothing in unexpected case.
+                    return ""
+
+                # Strip out node names if requested.
+                if skip_node_names_in_args:
+                    arg_strs_list = [a for a in arg_strs_list if "%" not in a]
+                if len(arg_strs_list) == 0:
+                    return ""
+                arg_strs = prefix + r",\n".join(arg_strs_list) + suffix
+                return arg_strs.replace("{", r"\{").replace("}", r"\}")
 
 
             label = "{" + f"name=%{node.name}|op_code={node.op}\n"
@@ -154,9 +186,9 @@ def _get_str_for_args_kwargs(arg):
             else:
                 label += f"|target={self._typename(node.target)}" + r"\n"
                 if len(node.args) > 0:
-                    label += f"|args={_get_str_for_args_kwargs(node.args)}" + r"\l"
+                    label += _get_str_for_args_kwargs(node.args)
                 if len(node.kwargs) > 0:
-                    label += f"|kwargs={_get_str_for_args_kwargs(node.kwargs)}" + r"\l"
+                    label += _get_str_for_args_kwargs(node.kwargs)
                 label += f"|num_users={len(node.users)}" + r"\n"
 
             tensor_meta = node.meta.get('tensor_meta')
@@ -221,7 +253,13 @@ def _stringify_tensor_meta(self, tm: TensorMetadata) -> str:
         def _get_tensor_label(self, t: torch.Tensor) -> str:
             return str(t.dtype) + str(list(t.shape)) + r"\n"
 
-        def _to_dot(self, graph_module: torch.fx.GraphModule, name: str, ignore_getattr: bool) -> pydot.Dot:
+        def _to_dot(
+            self,
+            graph_module: torch.fx.GraphModule,
+            name: str,
+            ignore_getattr: bool,
+            skip_node_names_in_args: bool,
+        ) -> pydot.Dot:
             """
             Actual interface to visualize a fx.Graph. Note that it takes in the GraphModule instead of the Graph
             """
@@ -233,7 +271,7 @@ def _to_dot(self, graph_module: torch.fx.GraphModule, name: str, ignore_getattr:
 
                 style = self._get_node_style(node)
                 dot_node = pydot.Node(
-                    node.name, label=self._get_node_label(graph_module, node), **style
+                    node.name, label=self._get_node_label(graph_module, node, skip_node_names_in_args), **style
                 )
                 dot_graph.add_node(dot_node)
 
diff --git a/torch/fx/passes/net_min_base.py b/torch/fx/passes/net_min_base.py
index 9c1a884b421c86..1d9891588b7ce2 100644
--- a/torch/fx/passes/net_min_base.py
+++ b/torch/fx/passes/net_min_base.py
@@ -1,4 +1,3 @@
-import argparse
 from typing import Any, Callable, Tuple, Dict, Optional
 import logging
 
@@ -18,6 +17,7 @@
     FxNetAccFusionsFinder,
     Names
 )
+from dataclasses import dataclass
 
 _LOGGER = logging.getLogger(__name__)
 
@@ -46,43 +46,26 @@ class FxNetMinimizerResultMismatchError(Exception):
 
     pass
 
-
+@dataclass
 class _MinimizerSettingBase:
-    def __init__(self):
-        parser = argparse.ArgumentParser()
-        parser.add_argument(
-            "--accumulate_error",
-            default=False,
-            action="store_true",
-            help="Instead of using a's input for both converted module to verify, use the "
-            "previous outputs of each converted module as input to accumulate the errors.",
-        )
+    """
+    Args:
+    `accumulate_error`: Instead of using a's input for both converted module to verify
+    , use the previous outputs of each converted module as input to accumulate the
+    errors.
 
-        parser.add_argument(
-            "--traverse_method",
-            default="sequential",
-            choices=["sequential", "binary", "accumulate"],
-            help="Determine the way of traverse the nodes in FX module.",
-        )
-        parser.add_argument(
-            "--find_all",
-            default=False,
-            action="store_true",
-            help="Minimizer will go through the entire model and return all problematic nodes.",
-        )
-        parser.add_argument(
-            "--return_intermediate",
-            default=False,
-            action="store_true",
-            help="If true, when using `run_nodes()` function to run the model, intermediate results "
-            "of all the ops will be returned as output.",
-        )
-        args, unknown = parser.parse_known_args()
+    `traverse_method`: "sequential" or "binary" or "accumulate"
+    Determine the way of traverse the nodes in FX module.
 
-        self.accumulate_error: bool = args.accumulate_error
-        self.traverse_method: str = args.traverse_method
-        self.find_all: bool = args.find_all
-        self.return_intermediate: bool = args.return_intermediate
+    `find_all`: Minimizer will go through the entire model and return all problematic nodes.
+
+    `return_intermediate`: If true, when using `run_nodes()` function to run the
+    model, intermediate results of all the ops will be returned as output.
+    """
+    accumulate_error: bool = False
+    traverse_method: str = "sequential"
+    find_all: bool = False
+    return_intermediate: bool = False
 
     def __str__(self):
         settings_str = "FX Minimizer Settings:\n"
diff --git a/torch/fx/passes/pass_manager.py b/torch/fx/passes/pass_manager.py
new file mode 100644
index 00000000000000..096857efb7c2b4
--- /dev/null
+++ b/torch/fx/passes/pass_manager.py
@@ -0,0 +1,188 @@
+from functools import wraps
+from inspect import unwrap
+from typing import Callable, List
+
+
+# for callables which modify object inplace and return something other than
+# the object on which they act
+def inplace_wrapper(fn: Callable) -> Callable:
+    """
+    Convenience wrapper for passes which modify an object inplace. This
+    wrapper makes them return the modified object instead.
+
+    Args:
+        fn (Callable[Object, Any])
+
+    Returns:
+        wrapped_fn (Callable[Object, Object])
+    """
+
+    @wraps(fn)
+    def wrapped_fn(gm):
+        fn(gm)
+        return gm
+
+    return wrapped_fn
+
+
+def loop_pass(base_pass: Callable, n_iter: int = None, predicate: Callable = None):
+    """
+    Convenience wrapper for passes which need to be applied multiple times.
+
+    Exactly one of `n_iter`or `predicate` must be specified.
+
+    Args:
+        base_pass (Callable[Object, Object]): pass to be applied in loop
+        n_iter (int, optional): number of times to loop pass
+        predicate (Callable[Object, bool], optional):
+
+    """
+    assert (n_iter is not None) ^ (
+        predicate is not None
+    ), "Exactly one of `n_iter`or `predicate` must be specified."
+
+    @wraps(base_pass)
+    def new_pass(source):
+        output = source
+        if n_iter is not None and n_iter > 0:
+            for _ in range(n_iter):
+                output = base_pass(output)
+        elif predicate is not None:
+            while predicate(output):
+                output = base_pass(output)
+        else:
+            raise RuntimeError(
+                f"loop_pass must be given positive int n_iter (given "
+                f"{n_iter}) xor predicate (given {predicate})"
+            )
+        return output
+
+    return new_pass
+
+
+# Pass Schedule Constraints:
+#
+# Implemented as 'depends on' operators. A constraint is satisfied iff a list
+# has a valid partial ordering according to this comparison operator.
+def _validate_pass_schedule_constraint(
+    constraint: Callable[[Callable, Callable], bool], passes: List[Callable]
+):
+    for i, a in enumerate(passes):
+        for j, b in enumerate(passes[i + 1 :]):
+            if constraint(a, b):
+                continue
+            raise RuntimeError(
+                f"pass schedule constraint violated. Expected {a} before {b}"
+                f" but found {a} at index {i} and {b} at index{j} in pass"
+                f" list."
+            )
+
+
+def this_before_that_pass_constraint(this: Callable, that: Callable):
+    """
+    Defines a partial order ('depends on' function) where `this` must occur
+    before `that`.
+    """
+
+    def depends_on(a: Callable, b: Callable):
+        if a == that and b == this:
+            return False
+        return True
+
+    return depends_on
+
+
+def these_before_those_pass_constraint(these: Callable, those: Callable):
+    """
+    Defines a partial order ('depends on' function) where `these` must occur
+    before `those`. Where the inputs are 'unwrapped' before comparison.
+
+    For example, the following pass list and constraint list would be invalid.
+    ```
+    passes = [
+        loop_pass(pass_b, 3),
+        loop_pass(pass_a, 5),
+    ]
+
+    constraints = [
+        these_before_those_pass_constraint(pass_a, pass_b)
+    ]
+    ```
+
+    Args:
+        these (Callable): pass which should occur first
+        those (Callable): pass which should occur later
+
+    Returns:
+        depends_on (Callable[[Object, Object], bool]
+    """
+
+    def depends_on(a: Callable, b: Callable):
+        if unwrap(a) == those and unwrap(b) == these:
+            return False
+        return True
+
+    return depends_on
+
+
+class PassManager:
+    """
+    Construct a PassManager.
+
+    Collects passes and constraints. This defines the pass schedule, manages
+    pass constraints and pass execution.
+
+    Args:
+        passes (Optional[List[Callable]]): list of passes. A pass is a
+            callable which modifies an object and returns modified object
+        constraint (Optional[List[Callable]]): list of constraints. A
+            constraint is a callable which takes two passes (A, B) and returns
+            True if A depends on B and False otherwise. See implementation of
+            `this_before_that_pass_constraint` for example.
+    """
+
+    passes: List[Callable] = []
+    constraints: List[Callable] = []
+    _validated: bool = False
+
+    def __init__(
+        self,
+        passes=None,
+        constraints=None,
+    ):
+        if passes:
+            self.passes = passes
+        if constraints:
+            self.constraints = constraints
+
+    @classmethod
+    def build_from_passlist(cls, passes):
+        pm = PassManager(passes)
+        # TODO(alexbeloi): add constraint management/validation
+        return pm
+
+    def add_pass(self, _pass: Callable):
+        self.passes.append(_pass)
+        self._validated = False
+
+    def add_constraint(self, constraint):
+        self.constraints.append(constraint)
+        self._validated = False
+
+    def validate(self):
+        """
+        Validates that current pass schedule defined by `self.passes` is valid
+        according to all constraints in `self.constraints`
+        """
+        if self._validated:
+            return
+        for constraint in self.constraints:
+            _validate_pass_schedule_constraint(constraint, self.passes)
+        self._validated = True
+
+    def __call__(self, source):
+        self.validate()
+        out = source
+        for _pass in self.passes:
+            out = _pass(out)
+        return out
diff --git a/torch/fx/passes/shape_prop.py b/torch/fx/passes/shape_prop.py
index 83ba0cbc572286..f7feaddd207f56 100644
--- a/torch/fx/passes/shape_prop.py
+++ b/torch/fx/passes/shape_prop.py
@@ -1,9 +1,12 @@
 import torch
 import torch.fx
+import traceback
+
 from torch.fx.node import Node, map_aggregate
 from typing import Any, Tuple, NamedTuple, Optional, Dict
 from torch.fx._compatibility import compatibility
 
+
 @compatibility(is_backward_compatible=True)
 class TensorMetadata(NamedTuple):
     # TensorMetadata is a structure containing pertinent information
@@ -108,7 +111,14 @@ def forward(self, x):
 
     """
     def run_node(self, n : Node) -> Any:
-        result = super().run_node(n)
+        try:
+            result = super().run_node(n)
+        except Exception:
+            traceback.print_exc()
+            raise RuntimeError(
+                f"ShapeProp error for: node={n.format_node()} with "
+                f"meta={n.meta}"
+            )
 
         found_tensor = False
 
diff --git a/torch/fx/passes/split_module.py b/torch/fx/passes/split_module.py
index 9f2704558e612d..1bd5918da053b2 100644
--- a/torch/fx/passes/split_module.py
+++ b/torch/fx/passes/split_module.py
@@ -238,7 +238,7 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
         if node.op == 'placeholder':
             default_value = node.args[0] if len(node.args) > 0 else inspect.Signature.empty
             base_mod_env[node.name] = base_mod_graph.placeholder(
-                node.name, type_expr=node.type, default_value=default_value)
+                node.target, type_expr=node.type, default_value=default_value)
             base_mod_env[node.name].meta = node.meta.copy()
         elif node.op == 'get_attr':
             base_mod_env[node.name] = base_mod_graph.get_attr(node.target)
diff --git a/torch/fx/passes/splitter_base.py b/torch/fx/passes/splitter_base.py
index 03d57ea300e1bf..b1b8fab7299c36 100644
--- a/torch/fx/passes/splitter_base.py
+++ b/torch/fx/passes/splitter_base.py
@@ -35,6 +35,7 @@ def __init__(self):
         parser.add_argument(
             "--min_acc_module_size",
             default=1,
+            type=int,
             help="Minimum size limit of an accelerator subgraph.",
         )
         parser.add_argument(
diff --git a/torch/fx/passes/tests/__init__.py b/torch/fx/passes/tests/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/torch/fx/passes/tests/test_pass_manager.py b/torch/fx/passes/tests/test_pass_manager.py
new file mode 100644
index 00000000000000..4ed0cfce89de8e
--- /dev/null
+++ b/torch/fx/passes/tests/test_pass_manager.py
@@ -0,0 +1,36 @@
+import unittest
+
+from ..pass_manager import (
+    inplace_wrapper,
+    PassManager,
+    these_before_those_pass_constraint,
+    this_before_that_pass_constraint,
+)
+
+
+class TestPassManager(unittest.TestCase):
+    def test_pass_manager_builder(self) -> None:
+        passes = [lambda x: 2 * x for _ in range(10)]
+        pm = PassManager(passes)
+        pm.validate()
+
+    def test_this_before_that_pass_constraint(self) -> None:
+        passes = [lambda x: 2 * x for _ in range(10)]
+        pm = PassManager(passes)
+
+        # add unfulfillable constraint
+        pm.add_constraint(this_before_that_pass_constraint(passes[-1], passes[0]))
+
+        self.assertRaises(RuntimeError, pm.validate)
+
+    def test_these_before_those_pass_constraint(self) -> None:
+        passes = [lambda x: 2 * x for _ in range(10)]
+        constraint = these_before_those_pass_constraint(passes[-1], passes[0])
+        pm = PassManager(
+            [inplace_wrapper(p) for p in passes]
+        )
+
+        # add unfulfillable constraint
+        pm.add_constraint(constraint)
+
+        self.assertRaises(RuntimeError, pm.validate)
diff --git a/torch/fx/proxy.py b/torch/fx/proxy.py
index 55985d25734ad7..6d3779a421059f 100644
--- a/torch/fx/proxy.py
+++ b/torch/fx/proxy.py
@@ -20,7 +20,7 @@ class TracerBase:
     # Feature flag for assert tracing
     trace_asserts : bool = False
     # Feature flag for proxying accesses to buffer values
-    proxy_buffer_attributes : bool = True
+    proxy_buffer_attributes : bool = False
 
     @compatibility(is_backward_compatible=True)
     def create_node(self, kind : str, target : Target,
diff --git a/torch/hub.py b/torch/hub.py
index 236ba8d5de335f..6df86c65e54fdc 100644
--- a/torch/hub.py
+++ b/torch/hub.py
@@ -9,7 +9,7 @@
 import torch
 import warnings
 import zipfile
-
+from pathlib import Path
 from urllib.error import HTTPError
 from urllib.request import urlopen, Request
 from urllib.parse import urlparse  # noqa: F401
@@ -56,6 +56,7 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 # matches bfd8deac from resnet18-bfd8deac.pth
 HASH_REGEX = re.compile(r'-([a-f0-9]*)\.')
 
+_TRUSTED_REPO_OWNERS = ("facebookresearch", "facebookincubator", "pytorch", "fairinternal")
 ENV_GITHUB_TOKEN = 'GITHUB_TOKEN'
 ENV_TORCH_HOME = 'TORCH_HOME'
 ENV_XDG_CACHE_HOME = 'XDG_CACHE_HOME'
@@ -77,11 +78,6 @@ def _import_module(name, path):
     return module
 
 
-def import_module(name, path):
-    warnings.warn('The use of torch.hub.import_module is deprecated in v0.11 and will be removed in v0.12', DeprecationWarning)
-    return _import_module(name, path)
-
-
 def _remove_if_exists(path):
     if os.path.exists(path):
         if os.path.isfile(path):
@@ -90,8 +86,9 @@ def _remove_if_exists(path):
             shutil.rmtree(path)
 
 
-def _git_archive_link(repo_owner, repo_name, branch):
-    return 'https://github.com/{}/{}/archive/{}.zip'.format(repo_owner, repo_name, branch)
+def _git_archive_link(repo_owner, repo_name, ref):
+    # See https://docs.github.com/en/rest/reference/repos#download-a-repository-archive-zip
+    return f"https://github.com/{repo_owner}/{repo_name}/zipball/{ref}"
 
 
 def _load_attr_from_module(module, func_name):
@@ -111,24 +108,24 @@ def _get_torch_home():
 
 def _parse_repo_info(github):
     if ':' in github:
-        repo_info, branch = github.split(':')
+        repo_info, ref = github.split(':')
     else:
-        repo_info, branch = github, None
+        repo_info, ref = github, None
     repo_owner, repo_name = repo_info.split('/')
 
-    if branch is None:
-        # The branch wasn't specified by the user, so we need to figure out the
+    if ref is None:
+        # The ref wasn't specified by the user, so we need to figure out the
         # default branch: main or master. Our assumption is that if main exists
         # then it's the default branch, otherwise it's master.
         try:
             with urlopen(f"https://github.com/{repo_owner}/{repo_name}/tree/main/"):
-                branch = 'main'
+                ref = 'main'
         except HTTPError as e:
             if e.code == 404:
-                branch = 'master'
+                ref = 'master'
             else:
                 raise
-    return repo_owner, repo_name, branch
+    return repo_owner, repo_name, ref
 
 
 def _read_url(url):
@@ -136,7 +133,7 @@ def _read_url(url):
         return r.read().decode(r.headers.get_content_charset('utf-8'))
 
 
-def _validate_not_a_forked_repo(repo_owner, repo_name, branch):
+def _validate_not_a_forked_repo(repo_owner, repo_name, ref):
     # Use urlopen to avoid depending on local git.
     headers = {'Accept': 'application/vnd.github.v3+json'}
     token = os.environ.get(ENV_GITHUB_TOKEN)
@@ -154,30 +151,34 @@ def _validate_not_a_forked_repo(repo_owner, repo_name, branch):
             if not response:
                 break
             for br in response:
-                if br['name'] == branch or br['commit']['sha'].startswith(branch):
+                if br['name'] == ref or br['commit']['sha'].startswith(ref):
                     return
 
-    raise ValueError(f'Cannot find {branch} in https://github.com/{repo_owner}/{repo_name}. '
+    raise ValueError(f'Cannot find {ref} in https://github.com/{repo_owner}/{repo_name}. '
                      'If it\'s a commit from a forked repo, please call hub.load() with forked repo directly.')
 
 
-def _get_cache_or_reload(github, force_reload, verbose=True, skip_validation=False):
+
+def _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose=True, skip_validation=False):
     # Setup hub_dir to save downloaded files
     hub_dir = get_dir()
     if not os.path.exists(hub_dir):
         os.makedirs(hub_dir)
     # Parse github repo information
-    repo_owner, repo_name, branch = _parse_repo_info(github)
+    repo_owner, repo_name, ref = _parse_repo_info(github)
     # Github allows branch name with slash '/',
     # this causes confusion with path on both Linux and Windows.
     # Backslash is not allowed in Github branch name so no need to
     # to worry about it.
-    normalized_br = branch.replace('/', '_')
+    normalized_br = ref.replace('/', '_')
     # Github renames folder repo-v1.x.x to repo-1.x.x
     # We don't know the repo name before downloading the zip file
     # and inspect name from it.
     # To check if cached repo exists, we need to normalize folder names.
-    repo_dir = os.path.join(hub_dir, '_'.join([repo_owner, repo_name, normalized_br]))
+    owner_name_branch = '_'.join([repo_owner, repo_name, normalized_br])
+    repo_dir = os.path.join(hub_dir, owner_name_branch)
+    # Check that the repo is in the trusted list
+    _check_repo_is_trusted(repo_owner, repo_name, owner_name_branch, trust_repo=trust_repo, calling_fn=calling_fn)
 
     use_cache = (not force_reload) and os.path.exists(repo_dir)
 
@@ -187,14 +188,32 @@ def _get_cache_or_reload(github, force_reload, verbose=True, skip_validation=Fal
     else:
         # Validate the tag/branch is from the original repo instead of a forked repo
         if not skip_validation:
-            _validate_not_a_forked_repo(repo_owner, repo_name, branch)
+            _validate_not_a_forked_repo(repo_owner, repo_name, ref)
 
         cached_file = os.path.join(hub_dir, normalized_br + '.zip')
         _remove_if_exists(cached_file)
 
-        url = _git_archive_link(repo_owner, repo_name, branch)
-        sys.stderr.write('Downloading: \"{}\" to {}\n'.format(url, cached_file))
-        download_url_to_file(url, cached_file, progress=False)
+        try:
+            url = _git_archive_link(repo_owner, repo_name, ref)
+            sys.stderr.write('Downloading: \"{}\" to {}\n'.format(url, cached_file))
+            download_url_to_file(url, cached_file, progress=False)
+        except HTTPError as err:
+            if err.code == 300:
+                # Getting a 300 Multiple Choices error likely means that the ref is both a tag and a branch
+                # in the repo. This can be disambiguated by explicitely using refs/heads/ or refs/tags
+                # See https://git-scm.com/book/en/v2/Git-Internals-Git-References
+                # Here, we do the same as git: we throw a warning, and assume the user wanted the branch
+                warnings.warn(
+                    f"The ref {ref} is ambiguous. Perhaps it is both a tag and a branch in the repo? "
+                    "Torchhub will now assume that it's a branch. "
+                    "You can disambiguate tags and branches by explicitly passing refs/heads/branch_name or "
+                    "refs/tags/tag_name as the ref. That might require using skip_validation=True."
+                )
+                disambiguated_branch_ref = f"refs/heads/{ref}"
+                url = _git_archive_link(repo_owner, repo_name, ref=disambiguated_branch_ref)
+                download_url_to_file(url, cached_file, progress=False)
+            else:
+                raise
 
         with zipfile.ZipFile(cached_file) as cached_zipfile:
             extraced_repo_name = cached_zipfile.infolist()[0].filename
@@ -209,6 +228,55 @@ def _get_cache_or_reload(github, force_reload, verbose=True, skip_validation=Fal
 
     return repo_dir
 
+def _check_repo_is_trusted(repo_owner, repo_name, owner_name_branch, trust_repo, calling_fn="load"):
+    hub_dir = get_dir()
+    filepath = os.path.join(hub_dir, "trusted_list")
+
+    if not os.path.exists(filepath):
+        Path(filepath).touch()
+    with open(filepath, 'r') as file:
+        trusted_repos = tuple(line.strip() for line in file)
+
+    # To minimize friction of introducing the new trust_repo mechanism, we consider that
+    # if a repo was already downloaded by torchhub, then it is already trusted (even if it's not in the allowlist)
+    trusted_repos_legacy = next(os.walk(hub_dir))[1]
+
+    owner_name = '_'.join([repo_owner, repo_name])
+    is_trusted = (
+        owner_name in trusted_repos
+        or owner_name_branch in trusted_repos_legacy
+        or repo_owner in _TRUSTED_REPO_OWNERS
+    )
+
+    # TODO: Remove `None` option in 1.14 and change the default to "check"
+    if trust_repo is None:
+        if not is_trusted:
+            warnings.warn(
+                "You are about to download and run code from an untrusted repository. In a future release, this won't "
+                "be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., "
+                "trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, "
+                f"or {calling_fn}(..., trust_repo=True), which will assume that the prompt is to be answered with "
+                f"'yes'. You can also use {calling_fn}(..., trust_repo='check') which will only prompt for "
+                f"confirmation if the repo is not already trusted. This will eventually be the default behaviour")
+        return
+
+    if (trust_repo is False) or (trust_repo == "check" and not is_trusted):
+        response = input(
+            f"The repository {owner_name} does not belong to the list of trusted repositories and as such cannot be downloaded. "
+            "Do you trust this repository and wish to add it to the trusted list of repositories (y/N)?")
+        if response.lower() in ("y", "yes"):
+            if is_trusted:
+                print("The repository is already trusted.")
+        elif response.lower() in ("n", "no", ""):
+            raise Exception("Untrusted repository.")
+        else:
+            raise ValueError(f"Unrecognized response {response}.")
+
+    # At this point we're sure that the user trusts the repo (or wants to trust it)
+    if not is_trusted:
+        with open(filepath, "a") as file:
+            file.write(owner_name + "\n")
+
 
 def _check_module_exists(name):
     import importlib.util
@@ -269,16 +337,16 @@ def set_dir(d):
         d (string): path to a local folder to save downloaded models & weights.
     """
     global _hub_dir
-    _hub_dir = d
+    _hub_dir = os.path.expanduser(d)
 
 
-def list(github, force_reload=False, skip_validation=False):
+def list(github, force_reload=False, skip_validation=False, trust_repo=None):
     r"""
     List all callable entrypoints available in the repo specified by ``github``.
 
     Args:
-        github (string): a string with format "repo_owner/repo_name[:tag_name]" with an optional
-            tag/branch. If ``tag_name`` is not specified, the default branch is assumed to be ``main`` if
+        github (string): a string with format "repo_owner/repo_name[:ref]" with an optional
+            ref (tag or branch). If ``ref`` is not specified, the default branch is assumed to be ``main`` if
             it exists, and otherwise ``master``.
             Example: 'pytorch/vision:0.10'
         force_reload (bool, optional): whether to discard the existing cache and force a fresh download.
@@ -287,13 +355,31 @@ def list(github, force_reload=False, skip_validation=False):
             specified by the ``github`` argument properly belongs to the repo owner. This will make
             requests to the GitHub API; you can specify a non-default GitHub token by setting the
             ``GITHUB_TOKEN`` environment variable. Default is ``False``.
+        trust_repo (bool, string or None): ``"check"``, ``True``, ``False`` or ``None``.
+            This parameter helps ensuring that users only run code from repos that they trust.
+
+            - If ``False``, a prompt will ask the user whether the repo should
+              be trusted.
+            - If ``True``, the repo will be added to the trusted list and loaded
+              without requiring explicit confirmation.
+            - If ``"check"``, the repo will be checked against the list of
+              trusted repos in the cache. If it is not present in that list, the
+              behaviour will fall back onto the ``trust_repo=False`` option.
+            - If ``None``: this will raise a warning, inviting the user to set
+              ``trust_repo`` to either ``False``, ``True`` or ``"check"``. This
+              is only present for backward compatibility and will be removed in
+              v1.14.
+
+            Default is ``None`` and will eventually change to ``"check"`` in a future version.
+
     Returns:
         list: The available callables entrypoint
 
     Example:
         >>> entrypoints = torch.hub.list('pytorch/vision', force_reload=True)
     """
-    repo_dir = _get_cache_or_reload(github, force_reload, verbose=True, skip_validation=skip_validation)
+    repo_dir = _get_cache_or_reload(github, force_reload, trust_repo, "list", verbose=True,
+                                    skip_validation=skip_validation)
 
     sys.path.insert(0, repo_dir)
 
@@ -308,26 +394,43 @@ def list(github, force_reload=False, skip_validation=False):
     return entrypoints
 
 
-def help(github, model, force_reload=False, skip_validation=False):
+def help(github, model, force_reload=False, skip_validation=False, trust_repo=None):
     r"""
     Show the docstring of entrypoint ``model``.
 
     Args:
-        github (string): a string with format <repo_owner/repo_name[:tag_name]> with an optional
-            tag/branch. If ``tag_name`` is not specified, the default branch is assumed to be ``main`` if
-            it exists, and otherwise ``master``.
+        github (string): a string with format <repo_owner/repo_name[:ref]> with an optional
+            ref (a tag or a branch). If ``ref`` is not specified, the default branch is assumed
+            to be ``main`` if it exists, and otherwise ``master``.
             Example: 'pytorch/vision:0.10'
         model (string): a string of entrypoint name defined in repo's ``hubconf.py``
         force_reload (bool, optional): whether to discard the existing cache and force a fresh download.
             Default is ``False``.
-        skip_validation (bool, optional): if ``False``, torchhub will check that the branch or commit
+        skip_validation (bool, optional): if ``False``, torchhub will check that the ref
             specified by the ``github`` argument properly belongs to the repo owner. This will make
             requests to the GitHub API; you can specify a non-default GitHub token by setting the
             ``GITHUB_TOKEN`` environment variable. Default is ``False``.
+        trust_repo (bool, string or None): ``"check"``, ``True``, ``False`` or ``None``.
+            This parameter helps ensuring that users only run code from repos that they trust.
+
+            - If ``False``, a prompt will ask the user whether the repo should
+              be trusted.
+            - If ``True``, the repo will be added to the trusted list and loaded
+              without requiring explicit confirmation.
+            - If ``"check"``, the repo will be checked against the list of
+              trusted repos in the cache. If it is not present in that list, the
+              behaviour will fall back onto the ``trust_repo=False`` option.
+            - If ``None``: this will raise a warning, inviting the user to set
+              ``trust_repo`` to either ``False``, ``True`` or ``"check"``. This
+              is only present for backward compatibility and will be removed in
+              v1.14.
+
+            Default is ``None`` and will eventually change to ``"check"`` in a future version.
     Example:
         >>> print(torch.hub.help('pytorch/vision', 'resnet18', force_reload=True))
     """
-    repo_dir = _get_cache_or_reload(github, force_reload, verbose=True, skip_validation=skip_validation)
+    repo_dir = _get_cache_or_reload(github, force_reload, trust_repo, "help", verbose=True,
+                                    skip_validation=skip_validation)
 
     sys.path.insert(0, repo_dir)
 
@@ -341,7 +444,8 @@ def help(github, model, force_reload=False, skip_validation=False):
     return entry.__doc__
 
 
-def load(repo_or_dir, model, *args, source='github', force_reload=False, verbose=True, skip_validation=False,
+def load(repo_or_dir, model, *args, source='github', trust_repo=None, force_reload=False, verbose=True,
+         skip_validation=False,
          **kwargs):
     r"""
     Load a model from a github repo or a local directory.
@@ -350,16 +454,16 @@ def load(repo_or_dir, model, *args, source='github', force_reload=False, verbose
     for loading other objects such as tokenizers, loss functions, etc.
 
     If ``source`` is 'github', ``repo_or_dir`` is expected to be
-    of the form ``repo_owner/repo_name[:tag_name]`` with an optional
-    tag/branch.
+    of the form ``repo_owner/repo_name[:ref]`` with an optional
+    ref (a tag or a branch).
 
     If ``source`` is 'local', ``repo_or_dir`` is expected to be a
     path to a local directory.
 
     Args:
         repo_or_dir (string): If ``source`` is 'github',
-            this should correspond to a github repo with format ``repo_owner/repo_name[:tag_name]`` with
-            an optional tag/branch, for example 'pytorch/vision:0.10'. If ``tag_name`` is not specified,
+            this should correspond to a github repo with format ``repo_owner/repo_name[:ref]`` with
+            an optional ref (tag or branch), for example 'pytorch/vision:0.10'. If ``ref`` is not specified,
             the default branch is assumed to be ``main`` if it exists, and otherwise ``master``.
             If ``source`` is 'local'  then it should be a path to a local directory.
         model (string): the name of a callable (entrypoint) defined in the
@@ -367,6 +471,22 @@ def load(repo_or_dir, model, *args, source='github', force_reload=False, verbose
         *args (optional): the corresponding args for callable ``model``.
         source (string, optional): 'github' or 'local'. Specifies how
             ``repo_or_dir`` is to be interpreted. Default is 'github'.
+        trust_repo (bool, string or None): ``"check"``, ``True``, ``False`` or ``None``.
+            This parameter helps ensuring that users only run code from repos that they trust.
+
+            - If ``False``, a prompt will ask the user whether the repo should
+              be trusted.
+            - If ``True``, the repo will be added to the trusted list and loaded
+              without requiring explicit confirmation.
+            - If ``"check"``, the repo will be checked against the list of
+              trusted repos in the cache. If it is not present in that list, the
+              behaviour will fall back onto the ``trust_repo=False`` option.
+            - If ``None``: this will raise a warning, inviting the user to set
+              ``trust_repo`` to either ``False``, ``True`` or ``"check"``. This
+              is only present for backward compatibility and will be removed in
+              v1.14.
+
+            Default is ``None`` and will eventually change to ``"check"`` in a future version.
         force_reload (bool, optional): whether to force a fresh download of
             the github repo unconditionally. Does not have any effect if
             ``source = 'local'``. Default is ``False``.
@@ -399,7 +519,8 @@ def load(repo_or_dir, model, *args, source='github', force_reload=False, verbose
             f'Unknown source: "{source}". Allowed values: "github" | "local".')
 
     if source == 'github':
-        repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation)
+        repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
+                                           verbose=verbose, skip_validation=skip_validation)
 
     model = _load_local(repo_or_dir, model, *args, **kwargs)
     return model
@@ -497,13 +618,6 @@ def download_url_to_file(url, dst, hash_prefix=None, progress=True):
             os.remove(f.name)
 
 
-def _download_url_to_file(url, dst, hash_prefix=None, progress=True):
-    warnings.warn('torch.hub._download_url_to_file has been renamed to\
-            torch.hub.download_url_to_file to be a public API,\
-            _download_url_to_file will be removed in after 1.3 release')
-    download_url_to_file(url, dst, hash_prefix, progress)
-
-
 # Hub used to support automatically extracts from zipfile manually compressed by users.
 # The legacy zip format expects only one file from torch.save() < 1.6 in the zip.
 # We should remove this support since zipfile is now default zipfile format for torch.save().
diff --git a/torch/jit/__init__.py b/torch/jit/__init__.py
index d20032d350cbac..9c70b77df77d26 100644
--- a/torch/jit/__init__.py
+++ b/torch/jit/__init__.py
@@ -47,7 +47,7 @@
     _get_trace_graph,
 )
 from torch.jit._async import fork, wait
-from torch.jit._serialization import save, load
+from torch.jit._serialization import save, load, jit_module_from_flatbuffer, save_jit_module_to_flatbuffer
 from torch.jit._fuser import optimized_execution, fuser, last_executed_optimized_graph, set_fusion_strategy
 from torch.jit._freeze import freeze, optimize_for_inference, run_frozen_optimizations
 from torch.jit._ir_utils import _InsertPoint
diff --git a/torch/jit/_decompositions.py b/torch/jit/_decompositions.py
new file mode 100644
index 00000000000000..b939584f36d3e2
--- /dev/null
+++ b/torch/jit/_decompositions.py
@@ -0,0 +1,118 @@
+
+
+import torch
+from torch import Tensor
+aten = torch.ops.aten
+from typing import Optional, List, Dict, Set
+import inspect
+from torch.fx.operator_schemas import get_signature_for_torch_op
+import warnings
+
+decomposition_table: Dict[str, torch.jit.ScriptFunction] = {}
+function_name_set: Set[str] = set()
+
+def check_decomposition_has_type_annotations(f):
+
+    inspect_empty = inspect._empty  # type: ignore[attr-defined]
+    sig = inspect.signature(f)
+    for param in sig.parameters.values():
+        assert param.annotation != inspect_empty, \
+            "No signature on param {name} for function {func}".format(name=param.name, func=f.name)
+
+    assert sig.return_annotation != inspect_empty, "No return annotation for function {func}".format(func=f.name)
+
+def signatures_match(decomposition_sig, torch_op_sig):
+    decomp_params = decomposition_sig.parameters
+    op_params = torch_op_sig.parameters
+
+    if len(decomp_params) != len(op_params):
+        return False
+
+
+    for decomp_param, op_param in zip(decomp_params.values(), op_params.values()):
+        # can't check full equality yet because not all fields are correcly deduced
+        # in the torch_op_sig - like default value
+        # can't check 'kind' bc
+        # kwarg-only values with defaults not yet supported in TS
+        inspect_empty = inspect._empty  # type: ignore[attr-defined]
+        for field in ['name', 'annotation']:
+            if field == 'name' and decomp_param.name == "self":
+                warnings.warn("PyTorch uses 'input' instead of 'self' on public api")
+
+            if getattr(decomp_param, field) != getattr(op_param, field):
+                return False
+
+        decomp_default = decomp_param.default
+        op_default = op_param.default
+        # default value not always correctly inferred as being present on torch schema,
+        # but if specified on both they should be equal
+        if decomp_default != inspect_empty and op_default != inspect_empty:
+            if decomp_default != op_default:
+                return False
+
+    return decomposition_sig.return_annotation == torch_op_sig.return_annotation
+
+def register_decomposition(aten_op, registry=None):
+    def decomposition_decorator(f):
+        nonlocal registry
+        if registry is None:
+            registry = decomposition_table
+
+        check_decomposition_has_type_annotations(f)
+
+        torch_op_sigs, torch_op_schemas = get_signature_for_torch_op(aten_op, return_schemas=True)
+        decomposition_sig = inspect.signature(f)
+
+        found_index = None
+        for i, torch_op_sig in enumerate(torch_op_sigs):
+            if signatures_match(decomposition_sig, torch_op_sig):
+                found_index = i
+                break
+
+        assert found_index is not None, "Could not find matching signature: " + str(f)
+
+        # Need unique name for jit function serialization
+        assert f.__name__ not in function_name_set, "Duplicated function name {}".format(f.__name__)
+        function_name_set.add(f.__name__)
+
+        scripted_func = torch.jit.script(f)
+        torch._C._jit_pass_inline(scripted_func.graph)
+
+        for _ in range(2):
+            torch._C._jit_pass_peephole(scripted_func.graph)
+            torch._C._jit_pass_constant_propagation(scripted_func.graph)
+
+        registry[str(torch_op_schemas[found_index])] = scripted_func
+        return f
+
+    return decomposition_decorator
+
+# TODO: replace torch.sigmoid -> aten.sigmoid
+
+@register_decomposition(aten.var)
+def var_decomposition(input: Tensor, dim: Optional[List[int]] = None, correction: Optional[int] = None,
+                      keepdim: bool = False) -> Tensor:
+    if dim is None:
+        dim_i: List[int] = []
+        dim = dim_i
+
+    if isinstance(dim, (tuple, list)) and len(dim) == 0:
+        n = input.numel()
+    else:
+        n = 1
+        for dim_i in dim:  # type: ignore[assignment]
+            n *= input.shape[dim_i]  # type: ignore[call-overload]
+
+    mean = aten.mean(input, dim, True)
+    sub = input - mean
+    sq = sub * sub
+    sum = aten.sum(sq, dim, keepdim)
+
+    if correction is not None:
+        n = n - correction
+
+    return sum / n
+
+@register_decomposition(aten.var)
+def var(input: Tensor, unbiased: bool = True) -> Tensor:
+    return var_decomposition(input, correction=(1 if unbiased else 0))
diff --git a/torch/jit/_fuser.py b/torch/jit/_fuser.py
index 62167beaf13344..d345f245c6b94d 100644
--- a/torch/jit/_fuser.py
+++ b/torch/jit/_fuser.py
@@ -38,8 +38,8 @@ def fuser(name):
         torch._C._jit_set_nvfuser_enabled(False)
     elif name == 'fuser1':  # NNC
         old_profiling_executor = torch._C._jit_set_profiling_executor(True)
-        old_profiling_mode = torch._C._jit_set_profiling_mode(True)
-        torch._C._jit_override_can_fuse_on_cpu(False)
+        old_profiling_mode = torch._C._get_graph_executor_optimize(True)
+        torch._C._jit_override_can_fuse_on_cpu(True)
         torch._C._jit_override_can_fuse_on_gpu(True)
         torch._C._jit_set_texpr_fuser_enabled(True)
         torch._C._jit_set_nvfuser_enabled(False)
@@ -55,7 +55,7 @@ def fuser(name):
     finally:
         if name == 'fuser1':  # NNC
             torch._C._jit_set_profiling_executor(old_profiling_executor)
-            torch._C._jit_set_profiling_mode(old_profiling_mode)
+            torch._C._get_graph_executor_optimize(old_profiling_mode)
         # recover the previous values
         torch._C._jit_override_can_fuse_on_cpu(old_cpu_fuse)
         torch._C._jit_override_can_fuse_on_gpu(old_gpu_fuse)
diff --git a/torch/jit/_serialization.py b/torch/jit/_serialization.py
index f2c32c3a19bcc8..3911cb411c5b2a 100644
--- a/torch/jit/_serialization.py
+++ b/torch/jit/_serialization.py
@@ -182,3 +182,74 @@ def validate_map_location(map_location=None):
         validate_cuda_device(map_location)
 
     return map_location
+
+
+def jit_module_from_flatbuffer(f):
+    try:
+        import torch._C_flatbuffer as ff
+    except ImportError:
+        print("Please include //caffe2:_C_flatbuffer as dependency.")
+        raise
+
+    if isinstance(f, string_classes):
+        if not os.path.exists(f):  # type: ignore[type-var]
+            raise ValueError("The provided filename {} does not exist".format(f))  # type: ignore[str-bytes-safe]
+        if os.path.isdir(f):
+            raise ValueError("The provided filename {} is a directory".format(f))  # type: ignore[str-bytes-safe]
+
+    if isinstance(f, str) or isinstance(f, pathlib.Path):
+        f = str(f)
+        return wrap_cpp_module(ff._load_jit_module_from_file(f))
+    else:
+        return wrap_cpp_module(ff._load_jit_module_from_bytes(f.read()))
+
+
+def save_jit_module_to_flatbuffer(m, f):
+    r"""
+    Save an offline version of this module for use in a separate process. The
+    saved module serializes all of the methods, submodules, parameters, and
+    attributes of this module. It can be loaded into the C++ API using
+    ``torch::jit::load_jit_module_from_file(filename)`` or into the Python API with
+    :func:`torch.jit.jit_module_from_flatbuffer<torch.jit.jit_module_from_flatbuffer>`.
+
+    To be able to save a module, it must not make any calls to native Python
+    functions.  This means that all submodules must be subclasses of
+    :class:`ScriptModule` as well.
+
+    .. DANGER::
+        All modules, no matter their device, are always loaded onto the CPU
+        during loading.  This is different from :func:`torch.load`'s semantics
+        and may change in the future.
+
+    Args:
+        m: A :class:`ScriptModule` to save.
+        f: A string for file path
+
+
+    Example:
+
+    .. testcode::
+
+        import torch
+        import io
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                return x + 10
+
+        m = torch.jit.script(MyModule())
+
+        # Save to file
+        torch.jit.save_jit_module_to_flatbuffer(m, 'scriptmodule.ff')
+    """
+    try:
+        import torch._C_flatbuffer as ff
+    except ImportError:
+        print("Please include //caffe2:_C_flatbuffer as dependency.")
+        raise
+    if isinstance(f, str) or isinstance(f, pathlib.Path):
+        f = str(f)
+        ff._save_jit_module(m._c, f)
+    else:
+        s = ff._save_jit_module_to_bytes(m._c)
+        f.write(s)
diff --git a/torch/jit/frontend.py b/torch/jit/frontend.py
index fbbe962d40b725..8aad0b3eff8fab 100644
--- a/torch/jit/frontend.py
+++ b/torch/jit/frontend.py
@@ -3,6 +3,7 @@
 import ast
 import inspect
 import string
+import re
 from collections import namedtuple
 from textwrap import dedent
 from typing import List, Tuple  # noqa: F401
@@ -390,7 +391,8 @@ def process_ins_outs(args):
     def create_unique_name_ext(ctx, stmt):
         # extension will be based on the full path filename plus
         # the line number of original context manager
-        return ctx.filename.replace(".", "_").replace("/", "_") + "_" + str(stmt.lineno)
+        fn = re.sub(r'[^a-zA-Z0-9_]', '_', ctx.filename)
+        return f"{fn}_{stmt.lineno}"
 
     def build_return_ann_stmt(outputs):
         return_type_ann = ""
diff --git a/torch/library.h b/torch/library.h
index 0cfdb6c0144225..98c9f8ca3b4006 100644
--- a/torch/library.h
+++ b/torch/library.h
@@ -355,6 +355,8 @@ inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
         return c10::DispatchKey::CPU;
       case c10::DeviceType::CUDA:
         return c10::DispatchKey::CUDA;
+      case c10::DeviceType::IPU:
+        return c10::DispatchKey::IPU;
       case c10::DeviceType::XLA:
         return c10::DispatchKey::XLA;
       case c10::DeviceType::Lazy:
diff --git a/torch/multiprocessing/reductions.py b/torch/multiprocessing/reductions.py
index 2da1ab8582b5c1..4a5d725c0e49b9 100644
--- a/torch/multiprocessing/reductions.py
+++ b/torch/multiprocessing/reductions.py
@@ -6,6 +6,8 @@
 import multiprocessing
 from multiprocessing.util import register_after_fork
 from multiprocessing.reduction import ForkingPickler
+from typing import Union
+
 try:
     # Early load resource_sharer to prevent a partially initialized instance
     # from being inherited in a forked child process. The reduce_storage method
@@ -103,7 +105,7 @@ def rebuild_cuda_tensor(tensor_cls, tensor_size, tensor_stride, tensor_offset,
                         requires_grad, ref_counter_handle, ref_counter_offset, event_handle, event_sync_required):
     # If storage_handle is None, storage points to nullptr.
     if storage_handle is None or storage_size_bytes == 0:
-        storage = storage_cls(0)
+        storage = storage_cls(0, dtype=dtype, device=storage_device)
     else:
         storage = storage_from_cache(storage_cls, (storage_handle, storage_offset_bytes))
         if storage is None:
@@ -120,7 +122,7 @@ def rebuild_cuda_tensor(tensor_cls, tensor_size, tensor_stride, tensor_offset,
             shared_cache[(storage_handle, storage_offset_bytes)] = StorageWeakRef(storage)
         else:
             # We already ref counting this Storage, but producer needs new ref-counters to be released.
-            storage_cls._release_ipc_counter(ref_counter_handle, ref_counter_offset)
+            storage_cls._release_ipc_counter(ref_counter_handle, ref_counter_offset, device=storage_device)
 
     t = torch._utils._rebuild_tensor(
         torch.storage._TypedStorage(wrap_storage=storage._untyped(), dtype=dtype),
@@ -288,7 +290,7 @@ def storage_from_cache(cls, key):
     storage_ref = shared_cache.get(key)
     if storage_ref is None:
         return None
-    return cls._new_with_weak_ptr(storage_ref.cdata)
+    return torch._UntypedStorage._new_with_weak_ptr(storage_ref.cdata)
 
 
 def rebuild_storage_fd(cls, df, size):
@@ -304,11 +306,18 @@ def rebuild_storage_fd(cls, df, size):
         os.close(fd)
 
 
-def rebuild_storage_filename(cls, manager, handle, size):
-    storage = storage_from_cache(cls, handle)
+def rebuild_storage_filename(cls, manager, handle, size, dtype=None):
+    storage: Union[torch._TypedStorage, torch._UntypedStorage] = storage_from_cache(cls, handle)
     if storage is not None:
         return storage._shared_decref()
-    storage = cls._new_shared_filename(manager, handle, size)
+    if dtype is None:
+        storage = torch._UntypedStorage._new_shared_filename(manager, handle, size)
+    else:
+        byte_size = size * torch._utils._element_size(dtype)
+        untyped_storage: torch._UntypedStorage = torch._UntypedStorage._new_shared_filename(manager, handle, byte_size)
+        storage = torch._TypedStorage(
+            wrap_storage=untyped_storage,
+            dtype=dtype)
     shared_cache[handle] = StorageWeakRef(storage)
     return storage._shared_decref()
 
@@ -338,6 +347,8 @@ def reduce_storage(storage):
         metadata = storage._share_filename_()
         cache_key = metadata[1]
         rebuild = rebuild_storage_filename
+        if isinstance(storage, torch._TypedStorage):
+            metadata += (storage.dtype,)
         storage._shared_incref()
     elif storage.size() == 0:
         # This is special cased because Empty tensors
diff --git a/torch/nested/_nestedtensor.py b/torch/nested/_nestedtensor.py
deleted file mode 100644
index 4f16a3131b11e3..00000000000000
--- a/torch/nested/_nestedtensor.py
+++ /dev/null
@@ -1,106 +0,0 @@
-import torch
-from functools import wraps
-
-
-@wraps(torch._nested_tensor)
-def nested_tensor(*args, **kwargs):
-    return NestedTensor(torch._nested_tensor(*args, **kwargs))
-
-
-# TODO: This entire class is not really necessary now that NestedTensor lives
-# in tree; before it lived out of tree and there was no way to conveniently
-# override the string printing behavior.  Now that we are in tree, we can
-# directly override _tensor_str to capture this behavior, and the wrapper subclass
-# is not necessary. See also https://github.com/pytorch/pytorch/issues/73506
-class NestedTensor:
-    # data is a torch.Tensor backed by a NestedTensorImpl
-
-    def __init__(self, impl):
-        self._impl = impl
-
-    @property
-    def dtype(self):
-        """
-        The data type of ```self``` NestedTensor.
-        """
-        return self._impl.dtype
-
-    @property
-    def layout(self):
-        """
-        The layout of ```self``` NestedTensor.
-        """
-        return self._impl.layout
-
-    @property
-    def device(self):
-        """
-        The device of ```self``` NestedTensor.
-        """
-        return self._impl.device
-
-    @property
-    def requires_grad(self):
-        """
-        Is ```True``` if gradients need to be computed for this Tensor.
-        """
-        return self._impl.requires_grad
-
-    def stride(self):
-        """
-        NestedTensor currently does not have a stride. This will throw.
-        """
-        return self._impl.stride()
-
-    def size(self):
-        """
-        NestedTensor currently does not have a size. This will throw.
-        """
-        return self._impl.size()
-
-    def dim(self):
-        """
-        The dimension of ```self``` NestedTensor.
-        """
-        return self._impl.dim()
-
-    def numel(self):
-        """
-        The number of elements of ```self``` NestedTensor.
-        """
-        return self._impl.numel()
-
-    def is_contiguous(self):
-        """
-        Returns true if ```self``` NestedTensor is contiguous.
-        """
-        return self._impl.is_contiguous()
-
-    def __str__(self):
-        def _str(x, indent=0, tab="  "):
-            s = indent * tab + "[\n"
-            strs = list(map(str, x.unbind()))
-            strs = list(
-                map(
-                    lambda xi: "\n".join(
-                        map(lambda xij: (indent + 1) * tab + xij, xi.split("\n"))
-                    ),
-                    strs,
-                )
-            )
-            s += ",\n".join(strs)
-            s += "\n" + indent * tab + "]"
-            return s
-
-        return "nested_tensor(" + _str(self) + ")"
-
-    def __repr__(self):
-        return self.__str__()
-
-    def unbind(self, dim=None):
-        if dim is None:
-            unbound = torch.ops.aten.unbind.int(self._impl, 0)
-            if len(unbound) == 0:
-                return ()
-            return unbound
-        return torch.ops.aten.unbind.int(self._impl, dim)
diff --git a/torch/nn/functional.py b/torch/nn/functional.py
index f7ca7489ab5092..987fe10b081eef 100644
--- a/torch/nn/functional.py
+++ b/torch/nn/functional.py
@@ -1,5 +1,5 @@
 r"""Functional interface"""
-from typing import Callable, List, Optional, Tuple
+from typing import Callable, List, Optional, Tuple, Union
 import math
 import warnings
 
@@ -1004,8 +1004,8 @@ def max_unpool3d(
 
 
 def lp_pool2d(
-    input: Tensor, norm_type: float,
-    kernel_size: int,
+    input: Tensor, norm_type: Union[int, float],
+    kernel_size: BroadcastingList2[int],
     stride: Optional[BroadcastingList2[int]] = None,
     ceil_mode: bool = False
 ) -> Tensor:
@@ -1029,7 +1029,7 @@ def lp_pool2d(
 
 
 def lp_pool1d(
-    input: Tensor, norm_type: float,
+    input: Tensor, norm_type: Union[int, float],
     kernel_size: int,
     stride: Optional[BroadcastingList1[int]] = None,
     ceil_mode: bool = False
@@ -4263,7 +4263,9 @@ def affine_grid(theta: Tensor, size: List[int], align_corners: Optional[bool] =
     return torch.affine_grid_generator(theta, size, align_corners)
 
 
-def _pad(input: Tensor, pad: List[int], mode: str = "constant", value: float = 0.0) -> Tensor:
+# NOTE: Do not edit. This code will be removed once the forward-compatibility
+#       period is over for PR #73431
+def _pad(input: Tensor, pad: BroadcastingList1[int], mode: str = "constant", value: Union[int, float] = 0.0) -> Tensor:
     r"""Pads tensor.
 
     Padding size:
@@ -4683,6 +4685,8 @@ def fold(
                                   f"are supported (got {input.dim()}D)")
 
 
+# NOTE: Do not edit. This code will be removed once the forward-compatibility
+#       period is over for PR #73410
 def _pad_circular(input: Tensor, padding: List[int]) -> Tensor:
     """Circularly pads tensor.
 
@@ -4996,9 +5000,11 @@ def _scaled_dot_product_attention(
     B, Nt, E = q.shape
     q = q / math.sqrt(E)
     # (B, Nt, E) x (B, E, Ns) -> (B, Nt, Ns)
-    attn = torch.bmm(q, k.transpose(-2, -1))
     if attn_mask is not None:
-        attn += attn_mask
+        attn = torch.baddbmm(attn_mask, q, k.transpose(-2, -1))
+    else:
+        attn = torch.bmm(q, k.transpose(-2, -1))
+
     attn = softmax(attn, dim=-1)
     if dropout_p > 0.0:
         attn = dropout(attn, p=dropout_p)
diff --git a/torch/nn/intrinsic/modules/fused.py b/torch/nn/intrinsic/modules/fused.py
index 1c09caff52a77d..b30b9a7d430c6b 100644
--- a/torch/nn/intrinsic/modules/fused.py
+++ b/torch/nn/intrinsic/modules/fused.py
@@ -1,5 +1,6 @@
 import torch
 from torch.nn import Conv1d, Conv2d, Conv3d, ReLU, Linear, BatchNorm1d, BatchNorm2d, BatchNorm3d
+from torch.nn.utils.parametrize import type_before_parametrizations
 
 # Used for identifying intrinsic modules used in quantization
 class _FusedModule(torch.nn.Sequential):
@@ -9,90 +10,90 @@ class ConvReLU1d(_FusedModule):
     r"""This is a sequential container which calls the Conv1d and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, relu):
-        assert type(conv) == Conv1d and type(relu) == ReLU, \
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(relu))
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
         super().__init__(conv, relu)
 
 class ConvReLU2d(_FusedModule):
     r"""This is a sequential container which calls the Conv2d and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, relu):
-        assert type(conv) == Conv2d and type(relu) == ReLU, \
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(relu))
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
         super().__init__(conv, relu)
 
 class ConvReLU3d(_FusedModule):
     r"""This is a sequential container which calls the Conv3d and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, relu):
-        assert type(conv) == Conv3d and type(relu) == ReLU, \
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(relu))
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
         super().__init__(conv, relu)
 
 class LinearReLU(_FusedModule):
     r"""This is a sequential container which calls the Linear and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, linear, relu):
-        assert type(linear) == Linear and type(relu) == ReLU, \
+        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(linear), type(relu))
+                type_before_parametrizations(linear), type_before_parametrizations(relu))
         super().__init__(linear, relu)
 
 class ConvBn1d(_FusedModule):
     r"""This is a sequential container which calls the Conv 1d and Batch Norm 1d modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn):
-        assert type(conv) == Conv1d and type(bn) == BatchNorm1d, \
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(bn))
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
         super().__init__(conv, bn)
 
 class ConvBn2d(_FusedModule):
     r"""This is a sequential container which calls the Conv 2d and Batch Norm 2d modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn):
-        assert type(conv) == Conv2d and type(bn) == BatchNorm2d, \
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(bn))
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
         super(ConvBn2d, self).__init__(conv, bn)
 
 class ConvBnReLU1d(_FusedModule):
     r"""This is a sequential container which calls the Conv 1d, Batch Norm 1d, and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn, relu):
-        assert type(conv) == Conv1d and type(bn) == BatchNorm1d and \
-            type(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type(conv), type(bn), type(relu))
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
         super().__init__(conv, bn, relu)
 
 class ConvBnReLU2d(_FusedModule):
     r"""This is a sequential container which calls the Conv 2d, Batch Norm 2d, and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn, relu):
-        assert type(conv) == Conv2d and type(bn) == BatchNorm2d and \
-            type(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type(conv), type(bn), type(relu))
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
         super().__init__(conv, bn, relu)
 
 class ConvBn3d(_FusedModule):
     r"""This is a sequential container which calls the Conv 3d and Batch Norm 3d modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn):
-        assert type(conv) == Conv3d and type(bn) == BatchNorm3d, \
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d, \
             'Incorrect types for input modules{}{}'.format(
-                type(conv), type(bn))
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
         super().__init__(conv, bn)
 
 class ConvBnReLU3d(_FusedModule):
     r"""This is a sequential container which calls the Conv 3d, Batch Norm 3d, and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, conv, bn, relu):
-        assert type(conv) == Conv3d and type(bn) == BatchNorm3d and \
-            type(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type(conv), type(bn), type(relu))
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
         super().__init__(conv, bn, relu)
 
 
@@ -100,18 +101,18 @@ class BNReLU2d(_FusedModule):
     r"""This is a sequential container which calls the BatchNorm 2d and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, batch_norm, relu):
-        assert type(batch_norm) == BatchNorm2d and type(relu) == ReLU, \
+        assert type_before_parametrizations(batch_norm) == BatchNorm2d and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(batch_norm), type(relu))
+                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
         super().__init__(batch_norm, relu)
 
 class BNReLU3d(_FusedModule):
     r"""This is a sequential container which calls the BatchNorm 3d and ReLU modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, batch_norm, relu):
-        assert type(batch_norm) == BatchNorm3d and type(relu) == ReLU, \
+        assert type_before_parametrizations(batch_norm) == BatchNorm3d and type_before_parametrizations(relu) == ReLU, \
             'Incorrect types for input modules{}{}'.format(
-                type(batch_norm), type(relu))
+                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
         super().__init__(batch_norm, relu)
 
 
@@ -119,6 +120,6 @@ class LinearBn1d(_FusedModule):
     r"""This is a sequential container which calls the Linear and BatchNorm1d modules.
     During quantization this will be replaced with the corresponding fused module."""
     def __init__(self, linear, bn):
-        assert type(linear) == Linear and type(bn) == BatchNorm1d, \
-            'Incorrect types for input modules{}{}'.format(type(linear), type(bn))
+        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(bn) == BatchNorm1d, \
+            'Incorrect types for input modules{}{}'.format(type_before_parametrizations(linear), type_before_parametrizations(bn))
         super().__init__(linear, bn)
diff --git a/torch/nn/intrinsic/qat/modules/__init__.py b/torch/nn/intrinsic/qat/modules/__init__.py
index 355d072c7398b4..f44820c637e86c 100644
--- a/torch/nn/intrinsic/qat/modules/__init__.py
+++ b/torch/nn/intrinsic/qat/modules/__init__.py
@@ -7,6 +7,7 @@
     ConvBnReLU1d,
     ConvBnReLU2d,
     ConvBnReLU3d,
+    ConvReLU1d,
     ConvReLU2d,
     ConvReLU3d,
     update_bn_stats,
@@ -16,6 +17,7 @@
 __all__ = [
     "LinearReLU",
     "LinearBn1d",
+    "ConvReLU1d",
     "ConvReLU2d",
     "ConvReLU3d",
     "ConvBn1d",
diff --git a/torch/nn/intrinsic/qat/modules/conv_fused.py b/torch/nn/intrinsic/qat/modules/conv_fused.py
index a4f121e9cab653..7fde27abb9352a 100644
--- a/torch/nn/intrinsic/qat/modules/conv_fused.py
+++ b/torch/nn/intrinsic/qat/modules/conv_fused.py
@@ -352,6 +352,43 @@ def forward(self, input):
     def from_float(cls, mod):
         return super(ConvBnReLU1d, cls).from_float(mod)
 
+class ConvReLU1d(nnqat.Conv1d, nni._FusedModule):
+    r"""A ConvReLU1d module is a fused module of Conv1d and ReLU, attached with
+    FakeQuantize modules for weight for
+    quantization aware training.
+
+    We combined the interface of :class:`~torch.nn.Conv1d` and
+    :class:`~torch.nn.BatchNorm1d`.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvReLU1d
+    _FLOAT_CONV_MODULE = nn.Conv1d
+    _FLOAT_BN_MODULE = None
+    _FLOAT_RELU_MODULE = nn.ReLU
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=True, padding_mode='zeros',
+                 qconfig=None):
+        super(ConvReLU1d, self).__init__(in_channels, out_channels, kernel_size,
+                                         stride=stride, padding=padding, dilation=dilation,
+                                         groups=groups, bias=bias, padding_mode=padding_mode,
+                                         qconfig=qconfig)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = self.qconfig.weight()
+
+    def forward(self, input):
+        return F.relu(
+            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvReLU1d, cls).from_float(mod)
+
 class ConvBn2d(_ConvBnNd, nn.Conv2d):
     r"""
     A ConvBn2d module is a module fused from Conv2d and BatchNorm2d,
diff --git a/torch/nn/intrinsic/qat/modules/linear_fused.py b/torch/nn/intrinsic/qat/modules/linear_fused.py
index f79b466507b9e5..e53303ff706f9a 100644
--- a/torch/nn/intrinsic/qat/modules/linear_fused.py
+++ b/torch/nn/intrinsic/qat/modules/linear_fused.py
@@ -4,6 +4,7 @@
 import torch.nn.functional as F
 from torch.nn import init
 from torch.nn.parameter import Parameter
+from torch.nn.utils.fusion import fuse_linear_bn_weights
 
 
 class LinearBn1d(nn.modules.linear.Linear, nni._FusedModule):
@@ -152,3 +153,15 @@ def from_float(cls, mod):
         qat_linearbn.bn.running_var = bn.running_var
         qat_linearbn.bn.num_batches_tracked = bn.num_batches_tracked
         return qat_linearbn
+
+    def to_float(self):
+        linear = torch.nn.Linear(self.in_features, self.out_features)
+        linear.weight, linear.bias = fuse_linear_bn_weights(
+            self.weight,
+            self.bias,
+            self.bn.running_mean,
+            self.bn.running_var,
+            self.bn.eps,
+            self.bn.weight,
+            self.bn.bias)
+        return linear
diff --git a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
index c30b3109ef601e..2b0b9b37ef407c 100644
--- a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
+++ b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
@@ -44,3 +44,7 @@ def _get_name(self):
     @classmethod
     def from_float(cls, mod):
         return super(LinearReLU, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_qlinear_relu):
+        return super().from_reference(ref_qlinear_relu[0])
diff --git a/torch/nn/intrinsic/quantized/modules/bn_relu.py b/torch/nn/intrinsic/quantized/modules/bn_relu.py
index d9c53c69e0159d..0727e57553d0bc 100644
--- a/torch/nn/intrinsic/quantized/modules/bn_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/bn_relu.py
@@ -17,8 +17,8 @@ class BNReLU2d(nnq.BatchNorm2d):
     """
     _FLOAT_MODULE = torch.nn.intrinsic.BNReLU2d
 
-    def __init__(self, num_features, eps=1e-5, momentum=0.1):
-        super(BNReLU2d, self).__init__(num_features, eps=eps, momentum=momentum)
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        super(BNReLU2d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
 
     def forward(self, input):
         # Temporarily using len(shape) instead of ndim due to JIT issue
@@ -37,6 +37,9 @@ def from_float(cls, mod):
         # TODO: Add qat support for BNReLU2d
         return super(BNReLU2d, cls).from_float(mod)
 
+    @classmethod
+    def from_reference(cls, bn_relu, output_scale, output_zero_point):
+        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
 
 class BNReLU3d(nnq.BatchNorm3d):
     r"""
@@ -50,8 +53,8 @@ class BNReLU3d(nnq.BatchNorm3d):
     """
     _FLOAT_MODULE = torch.nn.intrinsic.BNReLU3d
 
-    def __init__(self, num_features, eps=1e-5, momentum=0.1):
-        super(BNReLU3d, self).__init__(num_features, eps=eps, momentum=momentum)
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        super(BNReLU3d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
 
     def forward(self, input):
         # Temporarily using len(shape) instead of ndim due to JIT issue
@@ -69,3 +72,7 @@ def _get_name(self):
     def from_float(cls, mod):
         # TODO: Add qat support for BNReLU3d
         return super(BNReLU3d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, bn_relu, output_scale, output_zero_point):
+        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
diff --git a/torch/nn/modules/_functions.py b/torch/nn/modules/_functions.py
index f0d04441e9be59..8e10c1e798ecd2 100644
--- a/torch/nn/modules/_functions.py
+++ b/torch/nn/modules/_functions.py
@@ -16,17 +16,31 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
         if size == 1 and world_size < 2:
             raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
 
-        # calculate mean/invstd for input.
-        mean, invstd = torch.batch_norm_stats(input, eps)
-
-        count = torch.full((1,), input.numel() // input.size(1),
-                           dtype=mean.dtype,
-                           device=mean.device)
+        num_channels = input.shape[1]
+        if input.numel() > 0:
+            # calculate mean/invstd for input.
+            mean, invstd = torch.batch_norm_stats(input, eps)
+
+            count = torch.full(
+                (1,),
+                input.numel() // input.size(1),
+                dtype=mean.dtype,
+                device=mean.device
+            )
 
+            # C, C, 1 -> (2C + 1)
+            combined = torch.cat([mean, invstd, count], dim=0)
+        else:
+            # for empty input, set stats and the count to zero. The stats with
+            # zero count will be filtered out later when computing global mean
+            # & invstd, but they still needs to participate the all_gather
+            # collective communication to unblock other peer processes.
+            combined = torch.zeros(
+                2 * num_channels + 1,
+                dtype=input.dtype,
+                device=input.device
+            )
 
-        num_channels = input.shape[1]
-        # C, C, 1 -> (2C + 1)
-        combined = torch.cat([mean, invstd, count], dim=0)
         # Use allgather instead of allreduce because count could be different across
         # ranks, simple all reduce op can not give correct results.
         # batch_norm_gather_stats_with_counts calculates global mean & invstd based on
@@ -46,13 +60,19 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
         else:
             # world_size * (2C + 1)
             combined_list = [
-                torch.empty_like(combined) for k in range(world_size)
+                torch.empty_like(combined) for _ in range(world_size)
             ]
             dist.all_gather(combined_list, combined, process_group, async_op=False)
             combined = torch.stack(combined_list, dim=0)
             # world_size * (2C + 1) -> world_size * C, world_size * C, world_size * 1
             mean_all, invstd_all, count_all = torch.split(combined, num_channels, dim=1)
 
+        # remove stats from empty inputs
+        mask = count_all.squeeze(-1) >= 1
+        count_all = count_all[mask]
+        mean_all = mean_all[mask]
+        invstd_all = invstd_all[mask]
+
         # calculate global mean & invstd
         mean, invstd = torch.batch_norm_gather_stats_with_counts(
             input,
@@ -69,8 +89,10 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
         self.process_group = process_group
 
         # apply element-wise normalization
-        out = torch.batch_norm_elemt(input, weight, bias, mean, invstd, eps)
-        return out
+        if input.numel() > 0:
+            return torch.batch_norm_elemt(input, weight, bias, mean, invstd, eps)
+        else:
+            return torch.empty_like(input)
 
     @staticmethod
     def backward(self, grad_output):
@@ -80,45 +102,64 @@ def backward(self, grad_output):
         grad_input = grad_weight = grad_bias = None
         process_group = self.process_group
 
-        # calculate local stats as well as grad_weight / grad_bias
-        sum_dy, sum_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(
-            grad_output,
-            saved_input,
-            mean,
-            invstd,
-            weight,
-            self.needs_input_grad[0],
-            self.needs_input_grad[1],
-            self.needs_input_grad[2]
-        )
-
-        if self.needs_input_grad[0]:
-            # synchronizing stats used to calculate input gradient.
-            num_channels = sum_dy.shape[0]
-            combined = torch.cat([sum_dy, sum_dy_xmu], dim=0)
-            torch.distributed.all_reduce(
-                combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
-            sum_dy, sum_dy_xmu = torch.split(combined, num_channels)
-
-            # backward pass for gradient calculation
-            grad_input = torch.batch_norm_backward_elemt(
+        if saved_input.numel() > 0:
+            # calculate local stats as well as grad_weight / grad_bias
+            sum_dy, sum_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(
                 grad_output,
                 saved_input,
                 mean,
                 invstd,
                 weight,
-                sum_dy,
-                sum_dy_xmu,
-                count_tensor
+                self.needs_input_grad[0],
+                self.needs_input_grad[1],
+                self.needs_input_grad[2]
             )
 
-        # synchronizing of grad_weight / grad_bias is not needed as distributed
-        # training would handle all reduce.
-        if weight is None or not self.needs_input_grad[1]:
-            grad_weight = None
-
-        if weight is None or not self.needs_input_grad[2]:
-            grad_bias = None
+            if self.needs_input_grad[0]:
+                # synchronizing stats used to calculate input gradient.
+                num_channels = sum_dy.shape[0]
+                combined = torch.cat([sum_dy, sum_dy_xmu], dim=0)
+                torch.distributed.all_reduce(
+                    combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
+                sum_dy, sum_dy_xmu = torch.split(combined, num_channels)
+
+                # backward pass for gradient calculation
+                grad_input = torch.batch_norm_backward_elemt(
+                    grad_output,
+                    saved_input,
+                    mean,
+                    invstd,
+                    weight,
+                    sum_dy,
+                    sum_dy_xmu,
+                    count_tensor
+                )
+            # synchronizing of grad_weight / grad_bias is not needed as distributed
+            # training would handle all reduce.
+            if weight is None or not self.needs_input_grad[1]:
+                grad_weight = None
+
+            if weight is None or not self.needs_input_grad[2]:
+                grad_bias = None
+        else:
+            # This process got an empty input tensor in the forward pass.
+            # Although this process can directly set grad_input as an empty
+            # tensor of zeros, it still needs to participate in the collective
+            # communication to unblock its peers, as other peer processes might
+            # have recieved non-empty inputs.
+            num_channels = saved_input.shape[1]
+            if self.needs_input_grad[0]:
+                # launch all_reduce to unblock other peer processes
+                combined = torch.zeros(
+                    2 * num_channels,
+                    dtype=saved_input.dtype,
+                    device=saved_input.device
+                )
+                torch.distributed.all_reduce(
+                    combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
+
+            # Leave grad_input, grad_weight and grad_bias as None, which will be
+            # interpreted by the autograd engine as Tensors full of zeros.
 
         return grad_input, grad_weight, grad_bias, None, None, None, None, None, None
 
diff --git a/torch/nn/modules/activation.py b/torch/nn/modules/activation.py
index 2645b6260a55ed..1e687ffa937952 100644
--- a/torch/nn/modules/activation.py
+++ b/torch/nn/modules/activation.py
@@ -1020,7 +1020,7 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             the attention weight.
         average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across
             heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an
-            effect when ``need_weights=True.``. Default: True (i.e. average weights across heads)
+            effect when ``need_weights=True``. Default: ``True`` (i.e. average weights across heads)
 
     Outputs:
         - **attn_output** - Attention outputs of shape :math:`(L, E)` when input is unbatched,
@@ -1031,7 +1031,7 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
           returns attention weights averaged across heads of shape :math:`(L, S)` when input is unbatched or
           :math:`(N, L, S)`, where :math:`N` is the batch size, :math:`L` is the target sequence length, and
           :math:`S` is the source sequence length. If ``average_weights=False``, returns attention weights per
-          head of shape :math:`(num_heads, L, S)` when input is unbatched or :math:`(N, num_heads, L, S)`.
+          head of shape :math:`(\text{num\_heads}, L, S)` when input is unbatched or :math:`(N, \text{num\_heads}, L, S)`.
 
         .. note::
             `batch_first` argument is ignored for unbatched inputs.
diff --git a/torch/nn/modules/container.py b/torch/nn/modules/container.py
index 794c3f671b4c6e..916ac3eae517a3 100644
--- a/torch/nn/modules/container.py
+++ b/torch/nn/modules/container.py
@@ -497,7 +497,9 @@ def __iadd__(self, parameters: Iterable[Any]) -> 'ParameterList':
         return self.extend(parameters)
 
     def __dir__(self):
-        return list(range(self._size))
+        keys = super(ParameterList, self).__dir__()
+        keys = [key for key in keys if not key.isdigit()]
+        return keys
 
     def append(self, value: Any) -> 'ParameterList':
         """Appends a given value at the end of the list.
diff --git a/torch/nn/modules/loss.py b/torch/nn/modules/loss.py
index af1cf55e32d121..fe025deca185b9 100644
--- a/torch/nn/modules/loss.py
+++ b/torch/nn/modules/loss.py
@@ -451,8 +451,9 @@ class KLDivLoss(_Loss):
         >>> target = F.softmax(torch.rand(3, 5))
         >>> output = kl_loss(input, target)
 
+        >>> kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
         >>> log_target = F.log_softmax(torch.rand(3, 5))
-        >>> output = kl_loss(input, log_target, log_target=True)
+        >>> output = kl_loss(input, log_target)
     """
     __constants__ = ['reduction']
 
@@ -1216,7 +1217,7 @@ class CosineEmbeddingLoss(_Loss):
     r"""Creates a criterion that measures the loss given input tensors
     :math:`x_1`, :math:`x_2` and a `Tensor` label :math:`y` with values 1 or -1.
     This is used for measuring whether two inputs are similar or dissimilar,
-    using the cosine distance, and is typically used for learning nonlinear
+    using the cosine similarity, and is typically used for learning nonlinear
     embeddings or semi-supervised learning.
 
     The loss function for each sample is:
diff --git a/torch/nn/modules/module.py b/torch/nn/modules/module.py
index d08c5dbe6beb9f..a414bdcd2e184a 100644
--- a/torch/nn/modules/module.py
+++ b/torch/nn/modules/module.py
@@ -8,7 +8,6 @@
 import torch.utils.hooks as hooks
 
 from torch import Tensor, device, dtype
-import typing
 from typing import Union, Tuple, Any, Callable, Iterator, Set, Optional, overload, TypeVar, Mapping, Dict, List
 from ...utils.hooks import RemovableHandle
 
@@ -199,7 +198,7 @@ def _forward_unimplemented(self, *input: Any) -> None:
         instead of this since the former takes care of running the
         registered hooks while the latter silently ignores them.
     """
-    raise NotImplementedError
+    raise NotImplementedError(f"Module [{type(self).__name__}] is missing the required \"forward\" function")
 
 
 class Module:
@@ -688,6 +687,25 @@ def cuda(self: T, device: Optional[Union[int, device]] = None) -> T:
         """
         return self._apply(lambda t: t.cuda(device))
 
+    def ipu(self: T, device: Optional[Union[int, device]] = None) -> T:
+        r"""Moves all model parameters and buffers to the IPU.
+
+        This also makes associated parameters and buffers different objects. So
+        it should be called before constructing optimizer if the module will
+        live on IPU while being optimized.
+
+        .. note::
+            This method modifies the module in-place.
+
+        Arguments:
+            device (int, optional): if specified, all parameters will be
+                copied to that device
+
+        Returns:
+            Module: self
+        """
+        return self._apply(lambda t: t.ipu(device))
+
     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:
         r"""Moves all model parameters and buffers to the XPU.
 
@@ -1312,10 +1330,11 @@ def _state_dict_impl(self, destination, prefix, keep_vars):
     def state_dict(self, destination: T_destination, prefix: str = ..., keep_vars: bool = ...) -> T_destination:
         ...
 
-    # TODO: Remove string escape once Python-3.6 no longer supported
-    # See https://github.com/python/mypy/issues/6904#issuecomment-496207426
+    # TODO: Remove string escape once Python-3.7.0 is no longer supported
+    # typing.OrderedDict with generics can be used in Python-3.7.2+ or later
+    # See https://github.com/pytorch/pytorch/issues/74087
     @overload
-    def state_dict(self, *, prefix: str = ..., keep_vars: bool = ...) -> typing.OrderedDict[str, Tensor]:
+    def state_dict(self, *, prefix: str = ..., keep_vars: bool = ...) -> "OrderedDict[str, Tensor]":
         ...
 
     def state_dict(self, *args, destination=None, prefix='', keep_vars=False):
diff --git a/torch/nn/modules/normalization.py b/torch/nn/modules/normalization.py
index 64ef00887746cb..b9c43c402c5fb3 100644
--- a/torch/nn/modules/normalization.py
+++ b/torch/nn/modules/normalization.py
@@ -202,7 +202,8 @@ class GroupNorm(Module):
         y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta
 
     The input channels are separated into :attr:`num_groups` groups, each containing
-    ``num_channels / num_groups`` channels. The mean and standard-deviation are calculated
+    ``num_channels / num_groups`` channels. :attr:`num_channels` must be divisible by
+    :attr:`num_groups`. The mean and standard-deviation are calculated
     separately over the each group. :math:`\gamma` and :math:`\beta` are learnable
     per-channel affine transform parameter vectors of size :attr:`num_channels` if
     :attr:`affine` is ``True``.
@@ -246,6 +247,9 @@ def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5, affine
                  device=None, dtype=None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype}
         super(GroupNorm, self).__init__()
+        if num_channels % num_groups != 0:
+            raise ValueError('num_channels must be divisible by num_groups')
+
         self.num_groups = num_groups
         self.num_channels = num_channels
         self.eps = eps
diff --git a/torch/nn/modules/rnn.py b/torch/nn/modules/rnn.py
index e3d3fd9450cd0e..95eb5ed2033a4f 100644
--- a/torch/nn/modules/rnn.py
+++ b/torch/nn/modules/rnn.py
@@ -601,13 +601,16 @@ class LSTM(RNNBase):
           :math:`(N, L, D * H_{out})` when ``batch_first=True`` containing the output features
           `(h_t)` from the last layer of the LSTM, for each `t`. If a
           :class:`torch.nn.utils.rnn.PackedSequence` has been given as the input, the output
-          will also be a packed sequence.
+          will also be a packed sequence. When ``bidirectional=True``, `output` will contain
+          a concatenation of the forward and reverse hidden states at each time step in the sequence.
         * **h_n**: tensor of shape :math:`(D * \text{num\_layers}, H_{out})` for unbatched input or
           :math:`(D * \text{num\_layers}, N, H_{out})` containing the
-          final hidden state for each element in the sequence.
+          final hidden state for each element in the sequence. When ``bidirectional=True``,
+          `h_n` will contain a concatenation of the final forward and reverse hidden states, respectively.
         * **c_n**: tensor of shape :math:`(D * \text{num\_layers}, H_{cell})` for unbatched input or
           :math:`(D * \text{num\_layers}, N, H_{cell})` containing the
-          final cell state for each element in the sequence.
+          final cell state for each element in the sequence. When ``bidirectional=True``,
+          `c_n` will contain a concatenation of the final forward and reverse cell states, respectively.
 
     Attributes:
         weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
@@ -645,6 +648,11 @@ class LSTM(RNNBase):
         Example of splitting the output layers when ``batch_first=False``:
         ``output.view(seq_len, batch, num_directions, hidden_size)``.
 
+    .. note::
+        For bidirectional LSTMs, `h_n` is not equivalent to the last element of `output`; the
+        former contains the final forward and reverse hidden states, while the latter contains the
+        final forward hidden state and the initial reverse hidden state.
+
     .. note::
         ``batch_first`` argument is ignored for unbatched inputs.
 
diff --git a/torch/nn/parallel/distributed.py b/torch/nn/parallel/distributed.py
index d41a7d95ec93b8..eef7b5c22110a9 100644
--- a/torch/nn/parallel/distributed.py
+++ b/torch/nn/parallel/distributed.py
@@ -409,11 +409,6 @@ class DistributedDataParallel(Module, Joinable):
         Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
         likely experience deadlocks if you don't change this setting.
 
-    .. warning::
-        Forward and backward hooks defined on :attr:`module` and its submodules
-        won't be invoked anymore, unless the hooks are initialized in the
-        :meth:`forward` method.
-
     .. warning::
         You should never try to change your model's parameters after wrapping
         up your model with ``DistributedDataParallel``. Because, when
diff --git a/torch/nn/parallel/parallel_apply.py b/torch/nn/parallel/parallel_apply.py
index 06ab69332e16aa..80553fee046ad9 100644
--- a/torch/nn/parallel/parallel_apply.py
+++ b/torch/nn/parallel/parallel_apply.py
@@ -45,16 +45,19 @@ def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):
     else:
         devices = [None] * len(modules)
     devices = [_get_device_index(x, True) for x in devices]
+    streams = [torch.cuda.current_stream(x) for x in devices]
     lock = threading.Lock()
     results = {}
     grad_enabled, autocast_enabled = torch.is_grad_enabled(), torch.is_autocast_enabled()
 
-    def _worker(i, module, input, kwargs, device=None):
+    def _worker(i, module, input, kwargs, device=None, stream=None):
         torch.set_grad_enabled(grad_enabled)
         if device is None:
             device = get_a_var(input).get_device()
+        if stream is None:
+            stream = torch.cuda.current_stream(device)
         try:
-            with torch.cuda.device(device), autocast(enabled=autocast_enabled):
+            with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
                 # this also avoids accidental slicing of `input` if it is a Tensor
                 if not isinstance(input, (list, tuple)):
                     input = (input,)
@@ -68,16 +71,16 @@ def _worker(i, module, input, kwargs, device=None):
 
     if len(modules) > 1:
         threads = [threading.Thread(target=_worker,
-                                    args=(i, module, input, kwargs, device))
-                   for i, (module, input, kwargs, device) in
-                   enumerate(zip(modules, inputs, kwargs_tup, devices))]
+                                    args=(i, module, input, kwargs, device, stream))
+                   for i, (module, input, kwargs, device, stream) in
+                   enumerate(zip(modules, inputs, kwargs_tup, devices, streams))]
 
         for thread in threads:
             thread.start()
         for thread in threads:
             thread.join()
     else:
-        _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])
+        _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0], streams[0])
 
     outputs = []
     for i in range(len(inputs)):
diff --git a/torch/nn/qat/dynamic/modules/linear.py b/torch/nn/qat/dynamic/modules/linear.py
index e045807a4f61f8..8f4bbe47a41e8b 100644
--- a/torch/nn/qat/dynamic/modules/linear.py
+++ b/torch/nn/qat/dynamic/modules/linear.py
@@ -1,6 +1,7 @@
 import torch
 from torch.ao.quantization import activation_is_memoryless
 
+
 class Linear(torch.nn.qat.Linear):
     r"""
     A linear module attached with FakeQuantize modules for weight,
@@ -13,8 +14,12 @@ class Linear(torch.nn.qat.Linear):
     Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
     default.
     """
+
     def __init__(self, in_features, out_features, bias=True,
                  qconfig=None, device=None, dtype=None) -> None:
         super().__init__(in_features, out_features, bias, qconfig, device, dtype)
         if not activation_is_memoryless(qconfig):
-            raise ValueError("Dynamic QAT requires a memoryless observer")
+            raise ValueError(
+                "Dynamic QAT requires a memoryless observer." +
+                "This means a MovingAverage observer with averaging constant equal to 1"
+            )
diff --git a/torch/nn/qat/modules/__init__.py b/torch/nn/qat/modules/__init__.py
index b7e72603772051..988a1dd5ed4b61 100644
--- a/torch/nn/qat/modules/__init__.py
+++ b/torch/nn/qat/modules/__init__.py
@@ -1,10 +1,12 @@
 from .linear import Linear
+from .conv import Conv1d
 from .conv import Conv2d
 from .conv import Conv3d
 from .embedding_ops import EmbeddingBag, Embedding
 
 __all__ = [
     "Linear",
+    "Conv1d",
     "Conv2d",
     "Conv3d",
     "Embedding",
diff --git a/torch/nn/qat/modules/conv.py b/torch/nn/qat/modules/conv.py
index 734681d0517ab6..9c74cb44b8cb44 100644
--- a/torch/nn/qat/modules/conv.py
+++ b/torch/nn/qat/modules/conv.py
@@ -1,34 +1,35 @@
 import torch
 import torch.nn as nn
-from torch.nn.intrinsic import ConvReLU2d, ConvReLU3d, _FusedModule
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.nn.intrinsic import _FusedModule
+from typing import Tuple, TypeVar, Union
+from torch.nn.common_types import _size_1_t, _size_2_t, _size_3_t
 
-# TODO: factor out the common parts to _ConvNd
-class Conv2d(nn.Conv2d):
-    r"""
-    A Conv2d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
 
-    We adopt the same interface as `torch.nn.Conv2d`, please see
-    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
-    for documentation.
+class _ConvNd(nn.modules.conv._ConvNd):
 
-    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
+    _FLOAT_MODULE = MOD
 
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=True, padding_mode='zeros', qconfig=None,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(in_channels, out_channels, kernel_size,
-                         stride=stride, padding=padding, dilation=dilation,
-                         groups=groups, bias=bias, padding_mode=padding_mode,
-                         **factory_kwargs)
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: Tuple[int, ...],
+                 stride: Tuple[int, ...],
+                 padding: Tuple[int, ...],
+                 dilation: Tuple[int, ...],
+                 transposed: bool,
+                 output_padding: Tuple[int, ...],
+                 groups: int,
+                 bias: bool,
+                 padding_mode: str,
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        factory_kwargs = {"device": device, "dtype": dtype}
+        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
+                                         stride, padding, dilation, transposed,
+                                         output_padding, groups, bias, padding_mode, **factory_kwargs)
         assert qconfig, 'qconfig must be provided for QAT module'
         self.qconfig = qconfig
         self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
@@ -36,19 +37,24 @@ def __init__(self, in_channels, out_channels, kernel_size, stride=1,
     def forward(self, input):
         return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
 
-    @classmethod
+    @staticmethod
     def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
+        r"""Create a qat module from a float module
 
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
+            Args:
+               `mod`: a float module, either produced by torch.ao.quantization utilities
+               or directly from user
         """
-        assert type(mod) == cls._FLOAT_MODULE, 'qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
+        assert type(mod) == cls._FLOAT_MODULE, (
+            "qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        )
         assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
         assert mod.qconfig, 'Input float module must have a valid qconfig'
-        if type(mod) == ConvReLU2d:
-            mod = mod[0]
+        if issubclass(type(mod), _FusedModule):
+            mod = mod[0]  # type: ignore[index]
         qconfig = mod.qconfig
         qat_conv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
                        stride=mod.stride, padding=mod.padding, dilation=mod.dilation,
@@ -59,7 +65,11 @@ def from_float(cls, mod):
         return qat_conv
 
     def to_float(self):
-        conv = torch.nn.Conv2d(
+        """ This works for both single qat conv, and the qat conv - relu modules
+        to convert the qat module to a floating point module
+        """
+        cls = type(self)
+        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined, operator]
             self.in_channels,
             self.out_channels,
             self.kernel_size,  # type: ignore[arg-type]
@@ -72,20 +82,129 @@ def to_float(self):
         conv.weight = torch.nn.Parameter(self.weight.detach())
         if self.bias is not None:
             conv.bias = torch.nn.Parameter(self.bias.detach())
-        cls = type(self)
         # conv relu
         if issubclass(cls, _FusedModule):
             modules = [conv]
             assert hasattr(cls, "_FLOAT_RELU_MODULE")
             relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
             modules.append(relu)
-            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type]
+            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type, attr-defined, operator]
             fused.train(self.training)
             return fused
         else:
             return conv
 
-class Conv3d(nn.Conv3d):
+class Conv1d(_ConvNd, nn.Conv1d):
+    r"""
+    A Conv1d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as :class:`~torch.nn.Conv1d`
+
+    Similar to :class:`~torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv1d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: Union[str, _size_1_t] = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _single(kernel_size)
+        stride_ = _single(stride)
+        padding_ = padding if isinstance(padding, str) else _single(padding)
+        dilation_ = _single(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_single(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv2d(_ConvNd, nn.Conv2d):
+    r"""
+    A Conv2d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Conv2d`, please see
+    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
+    for documentation.
+
+    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _FLOAT_CONV_MODULE = nn.Conv2d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_2_t,
+                 stride: _size_2_t = 1,
+                 padding: Union[str, _size_2_t] = 0,
+                 dilation: _size_2_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _pair(kernel_size)
+        stride_ = _pair(stride)
+        padding_ = padding if isinstance(padding, str) else _pair(padding)
+        dilation_ = _pair(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_pair(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv3d(_ConvNd, nn.Conv3d):
     r"""
     A Conv3d module attached with FakeQuantize modules for weight,
     used for quantization aware training.
@@ -101,99 +220,44 @@ class Conv3d(nn.Conv3d):
         weight_fake_quant: fake quant module for weight
     """
     _FLOAT_MODULE = nn.Conv3d
+    _FLOAT_CONV_MODULE = nn.Conv3d
 
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=True,
-        padding_mode="zeros",
-        qconfig=None,
-        device=None,
-        dtype=None
-    ) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_3_t,
+                 stride: _size_3_t = 1,
+                 padding: Union[str, _size_3_t] = 0,
+                 dilation: _size_3_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _triple(kernel_size)
+        stride_ = _triple(stride)
+        padding_ = padding if isinstance(padding, str) else _triple(padding)
+        dilation_ = _triple(dilation)
         super().__init__(
             in_channels,
             out_channels,
-            kernel_size,
-            stride=stride,
-            padding=padding,
-            dilation=dilation,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_triple(0),
             groups=groups,
             bias=bias,
             padding_mode=padding_mode,
-            **factory_kwargs
-        )
-        assert qconfig, "qconfig must be provided for QAT module"
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
 
     def forward(self, input):
         return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
 
     @classmethod
     def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
-
-        Args: `mod` a float module, either produced by torch.ao.quantization utilities
-        or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, (
-            "qat."
-            + cls.__name__
-            + ".from_float only works for "
-            + cls._FLOAT_MODULE.__name__
-        )
-        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
-        assert mod.qconfig, "Input float module must have a valid qconfig"
-        if type(mod) == ConvReLU3d:
-            mod = mod[0]
-        qconfig = mod.qconfig
-        qat_conv = cls(
-            mod.in_channels,
-            mod.out_channels,
-            mod.kernel_size,
-            stride=mod.stride,
-            padding=mod.padding,
-            dilation=mod.dilation,
-            groups=mod.groups,
-            bias=mod.bias is not None,
-            padding_mode=mod.padding_mode,
-            qconfig=qconfig,
-        )
-        qat_conv.weight = mod.weight
-        qat_conv.bias = mod.bias
-        return qat_conv
-
-    def to_float(self):
-        conv = torch.nn.Conv3d(
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,  # type: ignore[arg-type]
-            self.stride,  # type: ignore[arg-type]
-            self.padding,  # type: ignore[arg-type]
-            self.dilation,  # type: ignore[arg-type]
-            self.groups,
-            self.bias is not None,
-            self.padding_mode)
-        conv.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            conv.bias = torch.nn.Parameter(self.bias.detach())
-        cls = type(self)
-        # conv relu
-        if issubclass(cls, _FusedModule):
-            modules = [conv]
-            assert hasattr(cls, "_FLOAT_RELU_MODULE")
-            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
-            modules.append(relu)
-            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type]
-            fused.train(self.training)
-            return fused
-        else:
-            return conv
+        return super().from_float(cls, mod)
diff --git a/torch/nn/qat/modules/linear.py b/torch/nn/qat/modules/linear.py
index d7a4ab66c0c433..b03c255f97f842 100644
--- a/torch/nn/qat/modules/linear.py
+++ b/torch/nn/qat/modules/linear.py
@@ -2,6 +2,11 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.nn.intrinsic import LinearReLU
+from torch.nn.utils.parametrize import (
+    is_parametrized,
+    type_before_parametrizations,
+    transfer_parametrizations_and_params,
+)
 
 class Linear(nn.Linear):
     r"""
@@ -34,21 +39,33 @@ def forward(self, input):
     @classmethod
     def from_float(cls, mod):
         r"""Create a qat module from a float module or qparams_dict
-
             Args: `mod` a float module, either produced by torch.ao.quantization utilities
             or directly from user
         """
-        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        if type(mod) == LinearReLU:
+        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
+            " qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__
+        )
+        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
+        assert mod.qconfig, "Input float module must have a valid qconfig"
+        if type_before_parametrizations(mod) == LinearReLU:
             mod = mod[0]
 
         qconfig = mod.qconfig
         qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
-        qat_linear.weight = mod.weight
-        qat_linear.bias = mod.bias
+
+        if is_parametrized(mod, "weight"):
+            transfer_parametrizations_and_params(mod, qat_linear, "weight")
+        else:
+            qat_linear.weight = mod.weight
+
+        if is_parametrized(mod, "bias"):
+            transfer_parametrizations_and_params(mod, qat_linear, "bias")
+        else:
+            qat_linear.bias = mod.bias
+
         return qat_linear
 
     def to_float(self):
diff --git a/torch/nn/quantized/_reference/modules/linear.py b/torch/nn/quantized/_reference/modules/linear.py
index b0e4e010f1d87b..adbcba39685fc5 100644
--- a/torch/nn/quantized/_reference/modules/linear.py
+++ b/torch/nn/quantized/_reference/modules/linear.py
@@ -12,6 +12,8 @@ class Linear(nn.Linear, ReferenceQuantizedModule):
     and dequantize the weight before running the floating point functional
     linear operator.
     """
+    _IS_REFERENCE = True
+
     def __init__(
             self,
             in_features: int,
diff --git a/torch/nn/quantized/_reference/modules/rnn.py b/torch/nn/quantized/_reference/modules/rnn.py
index 24449a1c26a695..bb5ec8bdcc98d1 100644
--- a/torch/nn/quantized/_reference/modules/rnn.py
+++ b/torch/nn/quantized/_reference/modules/rnn.py
@@ -2,6 +2,7 @@
 import torch.nn as nn
 from torch import Tensor
 from .utils import _quantize_and_dequantize_weight
+from .utils import _quantize_weight
 from typing import Optional, Dict, Any, Tuple
 from torch import _VF
 from torch.nn.utils.rnn import PackedSequence
@@ -9,6 +10,31 @@
 def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
     return tensor.index_select(dim, permutation)
 
+def get_weight_and_quantization_params(module, wn):
+    weight = getattr(module, wn)
+    params = [weight]
+    for param_name in [wn + n for n in ["_qscheme", "_dtype", "_scale", "_zero_point", "_axis"]]:
+        if hasattr(module, param_name):
+            param = getattr(module, param_name)
+        else:
+            param = None
+        params.append(param)
+    return params
+
+def get_quantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = get_weight_and_quantization_params(module, wn)
+    weight = _quantize_weight(*params)
+    return weight
+
+def get_quantize_and_dequantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = get_weight_and_quantization_params(module, wn)
+    weight = _quantize_and_dequantize_weight(*params)
+    return weight
+
 class RNNCellBase(nn.RNNCellBase):
     def __init__(self, input_size: int, hidden_size: int, bias: bool, num_chunks: int,
                  device=None, dtype=None, weight_qparams_dict=None) -> None:
@@ -56,27 +82,17 @@ def _init_weight_qparams_dict(self, weight_qparams_dict, device):
     def _get_name(self):
         return "QuantizedRNNCellBase(Reference)"
 
+    def get_quantized_weight_ih(self):
+        return get_quantized_weight(self, "weight_ih")
+
+    def get_quantized_weight_hh(self):
+        return get_quantized_weight(self, "weight_hh")
+
     def get_weight_ih(self):
-        wn = "weight_ih"
-        weight = self.weight_ih
-        weight_qscheme = getattr(self, wn + "_qscheme")
-        weight_dtype = getattr(self, wn + "_dtype")
-        weight_scale = getattr(self, wn + "_scale")
-        weight_zero_point = getattr(self, wn + "_zero_point")
-        weight_axis = getattr(self, wn + "_axis")
-        weight = _quantize_and_dequantize_weight(weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
-        return weight
+        return get_quantize_and_dequantized_weight(self, "weight_ih")
 
     def get_weight_hh(self):
-        wn = "weight_hh"
-        weight = self.weight_hh
-        weight_qscheme = getattr(self, wn + "_qscheme")
-        weight_dtype = getattr(self, wn + "_dtype")
-        weight_scale = getattr(self, wn + "_scale")
-        weight_zero_point = getattr(self, wn + "_zero_point")
-        weight_axis = getattr(self, wn + "_axis")
-        weight = _quantize_and_dequantize_weight(weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
-        return weight
+        return get_quantize_and_dequantized_weight(self, "weight_hh")
 
 class RNNCell(RNNCellBase):
     """
@@ -129,6 +145,21 @@ def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
 
         return ret
 
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.nonlinearity,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
 
 class LSTMCell(RNNCellBase):
     """
@@ -136,11 +167,10 @@ class LSTMCell(RNNCellBase):
     we need to pass in a `weight_qparams_dict` that maps from weight name,
     e.g. weight_ih, to the weight_qparams for that weight
     """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
                  device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
         super().__init__(input_size, hidden_size, bias, num_chunks=4, **factory_kwargs)
-        self.nonlinearity = nonlinearity
 
     def _get_name(self):
         return "QuantizedLSTMCell(Reference)"
@@ -168,7 +198,20 @@ def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) ->
             ret = (ret[0].squeeze(0), ret[1].squeeze(0))
         return ret
 
-
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
 
 class GRUCell(RNNCellBase):
     """
@@ -176,11 +219,10 @@ class GRUCell(RNNCellBase):
     we need to pass in a `weight_qparams_dict` that maps from weight name,
     e.g. weight_ih, to the weight_qparams for that weight
     """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
                  device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
         super().__init__(input_size, hidden_size, bias, num_chunks=3, **factory_kwargs)
-        self.nonlinearity = nonlinearity
 
     def _get_name(self):
         return "QuantizedGRUCell(Reference)"
@@ -208,6 +250,20 @@ def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
 
         return ret
 
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
 
 class RNNBase(nn.RNNBase):
     def __init__(self, mode: str, input_size: int, hidden_size: int,
@@ -228,8 +284,8 @@ def __init__(self, mode: str, input_size: int, hidden_size: int,
             }
             weight_qparams_dict = dict()
             for wn in self._flat_weights_names:
-                weight_qparams_dict[wn] = weight_qparams
-
+                if wn.startswith("weight"):
+                    weight_qparams_dict[wn] = weight_qparams
         self._init_weight_qparams_dict(weight_qparams_dict, device)
 
     def _init_weight_qparams_dict(self, weight_qparams_dict, device):
@@ -263,8 +319,7 @@ class LSTM(RNNBase):
     to the weight_qparams for that weight
     """
     def __init__(self, *args, **kwargs):
-        assert "weight_qparams_dict" in kwargs
-        super(LSTM, self).__init__('LSTM', *args, **kwargs)
+        super().__init__('LSTM', *args, **kwargs)
 
     # Same as above, see torch/nn/modules/module.py::_forward_unimplemented
     def permute_hidden(self,  # type: ignore[override]
@@ -298,19 +353,35 @@ def check_forward_args(self,  # type: ignore[override]
         self.check_hidden_size(hidden[1], self.get_expected_cell_size(input, batch_sizes),
                                'Expected hidden[1] size {}, got {}')
 
+    def get_quantized_weight_bias_dict(self):
+        """ dictionary from flat_weight_name to quantized weight or (unquantized) bias
+        e.g.
+        {
+          "weight_ih_l0": quantized_weight,
+          "bias_ih_l0": unquantized_bias,
+          ...
+        }
+        """
+        quantized_weight_bias_dict = {}
+        for wn in self._flat_weights_names:
+            if hasattr(self, wn):
+                if wn.startswith("weight"):
+                    weight_or_bias = get_quantized_weight(self, wn)
+                else:
+                    weight_or_bias = getattr(self, wn)
+            else:
+                weight_or_bias = None
+            quantized_weight_bias_dict[wn] = weight_or_bias
+        return quantized_weight_bias_dict
+
     def get_flat_weights(self):
         flat_weights = []
         for wn in self._flat_weights_names:
             if hasattr(self, wn):
                 weight = getattr(self, wn)
-                weight_qscheme = getattr(self, wn + "_qscheme")
-                weight_dtype = getattr(self, wn + "_dtype")
-                weight_scale = getattr(self, wn + "_scale")
-                weight_zero_point = getattr(self, wn + "_zero_point")
-                weight_axis = getattr(self, wn + "_axis")
-                weight = _quantize_and_dequantize_weight(
-                    weight, weight_qscheme, weight_dtype, weight_scale,
-                    weight_zero_point, weight_axis)
+                if wn.startswith("weight"):
+                    params = get_weight_and_quantization_params(self, wn)
+                    weight = _quantize_and_dequantize_weight(*params)
             else:
                 weight = None
             flat_weights.append(weight)
@@ -383,3 +454,18 @@ def forward(self, input, hx=None):  # noqa: F811
 
     def _get_name(self):
         return "QuantizedLSTM(Reference)"
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.num_layers,
+            mod.bias,
+            mod.batch_first,
+            mod.dropout,
+            mod.bidirectional,
+            weight_qparams_dict=weight_qparams_dict)
+        for wn in mod._flat_weights_names:
+            setattr(ref_mod, wn, getattr(mod, wn))
+        return ref_mod
diff --git a/torch/nn/quantized/_reference/modules/sparse.py b/torch/nn/quantized/_reference/modules/sparse.py
index 148907027daad7..5ace87f0fb732e 100644
--- a/torch/nn/quantized/_reference/modules/sparse.py
+++ b/torch/nn/quantized/_reference/modules/sparse.py
@@ -29,6 +29,21 @@ def forward(self, input: Tensor) -> Tensor:
             input, weight_quant_dequant, self.padding_idx, self.max_norm,
             self.norm_type, self.scale_grad_by_freq, self.sparse)
 
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.padding_idx,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.sparse,
+            mod.weight,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams)
+
 class EmbeddingBag(nn.EmbeddingBag, ReferenceQuantizedModule):
     """ A reference quantized EmbeddingBag module that fits into the
     FX Graph Mode Quantization workflow, activation will be floating point Tensor,
@@ -57,3 +72,21 @@ def forward(self, input: Tensor, offsets: Optional[Tensor] = None, per_sample_we
                                self.scale_grad_by_freq, self.mode, self.sparse,
                                per_sample_weights, self.include_last_offset,
                                self.padding_idx)
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.mode,
+            mod.sparse,
+            mod.weight,
+            mod.include_last_offset,
+            mod.padding_idx,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams
+        )
diff --git a/torch/nn/quantized/_reference/modules/utils.py b/torch/nn/quantized/_reference/modules/utils.py
index 157e61b5259b2a..58d5cd608ffbd9 100644
--- a/torch/nn/quantized/_reference/modules/utils.py
+++ b/torch/nn/quantized/_reference/modules/utils.py
@@ -16,13 +16,16 @@ def _init_weight_qparams(self, weight_qparams, device):
             None, torch.per_tensor_affine, torch.per_channel_affine,
             torch.per_channel_affine_float_qparams], \
             Exception(f"qscheme: {self.weight_qscheme} is not support in reference quantized {self._get_name()}")
-        if self.weight_qscheme is not None:
+        if self.weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
+            zero_point_dtype = weight_qparams["zero_point"].dtype if \
+                isinstance(weight_qparams["zero_point"], torch.Tensor) else \
+                torch.int
             self.register_buffer(
                 "weight_scale",
                 torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
             self.register_buffer(
                 "weight_zero_point",
-                torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
+                torch.tensor(weight_qparams["zero_point"], dtype=zero_point_dtype, device=device))
             if self.weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
                 self.register_buffer(
                     "weight_axis",
@@ -31,6 +34,12 @@ def _init_weight_qparams(self, weight_qparams, device):
                 # added for TorchScriptability, not used
                 self.register_buffer(
                     "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
+        else:
+            # added for TorchScriptability, and for torch.float
+            self.register_buffer("weight_scale", torch.tensor(1.0, dtype=torch.float, device=device))
+            self.register_buffer("weight_zero_point", torch.tensor(0, dtype=torch.int, device=device))
+            self.register_buffer(
+                "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
 
     def get_weight(self):
         """
@@ -40,25 +49,28 @@ def get_weight(self):
         model
         """
         # suppress mypy warning
-        assert isinstance(self.weight, torch.Tensor)
-        # assert isinstance(self.weight_qscheme, torch.qscheme)
         assert isinstance(self.weight_scale, torch.Tensor)
         assert isinstance(self.weight_zero_point, torch.Tensor)
         assert isinstance(self.weight_axis, torch.Tensor)
         return _quantize_and_dequantize_weight(
-            self.weight, self.weight_qscheme, self.weight_dtype, self.weight_scale,
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
             self.weight_zero_point, self.weight_axis)
 
     def get_quantized_weight(self):
         # suppress mypy warning
-        assert isinstance(self.weight, torch.Tensor)
-        # assert isinstance(self.weight_qscheme, torch.Tensor)
         assert isinstance(self.weight_scale, torch.Tensor)
         assert isinstance(self.weight_zero_point, torch.Tensor)
         assert isinstance(self.weight_axis, torch.Tensor)
         return _quantize_weight(
-            self.weight, self.weight_qscheme, self.weight_dtype, self.weight_scale,
-            self.weight_zero_point, self.weight_axis)
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
+            self.weight_zero_point,
+            self.weight_axis)
 
     def _save_to_state_dict(self, destination, prefix, keep_vars):
         super()._save_to_state_dict(destination, prefix, keep_vars)
@@ -83,24 +95,21 @@ def _quantize_weight(
         weight_scale: torch.Tensor,
         weight_zero_point: torch.Tensor,
         weight_axis: torch.Tensor):
+    if weight_dtype == torch.float16:
+        weight = weight.to(weight_dtype)
+        return weight
+
     if weight_qscheme == torch.per_tensor_affine:
         if weight_dtype in [torch.quint8, torch.qint8, torch.qint32]:
             weight = torch.quantize_per_tensor(weight, weight_scale, weight_zero_point, weight_dtype)
-        elif weight_dtype == torch.float16:
-            weight = weight.to(weight_dtype)
-        else:
-            raise Exception(f"Unsupported dtype: {weight_dtype} for {weight_qscheme}")
+            return weight
     elif weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
-        if weight_dtype in [torch.quint8, torch.qint8]:
+        if weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
             weight = torch.quantize_per_channel(
                 weight, weight_scale,
                 weight_zero_point, weight_axis.item(), weight_dtype)  # type: ignore[arg-type]
-        else:
-            raise Exception(f"Unsupported dtype: {weight_dtype} for {weight_qscheme}")
-    else:
-        raise Exception(f"Unsupported qscheme: {weight_qscheme}")
-    return weight
-
+            return weight
+    raise Exception(f"Unsupported dtype and qscheme: {weight_dtype}, {weight_qscheme}")
 
 def _quantize_and_dequantize_weight(
         weight: torch.Tensor,
diff --git a/torch/nn/quantized/dynamic/modules/linear.py b/torch/nn/quantized/dynamic/modules/linear.py
index ed7fcd33066848..8049a21009d696 100644
--- a/torch/nn/quantized/dynamic/modules/linear.py
+++ b/torch/nn/quantized/dynamic/modules/linear.py
@@ -110,3 +110,17 @@ def from_float(cls, mod):
         qlinear = cls(mod.in_features, mod.out_features, dtype=dtype)
         qlinear.set_weight_bias(qweight, mod.bias)
         return qlinear
+
+    @classmethod
+    def from_reference(cls, ref_qlinear):
+        """ Create a (fbgemm/qnnpack) dynamic quantized module from a reference quantized
+        module
+        Args:
+            ref_qlinear (Module): a reference quantized  module, either produced by
+            torch.ao.quantization functions or provided by the user
+        """
+        qlinear = cls(ref_qlinear.in_features, ref_qlinear.out_features, dtype=ref_qlinear.weight_dtype)
+        qweight = ref_qlinear.get_quantized_weight()
+        bias = ref_qlinear.bias
+        qlinear.set_weight_bias(qweight, bias)
+        return qlinear
diff --git a/torch/nn/quantized/dynamic/modules/rnn.py b/torch/nn/quantized/dynamic/modules/rnn.py
index 11e4db8d41e9cd..5cba3147472356 100644
--- a/torch/nn/quantized/dynamic/modules/rnn.py
+++ b/torch/nn/quantized/dynamic/modules/rnn.py
@@ -11,6 +11,27 @@
 def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
     return tensor.index_select(dim, permutation)
 
+def pack_weight_bias(qweight, bias, dtype):
+
+    if dtype == torch.qint8:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   w_ih, w_hh
+        packed_weight = \
+            torch.ops.quantized.linear_prepack(qweight, bias)
+
+        return packed_weight
+    else:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   packed_ih, packed_hh, b_ih, b_hh
+        packed_weight = torch.ops.quantized.linear_prepack_fp16(
+            qweight, bias)
+
+        return packed_weight
+
 class PackedParameter(torch.nn.Module):
     def __init__(self, param):
         super(PackedParameter, self).__init__()
@@ -92,9 +113,7 @@ def __init__(self, mode, input_size, hidden_size,
                     else:
                         cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
                             packed_ih, packed_hh, b_ih, b_hh, True)
-
                 else:
-
                     packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
                     packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
                     cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
@@ -197,6 +216,43 @@ def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
         super(RNNBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
                                                    missing_keys, unexpected_keys, error_msgs)
 
+    def set_weight_bias(self, weight_bias_dict):
+
+        def weight_bias_name(ihhh, layer, suffix):
+            weight_name = "weight_{}_l{}{}".format(ihhh, layer, suffix)
+            bias_name = "bias_{}_l{}{}".format(ihhh, layer, suffix)
+            return weight_name, bias_name
+
+        num_directions = 2 if self.bidirectional else 1
+        # TODO: dedup with __init__ of RNNBase
+        _all_weight_values = []
+        for layer in range(self.num_layers):
+            for direction in range(num_directions):
+                suffix = "_reverse" if direction == 1 else ""
+                w_ih_name, b_ih_name = weight_bias_name("ih", layer, suffix)
+                w_hh_name, b_hh_name = weight_bias_name("hh", layer, suffix)
+                w_ih = weight_bias_dict[w_ih_name]
+                b_ih = weight_bias_dict[b_ih_name]
+                w_hh = weight_bias_dict[w_hh_name]
+                b_hh = weight_bias_dict[b_hh_name]
+                if w_ih.dtype == torch.qint8:
+                    packed_ih = torch.ops.quantized.linear_prepack(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack(w_hh, b_hh)
+                    if self.version is None or self.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh, True)
+                else:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
     @classmethod
     def from_float(cls, mod):
         assert type(mod) in set(
@@ -429,6 +485,24 @@ def forward(self, input, hx=None):
     def from_float(cls, mod):
         return super(LSTM, cls).from_float(mod)
 
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_l0_dtype"), "We are assuming weight_ih_l0 "
+        "exists in LSTM, may need to relax the assumption to support the use case"
+        qmod = cls(
+            ref_mod.input_size,
+            ref_mod.hidden_size,
+            ref_mod.num_layers,
+            ref_mod.bias,
+            ref_mod.batch_first,
+            ref_mod.dropout,
+            ref_mod.bidirectional,
+            # assuming there is layer 0, which should be OK
+            ref_mod.weight_ih_l0_dtype,
+        )
+        qmod.set_weight_bias(ref_mod.get_quantized_weight_bias_dict())
+        return qmod
+
 
 class GRU(RNNBase):
     r"""Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
@@ -652,6 +726,7 @@ def __init__(self, input_size, hidden_size, bias=True, num_chunks=4, dtype=torch
         self.input_size = input_size
         self.hidden_size = hidden_size
         self.bias = bias
+        self.weight_dtype = dtype
         if bias:
             self.bias_ih = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
             self.bias_hh = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
@@ -750,42 +825,60 @@ def from_float(cls, mod):
             raise NotImplementedError('Only LSTMCell, GRUCell and RNNCell \
             are supported for QuantizedRNN for now')
 
-
         assert mod.bias
 
-        def process_weights(weight, bias, dtype):
-
+        def _observe_and_quantize_weight(weight):
             if dtype == torch.qint8:
-                # for each layer, for each direction we need to quantize and pack
-                # weights and pack parameters in this order:
-                #
-                #   w_ih, w_hh
                 weight_observer = weight_observer_method()
                 weight_observer(weight)
                 qweight = _quantize_weight(weight.float(), weight_observer)
-                packed_weight = \
-                    torch.ops.quantized.linear_prepack(qweight, bias)
-
-                return packed_weight
+                return qweight
             else:
-                # for each layer, for each direction we need to quantize and pack
-                # weights and pack parameters in this order:
-                #
-                #   packed_ih, packed_hh, b_ih, b_hh
-                packed_weight = torch.ops.quantized.linear_prepack_fp16(
-                    weight.float(), bias)
+                return weight.float()
 
-                return packed_weight
-
-        qRNNCellBase._packed_weight_ih = process_weights(mod.weight_ih, mod.bias_ih, dtype)
-        qRNNCellBase._packed_weight_hh = process_weights(mod.weight_hh, mod.bias_hh, dtype)
+        qRNNCellBase._packed_weight_ih = pack_weight_bias(_observe_and_quantize_weight(mod.weight_ih), mod.bias_ih, dtype)
+        qRNNCellBase._packed_weight_hh = pack_weight_bias(_observe_and_quantize_weight(mod.weight_hh), mod.bias_hh, dtype)
         return qRNNCellBase
 
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_dtype"), "We are assuming weight_ih "
+        "exists in reference module, may need to relax the assumption to support the use case"
+        if hasattr(ref_mod, "nonlinearity"):
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                ref_mod.nonlinearity,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        else:
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        weight_bias_dict = {
+            "weight": {
+                "weight_ih": ref_mod.get_quantized_weight_ih(),
+                "weight_hh": ref_mod.get_quantized_weight_hh(),
+            },
+            "bias": {
+                "bias_ih": ref_mod.bias_ih,
+                "bias_hh": ref_mod.bias_hh,
+            }
+        }
+        qmod.set_weight_bias(weight_bias_dict)
+        return qmod
+
     def _weight_bias(self):
         # Returns a dict of weights and biases
         weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
         w1, b1 = self._packed_weight_ih.__getstate__()[0]
         w2, b2 = self._packed_weight_hh.__getstate__()[0]
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
         weight_bias_dict['weight']['weight_ih'] = w1
         weight_bias_dict['weight']['weight_hh'] = w2
         weight_bias_dict['bias']['bias_ih'] = b1
@@ -798,12 +891,23 @@ def get_weight(self):
     def get_bias(self):
         return self._weight_bias()['bias']
 
+    def set_weight_bias(self, weight_bias_dict):
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
+        self._packed_weight_ih = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_ih"],
+            weight_bias_dict["bias"]["bias_ih"],
+            self.weight_dtype)
+        self._packed_weight_hh = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_hh"],
+            weight_bias_dict["bias"]["bias_hh"],
+            self.weight_dtype)
+
     def _save_to_state_dict(self, destination, prefix, keep_vars):
         super(RNNCellBase, self)._save_to_state_dict(destination, prefix, keep_vars)
         destination[prefix + '_packed_weight_ih'] = self._packed_weight_ih
         destination[prefix + '_packed_weight_hh'] = self._packed_weight_hh
 
-
     def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
                               missing_keys, unexpected_keys, error_msgs):
         self._packed_weight_ih = state_dict.pop(prefix + '_packed_weight_ih')
diff --git a/torch/nn/quantized/modules/batchnorm.py b/torch/nn/quantized/modules/batchnorm.py
index f292b89958eccc..1046d0254b6d5d 100644
--- a/torch/nn/quantized/modules/batchnorm.py
+++ b/torch/nn/quantized/modules/batchnorm.py
@@ -34,6 +34,10 @@ def from_reference(cls, bn, output_scale, output_zero_point):
             device=bn.weight.device,
             dtype=bn.weight.dtype
         )
+        qbn.weight = bn.weight
+        qbn.bias = bn.bias
+        qbn.running_mean = bn.running_mean
+        qbn.running_var = bn.running_var
         qbn.scale = output_scale
         qbn.zero_point = output_zero_point
         return qbn
diff --git a/torch/nn/quantized/modules/embedding_ops.py b/torch/nn/quantized/modules/embedding_ops.py
index c0c97deda4df72..7af12e9a72e281 100644
--- a/torch/nn/quantized/modules/embedding_ops.py
+++ b/torch/nn/quantized/modules/embedding_ops.py
@@ -19,7 +19,7 @@ def __init__(self, num_embeddings, embedding_dim, dtype=torch.quint8):
                                                            axis=0, dtype=self.dtype)
             self.set_weight(wq)
         else:
-            raise NotImplementedError('Unsupported dtype on quantized embedding! Supports quint8 and quint4x2.')
+            raise NotImplementedError(f'Unsupported dtype on quantized embedding! Supports quint8 and quint4x2. Got dtype: {dtype}')
 
     @torch.jit.export
     def set_weight(self, weight: torch.Tensor) -> None:
@@ -174,6 +174,20 @@ def from_float(cls, mod):
         qembedding.set_weight(qweight)
         return qembedding
 
+    @classmethod
+    def from_reference(cls, ref_embedding):
+        qembedding = cls(
+            ref_embedding.num_embeddings,
+            ref_embedding.embedding_dim,
+            ref_embedding.padding_idx,
+            ref_embedding.max_norm,
+            ref_embedding.norm_type,
+            ref_embedding.scale_grad_by_freq,
+            ref_embedding.sparse,
+            ref_embedding.get_quantized_weight(),
+            ref_embedding.weight_dtype,
+        )
+        return qembedding
 
 class EmbeddingBag(Embedding):
     r"""
@@ -260,3 +274,19 @@ def from_float(cls, mod):
         qembedding_bag = EmbeddingBag(mod.num_embeddings, mod.embedding_dim, dtype=dtype)
         qembedding_bag.set_weight(qweight)
         return qembedding_bag
+
+    @classmethod
+    def from_reference(cls, ref_embedding_bag):
+        qembedding_bag = cls(
+            ref_embedding_bag.num_embeddings,
+            ref_embedding_bag.embedding_dim,
+            ref_embedding_bag.max_norm,
+            ref_embedding_bag.norm_type,
+            ref_embedding_bag.scale_grad_by_freq,
+            ref_embedding_bag.mode,
+            ref_embedding_bag.sparse,
+            ref_embedding_bag.get_quantized_weight(),
+            ref_embedding_bag.include_last_offset,
+            ref_embedding_bag.weight_dtype,
+        )
+        return qembedding_bag
diff --git a/torch/nn/quantized/modules/linear.py b/torch/nn/quantized/modules/linear.py
index 1744a625efdac1..1463497cafe088 100644
--- a/torch/nn/quantized/modules/linear.py
+++ b/torch/nn/quantized/modules/linear.py
@@ -6,6 +6,7 @@
 import torch.nn.intrinsic.qat as nniqat
 from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr, WeightedQuantizedModule
 from torch.nn.utils.fusion import fuse_linear_bn_weights
+from torch.nn.utils.parametrize import type_before_parametrizations
 from typing import Optional
 
 class LinearPackedParams(torch.nn.Module):
@@ -241,7 +242,7 @@ def from_float(cls, mod):
                           utilities or provided by the user
         """
         if hasattr(mod, 'weight_fake_quant'):
-            if type(mod) == nniqat.LinearBn1d:
+            if type_before_parametrizations(mod) == nniqat.LinearBn1d:
                 mod.weight, mod.bias = fuse_linear_bn_weights(
                     mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
                     mod.bn.eps, mod.bn.weight, mod.bn.bias)
@@ -255,10 +256,10 @@ def from_float(cls, mod):
                 cls._FLOAT_MODULE = [cls._FLOAT_MODULE]  # type: ignore[assignment]
             supported_modules = ', '.join([float_mod.__name__ for float_mod in cls._FLOAT_MODULE])  # type: ignore[attr-defined]
             error_msg = 'nnq.{}.from_float only works for {}, but got: {}'.format(cls.__name__, supported_modules, type(mod))
-            assert type(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
+            assert type_before_parametrizations(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
             assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
             activation_post_process = mod.activation_post_process
-            if type(mod) == nni.LinearReLU:
+            if type_before_parametrizations(mod) == nni.LinearReLU:
                 mod = mod[0]
             weight_post_process = mod.qconfig.weight()
         weight_post_process(mod.weight)
@@ -279,7 +280,7 @@ def from_reference(cls, ref_qlinear, output_scale, output_zero_point):
         r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
 
         Args:
-            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
+            ref_qlinear (Module): a reference quantized linear module, either produced by torch.ao.quantization
                           utilities or provided by the user
             output_scale (float): scale for output Tensor
             zero_point (int): zero point for output Tensor
diff --git a/torch/nn/utils/_expanded_weights/__init__.py b/torch/nn/utils/_expanded_weights/__init__.py
index 4191266871d4b4..64586a20ead42c 100644
--- a/torch/nn/utils/_expanded_weights/__init__.py
+++ b/torch/nn/utils/_expanded_weights/__init__.py
@@ -1,3 +1,7 @@
+from .conv_expanded_weights import ConvPerSampleGrad
+from .embedding_expanded_weights import EmbeddingPerSampleGrad
+from .group_norm_expanded_weights import GroupNormPerSampleGrad
+from .layer_norm_expanded_weights import LayerNormPerSampleGrad
 from .linear_expanded_weights import LinearPerSampleGrad
 from .expanded_weights_impl import ExpandedWeight
 
diff --git a/torch/nn/utils/_expanded_weights/conv_expanded_weights.py b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
new file mode 100644
index 00000000000000..bfcd72e591da2f
--- /dev/null
+++ b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
@@ -0,0 +1,37 @@
+import torch
+import torch.nn.functional as F
+
+from .conv_utils import conv_backward, conv_args_and_kwargs
+from .expanded_weights_impl import ExpandedWeight, implements_per_sample_grads
+from .expanded_weights_utils import forward_helper
+
+@implements_per_sample_grads(F.conv1d)
+@implements_per_sample_grads(F.conv2d)
+@implements_per_sample_grads(F.conv3d)
+class ConvPerSampleGrad(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, kwarg_names, conv_fn, *expanded_args_and_kwargs):
+        if any([isinstance(i, str) for i in expanded_args_and_kwargs]):
+            raise RuntimeError("Expanded Weights does not support convolution padding as a string. "
+                               "Please file an issue to prioritize support")
+        expanded_args, expanded_kwargs = conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs)
+        output = forward_helper(conv_fn, expanded_args, expanded_kwargs)
+        input, weight = expanded_args
+
+        ctx.conv_fn = conv_fn
+
+        ctx.batch_size = input.shape[0]
+        ctx.input_required_grad = input.requires_grad
+        ctx.stride, ctx.padding = expanded_kwargs['stride'], expanded_kwargs['padding']
+        ctx.dilation, ctx.groups = expanded_kwargs['dilation'], expanded_kwargs['groups']
+
+        if isinstance(weight, ExpandedWeight):
+            ctx.input = input
+        ctx.weight = weight
+        ctx.bias = expanded_kwargs['bias']
+
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return conv_backward(ctx.conv_fn, ctx, grad_output)
diff --git a/torch/nn/utils/_expanded_weights/conv_utils.py b/torch/nn/utils/_expanded_weights/conv_utils.py
new file mode 100644
index 00000000000000..16b97743252998
--- /dev/null
+++ b/torch/nn/utils/_expanded_weights/conv_utils.py
@@ -0,0 +1,184 @@
+import torch
+import torch.nn.functional as F
+
+import numpy as np
+from typing import List, Optional
+
+from .expanded_weights_utils import \
+    set_grad_sample_if_exists, unpack_expanded_weight_or_tensor
+
+THRESHOLD = 32
+
+def conv_picker(func, conv1dOpt, conv2dOpt, conv3dOpt):
+    if func == F.conv1d:
+        return conv1dOpt
+    if func == F.conv2d:
+        return conv2dOpt
+    else:
+        assert func == F.conv3d
+        return conv3dOpt
+
+def conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs):
+    args = expanded_args_and_kwargs[:len(expanded_args_and_kwargs) - len(kwarg_names)]
+    kwargs = expanded_args_and_kwargs[len(expanded_args_and_kwargs) - len(kwarg_names):]
+    kwargs = {name: arg for (name, arg) in zip(kwarg_names, kwargs)}
+
+    return conv_normalizer(*args, **kwargs)
+
+def conv_normalizer(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    return (input, weight), {'bias': bias, 'stride': stride, 'padding': padding, 'dilation': dilation, 'groups': groups}
+
+def conv_backward(func, ctx, grad_output):
+
+    def weight_grad_sample(weight):
+        if (batch_size < THRESHOLD and groups == 1):
+            return conv_group_weight_grad_sample(ctx.input, grad_output, weight_shape, stride, padding, dilation, batch_size, func)
+        else:
+            return conv_unfold_weight_grad_sample(ctx.input, grad_output, weight_shape, kernel_size,
+                                                  stride, padding, dilation, groups, func)
+
+    def expand(param):
+        if isinstance(param, int):
+            return conv_picker(func, (param,), (param, param), (param, param, param))
+        else:
+            return param
+
+    weight_shape = ctx.weight.shape
+    stride, padding, dilation, groups = expand(ctx.stride), expand(ctx.padding), expand(ctx.dilation), ctx.groups
+
+    kernel_size = []
+    for i in range(2, conv_picker(func, 3, 4, 5)):
+        kernel_size.append(weight_shape[i])
+
+    batch_size = ctx.batch_size
+    results: List[Optional[torch.Tensor]] = []
+    results.append(None)  # for kwarg names
+    results.append(None)  # for op reference
+
+    if ctx.input_required_grad:
+        output_padding = []
+        input_dims = conv_picker(func, 1, 2, 3)
+        for i in range(input_dims):
+            input_dim = ctx.input.shape[2 + i]
+            output_padding.append((2 * padding[i] + input_dim - (kernel_size[i] * dilation[i] - dilation[i] + 1)) % stride[i])
+        weight_ = unpack_expanded_weight_or_tensor(ctx.weight)
+        transpose_func = conv_picker(func, F.conv_transpose1d, F.conv_transpose2d, F.conv_transpose3d)
+        results.append(transpose_func(grad_output, weight_, None, stride, padding, tuple(output_padding), groups, dilation))
+    else:
+        results.append(None)
+    # weight and bias don't compute batched gradients; no other arguments are differentiable
+    results = results + [None] * 6
+
+    # set grad_sample field for weight and bias with per sample gradients
+    set_grad_sample_if_exists(ctx.weight, weight_grad_sample)
+    set_grad_sample_if_exists(ctx.bias, lambda _: grad_output.reshape(*grad_output.shape[:2], -1).sum(dim=2))
+    return tuple(results)
+
+def conv_unfold_weight_grad_sample(input, grad_output, weight_shape, kernel_size, stride, padding, dilation, groups, func):
+    n = input.shape[0]
+    in_channels = input.shape[1]
+
+    unfold_func = conv_picker(
+        func,
+        lambda: F.unfold(input.unsqueeze(-2),
+                         kernel_size=(1, kernel_size[0]),
+                         dilation=(1, dilation[0]),
+                         padding=(0, padding[0]),
+                         stride=(1, stride[0])),
+        lambda: F.unfold(input, kernel_size, dilation=dilation, padding=padding, stride=stride),
+        lambda: unfold3d(input, kernel_size, dilation, padding, stride)
+    )
+
+    input = unfold_func()
+    grad_output = grad_output.reshape(n, -1, input.shape[-1])
+
+    # n=batch_sz; o=num_out_channels; p=(num_in_channels/groups)*kernel_sz
+    weight_grad_sample = torch.einsum("noq,npq->nop", grad_output, input)
+    # rearrange the above tensor and extract diagonals.
+    weight_grad_sample = weight_grad_sample.view(
+        n,
+        groups,
+        -1,
+        groups,
+        int(in_channels / groups),
+        np.prod(kernel_size),
+    )
+    weight_grad_sample = torch.einsum("ngrg...->ngr...", weight_grad_sample).contiguous()
+    shape = [n] + list(weight_shape)
+    weight_grad_sample = weight_grad_sample.view(shape)
+    return weight_grad_sample
+
+def conv_group_weight_grad_sample(input, grad_output, weight_shape, stride, padding, dilation, batch_size, func):
+    I = input.shape[1]
+    O = grad_output.shape[1]
+
+    input_ = input.transpose(0, 1)
+    grad_output_ = grad_output.view(grad_output.shape[0] * grad_output.shape[1], 1, *grad_output.shape[2:])
+
+    weight_grad_sample = func(input_, grad_output_, None, stride=dilation, padding=padding, dilation=stride, groups=batch_size)
+    input_dims = conv_picker(func, 3, 4, 5)
+    for i in range(2, input_dims):
+        weight_grad_sample = weight_grad_sample.narrow(i, 0, weight_shape[i])
+    weight_grad_sample = weight_grad_sample.view(I, batch_size, O, *weight_grad_sample.shape[2:])
+    weight_grad_sample = weight_grad_sample.movedim(0, 2)
+    return weight_grad_sample
+
+
+def unfold3d(
+    tensor,
+    kernel_size,
+    padding,
+    stride,
+    dilation,
+):
+    r"""
+    Extracts sliding local blocks from an batched input tensor.
+    :class:`torch.nn.Unfold` only supports 4D inputs (batched image-like tensors).
+    This method implements the same action for 5D inputs
+    Args:
+        tensor: An input tensor of shape ``(B, C, D, H, W)``.
+        kernel_size: the size of the sliding blocks
+        padding: implicit zero padding to be added on both sides of input
+        stride: the stride of the sliding blocks in the input spatial dimensions
+        dilation: the spacing between the kernel points.
+    Returns:
+        A tensor of shape ``(B, C * np.product(kernel_size), L)``, where L - output spatial dimensions.
+        See :class:`torch.nn.Unfold` for more details
+    Example:
+        >>> B, C, D, H, W = 3, 4, 5, 6, 7
+        >>> tensor = torch.arange(1, B*C*D*H*W + 1.).view(B, C, D, H, W)
+        >>> unfold3d(tensor, kernel_size=2, padding=0, stride=1).shape
+        torch.Size([3, 32, 120])
+    """
+
+    if len(tensor.shape) != 5:
+        raise ValueError(
+            f"Input tensor must be of the shape [B, C, D, H, W]. Got{tensor.shape}"
+        )
+
+    if dilation != (1, 1, 1):
+        raise NotImplementedError(f"dilation={dilation} not supported.")
+
+    batch_size, channels, _, _, _ = tensor.shape
+
+    # Input shape: (B, C, D, H, W)
+    tensor = F.pad(
+        tensor, (padding[2], padding[2], padding[1], padding[1], padding[0], padding[0])
+    )
+    # Output shape: (B, C, D+2*padding[2], H+2*padding[1], W+2*padding[0])
+
+    tensor = tensor.unfold(dimension=2, size=kernel_size[0], step=stride[0])
+    tensor = tensor.unfold(dimension=3, size=kernel_size[1], step=stride[1])
+    tensor = tensor.unfold(dimension=4, size=kernel_size[2], step=stride[2])
+    # Output shape: (B, C, D_out, H_out, W_out, kernel_size[0], kernel_size[1], kernel_size[2])
+    # For D_out, H_out, W_out definitions see :class:`torch.nn.Unfold`
+
+    tensor = tensor.permute(0, 2, 3, 4, 1, 5, 6, 7)
+    # Output shape: (B, D_out, H_out, W_out, C, kernel_size[0], kernel_size[1], kernel_size[2])
+
+    tensor = tensor.reshape(batch_size, -1, channels * np.prod(kernel_size)).transpose(
+        1, 2
+    )
+    # Output shape: (B, D_out * H_out * W_out, C * kernel_size[0] * kernel_size[1] * kernel_size[2]
+
+    return tensor
diff --git a/torch/nn/utils/_expanded_weights/embedding_expanded_weights.py b/torch/nn/utils/_expanded_weights/embedding_expanded_weights.py
new file mode 100644
index 00000000000000..c7956a3a1b1f66
--- /dev/null
+++ b/torch/nn/utils/_expanded_weights/embedding_expanded_weights.py
@@ -0,0 +1,54 @@
+import torch
+import torch.nn.functional as F
+from .expanded_weights_impl import implements_per_sample_grads
+from .expanded_weights_utils import standard_kwargs, forward_helper, set_grad_sample_if_exists
+
+from typing import List, Optional
+
+@implements_per_sample_grads(F.embedding)
+class EmbeddingPerSampleGrad(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, kwarg_names, _, *expanded_args_and_kwargs):
+        expanded_args, expanded_kwargs = standard_kwargs(kwarg_names, expanded_args_and_kwargs)
+        if len(expanded_args[0].shape) == 1:
+            raise RuntimeError(f"Expanded Weights needs an input with a batch size, got a 1D tensor, {expanded_args[0]}")
+        output = forward_helper(F.embedding, expanded_args, expanded_kwargs)
+        ctx.input, ctx.weight = expanded_args
+        ctx.padding_idx, ctx.scale_grad_by_freq = expanded_kwargs['padding_idx'], expanded_kwargs['scale_grad_by_freq']
+        ctx.sparse = expanded_kwargs['sparse']
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, weight = ctx.input, ctx.weight
+        padding_idx, scale_grad_by_freq, sparse = ctx.padding_idx, ctx.scale_grad_by_freq, ctx.sparse
+
+        def weight_per_sample_grad(weight):
+            batch_size = input.shape[0]
+            embedding_dim = weight.shape[1]
+            index = (
+                input.unsqueeze(-1)
+                .expand(*input.shape, embedding_dim)
+                .reshape(batch_size, -1, embedding_dim)
+            )
+            grad_sample = torch.zeros(
+                batch_size, *weight.shape, device=weight.device, dtype=grad_output.dtype
+            )
+            return grad_sample.scatter_add_(1, index, grad_output.reshape(batch_size, -1, embedding_dim))
+
+        results: List[Optional[torch.Tensor]] = []
+        results.append(None)  # for kwarg names
+        results.append(None)  # for op reference
+
+        if input.requires_grad:
+            bw_fn = torch.ops.aten.embedding_backward
+            results.append(bw_fn(grad_output, input, weight.shape[0], padding_idx, scale_grad_by_freq, sparse))
+        else:
+            results.append(None)
+
+        # weight doesn't compute batched gradients; no other arguments are differentiable (2 not saved from forward)
+        results = results + [None] * 6
+
+        # set grad_sample field for weight with per sample gradients
+        set_grad_sample_if_exists(weight, weight_per_sample_grad)
+        return tuple(results)
diff --git a/torch/nn/utils/_expanded_weights/expanded_weights_impl.py b/torch/nn/utils/_expanded_weights/expanded_weights_impl.py
index 3c66eb53ec41c9..7914cf8dc1ee8a 100644
--- a/torch/nn/utils/_expanded_weights/expanded_weights_impl.py
+++ b/torch/nn/utils/_expanded_weights/expanded_weights_impl.py
@@ -45,7 +45,7 @@ def __torch_function__(cls, func, _, args=(), kwargs=None):
         if kwargs is None:
             kwargs = {}
         if func in cls.handled_functions:
-            return cls.handled_functions[func].apply(tuple(kwargs.keys()), *(args + tuple(kwargs.values())))
+            return cls.handled_functions[func].apply(tuple(kwargs.keys()), func, *(args + tuple(kwargs.values())))
         # We cannot use a fallback here because we do not know the batch dimension for any regular tensor inputs,
         # i.e. torch.add(torch.Tensor, ExpandedWeight)
         raise RuntimeError(f"Expanded Weights encountered but cannot handle function {func.__name__}")
diff --git a/torch/nn/utils/_expanded_weights/expanded_weights_utils.py b/torch/nn/utils/_expanded_weights/expanded_weights_utils.py
index ed0f7d885e8509..ca0fc7c9e35be4 100644
--- a/torch/nn/utils/_expanded_weights/expanded_weights_utils.py
+++ b/torch/nn/utils/_expanded_weights/expanded_weights_utils.py
@@ -75,6 +75,8 @@ def unpack_expanded_weight_or_tensor(maybe_expanded_weight, func=lambda x: x):
         raise RuntimeError("ExpandedWeights currently does not support a mixture of ExpandedWeight parameters "
                            "and normal Parameters. Please file and issue with pytorch/pytorch")
 
+
+
 def sum_over_all_but_batch_and_last_n(
     tensor: torch.Tensor, n_dims: int
 ) -> torch.Tensor:
diff --git a/torch/nn/utils/_expanded_weights/group_norm_expanded_weights.py b/torch/nn/utils/_expanded_weights/group_norm_expanded_weights.py
new file mode 100644
index 00000000000000..fe29b1eafbe2c0
--- /dev/null
+++ b/torch/nn/utils/_expanded_weights/group_norm_expanded_weights.py
@@ -0,0 +1,64 @@
+from functools import reduce
+import operator
+import torch
+import torch.nn.functional as F
+from .expanded_weights_impl import ExpandedWeight, implements_per_sample_grads
+from .expanded_weights_utils import standard_kwargs, \
+    forward_helper, set_grad_sample_if_exists, unpack_expanded_weight_or_tensor
+from typing import List, Optional
+
+@implements_per_sample_grads(F.group_norm)
+class GroupNormPerSampleGrad(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, kwarg_names, _, *expanded_args_and_kwargs):
+        expanded_args, expanded_kwargs = standard_kwargs(kwarg_names, expanded_args_and_kwargs)
+        input, num_groups = expanded_args
+        N = input.shape[0]
+        C = input.shape[1]
+        HxW = reduce(operator.mul, input.shape[2:], 1)
+        weight, bias, eps = expanded_kwargs['weight'], expanded_kwargs['bias'], expanded_kwargs['eps']
+        output, mean, rstd = forward_helper(torch.native_group_norm, (input, weight, bias, N, C, HxW, num_groups, eps), {})
+        ctx.input, ctx.num_groups = input, num_groups
+        ctx.weight, ctx.eps = weight, eps
+        ctx.mean, ctx.rstd = mean, rstd
+        if isinstance(bias, ExpandedWeight):
+            ctx.bias = bias
+        if input.requires_grad and isinstance(weight, ExpandedWeight):
+            ctx.weight = weight
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, num_groups = ctx.input, ctx.num_groups
+        weight, bias, eps = ctx.weight, ctx.bias, ctx.eps
+        mean, rstd = ctx.mean, ctx.rstd
+
+        results: List[Optional[torch.Tensor]] = []
+        results.append(None)  # for kwarg names
+        results.append(None)  # for op reference
+
+        if input.requires_grad:
+            weight_c = unpack_expanded_weight_or_tensor(weight, lambda t: t.contiguous())
+            input_c = input.contiguous()
+            grad_output_c = grad_output.contiguous() if grad_output is not None else None
+            N = input.shape[0]
+            C = input.shape[1]
+            HxW = 1
+            for s in input.shape[2:]:
+                HxW *= s
+            bw_fn = torch.ops.aten.native_group_norm_backward
+            results.append(bw_fn(grad_output_c, input_c,
+                                 mean, rstd, weight_c, N, C, HxW, num_groups, (True, False, False))[0])
+        else:
+            results.append(None)
+
+        # weight and bias don't compute batched gradients; no other arguments are differentiable
+        results = results + [None] * 4
+
+        # set grad_sample field for weight and bias with per sample gradients
+        if hasattr(ctx, "weight"):
+            set_grad_sample_if_exists(weight,
+                                      lambda _: torch.einsum("ni...->ni", F.group_norm(input, num_groups, eps=eps) * grad_output))
+        if hasattr(ctx, "bias"):
+            set_grad_sample_if_exists(bias, lambda _: torch.einsum("ni...->ni", grad_output))
+        return tuple(results)
diff --git a/torch/nn/utils/_expanded_weights/layer_norm_expanded_weights.py b/torch/nn/utils/_expanded_weights/layer_norm_expanded_weights.py
new file mode 100644
index 00000000000000..53cb3fe032ea95
--- /dev/null
+++ b/torch/nn/utils/_expanded_weights/layer_norm_expanded_weights.py
@@ -0,0 +1,59 @@
+
+import torch
+import torch.nn.functional as F
+from .expanded_weights_impl import ExpandedWeight, implements_per_sample_grads
+from .expanded_weights_utils import forward_helper, set_grad_sample_if_exists, \
+    standard_kwargs, sum_over_all_but_batch_and_last_n, unpack_expanded_weight_or_tensor
+from typing import List, Optional
+
+@implements_per_sample_grads(F.layer_norm)
+class LayerNormPerSampleGrad(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, kwarg_names, _, *expanded_args_and_kwargs):
+        expanded_args, expanded_kwargs = standard_kwargs(kwarg_names, expanded_args_and_kwargs)
+        input = expanded_args[0]
+        normalized_shape = expanded_args[1]
+        if len(input.shape) <= len(normalized_shape):
+            raise RuntimeError("Expanded Weights: Layer norm should not normalize over batch dimension for per sample gradient"
+                               f"computations but got that normalized shape, {normalized_shape}, matched input shape.")
+        output, mean, rstd = forward_helper(torch.native_layer_norm, expanded_args, expanded_kwargs)
+        ctx.args = expanded_args
+
+        if input.requires_grad or isinstance(ExpandedWeight, expanded_kwargs['weight']):
+            ctx.weight = expanded_kwargs['weight']
+        if input.requires_grad or isinstance(ExpandedWeight, expanded_kwargs['bias']):
+            ctx.bias = expanded_kwargs['bias']
+        ctx.eps = expanded_kwargs['eps']
+        ctx.mean, ctx.rstd = mean, rstd
+        return output
+
+
+    @staticmethod
+    def backward(ctx, grad_output):
+
+        def weight_per_sample_grad(weight):
+            return sum_over_all_but_batch_and_last_n(F.layer_norm(input, normalized_shape, eps=ctx.eps) * grad_output, weight.dim())
+
+        input, normalized_shape = ctx.args
+        mean, rstd = ctx.mean, ctx.rstd
+
+        results: List[Optional[torch.Tensor]] = []
+        results.append(None)  # for kwarg names
+        results.append(None)  # for op reference
+        if input.requires_grad:
+            weight_ = unpack_expanded_weight_or_tensor(ctx.weight)
+            bias_ = unpack_expanded_weight_or_tensor(ctx.bias)
+            results.append(torch.ops.aten.native_layer_norm_backward(
+                grad_output, input, normalized_shape, mean, rstd, weight_, bias_, (True, False, False))[0])
+        else:
+            results.append(None)
+
+        # weight and bias don't compute batched gradients; no other arguments are differentiable
+        results = results + [None] * 4
+
+        # set grad_sample field for weight and bias with per sample gradients
+        if hasattr(ctx, "weight"):
+            set_grad_sample_if_exists(ctx.weight, weight_per_sample_grad)
+        if hasattr(ctx, "bias"):
+            set_grad_sample_if_exists(ctx.bias, lambda bias: sum_over_all_but_batch_and_last_n(grad_output, bias.dim()))
+        return tuple(results)
diff --git a/torch/nn/utils/_expanded_weights/linear_expanded_weights.py b/torch/nn/utils/_expanded_weights/linear_expanded_weights.py
index fe85f666f9c503..70db268b8fe78b 100644
--- a/torch/nn/utils/_expanded_weights/linear_expanded_weights.py
+++ b/torch/nn/utils/_expanded_weights/linear_expanded_weights.py
@@ -8,7 +8,7 @@
 @implements_per_sample_grads(F.linear)
 class LinearPerSampleGrad(torch.autograd.Function):
     @staticmethod
-    def forward(ctx, _, *expanded_args_and_kwargs):
+    def forward(ctx, _, __, *expanded_args_and_kwargs):
         if len(expanded_args_and_kwargs[0].shape) <= 1:
             raise RuntimeError("Input does not have a batch dimension. Expanded Weights expected input "
                                f"of at least rank 2, got of rank {len(expanded_args_and_kwargs[0].shape)}")
@@ -25,6 +25,7 @@ def backward(ctx, grad_output):
         bias = ctx.kwargs['bias']
         results: List[Optional[torch.Tensor]] = []
         results.append(None)  # for kwarg_names
+        results.append(None)  # for op reference
 
         if input.requires_grad:
             results.append(grad_output.matmul(unpack_expanded_weight_or_tensor(weight)))
diff --git a/torch/nn/utils/parametrize.py b/torch/nn/utils/parametrize.py
index 1d8596b33e0bf8..dc5157770f3151 100644
--- a/torch/nn/utils/parametrize.py
+++ b/torch/nn/utils/parametrize.py
@@ -7,7 +7,6 @@
 from contextlib import contextmanager
 from typing import Union, Optional, Dict, Tuple, Sequence
 
-
 _cache_enabled = 0
 _cache: Dict[Tuple[int, str], Optional[Tensor]] = {}
 
@@ -573,7 +572,6 @@ def is_parametrized(module: Module, tensor_name: Optional[str] = None) -> bool:
     else:
         return tensor_name in parametrizations
 
-
 def remove_parametrizations(
     module: Module, tensor_name: str, leave_parametrized: bool = True
 ) -> Module:
@@ -644,3 +642,75 @@ def remove_parametrizations(
         orig_cls = module.__class__.__bases__[0]
         module.__class__ = orig_cls
     return module
+
+def type_before_parametrizations(module: Module) -> type:
+    r"""Returns the module type before parametrizations were applied and if not,
+    then it returns the module type.
+
+    Args:
+        module (nn.Module): module to get type of
+    """
+    if is_parametrized(module):
+        return module.__class__.__bases__[0]
+    else:
+        return type(module)
+
+def transfer_parametrizations_and_params(
+    from_module: Module, to_module: Module, tensor_name: Optional[str] = None
+) -> Module:
+    r"""Transfers parametrizations and the parameters they parametrize from from_module
+    to to_module. If tensor_name is specified, only transfers the specified parameter, otherwise
+    transfers all parametrized parameters. If those parameters do not exist in to_module, it will create them.
+    Does nothing if from_module is not parametrized.
+
+    Args:
+        from_module (nn.Module): module to transfer from
+        to_module (nn.Module): module to transfer to
+        tensor_name (str, optional): parameter to transfer
+
+    Returns:
+        Module: to_module
+    """
+    if is_parametrized(from_module):
+        assert isinstance(from_module.parametrizations, ModuleDict)  # for mypy
+
+        # get list of all params or the single param to transfer
+        parameters_to_transfer: Union[list, ModuleDict] = (
+            from_module.parametrizations if tensor_name is None else [tensor_name]
+        )
+
+        assert hasattr(parameters_to_transfer, "__iter__")  # for mypy
+        for parameter_name in parameters_to_transfer:
+
+            # initialize the to-be-transfered param in to_module if it doesn't exist already
+            if not hasattr(to_module, parameter_name):
+                setattr(
+                    to_module,
+                    parameter_name,
+                    Parameter(getattr(from_module, parameter_name)),
+                )
+
+            # apply the params's parametrizations to to_module
+            for param_func in from_module.parametrizations[parameter_name]:
+                register_parametrization(to_module, parameter_name, param_func)
+            assert isinstance(to_module.parametrizations, ModuleDict)  # for mypy
+
+            # make values match, original values can be stored in either original or
+            # original0, original1..., need to check both cases
+            if hasattr(from_module.parametrizations[parameter_name], "original"):
+                to_module.parametrizations[parameter_name].original = \
+                    from_module.parametrizations[parameter_name].original
+            else:
+                num = 0
+                orig_num = "original" + str(num)
+                # loop through each original# until all values have been set
+                while hasattr(from_module.parametrizations[parameter_name], orig_num):
+                    setattr(
+                        to_module.parametrizations[parameter_name],
+                        orig_num,
+                        getattr(from_module.parametrizations[parameter_name], orig_num),
+                    )
+                    num = num + 1
+                    orig_num = "original" + str(num)
+
+    return to_module
diff --git a/torch/onnx/__init__.py b/torch/onnx/__init__.py
index bfe9802f92e07c..cf5d1d35436795 100644
--- a/torch/onnx/__init__.py
+++ b/torch/onnx/__init__.py
@@ -126,7 +126,7 @@ def export(model, args, f, export_params=True, verbose=False, training=TrainingM
         verbose (bool, default False): if True, prints a description of the
             model being exported to stdout. In addition, the final ONNX graph will include the
             field ``doc_string``` from the exported model which mentions the source code locations
-            for ``model``.
+            for ``model``. If True, ONNX exporter logging will be turned on.
         training (enum, default TrainingMode.EVAL):
             * ``TrainingMode.EVAL``: export the model in inference mode.
             * ``TrainingMode.PRESERVE``: export the model in inference mode if model.training is
@@ -195,7 +195,7 @@ def export(model, args, f, export_params=True, verbose=False, training=TrainingM
 
                 Models exported this way are probably runnable only by Caffe2.
 
-        opset_version (int, default 9): The version of the
+        opset_version (int, default 13): The version of the
             `default (ai.onnx) opset <https://github.com/onnx/onnx/blob/master/docs/Operators.md>`_
             to target. Must be >= 7 and <= 15.
         do_constant_folding (bool, default True): Apply the constant-folding optimization.
@@ -425,3 +425,46 @@ def unregister_custom_op_symbolic(symbolic_name, opset_version):
 
     from torch.onnx import utils
     utils.unregister_custom_op_symbolic(symbolic_name, opset_version)
+
+
+def is_onnx_log_enabled():
+    r"""
+    Returns True iff ONNX logging is turned on.
+    """
+    return _C._jit_is_onnx_log_enabled()
+
+
+def enable_log():
+    r"""
+    Enables ONNX logging.
+    """
+    _C._jit_set_onnx_log_enabled(True)
+
+
+def disable_log():
+    r"""
+    Disables ONNX logging.
+    """
+    _C._jit_set_onnx_log_enabled(False)
+
+
+def set_log_stream(stream_name="stdout"):
+    r"""
+    Set output stream for ONNX logging.
+
+    Args:
+      stream_name (str, default "stdout"): Only ``stdout`` and ``stderr`` are supported
+        as `stream_name`.
+    """
+    _C._jit_set_onnx_log_output_stream(stream_name)
+
+
+def log(*args):
+    r"""
+    A simple logging facility for ONNX exporter.
+
+    Args:
+      args: Arguments are converted to string, concatenated together with a newline
+        character appended to the end, and flushed to output stream.
+    """
+    _C._jit_onnx_log(*args)
diff --git a/torch/onnx/onnx_supported_ops.py b/torch/onnx/onnx_supported_ops.py
new file mode 100644
index 00000000000000..f203b21a907773
--- /dev/null
+++ b/torch/onnx/onnx_supported_ops.py
@@ -0,0 +1,110 @@
+from inspect import signature, _empty  # type: ignore[attr-defined]
+from torch._C import _jit_get_all_schemas, FunctionSchema
+from torch.onnx.symbolic_registry import _registry, register_version
+from torch.onnx.symbolic_helper import _onnx_main_opset, _onnx_stable_opsets
+from typing import Dict, List, Union
+
+
+for v in _onnx_stable_opsets + [_onnx_main_opset]:
+    register_version("", v)
+
+class _TorchSchema:
+    def __init__(self, schema: Union[FunctionSchema, str]) -> None:
+        if isinstance(schema, FunctionSchema):
+            self.name: str = schema.name
+            self.overload_name: str = schema.overload_name
+            self.arguments: List[str] = [arg.name for arg in schema.arguments]
+            self.optional_arguments: List[str] = []
+            self.returns: List[str] = [ret.name for ret in schema.returns]
+            self.opsets: List[int] = []
+        else:
+            self.name = schema
+            self.overload_name = ""
+            self.arguments = []
+            self.optional_arguments = []
+            self.returns = []
+            self.opsets = []
+
+    def __str__(self) -> str:
+        s = f"{self.name}.{self.overload_name}("
+        s += ", ".join(self.arguments)
+        s += ") -> ("
+        s += ", ".join(self.returns)
+        s += ")"
+        s += " in opsets "
+        s += ", ".join(str(opset) for opset in self.opsets)
+        return s
+
+    def __hash__(self):
+        # TODO: handle overload_name?
+        return hash((self.name))
+
+    def __eq__(self, other) -> bool:
+        if not isinstance(other, _TorchSchema):
+            return False
+        # TODO: handle overload_name?
+        return self.name == other.name
+
+    def is_aten(self) -> bool:
+        return self.name.startswith("aten::")
+
+    def is_backward(self) -> bool:
+        return "backward" in self.name
+
+
+def _all_aten_forward_schemas():
+    """ Creates a list of _TorchSchema for all aten schemas"""
+    torch_schemas = [_TorchSchema(s) for s in _jit_get_all_schemas()]
+    torch_schemas = sorted(torch_schemas, key=lambda x: x.name)
+    aten_schemas = [s for s in torch_schemas if s.is_aten() and not s.is_backward()]
+    return aten_schemas
+
+
+def _symbolic_argument_count(func):
+    params = []
+    sig = signature(func)
+    optional_params = []
+    has_var = False
+    for name, p in sig.parameters.items():
+        if p.kind.name == "VAR_POSITIONAL":
+            has_var = True
+        elif name == "_outputs" or name == "g":
+            continue
+        elif p.default != _empty:
+            optional_params.append(p)
+        else:
+            params.append(str(p))
+    return params
+
+
+def _all_symbolics_schemas():
+    symbolics_schemas: Dict[str, _TorchSchema] = dict()
+
+    for domain, version in _registry:
+        for opname, sym_func in _registry[(domain, version)].items():
+            symbolics_schema = _TorchSchema("aten::" + opname)
+            symbolics_schema.arguments = _symbolic_argument_count(sym_func)
+            if opname in symbolics_schemas:
+                symbolics_schemas[opname].opsets.append(version)
+            else:
+                symbolics_schema.opsets = [version]
+                symbolics_schemas[opname] = symbolics_schema
+    return symbolics_schemas
+
+
+def onnx_supported_ops():
+    aten_schemas = _all_aten_forward_schemas()
+    symbolic_schemas = _all_symbolics_schemas()
+    torch_schemas = set(symbolic_schemas.values())
+    supported_ops, unsupported_ops = list(), list()
+    onnx_supported_ops = list()
+    for schema in aten_schemas:
+        if schema in torch_schemas:
+            opname = schema.name[6:]  # without "aten::" prefix
+            opsets = symbolic_schemas[opname].opsets
+            if schema not in supported_ops:
+                supported_ops.append(symbolic_schemas[opname])
+                onnx_supported_ops.append((opname, " ".join(str(o) for o in opsets)))
+        else:
+            unsupported_ops.append(schema)
+    return sorted(onnx_supported_ops, key=lambda x: x[0])
diff --git a/torch/onnx/symbolic_helper.py b/torch/onnx/symbolic_helper.py
index 7836f0cddc689e..4de7876eec6895 100644
--- a/torch/onnx/symbolic_helper.py
+++ b/torch/onnx/symbolic_helper.py
@@ -461,6 +461,8 @@ def _topk_helper(g, input, k, dim, largest=True, sorted=False, out=None):
         k = g.op("Constant", value_t=torch.tensor([k], dtype=torch.int64))
     else:
         k = _reshape_helper(g, k, g.op("Constant", value_t=torch.tensor([1])))
+        if _try_get_scalar_type(k) != "Long":
+            k = g.op("Cast", k, to_i=torch.onnx.TensorProtoDataType.INT64)
     if _export_onnx_opset_version <= 10:
         if not largest:
             _unimplemented("TopK", "Ascending is not supported")
@@ -757,11 +759,12 @@ def _scatter_helper(g, self, dim, index, src):
 
 def _repeat_interleave_split_helper(g, self, reps, dim):
     if _export_onnx_opset_version <= 12:
-        return g.op("Split", self, split_i=[1] * reps, axis_i=dim, outputs=reps)
+        split_out = g.op("Split", self, split_i=[1] * reps, axis_i=dim, outputs=reps)
     else:
         from torch.onnx.symbolic_opset13 import split
         repeats = g.op("Constant", value_t=torch.tensor([1] * reps))
-        return split(g, self, repeats, dim, _outputs=reps)
+        split_out = split(g, self, repeats, dim, _outputs=reps)
+    return split_out if reps > 1 else [split_out]
 
 def _arange_cast_helper(g, end, start=None, step=None, dtype=None):
     def _is_all_integral(scalars):
@@ -826,16 +829,21 @@ def _index_fill_reshape_helper(g, self, dim, index):
     expanded_index = expand(g, unsqueezed_index, expanded_index_shape, None)
     return expanded_index_shape, expanded_index
 
-# When using reshape helper (opset_version >= 14), if reshape has -1,
-# allowzero cannot be set to 1
+# By default, when any value in the 'shape' input is equal to zero
+# the corresponding dimension value is copied from the input tensor dynamically.
+# allowzero=1 indicates that if any value in the 'shape' input is set to zero,
+# the zero value is honored, similar to NumPy.
+# allowzero=1 is only supported for opset version >= 14.
 def _reshape_helper(g, input, shape, allowzero=0):
     shape = _maybe_get_const(shape, "is")
     if not _is_value(shape):
         shape = g.op("Constant", value_t=torch.LongTensor(shape))
     if _export_onnx_opset_version <= 13:
+        if allowzero == 1:
+            raise _onnx_opset_unsupported("Reshape with allowzero=1",
+                                          _export_onnx_opset_version, 14)
         return g.op("Reshape", input, shape)
     else:
-        warnings.warn("allowzero=0 by default. In order to honor zero value in shape use allowzero=1")
         return g.op("Reshape", input, shape, allowzero_i=allowzero)
 
 def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
@@ -962,45 +970,21 @@ def requantize_bias_helper(g, bias, input_scale, weight_scale):
                   to_i=torch.onnx.TensorProtoDataType.INT32)
     return g.op("prim::TupleConstruct", q_bias, bias_scale, bias_zero_point)
 
-# ---------------------------------------------------------------------
-# ONNX operator version
-# ---------------------------------------------------------------------
+def args_have_same_dtype(args):
+    assert args
+    base_dtype = args[0].type().scalarType()
+    has_same_dtype = all(elem.type().scalarType() == base_dtype for elem in args)
+    return has_same_dtype
 
-# READ ME BEFORE EDITING _default_onnx_opset_version:
-#
-# The variable below controls which ONNX operator set version we are
-# targeting. THIS VARIABLE HAS SEMANTIC EFFECT! Say a breaking
-# change occurred in version 8. As long as this variable < 8, you can
-# export models targeting the old behavior. However, if you bump
-# this variable to 8 or later, the breaking change will take into effect:
-# you MUST adjust any symbolic affected by breaking changes. The ONNX
-# spec publishes a *comprehensive* list of BC-breaking changes for every
-# operator revision at:
-#
-#   https://github.com/onnx/onnx/blob/master/docs/Changelog.md
-#
-# Please be sure to go through and check all of our implementations here before
-# increasing this number. This includes symbolic definitions NOT in this
-# file, so grep for "OpName" (with quotes)
-#
-# Besides, opset_version can be specified in the invocation of export()
-# and export_to_pretty_string(), and _export_onnx_opset_version will be set
-# and the symbolic functions should check it to determine the behavior
-# of the exporter.
-
-
-_default_onnx_opset_version = 9
+_default_onnx_opset_version = 13
 _onnx_main_opset = 15
-_onnx_stable_opsets = [7, 8, 9, 10, 11, 12, 13, 14]
+_onnx_stable_opsets = list(range(7, _onnx_main_opset))
 _export_onnx_opset_version = _default_onnx_opset_version
 _constant_folding_opset_versions = list(range(9, _onnx_main_opset + 1))
 
 
 def _set_opset_version(opset_version):
     global _export_onnx_opset_version
-    if opset_version == _default_onnx_opset_version:
-        _export_onnx_opset_version = opset_version
-        return
     if opset_version in _onnx_stable_opsets + [_onnx_main_opset]:
         _export_onnx_opset_version = opset_version
         return
diff --git a/torch/onnx/symbolic_opset11.py b/torch/onnx/symbolic_opset11.py
index 3ba525bb07b670..1813aff1629095 100644
--- a/torch/onnx/symbolic_opset11.py
+++ b/torch/onnx/symbolic_opset11.py
@@ -102,7 +102,7 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
         indices_list = [indices_list_value]
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
         args = [self] + indices_list + [values, accumulate]
-        return g.op("ATen", *args, operator_s="index_put")
+        return g.at("index_put", *args)
 
     from torch.onnx.symbolic_opset9 import add, expand
     accumulate = sym_help._parse_arg(accumulate, "b")
@@ -226,7 +226,7 @@ def gather(g, self, dim, index, sparse_grad=False):
     if sym_help._maybe_get_const(sparse_grad, "i"):
         return _unimplemented("gather", "sparse_grad == True")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, dim, index, sparse_grad, operator_s="gather")
+        return g.at("gather", self, dim, index, sparse_grad)
     return g.op("GatherElements", self, index, axis_i=dim)
 
 
@@ -234,7 +234,7 @@ def gather(g, self, dim, index, sparse_grad=False):
 def scatter(g, self, dim, index, src):
     from torch.onnx.symbolic_opset9 import expand_as
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, dim, index, src, operator_s="scatter")
+        return g.at("scatter", self, dim, index, src, overload_name="src")
     src_type = src.type().scalarType()
     src = sym_help._maybe_get_scalar(src)
     if sym_help._is_value(src):
@@ -622,7 +622,7 @@ def mm(g, self, other):
 
 def index(g, self, index):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, operator_s="index")
+        return g.at("index", self, index, overload_name="Tensor")
 
     if sym_help._is_packed_list(index):
         indices = sym_help._unpack_list(index)
@@ -643,7 +643,7 @@ def index(g, self, index):
 def index_fill(g, self, dim, index, value):
     dim_value = sym_help._parse_arg(dim, "i")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, value, dim_i=dim_value, operator_s="index_fill")
+        return g.at("index_fill", self, index, value, dim_i=dim_value, overload_name="int_Scalar")
     expanded_index_shape, expanded_index = sym_help._index_fill_reshape_helper(g, self, dim, index)
     value = sym_help._maybe_get_scalar(value)
     value = sym_help._if_scalar_type_as(g, value, self)
@@ -654,7 +654,7 @@ def index_fill(g, self, dim, index, value):
 def index_copy(g, self, dim, index, source):
     dim_value = sym_help._parse_arg(dim, "i")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, source, dim_i=dim_value, operator_s="index_copy")
+        return g.at("index_copy", self, index, source, dim_i=dim_value)
     expanded_index_shape, expanded_index = sym_help._index_fill_reshape_helper(g, self, dim, index)
     return scatter(g, self, dim, expanded_index, source)
 
@@ -801,6 +801,8 @@ def narrow(g, input, dim, start, length):
 @parse_args("v", "i", "i")
 def flatten(g, input, start_dim, end_dim):
     dim = sym_help._get_tensor_rank(input)
+    if dim == 1:
+        return input
     # use ONNX's Flatten operator for cases where the output shape is 2D
     if start_dim == 1:
         if (end_dim == -1 or (dim is not None and end_dim == dim - 1)):
diff --git a/torch/onnx/symbolic_opset12.py b/torch/onnx/symbolic_opset12.py
index 7144f9721cb95d..6bd9165cd61d31 100644
--- a/torch/onnx/symbolic_opset12.py
+++ b/torch/onnx/symbolic_opset12.py
@@ -174,7 +174,7 @@ def unfold(g, input, dimension, size, step):
         from torch.onnx.symbolic_opset9 import unfold as _unfold
         return _unfold(g, input, dimension, const_size, const_step)
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, operator_s="unfold", dimension_i=dimension, size_i=size, step_i=step)
+        return g.at("unfold", input, dimension_i=dimension, size_i=size, step_i=step)
 
     sizedim = sym_help._get_tensor_dim_size(input, dimension)
     if sizedim is not None:
diff --git a/torch/onnx/symbolic_opset13.py b/torch/onnx/symbolic_opset13.py
index d4551daaf71192..61f97934222c04 100644
--- a/torch/onnx/symbolic_opset13.py
+++ b/torch/onnx/symbolic_opset13.py
@@ -237,7 +237,6 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     for idx, input_size in enumerate(input_sizes):
         if input_size is None:
             output_sizes[idx], input_sizes[idx] = 0, -1
-    print(output_sizes, input_sizes)
 
     cond_dynamic_repeats = (repeats_dim == 1 and repeats_sizes[0] is None)
     # If input size is dynamic or repeats vector is dynamic
diff --git a/torch/onnx/symbolic_opset14.py b/torch/onnx/symbolic_opset14.py
index f1ef80f122193d..f37829da95dd89 100644
--- a/torch/onnx/symbolic_opset14.py
+++ b/torch/onnx/symbolic_opset14.py
@@ -5,7 +5,7 @@
 import torch
 
 import torch.onnx.symbolic_helper as sym_help
-from torch.onnx.symbolic_helper import parse_args
+from torch.onnx.symbolic_helper import parse_args, args_have_same_dtype
 
 # Note [ONNX operators that are added/updated in opset 14]
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -34,10 +34,20 @@ def triu(g, self, diagonal, out=None):
 
 @parse_args("v", "v")
 def reshape(g, self, shape):
-    return sym_help._reshape_helper(g, self, shape)
+    # NOTE: Due to bug in ORT https://github.com/microsoft/onnxruntime/issues/10664
+    #       Reshape export cannot utilize the new allowzero attribute introduced in opset 14.
+    return sym_help._reshape_helper(g, self, shape, allowzero=0)
 
 @parse_args("v", "v", "v", "v", "v", "i", "f", "f", "i")
 def batch_norm(g, input, weight, bias, running_mean, running_var, training, momentum, eps, cudnn_enabled):
+
+    if torch.is_autocast_enabled() and \
+            not args_have_same_dtype([input, weight, bias, running_mean, running_var]) and \
+            sym_help._export_onnx_opset_version < 15:
+        return sym_help._onnx_opset_unsupported_detailed("BatchNormalization", 14, 15,
+                                                         "All input tensors must have the same `dtype`."
+                                                         " Turn off Autocast or export using opset version 15.")
+
     sym_help.check_training_mode(training, "batch_norm")
     weight, bias, running_mean, running_var = sym_help._batchnorm_helper(g, input, weight, bias, running_mean, running_var)
     out = g.op("BatchNormalization", input, weight, bias, running_mean, running_var,
diff --git a/torch/onnx/symbolic_opset9.py b/torch/onnx/symbolic_opset9.py
index f22b5acf405c5d..f0d65d60485c38 100644
--- a/torch/onnx/symbolic_opset9.py
+++ b/torch/onnx/symbolic_opset9.py
@@ -8,12 +8,16 @@
 # This import monkey-patches graph manipulation methods on Graph, used for the
 # ONNX symbolics
 import torch.onnx.utils
-
 from functools import partial
 from functools import wraps
 
 import torch.onnx.symbolic_helper as sym_help
-from torch.onnx.symbolic_helper import parse_args, _parse_arg, _unimplemented, ScalarType, quantized_args
+from torch.onnx.symbolic_helper import (parse_args,
+                                        _parse_arg,
+                                        _unimplemented,
+                                        ScalarType,
+                                        quantized_args,
+                                        args_have_same_dtype)
 
 from typing import Optional
 from sys import maxsize as maxsize
@@ -429,7 +433,7 @@ def cumsum(g, input, dim, dtype):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
         if dtype.node().kind() != "prim::Constant":
             return _unimplemented(name, "dtype")
-        return g.op("ATen", input, operator_s="cumsum", dim_i=dim)
+        return g.at("cumsum", input, dim_i=dim)
     else:
         sym_help._onnx_opset_unsupported("cumsum", 9, 11)
 
@@ -439,7 +443,7 @@ def _sample_dirichlet(g, self, generator):
         if not sym_help._is_none(generator):
             return _unimplemented("_sample_dirichlet",
                                   "We are not able to export generator")
-        return g.op("ATen", self, operator_s="_sample_dirichlet")
+        return g.at("_sample_dirichlet", self)
     else:
         return sym_help._onnx_unsupported("_sample_dirichlet")
 
@@ -449,7 +453,7 @@ def _standard_gamma(g, self, generator):
         if not sym_help._is_none(generator):
             return _unimplemented("_standard_gamma",
                                   "We are not able to export generator")
-        return g.op("ATen", self, operator_s="_standard_gamma")
+        return g.at("_standard_gamma", self)
     else:
         return sym_help._onnx_unsupported("_standard_gamma")
 
@@ -516,11 +520,10 @@ def embedding_bag(g,
     if not sym_help._is_none(per_sample_weights):
         return sym_help._onnx_unsupported("embedding_bag  with per_sample_weights")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen",
+        return g.at("embedding_bag",
                     embedding_matrix,
                     indices,
                     offsets,
-                    operator_s="embedding_bag",
                     outputs=4,
                     scale_grad_by_freq_i=scale_grad_by_freq,
                     mode_i=mode,
@@ -557,7 +560,7 @@ def transpose(g, self, dim0, dim1):
         # if we don't have dim information we cannot
         # output a permute so use ATen instead
         if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-            return g.op("ATen", self, operator_s="transpose", dim0_i=dim0, dim1_i=dim1)
+            return g.at("transpose", self, dim0_i=dim0, dim1_i=dim1, overload_name="int")
         else:
             raise RuntimeError("Unsupported: ONNX export of transpose for tensor "
                                "of unknown rank.")
@@ -1287,7 +1290,7 @@ def log_softmax(g, input, dim, dtype=None):
 
 @parse_args("v", "v", "v", "is", "is", "is", "i", "is", "i", "i", "i", "i", "i")
 def _convolution(g, input, weight, bias, stride, padding, dilation,
-                 transposed, output_padding, groups, benchmark, deterministic, cudnn_enabled, allow_tf32):
+                 transposed, output_padding, groups, benchmark, deterministic, cudnn_enabled, allow_tf32=None):
     weight_size = sym_help._get_tensor_sizes(weight)
     try:
         kernel_shape = weight_size[2:]
@@ -1360,6 +1363,14 @@ def conv_transpose3d(g, input, weight, bias, stride, padding, output_padding, gr
 @parse_args("v", "v", "v", "v", "v", "i", "f", "f", "i")
 def batch_norm(g, input, weight, bias, running_mean, running_var, training, momentum, eps, cudnn_enabled):
     sym_help.check_training_mode(training, "batch_norm")
+
+    if torch.is_autocast_enabled() and \
+            not args_have_same_dtype([input, weight, bias, running_mean, running_var]) and \
+            sym_help._export_onnx_opset_version < 15:
+        return sym_help._onnx_opset_unsupported_detailed("BatchNormalization", 9, 15,
+                                                         "All input tensors must have the same `dtype`."
+                                                         " Turn off Autocast or export using opset version 15.")
+
     weight, bias, running_mean, running_var = sym_help._batchnorm_helper(g, input, weight, bias, running_mean, running_var)
     out = g.op("BatchNormalization", input, weight, bias, running_mean, running_var,
                epsilon_f=eps,
@@ -1379,8 +1390,8 @@ def batch_norm(g, input, weight, bias, running_mean, running_var, training, mome
 @parse_args("v", "is", "v", "v", "f", "i")
 def layer_norm(g, input, normalized_shape, weight, bias, eps, cudnn_enable):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, weight, bias, normalized_shape_i=normalized_shape,
-                    eps_f=eps, cudnn_enable_i=cudnn_enable, operator_s="layer_norm")
+        return g.at("layer_norm", input, weight, bias, normalized_shape_i=normalized_shape,
+                    eps_f=eps, cudnn_enable_i=cudnn_enable)
 
     axes = [-i for i in range(len(normalized_shape), 0, -1)]
 
@@ -1449,7 +1460,7 @@ def instance_norm(g, input, weight, bias, running_mean, running_var, use_input_s
 @parse_args("v", "i", "i", "i")
 def unfold(g, input, dimension, size, step):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, operator_s="unfold", dimension_i=dimension, size_i=size, step_i=step)
+        return g.at("unfold", input, dimension_i=dimension, size_i=size, step_i=step)
     sizes = sym_help._get_tensor_sizes(input)
     try:
         sizedim = sizes[dimension]
@@ -1498,7 +1509,7 @@ def index_put(g, self, indices_list_value, values, accumulate):
         indices_list = [indices_list_value]
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
         args = [self] + indices_list + [values, accumulate]
-        return g.op("ATen", *args, operator_s="index_put")
+        return g.at("index_put", *args)
 
     accumulate = sym_help._parse_arg(accumulate, "b")
 
@@ -1514,7 +1525,7 @@ def index_put(g, self, indices_list_value, values, accumulate):
 def index_fill(g, self, dim, index, value):
     dim_value = sym_help._parse_arg(dim, "i")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, value, dim_i=dim_value, operator_s="index_fill")
+        return g.at("index_fill", self, index, value, dim_i=dim_value, overload_name="int_Scalar")
     expanded_index_shape, expanded_index = sym_help._index_fill_reshape_helper(g, self, dim, index)
     value = sym_help._maybe_get_scalar(value)
     value = sym_help._if_scalar_type_as(g, value, self)
@@ -1526,11 +1537,41 @@ def index_fill(g, self, dim, index, value):
 def index_copy(g, self, dim, index, source):
     dim_value = sym_help._parse_arg(dim, "i")
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, source, dim_i=dim_value, operator_s="index_copy")
+        return g.at("index_copy", self, index, source, dim_i=dim_value)
     expanded_index_shape, expanded_index = sym_help._index_fill_reshape_helper(g, self, dim, index)
     return scatter(g, self, dim, expanded_index, source)
 
 
+@parse_args("v", "v", "b", "b")
+def bucketize(g, self, boundaries, out_int32=False, right=False):
+    out_type = torch.onnx.TensorProtoDataType.INT64
+    if out_int32:
+        out_type = torch.onnx.TensorProtoDataType.INT32
+    # A tensor expanded_boundaries is created such that it
+    # contains a copy of boundaries for each element of self.
+    new_shape = g.op("Concat",
+                     g.op("Shape", boundaries), g.op("Shape", self),
+                     axis_i=0)
+    # Unsqueeze step is performed to respect ONNX's numpy style broadcasting for comparison ops
+    # https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md
+    unsqueeze_axes = list(range(1, sym_help._get_tensor_rank(self) + 1))
+    expanded_boundaries = expand(g, sym_help._unsqueeze_helper(g, boundaries, unsqueeze_axes),
+                                 new_shape, None)
+    # Compare each element of self to boundaries to get a tensor
+    # with leading 1s and trailing 0s.
+    # e.g., 4 > [1, 3, 4] = [1, 1, 0]
+    # The index of the last 1 is the bucket where the element should go.
+    if right:
+        cond = ge(g, self, expanded_boundaries)
+    else:
+        cond = gt(g, self, expanded_boundaries)
+    cond_out = g.op("Cast", cond, to_i=out_type)
+    # Sum to get the number of 1s corresponding to each element,
+    # which is the same as the bucket index.
+    # e.g., sum(4 > [1, 3, 4]) = sum([1, 1, 0]) = 2
+    return sym_help._reducesum_helper(g, cond_out, axes_i=[0], keepdims_i=0)
+
+
 def type_as(g, self, other):
     self_dtype = sym_help._try_get_scalar_type(self)
     other_dtype = sym_help._try_get_scalar_type(other)
@@ -1541,7 +1582,7 @@ def type_as(g, self, other):
     else:
         if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
             # We don't know the type of other, bail by emitting ATen
-            return g.op("ATen", self, other, operator_s="type_as")
+            return g.at("type_as", self, other)
         else:
             raise RuntimeError("Unsupported: ONNX export of type_as for tensor "
                                "of unknown dtype. Please check if the dtype of the "
@@ -1550,10 +1591,18 @@ def type_as(g, self, other):
 
 @parse_args("v", "v", "i", "f")
 def cosine_similarity(g, x1, x2, dim, eps):
-    if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", x1, x2, dim_i=dim, eps_f=eps, operator_s="cosine_similarity")
-    else:
-        return sym_help._onnx_unsupported("cosine_similarity")
+    # preserve legacy behavior for Caffe2
+    if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK and \
+       torch.onnx._CAFFE2_ATEN_FALLBACK:
+        return g.at("cosine_similarity", x1, x2, dim_i=dim, eps_f=eps)
+    cross = sym_help._reducesum_helper(g, mul(g, x1, x2),
+                                       axes_i=[dim], keepdims_i=0)
+    x1_l2 = sym_help._reducesum_helper(g, mul(g, x1, x1),
+                                       axes_i=[dim], keepdims_i=0)
+    x2_l2 = sym_help._reducesum_helper(g, mul(g, x2, x2),
+                                       axes_i=[dim], keepdims_i=0)
+    div_tens = max(g, sqrt(g, mul(g, x1_l2, x2_l2)), g.op("Constant", value_t=torch.tensor([eps])))
+    return div(g, cross, div_tens)
 
 
 # ignore clone operators that are inserted by PyTorch autograd
@@ -1665,6 +1714,16 @@ def minimum(g, input, other):
     return min(g, input, dim_or_y=other)
 
 
+@parse_args("v", "is", "i")
+def amax(g, self, dim, keepdim):
+    return g.op("ReduceMax", self, axes_i=dim, keepdims_i=keepdim)
+
+
+@parse_args("v", "is", "i")
+def amin(g, self, dim, keepdim):
+    return g.op("ReduceMin", self, axes_i=dim, keepdims_i=keepdim)
+
+
 def exp(g, self):
     return g.op("Exp", self)
 
@@ -1716,7 +1775,7 @@ def norm(g, self, p, dim, keepdim):
 @parse_args("v", "v", "v", "i")
 def conv_tbc(g, input, weight, bias, pad):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, weight, bias, operator_s="conv_tbc", pad_i=pad)
+        return g.at("conv_tbc", input, weight, bias, pad_i=pad)
     else:
         # input must have 3 dimensions, see:
         # https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ConvolutionTBC.cpp#L8-L10
@@ -1732,7 +1791,7 @@ def conv_tbc(g, input, weight, bias, pad):
 @parse_args("v", "i", "i")
 def _unique(g, input, sorted, return_inverse):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, operator_s="_unique", sorted_i=sorted,
+        return g.at("_unique", input, sorted_i=sorted,
                     return_inverse_i=return_inverse, outputs=2)
     else:
         return sym_help._onnx_unsupported("_unique")
@@ -1741,7 +1800,7 @@ def _unique(g, input, sorted, return_inverse):
 @parse_args("v", "i", "i", "i")
 def _unique2(g, input, sorted, return_inverse, return_counts):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, operator_s="_unique2", sorted_i=sorted,
+        return g.at("_unique2", input, sorted_i=sorted,
                     return_inverse_i=return_inverse, return_counts_i=return_counts,
                     outputs=3)
     else:
@@ -2806,7 +2865,7 @@ def logsumexp(g, input, dim, keepdim):
 
 def arange(g, *args):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", *args, operator_s="arange")
+        return g.at("arange", *args)
 
     def _get_arange_dtype(dtype):
         dtype = sym_help._maybe_get_const(dtype, "i")
@@ -2869,7 +2928,7 @@ def masked_fill(g, self, mask, value):
 
 def index(g, self, index):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", self, index, operator_s="index")
+        return g.at("index", self, index, overload_name="Tensor")
 
     if sym_help._is_packed_list(index):
         indices = sym_help._unpack_list(index)
@@ -3146,8 +3205,8 @@ def gelu(g, self, approximate):
 @parse_args("v", "i", "v", "v", "f", "i")
 def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     if sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", input, weight, bias, num_groups_i=num_groups,
-                    eps_f=eps, cudnn_enabled_i=cudnn_enabled, operator_s="group_norm")
+        return g.at("group_norm", input, weight, bias, num_groups_i=num_groups,
+                    eps_f=eps, cudnn_enabled_i=cudnn_enabled)
 
     channel_size = sym_help._get_tensor_dim_size(input, 1)
     if channel_size is not None:
@@ -3204,7 +3263,7 @@ def _weight_norm(g, weight_v, weight_g, dim):
         div = g.op("Div", weight_v, norm_v)
         return g.op("Mul", div, weight_g)
     elif sym_help._operator_export_type == torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK:
-        return g.op("ATen", weight_v, weight_g, dim_i=dim, operator_s="_weight_norm")
+        return g.at("_weight_norm", weight_v, weight_g, dim_i=dim)
     else:
         raise RuntimeError("Unsupported: ONNX export of _weight_norm for tensor "
                            "of unknown rank.")
diff --git a/torch/onnx/symbolic_registry.py b/torch/onnx/symbolic_registry.py
index b10da3c83e1041..f3a49b289c049c 100644
--- a/torch/onnx/symbolic_registry.py
+++ b/torch/onnx/symbolic_registry.py
@@ -65,7 +65,7 @@ def get_ops_in_version(version):
     domain_opname_ops = []
     for obj in members:
         if isinstance(obj[1], type) and hasattr(obj[1], "domain"):
-            ops = getmembers(obj[1])
+            ops = getmembers(obj[1], predicate=isfunction)
             for op in ops:
                 domain_opname_ops.append((obj[1].domain, op[0], op[1]))  # type: ignore[attr-defined]
 
diff --git a/torch/onnx/utils.py b/torch/onnx/utils.py
index 04f265434e1302..313a3783c2d0a3 100644
--- a/torch/onnx/utils.py
+++ b/torch/onnx/utils.py
@@ -98,10 +98,22 @@ def disable_apex_o2_state_dict_hook(model):
                     module._state_dict_hooks[k] = v
 
 @contextlib.contextmanager
-def exporter_context(model, mode):
+def setup_onnx_logging(verbose):
+    is_originally_enabled = torch.onnx.is_onnx_log_enabled()
+    if is_originally_enabled or verbose:
+        torch.onnx.enable_log()
+    try:
+        yield
+    finally:
+        if not is_originally_enabled:
+            torch.onnx.disable_log()
+
+@contextlib.contextmanager
+def exporter_context(model, mode, verbose):
     with select_model_mode_for_export(model, mode) as mode_ctx, \
-            disable_apex_o2_state_dict_hook(model) as apex_ctx:
-        yield (mode_ctx, apex_ctx)
+            disable_apex_o2_state_dict_hook(model) as apex_ctx, \
+            setup_onnx_logging(verbose) as log_ctx:
+        yield (mode_ctx, apex_ctx, log_ctx)
 
 
 def export(model, args, f, export_params=True, verbose=False, training=None,
@@ -486,6 +498,27 @@ def unpack_quantized_tensor(value):
         return (value,)
 
 
+def _assign_onnx_node_name(graph, node_names):
+    r"""Takes in ONNX graph, and mapping from torch._C.Node to node name in exported ONNX ModelProto.
+
+    Returns:
+      graph (torch._C.Graph): A TorchScript IR Graph with ONNX nodes, where each torch._C.Node gets its name
+        in exported ONNX ModelProto assigned as attribute ``onnx_name``.
+    """
+    def n_fn(n, b_fn, node_names):
+        for b in n.blocks():
+            b_fn(b, node_names)
+        if n in node_names:
+            n.s_("onnx_name", node_names[n])
+
+    def b_fn(b, node_names):
+        for n in b.nodes():
+            n_fn(n, b_fn, node_names)
+
+    b_fn(graph, node_names)
+    return graph
+
+
 def _model_to_graph(model, args, verbose=False,
                     input_names=None, output_names=None,
                     operator_export_type=OperatorExportTypes.ONNX,
@@ -512,11 +545,15 @@ def _model_to_graph(model, args, verbose=False,
 
     params_dict = _get_named_param_dict(graph, params)
 
-    graph = _optimize_graph(graph, operator_export_type,
-                            _disable_torch_constant_prop=_disable_torch_constant_prop,
-                            fixed_batch_size=fixed_batch_size, params_dict=params_dict,
-                            dynamic_axes=dynamic_axes, input_names=input_names,
-                            module=module)
+    try:
+        graph = _optimize_graph(graph, operator_export_type,
+                                _disable_torch_constant_prop=_disable_torch_constant_prop,
+                                fixed_batch_size=fixed_batch_size, params_dict=params_dict,
+                                dynamic_axes=dynamic_axes, input_names=input_names,
+                                module=module)
+    except Exception as e:
+        torch.onnx.log("Torch IR graph at exception: ", graph)
+        raise
     from torch.onnx.symbolic_helper import _onnx_shape_inference
     if isinstance(model, torch.jit.ScriptModule) or isinstance(model, torch.jit.ScriptFunction):
         example_outputs = _get_example_outputs(model, args)
@@ -563,9 +600,6 @@ def _model_to_graph(model, args, verbose=False,
     if _export_onnx_opset_version < 9:
         torch._C._jit_pass_onnx_cast_all_constant_to_floating(graph)
 
-    if verbose:
-        print(graph)
-
     params_dict = torch._C._jit_pass_filter_non_tensor_arguments(params_dict)
     torch._C._jit_decay_packed_param_input_types(graph)
 
@@ -591,7 +625,7 @@ def export_to_pretty_string(model, args, export_params=True, verbose=False, trai
     _set_operator_export_type(operator_export_type)
     from torch.onnx.symbolic_helper import _set_onnx_shape_inference
     _set_onnx_shape_inference(True)
-    with exporter_context(model, training):
+    with exporter_context(model, training, verbose):
         val_keep_init_as_ip = _decide_keep_init_as_input(keep_initializers_as_inputs,
                                                          operator_export_type,
                                                          opset_version)
@@ -629,7 +663,7 @@ def unconvertible_ops(model, args, training=TrainingMode.EVAL, opset_version=Non
     # operator_export_type is set to ONNX_FALLTHROUGH by default so that if an op is not supported
     # in ONNX, fall through will occur and export the operator as is, as a custom ONNX op.
     operator_export_type = OperatorExportTypes.ONNX_FALLTHROUGH
-    with exporter_context(model, training):
+    with exporter_context(model, training, False):
         args = _decide_input_format(model, args)
         graph, params_dict, torch_out = _model_to_graph(
             model, args,
@@ -748,7 +782,7 @@ def _export(model, args, f, export_params=True, verbose=False, training=None,
         # (to preserve whatever the original training mode was.)
         _set_opset_version(opset_version)
         _set_operator_export_type(operator_export_type)
-        with exporter_context(model, training):
+        with exporter_context(model, training, verbose):
             val_keep_init_as_ip = _decide_keep_init_as_input(keep_initializers_as_inputs,
                                                              operator_export_type,
                                                              opset_version)
@@ -784,18 +818,20 @@ def _export(model, args, f, export_params=True, verbose=False, training=None,
                 # NOTE: cannot call DCE after this pass. DCE will remove function definition nodes.
                 node_attr_to_name = torch._C._jit_pass_onnx_function_extraction(
                     graph, export_modules_as_functions, list(params_dict.keys()))
-            params_dict = torch._C._jit_pass_onnx_deduplicate_initializers(graph, params_dict,
-                                                                           training == TrainingMode.TRAINING)
+            params_dict = torch._C._jit_pass_onnx_deduplicate_initializers(
+                graph, params_dict, getattr(model, "training", False))
             if export_params:
-                proto, export_map, val_use_external_data_format = graph._export_onnx(
+                proto, export_map, val_use_external_data_format, node_names = graph._export_onnx(
                     params_dict, opset_version, dynamic_axes, defer_weight_export,
                     operator_export_type, not verbose, val_keep_init_as_ip, custom_opsets,
                     val_add_node_names, model_file_location, node_attr_to_name)
             else:
-                proto, export_map, val_use_external_data_format = graph._export_onnx(
+                proto, export_map, val_use_external_data_format, node_names = graph._export_onnx(
                     {}, opset_version, dynamic_axes, False, operator_export_type,
                     not verbose, val_keep_init_as_ip, custom_opsets, val_add_node_names,
                     model_file_location, node_attr_to_name)
+            if verbose:
+                torch.onnx.log("Exported graph: ", _assign_onnx_node_name(graph, node_names))
             if export_type == ExportTypes.PROTOBUF_FILE:
                 assert(len(export_map) == 0)
                 with torch.serialization._open_file_like(f, "wb") as opened_file:
@@ -1129,7 +1165,7 @@ def _run_symbolic_function(g, block, n, inputs, env, operator_export_type=Operat
             attrs = {k + "_" + n.kindOf(k)[0]: n[k] for k in n.attributeNames()}
             outputs = n.outputsSize()
             attrs["outputs"] = outputs
-            return _graph_at(g, op_name, *inputs, aten=True, **attrs)
+            return g.at(op_name, *inputs, aten=True, **attrs)
         else:
             raise sym_registry.UnsupportedOperatorError(domain, op_name, opset_version)
     except RuntimeError:
@@ -1144,8 +1180,8 @@ def _run_symbolic_function(g, block, n, inputs, env, operator_export_type=Operat
 
 
 # Generate an ONNX ATen op node.
-def _graph_at(g, opname, *args, **kwargs):
-    return g.op("ATen", *args, operator_s=opname, **kwargs)
+def _aten_op(g, operator, *args, overload_name="", **kwargs):
+    return g.op("ATen", *args, operator_s=operator, overload_name_s=overload_name, **kwargs)
 
 
 # This helper function can create either constant tensor or constant scalar.
@@ -1279,7 +1315,7 @@ def _validate_dynamic_axes(dynamic_axes, model, input_names, output_names):
 
 
 torch._C.Graph.op = _graph_op  # type: ignore[attr-defined]
-torch._C.Graph.at = _graph_at  # type: ignore[attr-defined]
+torch._C.Graph.at = _aten_op  # type: ignore[attr-defined]
 torch._C.Block.op = _block_op  # type: ignore[attr-defined]
 torch._C.Graph.constant = _graph_constant  # type: ignore[attr-defined]
 torch._C.Node.__getitem__ = _node_getitem  # type: ignore[attr-defined, misc, assignment]
diff --git a/torch/optim/adadelta.py b/torch/optim/adadelta.py
index a425e7a42bf64a..eb1d4e3f769a1e 100644
--- a/torch/optim/adadelta.py
+++ b/torch/optim/adadelta.py
@@ -45,12 +45,15 @@ class Adadelta(Optimizer):
             to the parameters (default: 1.0)
         weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
         foreach (bool, optional): whether foreach implementation of optimizer is used (default: None)
+        maximize (bool, optional): maximize the params based on the objective, instead of
+            minimizing (default: False)
 
     .. _ADADELTA\: An Adaptive Learning Rate Method:
         https://arxiv.org/abs/1212.5701
     """
 
-    def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0, foreach: Optional[bool] = None):
+    def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0,
+                 foreach: Optional[bool] = None, *, maximize: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= rho <= 1.0:
@@ -60,13 +63,15 @@ def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0, foreach: O
         if not 0.0 <= weight_decay:
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
 
-        defaults = dict(lr=lr, rho=rho, eps=eps, weight_decay=weight_decay, foreach=foreach)
+        defaults = dict(lr=lr, rho=rho, eps=eps, weight_decay=weight_decay,
+                        maximize=maximize, foreach=foreach)
         super(Adadelta, self).__init__(params, defaults)
 
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
             group.setdefault('foreach', None)
+            group.setdefault('maximize', False)
 
     @torch.no_grad()
     def step(self, closure=None):
@@ -86,11 +91,12 @@ def step(self, closure=None):
             grads = []
             square_avgs = []
             acc_deltas = []
-            lr, rho, eps, weight_decay, foreach = (group['lr'],
-                                                   group['rho'],
-                                                   group['eps'],
-                                                   group['weight_decay'],
-                                                   group['foreach'])
+            lr, rho, eps, weight_decay, foreach, maximize = (group['lr'],
+                                                             group['rho'],
+                                                             group['eps'],
+                                                             group['weight_decay'],
+                                                             group['foreach'],
+                                                             group['maximize'])
 
             for p in group['params']:
                 if p.grad is None:
@@ -121,7 +127,8 @@ def step(self, closure=None):
                      rho=rho,
                      eps=eps,
                      weight_decay=weight_decay,
-                     foreach=foreach)
+                     foreach=foreach,
+                     maximize=maximize)
 
         return loss
 
@@ -137,7 +144,8 @@ def adadelta(params: List[Tensor],
              lr: float,
              rho: float,
              eps: float,
-             weight_decay: float):
+             weight_decay: float,
+             maximize: bool):
     r"""Functional API that performs Adadelta algorithm computation.
 
     See :class:`~torch.optim.Adadelta` for details.
@@ -162,7 +170,8 @@ def adadelta(params: List[Tensor],
          lr=lr,
          rho=rho,
          eps=eps,
-         weight_decay=weight_decay)
+         weight_decay=weight_decay,
+         maximize=maximize)
 
 
 def _single_tensor_adadelta(params: List[Tensor],
@@ -173,9 +182,12 @@ def _single_tensor_adadelta(params: List[Tensor],
                             lr: float,
                             rho: float,
                             eps: float,
-                            weight_decay: float):
+                            weight_decay: float,
+                            maximize: bool):
 
     for (param, grad, square_avg, acc_delta) in zip(params, grads, square_avgs, acc_deltas):
+        grad = grad if not maximize else -grad
+
         if weight_decay != 0:
             grad = grad.add(param, alpha=weight_decay)
 
@@ -201,11 +213,15 @@ def _multi_tensor_adadelta(params: List[Tensor],
                            lr: float,
                            weight_decay: float,
                            rho: float,
-                           eps: float):
+                           eps: float,
+                           maximize: bool):
 
     if len(params) == 0:
         return
 
+    if maximize:
+        grads = torch._foreach_neg(grads)
+
     if weight_decay != 0:
         torch._foreach_add_(grads, params, alpha=weight_decay)
 
diff --git a/torch/optim/optimizer.py b/torch/optim/optimizer.py
index 8acf2eaebc5c99..9da9277dee3242 100644
--- a/torch/optim/optimizer.py
+++ b/torch/optim/optimizer.py
@@ -151,17 +151,19 @@ def load_state_dict(self, state_dict):
                   zip(chain.from_iterable((g['params'] for g in saved_groups)),
                       chain.from_iterable((g['params'] for g in groups)))}
 
-        def cast(param, value):
+        def cast(param, value, key=None):
             r"""Make a deep copy of value, casting all tensors to device of param."""
             if isinstance(value, torch.Tensor):
                 # Floating-point types are a bit special here. They are the only ones
                 # that are assumed to always match the type of params.
-                if param.is_floating_point():
-                    value = value.to(param.dtype)
-                value = value.to(param.device)
+                # Make sure state['step'] is not casted https://github.com/pytorch/pytorch/issues/74424
+                if (key != "step"):
+                    if param.is_floating_point():
+                        value = value.to(param.dtype)
+                    value = value.to(param.device)
                 return value
             elif isinstance(value, dict):
-                return {k: cast(param, v) for k, v in value.items()}
+                return {k: cast(param, v, key=k) for k, v in value.items()}
             elif isinstance(value, container_abcs.Iterable):
                 return type(value)(cast(param, v) for v in value)
             else:
diff --git a/torch/overrides.py b/torch/overrides.py
index 16039b87774835..aa766b8e96782e 100644
--- a/torch/overrides.py
+++ b/torch/overrides.py
@@ -26,12 +26,13 @@
 import functools
 import types
 import warnings
-from typing import Dict, Set, List, Any, Callable, Iterable, Type
+from typing import Dict, Set, List, Any, Callable, Iterable, Type, Iterator
 
 import torch
 from torch._C import (
     _has_torch_function, _has_torch_function_unary,
-    _has_torch_function_variadic, _add_docstr)
+    _has_torch_function_variadic, _add_docstr, _set_torch_function_mode, _get_torch_function_mode)
+import contextlib
 
 __all__ = [
     "get_ignored_functions",
@@ -151,6 +152,7 @@ def get_ignored_functions() -> Set[Callable]:
         torch.mkldnn_max_pool2d,
         torch.mkldnn_max_pool3d,
         torch.mkldnn_linear_backward_weights,
+        torch.nested_tensor,
         torch.normal,
         torch.ones,
         torch.promote_types,
@@ -244,6 +246,7 @@ def get_ignored_functions() -> Set[Callable]:
         Tensor._conj_physical,
         Tensor._neg_view,
         Tensor._is_zerotensor,
+        Tensor._addmm_activation,
     }
 
 
@@ -540,7 +543,7 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.hinge_embedding_loss: lambda input, target, margin=1.0, size_average=None, reduce=None, reduction='mean': -1,
         torch.histc: lambda input, bins=100, min=0, max=0, out=None: -1,
         torch.histogram: lambda input, bins=100, min=None, max=None, weight=None, density=False, out=None: -1,
-        torch.histogramdd: lambda input, bins, weight=None, density=False: -1,
+        torch.histogramdd: lambda input, bins, range=None, weight=None, density=False: -1,
         torch.linalg.householder_product: lambda input, tau: -1,
         torch.hspmm: lambda mat1, mat2, out=None: -1,
         torch.hsplit: lambda input, indices_or_sections: -1,
@@ -897,12 +900,12 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.saddmm: lambda input, mat1, mat2, beta=1, alpha=1, out=None: -1,
         torch.scatter: lambda input, dim, index, src: -1,
         torch.scatter_add: lambda input, dim, index, src: -1,
-        torch.scatter_reduce: lambda input, dim, index, reduce, output_size=None: -1,
+        torch.scatter_reduce: lambda input, dim, index, src, reduce, include_self=True: -1,
         torch.searchsorted: lambda sorted_sequence, input, out_int32=False, right=False, out=None: -1,
         torch.segment_reduce: lambda data, reduce="max", lengths=None, indices=None, axis=0, unsafe=False: -1,
         torch.select: lambda input, dim, index: -1,
         torch.select_scatter: lambda input, src, dim, index: -1,
-        torch.slice_scatter: lambda input, src, dim, start, end, step: -1,
+        torch.slice_scatter: lambda input, src, dim=0, start=None, end=None, step=1: -1,
         torch.selu: lambda input, inplace=False: -1,
         torch.sigmoid: lambda input, out=None: -1,
         torch.sign: lambda input, out=None: -1,
@@ -969,6 +972,7 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.special.multigammaln: lambda input, p: -1,
         torch.special.ndtri: lambda input: -1,
         torch.special.ndtr: lambda input: -1,
+        torch.special.log_ndtr: lambda input: -1,
         torch.special.xlogy: lambda input, other, out=None: -1,
         torch.special.xlog1py: lambda input, other, out=None: -1,
         torch.special.zeta: lambda self, other, out=None: -1,
@@ -1061,17 +1065,19 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         Tensor._grad_fn.__get__: lambda self: -1,
         Tensor.grad_fn.__get__: lambda self: -1,
         Tensor._version.__get__: lambda self: -1,
-        Tensor._autocast_to_reduced_precision: lambda self: -1,
-        Tensor._autocast_to_full_precision: lambda self: -1,
+        Tensor._autocast_to_reduced_precision: lambda self, cuda_enabled, cpu_enabled, cuda_dtype, cpu_dtype: -1,
+        Tensor._autocast_to_full_precision: lambda self, cuda_enabled, cpu_enabled: -1,
         Tensor.data.__get__: lambda self: -1,
         Tensor.device.__get__: lambda self: -1,
         Tensor.dtype.__get__: lambda self: -1,
         Tensor.is_cuda.__get__: lambda self: -1,
         Tensor.is_xpu.__get__: lambda self: -1,
+        Tensor.is_ipu.__get__: lambda self: -1,
         Tensor.is_leaf.__get__: lambda self: -1,
         Tensor.retains_grad.__get__: lambda self: -1,
         Tensor.is_meta.__get__: lambda self: -1,
         Tensor.is_mlc.__get__: lambda self: -1,
+        Tensor.is_nested.__get__: lambda self: -1,
         Tensor.is_ort.__get__: lambda self: -1,
         Tensor.is_mkldnn.__get__: lambda self: -1,
         Tensor.is_quantized.__get__: lambda self: -1,
@@ -1119,6 +1125,7 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         Tensor.cpu: lambda self, memory_format=torch.preserve_format: -1,
         Tensor.cuda: lambda self, memory_format=torch.preserve_format: -1,
         Tensor.xpu: lambda self, memory_format=torch.preserve_format: -1,
+        Tensor.ipu: lambda self, memory_format=torch.preserve_format: -1,
         Tensor.data_ptr: lambda self: -1,
         Tensor.dense_dim: lambda self: -1,
         Tensor.diagonal_scatter: lambda self, src, offset=0, dim1=0, dim2=1: -1,
@@ -1172,13 +1179,14 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         Tensor.resize: lambda self, *size: -1,
         Tensor.resize_: lambda self, size: -1,
         Tensor.resize_as: lambda self, other: -1,
+        Tensor.resize_as_sparse_: lambda self, other: -1,
         Tensor.retain_grad: lambda self: -1,
         Tensor.set_: lambda self, source=None, storage_offset=0, size=None, stride=None: -1,
         Tensor.select_scatter: lambda self, src, dim, index: -1,
         Tensor.share_memory_: lambda self: -1,
         Tensor.short: lambda self, memory_format=torch.preserve_format: -1,
         Tensor.size: lambda self: -1,
-        Tensor.slice_scatter: lambda self, src, dim, start, end, step: -1,
+        Tensor.slice_scatter: lambda self, src, dim=0, start=None, end=None, step=1: -1,
         Tensor.sparse_dim: lambda self: -1,
         Tensor.sparse_mask: lambda self, mask: -1,
         Tensor.sparse_resize_: lambda self, size1, size2, dense_dim: -1,
@@ -1191,7 +1199,8 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         Tensor.sum_to_size: lambda self, size: -1,
         Tensor.tile: lambda self, *reps: -1,
         Tensor.to: lambda self, dtype, non_blocking=False, copy=False, memory_format=torch.preserve_format: -1,
-        Tensor.to_dense: lambda self: -1,
+        Tensor.to_dense: lambda self, dtype=None: -1,
+        Tensor._to_dense: lambda self, dtype=None: -1,
         Tensor.to_sparse: lambda self: -1,
         Tensor.tolist: lambda self: -1,
         Tensor.to_mkldnn: lambda self: -1,
@@ -1370,7 +1379,7 @@ def handle_torch_function(
     Example
     -------
     >>> def func(a):
-    ...     if type(a) is not torch.Tensor:  # This will make func dispatchable by __torch_function__
+    ...     if has_torch_function_unary(a):
     ...         return handle_torch_function(func, (a,), a)
     ...     return a + 0
     """
@@ -1379,6 +1388,15 @@ def handle_torch_function(
     # overloaded_args already have unique types.
     types = tuple(map(type, overloaded_args))
 
+    # Check for __torch_function__ mode.
+    mode = _get_torch_function_mode()
+    if mode is not None:
+        # NB: unlike on tensors, modes are instances
+        with _no_torch_function_mode():
+            result = mode.__torch_function__(public_api, types, args, kwargs)
+        if result is not NotImplemented:
+            return result
+
     # Call overrides
     for overloaded_arg in overloaded_args:
         # This call needs to become a classmethod call in the future.
@@ -1386,7 +1404,7 @@ def handle_torch_function(
         torch_func_method = overloaded_arg.__torch_function__
         if hasattr(torch_func_method, "__self__") and torch_func_method.__self__ is overloaded_arg:
             warnings.warn("Defining your `__torch_function__ as a plain method is deprecated and "
-                          "will be an error in PyTorch 1.11, please define it as a classmethod.",
+                          "will be an error in future, please define it as a classmethod.",
                           DeprecationWarning)
 
         # Use `public_api` instead of `implementation` so __torch_function__
@@ -1397,9 +1415,13 @@ def handle_torch_function(
             return result
 
     func_name = '{}.{}'.format(public_api.__module__, public_api.__name__)
-    raise TypeError("no implementation found for '{}' on types that implement "
-                    '__torch_function__: {}'
-                    .format(func_name, [type(arg) for arg in overloaded_args]))
+    msg = (
+        "no implementation found for '{}' on types that implement "
+        '__torch_function__: {}'
+    ).format(func_name, [type(arg) for arg in overloaded_args])
+    if mode is not None:
+        msg += f" nor in mode {mode}"
+    raise TypeError(msg)
 
 has_torch_function = _add_docstr(
     _has_torch_function,
@@ -1582,3 +1604,235 @@ def is_tensor_like(inp):
     True
     """
     return type(inp) is torch.Tensor or hasattr(type(inp), "__torch_function__")
+
+def _wrap_init(f):
+    undef = object()
+
+    @functools.wraps(f)
+    def wrapped(self, *args, inner=undef, **kwargs):
+        if inner is undef:
+            raise TypeError(
+                "missing inner keyword argument; instead of constructing a TorchModeFunction directly, "
+                "pass the constructor to push_torch_function_mode"
+            )
+        self.inner = inner
+        return f(self, *args, **kwargs)
+    return wrapped
+
+
+def _wrap_torch_function(f):
+    @functools.wraps(f)
+    def wrapped(self, *args, **kwargs):
+        with enable_torch_function_mode(self.inner):
+            return f(self, *args, **kwargs)
+    return wrapped
+
+
+# Implementation note: I had a choice about how much of mode stacks
+# to implement in Python versus in C++.  At time of writing, I did not care
+# too much about implementation efficiency; however, I do care about making it
+# hard for users to implement modes in the wrong way.  In the end, it turned
+# out to be possible to implement mode stacks entirely from userland, with the
+# C++ API providing only _get_torch_function_mode() and
+# _set_torch_function_mode(), so I opted to provide some unsafe C++ bindings and
+# have the bulk of the logic for managing the stack in Python, which helped
+# simplify the C++ API surface.  It would also have been valid to build in the
+# notion of mode stack directly into C++ but in this design it's substantially
+# more difficult to interact with TorchFunctionModeMeta.
+
+
+class TorchFunctionModeMeta(type):
+    """
+    Metaclass for :class:`TorchFunctionMode`; it does two things:
+
+        * Adds an implicit ``inner`` kwarg to ``__init__``, to
+          allow the modes to be chained together to form a stack.
+
+        * Reenables the inner mode, so that by default PyTorch API calls
+          will compositionally proceed to the next mode on the stack.
+
+    The default behavior for the second bullet is important, as it is easy to
+    accidentally write ``__torch_function__`` implementations that are not
+    compositional, and the wrapping here makes the obvious code do the
+    right thing (aka, this is why there is a metaclass).
+    """
+    def __new__(metacls, name, bases, dct):
+        if '__init__' in dct:
+            dct['__init__'] = _wrap_init(dct['__init__'])
+        if '__torch_function__' in dct:
+            dct['__torch_function__'] = _wrap_torch_function(dct['__torch_function__'])
+        return super().__new__(metacls, name, bases, dct)
+
+
+class TorchFunctionMode(metaclass=TorchFunctionModeMeta):
+    """
+    A ``TorchFunctionMode`` allows you to override the meaning of all
+    ``__torch_function__`` overrideable functions within a dynamic scope,
+    without having to actually create a tensor subclass or manually
+    monkey-patch functions in the PyTorch API.  Some common situations
+    where you should use a mode:
+
+        * You want to override the meaning of factory functions, or other
+          functions that do not otherwise take a tensor as an argument
+          (these cannot be overridden with tensor subclasses).
+
+        * You want to override the behavior of all functions without needing
+          to wrap your inputs in tensor subclasses; e.g., if you are just
+          interested in logging intermediate computations.
+
+        * You want to control the order of execution of various tensor
+          subclasses explicitly, rather than implicitly via the return of
+          ``NotImplemented``.
+
+    Independent subclasses of :class:`TorchFunctionMode` are compositional:
+    modes can be pushed onto a stack with :func:`push_torch_function_mode`.
+    When you call functions in the PyTorch API inside your
+    ``__torch_function__`` implementation, by default, they will forward on to
+    the next mode on the mode stack.  If you want recursively call back into
+    your current ``__torch_function__`` implementation, either explicitly
+    invoke ``self.__torch_function__(...)``, or use the context manager
+    ``enable_torch_function_mode(self, replace=self.inner)`` to make PyTorch
+    API self-referential (beware of infinite loops, in this case!)
+    """
+    # Force metaclass to generate constructor at the base of the hierarchy
+    def __init__(self):
+        pass
+
+    def __torch_function__(self, func, types, args=(), kwargs=None):
+        raise NotImplementedError()
+
+
+class BaseTorchFunctionMode(TorchFunctionMode):
+    def __torch_function__(self, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+        return func(*args, **kwargs)
+
+
+# This is private API as I'm not sure it's possible for users to use this
+# compositionally (easy to discard too many modes).  It is useful for
+# library code though, e.g., in handle_torch_function
+@contextlib.contextmanager
+def _no_torch_function_mode() -> Iterator[None]:
+    old = _get_torch_function_mode()
+    _set_torch_function_mode(None)
+    try:
+        yield
+    finally:
+        _set_torch_function_mode(old)
+
+
+@contextlib.contextmanager
+def enable_torch_function_mode(mode, *, replace=None, ignore_preexisting=False) -> Iterator[None]:
+    """
+    Context manager that sets the current :class:`TorchFunctionMode`; see the
+    class for more information on what modes are.  This function is
+    non-compositional; if there is already an existing mode, it will raise an
+    error; prefer using :func:`push_torch_function_mode` if your
+    ``__torch_function__`` implementation can defer to an inner mode.
+
+    This function is safe to use inside a ``__torch_function__`` mode handler,
+    as the mode is guaranteed to be disabled in this context.  You can use
+    this context manager to reinstate the mode so that calls to overridable
+    APIs recursively call back into your mode handler (this can easily cause
+    infinite loops, so use with care!)
+
+    Args:
+        mode (:class:`TorchFunctionMode`, Tensor-like class or None): the
+            mode to set as current mode.  If you pass a Tensor-like class,
+            it will be treated as a non-compositional mode with no state,
+            which is convenient if you have an existing tensor subclass
+            that you'd like to apply globally in a quick and dirty way.
+            Passing None will disable the current mode.
+        replace (:class:`TorchFunctionMode` or Tensor-like class): the
+            mode to replace.  You can use this argument to change the mode in
+            a situation where you know what the current mode is (and you are
+            intentionally overwriting it.)  If you don't know what the current
+            mode is, use ``ignore_preexisting`` instead.
+        ignore_preexisting (bool): if True, ignore any preexisting mode
+            and overwrite it with the passed mode.
+    """
+    if not (
+        mode is None or
+        isinstance(mode, TorchFunctionMode) or
+        (isinstance(mode, type) and not issubclass(mode, TorchFunctionMode))
+    ):
+        raise TypeError(
+            "expected to get TorchFunctionMode, Tensor-like class or None as argument, got "
+            f"{type(mode)} instead"
+        )
+    old = _get_torch_function_mode()
+    # Short circuit.  It is valid to reset a mode to yourself, we won't error
+    # here
+    if old is mode:
+        yield
+        return
+    if old is not None and not ignore_preexisting and old is not replace:
+        if isinstance(mode, TorchFunctionMode):
+            help_text = (
+                'Use push_torch_function_mode instead.'
+            )
+        else:
+            help_text = (
+                'If you intended to completely override the preexisting mode, '
+                'pass ignore_preexisting=True.  This can result in unexpected '
+                'behavior; please consider rewriting your mode to be a subclass '
+                'of TorchFunctionMode to make it compositional!'
+            )
+        raise ValueError(
+            'Attempted to enable_torch_function_mode, but there is already an '
+            f'active mode {old}.  {help_text}'
+        )
+    # NB: we don't require TorchFunctionMode since this is intended to also
+    # let you directly pass a Tensor subclass type to "mode-ify" it.
+    if not hasattr(mode, '__torch_function__'):
+        raise ValueError(
+            'The argument passed to enable_torch_function_mode must implement '
+            '__torch_function__'
+        )
+    _set_torch_function_mode(mode)
+    try:
+        yield
+    finally:
+        _set_torch_function_mode(old)
+
+
+@contextlib.contextmanager
+def push_torch_function_mode(ctor) -> Iterator[TorchFunctionMode]:
+    """
+    Context manager that pushes a :class:`TorchFunctionMode` onto the current
+    mode stack; see the class for more information on what modes are.  Stacked
+    modes can delegate to each other by invoking the ``__torch_function__``
+    method for the ``inner`` mode.
+
+    Args:
+        ctor: a function that when invoked as ``ctor(inner=...)`` produces
+            a :class:`TorchFunctionMode`.  If your :class:`TorchFunctionMode`
+            has no ``__init__`` implementation, you can simply pass the class
+            itself (e.g., ``push_torch_function_mode(MyMode)``); otherwise,
+            use ``functools.partial`` to partially apply the constructor with all
+            non-inner arguments (e.g.,
+            ``push_torch_function_mode(partial(MyMode, arg))``)
+    """
+    if isinstance(ctor, TorchFunctionMode):
+        raise ValueError(
+            'Expected a TorchFunctionMode constructor function, but got an '
+            f'instance of TorchFunctionMode {ctor}.  Consider using '
+            'enable_torch_function_mode instead.'
+        )
+    old = _get_torch_function_mode()
+    if old is None:
+        inner = BaseTorchFunctionMode(inner=None)
+    else:
+        inner = old
+    mode = ctor(inner=inner)
+    if not isinstance(mode, TorchFunctionMode):
+        raise ValueError(
+            'The callable passed to push_torch_function_mode must return '
+            'a TorchFunctionMode'
+        )
+    _set_torch_function_mode(mode)
+    try:
+        yield mode
+    finally:
+        _set_torch_function_mode(old)
diff --git a/torch/package/_digraph.py b/torch/package/_digraph.py
index 5f91ce86bd2645..c918b3d56fc011 100644
--- a/torch/package/_digraph.py
+++ b/torch/package/_digraph.py
@@ -1,5 +1,5 @@
 from collections import deque
-from typing import Set
+from typing import Set, List
 
 
 class DiGraph:
@@ -18,6 +18,11 @@ def __init__(self):
         # Nested dict of node -> predecessor node -> nothing.
         self._pred = {}
 
+        # Keep track of the order in which nodes are added to
+        # the graph.
+        self._node_order = {}
+        self._insertion_idx = 0
+
     def add_node(self, n, **kwargs):
         """Add a node to the graph.
 
@@ -29,6 +34,8 @@ def add_node(self, n, **kwargs):
             self._node[n] = kwargs
             self._succ[n] = {}
             self._pred[n] = {}
+            self._node_order[n] = self._insertion_idx
+            self._insertion_idx += 1
         else:
             self._node[n].update(kwargs)
 
@@ -38,14 +45,8 @@ def add_edge(self, u, v):
         ``u`` and ``v`` will be created if they do not already exist.
         """
         # add nodes
-        if u not in self._node:
-            self._node[u] = {}
-            self._succ[u] = {}
-            self._pred[u] = {}
-        if v not in self._node:
-            self._node[v] = {}
-            self._succ[v] = {}
-            self._pred[v] = {}
+        self.add_node(u)
+        self.add_node(v)
 
         # add the edge
         self._succ[u][v] = True
@@ -138,6 +139,24 @@ def all_paths(self, src: str, dst: str):
 
         return result_graph.to_dot()
 
+    def first_path(self, dst: str) -> List[str]:
+        """Returns a list of nodes that show the first path that resulted in dst being added to the graph."""
+        path = []
+
+        while dst:
+            path.append(dst)
+            candidates = self._pred[dst].keys()
+            dst, min_idx = "", None
+            for candidate in candidates:
+                idx = self._node_order.get(candidate, None)
+                if idx is None:
+                    break
+                if min_idx is None or idx < min_idx:
+                    min_idx = idx
+                    dst = candidate
+
+        return list(reversed(path))
+
     def to_dot(self) -> str:
         """Returns the dot representation of the graph.
 
diff --git a/torch/package/analyze/__init__.py b/torch/package/analyze/__init__.py
index 0aee6ad417b7d6..6146eaf937ee0a 100644
--- a/torch/package/analyze/__init__.py
+++ b/torch/package/analyze/__init__.py
@@ -1,3 +1,6 @@
+from .find_first_use_of_broken_modules import (
+    find_first_use_of_broken_modules,
+)
 from .trace_dependencies import (
     trace_dependencies,
 )
diff --git a/torch/package/analyze/find_first_use_of_broken_modules.py b/torch/package/analyze/find_first_use_of_broken_modules.py
new file mode 100644
index 00000000000000..88553e3238c02a
--- /dev/null
+++ b/torch/package/analyze/find_first_use_of_broken_modules.py
@@ -0,0 +1,29 @@
+from typing import Dict, List
+
+from ..package_exporter import PackagingError
+
+
+def find_first_use_of_broken_modules(exc: PackagingError) -> Dict[str, List[str]]:
+    """
+    Find all broken modules in a PackagingError, and for each one, return the
+    dependency path in which the module was first encountered.
+
+    E.g. broken module m.n.o was added to a dependency graph while processing a.b.c,
+    then re-encountered while processing d.e.f. This method would return
+    {'m.n.o': ['a', 'b', 'c']}
+
+    Args:
+        exc: a PackagingError
+
+    Returns: A dict from broken module names to lists of module names in the path.
+    """
+
+    assert isinstance(exc, PackagingError), "exception must be a PackagingError"
+    uses = {}
+    broken_module_names = [
+        m for m, attr in exc.dependency_graph.nodes.items() if attr.get("error", False)
+    ]
+    for module_name in broken_module_names:
+        path = exc.dependency_graph.first_path(module_name)
+        uses[module_name] = path
+    return uses
diff --git a/torch/package/package_exporter.py b/torch/package/package_exporter.py
index 34eb871d0e39ed..3710e9eab5b478 100644
--- a/torch/package/package_exporter.py
+++ b/torch/package/package_exporter.py
@@ -3,6 +3,7 @@
 import io
 import linecache
 import pickletools
+import platform
 import types
 from collections import OrderedDict, defaultdict
 from dataclasses import dataclass
@@ -84,6 +85,11 @@ def __repr__(self):
         "Module did not match against any action pattern. Extern, mock, or intern it."
     )
     DENIED = "Module was denied by a pattern."
+    MOCKED_BUT_STILL_USED = (
+        "Module was mocked out, but is still being used in the package. "
+        "Please intern or extern the mocked modules if objects are supposed to be in "
+        "the package."
+    )
 
 
 @dataclass
@@ -586,7 +592,7 @@ def save_pickle(
         pickler.persistent_id = self._persistent_id
         pickler.dump(obj)
         data_value = data_buf.getvalue()
-
+        mocked_modules = defaultdict(list)
         name_in_dependency_graph = f"<{package}.{resource}>"
         self.dependency_graph.add_node(
             name_in_dependency_graph,
@@ -596,6 +602,15 @@ def save_pickle(
         )
 
         def _check_mocked_error(module: Optional[str], field: Optional[str]):
+            """
+            checks if an object (field) comes from a mocked module and then adds
+            the pair to mocked_modules which contains mocked modules paired with their
+            list of mocked objects present in the pickle.
+
+            We also hold the invariant that the first user defined rule that applies
+            to the module is the one we use.
+            """
+
             assert isinstance(module, str)
             assert isinstance(field, str)
             if self._can_implicitly_extern(module):
@@ -603,15 +618,8 @@ def _check_mocked_error(module: Optional[str], field: Optional[str]):
             for pattern, pattern_info in self.patterns.items():
                 if pattern.matches(module):
                     if pattern_info.action == _ModuleProviderAction.MOCK:
-                        raise NotImplementedError(
-                            f"Object '{field}' from module {module} was mocked out during packaging "
-                            f"but is being used in resource - {resource} in package {package}. "
-                            "If this error is happening during 'save_pickle', please ensure that your "
-                            "pickled object doesn't contain any mocked objects. Try interning or externing"
-                            f"{module} if {field} is supposed to be in the package."
-                        )
-                    else:
-                        return
+                        mocked_modules[module].append(field)
+                    return
 
         if dependencies:
             all_dependencies = []
@@ -655,7 +663,23 @@ def _check_mocked_error(module: Optional[str], field: Optional[str]):
                     _check_mocked_error(module, field)
             for module_name in all_dependencies:
                 self.dependency_graph.add_edge(name_in_dependency_graph, module_name)
-                self.add_dependency(module_name)
+
+                """ If an object happens to come from a mocked module, then we collect these errors and spit them
+                    out with the other errors found by package exporter.
+                """
+                if module in mocked_modules:
+                    assert isinstance(module, str)
+                    fields = mocked_modules[module]
+                    self.dependency_graph.add_node(
+                        module_name,
+                        action=_ModuleProviderAction.MOCK,
+                        error=PackagingErrorReason.MOCKED_BUT_STILL_USED,
+                        error_context=f"Object(s) '{fields}' from module `{module_name}` was mocked out during packaging "
+                        f"but is being used in resource - `{resource}` in package `{package}`. ",
+                        provided=True,
+                    )
+                else:
+                    self.add_dependency(module_name)
 
         self._write(filename, data_value)
 
@@ -1006,6 +1030,10 @@ def _execute_dependency_graph(self):
         extern_file_contents = "\n".join(extern_modules) + "\n"
         self._write(".data/extern_modules", extern_file_contents)
 
+    def _write_python_version(self):
+        """Writes the python version that the package was created with to .data/python_version"""
+        self._write(".data/python_version", platform.python_version())
+
     def close(self):
         """Write the package to the filesystem. Any calls after :meth:`close` are now invalid.
         It is preferable to use resource guard syntax instead::
@@ -1014,6 +1042,7 @@ def close(self):
                 ...
         """
         self._execute_dependency_graph()
+        self._write_python_version()
 
         self.script_module_serializer.write_files()
         self._finalize_zip()
diff --git a/torch/package/package_importer.py b/torch/package/package_importer.py
index a5d602d3a71a7a..f5c5f9bafc4e30 100644
--- a/torch/package/package_importer.py
+++ b/torch/package/package_importer.py
@@ -289,6 +289,22 @@ def file_structure(
             self.filename, self.zip_reader.get_all_records(), include, exclude
         )
 
+    def python_version(self):
+        """Returns the version of python that was used to create this package.
+
+        Note: this function is experimental and not Forward Compatible. The plan is to move this into a lock
+        file later on.
+
+        Returns:
+            :class:`Optional[str]` a python version e.g. 3.8.9 or None if no version was stored with this package
+        """
+        python_version_path = ".data/python_version"
+        return (
+            self.zip_reader.get_record(python_version_path).decode("utf-8").strip()
+            if self.zip_reader.has_record(python_version_path)
+            else None
+        )
+
     def _read_extern(self):
         return (
             self.zip_reader.get_record(".data/extern_modules")
diff --git a/torch/profiler/profiler.py b/torch/profiler/profiler.py
index c1f0d7000ac5be..99dce961cb8c7b 100644
--- a/torch/profiler/profiler.py
+++ b/torch/profiler/profiler.py
@@ -473,13 +473,15 @@ def step(self):
         if self.record_steps and self.step_rec_fn:
             self.step_rec_fn.__exit__(None, None, None)
         prev_action = self.current_action
+        cur_step = self.step_num
         self.step_num += 1
         self.current_action = self.schedule(self.step_num)
 
         self._transit_action(prev_action, self.current_action)
 
+        prof.kineto_step()
         if self.record_steps:
-            self.step_rec_fn = prof.record_function("ProfilerStep#" + str(self.step_num))
+            self.step_rec_fn = prof.record_function("ProfilerStep#" + str(cur_step))
             self.step_rec_fn.__enter__()
 
     def _trace_ready(self):
diff --git a/torch/quantization/fx/convert.py b/torch/quantization/fx/convert.py
index e1dd864aa5f59e..9d6ac350602bb7 100644
--- a/torch/quantization/fx/convert.py
+++ b/torch/quantization/fx/convert.py
@@ -6,6 +6,4 @@
 appropriate files under `torch/ao/quantization/fx/`, while adding an import statement
 here.
 """
-from torch.ao.quantization.fx.convert import (
-    convert
-)
+from torch.ao.quantization.fx.convert import convert
diff --git a/torch/quantization/fx/utils.py b/torch/quantization/fx/utils.py
index d891c667b70b12..230c10113e62bf 100644
--- a/torch/quantization/fx/utils.py
+++ b/torch/quantization/fx/utils.py
@@ -21,7 +21,7 @@
     create_qparam_nodes,
     all_node_args_have_no_tensors,
     node_return_type_is_int,
-    node_bool_tensor_arg_indexes,
+    get_non_observable_arg_indexes_and_types,
     is_get_tensor_info_node,
     maybe_get_next_module
 )
diff --git a/torch/sparse/__init__.py b/torch/sparse/__init__.py
index 29cd5510a74bab..d7de007b5ef1aa 100644
--- a/torch/sparse/__init__.py
+++ b/torch/sparse/__init__.py
@@ -22,6 +22,7 @@
     'sum',
     'softmax',
     'log_softmax',
+    '_csr_to_block_csr',
 ]
 
 
@@ -262,3 +263,5 @@ def sum(input: Tensor, dim: DimOrDims = None,
         performed. This is useful for preventing data type
         overflows. Default: None
 """)
+
+_csr_to_block_csr = _sparse._csr_to_block_csr
diff --git a/torch/special/__init__.py b/torch/special/__init__.py
index a11d3904ab29c4..ab91f5b44b8b4d 100644
--- a/torch/special/__init__.py
+++ b/torch/special/__init__.py
@@ -4,7 +4,7 @@
 
 __all__ = ['entr', 'psi', 'digamma', 'gammaln', 'polygamma', 'erf', 'erfc', 'erfinv',
            'erfcx', 'logit', 'logsumexp', 'expit', 'exp2', 'expm1', 'xlog1py', 'xlogy',
-           'i0', 'i0e', 'i1', 'i1e', 'ndtr', 'ndtri', 'log1p', 'sinc', 'round', 'log_softmax',
+           'i0', 'i0e', 'i1', 'i1e', 'ndtr', 'ndtri', 'log_ndtr', 'log1p', 'sinc', 'round', 'log_softmax',
            'zeta', 'multigammaln', 'gammainc', 'gammaincc', 'softmax']
 
 Tensor = torch.Tensor
@@ -547,6 +547,27 @@
     tensor([   -inf, -0.6745,  0.0000,  0.6745,     inf])
 """.format(**common_args))
 
+log_ndtr = _add_docstr(_special.special_log_ndtr,
+                       r"""
+log_ndtr(input, *, out=None) -> Tensor
+Computes the log of the area under the standard Gaussian probability density function,
+integrated from minus infinity to :attr:`input`, elementwise.
+
+.. math::
+    \text{log\_ndtr}(x) = \log\left(\frac{1}{\sqrt{2 \pi}}\int_{-\infty}^{x} e^{-\frac{1}{2}t^2} dt \right)
+
+""" + r"""
+Args:
+    {input}
+
+Keyword args:
+    {out}
+
+Example::
+    >>> torch.special.log_ndtr(torch.tensor([-3., -2, -1, 0, 1, 2, 3]))
+    tensor([-6.6077 -3.7832 -1.841  -0.6931 -0.1728 -0.023  -0.0014])
+""".format(**common_args))
+
 log1p = _add_docstr(_special.special_log1p,
                     r"""
 log1p(input, *, out=None) -> Tensor
diff --git a/torch/storage.py b/torch/storage.py
index 54e8df595843ce..2537d7ac8e0054 100644
--- a/torch/storage.py
+++ b/torch/storage.py
@@ -7,6 +7,11 @@
 import copy
 import collections
 from functools import lru_cache
+try:
+    import numpy as np
+    HAS_NUMPY = True
+except ModuleNotFoundError:
+    np = None  # type: ignore[assignment]
 
 T = TypeVar('T', bound='Union[_StorageBase, _TypedStorage]')
 class _StorageBase(object):
@@ -38,6 +43,15 @@ def _share_fd_(self): ...  # noqa: E704
     def _new_using_filename(cls: Type[T], size: int) -> T: ...  # noqa: E704
     @classmethod
     def _new_using_fd(cls: Type[T], size: int) -> T: ...  # noqa: E704
+    @classmethod
+    def from_buffer(cls, *args, **kwargs) -> T: ...  # noqa: E704
+    @classmethod
+    def _new_shared_filename(cls, manager, obj, size, *, device=None, dtype=None) -> T: ...  # noqa: E704
+    @classmethod
+    def _release_ipc_counter(cls, *args, **kwargs) -> T: ...  # noqa: E704
+    @classmethod
+    def _new_with_weak_ptr(cls, *args, **kwargs) -> T: ...  # noqa: E704
+    def _shared_decref(self) -> T: ...  # noqa: E704
 
     def __str__(self):
         content = ' ' + '\n '.join(str(self[i]) for i in range(len(self)))
@@ -83,6 +97,8 @@ def cpu(self):
         return _type(self, getattr(torch, self.__class__.__name__))
 
     def _to(self, dtype):
+        if not isinstance(dtype, torch.dtype):
+            raise TypeError(f"Argument 'dtype' must be torch.dtype, not {type(dtype)}")
         storage = torch.tensor([], dtype=torch.uint8, device=self.device).set_(cast(Storage, self)).to(dtype).storage()
         if storage.data_ptr() == self.data_ptr():
             storage = storage.clone()
@@ -187,6 +203,11 @@ def _load_from_bytes(b):
 
 @lru_cache(maxsize=None)
 def _dtype_to_storage_type_map():
+    # NOTE: We should no longer add dtypes to this map. This map
+    # is only used for BC/FC with older PyTorch versions. Going forward,
+    # new dtypes of _TypedStorage should not translate to a legacy
+    # <type>Storage class. Instead, new dtypes of _TypedStorage should
+    # be serialized as an _UntypedStorage paired with a torch.dtype
     return {
         torch.double: 'DoubleStorage',
         torch.float: 'FloatStorage',
@@ -213,104 +234,183 @@ def _storage_type_to_dtype_map():
         val: key for key, val in _dtype_to_storage_type_map().items()}
     return dtype_map
 
+def _get_storage_from_sequence(sequence, dtype, device):
+    if dtype in [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]:
+        interpret_dtypes = {
+            torch.quint8: torch.uint8,
+            torch.quint4x2: torch.uint8,
+            torch.quint2x4: torch.uint8,
+            torch.qint32: torch.int32,
+            torch.qint8: torch.int8
+        }
+        tmp_tensor = torch.tensor(
+            sequence,
+            dtype=interpret_dtypes[dtype],
+            device=device)
+
+    else:
+        tmp_tensor = torch.tensor(
+            sequence,
+            dtype=dtype,
+            device=device)
+
+    return tmp_tensor.storage()._untyped()
+
+def _isint(x):
+    if HAS_NUMPY:
+        return isinstance(x, (int, np.integer))
+    else:
+        return isinstance(x, int)
+
 class _TypedStorage:
     is_sparse = False
 
+    dtype: torch.dtype
+
     def fill_(self, value):
         self[0:len(self)] = value
         return self
 
-    def __init__(self, *args, **kwargs):
-        arg_error_msg = (
-            f'{type(self)} constructor received an invalid combination '
-            f'of arguments - got args={tuple(type(arg) for arg in args)}, '
-            f'kwargs={ {key: type(val) for key, val in kwargs.items()} }, but '
-            'expected one of:\n'
-            ' * no arguments\n'
-            ' * (int size)\n'
-            ' * (Sequence data)\n')
-        if type(self) == _TypedStorage:
-            arg_error_msg += ' * (wrap_storage=<_UntypedStorage>, dtype=<torch.dtype>)'
+    def __new__(cls, *args, wrap_storage=None, dtype=None, device=None):
+        if cls == torch.storage._LegacyStorage:
+            raise RuntimeError("Only child classes of _LegacyStorage can be instantiated")
+
+        if cls == _TypedStorage:
+            return super().__new__(cls)
+
         else:
-            arg_error_msg += ' * (wrap_storage=<_UntypedStorage>)'
-
-        if 'wrap_storage' in kwargs:
-            assert len(args) == 0, (
-                "No positional arguments should be given when using "
-                "'wrap_storage'")
-
-            if type(self) == _TypedStorage:
-                assert 'dtype' in kwargs, (
-                    "When using 'wrap_storage', 'dtype' also must be specified")
-                assert len(kwargs) == 2, (
-                    "Only 'wrap_storage' and 'dtype' should be given, but got: "
-                    f"{kwargs}")
-                dtype = kwargs['dtype']
-                assert isinstance(dtype, torch.dtype)
-                self.dtype = dtype
+            arg_error_msg = (
+                f'{cls}.__new__ received an invalid combination '
+                f'of arguments. Expected one of:\n'
+                ' * no arguments\n'
+                ' * (int size)\n'
+                ' * (Sequence data)\n'
+                ' * (*, _UntypedStorage wrap_storage)')
+
+            if device is not None:
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nKeyword argument 'device' cannot be specified")
+
+            if dtype is not None:
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nKeyword argument 'dtype' cannot be specified")
+
+            if wrap_storage is None:
+                if len(args) > 1:
+                    raise RuntimeError(
+                        arg_error_msg +
+                        "\nToo many positional arguments")
+
+                if len(args) == 1 and not _isint(args[0]) and not isinstance(args[0], collections.abc.Sequence):
+                    raise TypeError(
+                        arg_error_msg +
+                        f"\nArgument type not recognized: {type(args[0])}")
+
+                return _TypedStorage(
+                    *args,
+                    dtype=cls.dtype,
+                    device='cuda' if eval(cls.__module__) is torch.cuda else 'cpu')
 
             else:
-                assert hasattr(self, 'dtype')
-                assert len(kwargs) == 1, (
-                    f"Only 'wrap_storage' should be given, but got: {kwargs.keys()}")
-                dtype = self.dtype
+                if len(args) != 0:
+                    raise RuntimeError(
+                        arg_error_msg +
+                        "\nNo positional arguments should be given when using "
+                        "'wrap_storage'")
+
+                if not isinstance(wrap_storage, (torch._UntypedStorage, torch.cuda._UntypedStorage)):
+                    raise TypeError(
+                        arg_error_msg +
+                        f"\nArgument 'wrap_storage' must be _UntypedStorage, but got {type(wrap_storage)}")
+
+                cls_device = 'cuda' if cls.__module__ == 'torch.cuda' else 'cpu'
+
+                if wrap_storage.device.type != cls_device:
+                    raise RuntimeError(
+                        arg_error_msg +
+                        f"\nDevice of 'wrap_storage' must be {cls_device}"
+                        f", but got {wrap_storage.device.type}")
+
+                return _TypedStorage(
+                    *args,
+                    wrap_storage=wrap_storage,
+                    dtype=cls.dtype)
+
+    def __init__(self, *args, device=None, dtype=None, wrap_storage=None):
+        arg_error_msg = (
+            '_TypedStorage.__init__ received an invalid combination '
+            'of arguments. Expected one of:\n'
+            ' * (*, torch.device device, torch.dtype dtype)\n'
+            ' * (int size, *, torch.device device, torch.dtype dtype)\n'
+            ' * (Sequence data, *, torch.device device, torch.dtype dtype)\n'
+            ' * (*, _UntypedStorage wrap_storage, torch.dtype dtype)')
+
+        if wrap_storage is not None:
+            if len(args) != 0:
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nNo positional arguments should be given when using "
+                    "'wrap_storage'")
+
+            if dtype is None:
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nArgument 'dtype' must be specified")
+
+            if not isinstance(dtype, torch.dtype):
+                raise TypeError(
+                    arg_error_msg +
+                    f"\nArgument 'dtype' must be torch.dtype, not {type(dtype)}")
+
+            if device is not None:
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nArgument 'device' should not be specified when 'wrap_storage' is given")
 
-            storage = kwargs['wrap_storage']
+            self.dtype = dtype
 
-            if not isinstance(storage, (torch._UntypedStorage, torch.cuda._UntypedStorage)):
-                raise TypeError(arg_error_msg)
-            if type(self) != _TypedStorage and storage.__module__ != self.__module__:
-                raise TypeError((
+            if not isinstance(wrap_storage, (torch._UntypedStorage, torch.cuda._UntypedStorage)):
+                raise TypeError(
                     arg_error_msg +
-                    f'\n`storage` `module {storage.__module__}` does not match '
-                    f'module of {type(self)}'))
-            self._storage = storage
+                    f"\nArgument 'wrap_storage' must be _UntypedStorage, but got {type(wrap_storage)}")
+
+            self._storage = wrap_storage
 
         else:
-            assert type(self) != _TypedStorage, (
-                "Calling __init__ this way is only supported in _TypedStorage's "
-                "child classes. _TypedStorage can only be directly instantiated "
-                "when kwargs 'wrap_storage' and 'dtype' are given.")
+            self.dtype = torch.get_default_dtype() if dtype is None else dtype
+            device = torch.device('cpu' if device is None else device)
 
-            assert len(kwargs) == 0, "invalid keyword arguments"
+            if device.type == 'cpu':
+                untyped_storage_class = torch._UntypedStorage
+            elif device.type == 'cuda':
+                untyped_storage_class = torch.cuda._UntypedStorage
+            else:
+                raise RuntimeError(f"Storage device not recognized: {device}")
 
-            def isint(x):
-                try:
-                    int(x)
-                except TypeError:
-                    return False
-                return True
+            if self.dtype in [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]:
+                if device.type == 'cuda':
+                    raise RuntimeError("Cannot create CUDA storage with quantized dtype")
 
             if len(args) == 0:
-                self._storage = eval(self.__module__)._UntypedStorage()
-
-            elif len(args) == 1 and isint(args[0]):
-                self._storage = eval(self.__module__)._UntypedStorage(int(args[0]) * self.element_size())
-
-            elif len(args) == 1 and isinstance(args[0], collections.abc.Sequence):
-                if self.dtype in [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]:
-                    interpret_dtypes = {
-                        torch.quint8: torch.uint8,
-                        torch.quint4x2: torch.uint8,
-                        torch.quint2x4: torch.uint8,
-                        torch.qint32: torch.int32,
-                        torch.qint8: torch.int8
-                    }
-                    tmp_tensor = torch.tensor(
-                        args[0],
-                        dtype=interpret_dtypes[self.dtype],
-                        device='cuda' if eval(self.__module__) is torch.cuda else 'cpu')
+                self._storage = untyped_storage_class()
 
+            elif len(args) == 1:
+                if _isint(args[0]):
+                    self._storage = untyped_storage_class(int(args[0]) * self.element_size())
+                elif isinstance(args[0], collections.abc.Sequence):
+                    self._storage = _get_storage_from_sequence(args[0], self.dtype, device)
                 else:
-                    tmp_tensor = torch.tensor(
-                        args[0],
-                        dtype=self.dtype,
-                        device='cuda' if eval(self.__module__) is torch.cuda else 'cpu')
-
-                self._storage = tmp_tensor.storage()._untyped()
+                    raise TypeError(
+                        arg_error_msg +
+                        f"\nArgument type not recognized: {type(args[0])}")
 
             else:
-                raise TypeError(arg_error_msg)
+                raise RuntimeError(
+                    arg_error_msg +
+                    "\nToo many positional arguments")
+
 
     @property
     def is_cuda(self):
@@ -414,11 +514,19 @@ def nbytes(self):
 
     def type(self, dtype: str = None, non_blocking: bool = False) -> Union[T, str]:
         if dtype is None:
+            legacy_class = self._get_legacy_storage_class()
+
+            if legacy_class is not None:
+                return legacy_class.__module__ + '.' + legacy_class.__name__
+
             return '.'.join([self.__module__, type(self).__name__])
+
         else:
             return self._storage.type(dtype, non_blocking)
 
     def cuda(self, device=None, non_blocking=False, **kwargs) -> T:
+        if self.dtype in [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]:
+            raise RuntimeError("Cannot create CUDA storage with quantized dtype")
         cuda_storage = self._storage.cuda(device, non_blocking, **kwargs)
         return self._new_wrapped_storage(cuda_storage)
 
@@ -430,12 +538,9 @@ def get_device(self) -> int:
 
     def __str__(self):
         data_str = ' ' + '\n '.join(str(self[i]) for i in range(self.size()))
-        if type(self) == _TypedStorage:
-            return data_str + (
-                f'\n[{torch.typename(self)} with dtype {self.dtype} '
-                f'of size {len(self)}]')
-        else:
-            return data_str + f'\n[{torch.typename(self)} of size {len(self)}]'
+        return data_str + (
+            f'\n[{torch.typename(self)}(dtype={self.dtype}, '
+            f'device={self.device}) of size {len(self)}]')
 
     def __repr__(self):
         return str(self)
@@ -480,12 +585,16 @@ def share_memory_(self):
         self._storage.share_memory_()
         return self
 
-    @classmethod
-    def _new_shared(cls, size):
+    def _new_shared(self, size):
         """Creates a new storage in shared memory with the same data type"""
-        module = eval(cls.__module__)
-        untyped_storage = module._UntypedStorage._new_shared(size * cls().element_size())
-        return cls(wrap_storage=untyped_storage)
+        if self.is_cuda:
+            untyped_cls = torch.cuda._UntypedStorage
+        else:
+            untyped_cls = torch._UntypedStorage
+        untyped_storage = untyped_cls._new_shared(size * self.element_size())
+        return _TypedStorage(
+            wrap_storage=untyped_storage,
+            dtype=self.dtype)
 
     @property
     def _cdata(self):
@@ -523,22 +632,39 @@ def _weak_ref(self, *args, **kwargs):
         return self._storage._weak_ref(*args, **kwargs)
 
     @classmethod
-    def from_buffer(cls, *args, **kwargs):
+    def from_buffer(cls, *args, dtype=None, device=None, **kwargs):
         if cls == _TypedStorage:
-            raise RuntimeError(
-                'from_buffer: only supported for subclasses of _TypedStorage')
+            dtype = torch.get_default_dtype() if dtype is None else dtype
+            device = torch.device('cpu' if device is None else device)
 
-        if 'dtype' in kwargs or len(args) == 5:
-            raise RuntimeError((
-                "from_buffer: 'dtype' can only be specified in "
-                "_UntypedStorage.from_buffer"))
+            if device.type == 'cpu':
+                untyped_cls = torch._UntypedStorage
+            elif device.type == 'cuda':
+                untyped_cls = torch.cuda._UntypedStorage
+            else:
+                raise RuntimeError(
+                    f"_TypedStorage.from_buffer: device '{device}' not recognized")
+            untyped_storage: Union[torch._UntypedStorage, torch.cuda._UntypedStorage]
+            untyped_storage = untyped_cls.from_buffer(*args, dtype=dtype, **kwargs)
 
-        kwargs['dtype'] = cls().dtype
+        else:
+            if dtype is not None or len(args) == 5:
+                raise RuntimeError((
+                    "from_buffer: 'dtype' can only be specified in "
+                    "_UntypedStorage.from_buffer and _TypedStorage.from_buffer"))
+            if device is not None:
+                raise RuntimeError((
+                    "from_buffer: 'device' can only be specified in "
+                    "_UntypedStorage.from_buffer and _TypedStorage.from_buffer"))
 
-        untyped_storage = eval(cls.__module__)._UntypedStorage.from_buffer(*args, **kwargs)
-        return cls(wrap_storage=untyped_storage)
+            dtype = cls.dtype
+            untyped_storage = eval(cls.__module__)._UntypedStorage.from_buffer(*args, dtype=dtype, **kwargs)
+
+        return _TypedStorage(wrap_storage=untyped_storage, dtype=dtype)
 
     def _to(self, dtype):
+        if not isinstance(dtype, torch.dtype):
+            raise TypeError(f"Argument 'dtype' must be torch.dtype, not {type(dtype)}")
         storage = torch.tensor([], dtype=self.dtype, device=self.device).set_(self).to(dtype).storage()
         if storage.data_ptr() == self.data_ptr():
             storage = storage.clone()
@@ -594,6 +720,23 @@ def complex_float(self):
 
     @classmethod
     def from_file(cls, filename, shared, size):
+        """
+        from_file(filename, shared=False, size=0) -> Storage
+
+        If `shared` is `True`, then memory is shared between all processes.
+        All changes are written to the file. If `shared` is `False`, then the changes on
+        the storage do not affect the file.
+
+        `size` is the number of elements in the storage. If `shared` is `False`,
+        then the file must contain at least `size * sizeof(Type)` bytes
+        (`Type` is the type of storage). If `shared` is `True` the file will be
+        created if needed.
+
+        Args:
+            filename (str): file name to map
+            shared (bool): whether to share memory
+            size (int): number of elements in the storage
+        """
         if cls == _TypedStorage:
             raise RuntimeError('from_file can only be called on derived classes')
         untyped_storage = eval(cls.__module__)._UntypedStorage.from_file(
@@ -627,28 +770,28 @@ def is_shared(self):
 
     @classmethod
     def _new_shared_cuda(cls, *args, **kwargs):
-        return eval(cls.__module__)._UntypedStorage._new_shared_cuda(*args, **kwargs)
-
-    @classmethod
-    def _new_with_weak_ptr(cls, *args, **kwargs):
-        return eval(cls.__module__)._UntypedStorage._new_with_weak_ptr(*args, **kwargs)
+        return torch.cuda._UntypedStorage._new_shared_cuda(*args, **kwargs)
 
     def _share_filename_(self, *args, **kwargs):
         manager_handle, storage_handle, size = self._storage._share_filename_(*args, **kwargs)
         return manager_handle, storage_handle, size // self.element_size()
 
-    @classmethod
-    def _new_shared_filename(cls, manager, obj, size):
-        bytes_size = size * torch._utils._element_size(cls.dtype)
-        return cls(wrap_storage=eval(cls.__module__)._UntypedStorage._new_shared_filename(manager, obj, bytes_size))
-
     def _shared_decref(self):
         self._storage._shared_decref()
         return self
 
     @classmethod
-    def _release_ipc_counter(cls, *args, **kwargs):
-        return eval(cls.__module__)._UntypedStorage._release_ipc_counter(*args, **kwargs)
+    def _release_ipc_counter(cls, *args, device=None, **kwargs):
+        device = torch.device('cpu' if device is None else device)
+
+        if device.type == 'cpu':
+            untyped_cls = torch._UntypedStorage
+        elif device.type == 'cuda':
+            untyped_cls = torch.cuda._UntypedStorage
+        else:
+            raise RuntimeError(f"device {device} not recognized")
+
+        return untyped_cls._release_ipc_counter(*args, **kwargs)
 
     def _shared_incref(self, *args, **kwargs):
         return self._storage._shared_incref(*args, **kwargs)
@@ -657,6 +800,51 @@ def _share_fd_(self, *args, **kwargs):
         fd, size = self._storage._share_fd_(*args, **kwargs)
         return fd, size // self.element_size()
 
+    def _get_legacy_storage_class(self):
+        if self.dtype not in _dtype_to_storage_type_map():
+            return None
+
+        storage_name = _dtype_to_storage_type_map()[self.dtype]
+
+        if self.device.type not in ['cpu', 'cuda']:
+            return None
+
+        module = 'torch.' if self.device.type == 'cpu' else 'torch.cuda.'
+
+        try:
+            return eval(module + storage_name)
+        except AttributeError:
+            return None
+
+_TypedStorage.type.__doc__ = _type.__doc__
+_TypedStorage.cuda.__doc__ = _cuda.__doc__
+
+class _LegacyStorageMeta(type):
+    dtype: torch.dtype
+
+    def __instancecheck__(cls, instance):
+        if type(instance) == _TypedStorage:
+            cls_device = 'cuda' if cls.__module__ == 'torch.cuda' else 'cpu'
+            return (cls_device == instance.device.type) and (cls.dtype == instance.dtype)
+        return False
+
+class _LegacyStorage(_TypedStorage, metaclass=_LegacyStorageMeta):
+    @classmethod
+    def _new_shared(cls, size):
+        """Creates a new storage in shared memory with the same data type"""
+        module = eval(cls.__module__)
+        untyped_storage = module._UntypedStorage._new_shared(size * cls().element_size())
+        return cls(wrap_storage=untyped_storage)
+
+    @classmethod
+    def _release_ipc_counter(cls, *args, **kwargs):
+        return eval(cls.__module__)._UntypedStorage._release_ipc_counter(*args, **kwargs)
+
+    @classmethod
+    def _new_shared_filename(cls, manager, obj, size):
+        bytes_size = size * torch._utils._element_size(cls.dtype)
+        return cls(wrap_storage=eval(cls.__module__)._UntypedStorage._new_shared_filename(manager, obj, bytes_size))
+
 def _get_dtype_from_pickle_storage_type(pickle_storage_type: str):
     try:
         return _storage_type_to_dtype_map()[pickle_storage_type]
diff --git a/torch/testing/_creation.py b/torch/testing/_creation.py
index b16b6525e159b5..a5206643490048 100644
--- a/torch/testing/_creation.py
+++ b/torch/testing/_creation.py
@@ -7,6 +7,11 @@
 import math
 import collections.abc
 
+# Used by make_tensor for generating complex tensor.
+complex_to_corresponding_float_type_map = {torch.complex32: torch.float16,
+                                           torch.complex64: torch.float32,
+                                           torch.complex128: torch.float64}
+float_to_corresponding_complex_type_map = {v: k for k, v in complex_to_corresponding_float_type_map.items()}
 
 def make_tensor(
     *shape: Union[int, torch.Size, List[int], Tuple[int, ...]],
@@ -106,7 +111,7 @@ def clamp(a, l, h):
 
     _integral_types = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]
     _floating_types = [torch.float16, torch.bfloat16, torch.float32, torch.float64]
-    _complex_types = [torch.cfloat, torch.cdouble]
+    _complex_types = [torch.complex32, torch.complex64, torch.complex128]
     if requires_grad and dtype not in _floating_types and dtype not in _complex_types:
         raise ValueError("make_tensor: requires_grad must be False for integral dtype")
 
@@ -126,7 +131,7 @@ def clamp(a, l, h):
         rand_val = torch.rand(shape, device=device, dtype=dtype)
         result = high * rand_val + low * (1 - rand_val)
     elif dtype in _complex_types:
-        float_dtype = torch.float if dtype is torch.cfloat else torch.double
+        float_dtype = complex_to_corresponding_float_type_map[dtype]
         ranges_floats = (torch.finfo(float_dtype).min, torch.finfo(float_dtype).max)
         low, high = _modify_low_high(low, high, ranges_floats[0], ranges_floats[1], -9, 9, dtype)
         real_rand_val = torch.rand(shape, device=device, dtype=float_dtype)
diff --git a/torch/testing/_internal/autocast_test_lists.py b/torch/testing/_internal/autocast_test_lists.py
index 4b1058fe35a6da..faddcb880123a0 100644
--- a/torch/testing/_internal/autocast_test_lists.py
+++ b/torch/testing/_internal/autocast_test_lists.py
@@ -101,6 +101,7 @@ def __init__(self, dev):
             ("addmv", pointwise0_fp32 + mat2_fp32 + pointwise1_fp32),
             ("addr", mat0_fp32 + pointwise0_fp32 + pointwise1_fp32),
             ("matmul", mat0_fp32 + mat1_fp32),
+            ("einsum", "bkhd,bqhd->bqkh", mat0_fp32 + mat1_fp32),
             ("mm", mat0_fp32 + mat1_fp32),
             ("mv", mat0_fp32 + pointwise0_fp32),
             ("chain_matmul", mat0_fp32 + mat1_fp32 + mat2_fp32),
diff --git a/torch/testing/_internal/codegen/random_topo_test.py b/torch/testing/_internal/codegen/random_topo_test.py
index cf27fadff314c9..e92720be6b808f 100644
--- a/torch/testing/_internal/codegen/random_topo_test.py
+++ b/torch/testing/_internal/codegen/random_topo_test.py
@@ -370,7 +370,7 @@ def parse_args():
     # Turn off profiling executor
     if not args.profiling_executor:
         torch._C._jit_set_profiling_executor(False)
-        torch._C._jit_set_profiling_mode(False)
+        torch._C._get_graph_executor_optimize(False)
 
     # factor sorta control the depth of the model
     GRAPH_FACTOR = args.depth_factor
diff --git a/torch/testing/_internal/common_device_type.py b/torch/testing/_internal/common_device_type.py
index 46699430bbd349..31fc738f867495 100644
--- a/torch/testing/_internal/common_device_type.py
+++ b/torch/testing/_internal/common_device_type.py
@@ -12,7 +12,7 @@
 import torch
 from torch.testing._internal.common_utils import TestCase, TEST_WITH_ROCM, TEST_MKL, \
     skipCUDANonDefaultStreamIf, TEST_WITH_ASAN, TEST_WITH_UBSAN, TEST_WITH_TSAN, \
-    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, IS_WINDOWS, DeterministicGuard, TEST_SKIP_NOARCH, \
+    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, IS_WINDOWS, DeterministicGuard, \
     _TestParametrizer, compose_parametrize_fns, dtype_name, TEST_WITH_MIOPEN_SUGGEST_NHWC, NATIVE_DEVICES
 from torch.testing._internal.common_cuda import _get_torch_cuda_version, TEST_CUSPARSE_GENERIC
 from torch.testing._internal.common_dtype import get_all_dtypes
@@ -446,15 +446,6 @@ class CPUTestBase(DeviceTypeTestBase):
     def _should_stop_test_suite(self):
         return False
 
-# The meta device represents tensors that don't have any storage; they have
-# all metadata (size, dtype, strides) but they don't actually do any compute
-class MetaTestBase(DeviceTypeTestBase):
-    device_type = 'meta'
-    _ignore_not_implemented_error = True
-
-    def _should_stop_test_suite(self):
-        return False
-
 class CUDATestBase(DeviceTypeTestBase):
     device_type = 'cuda'
     _do_cuda_memory_leak_check = True
@@ -508,11 +499,8 @@ def get_device_type_test_bases():
                 test_bases.append(CUDATestBase)
         else:
             test_bases.append(CPUTestBase)
-            test_bases.append(MetaTestBase)
     else:
         test_bases.append(CPUTestBase)
-        if not TEST_SKIP_NOARCH:
-            test_bases.append(MetaTestBase)
         if torch.cuda.is_available():
             test_bases.append(CUDATestBase)
 
@@ -656,6 +644,8 @@ def split_if_not_empty(x: str):
 #                testing the operator raises an error and doesn't crash.
 # - supported_backward: Every dtype supported by the operator's backward pass.
 # - unsupported_backward: Run tests on dtypes not supported by the operator's backward pass.
+# - any_one: Runs a test for one dtype the operator supports. Prioritizes dtypes the
+#     operator supports in both forward and backward.
 # - none: Useful for tests that are not dtype-specific. No dtype will be passed to the test
 #         when this is selected.
 class OpDTypes(Enum):
@@ -664,7 +654,8 @@ class OpDTypes(Enum):
     unsupported = 2  # Test only unsupported dtypes
     supported_backward = 3  # Test all supported backward dtypes
     unsupported_backward = 4  # Test only unsupported backward dtypes
-    none = 5  # Instantiate no dtype variants (no dtype kwarg needed)
+    any_one = 5  # Test precisely one supported dtype
+    none = 6  # Instantiate no dtype variants (no dtype kwarg needed)
 
 
 # Decorator that defines the OpInfos a test template should be instantiated for.
@@ -698,7 +689,9 @@ class OpDTypes(Enum):
 #     operator's gradient formula supports
 #   OpDTypes.unsupported_backward - the test is instantiated for all dtypes the
 #     operator's gradient formula doesn't support
-#   OpDTypes.none - the test is instantied without any dtype. The test signature
+#   OpDTypes.any_one - the test is instantiated for one dtype the
+#     operator supports. The dtype supports forward and backward if possible.
+#   OpDTypes.none - the test is instantiated without any dtype. The test signature
 #     should not include a dtype kwarg in this case.
 #
 # These options allow tests to have considerable control over the dtypes
@@ -733,6 +726,32 @@ def _parametrize_test(self, test, generic_cls, device_cls):
                 dtypes = op.supported_dtypes(device_cls.device_type)
             elif self.opinfo_dtypes == OpDTypes.basic:
                 dtypes = op.default_test_dtypes(device_cls.device_type)
+            elif self.opinfo_dtypes == OpDTypes.any_one:
+                # Arbitrary order
+                dtype_order = (
+                    torch.float32,
+                    torch.float64,
+                    torch.complex64,
+                    torch.complex128,
+                    torch.float16,
+                    torch.bfloat16,
+                    torch.long,
+                    torch.int32,
+                    torch.int16,
+                    torch.int8,
+                    torch.uint8,
+                    torch.bool
+                )
+
+                # Tries to pick a dtype that supports both forward or backward
+                supported = op.supported_dtypes(device_cls.device_type)
+                supported_backward = op.supported_backward_dtypes(device_cls.device_type)
+                supported_both = supported.intersection(supported_backward)
+                dtype_set = supported_both if len(supported_both) > 0 else supported
+                for dtype in dtype_order:
+                    if dtype in dtype_set:
+                        dtypes = {dtype}
+                        break
             elif self.opinfo_dtypes == OpDTypes.none:
                 dtypes = {None}
             else:
diff --git a/torch/testing/_internal/common_distributed.py b/torch/testing/_internal/common_distributed.py
index 8df6b99c1412c7..395cf9ebeb501b 100644
--- a/torch/testing/_internal/common_distributed.py
+++ b/torch/testing/_internal/common_distributed.py
@@ -47,6 +47,7 @@ class TestSkip(NamedTuple):
         72, "Skipped because distributed backend is not available."
     ),
     "small_worldsize": TestSkip(73, "Skipped due to small world size."),
+    "odd_worldsize": TestSkip(87, "Skipped due to odd world size."),
     "no_cuda": TestSkip(74, "CUDA is not available."),
     "multi-gpu-1": TestSkip(75, "Need at least 1 CUDA device"),
     "multi-gpu-2": TestSkip(77, "Need at least 2 CUDA devices"),
@@ -83,8 +84,8 @@ class DistTestCases:
 
 
 def skip_if_no_gpu(func):
-    """Nccl multigpu tests require at least 2 GPUS. Skip if this is not met"""
-
+    """Skips if the world size exceeds the number of GPUs, ensuring that if the
+    test is run, each rank has its own GPU via ``torch.cuda.device(rank)``."""
     @wraps(func)
     def wrapper(*args, **kwargs):
         if not torch.cuda.is_available():
@@ -108,6 +109,15 @@ def wrapper(*args, **kwargs):
 
     return wrapper
 
+def skip_if_odd_worldsize(func):
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        if (os.environ["BACKEND"] != "mpi") and int(os.environ["WORLD_SIZE"]) % 2 == 1:
+            sys.exit(TEST_SKIPS["odd_worldsize"].exit_code)
+
+        return func(*args, **kwargs)
+
+    return wrapper
 
 def require_n_gpus_for_nccl_backend(n, backend):
     def decorator(func):
diff --git a/torch/testing/_internal/common_fsdp.py b/torch/testing/_internal/common_fsdp.py
index ad02079210de67..9e6d3adfd6baa3 100644
--- a/torch/testing/_internal/common_fsdp.py
+++ b/torch/testing/_internal/common_fsdp.py
@@ -1,28 +1,25 @@
 # Owner(s): ["oncall: distributed"]
 
-from copy import deepcopy
-from contextlib import suppress
-from enum import Enum
 import os
 import sys
+from contextlib import suppress
+from copy import deepcopy
+from enum import Enum
+from math import inf
+from typing import Union
 from unittest import mock
 
 import torch
 import torch.distributed as dist
 import torch.nn as nn
-from torch.distributed.fsdp import FullyShardedDataParallel, CPUOffload
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    TrainingState_,
-)
+from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel
+from torch.distributed.fsdp.fully_sharded_data_parallel import TrainingState_
 from torch.testing._internal.common_distributed import (
-    MultiProcessTestCase,
     TEST_SKIPS,
+    MultiProcessTestCase,
 )
-from torch.testing._internal.common_utils import (
-    FILE_SCHEMA,
-    get_cycles_per_ms,
-)
-
+from torch.distributed.fsdp.wrap import wrap
+from torch.testing._internal.common_utils import FILE_SCHEMA, get_cycles_per_ms
 
 
 class FSDPInitMode(Enum):
@@ -53,6 +50,11 @@ def _get_state_dict(model, cpu_offload=False, half=False):
 
     return model.state_dict()
 
+def subtest_name(test_name_mapping, *args):
+    return '_'.join(
+        [test_name_mapping[str(s)] if s is not None else "none" for s in args]
+    )
+
 # get full params of a model recursively. Note that if CPU offloading, it will
 # also automatically move the parameters to GPU, due to _rebuild_full_params
 # call.
@@ -85,7 +87,8 @@ def __init__(self, wrap_fsdp, cpu_offload=CPUOffload(offload_params=False)):
         super().__init__()
         # keep everything deterministic for model initialization
         torch.manual_seed(0)
-        self.inner = torch.nn.Linear(2, 2).cuda()
+        self.inner: Union[torch.nn.Linear, FullyShardedDataParallel] = \
+            torch.nn.Linear(2, 2).cuda()
         if wrap_fsdp:
             self.inner = FullyShardedDataParallel(self.inner, cpu_offload=cpu_offload)
         self.outer = torch.nn.Linear(2, 2).cuda()
@@ -245,6 +248,8 @@ def __init__(
         fsdp_init_mode=FSDPInitMode.CUDA_AFTER,
         cpu_offload=None,
         backward_prefetch=None,
+        sharding_strategy=None,
+        mixed_precision=None,
         **kwargs
     ):
         super().__init__(
@@ -254,6 +259,8 @@ def __init__(
                 fsdp_init_mode=fsdp_init_mode,
                 cpu_offload=cpu_offload,
                 backward_prefetch=backward_prefetch,
+                sharding_strategy=sharding_strategy,
+                mixed_precision=mixed_precision,
             ),
             **kwargs
         )
@@ -405,7 +412,16 @@ def _run(cls, rank, test_name, file_name, pipe):
         sys.exit(0)
 
     def _train_for_several_steps(
-        self, model, num_steps, autocast, lr=0.01, fsdp_cpu_offload=None, save_model=False
+        self,
+        model,
+        num_steps,
+        autocast,
+        lr=0.01,
+        fsdp_cpu_offload=None,
+        clip_norm=0.3,
+        norm_type=None,
+        save_model=False,
+        mixed_precision=None
     ):
         cpu_offload_params = fsdp_cpu_offload and fsdp_cpu_offload.offload_params
 
@@ -418,6 +434,11 @@ def _train_for_several_steps(
             with torch.cuda.amp.autocast(enabled=autocast):
                 # Inputs always cuda regardless of cpu offloading, or model.device
                 input = model.module.get_input(torch.device("cuda"))
+                if mixed_precision and not isinstance(model, FullyShardedDataParallel):
+                    if isinstance(input, torch.Tensor):
+                        input = input.half()
+                    else:
+                        input = tuple(x.half() for x in input)
                 output = model(*input)
                 # Post-forward, if CPU offloading model param should be on CPU.
                 if cpu_offload_params and isinstance(model, FullyShardedDataParallel):
@@ -427,11 +448,30 @@ def _train_for_several_steps(
                         self.assertEqual(p.device, torch.device("cpu"))
 
                 loss = model.module.get_loss(input, output).to(model_device)
-            assert (
-                loss.dtype == torch.float32
-            ), "loss data type should be float32, as the original \
-                 parameter data type is float32."
+            if not mixed_precision:
+                assert (
+                    loss.dtype == torch.float32
+                ), "loss data type should be float32, as the original \
+                    parameter data type is float32."
+            else:
+                # FSDP loss is fp16, DDP AMP loss is fp32
+                if isinstance(model, FullyShardedDataParallel):
+                    self.assertEqual(loss.dtype, mixed_precision.param_dtype)
+                else:
+                    self.assertEqual(loss.dtype, torch.float32)
             model.module.run_backward(loss)
+            if norm_type is not None:
+                if isinstance(model, FullyShardedDataParallel):
+                    model.clip_grad_norm_(clip_norm, norm_type)
+                    total_norm_after_clip = _collect_total_grad_norm_fsdp(
+                        model, norm_type, self.rank
+                    )
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm, norm_type)
+                    total_norm_after_clip = _collect_total_grad_norm_local(
+                        model, norm_type
+                    )
+                self.assertTrue(total_norm_after_clip <= clip_norm)
             # Post-backward, if CPU offloading model params should be on CPU.
             if cpu_offload_params and isinstance(model, FullyShardedDataParallel):
                 for p in model.parameters():
@@ -463,7 +503,11 @@ def _test_identical_outputs(
         lr=0.01,
         cpu_offload=CPUOffload(),
         backward_prefetch=None,
+        sharding_strategy=None,
+        mixed_precision=None,
         save_model=True,
+        clip_norm=0.3,
+        norm_type=None,
         **kwargs
     ):
         group = dist.distributed_c10d._get_default_group()
@@ -479,7 +523,8 @@ def _test_identical_outputs(
 
         # DDP training
         ref_loss = self._train_for_several_steps(
-            model, num_steps, autocast=False, lr=lr, fsdp_cpu_offload=cpu_offload
+            model, num_steps, autocast=mixed_precision is not None, lr=lr,
+            fsdp_cpu_offload=cpu_offload, mixed_precision=mixed_precision,
         )
         ref_full_params = list(model.parameters())
 
@@ -490,13 +535,21 @@ def _test_identical_outputs(
                 wrap_fsdp=True,
                 fsdp_init_mode=fsdp_init_mode,
                 cpu_offload=cpu_offload,
-                backward_prefetch=backward_prefetch
+                backward_prefetch=backward_prefetch,
+                sharding_strategy=sharding_strategy,
+                mixed_precision=mixed_precision,
             )
         except Exception as e:
             raise ValueError(f"model_Init_fn {model_init_fn} got error {str(e)}")
 
         cpu_offload = cpu_offload or CPUOffload()  # disabled if not specified.
-        model = FullyShardedDataParallel(model, cpu_offload=cpu_offload, backward_prefetch=backward_prefetch)
+        model = FullyShardedDataParallel(
+            model,
+            cpu_offload=cpu_offload,
+            backward_prefetch=backward_prefetch,
+            sharding_strategy=sharding_strategy,
+            mixed_precision=mixed_precision,
+        )
         # Call model.cuda() after init FSDP if specified.
         if fsdp_init_mode == FSDPInitMode.CUDA_AFTER:
             model = model.cuda()
@@ -519,6 +572,7 @@ def _test_identical_outputs(
             shard_loss = self._train_for_several_steps(
                 model, num_steps, autocast=False, lr=lr,
                 fsdp_cpu_offload=cpu_offload, save_model=save_model,
+                mixed_precision=mixed_precision,
             )
         # We only check for errors in the case we have the following setup:
         # model = FSDP(model, cpu_offload=True)
@@ -540,22 +594,102 @@ def _test_identical_outputs(
         if cpu_offload.offload_params:
             shard_loss = shard_loss.cuda()
         torch.testing.assert_allclose(ref_loss, shard_loss)
-        self.assertEqual(
-            ref_full_params,
-            shard_full_params,
-            exact_device=True,
-            msg="FullyShardedDataParallel didn't match PyTorch DDP",
-        )
+        # Note that we don't do parameter check when testing mixed precision,
+        # as FSDP will bring the full param back to fp32 but we did model.half()
+        # for DDP so they wouldn't be equal. Further, DDP + model.half() would
+        # run optimizer in reduced precision versus FSDP's full precision.
+        if not mixed_precision:
+            self.assertEqual(
+                ref_full_params,
+                shard_full_params,
+                exact_device=True,
+                msg="FullyShardedDataParallel didn't match PyTorch DDP",
+            )
 
     def _get_wrapped_model(
-        self, group, cuda_first=False, **model_kwargs
+        self, group, cuda_first=False, config=None, **model_kwargs,
     ) -> FullyShardedDataParallel:
+        if config is None:
+            config = {}
+        move_to_cuda = not (
+            "cpu_offload" in config and config["cpu_offload"].offload_params
+        )
         if cuda_first:
-            model = FullyShardedDataParallel(
-                TransformerWithSharedParams(group, **model_kwargs).cuda(), group
-            )
+            transformer = TransformerWithSharedParams(group, **model_kwargs)
+            if move_to_cuda:
+                transformer = transformer.cuda()
+            model = FullyShardedDataParallel(transformer, group, **config)
         else:
             model = FullyShardedDataParallel(
-                TransformerWithSharedParams(group, **model_kwargs), group
-            ).cuda()
+                TransformerWithSharedParams(group, **model_kwargs),
+                group,
+                **config,
+            )
+            if move_to_cuda:
+                model = model.cuda()
         return model
+
+    def _get_nonwrapped_model(
+        self, group, **model_kwargs,
+    ) -> torch.nn.Module:
+        """Returns the non-wrapped model that is wrapped in
+        :meth:`_get_wrapped_model`. The model used in these two methods should
+        be kept in sync for tests that use both for parity comparisons."""
+        return TransformerWithSharedParams(group, **model_kwargs).cuda()
+
+
+class SkipModule(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.lin = nn.Linear(10, 10, bias=False)
+
+    def forward(self, x):
+        return self.lin(x)
+
+
+class NestedLinear(nn.Module):
+    def __init__(self, fsdp_wrap):
+        super().__init__()
+        if fsdp_wrap:
+            self.nested_linear = wrap(nn.Linear(10, 10, bias=False).cuda())
+        else:
+            self.nested_linear = nn.Linear(10, 10, bias=False).cuda()
+
+    def forward(self, x):
+        return self.nested_linear(x)
+
+
+class SkipModel(nn.Module):
+    def __init__(self, double_nest):
+        super().__init__()
+        self.linear = nn.Linear(10, 10, bias=False).cuda()
+        self.linear_skip = SkipModule().cuda()
+        self.nested_linear = wrap(NestedLinear(fsdp_wrap=double_nest))
+
+    def forward(self, x):
+        x = self.linear(x)
+        x = self.linear_skip(x)
+        x = self.nested_linear(x)
+        return x
+
+
+def _collect_total_grad_norm_fsdp(model, norm_type, rank):
+    total_norm = _collect_total_grad_norm_local(model, norm_type)
+    op = torch.distributed.ReduceOp.SUM
+    if norm_type == inf:
+        op = torch.distributed.ReduceOp.MAX
+        norm_type = 1.0
+    return_norm = torch.tensor(total_norm ** norm_type, device=rank)
+    dist.all_reduce(return_norm, op=op)
+    return return_norm ** (1.0 / norm_type)
+
+
+def _collect_total_grad_norm_local(model, norm_type):
+    if norm_type == inf:
+        return max(p.grad.abs().max() for p in model.parameters())
+    else:
+        total_norm = 0.0
+        for p in model.parameters():
+            local_norm = torch.linalg.norm(p.grad, norm_type, dtype=torch.float32)
+            total_norm += local_norm ** norm_type
+        return total_norm ** (1.0 / norm_type)
diff --git a/torch/testing/_internal/common_fx2trt.py b/torch/testing/_internal/common_fx2trt.py
index 5d50d78e186017..ada3c85b9f403a 100644
--- a/torch/testing/_internal/common_fx2trt.py
+++ b/torch/testing/_internal/common_fx2trt.py
@@ -1,17 +1,19 @@
 import unittest
 from typing import Callable, List, Tuple
 
+import fx2trt_oss.tracer.acc_tracer.acc_tracer as acc_tracer
 import torch
 import torch.fx
-import fx2trt_oss.tracer.acc_tracer.acc_tracer as acc_tracer
 from fx2trt_oss.fx import (
     TRTInterpreter,
     InputTensorSpec,
     TRTModule,
 )
-from torch.testing._internal.common_utils import TestCase
+from fx2trt_oss.fx.passes.pass_utils import chain_passes
+from fx2trt_oss.fx.utils import LowerPrecision
 from torch.fx.experimental.normalize import NormalizeArgs
 from torch.fx.passes import shape_prop
+from torch.testing._internal.common_utils import TestCase
 
 
 def fetch_attr(mod, target):
@@ -41,7 +43,7 @@ def setUp(self):
         super().setUp()
         torch.manual_seed(3)
 
-    def run_test(self, mod, inputs, expected_ops, unexpected_ops, interpreter, rtol, atol):
+    def run_test(self, mod, inputs, expected_ops, unexpected_ops, interpreter, rtol, atol, precision=LowerPrecision.FP32):
         with torch.no_grad():
             cuda_inputs = []
             for i in inputs:
@@ -53,7 +55,7 @@ def run_test(self, mod, inputs, expected_ops, unexpected_ops, interpreter, rtol,
             if unexpected_ops:
                 self.assert_unexpected_op(mod, unexpected_ops)
 
-            interpreter_result = interpreter.run(fp16_mode=False)
+            interpreter_result = interpreter.run(lower_precision=precision)
             trt_mod = TRTModule(
                 interpreter_result.engine,
                 interpreter_result.input_names,
@@ -66,8 +68,10 @@ def run_test(self, mod, inputs, expected_ops, unexpected_ops, interpreter, rtol,
             if isinstance(outputs, torch.Tensor):
                 ref_outputs = [ref_outputs]
                 outputs = [outputs]
-
             for out, ref in zip(outputs, ref_outputs):
+                if not isinstance(ref, torch.Tensor):
+                    ref = torch.tensor([ref])
+                ref = ref.cpu()  # to_dtype test has cases with gpu output
                 torch.testing.assert_allclose(out.cpu(), ref, rtol=rtol, atol=atol)
 
     def run_test_custom_compare_results(
@@ -100,7 +104,7 @@ def run_test_custom_compare_results(
             if len(expected_ops):
                 self.assert_has_op(mod, expected_ops)
 
-            interpreter_result = interpreter.run(fp16_mode=fp16_mode)
+            interpreter_result = interpreter.run(lower_precision=LowerPrecision.FP16 if fp16_mode else LowerPrecision.FP32)
             trt_mod = TRTModule(
                 interpreter_result.engine,
                 interpreter_result.input_names,
@@ -125,7 +129,7 @@ def run_test_with_error(self, mod, inputs, interpreter, expect_error):
                     cuda_inputs.append(i.cuda())
 
                 mod.eval()
-                interpreter.run(fp16_mode=False)
+                interpreter.run(lower_precision=LowerPrecision.FP32)
 
     def assert_has_op(self, mod, ops):
         ops_in_mod = set()
@@ -192,13 +196,14 @@ def run_test(
         test_implicit_batch_dim=True,
         rtol=1e-03,
         atol=1e-03,
+        precision=LowerPrecision.FP32,
     ):
         mod.eval()
         mod = acc_tracer.trace(mod, inputs)
 
         if apply_passes is not None:
-            for p in apply_passes:
-                mod = p(mod)
+            pass_tracer = chain_passes(*apply_passes)
+            mod = pass_tracer(mod, inputs)
 
         if test_implicit_batch_dim:
             interp = TRTInterpreter(mod, InputTensorSpec.from_tensors(inputs))
@@ -208,7 +213,7 @@ def run_test(
             interp = TRTInterpreter(
                 mod, InputTensorSpec.from_tensors(inputs), explicit_batch_dimension=True
             )
-            super().run_test(mod, inputs, expected_ops, unexpected_ops, interp, rtol, atol)
+            super().run_test(mod, inputs, expected_ops, unexpected_ops, interp, rtol, atol, precision)
 
     def run_test_with_assert_error(
         self,
diff --git a/torch/testing/_internal/common_jit.py b/torch/testing/_internal/common_jit.py
index 8154963091eba5..30e320743ad25d 100644
--- a/torch/testing/_internal/common_jit.py
+++ b/torch/testing/_internal/common_jit.py
@@ -272,14 +272,15 @@ def assertAutodiffNode(self, graph, should_autodiff_node, nonfusible_nodes, fusi
                 fusion_nodes_not_found.append(node)
         found_all_fusible_nodes = len(fusion_nodes_found) == len(fusible_nodes)
 
-        err_msg = self.autoDiffErrorMessage(should_autodiff_node,
-                                            nodes_not_in_diff_graph,
-                                            fusion_nodes_not_found,
-                                            non_fusible_nodes_being_fused,
-                                            fusion_nodes_found,
-                                            nodes_in_diff_graph)
-        self.assertEqual(should_autodiff_node,
-                         found_all_nonfusible_nodes and found_all_fusible_nodes, err_msg)
+        if should_autodiff_node is not None:
+            err_msg = self.autoDiffErrorMessage(should_autodiff_node,
+                                                nodes_not_in_diff_graph,
+                                                fusion_nodes_not_found,
+                                                non_fusible_nodes_being_fused,
+                                                fusion_nodes_found,
+                                                nodes_in_diff_graph)
+            self.assertEqual(should_autodiff_node,
+                             found_all_nonfusible_nodes and found_all_fusible_nodes, err_msg)
 
     def checkShapeAnalysis(self, out_sizes: Union[List[int], List[List[int]]],
                            traced_graph, assert_propagation, constant_prop=True):
diff --git a/torch/testing/_internal/common_methods_invocations.py b/torch/testing/_internal/common_methods_invocations.py
index 6ec537514e1166..d419743ebee4e2 100644
--- a/torch/testing/_internal/common_methods_invocations.py
+++ b/torch/testing/_internal/common_methods_invocations.py
@@ -23,10 +23,10 @@
     all_types, double_types, empty_types
 )
 from torch.testing._internal.common_device_type import \
-    (onlyCPU, onlyCUDA, onlyNativeDeviceTypes, disablecuDNN, skipCUDAIfNoMagma, skipCUDAIfNoMagmaAndNoCusolver,
+    (onlyCUDA, onlyNativeDeviceTypes, disablecuDNN, skipCUDAIfNoMagma, skipCUDAIfNoMagmaAndNoCusolver,
      skipCUDAIfNoCusolver, skipCPUIfNoLapack, skipCPUIfNoFFT, skipCUDAIfRocm, precisionOverride,
      toleranceOverride, tol, has_cusolver)
-from torch.testing._internal.common_cuda import CUDA11OrLater, SM53OrLater, SM60OrLater
+from torch.testing._internal.common_cuda import CUDA11OrLater, SM53OrLater, SM60OrLater, with_tf32_off
 from torch.testing._internal.common_utils import \
     (is_iterable_of_tensors,
      random_symmetric_matrix, random_symmetric_psd_matrix,
@@ -88,17 +88,19 @@ def is_active(self, cls_name, test_name, device_type, dtype):
         )
 
 
+# FIXME
+# Note: historically the 'input' kwarg had to be a Tensor or TensorList, but we are trying
+#   to support scalar inputs, too. Some tests still depend on 'input' being a Tensor
+#   or TensorList, however.
 class SampleInput(object):
     """Represents sample inputs to a function."""
 
     __slots__ = ['input', 'args', 'kwargs', 'output_process_fn_grad', 'broadcasts_input', 'name']
 
     def __init__(self, input, *, args=tuple(), kwargs=None, output_process_fn_grad=lambda x: x, broadcasts_input=False, name=""):
-        # input is the first input to the op and must be either a Tensor or TensorList (Sequence[Tensor]).
+        # input is the first input to the op and is typically either a Tensor or TensorList (Sequence[Tensor]).
         # This follows the typical pattern where for Tensor inputs op(t, ...) = t.op(...).
-        # op with TensorList inputs do not support method or inplace variants.
-        assert isinstance(input, torch.Tensor) or is_iterable_of_tensors(input)
-        self.input: Union[torch.Tensor, Sequence[torch.Tensor]] = input
+        self.input = input
         self.args = args
         self.kwargs = kwargs if kwargs is not None else {}
         self.output_process_fn_grad = output_process_fn_grad
@@ -184,19 +186,25 @@ def _tt(t):
     def numpy(self):
         def to_numpy(t):
             if isinstance(t, torch.Tensor):
+                if t.dtype is torch.bfloat16:
+                    return t.detach().cpu().to(torch.float32).numpy()
                 return t.detach().cpu().numpy()
             elif isinstance(t, torch.dtype):
                 return torch_to_numpy_dtype_dict[t]
 
+            return t
+
         return self.transform(to_numpy)
 
     def noncontiguous(self):
         def to_noncontiguous(t):
             if isinstance(t, torch.Tensor):
                 return noncontiguous_like(t)
-            if isinstance(t, torch.dtype):
+            elif isinstance(t, torch.dtype):
                 return t
 
+            return t
+
         return self.transform(to_noncontiguous)
 
 
@@ -208,7 +216,7 @@ class ErrorInput(object):
 
     __slots__ = ['sample_input', 'error_type', 'error_regex']
 
-    def __init__(self, sample_input, *, error_type, error_regex):
+    def __init__(self, sample_input, *, error_type=RuntimeError, error_regex):
         self.sample_input = sample_input
         self.error_type = error_type
         self.error_regex = error_regex
@@ -373,6 +381,16 @@ def close_to_int(x, eps=0.1):
 # "name" is a string that's just used for debugging. It appears when printing
 #   the SampleInput.
 #
+# Sample inputs are designed to be used with many tests, some
+#   that are very time consuming, so they should be a small
+#   set with small tensors. An elaborated set of sample inputs
+#   can be specified using the the "reference_inputs_func" attribute.
+#   The "reference inputs" for an operation are an extended
+#   set of sample inputs that can more exhausively test an
+#   operator. They are used by only a few tests that are careful
+#   not to take too long to run. Adding reference inputs
+#   is highly encouraged!
+#
 # THE (OPTIONAL) ERROR INPUTS FUNCTION
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 #
@@ -420,6 +438,13 @@ def close_to_int(x, eps=0.1):
 #   - that the operation is autodifferentiated by the jit as expected
 #   - that the operator's aliases, if any, perform the same operation and that
 #       the jit understands the alias
+#   - that the operator throws the correct errors (if error_inputs is defined)
+#   - that the operator produces the same results as a NumPy reference (if ref is defined)
+#   - that the operator produces the same results as a NumPy reference on an extended
+#       set of "reference inputs" (if both ref and reference_inputs_func are defined)
+#       (NOTE: elementwise unary and elementwise binary OpInfos do this even if only
+#         ref is defined, because they effectively autogenerate reference inputs)
+#   - that the operator works on different CUDA devices
 #
 # Additional OpInfo tests are in test_jit_fuser_te.py, test_fx_experimental.py,
 #   and test_fx.py. These tests validate that operators work with NNC and FX
@@ -431,9 +456,15 @@ def close_to_int(x, eps=0.1):
 # In addition to these tests, some subclasses (discussed in the next section)
 #   define additional tests.
 #
-# Critically, as mentioned above, what's not tested is that the operator
+# Critically, as mentioned above, what's not necessarily tested is that the operator
 #   works as expected. When implementing an OpInfo an engineer must still
 #   typically write one or more tests validating the operator's behavior.
+#   The exception to this is if reference testing is sufficient, or if
+#   the operation belongs to an OpInfo subclass that has more exhaustive
+#   operator testing. Elementwise unary and elementwise binary operators,
+#   in particular, typically don't require additional testing beyond
+#   writing an Opinfo.
+#
 #
 # OPINFO (SUB)CLASSES
 # ~~~~~~~~~~~~~~~~~~~
@@ -466,9 +497,9 @@ def close_to_int(x, eps=0.1):
 # then you should typically add an OpInfo for it.
 #
 # As mentioned a couple times above, implementing an OpInfo is not
-#   usually sufficient testing (unless the operator is a unary elementwise
+#   usually sufficient testing (unless the operator is a unary or binary elementwise
 #   operator). The OpInfo will only test the properties described in the
-#   "WHAT'S TESTED" section. It DOES NOT verify that the operator is
+#   "WHAT'S TESTED" section. It DOES NOT necessarily verify that the operator is
 #   implemented correctly.
 #
 # TIPS FOR WRITING AN OPINFO AND OPINFO TESTS
@@ -541,6 +572,7 @@ def __init__(self,
                  # the following are pointers to functions to generate certain classes
                  #   of inputs
                  sample_inputs_func=None,  # function to generate sample inputs with strided layouts
+                 reference_inputs_func=None,  # function to generate a more thorough set of samples inputs with strided layouts
                  error_inputs_func=None,  # function to generate inputs that will throw errors
                  sample_inputs_sparse_coo_func=None,  # function to generate sample inputs with sparse coo layouts
                  sample_inputs_sparse_csr_func=None,  # function to generate sample inputs with sparse csr layouts
@@ -692,6 +724,7 @@ def __init__(self,
 
         # We run the sampling functions without tracking the gradiends of the creation of inputs
         self.sample_inputs_func = torch.no_grad()(sample_inputs_func)
+        self.reference_inputs_func = None if reference_inputs_func is None else torch.no_grad()(reference_inputs_func)
         self.error_inputs_func = error_inputs_func
         self.sample_inputs_sparse_coo_func = torch.no_grad()(sample_inputs_sparse_coo_func)
         self.sample_inputs_sparse_csr_func = torch.no_grad()(sample_inputs_sparse_csr_func)
@@ -817,13 +850,7 @@ def conjugate_sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
         tensor in a sequence input conjugated.
         """
 
-        # TODO: Remove the try/except once all operators have sample_inputs_func with
-        #       **kwargs in their signature.
-        try:
-            samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
-        except TypeError:
-            samples = self.sample_inputs_func(self, device, dtype, requires_grad)
-
+        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
         conj_samples = list(samples)
 
         def conjugate(tensor):
@@ -842,20 +869,15 @@ def conjugate(tensor):
         return tuple(conj_samples)
 
     def sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs.
+        """
+        Returns an iterable of SampleInputs.
 
         These samples should be sufficient to test the function works correctly
         with autograd, TorchScript, etc.
         """
+        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
 
-        # TODO: Remove the try/except once all operators have sample_inputs_func with
-        #       **kwargs in their signature.
-        try:
-            samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
-        except TypeError:
-            samples = self.sample_inputs_func(self, device, dtype, requires_grad)
-
-        if 'include_conjugated_inputs' in kwargs and kwargs.get('include_conjugated_inputs'):
+        if kwargs.get('include_conjugated_inputs', False):
             conj_samples = self.conjugate_sample_inputs(device, dtype, requires_grad, **kwargs)
             samples_list = list(samples)
             samples_list.extend(conj_samples)
@@ -863,6 +885,22 @@ def sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
 
         return samples
 
+    def reference_inputs(self, device, dtype, requires_grad=False, **kwargs):
+        """
+        Returns an iterable of SampleInputs.
+
+        Distinct from sample_inputs() above because this returns an expanded set
+        of inputs when reference_inputs_func is defined. If undefined this returns
+        the sample inputs.
+        """
+        if self.reference_inputs_func is None:
+            return self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
+
+        if kwargs.get('include_conjugated_inputs', False):
+            raise NotImplementedError
+
+        return self.reference_inputs_func(self, device, dtype, requires_grad, **kwargs)
+
     def error_inputs(self, device, **kwargs):
         """
         Returns an iterable of ErrorInputs.
@@ -912,7 +950,7 @@ def supported_backward_dtypes(self, device_type):
         else:
             backward_dtypes = self.backward_dtypes
 
-        allowed_backward_dtypes = floating_and_complex_types_and(torch.bfloat16, torch.float16)
+        allowed_backward_dtypes = floating_and_complex_types_and(torch.bfloat16, torch.float16, torch.complex32)
         return set(allowed_backward_dtypes).intersection(backward_dtypes)
 
     def supports_complex_autograd(self, device_type):
@@ -1051,7 +1089,6 @@ def sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwar
                     inputs.append(SampleInput(t.detach().requires_grad_(requires_grad),
                                               args=sample_input_args,
                                               kwargs=sample_input_kwargs))
-
     return inputs
 
 
@@ -1080,7 +1117,6 @@ def sample_inputs_sparse_csr_masked_reduction(op_info, device, dtype, requires_g
     with sparse csr layouts.
     """
     inputs: List[SampleInput] = []
-
     if op_info.supports_sparse_csr:
         for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
             if not (sample_input.input.ndim == 2 and sample_input.kwargs.get('keepdim')):
@@ -1097,6 +1133,19 @@ def sample_inputs_sparse_csr_masked_reduction(op_info, device, dtype, requires_g
                 inputs.append(SampleInput(sample_input.input.to_sparse_csr(),
                                           args=sample_input.args, kwargs=sample_input.kwargs))
 
+            if sample_input.kwargs['dim'] == 0:
+                # Reductions of CSR tensors use different implementations for
+                # inner and/or outer dimensions. So, as a minimum of testing CSR
+                # implementations the following kwargs must be generated:
+                #   dict(dim=0, keepdim=True)
+                #   dict(dim=1, keepdim=True)
+                #   dict(dim=(0, 1), keepdim=True)
+                # Here we generate the dim=1 case from the dim=0 case.
+                sample_input = inputs[-1]
+                sample_input_kwargs = sample_input.kwargs.copy()
+                sample_input_kwargs.update(dim=1)
+                inputs.append(SampleInput(sample_input.input.clone(),
+                                          args=sample_input.args, kwargs=sample_input_kwargs))
     return inputs
 
 
@@ -1112,8 +1161,8 @@ def sample_inputs_masked_norm(op_info, device, dtype, requires_grad, **kwargs):
     return inputs
 
 
-def sample_inputs_masked_var(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked var.
+def sample_inputs_masked_std_var(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked std/var.
     """
     inputs: List[SampleInput] = []
     for unbiased in [False, True]:
@@ -1127,14 +1176,18 @@ def sample_inputs_masked_var(op_info, device, dtype, requires_grad, **kwargs):
                 sample_input_args = sample_input.args
                 sample_input_kwargs = dict(sample_input.kwargs, unbiased=unbiased)
             if requires_grad:
-                inmask = torch._masked._input_mask(sample_input.input, *sample_input_args, **sample_input_kwargs)
-                orig_count = torch._masked.sum(inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
-                                               dim, keepdim=True, mask=inmask)
-                if orig_count.min() <= int(unbiased):
+                if sample_input_kwargs.get('mask') is None:
+                    orig_count = torch._masked.sum(torch.ones(sample_input.input.shape, dtype=torch.int64), dim, keepdim=True)
+                else:
+                    inmask = torch._masked._input_mask(sample_input.input, *sample_input_args, **sample_input_kwargs)
+                    orig_count = torch._masked.sum(inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
+                                                   dim, keepdim=True, mask=inmask)
+                if orig_count.min() <= int(unbiased) + 1:
                     # Skip samples that lead to singularities in var
                     # computation resulting nan values both in var and
                     # autograd output that test_grad_fn cannot handle
-                    # correctly.
+                    # correctly. Also, skip samples when the autograd output
+                    # for std could not be handled correctly due to torch.sqrt
                     continue
             inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
                                       args=sample_input_args, kwargs=sample_input_kwargs))
@@ -1474,8 +1527,8 @@ def error_inputs_hsplit(op_info, device, **kwargs):
                                   dtype=torch.float32,
                                   device=device),
                       args=(0,),)
-    return (ErrorInput(si1, error_type=RuntimeError, error_regex=err_msg1),
-            ErrorInput(si2, error_type=RuntimeError, error_regex=err_msg2),)
+    return (ErrorInput(si1, error_regex=err_msg1),
+            ErrorInput(si2, error_regex=err_msg2),)
 
 def error_inputs_vsplit(op_info, device, **kwargs):
     err_msg1 = ("torch.vsplit requires a tensor with at least 2 dimension, "
@@ -1491,8 +1544,8 @@ def error_inputs_vsplit(op_info, device, **kwargs):
                                   dtype=torch.float32,
                                   device=device),
                       args=(0,),)
-    return (ErrorInput(si1, error_type=RuntimeError, error_regex=err_msg1),
-            ErrorInput(si2, error_type=RuntimeError, error_regex=err_msg2),)
+    return (ErrorInput(si1, error_regex=err_msg1),
+            ErrorInput(si2, error_regex=err_msg2),)
 
 def error_inputs_dsplit(op_info, device, **kwargs):
     err_msg1 = ("torch.dsplit requires a tensor with at least 3 dimension, "
@@ -1508,8 +1561,8 @@ def error_inputs_dsplit(op_info, device, **kwargs):
                                   dtype=torch.float32,
                                   device=device),
                       args=(0,),)
-    return (ErrorInput(si1, error_type=RuntimeError, error_regex=err_msg1),
-            ErrorInput(si2, error_type=RuntimeError, error_regex=err_msg2),)
+    return (ErrorInput(si1, error_regex=err_msg1),
+            ErrorInput(si2, error_regex=err_msg2),)
 
 def sample_inputs_linalg_multi_dot(op_info, device, dtype, requires_grad, **kwargs):
     # Each test case consists of the sizes in the chain of multiplications
@@ -1778,11 +1831,14 @@ def sample_inputs_nn_functional_prelu(op_info, device, dtype, requires_grad, **k
     for shape in cases:
         for weight in [-1., 0., 0.8, 1.]:
             weight_tensor = torch.tensor(weight, device=device, dtype=dtype, requires_grad=requires_grad)
-            yield SampleInput(make_arg(shape), kwargs=dict(weight=weight_tensor))
+            yield SampleInput(make_arg(shape), args=(weight_tensor,))
 
         if len(shape) >= 2:
             channel_size = shape[1]
-            yield SampleInput(make_arg(shape), kwargs=dict(weight=make_arg((channel_size,))))
+            yield SampleInput(make_arg(shape), args=(make_arg((channel_size,)),))
+    weight_tensor = torch.tensor(1., device=device, dtype=dtype, requires_grad=requires_grad)
+    yield SampleInput(make_arg((S, S)), kwargs=dict(weight=weight_tensor,))
+    yield SampleInput(make_arg((S, S)), kwargs=dict(weight=make_arg((S,)),))
 
 def sample_inputs_norm(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -1937,6 +1993,329 @@ def sample_inputs_linalg_vector_norm(op_info, device, dtype, requires_grad, **kw
 
     return inputs
 
+# The following functions and classes are for testing elementwise binary operators.
+
+# Returns a generator of pairs of contiguous tensors on the requested device
+#   and with the requested dtype.
+#
+# This function is intended to test the non-vectorized and vectorized code
+#   paths of elementwise binary functions, as well as their handling of odd tensor
+#   sizes (like zero-dim tensors and tensors with zero elements).
+#
+# Each iterable will include an a tensor with no elements,
+#   zero dim (scalar) tensors, small 1D tensors, a medium 1D tensor, and
+#   a large 2D tensor.
+def generate_elementwise_binary_tensors(op, *, device, dtype, requires_grad=False):
+    shapes = (
+        # tensors with no elements
+        (0,),
+        (1, 0, 3),
+        # zero dim (scalar) tensor
+        (),
+        # small 1D tensor
+        (20,),
+        # medium 1D tensor
+        (812,),
+        # large 2D tensor
+        (1029, 917)
+    )
+
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+        yield SampleInput(lhs, args=(rhs,))
+
+# Returns a generator of pairs of contiguous tensors on the requested device and with
+#   the requested dtype.
+#
+# Unlike the previous function, the values in these tensors are specified manually.
+def generate_elementwise_binary_small_value_tensors(op, *, device, dtype, requires_grad=False):
+    # defines interesting values
+    _unsigned_int_vals = (0, 1, 55, 127, 128, 190, 210, 220, 254, 255)
+    _int_vals = (0, -1, 1, -55, 55, -127, 127, -128, 128)
+    _float_vals = (
+        0.,
+        -.001, .001,
+        -.25, .25,
+        -1., 1.,
+        -math.pi / 2, math.pi / 2,
+        -math.pi + .00001, math.pi - .00001,
+        -math.pi, math.pi,
+        -math.pi - .00001, math.pi + .00001
+    )
+
+    l_vals = []
+    r_vals = []
+
+    if dtype.is_floating_point:
+        prod = product(_float_vals, _float_vals)
+    elif dtype.is_complex:
+        complex_vals = product(_float_vals, _float_vals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    elif dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
+        prod = product(_int_vals, _int_vals)
+    elif dtype is torch.uint8:
+        prod = product(_unsigned_int_vals, _unsigned_int_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        if r == 0 and op.rhs_make_tensor_kwargs.get('exclude_zero', False):
+            r_vals.append(1)
+        else:
+            r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+def generate_elementwise_binary_large_value_tensors(op, *, device, dtype, requires_grad=False):
+    _large_int_vals = (-1113, 1113, -10701, 10701)
+    _large_float16_vals = (-501, 501, -1001.2, 1001.2, -13437.7, 13437.7)
+    _large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)
+
+    l_vals = []
+    r_vals = []
+
+    if dtype == torch.float16:
+        prod = product(_large_float16_vals, _large_float16_vals)
+    elif dtype.is_floating_point:
+        prod = product(_large_float_vals, _large_float_vals)
+    elif dtype.is_complex:
+        complex_vals = product(_large_float_vals, _large_float_vals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    elif dtype in (torch.int16, torch.int32, torch.int64):
+        prod = product(_large_int_vals, _large_int_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+def generate_elementwise_binary_extremal_value_tensors(op, *, device, dtype, requires_grad=False):
+    _float_extremals = (float('inf'), float('-inf'), float('nan'))
+
+    l_vals = []
+    r_vals = []
+
+    if dtype.is_floating_point:
+        prod = product(_float_extremals, _float_extremals)
+    elif dtype.is_complex:
+        complex_vals = product(_float_extremals, _float_extremals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+# Returns a generator of pairs of contiguous and noncontiguous tensors that
+#   require broadcasting
+def generate_elementwise_binary_broadcasting_tensors(op, *, device, dtype, requires_grad=False):
+    shapes = (
+        ((1,), ()),
+        ((2,), ()),
+        ((1,), (2,)),
+        ((2, 1), (2,)),
+        ((1, 2), (2,)),
+        ((3, 2), (2,)),
+        ((1, 3, 2), (2,)),
+        ((1, 3, 2), (3, 2)),
+        ((3, 1, 2), (3, 2)),
+        ((2, 3, 2), ()),
+        ((3, 1, 2), (1, 3, 2)),
+    )
+
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape, noncontiguous in product(shapes, [True, False]):
+        shape_lhs, shape_rhs = shape
+        lhs = make_arg(shape_lhs,
+                       noncontiguous=noncontiguous,
+                       **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape_rhs,
+                       noncontiguous=noncontiguous,
+                       **op.rhs_make_tensor_kwargs)
+
+        yield SampleInput(lhs, args=(rhs,), broadcasts_input=True)
+
+# Returns a generator of pairs of contiguous tensors and scalars
+def generate_elementwise_binary_with_scalar_samples(op, *, device, dtype, requires_grad=False):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    scalar_shapes = (
+        (),
+        (3,),
+        (5, 3),
+        (0, 1, 3),
+        (1, 5)
+    )
+    if op.supports_rhs_python_scalar:
+        for scalar_shape in scalar_shapes:
+            lhs = make_arg(scalar_shape, **op.lhs_make_tensor_kwargs)
+            rhs = make_arg(scalar_shape, **op.rhs_make_tensor_kwargs)
+            lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
+            rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
+
+            yield SampleInput(lhs, args=(rhs_scalar,))
+
+        # Extends with scalar lhs
+        if op.supports_one_python_scalar:
+            yield SampleInput(lhs_scalar, args=(rhs,))
+
+    if op.supports_two_python_scalars:
+        lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
+        rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
+
+        yield SampleInput(lhs_scalar, args=(rhs_scalar,))
+
+# Returns a generator of pairs of noncontiguous tensors
+def generate_elementwise_binary_noncontiguous_tensors(op, *, device, dtype, requires_grad=False):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # Generic noncontiguity
+    lhs = make_arg((1026,), noncontiguous=True, **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((1026,), noncontiguous=True, **op.rhs_make_tensor_kwargs)
+
+    yield SampleInput(lhs.clone(), args=(rhs.clone(),))
+    yield SampleInput(lhs.contiguous(), args=(rhs,))
+
+    # Transposed
+    lhs = make_arg((789, 357), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((789, 357), **op.rhs_make_tensor_kwargs)
+
+    yield SampleInput(lhs.T, args=(rhs.T,))
+
+    # More noncontiguity
+    shapes = (
+        (5, 7),
+        (1024,)
+    )
+
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+        lhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
+        lhs_non_contig.copy_(lhs)
+
+        rhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
+        rhs_non_contig.copy_(rhs)
+
+        yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
+        yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
+
+    # Noncontiguous indices
+    shape = (2, 2, 1, 2)
+    lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+    rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+    lhs_non_contig = lhs[:, 1, ...]
+    rhs_non_contig = rhs[:, 1, ...]
+
+    yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
+    yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
+
+    # Expanded tensors
+    shapes = (
+        (1, 3),
+        (1, 7),
+        (5, 7)
+    )
+
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+        lhs_non_contig = lhs.expand(3, -1, -1)
+        rhs_non_contig = rhs.expand(3, -1, -1)
+
+        yield SampleInput(lhs_non_contig, args=(rhs_non_contig,))
+
+
+# Sample inputs for elementwise binary operators, like add
+def sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    shapes = (
+        ((), ()),
+        ((S,), ()),
+        ((S, 1), (S,)),
+        ((M, S), ()),
+        ((S, M, S), (M, S)),
+        ((S, M, S), (S, M, S)),
+        ((M, 1, S), (M, S)),
+        ((M, 1, S), (1, M, S)),
+        ((0, 1, 3), (0, 10, 3))
+    )
+
+    sample_kwargs = kwargs.get('sample_kwargs', {})
+
+    for shape_lhs, shape_rhs in shapes:
+        lhs = make_arg(shape_lhs, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape_rhs, **op.rhs_make_tensor_kwargs)
+        broadcasts_input = (shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs))
+
+        yield SampleInput(lhs, args=(rhs,), kwargs=sample_kwargs, broadcasts_input=broadcasts_input)
+
+
+# Note that these references inputs use scalars for the SampleInput.input value,
+#   and many tests require SampleInput.input be a tensor or a list of tensors
+def reference_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
+    yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
+    yield from generate_elementwise_binary_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from generate_elementwise_binary_small_value_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from generate_elementwise_binary_large_value_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from generate_elementwise_binary_broadcasting_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from generate_elementwise_binary_noncontiguous_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from generate_elementwise_binary_with_scalar_samples(op, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    if dtype.is_floating_point or dtype.is_complex:
+        yield from generate_elementwise_binary_extremal_value_tensors(op, device=device, dtype=dtype, requires_grad=requires_grad)
+
+# A functional that extends an elementwise binary operator's bespoke error inputs
+#   with generic error inputs for the class of elementwise binary operations
+def make_error_inputs_elementwise_binary(error_inputs_func):
+    def error_inputs_func_wrapper(op, device, **kwargs):
+        if error_inputs_func is not None:
+            yield from error_inputs_func(op, device, **kwargs)
+
+        if not op.supports_rhs_python_scalar:
+            si = SampleInput(torch.tensor((1, 2, 3), device=device), args=(2,))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+        if not op.supports_one_python_scalar:
+            si = SampleInput(2, args=(torch.tensor((1, 2, 3), device=device),))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+        if not op.supports_two_python_scalars:
+            si = SampleInput(2, args=(3,))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+    return error_inputs_func_wrapper
 
 # Metadata class for binary "universal functions (ufuncs)" that accept two
 # tensor and have common properties
@@ -1953,12 +2332,33 @@ class BinaryUfuncInfo(OpInfo):
     about the concept of ufuncs.
     """
     def __init__(self, name, *,
+                 sample_inputs_func=sample_inputs_elementwise_binary,
+                 reference_inputs_func=reference_inputs_elementwise_binary,
+                 error_inputs_func=None,
                  lhs_make_tensor_kwargs=None,
                  rhs_make_tensor_kwargs=None,
                  promotes_int_to_float=False,  # Set to true if the op promotes integer inputs to float
                  always_returns_bool=False,  # Set to true if the op always returns bool tensors
+                 supports_rhs_python_scalar=True,  # Whether the operator allows Tensor x scalar inputs
+                 supports_one_python_scalar=False,  # Whether the operator allows scalar x tensor and tensor x scalar inputs
+                 supports_two_python_scalars=False,  # Whether the operator allows scalar x scalar inputs
                  **kwargs):
-        super().__init__(name, **kwargs)
+
+        # Elementwise binary operations perform the equivalent of test_reference_testing
+        #   in test_binary_ufuncs, but with additional test granularity. So the
+        #   generic test_ops.py test is skipped because it's redundant.
+        common_skips = (
+            DecorateInfo(unittest.skip('Skipping redundant test.'),
+                         'TestCommon',
+                         'test_reference_testing'),
+        )
+        kwargs['skips'] = kwargs.get('skips', tuple()) + common_skips
+        super().__init__(
+            name,
+            sample_inputs_func=sample_inputs_func,
+            reference_inputs_func=reference_inputs_func,
+            error_inputs_func=make_error_inputs_elementwise_binary(error_inputs_func),
+            **kwargs)
 
         # [lr]hs_make_tensor_kwargs are part of the OpInfo to be able to dynamically generate valid samples later on.
         if lhs_make_tensor_kwargs is None:
@@ -1971,154 +2371,47 @@ def __init__(self, name, *,
 
         self.promotes_int_to_float = promotes_int_to_float
         self.always_returns_bool = always_returns_bool
+        self.supports_rhs_python_scalar = supports_rhs_python_scalar
+        self.supports_one_python_scalar = supports_one_python_scalar
+        self.supports_two_python_scalars = supports_two_python_scalars
 
-def _resolve_binary_pwise_kwargs(
-        op_info, *, op_kwargs=None, lhs_make_tensor_kwargs=None, rhs_make_tensor_kwargs=None
-):
-    """Resolves default values for :func:`sample_inputs_binary_pwise`.
-
-    By default :attr:`op_kwargs`, :attr:`lhs_make_tensor_kwargs`, and :attr:`rhs_make_tensor_kwargs` are just empty
-    dictionaries. In case :attr:`op_info` is a :class:`BinaryUfuncInfo`, :attr:`BinaryUfuncInfo.lhs_make_tensor_kwargs`
-    and :attr:`BinaryUfuncInfo.rhs_make_tensor_kwargs` will be used as defaults.
-    """
-    if op_kwargs is None:
-        op_kwargs = {}
-    if lhs_make_tensor_kwargs is None:
-        lhs_make_tensor_kwargs = op_info.lhs_make_tensor_kwargs if isinstance(op_info, BinaryUfuncInfo) else {}
-    if rhs_make_tensor_kwargs is None:
-        rhs_make_tensor_kwargs = op_info.rhs_make_tensor_kwargs if isinstance(op_info, BinaryUfuncInfo) else {}
-
-    return op_kwargs, lhs_make_tensor_kwargs, rhs_make_tensor_kwargs
-
-
-def sample_inputs_binary_pwise(
-    op_info,
-    device,
-    dtype,
-    requires_grad,
-    *,
-    python_scalars=False,
-    op_kwargs=None,
-    lhs_make_tensor_kwargs=None,
-    rhs_make_tensor_kwargs=None,
-    **kwargs,
-):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    op_kwargs, lhs_make_tensor_kwargs, rhs_make_tensor_kwargs = _resolve_binary_pwise_kwargs(
-        op_info,
-        op_kwargs=op_kwargs,
-        lhs_make_tensor_kwargs=lhs_make_tensor_kwargs,
-        rhs_make_tensor_kwargs=rhs_make_tensor_kwargs,
-    )
-
-    scalar = make_arg((), **rhs_make_tensor_kwargs)
-    if python_scalars:
-        scalar = scalar.item()  # type: ignore[assignment]
-
-    shapes = [
-        ((), scalar),
-        ((S,), scalar),
-        ((S, 1), (S,)),
-        ((M, S), scalar),
-        ((S, M, S), (M, S)),
-        ((S, M, S), (S, M, S)),
-        ((M, 1, S), (M, S)),
-        ((M, 1, S), (1, M, S)),
-        ((0, 1, 3), (0, 10, 3))
-    ]
-
-    for shape_lhs, shape_rhs_or_scalar in shapes:
-        lhs = make_arg(shape_lhs, **lhs_make_tensor_kwargs)
-        if isinstance(shape_rhs_or_scalar, tuple):
-            # shape
-            rhs = make_arg(shape_rhs_or_scalar, **rhs_make_tensor_kwargs)
-            broadcasts_input = torch.broadcast_shapes(shape_lhs, shape_rhs_or_scalar) != shape_lhs
-        else:
-            # scalar
-            rhs = shape_rhs_or_scalar  # type: ignore[assignment]
-            broadcasts_input = False
-
-        yield SampleInput(lhs, args=(rhs,), kwargs=op_kwargs, broadcasts_input=broadcasts_input)
-
-
-def sample_inputs_add_sub(
-    op_info,
-    device,
-    dtype,
-    requires_grad,
-    python_scalars=False,
-    alpha=1,
-    op_kwargs=None,
-    lhs_make_tensor_kwargs=None,
-    rhs_make_tensor_kwargs=None,
-    **kwargs,
-):
-    op_kwargs, lhs_make_tensor_kwargs, rhs_make_tensor_kwargs = _resolve_binary_pwise_kwargs(
-        op_info,
-        op_kwargs=op_kwargs,
-        lhs_make_tensor_kwargs=lhs_make_tensor_kwargs,
-        rhs_make_tensor_kwargs=rhs_make_tensor_kwargs,
-    )
+        if self.supports_two_python_scalars:
+            self.supports_one_python_scalar = True
 
-    yield from sample_inputs_binary_pwise(
-        op_info,
-        device,
-        dtype,
-        requires_grad,
-        python_scalars=python_scalars,
-        op_kwargs=op_kwargs,
-        lhs_make_tensor_kwargs=lhs_make_tensor_kwargs,
-        rhs_make_tensor_kwargs=rhs_make_tensor_kwargs,
-        **kwargs,
-    )
+        if self.supports_one_python_scalar:
+            assert supports_rhs_python_scalar, "Can't support lhs and rhs Python scalars but not rhs scalars!"
 
-    lhs = make_tensor((S, S), device=device, dtype=dtype, requires_grad=requires_grad, **lhs_make_tensor_kwargs)
-    rhs = make_tensor((S, S), device=device, dtype=dtype, requires_grad=requires_grad, **rhs_make_tensor_kwargs)
-    yield SampleInput(lhs, args=(rhs,), kwargs=dict(op_kwargs, alpha=alpha), broadcasts_input=False)
-
-def sample_inputs_isclose(
-    op_info,
-    device,
-    dtype,
-    requires_grad,
-    python_scalars=False,
-    op_kwargs=None,
-    lhs_make_tensor_kwargs=None,
-    rhs_make_tensor_kwargs=None,
-    **kwargs,
-):
-    op_kwargs, lhs_make_tensor_kwargs, rhs_make_tensor_kwargs = _resolve_binary_pwise_kwargs(
-        op_info,
-        op_kwargs=op_kwargs,
-        lhs_make_tensor_kwargs=lhs_make_tensor_kwargs,
-        rhs_make_tensor_kwargs=rhs_make_tensor_kwargs,
-    )
 
-    yield from sample_inputs_binary_pwise(
-        op_info,
-        device,
-        dtype,
-        requires_grad,
-        python_scalars=python_scalars,
-        op_kwargs=op_kwargs,
-        lhs_make_tensor_kwargs=lhs_make_tensor_kwargs,
-        rhs_make_tensor_kwargs=rhs_make_tensor_kwargs,
-        **kwargs,
-    )
+def sample_inputs_add_sub(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
 
+    # Adds alpha kwarg cases
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
+    yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': 2})
+    neg_alpha = -3.14 if (dtype.is_floating_point or dtype.is_complex) else -3
+    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
+    yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': neg_alpha})
+
+def sample_inputs_isclose(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
+
+    # Creates additional inputs to test the rtol, atol, and equal_nan params
     rtols = [0., 1e-7]
     atols = [0., 1e-7]
     equal_nans = [False, True]
 
     products = product(rtols, atols, equal_nans)
 
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     for rtol, atol, equal_nan in products:
-        lhs = make_tensor((S, S), device=device, dtype=dtype, requires_grad=requires_grad, **lhs_make_tensor_kwargs)
-        rhs = make_tensor((S, S), device=device, dtype=dtype, requires_grad=requires_grad, **rhs_make_tensor_kwargs)
+        lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+        rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
 
         yield SampleInput(lhs, args=(rhs,),
-                          kwargs=dict(op_kwargs, rtol=rtol, atol=atol, equal_nan=equal_nan))
+                          kwargs=dict(rtol=rtol, atol=atol, equal_nan=equal_nan))
 
 def sample_inputs_t(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -2271,14 +2564,22 @@ def sample_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwarg
 
     sample_inputs = []
     for input_args, broadcasts_input in test_cases:
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad) if isinstance(arg, tuple) else arg
+        # addcdiv should accept inputs with zero value
+        # Currently, it throws ZeroDivisionError when the denominator is zero
+        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
+        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
+                     exclude_zero=True) if isinstance(arg, tuple) else arg
                      for arg in input_args)
         sample_inputs.append(SampleInput(
             args[0],
             args=args[1:],
             broadcasts_input=broadcasts_input))
 
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad) if isinstance(arg, tuple) else arg
+        # addcdiv should accept inputs with zero value
+        # Currently, it throws ZeroDivisionError when the denominator is zero
+        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
+        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
+                     exclude_zero=True) if isinstance(arg, tuple) else arg
                      for arg in input_args)
         sample_inputs.append(SampleInput(
             args[0],
@@ -2375,44 +2676,6 @@ def sample_inputs_addr(op_info, device, dtype, requires_grad, **kwargs):
 
     return (input1, input2, input3, input4)
 
-def sample_inputs_xlogy(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((S, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(
-                make_tensor((S, S), dtype=dtype, device=device, low=0, high=None, requires_grad=requires_grad),
-            )
-        ),
-    )
-
-
-def sample_inputs_xlog1py(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    # same shape
-    yield SampleInput(make_arg((S, S)), args=(make_arg((S, S), low=-1),))
-    # rhs broadcast
-    yield SampleInput(make_arg((S, S)), args=(make_arg((S,), low=-1),))
-    # all zero `x`
-    x = make_arg((S, S))
-    x.fill_(0)
-    yield SampleInput(x, args=(make_arg((S, S), low=-1),))
-
-    # randomly zero-masked `x`
-    x = make_arg((S, S))
-    y = make_arg((S, S), low=-1)
-    x[torch.rand(x.shape) > 0.5] = 0
-    yield SampleInput(x, args=(y,))
-
-    # Scalar x
-    # `input` has to be a tensor
-    # yield SampleInput(0, args=(make_arg((S, S), low=-1),))
-    # yield SampleInput(2.1, args=(make_arg((S, S), low=-1),))
-
-    # Scalar y
-    yield SampleInput(make_arg((S, S)), args=(-0.5,))
-    yield SampleInput(make_arg((S, S)), args=(1.2,))
-
 def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -2429,12 +2692,15 @@ def sample_inputs_logsumexp(self, device, dtype, requires_grad, **kwargs):
         ((S, S), (1,), False)
     )
     samples = []
-
-    for shape, dim, keepdim in inputs:
-        t = make_tensor(shape, dtype=dtype, device=device,
-                        low=None, high=None,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(dim, keepdim)))
+    # Test large inputs to check numerical stability
+    lows = (None, 1e3, 1e6) if dtype in (torch.float32, torch.float64) else (None,)
+    for low in lows:
+        high = low * 2 if low is not None else None
+        for shape, dim, keepdim in inputs:
+            t = make_tensor(shape, dtype=dtype, device=device,
+                            low=low, high=high,
+                            requires_grad=requires_grad)
+            samples.append(SampleInput(t, args=(dim, keepdim)))
 
     return tuple(samples)
 
@@ -2749,7 +3015,7 @@ def clone_sample(sample, **kwargs):
 
     def clone_tensor(t):
         if isinstance(t, torch.Tensor):
-            return t.clone().requires_grad_(t.requires_grad)
+            return t.detach().clone().requires_grad_(t.requires_grad)
         else:
             return t
 
@@ -2865,23 +3131,6 @@ def sample_inputs_block_diag(op_info, device, dtype, requires_grad, **kwargs):
 
     return samples
 
-def sample_inputs_bitwise_shift(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = (
-        (S, S, S),
-        (S,),
-        (),
-    )
-
-    sample_inputs = []
-    for size in test_cases:
-        tensor1 = make_tensor(size, dtype=dtype, device=device, low=-32, high=32, requires_grad=requires_grad)
-        tensor2 = make_tensor(size, dtype=dtype, device=device, low=0, high=5, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(tensor1, args=(tensor2,)))
-        sample_inputs.append(SampleInput(tensor1, args=(2,)))
-
-    return tuple(sample_inputs)
-
-
 def sample_inputs_cdist(op_info, device, dtype, requires_grad, **kwargs):
     small_S = 2
     test_cases = (
@@ -2930,51 +3179,14 @@ def sample_inputs_fill_(op_info, device, dtype, requires_grad, **kwargs):
         yield SampleInput(make_arg(shape), args=args)
 
 
-def sample_inputs_comparison_ops(self, device, dtype, requires_grad, **kwargs):
-    test_cases = (
-        ((S, S, S), (S, S, S), False),
-        ((S, S, S), (), False),
-        ((S, S, S), (1,), False),
-        ((S,), (1,), False),
-        ((), (), False),
-    )
-    test_cases_lhs_broadcasting = (
-        ((S, 1, S), (S, S, S), True),
-        ((1,), (S, S, S), True),
-        ((1, S), (1, 1, S), True),
-        ((), (0,), True),
-        ((), (S, S, S), True),
-    )
-    cases = test_cases + test_cases_lhs_broadcasting
-    sample_inputs = list(SampleInput(make_tensor(first_shape, dtype=dtype, device=device,
-                                                 requires_grad=requires_grad),
-                                     args=(make_tensor(second_shape, dtype=dtype, device=device,
-                                                       requires_grad=requires_grad),),
-                                     broadcasts_input=broadcasts_input)
-                         for first_shape, second_shape, broadcasts_input in cases)
-    equal_tensors_non_bool = (
-        ([[[-8, 6], [9, 0]], [[0, 5], [5, 7]]]),
-        ([[[6, 5]], [[1, -5]]]),
-        ([[2], [-1]]),
-        ([0, -6]),
-        ([3],),
-    )
-    equal_tensors_bool = (
-        ([[[1, 0], [0, 0]], [[0, 1], [1, 0]]]),
-        ([[[1, 1]], [[1, 0]]]),
-        ([[1], [0]]),
-        ([0, 1]),
-        ([1],),
-    )
-    more_cases = equal_tensors_bool if dtype is torch.bool else equal_tensors_non_bool
-    more_inputs = list(SampleInput(torch.tensor(elements, device=device, dtype=dtype,
-                                                requires_grad=requires_grad),
-                                   args=(torch.tensor(elements, device=device, dtype=dtype,
-                                                      requires_grad=requires_grad),))
-                       for elements in more_cases)
-    sample_inputs = [*sample_inputs, *more_inputs]
-    return tuple(sample_inputs)
+def sample_inputs_comparison_ops(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
 
+    # Adds a sample input where both tensors have the same values
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    lhs = make_arg((S, S))
+    yield SampleInput(lhs, args=(lhs.clone(),))
 
 def sample_inputs_stack(op_info, device, dtype, requires_grad, **kwargs):
     tensors = [
@@ -3010,14 +3222,6 @@ def sample_inputs_hstack_dstack_vstack(op_info, device, dtype, requires_grad, **
 
     return (SampleInput(tensors),)
 
-def sample_inputs_hypot(op_info, device, dtype, requires_grad, **kwargs):
-    input = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)
-    args = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)
-
-    return (
-        SampleInput(input, args=(args,)),
-    )
-
 def sample_inputs_gather(op_info, device, dtype, requires_grad, **kwargs):
     return (
         SampleInput(
@@ -3057,12 +3261,12 @@ def error_inputs_gather(op_info, device, **kwargs):
 
     # Index should be smaller than self except on dimesion 1
     bad_src = make_tensor((1, 1), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(bad_src, args=(1, idx,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(bad_src, args=(1, idx,)),
                      error_regex="Size does not match at dimension 0")
 
     # Index must have long dtype
     bad_idx = idx.to(torch.int32)
-    yield ErrorInput(SampleInput(src, args=(1, bad_idx)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(src, args=(1, bad_idx)),
                      error_regex="Expected dtype int64 for index")
 
     # TODO: FIXME
@@ -3071,20 +3275,20 @@ def error_inputs_gather(op_info, device, **kwargs):
     src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
     idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
     out = torch.empty((2, 2), device=device, dtype=torch.float64)
-    yield ErrorInput(SampleInput(src, args=(1, idx), kwargs={'out': out}), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(src, args=(1, idx), kwargs={'out': out}),
                      error_regex="Expected out tensor to have dtype")
 
     # src and index tensors must have the same # of dimensions
     # idx too few dimensions
     src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
     idx = torch.tensor((0, 0), device=device, dtype=torch.long)
-    yield ErrorInput(SampleInput(src, args=(1, idx)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(src, args=(1, idx)),
                      error_regex="Index tensor must have the same number of dimensions")
 
     # src too few dimensions
     src = torch.tensor((1, 2), device=device, dtype=torch.float32)
     idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
-    yield ErrorInput(SampleInput(src, args=(0, idx)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(src, args=(0, idx)),
                      error_regex="Index tensor must have the same number of dimensions")
 
     # index out of bounds
@@ -3092,37 +3296,70 @@ def error_inputs_gather(op_info, device, **kwargs):
     if torch.device(device).type == 'cpu':
         src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
         idx = torch.tensor(((0, 23), (1, 0)), device=device, dtype=torch.long)
-        yield ErrorInput(SampleInput(src, args=(1, idx,)), error_type=RuntimeError,
+        yield ErrorInput(SampleInput(src, args=(1, idx,)),
                          error_regex="index 23 is out of bounds for dimension")
 
+    x = torch.rand((1,), device=device).expand((3,))
+    src = torch.rand((6,), device=device)
+    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
+
+    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=src)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(ind.clone(), args=(0, ind[1:],), kwargs=dict(out=ind[:1])),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+def error_inputs_take(op_info, device, **kwargs):
+    x = torch.rand((1,), device=device).expand((3,))
+    src = torch.rand((6,), device=device)
+    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
+
+    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=src)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(ind.clone(), args=(ind[1:],), kwargs=dict(out=ind[:-1])),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
 # Error inputs for scatter
 def error_inputs_scatter_and_scatter_add(op_info, device, **kwargs):
     # Error when self.dtype != src.dtype (and src is not a scalar)
     src = make_tensor((2, 5), device=device, dtype=torch.float32)
     idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
     dst = torch.zeros((3, 5), device=device, dtype=torch.double)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
                      error_regex="Expected self.dtype to be equal to src.dtype")
 
     # Index dtype must be long
     src = make_tensor((2, 5), device=device, dtype=torch.float32)
     idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.int32)
     dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
                      error_regex="Expected dtype int64 for index")
 
     # Index and destination must have the same number of dimensions
     src = make_tensor((2, 5), device=device, dtype=torch.float32)
     idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
     dst = torch.zeros((3, 5, 3), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
                      error_regex="Index tensor must have the same number of dimensions as self tensor")
 
     # Index and src must have the same number of dimensions when src is not a scalar
     src = make_tensor((2, 5, 2), device=device, dtype=torch.float32)
     idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
     dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
                      error_regex="Index tensor must have the same number of dimensions as src tensor")
 
     # Index out of bounds
@@ -3131,9 +3368,146 @@ def error_inputs_scatter_and_scatter_add(op_info, device, **kwargs):
         src = make_tensor((2, 5), device=device, dtype=torch.float32)
         idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
         dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-        yield ErrorInput(SampleInput(dst, args=(0, idx, src)), error_type=RuntimeError,
+        yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
                          error_regex="index 34 is out of bounds for dimension 0 with size 3")
 
+def error_inputs_renorm(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(0.5, 0, 1.0)), error_type=RuntimeError,
+                     error_regex="needs at least 2 dimensions, got 0 dimensions")
+
+def error_inputs_lstsq(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(zero_d)), error_type=TypeError,
+                     error_regex="iteration over a 0-d tensor")
+
+def error_inputs_eig(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+
+    yield ErrorInput(SampleInput(zero_d, args=(False,)), error_type=RuntimeError,
+                     error_regex="input should be 2 dimensional")
+
+    yield ErrorInput(SampleInput(zero_d, args=(True,)), error_type=RuntimeError,
+                     error_regex="input should be 2 dimensional")
+
+def error_inputs_ormqr(op_info, device, **kwargs):
+    # this is only implemented on cpu
+    if (torch.device(device).type == 'cpu'):
+        zero_d = torch.randn((), device=device)
+        yield ErrorInput(SampleInput(zero_d, args=(zero_d, zero_d)), error_type=RuntimeError,
+                         error_regex="input must have at least 2 dimensions")
+
+def error_inputs_diag(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(zero_d)), error_type=TypeError,
+                     error_regex="iteration over a 0-d tensor")
+
+def error_inputs_embedding(op_info, device, **kwargs):
+    indices = torch.rand(2, 2, device=device).long()
+    weights = [
+        torch.tensor(1.0, device=device),
+        torch.tensor(1.0, device=device).reshape(1, 1, 1),
+    ]
+
+    for weight in weights:
+        yield ErrorInput(SampleInput(weight, args=(indices,)), error_type=RuntimeError,
+                         error_regex="'weight' must be 2-D")
+
+def error_inputs_multinomial(op_info, device, **kwargs):
+    x = torch.empty(1, 2, 3, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+                     error_regex="prob_dist must be 1 or 2 dim")
+
+    x = torch.empty(1, 2, dtype=torch.long, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+                     error_regex="multinomial only supports floating-point dtypes for input")
+
+    x = torch.empty(1, 2, dtype=torch.double, device=device)
+    y = torch.empty(1, 2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,), kwargs=dict(out=y)), error_type=RuntimeError,
+                     error_regex="multinomial expects Long tensor out")
+
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(0,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample <= 0 samples")
+
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(-1,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample <= 0 samples")
+
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(3, False,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample > prob_dist")
+
+    x = torch.empty(16777217, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(3,)), error_type=RuntimeError,
+                     error_regex="number of categories cannot exceed")
+
+def error_inputs_gradient(op_info, device, **kwargs):
+    for dtype in [torch.long, torch.float32, torch.complex64]:
+        t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], device=device, dtype=dtype)
+
+        dim = (1, 0)
+        spacing = [0.1]
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected spacing to be unspecified, a scalar ')
+
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=3)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient only supports edge_order=1 and edge_order=2.')
+
+        dim = (1, 1)
+        spacing = 0.1
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='dim 1 appears multiple times in the list of dims')
+
+        dim = (0, 1)
+        coordinates = [torch.tensor([1, 2, 4], device='cpu'), torch.tensor([1, 2, 4], device='meta')]
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=coordinates, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each tensor to be on the same device,')
+
+        yield ErrorInput(SampleInput(t, kwargs=dict(dim=3)),
+                         error_type=IndexError, error_regex='')
+
+        t = torch.tensor([[1], [2], [3]])
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each dimension size to be at least')
+
+        t = torch.tensor([[1, 2], [3, 4]])
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=2)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each dimension size to be at least')
+
+def error_inputs_masked_select(op_info, device, **kwargs):
+    x = torch.rand((1,), device=device).expand((3,))
+    y = torch.rand((6,), device=device)
+    mask = torch.tensor([True, False, True, True, False, False], device=device)
+
+    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=y)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+    yield ErrorInput(SampleInput(mask.clone(), args=(mask,), kwargs=dict(out=mask)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
+def error_inputs_index_select(op_info, device, **kwargs):
+    x = torch.rand((1, 6), device=device).expand((2, 6))
+    y = torch.rand((3, 6), device=device)
+    ind = torch.tensor([0, 1], dtype=torch.int64, device=device)
+
+    yield ErrorInput(SampleInput(y, args=(1, ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
+
 def sample_inputs_take_along_dim(op_info, device, dtype, requires_grad, **kwargs):
     return (SampleInput(make_tensor((S, S), dtype=dtype, device=device,
                                     low=None, high=None,
@@ -3164,6 +3538,54 @@ def sample_inputs_take_along_dim(op_info, device, dtype, requires_grad, **kwargs
             )
 
 
+def error_inputs_aminmax_amax_amin(op_info, device, **kwargs):
+
+    # Error Inputs for zero-dim tensors, when 'dim' arg is not provided.
+    shape = (S, 0, S)
+    err_msg_amax_amin = "Specify the reduction dim with the 'dim' argument."
+    err_msg_aminmax = "cannot compute aminmax over an empty dimension as the operation has no identity"
+    if op_info.name in ['amax', 'amin']:
+        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_amax_amin)
+    elif op_info.name in ['aminmax']:
+        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_aminmax)
+
+    # Error Inputs for tensors with more than 64 dimension
+    sizes = [1] * 65
+    err_msg1 = "only tensors with up to 64 dims are supported"
+    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': -1}),
+                     error_regex=err_msg1)
+    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': 64}),
+                     error_regex=err_msg1)
+
+    # Error Inputs for repeated 'dim'
+    if op_info.name in ['amax', 'amin']:
+        dims = [(0, 0), (0, -4)]
+        err_msg2 = "dim 0 appears multiple times in the list of dims"
+        x = torch.randn(S, S, S, S, device=device)
+        for dim in dims:
+            yield ErrorInput(SampleInput(x, kwargs={'dim': dim}), error_regex=err_msg2)
+
+    # Error Input for illegal dtype
+    input5 = torch.randn(L, L, dtype=torch.float32, device=device)
+    max_values = torch.empty(L, dtype=torch.float32, device=device)
+    min_values = torch.empty(L, dtype=torch.double, device=device)
+    illegal_values = torch.empty(L, dtype=torch.int, device=device)
+
+    err_msg_amax_amin2 = "Expected the dtype for input and out to match"
+    err_msg_aminmax2 = "Expected out tensor to have dtype float, but got double instead"
+
+    if op_info.name in ['amax', 'amin']:
+        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': illegal_values}),
+                         error_regex=err_msg_amax_amin2)
+    elif op_info.name in ['aminmax']:
+        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': (max_values, min_values)}),
+                         error_regex=err_msg_aminmax2)
+
+    # Error Inputs for functions to raise an error on specified zero'd dimension as reduction dim
+    err_msg3 = "Expected reduction dim 1 to have non-zero size"
+    yield ErrorInput(SampleInput(torch.rand(shape, device=device), kwargs={'dim': 1}),
+                     error_type=IndexError, error_regex=err_msg3)
+
 def sample_inputs_aminmax(op_info, device, dtype, requires_grad, **kwargs):
     test_cases: Tuple[tuple, dict] = (  # type: ignore[assignment]
         ((S, S, S), {}),
@@ -3501,28 +3923,6 @@ def sample_inputs_unique_consecutive(*args, **kwargs):
             sample_input.kwargs.pop("sorted")
             yield sample_input
 
-def sample_inputs_max_min_binary(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    args_for_binary_op = (
-        ((S, S, S), (S, S, S),),
-        ((S, S, S), (S,),),
-        ((S,), (S, S, S),),
-        ((S, 1, S), (S, S),),
-        ((S, S), (S, S),),
-        ((), (),),
-        ((S, S, S), (),),
-        ((), (S, S, S),),
-    )
-    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
-                                           low=None, high=None,
-                                           requires_grad=requires_grad),
-                               args=(make_tensor(other_tensor, dtype=dtype, device=device,
-                                                 low=None, high=None,
-                                                 requires_grad=requires_grad),),))
-                  for input_tensor, other_tensor in args_for_binary_op)
-    return inputs
-
-
 def sample_inputs_adaptive_avg_pool1d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -4191,7 +4591,7 @@ def sample_inputs_nan_reduction(supports_multiple_dims):
     # Generates sample inputs for reduction ops that contain the input tensor
     # and dim and keepdim kwargs. If a reduction op needs to test additional
     # args/kwargs then create a separate sample_inputs function
-    def fn(op_info, device, dtype, requires_grad):
+    def fn(op_info, device, dtype, requires_grad, **kwargs):
         inputs = []
 
         for t in _generate_nan_reduction_inputs(device, dtype, requires_grad):
@@ -4408,20 +4808,6 @@ def sample_inputs_outer(op_info, device, dtype, requires_grad, **kwargs):
     inputs.append(SampleInput(arg_a, args=(arg_b,)))
     return inputs
 
-
-def sample_inputs_igamma_igammac(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, low=1e-3)
-    cases = (((S, S), (S, S), False),
-             ((S, S), (S, ), False),
-             ((S, ), (S, S), True),
-             ((), (), False))
-
-    for shape, other_shape, broadcasts_input in cases:
-        yield SampleInput(make_arg(shape, requires_grad=requires_grad),
-                          args=(make_arg(other_shape, requires_grad=False),),
-                          broadcasts_input=broadcasts_input)
-
-
 def sample_inputs_dist(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     sizes = ((S, S, S), (S,), (S, 1, S), (), (S, S))
@@ -4956,7 +5342,6 @@ def __init__(self,
         decorators = list(decorators) if decorators is not None else []
         decorators += [
             skipCPUIfNoFFT,
-            skipCUDAIfRocm,
         ]
 
         super().__init__(name=name,
@@ -5043,7 +5428,7 @@ def __init__(self,
                                             **kwargs)
         self.ref = ref
 
-def sample_inputs_foreach(self, device, dtype, N, *, noncontiguous=False, same_size=False):
+def sample_inputs_foreach(self, device, dtype, N, *, noncontiguous=False, same_size=False, low=None, high=None):
     if same_size:
         return [make_tensor((N, N), dtype=dtype, device=device, noncontiguous=noncontiguous) for _ in range(N)]
     else:
@@ -5095,21 +5480,30 @@ def __init__(self,
 
 
 def sample_inputs_linalg_cholesky_inverse(op_info, device, dtype, requires_grad=False, **kwargs):
-    # Generate Cholesky factors of positive-definite (non-singular) Hermitian (symmetric) matrices
-    from torch.testing._internal.common_utils import random_hermitian_pd_matrix
+    from torch.testing._internal.common_utils import random_well_conditioned_matrix
+
+    # Cholesky factorization is for positive-definite matrices
+    single_well_conditioned_matrix = random_well_conditioned_matrix(S, S, dtype=dtype, device=device)
+    batch_well_conditioned_matrices = random_well_conditioned_matrix(2, S, S, dtype=dtype, device=device)
+    single_pd = single_well_conditioned_matrix @ single_well_conditioned_matrix.mH
+    batch_pd = batch_well_conditioned_matrices @ batch_well_conditioned_matrices.mH
+
     inputs = (
         torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
         torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
-        random_hermitian_pd_matrix(S, dtype=dtype, device=device),  # single matrix
-        random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),  # batch of matrices
+        single_pd,
+        batch_pd
     )
-    test_cases = (torch.linalg.cholesky(a) for a in inputs)
-    out = []
-    for a in test_cases:
-        a.requires_grad = requires_grad
-        out.append(SampleInput(a))
-        out.append(SampleInput(a.clone().requires_grad_(requires_grad), kwargs=dict(upper=True)))
-    return out
+    test_cases = (torch.linalg.cholesky(a, upper=False) for a in inputs)
+    for l in test_cases:
+        # generated lower-triangular samples
+        l.requires_grad = requires_grad
+        yield SampleInput(l)  # upper=False by default
+        yield SampleInput(l.detach().clone().requires_grad_(requires_grad), kwargs=dict(upper=False))
+
+        # generate upper-triangular inputs
+        u = l.detach().clone().mT.contiguous().requires_grad_(requires_grad)
+        yield SampleInput(u, kwargs=dict(upper=True))
 
 def sample_inputs_linalg_lstsq(op_info, device, dtype, requires_grad=False, **kwargs):
     from torch.testing._internal.common_utils import random_well_conditioned_matrix
@@ -5389,16 +5783,15 @@ def sample_inputs_legacy_solve(op_info, device, dtype, requires_grad=False, **kw
 
 
 def sample_inputs_cholesky_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    out = sample_inputs_linalg_cholesky_inverse(
+    cholesky_inverse_samples = sample_inputs_linalg_cholesky_inverse(
         op_info, device, dtype, requires_grad=False
     )
 
-    for sample in out:
+    for sample in cholesky_inverse_samples:
         psd_matrix = sample.input
         sample.input = make_tensor(psd_matrix.shape, dtype=dtype, device=device, requires_grad=requires_grad, low=None, high=None)
         sample.args = (psd_matrix.requires_grad_(requires_grad),)
-
-    return out
+        yield sample
 
 
 def sample_inputs_lu(op_info, device, dtype, requires_grad=False, **kwargs):
@@ -5529,6 +5922,39 @@ def sample_inputs_cov(op_info, device, dtype, requires_grad, **kwargs):
     return inputs
 
 
+def error_inputs_cov(op_info, device, **kwargs):
+    a = torch.rand(S, device=device)
+    error_inputs = []
+    error_inputs.append(ErrorInput(
+        SampleInput(torch.rand(S, S, S, device=device)),
+        error_regex="expected input to have two or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.rand(S, S, device=device)}),
+        error_regex="expected fweights to have one or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.rand(S, S, device=device)}),
+        error_regex="expected aweights to have one or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.rand(S, device=device)}),
+        error_regex="expected fweights to have integral dtype"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.tensor([1, 1], device=device)}),
+        error_regex="expected aweights to have floating point dtype"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.tensor([1], device=device)}),
+        error_regex="expected fweights to have the same numel"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.rand(1, device=device)}),
+        error_regex="expected aweights to have the same numel"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.tensor([-1, -2, -3, -4 , -5], device=device)}),
+        error_regex="fweights cannot be negative"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.tensor([-1., -2., -3., -4., -5.], device=device)}),
+        error_regex="aweights cannot be negative"))
+    return error_inputs
+
+
 def sample_inputs_svd(op_info, device, dtype, requires_grad=False, **kwargs):
     make_fullrank = make_fullrank_matrices_with_distinct_singular_values
     make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -5582,66 +6008,6 @@ def sample_inputs_permute(op_info, device, dtype, requires_grad, **kwargs):
     for shape, args in cases:
         yield SampleInput(make_arg(shape), args=(args,))
 
-
-# Based on erstwhile method_tests tests & some tensor_op_tests for pow
-def sample_inputs_pow(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype)
-
-    samples = []
-
-    if dtype in [torch.float16, torch.bfloat16, torch.float32, torch.float64]:
-        test_cases = (
-            ((2, 2), 0, 5, 1e-3, requires_grad, (2, 2), 0, 1, 0.1, requires_grad, False),
-            ((2, 2), 0, 5, 1e-3, requires_grad, (1,), 0, 1, 0.1, requires_grad, False),
-            ((), 1e-3, 1e-3 + 1, 0, requires_grad, (), 0.1, 1.1, 0, requires_grad, False),
-            ((2, 2), 0, 5, 1e-3, requires_grad, (), 0.1, 1.1, 1, requires_grad, False),
-        )
-        tests_require_resizing = (
-            ((1,), 0, 5, 1e-3, requires_grad, (2, 2), 0, 1, 0.1, requires_grad, requires_grad),
-            ((2, 1, 2), 0, 5, 1e-3, requires_grad, (1, 2, 1), 0, 1, 0.1, requires_grad, requires_grad),
-            ((), 1e-3, 1e-3 + 1, 0, requires_grad, (1, S, 1), 0, 1, 0.1, requires_grad, requires_grad),
-        )
-        cases = test_cases + tests_require_resizing
-
-        samples = []
-        for (shape_b, low_b, high_b, additive_b, b_grad, shape_e, low_e,
-             high_e, additive_e, e_grad, broadcasts_input) in cases:
-            si = SampleInput((make_arg(shape_b, low=low_b, high=high_b) + additive_b).requires_grad_(b_grad),
-                             args=((make_arg(shape_e, low=low_e, high=high_e) + additive_e).requires_grad_(e_grad),),
-                             broadcasts_input=broadcasts_input)
-            samples.append(si)
-
-        tensor_scalar_inputs = (
-            ((2, 2), 0, 5, 1e-3, requires_grad, (3.14,)),
-            ((), 1e-3, 1e-3 + 1, 0, requires_grad, (3.14,))
-        )
-        more_samples = list(SampleInput(
-            (make_arg(shape, high=high, low=low) + additive).requires_grad_(b_grad),
-            args=exp)
-            for shape, low, high, additive, b_grad, exp in tensor_scalar_inputs)
-
-        samples = [*samples, *more_samples]
-    elif dtype in [torch.complex64, torch.complex128]:
-        args_tuple = (
-            ((2, 2), 0, 5, requires_grad, (3.14,)),
-            ((), 0, 1, requires_grad, (3.14,)),
-            ((), 0, 1, requires_grad, (3.14j,))
-        )
-        samples = list(SampleInput(
-            (make_arg(shape, high=high, low=low) + 1e-3 * (1 + 1j)).requires_grad_(b_grad),
-            args=arg)
-            for shape, low, high, b_grad, arg in args_tuple)
-    else:  # integral dtype
-        exp_tuple = (1, 2, 3)
-        samples = list(SampleInput(
-            make_arg((2, 2), requires_grad=requires_grad),
-            args=(arg,))
-            for arg in exp_tuple)
-        samples.append(SampleInput(
-            make_arg((2, 2), requires_grad=requires_grad),
-            args=(make_arg((2, 2), requires_grad=requires_grad),)))
-    return tuple(samples)
-
 def sample_inputs_linalg_svdvals(op_info, device, dtype, requires_grad=False, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
@@ -5761,44 +6127,6 @@ def sample_inputs_fliplr_flipud(op_info, device, dtype, requires_grad, **kwargs)
     )
     return [SampleInput(tensor) for tensor in tensors]
 
-def sample_inputs_fmod_remainder(op_info, device, dtype, requires_grad, *, autodiffed=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    if autodiffed:
-        samples = (
-            ((S, S, S), 1.5, False),
-            ((), 1.5, False),
-        )
-    else:
-        cases = (
-            ((S, S, S), (), False),
-            ((S, S, S), (S, S, S), False),
-            ((S, S, S), (S,), False),
-        )
-
-        # Sample inputs with scalars as torch tensors
-        # FIXME It does not work for mak make_arg((1,), exclude_zero=True)
-        cases_with_tensor_scalar = (
-            ((), make_arg((), exclude_zero=True), False),
-        )
-
-        # Sample inputs with broadcasting
-        cases_with_broadcasting = (
-            ((S,), (S, S, S), True),
-            ((S, 1, S), (S, S, S), True),
-            ((), (S, S, S), True),
-        )
-
-        samples = cases + cases_with_tensor_scalar + cases_with_broadcasting  # type: ignore[assignment]
-
-    for shape, arg_other, broadcasts_input in samples:
-        if isinstance(arg_other, tuple):
-            arg = make_arg(arg_other, exclude_zero=True)
-        else:
-            # shape_other is scalar or torch.tensor
-            arg = arg_other
-        yield(SampleInput(make_arg(shape), args=(arg,), broadcasts_input=broadcasts_input))
-
 # TODO: clamp shares tensors among its sample inputs --- we should prohibit this!
 def sample_inputs_clamp(op_info, device, dtype, requires_grad, **kwargs):
     x = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
@@ -5894,38 +6222,6 @@ def sample_inputs_view_as_real(op_info, device, dtype, requires_grad, **kwargs):
     )
     return [SampleInput(tensor) for tensor in tensors]
 
-def sample_inputs_copysign(op_info, device, dtype, requires_grad, **kwargs):
-    def _make_tensor(*shape, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
-
-    cases = [
-        # no broadcast
-        ((S, S, S), (S, S, S), False),
-        # broadcast rhs
-        ((S, S, S), (S, S), False),
-
-        # scalar
-        ((S, S), 3.14, False),
-        # scalar positive zero
-        ((S, S), 0.0, False),
-        # scalar negative zero
-        ((S, S), -0.0, False),
-    ]
-
-    # broadcast lhs
-    cases.append(((S, S), (S, S, S), True))
-    # broadcast all
-    cases.append(((S, 1, S), (M, S), True))
-
-    for input_shape, arg_val, broadcasts_input in cases:
-        if isinstance(arg_val, tuple):
-            arg = _make_tensor(*arg_val)
-        else:
-            # arg_val is scalar
-            arg = arg_val
-
-        yield SampleInput(_make_tensor(*input_shape), args=(arg, ), broadcasts_input=broadcasts_input)
-
 def sample_inputs_prod(op_info, device, dtype, requires_grad, **kwargs):
     def make_arg(shape):
         # shrink values to be in the interval [-1, +1] for better precision in gradgradcheck
@@ -5964,20 +6260,7 @@ def error_inputs_neg(op_info, device, **kwargs):
     msg = ("Negation, the `\\-` operator, on a bool tensor is not supported."
            " If you are trying to invert a mask, use the `\\~` or"
            " `logical_not\\(\\)` operator instead.")
-    return (ErrorInput(si, error_type=RuntimeError, error_regex=msg),)
-
-def sample_inputs_nextafter(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    cases = (
-        ((S, S), (S, S), False),
-        ((S, S), (S,), False),
-        ((S, ), (S, S), True)
-    )
-
-    for shape, other_shape, broadcasts_input in cases:
-        yield SampleInput(make_arg(shape), args=(make_arg(other_shape),), broadcasts_input=broadcasts_input)
-
+    return (ErrorInput(si, error_regex=msg),)
 
 def sample_inputs_diag(op_info, device, dtype, requires_grad, **kwargs):
     vec_sample = SampleInput(make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad))
@@ -6276,7 +6559,7 @@ def sample_inputs_matmul(op_info, device, dtype, requires_grad, **kwargs):
 
 def sample_inputs_meshgrid(op_info: OpInfo, device: torch.device, dtype: torch.dtype,
                            requires_grad: bool,
-                           *, variant: str) -> List[SampleInput]:
+                           *, variant: str, **kwargs) -> List[SampleInput]:
     if variant == 'variadic':
         def make_inputs(
                 tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
@@ -6313,30 +6596,6 @@ def make_inputs(
                                          kwargs=dict(indexing=indexing)))
     return sample_inputs
 
-
-def sample_inputs_polar(op_info, device, dtype, requires_grad, **kwargs):
-    def _make_tensor_helper(shape, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
-
-    samples = (
-        SampleInput(_make_tensor_helper((S, S), low=0), args=(_make_tensor_helper((S, S)),)),
-        SampleInput(_make_tensor_helper((), low=0), args=(_make_tensor_helper(()),)),
-    )
-
-    return samples
-
-def sample_inputs_complex(op_info, device, dtype, requires_grad, **kwargs):
-    def _make_tensor_helper(shape):
-        return make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    samples = (
-        SampleInput(_make_tensor_helper((S, S)), args=(_make_tensor_helper((S, S)),)),
-        SampleInput(_make_tensor_helper(()), args=(_make_tensor_helper(()),)),
-    )
-
-    return samples
-
-
 def sample_inputs_polygamma(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     tensor_shapes = ((S, S), ())
@@ -6424,18 +6683,6 @@ def sample_inputs_entr(op_info, device, dtype, requires_grad, **kwargs):
                                     low=low,
                                     requires_grad=requires_grad)))
 
-
-def sample_inputs_zeta(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    samples = (SampleInput(make_arg((S,), low=1, requires_grad=requires_grad),
-                           args=(make_arg((S,), low=2, requires_grad=False),)),
-               SampleInput(make_arg((S,), low=1, requires_grad=requires_grad),
-                           args=(3.,)),
-               )
-
-    return samples
-
-
 # TODO: Consolidate `i0e` with sample_inputs_unary when `make_tensor`,
 #       supports `exclude` argument.
 #       For more context: https://github.com/pytorch/pytorch/pull/56352#discussion_r633277617
@@ -6464,34 +6711,6 @@ def sample_inputs_i0_i1(op_info, device, dtype, requires_grad, **kwargs):
 
     return samples
 
-
-def sample_inputs_rsub(op_info, device, dtype, requires_grad, other_scalar, **kwargs):
-    make_arg = partial(make_tensor, device=device)
-
-    shapes = ((S, S), (S,), ()) if not other_scalar else ((),)
-    # We are doing y - a*x, where y may be a scalar or a tensor
-    # If y is a scalar, y may be of any dtype that can be cast to the dtype of x
-    # a may always be of any dtype that can be cast to the dtype of x
-    if dtype.is_complex:
-        dtypes_a = (torch.int32, torch.float32, dtype)
-    elif dtype.is_floating_point:
-        dtypes_a = (torch.int32, dtype)
-    else:
-        dtypes_a = (dtype, )
-    dtypes_y = dtypes_a if other_scalar else (dtype,)
-
-    for shape_x, shape_y, dtype_y, dtype_a in product(shapes, shapes, dtypes_y, dtypes_a):
-        requires_grad_y = (requires_grad and
-                           not other_scalar and
-                           (dtype_y.is_floating_point or dtype_y.is_complex))
-
-        x = make_arg(shape_x, dtype=dtype, requires_grad=requires_grad)
-        y = make_arg(shape_y, dtype=dtype_y, requires_grad=requires_grad_y)
-        if other_scalar:
-            y = y.item()
-        a = make_arg((), dtype=dtype_a).item()
-        yield SampleInput(x, args=(y,), kwargs={"alpha": a})
-
 def sample_inputs_cumulative_ops(op_info, device, dtype, requires_grad, supports_dtype_kwargs=True, **kwargs):
     def _make_tensor_helper(shape, low=None, high=None):
         return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
@@ -6544,22 +6763,6 @@ def sample_inputs_unfold(op_info, device, dtype, requires_grad, **kwargs):
                                       args=arguments)]
     return sample_inputs
 
-
-def sample_inputs_atan2(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    cases = (
-        ((S, S, S), (S, S, S), False),
-        ((), (), False),
-        ((S, S, S), (S,), False),
-        ((S,), (S, S, S), True),
-        ((S, 1, S), (S, S), True),
-    )
-
-    for x_shape, y_shape, broadcasts_input in cases:
-        yield SampleInput(make_arg(x_shape), args=(make_arg(y_shape),),
-                          broadcasts_input=broadcasts_input)
-
-
 def sample_inputs_split(op_info, device, dtype, requires_grad, *, list_args=False, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -6766,29 +6969,35 @@ def _gather(shape, index_dim, max_indices):
 
     return [SampleInput(tensor, args=args) for tensor, args in test_cases]
 
-def sample_inputs_scatter_reduce(op_info, device, dtype, requires_grad):
+def sample_inputs_scatter_reduce(op_info, device, dtype, requires_grad, **kwargs):
     def _tensor(shape, dtype=dtype, low=None, high=None):
         return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-    def _index(shape, max_index):
-        return torch.from_numpy(np.random.choice(max_index, size=shape)).to(dtype=torch.int64, device=device)
+    def _gather(shape, index_dim, max_indices):
+        return gather_variable(shape, index_dim, max_indices, device=device)
 
-    reduces = ["sum", "prod", "mean", "amax", "amin"]
-    shapes_and_dims = [((M,), 1), ((M, S), 2), ((M, M, S), 3), ((1, M, M, S), 4)]
+    zero = torch.tensor(0, dtype=torch.long, device=device)
+    test_cases = (
+        ((M, S), 0, _gather((S, S), 1, M), (S, S)),
+        ((M, S), 1, _gather((S, S), 0, S), (S, S)),
+        ((M, S), -1, _gather((S, S), 0, S), (S, S)),
+        ((M, S), 0, _gather((M, S // 2), 1, M), (M, S // 2)),
+        ((M, S), 1, _gather((M, S // 2), 0, S), (M, S // 2)),
+        ((M, S), -1, _gather((M, S // 2), 0, S), (M, S // 2)),
+        ((), 0, zero.clone().detach(), ()),
+    )
 
+    reduce = op_info.variant_test_name
     sample_inputs = []
-
-    for ((shape, dim), reduce) in itertools.product(shapes_and_dims, reduces):
-        for d in range(dim):
-            # Generate a random maximum integer that can appear in index array
-            max_index = np.random.randint(1, shape[d] * 2)
-            index = _index(shape, max_index)
-            sample_inputs.append(
-                SampleInput(
-                    _tensor(shape),
-                    args=(d, index, reduce),
-                )
+    for args, include_self in product(test_cases, [True, False]):
+        inp_shape, dim, index, src_shape = args
+        sample_inputs.append(
+            SampleInput(
+                _tensor(inp_shape),
+                args=(dim, index, _tensor(src_shape), reduce),
+                kwargs={'include_self': include_self}
             )
+        )
 
     return sample_inputs
 
@@ -6989,26 +7198,6 @@ def sample_inputs_slice_scatter(op_info, device, dtype, requires_grad, **kwargs)
         src = make_arg(src_shape)
         yield SampleInput(input_, args=(src, *args))
 
-
-def sample_inputs_rbinops(op_info, device, dtype, requires_grad, supports_dtype_kwargs=True, **kwargs):
-    def _make_tensor_helper(shape, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
-
-    scalar: Union[int, float, complex] = 3
-
-    if dtype.is_floating_point:
-        scalar = 3.14
-    elif dtype.is_complex:
-        scalar = 3.14j
-
-    samples = [
-        SampleInput(_make_tensor_helper((S, S, S)), args=(scalar,)),
-        SampleInput(_make_tensor_helper(()), args=(scalar,)),
-    ]
-
-    return samples
-
-
 def sample_inputs_expand(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
@@ -7093,6 +7282,16 @@ def random_index(shape):
                           args=(make_bool_mask(mask_shape), make_arg(other_shape)),
                           broadcasts_input=broadcasts_input)
 
+def error_inputs_where(op_info, device, **kwargs):
+    shape = (S,)
+    err_msg = "Expected all tensors to be on the same device"
+    for devices in product(('cpu', device), repeat=3):
+        if len(set(devices)) == 2:
+            si = SampleInput(make_tensor(shape, device=devices[0], dtype=torch.float32),
+                             args=(make_tensor(shape, dtype=torch.bool, device=devices[1]),
+                             make_tensor(shape, device=devices[2], dtype=torch.float32)))
+            yield ErrorInput(si, error_regex=err_msg)
+
 def sample_inputs_nonzero(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
@@ -7151,13 +7350,13 @@ def error_inputs_kthvalue(op_info, device, **kwargs):
     si = SampleInput(t, args=(5,), kwargs={'out': (t, indices)})
 
     k_out_of_range_err = "selected number k out of range for dimension"
-    return (ErrorInput(si, error_type=RuntimeError, error_regex="unsupported operation"),
+    return (ErrorInput(si, error_regex="unsupported operation"),
             ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3, 0)),
-                       error_type=RuntimeError, error_regex=k_out_of_range_err),
+                       error_regex=k_out_of_range_err),
             ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3,)),
-                       error_type=RuntimeError, error_regex=k_out_of_range_err),
+                       error_regex=k_out_of_range_err),
             ErrorInput(SampleInput(torch.tensor(2, device=device), args=(3,)),
-                       error_type=RuntimeError, error_regex=k_out_of_range_err),)
+                       error_regex=k_out_of_range_err),)
 
 def sample_inputs_dropout(op_info, device, dtype, requires_grad, *,
                           train=None, valid_input_dim=None, **kwargs):
@@ -8158,7 +8357,7 @@ def gradcheck_wrapper_hermitian_input(op, input, *args, **kwargs):
 
 
 def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
-    """Gradcheck wrpper for functions that take lower or upper triangular matrices as input.
+    """Gradcheck wrapper for functions that take lower or upper triangular matrices as input.
 
     They require a modified function because the finite-difference algorithm
     for calculating derivatives does not preserve the triangular property of the input.
@@ -8168,6 +8367,23 @@ def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
     return op(*args[:idx], triangular_arg, *args[idx + 1:], upper, **kwargs)
 
 
+def gradcheck_wrapper_triangular_input_real_positive_diagonal(op, *args, upper=False, idx=0, **kwargs):
+    """Gradcheck wrapper for functions that take lower/upper triangular matrices
+    with real and positive diagonals, for example, cholesky-like operations.
+    """
+    arg = args[idx]
+    arg_diag = arg.diagonal(0, -2, -1)
+    arg_diag_embed = torch.diag_embed(arg_diag)
+    id_diag_tensor = torch.ones_like(arg_diag)
+    id_tensor = torch.diag_embed(id_diag_tensor)
+    # new_arg = arg - diag(arg) + I
+    new_arg = arg - arg_diag_embed + id_tensor
+    return gradcheck_wrapper_triangular_input(
+        op, *args[:idx], new_arg, *args[idx + 1:],
+        upper=upper, idx=idx, **kwargs
+    )
+
+
 def gradcheck_wrapper_masked_operation(op, input, *args, **kwargs):
     """Gradcheck wrapper for masked operations.
 
@@ -8225,6 +8441,7 @@ def wrapper(x: np.ndarray, *args, **kwargs):
         if 'mask' in keys:
             mask = kwargs.pop('mask')
             if mask is not None:
+                assert mask.layout == torch.strided
                 kwargs['where'] = mask.cpu().numpy()
 
         if 'identity' in keys:
@@ -8395,11 +8612,16 @@ def ref_pairwise_distance(input1, input2):
                     else np.add(input, np.multiply(alpha, other)),
                     dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
                     assert_autodiffed=True,
-                    sample_inputs_func=partial(sample_inputs_add_sub, alpha=2),
+                    sample_inputs_func=sample_inputs_add_sub,
                     supports_inplace_autograd=False,
                     supports_fwgrad_bwgrad=True,
                     supports_forward_ad=True,
+                    supports_two_python_scalars=True,
                     skips=(
+                        DecorateInfo(unittest.skip("Skipped!"),
+                                     'TestCommon',
+                                     'test_reference_testing',
+                                     dtypes=(torch.complex128,)),
                         DecorateInfo(unittest.skip("Skipped!"),
                                      'TestBinaryUfuncs',
                                      'test_reference_numerics_extremal_values',
@@ -8411,7 +8633,7 @@ def ref_pairwise_distance(input1, input2):
                     assert_autodiffed=True,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=partial(sample_inputs_binary_pwise, python_scalars=True)),
+                    supports_two_python_scalars=True),
     BinaryUfuncInfo('sub',
                     # NumPy has no builtin reference for the alpha kwarg, but it is easy enough to emulate
                     ref=lambda input, other, *, alpha=1: np.subtract(input, np.multiply(alpha, other)),
@@ -8420,8 +8642,9 @@ def ref_pairwise_distance(input1, input2):
                     assert_autodiffed=True,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=partial(sample_inputs_add_sub, alpha=2, python_scalars=True),
+                    sample_inputs_func=sample_inputs_add_sub,
                     supports_inplace_autograd=False,
+                    supports_two_python_scalars=True,
                     decorators=(
                         DecorateInfo(
                             toleranceOverride({torch.float16: tol(atol=1e-2, rtol=0)}),
@@ -8467,6 +8690,7 @@ def ref_pairwise_distance(input1, input2):
                # https://github.com/pytorch/pytorch/issues/71784
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
                             device_type='cpu', dtypes=(torch.float16,)),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.float16,)),
            )),
     OpInfo('addmv',
            dtypes=all_types_and_complex_and(torch.bfloat16),
@@ -8692,11 +8916,13 @@ def ref_pairwise_distance(input1, input2):
                     aliases=('arctan2',),
                     dtypes=all_types_and(torch.bool),
                     dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
                     promotes_int_to_float=True,
-                    sample_inputs_func=sample_inputs_atan2,
+                    supports_rhs_python_scalar=False,
                     skips=(
-                        # TypeError: atan2(): argument 'other' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+                        # Incorrectly attempts to use a scalar for the second argument
+                        DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
                     )),
     UnaryUfuncInfo('atanh',
                    aliases=('arctanh', ),
@@ -8736,6 +8962,7 @@ def ref_pairwise_distance(input1, input2):
            skips=(
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            ),
            supports_out=False),
     OpInfo('broadcast_to',
@@ -8770,12 +8997,14 @@ def ref_pairwise_distance(input1, input2):
                # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
                # please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=[torch.float32]),
+               # Problem; should be fixed
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ),
            sample_inputs_func=sample_inputs_block_diag),
     BinaryUfuncInfo('bitwise_and',
                     dtypes=integral_types_and(torch.bool),
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     skips=(
                         # RuntimeError: "bitwise_and_cuda" not implemented for 'Half'
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
@@ -8789,22 +9018,20 @@ def ref_pairwise_distance(input1, input2):
                     dtypes=all_types(),
                     dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_bitwise_shift,
+                    supports_one_python_scalar=True,
+                    rhs_make_tensor_kwargs=dict(low=0),
                     skips=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # FIXME: Undefined behavior sanitizer: shift exponent -9 is negative
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', device_type='cpu'),
                     )),
     BinaryUfuncInfo('bitwise_right_shift',
                     op=torch.bitwise_right_shift,
                     dtypes=all_types(),
                     dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_bitwise_shift,
+                    supports_one_python_scalar=True,
+                    rhs_make_tensor_kwargs=dict(low=0),
                     skips=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # FIXME: Undefined behavior sanitizer
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', device_type='cpu'),
                     )),
     OpInfo('combinations',
            op=torch.combinations,
@@ -8844,20 +9071,20 @@ def ref_pairwise_distance(input1, input2):
            check_batched_gradgrad=False,
            sample_inputs_func=sample_inputs_linalg_cholesky,
            gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],),
     OpInfo('cholesky_inverse',
            dtypes=floating_and_complex_types(),
-           backward_dtypes=floating_types(),
-           # TODO: RuntimeError: cholesky_inverse does not support automatic differentiation for outputs
-           # with complex dtype.
-           check_batched_gradgrad=False,
+           backward_dtypes=floating_and_complex_types(),
+           supports_fwgrad_bwgrad=True,
+           supports_forward_ad=True,
+           check_batched_gradgrad=True,
            sample_inputs_func=sample_inputs_linalg_cholesky_inverse,
-           gradcheck_wrapper=gradcheck_wrapper_triangular_input,
+           gradcheck_wrapper=gradcheck_wrapper_triangular_input_real_positive_diagonal,
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
-               # TODO: FIXME: cholesky_inverse throws an error in forward when requires_grad=True
-               #   for complex tensors
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
                # cholesky_inverse does not correctly warn when resizing out= inputs
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),)),
     OpInfo('cholesky_solve',
@@ -8910,13 +9137,18 @@ def ref_pairwise_distance(input1, input2):
            check_batched_gradgrad=False,
            sample_inputs_func=sample_inputs_symeig,
            gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off]),
     # NOTE: clamp has seperate opinfos for scalar min/max (unary op) vs. tensors
     OpInfo('clamp',
            aliases=('clip',),
            dtypes=all_types_and(torch.bfloat16),
            dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
            assert_autodiffed=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_clamp),
     UnaryUfuncInfo('clamp',
                    variant_test_name='scalar',
@@ -9000,22 +9232,21 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
                # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.half,)),
+               # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.half,)),
            )),
     BinaryUfuncInfo('complex',
-                    dtypes=floating_types(),
-                    sample_inputs_func=sample_inputs_complex,
+                    dtypes=floating_types_and(torch.half),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: Expected object of scalar type Float but got scalar type Double for second argument
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # TypeError: complex(): argument 'imag' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
                     )),
     BinaryUfuncInfo('copysign',
                     dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
                     promotes_int_to_float=True,
-                    sample_inputs_func=sample_inputs_copysign,
                     supports_inplace_autograd=False,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True),
@@ -9074,6 +9305,7 @@ def ref_pairwise_distance(input1, input2):
            backward_dtypesIfCUDA=all_types_and_complex_and(torch.half, *[torch.bfloat16]
                                                            if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_cov,
+           error_inputs_func=error_inputs_cov,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -9123,6 +9355,7 @@ def ref_pairwise_distance(input1, input2):
            skips=(
                # cumprod does not handle correctly out= dtypes
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ),
            # gradgradcheck fails in fast_mode=True: #56275
            sample_inputs_func=sample_inputs_cumprod,
@@ -9133,6 +9366,9 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     OpInfo('cummin',
            dtypes=all_types_and(torch.bool, torch.bfloat16),
@@ -9140,6 +9376,9 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     UnaryUfuncInfo('deg2rad',
                    ref=np.radians,
@@ -9169,38 +9408,48 @@ def ref_pairwise_distance(input1, input2):
                     aliases=('divide',),
                     variant_test_name='no_rounding_mode',
                     dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=partial(sample_inputs_binary_pwise, python_scalars=True),
                     supports_forward_ad=True,
                     promotes_int_to_float=True,
                     supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
                     assert_autodiffed=True,
                     rhs_make_tensor_kwargs=dict(exclude_zero=True),),
     BinaryUfuncInfo('div',
                     aliases=('divide',),
                     variant_test_name='trunc_rounding',
-                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=partial(sample_inputs_binary_pwise, rounding_mode="trunc", python_scalars=True),
+                    dtypes=all_types_and(torch.half, torch.bfloat16),
+                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="trunc")),
                     supports_forward_ad=True,
                     promotes_int_to_float=True,
                     supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
                     assert_autodiffed=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),),
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
+                    skips=(
+                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
+                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
+                    )),
     BinaryUfuncInfo('div',
                     aliases=('divide',),
                     variant_test_name='floor_rounding',
-                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=partial(sample_inputs_binary_pwise, rounding_mode="floor", python_scalars=True),
+                    dtypes=all_types_and(torch.half, torch.bfloat16),
+                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="floor")),
                     supports_forward_ad=True,
                     promotes_int_to_float=True,
                     supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
                     assert_autodiffed=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),),
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
+                    skips=(
+                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
+                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
+                    )),
     BinaryUfuncInfo('true_divide',
                     dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
                     supports_forward_ad=True,
                     promotes_int_to_float=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_binary_pwise,
+                    supports_two_python_scalars=True,
                     rhs_make_tensor_kwargs=dict(exclude_zero=True)),
     UnaryUfuncInfo('exp',
                    ref=np_unary_ufunc_integer_promotion_wrapper(np.exp),
@@ -9248,7 +9497,8 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diag),
+           sample_inputs_func=sample_inputs_diag,
+           error_inputs_func=error_inputs_diag),
     OpInfo('diag_embed',
            dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
@@ -9280,48 +9530,32 @@ def ref_pairwise_distance(input1, input2):
                     dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_max_min_binary,
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # TypeError: fmax(): argument 'other' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs'),
                     )),
     BinaryUfuncInfo('fmin',
                     op=torch.fmin,
                     dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_max_min_binary,
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # TypeError: fmin(): argument 'other' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs'),
                     )),
     BinaryUfuncInfo('fmod',
                     ref=np.fmod,
                     dtypes=all_types_and(torch.float16),
-                    rhs_make_tensor_kwargs={'exclude_zero': True},
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_fmod_remainder,
-                    skips=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bool, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)),
-                    )),
-    BinaryUfuncInfo('fmod',
-                    ref=np.fmod,
-                    variant_test_name='autodiffed',
-                    dtypes=all_types_and(torch.float16, torch.bool),
-                    assert_autodiffed=True,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
+                    assert_autodiffed=None,
                     rhs_make_tensor_kwargs={'exclude_zero': True},
-                    sample_inputs_func=partial(sample_inputs_fmod_remainder, autodiffed=True),
-                    skips=(
+                    decorators=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bool, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)),
+                                     'test_reference_numerics_small_values',
+                                     dtypes=(torch.uint8,)),
                     )),
     BinaryUfuncInfo('remainder',
                     ref=np.remainder,
@@ -9329,37 +9563,26 @@ def ref_pairwise_distance(input1, input2):
                     dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
+                    assert_autodiffed=None,
+                    supports_one_python_scalar=True,
                     rhs_make_tensor_kwargs={'exclude_zero': True},
-                    sample_inputs_func=sample_inputs_fmod_remainder,
-                    skips=(
-                        # AssertionError: False is not true : Tensors failed to compare as equal!
+                    decorators=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_contig_vs_every_other',
                                      dtypes=(torch.bfloat16,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bool, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)),
-                    )),
-    BinaryUfuncInfo('remainder',
-                    ref=np.remainder,
-                    variant_test_name='autodiffed',
-                    dtypes=all_types_and(torch.float16, torch.bool, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and(torch.float16, torch.bool, torch.bfloat16),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    assert_autodiffed=True,
-                    rhs_make_tensor_kwargs={'exclude_zero': True},
-                    sample_inputs_func=partial(sample_inputs_fmod_remainder, autodiffed=True),
-                    decorators=(
-                        # Fails on XLA
-                        # False is not true : Tensors failed to compare as equal!
-                        # Attempted to compare equality of tensors with different dtypes
-                        DecorateInfo(unittest.expectedFailure, 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
-                    ),
-                    skips=(
-                        # AssertionError: False is not true : Tensors failed to compare as equal!
+                                     'test_non_contig',
+                                     dtypes=(torch.bfloat16,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_reference_numerics',
                                      dtypes=(torch.bfloat16,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bool, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)),
+                                     'test_reference_numerics_small_values',
+                                     dtypes=(torch.uint8,)),
+                        # Fails on XLA
+                        # False is not true : Tensors failed to compare as equal!
+                        # Attempted to compare equality of tensors with different dtypes
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
                     )),
     UnaryUfuncInfo('frac',
                    ref=lambda x: np.modf(x)[0],
@@ -9377,6 +9600,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.OneD,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      ),
     SpectralFuncInfo('fft.fft2',
                      aten_name='fft_fft2',
@@ -9384,6 +9609,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.TwoD,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      decorators=[precisionOverride(
                          {torch.float: 1e-4, torch.cfloat: 1e-4})],
                      ),
@@ -9393,6 +9620,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.ND,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      decorators=[precisionOverride(
                          {torch.float: 1e-4, torch.cfloat: 1e-4})],
                      ),
@@ -9402,6 +9631,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.OneD,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_gradgrad=False),
     SpectralFuncInfo('fft.hfft2',
                      aten_name='fft_hfft2',
@@ -9409,6 +9640,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.TwoD,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_gradgrad=False,
                      decorators=[
                          DecorateInfo(
@@ -9421,6 +9654,8 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.ND,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_gradgrad=False,
                      decorators=[
                          DecorateInfo(
@@ -9433,7 +9668,12 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.OneD,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_grad=False,
+                     skips=(
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+                     ),
                      check_batched_gradgrad=False),
     SpectralFuncInfo('fft.rfft2',
                      aten_name='fft_rfft2',
@@ -9441,28 +9681,42 @@ def ref_pairwise_distance(input1, input2):
                      ndimensional=SpectralFuncType.TwoD,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_grad=False,
                      check_batched_gradgrad=False,
-                     decorators=[precisionOverride({torch.float: 1e-4})],),
+                     decorators=[
+                         precisionOverride({torch.float: 1e-4}),
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+                     ],),
     SpectralFuncInfo('fft.rfftn',
                      aten_name='fft_rfftn',
                      ref=np.fft.rfftn,
                      ndimensional=SpectralFuncType.ND,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      check_batched_grad=False,
                      check_batched_gradgrad=False,
-                     decorators=[precisionOverride({torch.float: 1e-4})],),
+                     decorators=[
+                         precisionOverride({torch.float: 1e-4}),
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+                     ],),
     SpectralFuncInfo('fft.ifft',
                      aten_name='fft_ifft',
                      ref=np.fft.ifft,
                      ndimensional=SpectralFuncType.OneD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types()),
     SpectralFuncInfo('fft.ifft2',
                      aten_name='fft_ifft2',
                      ref=np.fft.ifft2,
                      ndimensional=SpectralFuncType.TwoD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
                      decorators=[
@@ -9474,6 +9728,8 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_ifftn',
                      ref=np.fft.ifftn,
                      ndimensional=SpectralFuncType.ND,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
                      decorators=[
@@ -9485,18 +9741,26 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_ihfft',
                      ref=np.fft.ihfft,
                      ndimensional=SpectralFuncType.OneD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_types(),
+                     skips=(
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+                     ),
                      check_batched_grad=False),
     SpectralFuncInfo('fft.ihfft2',
                      aten_name='fft_ihfft2',
                      ref=scipy.fft.ihfftn if has_scipy_fft else None,
                      ndimensional=SpectralFuncType.TwoD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_types(),
                      check_batched_grad=False,
                      check_batched_gradgrad=False,
                      decorators=[
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                          DecorateInfo(
                              precisionOverride({torch.float: 2e-4}),
                              'TestFFT', 'test_reference_nd')],
@@ -9505,11 +9769,14 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_ihfftn',
                      ref=scipy.fft.ihfftn if has_scipy_fft else None,
                      ndimensional=SpectralFuncType.ND,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and(torch.bool),
                      default_test_dtypes=floating_types(),
                      check_batched_grad=False,
                      check_batched_gradgrad=False,
                      decorators=[
+                         DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                          DecorateInfo(
                              precisionOverride({torch.float: 2e-4}),
                              'TestFFT', 'test_reference_nd')],
@@ -9518,6 +9785,8 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_irfft',
                      ref=np.fft.irfft,
                      ndimensional=SpectralFuncType.OneD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
                      check_batched_gradgrad=False),
@@ -9525,6 +9794,8 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_irfft2',
                      ref=np.fft.irfft2,
                      ndimensional=SpectralFuncType.TwoD,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
                      check_batched_gradgrad=False,
@@ -9537,6 +9808,8 @@ def ref_pairwise_distance(input1, input2):
                      aten_name='fft_irfftn',
                      ref=np.fft.irfftn,
                      ndimensional=SpectralFuncType.ND,
+                     supports_forward_ad=True,
+                     supports_fwgrad_bwgrad=True,
                      dtypes=all_types_and_complex_and(torch.bool),
                      default_test_dtypes=floating_and_complex_types(),
                      check_batched_gradgrad=False,
@@ -9564,9 +9837,13 @@ def ref_pairwise_distance(input1, input2):
                skipCPUIfNoFFT,
                DecorateInfo(unittest.skip("Skipped! stft does not match the native function"),
                             'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ],
            dtypes=floating_and_complex_types(),
            sample_inputs_func=sample_inputs_stft,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
            check_batched_grad=False,
            check_batched_gradgrad=False,
            supports_out=False,
@@ -9582,6 +9859,9 @@ def ref_pairwise_distance(input1, input2):
            ],
            dtypes=floating_and_complex_types(),
            sample_inputs_func=sample_inputs_istft,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
            check_batched_grad=False,
            check_batched_gradgrad=False,
            supports_out=False,
@@ -9691,10 +9971,13 @@ def ref_pairwise_distance(input1, input2):
                    )),
     BinaryUfuncInfo('floor_divide',
                     dtypes=all_types_and(torch.half, torch.bfloat16),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     rhs_make_tensor_kwargs=dict(exclude_zero=True),
-                    ),
+                    supports_two_python_scalars=True,
+                    skips=(
+                        # AssertionError: Results of original model and exported/imported version of model differed
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+                    )),
     UnaryUfuncInfo('frexp',
                    op=torch.frexp,
                    ref=np.frexp,
@@ -9728,8 +10011,7 @@ def ref_pairwise_distance(input1, input2):
                     aliases=('greater_equal',),
                     dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
                     always_returns_bool=True,
-                    supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops),
+                    supports_autograd=False,),
     OpInfo('geqrf',
            dtypes=floating_and_complex_types(),
            supports_autograd=False,
@@ -9739,8 +10021,7 @@ def ref_pairwise_distance(input1, input2):
                     aliases=('greater',),
                     dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
                     always_returns_bool=True,
-                    supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops),
+                    supports_autograd=False,),
     UnaryUfuncInfo('imag',
                    ref=np.imag,
                    dtypes=complex_types(),
@@ -9768,9 +10049,11 @@ def ref_pairwise_distance(input1, input2):
                # Arguments for call are not valid.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32, torch.complex64)),  # noqa: B950
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            ),
            supports_inplace_autograd=False,
-           sample_inputs_func=sample_inputs_gradient),
+           sample_inputs_func=sample_inputs_gradient,
+           error_inputs_func=error_inputs_gradient),
     OpInfo('inverse',
            op=torch.inverse,
            dtypes=floating_and_complex_types(),
@@ -9791,13 +10074,15 @@ def ref_pairwise_distance(input1, input2):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_kthvalue,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            error_inputs_func=error_inputs_kthvalue),
     BinaryUfuncInfo('le',
                     aliases=('less_equal',),
                     dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
                     always_returns_bool=True,
-                    supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops),
+                    supports_autograd=False,),
     OpInfo('linalg.det',
            op=torch.linalg.det,
            aliases=('det',),
@@ -9840,6 +10125,9 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_linalg_cholesky,
            gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],),
     OpInfo('linalg.cholesky_ex',
            aten_name='linalg_cholesky_ex',
@@ -9849,6 +10137,9 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_linalg_cholesky,
            gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
            ),
     OpInfo('linalg.cond',
@@ -9860,7 +10151,7 @@ def ref_pairwise_distance(input1, input2):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],),
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],),
     OpInfo('linalg.eig',
            aten_name='linalg_eig',
            op=torch.linalg.eig,
@@ -9871,10 +10162,13 @@ def ref_pairwise_distance(input1, input2):
            check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                # Forward-over-reverse gradgrad might be incorrect
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad'),),),
+               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+           ),
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+           ),
     OpInfo('linalg.eigvals',
            aten_name='linalg_eigvals',
            op=torch.linalg.eigvals,
@@ -9888,7 +10182,7 @@ def ref_pairwise_distance(input1, input2):
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            )),
     OpInfo('linalg.eigh',
            aten_name='linalg_eigh',
@@ -9900,12 +10194,13 @@ def ref_pairwise_distance(input1, input2):
            check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
            skips=(
                # Forward-over-reverse gradgrad might be incorrect
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad',
-                            dtypes=complex_types()),),
-           ),
+                            dtypes=complex_types()),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           )),
     OpInfo('linalg.eigvalsh',
            aten_name='linalg_eigvalsh',
            dtypes=floating_and_complex_types(),
@@ -9919,7 +10214,7 @@ def ref_pairwise_distance(input1, input2):
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            )),
     OpInfo('linalg.householder_product',
            aten_name='linalg_householder_product',
@@ -9936,17 +10231,20 @@ def ref_pairwise_distance(input1, input2):
            decorators=[
                skipCUDAIfNoCusolver, skipCPUIfNoLapack,
                DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)})),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ]),
     OpInfo('linalg.lstsq',
            aten_name='linalg_lstsq',
            dtypes=floating_and_complex_types(),
            supports_out=True,
            sample_inputs_func=sample_inputs_linalg_lstsq,
+           error_inputs_func=error_inputs_lstsq,
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
                # we skip gradient checks for this suite as they are tested in
                # variant_test_name='grad_oriented'
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('linalg.lstsq',
            aten_name='linalg_lstsq',
@@ -9956,6 +10254,7 @@ def ref_pairwise_distance(input1, input2):
            supports_out=False,
            dtypes=floating_and_complex_types(),
            sample_inputs_func=sample_inputs_linalg_lstsq,
+           error_inputs_func=error_inputs_lstsq,
            supports_autograd=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -9965,6 +10264,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('linalg.matrix_power',
            aliases=('matrix_power',),
@@ -9974,7 +10274,7 @@ def ref_pairwise_distance(input1, input2):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            check_batched_grad=False,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
            sample_inputs_func=sample_inputs_linalg_matrix_power,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            ),
@@ -9999,7 +10299,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.skip("67470!"), 'TestCommon', 'test_noncontiguous_samples'),
                # Fails on XLA.
                # AssertionError: False is not true : Tensors failed to compare as equal!
-               DecorateInfo(unittest.expectedFailure, 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
                # https://github.com/pytorch/pytorch/issues/71774
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
                             device_type='cpu', dtypes=(torch.long,)),
@@ -10012,7 +10312,8 @@ def ref_pairwise_distance(input1, input2):
            aten_name='linalg_norm',
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
            )),
@@ -10020,11 +10321,12 @@ def ref_pairwise_distance(input1, input2):
            aten_name='linalg_matrix_norm',
            dtypes=floating_and_complex_types(),
            check_batched_gradgrad=False,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
            sample_inputs_func=sample_inputs_linalg_matrix_norm,
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
            )),
@@ -10037,6 +10339,9 @@ def ref_pairwise_distance(input1, input2):
            check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            # See https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
            sample_inputs_func=sample_inputs_linalg_qr_geqrf,
@@ -10125,9 +10430,9 @@ def ref_pairwise_distance(input1, input2):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     supports_inplace_autograd=False,
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     promotes_int_to_float=True,
                     supports_out=True,
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: mul(): functions with out=... arguments don't support
                         # automatic differentiation, but one of the arguments requires grad
@@ -10136,8 +10441,6 @@ def ref_pairwise_distance(input1, input2):
                         DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
                         DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
                         DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-                        # FIXME: ldexp does not accept scalar inputs
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
                     ),
                     decorators=[
                         DecorateInfo(
@@ -10191,8 +10494,7 @@ def ref_pairwise_distance(input1, input2):
                     aliases=('less',),
                     dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
                     always_returns_bool=True,
-                    supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops),
+                    supports_autograd=False,),
     OpInfo('linalg.lu_factor',
            aten_name='linalg_lu_factor',
            op=torch.linalg.lu_factor,
@@ -10262,6 +10564,7 @@ def ref_pairwise_distance(input1, input2):
                # cuda gradchecks are slow
                # see discussion https://github.com/pytorch/pytorch/pull/47761#issuecomment-747316775
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad', device_type='cuda'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('masked_fill',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
@@ -10282,7 +10585,11 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_masked_select),
+           sample_inputs_func=sample_inputs_masked_select,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
+           error_inputs_func=error_inputs_masked_select),
     OpInfo('matrix_exp',
            dtypes=floating_and_complex_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
@@ -10296,6 +10603,9 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            supports_out=False,
            ),
     OpInfo('matmul',
@@ -10324,7 +10634,7 @@ def ref_pairwise_distance(input1, input2):
                             'TestCommon', 'test_noncontiguous_samples',
                             device_type='cpu', dtypes=(torch.long,)),
                # AssertionError: False is not true : Tensors failed to compare as equal!
-               DecorateInfo(unittest.expectedFailure, 'TestOpInfo',
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo',
                             device_type='xla', dtypes=(torch.long,)),
                # https://github.com/pytorch/pytorch/issues/71774
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
@@ -10335,6 +10645,9 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
            sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            supports_forward_ad=True),
     OpInfo('max',
            variant_test_name='reduction_no_dim',
@@ -10350,6 +10663,9 @@ def ref_pairwise_distance(input1, input2):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
     OpInfo('nanmedian',
            dtypes=all_types_and(torch.bfloat16),
@@ -10358,6 +10674,9 @@ def ref_pairwise_distance(input1, input2):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
     OpInfo('var_mean',
            dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
@@ -10380,6 +10699,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_backward'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
                # Division by zero, may be related to above?
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad'))),
@@ -10404,6 +10724,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_backward'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
                # Division by zero, may be related to above?
                DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad'))),
@@ -10451,6 +10772,9 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
            sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            supports_forward_ad=True),
     OpInfo('min',
            variant_test_name='reduction_no_dim',
@@ -10464,6 +10788,9 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_inputs_reduction_quantile,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            # See https://github.com/pytorch/pytorch/issues/66357
            # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
            # does not have a batching rule in core
@@ -10473,6 +10800,9 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_inputs_reduction_quantile,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            # See https://github.com/pytorch/pytorch/issues/66357
            # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
            # does not have a batching rule in core
@@ -10482,68 +10812,55 @@ def ref_pairwise_distance(input1, input2):
         aliases=('maximum',),
         variant_test_name='binary',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        sample_inputs_func=sample_inputs_max_min_binary,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         assert_autodiffed=True,
         ref=np.maximum,
+        supports_rhs_python_scalar=False,
         skips=(
-            # FIXME: maximum does not accept scalar inputs
-            DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
+            # Incorrectly attempts to use a scalar for the second argument
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
             # TODO: FIXME: RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
-            DecorateInfo(unittest.expectedFailure,
-                         'TestBinaryUfuncs',
-                         'test_type_promotion',
-                         device_type='cuda'),
-        ),
-    ),
+            DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion', device_type='cuda'),
+        )),
     BinaryUfuncInfo(
         'maximum',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_max_min_binary,
         ref=np.maximum,
+        supports_rhs_python_scalar=False,
         skips=(
-            # FIXME: maximum does not accept scalar inputs
-            DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
             # TODO: FIXME: RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
-            DecorateInfo(unittest.expectedFailure,
-                         'TestBinaryUfuncs',
-                         'test_type_promotion',
-                         device_type='cuda'),
-        ),
-    ),
+            DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion', device_type='cuda'),
+        )),
     BinaryUfuncInfo(
         'min',
         aliases=('minimum',),
         variant_test_name='binary',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        sample_inputs_func=sample_inputs_max_min_binary,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         assert_autodiffed=True,
         ref=np.minimum,
+        supports_rhs_python_scalar=False,
         skips=(
-            # FIXME: min does not accept scalar inputs
-            DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
+            # Incorrectly attempts to use a scalar for the second argument
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
             # TODO: FIXME: RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
             DecorateInfo(unittest.expectedFailure,
                          'TestBinaryUfuncs',
                          'test_type_promotion',
                          device_type='cuda'),
-        ),
-    ),
+        )),
     BinaryUfuncInfo(
         'minimum',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_max_min_binary,
         ref=np.minimum,
+        supports_rhs_python_scalar=False,
         skips=(
-            # FIXME: minimum does not accept scalar inputs
-            DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
             # TODO: FIXME: RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
             DecorateInfo(unittest.expectedFailure,
                          'TestBinaryUfuncs',
@@ -10554,37 +10871,24 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('logical_and',
                     ref=np.logical_and,
                     dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     always_returns_bool=True,
-                    skips=(
-                        # FIXME: logical_and does not accept scalar inputs
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
-                    )),
+                    supports_rhs_python_scalar=False),
     BinaryUfuncInfo('logical_or',
                     ref=np.logical_or,
                     dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     always_returns_bool=True,
-                    skips=(
-                        # FIXME: logical_or does not accept scalar inputs
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
-                    )),
+                    supports_rhs_python_scalar=False),
     BinaryUfuncInfo('logical_xor',
                     ref=np.logical_xor,
                     dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     always_returns_bool=True,
-                    skips=(
-                        # FIXME: logical_xor does not accept scalar inputs
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
-                    )),
+                    supports_rhs_python_scalar=False),
     BinaryUfuncInfo('bitwise_or',
                     ref=np.bitwise_or,
                     dtypes=integral_types_and(torch.bool),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     skips=(
                         # TODO: FIXME: RuntimeError: "bitwise_or_cuda" not implemented for 'Half'
@@ -10596,7 +10900,6 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('bitwise_xor',
                     ref=np.bitwise_xor,
                     dtypes=integral_types_and(torch.bool),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
                     skips=(
                         # TODO: FIXME: RuntimeError: "bitwise_xor_cuda" not implemented for 'Half'
@@ -10611,9 +10914,8 @@ def ref_pairwise_distance(input1, input2):
                         np.int64(np.heaviside(a, b)) if a.dtype == np.int64 and b.dtype == np.int64 else np.heaviside(a, b)
                     ),
                     dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
-                    # FIXME: heaviside does not accept scalar inputs
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: heaviside is not yet implemented for tensors with different dtypes.
                         DecorateInfo(unittest.expectedFailure,
@@ -10623,40 +10925,33 @@ def ref_pairwise_distance(input1, input2):
                         DecorateInfo(unittest.skip("Skipped!"),
                                      'TestBinaryUfuncs',
                                      'test_reference_numerics_extremal_values'),
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
                     )),
     BinaryUfuncInfo('lcm',
                     ref=np.lcm,
                     dtypes=integral_types_and(),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
-                    skips=(
-                        # TODO: FIXME: lcm doesn't support scalars
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestBinaryUfuncs',
-                                     'test_broadcast_python_scalar'),
-                    )),
+                    supports_rhs_python_scalar=False),
     BinaryUfuncInfo('gcd',
                     ref=np.gcd,
                     dtypes=integral_types_and(),
-                    sample_inputs_func=sample_inputs_binary_pwise,
                     supports_autograd=False,
+                    supports_rhs_python_scalar=False,
                     skips=(
                         DecorateInfo(unittest.expectedFailure,
                                      'TestBinaryUfuncs',
                                      'test_reference_numerics_small_values',
-                                     dtypes=(torch.int8,)),
-                        # TODO: FIXME: jiterator doesn't support non-tensor inputs
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestBinaryUfuncs',
-                                     'test_broadcast_python_scalar'),)),
+                                     dtypes=(torch.int8,)),)),
     BinaryUfuncInfo('isclose',
                     ref=np.isclose,
                     dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
                     sample_inputs_func=sample_inputs_isclose,
                     supports_autograd=False,
                     supports_out=False,
+                    supports_rhs_python_scalar=False,
                     skips=(
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestCommon',
+                                     'test_reference_testing', dtypes=(torch.complex128,)),
                         # RuntimeError: Short did not match Int
                         DecorateInfo(unittest.expectedFailure,
                                      'TestBinaryUfuncs',
@@ -10664,8 +10959,10 @@ def ref_pairwise_distance(input1, input2):
                         DecorateInfo(unittest.skip("Skipped!"),
                                      'TestBinaryUfuncs',
                                      'test_reference_numerics_extremal_values'),
-                        # FIXME: isclose does not accept scalar inputs
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_broadcast_python_scalar'),
+                        # Problem due to internal inplace operations
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestCompositeCompliance',
+                                     'test_operator'),
                     )),
     # `softmax` supports different dtypes based on whether `dtype` argument,
     # is passed or not. Hence two OpInfo entries, one with dtype and other without.
@@ -10678,6 +10975,7 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_inputs_softmax_variant,
            assert_jit_shape_analysis=True,
            assert_autodiffed=True,
+           supports_forward_ad=True,
            supports_out=False),
     OpInfo('softmax',
            aliases=('special.softmax', 'nn.functional.softmax',),
@@ -10686,6 +10984,7 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
            assert_autodiffed=True,
+           supports_forward_ad=True,
            supports_out=False),
     # `softmin` supports different dtypes based on whether `dtype` argument,
     # is passed or not. Hence two OpInfo entries, one with dtype and other without.
@@ -10697,6 +10996,7 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_inputs_softmax_variant,
            assert_jit_shape_analysis=False,
            assert_autodiffed=False,
+           supports_forward_ad=True,
            supports_out=False),
     OpInfo('nn.functional.softmin',
            variant_test_name="with_dtype",
@@ -10704,6 +11004,7 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
            sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
            assert_autodiffed=False,
+           supports_forward_ad=True,
            supports_out=False),
     OpInfo(
         "nn.functional.cross_entropy",
@@ -10711,6 +11012,7 @@ def ref_pairwise_distance(input1, input2):
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         sample_inputs_func=sample_inputs_cross_entropy,
         supports_out=False,
+        supports_forward_ad=True,
         decorators=(
             DecorateInfo(
                 toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-3)}),
@@ -10744,7 +11046,8 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=all_types_and(torch.bool, torch.float16, torch.bfloat16),
            decorators=(onlyNativeDeviceTypes,),
            supports_autograd=False,
-           sample_inputs_func=sample_inputs_aminmax),
+           sample_inputs_func=sample_inputs_aminmax,
+           error_inputs_func=error_inputs_aminmax_amax_amin),
     OpInfo('as_strided',
            op=lambda x, size, stride, storage_offset=0:
                torch.as_strided(x, size, stride, storage_offset=storage_offset),
@@ -10963,6 +11266,7 @@ def ref_pairwise_distance(input1, input2):
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":104, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Skipped! 75029"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
                DecorateInfo(unittest.skip("Skipped! RuntimeError: bias tensor has to be contiguous"), 'TestGradients',
                             'test_forward_mode_AD', device_type='cuda', active_if=TEST_WITH_ROCM),
            ),
@@ -10970,8 +11274,9 @@ def ref_pairwise_distance(input1, input2):
     OpInfo('nn.functional.conv1d',
            aliases=('conv1d',),
            aten_name='conv1d',
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv1d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -10980,14 +11285,23 @@ def ref_pairwise_distance(input1, input2):
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":103, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # Ref: https://github.com/pytorch/pytorch/issues/75309
+               # AssertionError: None mismatch: torch.complex128 is not None
+               DecorateInfo(unittest.expectedFailure, 'TestDtypeCustomRules',
+                            'test_custom_rules', dtypes=(torch.complex64, torch.complex128)),
+               # Ref: https://github.com/pytorch/pytorch/issues/75309
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo',
+                            'test_nnc_correctness', dtypes=(torch.complex64, torch.complex128)),
            ),
+           supports_expanded_weight=True,
            supports_out=False,),
     OpInfo('nn.functional.conv2d',
            aliases=('conv2d',),
            aten_name='conv2d',
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16,
-                                           *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=partial(sample_inputs_conv2d),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            supports_forward_ad=True,
@@ -10996,7 +11310,15 @@ def ref_pairwise_distance(input1, input2):
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":103, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # Ref: https://github.com/pytorch/pytorch/issues/75309
+               # AssertionError: None mismatch: torch.complex128 is not None
+               DecorateInfo(unittest.expectedFailure, 'TestDtypeCustomRules',
+                            'test_custom_rules', dtypes=(torch.complex64, torch.complex128)),
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo',
+                            'test_nnc_correctness', dtypes=(torch.complex64, torch.complex128)),
            ),
+           supports_expanded_weight=True,
            supports_out=False,),
     OpInfo('nn.functional.group_norm',
            aten_name='group_norm',
@@ -11012,7 +11334,8 @@ def ref_pairwise_distance(input1, input2):
                # Consider making it a parameter or input, or detaching the gradient
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,))
            ],
-           sample_inputs_func=sample_inputs_group_norm,),
+           sample_inputs_func=sample_inputs_group_norm,
+           supports_expanded_weight=True,),
     OpInfo('nn.functional.instance_norm',
            # no ref because instance_norm will often have numerical instability (large numbers or nan)
            dtypes=floating_types(),
@@ -11042,7 +11365,8 @@ def ref_pairwise_distance(input1, input2):
                    'TestCommon', 'test_reference_testing'
                )
            ],
-           sample_inputs_func=sample_inputs_layer_norm,),
+           sample_inputs_func=sample_inputs_layer_norm,
+           supports_expanded_weight=True,),
     OpInfo('nn.functional.local_response_norm',
            dtypes=floating_types_and(torch.int64),
            dtypesIfCPU=floating_types_and(torch.int64, torch.bfloat16),
@@ -11116,6 +11440,7 @@ def ref_pairwise_distance(input1, input2):
            assert_autodiffed=True,
            sample_inputs_func=sample_inputs_hardswish,
            dtypes=floating_types(),
+           dtypesIfCPU=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
            supports_gradgrad=False,
            supports_forward_ad=True,
@@ -11277,7 +11602,7 @@ def ref_pairwise_distance(input1, input2):
            supports_gradgrad=True,
            supports_out=False,
            supports_forward_ad=True,
-           supports_fwgrad_bwgrad=False,  # Need: leaky_relu_backward
+           supports_fwgrad_bwgrad=True,
            autodiff_nonfusible_nodes=["aten::leaky_relu"]),
     OpInfo('nn.functional.avg_pool2d',
            aten_name='avg_pool2d',
@@ -11347,9 +11672,10 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator', device_type='cpu'),
                # RuntimeError: "max_pool1d_impl" not implemented for 'BFloat16'
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bfloat16,)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.bfloat16,)),
            ),
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_pool2d',
@@ -11400,6 +11726,11 @@ def ref_pairwise_distance(input1, input2):
            # See https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
            supports_expanded_weight=True,
+           skips=(
+               # Problem, needs to be fixed
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            supports_out=False),
     OpInfo('nn.functional.bilinear',
            aten_name='bilinear',
@@ -11432,7 +11763,7 @@ def ref_pairwise_distance(input1, input2):
         dtypes=floating_types(),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,
-        supports_fwgrad_bwgrad=False,  # Need: elu_backward
+        supports_fwgrad_bwgrad=True,
         supports_autograd=True,
         assert_autodiffed=False,
         supports_gradgrad=True,
@@ -11481,7 +11812,7 @@ def ref_pairwise_distance(input1, input2):
         dtypes=floating_types(),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,
-        supports_fwgrad_bwgrad=False,  # Need: elu_backward
+        supports_fwgrad_bwgrad=True,
         supports_autograd=True,
         assert_autodiffed=False,
         supports_gradgrad=True,
@@ -11542,7 +11873,7 @@ def ref_pairwise_distance(input1, input2):
         dtypes=floating_types(),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,  # depends on 'elu'
-        supports_fwgrad_bwgrad=False,  # Needs: elu_backward
+        supports_fwgrad_bwgrad=True,
         supports_autograd=True,
         assert_autodiffed=False,
         supports_gradgrad=True,
@@ -11561,6 +11892,30 @@ def ref_pairwise_distance(input1, input2):
     ),
     UnaryUfuncInfo(
         'nn.functional.silu',
+        ref=lambda x, inplace=False:
+            x / (1 + np.exp(-x)),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        supports_autograd=True,
+        supports_fwgrad_bwgrad=True,
+        assert_autodiffed=False,
+        supports_out=False,
+        inplace_variant=lambda x: torch.nn.functional.silu(x, inplace=True),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({
+                    torch.float16: tol(atol=1e-3, rtol=1e-3),
+                    torch.bfloat16: tol(atol=1e-4, rtol=1e-4)
+                }),
+                'TestUnaryUfuncs', device_type='cuda',
+            ), ],
+    ),
+    # TODO: combine this with the nn.functional.silu OpInfo when
+    # complex autodiff for silu is supported.
+    UnaryUfuncInfo(
+        'nn.functional.silu',
+        variant_test_name='complex',
         ref=lambda x, inplace=False:
             x / (1 + np.exp(-x)),
         dtypes=floating_and_complex_types_and(torch.bfloat16),
@@ -11586,6 +11941,7 @@ def ref_pairwise_distance(input1, input2):
         'nn.functional.hardsigmoid',
         ref=reference_hardsigmoid,
         dtypes=floating_types(),
+        dtypesIfCPU=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_autograd=True,
         assert_autodiffed=False,
@@ -11610,11 +11966,24 @@ def ref_pairwise_distance(input1, input2):
         aten_name="log_sigmoid",
         ref=reference_logsigmoid,
         dtypes=floating_types(),
+        dtypesIfCPU=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16),
         supports_autograd=True,
         assert_autodiffed=False,
         supports_gradgrad=True,
         supports_out=False,
+        # autodiff_nonfusible_nodes=["aten::log_sigmoid"],
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float16: 1e-2, torch.bfloat16: 5e-3}),
+                'TestUnaryUfuncs', 'test_reference_numerics_normal'),
+            DecorateInfo(
+                precisionOverride({torch.float16: 1e-2, torch.bfloat16: 5e-3}),
+                'TestUnaryUfuncs', 'test_reference_numerics_hard'),
+            DecorateInfo(
+                precisionOverride({torch.float16: 1e-2, torch.bfloat16: 5e-3}),
+                'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+        ],
     ),
     UnaryUfuncInfo(
         'nn.functional.mish',
@@ -11696,16 +12065,15 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('nextafter',
                     dtypes=floating_types_and(torch.bfloat16),
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_nextafter,
-                    skips=(
-                        # TypeError: nextafter(): argument 'other' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
-                    )),
+                    supports_rhs_python_scalar=False),
     OpInfo('topk',
            dtypes=all_types_and(torch.bfloat16),
            dtypesIfCUDA=all_types_and(torch.bfloat16, torch.float16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=sample_inputs_topk),
     # Multiple variants for batch_norm to test with and without cuDNN disabled
     # See https://github.com/pytorch/pytorch/pull/63218#discussion_r688549391 for more details
@@ -11741,83 +12109,89 @@ def ref_pairwise_distance(input1, input2):
                     dtypes=floating_types_and(torch.bfloat16, torch.float16),
                     aliases=('torch.special.gammainc',),
                     dtypesIfCUDA=floating_types(),
+                    # TODO: FIXME
+                    supports_rhs_python_scalar=False,
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_igamma_igammac,
                     skips=(
-                        # TypeError: igamma(): argument 'input' (position 1) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+                        # FIXME: incorrectly tries to pass a rhs scalar
+                        DecorateInfo(unittest.expectedFailure, 'TestJit',
+                                     'test_jit_alias_remapping'),
                     )),
-    BinaryUfuncInfo('igamma',
-                    variant_test_name='grad_other',
-                    # Since autograd formula is implemented only for other and
-                    # gradcheck test verifies the formula for input in SampleInput,
-                    # we permute the arguments.
-                    op=lambda self, other, **kwargs: torch.igamma(other, self, **kwargs),
-                    inplace_variant=None,
-                    method_variant=None,
-                    dtypes=floating_types_and(torch.bfloat16, torch.float16),
-                    backward_dtypesIfCPU=floating_types_and(torch.bfloat16),
-                    dtypesIfCUDA=floating_types(),
-                    backward_dtypesIfCUDA=floating_types(),
-                    supports_inplace_autograd=False,
-                    decorators=[
-                        # Derivative wrt first tensor not implemented
-                        DecorateInfo(unittest.expectedFailure, "TestCommon",
-                                     "test_floating_inputs_are_differentiable")
-                    ],
-                    skips=(
-                        # test does not work with passing lambda for op
-                        # AssertionError: False is not true : Tensors failed to compare as equal!
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-                        # test fails are we permute the arguments function variant
-                        # but not for inplace or method.
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
-                        # TypeError: igamma(): argument 'input' (position 1) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
-                    ),
-                    sample_inputs_func=sample_inputs_igamma_igammac),
+    # TODO: FIXME, ideally by implemented grad for both inputs
+    # BinaryUfuncInfo('igamma',
+    #                 variant_test_name='grad_other',
+    #                 # Since autograd formula is implemented only for other and
+    #                 # gradcheck test verifies the formula for input in SampleInput,
+    #                 # we permute the arguments.
+    #                 op=lambda self, other, **kwargs: torch.igamma(other, self, **kwargs),
+    #                 inplace_variant=None,
+    #                 method_variant=None,
+    #                 supports_rhs_python_scalar=False,
+    #                 rhs_make_tensor_kwargs=dict(requires_grad=False),
+    #                 dtypes=floating_types_and(torch.bfloat16, torch.float16),
+    #                 backward_dtypesIfCPU=floating_types_and(torch.bfloat16),
+    #                 dtypesIfCUDA=floating_types(),
+    #                 backward_dtypesIfCUDA=floating_types(),
+    #                 supports_inplace_autograd=False,
+    #                 skips=(
+    #                     # Derivative wrt first tensor not implemented
+    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
+    #                                  "test_floating_inputs_are_differentiable"),"),
+    #                     # test does not work with passing lambda for op
+    #                     # AssertionError: False is not true : Tensors failed to compare as equal!
+    #                     DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+    #                     # test fails are we permute the arguments function variant
+    #                     # but not for inplace or method.
+    #                     DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
+    #                     # TypeError: igamma(): argument 'input' (position 1) must be Tensor, not float
+    #                     DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+    #                 )),
     BinaryUfuncInfo('igammac',
                     dtypes=floating_types_and(torch.bfloat16, torch.float16),
                     aliases=('torch.special.gammaincc',),
                     dtypesIfCUDA=floating_types(),
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_igamma_igammac,
+                    supports_rhs_python_scalar=False,
                     skips=(
-                        # TypeError: igammac(): argument 'input' (position 1) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+                        # FIXME: incorrectly tries to pass a rhs scalar
+                        DecorateInfo(unittest.expectedFailure, 'TestJit',
+                                     'test_jit_alias_remapping'),
                     )),
-    BinaryUfuncInfo('igammac',
-                    variant_test_name='grad_other',
-                    # Since autograd formula is implemented only for other and
-                    # gradcheck test verifies the formula for input in SampleInput,
-                    # we permute the arguments
-                    op=lambda self, other, **kwargs: torch.igammac(other, self, **kwargs),
-                    inplace_variant=None,
-                    method_variant=None,
-                    dtypes=floating_types_and(torch.bfloat16, torch.float16),
-                    backward_dtypesIfCPU=floating_types_and(torch.bfloat16),
-                    dtypesIfCUDA=floating_types(),
-                    backward_dtypesIfCUDA=floating_types(),
-                    supports_inplace_autograd=False,
-                    decorators=[
-                        # Derivative wrt first tensor not implemented
-                        DecorateInfo(unittest.expectedFailure, "TestCommon",
-                                     "test_floating_inputs_are_differentiable"),
-                    ],
-                    skips=(
-                        # test does not work with passing lambda for op
-                        # AssertionError: False is not true : Tensors failed to compare as equal!
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-                        # test fails are we permute the arguments function variant
-                        # but not for inplace or method.
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
-                        # TypeError: igammac(): argument 'input' (position 1) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
-                    ),
-                    sample_inputs_func=sample_inputs_igamma_igammac),
+    # TODO: FIXME, ideally by implementing grad for both inputs
+    # BinaryUfuncInfo('igammac',
+    #                 variant_test_name='grad_other',
+    #                 # Since autograd formula is implemented only for other and
+    #                 # gradcheck test verifies the formula for input in SampleInput,
+    #                 # we permute the arguments
+    #                 op=lambda self, other, **kwargs: torch.igammac(other, self, **kwargs),
+    #                 inplace_variant=None,
+    #                 method_variant=None,
+    #                 supports_rhs_python_scalar=False,
+    #                 rhs_make_tensor_kwargs=dict(requires_grad=False),
+    #                 dtypes=floating_types_and(torch.bfloat16, torch.float16),
+    #                 backward_dtypesIfCPU=floating_types_and(torch.bfloat16),
+    #                 dtypesIfCUDA=floating_types(),
+    #                 backward_dtypesIfCUDA=floating_types(),
+    #                 supports_inplace_autograd=False,
+    #                 decorators=[
+    #                     # Derivative wrt first tensor not implemented
+    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
+    #                                  "test_floating_inputs_are_differentiable"),
+    #                 ],
+    #                 skips=(
+    #                     # test does not work with passing lambda for op
+    #                     # AssertionError: False is not true : Tensors failed to compare as equal!
+    #                     DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+    #                     # test fails are we permute the arguments function variant
+    #                     # but not for inplace or method.
+    #                     DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
+    #                     # TypeError: igammac(): argument 'input' (position 1) must be Tensor, not float
+    #                     DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+    #                 )),
     OpInfo('nn.functional.softshrink',
            aten_name="softshrink",
            dtypes=floating_types(),
+           dtypesIfCPU=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_autograd=True,
            supports_forward_ad=True,
@@ -11830,6 +12204,7 @@ def ref_pairwise_distance(input1, input2):
     OpInfo('nn.functional.hardshrink',
            aten_name="hardshrink",
            dtypes=floating_types(),
+           dtypesIfCPU=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_autograd=True,
            assert_autodiffed=True,
@@ -11894,6 +12269,9 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=sample_inputs_mode,),
     MvlGammaInfo(variant_test_name='mvlgamma_p_1',
                  domain=(1, None),
@@ -11918,7 +12296,6 @@ def ref_pairwise_distance(input1, input2):
                     dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
                     always_returns_bool=True,
                     supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops,
                     skips=(
                         # AssertionError: False is not true : Tensors failed to compare as equal!
                         DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_type_promotion'),
@@ -11956,6 +12333,7 @@ def ref_pairwise_distance(input1, input2):
            dtypes=floating_and_complex_types(),
            supports_autograd=False,
            sample_inputs_func=sample_inputs_ormqr,
+           error_inputs_func=error_inputs_ormqr,
            decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack]),
     OpInfo('permute',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
@@ -11975,25 +12353,37 @@ def ref_pairwise_distance(input1, input2):
                     # unsupported on CPU.
                     backward_dtypes=floating_and_complex_types_and(torch.bfloat16),
                     backward_dtypesIfCUDA=floating_and_complex_types_and(torch.bfloat16, torch.half),
-                    sample_inputs_func=sample_inputs_pow,
                     supports_inplace_autograd=False,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     assert_autodiffed=True,
+                    supports_two_python_scalars=True,
+                    # TODO: FIXME: pow needs a way of specifying that for integer
+                    #   types only it does not support negative exponentes
+                    rhs_make_tensor_kwargs=dict(low=1),
                     skips=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # RuntimeError: Integers to negative integer powers are not allowed.
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.int8, torch.int16, torch.int32, torch.int64)),
+                        # FIXME
+                        # Greatest absolute difference: 262144 at index (1,)
+                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
+                        # nan vs nan comparisons
+                        # https://github.com/pytorch/pytorch/issues/74279
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
                     )),
     BinaryUfuncInfo('float_power',
                     dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool),
                     promotes_int_to_float=True,
-                    sample_inputs_func=sample_inputs_pow,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
+                    supports_one_python_scalar=True,
                     skips=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                        # FIXME
+                        # AssertionError: Object comparison failed: torch.float64 != torch.float32
+                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
+                        # nan vs nan comparisons
+                        # https://github.com/pytorch/pytorch/issues/74279
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                        # -3.43399e+38 is outside the range of representable values of type 'float'
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                     )),
     OpInfo('qr',
            op=torch.qr,
@@ -12006,6 +12396,9 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            # See https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
     UnaryUfuncInfo('rad2deg',
                    ref=np.degrees,
@@ -12240,7 +12633,6 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('__radd__',
                     op=torch.Tensor.__radd__,
                     dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
                     supports_out=False,
                     skips=(
                         # RuntimeError:
@@ -12260,7 +12652,6 @@ def ref_pairwise_distance(input1, input2):
                     dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
                     promotes_int_to_float=True,
                     lhs_make_tensor_kwargs={'exclude_zero': True},
-                    sample_inputs_func=sample_inputs_rbinops,
                     supports_out=False,
                     skips=(
                         # RuntimeError:
@@ -12278,7 +12669,6 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('__rmul__',
                     op=torch.Tensor.__rmul__,
                     dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
                     supports_out=False,
                     skips=(
                         # RuntimeError:
@@ -12293,24 +12683,21 @@ def ref_pairwise_distance(input1, input2):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     autodiff_nonfusible_nodes=['aten::mul'],),
-    OpInfo('__rand__',
-           op=torch.Tensor.__rand__,
-           dtypes=integral_types_and(torch.bool),
-           sample_inputs_func=sample_inputs_rbinops,
-           supports_out=False,
-           supports_autograd=False,
-           supports_forward_ad=True,),
+    BinaryUfuncInfo('__rand__',
+                    op=torch.Tensor.__rand__,
+                    dtypes=integral_types_and(torch.bool),
+                    supports_out=False,
+                    supports_autograd=False,
+                    supports_forward_ad=True,),
     BinaryUfuncInfo('__ror__',
                     op=torch.Tensor.__ror__,
                     dtypes=integral_types_and(torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
                     supports_out=False,
                     supports_autograd=False,
                     supports_forward_ad=True,),
     BinaryUfuncInfo('__rxor__',
                     op=torch.Tensor.__rxor__,
                     dtypes=integral_types_and(torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
                     supports_out=False,
                     supports_autograd=False,
                     supports_forward_ad=True,),
@@ -12339,7 +12726,7 @@ def ref_pairwise_distance(input1, input2):
                             'TestMathBits', 'test_conj_view'),
                # Fails on XLA.
                # AssertionError: False is not true : Tensors failed to compare as equal
-               DecorateInfo(unittest.expectedFailure, 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
                DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-05, rtol=1.2e-03)}),
                             'TestCommon', 'test_noncontiguous_samples',
                             device_type='cuda', active_if=TEST_WITH_ROCM),
@@ -12359,9 +12746,9 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('__rmod__',
                     op=torch.Tensor.__rmod__,
                     dtypes=floating_types_and(torch.bfloat16, torch.half,),
-                    dtypesIfCUDA=all_types_and(torch.bfloat16, torch.half, torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
+                    dtypesIfCUDA=all_types_and(torch.bfloat16, torch.half),
                     supports_out=False,
+                    supports_two_python_scalars=True,
                     skips=(
                         # RuntimeError:
                         # object has no attribute __rmod__:
@@ -12370,9 +12757,6 @@ def ref_pairwise_distance(input1, input2):
                         #     return torch.__rmod__(i0, 3.14)
                         #            ~~~~~~~~~~~~~~ <--- HERE
                         DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',),
-                        # RuntimeError: "remainder_cuda" not implemented for 'Bool'
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bool,))
                     ),
                     # Support autograd after torch.remainder(Tensor, Tensor) supports
                     # autograd of the second argument.
@@ -12382,15 +12766,16 @@ def ref_pairwise_distance(input1, input2):
                     autodiff_nonfusible_nodes=['aten::remainder'],),
     BinaryUfuncInfo('__rpow__',
                     op=torch.Tensor.__rpow__,
-                    dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
+                    dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
                     # Reference: https://github.com/pytorch/pytorch/issues/54774
                     # "log2" "_vml_cpu" not implemented for Half
-                    backward_dtypesIfCPU=all_types_and_complex_and(torch.bfloat16, torch.bool),
-                    sample_inputs_func=sample_inputs_rbinops,
+                    backward_dtypesIfCPU=all_types_and_complex_and(torch.bfloat16),
                     supports_out=False,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     skips=(
+                        # TODO: FIXME tolerance is too high
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestGradients'),
                         # RuntimeError:
                         # object has no attribute __rpow__:
                         #   File "<string>", line 3
@@ -12407,8 +12792,10 @@ def ref_pairwise_distance(input1, input2):
     BinaryUfuncInfo('__rsub__',
                     op=torch.Tensor.__rsub__,
                     dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
-                    sample_inputs_func=sample_inputs_rbinops,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
                     supports_out=False,
+                    supports_two_python_scalars=True,
                     skips=(
                         # RuntimeError:
                         # object has no attribute __rsub__:
@@ -12422,17 +12809,12 @@ def ref_pairwise_distance(input1, input2):
                     autodiff_nonfusible_nodes=['aten::rsub'],),
     BinaryUfuncInfo('rsub',
                     dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
-                    variant_test_name='rsub_tensor',
-                    supports_out=False,
-                    supports_inplace_autograd=False,
-                    sample_inputs_func=partial(sample_inputs_rsub, other_scalar=False),),
-    BinaryUfuncInfo('rsub',
-                    dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
-                    variant_test_name='rsub_scalar',
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
                     supports_out=False,
                     supports_inplace_autograd=False,
-                    sample_inputs_func=partial(sample_inputs_rsub, other_scalar=True),
-                    assert_autodiffed=True,),
+                    assert_autodiffed=None,
+                    sample_inputs_func=sample_inputs_add_sub),
     OpInfo('select',
            dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
            sample_inputs_func=sample_inputs_select,
@@ -12538,25 +12920,26 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ),
            sample_inputs_func=sample_inputs_tensor_split,),
     OpInfo('hsplit',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_hsplit,
            error_inputs_func=error_inputs_hsplit,),
     OpInfo('vsplit',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_vsplit,
            error_inputs_func=error_inputs_vsplit,),
     OpInfo('dsplit',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -12825,7 +13208,7 @@ def ref_pairwise_distance(input1, input2):
            decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            ),
            ),
     OpInfo('linalg.matrix_rank',
@@ -12837,7 +13220,7 @@ def ref_pairwise_distance(input1, input2):
            decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            ),
            ),
     OpInfo('linalg.pinv',
@@ -12897,6 +13280,7 @@ def ref_pairwise_distance(input1, input2):
            op=torch.eig,
            dtypes=floating_and_complex_types(),
            sample_inputs_func=sample_inputs_eig,
+           error_inputs_func=error_inputs_eig,
            decorators=[
                skipCUDAIfNoMagma,
                skipCPUIfNoLapack,
@@ -12940,10 +13324,11 @@ def ref_pairwise_distance(input1, input2):
            # We're using at::allclose, which does not have a batching rule
            check_batched_grad=False,
            check_batched_gradgrad=False,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
            skips=(
                # Fixme, forward over backward gives a numerical error
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad', dtypes=(torch.complex128,)),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('linalg.svd',
            op=torch.linalg.svd,
@@ -12956,10 +13341,11 @@ def ref_pairwise_distance(input1, input2):
            check_batched_grad=False,
            check_batched_gradgrad=False,
            sample_inputs_func=sample_inputs_svd,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
            skips=(
                # FIXME forward over backward gives a numerical error
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad', dtypes=(torch.complex128,)),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('linalg.svdvals',
            op=torch.linalg.svdvals,
@@ -12971,7 +13357,7 @@ def ref_pairwise_distance(input1, input2):
            # We're using at::allclose, which does not have a batching rule
            check_batched_gradgrad=False,
            sample_inputs_func=sample_inputs_linalg_svdvals,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off]),
     OpInfo('svd_lowrank',
            op=lambda *args, **kwargs: wrapper_set_seed(
                lambda a, b, **kwargs: torch.svd_lowrank(a @ b.mT, **kwargs),
@@ -12985,13 +13371,14 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            supports_forward_ad=True,
            sample_inputs_func=sample_inputs_svd_lowrank,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack,
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off,
                        DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
                                     'TestCommon', 'test_noncontiguous_samples',
                                     device_type='cuda')],
            skips=(
                # test does not work with passing lambda for op
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('pca_lowrank',
            op=lambda *args, **kwargs: wrapper_set_seed(
@@ -13006,19 +13393,30 @@ def ref_pairwise_distance(input1, input2):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_pca_lowrank,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off,
+                       DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
+                                    'TestCommon', 'test_noncontiguous_samples',
+                                    device_type='cuda')],
            skips=(
                # test does not work with passing lambda for op
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     BinaryUfuncInfo('polar',
                     dtypes=floating_types(),
-                    sample_inputs_func=sample_inputs_polar,
+                    # this function is undefined if 'abs' values are <0
+                    supports_forward_ad=True,
+                    lhs_make_tensor_kwargs=dict(low=0),
+                    supports_rhs_python_scalar=False,
                     skips=(
                         # RuntimeError: Expected object of scalar type Float but got scalar type Double for second argument
                         DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_type_promotion'),
-                        # TypeError: polar(): argument 'angle' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
+                        # GradcheckError: Jacobian computed with forward mode mismatch for output 0 with respect to input 0
+                        # Numerical:
+                        #  tensor([[0.]], dtype=torch.float64)
+                        # Analytical:
+                        # tensor([[-0.0047]], dtype=torch.float64, grad_fn=<CopySlices>)
+                        DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad'),
                     )),
     # TODO(@kshitij12345): Refactor similar to `mvlgamma` entries.
     # To test reference numerics against multiple values of argument `n`,
@@ -13154,6 +13552,7 @@ def ref_pairwise_distance(input1, input2):
                    # polygamma functions have multiple singularities at x <= 0
                    reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
     OpInfo('ravel',
+           ref=np.ravel,
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
@@ -13180,7 +13579,7 @@ def ref_pairwise_distance(input1, input2):
            ),
     OpInfo('view',
            op=lambda x, shape: x.view(shape),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13189,7 +13588,7 @@ def ref_pairwise_distance(input1, input2):
            ),
     OpInfo('view_as',
            op=lambda x, other: x.view_as(other),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13199,7 +13598,7 @@ def ref_pairwise_distance(input1, input2):
                             "TestCommon", "test_floating_inputs_are_differentiable"),),
            ),
     OpInfo('atleast_1d',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13213,7 +13612,7 @@ def ref_pairwise_distance(input1, input2):
            ),
            ),
     OpInfo('atleast_2d',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13227,7 +13626,7 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_inputs_atleast1d2d3d,
            ),
     OpInfo('atleast_3d',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13278,7 +13677,10 @@ def ref_pairwise_distance(input1, input2):
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           error_inputs_func=error_inputs_gather
+           error_inputs_func=error_inputs_gather,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            ),
     OpInfo('index_fill',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
@@ -13297,14 +13699,21 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=sample_inputs_index,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     OpInfo('index_select',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_index,
+           error_inputs_func=error_inputs_index_select,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     OpInfo('index_add',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
@@ -13359,6 +13768,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
                # RuntimeError not raised
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cpu'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
     OpInfo('unique',
            dtypes=all_types_and(torch.bool, torch.bfloat16),
@@ -13389,13 +13799,22 @@ def ref_pairwise_distance(input1, input2):
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
            check_batched_gradgrad=False,  # vmap complains of the sizes
+           skips=(
+               # Problem, needs to be fixed
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=sample_inputs_put),
     OpInfo('take',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            check_batched_grad=False,  # vmap complains of the sizes
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=False,  # Need: put_
-           sample_inputs_func=sample_inputs_take),
+           sample_inputs_func=sample_inputs_take,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
+           error_inputs_func=error_inputs_take),
     OpInfo('scatter',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
@@ -13413,6 +13832,7 @@ def ref_pairwise_distance(input1, input2):
                # RuntimeError: attribute lookup is not defined on builtin
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            )),
     OpInfo('bfloat16',
            op=lambda x, *args, **kwargs: x.bfloat16(*args, **kwargs),
@@ -13426,6 +13846,7 @@ def ref_pairwise_distance(input1, input2):
                # RuntimeError: attribute lookup is not defined on builtin
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            )),
     OpInfo('bool',
            op=lambda x, *args, **kwargs: x.bool(*args, **kwargs),
@@ -13644,6 +14065,8 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
                # Empty tensor data is garbage so it's hard to make comparisons with it.
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               # Empty tensor data is garbage so it's hard to make comparisons with it.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
                # Can't find schemas for this operator for some reason
                DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
@@ -13752,6 +14175,8 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
                # Empty tensor data is garbage so it's hard to make comparisons with it.
                DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               # Empty tensor data is garbage so it's hard to make comparisons with it.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
                # Can't find schemas for this operator for some reason
                DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            ),
@@ -13775,6 +14200,7 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=floating_types_and(torch.half),
            supports_out=True,
            sample_inputs_func=sample_inputs_multinomial,
+           error_inputs_func=error_inputs_multinomial,
            skips=(
                # AssertionError: JIT Test does not execute any logic
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
@@ -13845,14 +14271,14 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            ),
     OpInfo('stack',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_stack,
            assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            ),
     OpInfo('hstack',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13865,11 +14291,7 @@ def ref_pairwise_distance(input1, input2):
                     dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_hypot,
-                    skips=(
-                        # TypeError: hypot(): argument 'other' (position 2) must be Tensor, not float
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
-                    )),
+                    supports_rhs_python_scalar=False),
     OpInfo('histogram',
            dtypes=floating_types(),
            dtypesIfCUDA=_dispatch_dtypes(),  # histogram is only implemented on CPU
@@ -13886,7 +14308,7 @@ def ref_pairwise_distance(input1, input2):
                #                                          ~~~~~~ <--- HERE
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
                # Not Implemented on XLA.
-               DecorateInfo(unittest.expectedFailure, 'TestOpInfo', device_type='xla'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla'),
            )),
     OpInfo('histogramdd',
            dtypes=floating_types(),
@@ -13944,7 +14366,7 @@ def ref_pairwise_distance(input1, input2):
     OpInfo('cat',
            ref=lambda input_seq, dim=0, **kwargs: np.concatenate(input_seq, axis=dim, **kwargs),
            aliases=('concat',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.complex32),
            sample_inputs_func=sample_inputs_cat_concat,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13960,7 +14382,7 @@ def ref_pairwise_distance(input1, input2):
                DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),)),
     OpInfo('vstack',
            aliases=('row_stack',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -13971,7 +14393,7 @@ def ref_pairwise_distance(input1, input2):
                #   'Tensor (inferred)' for argument 't0' but instead found type 'tuple'.
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),)),
     OpInfo('dstack',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -14005,6 +14427,7 @@ def ref_pairwise_distance(input1, input2):
                # Expected RuntimeError when doing an unsafe cast from a result of dtype
                #   torch.float32 into an out= with dtype torch.long
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cpu'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ),
            sample_inputs_func=sample_inputs_msort),
     OpInfo('movedim',
@@ -14016,7 +14439,8 @@ def ref_pairwise_distance(input1, input2):
            sample_inputs_func=sample_movedim_moveaxis),
     OpInfo('renorm',
            dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_renorm),
+           sample_inputs_func=sample_inputs_renorm,
+           error_inputs_func=error_inputs_renorm),
     ShapeFuncInfo('repeat',
                   op=lambda x, dims: x.repeat(dims),
                   ref=np.tile,
@@ -14047,7 +14471,8 @@ def ref_pairwise_distance(input1, input2):
            supports_fwgrad_bwgrad=True,
            # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           backward_dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            skips=(
                # JIT has issue when op is passed as lambda
@@ -14057,6 +14482,8 @@ def ref_pairwise_distance(input1, input2):
                # https://github.com/pytorch/pytorch/issues/59137
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_inplace_gradgrad'),
+               DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_backward'),
            ),
            sample_inputs_func=sample_inputs_fill_),
     OpInfo('resize_',
@@ -14076,7 +14503,7 @@ def ref_pairwise_distance(input1, input2):
                    'TestGradients',
                    'test_nondifferentiable',
                ),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
            ),
            sample_inputs_func=sample_inputs_resize_ops),
     OpInfo('resize_as_',
@@ -14093,6 +14520,7 @@ def ref_pairwise_distance(input1, input2):
                    'TestGradients',
                    'test_nondifferentiable',
                ),
+               DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
            ),
            sample_inputs_func=sample_inputs_resize_ops),
     OpInfo('take_along_dim',
@@ -14102,6 +14530,9 @@ def ref_pairwise_distance(input1, input2):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_take_along_dim,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     ShapeFuncInfo('tile',
                   ref=np.tile,
@@ -14148,7 +14579,12 @@ def ref_pairwise_distance(input1, input2):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     safe_casts_outputs=True,
-                    sample_inputs_func=sample_inputs_xlogy),
+                    supports_one_python_scalar=True,
+                    skips=(
+                        # nan vs nan comparisons
+                        # https://github.com/pytorch/pytorch/issues/74279
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                    )),
     OpInfo('zero_',
            op=lambda x: torch.zero_(x.clone()),
            method_variant=None,
@@ -14171,39 +14607,47 @@ def ref_pairwise_distance(input1, input2):
                     safe_casts_outputs=True,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_xlog1py),
+                    supports_one_python_scalar=True,
+                    skips=(
+                        # nan vs 0 comparisons
+                        # https://github.com/pytorch/pytorch/issues/74279
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                    )),
     BinaryUfuncInfo('special.zeta',
                     aten_name='special_zeta',
                     dtypes=all_types_and(torch.bool),
                     promotes_int_to_float=True,
                     supports_autograd=False,
                     safe_casts_outputs=True,
-                    sample_inputs_func=sample_inputs_binary_pwise),
+                    supports_one_python_scalar=True),
+    # TODO: FIXME
     # OpInfo entry to verify the gradient formula of `other`/`q`
-    BinaryUfuncInfo('special.zeta',
-                    op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
-                    aten_name='special_zeta',
-                    variant_test_name='grad',
-                    dtypes=all_types_and(torch.bool),
-                    promotes_int_to_float=True,
-                    supports_autograd=True,
-                    safe_casts_outputs=True,
-                    decorators=[
-                        # Derivative wrt first tensor not implemented
-                        DecorateInfo(unittest.expectedFailure, "TestCommon",
-                                     "test_floating_inputs_are_differentiable")
-                    ],
-                    skips=(
-                        # Lambda doesn't work in JIT test
-                        # AssertionError: JIT Test does not execute any logic
-                        DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
-                    ),
-                    sample_inputs_func=sample_inputs_zeta),
+    # BinaryUfuncInfo('special.zeta',
+    #                 op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
+    #                 aten_name='special_zeta',
+    #                 variant_test_name='grad',
+    #                 dtypes=all_types_and(torch.bool),
+    #                 promotes_int_to_float=True,
+    #                 supports_autograd=True,
+    #                 safe_casts_outputs=True,
+    #                 supports_rhs_python_scalar=False,
+    #                 decorators=[
+    #                     # Derivative wrt first tensor not implemented
+    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
+    #                                  "test_floating_inputs_are_differentiable")
+    #                 ],
+    #                 skips=(
+    #                     # Lambda doesn't work in JIT test
+    #                     # AssertionError: JIT Test does not execute any logic
+    #                     DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
+    #                 )),
     OpInfo('logsumexp',
            aliases=('special.logsumexp',),
            dtypes=all_types_and(torch.bool, torch.bfloat16),
            dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.half),
            assert_autodiffed=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_logsumexp),
     OpInfo('trace',
            dtypes=all_types_and_complex(),
@@ -14212,6 +14656,9 @@ def ref_pairwise_distance(input1, input2):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
            sample_inputs_func=sample_inputs_trace),
     OpInfo('transpose',
            aliases=('swapdims', 'swapaxes'),
@@ -14327,7 +14774,8 @@ def ref_pairwise_distance(input1, input2):
                # NotImplementedError: Cannot access storage of SparseTensorImpl
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                # Allowed exception: sparse tensors don't have strides
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_backward'),
                # TODO: implement csr.to_sparse(sample_dim) where sampled_dim is 1.
                DecorateInfo(unittest.skip("csr.to_sparse(1) not implemented. Skipped!"),
                             'TestSparseCSR', 'test_sparse_csr_consistency'),
@@ -14407,6 +14855,14 @@ def ref_pairwise_distance(input1, input2):
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
                    safe_casts_outputs=True),
+    UnaryUfuncInfo('special.log_ndtr',
+                   aten_name='special_log_ndtr',
+                   ref=scipy.special.log_ndtr if TEST_SCIPY else _NOTHING,
+                   dtypes=all_types_and(torch.bool),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   safe_casts_outputs=True,
+                   ),
     UnaryUfuncInfo('erf',
                    ref=scipy.special.erf if TEST_SCIPY else _NOTHING,
                    aliases=('special.erf', ),
@@ -14499,6 +14955,7 @@ def ref_pairwise_distance(input1, input2):
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         sample_inputs_func=sample_inputs_softmax_variant,
+        supports_forward_ad=True,
         assert_autodiffed=True),
     OpInfo(
         'log_softmax',
@@ -14507,6 +14964,7 @@ def ref_pairwise_distance(input1, input2):
         supports_out=False,
         dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
         sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
+        supports_forward_ad=True,
         assert_autodiffed=True),
     UnaryUfuncInfo('logit',
                    ref=scipy.special.logit if TEST_SCIPY else _NOTHING,
@@ -14527,9 +14985,12 @@ def ref_pairwise_distance(input1, input2):
            op=lambda self, condition, other: torch.where(condition, self, other),
            ref=lambda self, condition, other: np.where(condition, self, other),
            sample_inputs_func=sample_inputs_where,
+           error_inputs_func=error_inputs_where,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           decorators=(
+               DecorateInfo(onlyCUDA, "TestCommon", 'test_errors'),),
            skips=(
                # test does not work with passing lambda for op
                # AssertionError: False is not true :
@@ -14580,7 +15041,8 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=floating_and_complex_types(),
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                # RuntimeError not raised :
                # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
@@ -14596,7 +15058,8 @@ def ref_pairwise_distance(input1, input2):
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
                # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
                # Arguments for call are not valid.
@@ -14669,13 +15132,7 @@ def ref_pairwise_distance(input1, input2):
             # inplace variant dispatches to dropout kernel, while on CUDA
             # the op dispatches to _fused_dropout (with a few more conditions)
             # hence, different values and this skip here
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
-            # On CUDA, the op is dispatched (and a few more conditions) to
-            # _fused_dropout, which doesn't support forward AD
-            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD', device_type='cuda'),
-            # NotImplementedError: Trying to use forward AD with native_dropout that does not support it
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad',
-                         device_type='cuda', dtypes=[torch.float64]),),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),),
         gradcheck_wrapper=wrapper_set_seed,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
@@ -14769,13 +15226,17 @@ def ref_pairwise_distance(input1, input2):
         op=lambda weight, idx, **kwargs: torch.nn.functional.embedding(idx, weight, **kwargs),
         dtypes=floating_types_and(torch.bfloat16, torch.float16),
         sample_inputs_func=sample_inputs_embedding,
+        error_inputs_func=error_inputs_embedding,
         skips=(
             # Does not work with lambda
             # Raises : JIT Test does not execute any logic
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
             # Reference: https://github.com/pytorch/pytorch/issues/67084
             DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
+            # Not a problem: embedding does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
         ),
+        supports_expanded_weight=True,
         supports_out=False,
     ),
     OpInfo(
@@ -14784,7 +15245,7 @@ def ref_pairwise_distance(input1, input2):
         # This is because currently only the `input` field of SampleInput
         # is tested in gradient tests.
         op=lambda weight, idx, **kwargs: torch.nn.functional.embedding_bag(idx, weight, **kwargs),
-        dtypes=floating_types(),
+        dtypes=floating_types_and(torch.float16),
         dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         # backward is not supported for mode `max` and dtype `bfloat16`
         backward_dtypesIfCUDA=floating_types_and(torch.float16),
@@ -14793,6 +15254,9 @@ def ref_pairwise_distance(input1, input2):
             # Does not work with lambda
             # Raises : JIT Test does not execute any logic
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Not a problem: embedding_bag does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward', device_type='cpu'),
         ),
         gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
         supports_out=False,
@@ -14804,6 +15268,7 @@ def ref_pairwise_distance(input1, input2):
         sample_inputs_func=sample_inputs_softplus,
         supports_forward_ad=True,
         dtypes=floating_types(),
+        dtypesIfCPU=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         supports_out=False,
     ),
@@ -14906,6 +15371,7 @@ def ref_pairwise_distance(input1, input2):
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
         ),
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
         'amin',
@@ -14917,6 +15383,7 @@ def ref_pairwise_distance(input1, input2):
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
         ),
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
         'argmax',
@@ -15169,6 +15636,8 @@ def ref_pairwise_distance(input1, input2):
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
         promotes_int_to_int64=False,
         dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
         skips=(
@@ -15248,6 +15717,46 @@ def ref_pairwise_distance(input1, input2):
         sample_inputs_func=sample_inputs_masked_reduction,
         gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
+    ReductionOpInfo(
+        '_masked.argmax',
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
+        skips=(
+            # FIXME (from torch.argmax): keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            # initial is not a keyword for argmax
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_reference_masked'),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+    ),
+    ReductionOpInfo(
+        '_masked.argmin',
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
+        skips=(
+            # FIXME (from torch.argmin): keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            # initial is not a keyword for argmin
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_reference_masked'),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+    ),
     ReductionOpInfo(
         '_masked.mean',
         ref=reference_reduction_numpy(np.mean) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
@@ -15309,6 +15818,38 @@ def ref_pairwise_distance(input1, input2):
             # RuntimeError: undefined value tensor
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
         ),
+        decorators=[
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02),
+                                            torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
+                         'TestReductions', 'test_reference_masked'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestReductions', 'test_ref_small_input'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestMasked', 'test_reference_masked'),
+        ],
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+        check_batched_forward_grad=True,
+    ),
+    ReductionOpInfo(
+        '_masked.std',
+        ref=reference_reduction_numpy(np.std) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
+        method_variant=None,
+        nan_policy='propagate',
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        skips=(
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+        ),
         decorators=[
             DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
                          'TestReductions', 'test_reference_masked'),
@@ -15317,8 +15858,10 @@ def ref_pairwise_distance(input1, input2):
             DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
                          'TestMasked', 'test_reference_masked'),
         ],
-        sample_inputs_func=sample_inputs_masked_var,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+        check_batched_forward_grad=True,
     ),
     OpInfo(
         '_masked.softmax',
@@ -15333,6 +15876,7 @@ def ref_pairwise_distance(input1, input2):
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
         ),
         gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
         supports_out=False),
     OpInfo(
         '_masked.log_softmax',
@@ -15351,6 +15895,7 @@ def ref_pairwise_distance(input1, input2):
                          'TestMasked', 'test_reference_masked'),
         ],
         gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
         supports_out=False),
     OpInfo(
         '_masked.softmin',
@@ -15365,6 +15910,7 @@ def ref_pairwise_distance(input1, input2):
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
         ),
         gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
         supports_out=False),
     OpInfo(
         '_masked.normalize',
@@ -15412,7 +15958,8 @@ def ref_pairwise_distance(input1, input2):
                 dtypes=(torch.float32,),
             ),
             # Operation calls data_ptr() somewhere; needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_composite_compliance'),
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
         ),
     ),
     OpInfo(
@@ -15432,6 +15979,7 @@ def ref_pairwise_distance(input1, input2):
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
         sample_inputs_func=sample_inputs_nll_loss,
+        supports_forward_ad=True,
         decorators=[
             # FIXME: Derivative wrt. weight not implemented
             DecorateInfo(unittest.expectedFailure, "TestCommon",
@@ -15522,6 +16070,7 @@ def ref_pairwise_distance(input1, input2):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
             DecorateInfo(
                 unittest.skip("Skipped!"),
                 "TestJit",
@@ -15612,16 +16161,48 @@ def ref_pairwise_distance(input1, input2):
     ),
     OpInfo(
         'scatter_reduce',
+        variant_test_name='sum',
+        # complex not added to dtypes as complex gradients are not properly handled
+        # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        sample_inputs_func=sample_inputs_scatter_reduce,
+        supports_out=False,
+    ),
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='prod',
+        # complex not added to dtypes as complex gradients are not properly handled
+        # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
+        supports_out=False,
+    ),
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='mean',
+        # complex not added to dtypes as complex gradients are not properly handled
+        # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
         dtypes=all_types_and(torch.float16, torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
+        supports_out=False,
+    ),
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='amin',
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
+        supports_out=False,
+    ),
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='amax',
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         sample_inputs_func=sample_inputs_scatter_reduce,
         supports_out=False,
-        decorators=(onlyCPU,),
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive',
-                         active_if=IS_WINDOWS),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive',
-                         active_if=IS_WINDOWS),
-        ),
     ),
 ]
 
@@ -15631,10 +16212,12 @@ def ref_pairwise_distance(input1, input2):
 spectral_funcs = [op for op in op_db if isinstance(op, SpectralFuncInfo)]
 sparse_unary_ufuncs = [op for op in op_db if isinstance(op, UnaryUfuncInfo) and op.supports_sparse]
 sparse_csr_unary_ufuncs = [op for op in op_db if isinstance(op, UnaryUfuncInfo) and op.supports_sparse_csr]
+sparse_reduction_ops = [op for op in op_db if isinstance(op, ReductionOpInfo) and op.supports_sparse]
 shape_funcs = [op for op in op_db if isinstance(op, ShapeFuncInfo)]
 reduction_ops = [op for op in op_db if isinstance(op, ReductionOpInfo)]
 reference_filtered_ops = [op for op in reduction_ops if op.ref not in (_NOTHING, None)]
 reference_masked_ops = [op for op in reference_filtered_ops if op.name.startswith('_masked.')]
+sparse_masked_reduction_ops = [op for op in sparse_reduction_ops if op.name.startswith('_masked.')]
 
 # TODO: review porting these to make_tensor
 def index_variable(shape, max_indices, device=torch.device('cpu')):
@@ -15733,6 +16316,8 @@ def _compare_large_trilu_indices(
     (0, 3, 0),
     (0, 3, 1),
     (0, 3, -1),
+    (0, 1, 2),
+    (1, 0, 2),
     (3, 0, 0),
     (3, 0, 1),
     (3, 0, -1),
diff --git a/torch/testing/_internal/common_quantization.py b/torch/testing/_internal/common_quantization.py
index 5a142794be574e..e93273faa2781a 100644
--- a/torch/testing/_internal/common_quantization.py
+++ b/torch/testing/_internal/common_quantization.py
@@ -16,6 +16,7 @@
     QuantType,
     default_dynamic_qat_qconfig,
     default_embedding_qat_qconfig,
+    default_symmetric_qnnpack_qat_qconfig,
 )
 from torch.quantization import QuantWrapper, QuantStub, DeQuantStub, \
     default_qconfig, default_dynamic_qconfig, default_per_channel_qconfig, QConfig, default_observer, default_weight_observer, \
@@ -872,8 +873,8 @@ def checkGraphModeFxOp(
                 prepare_expected_node_occurrence, prepare_expected_node_list)
 
             prepared_copy = copy.deepcopy(prepared)
-            qgraph = convert_fx(prepared)
-            qgraph_reference = convert_fx(prepared_copy, is_reference=True)
+            qgraph = convert_fx(copy.deepcopy(prepared))
+            qgraph_reference = convert_fx(copy.deepcopy(prepared), is_reference=True)
             result = qgraph(*inputs)
             result_reference = qgraph_reference(*inputs)
             qgraph_copy = copy.deepcopy(qgraph)
@@ -1694,9 +1695,9 @@ class ManualConvLinearQATModel(torch.nn.Module):
     r"""A module with manually inserted `QuantStub` and `DeQuantStub`
     and contains both linear and conv modules
     """
-    def __init__(self):
+    def __init__(self, qconfig=None):
         super().__init__()
-        self.qconfig = torch.quantization.get_default_qat_qconfig("qnnpack")
+        self.qconfig = qconfig if qconfig else torch.quantization.get_default_qat_qconfig("qnnpack")
         self.quant = QuantStub()
         self.dequant = DeQuantStub()
         self.conv = torch.nn.Conv2d(3, 1, kernel_size=3).to(dtype=torch.float)
@@ -1711,6 +1712,13 @@ def forward(self, x):
         x = self.fc2(x)
         return self.dequant(x)
 
+class ManualConvLinearSymmQATModel(ManualConvLinearQATModel):
+    r"""Same as ManualConvLinearQATModule but with Symmetric Quantization.
+    Supported only with qnnpack.
+    """
+    def __init__(self):
+        super().__init__(default_symmetric_qnnpack_qat_qconfig)
+
 class ManualEmbeddingBagLinear(nn.Module):
     def __init__(self):
         super(ManualEmbeddingBagLinear, self).__init__()
diff --git a/torch/testing/_internal/common_quantized.py b/torch/testing/_internal/common_quantized.py
index 9440e825e411d8..597fd774e32999 100644
--- a/torch/testing/_internal/common_quantized.py
+++ b/torch/testing/_internal/common_quantized.py
@@ -46,9 +46,12 @@ def _requantize(x, multiplier, zero_point, qmin=0, qmax=255, qtype=np.uint8):
     qx = np.clip(qx, qmin, qmax).astype(qtype)
     return qx
 
-def _calculate_dynamic_qparams(X, dtype, reduce_range=False):
+def _calculate_dynamic_qparams(X, dtype, reduce_range=False, qscheme=torch.per_tensor_affine):
     """Calculate the dynamic quantization parameters (scale, zero_point)
     according to the min and max element of the tensor"""
+    assert qscheme in (torch.per_tensor_affine, torch.per_tensor_symmetric)
+    if qscheme == torch.per_tensor_symmetric:
+        assert dtype == torch.qint8
     if isinstance(X, torch.Tensor):
         X = X.numpy()
     if dtype == torch.qint8:
@@ -63,17 +66,25 @@ def _calculate_dynamic_qparams(X, dtype, reduce_range=False):
             qmin, qmax = 0, 255
     min_val = X.min()
     max_val = X.max()
+    is_symmetric = (qscheme == torch.per_tensor_symmetric)
     if min_val == max_val:
         scale = 1.0
         zero_point = 0
     else:
-        max_val = max(max_val, 0.0)
-        min_val = min(min_val, 0.0)
-        scale = (max_val - min_val) / (qmax - qmin)
-        scale = max(scale, np.finfo(np.float32).eps)
-        zero_point = qmin - round(min_val / scale)
-        zero_point = max(qmin, zero_point)
-        zero_point = min(qmax, zero_point)
+        if is_symmetric:
+            max_val = max(max_val, -min_val)
+            min_val = -max_val
+            scale = (max_val - min_val) / (qmax - qmin)
+            scale = max(scale, np.finfo(np.float32).eps)
+            zero_point = 0
+        else:
+            max_val = max(max_val, 0.0)
+            min_val = min(min_val, 0.0)
+            scale = (max_val - min_val) / (qmax - qmin)
+            scale = max(scale, np.finfo(np.float32).eps)
+            zero_point = qmin - round(min_val / scale)
+            zero_point = max(qmin, zero_point)
+            zero_point = min(qmax, zero_point)
     return [float(scale), int(zero_point)]
 
 def _calculate_dynamic_per_channel_qparams(X, dtype):
@@ -165,6 +176,8 @@ def qengine_is_fbgemm():
     return torch.backends.quantized.engine == 'fbgemm'
 def qengine_is_qnnpack():
     return torch.backends.quantized.engine == 'qnnpack'
+def qengine_is_onednn():
+    return torch.backends.quantized.engine == 'onednn'
 
 # Helper function used to simulate per-channel fake-quant against any axis
 def _permute_to_axis_zero(X, axis):
diff --git a/torch/testing/_internal/common_utils.py b/torch/testing/_internal/common_utils.py
index 7db3df75d32e08..bd8732c9369d70 100644
--- a/torch/testing/_internal/common_utils.py
+++ b/torch/testing/_internal/common_utils.py
@@ -369,9 +369,9 @@ class ProfilingMode(Enum):
 
 def cppProfilingFlagsToProfilingMode():
     old_prof_exec_state = torch._C._jit_set_profiling_executor(True)
-    old_prof_mode_state = torch._C._jit_set_profiling_mode(True)
+    old_prof_mode_state = torch._C._get_graph_executor_optimize(True)
     torch._C._jit_set_profiling_executor(old_prof_exec_state)
-    torch._C._jit_set_profiling_mode(old_prof_mode_state)
+    torch._C._get_graph_executor_optimize(old_prof_mode_state)
 
     if old_prof_exec_state:
         if old_prof_mode_state:
@@ -385,23 +385,23 @@ def cppProfilingFlagsToProfilingMode():
 def enable_profiling_mode_for_profiling_tests():
     if GRAPH_EXECUTOR == ProfilingMode.PROFILING:
         old_prof_exec_state = torch._C._jit_set_profiling_executor(True)
-        old_prof_mode_state = torch._C._jit_set_profiling_mode(True)
+        old_prof_mode_state = torch._C._get_graph_executor_optimize(True)
     try:
         yield
     finally:
         if GRAPH_EXECUTOR == ProfilingMode.PROFILING:
             torch._C._jit_set_profiling_executor(old_prof_exec_state)
-            torch._C._jit_set_profiling_mode(old_prof_mode_state)
+            torch._C._get_graph_executor_optimize(old_prof_mode_state)
 
 @contextmanager
 def enable_profiling_mode():
     old_prof_exec_state = torch._C._jit_set_profiling_executor(True)
-    old_prof_mode_state = torch._C._jit_set_profiling_mode(True)
+    old_prof_mode_state = torch._C._get_graph_executor_optimize(True)
     try:
         yield
     finally:
         torch._C._jit_set_profiling_executor(old_prof_exec_state)
-        torch._C._jit_set_profiling_mode(old_prof_mode_state)
+        torch._C._get_graph_executor_optimize(old_prof_mode_state)
 
 @contextmanager
 def num_profiled_runs(num_runs):
@@ -646,6 +646,29 @@ def run_tests(argv=UNITTEST_ARGS):
     elif TEST_SAVE_XML is not None:
         # import here so that non-CI doesn't need xmlrunner installed
         import xmlrunner  # type: ignore[import]
+        from xmlrunner.result import _XMLTestResult  # type: ignore[import]
+
+        class XMLTestResultVerbose(_XMLTestResult):
+            """
+            Adding verbosity to test outputs:
+            by default test summary prints 'skip',
+            but we want to also print the skip reason.
+            GH issue: https://github.com/pytorch/pytorch/issues/69014
+
+            This works with unittest_xml_reporting<=3.2.0,>=2.0.0
+            (3.2.0 is latest at the moment)
+            """
+            def __init__(self, *args, **kwargs):
+                super().__init__(*args, **kwargs)
+
+            def addSkip(self, test, reason):
+                super().addSkip(test, reason)
+                for c in self.callback.__closure__:
+                    if isinstance(c.cell_contents, str) and c.cell_contents == 'skip':
+                        # this message is printed in test summary;
+                        # it stands for `verbose_str` captured in the closure
+                        c.cell_contents = f"skip: {reason}"
+
         test_filename = sanitize_test_filename(inspect.getfile(sys._getframe(1)))
         test_report_path = TEST_SAVE_XML + LOG_SUFFIX
         test_report_path = os.path.join(test_report_path, test_filename)
@@ -653,7 +676,10 @@ def run_tests(argv=UNITTEST_ARGS):
         verbose = '--verbose' in argv or '-v' in argv
         if verbose:
             print('Test results will be stored in {}'.format(test_report_path))
-        unittest.main(argv=argv, testRunner=xmlrunner.XMLTestRunner(output=test_report_path, verbosity=2 if verbose else 1))
+        unittest.main(argv=argv, testRunner=xmlrunner.XMLTestRunner(
+            output=test_report_path,
+            verbosity=2 if verbose else 1,
+            resultclass=XMLTestResultVerbose))
     elif REPEAT_COUNT > 1:
         for _ in range(REPEAT_COUNT):
             if not unittest.main(exit=False, argv=argv).result.wasSuccessful():
@@ -1973,9 +1999,11 @@ def sawteeth(n, m):
         return crow_indices.to(device=device)
 
     def genSparseCSRTensor(self, size, nnz, *, device, dtype, index_dtype):
+        from operator import mul
+        from functools import reduce
         sparse_dim = 2
-        assert all(size[d] > 0 for d in range(sparse_dim)) or nnz == 0, 'invalid arguments'
-        assert len(size) == sparse_dim
+        assert all(size[d] > 0 for d in range(len(size))) or nnz == 0, 'invalid arguments'
+        assert len(size) >= sparse_dim
 
         def random_sparse_csr(n_rows, n_cols, nnz):
             crow_indices = self._make_crow_indices(n_rows, n_cols, nnz, device=device, dtype=index_dtype)
@@ -1989,7 +2017,15 @@ def random_sparse_csr(n_rows, n_cols, nnz):
             values = make_tensor([nnz], device=device, dtype=dtype, low=low, high=high)
             return values, crow_indices, col_indices
 
-        values, crow_indices, col_indices = random_sparse_csr(size[0], size[1], nnz)
+        batch_shape = size[:-2]
+        n_batch = reduce(mul, batch_shape, 1)
+
+        sparse_tensors = [random_sparse_csr(size[-2], size[-1], nnz) for _ in range(n_batch)]
+        sparse_tensors_it = map(list, zip(*sparse_tensors))
+        values = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, -1)
+        crow_indices = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, -1)
+        col_indices = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, -1)
+
         return torch.sparse_csr_tensor(crow_indices,
                                        col_indices,
                                        values, size=size, dtype=dtype, device=device)
@@ -2022,7 +2058,10 @@ def genSparseTensor(self, size, sparse_dim, nnz, is_uncoalesced, device, dtype):
         return x, x._indices().clone(), x._values().clone()
 
     def safeToDense(self, t):
-        return t.coalesce().to_dense()
+        # coalesce is only implemented for COO
+        if t.layout == torch.sparse_coo:
+            t = t.coalesce()
+        return t.to_dense()
 
     # Compares a torch function with a reference function for a given sample input (object of SampleInput)
     # Note: only values are compared, type comparison is not done here
@@ -2033,7 +2072,7 @@ def compare_with_reference(self, torch_fn, ref_fn, sample_input, **kwargs):
         actual = torch_fn(t_inp, *t_args, **t_kwargs)
         expected = ref_fn(n_inp, *n_args, **n_kwargs)
 
-        self.assertEqual(actual, expected, exact_device=False)
+        self.assertEqual(actual, expected, exact_device=False, **kwargs)
 
     # Compares the given Torch and NumPy functions on the given tensor-like object.
     # NOTE: both torch_fn and np_fn should be functions that take a single
@@ -3083,20 +3122,17 @@ def sandcastle_skip_if(condition, reason):
     skipping continuously.
     """
     def decorator(func):
-
-        if not IS_SANDCASTLE and condition:
-            func.__unittest_skip__ = True
-            func.__unittest_skip_why__ = reason
-            return func
-
-        @wraps(func)
-        def wrapper(*args, **kwargs):
-            if condition and IS_SANDCASTLE:
-                print(f'Skipping {func.__name__} on sandcastle for following reason: {reason}', file=sys.stderr)
-                return
+        if condition:
+            if IS_SANDCASTLE:
+                @wraps(func)
+                def wrapper(*args, **kwargs):
+                    print(f'Skipping {func.__name__} on sandcastle for following reason: {reason}', file=sys.stderr)
+                return wrapper
             else:
-                return func(*args, **kwargs)
-        return wrapper
+                func.__unittest_skip__ = True
+                func.__unittest_skip_why__ = reason
+
+        return func
 
     return decorator
 
diff --git a/torch/testing/_internal/composite_compliance.py b/torch/testing/_internal/composite_compliance.py
index 85e2299c2ff65b..54b15d89e27345 100644
--- a/torch/testing/_internal/composite_compliance.py
+++ b/torch/testing/_internal/composite_compliance.py
@@ -1,10 +1,12 @@
 import torch
 from torch import Tensor
 import contextlib
+import itertools
 from typing import Iterator
-from torch.utils._pytree import tree_map
+from torch.utils._pytree import tree_map, tree_flatten, tree_unflatten
 from functools import partial
 from torch.utils._python_dispatch import enable_python_mode
+import re
 
 # TODO: move this into library proper
 @contextlib.contextmanager
@@ -22,7 +24,7 @@ def check_attr_consistency(wrapper_tensor, metadata_name, metadata_accessor):
     if metadata_wrapper_tensor == metadata_elem:
         return
     raise RuntimeError(
-        f"This operator is not CompositeImplicitAutograd compliant: the "
+        f"This operator is not Composite Compliant: the "
         f"{metadata_name} of the tensor was modified directly without "
         f"going through the PyTorch dispatcher.")
 
@@ -41,7 +43,7 @@ def check_metadata_consistency(wrapper_tensor):
         check_attr_consistency(wrapper_tensor, metadata_name, metadata_accessor)
 
 def is_view_fn(func):
-    return func.__name__ in {
+    return func.overloadpacket.__name__ in {
         'as_strided',
         'detach',
         'diagonal',
@@ -81,7 +83,7 @@ def is_view_fn(func):
 # manually populated from native_functions that have inplace_view: True.
 # In the future we will probably be able to grab that list directly
 def is_inplace_view_fn(func):
-    return func.__name__ in {
+    return func.overloadpacket.__name__ in {
         'as_strided_',
         'detach_',
         'squeeze_',
@@ -92,6 +94,17 @@ def is_inplace_view_fn(func):
         'unsqueeze_',
     }
 
+
+# Introspection please save us
+def is_inplace(func):
+    name = func.overloadpacket.__name__
+    if re.match('__i.+__', name):
+        return True
+    if re.match('__.+__', name):
+        return False
+    return name[-1] == '_'
+
+
 class CompositeCompliantTensor(torch.Tensor):
     elem: torch.Tensor
 
@@ -100,7 +113,7 @@ class CompositeCompliantTensor(torch.Tensor):
     @staticmethod
     def __new__(cls, elem, *args, **kwargs):
         # The storage of CompositeCompliantTensor should never be used directly
-        # by a CompositeImplicitAutograd operation; if the CompositeImplicitAutograd
+        # by a Composite operation; if the Composite
         # operator attempts to read from the storage without dispatching then it'll
         # raise a RuntimeError due to it being a meta storage.
         r = torch.Tensor._make_wrapper_subclass(  # type: ignore[attr-defined]
@@ -108,7 +121,13 @@ def __new__(cls, elem, *args, **kwargs):
             dtype=elem.dtype, layout=elem.layout,
             device=elem.device, requires_grad=elem.requires_grad,
             strides=elem.stride(), storage_offset=elem.storage_offset())
-        r.elem = elem
+
+        # CompositeCompliantTensor steals the "requires_grad"-ness.
+        if elem.requires_grad:
+            # Why clone? Because sometimes OpInfo shares inputs between tests...
+            r.elem = elem.detach().clone()
+        else:
+            r.elem = elem
         return r
 
     def __repr__(self):
@@ -122,10 +141,22 @@ def unwrap(e):
         def wrap(e):
             return CompositeCompliantTensor(e) if isinstance(e, torch.Tensor) else e
 
-        if func.__name__ in ('set_', 'resize_'):
+        if func.overloadpacket.__name__ in ('set_', 'resize_'):
             raise RuntimeError(
                 f"{func.__name__} is not allowed to be called inside of "
-                f"CompositeImplicitAutograd operators.")
+                f"Composite operators.")
+
+        if is_inplace(func):
+            # NB: We are making an assumption that if the function is in-place,
+            # then the first argument is being written to. Introspection please save us!
+            mutated_argument = args[0]
+            if not isinstance(mutated_argument, CompositeCompliantTensor) and \
+                    any([isinstance(a, CompositeCompliantTensor) for a in args[1:]]):
+                raise RuntimeError(
+                    'Not composite compliant: performing in-place operation '
+                    f'{func.__name__} where the Tensor being written to is '
+                    'regular Tensor but the other tensors are Tensor Subclasses. '
+                    'Please try to avoid this in-place operation.')
 
         with no_dispatch():
             unwrapped_args = tree_map(unwrap, args)
@@ -173,11 +204,132 @@ def wrap(e):
         tree_map(check, rs)
         return rs
 
+
+def is_tensorlist(lst):
+    if not isinstance(lst, list) and not isinstance(lst, tuple):
+        return False
+    if len(lst) == 0:
+        return False
+    all_tensors = all([isinstance(elt, torch.Tensor) for elt in lst])
+    if all_tensors:
+        return True
+    exists_one_tensor = all([isinstance(elt, torch.Tensor) for elt in lst])
+    if exists_one_tensor:
+        raise RuntimeError('This test assumes that PyTorch APIs cannot take '
+                           'mixed lists of Tensor and other things')
+    return False
+
+
+def maybe_map(fn, should_map, arg):
+    return fn(arg) if should_map else arg
+
+
+def wrap(arg):
+    if isinstance(arg, torch.Tensor):
+        return CompositeCompliantTensor(arg)
+    if is_tensorlist(arg):
+        return [CompositeCompliantTensor(a) for a in arg]
+    raise RuntimeError("wrap assumes that the input can be wrapped")
+
+
+# Given a list of flat arguments, some of which may be Tensors, return all
+# possible ways some of the arguments could be CompositeCompliantTensors (CCT).
+# For example, given Tensors A, B, C and flat_args = [A, 1, B],
+# We would return the following 4 options:
+# [CCT(A), 1, CCT(B)]
+# [CCT(A), 1, B]
+# [A, 1, CCT(B)]
+# [A, 1, B]
+# NB: Yes, this is exponential. No, we don't care too much because PyTorch ops
+# don't accept that many input Tensors.
+def generate_subclass_choices(flat_args):
+    is_tensor_likes = [isinstance(arg, torch.Tensor) or is_tensorlist(arg) for arg in flat_args]
+    subclass_options = [[False, True] if is_tensor_like else [False] for is_tensor_like in is_tensor_likes]
+
+    for which_args_are_wrapped in itertools.product(*subclass_options):
+        result = [maybe_map(wrap, should_wrap_arg, arg)
+                  for should_wrap_arg, arg in zip(which_args_are_wrapped, flat_args)]
+        yield result, which_args_are_wrapped
+
+
+# For an operation f(*args, **kwargs), each Tensor argument may either be
+# a regular Tensor or a Tensor Subclass. This iterator iterates through
+# all of those options.
+def generate_subclass_choices_args_kwargs(args, kwargs):
+    flat_kwargs, spec = tree_flatten(kwargs)
+    flat_args_kwargs = list(args) + list(flat_kwargs)
+    for choice, debug_metadata in generate_subclass_choices(flat_args_kwargs):
+        new_args = choice[:len(args)]
+        new_kwargs = tree_unflatten(choice[len(args):], spec)
+        which_args_are_wrapped = debug_metadata[:len(args)]
+        which_kwargs_are_wrapped = tree_unflatten(debug_metadata[len(args):], spec)
+        yield new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped
+
+
+def raise_composite_compliance_error(err, additional_info=''):
+    raise RuntimeError(
+        "Composite compilance check failed with "
+        "the above error.\n"
+        f"{additional_info}"
+        "If you are adding an OpInfo of an "
+        "existing operator, please feel free to skip this test "
+        "because the problem was pre-existing and file an issue. "
+        "Otherwise, if you added a new operator, please read "
+        "through the Composite Compliance section in "
+        "aten/src/ATen/native/README.md for how to resolve this. "
+    ) from err
+
+
+# This test checks ALL possible permutations of calling `op` with arguments
+# that are individually either a regular Tensor or a Tensor subclass.
+#
+# The general strategy is to wrap some Tensor args and kwargs in
+# CompositeCompliantTensor wrappers and call the operation.
+
+# If some composite operation does any non-compliant behavior,
+# CompositeCompliantTensor will raise an error.
+def check_all_permutations(op, args, kwargs):
+    def wrap(e):
+        return CompositeCompliantTensor(e) if isinstance(e, torch.Tensor) else e
+
+    for choice in generate_subclass_choices_args_kwargs(args, kwargs):
+        new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped = choice
+
+        try:
+            op(*new_args, **new_kwargs)
+        # NOTE: [What errors are Composite Compiance trying to catch?]
+        #
+        # There's two things we want to catch:
+        # - errors that would raise within the torch_dispatch impl
+        # - data_ptr accesses
+        # The first is easy to filter for (we could make the error a different
+        # error class), the second is always going to be a RuntimeError due to
+        # how it is implemented (if you try to access the data_ptr of thex
+        # wrapper Tensor, it raises you some internal RuntimeError).
+        #
+        # So the most general thing to catch here was RuntimeError. If you
+        # are here and debugging why your test failed, it's plausible that
+        # the operator itself is broken and that there are other tests failing.
+        except RuntimeError as err:
+            raise_composite_compliance_error(
+                err,
+                f"- wrapped_args: {which_args_are_wrapped}\n"
+                f"- wrapped_kwargs: {which_kwargs_are_wrapped}\n"
+            )
+
+# Checks via the usage of Python mode certain anti-patterns that
+# are not composite compliant.
+#
+# In particular, the anti-pattern we are trying to prevent is a user
+# creating an empty tensor and then resize_-ing it. Python Mode helps
+# here because all factory functions will create tensors that are
+# CompositeCompliantTensor.
+#
 # The general strategy is to wrap all Tensor args and kwargs in
 # CompositeCompliantTensor wrappers. If an operator that is
-# CompositeImplicitAutograd does any non-compliant behavior,
+# Composite does any non-compliant behavior,
 # CompositeCompliantTensor will raise an error.
-def _check_composite_compliance(op, args, kwargs):
+def check_with_mode(op, args, kwargs):
     def wrap(e):
         return CompositeCompliantTensor(e) if isinstance(e, torch.Tensor) else e
 
@@ -186,12 +338,65 @@ def wrap(e):
     try:
         with enable_python_mode(CompositeCompliantTensor):
             op(*args, **kwargs)
+    # see NOTE: [What errors are Composite Compiance trying to catch?]
     except RuntimeError as err:
-        raise RuntimeError("CompositeImplicitAutograd compilance check failed with "
-                           "the above error. If you are adding an OpInfo of an "
-                           "existing operator, please feel free to skip this test "
-                           "because the problem was pre-existing and file an issue. "
-                           "Otherwise, if you added a new operator, please read "
-                           "through the CompositeImplicitAutograd Compliance section in "
-                           "aten/src/ATen/native/README.md for how to resolve this. "
-                           ) from err
+        raise_composite_compliance_error(err)
+
+def gather_leaf_tensors(args, kwargs):
+    leaf_tensors = []
+    args, args_spec = tree_flatten(args)
+    kwargs, kwargs_spec = tree_flatten(kwargs)
+    args = args + kwargs
+    for arg in args:
+        if not isinstance(arg, torch.Tensor):
+            continue
+        if arg.requires_grad:
+            leaf_tensors.append(arg)
+    return leaf_tensors
+
+
+# Checks if the backward formula is composite compliant by testing
+# all possible permutations of {inputs, grad_outputs} being
+# CompositeCompliantTensor or regular Tensors.
+def check_backward_formula(op, args, kwargs):
+    assert op.supports_autograd
+
+    for choice in generate_subclass_choices_args_kwargs(args, kwargs):
+        new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped = choice
+        leaf_tensors = gather_leaf_tensors(new_args, new_kwargs)
+        assert len(leaf_tensors) > 0
+
+        try:
+            results = op(*new_args, **new_kwargs)
+        # see NOTE: [What errors are Composite Compiance trying to catch?]
+        except RuntimeError as err:
+            raise_composite_compliance_error(
+                err,
+                f"- wrapped_args: {which_args_are_wrapped}\n"
+                f"- wrapped_kwargs: {which_kwargs_are_wrapped}\n"
+            )
+
+        # Hack: tree_flatten doesn't handle torch.return_types yet,
+        # so we're gonna convert them to tuple.
+        # TODO: https://github.com/pytorch/pytorch/issues/74624
+        if isinstance(results, tuple):
+            results = tuple(results)
+        flat_results, _ = tree_flatten(results)
+        flat_diff_results = [r for r in flat_results if r.requires_grad]
+        assert len(flat_diff_results) > 0
+
+        # NB: ones, not ones_like, so we get a regular Tensor here
+        grads = [torch.ones(r.shape, device=r.device, dtype=r.dtype)
+                 for r in flat_diff_results]
+        for flat_new_grads, which_grad_is_batched in generate_subclass_choices(grads):
+            try:
+                torch.autograd.grad(flat_diff_results, leaf_tensors, flat_new_grads,
+                                    allow_unused=True, retain_graph=True)
+            # see NOTE: [What errors are Composite Compiance trying to catch?]
+            except RuntimeError as err:
+                raise_composite_compliance_error(
+                    err,
+                    f"- wrapped_args: {which_args_are_wrapped}\n"
+                    f"- wrapped_kwargs: {which_kwargs_are_wrapped}\n"
+                    f"- wrapped_grads: {which_grad_is_batched}\n"
+                )
diff --git a/torch/testing/_internal/distributed/distributed_test.py b/torch/testing/_internal/distributed/distributed_test.py
index b3f98560e42a04..25ee51b23b4780 100644
--- a/torch/testing/_internal/distributed/distributed_test.py
+++ b/torch/testing/_internal/distributed/distributed_test.py
@@ -49,6 +49,7 @@
     simple_sparse_reduce_tests,
     skip_if_rocm,
     skip_if_small_worldsize,
+    skip_if_odd_worldsize,
     skip_if_lt_x_gpu,
     nccl_skip_if_lt_x_gpu,
     skip_if_no_gpu,
@@ -316,16 +317,18 @@ def forward(self, x):
         return (a, b)
 
 
-class EmbeddingNetDifferentDimension(nn.Module):
+class EmbeddingNetDifferentParams(nn.Module):
     """
-    A module containing an embedding with different dimension depending on the
-    rank.
+    A module containing an embedding with different dimension or different # of
+    parameters depending on the rank.
     """
-    def __init__(self, rank):
+    def __init__(self, rank, diff_num_params=False):
         super().__init__()
-        embedding_dim = 500 if rank == 0 else 50
+        embedding_dim = 500 if diff_num_params or rank == 0 else 50
         self.embedding = nn.Embedding(num_embeddings=10, embedding_dim=embedding_dim)
         self.lin = nn.Linear(embedding_dim, 1)
+        if diff_num_params:
+            self.lin2 = nn.Linear(1, 1, bias=False)
 
     def forward(self, x):
         x = self.embedding(x)
@@ -4786,36 +4789,29 @@ def _test_DistributedDataParallel_SyncBatchNorm(
             )
             self._barrier()
 
-        @skip_if_lt_x_gpu(2)
-        @sandcastle_skip_if(
-            BACKEND not in DistTestCases.backend_feature["ddp"],
-            f"The {BACKEND} backend does not support DistributedDataParallel"
-        )
-        def test_post_localSGD_optimizer_parity(self, grad_is_view=False):
+        def _test_post_localSGD_optimizer_parity(self, create_averager, grad_is_view):
             learning_rate = 0.03
-            period = 4
-            warmup_steps = 10
-            torch.cuda.set_device(self.rank)
+
             net = torch.nn.parallel.DistributedDataParallel(
                 copy.deepcopy(DDP_NET).cuda(),
                 device_ids=[self.rank],
                 gradient_as_bucket_view=grad_is_view,
             )
+            averager = create_averager()
             opt = torch.optim.SGD(net.parameters(), lr=learning_rate)
-            averager = averagers.PeriodicModelAverager(
-                period=period, warmup_steps=warmup_steps
-            )
 
-            post_localSGD_net = torch.nn.parallel.DistributedDataParallel(
+            net_using_post_localSGD_opt = torch.nn.parallel.DistributedDataParallel(
                 copy.deepcopy(DDP_NET).cuda(),
                 device_ids=[self.rank],
                 gradient_as_bucket_view=grad_is_view,
             )
+            # Process group cannot be pickled in some environments,
+            # so cannot deep copy an averager. See:
+            # https://github.com/pytorch/pytorch/pull/74737#pullrequestreview-922487496
+            averager2 = create_averager()
             post_localSGD_opt = post_localSGD_optimizer.PostLocalSGDOptimizer(
-                optim=torch.optim.SGD(post_localSGD_net.parameters(), lr=learning_rate),
-                averager=averagers.PeriodicModelAverager(
-                    period=period, warmup_steps=warmup_steps
-                )
+                optim=torch.optim.SGD(net_using_post_localSGD_opt.parameters(), lr=learning_rate),
+                averager=averager2,
             )
 
             input = torch.randn(dist.get_world_size() * 2, 2).cuda()
@@ -4831,14 +4827,76 @@ def test_post_localSGD_optimizer_parity(self, grad_is_view=False):
                 averager.average_parameters(net.parameters())
 
                 post_localSGD_opt.zero_grad()
-                post_localSGD_output = post_localSGD_net(input)
-                post_localSGD_loss = loss_fn(post_localSGD_output, target)
-                post_localSGD_loss.backward()
+                output_using_post_localSGD_opt = net_using_post_localSGD_opt(input)
+                loss_using_post_localSGD_opt = loss_fn(output_using_post_localSGD_opt, target)
+                loss_using_post_localSGD_opt.backward()
                 post_localSGD_opt.step()
 
-                for p1, p2 in zip(net.parameters(), post_localSGD_net.parameters()):
+                for p1, p2 in zip(net.parameters(), net_using_post_localSGD_opt.parameters()):
                     self.assertEqual(p1.data, p2.data)
 
+            # Also check if the built-in step counters are the same to prevent a bug like #74737.
+            self.assertEqual(averager.step, averager2.step)
+
+        def _create_periodic_model_averager(self):
+            return averagers.PeriodicModelAverager(period=4, warmup_steps=10)
+
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(
+            BACKEND not in DistTestCases.backend_feature["ddp"],
+            f"The {BACKEND} backend does not support DistributedDataParallel"
+        )
+        def test_post_localSGD_optimizer_parity(self):
+            torch.cuda.set_device(self.rank)
+            self._test_post_localSGD_optimizer_parity(
+                self._create_periodic_model_averager,
+                grad_is_view=False,
+            )
+
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(
+            BACKEND not in DistTestCases.backend_feature["ddp"],
+            f"The {BACKEND} backend does not support DistributedDataParallel"
+        )
+        def test_post_localSGD_optimizer_parity_grad_is_view(self):
+            torch.cuda.set_device(self.rank)
+            self._test_post_localSGD_optimizer_parity(
+                self._create_periodic_model_averager,
+                grad_is_view=True,
+            )
+
+        def _create_hierarchical_model_averager(self):
+            period_group_size_dict = OrderedDict([(2, 2), (4, dist.get_world_size())])
+            return hierarchicalSGD.HierarchicalModelAverager(
+                period_group_size_dict=period_group_size_dict, warmup_steps=4
+            )
+
+        @skip_if_lt_x_gpu(4)
+        @skip_if_odd_worldsize
+        @sandcastle_skip_if(
+            BACKEND not in DistTestCases.backend_feature["ddp"],
+            f"The {BACKEND} backend does not support DistributedDataParallel"
+        )
+        def test_post_localSGD_optimizer_parity_with_hierarchical_sgd(self):
+            torch.cuda.set_device(self.rank)
+            self._test_post_localSGD_optimizer_parity(
+                self._create_hierarchical_model_averager,
+                grad_is_view=False,
+            )
+
+        @skip_if_lt_x_gpu(4)
+        @skip_if_odd_worldsize
+        @sandcastle_skip_if(
+            BACKEND not in DistTestCases.backend_feature["ddp"],
+            f"The {BACKEND} backend does not support DistributedDataParallel"
+        )
+        def test_post_localSGD_optimizer_parity_with_hierarchical_sgd_grad_is_view(self):
+            torch.cuda.set_device(self.rank)
+            self._test_post_localSGD_optimizer_parity(
+                self._create_hierarchical_model_averager,
+                grad_is_view=True,
+            )
+
         @sandcastle_skip_if(
             BACKEND not in DistTestCases.backend_feature["ddp"],
             f"The {BACKEND} backend does not support DistributedDataParallel"
@@ -7108,7 +7166,7 @@ def _test_compute_bucket_assignment_by_size(self, use_logger):
 
             # Create a valid model. The constructor initializes the logger that we use later.
             # We never actually use the rest of the model - we only need its logger.
-            net = EmbeddingNetDifferentDimension(0)
+            net = EmbeddingNetDifferentParams(0)
             net = torch.nn.parallel.DistributedDataParallel(
                 net.to(self.rank),
                 device_ids=[self.rank],
@@ -7147,11 +7205,22 @@ def test_compute_bucket_assignment_by_size_sparse_error_without_logger(self):
         def test_compute_bucket_assignment_by_size_sparse_error_with_logger(self):
             self._test_compute_bucket_assignment_by_size(use_logger=True)
 
-        def _determine_expected_error_verify_model_across_rank(self, group_to_use):
+        def _determine_expected_error_verify_model_across_rank(
+            self,
+            group_to_use,
+            diff_num_params=False
+        ):
             # When running with NCCL backend, we don't expect an error on rank 0,
             # rather, it will be taken down by NCCL_ASYNC_ERROR_HANDLING. When
             # running with Gloo or with debug mode wrapper, we expect the error
             # to be caught inline.
+            # All ranks report same error when there is a # of parameter
+            # mismatch since we use allgather in the impl.
+            if diff_num_params:
+                expected_err = "DDP expects same model across all ranks"
+                ctx = self.assertRaisesRegex(RuntimeError, expected_err)
+                return ctx, expected_err
+
             is_detail_dbg_mode = (
                 dist.get_debug_level() == dist.DebugLevel.DETAIL
             )
@@ -7181,7 +7250,7 @@ def _test_verify_model_across_rank(self, use_logger):
             ctx, expected_err = self._determine_expected_error_verify_model_across_rank(group_to_use)
 
             # Create a valid model. The constructor initializes the logger that we use later.
-            net = EmbeddingNetDifferentDimension(0)
+            net = EmbeddingNetDifferentParams(0)
             net = torch.nn.parallel.DistributedDataParallel(
                 net.to(self.rank),
                 device_ids=[self.rank],
@@ -7228,11 +7297,28 @@ def test_verify_model_across_rank_with_logger(self):
         def test_verify_model_across_rank_without_logger(self):
             self._test_verify_model_across_rank(use_logger=False)
 
+        def _run_test_ddp_model_with_diff_params(self, ctx, net, ddp_group, group_gloo):
+            with ctx:
+                net = torch.nn.parallel.DistributedDataParallel(
+                    net.to(self.rank),
+                    device_ids=[self.rank],
+                    process_group=ddp_group
+                )
+                # Should only be run by rank 0, and blocking_wait catches and
+                # reports exception.
+                dist.barrier(ddp_group)
+
+            # can't use verify_ddp_error_logged here because net was never properly constructed
+
+            # Perform gloo-based barrier to ensure one rank doesn't exit test
+            # early which causes failure with Barrier.sync.
+            dist.barrier(group_gloo)
+
         @require_backend(DistTestCases.backend_feature["gpu"])
         @require_backends_available(DistTestCases.backend_feature["gpu"])
         @skip_if_lt_x_gpu(2)
         @skip_if_rocm
-        def test_ddp_model_diff_across_ranks(self):
+        def test_ddp_model_diff_shape_across_ranks(self):
             group_gloo = dist.new_group(
                 timeout=timedelta(seconds=60), backend=dist.Backend.GLOO
             )
@@ -7240,27 +7326,44 @@ def test_ddp_model_diff_across_ranks(self):
             # determinism.
             os.environ["NCCL_BLOCKING_WAIT"] = "1"
             group_to_use = dist.new_group(
-                backend=dist.get_backend(), timeout=timedelta(seconds=5)
+                backend=dist.get_backend(), timeout=timedelta(seconds=10)
             )
             torch.cuda.set_device(self.rank)
             ctx, expected_err = self._determine_expected_error_verify_model_across_rank(group_to_use)
             # Creates network with different sized embedding table on different
             # ranks. This should throw an error during DDP init.
-            net = EmbeddingNetDifferentDimension(self.rank)
-            with ctx:
-                net = torch.nn.parallel.DistributedDataParallel(
-                    net.to(self.rank),
-                    device_ids=[self.rank],
-                    process_group=group_to_use,
-                )
-                # Should only be run by rank 0, and blocking_wait catches and
-                # reports exception.
-                dist.barrier(group_to_use)
-            # can't use verify_ddp_error_logged here because net was never properly constructed
+            net = EmbeddingNetDifferentParams(self.rank)
+            self._run_test_ddp_model_with_diff_params(
+                ctx, net, group_to_use, group_gloo
+            )
 
-            # Perform gloo-based barrier to ensure one rank doesn't exit test
-            # early which causes failure with Barrier.sync.
-            dist.barrier(group_gloo)
+        @require_backend(DistTestCases.backend_feature["gpu"])
+        @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @skip_if_lt_x_gpu(2)
+        @skip_if_rocm
+        def test_ddp_model_diff_num_params_across_ranks(self):
+            group_gloo = dist.new_group(
+                timeout=timedelta(seconds=60), backend=dist.Backend.GLOO
+            )
+            # Set NCCL_BLOCKING_WAIT and use a new NCCL group to improve test
+            # determinism.
+            os.environ["NCCL_BLOCKING_WAIT"] = "1"
+            group_to_use = dist.new_group(
+                backend=dist.get_backend(), timeout=timedelta(seconds=10)
+            )
+            torch.cuda.set_device(self.rank)
+            ctx, expected_err = self._determine_expected_error_verify_model_across_rank(
+                group_to_use, diff_num_params=True
+            )
+
+            # Creates network with diff # of param across ranks, reducer should
+            # recognize this and throw appropriate error.
+            net = EmbeddingNetDifferentParams(self.rank, diff_num_params=(self.rank == 1))
+
+
+            self._run_test_ddp_model_with_diff_params(
+                ctx, net, group_to_use, group_gloo,
+            )
 
         def _test_output_unused_in_loss(self, module_cls, gradient_as_bucket_view):
             model = module_cls()
@@ -7750,7 +7853,7 @@ def _test_ddp_multiple_nested_unused_params_error(self, ignore_sparse):
             class SubModule(nn.Module):
                 def __init__(self):
                     super().__init__()
-                    self.embedding_net = EmbeddingNetDifferentDimension(0)
+                    self.embedding_net = EmbeddingNetDifferentParams(0)
                     self.lin = TwoLinLayerNet()
                     self.bn = BatchNormNet()
                     self.lin_layer = nn.Linear(4, 10, bias=False)
@@ -7759,7 +7862,7 @@ def forward(self, x):
                     x = self.bn(x)
                     x = self.lin_layer(x)
                     x = self.lin.a(x)  # self.lin.b param unused
-                    # EmbeddingNetDifferentDimension entirely unused: self.embedding_net.embedding and
+                    # EmbeddingNetDifferentParams entirely unused: self.embedding_net.embedding and
                     # self.embedding_net.lin unused.
                     return x
 
@@ -8533,4 +8636,48 @@ def forward(self, x):
             self.assertIsNone(module.module.l1.bias.grad)
             self.assertIsNone(module.module.buffer.grad)
 
+
+        @require_backend(DistTestCases.backend_feature["gpu"])
+        @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @skip_if_lt_x_gpu(2)
+        def test_ddp_forward_backward_hook(self):
+            class DummyTestModel(nn.Module):
+                def __init__(self):
+                    super(DummyTestModel, self).__init__()
+                    torch.manual_seed(0)
+                    self.fc = nn.Linear(2, 2)
+
+                def forward(self, x):
+                    return self.fc(x)
+
+            def relu_hook(module, input):
+                return nn.functional.relu(input[0])
+
+            def gelu_hook(module, _input, output):
+                return nn.functional.gelu(output)
+
+            def celu_hook(module, _input, output):
+                return (nn.functional.celu(output[0]),)
+
+            local_model = DummyTestModel()
+            ddp_model = DummyTestModel()
+            local_model.fc.register_forward_pre_hook(relu_hook)
+            local_model.fc.register_forward_hook(gelu_hook)
+            ddp_model.fc.register_forward_pre_hook(relu_hook)
+            ddp_model.fc.register_forward_hook(gelu_hook)
+            local_model.fc.register_backward_hook(celu_hook)
+            ddp_model.fc.register_backward_hook(celu_hook)
+            ddp_model = DistributedDataParallel(
+                ddp_model.to(self.rank), device_ids=[self.rank]
+            )
+            input_data = torch.rand(5, 2)
+            output_local = local_model(input_data)
+            output_ddp = ddp_model(input_data.to(self.rank))
+            self.assertEqual(output_local, output_ddp)
+            output_local.sum().backward()
+            output_ddp.sum().backward()
+            ddp_grads = [p.grad for p in ddp_model.parameters()]
+            self.assertEqual(ddp_grads[0], local_model.fc.weight.grad)
+            self.assertEqual(ddp_grads[1], local_model.fc.bias.grad)
+
 instantiate_parametrized_tests(DistributedTest._DistTestBase)
diff --git a/torch/testing/_internal/distributed/rpc/jit/rpc_test.py b/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
index 5ba47c724415ee..a8cb17a3d66fd0 100644
--- a/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
+++ b/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
@@ -153,8 +153,11 @@ def record_function_on_caller_rpc_async(dst_worker_name: str, block: str) -> Ten
     t: Tensor = torch.ones(1)
     with record_function(block) as rf:
         fut1 = rpc.rpc_async(dst_worker_name, script_add_ones, (t, ))
+        # Extra operator call to avoid de-duplication of the next async call
+        # see https://github.com/pytorch/pytorch/pull/62710#discussion_r694680279
+        zero = torch.zeros_like(t)
         fut2 = rpc.rpc_async(dst_worker_name, script_add_ones, (t, ))
-        res = fut1.wait() + fut2.wait()
+        res = fut1.wait() + fut2.wait() + zero
     return res
 
 
@@ -191,12 +194,12 @@ def script_fork_wait_throw(invalue):
 
 
 @torch.jit.script
-def call_rpc_with_profiling(handle: Tensor, dst_worker_name: str) -> Tensor:
+def call_rpc_with_profiling(record: torch.classes.profiler._RecordFunction, dst_worker_name: str) -> Tensor:
     # Call rpc_async from within ScriptFunction and ensure that we can attach
     # profiling callbacks. Note that handle here is a Tensor representation of
     # RecordFunction.
     fut = rpc.rpc_async(dst_worker_name, one_arg, (torch.tensor(1),))
-    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
+    torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
     ret = fut.wait()
     return ret
 
@@ -207,12 +210,12 @@ def call_rpc_torchscript_with_record_function(dst_worker_name: str, block: str)
 
 
 @torch.jit.script
-def call_fork_with_profiling(handle: Tensor) -> Tensor:
+def call_fork_with_profiling(record: torch.classes.profiler._RecordFunction) -> Tensor:
     # Call fork from within ScriptFunction and ensure that we can attach profiling
     # callbacks to the resulting future. Note that handle here is a Tensor
     # representation of RecordFunction.
     fut = torch.jit._fork(one_arg, torch.tensor(1))
-    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
+    torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
     ret = fut.wait()
     return ret
 
@@ -1143,7 +1146,7 @@ def test_call_rpc_with_profiling(self):
                     "worker1",
                 )
                 with torch.autograd.profiler.record_function(prof_key) as rf:
-                    ret = call_rpc_with_profiling(rf.handle, "worker1")
+                    ret = call_rpc_with_profiling(rf.record, "worker1")
             # TODO: Can't get a reliable time for this profiling event since
             # it's hard to estimate the execution time on the remote end for non-UDFs.
             # This can be resolved by https://github.com/pytorch/pytorch/issues/36272.
@@ -1292,7 +1295,7 @@ def test_call_fork_in_jit_with_profiling(self):
         # future from within a script function with torch.jit.fork
         with _profile() as prof:
             with torch.autograd.profiler.record_function("foo") as rf:
-                ret = call_fork_with_profiling(rf.handle)
+                ret = call_fork_with_profiling(rf.record)
 
         events = prof.function_events
         function_event = get_function_event(events, "foo")
diff --git a/torch/testing/_internal/distributed/rpc/rpc_test.py b/torch/testing/_internal/distributed/rpc/rpc_test.py
index 1d69808bdc07a1..aea25002ed8e39 100644
--- a/torch/testing/_internal/distributed/rpc/rpc_test.py
+++ b/torch/testing/_internal/distributed/rpc/rpc_test.py
@@ -574,13 +574,15 @@ def cb(fut):
 # A custom Python class that contains a tensor, needed to see if we correctly
 # use the Python pickler to extract tensors from non-IValue-convertible types.
 class TensorWrapper:
-    __slots__ = ("tensor", "lock", "event")
+    __slots__ = ("tensor", "lock", "event", "thread")
 
     def __init__(self, t):
         self.tensor = t
         # Add one non-picklable field, to ensure it's ignored/skipped.
         self.lock = Lock()
         self.event = torch.cuda.Event(enable_timing=True)
+        self.thread = threading.Thread()
+        self.thread.start()
 
     def increase(self, v):
         with self.lock:
@@ -2380,6 +2382,24 @@ def test_async_record_function_double_end_callbacks(self):
                         rf._call_end_callbacks_on_future(fut)
                 fut.wait()
 
+    @dist_init
+    def test_async_record_function_legacy(self):
+        # Test the legacy _record_function ops work
+        # Note: These exist for backward compatibility with TorchScript
+        num_sleep_seconds = 1
+        if self.rank == 1:
+            with _profile() as pf:
+                try:
+                    handle = torch.ops.profiler._record_function_enter("foo", None)
+                    fut = rpc.rpc_async(
+                        worker_name(0), my_sleep_func, args=(num_sleep_seconds,)
+                    )
+                    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
+                finally:
+                    torch.ops.profiler._record_function_exit(handle)
+
+                fut.wait()
+
     @dist_init
     def test_async_record_function_cbs_jit_call(self):
         if self.rank == 1:
@@ -2395,7 +2415,7 @@ def test_async_record_function_cbs_jit_call(self):
                         worker_name(0), my_script_func, args=(torch.tensor(1),)
                     )
                     # Intentionally calling record_function internals
-                    fut = torch.ops.profiler._call_end_callbacks_on_jit_fut(rf.handle, fut)
+                    fut = torch.ops.profiler._call_end_callbacks_on_jit_fut(rf.record, fut)
                 result = fut.wait()
                 # Validate that the profiling future returns the same value as the RPC
                 # future.
@@ -4302,6 +4322,121 @@ def test_init_rpc_twice(self):
 
         rpc.shutdown()
 
+    # Test init_rpc without world_size argument
+    @dist_init(setup_rpc=False)
+    def test_init_rpc_without_world_size(self):
+        rpc.init_rpc(
+            name=worker_name(self.rank),
+            backend=self.rpc_backend,
+            rank=self.rank,
+            rpc_backend_options=self.rpc_backend_options,
+        )
+
+        # TODO: Need to sync before shutdown since ungraceful shutdown is not fully implemented
+        # Using process_group initialization as sync (could also use store based barrier)
+        dist.init_process_group(
+            backend='gloo',
+            init_method=self.file_init_method,
+            rank=self.rank,
+            world_size=self.world_size)
+        rpc.shutdown(graceful=False)
+
+    # Dynamic RPC new ranks communicate with existing ranks
+    @dist_init(setup_rpc=False)
+    def test_without_world_size_new_rank_can_communicated_with_existing_rank(self):
+        # TODO: Using process group for synchronization to ensure rank 0 is created first
+        dist.init_process_group(
+            backend='gloo',
+            init_method=self.file_init_method,
+            rank=self.rank,
+            world_size=self.world_size)
+
+        if self.rank == 0:
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rank=self.rank,
+                rpc_backend_options=self.rpc_backend_options,
+            )
+
+        # Rank 0 will be initialized with RPC after this barrier
+        dist.barrier()
+
+        if self.rank != 0:
+            # Newly joined ranks will be able to communicate with rank 0, since that was created first
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rank=self.rank,
+                rpc_backend_options=self.rpc_backend_options,
+            )
+            result = rpc.rpc_sync(worker_name(0), torch.add, args=(torch.tensor(1), torch.tensor(1)))
+            self.assertEqual(torch.add(torch.tensor(1), torch.tensor(1)), result)
+
+        # TODO: Remove the sync before shutdown and replace with graceful shutdown
+        dist.barrier()
+        rpc.shutdown(graceful=False)
+
+    @dist_init(setup_rpc=False)
+    def test_init_rpc_without_world_size_without_rank(self):
+        # default initialization uses file init
+        with self.assertRaisesRegex(ValueError, "rank parameter missing"):
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rpc_backend_options=self.rpc_backend_options,
+            )
+
+        # env init
+        with self.assertRaisesRegex(ValueError, "environment variable RANK expected"):
+            rpc_backend_options = rpc.TensorPipeRpcBackendOptions(init_method="env://")
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rpc_backend_options=rpc_backend_options,
+            )
+
+        # tcp init
+        with self.assertRaisesRegex(ValueError, "rank parameter missing"):
+            rpc_backend_options = rpc.TensorPipeRpcBackendOptions(init_method="tcp://127.0.0.1:23456")
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rpc_backend_options=rpc_backend_options,
+            )
+
+    @dist_init(setup_rpc=False)
+    def test_init_dynamic_and_static_rpc_group(self):
+        # Initialize a static rpc group with size = self.world_size - 1
+        dist.init_process_group(
+            backend='gloo',
+            init_method=self.file_init_method,
+            rank=self.rank,
+            world_size=self.world_size)
+
+        world_size_minus_one = self.world_size - 1
+        if self.rank < world_size_minus_one:
+            rpc.init_rpc(
+                name=worker_name(self.rank),
+                backend=self.rpc_backend,
+                rank=self.rank,
+                world_size=world_size_minus_one,
+                rpc_backend_options=self.rpc_backend_options,
+            )
+
+        dist.barrier()
+
+        # Attempt to add an additional dynamic group member
+        if self.rank == world_size_minus_one:
+            with self.assertRaisesRegex(RuntimeError, "RPC group mixes statically and dynamically\
+ initialized members which is not supported."):
+                rpc.init_rpc(
+                    name=worker_name(self.rank),
+                    backend=self.rpc_backend,
+                    rank=self.rank,
+                    rpc_backend_options=self.rpc_backend_options,
+                )
+
     def test_wrong_types(self):
         with self.assertRaisesRegex(
             TypeError,
diff --git a/torch/testing/_internal/jit_utils.py b/torch/testing/_internal/jit_utils.py
index de7b78ed21df61..65c6b01d9fc5d3 100644
--- a/torch/testing/_internal/jit_utils.py
+++ b/torch/testing/_internal/jit_utils.py
@@ -767,7 +767,7 @@ def _get_py3_code(code, fn_name):
 class TensorExprTestOptions():
     def __init__(self):
         self.old_profiling_executor = torch._C._jit_set_profiling_executor(True)
-        self.old_profiling_mode = torch._C._jit_set_profiling_mode(True)
+        self.old_profiling_mode = torch._C._get_graph_executor_optimize(True)
 
         self.old_cpu_fuser_state = torch._C._jit_can_fuse_on_cpu()
         self.old_gpu_fuser_state = torch._C._jit_can_fuse_on_gpu()
@@ -782,7 +782,7 @@ def __init__(self):
 
     def restore(self):
         torch._C._jit_set_profiling_executor(self.old_profiling_executor)
-        torch._C._jit_set_profiling_mode(self.old_profiling_mode)
+        torch._C._get_graph_executor_optimize(self.old_profiling_mode)
 
         torch._C._jit_set_texpr_fuser_enabled(self.texpr_fuser_state)
         torch._C._jit_override_can_fuse_on_gpu(self.old_gpu_fuser_state)
diff --git a/torch/types.py b/torch/types.py
index 6d6633c6e45cdc..80f7278ef488a3 100644
--- a/torch/types.py
+++ b/torch/types.py
@@ -22,6 +22,9 @@
 _size = Union[torch.Size, List[_int], Tuple[_int, ...]]
 _layout = torch.layout
 
+class SymInt:
+    pass
+
 # Meta-type for "numeric" things; matches our docs
 Number = Union[builtins.int, builtins.float, builtins.bool]
 
diff --git a/torch/utils/collect_env.py b/torch/utils/collect_env.py
index 4a4dc629f520b6..0feec180fd4988 100644
--- a/torch/utils/collect_env.py
+++ b/torch/utils/collect_env.py
@@ -294,8 +294,11 @@ def get_cachingallocator_config():
     return ca_config
 
 def is_xnnpack_available():
-    import torch.backends.xnnpack
-    return str(torch.backends.xnnpack.enabled)  # type: ignore[attr-defined]
+    if TORCH_AVAILABLE:
+        import torch.backends.xnnpack
+        return str(torch.backends.xnnpack.enabled)  # type: ignore[attr-defined]
+    else:
+        return "N/A"
 
 def get_env_info():
     run_lambda = run
diff --git a/torch/utils/cpp_extension.py b/torch/utils/cpp_extension.py
index fa8b4ee0d197b8..5d4b25c1678e93 100644
--- a/torch/utils/cpp_extension.py
+++ b/torch/utils/cpp_extension.py
@@ -16,7 +16,7 @@
 from .file_baton import FileBaton
 from ._cpp_extension_versioner import ExtensionVersioner
 from .hipify import hipify_python
-from .hipify.hipify_python import get_hip_file_path, GeneratedFileCleaner
+from .hipify.hipify_python import GeneratedFileCleaner, get_hip_file_path
 from typing import List, Optional, Union, Tuple
 from torch.torch_version import TorchVersion
 
@@ -1008,16 +1008,19 @@ def CUDAExtension(name, sources, *args, **kwargs):
         hipify_result = hipify_python.hipify(
             project_directory=build_dir,
             output_directory=build_dir,
-            includes=[os.path.join(os.path.relpath(include_dir, build_dir), '*') for include_dir in include_dirs] if include_dirs else ['*'],
+            header_include_dirs=include_dirs,
+            includes=[os.path.join(build_dir, '*')],  # limit scope to build_dir only
             extra_files=[os.path.abspath(s) for s in sources],
             show_detailed=True,
             is_pytorch_extension=True,
+            hipify_extra_files_only=True,  # don't hipify everything in includes path
         )
 
         hipified_sources = set()
         for source in sources:
             s_abs = os.path.abspath(source)
-            hipified_sources.add(hipify_result[s_abs]["hipified_path"] if s_abs in hipify_result else s_abs)
+            hipified_sources.add(hipify_result[s_abs]["hipified_path"] if (s_abs in hipify_result and
+                                 hipify_result[s_abs]["hipified_path"] is not None) else s_abs)
 
         sources = list(hipified_sources)
 
@@ -1394,15 +1397,25 @@ def _jit_compile(name,
             try:
                 with GeneratedFileCleaner(keep_intermediates=keep_intermediates) as clean_ctx:
                     if IS_HIP_EXTENSION and (with_cuda or with_cudnn):
-                        hipify_python.hipify(
+                        hipify_result = hipify_python.hipify(
                             project_directory=build_directory,
                             output_directory=build_directory,
-                            includes=os.path.join(build_directory, '*'),
+                            header_include_dirs=(extra_include_paths if extra_include_paths is not None else []),
                             extra_files=[os.path.abspath(s) for s in sources],
+                            ignores=[os.path.join(ROCM_HOME, '*'), os.path.join(_TORCH_PATH, '*')],  # no need to hipify ROCm or PyTorch headers
                             show_detailed=verbose,
+                            show_progress=verbose,
                             is_pytorch_extension=True,
                             clean_ctx=clean_ctx
                         )
+
+                        hipified_sources = set()
+                        for source in sources:
+                            s_abs = os.path.abspath(source)
+                            hipified_sources.add(hipify_result[s_abs]["hipified_path"] if s_abs in hipify_result else s_abs)
+
+                        sources = list(hipified_sources)
+
                     _write_ninja_file_and_build_library(
                         name=name,
                         sources=sources,
@@ -1898,10 +1911,6 @@ def _write_ninja_file_to_build_library(path,
         cuda_flags = ['-DWITH_HIP'] + cflags + COMMON_HIP_FLAGS + COMMON_HIPCC_FLAGS
         cuda_flags += extra_cuda_cflags
         cuda_flags += _get_rocm_arch_flags(cuda_flags)
-        sources = [s if not _is_cuda_file(s) else
-                   os.path.abspath(os.path.join(
-                       path, get_hip_file_path(os.path.relpath(s, path), is_pytorch_extension=True)))
-                   for s in sources]
     elif with_cuda:
         cuda_flags = common_cflags + COMMON_NVCC_FLAGS + _get_cuda_arch_flags()
         if IS_WINDOWS:
@@ -2012,6 +2021,8 @@ def sanitize_flags(flags):
             nvcc = _join_cuda_home('bin', 'nvcc')
         config.append(f'nvcc = {nvcc}')
 
+    if IS_HIP_EXTENSION:
+        post_cflags = COMMON_HIP_FLAGS + post_cflags
     flags = [f'cflags = {" ".join(cflags)}']
     flags.append(f'post_cflags = {" ".join(post_cflags)}')
     if with_cuda:
diff --git a/torch/utils/data/__init__.py b/torch/utils/data/__init__.py
index f82746281b1f60..0988e4f7432e5f 100644
--- a/torch/utils/data/__init__.py
+++ b/torch/utils/data/__init__.py
@@ -11,16 +11,18 @@
 from torch.utils.data.dataset import (
     ChainDataset,
     ConcatDataset,
-    DFIterDataPipe,
-    DataChunk,
     Dataset,
-    IterDataPipe,
     IterableDataset,
-    MapDataPipe,
     Subset,
     TensorDataset,
     random_split,
 )
+from torch.utils.data.datapipes.datapipe import (
+    DFIterDataPipe,
+    IterDataPipe,
+    MapDataPipe,
+    DataChunk,
+)
 from torch.utils.data.dataloader import (
     DataLoader,
     _DatasetKind,
@@ -29,7 +31,7 @@
     default_convert,
 )
 from torch.utils.data.distributed import DistributedSampler
-from torch.utils.data._decorator import (
+from torch.utils.data.datapipes._decorator import (
     argument_validation,
     functional_datapipe,
     guaranteed_datapipes_determinism,
diff --git a/torch/utils/data/dataloader.py b/torch/utils/data/dataloader.py
index d10576987b4bbf..6318c37c3415af 100644
--- a/torch/utils/data/dataloader.py
+++ b/torch/utils/data/dataloader.py
@@ -118,9 +118,9 @@ class DataLoader(Generic[T_co]):
         generator (torch.Generator, optional): If not ``None``, this RNG will be used
             by RandomSampler to generate random indexes and multiprocessing to generate
             `base_seed` for workers. (default: ``None``)
-        prefetch_factor (int, optional, keyword-only arg): Number of samples loaded
+        prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
             in advance by each worker. ``2`` means there will be a total of
-            2 * num_workers samples prefetched across all workers. (default: ``2``)
+            2 * num_workers batches prefetched across all workers. (default: ``2``)
         persistent_workers (bool, optional): If ``True``, the data loader will not shutdown
             the worker processes after a dataset has been consumed once. This allows to
             maintain the workers `Dataset` instances alive. (default: ``False``)
@@ -229,7 +229,7 @@ def __init__(self, dataset: Dataset[T_co], batch_size: Optional[int] = 1,
             # this, and support custom samplers that specify the assignments to
             # specific workers.
             if isinstance(dataset, IterDataPipe):
-                torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)
+                dataset = torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)
             elif shuffle is not False:
                 raise ValueError(
                     "DataLoader with IterableDataset: expected unspecified "
@@ -895,7 +895,6 @@ def __init__(self, loader):
             multiprocessing_context = loader.multiprocessing_context
 
         self._worker_init_fn = loader.worker_init_fn
-        self._worker_queue_idx_cycle = itertools.cycle(range(self._num_workers))
         # No certainty which module multiprocessing_context is
         self._worker_result_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
         self._worker_pids_set = False
@@ -981,6 +980,8 @@ def _reset(self, loader, first_iter=False):
         # It does not mean that a worker is dead. In case of `_persistent_workers`,
         # the worker will be reset to available in the next epoch.
         self._workers_status = [True for i in range(self._num_workers)]
+        # Reset the worker queue cycle so it resumes next epoch at worker 0
+        self._worker_queue_idx_cycle = itertools.cycle(range(self._num_workers))
         # We resume the prefetching in case it was enabled
         if not first_iter:
             for idx in range(self._num_workers):
diff --git a/torch/utils/data/dataloader_experimental.py b/torch/utils/data/dataloader_experimental.py
index eafb981fcaaa3f..9c79f8e0da673a 100644
--- a/torch/utils/data/dataloader_experimental.py
+++ b/torch/utils/data/dataloader_experimental.py
@@ -87,9 +87,7 @@ def __new__(cls,
                 raise Exception(
                     'sampler is not yet supported by DataPipes')
             datapipe = dataset
-            if shuffle:
-                # Enforce at least one shuffle in the graph
-                datapipe = datapipe.shuffle()
+            datapipe = torch.utils.data.graph_settings.apply_shuffle_settings(datapipe, shuffle=shuffle)
             if batch_outside_worker and pin_memory:
                 raise Exception(
                     'pin_memory is not yet compatible with batch_outside_worker')
@@ -98,7 +96,6 @@ def __new__(cls,
                     datapipe = datapipe.batch(batch_size, drop_last=drop_last)
                     if collate_fn is None:
                         collate_fn = torch.utils.data._utils.collate.default_collate
-            torch.utils.data.graph_settings.apply_shuffle_settings(datapipe, shuffle=shuffle)
             if parallelism_mode == 'mp' or num_workers == 0:
                 my_worker_init_fn = functools.partial(
                     _sharding_worker_init_fn, worker_init_fn)
diff --git a/torch/utils/data/_decorator.py b/torch/utils/data/datapipes/_decorator.py
similarity index 98%
rename from torch/utils/data/_decorator.py
rename to torch/utils/data/datapipes/_decorator.py
index b9cd2b0f242ab6..e466de5125235a 100644
--- a/torch/utils/data/_decorator.py
+++ b/torch/utils/data/datapipes/_decorator.py
@@ -1,8 +1,8 @@
 import inspect
 from functools import wraps
 from typing import Any, Callable, Optional, Type, Union, get_type_hints
-from torch.utils.data import IterDataPipe, MapDataPipe
-from torch.utils.data._typing import _DataPipeMeta
+from torch.utils.data.datapipes.datapipe import IterDataPipe, MapDataPipe
+from torch.utils.data.datapipes._typing import _DataPipeMeta
 
 
 ######################################################
diff --git a/torch/utils/data/_typing.py b/torch/utils/data/datapipes/_typing.py
similarity index 100%
rename from torch/utils/data/_typing.py
rename to torch/utils/data/datapipes/_typing.py
diff --git a/torch/utils/data/datapipes/dataframe/dataframes.py b/torch/utils/data/datapipes/dataframe/dataframes.py
index 1f2767fa5c45c0..808773a2ee6224 100644
--- a/torch/utils/data/datapipes/dataframe/dataframes.py
+++ b/torch/utils/data/datapipes/dataframe/dataframes.py
@@ -1,10 +1,7 @@
 from typing import Any, Dict, List
 
-from torch.utils.data import (
-    DFIterDataPipe,
-    IterDataPipe,
-    functional_datapipe,
-)
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import DFIterDataPipe, IterDataPipe
 
 from torch.utils.data.datapipes.dataframe.structures import DataChunkDF
 
diff --git a/torch/utils/data/datapipes/dataframe/datapipes.py b/torch/utils/data/datapipes/dataframe/datapipes.py
index d7849b83d86298..89f580729b7706 100644
--- a/torch/utils/data/datapipes/dataframe/datapipes.py
+++ b/torch/utils/data/datapipes/dataframe/datapipes.py
@@ -1,10 +1,8 @@
 import random
 
-from torch.utils.data import (
-    DFIterDataPipe,
-    IterDataPipe,
-    functional_datapipe,
-)
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import DFIterDataPipe, IterDataPipe
+
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
 
 @functional_datapipe('_dataframes_as_tuples')
diff --git a/torch/utils/data/datapipes/dataframe/structures.py b/torch/utils/data/datapipes/dataframe/structures.py
index c822f893ba28a0..22c3ffbd1647c2 100644
--- a/torch/utils/data/datapipes/dataframe/structures.py
+++ b/torch/utils/data/datapipes/dataframe/structures.py
@@ -1,6 +1,5 @@
-from torch.utils.data import (
-    DataChunk,
-)
+from torch.utils.data.datapipes.datapipe import DataChunk
+
 
 class DataChunkDF(DataChunk):
     """
diff --git a/torch/utils/data/datapipes/datapipe.py b/torch/utils/data/datapipes/datapipe.py
new file mode 100644
index 00000000000000..6d5da3d7f38de6
--- /dev/null
+++ b/torch/utils/data/datapipes/datapipe.py
@@ -0,0 +1,234 @@
+import functools
+from typing import Dict, Callable, Optional, TypeVar, Generic, Iterator
+
+from torch.utils.data.datapipes._typing import _DataPipeMeta
+from torch.utils.data._utils.serialization import serialize_fn, SerializationType, deserialize_fn, DILL_AVAILABLE
+from torch.utils.data.dataset import Dataset, IterableDataset
+
+T = TypeVar('T')
+T_co = TypeVar('T_co', covariant=True)
+
+UNTRACABLE_DATAFRAME_PIPES = ['batch',  # As it returns DataChunks
+                              'groupby',   # As it returns DataChunks
+                              '_dataframes_as_tuples',  # As it unpacks DF
+                              'trace_as_dataframe',  # As it used to mark DF for tracing
+                              ]
+
+
+class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
+    r"""
+    Iterable-style DataPipe.
+
+    All DataPipes that represent an iterable of data samples should subclass this.
+    This style of DataPipes is particularly useful when data come from a stream, or
+    when the number of samples is too large to fit them all in memory.
+
+    All subclasses should overwrite :meth:`__iter__`, which would return an
+    iterator of samples in this DataPipe.
+
+    `IterDataPipe` is lazily initialized and its elements are computed only when ``next()`` is called
+    on its iterator.
+
+    These DataPipes can be invoked in two ways, using the class constructor or applying their
+    functional form onto an existing `IterDataPipe` (recommended, available to most but not all DataPipes).
+    You can chain multiple `IterDataPipe` together to form a pipeline that will perform multiple
+    operations in succession.
+
+    Note:
+        When a subclass is used with :class:`~torch.utils.data.DataLoader`, each
+        item in the DataPipe will be yielded from the :class:`~torch.utils.data.DataLoader`
+        iterator. When :attr:`num_workers > 0`, each worker process will have a
+        different copy of the DataPipe object, so it is often desired to configure
+        each copy independently to avoid having duplicate data returned from the
+        workers. :func:`~torch.utils.data.get_worker_info`, when called in a worker
+        process, returns information about the worker. It can be used in either the
+        dataset's :meth:`__iter__` method or the :class:`~torch.utils.data.DataLoader` 's
+        :attr:`worker_init_fn` option to modify each copy's behavior.
+
+    Example:
+        >>> from torchdata.datapipes.iter import IterableWrapper, Mapper
+        >>> dp = IterableWrapper(range(10))
+        >>> map_dp_1 = Mapper(dp, lambda x: x + 1)  # Using class constructor
+        >>> map_dp_2 = dp.map(lambda x: x + 1)  # Using functional form (recommended)
+        >>> list(map_dp_1)
+        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+        >>> list(map_dp_2)
+        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+        >>> filter_dp = map_dp_1.filter(lambda x: x % 2 == 0)
+        >>> list(filter_dp)
+        [2, 4, 6, 8, 10]
+    """
+    functions: Dict[str, Callable] = {}
+    reduce_ex_hook : Optional[Callable] = None
+    getstate_hook: Optional[Callable] = None
+
+    def __getattr__(self, attribute_name):
+        if attribute_name in IterDataPipe.functions:
+            function = functools.partial(IterDataPipe.functions[attribute_name], self)
+            return function
+        else:
+            raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
+
+    @classmethod
+    def register_function(cls, function_name, function):
+        cls.functions[function_name] = function
+
+    @classmethod
+    def register_datapipe_as_function(cls, function_name, cls_to_register, enable_df_api_tracing=False):
+        if function_name in cls.functions:
+            raise Exception("Unable to add DataPipe function name {} as it is already taken".format(function_name))
+
+        def class_function(cls, enable_df_api_tracing, source_dp, *args, **kwargs):
+            result_pipe = cls(source_dp, *args, **kwargs)
+            if isinstance(result_pipe, IterDataPipe):
+                if enable_df_api_tracing or isinstance(source_dp, DFIterDataPipe):
+                    if function_name not in UNTRACABLE_DATAFRAME_PIPES:
+                        result_pipe = result_pipe.trace_as_dataframe()
+
+            return result_pipe
+
+        function = functools.partial(class_function, cls_to_register, enable_df_api_tracing)
+        cls.functions[function_name] = function
+
+    def __getstate__(self):
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(self)
+        # TODO: Fix `dill` circular dependency - https://github.com/pytorch/data/issues/237
+        if DILL_AVAILABLE:
+            state_dict = {}
+            for k, v in self.__dict__.items():
+                if callable(v):
+                    state_dict[k] = serialize_fn(v)
+                else:
+                    state_dict[k] = v
+            return state_dict
+        else:
+            return self.__dict__
+
+    def __setstate__(self, state_dict):
+        for k, v in state_dict.items():
+            if isinstance(v, tuple) and len(v) == 2 and isinstance(v[1], SerializationType):
+                self.__dict__[k] = deserialize_fn(v)
+            else:
+                self.__dict__[k] = v
+
+    def __reduce_ex__(self, *args, **kwargs):
+        if IterDataPipe.reduce_ex_hook is not None:
+            try:
+                return IterDataPipe.reduce_ex_hook(self)
+            except NotImplementedError:
+                pass
+        return super().__reduce_ex__(*args, **kwargs)
+
+    @classmethod
+    def set_getstate_hook(cls, hook_fn):
+        if IterDataPipe.getstate_hook is not None and hook_fn is not None:
+            raise Exception("Attempt to override existing getstate_hook")
+        IterDataPipe.getstate_hook = hook_fn
+
+    @classmethod
+    def set_reduce_ex_hook(cls, hook_fn):
+        if IterDataPipe.reduce_ex_hook is not None and hook_fn is not None:
+            raise Exception("Attempt to override existing reduce_ex_hook")
+        IterDataPipe.reduce_ex_hook = hook_fn
+
+
+class DFIterDataPipe(IterDataPipe):
+    def _is_dfpipe(self):
+        return True
+
+
+class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
+    r"""
+    Map-style DataPipe.
+
+    All datasets that represent a map from keys to data samples should subclass this.
+    Subclasses should overwrite :meth:`__getitem__`, supporting fetching a
+    data sample for a given, unique key. Subclasses can also optionally overwrite
+    :meth:`__len__`, which is expected to return the size of the dataset by many
+    :class:`~torch.utils.data.Sampler` implementations and the default options
+    of :class:`~torch.utils.data.DataLoader`.
+
+    These DataPipes can be invoked in two ways, using the class constructor or applying their
+    functional form onto an existing `MapDataPipe` (recommend, available to most but not all DataPipes).
+
+    Note:
+        :class:`~torch.utils.data.DataLoader` by default constructs an index
+        sampler that yields integral indices. To make it work with a map-style
+        DataPipe with non-integral indices/keys, a custom sampler must be provided.
+
+    Example:
+        >>> from torchdata.datapipes.map import SequenceWrapper, Mapper
+        >>> dp = SequenceWrapper(range(10))
+        >>> map_dp_1 = dp.map(lambda x: x + 1)  # Using functional form (recommended)
+        >>> list(map_dp_1)
+        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+        >>> map_dp_2 = Mapper(dp, lambda x: x + 1)  # Using class constructor
+        >>> list(map_dp_2)
+        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+        >>> batch_dp = map_dp_1.batch(batch_size=2)
+        >>> list(batch_dp)
+        [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
+    """
+    functions: Dict[str, Callable] = {}
+
+    def __getattr__(self, attribute_name):
+        if attribute_name in MapDataPipe.functions:
+            function = functools.partial(MapDataPipe.functions[attribute_name], self)
+            return function
+        else:
+            raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
+
+    @classmethod
+    def register_function(cls, function_name, function):
+        cls.functions[function_name] = function
+
+    @classmethod
+    def register_datapipe_as_function(cls, function_name, cls_to_register):
+        if function_name in cls.functions:
+            raise Exception("Unable to add DataPipe function name {} as it is already taken".format(function_name))
+
+        def class_function(cls, source_dp, *args, **kwargs):
+            result_pipe = cls(source_dp, *args, **kwargs)
+            return result_pipe
+
+        function = functools.partial(class_function, cls_to_register)
+        cls.functions[function_name] = function
+
+    def __getstate__(self):
+        # TODO: Fix `dill` circular dependency - https://github.com/pytorch/data/issues/237
+        if DILL_AVAILABLE:
+            state_dict = {}
+            for k, v in self.__dict__.items():
+                if callable(v):
+                    state_dict[k] = serialize_fn(v)
+                else:
+                    state_dict[k] = v
+            return state_dict
+        else:
+            return self.__dict__
+
+    def __setstate__(self, state_dict):
+        for k, v in state_dict.items():
+            if isinstance(v, tuple) and len(v) == 2 and isinstance(v[1], SerializationType):
+                self.__dict__[k] = deserialize_fn(v)
+            else:
+                self.__dict__[k] = v
+
+
+class DataChunk(list, Generic[T]):
+    def __init__(self, items):
+        super().__init__(items)
+        self.items = items
+
+    def as_str(self, indent=''):
+        res = indent + "[" + ", ".join(str(i) for i in iter(self)) + "]"
+        return res
+
+    def __iter__(self) -> Iterator[T]:
+        for i in super().__iter__():
+            yield i
+
+    def raw_iterator(self) -> T:  # type: ignore[misc]
+        for i in self.items:
+            yield i
diff --git a/torch/utils/data/datapipes/datapipe.pyi.in b/torch/utils/data/datapipes/datapipe.pyi.in
new file mode 100644
index 00000000000000..8178853adaa488
--- /dev/null
+++ b/torch/utils/data/datapipes/datapipe.pyi.in
@@ -0,0 +1,62 @@
+# This base template ("datapipe.pyi.in") is generated from mypy stubgen with minimal editing for code injection
+# The output file will be "datapipe.pyi". This is executed as part of torch/CMakeLists.txt
+# Note that, for mypy, .pyi file takes precedent over .py file, such that we must define the interface for other
+# classes/objects here, even though we are not injecting extra code into them at the moment.
+
+from torch.utils.data.datapipes._typing import _DataPipeMeta
+from typing import Any, Callable, Dict, Generic, Iterator, List, Optional, TypeVar
+from torch.utils.data import Dataset, IterableDataset
+
+T_co = TypeVar('T_co', covariant=True)
+T = TypeVar('T')
+UNTRACABLE_DATAFRAME_PIPES: Any
+
+
+class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
+    functions: Dict[str, Callable] = ...
+    def __getattr__(self, attribute_name: Any): ...
+    @classmethod
+    def register_function(cls, function_name: Any, function: Any) -> None: ...
+    @classmethod
+    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any): ...
+    ${MapDataPipeMethods}
+
+
+class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
+    functions: Dict[str, Callable] = ...
+    reduce_ex_hook: Optional[Callable] = ...
+    getstate_hook: Optional[Callable] = ...
+    def __getattr__(self, attribute_name: Any): ...
+    @classmethod
+    def register_function(cls, function_name: Any, function: Any) -> None: ...
+    @classmethod
+    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any, enable_df_api_tracing: bool = ...): ...
+    def __getstate__(self): ...
+    def __reduce_ex__(self, *args: Any, **kwargs: Any): ...
+    @classmethod
+    def set_getstate_hook(cls, hook_fn: Any) -> None: ...
+    @classmethod
+    def set_reduce_ex_hook(cls, hook_fn: Any) -> None: ...
+    ${IterDataPipeMethods}
+
+
+class DFIterDataPipe(IterDataPipe):
+    def _is_dfpipe(self): ...
+
+
+class DataChunk(list, Generic[T]):
+    def __init__(self, items):
+        super().__init__(items)
+        self.items = items
+
+    def as_str(self, indent=''):
+        res = indent + "[" + ", ".join(str(i) for i in iter(self)) + "]"
+        return res
+
+    def __iter__(self) -> Iterator[T]:
+        for i in super().__iter__():
+            yield i
+
+    def raw_iterator(self) -> T:  # type: ignore[misc]
+        for i in self.items:
+            yield i
diff --git a/torch/utils/data/gen_pyi.py b/torch/utils/data/datapipes/gen_pyi.py
similarity index 85%
rename from torch/utils/data/gen_pyi.py
rename to torch/utils/data/datapipes/gen_pyi.py
index 11aa33338883f7..d500adabab44e1 100644
--- a/torch/utils/data/gen_pyi.py
+++ b/torch/utils/data/datapipes/gen_pyi.py
@@ -1,7 +1,29 @@
 import os
 import pathlib
-from typing import Dict, List, Set, Tuple, Union
-from tools.codegen.gen import FileManager
+from typing import Any, Dict, List, Set, Tuple, Union
+
+
+def materialize_lines(lines: List[str], indentation: int) -> str:
+    output = ""
+    new_line_with_indent = "\n" + " " * indentation
+    for i, line in enumerate(lines):
+        if i != 0:
+            output += new_line_with_indent
+        output += line.replace('\n', new_line_with_indent)
+    return output
+
+
+def gen_from_template(dir: str, template_name: str, output_name: str, replacements: List[Tuple[str, Any, int]]):
+
+    template_path = os.path.join(dir, template_name)
+    output_path = os.path.join(dir, output_name)
+
+    with open(template_path, "r") as f:
+        content = f.read()
+    for placeholder, lines, indentation in replacements:
+        with open(output_path, "w") as f:
+            content = content.replace(placeholder, materialize_lines(lines, indentation))
+            f.write(content)
 
 
 def find_file_paths(dir_paths: List[str], files_to_exclude: Set[str]) -> Set[str]:
@@ -171,12 +193,12 @@ def get_method_definitions(file_path: Union[str, List[str]],
 
 
 # Defined outside of main() so they can be imported by TorchData
-iterDP_file_path: str = "datapipes/iter"
+iterDP_file_path: str = "iter"
 iterDP_files_to_exclude: Set[str] = {"__init__.py", "utils.py"}
 iterDP_deprecated_files: Set[str] = set()
 iterDP_method_to_special_output_type: Dict[str, str] = {"demux": "List[IterDataPipe]", "fork": "List[IterDataPipe]"}
 
-mapDP_file_path: str = "datapipes/map"
+mapDP_file_path: str = "map"
 mapDP_files_to_exclude: Set[str] = {"__init__.py", "utils.py"}
 mapDP_deprecated_files: Set[str] = set()
 mapDP_method_to_special_output_type: Dict[str, str] = {}
@@ -184,7 +206,7 @@ def get_method_definitions(file_path: Union[str, List[str]],
 
 def main() -> None:
     """
-    # Inject file into template dataset.pyi.in
+    # Inject file into template datapipe.pyi.in
     TODO: The current implementation of this script only generates interfaces for built-in methods. To generate
           interface for user-defined DataPipes, consider changing `IterDataPipe.register_datapipe_as_function`.
     """
@@ -194,12 +216,15 @@ def main() -> None:
     map_method_definitions = get_method_definitions(mapDP_file_path, mapDP_files_to_exclude, mapDP_deprecated_files,
                                                     "MapDataPipe", mapDP_method_to_special_output_type)
 
-    fm = FileManager(install_dir='.', template_dir='.', dry_run=False)
-    fm.write_with_template(filename="dataset.pyi",
-                           template_fn="dataset.pyi.in",
-                           env_callable=lambda: {'IterDataPipeMethods': iter_method_definitions,
-                                                 'MapDataPipeMethods': map_method_definitions})
+    path = pathlib.Path(__file__).parent.resolve()
+    replacements = [('${IterDataPipeMethods}', iter_method_definitions, 4),
+                    ('${MapDataPipeMethods}', map_method_definitions, 4)]
+    gen_from_template(dir=str(path),
+                      template_name="datapipe.pyi.in",
+                      output_name="datapipe.pyi",
+                      replacements=replacements)
 
 
 if __name__ == '__main__':
-    main()  # TODO: Run this script automatically within the build and CI process
+    print("Generating Python interface file 'datapipe.pyi'...")
+    main()
diff --git a/torch/utils/data/datapipes/iter/callable.py b/torch/utils/data/datapipes/iter/callable.py
index 1fab3297052e8c..f8e5eb7ce8023b 100644
--- a/torch/utils/data/datapipes/iter/callable.py
+++ b/torch/utils/data/datapipes/iter/callable.py
@@ -1,6 +1,8 @@
 from typing import Callable, Iterator, Sized, TypeVar
 
-from torch.utils.data import IterDataPipe, _utils, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data._utils.collate import default_collate
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.utils.common import check_lambda_fn
 
 T_co = TypeVar("T_co", covariant=True)
@@ -159,6 +161,6 @@ class CollatorIterDataPipe(MapperIterDataPipe):
     def __init__(
         self,
         datapipe: IterDataPipe,
-        collate_fn: Callable = _utils.collate.default_collate,
+        collate_fn: Callable = default_collate,
     ) -> None:
         super().__init__(datapipe, fn=collate_fn)
diff --git a/torch/utils/data/datapipes/iter/combinatorics.py b/torch/utils/data/datapipes/iter/combinatorics.py
index 84e1b394f7251b..9d0335941dcf46 100644
--- a/torch/utils/data/datapipes/iter/combinatorics.py
+++ b/torch/utils/data/datapipes/iter/combinatorics.py
@@ -1,6 +1,8 @@
 import random
 
-from torch.utils.data import IterDataPipe, Sampler, SequentialSampler, functional_datapipe
+from torch.utils.data import Sampler, SequentialSampler
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from typing import Dict, Iterator, List, Optional, Sized, Tuple, Type, TypeVar
 
 T_co = TypeVar('T_co', covariant=True)
@@ -82,7 +84,6 @@ class ShufflerIterDataPipe(IterDataPipe[T_co]):
     def __init__(self,
                  datapipe: IterDataPipe[T_co],
                  *,
-                 default: bool = True,
                  buffer_size: int = 10000,
                  unbatch_level: int = 0
                  ) -> None:
@@ -93,7 +94,7 @@ def __init__(self,
         else:
             self.datapipe = datapipe.unbatch(unbatch_level=unbatch_level)
         self.buffer_size = buffer_size
-        self._shuffle_enabled = default
+        self._enabled = True
 
     @staticmethod
     def buffer_replace(buffer, x):
@@ -102,11 +103,12 @@ def buffer_replace(buffer, x):
         buffer[idx] = x
         return val
 
-    def set_shuffle_settings(self, shuffle=True):
-        self._shuffle_enabled = shuffle
+    def set_shuffle(self, shuffle=True):
+        self._enabled = shuffle
+        return self
 
     def __iter__(self) -> Iterator[T_co]:
-        if not self._shuffle_enabled:
+        if not self._enabled:
             for x in self.datapipe:
                 yield x
         else:
diff --git a/torch/utils/data/datapipes/iter/combining.py b/torch/utils/data/datapipes/iter/combining.py
index bac42c79c6da59..cef11755ae91d4 100644
--- a/torch/utils/data/datapipes/iter/combining.py
+++ b/torch/utils/data/datapipes/iter/combining.py
@@ -3,9 +3,10 @@
 from collections import deque
 from typing import Any, Callable, Iterator, List, Optional, Set, Sized, Tuple, TypeVar, Deque
 
-from torch.utils.data import IterDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.utils.common import check_lambda_fn
-from torch.utils.data._utils.serialization import serialize_fn, deserialize_fn
+from torch.utils.data._utils.serialization import DILL_AVAILABLE, SerializationType, serialize_fn, deserialize_fn
 
 T_co = TypeVar('T_co', covariant=True)
 
@@ -343,12 +344,11 @@ def __getstate__(self):
         if IterDataPipe.getstate_hook is not None:
             return IterDataPipe.getstate_hook(self)
 
-        serialized_fn_with_method = serialize_fn(self.classifier_fn)
         state = (
             self.main_datapipe,
             self.num_instances,
             self.buffer_size,
-            serialized_fn_with_method,
+            serialize_fn(self.classifier_fn) if DILL_AVAILABLE else self.classifier_fn,
             self.drop_none,
         )
         return state
@@ -358,10 +358,13 @@ def __setstate__(self, state):
             self.main_datapipe,
             self.num_instances,
             self.buffer_size,
-            serialized_fn_with_method,
+            fn,
             self.drop_none,
         ) = state
-        self.classifier_fn = deserialize_fn(serialized_fn_with_method)
+        if isinstance(fn, tuple) and len(fn) == 2 and isinstance(fn[1], SerializationType):
+            self.classifier_fn = deserialize_fn(fn)
+        else:
+            self.classifier_fn = fn
         self._datapipe_iterator = None
         self.current_buffer_usage = 0
         self.child_buffers = [deque() for _ in range(self.num_instances)]
diff --git a/torch/utils/data/datapipes/iter/filelister.py b/torch/utils/data/datapipes/iter/filelister.py
index e75cc58818018a..12950c35365ac2 100644
--- a/torch/utils/data/datapipes/iter/filelister.py
+++ b/torch/utils/data/datapipes/iter/filelister.py
@@ -1,6 +1,6 @@
 from typing import Iterator, List, Sequence, Union
 
-from torch.utils.data import IterDataPipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.iter import IterableWrapper
 from torch.utils.data.datapipes.utils.common import get_file_pathnames_from_root
 
diff --git a/torch/utils/data/datapipes/iter/fileopener.py b/torch/utils/data/datapipes/iter/fileopener.py
index 1e92a5cc277787..93a5d7baae910d 100644
--- a/torch/utils/data/datapipes/iter/fileopener.py
+++ b/torch/utils/data/datapipes/iter/fileopener.py
@@ -1,7 +1,7 @@
 from io import IOBase
 from typing import Iterable, Tuple, Optional
 
-from torch.utils.data import IterDataPipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.utils.common import get_file_binaries_from_pathnames, deprecation_warning
 
 
@@ -72,5 +72,10 @@ def __new__(
             datapipe: Iterable[str],
             mode: str = 'b',
             length: int = -1):
-        deprecation_warning(type(cls).__name__, new_name="FileOpener")
+        deprecation_warning(
+            cls.__name__,
+            deprecation_version="1.12",
+            removal_version="1.14",
+            new_class_name="FileOpener",
+        )
         return FileOpenerIterDataPipe(datapipe=datapipe, mode=mode, length=length)
diff --git a/torch/utils/data/datapipes/iter/grouping.py b/torch/utils/data/datapipes/iter/grouping.py
index 3e4ef049f99474..d91f9e62cc6684 100644
--- a/torch/utils/data/datapipes/iter/grouping.py
+++ b/torch/utils/data/datapipes/iter/grouping.py
@@ -1,6 +1,7 @@
 from collections import defaultdict
 
-from torch.utils.data import IterDataPipe, functional_datapipe, DataChunk
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe, DataChunk
 from torch.utils.data.datapipes.utils.common import check_lambda_fn
 from typing import Any, Callable, DefaultDict, Iterator, List, Optional, Sized, TypeVar
 
diff --git a/torch/utils/data/datapipes/iter/routeddecoder.py b/torch/utils/data/datapipes/iter/routeddecoder.py
index 79250237f13604..2d6c9a15ccb3d0 100644
--- a/torch/utils/data/datapipes/iter/routeddecoder.py
+++ b/torch/utils/data/datapipes/iter/routeddecoder.py
@@ -1,7 +1,8 @@
 from io import BufferedIOBase
 from typing import Any, Callable, Iterable, Iterator, Sized, Tuple
 
-from torch.utils.data import IterDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.utils.common import deprecation_warning
 from torch.utils.data.datapipes.utils.decoder import (
     Decoder,
@@ -40,7 +41,12 @@ def __init__(self,
         if not handlers:
             handlers = (decoder_basichandlers, decoder_imagehandler('torch'))
         self.decoder = Decoder(*handlers, key_fn=key_fn)
-        deprecation_warning(type(self).__name__)
+        deprecation_warning(
+            type(self).__name__,
+            deprecation_version="1.12",
+            removal_version="1.14",
+            old_functional_name="routed_decode",
+        )
 
     def add_handler(self, *handler: Callable) -> None:
         self.decoder.add_handler(*handler)
diff --git a/torch/utils/data/datapipes/iter/selecting.py b/torch/utils/data/datapipes/iter/selecting.py
index 51e16bb2d6ad51..b7f335d53f9a5a 100644
--- a/torch/utils/data/datapipes/iter/selecting.py
+++ b/torch/utils/data/datapipes/iter/selecting.py
@@ -1,6 +1,7 @@
 from typing import Callable, Iterator, TypeVar
 
-from torch.utils.data import IterDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
 from torch.utils.data.datapipes.utils.common import check_lambda_fn
 
diff --git a/torch/utils/data/datapipes/iter/streamreader.py b/torch/utils/data/datapipes/iter/streamreader.py
index 1f1a536c611e62..693cf455d5a8f6 100644
--- a/torch/utils/data/datapipes/iter/streamreader.py
+++ b/torch/utils/data/datapipes/iter/streamreader.py
@@ -1,5 +1,5 @@
 from typing import Tuple
-from torch.utils.data import IterDataPipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 
 
 class StreamReaderIterDataPipe(IterDataPipe[Tuple[str, bytes]]):
diff --git a/torch/utils/data/datapipes/iter/utils.py b/torch/utils/data/datapipes/iter/utils.py
index ef932c0dfcde60..591cfa6fc191c1 100644
--- a/torch/utils/data/datapipes/iter/utils.py
+++ b/torch/utils/data/datapipes/iter/utils.py
@@ -1,6 +1,6 @@
 import copy
 import warnings
-from torch.utils.data import IterDataPipe
+from torch.utils.data.datapipes.datapipe import IterDataPipe
 
 
 class IterableWrapperIterDataPipe(IterDataPipe):
diff --git a/torch/utils/data/datapipes/map/callable.py b/torch/utils/data/datapipes/map/callable.py
index 8c2d8170330bab..be4501f9e2e0d6 100644
--- a/torch/utils/data/datapipes/map/callable.py
+++ b/torch/utils/data/datapipes/map/callable.py
@@ -1,6 +1,7 @@
 from torch.utils.data.datapipes.utils.common import check_lambda_fn
 from typing import Callable, TypeVar
-from torch.utils.data import MapDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import MapDataPipe
 
 T_co = TypeVar('T_co', covariant=True)
 
diff --git a/torch/utils/data/datapipes/map/combinatorics.py b/torch/utils/data/datapipes/map/combinatorics.py
index 22f791bf446ea0..176f4749ed9800 100644
--- a/torch/utils/data/datapipes/map/combinatorics.py
+++ b/torch/utils/data/datapipes/map/combinatorics.py
@@ -1,6 +1,7 @@
 import random
 
-from torch.utils.data import MapDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import MapDataPipe
 from typing import Iterator, List, Optional, TypeVar
 
 
diff --git a/torch/utils/data/datapipes/map/combining.py b/torch/utils/data/datapipes/map/combining.py
index ead2cc23ed9fc2..6645b13266d0a5 100644
--- a/torch/utils/data/datapipes/map/combining.py
+++ b/torch/utils/data/datapipes/map/combining.py
@@ -1,4 +1,5 @@
-from torch.utils.data import MapDataPipe, functional_datapipe
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import MapDataPipe
 from typing import Sized, Tuple, TypeVar
 
 T_co = TypeVar('T_co', covariant=True)
diff --git a/torch/utils/data/datapipes/map/grouping.py b/torch/utils/data/datapipes/map/grouping.py
index b7ef168f6a316d..d1d4708e0a9a5c 100644
--- a/torch/utils/data/datapipes/map/grouping.py
+++ b/torch/utils/data/datapipes/map/grouping.py
@@ -1,4 +1,5 @@
-from torch.utils.data import MapDataPipe, functional_datapipe, DataChunk
+from torch.utils.data.datapipes._decorator import functional_datapipe
+from torch.utils.data.datapipes.datapipe import MapDataPipe, DataChunk
 from typing import List, Optional, Sized, TypeVar
 
 
diff --git a/torch/utils/data/datapipes/map/utils.py b/torch/utils/data/datapipes/map/utils.py
index d27a842ba8f872..4dfcbeda828124 100644
--- a/torch/utils/data/datapipes/map/utils.py
+++ b/torch/utils/data/datapipes/map/utils.py
@@ -1,6 +1,6 @@
 import copy
 import warnings
-from torch.utils.data import MapDataPipe
+from torch.utils.data.datapipes.datapipe import MapDataPipe
 
 
 class SequenceWrapperMapDataPipe(MapDataPipe):
diff --git a/torch/utils/data/datapipes/utils/common.py b/torch/utils/data/datapipes/utils/common.py
index c6071f71e48342..b432d4037f608d 100644
--- a/torch/utils/data/datapipes/utils/common.py
+++ b/torch/utils/data/datapipes/utils/common.py
@@ -2,18 +2,12 @@
 import fnmatch
 import warnings
 
-from enum import Enum
 from io import IOBase
 from typing import Iterable, List, Tuple, Union, Optional
 
 from torch.utils.data._utils.serialization import DILL_AVAILABLE
 
 
-class SerializationType(Enum):
-    PICKLE = "pickle"
-    DILL = "dill"
-
-
 def check_lambda_fn(fn):
     # Partial object has no attribute '__name__', but can be pickled
     if hasattr(fn, "__name__") and fn.__name__ == "<lambda>" and not DILL_AVAILABLE:
@@ -102,12 +96,36 @@ def validate_pathname_binary_tuple(data: Tuple[str, IOBase]):
         )
 
 
-def deprecation_warning(name, new_name: str = ""):
-    new_name_statement = ""
-    if new_name:
-        new_name_statement = f" Please use {new_name} instead."
-    warnings.warn(f"{name} and its functional API are deprecated and will be removed from the package `torch`." +
-                  new_name_statement, DeprecationWarning)
+def deprecation_warning(
+    old_class_name: str,
+    *,
+    deprecation_version: str,
+    removal_version: str,
+    old_functional_name: str = "",
+    new_class_name: str = "",
+    new_functional_name: str = "",
+) -> None:
+    msg = f"`{old_class_name}()`"
+    if old_functional_name:
+        msg = f"{msg} and its functional API `.{old_functional_name}()` are"
+    else:
+        msg = f"{msg} is"
+    msg = (
+        f"{msg} deprecated since {deprecation_version} and will be removed in {removal_version}. "
+        f"See https://github.com/pytorch/data/issues/163 for details."
+    )
+
+    if new_class_name or new_functional_name:
+        msg = f"{msg} Please use"
+        if new_class_name:
+            msg = f"{msg} `{new_class_name}()`"
+        if new_class_name and new_functional_name:
+            msg = f"{msg} or"
+        if new_functional_name:
+            msg = f"{msg} `.{new_functional_name}()`"
+        msg = f"{msg} instead."
+
+    warnings.warn(msg, FutureWarning)
 
 
 class StreamWrapper:
diff --git a/torch/utils/data/dataset.py b/torch/utils/data/dataset.py
index 843df48389a717..1f62340953664e 100644
--- a/torch/utils/data/dataset.py
+++ b/torch/utils/data/dataset.py
@@ -1,9 +1,6 @@
 import bisect
-import functools
 import warnings
 from typing import (
-    Callable,
-    Dict,
     Generic,
     Iterable,
     Iterator,
@@ -17,37 +14,12 @@
 # No 'default_generator' in torch/__init__.pyi
 from torch import default_generator, randperm
 from torch._utils import _accumulate
-from torch.utils.data._typing import _DataPipeMeta
-from torch.utils.data._utils.serialization import serialize_fn, deserialize_fn, SerializationType
 
 from ... import Generator, Tensor
 
 T_co = TypeVar('T_co', covariant=True)
 T = TypeVar('T')
 
-UNTRACABLE_DATAFRAME_PIPES = ['batch',  # As it returns DataChunks
-                              'groupby',   # As it returns DataChunks
-                              '_dataframes_as_tuples',  # As it unpacks DF
-                              'trace_as_dataframe',  # As it used to mark DF for tracing
-                              ]
-
-class DataChunk(list, Generic[T]):
-    def __init__(self, items):
-        super().__init__(items)
-        self.items = items
-
-    def as_str(self, indent=''):
-        res = indent + "[" + ", ".join(str(i) for i in iter(self)) + "]"
-        return res
-
-    def __iter__(self) -> Iterator[T]:
-        for i in super().__iter__():
-            yield i
-
-    def raw_iterator(self) -> T:
-        for i in self.items:
-            yield i
-
 
 class Dataset(Generic[T_co]):
     r"""An abstract class representing a :class:`Dataset`.
@@ -76,80 +48,6 @@ def __add__(self, other: 'Dataset[T_co]') -> 'ConcatDataset[T_co]':
     # in pytorch/torch/utils/data/sampler.py
 
 
-class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
-    r"""
-    Map-style DataPipe.
-
-    All datasets that represent a map from keys to data samples should subclass this.
-    Subclasses should overwrite :meth:`__getitem__`, supporting fetching a
-    data sample for a given, unique key. Subclasses can also optionally overwrite
-    :meth:`__len__`, which is expected to return the size of the dataset by many
-    :class:`~torch.utils.data.Sampler` implementations and the default options
-    of :class:`~torch.utils.data.DataLoader`.
-
-    These DataPipes can be invoked in two ways, using the class constructor or applying their
-    functional form onto an existing `MapDataPipe` (recommend, available to most but not all DataPipes).
-
-    Note:
-        :class:`~torch.utils.data.DataLoader` by default constructs an index
-        sampler that yields integral indices. To make it work with a map-style
-        DataPipe with non-integral indices/keys, a custom sampler must be provided.
-
-    Example:
-        >>> from torchdata.datapipes.map import SequenceWrapper, Mapper
-        >>> dp = SequenceWrapper(range(10))
-        >>> map_dp_1 = dp.map(lambda x: x + 1)  # Using functional form (recommended)
-        >>> list(map_dp_1)
-        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-        >>> map_dp_2 = Mapper(dp, lambda x: x + 1)  # Using class constructor
-        >>> list(map_dp_2)
-        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-        >>> batch_dp = map_dp_1.batch(batch_size=2)
-        >>> list(batch_dp)
-        [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
-    """
-    functions: Dict[str, Callable] = {}
-
-    def __getattr__(self, attribute_name):
-        if attribute_name in MapDataPipe.functions:
-            function = functools.partial(MapDataPipe.functions[attribute_name], self)
-            return function
-        else:
-            raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
-
-    @classmethod
-    def register_function(cls, function_name, function):
-        cls.functions[function_name] = function
-
-    @classmethod
-    def register_datapipe_as_function(cls, function_name, cls_to_register):
-        if function_name in cls.functions:
-            raise Exception("Unable to add DataPipe function name {} as it is already taken".format(function_name))
-
-        def class_function(cls, source_dp, *args, **kwargs):
-            result_pipe = cls(source_dp, *args, **kwargs)
-            return result_pipe
-
-        function = functools.partial(class_function, cls_to_register)
-        cls.functions[function_name] = function
-
-    def __getstate__(self):
-        state_dict = {}
-        for k, v in self.__dict__.items():
-            if callable(v):
-                state_dict[k] = serialize_fn(v)
-            else:
-                state_dict[k] = v
-        return state_dict
-
-    def __setstate__(self, state_dict):
-        for k, v in state_dict.items():
-            if isinstance(v, tuple) and len(v) == 2 and isinstance(v[1], SerializationType):
-                self.__dict__[k] = deserialize_fn(v)
-            else:
-                self.__dict__[k] = v
-
-
 class IterableDataset(Dataset[T_co]):
     r"""An iterable Dataset.
 
@@ -262,125 +160,6 @@ def __add__(self, other: Dataset[T_co]):
     # See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
 
 
-class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
-    r"""
-    Iterable-style DataPipe.
-
-    All DataPipes that represent an iterable of data samples should subclass this.
-    This style of DataPipes is particularly useful when data come from a stream, or
-    when the number of samples is too large to fit them all in memory.
-
-    All subclasses should overwrite :meth:`__iter__`, which would return an
-    iterator of samples in this DataPipe.
-
-    `IterDataPipe` is lazily initialized and its elements are computed only when ``next()`` is called
-    on its iterator.
-
-    These DataPipes can be invoked in two ways, using the class constructor or applying their
-    functional form onto an existing `IterDataPipe` (recommended, available to most but not all DataPipes).
-    You can chain multiple `IterDataPipe` together to form a pipeline that will perform multiple
-    operations in succession.
-
-    Note:
-        When a subclass is used with :class:`~torch.utils.data.DataLoader`, each
-        item in the DataPipe will be yielded from the :class:`~torch.utils.data.DataLoader`
-        iterator. When :attr:`num_workers > 0`, each worker process will have a
-        different copy of the DataPipe object, so it is often desired to configure
-        each copy independently to avoid having duplicate data returned from the
-        workers. :func:`~torch.utils.data.get_worker_info`, when called in a worker
-        process, returns information about the worker. It can be used in either the
-        dataset's :meth:`__iter__` method or the :class:`~torch.utils.data.DataLoader` 's
-        :attr:`worker_init_fn` option to modify each copy's behavior.
-
-    Example:
-        >>> from torchdata.datapipes.iter import IterableWrapper, Mapper
-        >>> dp = IterableWrapper(range(10))
-        >>> map_dp_1 = Mapper(dp, lambda x: x + 1)  # Using class constructor
-        >>> map_dp_2 = dp.map(lambda x: x + 1)  # Using functional form (recommended)
-        >>> list(map_dp_1)
-        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-        >>> list(map_dp_2)
-        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-        >>> filter_dp = map_dp_1.filter(lambda x: x % 2 == 0)
-        >>> list(filter_dp)
-        [2, 4, 6, 8, 10]
-    """
-    functions: Dict[str, Callable] = {}
-    reduce_ex_hook : Optional[Callable] = None
-    getstate_hook: Optional[Callable] = None
-
-    def __getattr__(self, attribute_name):
-        if attribute_name in IterDataPipe.functions:
-            function = functools.partial(IterDataPipe.functions[attribute_name], self)
-            return function
-        else:
-            raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
-
-    @classmethod
-    def register_function(cls, function_name, function):
-        cls.functions[function_name] = function
-
-    @classmethod
-    def register_datapipe_as_function(cls, function_name, cls_to_register, enable_df_api_tracing=False):
-        if function_name in cls.functions:
-            raise Exception("Unable to add DataPipe function name {} as it is already taken".format(function_name))
-
-        def class_function(cls, enable_df_api_tracing, source_dp, *args, **kwargs):
-            result_pipe = cls(source_dp, *args, **kwargs)
-            if isinstance(result_pipe, IterDataPipe):
-                if enable_df_api_tracing or isinstance(source_dp, DFIterDataPipe):
-                    if function_name not in UNTRACABLE_DATAFRAME_PIPES:
-                        result_pipe = result_pipe.trace_as_dataframe()
-
-            return result_pipe
-
-        function = functools.partial(class_function, cls_to_register, enable_df_api_tracing)
-        cls.functions[function_name] = function
-
-    def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
-        state_dict = {}
-        for k, v in self.__dict__.items():
-            if callable(v):
-                state_dict[k] = serialize_fn(v)
-            else:
-                state_dict[k] = v
-        return state_dict
-
-    def __setstate__(self, state_dict):
-        for k, v in state_dict.items():
-            if isinstance(v, tuple) and len(v) == 2 and isinstance(v[1], SerializationType):
-                self.__dict__[k] = deserialize_fn(v)
-            else:
-                self.__dict__[k] = v
-
-    def __reduce_ex__(self, *args, **kwargs):
-        if IterDataPipe.reduce_ex_hook is not None:
-            try:
-                return IterDataPipe.reduce_ex_hook(self)
-            except NotImplementedError:
-                pass
-        return super().__reduce_ex__(*args, **kwargs)
-
-    @classmethod
-    def set_getstate_hook(cls, hook_fn):
-        if IterDataPipe.getstate_hook is not None and hook_fn is not None:
-            raise Exception("Attempt to override existing getstate_hook")
-        IterDataPipe.getstate_hook = hook_fn
-
-    @classmethod
-    def set_reduce_ex_hook(cls, hook_fn):
-        if IterDataPipe.reduce_ex_hook is not None and hook_fn is not None:
-            raise Exception("Attempt to override existing reduce_ex_hook")
-        IterDataPipe.reduce_ex_hook = hook_fn
-
-
-class DFIterDataPipe(IterDataPipe):
-    def _is_dfpipe(self):
-        return True
-
-
 class TensorDataset(Dataset[Tuple[Tensor, ...]]):
     r"""Dataset wrapping tensors.
 
@@ -476,7 +255,7 @@ def __len__(self):
         total = 0
         for d in self.datasets:
             assert isinstance(d, IterableDataset), "ChainDataset only supports IterableDataset"
-            total += len(d)
+            total += len(d)  # type: ignore[arg-type]
         return total
 
 
@@ -518,7 +297,7 @@ def random_split(dataset: Dataset[T], lengths: Sequence[int],
         generator (Generator): Generator used for the random permutation.
     """
     # Cannot verify that dataset is Sized
-    if sum(lengths) != len(dataset):
+    if sum(lengths) != len(dataset):    # type: ignore[arg-type]
         raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
 
     indices = randperm(sum(lengths), generator=generator).tolist()
diff --git a/torch/utils/data/dataset.pyi b/torch/utils/data/dataset.pyi
deleted file mode 100644
index c05ae48c0ffabd..00000000000000
--- a/torch/utils/data/dataset.pyi
+++ /dev/null
@@ -1,125 +0,0 @@
-# This base template ("dataset.pyi.in") is generated from mypy stubgen with minimal editing for code injection
-# The output file will be "dataset.pyi".
-# Note that, for mypy, .pyi file takes precedent over .py file, such that we must define the interface for other
-# classes/objects here, even though we are not injecting extra code into them at the moment.
-
-from ... import Generator as Generator, Tensor as Tensor
-from torch import default_generator as default_generator, randperm as randperm
-from torch.utils.data._typing import _DataPipeMeta
-from typing import Any, Callable, Dict, Generic, Iterable, Iterator, List, Optional, Sequence, Tuple, TypeVar
-
-T_co = TypeVar('T_co', covariant=True)
-T = TypeVar('T')
-UNTRACABLE_DATAFRAME_PIPES: Any
-
-
-class DataChunk(list, Generic[T]):
-    items: Any = ...
-    def __init__(self, items: Any) -> None: ...
-    def as_str(self, indent: str = ...): ...
-    def __iter__(self) -> Iterator[T]: ...
-    def raw_iterator(self) -> T: ...
-
-class Dataset(Generic[T_co]):
-    def __getitem__(self, index: Any) -> T_co: ...
-    def __add__(self, other: Dataset[T_co]) -> ConcatDataset[T_co]: ...
-
-class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
-    functions: Dict[str, Callable] = ...
-    def __getattr__(self, attribute_name: Any): ...
-    @classmethod
-    def register_function(cls, function_name: Any, function: Any) -> None: ...
-    @classmethod
-    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any): ...
-    # Functional form of 'BatcherMapDataPipe'
-    def batch(self, batch_size: int, drop_last: bool = False, wrapper_class=DataChunk) -> MapDataPipe: ...
-    # Functional form of 'ConcaterMapDataPipe'
-    def concat(self, *datapipes: MapDataPipe) -> MapDataPipe: ...
-    # Functional form of 'MapperMapDataPipe'
-    def map(self, fn: Callable= ...) -> MapDataPipe: ...
-    # Functional form of 'ShufflerMapDataPipe'
-    def shuffle(self, *, indices: Optional[List] = None) -> MapDataPipe: ...
-    # Functional form of 'ZipperMapDataPipe'
-    def zip(self, *datapipes: MapDataPipe[T_co]) -> MapDataPipe: ...
-
-class IterableDataset(Dataset[T_co]):
-    def __iter__(self) -> Iterator[T_co]: ...
-    def __add__(self, other: Dataset[T_co]) -> Any: ...
-
-class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
-    functions: Dict[str, Callable] = ...
-    reduce_ex_hook: Optional[Callable] = ...
-    getstate_hook: Optional[Callable] = ...
-    def __getattr__(self, attribute_name: Any): ...
-    @classmethod
-    def register_function(cls, function_name: Any, function: Any) -> None: ...
-    @classmethod
-    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any, enable_df_api_tracing: bool = ...): ...
-    def __getstate__(self): ...
-    def __reduce_ex__(self, *args: Any, **kwargs: Any): ...
-    @classmethod
-    def set_getstate_hook(cls, hook_fn: Any) -> None: ...
-    @classmethod
-    def set_reduce_ex_hook(cls, hook_fn: Any) -> None: ...
-    # Functional form of 'BatcherIterDataPipe'
-    def batch(self, batch_size: int, drop_last: bool = False, wrapper_class=DataChunk) -> IterDataPipe: ...
-    # Functional form of 'CollatorIterDataPipe'
-    def collate(self, collate_fn: Callable= ...) -> IterDataPipe: ...
-    # Functional form of 'ConcaterIterDataPipe'
-    def concat(self, *datapipes: IterDataPipe) -> IterDataPipe: ...
-    # Functional form of 'DemultiplexerIterDataPipe'
-    def demux(self, num_instances: int, classifier_fn: Callable[[T_co], Optional[int]], drop_none: bool = False, buffer_size: int = 1000) -> List[IterDataPipe]: ...
-    # Functional form of 'FilterIterDataPipe'
-    def filter(self, filter_fn: Callable, drop_empty_batches: bool = True) -> IterDataPipe: ...
-    # Functional form of 'ForkerIterDataPipe'
-    def fork(self, num_instances: int, buffer_size: int = 1000) -> List[IterDataPipe]: ...
-    # Functional form of 'GrouperIterDataPipe'
-    def groupby(self, group_key_fn: Callable, *, buffer_size: int = 10000, group_size: Optional[int] = None, guaranteed_group_size: Optional[int] = None, drop_remaining: bool = False) -> IterDataPipe: ...
-    # Functional form of 'MapperIterDataPipe'
-    def map(self, fn: Callable, input_col=None, output_col=None) -> IterDataPipe: ...
-    # Functional form of 'MultiplexerIterDataPipe'
-    def mux(self, *datapipes) -> IterDataPipe: ...
-    # Functional form of 'RoutedDecoderIterDataPipe'
-    def routed_decode(self, *handlers: Callable, key_fn: Callable= ...) -> IterDataPipe: ...
-    # Functional form of 'ShardingFilterIterDataPipe'
-    def sharding_filter(self) -> IterDataPipe: ...
-    # Functional form of 'ShufflerIterDataPipe'
-    def shuffle(self, *, default: bool = True, buffer_size: int = 10000, unbatch_level: int = 0) -> IterDataPipe: ...
-    # Functional form of 'UnBatcherIterDataPipe'
-    def unbatch(self, unbatch_level: int = 1) -> IterDataPipe: ...
-    # Functional form of 'ZipperIterDataPipe'
-    def zip(self, *datapipes: IterDataPipe) -> IterDataPipe: ...
-
-class DFIterDataPipe(IterableDataset): ...
-
-class TensorDataset(Dataset[Tuple[Tensor, ...]]):
-    tensors: Tuple[Tensor, ...]
-    def __init__(self, *tensors: Tensor) -> None: ...
-    def __getitem__(self, index: Any): ...
-    def __len__(self): ...
-
-class ConcatDataset(Dataset[T_co]):
-    datasets: List[Dataset[T_co]]
-    cumulative_sizes: List[int]
-    @staticmethod
-    def cumsum(sequence: Any): ...
-    def __init__(self, datasets: Iterable[Dataset]) -> None: ...
-    def __len__(self): ...
-    def __getitem__(self, idx: Any): ...
-    @property
-    def cummulative_sizes(self): ...
-
-class ChainDataset(IterableDataset):
-    datasets: Any = ...
-    def __init__(self, datasets: Iterable[Dataset]) -> None: ...
-    def __iter__(self) -> Any: ...
-    def __len__(self): ...
-
-class Subset(Dataset[T_co]):
-    dataset: Dataset[T_co]
-    indices: Sequence[int]
-    def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None: ...
-    def __getitem__(self, idx: Any): ...
-    def __len__(self): ...
-
-def random_split(dataset: Dataset[T], lengths: Sequence[int], generator: Optional[Generator]=...) -> List[Subset[T]]: ...
diff --git a/torch/utils/data/dataset.pyi.in b/torch/utils/data/dataset.pyi.in
deleted file mode 100644
index ea24af33e259a1..00000000000000
--- a/torch/utils/data/dataset.pyi.in
+++ /dev/null
@@ -1,89 +0,0 @@
-# This base template ("dataset.pyi.in") is generated from mypy stubgen with minimal editing for code injection
-# The output file will be "dataset.pyi".
-# Note that, for mypy, .pyi file takes precedent over .py file, such that we must define the interface for other
-# classes/objects here, even though we are not injecting extra code into them at the moment.
-
-from ... import Generator as Generator, Tensor as Tensor
-from torch import default_generator as default_generator, randperm as randperm
-from torch.utils.data._typing import _DataPipeMeta
-from typing import Any, Callable, Dict, Generic, Iterable, Iterator, List, Optional, Sequence, Tuple, TypeVar
-
-T_co = TypeVar('T_co', covariant=True)
-T = TypeVar('T')
-UNTRACABLE_DATAFRAME_PIPES: Any
-
-
-class DataChunk(list, Generic[T]):
-    items: Any = ...
-    def __init__(self, items: Any) -> None: ...
-    def as_str(self, indent: str = ...): ...
-    def __iter__(self) -> Iterator[T]: ...
-    def raw_iterator(self) -> T: ...
-
-class Dataset(Generic[T_co]):
-    def __getitem__(self, index: Any) -> T_co: ...
-    def __add__(self, other: Dataset[T_co]) -> ConcatDataset[T_co]: ...
-
-class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
-    functions: Dict[str, Callable] = ...
-    def __getattr__(self, attribute_name: Any): ...
-    @classmethod
-    def register_function(cls, function_name: Any, function: Any) -> None: ...
-    @classmethod
-    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any): ...
-    ${MapDataPipeMethods}
-
-class IterableDataset(Dataset[T_co]):
-    def __iter__(self) -> Iterator[T_co]: ...
-    def __add__(self, other: Dataset[T_co]) -> Any: ...
-
-class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
-    functions: Dict[str, Callable] = ...
-    reduce_ex_hook: Optional[Callable] = ...
-    getstate_hook: Optional[Callable] = ...
-    def __getattr__(self, attribute_name: Any): ...
-    @classmethod
-    def register_function(cls, function_name: Any, function: Any) -> None: ...
-    @classmethod
-    def register_datapipe_as_function(cls, function_name: Any, cls_to_register: Any, enable_df_api_tracing: bool = ...): ...
-    def __getstate__(self): ...
-    def __reduce_ex__(self, *args: Any, **kwargs: Any): ...
-    @classmethod
-    def set_getstate_hook(cls, hook_fn: Any) -> None: ...
-    @classmethod
-    def set_reduce_ex_hook(cls, hook_fn: Any) -> None: ...
-    ${IterDataPipeMethods}
-
-class DFIterDataPipe(IterableDataset): ...
-
-class TensorDataset(Dataset[Tuple[Tensor, ...]]):
-    tensors: Tuple[Tensor, ...]
-    def __init__(self, *tensors: Tensor) -> None: ...
-    def __getitem__(self, index: Any): ...
-    def __len__(self): ...
-
-class ConcatDataset(Dataset[T_co]):
-    datasets: List[Dataset[T_co]]
-    cumulative_sizes: List[int]
-    @staticmethod
-    def cumsum(sequence: Any): ...
-    def __init__(self, datasets: Iterable[Dataset]) -> None: ...
-    def __len__(self): ...
-    def __getitem__(self, idx: Any): ...
-    @property
-    def cummulative_sizes(self): ...
-
-class ChainDataset(IterableDataset):
-    datasets: Any = ...
-    def __init__(self, datasets: Iterable[Dataset]) -> None: ...
-    def __iter__(self) -> Any: ...
-    def __len__(self): ...
-
-class Subset(Dataset[T_co]):
-    dataset: Dataset[T_co]
-    indices: Sequence[int]
-    def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None: ...
-    def __getitem__(self, idx: Any): ...
-    def __len__(self): ...
-
-def random_split(dataset: Dataset[T], lengths: Sequence[int], generator: Optional[Generator]=...) -> List[Subset[T]]: ...
diff --git a/torch/utils/data/graph.py b/torch/utils/data/graph.py
index d155d52c1e72c6..aa2b2ed8d813e8 100644
--- a/torch/utils/data/graph.py
+++ b/torch/utils/data/graph.py
@@ -3,7 +3,7 @@
 
 from torch.utils.data import IterDataPipe, MapDataPipe
 
-from typing import Any, Dict
+from typing import Any, Dict, Set
 
 reduce_ex_hook = None
 
@@ -13,7 +13,7 @@ def stub_unpickler():
 
 
 # TODO(VitalyFedyunin): Make sure it works without dill module installed
-def list_connected_datapipes(scan_obj, only_datapipe):
+def list_connected_datapipes(scan_obj, only_datapipe, cache):
 
     f = io.BytesIO()
     p = pickle.Pickler(f)  # Not going to work for lambdas, but dill infinite loops on typing and can't be used as is
@@ -31,7 +31,7 @@ def getstate_hook(obj):
         return state
 
     def reduce_hook(obj):
-        if obj == scan_obj:
+        if obj == scan_obj or obj in cache:
             raise NotImplementedError
         else:
             captured_connections.append(obj)
@@ -52,11 +52,20 @@ def reduce_hook(obj):
 
 
 def traverse(datapipe, only_datapipe=False):
+    cache: Set[IterDataPipe] = set()
+    return _traverse_helper(datapipe, only_datapipe, cache)
+
+
+# Add cache here to prevent infinite recursion on DataPipe
+def _traverse_helper(datapipe, only_datapipe, cache):
     if not isinstance(datapipe, IterDataPipe):
         raise RuntimeError("Expected `IterDataPipe`, but {} is found".format(type(datapipe)))
 
-    items = list_connected_datapipes(datapipe, only_datapipe)
+    cache.add(datapipe)
+    items = list_connected_datapipes(datapipe, only_datapipe, cache)
     d: Dict[IterDataPipe, Any] = {datapipe: {}}
     for item in items:
-        d[datapipe].update(traverse(item, only_datapipe))
+        # Using cache.copy() here is to prevent recursion on a single path rather than global graph
+        # Single DataPipe can present multiple times in different paths in graph
+        d[datapipe].update(_traverse_helper(item, only_datapipe, cache.copy()))
     return d
diff --git a/torch/utils/data/graph_settings.py b/torch/utils/data/graph_settings.py
index 940f30c7f03f77..a33f600bd1fedc 100644
--- a/torch/utils/data/graph_settings.py
+++ b/torch/utils/data/graph_settings.py
@@ -1,5 +1,6 @@
 import torch.utils.data.graph
-
+from torch.utils.data.datapipes.iter import Shuffler
+import warnings
 
 def get_all_graph_pipes(graph):
     results = set()
@@ -27,9 +28,21 @@ def apply_sharding(datapipe, num_of_instances, instance_id):
 
 
 def apply_shuffle_settings(datapipe, shuffle):
-    if shuffle is not None:
-        graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=True)
-        all_pipes = get_all_graph_pipes(graph)
-        for pipe in all_pipes:
-            if hasattr(pipe, 'set_shuffle_settings'):
-                pipe.set_shuffle_settings(shuffle)
+    if shuffle is None:
+        return datapipe
+
+    graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=True)
+    all_pipes = get_all_graph_pipes(graph)
+    shufflers = {pipe for pipe in all_pipes if isinstance(pipe, Shuffler)}
+    if not shufflers and shuffle:
+        warnings.warn(
+            "`shuffle=True` was set, but the datapipe does not contain a `Shuffler`. Adding one at the end. "
+            "Be aware that the default buffer size might not be sufficient for your task."
+        )
+        datapipe = datapipe.shuffle()
+        shufflers = {datapipe}
+
+    for shuffler in shufflers:
+        shuffler.set_shuffle(shuffle)
+
+    return datapipe
diff --git a/torch/utils/ffi/__init__.py b/torch/utils/ffi/__init__.py
deleted file mode 100644
index e47a4f8a341705..00000000000000
--- a/torch/utils/ffi/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
diff --git a/torch/utils/hipify/hipify_python.py b/torch/utils/hipify/hipify_python.py
old mode 100644
new mode 100755
index ab541d07375ef4..25014dd7b4c777
--- a/torch/utils/hipify/hipify_python.py
+++ b/torch/utils/hipify/hipify_python.py
@@ -117,15 +117,16 @@ def match_extensions(filename: str, extensions: Iterable) -> bool:
     """Helper method to see if filename ends with certain extension"""
     return any(filename.endswith(e) for e in extensions)
 
+def _fnmatch(filepath, patterns):
+    return any(fnmatch.fnmatch(filepath, pattern) for pattern in patterns)
+
 def matched_files_iter(
         root_path: str,
-        includes: Iterable = ('*',),
+        includes: Iterable = (),
         ignores: Iterable = (),
         extensions: Iterable = (),
         out_of_place_only: bool = False,
         is_pytorch_extension: bool = False) -> Iterator[str]:
-    def _fnmatch(filepath, patterns):
-        return any(fnmatch.fnmatch(filepath, pattern) for pattern in patterns)
 
     exact_matches = set(includes)
 
@@ -145,7 +146,8 @@ def _fnmatch(filepath, patterns):
             if "third_party" in dirs:
                 dirs.remove("third_party")
         for filename in filenames:
-            filepath = os.path.join(rel_dirpath, filename)
+            filepath = os.path.join(abs_dirpath, filename)
+            rel_filepath = os.path.join(rel_dirpath, filename)
             # We respect extensions, UNLESS you wrote the entire
             # filename verbatim, in which case we always accept it
             if (
@@ -154,9 +156,9 @@ def _fnmatch(filepath, patterns):
                 and (match_extensions(filepath, extensions) or filepath in exact_matches)
             ):
                 if not is_pytorch_extension:  # for pytorch extensions, consider all files
-                    if not is_pytorch_file(filepath) and not is_caffe2_gpu_file(filepath):
+                    if not is_pytorch_file(rel_filepath) and not is_caffe2_gpu_file(rel_filepath):
                         continue
-                    if out_of_place_only and not is_out_of_place(filepath):
+                    if out_of_place_only and not is_out_of_place(rel_filepath):
                         continue
                 yield filepath
 
@@ -165,59 +167,23 @@ def preprocess_file_and_save_result(
         output_directory: str,
         filepath: str,
         all_files: Iterable,
-        includes: Iterable,
+        header_include_dirs: Iterable,
         stats: Dict[str, List],
         hip_clang_launch: bool,
         is_pytorch_extension: bool,
         clean_ctx: GeneratedFileCleaner,
         show_progress: bool) -> None:
-    result = preprocessor(output_directory, filepath, all_files, includes, stats,
+    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
                           hip_clang_launch, is_pytorch_extension, clean_ctx, show_progress)
 
     fin_path = os.path.abspath(os.path.join(output_directory, filepath))
     # Show what happened
-    if show_progress:
+    if show_progress and "ignored" not in result["status"]:
         print(
             fin_path, "->",
-            result["hipified_path"], result["status"])
-
-    if result["hipified_path"] is not None:
-        HIPIFY_FINAL_RESULT[fin_path] = result
-
-
-def preprocess(
-        output_directory: str,
-        all_files: Iterable,
-        includes: Iterable,
-        show_detailed: bool = False,
-        show_progress: bool = True,
-        hip_clang_launch: bool = False,
-        is_pytorch_extension: bool = False,
-        clean_ctx: Optional[GeneratedFileCleaner] = None) -> HipifyFinalResult:
-    """
-    Call preprocessor on selected files.
-
-    Arguments)
-        show_detailed - Show a detailed summary of the transpilation process.
-    """
-
-    if clean_ctx is None:
-        clean_ctx = GeneratedFileCleaner(keep_intermediates=True)
-
-    # Preprocessing statistics.
-    stats: Dict[str, List] = {"unsupported_calls": [], "kernel_launches": []}
-
-    for filepath in all_files:
-        preprocess_file_and_save_result(output_directory, filepath, all_files, includes, stats,
-                                        hip_clang_launch, is_pytorch_extension, clean_ctx, show_progress)
+            result["hipified_path"], result["status"], flush=True)
 
-    print(bcolors.OKGREEN + "Successfully preprocessed all matching files." + bcolors.ENDC, file=sys.stderr)
-
-    # Show detailed summary
-    if show_detailed:
-        compute_stats(stats)
-
-    return HIPIFY_FINAL_RESULT
+    HIPIFY_FINAL_RESULT[fin_path] = result
 
 
 def compute_stats(stats):
@@ -544,16 +510,17 @@ def replace_extern_shared(input_string):
     return output_string
 
 
-def get_hip_file_path(filepath, is_pytorch_extension=False):
+def get_hip_file_path(rel_filepath, is_pytorch_extension=False):
     """
     Returns the new name of the hipified file
     """
     # At the moment, some PyTorch source files are HIPified in place.  The predicate
     # is_out_of_place tells us if this is the case or not.
-    if not is_pytorch_extension and not is_out_of_place(filepath):
-        return filepath
+    assert(not os.path.isabs(rel_filepath))
+    if not is_pytorch_extension and not is_out_of_place(rel_filepath):
+        return rel_filepath
 
-    dirpath, filename = os.path.split(filepath)
+    dirpath, filename = os.path.split(rel_filepath)
     root, ext = os.path.splitext(filename)
 
     # Here's the plan:
@@ -597,6 +564,7 @@ def get_hip_file_path(filepath, is_pytorch_extension=False):
     orig_dirpath = dirpath
 
     dirpath = dirpath.replace('cuda', 'hip')
+    dirpath = dirpath.replace('CUDA', 'HIP')
     dirpath = dirpath.replace('THC', 'THH')
 
     root = root.replace('cuda', 'hip')
@@ -614,36 +582,39 @@ def get_hip_file_path(filepath, is_pytorch_extension=False):
     return os.path.join(dirpath, root + ext)
 
 
-def is_out_of_place(filepath):
-    if filepath.startswith("torch/"):
+def is_out_of_place(rel_filepath):
+    assert(not os.path.isabs(rel_filepath))
+    if rel_filepath.startswith("torch/"):
         return False
-    if filepath.startswith("tools/autograd/templates/"):
+    if rel_filepath.startswith("tools/autograd/templates/"):
         return False
     return True
 
 
 # Keep this synchronized with includes/ignores in build_amd.py
-def is_pytorch_file(filepath):
-    if filepath.startswith("aten/"):
-        if filepath.startswith("aten/src/ATen/core/"):
+def is_pytorch_file(rel_filepath):
+    assert(not os.path.isabs(rel_filepath))
+    if rel_filepath.startswith("aten/"):
+        if rel_filepath.startswith("aten/src/ATen/core/"):
             return False
         return True
-    if filepath.startswith("torch/"):
+    if rel_filepath.startswith("torch/"):
         return True
-    if filepath.startswith("tools/autograd/templates/"):
+    if rel_filepath.startswith("tools/autograd/templates/"):
         return True
     return False
 
 
-def is_cusparse_file(filepath):
-    if is_pytorch_file(filepath):
-        return "sparse" in filepath.lower()
+def is_cusparse_file(rel_filepath):
+    if is_pytorch_file(rel_filepath):
+        return "sparse" in rel_filepath.lower()
     return False
 
-def is_caffe2_gpu_file(filepath):
-    if filepath.startswith("c10/cuda"):
+def is_caffe2_gpu_file(rel_filepath):
+    assert(not os.path.isabs(rel_filepath))
+    if rel_filepath.startswith("c10/cuda"):
         return True
-    filename = os.path.basename(filepath)
+    filename = os.path.basename(rel_filepath)
     _, ext = os.path.splitext(filename)
     return ('gpu' in filename or ext in ['.cu', '.cuh']) and ('cudnn' not in filename)
 
@@ -752,31 +723,36 @@ def pattern(self):
 Returns a dict with the following keys:
     "hipified_path" : absolute path of hipified source file
     "status"        : "ok"      if hipified file was written out
-                      "skipped" if an identical hipified file already existed
-                      "ignored" if the source file was a hipified file itself
+                      "skipped" if an identical hipified file already existed or hipified file couldn't be written out
+                      "ignored" if the source file was a hipified file itself or not meant to be hipified
 """
 def preprocessor(
         output_directory: str,
         filepath: str,
         all_files: Iterable,
-        includes: Iterable,
+        header_include_dirs: Iterable,
         stats: Dict[str, List],
         hip_clang_launch: bool,
         is_pytorch_extension: bool,
         clean_ctx: GeneratedFileCleaner,
         show_progress: bool) -> HipifyResult:
     """ Executes the CUDA -> HIP conversion on the specified file. """
+    if filepath not in all_files:
+        return {"hipified_path": None, "status": "[ignored, not to be hipified]"}
+
     fin_path = os.path.abspath(os.path.join(output_directory, filepath))
+    rel_filepath = os.path.relpath(filepath, output_directory)
 
     with open(fin_path, 'r', encoding='utf-8') as fin:
         if fin.readline() == HIPIFY_C_BREADCRUMB:
-            return {"hipified_path": None, "status": "ignored"}
+            return {"hipified_path": None, "status": "[ignored, input is hipified output]"}
         fin.seek(0)
         output_source = fin.read()
 
     orig_output_source = output_source
 
-    fout_path = os.path.abspath(os.path.join(output_directory, get_hip_file_path(filepath, is_pytorch_extension)))
+    # get_hip_file_path needs a relative path to work correctly
+    fout_path = os.path.abspath(os.path.join(output_directory, get_hip_file_path(rel_filepath, is_pytorch_extension)))
     if not os.path.exists(os.path.dirname(fout_path)):
         clean_ctx.makedirs(os.path.dirname(fout_path))
 
@@ -791,9 +767,9 @@ def pt_sparse_repl(m):
     if is_pytorch_extension:
         output_source = RE_PYTORCH_PREPROCESSOR.sub(pt_repl, output_source)
     else:
-        if is_cusparse_file(filepath):
+        if is_cusparse_file(rel_filepath):
             output_source = RE_PYTORCH_PREPROCESSOR.sub(pt_sparse_repl, output_source)
-        elif is_pytorch_file(filepath):
+        elif is_pytorch_file(rel_filepath):
             output_source = RE_PYTORCH_PREPROCESSOR.sub(pt_repl, output_source)
         else:
             def c2_repl(m):
@@ -827,8 +803,8 @@ def repl(m):
                         header_filepath = header_path_to_check
                 # If not found, look in include dirs one by one and first match wins
                 if header_filepath is None:
-                    for include in includes:
-                        header_dir_to_check = os.path.join(output_directory, os.path.dirname(include))
+                    for header_include_dir in header_include_dirs:
+                        header_dir_to_check = os.path.join(output_directory, header_include_dir)
                         header_path_to_check = os.path.abspath(os.path.join(header_dir_to_check, f))
                         if os.path.exists(header_path_to_check):
                             header_dir = header_dir_to_check
@@ -839,12 +815,12 @@ def repl(m):
                 # Hipify header file first if needed
                 if header_filepath not in HIPIFY_FINAL_RESULT:
                     preprocess_file_and_save_result(output_directory,
-                                                    os.path.relpath(header_filepath, output_directory),
-                                                    all_files, includes, stats, hip_clang_launch, is_pytorch_extension,
-                                                    clean_ctx, show_progress)
-                value = HIPIFY_FINAL_RESULT[header_filepath]["hipified_path"]
-                assert value is not None
-                return templ.format(os.path.relpath(value, header_dir))
+                                                    header_filepath,
+                                                    all_files, header_include_dirs, stats, hip_clang_launch,
+                                                    is_pytorch_extension, clean_ctx, show_progress)
+                hipified_header_filepath = HIPIFY_FINAL_RESULT[header_filepath]["hipified_path"]
+                return templ.format(os.path.relpath(hipified_header_filepath if hipified_header_filepath is not None 
+                                                    else header_filepath, header_dir))
 
             return m.group(0)
         return repl
@@ -878,7 +854,7 @@ def repl(m):
         and orig_output_source == output_source
         and os.path.dirname(fin_path) == os.path.dirname(fout_path)
     ):
-        return {"hipified_path": fin_path, "status": "ok"}
+        return {"hipified_path": fin_path, "status": "[skipped, no changes]"}
 
     # Add hipify breadcrumb for C-style files to avoid re-hipification
     if fin_path != fout_path and match_extensions(fin_path, (".cu", ".cuh", ".c", ".cc", ".cpp", ".h", ".hpp")):
@@ -892,13 +868,13 @@ def repl(m):
         try:
             with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout:
                 fout.write(output_source)
-            return {"hipified_path": fout_path, "status": "ok"}
+            return {"hipified_path": fout_path, "status": "[ok]"}
         except PermissionError as e:
             print(f"{bcolors.WARNING}Failed to save {fout_path} with \"{e.strerror}\", leaving {fin_path} unchanged.{bcolors.ENDC}",
                   file=sys.stderr)
-            return {"hipified_path": fin_path, "status": "skipped"}
+            return {"hipified_path": fin_path, "status": "[skipped, no permissions]"}
     else:
-        return {"hipified_path": fout_path, "status": "skipped"}
+        return {"hipified_path": fout_path, "status": "[skipped, already hipified]"}
 
 def file_specific_replacement(filepath, search_string, replace_string, strict=False):
     with openf(filepath, "r+") as f:
@@ -993,14 +969,17 @@ def hipify(
     project_directory: str,
     show_detailed: bool = False,
     extensions: Iterable = (".cu", ".cuh", ".c", ".cc", ".cpp", ".h", ".in", ".hpp"),
+    header_extensions: Iterable = (".cuh", ".h", ".hpp"),
     output_directory: str = "",
-    includes: Iterable = (),
+    header_include_dirs: Iterable = (),
+    includes: Iterable = ('*',),
     extra_files: Iterable = (),
     out_of_place_only: bool = False,
     ignores: Iterable = (),
     show_progress: bool = True,
     hip_clang_launch: bool = False,
     is_pytorch_extension: bool = False,
+    hipify_extra_files_only: bool = False,
     clean_ctx: Optional[GeneratedFileCleaner] = None
 ) -> HipifyFinalResult:
     if project_directory == "":
@@ -1016,6 +995,10 @@ def hipify(
         project_directory.rstrip("/")
         output_directory = project_directory + "_amd"
 
+    if project_directory != output_directory:
+        includes = [include.replace(project_directory, output_directory) for include in includes]
+        ignores = [ignore.replace(project_directory, output_directory) for ignore in ignores]
+
     # Copy from project directory to output directory if not done already.
     if not os.path.exists(output_directory):
         shutil.copytree(project_directory, output_directory)
@@ -1025,19 +1008,42 @@ def hipify(
                                         out_of_place_only=out_of_place_only,
                                         is_pytorch_extension=is_pytorch_extension))
     all_files_set = set(all_files)
-    # Convert extra_files to relative paths since all_files has all relative paths
     for f in extra_files:
-        f_rel = os.path.relpath(f, output_directory)
-        if f_rel not in all_files_set:
-            all_files.append(f_rel)
-
-    # Start Preprocessor
-    return preprocess(
-        output_directory,
-        all_files,
-        includes,
-        show_detailed=show_detailed,
-        show_progress=show_progress,
-        hip_clang_launch=hip_clang_launch,
-        is_pytorch_extension=is_pytorch_extension,
-        clean_ctx=clean_ctx)
+        if not os.path.isabs(f):
+            f = os.path.join(output_directory, f)
+        if f not in all_files_set:
+            all_files.append(f)
+
+    # List all files in header_include_paths to ensure they are hipified
+    from pathlib import Path
+    for header_include_dir in header_include_dirs:
+        if os.path.isabs(header_include_dir):
+            header_include_dir_path = Path(header_include_dir)
+        else:
+            header_include_dir_path = Path(os.path.join(output_directory, header_include_dir))
+        for path in header_include_dir_path.rglob('*'):
+            if (
+                path.is_file()
+                and _fnmatch(str(path), includes)
+                and (not _fnmatch(str(path), ignores))
+                and match_extensions(path.name, header_extensions)
+            ):
+                all_files.append(str(path))
+
+    if clean_ctx is None:
+        clean_ctx = GeneratedFileCleaner(keep_intermediates=True)
+
+    # Preprocessing statistics.
+    stats: Dict[str, List] = {"unsupported_calls": [], "kernel_launches": []}
+
+    for filepath in (all_files if not hipify_extra_files_only else extra_files):
+        preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
+                                        stats, hip_clang_launch, is_pytorch_extension, clean_ctx, show_progress)
+
+    print(bcolors.OKGREEN + "Successfully preprocessed all matching files." + bcolors.ENDC, file=sys.stderr)
+
+    # Show detailed summary
+    if show_detailed:
+        compute_stats(stats)
+
+    return HIPIFY_FINAL_RESULT
diff --git a/torch/utils/jit/log_extract.py b/torch/utils/jit/log_extract.py
new file mode 100644
index 00000000000000..77685065ce54ff
--- /dev/null
+++ b/torch/utils/jit/log_extract.py
@@ -0,0 +1,111 @@
+from contextlib import contextmanager
+from typing import Any, List, Tuple, cast
+import random
+import torch
+import time
+from torch.utils.benchmark import Timer
+
+def extract_ir(filename: str) -> List[str]:
+    BEGIN = "<GRAPH_EXPORT>"
+    END = "</GRAPH_EXPORT>"
+    pfx = None
+    current = ""
+    graphs = []
+    with open(filename, "r") as f:
+        split_strs = f.read().split(BEGIN)
+        for i, split_str in enumerate(split_strs):
+            if i == 0:
+                continue
+            end_loc = split_str.find(END)
+            if end_loc == -1:
+                continue
+            s = split_str[:end_loc]
+            pfx = split_strs[i - 1].splitlines()[-1]
+            lines = [x[len(pfx):] for x in s.splitlines(keepends=True)]
+            graphs.append(''.join(lines))
+
+    return graphs
+
+
+def make_tensor_from_type(inp_type: torch._C.TensorType):
+    size = inp_type.sizes()
+    stride = inp_type.strides()
+    device = inp_type.device()
+    dtype = inp_type.dtype()
+    assert size is not None
+    assert stride is not None
+    assert device is not None
+    assert dtype is not None
+    return torch.empty_strided(size=size, stride=stride, device=device, dtype=dtype)
+
+def load_graph_and_inputs(ir: str) -> Tuple[Any, List[Any]]:
+    graph = torch._C.parse_ir(ir)
+    graph.makeMultiOutputIntoTuple()
+    inputs = []
+    for inp in graph.inputs():
+        if isinstance(inp.type(), torch._C.FloatType):
+            inputs.append(random.uniform(.1, 100))
+        elif isinstance(inp.type(), torch._C.IntType):
+            inputs.append(random.randint(1, 100))
+        elif isinstance(inp.type(), torch._C.TensorType):
+            tensorType = cast(torch._C.TensorType, inp.type())
+            inputs.append(make_tensor_from_type(tensorType))
+        else:
+            raise NotImplementedError(f"A default value is not implemented for type {inp.type()}")
+
+    func = torch._C._create_function_from_graph("forward", graph)
+    torch._C._jit_pass_erase_shape_information(func.graph)
+    return (func, inputs)
+
+def time_cuda(fn, inputs, test_runs):
+    t = Timer(stmt="fn(*inputs)", globals={"fn": fn, "inputs" : inputs})
+    times = t.blocked_autorange()
+    return times.median * 1000  # time in ms
+
+def time_cpu(fn, inputs, test_runs):
+    s = time.perf_counter()
+    for _ in range(test_runs):
+        fn(*inputs)
+    e = time.perf_counter()
+    return (e - s) / test_runs * 1000  # time in ms
+
+def run_test(ir, inputs, *, warmup_runs=10, test_runs=20) -> float:
+    graph, _ = load_graph_and_inputs(ir)
+    for _ in range(warmup_runs):
+        graph(*inputs)
+
+    is_cpu = None
+    for input in inputs:
+        if isinstance(input, torch.Tensor):
+            is_cpu = input.device.type == "cpu"
+            break
+    assert is_cpu is not None
+
+    out = time_cpu(graph, inputs, test_runs) if is_cpu else time_cuda(graph, inputs, test_runs)
+    return out
+
+@contextmanager
+def no_fuser(*args, **kwargs):
+    old_optimize = torch._C._get_graph_executor_optimize(False)
+    try:
+        yield
+    finally:
+        torch._C._get_graph_executor_optimize(old_optimize)
+
+def run_baseline_no_fusion(ir, inputs) -> float:
+    with no_fuser():
+        return run_test(ir, inputs)
+
+
+def run_nnc(ir, inputs, dynamic) -> float:
+    try:
+        strat = [("DYNAMIC", 10)] if dynamic else [("STATIC", 10)]
+        old_strat = torch.jit.set_fusion_strategy(strat)
+        with torch.jit.fuser("fuser1"):
+            return run_test(ir, inputs)
+    finally:
+        torch.jit.set_fusion_strategy(old_strat)
+
+def run_nvfuser(ir, inputs) -> float:
+    with torch.jit.fuser("fuser2"):
+        return run_test(ir, inputs)
diff --git a/torch/utils/model_dump/__init__.py b/torch/utils/model_dump/__init__.py
index 60d711e9bba179..23e3682f64526b 100644
--- a/torch/utils/model_dump/__init__.py
+++ b/torch/utils/model_dump/__init__.py
@@ -258,22 +258,7 @@ def ist(s):
             # Parse debug info and add begin/end markers if not present
             # to ensure that we cover the entire source code.
             debug_info_t = pickle.loads(raw_debug)
-            text_table = None
-
-            if (len(debug_info_t) == 3 and
-                    isinstance(debug_info_t[0], str) and
-                    debug_info_t[0] == 'FORMAT_WITH_STRING_TABLE'):
-                _, text_table, content = debug_info_t
-
-                def parse_new_format(line):
-                    # (0, (('', '', 0), 0, 0))
-                    num, ((text_indexes, fname_idx, offset), start, end), tag = line
-                    text = ''.join(text_table[x] for x in text_indexes)  # type: ignore[index]
-                    fname = text_table[fname_idx]  # type: ignore[index]
-                    return num, ((text, fname, offset), start, end), tag
-
-                debug_info_t = map(parse_new_format, content)
-
+            assert isinstance(debug_info_t, tuple)
             debug_info = list(debug_info_t)
             if not debug_info:
                 debug_info.append((0, (('', '', 0), 0, 0)))